Files
wifi-densepose/crates/ruvllm/docs/GITHUB_ISSUE_SOTA_FEATURES.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

556 lines
17 KiB
Markdown

# feat(ruvllm): Implement SOTA features for production agentic workflows
**Labels**: `enhancement`, `p0-critical`, `agentic`, `v2.4`, `mistral-rs`, `performance`
---
## Summary
RuvLLM v2.4 SOTA Feature Implementation - Adding the 3 critical capabilities needed for production agentic workflows: **Structured Output**, **Function Calling**, and **Prefix Caching**.
These features are essential for modern LLM applications and are currently blocking production adoption for major agent frameworks.
---
## Motivation
### Why This Matters
**Current State:**
- RuvLLM cannot reliably generate structured outputs (JSON schema enforcement)
- No native function calling support for tool-using agents
- Repeated prompts/prefixes incur full generation costs (no caching)
- Agent frameworks (LangChain, LlamaIndex, CrewAI) cannot integrate
**Impact:**
- **Blocking production adoption** for agentic workflows
- **Cost inefficiency**: 10-100x slower for RAG/chat applications vs competitors
- **Reliability gap**: JSON parsing failures break agent loops
- **Missing compatibility**: Cannot replace vLLM, llama.cpp, SGLang in existing stacks
**Competitive Gap:**
| Feature | vLLM | llama.cpp | SGLang | RuvLLM |
|---------|------|-----------|--------|--------|
| Structured Output | ✅ | ✅ | ✅ | ❌ |
| Function Calling | ✅ | ✅ | ✅ | ❌ |
| Prefix Caching | ✅ | ✅ | ✅ | ❌ |
---
## Features
### 1. Structured Output / JSON Mode (P0)
**Objective**: Guarantee valid JSON output conforming to user-provided schemas.
#### Core Capabilities
- [ ] **JSON schema validation** (JSONSchema Draft 7 support)
- Primitive types: `string`, `number`, `boolean`, `null`
- Complex types: `object`, `array`
- Nested schemas with `properties`, `items`, `required`
- Constraints: `minLength`, `maxLength`, `pattern`, `enum`, `minimum`, `maximum`
- [ ] **Constrained decoding with logit bias**
- State machine for tracking JSON structure (open braces, quotes, commas)
- Token masking to enforce valid next tokens
- Rejection sampling fallback for complex schemas
- [ ] **Bracket/brace state machine**
- Track depth of `{}` and `[]`
- Enforce closing brackets
- Handle escaped quotes in strings
- [ ] **JSON repair for malformed output**
- Auto-close unclosed braces/brackets
- Fix trailing commas
- Escape unescaped quotes
- Best-effort recovery mode
- [ ] **GBNF grammar support (future)**
- llama.cpp-compatible grammar format
- Custom domain-specific languages
- [ ] **Comprehensive tests**
- Unit tests for all JSON types
- Property-based testing with Hypothesis/QuickCheck
- Adversarial inputs (deeply nested, large arrays)
- [ ] **Benchmarks vs unconstrained**
- Measure latency overhead (<10% target)
- Throughput impact
- Memory usage
#### Example API
```rust
let schema = json!({
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number", "minimum": 0},
"tags": {"type": "array", "items": {"type": "string"}}
},
"required": ["name"]
});
let response = llm.generate(GenerateRequest {
prompt: "Extract person info: John is 30",
json_schema: Some(schema),
strict: true, // Guarantee valid JSON
..Default::default()
})?;
// response.text is guaranteed valid JSON matching schema
```
#### Acceptance Criteria
- [ ] **100% valid JSON** when `strict: true` enabled
- [ ] **<10% latency overhead** vs unconstrained generation
- [ ] **Schema validation passes** for nested objects/arrays (depth ≥ 5)
- [ ] **Repair mode** recovers ≥95% of malformed outputs
---
### 2. Function Calling / Tool Use (P0)
**Objective**: Enable LLMs to call external tools/functions with structured arguments.
#### Core Capabilities
- [ ] **Tool definition schema**
- Function name, description
- Parameters (JSON schema)
- Return type (optional)
- [ ] **ToolChoice enum**
- `auto`: Model decides whether to call tools
- `none`: Never call tools (text-only)
- `required`: Must call at least one tool
- `specific(name)`: Force specific tool
- [ ] **Parallel tool calls**
- Generate multiple tool calls in one response
- Dependency-aware ordering
- [ ] **Tool result handling**
- Inject tool results back into conversation
- Continue generation after tool execution
- Multi-turn tool loops
- [ ] **Model-specific formats**
- Llama 3.1 tool format (`<|python_tag|>`)
- Mistral tool format (function tags)
- Qwen tool format
- Claude tool format
- [ ] **OpenAI API compatibility layer**
- `tools` parameter
- `tool_choice` parameter
- `ChatCompletionToolCall` response format
- [ ] **LangChain integration tests**
- Works with `AgentExecutor`
- Compatible with `StructuredTool`
- Multi-agent workflows
#### Example API
```rust
let tools = vec![
Tool {
name: "get_weather".into(),
description: "Get current weather for a location".into(),
parameters: json!({
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}),
},
Tool {
name: "search_web".into(),
description: "Search the web".into(),
parameters: json!({
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}),
}
];
let response = llm.chat(ChatRequest {
messages: vec![
Message::user("What's the weather in SF and latest AI news?")
],
tools: Some(tools),
tool_choice: ToolChoice::Auto,
..Default::default()
})?;
// response.tool_calls contains parallel calls:
// [get_weather(location="San Francisco"), search_web(query="AI news")]
```
#### Acceptance Criteria
- [ ] **OpenAI API format compatibility** (passes OpenAI SDK tests)
- [ ] **LangChain AgentExecutor** integration works end-to-end
- [ ] **Parallel tool calls** supported (≥3 concurrent)
- [ ] **Multi-turn tool conversations** (≥5 turns)
- [ ] **Tool call success rate** ≥95% for common tools
---
### 3. Prefix Caching (P0)
**Objective**: Cache and reuse KV cache for repeated prompt prefixes (system prompts, RAG documents).
#### Core Capabilities
- [ ] **Hash-based prefix lookup**
- SHA-256 hash of token IDs
- Fast O(1) cache hit detection
- [ ] **Radix tree implementation**
- Efficient storage for overlapping prefixes
- Longest common prefix matching
- Memory-efficient sharing
- [ ] **KV cache copy-on-write**
- Share read-only cache entries
- Copy only on divergence
- Zero-copy for cache hits
- [ ] **LRU eviction policy**
- Evict least recently used prefixes
- Configurable cache size
- Per-model cache isolation
- [ ] **Memory limits**
- Hard limit on cache size (bytes)
- Soft limit with warning
- Graceful degradation
- [ ] **Cache hit/miss metrics**
- Prometheus metrics
- Hit rate tracking
- Memory usage stats
- [ ] **Chat prefix caching**
- System prompt caching
- Conversation history caching
- Automatic prefix detection
- [ ] **RAG document caching**
- Document chunk prefixes
- Query-independent context
- Multi-query reuse
#### Example API
```rust
// First request - cache miss
let response1 = llm.generate(GenerateRequest {
prompt: "System: You are a helpful assistant.\nUser: Hello",
cache_prefix: Some(CacheConfig {
enable: true,
key: Some("chat-system-prompt".into()),
ttl_seconds: Some(3600),
}),
..Default::default()
})?;
// Latency: 500ms
// Second request - cache hit (reuses "System: You are..." KV cache)
let response2 = llm.generate(GenerateRequest {
prompt: "System: You are a helpful assistant.\nUser: How are you?",
cache_prefix: Some(CacheConfig {
enable: true,
key: Some("chat-system-prompt".into()),
ttl_seconds: Some(3600),
}),
..Default::default()
})?;
// Latency: 50ms (10x faster!)
```
#### Performance Targets
- [ ] **10x speedup** for repeated system prompts (cache hit)
- [ ] **<5% overhead** for cache miss
- [ ] **Memory-bounded** (configurable, default 2GB)
- [ ] **Thread-safe** for concurrent requests
- [ ] **Hit rate ≥80%** for typical chat/RAG workloads
#### Acceptance Criteria
- [ ] **Speedup**: ≥10x for 1024-token prefix reuse
- [ ] **Memory**: Bounded by config, no OOM
- [ ] **Correctness**: Identical outputs for cached vs uncached
- [ ] **Concurrency**: No race conditions (stress tested)
- [ ] **Metrics**: Prometheus metrics exported
---
## Technical Design
### Architecture Overview
```
┌─────────────────────────────────────────────────────────┐
│ RuvLLM v2.4 │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Structured │ │ Function │ │ Prefix │ │
│ │ Output Engine │ │ Calling │ │ Cache │ │
│ │ │ │ Router │ │ Manager │ │
│ │ - JSON Schema │ │ - Tool Defs │ │ - Radix │ │
│ │ - Logit Bias │ │ - ToolChoice │ │ Tree │ │
│ │ - State Machine │ │ - Multi-call │ │ - LRU │ │
│ └────────┬────────┘ └──────┬───────┘ └─────┬──────┘ │
│ │ │ │ │
│ └──────────────────┼─────────────────┘ │
│ │ │
│ ┌─────────▼──────────┐ │
│ │ mistral-rs Core │ │
│ │ - Model Loading │ │
│ │ - Token Sampling │ │
│ │ - KV Cache │ │
│ └────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
### Reference ADRs
- **ADR-009**: Structured Output Implementation
- Constrained decoding algorithm
- JSON schema validation approach
- Performance optimization strategies
- **ADR-010**: Function Calling Architecture
- Tool definition format
- Multi-model compatibility layer
- Parallel execution model
- **ADR-011**: Prefix Caching Design
- Radix tree structure
- Eviction policies
- Memory management
- **ADR-008**: mistral-rs Integration
- Dependency structure
- API surface
- Migration path
---
## Implementation Plan
### Phase 1: Foundation (Weeks 1-2)
**Focus**: Structured Output basics + Function Calling definitions
- [ ] Week 1: JSON schema parser and validator
- Implement schema types (object, array, string, number, boolean, null)
- Unit tests for all types
- Property-based tests
- [ ] Week 2: Constrained decoding MVP
- Logit bias implementation
- Simple state machine (braces, brackets)
- Integration with mistral-rs sampler
- Basic function calling types (Tool, ToolChoice enums)
**Deliverable**: JSON mode works for simple schemas, tool definitions parsed
---
### Phase 2: Core Logic (Weeks 3-4)
**Focus**: Constrained decoding + Tool generation
- [ ] Week 3: Advanced constrained decoding
- Nested schema support
- String pattern matching
- Enum constraints
- JSON repair mode
- [ ] Week 4: Tool call generation
- Llama 3.1 format support
- Mistral format support
- Parallel tool calls
- OpenAI API compatibility layer
**Deliverable**: Complex JSON schemas work, tool calls generated in OpenAI format
---
### Phase 3: Caching + Polish (Weeks 5-6)
**Focus**: Prefix Caching + Integration tests
- [ ] Week 5: Prefix caching implementation
- Radix tree structure
- Hash-based lookup
- LRU eviction
- Thread-safety (RwLock)
- [ ] Week 6: Integration + benchmarks
- LangChain integration tests
- RAG workflow tests
- Performance benchmarks
- Documentation
- Example applications
**Deliverable**: All 3 features production-ready, benchmarked, documented
---
## Testing Strategy
### Unit Tests
- JSON schema validation (all types, nested, constraints)
- Logit bias correctness
- Tool definition parsing
- Prefix cache hit/miss logic
- Radix tree operations
### Integration Tests
- LangChain AgentExecutor with tools
- LlamaIndex ReAct agent
- CrewAI multi-agent workflows
- OpenAI SDK compatibility tests
### Benchmarks
- Structured output latency vs unconstrained
- Tool calling accuracy (% correct tool selections)
- Prefix cache speedup (1x, 10x, 100x reuse)
- Memory usage under load
### Stress Tests
- 1000 concurrent requests with caching
- Deeply nested JSON schemas (depth 20)
- Large tool libraries (100+ tools)
- Multi-turn tool conversations (50+ turns)
---
## Success Metrics
### Structured Output
- [ ] **Validity**: 100% valid JSON when `strict: true`
- [ ] **Overhead**: <10% latency vs unconstrained
- [ ] **Schema compliance**: 100% for depth ≤10 schemas
- [ ] **Repair rate**: ≥95% successful repairs
### Function Calling
- [ ] **Compatibility**: Passes OpenAI SDK test suite
- [ ] **LangChain**: Works with AgentExecutor (5+ examples)
- [ ] **Accuracy**: ≥95% correct tool selection (benchmark dataset)
- [ ] **Parallel calls**: Supports ≥5 concurrent tools
### Prefix Caching
- [ ] **Speedup**: 10x for 1024-token prefix, 100x for 4096-token
- [ ] **Hit rate**: ≥80% for chat workloads
- [ ] **Memory**: Bounded, no OOM under stress
- [ ] **Correctness**: 100% identical outputs (cached vs uncached)
---
## Dependencies
### Upstream
- **mistral-rs v0.4.x** (ADR-008)
- KV cache access for prefix caching
- Token sampling hooks for logit bias
- Model loading infrastructure
### Downstream
- **Enables**: Agentic workflow support
- **Enables**: LangChain/LlamaIndex/CrewAI integration
- **Blocks**: v2.4 release
- **Blocks**: Production adoption by agent frameworks
---
## Related Issues
- Depends on: #XXX (mistral-rs integration ADR-008)
- Enables: #XXX (Agentic workflow support)
- Enables: #XXX (LangChain integration)
- Blocks: #XXX (v2.4 release milestone)
---
## Documentation Requirements
- [ ] API reference docs (rustdoc)
- [ ] User guides for each feature
- "How to use JSON mode"
- "How to define tools"
- "How to enable prefix caching"
- [ ] Migration guide from v2.3
- [ ] Example applications
- Structured extraction (NER, info extraction)
- Multi-tool agent (ReAct loop)
- RAG with caching (chatbot)
- [ ] Performance tuning guide
---
## Open Questions
1. **JSON Schema**: Full Draft 7 or subset? (Propose: Core subset + extensions)
2. **Tool formats**: Support all models or Llama 3.1+ only? (Propose: Llama 3.1+ with adapters)
3. **Cache eviction**: LRU vs LFU vs TTL-based? (Propose: LRU + TTL)
4. **Memory limit**: Default cache size? (Propose: 2GB default, configurable)
5. **Breaking changes**: Any API changes needed? (Propose: Additive only, no breaks)
---
## Future Enhancements (Post-v2.4)
- **Structured Output**:
- GBNF grammar support (custom DSLs)
- Regex-constrained strings
- Speculative decoding for constrained generation
- **Function Calling**:
- Async/streaming tool execution
- Tool result validation
- Tool dependency graphs
- **Prefix Caching**:
- Cross-request caching (shared cache pool)
- Disk-backed cache (persist across restarts)
- Distributed caching (Redis/memcached)
---
## Timeline Summary
| Phase | Duration | Focus | Deliverable |
|-------|----------|-------|-------------|
| 1 | Weeks 1-2 | Structured Output + Tool Definitions | JSON mode MVP, tool parsing |
| 2 | Weeks 3-4 | Constrained Decoding + Tool Generation | Complex schemas, tool calls |
| 3 | Weeks 5-6 | Prefix Caching + Integration | Production-ready, benchmarked |
**Total**: 6 weeks to production-ready v2.4
---
## Getting Involved
### For Contributors
- Pick a task from the checkboxes above
- Comment on this issue to claim a feature
- Follow the implementation plan phases
- Submit PRs with tests + benchmarks
### For Reviewers
- Focus on correctness (JSON validity, cache correctness)
- Performance regression checks (<10% overhead target)
- API design feedback (before Week 3)
### For Testers
- Test with real-world agent workflows
- Report edge cases and failure modes
- Benchmark on your hardware/models
---
**Let's close the gap with vLLM/llama.cpp and make RuvLLM the best choice for production agentic workflows!** 🚀