git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
556 lines
17 KiB
Markdown
556 lines
17 KiB
Markdown
# feat(ruvllm): Implement SOTA features for production agentic workflows
|
|
|
|
**Labels**: `enhancement`, `p0-critical`, `agentic`, `v2.4`, `mistral-rs`, `performance`
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
RuvLLM v2.4 SOTA Feature Implementation - Adding the 3 critical capabilities needed for production agentic workflows: **Structured Output**, **Function Calling**, and **Prefix Caching**.
|
|
|
|
These features are essential for modern LLM applications and are currently blocking production adoption for major agent frameworks.
|
|
|
|
---
|
|
|
|
## Motivation
|
|
|
|
### Why This Matters
|
|
|
|
**Current State:**
|
|
- RuvLLM cannot reliably generate structured outputs (JSON schema enforcement)
|
|
- No native function calling support for tool-using agents
|
|
- Repeated prompts/prefixes incur full generation costs (no caching)
|
|
- Agent frameworks (LangChain, LlamaIndex, CrewAI) cannot integrate
|
|
|
|
**Impact:**
|
|
- **Blocking production adoption** for agentic workflows
|
|
- **Cost inefficiency**: 10-100x slower for RAG/chat applications vs competitors
|
|
- **Reliability gap**: JSON parsing failures break agent loops
|
|
- **Missing compatibility**: Cannot replace vLLM, llama.cpp, SGLang in existing stacks
|
|
|
|
**Competitive Gap:**
|
|
| Feature | vLLM | llama.cpp | SGLang | RuvLLM |
|
|
|---------|------|-----------|--------|--------|
|
|
| Structured Output | ✅ | ✅ | ✅ | ❌ |
|
|
| Function Calling | ✅ | ✅ | ✅ | ❌ |
|
|
| Prefix Caching | ✅ | ✅ | ✅ | ❌ |
|
|
|
|
---
|
|
|
|
## Features
|
|
|
|
### 1. Structured Output / JSON Mode (P0)
|
|
|
|
**Objective**: Guarantee valid JSON output conforming to user-provided schemas.
|
|
|
|
#### Core Capabilities
|
|
- [ ] **JSON schema validation** (JSONSchema Draft 7 support)
|
|
- Primitive types: `string`, `number`, `boolean`, `null`
|
|
- Complex types: `object`, `array`
|
|
- Nested schemas with `properties`, `items`, `required`
|
|
- Constraints: `minLength`, `maxLength`, `pattern`, `enum`, `minimum`, `maximum`
|
|
|
|
- [ ] **Constrained decoding with logit bias**
|
|
- State machine for tracking JSON structure (open braces, quotes, commas)
|
|
- Token masking to enforce valid next tokens
|
|
- Rejection sampling fallback for complex schemas
|
|
|
|
- [ ] **Bracket/brace state machine**
|
|
- Track depth of `{}` and `[]`
|
|
- Enforce closing brackets
|
|
- Handle escaped quotes in strings
|
|
|
|
- [ ] **JSON repair for malformed output**
|
|
- Auto-close unclosed braces/brackets
|
|
- Fix trailing commas
|
|
- Escape unescaped quotes
|
|
- Best-effort recovery mode
|
|
|
|
- [ ] **GBNF grammar support (future)**
|
|
- llama.cpp-compatible grammar format
|
|
- Custom domain-specific languages
|
|
|
|
- [ ] **Comprehensive tests**
|
|
- Unit tests for all JSON types
|
|
- Property-based testing with Hypothesis/QuickCheck
|
|
- Adversarial inputs (deeply nested, large arrays)
|
|
|
|
- [ ] **Benchmarks vs unconstrained**
|
|
- Measure latency overhead (<10% target)
|
|
- Throughput impact
|
|
- Memory usage
|
|
|
|
#### Example API
|
|
```rust
|
|
let schema = json!({
|
|
"type": "object",
|
|
"properties": {
|
|
"name": {"type": "string"},
|
|
"age": {"type": "number", "minimum": 0},
|
|
"tags": {"type": "array", "items": {"type": "string"}}
|
|
},
|
|
"required": ["name"]
|
|
});
|
|
|
|
let response = llm.generate(GenerateRequest {
|
|
prompt: "Extract person info: John is 30",
|
|
json_schema: Some(schema),
|
|
strict: true, // Guarantee valid JSON
|
|
..Default::default()
|
|
})?;
|
|
|
|
// response.text is guaranteed valid JSON matching schema
|
|
```
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **100% valid JSON** when `strict: true` enabled
|
|
- [ ] **<10% latency overhead** vs unconstrained generation
|
|
- [ ] **Schema validation passes** for nested objects/arrays (depth ≥ 5)
|
|
- [ ] **Repair mode** recovers ≥95% of malformed outputs
|
|
|
|
---
|
|
|
|
### 2. Function Calling / Tool Use (P0)
|
|
|
|
**Objective**: Enable LLMs to call external tools/functions with structured arguments.
|
|
|
|
#### Core Capabilities
|
|
- [ ] **Tool definition schema**
|
|
- Function name, description
|
|
- Parameters (JSON schema)
|
|
- Return type (optional)
|
|
|
|
- [ ] **ToolChoice enum**
|
|
- `auto`: Model decides whether to call tools
|
|
- `none`: Never call tools (text-only)
|
|
- `required`: Must call at least one tool
|
|
- `specific(name)`: Force specific tool
|
|
|
|
- [ ] **Parallel tool calls**
|
|
- Generate multiple tool calls in one response
|
|
- Dependency-aware ordering
|
|
|
|
- [ ] **Tool result handling**
|
|
- Inject tool results back into conversation
|
|
- Continue generation after tool execution
|
|
- Multi-turn tool loops
|
|
|
|
- [ ] **Model-specific formats**
|
|
- Llama 3.1 tool format (`<|python_tag|>`)
|
|
- Mistral tool format (function tags)
|
|
- Qwen tool format
|
|
- Claude tool format
|
|
|
|
- [ ] **OpenAI API compatibility layer**
|
|
- `tools` parameter
|
|
- `tool_choice` parameter
|
|
- `ChatCompletionToolCall` response format
|
|
|
|
- [ ] **LangChain integration tests**
|
|
- Works with `AgentExecutor`
|
|
- Compatible with `StructuredTool`
|
|
- Multi-agent workflows
|
|
|
|
#### Example API
|
|
```rust
|
|
let tools = vec![
|
|
Tool {
|
|
name: "get_weather".into(),
|
|
description: "Get current weather for a location".into(),
|
|
parameters: json!({
|
|
"type": "object",
|
|
"properties": {
|
|
"location": {"type": "string"},
|
|
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
|
|
},
|
|
"required": ["location"]
|
|
}),
|
|
},
|
|
Tool {
|
|
name: "search_web".into(),
|
|
description: "Search the web".into(),
|
|
parameters: json!({
|
|
"type": "object",
|
|
"properties": {
|
|
"query": {"type": "string"}
|
|
},
|
|
"required": ["query"]
|
|
}),
|
|
}
|
|
];
|
|
|
|
let response = llm.chat(ChatRequest {
|
|
messages: vec![
|
|
Message::user("What's the weather in SF and latest AI news?")
|
|
],
|
|
tools: Some(tools),
|
|
tool_choice: ToolChoice::Auto,
|
|
..Default::default()
|
|
})?;
|
|
|
|
// response.tool_calls contains parallel calls:
|
|
// [get_weather(location="San Francisco"), search_web(query="AI news")]
|
|
```
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **OpenAI API format compatibility** (passes OpenAI SDK tests)
|
|
- [ ] **LangChain AgentExecutor** integration works end-to-end
|
|
- [ ] **Parallel tool calls** supported (≥3 concurrent)
|
|
- [ ] **Multi-turn tool conversations** (≥5 turns)
|
|
- [ ] **Tool call success rate** ≥95% for common tools
|
|
|
|
---
|
|
|
|
### 3. Prefix Caching (P0)
|
|
|
|
**Objective**: Cache and reuse KV cache for repeated prompt prefixes (system prompts, RAG documents).
|
|
|
|
#### Core Capabilities
|
|
- [ ] **Hash-based prefix lookup**
|
|
- SHA-256 hash of token IDs
|
|
- Fast O(1) cache hit detection
|
|
|
|
- [ ] **Radix tree implementation**
|
|
- Efficient storage for overlapping prefixes
|
|
- Longest common prefix matching
|
|
- Memory-efficient sharing
|
|
|
|
- [ ] **KV cache copy-on-write**
|
|
- Share read-only cache entries
|
|
- Copy only on divergence
|
|
- Zero-copy for cache hits
|
|
|
|
- [ ] **LRU eviction policy**
|
|
- Evict least recently used prefixes
|
|
- Configurable cache size
|
|
- Per-model cache isolation
|
|
|
|
- [ ] **Memory limits**
|
|
- Hard limit on cache size (bytes)
|
|
- Soft limit with warning
|
|
- Graceful degradation
|
|
|
|
- [ ] **Cache hit/miss metrics**
|
|
- Prometheus metrics
|
|
- Hit rate tracking
|
|
- Memory usage stats
|
|
|
|
- [ ] **Chat prefix caching**
|
|
- System prompt caching
|
|
- Conversation history caching
|
|
- Automatic prefix detection
|
|
|
|
- [ ] **RAG document caching**
|
|
- Document chunk prefixes
|
|
- Query-independent context
|
|
- Multi-query reuse
|
|
|
|
#### Example API
|
|
```rust
|
|
// First request - cache miss
|
|
let response1 = llm.generate(GenerateRequest {
|
|
prompt: "System: You are a helpful assistant.\nUser: Hello",
|
|
cache_prefix: Some(CacheConfig {
|
|
enable: true,
|
|
key: Some("chat-system-prompt".into()),
|
|
ttl_seconds: Some(3600),
|
|
}),
|
|
..Default::default()
|
|
})?;
|
|
// Latency: 500ms
|
|
|
|
// Second request - cache hit (reuses "System: You are..." KV cache)
|
|
let response2 = llm.generate(GenerateRequest {
|
|
prompt: "System: You are a helpful assistant.\nUser: How are you?",
|
|
cache_prefix: Some(CacheConfig {
|
|
enable: true,
|
|
key: Some("chat-system-prompt".into()),
|
|
ttl_seconds: Some(3600),
|
|
}),
|
|
..Default::default()
|
|
})?;
|
|
// Latency: 50ms (10x faster!)
|
|
```
|
|
|
|
#### Performance Targets
|
|
- [ ] **10x speedup** for repeated system prompts (cache hit)
|
|
- [ ] **<5% overhead** for cache miss
|
|
- [ ] **Memory-bounded** (configurable, default 2GB)
|
|
- [ ] **Thread-safe** for concurrent requests
|
|
- [ ] **Hit rate ≥80%** for typical chat/RAG workloads
|
|
|
|
#### Acceptance Criteria
|
|
- [ ] **Speedup**: ≥10x for 1024-token prefix reuse
|
|
- [ ] **Memory**: Bounded by config, no OOM
|
|
- [ ] **Correctness**: Identical outputs for cached vs uncached
|
|
- [ ] **Concurrency**: No race conditions (stress tested)
|
|
- [ ] **Metrics**: Prometheus metrics exported
|
|
|
|
---
|
|
|
|
## Technical Design
|
|
|
|
### Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ RuvLLM v2.4 │
|
|
├─────────────────────────────────────────────────────────┤
|
|
│ ┌─────────────────┐ ┌──────────────┐ ┌────────────┐ │
|
|
│ │ Structured │ │ Function │ │ Prefix │ │
|
|
│ │ Output Engine │ │ Calling │ │ Cache │ │
|
|
│ │ │ │ Router │ │ Manager │ │
|
|
│ │ - JSON Schema │ │ - Tool Defs │ │ - Radix │ │
|
|
│ │ - Logit Bias │ │ - ToolChoice │ │ Tree │ │
|
|
│ │ - State Machine │ │ - Multi-call │ │ - LRU │ │
|
|
│ └────────┬────────┘ └──────┬───────┘ └─────┬──────┘ │
|
|
│ │ │ │ │
|
|
│ └──────────────────┼─────────────────┘ │
|
|
│ │ │
|
|
│ ┌─────────▼──────────┐ │
|
|
│ │ mistral-rs Core │ │
|
|
│ │ - Model Loading │ │
|
|
│ │ - Token Sampling │ │
|
|
│ │ - KV Cache │ │
|
|
│ └────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Reference ADRs
|
|
|
|
- **ADR-009**: Structured Output Implementation
|
|
- Constrained decoding algorithm
|
|
- JSON schema validation approach
|
|
- Performance optimization strategies
|
|
|
|
- **ADR-010**: Function Calling Architecture
|
|
- Tool definition format
|
|
- Multi-model compatibility layer
|
|
- Parallel execution model
|
|
|
|
- **ADR-011**: Prefix Caching Design
|
|
- Radix tree structure
|
|
- Eviction policies
|
|
- Memory management
|
|
|
|
- **ADR-008**: mistral-rs Integration
|
|
- Dependency structure
|
|
- API surface
|
|
- Migration path
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Foundation (Weeks 1-2)
|
|
**Focus**: Structured Output basics + Function Calling definitions
|
|
|
|
- [ ] Week 1: JSON schema parser and validator
|
|
- Implement schema types (object, array, string, number, boolean, null)
|
|
- Unit tests for all types
|
|
- Property-based tests
|
|
|
|
- [ ] Week 2: Constrained decoding MVP
|
|
- Logit bias implementation
|
|
- Simple state machine (braces, brackets)
|
|
- Integration with mistral-rs sampler
|
|
- Basic function calling types (Tool, ToolChoice enums)
|
|
|
|
**Deliverable**: JSON mode works for simple schemas, tool definitions parsed
|
|
|
|
---
|
|
|
|
### Phase 2: Core Logic (Weeks 3-4)
|
|
**Focus**: Constrained decoding + Tool generation
|
|
|
|
- [ ] Week 3: Advanced constrained decoding
|
|
- Nested schema support
|
|
- String pattern matching
|
|
- Enum constraints
|
|
- JSON repair mode
|
|
|
|
- [ ] Week 4: Tool call generation
|
|
- Llama 3.1 format support
|
|
- Mistral format support
|
|
- Parallel tool calls
|
|
- OpenAI API compatibility layer
|
|
|
|
**Deliverable**: Complex JSON schemas work, tool calls generated in OpenAI format
|
|
|
|
---
|
|
|
|
### Phase 3: Caching + Polish (Weeks 5-6)
|
|
**Focus**: Prefix Caching + Integration tests
|
|
|
|
- [ ] Week 5: Prefix caching implementation
|
|
- Radix tree structure
|
|
- Hash-based lookup
|
|
- LRU eviction
|
|
- Thread-safety (RwLock)
|
|
|
|
- [ ] Week 6: Integration + benchmarks
|
|
- LangChain integration tests
|
|
- RAG workflow tests
|
|
- Performance benchmarks
|
|
- Documentation
|
|
- Example applications
|
|
|
|
**Deliverable**: All 3 features production-ready, benchmarked, documented
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- JSON schema validation (all types, nested, constraints)
|
|
- Logit bias correctness
|
|
- Tool definition parsing
|
|
- Prefix cache hit/miss logic
|
|
- Radix tree operations
|
|
|
|
### Integration Tests
|
|
- LangChain AgentExecutor with tools
|
|
- LlamaIndex ReAct agent
|
|
- CrewAI multi-agent workflows
|
|
- OpenAI SDK compatibility tests
|
|
|
|
### Benchmarks
|
|
- Structured output latency vs unconstrained
|
|
- Tool calling accuracy (% correct tool selections)
|
|
- Prefix cache speedup (1x, 10x, 100x reuse)
|
|
- Memory usage under load
|
|
|
|
### Stress Tests
|
|
- 1000 concurrent requests with caching
|
|
- Deeply nested JSON schemas (depth 20)
|
|
- Large tool libraries (100+ tools)
|
|
- Multi-turn tool conversations (50+ turns)
|
|
|
|
---
|
|
|
|
## Success Metrics
|
|
|
|
### Structured Output
|
|
- [ ] **Validity**: 100% valid JSON when `strict: true`
|
|
- [ ] **Overhead**: <10% latency vs unconstrained
|
|
- [ ] **Schema compliance**: 100% for depth ≤10 schemas
|
|
- [ ] **Repair rate**: ≥95% successful repairs
|
|
|
|
### Function Calling
|
|
- [ ] **Compatibility**: Passes OpenAI SDK test suite
|
|
- [ ] **LangChain**: Works with AgentExecutor (5+ examples)
|
|
- [ ] **Accuracy**: ≥95% correct tool selection (benchmark dataset)
|
|
- [ ] **Parallel calls**: Supports ≥5 concurrent tools
|
|
|
|
### Prefix Caching
|
|
- [ ] **Speedup**: 10x for 1024-token prefix, 100x for 4096-token
|
|
- [ ] **Hit rate**: ≥80% for chat workloads
|
|
- [ ] **Memory**: Bounded, no OOM under stress
|
|
- [ ] **Correctness**: 100% identical outputs (cached vs uncached)
|
|
|
|
---
|
|
|
|
## Dependencies
|
|
|
|
### Upstream
|
|
- **mistral-rs v0.4.x** (ADR-008)
|
|
- KV cache access for prefix caching
|
|
- Token sampling hooks for logit bias
|
|
- Model loading infrastructure
|
|
|
|
### Downstream
|
|
- **Enables**: Agentic workflow support
|
|
- **Enables**: LangChain/LlamaIndex/CrewAI integration
|
|
- **Blocks**: v2.4 release
|
|
- **Blocks**: Production adoption by agent frameworks
|
|
|
|
---
|
|
|
|
## Related Issues
|
|
|
|
- Depends on: #XXX (mistral-rs integration ADR-008)
|
|
- Enables: #XXX (Agentic workflow support)
|
|
- Enables: #XXX (LangChain integration)
|
|
- Blocks: #XXX (v2.4 release milestone)
|
|
|
|
---
|
|
|
|
## Documentation Requirements
|
|
|
|
- [ ] API reference docs (rustdoc)
|
|
- [ ] User guides for each feature
|
|
- "How to use JSON mode"
|
|
- "How to define tools"
|
|
- "How to enable prefix caching"
|
|
- [ ] Migration guide from v2.3
|
|
- [ ] Example applications
|
|
- Structured extraction (NER, info extraction)
|
|
- Multi-tool agent (ReAct loop)
|
|
- RAG with caching (chatbot)
|
|
- [ ] Performance tuning guide
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **JSON Schema**: Full Draft 7 or subset? (Propose: Core subset + extensions)
|
|
2. **Tool formats**: Support all models or Llama 3.1+ only? (Propose: Llama 3.1+ with adapters)
|
|
3. **Cache eviction**: LRU vs LFU vs TTL-based? (Propose: LRU + TTL)
|
|
4. **Memory limit**: Default cache size? (Propose: 2GB default, configurable)
|
|
5. **Breaking changes**: Any API changes needed? (Propose: Additive only, no breaks)
|
|
|
|
---
|
|
|
|
## Future Enhancements (Post-v2.4)
|
|
|
|
- **Structured Output**:
|
|
- GBNF grammar support (custom DSLs)
|
|
- Regex-constrained strings
|
|
- Speculative decoding for constrained generation
|
|
|
|
- **Function Calling**:
|
|
- Async/streaming tool execution
|
|
- Tool result validation
|
|
- Tool dependency graphs
|
|
|
|
- **Prefix Caching**:
|
|
- Cross-request caching (shared cache pool)
|
|
- Disk-backed cache (persist across restarts)
|
|
- Distributed caching (Redis/memcached)
|
|
|
|
---
|
|
|
|
## Timeline Summary
|
|
|
|
| Phase | Duration | Focus | Deliverable |
|
|
|-------|----------|-------|-------------|
|
|
| 1 | Weeks 1-2 | Structured Output + Tool Definitions | JSON mode MVP, tool parsing |
|
|
| 2 | Weeks 3-4 | Constrained Decoding + Tool Generation | Complex schemas, tool calls |
|
|
| 3 | Weeks 5-6 | Prefix Caching + Integration | Production-ready, benchmarked |
|
|
|
|
**Total**: 6 weeks to production-ready v2.4
|
|
|
|
---
|
|
|
|
## Getting Involved
|
|
|
|
### For Contributors
|
|
- Pick a task from the checkboxes above
|
|
- Comment on this issue to claim a feature
|
|
- Follow the implementation plan phases
|
|
- Submit PRs with tests + benchmarks
|
|
|
|
### For Reviewers
|
|
- Focus on correctness (JSON validity, cache correctness)
|
|
- Performance regression checks (<10% overhead target)
|
|
- API design feedback (before Week 3)
|
|
|
|
### For Testers
|
|
- Test with real-world agent workflows
|
|
- Report edge cases and failure modes
|
|
- Benchmark on your hardware/models
|
|
|
|
---
|
|
|
|
**Let's close the gap with vLLM/llama.cpp and make RuvLLM the best choice for production agentic workflows!** 🚀
|