git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
17 KiB
feat(ruvllm): Implement SOTA features for production agentic workflows
Labels: enhancement, p0-critical, agentic, v2.4, mistral-rs, performance
Summary
RuvLLM v2.4 SOTA Feature Implementation - Adding the 3 critical capabilities needed for production agentic workflows: Structured Output, Function Calling, and Prefix Caching.
These features are essential for modern LLM applications and are currently blocking production adoption for major agent frameworks.
Motivation
Why This Matters
Current State:
- RuvLLM cannot reliably generate structured outputs (JSON schema enforcement)
- No native function calling support for tool-using agents
- Repeated prompts/prefixes incur full generation costs (no caching)
- Agent frameworks (LangChain, LlamaIndex, CrewAI) cannot integrate
Impact:
- Blocking production adoption for agentic workflows
- Cost inefficiency: 10-100x slower for RAG/chat applications vs competitors
- Reliability gap: JSON parsing failures break agent loops
- Missing compatibility: Cannot replace vLLM, llama.cpp, SGLang in existing stacks
Competitive Gap:
| Feature | vLLM | llama.cpp | SGLang | RuvLLM |
|---|---|---|---|---|
| Structured Output | ✅ | ✅ | ✅ | ❌ |
| Function Calling | ✅ | ✅ | ✅ | ❌ |
| Prefix Caching | ✅ | ✅ | ✅ | ❌ |
Features
1. Structured Output / JSON Mode (P0)
Objective: Guarantee valid JSON output conforming to user-provided schemas.
Core Capabilities
-
JSON schema validation (JSONSchema Draft 7 support)
- Primitive types:
string,number,boolean,null - Complex types:
object,array - Nested schemas with
properties,items,required - Constraints:
minLength,maxLength,pattern,enum,minimum,maximum
- Primitive types:
-
Constrained decoding with logit bias
- State machine for tracking JSON structure (open braces, quotes, commas)
- Token masking to enforce valid next tokens
- Rejection sampling fallback for complex schemas
-
Bracket/brace state machine
- Track depth of
{}and[] - Enforce closing brackets
- Handle escaped quotes in strings
- Track depth of
-
JSON repair for malformed output
- Auto-close unclosed braces/brackets
- Fix trailing commas
- Escape unescaped quotes
- Best-effort recovery mode
-
GBNF grammar support (future)
- llama.cpp-compatible grammar format
- Custom domain-specific languages
-
Comprehensive tests
- Unit tests for all JSON types
- Property-based testing with Hypothesis/QuickCheck
- Adversarial inputs (deeply nested, large arrays)
-
Benchmarks vs unconstrained
- Measure latency overhead (<10% target)
- Throughput impact
- Memory usage
Example API
let schema = json!({
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number", "minimum": 0},
"tags": {"type": "array", "items": {"type": "string"}}
},
"required": ["name"]
});
let response = llm.generate(GenerateRequest {
prompt: "Extract person info: John is 30",
json_schema: Some(schema),
strict: true, // Guarantee valid JSON
..Default::default()
})?;
// response.text is guaranteed valid JSON matching schema
Acceptance Criteria
- 100% valid JSON when
strict: trueenabled - <10% latency overhead vs unconstrained generation
- Schema validation passes for nested objects/arrays (depth ≥ 5)
- Repair mode recovers ≥95% of malformed outputs
2. Function Calling / Tool Use (P0)
Objective: Enable LLMs to call external tools/functions with structured arguments.
Core Capabilities
-
Tool definition schema
- Function name, description
- Parameters (JSON schema)
- Return type (optional)
-
ToolChoice enum
auto: Model decides whether to call toolsnone: Never call tools (text-only)required: Must call at least one toolspecific(name): Force specific tool
-
Parallel tool calls
- Generate multiple tool calls in one response
- Dependency-aware ordering
-
Tool result handling
- Inject tool results back into conversation
- Continue generation after tool execution
- Multi-turn tool loops
-
Model-specific formats
- Llama 3.1 tool format (
<|python_tag|>) - Mistral tool format (function tags)
- Qwen tool format
- Claude tool format
- Llama 3.1 tool format (
-
OpenAI API compatibility layer
toolsparametertool_choiceparameterChatCompletionToolCallresponse format
-
LangChain integration tests
- Works with
AgentExecutor - Compatible with
StructuredTool - Multi-agent workflows
- Works with
Example API
let tools = vec![
Tool {
name: "get_weather".into(),
description: "Get current weather for a location".into(),
parameters: json!({
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}),
},
Tool {
name: "search_web".into(),
description: "Search the web".into(),
parameters: json!({
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}),
}
];
let response = llm.chat(ChatRequest {
messages: vec![
Message::user("What's the weather in SF and latest AI news?")
],
tools: Some(tools),
tool_choice: ToolChoice::Auto,
..Default::default()
})?;
// response.tool_calls contains parallel calls:
// [get_weather(location="San Francisco"), search_web(query="AI news")]
Acceptance Criteria
- OpenAI API format compatibility (passes OpenAI SDK tests)
- LangChain AgentExecutor integration works end-to-end
- Parallel tool calls supported (≥3 concurrent)
- Multi-turn tool conversations (≥5 turns)
- Tool call success rate ≥95% for common tools
3. Prefix Caching (P0)
Objective: Cache and reuse KV cache for repeated prompt prefixes (system prompts, RAG documents).
Core Capabilities
-
Hash-based prefix lookup
- SHA-256 hash of token IDs
- Fast O(1) cache hit detection
-
Radix tree implementation
- Efficient storage for overlapping prefixes
- Longest common prefix matching
- Memory-efficient sharing
-
KV cache copy-on-write
- Share read-only cache entries
- Copy only on divergence
- Zero-copy for cache hits
-
LRU eviction policy
- Evict least recently used prefixes
- Configurable cache size
- Per-model cache isolation
-
Memory limits
- Hard limit on cache size (bytes)
- Soft limit with warning
- Graceful degradation
-
Cache hit/miss metrics
- Prometheus metrics
- Hit rate tracking
- Memory usage stats
-
Chat prefix caching
- System prompt caching
- Conversation history caching
- Automatic prefix detection
-
RAG document caching
- Document chunk prefixes
- Query-independent context
- Multi-query reuse
Example API
// First request - cache miss
let response1 = llm.generate(GenerateRequest {
prompt: "System: You are a helpful assistant.\nUser: Hello",
cache_prefix: Some(CacheConfig {
enable: true,
key: Some("chat-system-prompt".into()),
ttl_seconds: Some(3600),
}),
..Default::default()
})?;
// Latency: 500ms
// Second request - cache hit (reuses "System: You are..." KV cache)
let response2 = llm.generate(GenerateRequest {
prompt: "System: You are a helpful assistant.\nUser: How are you?",
cache_prefix: Some(CacheConfig {
enable: true,
key: Some("chat-system-prompt".into()),
ttl_seconds: Some(3600),
}),
..Default::default()
})?;
// Latency: 50ms (10x faster!)
Performance Targets
- 10x speedup for repeated system prompts (cache hit)
- <5% overhead for cache miss
- Memory-bounded (configurable, default 2GB)
- Thread-safe for concurrent requests
- Hit rate ≥80% for typical chat/RAG workloads
Acceptance Criteria
- Speedup: ≥10x for 1024-token prefix reuse
- Memory: Bounded by config, no OOM
- Correctness: Identical outputs for cached vs uncached
- Concurrency: No race conditions (stress tested)
- Metrics: Prometheus metrics exported
Technical Design
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ RuvLLM v2.4 │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Structured │ │ Function │ │ Prefix │ │
│ │ Output Engine │ │ Calling │ │ Cache │ │
│ │ │ │ Router │ │ Manager │ │
│ │ - JSON Schema │ │ - Tool Defs │ │ - Radix │ │
│ │ - Logit Bias │ │ - ToolChoice │ │ Tree │ │
│ │ - State Machine │ │ - Multi-call │ │ - LRU │ │
│ └────────┬────────┘ └──────┬───────┘ └─────┬──────┘ │
│ │ │ │ │
│ └──────────────────┼─────────────────┘ │
│ │ │
│ ┌─────────▼──────────┐ │
│ │ mistral-rs Core │ │
│ │ - Model Loading │ │
│ │ - Token Sampling │ │
│ │ - KV Cache │ │
│ └────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Reference ADRs
-
ADR-009: Structured Output Implementation
- Constrained decoding algorithm
- JSON schema validation approach
- Performance optimization strategies
-
ADR-010: Function Calling Architecture
- Tool definition format
- Multi-model compatibility layer
- Parallel execution model
-
ADR-011: Prefix Caching Design
- Radix tree structure
- Eviction policies
- Memory management
-
ADR-008: mistral-rs Integration
- Dependency structure
- API surface
- Migration path
Implementation Plan
Phase 1: Foundation (Weeks 1-2)
Focus: Structured Output basics + Function Calling definitions
-
Week 1: JSON schema parser and validator
- Implement schema types (object, array, string, number, boolean, null)
- Unit tests for all types
- Property-based tests
-
Week 2: Constrained decoding MVP
- Logit bias implementation
- Simple state machine (braces, brackets)
- Integration with mistral-rs sampler
- Basic function calling types (Tool, ToolChoice enums)
Deliverable: JSON mode works for simple schemas, tool definitions parsed
Phase 2: Core Logic (Weeks 3-4)
Focus: Constrained decoding + Tool generation
-
Week 3: Advanced constrained decoding
- Nested schema support
- String pattern matching
- Enum constraints
- JSON repair mode
-
Week 4: Tool call generation
- Llama 3.1 format support
- Mistral format support
- Parallel tool calls
- OpenAI API compatibility layer
Deliverable: Complex JSON schemas work, tool calls generated in OpenAI format
Phase 3: Caching + Polish (Weeks 5-6)
Focus: Prefix Caching + Integration tests
-
Week 5: Prefix caching implementation
- Radix tree structure
- Hash-based lookup
- LRU eviction
- Thread-safety (RwLock)
-
Week 6: Integration + benchmarks
- LangChain integration tests
- RAG workflow tests
- Performance benchmarks
- Documentation
- Example applications
Deliverable: All 3 features production-ready, benchmarked, documented
Testing Strategy
Unit Tests
- JSON schema validation (all types, nested, constraints)
- Logit bias correctness
- Tool definition parsing
- Prefix cache hit/miss logic
- Radix tree operations
Integration Tests
- LangChain AgentExecutor with tools
- LlamaIndex ReAct agent
- CrewAI multi-agent workflows
- OpenAI SDK compatibility tests
Benchmarks
- Structured output latency vs unconstrained
- Tool calling accuracy (% correct tool selections)
- Prefix cache speedup (1x, 10x, 100x reuse)
- Memory usage under load
Stress Tests
- 1000 concurrent requests with caching
- Deeply nested JSON schemas (depth 20)
- Large tool libraries (100+ tools)
- Multi-turn tool conversations (50+ turns)
Success Metrics
Structured Output
- Validity: 100% valid JSON when
strict: true - Overhead: <10% latency vs unconstrained
- Schema compliance: 100% for depth ≤10 schemas
- Repair rate: ≥95% successful repairs
Function Calling
- Compatibility: Passes OpenAI SDK test suite
- LangChain: Works with AgentExecutor (5+ examples)
- Accuracy: ≥95% correct tool selection (benchmark dataset)
- Parallel calls: Supports ≥5 concurrent tools
Prefix Caching
- Speedup: 10x for 1024-token prefix, 100x for 4096-token
- Hit rate: ≥80% for chat workloads
- Memory: Bounded, no OOM under stress
- Correctness: 100% identical outputs (cached vs uncached)
Dependencies
Upstream
- mistral-rs v0.4.x (ADR-008)
- KV cache access for prefix caching
- Token sampling hooks for logit bias
- Model loading infrastructure
Downstream
- Enables: Agentic workflow support
- Enables: LangChain/LlamaIndex/CrewAI integration
- Blocks: v2.4 release
- Blocks: Production adoption by agent frameworks
Related Issues
- Depends on: #XXX (mistral-rs integration ADR-008)
- Enables: #XXX (Agentic workflow support)
- Enables: #XXX (LangChain integration)
- Blocks: #XXX (v2.4 release milestone)
Documentation Requirements
- API reference docs (rustdoc)
- User guides for each feature
- "How to use JSON mode"
- "How to define tools"
- "How to enable prefix caching"
- Migration guide from v2.3
- Example applications
- Structured extraction (NER, info extraction)
- Multi-tool agent (ReAct loop)
- RAG with caching (chatbot)
- Performance tuning guide
Open Questions
- JSON Schema: Full Draft 7 or subset? (Propose: Core subset + extensions)
- Tool formats: Support all models or Llama 3.1+ only? (Propose: Llama 3.1+ with adapters)
- Cache eviction: LRU vs LFU vs TTL-based? (Propose: LRU + TTL)
- Memory limit: Default cache size? (Propose: 2GB default, configurable)
- Breaking changes: Any API changes needed? (Propose: Additive only, no breaks)
Future Enhancements (Post-v2.4)
-
Structured Output:
- GBNF grammar support (custom DSLs)
- Regex-constrained strings
- Speculative decoding for constrained generation
-
Function Calling:
- Async/streaming tool execution
- Tool result validation
- Tool dependency graphs
-
Prefix Caching:
- Cross-request caching (shared cache pool)
- Disk-backed cache (persist across restarts)
- Distributed caching (Redis/memcached)
Timeline Summary
| Phase | Duration | Focus | Deliverable |
|---|---|---|---|
| 1 | Weeks 1-2 | Structured Output + Tool Definitions | JSON mode MVP, tool parsing |
| 2 | Weeks 3-4 | Constrained Decoding + Tool Generation | Complex schemas, tool calls |
| 3 | Weeks 5-6 | Prefix Caching + Integration | Production-ready, benchmarked |
Total: 6 weeks to production-ready v2.4
Getting Involved
For Contributors
- Pick a task from the checkboxes above
- Comment on this issue to claim a feature
- Follow the implementation plan phases
- Submit PRs with tests + benchmarks
For Reviewers
- Focus on correctness (JSON validity, cache correctness)
- Performance regression checks (<10% overhead target)
- API design feedback (before Week 3)
For Testers
- Test with real-world agent workflows
- Report edge cases and failure modes
- Benchmark on your hardware/models
Let's close the gap with vLLM/llama.cpp and make RuvLLM the best choice for production agentic workflows! 🚀