# feat(ruvllm): Implement SOTA features for production agentic workflows **Labels**: `enhancement`, `p0-critical`, `agentic`, `v2.4`, `mistral-rs`, `performance` --- ## Summary RuvLLM v2.4 SOTA Feature Implementation - Adding the 3 critical capabilities needed for production agentic workflows: **Structured Output**, **Function Calling**, and **Prefix Caching**. These features are essential for modern LLM applications and are currently blocking production adoption for major agent frameworks. --- ## Motivation ### Why This Matters **Current State:** - RuvLLM cannot reliably generate structured outputs (JSON schema enforcement) - No native function calling support for tool-using agents - Repeated prompts/prefixes incur full generation costs (no caching) - Agent frameworks (LangChain, LlamaIndex, CrewAI) cannot integrate **Impact:** - **Blocking production adoption** for agentic workflows - **Cost inefficiency**: 10-100x slower for RAG/chat applications vs competitors - **Reliability gap**: JSON parsing failures break agent loops - **Missing compatibility**: Cannot replace vLLM, llama.cpp, SGLang in existing stacks **Competitive Gap:** | Feature | vLLM | llama.cpp | SGLang | RuvLLM | |---------|------|-----------|--------|--------| | Structured Output | ✅ | ✅ | ✅ | ❌ | | Function Calling | ✅ | ✅ | ✅ | ❌ | | Prefix Caching | ✅ | ✅ | ✅ | ❌ | --- ## Features ### 1. Structured Output / JSON Mode (P0) **Objective**: Guarantee valid JSON output conforming to user-provided schemas. #### Core Capabilities - [ ] **JSON schema validation** (JSONSchema Draft 7 support) - Primitive types: `string`, `number`, `boolean`, `null` - Complex types: `object`, `array` - Nested schemas with `properties`, `items`, `required` - Constraints: `minLength`, `maxLength`, `pattern`, `enum`, `minimum`, `maximum` - [ ] **Constrained decoding with logit bias** - State machine for tracking JSON structure (open braces, quotes, commas) - Token masking to enforce valid next tokens - Rejection sampling fallback for complex schemas - [ ] **Bracket/brace state machine** - Track depth of `{}` and `[]` - Enforce closing brackets - Handle escaped quotes in strings - [ ] **JSON repair for malformed output** - Auto-close unclosed braces/brackets - Fix trailing commas - Escape unescaped quotes - Best-effort recovery mode - [ ] **GBNF grammar support (future)** - llama.cpp-compatible grammar format - Custom domain-specific languages - [ ] **Comprehensive tests** - Unit tests for all JSON types - Property-based testing with Hypothesis/QuickCheck - Adversarial inputs (deeply nested, large arrays) - [ ] **Benchmarks vs unconstrained** - Measure latency overhead (<10% target) - Throughput impact - Memory usage #### Example API ```rust let schema = json!({ "type": "object", "properties": { "name": {"type": "string"}, "age": {"type": "number", "minimum": 0}, "tags": {"type": "array", "items": {"type": "string"}} }, "required": ["name"] }); let response = llm.generate(GenerateRequest { prompt: "Extract person info: John is 30", json_schema: Some(schema), strict: true, // Guarantee valid JSON ..Default::default() })?; // response.text is guaranteed valid JSON matching schema ``` #### Acceptance Criteria - [ ] **100% valid JSON** when `strict: true` enabled - [ ] **<10% latency overhead** vs unconstrained generation - [ ] **Schema validation passes** for nested objects/arrays (depth ≥ 5) - [ ] **Repair mode** recovers ≥95% of malformed outputs --- ### 2. Function Calling / Tool Use (P0) **Objective**: Enable LLMs to call external tools/functions with structured arguments. #### Core Capabilities - [ ] **Tool definition schema** - Function name, description - Parameters (JSON schema) - Return type (optional) - [ ] **ToolChoice enum** - `auto`: Model decides whether to call tools - `none`: Never call tools (text-only) - `required`: Must call at least one tool - `specific(name)`: Force specific tool - [ ] **Parallel tool calls** - Generate multiple tool calls in one response - Dependency-aware ordering - [ ] **Tool result handling** - Inject tool results back into conversation - Continue generation after tool execution - Multi-turn tool loops - [ ] **Model-specific formats** - Llama 3.1 tool format (`<|python_tag|>`) - Mistral tool format (function tags) - Qwen tool format - Claude tool format - [ ] **OpenAI API compatibility layer** - `tools` parameter - `tool_choice` parameter - `ChatCompletionToolCall` response format - [ ] **LangChain integration tests** - Works with `AgentExecutor` - Compatible with `StructuredTool` - Multi-agent workflows #### Example API ```rust let tools = vec![ Tool { name: "get_weather".into(), description: "Get current weather for a location".into(), parameters: json!({ "type": "object", "properties": { "location": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["location"] }), }, Tool { name: "search_web".into(), description: "Search the web".into(), parameters: json!({ "type": "object", "properties": { "query": {"type": "string"} }, "required": ["query"] }), } ]; let response = llm.chat(ChatRequest { messages: vec![ Message::user("What's the weather in SF and latest AI news?") ], tools: Some(tools), tool_choice: ToolChoice::Auto, ..Default::default() })?; // response.tool_calls contains parallel calls: // [get_weather(location="San Francisco"), search_web(query="AI news")] ``` #### Acceptance Criteria - [ ] **OpenAI API format compatibility** (passes OpenAI SDK tests) - [ ] **LangChain AgentExecutor** integration works end-to-end - [ ] **Parallel tool calls** supported (≥3 concurrent) - [ ] **Multi-turn tool conversations** (≥5 turns) - [ ] **Tool call success rate** ≥95% for common tools --- ### 3. Prefix Caching (P0) **Objective**: Cache and reuse KV cache for repeated prompt prefixes (system prompts, RAG documents). #### Core Capabilities - [ ] **Hash-based prefix lookup** - SHA-256 hash of token IDs - Fast O(1) cache hit detection - [ ] **Radix tree implementation** - Efficient storage for overlapping prefixes - Longest common prefix matching - Memory-efficient sharing - [ ] **KV cache copy-on-write** - Share read-only cache entries - Copy only on divergence - Zero-copy for cache hits - [ ] **LRU eviction policy** - Evict least recently used prefixes - Configurable cache size - Per-model cache isolation - [ ] **Memory limits** - Hard limit on cache size (bytes) - Soft limit with warning - Graceful degradation - [ ] **Cache hit/miss metrics** - Prometheus metrics - Hit rate tracking - Memory usage stats - [ ] **Chat prefix caching** - System prompt caching - Conversation history caching - Automatic prefix detection - [ ] **RAG document caching** - Document chunk prefixes - Query-independent context - Multi-query reuse #### Example API ```rust // First request - cache miss let response1 = llm.generate(GenerateRequest { prompt: "System: You are a helpful assistant.\nUser: Hello", cache_prefix: Some(CacheConfig { enable: true, key: Some("chat-system-prompt".into()), ttl_seconds: Some(3600), }), ..Default::default() })?; // Latency: 500ms // Second request - cache hit (reuses "System: You are..." KV cache) let response2 = llm.generate(GenerateRequest { prompt: "System: You are a helpful assistant.\nUser: How are you?", cache_prefix: Some(CacheConfig { enable: true, key: Some("chat-system-prompt".into()), ttl_seconds: Some(3600), }), ..Default::default() })?; // Latency: 50ms (10x faster!) ``` #### Performance Targets - [ ] **10x speedup** for repeated system prompts (cache hit) - [ ] **<5% overhead** for cache miss - [ ] **Memory-bounded** (configurable, default 2GB) - [ ] **Thread-safe** for concurrent requests - [ ] **Hit rate ≥80%** for typical chat/RAG workloads #### Acceptance Criteria - [ ] **Speedup**: ≥10x for 1024-token prefix reuse - [ ] **Memory**: Bounded by config, no OOM - [ ] **Correctness**: Identical outputs for cached vs uncached - [ ] **Concurrency**: No race conditions (stress tested) - [ ] **Metrics**: Prometheus metrics exported --- ## Technical Design ### Architecture Overview ``` ┌─────────────────────────────────────────────────────────┐ │ RuvLLM v2.4 │ ├─────────────────────────────────────────────────────────┤ │ ┌─────────────────┐ ┌──────────────┐ ┌────────────┐ │ │ │ Structured │ │ Function │ │ Prefix │ │ │ │ Output Engine │ │ Calling │ │ Cache │ │ │ │ │ │ Router │ │ Manager │ │ │ │ - JSON Schema │ │ - Tool Defs │ │ - Radix │ │ │ │ - Logit Bias │ │ - ToolChoice │ │ Tree │ │ │ │ - State Machine │ │ - Multi-call │ │ - LRU │ │ │ └────────┬────────┘ └──────┬───────┘ └─────┬──────┘ │ │ │ │ │ │ │ └──────────────────┼─────────────────┘ │ │ │ │ │ ┌─────────▼──────────┐ │ │ │ mistral-rs Core │ │ │ │ - Model Loading │ │ │ │ - Token Sampling │ │ │ │ - KV Cache │ │ │ └────────────────────┘ │ └─────────────────────────────────────────────────────────┘ ``` ### Reference ADRs - **ADR-009**: Structured Output Implementation - Constrained decoding algorithm - JSON schema validation approach - Performance optimization strategies - **ADR-010**: Function Calling Architecture - Tool definition format - Multi-model compatibility layer - Parallel execution model - **ADR-011**: Prefix Caching Design - Radix tree structure - Eviction policies - Memory management - **ADR-008**: mistral-rs Integration - Dependency structure - API surface - Migration path --- ## Implementation Plan ### Phase 1: Foundation (Weeks 1-2) **Focus**: Structured Output basics + Function Calling definitions - [ ] Week 1: JSON schema parser and validator - Implement schema types (object, array, string, number, boolean, null) - Unit tests for all types - Property-based tests - [ ] Week 2: Constrained decoding MVP - Logit bias implementation - Simple state machine (braces, brackets) - Integration with mistral-rs sampler - Basic function calling types (Tool, ToolChoice enums) **Deliverable**: JSON mode works for simple schemas, tool definitions parsed --- ### Phase 2: Core Logic (Weeks 3-4) **Focus**: Constrained decoding + Tool generation - [ ] Week 3: Advanced constrained decoding - Nested schema support - String pattern matching - Enum constraints - JSON repair mode - [ ] Week 4: Tool call generation - Llama 3.1 format support - Mistral format support - Parallel tool calls - OpenAI API compatibility layer **Deliverable**: Complex JSON schemas work, tool calls generated in OpenAI format --- ### Phase 3: Caching + Polish (Weeks 5-6) **Focus**: Prefix Caching + Integration tests - [ ] Week 5: Prefix caching implementation - Radix tree structure - Hash-based lookup - LRU eviction - Thread-safety (RwLock) - [ ] Week 6: Integration + benchmarks - LangChain integration tests - RAG workflow tests - Performance benchmarks - Documentation - Example applications **Deliverable**: All 3 features production-ready, benchmarked, documented --- ## Testing Strategy ### Unit Tests - JSON schema validation (all types, nested, constraints) - Logit bias correctness - Tool definition parsing - Prefix cache hit/miss logic - Radix tree operations ### Integration Tests - LangChain AgentExecutor with tools - LlamaIndex ReAct agent - CrewAI multi-agent workflows - OpenAI SDK compatibility tests ### Benchmarks - Structured output latency vs unconstrained - Tool calling accuracy (% correct tool selections) - Prefix cache speedup (1x, 10x, 100x reuse) - Memory usage under load ### Stress Tests - 1000 concurrent requests with caching - Deeply nested JSON schemas (depth 20) - Large tool libraries (100+ tools) - Multi-turn tool conversations (50+ turns) --- ## Success Metrics ### Structured Output - [ ] **Validity**: 100% valid JSON when `strict: true` - [ ] **Overhead**: <10% latency vs unconstrained - [ ] **Schema compliance**: 100% for depth ≤10 schemas - [ ] **Repair rate**: ≥95% successful repairs ### Function Calling - [ ] **Compatibility**: Passes OpenAI SDK test suite - [ ] **LangChain**: Works with AgentExecutor (5+ examples) - [ ] **Accuracy**: ≥95% correct tool selection (benchmark dataset) - [ ] **Parallel calls**: Supports ≥5 concurrent tools ### Prefix Caching - [ ] **Speedup**: 10x for 1024-token prefix, 100x for 4096-token - [ ] **Hit rate**: ≥80% for chat workloads - [ ] **Memory**: Bounded, no OOM under stress - [ ] **Correctness**: 100% identical outputs (cached vs uncached) --- ## Dependencies ### Upstream - **mistral-rs v0.4.x** (ADR-008) - KV cache access for prefix caching - Token sampling hooks for logit bias - Model loading infrastructure ### Downstream - **Enables**: Agentic workflow support - **Enables**: LangChain/LlamaIndex/CrewAI integration - **Blocks**: v2.4 release - **Blocks**: Production adoption by agent frameworks --- ## Related Issues - Depends on: #XXX (mistral-rs integration ADR-008) - Enables: #XXX (Agentic workflow support) - Enables: #XXX (LangChain integration) - Blocks: #XXX (v2.4 release milestone) --- ## Documentation Requirements - [ ] API reference docs (rustdoc) - [ ] User guides for each feature - "How to use JSON mode" - "How to define tools" - "How to enable prefix caching" - [ ] Migration guide from v2.3 - [ ] Example applications - Structured extraction (NER, info extraction) - Multi-tool agent (ReAct loop) - RAG with caching (chatbot) - [ ] Performance tuning guide --- ## Open Questions 1. **JSON Schema**: Full Draft 7 or subset? (Propose: Core subset + extensions) 2. **Tool formats**: Support all models or Llama 3.1+ only? (Propose: Llama 3.1+ with adapters) 3. **Cache eviction**: LRU vs LFU vs TTL-based? (Propose: LRU + TTL) 4. **Memory limit**: Default cache size? (Propose: 2GB default, configurable) 5. **Breaking changes**: Any API changes needed? (Propose: Additive only, no breaks) --- ## Future Enhancements (Post-v2.4) - **Structured Output**: - GBNF grammar support (custom DSLs) - Regex-constrained strings - Speculative decoding for constrained generation - **Function Calling**: - Async/streaming tool execution - Tool result validation - Tool dependency graphs - **Prefix Caching**: - Cross-request caching (shared cache pool) - Disk-backed cache (persist across restarts) - Distributed caching (Redis/memcached) --- ## Timeline Summary | Phase | Duration | Focus | Deliverable | |-------|----------|-------|-------------| | 1 | Weeks 1-2 | Structured Output + Tool Definitions | JSON mode MVP, tool parsing | | 2 | Weeks 3-4 | Constrained Decoding + Tool Generation | Complex schemas, tool calls | | 3 | Weeks 5-6 | Prefix Caching + Integration | Production-ready, benchmarked | **Total**: 6 weeks to production-ready v2.4 --- ## Getting Involved ### For Contributors - Pick a task from the checkboxes above - Comment on this issue to claim a feature - Follow the implementation plan phases - Submit PRs with tests + benchmarks ### For Reviewers - Focus on correctness (JSON validity, cache correctness) - Performance regression checks (<10% overhead target) - API design feedback (before Week 3) ### For Testers - Test with real-world agent workflows - Report edge cases and failure modes - Benchmark on your hardware/models --- **Let's close the gap with vLLM/llama.cpp and make RuvLLM the best choice for production agentic workflows!** 🚀