Files
wifi-densepose/crates/ruvllm/docs/GITHUB_ISSUE_SOTA_FEATURES.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

17 KiB

feat(ruvllm): Implement SOTA features for production agentic workflows

Labels: enhancement, p0-critical, agentic, v2.4, mistral-rs, performance


Summary

RuvLLM v2.4 SOTA Feature Implementation - Adding the 3 critical capabilities needed for production agentic workflows: Structured Output, Function Calling, and Prefix Caching.

These features are essential for modern LLM applications and are currently blocking production adoption for major agent frameworks.


Motivation

Why This Matters

Current State:

  • RuvLLM cannot reliably generate structured outputs (JSON schema enforcement)
  • No native function calling support for tool-using agents
  • Repeated prompts/prefixes incur full generation costs (no caching)
  • Agent frameworks (LangChain, LlamaIndex, CrewAI) cannot integrate

Impact:

  • Blocking production adoption for agentic workflows
  • Cost inefficiency: 10-100x slower for RAG/chat applications vs competitors
  • Reliability gap: JSON parsing failures break agent loops
  • Missing compatibility: Cannot replace vLLM, llama.cpp, SGLang in existing stacks

Competitive Gap:

Feature vLLM llama.cpp SGLang RuvLLM
Structured Output
Function Calling
Prefix Caching

Features

1. Structured Output / JSON Mode (P0)

Objective: Guarantee valid JSON output conforming to user-provided schemas.

Core Capabilities

  • JSON schema validation (JSONSchema Draft 7 support)

    • Primitive types: string, number, boolean, null
    • Complex types: object, array
    • Nested schemas with properties, items, required
    • Constraints: minLength, maxLength, pattern, enum, minimum, maximum
  • Constrained decoding with logit bias

    • State machine for tracking JSON structure (open braces, quotes, commas)
    • Token masking to enforce valid next tokens
    • Rejection sampling fallback for complex schemas
  • Bracket/brace state machine

    • Track depth of {} and []
    • Enforce closing brackets
    • Handle escaped quotes in strings
  • JSON repair for malformed output

    • Auto-close unclosed braces/brackets
    • Fix trailing commas
    • Escape unescaped quotes
    • Best-effort recovery mode
  • GBNF grammar support (future)

    • llama.cpp-compatible grammar format
    • Custom domain-specific languages
  • Comprehensive tests

    • Unit tests for all JSON types
    • Property-based testing with Hypothesis/QuickCheck
    • Adversarial inputs (deeply nested, large arrays)
  • Benchmarks vs unconstrained

    • Measure latency overhead (<10% target)
    • Throughput impact
    • Memory usage

Example API

let schema = json!({
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number", "minimum": 0},
        "tags": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["name"]
});

let response = llm.generate(GenerateRequest {
    prompt: "Extract person info: John is 30",
    json_schema: Some(schema),
    strict: true, // Guarantee valid JSON
    ..Default::default()
})?;

// response.text is guaranteed valid JSON matching schema

Acceptance Criteria

  • 100% valid JSON when strict: true enabled
  • <10% latency overhead vs unconstrained generation
  • Schema validation passes for nested objects/arrays (depth ≥ 5)
  • Repair mode recovers ≥95% of malformed outputs

2. Function Calling / Tool Use (P0)

Objective: Enable LLMs to call external tools/functions with structured arguments.

Core Capabilities

  • Tool definition schema

    • Function name, description
    • Parameters (JSON schema)
    • Return type (optional)
  • ToolChoice enum

    • auto: Model decides whether to call tools
    • none: Never call tools (text-only)
    • required: Must call at least one tool
    • specific(name): Force specific tool
  • Parallel tool calls

    • Generate multiple tool calls in one response
    • Dependency-aware ordering
  • Tool result handling

    • Inject tool results back into conversation
    • Continue generation after tool execution
    • Multi-turn tool loops
  • Model-specific formats

    • Llama 3.1 tool format (<|python_tag|>)
    • Mistral tool format (function tags)
    • Qwen tool format
    • Claude tool format
  • OpenAI API compatibility layer

    • tools parameter
    • tool_choice parameter
    • ChatCompletionToolCall response format
  • LangChain integration tests

    • Works with AgentExecutor
    • Compatible with StructuredTool
    • Multi-agent workflows

Example API

let tools = vec![
    Tool {
        name: "get_weather".into(),
        description: "Get current weather for a location".into(),
        parameters: json!({
            "type": "object",
            "properties": {
                "location": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }),
    },
    Tool {
        name: "search_web".into(),
        description: "Search the web".into(),
        parameters: json!({
            "type": "object",
            "properties": {
                "query": {"type": "string"}
            },
            "required": ["query"]
        }),
    }
];

let response = llm.chat(ChatRequest {
    messages: vec![
        Message::user("What's the weather in SF and latest AI news?")
    ],
    tools: Some(tools),
    tool_choice: ToolChoice::Auto,
    ..Default::default()
})?;

// response.tool_calls contains parallel calls:
// [get_weather(location="San Francisco"), search_web(query="AI news")]

Acceptance Criteria

  • OpenAI API format compatibility (passes OpenAI SDK tests)
  • LangChain AgentExecutor integration works end-to-end
  • Parallel tool calls supported (≥3 concurrent)
  • Multi-turn tool conversations (≥5 turns)
  • Tool call success rate ≥95% for common tools

3. Prefix Caching (P0)

Objective: Cache and reuse KV cache for repeated prompt prefixes (system prompts, RAG documents).

Core Capabilities

  • Hash-based prefix lookup

    • SHA-256 hash of token IDs
    • Fast O(1) cache hit detection
  • Radix tree implementation

    • Efficient storage for overlapping prefixes
    • Longest common prefix matching
    • Memory-efficient sharing
  • KV cache copy-on-write

    • Share read-only cache entries
    • Copy only on divergence
    • Zero-copy for cache hits
  • LRU eviction policy

    • Evict least recently used prefixes
    • Configurable cache size
    • Per-model cache isolation
  • Memory limits

    • Hard limit on cache size (bytes)
    • Soft limit with warning
    • Graceful degradation
  • Cache hit/miss metrics

    • Prometheus metrics
    • Hit rate tracking
    • Memory usage stats
  • Chat prefix caching

    • System prompt caching
    • Conversation history caching
    • Automatic prefix detection
  • RAG document caching

    • Document chunk prefixes
    • Query-independent context
    • Multi-query reuse

Example API

// First request - cache miss
let response1 = llm.generate(GenerateRequest {
    prompt: "System: You are a helpful assistant.\nUser: Hello",
    cache_prefix: Some(CacheConfig {
        enable: true,
        key: Some("chat-system-prompt".into()),
        ttl_seconds: Some(3600),
    }),
    ..Default::default()
})?;
// Latency: 500ms

// Second request - cache hit (reuses "System: You are..." KV cache)
let response2 = llm.generate(GenerateRequest {
    prompt: "System: You are a helpful assistant.\nUser: How are you?",
    cache_prefix: Some(CacheConfig {
        enable: true,
        key: Some("chat-system-prompt".into()),
        ttl_seconds: Some(3600),
    }),
    ..Default::default()
})?;
// Latency: 50ms (10x faster!)

Performance Targets

  • 10x speedup for repeated system prompts (cache hit)
  • <5% overhead for cache miss
  • Memory-bounded (configurable, default 2GB)
  • Thread-safe for concurrent requests
  • Hit rate ≥80% for typical chat/RAG workloads

Acceptance Criteria

  • Speedup: ≥10x for 1024-token prefix reuse
  • Memory: Bounded by config, no OOM
  • Correctness: Identical outputs for cached vs uncached
  • Concurrency: No race conditions (stress tested)
  • Metrics: Prometheus metrics exported

Technical Design

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                     RuvLLM v2.4                         │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │ Structured      │  │ Function     │  │ Prefix     │ │
│  │ Output Engine   │  │ Calling      │  │ Cache      │ │
│  │                 │  │ Router       │  │ Manager    │ │
│  │ - JSON Schema   │  │ - Tool Defs  │  │ - Radix    │ │
│  │ - Logit Bias    │  │ - ToolChoice │  │   Tree     │ │
│  │ - State Machine │  │ - Multi-call │  │ - LRU      │ │
│  └────────┬────────┘  └──────┬───────┘  └─────┬──────┘ │
│           │                  │                 │        │
│           └──────────────────┼─────────────────┘        │
│                              │                          │
│                    ┌─────────▼──────────┐               │
│                    │  mistral-rs Core   │               │
│                    │  - Model Loading   │               │
│                    │  - Token Sampling  │               │
│                    │  - KV Cache        │               │
│                    └────────────────────┘               │
└─────────────────────────────────────────────────────────┘

Reference ADRs

  • ADR-009: Structured Output Implementation

    • Constrained decoding algorithm
    • JSON schema validation approach
    • Performance optimization strategies
  • ADR-010: Function Calling Architecture

    • Tool definition format
    • Multi-model compatibility layer
    • Parallel execution model
  • ADR-011: Prefix Caching Design

    • Radix tree structure
    • Eviction policies
    • Memory management
  • ADR-008: mistral-rs Integration

    • Dependency structure
    • API surface
    • Migration path

Implementation Plan

Phase 1: Foundation (Weeks 1-2)

Focus: Structured Output basics + Function Calling definitions

  • Week 1: JSON schema parser and validator

    • Implement schema types (object, array, string, number, boolean, null)
    • Unit tests for all types
    • Property-based tests
  • Week 2: Constrained decoding MVP

    • Logit bias implementation
    • Simple state machine (braces, brackets)
    • Integration with mistral-rs sampler
    • Basic function calling types (Tool, ToolChoice enums)

Deliverable: JSON mode works for simple schemas, tool definitions parsed


Phase 2: Core Logic (Weeks 3-4)

Focus: Constrained decoding + Tool generation

  • Week 3: Advanced constrained decoding

    • Nested schema support
    • String pattern matching
    • Enum constraints
    • JSON repair mode
  • Week 4: Tool call generation

    • Llama 3.1 format support
    • Mistral format support
    • Parallel tool calls
    • OpenAI API compatibility layer

Deliverable: Complex JSON schemas work, tool calls generated in OpenAI format


Phase 3: Caching + Polish (Weeks 5-6)

Focus: Prefix Caching + Integration tests

  • Week 5: Prefix caching implementation

    • Radix tree structure
    • Hash-based lookup
    • LRU eviction
    • Thread-safety (RwLock)
  • Week 6: Integration + benchmarks

    • LangChain integration tests
    • RAG workflow tests
    • Performance benchmarks
    • Documentation
    • Example applications

Deliverable: All 3 features production-ready, benchmarked, documented


Testing Strategy

Unit Tests

  • JSON schema validation (all types, nested, constraints)
  • Logit bias correctness
  • Tool definition parsing
  • Prefix cache hit/miss logic
  • Radix tree operations

Integration Tests

  • LangChain AgentExecutor with tools
  • LlamaIndex ReAct agent
  • CrewAI multi-agent workflows
  • OpenAI SDK compatibility tests

Benchmarks

  • Structured output latency vs unconstrained
  • Tool calling accuracy (% correct tool selections)
  • Prefix cache speedup (1x, 10x, 100x reuse)
  • Memory usage under load

Stress Tests

  • 1000 concurrent requests with caching
  • Deeply nested JSON schemas (depth 20)
  • Large tool libraries (100+ tools)
  • Multi-turn tool conversations (50+ turns)

Success Metrics

Structured Output

  • Validity: 100% valid JSON when strict: true
  • Overhead: <10% latency vs unconstrained
  • Schema compliance: 100% for depth ≤10 schemas
  • Repair rate: ≥95% successful repairs

Function Calling

  • Compatibility: Passes OpenAI SDK test suite
  • LangChain: Works with AgentExecutor (5+ examples)
  • Accuracy: ≥95% correct tool selection (benchmark dataset)
  • Parallel calls: Supports ≥5 concurrent tools

Prefix Caching

  • Speedup: 10x for 1024-token prefix, 100x for 4096-token
  • Hit rate: ≥80% for chat workloads
  • Memory: Bounded, no OOM under stress
  • Correctness: 100% identical outputs (cached vs uncached)

Dependencies

Upstream

  • mistral-rs v0.4.x (ADR-008)
    • KV cache access for prefix caching
    • Token sampling hooks for logit bias
    • Model loading infrastructure

Downstream

  • Enables: Agentic workflow support
  • Enables: LangChain/LlamaIndex/CrewAI integration
  • Blocks: v2.4 release
  • Blocks: Production adoption by agent frameworks

  • Depends on: #XXX (mistral-rs integration ADR-008)
  • Enables: #XXX (Agentic workflow support)
  • Enables: #XXX (LangChain integration)
  • Blocks: #XXX (v2.4 release milestone)

Documentation Requirements

  • API reference docs (rustdoc)
  • User guides for each feature
    • "How to use JSON mode"
    • "How to define tools"
    • "How to enable prefix caching"
  • Migration guide from v2.3
  • Example applications
    • Structured extraction (NER, info extraction)
    • Multi-tool agent (ReAct loop)
    • RAG with caching (chatbot)
  • Performance tuning guide

Open Questions

  1. JSON Schema: Full Draft 7 or subset? (Propose: Core subset + extensions)
  2. Tool formats: Support all models or Llama 3.1+ only? (Propose: Llama 3.1+ with adapters)
  3. Cache eviction: LRU vs LFU vs TTL-based? (Propose: LRU + TTL)
  4. Memory limit: Default cache size? (Propose: 2GB default, configurable)
  5. Breaking changes: Any API changes needed? (Propose: Additive only, no breaks)

Future Enhancements (Post-v2.4)

  • Structured Output:

    • GBNF grammar support (custom DSLs)
    • Regex-constrained strings
    • Speculative decoding for constrained generation
  • Function Calling:

    • Async/streaming tool execution
    • Tool result validation
    • Tool dependency graphs
  • Prefix Caching:

    • Cross-request caching (shared cache pool)
    • Disk-backed cache (persist across restarts)
    • Distributed caching (Redis/memcached)

Timeline Summary

Phase Duration Focus Deliverable
1 Weeks 1-2 Structured Output + Tool Definitions JSON mode MVP, tool parsing
2 Weeks 3-4 Constrained Decoding + Tool Generation Complex schemas, tool calls
3 Weeks 5-6 Prefix Caching + Integration Production-ready, benchmarked

Total: 6 weeks to production-ready v2.4


Getting Involved

For Contributors

  • Pick a task from the checkboxes above
  • Comment on this issue to claim a feature
  • Follow the implementation plan phases
  • Submit PRs with tests + benchmarks

For Reviewers

  • Focus on correctness (JSON validity, cache correctness)
  • Performance regression checks (<10% overhead target)
  • API design feedback (before Week 3)

For Testers

  • Test with real-world agent workflows
  • Report edge cases and failure modes
  • Benchmark on your hardware/models

Let's close the gap with vLLM/llama.cpp and make RuvLLM the best choice for production agentic workflows! 🚀