Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

17 KiB

Raw Blame History

feat(ruvllm): Implement SOTA features for production agentic workflows

Labels: enhancement, p0-critical, agentic, v2.4, mistral-rs, performance

Summary

RuvLLM v2.4 SOTA Feature Implementation - Adding the 3 critical capabilities needed for production agentic workflows: Structured Output, Function Calling, and Prefix Caching.

These features are essential for modern LLM applications and are currently blocking production adoption for major agent frameworks.

Motivation

Why This Matters

Current State:

RuvLLM cannot reliably generate structured outputs (JSON schema enforcement)
No native function calling support for tool-using agents
Repeated prompts/prefixes incur full generation costs (no caching)
Agent frameworks (LangChain, LlamaIndex, CrewAI) cannot integrate

Impact:

Blocking production adoption for agentic workflows
Cost inefficiency: 10-100x slower for RAG/chat applications vs competitors
Reliability gap: JSON parsing failures break agent loops
Missing compatibility: Cannot replace vLLM, llama.cpp, SGLang in existing stacks

Competitive Gap:

Feature	vLLM	llama.cpp	SGLang	RuvLLM
Structured Output	✅	✅	✅	❌
Function Calling	✅	✅	✅	❌
Prefix Caching	✅	✅	✅	❌

Features

1. Structured Output / JSON Mode (P0)

Objective: Guarantee valid JSON output conforming to user-provided schemas.

Core Capabilities

JSON schema validation (JSONSchema Draft 7 support)
- Primitive types: string, number, boolean, null
- Complex types: object, array
- Nested schemas with properties, items, required
- Constraints: minLength, maxLength, pattern, enum, minimum, maximum
Constrained decoding with logit bias
- State machine for tracking JSON structure (open braces, quotes, commas)
- Token masking to enforce valid next tokens
- Rejection sampling fallback for complex schemas
Bracket/brace state machine
- Track depth of {} and []
- Enforce closing brackets
- Handle escaped quotes in strings
JSON repair for malformed output
- Auto-close unclosed braces/brackets
- Fix trailing commas
- Escape unescaped quotes
- Best-effort recovery mode
GBNF grammar support (future)
- llama.cpp-compatible grammar format
- Custom domain-specific languages
Comprehensive tests
- Unit tests for all JSON types
- Property-based testing with Hypothesis/QuickCheck
- Adversarial inputs (deeply nested, large arrays)
Benchmarks vs unconstrained
- Measure latency overhead (<10% target)
- Throughput impact
- Memory usage

Example API

let schema = json!({
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number", "minimum": 0},
        "tags": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["name"]
});

let response = llm.generate(GenerateRequest {
    prompt: "Extract person info: John is 30",
    json_schema: Some(schema),
    strict: true, // Guarantee valid JSON
    ..Default::default()
})?;

// response.text is guaranteed valid JSON matching schema

Acceptance Criteria

100% valid JSON when strict: true enabled
<10% latency overhead vs unconstrained generation
Schema validation passes for nested objects/arrays (depth ≥ 5)
Repair mode recovers ≥95% of malformed outputs

2. Function Calling / Tool Use (P0)

Objective: Enable LLMs to call external tools/functions with structured arguments.

Core Capabilities

Tool definition schema
- Function name, description
- Parameters (JSON schema)
- Return type (optional)
ToolChoice enum
- auto: Model decides whether to call tools
- none: Never call tools (text-only)
- required: Must call at least one tool
- specific(name): Force specific tool
Parallel tool calls
- Generate multiple tool calls in one response
- Dependency-aware ordering
Tool result handling
- Inject tool results back into conversation
- Continue generation after tool execution
- Multi-turn tool loops
Model-specific formats
- Llama 3.1 tool format (<|python_tag|>)
- Mistral tool format (function tags)
- Qwen tool format
- Claude tool format
OpenAI API compatibility layer
- tools parameter
- tool_choice parameter
- ChatCompletionToolCall response format
LangChain integration tests
- Works with AgentExecutor
- Compatible with StructuredTool
- Multi-agent workflows

Example API

let tools = vec![
    Tool {
        name: "get_weather".into(),
        description: "Get current weather for a location".into(),
        parameters: json!({
            "type": "object",
            "properties": {
                "location": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }),
    },
    Tool {
        name: "search_web".into(),
        description: "Search the web".into(),
        parameters: json!({
            "type": "object",
            "properties": {
                "query": {"type": "string"}
            },
            "required": ["query"]
        }),
    }
];

let response = llm.chat(ChatRequest {
    messages: vec![
        Message::user("What's the weather in SF and latest AI news?")
    ],
    tools: Some(tools),
    tool_choice: ToolChoice::Auto,
    ..Default::default()
})?;

// response.tool_calls contains parallel calls:
// [get_weather(location="San Francisco"), search_web(query="AI news")]

Acceptance Criteria

OpenAI API format compatibility (passes OpenAI SDK tests)
LangChain AgentExecutor integration works end-to-end
Parallel tool calls supported (≥3 concurrent)
Multi-turn tool conversations (≥5 turns)
Tool call success rate ≥95% for common tools

3. Prefix Caching (P0)

Objective: Cache and reuse KV cache for repeated prompt prefixes (system prompts, RAG documents).

Core Capabilities

Hash-based prefix lookup
- SHA-256 hash of token IDs
- Fast O(1) cache hit detection
Radix tree implementation
- Efficient storage for overlapping prefixes
- Longest common prefix matching
- Memory-efficient sharing
KV cache copy-on-write
- Share read-only cache entries
- Copy only on divergence
- Zero-copy for cache hits
LRU eviction policy
- Evict least recently used prefixes
- Configurable cache size
- Per-model cache isolation
Memory limits
- Hard limit on cache size (bytes)
- Soft limit with warning
- Graceful degradation
Cache hit/miss metrics
- Prometheus metrics
- Hit rate tracking
- Memory usage stats
Chat prefix caching
- System prompt caching
- Conversation history caching
- Automatic prefix detection
RAG document caching
- Document chunk prefixes
- Query-independent context
- Multi-query reuse

Example API

// First request - cache miss
let response1 = llm.generate(GenerateRequest {
    prompt: "System: You are a helpful assistant.\nUser: Hello",
    cache_prefix: Some(CacheConfig {
        enable: true,
        key: Some("chat-system-prompt".into()),
        ttl_seconds: Some(3600),
    }),
    ..Default::default()
})?;
// Latency: 500ms

// Second request - cache hit (reuses "System: You are..." KV cache)
let response2 = llm.generate(GenerateRequest {
    prompt: "System: You are a helpful assistant.\nUser: How are you?",
    cache_prefix: Some(CacheConfig {
        enable: true,
        key: Some("chat-system-prompt".into()),
        ttl_seconds: Some(3600),
    }),
    ..Default::default()
})?;
// Latency: 50ms (10x faster!)

Performance Targets

10x speedup for repeated system prompts (cache hit)
<5% overhead for cache miss
Memory-bounded (configurable, default 2GB)
Thread-safe for concurrent requests
Hit rate ≥80% for typical chat/RAG workloads

Acceptance Criteria

Speedup: ≥10x for 1024-token prefix reuse
Memory: Bounded by config, no OOM
Correctness: Identical outputs for cached vs uncached
Concurrency: No race conditions (stress tested)
Metrics: Prometheus metrics exported

Technical Design

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                     RuvLLM v2.4                         │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │ Structured      │  │ Function     │  │ Prefix     │ │
│  │ Output Engine   │  │ Calling      │  │ Cache      │ │
│  │                 │  │ Router       │  │ Manager    │ │
│  │ - JSON Schema   │  │ - Tool Defs  │  │ - Radix    │ │
│  │ - Logit Bias    │  │ - ToolChoice │  │   Tree     │ │
│  │ - State Machine │  │ - Multi-call │  │ - LRU      │ │
│  └────────┬────────┘  └──────┬───────┘  └─────┬──────┘ │
│           │                  │                 │        │
│           └──────────────────┼─────────────────┘        │
│                              │                          │
│                    ┌─────────▼──────────┐               │
│                    │  mistral-rs Core   │               │
│                    │  - Model Loading   │               │
│                    │  - Token Sampling  │               │
│                    │  - KV Cache        │               │
│                    └────────────────────┘               │
└─────────────────────────────────────────────────────────┘

Reference ADRs

ADR-009: Structured Output Implementation
- Constrained decoding algorithm
- JSON schema validation approach
- Performance optimization strategies
ADR-010: Function Calling Architecture
- Tool definition format
- Multi-model compatibility layer
- Parallel execution model
ADR-011: Prefix Caching Design
- Radix tree structure
- Eviction policies
- Memory management
ADR-008: mistral-rs Integration
- Dependency structure
- API surface
- Migration path

Implementation Plan

Phase 1: Foundation (Weeks 1-2)

Focus: Structured Output basics + Function Calling definitions

Week 1: JSON schema parser and validator
- Implement schema types (object, array, string, number, boolean, null)
- Unit tests for all types
- Property-based tests
Week 2: Constrained decoding MVP
- Logit bias implementation
- Simple state machine (braces, brackets)
- Integration with mistral-rs sampler
- Basic function calling types (Tool, ToolChoice enums)

Deliverable: JSON mode works for simple schemas, tool definitions parsed

Phase 2: Core Logic (Weeks 3-4)

Focus: Constrained decoding + Tool generation

Week 3: Advanced constrained decoding
- Nested schema support
- String pattern matching
- Enum constraints
- JSON repair mode
Week 4: Tool call generation
- Llama 3.1 format support
- Mistral format support
- Parallel tool calls
- OpenAI API compatibility layer

Deliverable: Complex JSON schemas work, tool calls generated in OpenAI format

Phase 3: Caching + Polish (Weeks 5-6)

Focus: Prefix Caching + Integration tests

Week 5: Prefix caching implementation
- Radix tree structure
- Hash-based lookup
- LRU eviction
- Thread-safety (RwLock)
Week 6: Integration + benchmarks
- LangChain integration tests
- RAG workflow tests
- Performance benchmarks
- Documentation
- Example applications

Deliverable: All 3 features production-ready, benchmarked, documented

Testing Strategy

Unit Tests

JSON schema validation (all types, nested, constraints)
Logit bias correctness
Tool definition parsing
Prefix cache hit/miss logic
Radix tree operations

Integration Tests

LangChain AgentExecutor with tools
LlamaIndex ReAct agent
CrewAI multi-agent workflows
OpenAI SDK compatibility tests

Benchmarks

Structured output latency vs unconstrained
Tool calling accuracy (% correct tool selections)
Prefix cache speedup (1x, 10x, 100x reuse)
Memory usage under load

Stress Tests

1000 concurrent requests with caching
Deeply nested JSON schemas (depth 20)
Large tool libraries (100+ tools)
Multi-turn tool conversations (50+ turns)

Success Metrics

Structured Output

Validity: 100% valid JSON when strict: true
Overhead: <10% latency vs unconstrained
Schema compliance: 100% for depth ≤10 schemas
Repair rate: ≥95% successful repairs

Function Calling

Compatibility: Passes OpenAI SDK test suite
LangChain: Works with AgentExecutor (5+ examples)
Accuracy: ≥95% correct tool selection (benchmark dataset)
Parallel calls: Supports ≥5 concurrent tools

Prefix Caching

Speedup: 10x for 1024-token prefix, 100x for 4096-token
Hit rate: ≥80% for chat workloads
Memory: Bounded, no OOM under stress
Correctness: 100% identical outputs (cached vs uncached)

Dependencies

Upstream

mistral-rs v0.4.x (ADR-008)
- KV cache access for prefix caching
- Token sampling hooks for logit bias
- Model loading infrastructure

Downstream

Enables: Agentic workflow support
Enables: LangChain/LlamaIndex/CrewAI integration
Blocks: v2.4 release
Blocks: Production adoption by agent frameworks

Depends on: #XXX (mistral-rs integration ADR-008)
Enables: #XXX (Agentic workflow support)
Enables: #XXX (LangChain integration)
Blocks: #XXX (v2.4 release milestone)

Documentation Requirements

API reference docs (rustdoc)
User guides for each feature
- "How to use JSON mode"
- "How to define tools"
- "How to enable prefix caching"
Migration guide from v2.3
Example applications
- Structured extraction (NER, info extraction)
- Multi-tool agent (ReAct loop)
- RAG with caching (chatbot)
Performance tuning guide

Open Questions

JSON Schema: Full Draft 7 or subset? (Propose: Core subset + extensions)
Tool formats: Support all models or Llama 3.1+ only? (Propose: Llama 3.1+ with adapters)
Cache eviction: LRU vs LFU vs TTL-based? (Propose: LRU + TTL)
Memory limit: Default cache size? (Propose: 2GB default, configurable)
Breaking changes: Any API changes needed? (Propose: Additive only, no breaks)

Future Enhancements (Post-v2.4)

Structured Output:
- GBNF grammar support (custom DSLs)
- Regex-constrained strings
- Speculative decoding for constrained generation
Function Calling:
- Async/streaming tool execution
- Tool result validation
- Tool dependency graphs
Prefix Caching:
- Cross-request caching (shared cache pool)
- Disk-backed cache (persist across restarts)
- Distributed caching (Redis/memcached)

Timeline Summary

Phase	Duration	Focus	Deliverable
1	Weeks 1-2	Structured Output + Tool Definitions	JSON mode MVP, tool parsing
2	Weeks 3-4	Constrained Decoding + Tool Generation	Complex schemas, tool calls
3	Weeks 5-6	Prefix Caching + Integration	Production-ready, benchmarked

Total: 6 weeks to production-ready v2.4

Getting Involved

For Contributors

Pick a task from the checkboxes above
Comment on this issue to claim a feature
Follow the implementation plan phases
Submit PRs with tests + benchmarks

For Reviewers

Focus on correctness (JSON validity, cache correctness)
Performance regression checks (<10% overhead target)
API design feedback (before Week 3)

For Testers

Test with real-world agent workflows
Report edge cases and failure modes
Benchmark on your hardware/models

Let's close the gap with vLLM/llama.cpp and make RuvLLM the best choice for production agentic workflows! 🚀

17 KiB Raw Blame History

feat(ruvllm): Implement SOTA features for production agentic workflows

Summary

Motivation

Why This Matters

Features

1. Structured Output / JSON Mode (P0)

Core Capabilities

Example API

Acceptance Criteria

2. Function Calling / Tool Use (P0)

Core Capabilities

Example API

Acceptance Criteria

3. Prefix Caching (P0)

Core Capabilities

Example API

Performance Targets

Acceptance Criteria

Technical Design

Architecture Overview

Reference ADRs

Implementation Plan

Phase 1: Foundation (Weeks 1-2)

Phase 2: Core Logic (Weeks 3-4)

Phase 3: Caching + Polish (Weeks 5-6)

Testing Strategy

Unit Tests

Integration Tests

Benchmarks

Stress Tests

Success Metrics

Structured Output

Function Calling

Prefix Caching

Dependencies

Upstream

Downstream

Related Issues

Documentation Requirements

Open Questions

Future Enhancements (Post-v2.4)

Timeline Summary

Getting Involved

For Contributors

For Reviewers

For Testers

17 KiB

Raw Blame History