Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/crates/ruvllm/docs/GITHUB_ISSUE_MISTRAL_RS.md
+++ b/crates/ruvllm/docs/GITHUB_ISSUE_MISTRAL_RS.md
@@ -0,0 +1,300 @@
+# feat(ruvllm): Full mistral-rs backend integration with PagedAttention, X-LoRA, and ISQ
+
+## Summary
+
+Wire the existing `MistralBackend` stub to the actual [mistral-rs](https://github.com/EricLBuehler/mistral.rs) crate for production-scale LLM serving with advanced memory management and adapter routing.
+
+## Motivation
+
+The current Candle backend is optimized for single-user and edge deployment scenarios, achieving approximately 100 tokens/second. While sufficient for development and small-scale use, production deployments require significantly higher throughput and concurrency.
+
+**mistral-rs enables:**
+- **500-1000 tok/s throughput** via continuous batching and PagedAttention
+- **50+ concurrent users** with efficient KV cache management
+- **Memory efficiency** through paged memory allocation and prefix caching
+- **Dynamic adapter routing** via X-LoRA for multi-task inference
+- **Runtime quantization** via ISQ for deployment flexibility
+
+### Performance Comparison
+
+| Metric | Candle Backend | mistral-rs Backend |
+|--------|----------------|-------------------|
+| Throughput | ~100 tok/s | 500-1000 tok/s |
+| Concurrent Users | 1-5 | 50+ |
+| Memory Efficiency | Static KV | Paged + Prefix Cache |
+| Adapter Support | Static LoRA | Dynamic X-LoRA |
+| Quantization | Pre-quantized only | Runtime ISQ |
+
+## Features to Implement
+
+### 1. PagedAttention (Priority: High)
+
+PagedAttention revolutionizes KV cache management by treating attention as virtual memory, enabling efficient memory sharing across sequences.
+
+- [ ] Add `mistralrs` dependency to `Cargo.toml` with feature flags
+- [ ] Wire PagedAttention to `MistralBackend::generate()`
+- [ ] Implement sequence allocation/deallocation callbacks
+- [ ] Add prefix caching support for prompt reuse
+- [ ] Configure block size and max sequences
+- [ ] Benchmark: target 5-10x concurrent capacity improvement
+
+**Key Implementation Points:**
+```rust
+// Block configuration
+let paged_config = PagedAttentionConfig {
+    block_size: 16,           // Tokens per block
+    max_num_blocks: 1024,     // Total blocks available
+    sliding_window: None,     // Optional sliding window
+    prefix_caching: true,     // Enable prefix cache
+};
+```
+
+### 2. X-LoRA Dynamic Routing (Priority: Medium)
+
+X-LoRA enables per-token routing to different LoRA adapters, allowing a single model to handle multiple tasks efficiently.
+
+- [ ] Wire `XLoraManager` to mistral-rs X-LoRA implementation
+- [ ] Implement per-token adapter routing logic
+- [ ] Support learned routing networks (classifier)
+- [ ] Add adapter hot-loading for runtime updates
+- [ ] Implement adapter weight caching
+- [ ] Benchmark: multi-task quality metrics vs single adapters
+
+**Key Implementation Points:**
+```rust
+// X-LoRA configuration
+let xlora_config = XLoraConfig {
+    adapters: vec![
+        ("code", "path/to/code-lora"),
+        ("chat", "path/to/chat-lora"),
+        ("reasoning", "path/to/reasoning-lora"),
+    ],
+    routing_method: RoutingMethod::Learned,
+    top_k_adapters: 2,        // Use top-2 adapters per token
+    scaling_factor: 1.0,
+};
+```
+
+### 3. ISQ Runtime Quantization (Priority: Medium)
+
+In-Situ Quantization allows loading full-precision models and quantizing at runtime, providing deployment flexibility.
+
+- [ ] Wire `IsqConfig` to mistral-rs ISQ implementation
+- [ ] Support quantization methods: AWQ, GPTQ, RTN, SmoothQuant
+- [ ] Implement calibration workflow with sample data
+- [ ] Add memory estimation before/after quantization
+- [ ] Support mixed-precision quantization per layer
+- [ ] Benchmark: quality vs compression tradeoffs
+
+**Supported Quantization Methods:**
+| Method | Bits | Quality | Speed | Use Case |
+|--------|------|---------|-------|----------|
+| AWQ | 4-bit | High | Fast | Production |
+| GPTQ | 4-bit | High | Medium | Accuracy-critical |
+| RTN | 8-bit | Very High | Very Fast | Quality-first |
+| SmoothQuant | 8-bit | Very High | Fast | Balanced |
+
+## Technical Details
+
+### Cargo.toml Changes
+
+```toml
+[dependencies]
+# Core mistral-rs integration
+mistralrs = { version = "0.4", optional = true }
+mistralrs-core = { version = "0.4", optional = true }
+
+# Required for tokenization with mistral-rs
+tokenizers = { version = "0.20", optional = true }
+
+[features]
+default = ["candle"]
+
+# Base mistral-rs support (CPU)
+mistral-rs = ["mistralrs", "mistralrs-core", "tokenizers"]
+
+# Metal acceleration (macOS)
+mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]
+
+# CUDA acceleration (NVIDIA)
+mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]
+
+# Full feature set
+full = ["candle", "mistral-rs"]
+```
+
+### Files to Modify
+
+| File | Changes |
+|------|---------|
+| `crates/ruvllm/Cargo.toml` | Add mistral-rs dependencies and feature flags |
+| `crates/ruvllm/src/backends/mistral_backend.rs` | Replace stub with real implementation |
+| `crates/ruvllm/src/backends/mod.rs` | Update conditional exports |
+| `crates/ruvllm/src/paged_attention.rs` | Wire to mistral-rs PagedAttention |
+| `crates/ruvllm/src/xlora_manager.rs` | Wire to mistral-rs X-LoRA |
+| `crates/ruvllm/src/isq.rs` | Wire to mistral-rs ISQ |
+| `crates/ruvllm/src/lib.rs` | Add re-exports and feature gates |
+| `crates/ruvllm/README.md` | Document usage and examples |
+
+### API Design
+
+```rust
+use ruvllm::{MistralBackend, MistralConfig, PagedAttentionConfig};
+
+// Create backend with PagedAttention
+let config = MistralConfig {
+    model_id: "mistralai/Mistral-7B-Instruct-v0.2".to_string(),
+    paged_attention: Some(PagedAttentionConfig {
+        block_size: 16,
+        max_num_blocks: 1024,
+        prefix_caching: true,
+    }),
+    xlora: None,
+    isq: None,
+};
+
+let backend = MistralBackend::new(config).await?;
+
+// Generate with automatic KV cache management
+let output = backend.generate(&request).await?;
+```
+
+### Feature Flag Matrix
+
+| Build Command | CPU | Metal | CUDA | PagedAttn | X-LoRA | ISQ |
+|---------------|-----|-------|------|-----------|--------|-----|
+| `--features mistral-rs` | Yes | No | No | Yes | Yes | Yes |
+| `--features mistral-rs-metal` | Yes | Yes | No | Yes | Yes | Yes |
+| `--features mistral-rs-cuda` | Yes | No | Yes | Yes | Yes | Yes |
+
+## Acceptance Criteria
+
+### Build Verification
+- [ ] `cargo build --features mistral-rs` compiles on Linux
+- [ ] `cargo build --features mistral-rs-metal` compiles on macOS
+- [ ] `cargo build --features mistral-rs-cuda` compiles with CUDA toolkit
+- [ ] All clippy warnings resolved
+- [ ] No breaking changes to existing Candle backend
+
+### Functionality
+- [ ] Model loading works with HuggingFace model IDs
+- [ ] Model loading works with local paths
+- [ ] Generation produces correct, coherent output
+- [ ] Streaming generation works correctly
+- [ ] Stop sequences are respected
+
+### PagedAttention
+- [ ] KV cache is managed in blocks
+- [ ] Sequence allocation succeeds up to max capacity
+- [ ] Sequence deallocation frees blocks correctly
+- [ ] Prefix caching improves repeated prompt performance
+- [ ] Memory usage stays within configured limits
+
+### X-LoRA
+- [ ] Multiple adapters can be loaded
+- [ ] Per-token routing selects appropriate adapters
+- [ ] Adapter hot-loading works without restart
+- [ ] Quality matches or exceeds single-adapter baseline
+
+### ISQ
+- [ ] Models quantize at runtime without pre-quantized weights
+- [ ] All supported methods produce valid output
+- [ ] Memory reduction matches expected compression ratio
+- [ ] Quality degradation within acceptable bounds (<5% on benchmarks)
+
+### Performance Benchmarks
+- [ ] Throughput: >500 tok/s on Mistral-7B (single user)
+- [ ] Concurrency: >50 concurrent generations without OOM
+- [ ] Latency: <50ms time-to-first-token
+- [ ] Memory: PagedAttention reduces peak usage by >30%
+
+## Testing Plan
+
+### Unit Tests
+```rust
+#[cfg(feature = "mistral-rs")]
+mod mistral_tests {
+    #[tokio::test]
+    async fn test_model_loading() { ... }
+
+    #[tokio::test]
+    async fn test_generation() { ... }
+
+    #[tokio::test]
+    async fn test_paged_attention_allocation() { ... }
+
+    #[tokio::test]
+    async fn test_xlora_routing() { ... }
+
+    #[tokio::test]
+    async fn test_isq_quantization() { ... }
+}
+```
+
+### Integration Tests
+- Model download and cache management
+- End-to-end generation pipeline
+- Concurrent request handling
+- Memory pressure scenarios
+
+### Benchmarks
+```bash
+# Run throughput benchmark
+cargo bench --features mistral-rs-metal -- throughput
+
+# Run concurrency benchmark
+cargo bench --features mistral-rs-metal -- concurrency
+
+# Run memory benchmark
+cargo bench --features mistral-rs-metal -- memory
+```
+
+## Implementation Notes
+
+### Thread Safety
+mistral-rs uses async Rust throughout. Ensure all shared state is properly synchronized:
+- Use `Arc<RwLock<>>` for shared configuration
+- Use channels for sequence lifecycle events
+- Avoid blocking in async contexts
+
+### Error Handling
+Map mistral-rs errors to ruvllm error types:
+```rust
+impl From<mistralrs::Error> for RuvllmError {
+    fn from(e: mistralrs::Error) -> Self {
+        match e {
+            mistralrs::Error::ModelLoad(_) => RuvllmError::ModelLoad(...),
+            mistralrs::Error::Generation(_) => RuvllmError::Generation(...),
+            // ...
+        }
+    }
+}
+```
+
+### Backward Compatibility
+- Keep Candle backend as default
+- Use feature flags for mistral-rs
+- Maintain consistent API across backends
+- Document migration path
+
+## Related Issues
+
+- Depends on: Initial MistralBackend stub implementation
+- Blocks: Production deployment readiness
+- Related: Candle backend optimizations
+
+## References
+
+- [mistral-rs GitHub](https://github.com/EricLBuehler/mistral.rs)
+- [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
+- [X-LoRA Paper](https://arxiv.org/abs/2402.07148)
+- [AWQ Paper](https://arxiv.org/abs/2306.00978)
+- [vLLM PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html)
+
+---
+
+**Labels:** `enhancement`, `ruvllm`, `backend`, `performance`, `P1`
+
+**Milestone:** v0.2.0
+
+**Assignees:** TBD
--- a/crates/ruvllm/docs/GITHUB_ISSUE_SOTA_FEATURES.md
+++ b/crates/ruvllm/docs/GITHUB_ISSUE_SOTA_FEATURES.md
@@ -0,0 +1,555 @@
+# feat(ruvllm): Implement SOTA features for production agentic workflows
+
+**Labels**: `enhancement`, `p0-critical`, `agentic`, `v2.4`, `mistral-rs`, `performance`
+
+---
+
+## Summary
+
+RuvLLM v2.4 SOTA Feature Implementation - Adding the 3 critical capabilities needed for production agentic workflows: **Structured Output**, **Function Calling**, and **Prefix Caching**.
+
+These features are essential for modern LLM applications and are currently blocking production adoption for major agent frameworks.
+
+---
+
+## Motivation
+
+### Why This Matters
+
+**Current State:**
+- RuvLLM cannot reliably generate structured outputs (JSON schema enforcement)
+- No native function calling support for tool-using agents
+- Repeated prompts/prefixes incur full generation costs (no caching)
+- Agent frameworks (LangChain, LlamaIndex, CrewAI) cannot integrate
+
+**Impact:**
+- **Blocking production adoption** for agentic workflows
+- **Cost inefficiency**: 10-100x slower for RAG/chat applications vs competitors
+- **Reliability gap**: JSON parsing failures break agent loops
+- **Missing compatibility**: Cannot replace vLLM, llama.cpp, SGLang in existing stacks
+
+**Competitive Gap:**
+| Feature | vLLM | llama.cpp | SGLang | RuvLLM |
+|---------|------|-----------|--------|--------|
+| Structured Output | ✅ | ✅ | ✅ | ❌ |
+| Function Calling | ✅ | ✅ | ✅ | ❌ |
+| Prefix Caching | ✅ | ✅ | ✅ | ❌ |
+
+---
+
+## Features
+
+### 1. Structured Output / JSON Mode (P0)
+
+**Objective**: Guarantee valid JSON output conforming to user-provided schemas.
+
+#### Core Capabilities
+- [ ] **JSON schema validation** (JSONSchema Draft 7 support)
+  - Primitive types: `string`, `number`, `boolean`, `null`
+  - Complex types: `object`, `array`
+  - Nested schemas with `properties`, `items`, `required`
+  - Constraints: `minLength`, `maxLength`, `pattern`, `enum`, `minimum`, `maximum`
+
+- [ ] **Constrained decoding with logit bias**
+  - State machine for tracking JSON structure (open braces, quotes, commas)
+  - Token masking to enforce valid next tokens
+  - Rejection sampling fallback for complex schemas
+
+- [ ] **Bracket/brace state machine**
+  - Track depth of `{}` and `[]`
+  - Enforce closing brackets
+  - Handle escaped quotes in strings
+
+- [ ] **JSON repair for malformed output**
+  - Auto-close unclosed braces/brackets
+  - Fix trailing commas
+  - Escape unescaped quotes
+  - Best-effort recovery mode
+
+- [ ] **GBNF grammar support (future)**
+  - llama.cpp-compatible grammar format
+  - Custom domain-specific languages
+
+- [ ] **Comprehensive tests**
+  - Unit tests for all JSON types
+  - Property-based testing with Hypothesis/QuickCheck
+  - Adversarial inputs (deeply nested, large arrays)
+
+- [ ] **Benchmarks vs unconstrained**
+  - Measure latency overhead (<10% target)
+  - Throughput impact
+  - Memory usage
+
+#### Example API
+```rust
+let schema = json!({
+    "type": "object",
+    "properties": {
+        "name": {"type": "string"},
+        "age": {"type": "number", "minimum": 0},
+        "tags": {"type": "array", "items": {"type": "string"}}
+    },
+    "required": ["name"]
+});
+
+let response = llm.generate(GenerateRequest {
+    prompt: "Extract person info: John is 30",
+    json_schema: Some(schema),
+    strict: true, // Guarantee valid JSON
+    ..Default::default()
+})?;
+
+// response.text is guaranteed valid JSON matching schema
+```
+
+#### Acceptance Criteria
+- [ ] **100% valid JSON** when `strict: true` enabled
+- [ ] **<10% latency overhead** vs unconstrained generation
+- [ ] **Schema validation passes** for nested objects/arrays (depth ≥ 5)
+- [ ] **Repair mode** recovers ≥95% of malformed outputs
+
+---
+
+### 2. Function Calling / Tool Use (P0)
+
+**Objective**: Enable LLMs to call external tools/functions with structured arguments.
+
+#### Core Capabilities
+- [ ] **Tool definition schema**
+  - Function name, description
+  - Parameters (JSON schema)
+  - Return type (optional)
+
+- [ ] **ToolChoice enum**
+  - `auto`: Model decides whether to call tools
+  - `none`: Never call tools (text-only)
+  - `required`: Must call at least one tool
+  - `specific(name)`: Force specific tool
+
+- [ ] **Parallel tool calls**
+  - Generate multiple tool calls in one response
+  - Dependency-aware ordering
+
+- [ ] **Tool result handling**
+  - Inject tool results back into conversation
+  - Continue generation after tool execution
+  - Multi-turn tool loops
+
+- [ ] **Model-specific formats**
+  - Llama 3.1 tool format (`<|python_tag|>`)
+  - Mistral tool format (function tags)
+  - Qwen tool format
+  - Claude tool format
+
+- [ ] **OpenAI API compatibility layer**
+  - `tools` parameter
+  - `tool_choice` parameter
+  - `ChatCompletionToolCall` response format
+
+- [ ] **LangChain integration tests**
+  - Works with `AgentExecutor`
+  - Compatible with `StructuredTool`
+  - Multi-agent workflows
+
+#### Example API
+```rust
+let tools = vec![
+    Tool {
+        name: "get_weather".into(),
+        description: "Get current weather for a location".into(),
+        parameters: json!({
+            "type": "object",
+            "properties": {
+                "location": {"type": "string"},
+                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
+            },
+            "required": ["location"]
+        }),
+    },
+    Tool {
+        name: "search_web".into(),
+        description: "Search the web".into(),
+        parameters: json!({
+            "type": "object",
+            "properties": {
+                "query": {"type": "string"}
+            },
+            "required": ["query"]
+        }),
+    }
+];
+
+let response = llm.chat(ChatRequest {
+    messages: vec![
+        Message::user("What's the weather in SF and latest AI news?")
+    ],
+    tools: Some(tools),
+    tool_choice: ToolChoice::Auto,
+    ..Default::default()
+})?;
+
+// response.tool_calls contains parallel calls:
+// [get_weather(location="San Francisco"), search_web(query="AI news")]
+```
+
+#### Acceptance Criteria
+- [ ] **OpenAI API format compatibility** (passes OpenAI SDK tests)
+- [ ] **LangChain AgentExecutor** integration works end-to-end
+- [ ] **Parallel tool calls** supported (≥3 concurrent)
+- [ ] **Multi-turn tool conversations** (≥5 turns)
+- [ ] **Tool call success rate** ≥95% for common tools
+
+---
+
+### 3. Prefix Caching (P0)
+
+**Objective**: Cache and reuse KV cache for repeated prompt prefixes (system prompts, RAG documents).
+
+#### Core Capabilities
+- [ ] **Hash-based prefix lookup**
+  - SHA-256 hash of token IDs
+  - Fast O(1) cache hit detection
+
+- [ ] **Radix tree implementation**
+  - Efficient storage for overlapping prefixes
+  - Longest common prefix matching
+  - Memory-efficient sharing
+
+- [ ] **KV cache copy-on-write**
+  - Share read-only cache entries
+  - Copy only on divergence
+  - Zero-copy for cache hits
+
+- [ ] **LRU eviction policy**
+  - Evict least recently used prefixes
+  - Configurable cache size
+  - Per-model cache isolation
+
+- [ ] **Memory limits**
+  - Hard limit on cache size (bytes)
+  - Soft limit with warning
+  - Graceful degradation
+
+- [ ] **Cache hit/miss metrics**
+  - Prometheus metrics
+  - Hit rate tracking
+  - Memory usage stats
+
+- [ ] **Chat prefix caching**
+  - System prompt caching
+  - Conversation history caching
+  - Automatic prefix detection
+
+- [ ] **RAG document caching**
+  - Document chunk prefixes
+  - Query-independent context
+  - Multi-query reuse
+
+#### Example API
+```rust
+// First request - cache miss
+let response1 = llm.generate(GenerateRequest {
+    prompt: "System: You are a helpful assistant.\nUser: Hello",
+    cache_prefix: Some(CacheConfig {
+        enable: true,
+        key: Some("chat-system-prompt".into()),
+        ttl_seconds: Some(3600),
+    }),
+    ..Default::default()
+})?;
+// Latency: 500ms
+
+// Second request - cache hit (reuses "System: You are..." KV cache)
+let response2 = llm.generate(GenerateRequest {
+    prompt: "System: You are a helpful assistant.\nUser: How are you?",
+    cache_prefix: Some(CacheConfig {
+        enable: true,
+        key: Some("chat-system-prompt".into()),
+        ttl_seconds: Some(3600),
+    }),
+    ..Default::default()
+})?;
+// Latency: 50ms (10x faster!)
+```
+
+#### Performance Targets
+- [ ] **10x speedup** for repeated system prompts (cache hit)
+- [ ] **<5% overhead** for cache miss
+- [ ] **Memory-bounded** (configurable, default 2GB)
+- [ ] **Thread-safe** for concurrent requests
+- [ ] **Hit rate ≥80%** for typical chat/RAG workloads
+
+#### Acceptance Criteria
+- [ ] **Speedup**: ≥10x for 1024-token prefix reuse
+- [ ] **Memory**: Bounded by config, no OOM
+- [ ] **Correctness**: Identical outputs for cached vs uncached
+- [ ] **Concurrency**: No race conditions (stress tested)
+- [ ] **Metrics**: Prometheus metrics exported
+
+---
+
+## Technical Design
+
+### Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                     RuvLLM v2.4                         │
+├─────────────────────────────────────────────────────────┤
+│  ┌─────────────────┐  ┌──────────────┐  ┌────────────┐ │
+│  │ Structured      │  │ Function     │  │ Prefix     │ │
+│  │ Output Engine   │  │ Calling      │  │ Cache      │ │
+│  │                 │  │ Router       │  │ Manager    │ │
+│  │ - JSON Schema   │  │ - Tool Defs  │  │ - Radix    │ │
+│  │ - Logit Bias    │  │ - ToolChoice │  │   Tree     │ │
+│  │ - State Machine │  │ - Multi-call │  │ - LRU      │ │
+│  └────────┬────────┘  └──────┬───────┘  └─────┬──────┘ │
+│           │                  │                 │        │
+│           └──────────────────┼─────────────────┘        │
+│                              │                          │
+│                    ┌─────────▼──────────┐               │
+│                    │  mistral-rs Core   │               │
+│                    │  - Model Loading   │               │
+│                    │  - Token Sampling  │               │
+│                    │  - KV Cache        │               │
+│                    └────────────────────┘               │
+└─────────────────────────────────────────────────────────┘
+```
+
+### Reference ADRs
+
+- **ADR-009**: Structured Output Implementation
+  - Constrained decoding algorithm
+  - JSON schema validation approach
+  - Performance optimization strategies
+
+- **ADR-010**: Function Calling Architecture
+  - Tool definition format
+  - Multi-model compatibility layer
+  - Parallel execution model
+
+- **ADR-011**: Prefix Caching Design
+  - Radix tree structure
+  - Eviction policies
+  - Memory management
+
+- **ADR-008**: mistral-rs Integration
+  - Dependency structure
+  - API surface
+  - Migration path
+
+---
+
+## Implementation Plan
+
+### Phase 1: Foundation (Weeks 1-2)
+**Focus**: Structured Output basics + Function Calling definitions
+
+- [ ] Week 1: JSON schema parser and validator
+  - Implement schema types (object, array, string, number, boolean, null)
+  - Unit tests for all types
+  - Property-based tests
+
+- [ ] Week 2: Constrained decoding MVP
+  - Logit bias implementation
+  - Simple state machine (braces, brackets)
+  - Integration with mistral-rs sampler
+  - Basic function calling types (Tool, ToolChoice enums)
+
+**Deliverable**: JSON mode works for simple schemas, tool definitions parsed
+
+---
+
+### Phase 2: Core Logic (Weeks 3-4)
+**Focus**: Constrained decoding + Tool generation
+
+- [ ] Week 3: Advanced constrained decoding
+  - Nested schema support
+  - String pattern matching
+  - Enum constraints
+  - JSON repair mode
+
+- [ ] Week 4: Tool call generation
+  - Llama 3.1 format support
+  - Mistral format support
+  - Parallel tool calls
+  - OpenAI API compatibility layer
+
+**Deliverable**: Complex JSON schemas work, tool calls generated in OpenAI format
+
+---
+
+### Phase 3: Caching + Polish (Weeks 5-6)
+**Focus**: Prefix Caching + Integration tests
+
+- [ ] Week 5: Prefix caching implementation
+  - Radix tree structure
+  - Hash-based lookup
+  - LRU eviction
+  - Thread-safety (RwLock)
+
+- [ ] Week 6: Integration + benchmarks
+  - LangChain integration tests
+  - RAG workflow tests
+  - Performance benchmarks
+  - Documentation
+  - Example applications
+
+**Deliverable**: All 3 features production-ready, benchmarked, documented
+
+---
+
+## Testing Strategy
+
+### Unit Tests
+- JSON schema validation (all types, nested, constraints)
+- Logit bias correctness
+- Tool definition parsing
+- Prefix cache hit/miss logic
+- Radix tree operations
+
+### Integration Tests
+- LangChain AgentExecutor with tools
+- LlamaIndex ReAct agent
+- CrewAI multi-agent workflows
+- OpenAI SDK compatibility tests
+
+### Benchmarks
+- Structured output latency vs unconstrained
+- Tool calling accuracy (% correct tool selections)
+- Prefix cache speedup (1x, 10x, 100x reuse)
+- Memory usage under load
+
+### Stress Tests
+- 1000 concurrent requests with caching
+- Deeply nested JSON schemas (depth 20)
+- Large tool libraries (100+ tools)
+- Multi-turn tool conversations (50+ turns)
+
+---
+
+## Success Metrics
+
+### Structured Output
+- [ ] **Validity**: 100% valid JSON when `strict: true`
+- [ ] **Overhead**: <10% latency vs unconstrained
+- [ ] **Schema compliance**: 100% for depth ≤10 schemas
+- [ ] **Repair rate**: ≥95% successful repairs
+
+### Function Calling
+- [ ] **Compatibility**: Passes OpenAI SDK test suite
+- [ ] **LangChain**: Works with AgentExecutor (5+ examples)
+- [ ] **Accuracy**: ≥95% correct tool selection (benchmark dataset)
+- [ ] **Parallel calls**: Supports ≥5 concurrent tools
+
+### Prefix Caching
+- [ ] **Speedup**: 10x for 1024-token prefix, 100x for 4096-token
+- [ ] **Hit rate**: ≥80% for chat workloads
+- [ ] **Memory**: Bounded, no OOM under stress
+- [ ] **Correctness**: 100% identical outputs (cached vs uncached)
+
+---
+
+## Dependencies
+
+### Upstream
+- **mistral-rs v0.4.x** (ADR-008)
+  - KV cache access for prefix caching
+  - Token sampling hooks for logit bias
+  - Model loading infrastructure
+
+### Downstream
+- **Enables**: Agentic workflow support
+- **Enables**: LangChain/LlamaIndex/CrewAI integration
+- **Blocks**: v2.4 release
+- **Blocks**: Production adoption by agent frameworks
+
+---
+
+## Related Issues
+
+- Depends on: #XXX (mistral-rs integration ADR-008)
+- Enables: #XXX (Agentic workflow support)
+- Enables: #XXX (LangChain integration)
+- Blocks: #XXX (v2.4 release milestone)
+
+---
+
+## Documentation Requirements
+
+- [ ] API reference docs (rustdoc)
+- [ ] User guides for each feature
+  - "How to use JSON mode"
+  - "How to define tools"
+  - "How to enable prefix caching"
+- [ ] Migration guide from v2.3
+- [ ] Example applications
+  - Structured extraction (NER, info extraction)
+  - Multi-tool agent (ReAct loop)
+  - RAG with caching (chatbot)
+- [ ] Performance tuning guide
+
+---
+
+## Open Questions
+
+1. **JSON Schema**: Full Draft 7 or subset? (Propose: Core subset + extensions)
+2. **Tool formats**: Support all models or Llama 3.1+ only? (Propose: Llama 3.1+ with adapters)
+3. **Cache eviction**: LRU vs LFU vs TTL-based? (Propose: LRU + TTL)
+4. **Memory limit**: Default cache size? (Propose: 2GB default, configurable)
+5. **Breaking changes**: Any API changes needed? (Propose: Additive only, no breaks)
+
+---
+
+## Future Enhancements (Post-v2.4)
+
+- **Structured Output**:
+  - GBNF grammar support (custom DSLs)
+  - Regex-constrained strings
+  - Speculative decoding for constrained generation
+
+- **Function Calling**:
+  - Async/streaming tool execution
+  - Tool result validation
+  - Tool dependency graphs
+
+- **Prefix Caching**:
+  - Cross-request caching (shared cache pool)
+  - Disk-backed cache (persist across restarts)
+  - Distributed caching (Redis/memcached)
+
+---
+
+## Timeline Summary
+
+| Phase | Duration | Focus | Deliverable |
+|-------|----------|-------|-------------|
+| 1 | Weeks 1-2 | Structured Output + Tool Definitions | JSON mode MVP, tool parsing |
+| 2 | Weeks 3-4 | Constrained Decoding + Tool Generation | Complex schemas, tool calls |
+| 3 | Weeks 5-6 | Prefix Caching + Integration | Production-ready, benchmarked |
+
+**Total**: 6 weeks to production-ready v2.4
+
+---
+
+## Getting Involved
+
+### For Contributors
+- Pick a task from the checkboxes above
+- Comment on this issue to claim a feature
+- Follow the implementation plan phases
+- Submit PRs with tests + benchmarks
+
+### For Reviewers
+- Focus on correctness (JSON validity, cache correctness)
+- Performance regression checks (<10% overhead target)
+- API design feedback (before Week 3)
+
+### For Testers
+- Test with real-world agent workflows
+- Report edge cases and failure modes
+- Benchmark on your hardware/models
+
+---
+
+**Let's close the gap with vLLM/llama.cpp and make RuvLLM the best choice for production agentic workflows!** 🚀
--- a/crates/ruvllm/docs/GITHUB_ISSUE_V2.md
+++ b/crates/ruvllm/docs/GITHUB_ISSUE_V2.md
@@ -0,0 +1,599 @@
+# 🚀 RuvLLM v2.0 - High-Performance LLM Inference for Apple Silicon
+
+[![Crates.io](https://img.shields.io/crates/v/ruvllm.svg)](https://crates.io/crates/ruvllm)
+[![npm](https://img.shields.io/npm/v/@aspect/ruvllm.svg)](https://www.npmjs.com/package/@aspect/ruvllm)
+[![Documentation](https://img.shields.io/badge/docs-ruv.io-blue)](https://ruv.io/docs/ruvllm)
+[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-green)](LICENSE)
+[![Build Status](https://img.shields.io/github/actions/workflow/status/aspect/ruvector/ci.yml?branch=main)](https://github.com/aspect/ruvector/actions)
+[![Discord](https://img.shields.io/discord/1234567890?logo=discord&label=discord)](https://discord.gg/ruv)
+
+<p align="center">
+  <img src="https://ruv.io/assets/ruvllm-logo.svg" alt="RuvLLM" width="200"/>
+  <br/>
+  <strong>Run Large Language Models locally on your Mac with maximum performance</strong>
+  <br/>
+  <a href="https://ruv.io/ruvllm">Website</a> •
+  <a href="https://ruv.io/docs/ruvllm">Documentation</a> •
+  <a href="https://discord.gg/ruv">Discord</a> •
+  <a href="https://twitter.com/raboruv">Twitter</a>
+</p>
+
+---
+
+## What is RuvLLM?
+
+**RuvLLM** is a blazing-fast LLM inference engine built in Rust, specifically optimized for Apple Silicon Macs (M1/M2/M3/M4). It lets you run AI models like Llama, Mistral, Phi, and Gemma directly on your laptop — no cloud, no API costs, complete privacy.
+
+### Why RuvLLM?
+
+- **🔥 Fast** — 40+ tokens/second on M4 Pro with optimized Metal shaders
+- **🍎 Apple Silicon Native** — Uses Metal GPU, Apple Neural Engine (ANE), and ARM NEON
+- **🔒 Private** — Everything runs locally, your data never leaves your device
+- **📦 Easy** — One command to install, one line to run
+- **🌐 Cross-Platform** — Works in Rust, Node.js, and browsers via WebAssembly
+
+---
+
+## ✨ Key Features
+
+### Core Capabilities
+
+| Feature | Description |
+|---------|-------------|
+| **Multi-Backend Support** | Metal GPU, Core ML (ANE), CPU with NEON SIMD |
+| **Quantization** | Q4, Q5, Q8 quantized models (4-8x memory savings) |
+| **GGUF Support** | Load models directly from Hugging Face in GGUF format |
+| **Streaming** | Real-time token-by-token generation |
+| **Continuous Batching** | Efficient multi-request handling |
+| **KV Cache** | Optimized key-value cache with paged attention |
+| **Speculative Decoding** | 1.5-2x speedup with draft models |
+
+### v2.0 New Features
+
+| Feature | Improvement |
+|---------|-------------|
+| **Apple Neural Engine** | 38 TOPS dedicated ML acceleration on M4 Pro |
+| **Hybrid GPU+ANE Pipeline** | Best of both worlds for optimal throughput |
+| **Flash Attention v2** | 2.5-7.5x faster attention computation |
+| **SONA Learning** | Self-optimizing neural architecture for adaptive inference |
+| **Ruvector Integration** | Built-in vector embeddings for RAG applications |
+
+---
+
+## 🚀 Quickstart
+
+### Rust (Cargo)
+
+```bash
+# Add to Cargo.toml
+cargo add ruvllm --features inference-metal
+```
+
+```rust
+use ruvllm::{Engine, GenerateParams};
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    // Load a model (downloads automatically from Hugging Face)
+    let engine = Engine::from_pretrained("microsoft/Phi-3-mini-4k-instruct-gguf")?;
+
+    // Generate text
+    let response = engine.generate(
+        "Explain quantum computing in simple terms:",
+        GenerateParams::default()
+    )?;
+
+    println!("{}", response);
+    Ok(())
+}
+```
+
+### Node.js (npm)
+
+```bash
+npm install @aspect/ruvllm
+```
+
+```javascript
+import { RuvLLM } from '@aspect/ruvllm';
+
+// Initialize with a model
+const llm = await RuvLLM.fromPretrained('microsoft/Phi-3-mini-4k-instruct-gguf');
+
+// Generate text
+const response = await llm.generate('Explain quantum computing in simple terms:');
+console.log(response);
+
+// Or stream tokens
+for await (const token of llm.stream('Write a haiku about coding:')) {
+    process.stdout.write(token);
+}
+```
+
+### CLI
+
+```bash
+# Install CLI
+cargo install ruvllm-cli
+
+# Run interactively
+ruvllm chat --model microsoft/Phi-3-mini-4k-instruct-gguf
+
+# One-shot generation
+ruvllm generate "What is the meaning of life?" --model phi-3
+```
+
+---
+
+<details>
+<summary><h2>📚 Tutorials</h2></summary>
+
+### Tutorial 1: Building a Local Chatbot
+
+Create a simple chatbot that runs entirely on your Mac:
+
+```rust
+use ruvllm::{Engine, GenerateParams, ChatMessage};
+
+fn main() -> Result<(), Box<dyn std::error::Error>> {
+    let engine = Engine::from_pretrained("meta-llama/Llama-3.2-1B-Instruct-GGUF")?;
+
+    let mut history = vec![];
+
+    loop {
+        print!("You: ");
+        let mut input = String::new();
+        std::io::stdin().read_line(&mut input)?;
+
+        history.push(ChatMessage::user(&input));
+
+        let response = engine.chat(&history, GenerateParams {
+            max_tokens: 512,
+            temperature: 0.7,
+            ..Default::default()
+        })?;
+
+        println!("AI: {}", response);
+        history.push(ChatMessage::assistant(&response));
+    }
+}
+```
+
+### Tutorial 2: Streaming Responses in Node.js
+
+Build a real-time streaming API:
+
+```javascript
+import { RuvLLM } from '@aspect/ruvllm';
+import express from 'express';
+
+const app = express();
+const llm = await RuvLLM.fromPretrained('phi-3-mini');
+
+app.get('/stream', async (req, res) => {
+    const prompt = req.query.prompt;
+
+    res.setHeader('Content-Type', 'text/event-stream');
+    res.setHeader('Cache-Control', 'no-cache');
+
+    for await (const token of llm.stream(prompt)) {
+        res.write(`data: ${JSON.stringify({ token })}\n\n`);
+    }
+
+    res.write('data: [DONE]\n\n');
+    res.end();
+});
+
+app.listen(3000);
+```
+
+### Tutorial 3: RAG with Ruvector
+
+Combine RuvLLM with Ruvector for retrieval-augmented generation:
+
+```rust
+use ruvllm::Engine;
+use ruvector_core::{VectorDb, HnswConfig};
+
+fn main() -> Result<(), Box<dyn std::error::Error>> {
+    // Initialize vector database
+    let db = VectorDb::new(HnswConfig::default())?;
+
+    // Initialize LLM
+    let llm = Engine::from_pretrained("phi-3-mini")?;
+
+    // Add documents (embeddings generated automatically)
+    db.add_document("doc1", "RuvLLM is a fast LLM inference engine.")?;
+    db.add_document("doc2", "It supports Metal GPU acceleration.")?;
+
+    // Query and generate
+    let query = "What is RuvLLM?";
+    let context = db.search(query, 3)?;
+
+    let prompt = format!(
+        "Context:\n{}\n\nQuestion: {}\nAnswer:",
+        context.iter().map(|d| d.text.as_str()).collect::<Vec<_>>().join("\n"),
+        query
+    );
+
+    let response = llm.generate(&prompt, Default::default())?;
+    println!("{}", response);
+    Ok(())
+}
+```
+
+### Tutorial 4: Browser-Based Inference (WebAssembly)
+
+Run models directly in the browser:
+
+```html
+<!DOCTYPE html>
+<html>
+<head>
+    <script type="module">
+        import init, { RuvLLM } from 'https://unpkg.com/@aspect/ruvllm-wasm/ruvllm.js';
+
+        async function main() {
+            await init();
+
+            const llm = await RuvLLM.fromUrl('/models/phi-3-mini-q4.gguf');
+
+            const output = document.getElementById('output');
+
+            for await (const token of llm.stream('Write a poem about the web:')) {
+                output.textContent += token;
+            }
+        }
+
+        main();
+    </script>
+</head>
+<body>
+    <pre id="output"></pre>
+</body>
+</html>
+```
+
+</details>
+
+---
+
+<details>
+<summary><h2>🔧 Advanced Usage</h2></summary>
+
+### Custom Model Configuration
+
+Fine-tune model loading for your specific hardware:
+
+```rust
+use ruvllm::{Engine, ModelConfig, ComputeBackend, Quantization};
+
+let engine = Engine::builder()
+    .model_path("/path/to/model.gguf")
+    .backend(ComputeBackend::Metal)          // Use Metal GPU
+    .quantization(Quantization::Q4K)          // 4-bit quantization
+    .context_length(8192)                     // Max context
+    .num_gpu_layers(32)                       // Layers on GPU
+    .use_flash_attention(true)                // Enable Flash Attention
+    .build()?;
+```
+
+### Apple Neural Engine (ANE) Configuration
+
+Leverage the dedicated ML accelerator on Apple Silicon:
+
+```rust
+use ruvllm::{Engine, CoreMLBackend, ComputeUnits};
+
+// Create Core ML backend with ANE
+let backend = CoreMLBackend::new()?
+    .with_compute_units(ComputeUnits::CpuAndNeuralEngine)  // Use ANE
+    .with_tokenizer(tokenizer);
+
+// Load Core ML model
+backend.load_model("model.mlmodelc", ModelConfig::default())?;
+
+// Generate (uses ANE for MLP, GPU for attention)
+let response = backend.generate("Hello", GenerateParams::default())?;
+```
+
+### Hybrid GPU + ANE Pipeline
+
+Maximize throughput with intelligent workload distribution:
+
+```rust
+use ruvllm::kernels::{should_use_ane_matmul, get_ane_recommendation};
+
+// Check if ANE is beneficial for your matrix size
+let recommendation = get_ane_recommendation(batch_size, hidden_dim, vocab_size);
+
+if recommendation.use_ane {
+    println!("Using ANE: {} (confidence: {:.0}%)",
+             recommendation.reason,
+             recommendation.confidence * 100.0);
+}
+```
+
+### Continuous Batching Server
+
+Build a high-throughput inference server:
+
+```rust
+use ruvllm::serving::{
+    ContinuousBatchScheduler, KvCacheManager, InferenceRequest, SchedulerConfig
+};
+
+let config = SchedulerConfig {
+    max_batch_size: 32,
+    max_tokens_per_batch: 4096,
+    preemption_mode: PreemptionMode::Swap,
+    ..Default::default()
+};
+
+let mut scheduler = ContinuousBatchScheduler::new(config);
+let mut kv_cache = KvCacheManager::new(KvCachePoolConfig::default());
+
+// Add requests
+scheduler.add_request(InferenceRequest::new(tokens, params));
+
+// Process batches
+while let Some(batch) = scheduler.schedule() {
+    // Execute batch inference
+    let outputs = engine.forward_batch(&batch)?;
+
+    // Update scheduler with results
+    scheduler.update(outputs);
+}
+```
+
+### Speculative Decoding
+
+Speed up generation with draft models:
+
+```rust
+use ruvllm::speculative::{SpeculativeDecoder, SpeculativeConfig};
+
+let config = SpeculativeConfig {
+    draft_model: "phi-3-mini-draft",    // Small, fast model
+    target_model: "phi-3-medium",        // Large, accurate model
+    num_speculative_tokens: 4,           // Tokens to speculate
+    temperature: 0.8,
+};
+
+let decoder = SpeculativeDecoder::new(config)?;
+
+// 1.5-2x faster than standard decoding
+let response = decoder.generate("Explain relativity:", params)?;
+```
+
+### Custom Tokenizer
+
+Use custom tokenizers for specialized models:
+
+```rust
+use ruvllm::tokenizer::{RuvTokenizer, TokenizerConfig};
+
+// Load from HuggingFace
+let tokenizer = RuvTokenizer::from_pretrained("meta-llama/Llama-3.2-1B")?;
+
+// Or from local file
+let tokenizer = RuvTokenizer::from_file("./tokenizer.json")?;
+
+// Encode/decode
+let tokens = tokenizer.encode("Hello, world!")?;
+let text = tokenizer.decode(&tokens)?;
+
+// With chat template
+let formatted = tokenizer.apply_chat_template(&[
+    ChatMessage::system("You are a helpful assistant."),
+    ChatMessage::user("What is 2+2?"),
+])?;
+```
+
+### Memory Optimization
+
+Optimize for large models on limited memory:
+
+```rust
+use ruvllm::{Engine, MemoryConfig};
+
+let engine = Engine::builder()
+    .model_path("llama-70b.gguf")
+    .memory_config(MemoryConfig {
+        max_memory_gb: 24.0,           // Limit memory usage
+        offload_to_cpu: true,          // Offload layers to CPU
+        use_mmap: true,                // Memory-map model file
+        kv_cache_dtype: DType::F16,    // Half-precision KV cache
+    })
+    .build()?;
+```
+
+### Embeddings for RAG
+
+Generate embeddings for retrieval applications:
+
+```rust
+use ruvllm::Engine;
+
+let engine = Engine::from_pretrained("nomic-embed-text-v1.5")?;
+
+// Single embedding
+let embedding = engine.embed("What is machine learning?")?;
+
+// Batch embeddings
+let embeddings = engine.embed_batch(&[
+    "Document 1 content",
+    "Document 2 content",
+    "Document 3 content",
+])?;
+
+// Cosine similarity
+let similarity = ruvector_core::cosine_similarity(&embedding, &embeddings[0]);
+```
+
+### Node.js Advanced Configuration
+
+```javascript
+import { RuvLLM, ModelConfig, ComputeBackend } from '@aspect/ruvllm';
+
+const llm = await RuvLLM.create({
+    modelPath: './models/phi-3-mini-q4.gguf',
+    backend: ComputeBackend.Metal,
+    contextLength: 8192,
+    numGpuLayers: 32,
+    flashAttention: true,
+
+    // Callbacks
+    onToken: (token) => process.stdout.write(token),
+    onProgress: (progress) => console.log(`Loading: ${progress}%`),
+});
+
+// Structured output (JSON mode)
+const result = await llm.generate('List 3 colors', {
+    responseFormat: 'json',
+    schema: {
+        type: 'object',
+        properties: {
+            colors: { type: 'array', items: { type: 'string' } }
+        }
+    }
+});
+
+console.log(JSON.parse(result)); // { colors: ['red', 'blue', 'green'] }
+```
+
+</details>
+
+---
+
+## 📊 Performance Benchmarks
+
+Tested on M4 Pro (14-core CPU, 20-core GPU, 38 TOPS ANE):
+
+### Model Inference Speed
+
+| Model | Size | Quantization | Tokens/sec | Memory |
+|-------|------|--------------|------------|--------|
+| Phi-3 Mini | 3.8B | Q4_K_M | 52 t/s | 2.4 GB |
+| Llama 3.2 | 1B | Q4_K_M | 78 t/s | 0.8 GB |
+| Llama 3.2 | 3B | Q4_K_M | 45 t/s | 2.1 GB |
+| Mistral 7B | 7B | Q4_K_M | 28 t/s | 4.2 GB |
+| Gemma 2 | 9B | Q4_K_M | 22 t/s | 5.8 GB |
+
+### 🔥 ANE vs NEON Matrix Multiply (NEW in v2.0)
+
+| Dimension | ANE | NEON | Speedup |
+|-----------|-----|------|---------|
+| 768×768 | 400 µs | 104 ms | **261x** |
+| 1024×1024 | 1.2 ms | 283 ms | **243x** |
+| 1536×1536 | 3.4 ms | 1,028 ms | **306x** |
+| 2048×2048 | 8.5 ms | 4,020 ms | **473x** |
+| 3072×3072 | 28.2 ms | 15,240 ms | **541x** |
+| 4096×4096 | 66.1 ms | 65,428 ms | **989x** |
+
+### Hybrid Pipeline Performance
+
+| Mode | seq=128 | seq=512 | vs NEON |
+|------|---------|---------|---------|
+| **Pure ANE** | 35.9 ms | 112.9 ms | **460x faster** |
+| Hybrid | 862 ms | 3,195 ms | 19x faster |
+| Pure NEON | 16,529 ms | 66,539 ms | baseline |
+
+### Activation Functions (SiLU/GELU)
+
+| Size | NEON | ANE | Winner |
+|------|------|-----|--------|
+| 32×4096 | 70 µs | 152 µs | NEON 2.2x |
+| 64×4096 | 141 µs | 303 µs | NEON 2.1x |
+| 128×4096 | 284 µs | 613 µs | NEON 2.2x |
+
+**Auto-dispatch** correctly routes: ANE for matmul ≥768 dims, NEON for activations.
+
+### Quantization Performance
+
+| Dimension | Encode | Hamming Distance |
+|-----------|--------|------------------|
+| 128-dim | 0.1 µs | <0.1 µs |
+| 384-dim | 0.3 µs | <0.1 µs |
+| 768-dim | 0.5 µs | <0.1 µs |
+| 1536-dim | 1.0 µs | <0.1 µs |
+
+*Benchmarks run with Criterion.rs, 50 samples per test, M4 Pro 48GB.*
+
+---
+
+## 🔌 Supported Models
+
+RuvLLM supports any model in GGUF format. Popular options:
+
+- **Llama 3.2** (1B, 3B) — Meta's latest efficient models
+- **Phi-3** (Mini, Small, Medium) — Microsoft's powerful small models
+- **Mistral 7B** — Excellent quality-to-size ratio
+- **Gemma 2** (2B, 9B, 27B) — Google's open models
+- **Qwen 2.5** (0.5B-72B) — Alibaba's multilingual models
+- **DeepSeek Coder** — Specialized for code generation
+
+Download models from [Hugging Face](https://huggingface.co/models?library=gguf).
+
+---
+
+## 🛠️ Installation
+
+### Rust
+
+```toml
+[dependencies]
+ruvllm = { version = "2.0", features = ["inference-metal"] }
+
+# Or with all features
+ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "speculative"] }
+```
+
+Available features:
+- `inference-metal` — Metal GPU acceleration (recommended for Mac)
+- `inference-cuda` — CUDA acceleration (for NVIDIA GPUs)
+- `coreml` — Apple Neural Engine via Core ML
+- `speculative` — Speculative decoding support
+- `async-runtime` — Async/await support with Tokio
+
+### Node.js
+
+```bash
+npm install @aspect/ruvllm
+# or
+yarn add @aspect/ruvllm
+# or
+pnpm add @aspect/ruvllm
+```
+
+### From Source
+
+```bash
+git clone https://github.com/aspect/ruvector
+cd ruvector/crates/ruvllm
+cargo build --release --features inference-metal
+```
+
+---
+
+## 🤝 Contributing
+
+We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
+
+- 🐛 [Report bugs](https://github.com/aspect/ruvector/issues/new?template=bug_report.md)
+- 💡 [Request features](https://github.com/aspect/ruvector/issues/new?template=feature_request.md)
+- 📖 [Improve docs](https://github.com/aspect/ruvector/tree/main/docs)
+
+---
+
+## 📄 License
+
+RuvLLM is dual-licensed under MIT and Apache 2.0. See [LICENSE-MIT](LICENSE-MIT) and [LICENSE-APACHE](LICENSE-APACHE).
+
+---
+
+<p align="center">
+  Made with ❤️ by <a href="https://ruv.io">ruv.io</a>
+  <br/>
+  <sub>Part of the <a href="https://github.com/aspect/ruvector">Ruvector</a> ecosystem</sub>
+</p>