Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/adr/ADR-008-mistral-rs-integration.md
+++ b/vendor/ruvector/docs/adr/ADR-008-mistral-rs-integration.md
@@ -0,0 +1,468 @@
+# ADR-008: mistral-rs Integration for Production-Scale LLM Serving
+
+**Status:** Proposed
+**Date:** 2026-01-20
+**Decision Makers:** Ruvector Architecture Team
+**Technical Area:** LLM Inference Engine / Production Serving
+
+---
+
+## Context and Problem Statement
+
+RuvLLM v2.3 includes a stub `MistralBackend` implementation at `crates/ruvllm/src/backends/mistral_backend.rs` that defines the interface for high-performance LLM inference but lacks actual integration with the mistral-rs crate. The current Candle backend is optimized for single-user and edge deployment scenarios, but production-scale serving requires advanced memory management and multi-tenant capabilities.
+
+### Current State
+
+The existing `MistralBackend` stub provides:
+- Configuration structures for PagedAttention, X-LoRA, and ISQ
+- `XLoraManager` with adapter loading/routing logic (placeholder)
+- `MistralBackendConfig` with builder pattern for Metal/CUDA targets
+- Integration hooks for the `LlmBackend` trait
+
+However, the implementation is non-functional:
+- No actual mistral-rs crate dependency
+- Token generation returns placeholder values
+- Model loading does not wire to inference pipeline
+- PagedAttention uses RuvLLM's internal implementation, not mistral-rs's optimized version
+
+### Key Challenges
+
+1. **Concurrent User Scaling**: Candle backend is optimized for single-user inference; production servers need 10-100+ concurrent requests
+2. **KV Cache Memory Pressure**: Without vLLM-style paging, long-context sessions exhaust GPU memory
+3. **Multi-Task Models**: LoRA adapter switching requires per-request overhead; X-LoRA enables per-token routing
+4. **Deployment Flexibility**: Models should be quantized at runtime based on available hardware
+
+---
+
+## Decision Drivers
+
+### Performance Requirements
+- **Concurrent sessions**: 50-100 simultaneous inference requests
+- **Memory efficiency**: 5-10x improvement in KV cache utilization
+- **Adapter latency**: <1ms overhead for X-LoRA routing decisions
+- **Quantization**: Runtime ISQ without model re-export
+
+### Compatibility Requirements
+- **Existing interface**: Must implement `LlmBackend` trait seamlessly
+- **Feature isolation**: Optional dependency with feature flags
+- **Backend selection**: Runtime choice between Candle and mistral-rs
+
+### Hardware Requirements
+- **Apple Silicon**: Metal acceleration via `mistral-rs-metal`
+- **NVIDIA GPUs**: CUDA acceleration via `mistral-rs-cuda`
+- **CPU fallback**: Pure Rust path for edge/WASM targets
+
+---
+
+## Considered Options
+
+### Option A: Fork and Embed mistral-rs
+
+Vendor mistral-rs source code directly into RuvLLM.
+
+**Pros:**
+- Full control over API surface
+- No external dependency versioning
+- Can customize for RuvLLM's needs
+
+**Cons:**
+- Maintenance burden of tracking upstream
+- Miss upstream optimizations and fixes
+- Duplicated effort
+
+### Option B: Optional Dependency with Feature Flags
+
+Add mistral-rs as an optional dependency behind feature flags, wiring the existing `MistralBackend` interface to actual mistral-rs crate.
+
+**Pros:**
+- Leverage upstream development
+- Clean separation via features
+- Users choose their backend at compile time
+- Smaller binary for edge deployments (Candle-only)
+
+**Cons:**
+- API surface depends on upstream stability
+- Two codepaths to maintain
+- Feature matrix complexity
+
+### Option C: Runtime Backend Selection
+
+Use dynamic dispatch to select backend at runtime via configuration.
+
+**Pros:**
+- Single binary for all deployments
+- Runtime flexibility
+
+**Cons:**
+- Binary size includes all backends
+- Dynamic dispatch overhead
+- Complex testing matrix
+
+---
+
+## Decision Outcome
+
+**Chosen Option: Option B - Optional Dependency with Feature Flags**
+
+Add mistral-rs as an optional dependency with three feature flags, wiring the existing `MistralBackend` stub to the actual mistral-rs implementation.
+
+### Rationale
+
+1. **Separation of concerns**: Edge deployments use Candle (no mistral-rs dependency); server deployments enable mistral-rs features
+2. **Upstream leverage**: mistral-rs team maintains PagedAttention, X-LoRA, ISQ implementations
+3. **Existing interface**: The `MistralBackend` stub already defines the API; we wire it to real implementation
+4. **Incremental adoption**: Users can migrate from Candle to mistral-rs backend per-deployment
+
+---
+
+## Technical Specifications
+
+### Feature Flags
+
+```toml
+# Cargo.toml additions
+[features]
+default = ["candle-backend"]
+
+# Base mistral-rs integration
+mistral-rs = ["dep:mistralrs", "dep:mistralrs-core"]
+
+# Apple Silicon Metal acceleration
+mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]
+
+# NVIDIA CUDA acceleration
+mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]
+
+[dependencies]
+# Optional mistral-rs integration
+mistralrs = { version = "0.3", optional = true }
+mistralrs-core = { version = "0.3", optional = true }
+```
+
+### Feature Matrix
+
+| Feature | Candle | mistral-rs | mistral-rs-metal | mistral-rs-cuda |
+|---------|--------|------------|------------------|-----------------|
+| Single-user inference | Yes | Yes | Yes | Yes |
+| PagedAttention | No | Yes | Yes | Yes |
+| X-LoRA | No | Yes | Yes | Yes |
+| ISQ | No | Yes | Yes | Yes |
+| Metal acceleration | Yes | No | Yes | No |
+| CUDA acceleration | Partial | No | No | Yes |
+| WASM support | Yes | No | No | No |
+| Binary size | ~15MB | ~45MB | ~50MB | ~60MB |
+
+### Architecture
+
+```
+-----------------------------------------------------------------------+
+|                    MISTRAL-RS INTEGRATION ARCHITECTURE                 |
+-----------------------------------------------------------------------+
+|                                                                        |
+|   +-------------------+     +-------------------+     +--------------+ |
+|   | MistralBackend    |     | mistralrs::Model  |     | Hardware     | |
+|   | (RuvLLM adapter)  |     | (inference core)  |     | Accelerator  | |
+|   |                   |     |                   |     |              | |
+|   | - Config mapping  |---->| - PagedAttention  |---->| - Metal      | |
+|   | - Trait impl      |     | - X-LoRA routing  |     | - CUDA       | |
+|   | - Error handling  |     | - ISQ runtime     |     | - CPU        | |
+|   +--------+----------+     +---------+---------+     +------+-------+ |
+|            |                          |                      |         |
+|            v                          v                      v         |
+|   +--------+----------+     +---------+---------+     +------+-------+ |
+|   | LlmBackend trait  |     | KV Cache Pool     |     | Tensor Ops   | |
+|   | (RuvLLM unified)  |     | (PagedAttention)  |     | (kernels)    | |
+|   +-------------------+     +-------------------+     +--------------+ |
+|                                                                        |
+-----------------------------------------------------------------------+
+```
+
+### Key Features to Enable
+
+#### 1. PagedAttention (vLLM-style KV Cache Management)
+
+PagedAttention partitions the KV cache into fixed-size blocks (pages) that can be allocated non-contiguously, enabling:
+- **5-10x concurrent users**: Memory shared across requests via copy-on-write pages
+- **Dynamic allocation**: Pages allocated as sequences grow, freed when complete
+- **Prefix caching**: Common prefixes (system prompts) share pages across requests
+
+```rust
+/// PagedAttention configuration for mistral-rs
+#[cfg(feature = "mistral-rs")]
+pub struct PagedAttentionConfig {
+    /// Block size in tokens (typical: 16)
+    pub block_size: usize,
+    /// Maximum blocks in page table
+    pub max_blocks: usize,
+    /// GPU memory fraction for KV cache (0.0-1.0)
+    pub gpu_memory_fraction: f32,
+    /// Enable prefix caching for repeated prompts
+    pub enable_prefix_caching: bool,
+}
+
+impl Default for PagedAttentionConfig {
+    fn default() -> Self {
+        Self {
+            block_size: 16,
+            max_blocks: 4096,
+            gpu_memory_fraction: 0.9,
+            enable_prefix_caching: true,
+        }
+    }
+}
+```
+
+**Performance Impact:**
+| Metric | Without PagedAttention | With PagedAttention |
+|--------|------------------------|---------------------|
+| Concurrent users | 1-2 | 10-50 |
+| Memory utilization | 40-60% | 85-95% |
+| Memory fragmentation | High | Near-zero |
+
+#### 2. X-LoRA (eXpert-mixed LoRA)
+
+X-LoRA enables per-token adapter routing for multi-task models:
+- **Dynamic mixing**: Router network selects adapters per token
+- **Learned routing**: MLP router trained on adapter selection
+- **Top-k activation**: Only k adapters compute per token (efficiency)
+
+```rust
+/// X-LoRA configuration for multi-adapter inference
+#[cfg(feature = "mistral-rs")]
+pub struct XLoraConfig {
+    /// Adapter names/paths to load
+    pub adapters: Vec<String>,
+    /// Top-k adapters to activate per token
+    pub top_k: usize,
+    /// Router temperature for softmax
+    pub temperature: f32,
+    /// Mixing mode
+    pub mixing_mode: XLoraMixingMode,
+}
+
+#[derive(Debug, Clone, Copy)]
+pub enum XLoraMixingMode {
+    /// Sum weighted adapter outputs
+    Additive,
+    /// Concatenate and project
+    Concatenate,
+    /// Gated mixture with learned gates
+    Gated,
+}
+```
+
+**Use Cases:**
+- Code + chat model: Route code tokens to code adapter, natural language to chat adapter
+- Multi-language: Route based on detected language
+- Domain-specific: Finance, medical, legal adapters activated by context
+
+#### 3. ISQ (In-Situ Quantization)
+
+ISQ enables runtime quantization without pre-exported quantized models:
+- **Runtime flexibility**: Same model weights, different quantization per deployment
+- **Memory adaptation**: Quantize to fit available hardware
+- **Quality preservation**: Activation-aware methods (AWQ, GPTQ) maintain accuracy
+
+```rust
+/// ISQ configuration for runtime quantization
+#[cfg(feature = "mistral-rs")]
+pub struct IsqConfig {
+    /// Quantization bits (2, 4, 8)
+    pub bits: u8,
+    /// Quantization method
+    pub method: IsqMethod,
+    /// Calibration dataset size
+    pub calibration_samples: usize,
+}
+
+#[derive(Debug, Clone, Copy)]
+pub enum IsqMethod {
+    /// Activation-aware Weight Quantization
+    AWQ,
+    /// GPTQ with optimal brain quantization
+    GPTQ,
+    /// Round-to-nearest (fastest, lower quality)
+    RTN,
+    /// SmoothQuant (activation smoothing)
+    SmoothQuant,
+}
+```
+
+**Performance Impact:**
+| Method | Bits | Memory Reduction | Quality Loss |
+|--------|------|------------------|--------------|
+| AWQ | 4 | 4x | <1% |
+| GPTQ | 4 | 4x | <1% |
+| RTN | 4 | 4x | 2-3% |
+| AWQ | 2 | 8x | 3-5% |
+
+### Implementation Roadmap
+
+#### Phase 1: Core Integration (Week 1-2)
+
+1. Add mistral-rs dependencies with feature flags
+2. Implement config mapping: `MistralBackendConfig` -> `mistralrs::Config`
+3. Wire `load_model` to mistral-rs model loading
+4. Wire `generate` and `generate_stream` to mistral-rs inference
+
+```rust
+#[cfg(feature = "mistral-rs")]
+impl LlmBackend for MistralBackend {
+    fn load_model(&mut self, model_id: &str, config: ModelConfig) -> Result<()> {
+        use mistralrs::{ModelKind, MistralRs, MistralRsBuilder};
+
+        let builder = MistralRsBuilder::new(model_id)
+            .with_paged_attention(self.config.paged_attention.as_ref().map(|pa| {
+                mistralrs::PagedAttentionConfig {
+                    block_size: pa.block_size,
+                    ..Default::default()
+                }
+            }));
+
+        self.inner = Some(builder.build()?);
+        Ok(())
+    }
+
+    fn generate(&self, prompt: &str, params: GenerateParams) -> Result<String> {
+        let inner = self.inner.as_ref()
+            .ok_or_else(|| Error::msg("Model not loaded"))?;
+
+        let request = mistralrs::Request::new(prompt)
+            .with_max_tokens(params.max_tokens)
+            .with_temperature(params.temperature);
+
+        let response = inner.send_request(request)?;
+        Ok(response.text)
+    }
+}
+```
+
+#### Phase 2: Advanced Features (Week 3-4)
+
+1. Enable PagedAttention with configurable parameters
+2. Add X-LoRA adapter loading and routing
+3. Implement ISQ with calibration pipeline
+
+#### Phase 3: Hardware Acceleration (Week 5-6)
+
+1. Test and validate Metal acceleration
+2. Test and validate CUDA acceleration
+3. Benchmark against Candle backend
+
+---
+
+## Consequences
+
+### Positive Consequences
+
+1. **Production-scale serving**: PagedAttention enables 5-10x more concurrent users
+2. **Multi-task efficiency**: X-LoRA eliminates adapter switching overhead
+3. **Deployment flexibility**: ISQ allows runtime quantization decisions
+4. **Upstream maintenance**: mistral-rs team maintains core inference optimizations
+5. **Feature parity**: Access to latest mistral-rs features (Flash Attention 2, speculative decoding)
+
+### Negative Consequences
+
+1. **Dependency complexity**: Additional crate dependencies increase build complexity
+2. **API surface coupling**: Changes in mistral-rs may require RuvLLM updates
+3. **Feature matrix**: Two backend codepaths require testing both paths
+4. **WASM incompatibility**: mistral-rs does not support WASM targets
+
+### Neutral Consequences
+
+1. **Two backend options**: Candle remains optimal for edge/WASM; mistral-rs for server
+2. **Compile-time selection**: Users choose backend via feature flags
+3. **Binary size tradeoff**: Server builds are larger; edge builds unchanged
+
+### Risk Mitigation
+
+| Risk | Mitigation |
+|------|------------|
+| mistral-rs API instability | Pin to specific version; abstract via MistralBackend interface |
+| Feature flag complexity | Comprehensive CI matrix testing all feature combinations |
+| Performance regression | Benchmark suite comparing Candle vs mistral-rs |
+| Metal/CUDA compatibility | Platform-specific CI runners for hardware validation |
+
+---
+
+## Alternatives Considered
+
+### llama.cpp via rust-llama
+
+- **Rejected**: Different model format (GGUF), weaker Rust integration
+- **Consideration**: Could add as third backend for GGUF model support
+
+### candle-transformers PagedAttention
+
+- **Rejected**: Candle's PagedAttention is experimental and less mature
+- **Consideration**: Monitor upstream development
+
+### vLLM Python Backend
+
+- **Rejected**: Python FFI adds latency; deployment complexity
+- **Consideration**: vLLM's algorithm informs our understanding
+
+---
+
+## Related Decisions
+
+- **ADR-001**: Ruvector Core Architecture (HNSW, Graph Store)
+- **ADR-002**: RuvLLM Integration with Ruvector
+- **ADR-003**: SIMD Optimization Strategy
+- **ADR-004**: KV Cache Management
+- **ADR-006**: Memory Management
+- **ADR-007**: Security Review & Technical Debt
+
+---
+
+## Compliance and Standards
+
+### API Compatibility
+- `MistralBackend` implements `LlmBackend` trait
+- All existing RuvLLM consumers work unchanged
+- Feature flags are additive (no breaking changes)
+
+### Testing Requirements
+- Unit tests for config mapping
+- Integration tests with sample models
+- Benchmark suite comparing backends
+- CI matrix for feature flag combinations
+
+### Documentation Requirements
+- Feature flag documentation in README
+- Backend selection guide
+- Performance comparison benchmarks
+
+---
+
+## References
+
+1. mistral-rs Repository: https://github.com/EricLBuehler/mistral.rs
+2. vLLM PagedAttention Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention"
+3. X-LoRA Paper: "X-LoRA: Mixture of Low-Rank Adapter Experts"
+4. ISQ/AWQ Paper: "AWQ: Activation-aware Weight Quantization for LLM Compression"
+5. Existing MistralBackend stub: `crates/ruvllm/src/backends/mistral_backend.rs`
+
+---
+
+## Implementation Status
+
+| Component | Status | Notes |
+|-----------|--------|-------|
+| Feature flags | Pending | Add to Cargo.toml |
+| Config mapping | Pending | MistralBackendConfig -> mistralrs::Config |
+| Model loading | Pending | Wire to mistral-rs loader |
+| Generation | Pending | Wire to mistral-rs inference |
+| PagedAttention | Pending | Enable via config |
+| X-LoRA | Pending | Wire existing XLoraManager |
+| ISQ | Pending | Implement calibration pipeline |
+| Metal acceleration | Pending | Test on Apple Silicon |
+| CUDA acceleration | Pending | Test on NVIDIA GPUs |
+
+---
+
+## Revision History
+
+| Version | Date | Author | Changes |
+|---------|------|--------|---------|
+| 1.0 | 2026-01-20 | Ruvector Architecture Team | Initial proposal |