wifi-densepose/vendor/ruvector/docs/adr/ADR-008-mistral-rs-integration.md

# ADR-008: mistral-rs Integration for Production-Scale LLM Serving

**Status:** Proposed
**Date:** 2026-01-20
**Decision Makers:** Ruvector Architecture Team
**Technical Area:** LLM Inference Engine / Production Serving

---

## Context and Problem Statement

RuvLLM v2.3 includes a stub `MistralBackend` implementation at `crates/ruvllm/src/backends/mistral_backend.rs` that defines the interface for high-performance LLM inference but lacks actual integration with the mistral-rs crate. The current Candle backend is optimized for single-user and edge deployment scenarios, but production-scale serving requires advanced memory management and multi-tenant capabilities.

### Current State

The existing `MistralBackend` stub provides:
- Configuration structures for PagedAttention, X-LoRA, and ISQ
- `XLoraManager` with adapter loading/routing logic (placeholder)
- `MistralBackendConfig` with builder pattern for Metal/CUDA targets
- Integration hooks for the `LlmBackend` trait

However, the implementation is non-functional:
- No actual mistral-rs crate dependency
- Token generation returns placeholder values
- Model loading does not wire to inference pipeline
- PagedAttention uses RuvLLM's internal implementation, not mistral-rs's optimized version

### Key Challenges

1. **Concurrent User Scaling**: Candle backend is optimized for single-user inference; production servers need 10-100+ concurrent requests
2. **KV Cache Memory Pressure**: Without vLLM-style paging, long-context sessions exhaust GPU memory
3. **Multi-Task Models**: LoRA adapter switching requires per-request overhead; X-LoRA enables per-token routing
4. **Deployment Flexibility**: Models should be quantized at runtime based on available hardware

---

## Decision Drivers

### Performance Requirements
- **Concurrent sessions**: 50-100 simultaneous inference requests
- **Memory efficiency**: 5-10x improvement in KV cache utilization
- **Adapter latency**: <1ms overhead for X-LoRA routing decisions
- **Quantization**: Runtime ISQ without model re-export

### Compatibility Requirements
- **Existing interface**: Must implement `LlmBackend` trait seamlessly
- **Feature isolation**: Optional dependency with feature flags
- **Backend selection**: Runtime choice between Candle and mistral-rs

### Hardware Requirements
- **Apple Silicon**: Metal acceleration via `mistral-rs-metal`
- **NVIDIA GPUs**: CUDA acceleration via `mistral-rs-cuda`
- **CPU fallback**: Pure Rust path for edge/WASM targets

---

## Considered Options

### Option A: Fork and Embed mistral-rs

Vendor mistral-rs source code directly into RuvLLM.

**Pros:**
- Full control over API surface
- No external dependency versioning
- Can customize for RuvLLM's needs

**Cons:**
- Maintenance burden of tracking upstream
- Miss upstream optimizations and fixes
- Duplicated effort

### Option B: Optional Dependency with Feature Flags

Add mistral-rs as an optional dependency behind feature flags, wiring the existing `MistralBackend` interface to actual mistral-rs crate.

**Pros:**
- Leverage upstream development
- Clean separation via features
- Users choose their backend at compile time
- Smaller binary for edge deployments (Candle-only)

**Cons:**
- API surface depends on upstream stability
- Two codepaths to maintain
- Feature matrix complexity

### Option C: Runtime Backend Selection

Use dynamic dispatch to select backend at runtime via configuration.

**Pros:**
- Single binary for all deployments
- Runtime flexibility

**Cons:**
- Binary size includes all backends
- Dynamic dispatch overhead
- Complex testing matrix

---

## Decision Outcome

**Chosen Option: Option B - Optional Dependency with Feature Flags**

Add mistral-rs as an optional dependency with three feature flags, wiring the existing `MistralBackend` stub to the actual mistral-rs implementation.

### Rationale

1. **Separation of concerns**: Edge deployments use Candle (no mistral-rs dependency); server deployments enable mistral-rs features
2. **Upstream leverage**: mistral-rs team maintains PagedAttention, X-LoRA, ISQ implementations
3. **Existing interface**: The `MistralBackend` stub already defines the API; we wire it to real implementation
4. **Incremental adoption**: Users can migrate from Candle to mistral-rs backend per-deployment

---

## Technical Specifications

### Feature Flags

```toml
# Cargo.toml additions
[features]
default = ["candle-backend"]

# Base mistral-rs integration
mistral-rs = ["dep:mistralrs", "dep:mistralrs-core"]

# Apple Silicon Metal acceleration
mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]

# NVIDIA CUDA acceleration
mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]

[dependencies]
# Optional mistral-rs integration
mistralrs = { version = "0.3", optional = true }
mistralrs-core = { version = "0.3", optional = true }
```

### Feature Matrix

| Feature | Candle | mistral-rs | mistral-rs-metal | mistral-rs-cuda |
|---------|--------|------------|------------------|-----------------|
| Single-user inference | Yes | Yes | Yes | Yes |
| PagedAttention | No | Yes | Yes | Yes |
| X-LoRA | No | Yes | Yes | Yes |
| ISQ | No | Yes | Yes | Yes |
| Metal acceleration | Yes | No | Yes | No |
| CUDA acceleration | Partial | No | No | Yes |
| WASM support | Yes | No | No | No |
| Binary size | ~15MB | ~45MB | ~50MB | ~60MB |

### Architecture

```
+-----------------------------------------------------------------------+
|                    MISTRAL-RS INTEGRATION ARCHITECTURE                 |
+-----------------------------------------------------------------------+
|                                                                        |
|   +-------------------+     +-------------------+     +--------------+ |
|   | MistralBackend    |     | mistralrs::Model  |     | Hardware     | |
|   | (RuvLLM adapter)  |     | (inference core)  |     | Accelerator  | |
|   |                   |     |                   |     |              | |
|   | - Config mapping  |---->| - PagedAttention  |---->| - Metal      | |
|   | - Trait impl      |     | - X-LoRA routing  |     | - CUDA       | |
|   | - Error handling  |     | - ISQ runtime     |     | - CPU        | |
|   +--------+----------+     +---------+---------+     +------+-------+ |
|            |                          |                      |         |
|            v                          v                      v         |
|   +--------+----------+     +---------+---------+     +------+-------+ |
|   | LlmBackend trait  |     | KV Cache Pool     |     | Tensor Ops   | |
|   | (RuvLLM unified)  |     | (PagedAttention)  |     | (kernels)    | |
|   +-------------------+     +-------------------+     +--------------+ |
|                                                                        |
+-----------------------------------------------------------------------+
```

### Key Features to Enable

#### 1. PagedAttention (vLLM-style KV Cache Management)

PagedAttention partitions the KV cache into fixed-size blocks (pages) that can be allocated non-contiguously, enabling:
- **5-10x concurrent users**: Memory shared across requests via copy-on-write pages
- **Dynamic allocation**: Pages allocated as sequences grow, freed when complete
- **Prefix caching**: Common prefixes (system prompts) share pages across requests

```rust
/// PagedAttention configuration for mistral-rs
#[cfg(feature = "mistral-rs")]
pub struct PagedAttentionConfig {
    /// Block size in tokens (typical: 16)
    pub block_size: usize,
    /// Maximum blocks in page table
    pub max_blocks: usize,
    /// GPU memory fraction for KV cache (0.0-1.0)
    pub gpu_memory_fraction: f32,
    /// Enable prefix caching for repeated prompts
    pub enable_prefix_caching: bool,
}

impl Default for PagedAttentionConfig {
    fn default() -> Self {
        Self {
            block_size: 16,
            max_blocks: 4096,
            gpu_memory_fraction: 0.9,
            enable_prefix_caching: true,
        }
    }
}
```

**Performance Impact:**
| Metric | Without PagedAttention | With PagedAttention |
|--------|------------------------|---------------------|
| Concurrent users | 1-2 | 10-50 |
| Memory utilization | 40-60% | 85-95% |
| Memory fragmentation | High | Near-zero |

#### 2. X-LoRA (eXpert-mixed LoRA)

X-LoRA enables per-token adapter routing for multi-task models:
- **Dynamic mixing**: Router network selects adapters per token
- **Learned routing**: MLP router trained on adapter selection
- **Top-k activation**: Only k adapters compute per token (efficiency)

```rust
/// X-LoRA configuration for multi-adapter inference
#[cfg(feature = "mistral-rs")]
pub struct XLoraConfig {
    /// Adapter names/paths to load
    pub adapters: Vec<String>,
    /// Top-k adapters to activate per token
    pub top_k: usize,
    /// Router temperature for softmax
    pub temperature: f32,
    /// Mixing mode
    pub mixing_mode: XLoraMixingMode,
}

#[derive(Debug, Clone, Copy)]
pub enum XLoraMixingMode {
    /// Sum weighted adapter outputs
    Additive,
    /// Concatenate and project
    Concatenate,
    /// Gated mixture with learned gates
    Gated,
}
```

**Use Cases:**
- Code + chat model: Route code tokens to code adapter, natural language to chat adapter
- Multi-language: Route based on detected language
- Domain-specific: Finance, medical, legal adapters activated by context

#### 3. ISQ (In-Situ Quantization)

ISQ enables runtime quantization without pre-exported quantized models:
- **Runtime flexibility**: Same model weights, different quantization per deployment
- **Memory adaptation**: Quantize to fit available hardware
- **Quality preservation**: Activation-aware methods (AWQ, GPTQ) maintain accuracy

```rust
/// ISQ configuration for runtime quantization
#[cfg(feature = "mistral-rs")]
pub struct IsqConfig {
    /// Quantization bits (2, 4, 8)
    pub bits: u8,
    /// Quantization method
    pub method: IsqMethod,
    /// Calibration dataset size
    pub calibration_samples: usize,
}

#[derive(Debug, Clone, Copy)]
pub enum IsqMethod {
    /// Activation-aware Weight Quantization
    AWQ,
    /// GPTQ with optimal brain quantization
    GPTQ,
    /// Round-to-nearest (fastest, lower quality)
    RTN,
    /// SmoothQuant (activation smoothing)
    SmoothQuant,
}
```

**Performance Impact:**
| Method | Bits | Memory Reduction | Quality Loss |
|--------|------|------------------|--------------|
| AWQ | 4 | 4x | <1% |
| GPTQ | 4 | 4x | <1% |
| RTN | 4 | 4x | 2-3% |
| AWQ | 2 | 8x | 3-5% |

### Implementation Roadmap

#### Phase 1: Core Integration (Week 1-2)

1. Add mistral-rs dependencies with feature flags
2. Implement config mapping: `MistralBackendConfig` -> `mistralrs::Config`
3. Wire `load_model` to mistral-rs model loading
4. Wire `generate` and `generate_stream` to mistral-rs inference

```rust
#[cfg(feature = "mistral-rs")]
impl LlmBackend for MistralBackend {
    fn load_model(&mut self, model_id: &str, config: ModelConfig) -> Result<()> {
        use mistralrs::{ModelKind, MistralRs, MistralRsBuilder};

        let builder = MistralRsBuilder::new(model_id)
            .with_paged_attention(self.config.paged_attention.as_ref().map(|pa| {
                mistralrs::PagedAttentionConfig {
                    block_size: pa.block_size,
                    ..Default::default()
                }
            }));

        self.inner = Some(builder.build()?);
        Ok(())
    }

    fn generate(&self, prompt: &str, params: GenerateParams) -> Result<String> {
        let inner = self.inner.as_ref()
            .ok_or_else(|| Error::msg("Model not loaded"))?;

        let request = mistralrs::Request::new(prompt)
            .with_max_tokens(params.max_tokens)
            .with_temperature(params.temperature);

        let response = inner.send_request(request)?;
        Ok(response.text)
    }
}
```

#### Phase 2: Advanced Features (Week 3-4)

1. Enable PagedAttention with configurable parameters
2. Add X-LoRA adapter loading and routing
3. Implement ISQ with calibration pipeline

#### Phase 3: Hardware Acceleration (Week 5-6)

1. Test and validate Metal acceleration
2. Test and validate CUDA acceleration
3. Benchmark against Candle backend

---

## Consequences

### Positive Consequences

1. **Production-scale serving**: PagedAttention enables 5-10x more concurrent users
2. **Multi-task efficiency**: X-LoRA eliminates adapter switching overhead
3. **Deployment flexibility**: ISQ allows runtime quantization decisions
4. **Upstream maintenance**: mistral-rs team maintains core inference optimizations
5. **Feature parity**: Access to latest mistral-rs features (Flash Attention 2, speculative decoding)

### Negative Consequences

1. **Dependency complexity**: Additional crate dependencies increase build complexity
2. **API surface coupling**: Changes in mistral-rs may require RuvLLM updates
3. **Feature matrix**: Two backend codepaths require testing both paths
4. **WASM incompatibility**: mistral-rs does not support WASM targets

### Neutral Consequences

1. **Two backend options**: Candle remains optimal for edge/WASM; mistral-rs for server
2. **Compile-time selection**: Users choose backend via feature flags
3. **Binary size tradeoff**: Server builds are larger; edge builds unchanged

### Risk Mitigation

| Risk | Mitigation |
|------|------------|
| mistral-rs API instability | Pin to specific version; abstract via MistralBackend interface |
| Feature flag complexity | Comprehensive CI matrix testing all feature combinations |
| Performance regression | Benchmark suite comparing Candle vs mistral-rs |
| Metal/CUDA compatibility | Platform-specific CI runners for hardware validation |

---

## Alternatives Considered

### llama.cpp via rust-llama

- **Rejected**: Different model format (GGUF), weaker Rust integration
- **Consideration**: Could add as third backend for GGUF model support

### candle-transformers PagedAttention

- **Rejected**: Candle's PagedAttention is experimental and less mature
- **Consideration**: Monitor upstream development

### vLLM Python Backend

- **Rejected**: Python FFI adds latency; deployment complexity
- **Consideration**: vLLM's algorithm informs our understanding

---

## Related Decisions

- **ADR-001**: Ruvector Core Architecture (HNSW, Graph Store)
- **ADR-002**: RuvLLM Integration with Ruvector
- **ADR-003**: SIMD Optimization Strategy
- **ADR-004**: KV Cache Management
- **ADR-006**: Memory Management
- **ADR-007**: Security Review & Technical Debt

---

## Compliance and Standards

### API Compatibility
- `MistralBackend` implements `LlmBackend` trait
- All existing RuvLLM consumers work unchanged
- Feature flags are additive (no breaking changes)

### Testing Requirements
- Unit tests for config mapping
- Integration tests with sample models
- Benchmark suite comparing backends
- CI matrix for feature flag combinations

### Documentation Requirements
- Feature flag documentation in README
- Backend selection guide
- Performance comparison benchmarks

---

## References

1. mistral-rs Repository: https://github.com/EricLBuehler/mistral.rs
2. vLLM PagedAttention Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention"
3. X-LoRA Paper: "X-LoRA: Mixture of Low-Rank Adapter Experts"
4. ISQ/AWQ Paper: "AWQ: Activation-aware Weight Quantization for LLM Compression"
5. Existing MistralBackend stub: `crates/ruvllm/src/backends/mistral_backend.rs`

---

## Implementation Status

| Component | Status | Notes |
|-----------|--------|-------|
| Feature flags | Pending | Add to Cargo.toml |
| Config mapping | Pending | MistralBackendConfig -> mistralrs::Config |
| Model loading | Pending | Wire to mistral-rs loader |
| Generation | Pending | Wire to mistral-rs inference |
| PagedAttention | Pending | Enable via config |
| X-LoRA | Pending | Wire existing XLoraManager |
| ISQ | Pending | Implement calibration pipeline |
| Metal acceleration | Pending | Test on Apple Silicon |
| CUDA acceleration | Pending | Test on NVIDIA GPUs |

---

## Revision History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2026-01-20 | Ruvector Architecture Team | Initial proposal |