Files
wifi-densepose/crates/ruvllm/docs/GITHUB_ISSUE_MISTRAL_RS.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

9.6 KiB

feat(ruvllm): Full mistral-rs backend integration with PagedAttention, X-LoRA, and ISQ

Summary

Wire the existing MistralBackend stub to the actual mistral-rs crate for production-scale LLM serving with advanced memory management and adapter routing.

Motivation

The current Candle backend is optimized for single-user and edge deployment scenarios, achieving approximately 100 tokens/second. While sufficient for development and small-scale use, production deployments require significantly higher throughput and concurrency.

mistral-rs enables:

  • 500-1000 tok/s throughput via continuous batching and PagedAttention
  • 50+ concurrent users with efficient KV cache management
  • Memory efficiency through paged memory allocation and prefix caching
  • Dynamic adapter routing via X-LoRA for multi-task inference
  • Runtime quantization via ISQ for deployment flexibility

Performance Comparison

Metric Candle Backend mistral-rs Backend
Throughput ~100 tok/s 500-1000 tok/s
Concurrent Users 1-5 50+
Memory Efficiency Static KV Paged + Prefix Cache
Adapter Support Static LoRA Dynamic X-LoRA
Quantization Pre-quantized only Runtime ISQ

Features to Implement

1. PagedAttention (Priority: High)

PagedAttention revolutionizes KV cache management by treating attention as virtual memory, enabling efficient memory sharing across sequences.

  • Add mistralrs dependency to Cargo.toml with feature flags
  • Wire PagedAttention to MistralBackend::generate()
  • Implement sequence allocation/deallocation callbacks
  • Add prefix caching support for prompt reuse
  • Configure block size and max sequences
  • Benchmark: target 5-10x concurrent capacity improvement

Key Implementation Points:

// Block configuration
let paged_config = PagedAttentionConfig {
    block_size: 16,           // Tokens per block
    max_num_blocks: 1024,     // Total blocks available
    sliding_window: None,     // Optional sliding window
    prefix_caching: true,     // Enable prefix cache
};

2. X-LoRA Dynamic Routing (Priority: Medium)

X-LoRA enables per-token routing to different LoRA adapters, allowing a single model to handle multiple tasks efficiently.

  • Wire XLoraManager to mistral-rs X-LoRA implementation
  • Implement per-token adapter routing logic
  • Support learned routing networks (classifier)
  • Add adapter hot-loading for runtime updates
  • Implement adapter weight caching
  • Benchmark: multi-task quality metrics vs single adapters

Key Implementation Points:

// X-LoRA configuration
let xlora_config = XLoraConfig {
    adapters: vec![
        ("code", "path/to/code-lora"),
        ("chat", "path/to/chat-lora"),
        ("reasoning", "path/to/reasoning-lora"),
    ],
    routing_method: RoutingMethod::Learned,
    top_k_adapters: 2,        // Use top-2 adapters per token
    scaling_factor: 1.0,
};

3. ISQ Runtime Quantization (Priority: Medium)

In-Situ Quantization allows loading full-precision models and quantizing at runtime, providing deployment flexibility.

  • Wire IsqConfig to mistral-rs ISQ implementation
  • Support quantization methods: AWQ, GPTQ, RTN, SmoothQuant
  • Implement calibration workflow with sample data
  • Add memory estimation before/after quantization
  • Support mixed-precision quantization per layer
  • Benchmark: quality vs compression tradeoffs

Supported Quantization Methods:

Method Bits Quality Speed Use Case
AWQ 4-bit High Fast Production
GPTQ 4-bit High Medium Accuracy-critical
RTN 8-bit Very High Very Fast Quality-first
SmoothQuant 8-bit Very High Fast Balanced

Technical Details

Cargo.toml Changes

[dependencies]
# Core mistral-rs integration
mistralrs = { version = "0.4", optional = true }
mistralrs-core = { version = "0.4", optional = true }

# Required for tokenization with mistral-rs
tokenizers = { version = "0.20", optional = true }

[features]
default = ["candle"]

# Base mistral-rs support (CPU)
mistral-rs = ["mistralrs", "mistralrs-core", "tokenizers"]

# Metal acceleration (macOS)
mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]

# CUDA acceleration (NVIDIA)
mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]

# Full feature set
full = ["candle", "mistral-rs"]

Files to Modify

File Changes
crates/ruvllm/Cargo.toml Add mistral-rs dependencies and feature flags
crates/ruvllm/src/backends/mistral_backend.rs Replace stub with real implementation
crates/ruvllm/src/backends/mod.rs Update conditional exports
crates/ruvllm/src/paged_attention.rs Wire to mistral-rs PagedAttention
crates/ruvllm/src/xlora_manager.rs Wire to mistral-rs X-LoRA
crates/ruvllm/src/isq.rs Wire to mistral-rs ISQ
crates/ruvllm/src/lib.rs Add re-exports and feature gates
crates/ruvllm/README.md Document usage and examples

API Design

use ruvllm::{MistralBackend, MistralConfig, PagedAttentionConfig};

// Create backend with PagedAttention
let config = MistralConfig {
    model_id: "mistralai/Mistral-7B-Instruct-v0.2".to_string(),
    paged_attention: Some(PagedAttentionConfig {
        block_size: 16,
        max_num_blocks: 1024,
        prefix_caching: true,
    }),
    xlora: None,
    isq: None,
};

let backend = MistralBackend::new(config).await?;

// Generate with automatic KV cache management
let output = backend.generate(&request).await?;

Feature Flag Matrix

Build Command CPU Metal CUDA PagedAttn X-LoRA ISQ
--features mistral-rs Yes No No Yes Yes Yes
--features mistral-rs-metal Yes Yes No Yes Yes Yes
--features mistral-rs-cuda Yes No Yes Yes Yes Yes

Acceptance Criteria

Build Verification

  • cargo build --features mistral-rs compiles on Linux
  • cargo build --features mistral-rs-metal compiles on macOS
  • cargo build --features mistral-rs-cuda compiles with CUDA toolkit
  • All clippy warnings resolved
  • No breaking changes to existing Candle backend

Functionality

  • Model loading works with HuggingFace model IDs
  • Model loading works with local paths
  • Generation produces correct, coherent output
  • Streaming generation works correctly
  • Stop sequences are respected

PagedAttention

  • KV cache is managed in blocks
  • Sequence allocation succeeds up to max capacity
  • Sequence deallocation frees blocks correctly
  • Prefix caching improves repeated prompt performance
  • Memory usage stays within configured limits

X-LoRA

  • Multiple adapters can be loaded
  • Per-token routing selects appropriate adapters
  • Adapter hot-loading works without restart
  • Quality matches or exceeds single-adapter baseline

ISQ

  • Models quantize at runtime without pre-quantized weights
  • All supported methods produce valid output
  • Memory reduction matches expected compression ratio
  • Quality degradation within acceptable bounds (<5% on benchmarks)

Performance Benchmarks

  • Throughput: >500 tok/s on Mistral-7B (single user)
  • Concurrency: >50 concurrent generations without OOM
  • Latency: <50ms time-to-first-token
  • Memory: PagedAttention reduces peak usage by >30%

Testing Plan

Unit Tests

#[cfg(feature = "mistral-rs")]
mod mistral_tests {
    #[tokio::test]
    async fn test_model_loading() { ... }

    #[tokio::test]
    async fn test_generation() { ... }

    #[tokio::test]
    async fn test_paged_attention_allocation() { ... }

    #[tokio::test]
    async fn test_xlora_routing() { ... }

    #[tokio::test]
    async fn test_isq_quantization() { ... }
}

Integration Tests

  • Model download and cache management
  • End-to-end generation pipeline
  • Concurrent request handling
  • Memory pressure scenarios

Benchmarks

# Run throughput benchmark
cargo bench --features mistral-rs-metal -- throughput

# Run concurrency benchmark
cargo bench --features mistral-rs-metal -- concurrency

# Run memory benchmark
cargo bench --features mistral-rs-metal -- memory

Implementation Notes

Thread Safety

mistral-rs uses async Rust throughout. Ensure all shared state is properly synchronized:

  • Use Arc<RwLock<>> for shared configuration
  • Use channels for sequence lifecycle events
  • Avoid blocking in async contexts

Error Handling

Map mistral-rs errors to ruvllm error types:

impl From<mistralrs::Error> for RuvllmError {
    fn from(e: mistralrs::Error) -> Self {
        match e {
            mistralrs::Error::ModelLoad(_) => RuvllmError::ModelLoad(...),
            mistralrs::Error::Generation(_) => RuvllmError::Generation(...),
            // ...
        }
    }
}

Backward Compatibility

  • Keep Candle backend as default
  • Use feature flags for mistral-rs
  • Maintain consistent API across backends
  • Document migration path
  • Depends on: Initial MistralBackend stub implementation
  • Blocks: Production deployment readiness
  • Related: Candle backend optimizations

References


Labels: enhancement, ruvllm, backend, performance, P1

Milestone: v0.2.0

Assignees: TBD