Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

9.6 KiB

Raw Blame History

feat(ruvllm): Full mistral-rs backend integration with PagedAttention, X-LoRA, and ISQ

Summary

Wire the existing MistralBackend stub to the actual mistral-rs crate for production-scale LLM serving with advanced memory management and adapter routing.

Motivation

The current Candle backend is optimized for single-user and edge deployment scenarios, achieving approximately 100 tokens/second. While sufficient for development and small-scale use, production deployments require significantly higher throughput and concurrency.

mistral-rs enables:

500-1000 tok/s throughput via continuous batching and PagedAttention
50+ concurrent users with efficient KV cache management
Memory efficiency through paged memory allocation and prefix caching
Dynamic adapter routing via X-LoRA for multi-task inference
Runtime quantization via ISQ for deployment flexibility

Performance Comparison

Metric	Candle Backend	mistral-rs Backend
Throughput	~100 tok/s	500-1000 tok/s
Concurrent Users	1-5	50+
Memory Efficiency	Static KV	Paged + Prefix Cache
Adapter Support	Static LoRA	Dynamic X-LoRA
Quantization	Pre-quantized only	Runtime ISQ

Features to Implement

1. PagedAttention (Priority: High)

PagedAttention revolutionizes KV cache management by treating attention as virtual memory, enabling efficient memory sharing across sequences.

Add mistralrs dependency to Cargo.toml with feature flags
Wire PagedAttention to MistralBackend::generate()
Implement sequence allocation/deallocation callbacks
Add prefix caching support for prompt reuse
Configure block size and max sequences
Benchmark: target 5-10x concurrent capacity improvement

Key Implementation Points:

// Block configuration
let paged_config = PagedAttentionConfig {
    block_size: 16,           // Tokens per block
    max_num_blocks: 1024,     // Total blocks available
    sliding_window: None,     // Optional sliding window
    prefix_caching: true,     // Enable prefix cache
};

2. X-LoRA Dynamic Routing (Priority: Medium)

X-LoRA enables per-token routing to different LoRA adapters, allowing a single model to handle multiple tasks efficiently.

Wire XLoraManager to mistral-rs X-LoRA implementation
Implement per-token adapter routing logic
Support learned routing networks (classifier)
Add adapter hot-loading for runtime updates
Implement adapter weight caching
Benchmark: multi-task quality metrics vs single adapters

Key Implementation Points:

// X-LoRA configuration
let xlora_config = XLoraConfig {
    adapters: vec![
        ("code", "path/to/code-lora"),
        ("chat", "path/to/chat-lora"),
        ("reasoning", "path/to/reasoning-lora"),
    ],
    routing_method: RoutingMethod::Learned,
    top_k_adapters: 2,        // Use top-2 adapters per token
    scaling_factor: 1.0,
};

3. ISQ Runtime Quantization (Priority: Medium)

In-Situ Quantization allows loading full-precision models and quantizing at runtime, providing deployment flexibility.

Wire IsqConfig to mistral-rs ISQ implementation
Support quantization methods: AWQ, GPTQ, RTN, SmoothQuant
Implement calibration workflow with sample data
Add memory estimation before/after quantization
Support mixed-precision quantization per layer
Benchmark: quality vs compression tradeoffs

Supported Quantization Methods:

Method	Bits	Quality	Speed	Use Case
AWQ	4-bit	High	Fast	Production
GPTQ	4-bit	High	Medium	Accuracy-critical
RTN	8-bit	Very High	Very Fast	Quality-first
SmoothQuant	8-bit	Very High	Fast	Balanced

Technical Details

Cargo.toml Changes

[dependencies]
# Core mistral-rs integration
mistralrs = { version = "0.4", optional = true }
mistralrs-core = { version = "0.4", optional = true }

# Required for tokenization with mistral-rs
tokenizers = { version = "0.20", optional = true }

[features]
default = ["candle"]

# Base mistral-rs support (CPU)
mistral-rs = ["mistralrs", "mistralrs-core", "tokenizers"]

# Metal acceleration (macOS)
mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]

# CUDA acceleration (NVIDIA)
mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]

# Full feature set
full = ["candle", "mistral-rs"]

Files to Modify

File	Changes
`crates/ruvllm/Cargo.toml`	Add mistral-rs dependencies and feature flags
`crates/ruvllm/src/backends/mistral_backend.rs`	Replace stub with real implementation
`crates/ruvllm/src/backends/mod.rs`	Update conditional exports
`crates/ruvllm/src/paged_attention.rs`	Wire to mistral-rs PagedAttention
`crates/ruvllm/src/xlora_manager.rs`	Wire to mistral-rs X-LoRA
`crates/ruvllm/src/isq.rs`	Wire to mistral-rs ISQ
`crates/ruvllm/src/lib.rs`	Add re-exports and feature gates
`crates/ruvllm/README.md`	Document usage and examples

API Design

use ruvllm::{MistralBackend, MistralConfig, PagedAttentionConfig};

// Create backend with PagedAttention
let config = MistralConfig {
    model_id: "mistralai/Mistral-7B-Instruct-v0.2".to_string(),
    paged_attention: Some(PagedAttentionConfig {
        block_size: 16,
        max_num_blocks: 1024,
        prefix_caching: true,
    }),
    xlora: None,
    isq: None,
};

let backend = MistralBackend::new(config).await?;

// Generate with automatic KV cache management
let output = backend.generate(&request).await?;

Feature Flag Matrix

Build Command	CPU	Metal	CUDA	PagedAttn	X-LoRA	ISQ
`--features mistral-rs`	Yes	No	No	Yes	Yes	Yes
`--features mistral-rs-metal`	Yes	Yes	No	Yes	Yes	Yes
`--features mistral-rs-cuda`	Yes	No	Yes	Yes	Yes	Yes

Acceptance Criteria

Build Verification

cargo build --features mistral-rs compiles on Linux
cargo build --features mistral-rs-metal compiles on macOS
cargo build --features mistral-rs-cuda compiles with CUDA toolkit
All clippy warnings resolved
No breaking changes to existing Candle backend

Functionality

Model loading works with HuggingFace model IDs
Model loading works with local paths
Generation produces correct, coherent output
Streaming generation works correctly
Stop sequences are respected

PagedAttention

KV cache is managed in blocks
Sequence allocation succeeds up to max capacity
Sequence deallocation frees blocks correctly
Prefix caching improves repeated prompt performance
Memory usage stays within configured limits

X-LoRA

Multiple adapters can be loaded
Per-token routing selects appropriate adapters
Adapter hot-loading works without restart
Quality matches or exceeds single-adapter baseline

ISQ

Models quantize at runtime without pre-quantized weights
All supported methods produce valid output
Memory reduction matches expected compression ratio
Quality degradation within acceptable bounds (<5% on benchmarks)

Performance Benchmarks

Throughput: >500 tok/s on Mistral-7B (single user)
Concurrency: >50 concurrent generations without OOM
Latency: <50ms time-to-first-token
Memory: PagedAttention reduces peak usage by >30%

Testing Plan

Unit Tests

#[cfg(feature = "mistral-rs")]
mod mistral_tests {
    #[tokio::test]
    async fn test_model_loading() { ... }

    #[tokio::test]
    async fn test_generation() { ... }

    #[tokio::test]
    async fn test_paged_attention_allocation() { ... }

    #[tokio::test]
    async fn test_xlora_routing() { ... }

    #[tokio::test]
    async fn test_isq_quantization() { ... }
}

Integration Tests

Model download and cache management
End-to-end generation pipeline
Concurrent request handling
Memory pressure scenarios

Benchmarks

# Run throughput benchmark
cargo bench --features mistral-rs-metal -- throughput

# Run concurrency benchmark
cargo bench --features mistral-rs-metal -- concurrency

# Run memory benchmark
cargo bench --features mistral-rs-metal -- memory

Implementation Notes

Thread Safety

mistral-rs uses async Rust throughout. Ensure all shared state is properly synchronized:

Use Arc<RwLock<>> for shared configuration
Use channels for sequence lifecycle events
Avoid blocking in async contexts

Error Handling

Map mistral-rs errors to ruvllm error types:

impl From<mistralrs::Error> for RuvllmError {
    fn from(e: mistralrs::Error) -> Self {
        match e {
            mistralrs::Error::ModelLoad(_) => RuvllmError::ModelLoad(...),
            mistralrs::Error::Generation(_) => RuvllmError::Generation(...),
            // ...
        }
    }
}

Backward Compatibility

Keep Candle backend as default
Use feature flags for mistral-rs
Maintain consistent API across backends
Document migration path

Depends on: Initial MistralBackend stub implementation
Blocks: Production deployment readiness
Related: Candle backend optimizations

References

Labels: enhancement, ruvllm, backend, performance, P1

Milestone: v0.2.0

Assignees: TBD

9.6 KiB Raw Blame History