git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
9.6 KiB
feat(ruvllm): Full mistral-rs backend integration with PagedAttention, X-LoRA, and ISQ
Summary
Wire the existing MistralBackend stub to the actual mistral-rs crate for production-scale LLM serving with advanced memory management and adapter routing.
Motivation
The current Candle backend is optimized for single-user and edge deployment scenarios, achieving approximately 100 tokens/second. While sufficient for development and small-scale use, production deployments require significantly higher throughput and concurrency.
mistral-rs enables:
- 500-1000 tok/s throughput via continuous batching and PagedAttention
- 50+ concurrent users with efficient KV cache management
- Memory efficiency through paged memory allocation and prefix caching
- Dynamic adapter routing via X-LoRA for multi-task inference
- Runtime quantization via ISQ for deployment flexibility
Performance Comparison
| Metric | Candle Backend | mistral-rs Backend |
|---|---|---|
| Throughput | ~100 tok/s | 500-1000 tok/s |
| Concurrent Users | 1-5 | 50+ |
| Memory Efficiency | Static KV | Paged + Prefix Cache |
| Adapter Support | Static LoRA | Dynamic X-LoRA |
| Quantization | Pre-quantized only | Runtime ISQ |
Features to Implement
1. PagedAttention (Priority: High)
PagedAttention revolutionizes KV cache management by treating attention as virtual memory, enabling efficient memory sharing across sequences.
- Add
mistralrsdependency toCargo.tomlwith feature flags - Wire PagedAttention to
MistralBackend::generate() - Implement sequence allocation/deallocation callbacks
- Add prefix caching support for prompt reuse
- Configure block size and max sequences
- Benchmark: target 5-10x concurrent capacity improvement
Key Implementation Points:
// Block configuration
let paged_config = PagedAttentionConfig {
block_size: 16, // Tokens per block
max_num_blocks: 1024, // Total blocks available
sliding_window: None, // Optional sliding window
prefix_caching: true, // Enable prefix cache
};
2. X-LoRA Dynamic Routing (Priority: Medium)
X-LoRA enables per-token routing to different LoRA adapters, allowing a single model to handle multiple tasks efficiently.
- Wire
XLoraManagerto mistral-rs X-LoRA implementation - Implement per-token adapter routing logic
- Support learned routing networks (classifier)
- Add adapter hot-loading for runtime updates
- Implement adapter weight caching
- Benchmark: multi-task quality metrics vs single adapters
Key Implementation Points:
// X-LoRA configuration
let xlora_config = XLoraConfig {
adapters: vec![
("code", "path/to/code-lora"),
("chat", "path/to/chat-lora"),
("reasoning", "path/to/reasoning-lora"),
],
routing_method: RoutingMethod::Learned,
top_k_adapters: 2, // Use top-2 adapters per token
scaling_factor: 1.0,
};
3. ISQ Runtime Quantization (Priority: Medium)
In-Situ Quantization allows loading full-precision models and quantizing at runtime, providing deployment flexibility.
- Wire
IsqConfigto mistral-rs ISQ implementation - Support quantization methods: AWQ, GPTQ, RTN, SmoothQuant
- Implement calibration workflow with sample data
- Add memory estimation before/after quantization
- Support mixed-precision quantization per layer
- Benchmark: quality vs compression tradeoffs
Supported Quantization Methods:
| Method | Bits | Quality | Speed | Use Case |
|---|---|---|---|---|
| AWQ | 4-bit | High | Fast | Production |
| GPTQ | 4-bit | High | Medium | Accuracy-critical |
| RTN | 8-bit | Very High | Very Fast | Quality-first |
| SmoothQuant | 8-bit | Very High | Fast | Balanced |
Technical Details
Cargo.toml Changes
[dependencies]
# Core mistral-rs integration
mistralrs = { version = "0.4", optional = true }
mistralrs-core = { version = "0.4", optional = true }
# Required for tokenization with mistral-rs
tokenizers = { version = "0.20", optional = true }
[features]
default = ["candle"]
# Base mistral-rs support (CPU)
mistral-rs = ["mistralrs", "mistralrs-core", "tokenizers"]
# Metal acceleration (macOS)
mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]
# CUDA acceleration (NVIDIA)
mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]
# Full feature set
full = ["candle", "mistral-rs"]
Files to Modify
| File | Changes |
|---|---|
crates/ruvllm/Cargo.toml |
Add mistral-rs dependencies and feature flags |
crates/ruvllm/src/backends/mistral_backend.rs |
Replace stub with real implementation |
crates/ruvllm/src/backends/mod.rs |
Update conditional exports |
crates/ruvllm/src/paged_attention.rs |
Wire to mistral-rs PagedAttention |
crates/ruvllm/src/xlora_manager.rs |
Wire to mistral-rs X-LoRA |
crates/ruvllm/src/isq.rs |
Wire to mistral-rs ISQ |
crates/ruvllm/src/lib.rs |
Add re-exports and feature gates |
crates/ruvllm/README.md |
Document usage and examples |
API Design
use ruvllm::{MistralBackend, MistralConfig, PagedAttentionConfig};
// Create backend with PagedAttention
let config = MistralConfig {
model_id: "mistralai/Mistral-7B-Instruct-v0.2".to_string(),
paged_attention: Some(PagedAttentionConfig {
block_size: 16,
max_num_blocks: 1024,
prefix_caching: true,
}),
xlora: None,
isq: None,
};
let backend = MistralBackend::new(config).await?;
// Generate with automatic KV cache management
let output = backend.generate(&request).await?;
Feature Flag Matrix
| Build Command | CPU | Metal | CUDA | PagedAttn | X-LoRA | ISQ |
|---|---|---|---|---|---|---|
--features mistral-rs |
Yes | No | No | Yes | Yes | Yes |
--features mistral-rs-metal |
Yes | Yes | No | Yes | Yes | Yes |
--features mistral-rs-cuda |
Yes | No | Yes | Yes | Yes | Yes |
Acceptance Criteria
Build Verification
cargo build --features mistral-rscompiles on Linuxcargo build --features mistral-rs-metalcompiles on macOScargo build --features mistral-rs-cudacompiles with CUDA toolkit- All clippy warnings resolved
- No breaking changes to existing Candle backend
Functionality
- Model loading works with HuggingFace model IDs
- Model loading works with local paths
- Generation produces correct, coherent output
- Streaming generation works correctly
- Stop sequences are respected
PagedAttention
- KV cache is managed in blocks
- Sequence allocation succeeds up to max capacity
- Sequence deallocation frees blocks correctly
- Prefix caching improves repeated prompt performance
- Memory usage stays within configured limits
X-LoRA
- Multiple adapters can be loaded
- Per-token routing selects appropriate adapters
- Adapter hot-loading works without restart
- Quality matches or exceeds single-adapter baseline
ISQ
- Models quantize at runtime without pre-quantized weights
- All supported methods produce valid output
- Memory reduction matches expected compression ratio
- Quality degradation within acceptable bounds (<5% on benchmarks)
Performance Benchmarks
- Throughput: >500 tok/s on Mistral-7B (single user)
- Concurrency: >50 concurrent generations without OOM
- Latency: <50ms time-to-first-token
- Memory: PagedAttention reduces peak usage by >30%
Testing Plan
Unit Tests
#[cfg(feature = "mistral-rs")]
mod mistral_tests {
#[tokio::test]
async fn test_model_loading() { ... }
#[tokio::test]
async fn test_generation() { ... }
#[tokio::test]
async fn test_paged_attention_allocation() { ... }
#[tokio::test]
async fn test_xlora_routing() { ... }
#[tokio::test]
async fn test_isq_quantization() { ... }
}
Integration Tests
- Model download and cache management
- End-to-end generation pipeline
- Concurrent request handling
- Memory pressure scenarios
Benchmarks
# Run throughput benchmark
cargo bench --features mistral-rs-metal -- throughput
# Run concurrency benchmark
cargo bench --features mistral-rs-metal -- concurrency
# Run memory benchmark
cargo bench --features mistral-rs-metal -- memory
Implementation Notes
Thread Safety
mistral-rs uses async Rust throughout. Ensure all shared state is properly synchronized:
- Use
Arc<RwLock<>>for shared configuration - Use channels for sequence lifecycle events
- Avoid blocking in async contexts
Error Handling
Map mistral-rs errors to ruvllm error types:
impl From<mistralrs::Error> for RuvllmError {
fn from(e: mistralrs::Error) -> Self {
match e {
mistralrs::Error::ModelLoad(_) => RuvllmError::ModelLoad(...),
mistralrs::Error::Generation(_) => RuvllmError::Generation(...),
// ...
}
}
}
Backward Compatibility
- Keep Candle backend as default
- Use feature flags for mistral-rs
- Maintain consistent API across backends
- Document migration path
Related Issues
- Depends on: Initial MistralBackend stub implementation
- Blocks: Production deployment readiness
- Related: Candle backend optimizations
References
Labels: enhancement, ruvllm, backend, performance, P1
Milestone: v0.2.0
Assignees: TBD