Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
300
crates/ruvllm/docs/GITHUB_ISSUE_MISTRAL_RS.md
Normal file
300
crates/ruvllm/docs/GITHUB_ISSUE_MISTRAL_RS.md
Normal file
@@ -0,0 +1,300 @@
|
||||
# feat(ruvllm): Full mistral-rs backend integration with PagedAttention, X-LoRA, and ISQ
|
||||
|
||||
## Summary
|
||||
|
||||
Wire the existing `MistralBackend` stub to the actual [mistral-rs](https://github.com/EricLBuehler/mistral.rs) crate for production-scale LLM serving with advanced memory management and adapter routing.
|
||||
|
||||
## Motivation
|
||||
|
||||
The current Candle backend is optimized for single-user and edge deployment scenarios, achieving approximately 100 tokens/second. While sufficient for development and small-scale use, production deployments require significantly higher throughput and concurrency.
|
||||
|
||||
**mistral-rs enables:**
|
||||
- **500-1000 tok/s throughput** via continuous batching and PagedAttention
|
||||
- **50+ concurrent users** with efficient KV cache management
|
||||
- **Memory efficiency** through paged memory allocation and prefix caching
|
||||
- **Dynamic adapter routing** via X-LoRA for multi-task inference
|
||||
- **Runtime quantization** via ISQ for deployment flexibility
|
||||
|
||||
### Performance Comparison
|
||||
|
||||
| Metric | Candle Backend | mistral-rs Backend |
|
||||
|--------|----------------|-------------------|
|
||||
| Throughput | ~100 tok/s | 500-1000 tok/s |
|
||||
| Concurrent Users | 1-5 | 50+ |
|
||||
| Memory Efficiency | Static KV | Paged + Prefix Cache |
|
||||
| Adapter Support | Static LoRA | Dynamic X-LoRA |
|
||||
| Quantization | Pre-quantized only | Runtime ISQ |
|
||||
|
||||
## Features to Implement
|
||||
|
||||
### 1. PagedAttention (Priority: High)
|
||||
|
||||
PagedAttention revolutionizes KV cache management by treating attention as virtual memory, enabling efficient memory sharing across sequences.
|
||||
|
||||
- [ ] Add `mistralrs` dependency to `Cargo.toml` with feature flags
|
||||
- [ ] Wire PagedAttention to `MistralBackend::generate()`
|
||||
- [ ] Implement sequence allocation/deallocation callbacks
|
||||
- [ ] Add prefix caching support for prompt reuse
|
||||
- [ ] Configure block size and max sequences
|
||||
- [ ] Benchmark: target 5-10x concurrent capacity improvement
|
||||
|
||||
**Key Implementation Points:**
|
||||
```rust
|
||||
// Block configuration
|
||||
let paged_config = PagedAttentionConfig {
|
||||
block_size: 16, // Tokens per block
|
||||
max_num_blocks: 1024, // Total blocks available
|
||||
sliding_window: None, // Optional sliding window
|
||||
prefix_caching: true, // Enable prefix cache
|
||||
};
|
||||
```
|
||||
|
||||
### 2. X-LoRA Dynamic Routing (Priority: Medium)
|
||||
|
||||
X-LoRA enables per-token routing to different LoRA adapters, allowing a single model to handle multiple tasks efficiently.
|
||||
|
||||
- [ ] Wire `XLoraManager` to mistral-rs X-LoRA implementation
|
||||
- [ ] Implement per-token adapter routing logic
|
||||
- [ ] Support learned routing networks (classifier)
|
||||
- [ ] Add adapter hot-loading for runtime updates
|
||||
- [ ] Implement adapter weight caching
|
||||
- [ ] Benchmark: multi-task quality metrics vs single adapters
|
||||
|
||||
**Key Implementation Points:**
|
||||
```rust
|
||||
// X-LoRA configuration
|
||||
let xlora_config = XLoraConfig {
|
||||
adapters: vec![
|
||||
("code", "path/to/code-lora"),
|
||||
("chat", "path/to/chat-lora"),
|
||||
("reasoning", "path/to/reasoning-lora"),
|
||||
],
|
||||
routing_method: RoutingMethod::Learned,
|
||||
top_k_adapters: 2, // Use top-2 adapters per token
|
||||
scaling_factor: 1.0,
|
||||
};
|
||||
```
|
||||
|
||||
### 3. ISQ Runtime Quantization (Priority: Medium)
|
||||
|
||||
In-Situ Quantization allows loading full-precision models and quantizing at runtime, providing deployment flexibility.
|
||||
|
||||
- [ ] Wire `IsqConfig` to mistral-rs ISQ implementation
|
||||
- [ ] Support quantization methods: AWQ, GPTQ, RTN, SmoothQuant
|
||||
- [ ] Implement calibration workflow with sample data
|
||||
- [ ] Add memory estimation before/after quantization
|
||||
- [ ] Support mixed-precision quantization per layer
|
||||
- [ ] Benchmark: quality vs compression tradeoffs
|
||||
|
||||
**Supported Quantization Methods:**
|
||||
| Method | Bits | Quality | Speed | Use Case |
|
||||
|--------|------|---------|-------|----------|
|
||||
| AWQ | 4-bit | High | Fast | Production |
|
||||
| GPTQ | 4-bit | High | Medium | Accuracy-critical |
|
||||
| RTN | 8-bit | Very High | Very Fast | Quality-first |
|
||||
| SmoothQuant | 8-bit | Very High | Fast | Balanced |
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Cargo.toml Changes
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
# Core mistral-rs integration
|
||||
mistralrs = { version = "0.4", optional = true }
|
||||
mistralrs-core = { version = "0.4", optional = true }
|
||||
|
||||
# Required for tokenization with mistral-rs
|
||||
tokenizers = { version = "0.20", optional = true }
|
||||
|
||||
[features]
|
||||
default = ["candle"]
|
||||
|
||||
# Base mistral-rs support (CPU)
|
||||
mistral-rs = ["mistralrs", "mistralrs-core", "tokenizers"]
|
||||
|
||||
# Metal acceleration (macOS)
|
||||
mistral-rs-metal = ["mistral-rs", "mistralrs/metal"]
|
||||
|
||||
# CUDA acceleration (NVIDIA)
|
||||
mistral-rs-cuda = ["mistral-rs", "mistralrs/cuda"]
|
||||
|
||||
# Full feature set
|
||||
full = ["candle", "mistral-rs"]
|
||||
```
|
||||
|
||||
### Files to Modify
|
||||
|
||||
| File | Changes |
|
||||
|------|---------|
|
||||
| `crates/ruvllm/Cargo.toml` | Add mistral-rs dependencies and feature flags |
|
||||
| `crates/ruvllm/src/backends/mistral_backend.rs` | Replace stub with real implementation |
|
||||
| `crates/ruvllm/src/backends/mod.rs` | Update conditional exports |
|
||||
| `crates/ruvllm/src/paged_attention.rs` | Wire to mistral-rs PagedAttention |
|
||||
| `crates/ruvllm/src/xlora_manager.rs` | Wire to mistral-rs X-LoRA |
|
||||
| `crates/ruvllm/src/isq.rs` | Wire to mistral-rs ISQ |
|
||||
| `crates/ruvllm/src/lib.rs` | Add re-exports and feature gates |
|
||||
| `crates/ruvllm/README.md` | Document usage and examples |
|
||||
|
||||
### API Design
|
||||
|
||||
```rust
|
||||
use ruvllm::{MistralBackend, MistralConfig, PagedAttentionConfig};
|
||||
|
||||
// Create backend with PagedAttention
|
||||
let config = MistralConfig {
|
||||
model_id: "mistralai/Mistral-7B-Instruct-v0.2".to_string(),
|
||||
paged_attention: Some(PagedAttentionConfig {
|
||||
block_size: 16,
|
||||
max_num_blocks: 1024,
|
||||
prefix_caching: true,
|
||||
}),
|
||||
xlora: None,
|
||||
isq: None,
|
||||
};
|
||||
|
||||
let backend = MistralBackend::new(config).await?;
|
||||
|
||||
// Generate with automatic KV cache management
|
||||
let output = backend.generate(&request).await?;
|
||||
```
|
||||
|
||||
### Feature Flag Matrix
|
||||
|
||||
| Build Command | CPU | Metal | CUDA | PagedAttn | X-LoRA | ISQ |
|
||||
|---------------|-----|-------|------|-----------|--------|-----|
|
||||
| `--features mistral-rs` | Yes | No | No | Yes | Yes | Yes |
|
||||
| `--features mistral-rs-metal` | Yes | Yes | No | Yes | Yes | Yes |
|
||||
| `--features mistral-rs-cuda` | Yes | No | Yes | Yes | Yes | Yes |
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### Build Verification
|
||||
- [ ] `cargo build --features mistral-rs` compiles on Linux
|
||||
- [ ] `cargo build --features mistral-rs-metal` compiles on macOS
|
||||
- [ ] `cargo build --features mistral-rs-cuda` compiles with CUDA toolkit
|
||||
- [ ] All clippy warnings resolved
|
||||
- [ ] No breaking changes to existing Candle backend
|
||||
|
||||
### Functionality
|
||||
- [ ] Model loading works with HuggingFace model IDs
|
||||
- [ ] Model loading works with local paths
|
||||
- [ ] Generation produces correct, coherent output
|
||||
- [ ] Streaming generation works correctly
|
||||
- [ ] Stop sequences are respected
|
||||
|
||||
### PagedAttention
|
||||
- [ ] KV cache is managed in blocks
|
||||
- [ ] Sequence allocation succeeds up to max capacity
|
||||
- [ ] Sequence deallocation frees blocks correctly
|
||||
- [ ] Prefix caching improves repeated prompt performance
|
||||
- [ ] Memory usage stays within configured limits
|
||||
|
||||
### X-LoRA
|
||||
- [ ] Multiple adapters can be loaded
|
||||
- [ ] Per-token routing selects appropriate adapters
|
||||
- [ ] Adapter hot-loading works without restart
|
||||
- [ ] Quality matches or exceeds single-adapter baseline
|
||||
|
||||
### ISQ
|
||||
- [ ] Models quantize at runtime without pre-quantized weights
|
||||
- [ ] All supported methods produce valid output
|
||||
- [ ] Memory reduction matches expected compression ratio
|
||||
- [ ] Quality degradation within acceptable bounds (<5% on benchmarks)
|
||||
|
||||
### Performance Benchmarks
|
||||
- [ ] Throughput: >500 tok/s on Mistral-7B (single user)
|
||||
- [ ] Concurrency: >50 concurrent generations without OOM
|
||||
- [ ] Latency: <50ms time-to-first-token
|
||||
- [ ] Memory: PagedAttention reduces peak usage by >30%
|
||||
|
||||
## Testing Plan
|
||||
|
||||
### Unit Tests
|
||||
```rust
|
||||
#[cfg(feature = "mistral-rs")]
|
||||
mod mistral_tests {
|
||||
#[tokio::test]
|
||||
async fn test_model_loading() { ... }
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_generation() { ... }
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_paged_attention_allocation() { ... }
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_xlora_routing() { ... }
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_isq_quantization() { ... }
|
||||
}
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
- Model download and cache management
|
||||
- End-to-end generation pipeline
|
||||
- Concurrent request handling
|
||||
- Memory pressure scenarios
|
||||
|
||||
### Benchmarks
|
||||
```bash
|
||||
# Run throughput benchmark
|
||||
cargo bench --features mistral-rs-metal -- throughput
|
||||
|
||||
# Run concurrency benchmark
|
||||
cargo bench --features mistral-rs-metal -- concurrency
|
||||
|
||||
# Run memory benchmark
|
||||
cargo bench --features mistral-rs-metal -- memory
|
||||
```
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Thread Safety
|
||||
mistral-rs uses async Rust throughout. Ensure all shared state is properly synchronized:
|
||||
- Use `Arc<RwLock<>>` for shared configuration
|
||||
- Use channels for sequence lifecycle events
|
||||
- Avoid blocking in async contexts
|
||||
|
||||
### Error Handling
|
||||
Map mistral-rs errors to ruvllm error types:
|
||||
```rust
|
||||
impl From<mistralrs::Error> for RuvllmError {
|
||||
fn from(e: mistralrs::Error) -> Self {
|
||||
match e {
|
||||
mistralrs::Error::ModelLoad(_) => RuvllmError::ModelLoad(...),
|
||||
mistralrs::Error::Generation(_) => RuvllmError::Generation(...),
|
||||
// ...
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Backward Compatibility
|
||||
- Keep Candle backend as default
|
||||
- Use feature flags for mistral-rs
|
||||
- Maintain consistent API across backends
|
||||
- Document migration path
|
||||
|
||||
## Related Issues
|
||||
|
||||
- Depends on: Initial MistralBackend stub implementation
|
||||
- Blocks: Production deployment readiness
|
||||
- Related: Candle backend optimizations
|
||||
|
||||
## References
|
||||
|
||||
- [mistral-rs GitHub](https://github.com/EricLBuehler/mistral.rs)
|
||||
- [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
|
||||
- [X-LoRA Paper](https://arxiv.org/abs/2402.07148)
|
||||
- [AWQ Paper](https://arxiv.org/abs/2306.00978)
|
||||
- [vLLM PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html)
|
||||
|
||||
---
|
||||
|
||||
**Labels:** `enhancement`, `ruvllm`, `backend`, `performance`, `P1`
|
||||
|
||||
**Milestone:** v0.2.0
|
||||
|
||||
**Assignees:** TBD
|
||||
555
crates/ruvllm/docs/GITHUB_ISSUE_SOTA_FEATURES.md
Normal file
555
crates/ruvllm/docs/GITHUB_ISSUE_SOTA_FEATURES.md
Normal file
@@ -0,0 +1,555 @@
|
||||
# feat(ruvllm): Implement SOTA features for production agentic workflows
|
||||
|
||||
**Labels**: `enhancement`, `p0-critical`, `agentic`, `v2.4`, `mistral-rs`, `performance`
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
RuvLLM v2.4 SOTA Feature Implementation - Adding the 3 critical capabilities needed for production agentic workflows: **Structured Output**, **Function Calling**, and **Prefix Caching**.
|
||||
|
||||
These features are essential for modern LLM applications and are currently blocking production adoption for major agent frameworks.
|
||||
|
||||
---
|
||||
|
||||
## Motivation
|
||||
|
||||
### Why This Matters
|
||||
|
||||
**Current State:**
|
||||
- RuvLLM cannot reliably generate structured outputs (JSON schema enforcement)
|
||||
- No native function calling support for tool-using agents
|
||||
- Repeated prompts/prefixes incur full generation costs (no caching)
|
||||
- Agent frameworks (LangChain, LlamaIndex, CrewAI) cannot integrate
|
||||
|
||||
**Impact:**
|
||||
- **Blocking production adoption** for agentic workflows
|
||||
- **Cost inefficiency**: 10-100x slower for RAG/chat applications vs competitors
|
||||
- **Reliability gap**: JSON parsing failures break agent loops
|
||||
- **Missing compatibility**: Cannot replace vLLM, llama.cpp, SGLang in existing stacks
|
||||
|
||||
**Competitive Gap:**
|
||||
| Feature | vLLM | llama.cpp | SGLang | RuvLLM |
|
||||
|---------|------|-----------|--------|--------|
|
||||
| Structured Output | ✅ | ✅ | ✅ | ❌ |
|
||||
| Function Calling | ✅ | ✅ | ✅ | ❌ |
|
||||
| Prefix Caching | ✅ | ✅ | ✅ | ❌ |
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
### 1. Structured Output / JSON Mode (P0)
|
||||
|
||||
**Objective**: Guarantee valid JSON output conforming to user-provided schemas.
|
||||
|
||||
#### Core Capabilities
|
||||
- [ ] **JSON schema validation** (JSONSchema Draft 7 support)
|
||||
- Primitive types: `string`, `number`, `boolean`, `null`
|
||||
- Complex types: `object`, `array`
|
||||
- Nested schemas with `properties`, `items`, `required`
|
||||
- Constraints: `minLength`, `maxLength`, `pattern`, `enum`, `minimum`, `maximum`
|
||||
|
||||
- [ ] **Constrained decoding with logit bias**
|
||||
- State machine for tracking JSON structure (open braces, quotes, commas)
|
||||
- Token masking to enforce valid next tokens
|
||||
- Rejection sampling fallback for complex schemas
|
||||
|
||||
- [ ] **Bracket/brace state machine**
|
||||
- Track depth of `{}` and `[]`
|
||||
- Enforce closing brackets
|
||||
- Handle escaped quotes in strings
|
||||
|
||||
- [ ] **JSON repair for malformed output**
|
||||
- Auto-close unclosed braces/brackets
|
||||
- Fix trailing commas
|
||||
- Escape unescaped quotes
|
||||
- Best-effort recovery mode
|
||||
|
||||
- [ ] **GBNF grammar support (future)**
|
||||
- llama.cpp-compatible grammar format
|
||||
- Custom domain-specific languages
|
||||
|
||||
- [ ] **Comprehensive tests**
|
||||
- Unit tests for all JSON types
|
||||
- Property-based testing with Hypothesis/QuickCheck
|
||||
- Adversarial inputs (deeply nested, large arrays)
|
||||
|
||||
- [ ] **Benchmarks vs unconstrained**
|
||||
- Measure latency overhead (<10% target)
|
||||
- Throughput impact
|
||||
- Memory usage
|
||||
|
||||
#### Example API
|
||||
```rust
|
||||
let schema = json!({
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {"type": "string"},
|
||||
"age": {"type": "number", "minimum": 0},
|
||||
"tags": {"type": "array", "items": {"type": "string"}}
|
||||
},
|
||||
"required": ["name"]
|
||||
});
|
||||
|
||||
let response = llm.generate(GenerateRequest {
|
||||
prompt: "Extract person info: John is 30",
|
||||
json_schema: Some(schema),
|
||||
strict: true, // Guarantee valid JSON
|
||||
..Default::default()
|
||||
})?;
|
||||
|
||||
// response.text is guaranteed valid JSON matching schema
|
||||
```
|
||||
|
||||
#### Acceptance Criteria
|
||||
- [ ] **100% valid JSON** when `strict: true` enabled
|
||||
- [ ] **<10% latency overhead** vs unconstrained generation
|
||||
- [ ] **Schema validation passes** for nested objects/arrays (depth ≥ 5)
|
||||
- [ ] **Repair mode** recovers ≥95% of malformed outputs
|
||||
|
||||
---
|
||||
|
||||
### 2. Function Calling / Tool Use (P0)
|
||||
|
||||
**Objective**: Enable LLMs to call external tools/functions with structured arguments.
|
||||
|
||||
#### Core Capabilities
|
||||
- [ ] **Tool definition schema**
|
||||
- Function name, description
|
||||
- Parameters (JSON schema)
|
||||
- Return type (optional)
|
||||
|
||||
- [ ] **ToolChoice enum**
|
||||
- `auto`: Model decides whether to call tools
|
||||
- `none`: Never call tools (text-only)
|
||||
- `required`: Must call at least one tool
|
||||
- `specific(name)`: Force specific tool
|
||||
|
||||
- [ ] **Parallel tool calls**
|
||||
- Generate multiple tool calls in one response
|
||||
- Dependency-aware ordering
|
||||
|
||||
- [ ] **Tool result handling**
|
||||
- Inject tool results back into conversation
|
||||
- Continue generation after tool execution
|
||||
- Multi-turn tool loops
|
||||
|
||||
- [ ] **Model-specific formats**
|
||||
- Llama 3.1 tool format (`<|python_tag|>`)
|
||||
- Mistral tool format (function tags)
|
||||
- Qwen tool format
|
||||
- Claude tool format
|
||||
|
||||
- [ ] **OpenAI API compatibility layer**
|
||||
- `tools` parameter
|
||||
- `tool_choice` parameter
|
||||
- `ChatCompletionToolCall` response format
|
||||
|
||||
- [ ] **LangChain integration tests**
|
||||
- Works with `AgentExecutor`
|
||||
- Compatible with `StructuredTool`
|
||||
- Multi-agent workflows
|
||||
|
||||
#### Example API
|
||||
```rust
|
||||
let tools = vec![
|
||||
Tool {
|
||||
name: "get_weather".into(),
|
||||
description: "Get current weather for a location".into(),
|
||||
parameters: json!({
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {"type": "string"},
|
||||
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
|
||||
},
|
||||
"required": ["location"]
|
||||
}),
|
||||
},
|
||||
Tool {
|
||||
name: "search_web".into(),
|
||||
description: "Search the web".into(),
|
||||
parameters: json!({
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"query": {"type": "string"}
|
||||
},
|
||||
"required": ["query"]
|
||||
}),
|
||||
}
|
||||
];
|
||||
|
||||
let response = llm.chat(ChatRequest {
|
||||
messages: vec![
|
||||
Message::user("What's the weather in SF and latest AI news?")
|
||||
],
|
||||
tools: Some(tools),
|
||||
tool_choice: ToolChoice::Auto,
|
||||
..Default::default()
|
||||
})?;
|
||||
|
||||
// response.tool_calls contains parallel calls:
|
||||
// [get_weather(location="San Francisco"), search_web(query="AI news")]
|
||||
```
|
||||
|
||||
#### Acceptance Criteria
|
||||
- [ ] **OpenAI API format compatibility** (passes OpenAI SDK tests)
|
||||
- [ ] **LangChain AgentExecutor** integration works end-to-end
|
||||
- [ ] **Parallel tool calls** supported (≥3 concurrent)
|
||||
- [ ] **Multi-turn tool conversations** (≥5 turns)
|
||||
- [ ] **Tool call success rate** ≥95% for common tools
|
||||
|
||||
---
|
||||
|
||||
### 3. Prefix Caching (P0)
|
||||
|
||||
**Objective**: Cache and reuse KV cache for repeated prompt prefixes (system prompts, RAG documents).
|
||||
|
||||
#### Core Capabilities
|
||||
- [ ] **Hash-based prefix lookup**
|
||||
- SHA-256 hash of token IDs
|
||||
- Fast O(1) cache hit detection
|
||||
|
||||
- [ ] **Radix tree implementation**
|
||||
- Efficient storage for overlapping prefixes
|
||||
- Longest common prefix matching
|
||||
- Memory-efficient sharing
|
||||
|
||||
- [ ] **KV cache copy-on-write**
|
||||
- Share read-only cache entries
|
||||
- Copy only on divergence
|
||||
- Zero-copy for cache hits
|
||||
|
||||
- [ ] **LRU eviction policy**
|
||||
- Evict least recently used prefixes
|
||||
- Configurable cache size
|
||||
- Per-model cache isolation
|
||||
|
||||
- [ ] **Memory limits**
|
||||
- Hard limit on cache size (bytes)
|
||||
- Soft limit with warning
|
||||
- Graceful degradation
|
||||
|
||||
- [ ] **Cache hit/miss metrics**
|
||||
- Prometheus metrics
|
||||
- Hit rate tracking
|
||||
- Memory usage stats
|
||||
|
||||
- [ ] **Chat prefix caching**
|
||||
- System prompt caching
|
||||
- Conversation history caching
|
||||
- Automatic prefix detection
|
||||
|
||||
- [ ] **RAG document caching**
|
||||
- Document chunk prefixes
|
||||
- Query-independent context
|
||||
- Multi-query reuse
|
||||
|
||||
#### Example API
|
||||
```rust
|
||||
// First request - cache miss
|
||||
let response1 = llm.generate(GenerateRequest {
|
||||
prompt: "System: You are a helpful assistant.\nUser: Hello",
|
||||
cache_prefix: Some(CacheConfig {
|
||||
enable: true,
|
||||
key: Some("chat-system-prompt".into()),
|
||||
ttl_seconds: Some(3600),
|
||||
}),
|
||||
..Default::default()
|
||||
})?;
|
||||
// Latency: 500ms
|
||||
|
||||
// Second request - cache hit (reuses "System: You are..." KV cache)
|
||||
let response2 = llm.generate(GenerateRequest {
|
||||
prompt: "System: You are a helpful assistant.\nUser: How are you?",
|
||||
cache_prefix: Some(CacheConfig {
|
||||
enable: true,
|
||||
key: Some("chat-system-prompt".into()),
|
||||
ttl_seconds: Some(3600),
|
||||
}),
|
||||
..Default::default()
|
||||
})?;
|
||||
// Latency: 50ms (10x faster!)
|
||||
```
|
||||
|
||||
#### Performance Targets
|
||||
- [ ] **10x speedup** for repeated system prompts (cache hit)
|
||||
- [ ] **<5% overhead** for cache miss
|
||||
- [ ] **Memory-bounded** (configurable, default 2GB)
|
||||
- [ ] **Thread-safe** for concurrent requests
|
||||
- [ ] **Hit rate ≥80%** for typical chat/RAG workloads
|
||||
|
||||
#### Acceptance Criteria
|
||||
- [ ] **Speedup**: ≥10x for 1024-token prefix reuse
|
||||
- [ ] **Memory**: Bounded by config, no OOM
|
||||
- [ ] **Correctness**: Identical outputs for cached vs uncached
|
||||
- [ ] **Concurrency**: No race conditions (stress tested)
|
||||
- [ ] **Metrics**: Prometheus metrics exported
|
||||
|
||||
---
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ RuvLLM v2.4 │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────────┐ ┌──────────────┐ ┌────────────┐ │
|
||||
│ │ Structured │ │ Function │ │ Prefix │ │
|
||||
│ │ Output Engine │ │ Calling │ │ Cache │ │
|
||||
│ │ │ │ Router │ │ Manager │ │
|
||||
│ │ - JSON Schema │ │ - Tool Defs │ │ - Radix │ │
|
||||
│ │ - Logit Bias │ │ - ToolChoice │ │ Tree │ │
|
||||
│ │ - State Machine │ │ - Multi-call │ │ - LRU │ │
|
||||
│ └────────┬────────┘ └──────┬───────┘ └─────┬──────┘ │
|
||||
│ │ │ │ │
|
||||
│ └──────────────────┼─────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────▼──────────┐ │
|
||||
│ │ mistral-rs Core │ │
|
||||
│ │ - Model Loading │ │
|
||||
│ │ - Token Sampling │ │
|
||||
│ │ - KV Cache │ │
|
||||
│ └────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Reference ADRs
|
||||
|
||||
- **ADR-009**: Structured Output Implementation
|
||||
- Constrained decoding algorithm
|
||||
- JSON schema validation approach
|
||||
- Performance optimization strategies
|
||||
|
||||
- **ADR-010**: Function Calling Architecture
|
||||
- Tool definition format
|
||||
- Multi-model compatibility layer
|
||||
- Parallel execution model
|
||||
|
||||
- **ADR-011**: Prefix Caching Design
|
||||
- Radix tree structure
|
||||
- Eviction policies
|
||||
- Memory management
|
||||
|
||||
- **ADR-008**: mistral-rs Integration
|
||||
- Dependency structure
|
||||
- API surface
|
||||
- Migration path
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Foundation (Weeks 1-2)
|
||||
**Focus**: Structured Output basics + Function Calling definitions
|
||||
|
||||
- [ ] Week 1: JSON schema parser and validator
|
||||
- Implement schema types (object, array, string, number, boolean, null)
|
||||
- Unit tests for all types
|
||||
- Property-based tests
|
||||
|
||||
- [ ] Week 2: Constrained decoding MVP
|
||||
- Logit bias implementation
|
||||
- Simple state machine (braces, brackets)
|
||||
- Integration with mistral-rs sampler
|
||||
- Basic function calling types (Tool, ToolChoice enums)
|
||||
|
||||
**Deliverable**: JSON mode works for simple schemas, tool definitions parsed
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Core Logic (Weeks 3-4)
|
||||
**Focus**: Constrained decoding + Tool generation
|
||||
|
||||
- [ ] Week 3: Advanced constrained decoding
|
||||
- Nested schema support
|
||||
- String pattern matching
|
||||
- Enum constraints
|
||||
- JSON repair mode
|
||||
|
||||
- [ ] Week 4: Tool call generation
|
||||
- Llama 3.1 format support
|
||||
- Mistral format support
|
||||
- Parallel tool calls
|
||||
- OpenAI API compatibility layer
|
||||
|
||||
**Deliverable**: Complex JSON schemas work, tool calls generated in OpenAI format
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Caching + Polish (Weeks 5-6)
|
||||
**Focus**: Prefix Caching + Integration tests
|
||||
|
||||
- [ ] Week 5: Prefix caching implementation
|
||||
- Radix tree structure
|
||||
- Hash-based lookup
|
||||
- LRU eviction
|
||||
- Thread-safety (RwLock)
|
||||
|
||||
- [ ] Week 6: Integration + benchmarks
|
||||
- LangChain integration tests
|
||||
- RAG workflow tests
|
||||
- Performance benchmarks
|
||||
- Documentation
|
||||
- Example applications
|
||||
|
||||
**Deliverable**: All 3 features production-ready, benchmarked, documented
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- JSON schema validation (all types, nested, constraints)
|
||||
- Logit bias correctness
|
||||
- Tool definition parsing
|
||||
- Prefix cache hit/miss logic
|
||||
- Radix tree operations
|
||||
|
||||
### Integration Tests
|
||||
- LangChain AgentExecutor with tools
|
||||
- LlamaIndex ReAct agent
|
||||
- CrewAI multi-agent workflows
|
||||
- OpenAI SDK compatibility tests
|
||||
|
||||
### Benchmarks
|
||||
- Structured output latency vs unconstrained
|
||||
- Tool calling accuracy (% correct tool selections)
|
||||
- Prefix cache speedup (1x, 10x, 100x reuse)
|
||||
- Memory usage under load
|
||||
|
||||
### Stress Tests
|
||||
- 1000 concurrent requests with caching
|
||||
- Deeply nested JSON schemas (depth 20)
|
||||
- Large tool libraries (100+ tools)
|
||||
- Multi-turn tool conversations (50+ turns)
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Structured Output
|
||||
- [ ] **Validity**: 100% valid JSON when `strict: true`
|
||||
- [ ] **Overhead**: <10% latency vs unconstrained
|
||||
- [ ] **Schema compliance**: 100% for depth ≤10 schemas
|
||||
- [ ] **Repair rate**: ≥95% successful repairs
|
||||
|
||||
### Function Calling
|
||||
- [ ] **Compatibility**: Passes OpenAI SDK test suite
|
||||
- [ ] **LangChain**: Works with AgentExecutor (5+ examples)
|
||||
- [ ] **Accuracy**: ≥95% correct tool selection (benchmark dataset)
|
||||
- [ ] **Parallel calls**: Supports ≥5 concurrent tools
|
||||
|
||||
### Prefix Caching
|
||||
- [ ] **Speedup**: 10x for 1024-token prefix, 100x for 4096-token
|
||||
- [ ] **Hit rate**: ≥80% for chat workloads
|
||||
- [ ] **Memory**: Bounded, no OOM under stress
|
||||
- [ ] **Correctness**: 100% identical outputs (cached vs uncached)
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Upstream
|
||||
- **mistral-rs v0.4.x** (ADR-008)
|
||||
- KV cache access for prefix caching
|
||||
- Token sampling hooks for logit bias
|
||||
- Model loading infrastructure
|
||||
|
||||
### Downstream
|
||||
- **Enables**: Agentic workflow support
|
||||
- **Enables**: LangChain/LlamaIndex/CrewAI integration
|
||||
- **Blocks**: v2.4 release
|
||||
- **Blocks**: Production adoption by agent frameworks
|
||||
|
||||
---
|
||||
|
||||
## Related Issues
|
||||
|
||||
- Depends on: #XXX (mistral-rs integration ADR-008)
|
||||
- Enables: #XXX (Agentic workflow support)
|
||||
- Enables: #XXX (LangChain integration)
|
||||
- Blocks: #XXX (v2.4 release milestone)
|
||||
|
||||
---
|
||||
|
||||
## Documentation Requirements
|
||||
|
||||
- [ ] API reference docs (rustdoc)
|
||||
- [ ] User guides for each feature
|
||||
- "How to use JSON mode"
|
||||
- "How to define tools"
|
||||
- "How to enable prefix caching"
|
||||
- [ ] Migration guide from v2.3
|
||||
- [ ] Example applications
|
||||
- Structured extraction (NER, info extraction)
|
||||
- Multi-tool agent (ReAct loop)
|
||||
- RAG with caching (chatbot)
|
||||
- [ ] Performance tuning guide
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **JSON Schema**: Full Draft 7 or subset? (Propose: Core subset + extensions)
|
||||
2. **Tool formats**: Support all models or Llama 3.1+ only? (Propose: Llama 3.1+ with adapters)
|
||||
3. **Cache eviction**: LRU vs LFU vs TTL-based? (Propose: LRU + TTL)
|
||||
4. **Memory limit**: Default cache size? (Propose: 2GB default, configurable)
|
||||
5. **Breaking changes**: Any API changes needed? (Propose: Additive only, no breaks)
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements (Post-v2.4)
|
||||
|
||||
- **Structured Output**:
|
||||
- GBNF grammar support (custom DSLs)
|
||||
- Regex-constrained strings
|
||||
- Speculative decoding for constrained generation
|
||||
|
||||
- **Function Calling**:
|
||||
- Async/streaming tool execution
|
||||
- Tool result validation
|
||||
- Tool dependency graphs
|
||||
|
||||
- **Prefix Caching**:
|
||||
- Cross-request caching (shared cache pool)
|
||||
- Disk-backed cache (persist across restarts)
|
||||
- Distributed caching (Redis/memcached)
|
||||
|
||||
---
|
||||
|
||||
## Timeline Summary
|
||||
|
||||
| Phase | Duration | Focus | Deliverable |
|
||||
|-------|----------|-------|-------------|
|
||||
| 1 | Weeks 1-2 | Structured Output + Tool Definitions | JSON mode MVP, tool parsing |
|
||||
| 2 | Weeks 3-4 | Constrained Decoding + Tool Generation | Complex schemas, tool calls |
|
||||
| 3 | Weeks 5-6 | Prefix Caching + Integration | Production-ready, benchmarked |
|
||||
|
||||
**Total**: 6 weeks to production-ready v2.4
|
||||
|
||||
---
|
||||
|
||||
## Getting Involved
|
||||
|
||||
### For Contributors
|
||||
- Pick a task from the checkboxes above
|
||||
- Comment on this issue to claim a feature
|
||||
- Follow the implementation plan phases
|
||||
- Submit PRs with tests + benchmarks
|
||||
|
||||
### For Reviewers
|
||||
- Focus on correctness (JSON validity, cache correctness)
|
||||
- Performance regression checks (<10% overhead target)
|
||||
- API design feedback (before Week 3)
|
||||
|
||||
### For Testers
|
||||
- Test with real-world agent workflows
|
||||
- Report edge cases and failure modes
|
||||
- Benchmark on your hardware/models
|
||||
|
||||
---
|
||||
|
||||
**Let's close the gap with vLLM/llama.cpp and make RuvLLM the best choice for production agentic workflows!** 🚀
|
||||
599
crates/ruvllm/docs/GITHUB_ISSUE_V2.md
Normal file
599
crates/ruvllm/docs/GITHUB_ISSUE_V2.md
Normal file
@@ -0,0 +1,599 @@
|
||||
# 🚀 RuvLLM v2.0 - High-Performance LLM Inference for Apple Silicon
|
||||
|
||||
[](https://crates.io/crates/ruvllm)
|
||||
[](https://www.npmjs.com/package/@aspect/ruvllm)
|
||||
[](https://ruv.io/docs/ruvllm)
|
||||
[](LICENSE)
|
||||
[](https://github.com/aspect/ruvector/actions)
|
||||
[](https://discord.gg/ruv)
|
||||
|
||||
<p align="center">
|
||||
<img src="https://ruv.io/assets/ruvllm-logo.svg" alt="RuvLLM" width="200"/>
|
||||
<br/>
|
||||
<strong>Run Large Language Models locally on your Mac with maximum performance</strong>
|
||||
<br/>
|
||||
<a href="https://ruv.io/ruvllm">Website</a> •
|
||||
<a href="https://ruv.io/docs/ruvllm">Documentation</a> •
|
||||
<a href="https://discord.gg/ruv">Discord</a> •
|
||||
<a href="https://twitter.com/raboruv">Twitter</a>
|
||||
</p>
|
||||
|
||||
---
|
||||
|
||||
## What is RuvLLM?
|
||||
|
||||
**RuvLLM** is a blazing-fast LLM inference engine built in Rust, specifically optimized for Apple Silicon Macs (M1/M2/M3/M4). It lets you run AI models like Llama, Mistral, Phi, and Gemma directly on your laptop — no cloud, no API costs, complete privacy.
|
||||
|
||||
### Why RuvLLM?
|
||||
|
||||
- **🔥 Fast** — 40+ tokens/second on M4 Pro with optimized Metal shaders
|
||||
- **🍎 Apple Silicon Native** — Uses Metal GPU, Apple Neural Engine (ANE), and ARM NEON
|
||||
- **🔒 Private** — Everything runs locally, your data never leaves your device
|
||||
- **📦 Easy** — One command to install, one line to run
|
||||
- **🌐 Cross-Platform** — Works in Rust, Node.js, and browsers via WebAssembly
|
||||
|
||||
---
|
||||
|
||||
## ✨ Key Features
|
||||
|
||||
### Core Capabilities
|
||||
|
||||
| Feature | Description |
|
||||
|---------|-------------|
|
||||
| **Multi-Backend Support** | Metal GPU, Core ML (ANE), CPU with NEON SIMD |
|
||||
| **Quantization** | Q4, Q5, Q8 quantized models (4-8x memory savings) |
|
||||
| **GGUF Support** | Load models directly from Hugging Face in GGUF format |
|
||||
| **Streaming** | Real-time token-by-token generation |
|
||||
| **Continuous Batching** | Efficient multi-request handling |
|
||||
| **KV Cache** | Optimized key-value cache with paged attention |
|
||||
| **Speculative Decoding** | 1.5-2x speedup with draft models |
|
||||
|
||||
### v2.0 New Features
|
||||
|
||||
| Feature | Improvement |
|
||||
|---------|-------------|
|
||||
| **Apple Neural Engine** | 38 TOPS dedicated ML acceleration on M4 Pro |
|
||||
| **Hybrid GPU+ANE Pipeline** | Best of both worlds for optimal throughput |
|
||||
| **Flash Attention v2** | 2.5-7.5x faster attention computation |
|
||||
| **SONA Learning** | Self-optimizing neural architecture for adaptive inference |
|
||||
| **Ruvector Integration** | Built-in vector embeddings for RAG applications |
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Quickstart
|
||||
|
||||
### Rust (Cargo)
|
||||
|
||||
```bash
|
||||
# Add to Cargo.toml
|
||||
cargo add ruvllm --features inference-metal
|
||||
```
|
||||
|
||||
```rust
|
||||
use ruvllm::{Engine, GenerateParams};
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||
// Load a model (downloads automatically from Hugging Face)
|
||||
let engine = Engine::from_pretrained("microsoft/Phi-3-mini-4k-instruct-gguf")?;
|
||||
|
||||
// Generate text
|
||||
let response = engine.generate(
|
||||
"Explain quantum computing in simple terms:",
|
||||
GenerateParams::default()
|
||||
)?;
|
||||
|
||||
println!("{}", response);
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Node.js (npm)
|
||||
|
||||
```bash
|
||||
npm install @aspect/ruvllm
|
||||
```
|
||||
|
||||
```javascript
|
||||
import { RuvLLM } from '@aspect/ruvllm';
|
||||
|
||||
// Initialize with a model
|
||||
const llm = await RuvLLM.fromPretrained('microsoft/Phi-3-mini-4k-instruct-gguf');
|
||||
|
||||
// Generate text
|
||||
const response = await llm.generate('Explain quantum computing in simple terms:');
|
||||
console.log(response);
|
||||
|
||||
// Or stream tokens
|
||||
for await (const token of llm.stream('Write a haiku about coding:')) {
|
||||
process.stdout.write(token);
|
||||
}
|
||||
```
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
# Install CLI
|
||||
cargo install ruvllm-cli
|
||||
|
||||
# Run interactively
|
||||
ruvllm chat --model microsoft/Phi-3-mini-4k-instruct-gguf
|
||||
|
||||
# One-shot generation
|
||||
ruvllm generate "What is the meaning of life?" --model phi-3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
<details>
|
||||
<summary><h2>📚 Tutorials</h2></summary>
|
||||
|
||||
### Tutorial 1: Building a Local Chatbot
|
||||
|
||||
Create a simple chatbot that runs entirely on your Mac:
|
||||
|
||||
```rust
|
||||
use ruvllm::{Engine, GenerateParams, ChatMessage};
|
||||
|
||||
fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||
let engine = Engine::from_pretrained("meta-llama/Llama-3.2-1B-Instruct-GGUF")?;
|
||||
|
||||
let mut history = vec![];
|
||||
|
||||
loop {
|
||||
print!("You: ");
|
||||
let mut input = String::new();
|
||||
std::io::stdin().read_line(&mut input)?;
|
||||
|
||||
history.push(ChatMessage::user(&input));
|
||||
|
||||
let response = engine.chat(&history, GenerateParams {
|
||||
max_tokens: 512,
|
||||
temperature: 0.7,
|
||||
..Default::default()
|
||||
})?;
|
||||
|
||||
println!("AI: {}", response);
|
||||
history.push(ChatMessage::assistant(&response));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Tutorial 2: Streaming Responses in Node.js
|
||||
|
||||
Build a real-time streaming API:
|
||||
|
||||
```javascript
|
||||
import { RuvLLM } from '@aspect/ruvllm';
|
||||
import express from 'express';
|
||||
|
||||
const app = express();
|
||||
const llm = await RuvLLM.fromPretrained('phi-3-mini');
|
||||
|
||||
app.get('/stream', async (req, res) => {
|
||||
const prompt = req.query.prompt;
|
||||
|
||||
res.setHeader('Content-Type', 'text/event-stream');
|
||||
res.setHeader('Cache-Control', 'no-cache');
|
||||
|
||||
for await (const token of llm.stream(prompt)) {
|
||||
res.write(`data: ${JSON.stringify({ token })}\n\n`);
|
||||
}
|
||||
|
||||
res.write('data: [DONE]\n\n');
|
||||
res.end();
|
||||
});
|
||||
|
||||
app.listen(3000);
|
||||
```
|
||||
|
||||
### Tutorial 3: RAG with Ruvector
|
||||
|
||||
Combine RuvLLM with Ruvector for retrieval-augmented generation:
|
||||
|
||||
```rust
|
||||
use ruvllm::Engine;
|
||||
use ruvector_core::{VectorDb, HnswConfig};
|
||||
|
||||
fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||
// Initialize vector database
|
||||
let db = VectorDb::new(HnswConfig::default())?;
|
||||
|
||||
// Initialize LLM
|
||||
let llm = Engine::from_pretrained("phi-3-mini")?;
|
||||
|
||||
// Add documents (embeddings generated automatically)
|
||||
db.add_document("doc1", "RuvLLM is a fast LLM inference engine.")?;
|
||||
db.add_document("doc2", "It supports Metal GPU acceleration.")?;
|
||||
|
||||
// Query and generate
|
||||
let query = "What is RuvLLM?";
|
||||
let context = db.search(query, 3)?;
|
||||
|
||||
let prompt = format!(
|
||||
"Context:\n{}\n\nQuestion: {}\nAnswer:",
|
||||
context.iter().map(|d| d.text.as_str()).collect::<Vec<_>>().join("\n"),
|
||||
query
|
||||
);
|
||||
|
||||
let response = llm.generate(&prompt, Default::default())?;
|
||||
println!("{}", response);
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Tutorial 4: Browser-Based Inference (WebAssembly)
|
||||
|
||||
Run models directly in the browser:
|
||||
|
||||
```html
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<script type="module">
|
||||
import init, { RuvLLM } from 'https://unpkg.com/@aspect/ruvllm-wasm/ruvllm.js';
|
||||
|
||||
async function main() {
|
||||
await init();
|
||||
|
||||
const llm = await RuvLLM.fromUrl('/models/phi-3-mini-q4.gguf');
|
||||
|
||||
const output = document.getElementById('output');
|
||||
|
||||
for await (const token of llm.stream('Write a poem about the web:')) {
|
||||
output.textContent += token;
|
||||
}
|
||||
}
|
||||
|
||||
main();
|
||||
</script>
|
||||
</head>
|
||||
<body>
|
||||
<pre id="output"></pre>
|
||||
</body>
|
||||
</html>
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
<details>
|
||||
<summary><h2>🔧 Advanced Usage</h2></summary>
|
||||
|
||||
### Custom Model Configuration
|
||||
|
||||
Fine-tune model loading for your specific hardware:
|
||||
|
||||
```rust
|
||||
use ruvllm::{Engine, ModelConfig, ComputeBackend, Quantization};
|
||||
|
||||
let engine = Engine::builder()
|
||||
.model_path("/path/to/model.gguf")
|
||||
.backend(ComputeBackend::Metal) // Use Metal GPU
|
||||
.quantization(Quantization::Q4K) // 4-bit quantization
|
||||
.context_length(8192) // Max context
|
||||
.num_gpu_layers(32) // Layers on GPU
|
||||
.use_flash_attention(true) // Enable Flash Attention
|
||||
.build()?;
|
||||
```
|
||||
|
||||
### Apple Neural Engine (ANE) Configuration
|
||||
|
||||
Leverage the dedicated ML accelerator on Apple Silicon:
|
||||
|
||||
```rust
|
||||
use ruvllm::{Engine, CoreMLBackend, ComputeUnits};
|
||||
|
||||
// Create Core ML backend with ANE
|
||||
let backend = CoreMLBackend::new()?
|
||||
.with_compute_units(ComputeUnits::CpuAndNeuralEngine) // Use ANE
|
||||
.with_tokenizer(tokenizer);
|
||||
|
||||
// Load Core ML model
|
||||
backend.load_model("model.mlmodelc", ModelConfig::default())?;
|
||||
|
||||
// Generate (uses ANE for MLP, GPU for attention)
|
||||
let response = backend.generate("Hello", GenerateParams::default())?;
|
||||
```
|
||||
|
||||
### Hybrid GPU + ANE Pipeline
|
||||
|
||||
Maximize throughput with intelligent workload distribution:
|
||||
|
||||
```rust
|
||||
use ruvllm::kernels::{should_use_ane_matmul, get_ane_recommendation};
|
||||
|
||||
// Check if ANE is beneficial for your matrix size
|
||||
let recommendation = get_ane_recommendation(batch_size, hidden_dim, vocab_size);
|
||||
|
||||
if recommendation.use_ane {
|
||||
println!("Using ANE: {} (confidence: {:.0}%)",
|
||||
recommendation.reason,
|
||||
recommendation.confidence * 100.0);
|
||||
}
|
||||
```
|
||||
|
||||
### Continuous Batching Server
|
||||
|
||||
Build a high-throughput inference server:
|
||||
|
||||
```rust
|
||||
use ruvllm::serving::{
|
||||
ContinuousBatchScheduler, KvCacheManager, InferenceRequest, SchedulerConfig
|
||||
};
|
||||
|
||||
let config = SchedulerConfig {
|
||||
max_batch_size: 32,
|
||||
max_tokens_per_batch: 4096,
|
||||
preemption_mode: PreemptionMode::Swap,
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let mut scheduler = ContinuousBatchScheduler::new(config);
|
||||
let mut kv_cache = KvCacheManager::new(KvCachePoolConfig::default());
|
||||
|
||||
// Add requests
|
||||
scheduler.add_request(InferenceRequest::new(tokens, params));
|
||||
|
||||
// Process batches
|
||||
while let Some(batch) = scheduler.schedule() {
|
||||
// Execute batch inference
|
||||
let outputs = engine.forward_batch(&batch)?;
|
||||
|
||||
// Update scheduler with results
|
||||
scheduler.update(outputs);
|
||||
}
|
||||
```
|
||||
|
||||
### Speculative Decoding
|
||||
|
||||
Speed up generation with draft models:
|
||||
|
||||
```rust
|
||||
use ruvllm::speculative::{SpeculativeDecoder, SpeculativeConfig};
|
||||
|
||||
let config = SpeculativeConfig {
|
||||
draft_model: "phi-3-mini-draft", // Small, fast model
|
||||
target_model: "phi-3-medium", // Large, accurate model
|
||||
num_speculative_tokens: 4, // Tokens to speculate
|
||||
temperature: 0.8,
|
||||
};
|
||||
|
||||
let decoder = SpeculativeDecoder::new(config)?;
|
||||
|
||||
// 1.5-2x faster than standard decoding
|
||||
let response = decoder.generate("Explain relativity:", params)?;
|
||||
```
|
||||
|
||||
### Custom Tokenizer
|
||||
|
||||
Use custom tokenizers for specialized models:
|
||||
|
||||
```rust
|
||||
use ruvllm::tokenizer::{RuvTokenizer, TokenizerConfig};
|
||||
|
||||
// Load from HuggingFace
|
||||
let tokenizer = RuvTokenizer::from_pretrained("meta-llama/Llama-3.2-1B")?;
|
||||
|
||||
// Or from local file
|
||||
let tokenizer = RuvTokenizer::from_file("./tokenizer.json")?;
|
||||
|
||||
// Encode/decode
|
||||
let tokens = tokenizer.encode("Hello, world!")?;
|
||||
let text = tokenizer.decode(&tokens)?;
|
||||
|
||||
// With chat template
|
||||
let formatted = tokenizer.apply_chat_template(&[
|
||||
ChatMessage::system("You are a helpful assistant."),
|
||||
ChatMessage::user("What is 2+2?"),
|
||||
])?;
|
||||
```
|
||||
|
||||
### Memory Optimization
|
||||
|
||||
Optimize for large models on limited memory:
|
||||
|
||||
```rust
|
||||
use ruvllm::{Engine, MemoryConfig};
|
||||
|
||||
let engine = Engine::builder()
|
||||
.model_path("llama-70b.gguf")
|
||||
.memory_config(MemoryConfig {
|
||||
max_memory_gb: 24.0, // Limit memory usage
|
||||
offload_to_cpu: true, // Offload layers to CPU
|
||||
use_mmap: true, // Memory-map model file
|
||||
kv_cache_dtype: DType::F16, // Half-precision KV cache
|
||||
})
|
||||
.build()?;
|
||||
```
|
||||
|
||||
### Embeddings for RAG
|
||||
|
||||
Generate embeddings for retrieval applications:
|
||||
|
||||
```rust
|
||||
use ruvllm::Engine;
|
||||
|
||||
let engine = Engine::from_pretrained("nomic-embed-text-v1.5")?;
|
||||
|
||||
// Single embedding
|
||||
let embedding = engine.embed("What is machine learning?")?;
|
||||
|
||||
// Batch embeddings
|
||||
let embeddings = engine.embed_batch(&[
|
||||
"Document 1 content",
|
||||
"Document 2 content",
|
||||
"Document 3 content",
|
||||
])?;
|
||||
|
||||
// Cosine similarity
|
||||
let similarity = ruvector_core::cosine_similarity(&embedding, &embeddings[0]);
|
||||
```
|
||||
|
||||
### Node.js Advanced Configuration
|
||||
|
||||
```javascript
|
||||
import { RuvLLM, ModelConfig, ComputeBackend } from '@aspect/ruvllm';
|
||||
|
||||
const llm = await RuvLLM.create({
|
||||
modelPath: './models/phi-3-mini-q4.gguf',
|
||||
backend: ComputeBackend.Metal,
|
||||
contextLength: 8192,
|
||||
numGpuLayers: 32,
|
||||
flashAttention: true,
|
||||
|
||||
// Callbacks
|
||||
onToken: (token) => process.stdout.write(token),
|
||||
onProgress: (progress) => console.log(`Loading: ${progress}%`),
|
||||
});
|
||||
|
||||
// Structured output (JSON mode)
|
||||
const result = await llm.generate('List 3 colors', {
|
||||
responseFormat: 'json',
|
||||
schema: {
|
||||
type: 'object',
|
||||
properties: {
|
||||
colors: { type: 'array', items: { type: 'string' } }
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
console.log(JSON.parse(result)); // { colors: ['red', 'blue', 'green'] }
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
## 📊 Performance Benchmarks
|
||||
|
||||
Tested on M4 Pro (14-core CPU, 20-core GPU, 38 TOPS ANE):
|
||||
|
||||
### Model Inference Speed
|
||||
|
||||
| Model | Size | Quantization | Tokens/sec | Memory |
|
||||
|-------|------|--------------|------------|--------|
|
||||
| Phi-3 Mini | 3.8B | Q4_K_M | 52 t/s | 2.4 GB |
|
||||
| Llama 3.2 | 1B | Q4_K_M | 78 t/s | 0.8 GB |
|
||||
| Llama 3.2 | 3B | Q4_K_M | 45 t/s | 2.1 GB |
|
||||
| Mistral 7B | 7B | Q4_K_M | 28 t/s | 4.2 GB |
|
||||
| Gemma 2 | 9B | Q4_K_M | 22 t/s | 5.8 GB |
|
||||
|
||||
### 🔥 ANE vs NEON Matrix Multiply (NEW in v2.0)
|
||||
|
||||
| Dimension | ANE | NEON | Speedup |
|
||||
|-----------|-----|------|---------|
|
||||
| 768×768 | 400 µs | 104 ms | **261x** |
|
||||
| 1024×1024 | 1.2 ms | 283 ms | **243x** |
|
||||
| 1536×1536 | 3.4 ms | 1,028 ms | **306x** |
|
||||
| 2048×2048 | 8.5 ms | 4,020 ms | **473x** |
|
||||
| 3072×3072 | 28.2 ms | 15,240 ms | **541x** |
|
||||
| 4096×4096 | 66.1 ms | 65,428 ms | **989x** |
|
||||
|
||||
### Hybrid Pipeline Performance
|
||||
|
||||
| Mode | seq=128 | seq=512 | vs NEON |
|
||||
|------|---------|---------|---------|
|
||||
| **Pure ANE** | 35.9 ms | 112.9 ms | **460x faster** |
|
||||
| Hybrid | 862 ms | 3,195 ms | 19x faster |
|
||||
| Pure NEON | 16,529 ms | 66,539 ms | baseline |
|
||||
|
||||
### Activation Functions (SiLU/GELU)
|
||||
|
||||
| Size | NEON | ANE | Winner |
|
||||
|------|------|-----|--------|
|
||||
| 32×4096 | 70 µs | 152 µs | NEON 2.2x |
|
||||
| 64×4096 | 141 µs | 303 µs | NEON 2.1x |
|
||||
| 128×4096 | 284 µs | 613 µs | NEON 2.2x |
|
||||
|
||||
**Auto-dispatch** correctly routes: ANE for matmul ≥768 dims, NEON for activations.
|
||||
|
||||
### Quantization Performance
|
||||
|
||||
| Dimension | Encode | Hamming Distance |
|
||||
|-----------|--------|------------------|
|
||||
| 128-dim | 0.1 µs | <0.1 µs |
|
||||
| 384-dim | 0.3 µs | <0.1 µs |
|
||||
| 768-dim | 0.5 µs | <0.1 µs |
|
||||
| 1536-dim | 1.0 µs | <0.1 µs |
|
||||
|
||||
*Benchmarks run with Criterion.rs, 50 samples per test, M4 Pro 48GB.*
|
||||
|
||||
---
|
||||
|
||||
## 🔌 Supported Models
|
||||
|
||||
RuvLLM supports any model in GGUF format. Popular options:
|
||||
|
||||
- **Llama 3.2** (1B, 3B) — Meta's latest efficient models
|
||||
- **Phi-3** (Mini, Small, Medium) — Microsoft's powerful small models
|
||||
- **Mistral 7B** — Excellent quality-to-size ratio
|
||||
- **Gemma 2** (2B, 9B, 27B) — Google's open models
|
||||
- **Qwen 2.5** (0.5B-72B) — Alibaba's multilingual models
|
||||
- **DeepSeek Coder** — Specialized for code generation
|
||||
|
||||
Download models from [Hugging Face](https://huggingface.co/models?library=gguf).
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ Installation
|
||||
|
||||
### Rust
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
ruvllm = { version = "2.0", features = ["inference-metal"] }
|
||||
|
||||
# Or with all features
|
||||
ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "speculative"] }
|
||||
```
|
||||
|
||||
Available features:
|
||||
- `inference-metal` — Metal GPU acceleration (recommended for Mac)
|
||||
- `inference-cuda` — CUDA acceleration (for NVIDIA GPUs)
|
||||
- `coreml` — Apple Neural Engine via Core ML
|
||||
- `speculative` — Speculative decoding support
|
||||
- `async-runtime` — Async/await support with Tokio
|
||||
|
||||
### Node.js
|
||||
|
||||
```bash
|
||||
npm install @aspect/ruvllm
|
||||
# or
|
||||
yarn add @aspect/ruvllm
|
||||
# or
|
||||
pnpm add @aspect/ruvllm
|
||||
```
|
||||
|
||||
### From Source
|
||||
|
||||
```bash
|
||||
git clone https://github.com/aspect/ruvector
|
||||
cd ruvector/crates/ruvllm
|
||||
cargo build --release --features inference-metal
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
|
||||
|
||||
- 🐛 [Report bugs](https://github.com/aspect/ruvector/issues/new?template=bug_report.md)
|
||||
- 💡 [Request features](https://github.com/aspect/ruvector/issues/new?template=feature_request.md)
|
||||
- 📖 [Improve docs](https://github.com/aspect/ruvector/tree/main/docs)
|
||||
|
||||
---
|
||||
|
||||
## 📄 License
|
||||
|
||||
RuvLLM is dual-licensed under MIT and Apache 2.0. See [LICENSE-MIT](LICENSE-MIT) and [LICENSE-APACHE](LICENSE-APACHE).
|
||||
|
||||
---
|
||||
|
||||
<p align="center">
|
||||
Made with ❤️ by <a href="https://ruv.io">ruv.io</a>
|
||||
<br/>
|
||||
<sub>Part of the <a href="https://github.com/aspect/ruvector">Ruvector</a> ecosystem</sub>
|
||||
</p>
|
||||
Reference in New Issue
Block a user