Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
402
vendor/ruvector/docs/ruvllm/ARCHITECTURE.md
vendored
Normal file
402
vendor/ruvector/docs/ruvllm/ARCHITECTURE.md
vendored
Normal file
@@ -0,0 +1,402 @@
|
||||
# RuvLLM Architecture (v2.0.0)
|
||||
|
||||
This document describes the system architecture of RuvLLM, a high-performance LLM inference engine optimized for Apple Silicon.
|
||||
|
||||
## v2.0.0 New Features
|
||||
|
||||
| Feature | Description | Performance Impact |
|
||||
|---------|-------------|-------------------|
|
||||
| Multi-threaded GEMM/GEMV | Rayon parallelization | 12.7x speedup on M4 Pro |
|
||||
| Flash Attention 2 | Auto block sizing | +10% throughput |
|
||||
| Quantized Inference | INT8/INT4/Q4_K kernels | 4-8x memory reduction |
|
||||
| Metal GPU Shaders | simdgroup_matrix ops | 3x speedup |
|
||||
| Memory Pool | Arena allocator | Zero-alloc inference |
|
||||
| WASM Support | Browser inference | ~2.5x overhead |
|
||||
| npm Integration | @ruvector/ruvllm | JavaScript/TypeScript API |
|
||||
|
||||
## System Overview
|
||||
|
||||
```
|
||||
+----------------------------------+
|
||||
| User Application |
|
||||
+----------------------------------+
|
||||
|
|
||||
v
|
||||
+-------------------------------------------------------------------------------------+
|
||||
| RuvLLM Core |
|
||||
| +-------------------------------------------------------------------------------+ |
|
||||
| | Backend Abstraction | |
|
||||
| | +-------------------------+ +-------------------------+ | |
|
||||
| | | Candle Backend | | mistral-rs Backend | | |
|
||||
| | | - Model Loading | | - Model Loading | | |
|
||||
| | | - Tokenization | | - Tokenization | | |
|
||||
| | | - Forward Pass | | - Forward Pass | | |
|
||||
| | +-------------------------+ +-------------------------+ | |
|
||||
| +-------------------------------------------------------------------------------+ |
|
||||
| | |
|
||||
| +-------------------------------------------------------------------------------+ |
|
||||
| | SONA Learning Layer | |
|
||||
| | +---------------------+ +----------------------+ +---------------------+ | |
|
||||
| | | Instant Loop | | Background Loop | | Deep Loop | | |
|
||||
| | | (<1ms latency) | | (~100ms interval) | | (minutes/hours) | | |
|
||||
| | | - MicroLoRA adapt | | - Pattern merge | | - Full fine-tune | | |
|
||||
| | | - Per-request | | - EWC++ update | | - Model distill | | |
|
||||
| | +---------------------+ +----------------------+ +---------------------+ | |
|
||||
| +-------------------------------------------------------------------------------+ |
|
||||
| | |
|
||||
| +-------------------------------------------------------------------------------+ |
|
||||
| | Optimized Kernels | |
|
||||
| | +------------------+ +------------------+ +------------------+ | |
|
||||
| | | Attention | | Normalization | | Embedding | | |
|
||||
| | | - Flash Attn 2 | | - RMSNorm | | - RoPE | | |
|
||||
| | | - Paged Attn | | - LayerNorm | | - Token Embed | | |
|
||||
| | | - GQA/MQA | | - Fused Ops | | - Pos Embed | | |
|
||||
| | +------------------+ +------------------+ +------------------+ | |
|
||||
| +-------------------------------------------------------------------------------+ |
|
||||
| | |
|
||||
| +-------------------------------------------------------------------------------+ |
|
||||
| | Memory Management | |
|
||||
| | +-------------------------+ +-------------------------------------------+ | |
|
||||
| | | Two-Tier KV Cache | | Memory Pool | | |
|
||||
| | | +-------------------+ | | - Slab allocator | | |
|
||||
| | | | FP16 Tail (hot) | | | - Arena allocation | | |
|
||||
| | | +-------------------+ | | - Zero-copy transfers | | |
|
||||
| | | | Q4 Store (cold) | | | | | |
|
||||
| | | +-------------------+ | +-------------------------------------------+ | |
|
||||
| | +-------------------------+ | |
|
||||
| +-------------------------------------------------------------------------------+ |
|
||||
+-------------------------------------------------------------------------------------+
|
||||
|
|
||||
v
|
||||
+-------------------------------------------------------------------------------------+
|
||||
| Hardware Acceleration |
|
||||
| +---------------------------+ +---------------------------+ |
|
||||
| | Metal (Apple GPU) | | CUDA (NVIDIA) | |
|
||||
| | - MLX integration | | - cuBLAS | |
|
||||
| | - Metal Performance | | - cuDNN | |
|
||||
| | Shaders | | - TensorRT | |
|
||||
| +---------------------------+ +---------------------------+ |
|
||||
+-------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
## Component Architecture
|
||||
|
||||
### 1. Backend Abstraction Layer
|
||||
|
||||
The backend abstraction provides a unified interface for different ML frameworks.
|
||||
|
||||
```
|
||||
+---------------------------+
|
||||
| LlmBackend Trait |
|
||||
| - load_model() |
|
||||
| - generate() |
|
||||
| - forward() |
|
||||
| - get_tokenizer() |
|
||||
+---------------------------+
|
||||
^
|
||||
|
|
||||
+------+------+
|
||||
| |
|
||||
+-------+ +-----------+
|
||||
|Candle | |mistral-rs |
|
||||
+-------+ +-----------+
|
||||
```
|
||||
|
||||
**Candle Backend Features:**
|
||||
- HuggingFace model hub integration
|
||||
- Native Rust tensor operations
|
||||
- Metal/CUDA acceleration
|
||||
- Safetensors loading
|
||||
|
||||
### 2. SONA Learning Layer
|
||||
|
||||
Self-Optimizing Neural Architecture with three learning loops:
|
||||
|
||||
```
|
||||
+-------------------+ +-------------------+
|
||||
| Inference Request |---->| Instant Loop |
|
||||
| + feedback | | - MicroLoRA adapt |
|
||||
+-------------------+ | - <1ms latency |
|
||||
+--------+----------+
|
||||
|
|
||||
v (async, 100ms)
|
||||
+--------+----------+
|
||||
| Background Loop |
|
||||
| - Pattern merge |
|
||||
| - Adapter compose |
|
||||
| - EWC++ update |
|
||||
+--------+----------+
|
||||
|
|
||||
v (triggered)
|
||||
+--------+----------+
|
||||
| Deep Loop |
|
||||
| - Full fine-tune |
|
||||
| - Model distill |
|
||||
| - Pattern bank |
|
||||
+-------------------+
|
||||
```
|
||||
|
||||
**Loop Characteristics:**
|
||||
|
||||
| Loop | Latency | Trigger | Purpose |
|
||||
|------|---------|---------|---------|
|
||||
| Instant | <1ms | Per-request | Real-time adaptation |
|
||||
| Background | ~100ms | Interval/threshold | Pattern consolidation |
|
||||
| Deep | Minutes | Accumulated quality | Full optimization |
|
||||
|
||||
### 3. Optimized Kernel Layer
|
||||
|
||||
NEON SIMD-optimized kernels for ARM64:
|
||||
|
||||
```
|
||||
+-----------------------------------------------+
|
||||
| Attention Kernels |
|
||||
+-----------------------------------------------+
|
||||
| |
|
||||
| +------------------+ +------------------+ |
|
||||
| | Flash Attention | | Paged Attention | |
|
||||
| | - Tiled QKV | | - Block tables | |
|
||||
| | - Online softmax| | - Non-contiguous| |
|
||||
| | - O(N) memory | | - KV cache aware| |
|
||||
| +------------------+ +------------------+ |
|
||||
| |
|
||||
| +------------------+ +------------------+ |
|
||||
| | Multi-Query (MQA)| | Grouped-Query | |
|
||||
| | - 1 KV head | | - KV groups | |
|
||||
| | - Shared KV | | - 4-8x savings | |
|
||||
| +------------------+ +------------------+ |
|
||||
+-----------------------------------------------+
|
||||
|
||||
+-----------------------------------------------+
|
||||
| Normalization Kernels |
|
||||
+-----------------------------------------------+
|
||||
| +------------------+ +------------------+ |
|
||||
| | RMSNorm | | LayerNorm | |
|
||||
| | - NEON SIMD | | - NEON SIMD | |
|
||||
| | - Fused ops | | - Fused ops | |
|
||||
| +------------------+ +------------------+ |
|
||||
+-----------------------------------------------+
|
||||
|
||||
+-----------------------------------------------+
|
||||
| Embedding Kernels |
|
||||
+-----------------------------------------------+
|
||||
| +------------------+ +------------------+ |
|
||||
| | Rotary Position | | Token Embedding | |
|
||||
| | (RoPE) | | - Lookup table | |
|
||||
| | - Precomputed | | - Batch gather | |
|
||||
| +------------------+ +------------------+ |
|
||||
+-----------------------------------------------+
|
||||
```
|
||||
|
||||
### 4. Memory Management
|
||||
|
||||
Two-tier KV cache for optimal memory/quality tradeoff:
|
||||
|
||||
```
|
||||
+----------------------------------------------------+
|
||||
| Two-Tier KV Cache |
|
||||
+----------------------------------------------------+
|
||||
| |
|
||||
| Position: 0 tail_length max |
|
||||
| +------------------+------------------+ |
|
||||
| | | | |
|
||||
| | Quantized Store | High-Precision | |
|
||||
| | (Cold) | Tail (Hot) | |
|
||||
| | | | |
|
||||
| | - Q4/Q8 format | - FP16 format | |
|
||||
| | - Older tokens | - Recent tokens | |
|
||||
| | - 4x smaller | - Full quality | |
|
||||
| | | | |
|
||||
| +------------------+------------------+ |
|
||||
| |
|
||||
| Migration: Hot -> Cold (when tail_length exceeded)|
|
||||
| Eviction: Cold first, then Hot |
|
||||
+----------------------------------------------------+
|
||||
```
|
||||
|
||||
**Cache Operations:**
|
||||
|
||||
1. **Append**: Add new KV pairs to tail
|
||||
2. **Migrate**: Move old tokens from tail to quantized store
|
||||
3. **Evict**: Remove oldest tokens when max exceeded
|
||||
4. **Attend**: Dequantize cold + use hot for attention
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Inference Pipeline
|
||||
|
||||
```
|
||||
Input Tokens
|
||||
|
|
||||
v
|
||||
+--------------------+
|
||||
| Token Embedding |
|
||||
| + RoPE Position |
|
||||
+--------------------+
|
||||
|
|
||||
v (for each layer)
|
||||
+--------------------+
|
||||
| Attention Layer |
|
||||
| +---------------+|
|
||||
| | Q,K,V Project ||
|
||||
| +---------------+|
|
||||
| | |
|
||||
| +---------------+|
|
||||
| | KV Cache ||
|
||||
| | Update ||
|
||||
| +---------------+|
|
||||
| | |
|
||||
| +---------------+|
|
||||
| | Flash/Paged ||
|
||||
| | Attention ||
|
||||
| +---------------+|
|
||||
| | |
|
||||
| +---------------+|
|
||||
| | Output Proj ||
|
||||
| +---------------+|
|
||||
+--------------------+
|
||||
|
|
||||
v
|
||||
+--------------------+
|
||||
| FFN Layer |
|
||||
| - Gate Proj |
|
||||
| - Up Proj |
|
||||
| - Down Proj |
|
||||
| - Activation |
|
||||
+--------------------+
|
||||
|
|
||||
v
|
||||
+--------------------+
|
||||
| RMSNorm |
|
||||
+--------------------+
|
||||
|
|
||||
v
|
||||
+--------------------+
|
||||
| LM Head |
|
||||
| (final layer) |
|
||||
+--------------------+
|
||||
|
|
||||
v
|
||||
Logits -> Sampling -> Token
|
||||
```
|
||||
|
||||
### Learning Pipeline
|
||||
|
||||
```
|
||||
Request + Response + Feedback
|
||||
|
|
||||
v
|
||||
+---------------------------+
|
||||
| Instant Loop |
|
||||
| - Compute embeddings |
|
||||
| - Apply MicroLoRA |
|
||||
| - Queue for background |
|
||||
+---------------------------+
|
||||
|
|
||||
v (async)
|
||||
+---------------------------+
|
||||
| Background Loop |
|
||||
| - Batch samples |
|
||||
| - Update EWC++ Fisher |
|
||||
| - Merge adapters |
|
||||
| - Store in ReasoningBank |
|
||||
+---------------------------+
|
||||
|
|
||||
v (threshold triggered)
|
||||
+---------------------------+
|
||||
| Deep Loop |
|
||||
| - Full training pipeline |
|
||||
| - Pattern distillation |
|
||||
| - Catastrophic forget |
|
||||
| prevention (EWC++) |
|
||||
+---------------------------+
|
||||
```
|
||||
|
||||
## Module Structure
|
||||
|
||||
```
|
||||
ruvllm/
|
||||
├── src/
|
||||
│ ├── lib.rs # Crate root, re-exports
|
||||
│ ├── error.rs # Error types
|
||||
│ ├── types.rs # Common types (Precision, etc.)
|
||||
│ │
|
||||
│ ├── backends/ # ML framework backends
|
||||
│ │ ├── mod.rs # Backend trait
|
||||
│ │ ├── candle_backend.rs
|
||||
│ │ └── config.rs
|
||||
│ │
|
||||
│ ├── kernels/ # Optimized kernels
|
||||
│ │ ├── mod.rs # Kernel exports
|
||||
│ │ ├── attention.rs # Attention variants
|
||||
│ │ ├── matmul.rs # Matrix multiplication
|
||||
│ │ ├── norm.rs # Normalization ops
|
||||
│ │ └── rope.rs # Rotary embeddings
|
||||
│ │
|
||||
│ ├── lora/ # LoRA adapters
|
||||
│ │ ├── mod.rs # LoRA exports
|
||||
│ │ ├── micro_lora.rs # Real-time MicroLoRA
|
||||
│ │ └── training.rs # Training pipeline
|
||||
│ │
|
||||
│ ├── optimization/ # SONA integration
|
||||
│ │ ├── mod.rs
|
||||
│ │ └── sona_llm.rs # Learning loops
|
||||
│ │
|
||||
│ ├── kv_cache.rs # Two-tier KV cache
|
||||
│ ├── sona.rs # SONA core integration
|
||||
│ ├── policy_store.rs # Learned policies
|
||||
│ └── witness_log.rs # Inference logging
|
||||
│
|
||||
└── benches/ # Benchmarks
|
||||
├── attention_bench.rs
|
||||
├── lora_bench.rs
|
||||
└── e2e_bench.rs
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Memory Layout
|
||||
|
||||
| Component | Memory Pattern | Optimization |
|
||||
|-----------|---------------|--------------|
|
||||
| KV Cache Tail | Sequential | NEON vectorized |
|
||||
| KV Cache Store | Quantized blocks | Batch dequant |
|
||||
| Model Weights | Memory-mapped | Zero-copy |
|
||||
| Intermediate | Stack allocated | Arena alloc |
|
||||
|
||||
### Throughput Targets (M4 Pro)
|
||||
|
||||
| Operation | Target | Achieved |
|
||||
|-----------|--------|----------|
|
||||
| Flash Attention | 2.5x vs naive | ~2.3x |
|
||||
| Paged Attention | 1.8x vs contiguous | ~1.7x |
|
||||
| GQA vs MHA | 4x less KV memory | 4x |
|
||||
| MicroLoRA adapt | <1ms | ~0.5ms |
|
||||
|
||||
## Integration Points
|
||||
|
||||
### With RuVector Core
|
||||
|
||||
```rust
|
||||
// Memory backend integration
|
||||
use ruvector_core::storage::Storage;
|
||||
|
||||
// SONA learning integration
|
||||
use ruvector_sona::{SonaEngine, ReasoningBank};
|
||||
```
|
||||
|
||||
### With External Systems
|
||||
|
||||
- **HuggingFace Hub**: Model downloads
|
||||
- **OpenAI API**: Compatible inference endpoint
|
||||
- **Prometheus**: Metrics export
|
||||
- **gRPC**: High-performance RPC
|
||||
|
||||
## Future Architecture
|
||||
|
||||
Planned enhancements:
|
||||
|
||||
1. **Speculative Decoding**: Draft model integration
|
||||
2. **Tensor Parallelism**: Multi-GPU support
|
||||
3. **Continuous Batching**: Dynamic batch scheduling
|
||||
4. **PagedAttention v2**: vLLM-style memory management
|
||||
Reference in New Issue
Block a user