15 KiB
15 KiB
RuvLLM Architecture (v2.0.0)
This document describes the system architecture of RuvLLM, a high-performance LLM inference engine optimized for Apple Silicon.
v2.0.0 New Features
| Feature | Description | Performance Impact |
|---|---|---|
| Multi-threaded GEMM/GEMV | Rayon parallelization | 12.7x speedup on M4 Pro |
| Flash Attention 2 | Auto block sizing | +10% throughput |
| Quantized Inference | INT8/INT4/Q4_K kernels | 4-8x memory reduction |
| Metal GPU Shaders | simdgroup_matrix ops | 3x speedup |
| Memory Pool | Arena allocator | Zero-alloc inference |
| WASM Support | Browser inference | ~2.5x overhead |
| npm Integration | @ruvector/ruvllm | JavaScript/TypeScript API |
System Overview
+----------------------------------+
| User Application |
+----------------------------------+
|
v
+-------------------------------------------------------------------------------------+
| RuvLLM Core |
| +-------------------------------------------------------------------------------+ |
| | Backend Abstraction | |
| | +-------------------------+ +-------------------------+ | |
| | | Candle Backend | | mistral-rs Backend | | |
| | | - Model Loading | | - Model Loading | | |
| | | - Tokenization | | - Tokenization | | |
| | | - Forward Pass | | - Forward Pass | | |
| | +-------------------------+ +-------------------------+ | |
| +-------------------------------------------------------------------------------+ |
| | |
| +-------------------------------------------------------------------------------+ |
| | SONA Learning Layer | |
| | +---------------------+ +----------------------+ +---------------------+ | |
| | | Instant Loop | | Background Loop | | Deep Loop | | |
| | | (<1ms latency) | | (~100ms interval) | | (minutes/hours) | | |
| | | - MicroLoRA adapt | | - Pattern merge | | - Full fine-tune | | |
| | | - Per-request | | - EWC++ update | | - Model distill | | |
| | +---------------------+ +----------------------+ +---------------------+ | |
| +-------------------------------------------------------------------------------+ |
| | |
| +-------------------------------------------------------------------------------+ |
| | Optimized Kernels | |
| | +------------------+ +------------------+ +------------------+ | |
| | | Attention | | Normalization | | Embedding | | |
| | | - Flash Attn 2 | | - RMSNorm | | - RoPE | | |
| | | - Paged Attn | | - LayerNorm | | - Token Embed | | |
| | | - GQA/MQA | | - Fused Ops | | - Pos Embed | | |
| | +------------------+ +------------------+ +------------------+ | |
| +-------------------------------------------------------------------------------+ |
| | |
| +-------------------------------------------------------------------------------+ |
| | Memory Management | |
| | +-------------------------+ +-------------------------------------------+ | |
| | | Two-Tier KV Cache | | Memory Pool | | |
| | | +-------------------+ | | - Slab allocator | | |
| | | | FP16 Tail (hot) | | | - Arena allocation | | |
| | | +-------------------+ | | - Zero-copy transfers | | |
| | | | Q4 Store (cold) | | | | | |
| | | +-------------------+ | +-------------------------------------------+ | |
| | +-------------------------+ | |
| +-------------------------------------------------------------------------------+ |
+-------------------------------------------------------------------------------------+
|
v
+-------------------------------------------------------------------------------------+
| Hardware Acceleration |
| +---------------------------+ +---------------------------+ |
| | Metal (Apple GPU) | | CUDA (NVIDIA) | |
| | - MLX integration | | - cuBLAS | |
| | - Metal Performance | | - cuDNN | |
| | Shaders | | - TensorRT | |
| +---------------------------+ +---------------------------+ |
+-------------------------------------------------------------------------------------+
Component Architecture
1. Backend Abstraction Layer
The backend abstraction provides a unified interface for different ML frameworks.
+---------------------------+
| LlmBackend Trait |
| - load_model() |
| - generate() |
| - forward() |
| - get_tokenizer() |
+---------------------------+
^
|
+------+------+
| |
+-------+ +-----------+
|Candle | |mistral-rs |
+-------+ +-----------+
Candle Backend Features:
- HuggingFace model hub integration
- Native Rust tensor operations
- Metal/CUDA acceleration
- Safetensors loading
2. SONA Learning Layer
Self-Optimizing Neural Architecture with three learning loops:
+-------------------+ +-------------------+
| Inference Request |---->| Instant Loop |
| + feedback | | - MicroLoRA adapt |
+-------------------+ | - <1ms latency |
+--------+----------+
|
v (async, 100ms)
+--------+----------+
| Background Loop |
| - Pattern merge |
| - Adapter compose |
| - EWC++ update |
+--------+----------+
|
v (triggered)
+--------+----------+
| Deep Loop |
| - Full fine-tune |
| - Model distill |
| - Pattern bank |
+-------------------+
Loop Characteristics:
| Loop | Latency | Trigger | Purpose |
|---|---|---|---|
| Instant | <1ms | Per-request | Real-time adaptation |
| Background | ~100ms | Interval/threshold | Pattern consolidation |
| Deep | Minutes | Accumulated quality | Full optimization |
3. Optimized Kernel Layer
NEON SIMD-optimized kernels for ARM64:
+-----------------------------------------------+
| Attention Kernels |
+-----------------------------------------------+
| |
| +------------------+ +------------------+ |
| | Flash Attention | | Paged Attention | |
| | - Tiled QKV | | - Block tables | |
| | - Online softmax| | - Non-contiguous| |
| | - O(N) memory | | - KV cache aware| |
| +------------------+ +------------------+ |
| |
| +------------------+ +------------------+ |
| | Multi-Query (MQA)| | Grouped-Query | |
| | - 1 KV head | | - KV groups | |
| | - Shared KV | | - 4-8x savings | |
| +------------------+ +------------------+ |
+-----------------------------------------------+
+-----------------------------------------------+
| Normalization Kernels |
+-----------------------------------------------+
| +------------------+ +------------------+ |
| | RMSNorm | | LayerNorm | |
| | - NEON SIMD | | - NEON SIMD | |
| | - Fused ops | | - Fused ops | |
| +------------------+ +------------------+ |
+-----------------------------------------------+
+-----------------------------------------------+
| Embedding Kernels |
+-----------------------------------------------+
| +------------------+ +------------------+ |
| | Rotary Position | | Token Embedding | |
| | (RoPE) | | - Lookup table | |
| | - Precomputed | | - Batch gather | |
| +------------------+ +------------------+ |
+-----------------------------------------------+
4. Memory Management
Two-tier KV cache for optimal memory/quality tradeoff:
+----------------------------------------------------+
| Two-Tier KV Cache |
+----------------------------------------------------+
| |
| Position: 0 tail_length max |
| +------------------+------------------+ |
| | | | |
| | Quantized Store | High-Precision | |
| | (Cold) | Tail (Hot) | |
| | | | |
| | - Q4/Q8 format | - FP16 format | |
| | - Older tokens | - Recent tokens | |
| | - 4x smaller | - Full quality | |
| | | | |
| +------------------+------------------+ |
| |
| Migration: Hot -> Cold (when tail_length exceeded)|
| Eviction: Cold first, then Hot |
+----------------------------------------------------+
Cache Operations:
- Append: Add new KV pairs to tail
- Migrate: Move old tokens from tail to quantized store
- Evict: Remove oldest tokens when max exceeded
- Attend: Dequantize cold + use hot for attention
Data Flow
Inference Pipeline
Input Tokens
|
v
+--------------------+
| Token Embedding |
| + RoPE Position |
+--------------------+
|
v (for each layer)
+--------------------+
| Attention Layer |
| +---------------+|
| | Q,K,V Project ||
| +---------------+|
| | |
| +---------------+|
| | KV Cache ||
| | Update ||
| +---------------+|
| | |
| +---------------+|
| | Flash/Paged ||
| | Attention ||
| +---------------+|
| | |
| +---------------+|
| | Output Proj ||
| +---------------+|
+--------------------+
|
v
+--------------------+
| FFN Layer |
| - Gate Proj |
| - Up Proj |
| - Down Proj |
| - Activation |
+--------------------+
|
v
+--------------------+
| RMSNorm |
+--------------------+
|
v
+--------------------+
| LM Head |
| (final layer) |
+--------------------+
|
v
Logits -> Sampling -> Token
Learning Pipeline
Request + Response + Feedback
|
v
+---------------------------+
| Instant Loop |
| - Compute embeddings |
| - Apply MicroLoRA |
| - Queue for background |
+---------------------------+
|
v (async)
+---------------------------+
| Background Loop |
| - Batch samples |
| - Update EWC++ Fisher |
| - Merge adapters |
| - Store in ReasoningBank |
+---------------------------+
|
v (threshold triggered)
+---------------------------+
| Deep Loop |
| - Full training pipeline |
| - Pattern distillation |
| - Catastrophic forget |
| prevention (EWC++) |
+---------------------------+
Module Structure
ruvllm/
├── src/
│ ├── lib.rs # Crate root, re-exports
│ ├── error.rs # Error types
│ ├── types.rs # Common types (Precision, etc.)
│ │
│ ├── backends/ # ML framework backends
│ │ ├── mod.rs # Backend trait
│ │ ├── candle_backend.rs
│ │ └── config.rs
│ │
│ ├── kernels/ # Optimized kernels
│ │ ├── mod.rs # Kernel exports
│ │ ├── attention.rs # Attention variants
│ │ ├── matmul.rs # Matrix multiplication
│ │ ├── norm.rs # Normalization ops
│ │ └── rope.rs # Rotary embeddings
│ │
│ ├── lora/ # LoRA adapters
│ │ ├── mod.rs # LoRA exports
│ │ ├── micro_lora.rs # Real-time MicroLoRA
│ │ └── training.rs # Training pipeline
│ │
│ ├── optimization/ # SONA integration
│ │ ├── mod.rs
│ │ └── sona_llm.rs # Learning loops
│ │
│ ├── kv_cache.rs # Two-tier KV cache
│ ├── sona.rs # SONA core integration
│ ├── policy_store.rs # Learned policies
│ └── witness_log.rs # Inference logging
│
└── benches/ # Benchmarks
├── attention_bench.rs
├── lora_bench.rs
└── e2e_bench.rs
Performance Characteristics
Memory Layout
| Component | Memory Pattern | Optimization |
|---|---|---|
| KV Cache Tail | Sequential | NEON vectorized |
| KV Cache Store | Quantized blocks | Batch dequant |
| Model Weights | Memory-mapped | Zero-copy |
| Intermediate | Stack allocated | Arena alloc |
Throughput Targets (M4 Pro)
| Operation | Target | Achieved |
|---|---|---|
| Flash Attention | 2.5x vs naive | ~2.3x |
| Paged Attention | 1.8x vs contiguous | ~1.7x |
| GQA vs MHA | 4x less KV memory | 4x |
| MicroLoRA adapt | <1ms | ~0.5ms |
Integration Points
With RuVector Core
// Memory backend integration
use ruvector_core::storage::Storage;
// SONA learning integration
use ruvector_sona::{SonaEngine, ReasoningBank};
With External Systems
- HuggingFace Hub: Model downloads
- OpenAI API: Compatible inference endpoint
- Prometheus: Metrics export
- gRPC: High-performance RPC
Future Architecture
Planned enhancements:
- Speculative Decoding: Draft model integration
- Tensor Parallelism: Multi-GPU support
- Continuous Batching: Dynamic batch scheduling
- PagedAttention v2: vLLM-style memory management