Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

15 KiB

Raw Blame History

RuvLLM Architecture (v2.0.0)

This document describes the system architecture of RuvLLM, a high-performance LLM inference engine optimized for Apple Silicon.

v2.0.0 New Features

Feature	Description	Performance Impact
Multi-threaded GEMM/GEMV	Rayon parallelization	12.7x speedup on M4 Pro
Flash Attention 2	Auto block sizing	+10% throughput
Quantized Inference	INT8/INT4/Q4_K kernels	4-8x memory reduction
Metal GPU Shaders	simdgroup_matrix ops	3x speedup
Memory Pool	Arena allocator	Zero-alloc inference
WASM Support	Browser inference	~2.5x overhead
npm Integration	@ruvector/ruvllm	JavaScript/TypeScript API

System Overview

                              +----------------------------------+
                              |          User Application        |
                              +----------------------------------+
                                              |
                                              v
+-------------------------------------------------------------------------------------+
|                                    RuvLLM Core                                       |
|  +-------------------------------------------------------------------------------+  |
|  |                              Backend Abstraction                               |  |
|  |  +-------------------------+  +-------------------------+                     |  |
|  |  |    Candle Backend       |  |    mistral-rs Backend   |                     |  |
|  |  |  - Model Loading        |  |  - Model Loading        |                     |  |
|  |  |  - Tokenization         |  |  - Tokenization         |                     |  |
|  |  |  - Forward Pass         |  |  - Forward Pass         |                     |  |
|  |  +-------------------------+  +-------------------------+                     |  |
|  +-------------------------------------------------------------------------------+  |
|                                          |                                          |
|  +-------------------------------------------------------------------------------+  |
|  |                              SONA Learning Layer                               |  |
|  |  +---------------------+  +----------------------+  +---------------------+   |  |
|  |  |    Instant Loop     |  |   Background Loop    |  |     Deep Loop       |   |  |
|  |  |    (<1ms latency)   |  |   (~100ms interval)  |  |   (minutes/hours)   |   |  |
|  |  |  - MicroLoRA adapt  |  |  - Pattern merge     |  |  - Full fine-tune   |   |  |
|  |  |  - Per-request      |  |  - EWC++ update      |  |  - Model distill    |   |  |
|  |  +---------------------+  +----------------------+  +---------------------+   |  |
|  +-------------------------------------------------------------------------------+  |
|                                          |                                          |
|  +-------------------------------------------------------------------------------+  |
|  |                              Optimized Kernels                                 |  |
|  |  +------------------+  +------------------+  +------------------+              |  |
|  |  |  Attention       |  |  Normalization   |  |  Embedding       |              |  |
|  |  |  - Flash Attn 2  |  |  - RMSNorm       |  |  - RoPE          |              |  |
|  |  |  - Paged Attn    |  |  - LayerNorm     |  |  - Token Embed   |              |  |
|  |  |  - GQA/MQA       |  |  - Fused Ops     |  |  - Pos Embed     |              |  |
|  |  +------------------+  +------------------+  +------------------+              |  |
|  +-------------------------------------------------------------------------------+  |
|                                          |                                          |
|  +-------------------------------------------------------------------------------+  |
|  |                              Memory Management                                 |  |
|  |  +-------------------------+  +-------------------------------------------+   |  |
|  |  |   Two-Tier KV Cache     |  |           Memory Pool                     |   |  |
|  |  |  +-------------------+  |  |  - Slab allocator                         |   |  |
|  |  |  |  FP16 Tail (hot)  |  |  |  - Arena allocation                       |   |  |
|  |  |  +-------------------+  |  |  - Zero-copy transfers                    |   |  |
|  |  |  |  Q4 Store (cold)  |  |  |                                           |   |  |
|  |  |  +-------------------+  |  +-------------------------------------------+   |  |
|  |  +-------------------------+                                                  |  |
|  +-------------------------------------------------------------------------------+  |
+-------------------------------------------------------------------------------------+
                                          |
                                          v
+-------------------------------------------------------------------------------------+
|                              Hardware Acceleration                                   |
|  +---------------------------+  +---------------------------+                       |
|  |     Metal (Apple GPU)     |  |      CUDA (NVIDIA)        |                       |
|  |  - MLX integration        |  |  - cuBLAS                 |                       |
|  |  - Metal Performance      |  |  - cuDNN                  |                       |
|  |    Shaders                |  |  - TensorRT               |                       |
|  +---------------------------+  +---------------------------+                       |
+-------------------------------------------------------------------------------------+

Component Architecture

1. Backend Abstraction Layer

The backend abstraction provides a unified interface for different ML frameworks.

+---------------------------+
|     LlmBackend Trait      |
|  - load_model()           |
|  - generate()             |
|  - forward()              |
|  - get_tokenizer()        |
+---------------------------+
           ^
           |
    +------+------+
    |             |
+-------+   +-----------+
|Candle |   |mistral-rs |
+-------+   +-----------+

Candle Backend Features:

HuggingFace model hub integration
Native Rust tensor operations
Metal/CUDA acceleration
Safetensors loading

2. SONA Learning Layer

Self-Optimizing Neural Architecture with three learning loops:

+-------------------+     +-------------------+
| Inference Request |---->| Instant Loop      |
| + feedback        |     | - MicroLoRA adapt |
+-------------------+     | - <1ms latency    |
                          +--------+----------+
                                   |
                                   v (async, 100ms)
                          +--------+----------+
                          | Background Loop   |
                          | - Pattern merge   |
                          | - Adapter compose |
                          | - EWC++ update    |
                          +--------+----------+
                                   |
                                   v (triggered)
                          +--------+----------+
                          | Deep Loop         |
                          | - Full fine-tune  |
                          | - Model distill   |
                          | - Pattern bank    |
                          +-------------------+

Loop Characteristics:

Loop	Latency	Trigger	Purpose
Instant	<1ms	Per-request	Real-time adaptation
Background	~100ms	Interval/threshold	Pattern consolidation
Deep	Minutes	Accumulated quality	Full optimization

3. Optimized Kernel Layer

NEON SIMD-optimized kernels for ARM64:

+-----------------------------------------------+
|              Attention Kernels                 |
+-----------------------------------------------+
|                                               |
|  +------------------+  +------------------+   |
|  | Flash Attention  |  | Paged Attention  |   |
|  |  - Tiled QKV     |  |  - Block tables  |   |
|  |  - Online softmax|  |  - Non-contiguous|   |
|  |  - O(N) memory   |  |  - KV cache aware|   |
|  +------------------+  +------------------+   |
|                                               |
|  +------------------+  +------------------+   |
|  | Multi-Query (MQA)|  | Grouped-Query    |   |
|  |  - 1 KV head     |  |  - KV groups     |   |
|  |  - Shared KV     |  |  - 4-8x savings  |   |
|  +------------------+  +------------------+   |
+-----------------------------------------------+

+-----------------------------------------------+
|            Normalization Kernels               |
+-----------------------------------------------+
|  +------------------+  +------------------+   |
|  |    RMSNorm       |  |    LayerNorm     |   |
|  |  - NEON SIMD     |  |  - NEON SIMD     |   |
|  |  - Fused ops     |  |  - Fused ops     |   |
|  +------------------+  +------------------+   |
+-----------------------------------------------+

+-----------------------------------------------+
|             Embedding Kernels                  |
+-----------------------------------------------+
|  +------------------+  +------------------+   |
|  | Rotary Position  |  | Token Embedding  |   |
|  |  (RoPE)          |  |  - Lookup table  |   |
|  |  - Precomputed   |  |  - Batch gather  |   |
|  +------------------+  +------------------+   |
+-----------------------------------------------+

4. Memory Management

Two-tier KV cache for optimal memory/quality tradeoff:

+----------------------------------------------------+
|                  Two-Tier KV Cache                  |
+----------------------------------------------------+
|                                                    |
|   Position: 0            tail_length        max    |
|   +------------------+------------------+          |
|   |                  |                  |          |
|   |  Quantized Store |  High-Precision  |          |
|   |     (Cold)       |    Tail (Hot)    |          |
|   |                  |                  |          |
|   |  - Q4/Q8 format  |  - FP16 format   |          |
|   |  - Older tokens  |  - Recent tokens |          |
|   |  - 4x smaller    |  - Full quality  |          |
|   |                  |                  |          |
|   +------------------+------------------+          |
|                                                    |
|   Migration: Hot -> Cold (when tail_length exceeded)|
|   Eviction:  Cold first, then Hot                  |
+----------------------------------------------------+

Cache Operations:

Append: Add new KV pairs to tail
Migrate: Move old tokens from tail to quantized store
Evict: Remove oldest tokens when max exceeded
Attend: Dequantize cold + use hot for attention

Data Flow

Inference Pipeline

Input Tokens
     |
     v
+--------------------+
|   Token Embedding  |
|   + RoPE Position  |
+--------------------+
     |
     v (for each layer)
+--------------------+
|   Attention Layer  |
|   +---------------+|
|   | Q,K,V Project ||
|   +---------------+|
|          |         |
|   +---------------+|
|   | KV Cache      ||
|   | Update        ||
|   +---------------+|
|          |         |
|   +---------------+|
|   | Flash/Paged   ||
|   | Attention     ||
|   +---------------+|
|          |         |
|   +---------------+|
|   | Output Proj   ||
|   +---------------+|
+--------------------+
     |
     v
+--------------------+
|   FFN Layer        |
|   - Gate Proj      |
|   - Up Proj        |
|   - Down Proj      |
|   - Activation     |
+--------------------+
     |
     v
+--------------------+
|   RMSNorm          |
+--------------------+
     |
     v
+--------------------+
|   LM Head          |
|   (final layer)    |
+--------------------+
     |
     v
Logits -> Sampling -> Token

Learning Pipeline

Request + Response + Feedback
              |
              v
+---------------------------+
|      Instant Loop         |
|  - Compute embeddings     |
|  - Apply MicroLoRA        |
|  - Queue for background   |
+---------------------------+
              |
              v (async)
+---------------------------+
|    Background Loop        |
|  - Batch samples          |
|  - Update EWC++ Fisher    |
|  - Merge adapters         |
|  - Store in ReasoningBank |
+---------------------------+
              |
              v (threshold triggered)
+---------------------------+
|       Deep Loop           |
|  - Full training pipeline |
|  - Pattern distillation   |
|  - Catastrophic forget    |
|    prevention (EWC++)     |
+---------------------------+

Module Structure

ruvllm/
├── src/
│   ├── lib.rs              # Crate root, re-exports
│   ├── error.rs            # Error types
│   ├── types.rs            # Common types (Precision, etc.)
│   │
│   ├── backends/           # ML framework backends
│   │   ├── mod.rs          # Backend trait
│   │   ├── candle_backend.rs
│   │   └── config.rs
│   │
│   ├── kernels/            # Optimized kernels
│   │   ├── mod.rs          # Kernel exports
│   │   ├── attention.rs    # Attention variants
│   │   ├── matmul.rs       # Matrix multiplication
│   │   ├── norm.rs         # Normalization ops
│   │   └── rope.rs         # Rotary embeddings
│   │
│   ├── lora/               # LoRA adapters
│   │   ├── mod.rs          # LoRA exports
│   │   ├── micro_lora.rs   # Real-time MicroLoRA
│   │   └── training.rs     # Training pipeline
│   │
│   ├── optimization/       # SONA integration
│   │   ├── mod.rs
│   │   └── sona_llm.rs     # Learning loops
│   │
│   ├── kv_cache.rs         # Two-tier KV cache
│   ├── sona.rs             # SONA core integration
│   ├── policy_store.rs     # Learned policies
│   └── witness_log.rs      # Inference logging
│
└── benches/                # Benchmarks
    ├── attention_bench.rs
    ├── lora_bench.rs
    └── e2e_bench.rs

Performance Characteristics

Memory Layout

Component	Memory Pattern	Optimization
KV Cache Tail	Sequential	NEON vectorized
KV Cache Store	Quantized blocks	Batch dequant
Model Weights	Memory-mapped	Zero-copy
Intermediate	Stack allocated	Arena alloc

Throughput Targets (M4 Pro)

Operation	Target	Achieved
Flash Attention	2.5x vs naive	~2.3x
Paged Attention	1.8x vs contiguous	~1.7x
GQA vs MHA	4x less KV memory	4x
MicroLoRA adapt	<1ms	~0.5ms

Integration Points

With RuVector Core

// Memory backend integration
use ruvector_core::storage::Storage;

// SONA learning integration
use ruvector_sona::{SonaEngine, ReasoningBank};

With External Systems

HuggingFace Hub: Model downloads
OpenAI API: Compatible inference endpoint
Prometheus: Metrics export
gRPC: High-performance RPC

Future Architecture

Planned enhancements:

Speculative Decoding: Draft model integration
Tensor Parallelism: Multi-GPU support
Continuous Batching: Dynamic batch scheduling
PagedAttention v2: vLLM-style memory management

15 KiB Raw Blame History