# RuvLLM Architecture (v2.0.0)

This document describes the system architecture of RuvLLM, a high-performance LLM inference engine optimized for Apple Silicon.

## v2.0.0 New Features

| Feature | Description | Performance Impact |
|---------|-------------|-------------------|
| Multi-threaded GEMM/GEMV | Rayon parallelization | 12.7x speedup on M4 Pro |
| Flash Attention 2 | Auto block sizing | +10% throughput |
| Quantized Inference | INT8/INT4/Q4_K kernels | 4-8x memory reduction |
| Metal GPU Shaders | simdgroup_matrix ops | 3x speedup |
| Memory Pool | Arena allocator | Zero-alloc inference |
| WASM Support | Browser inference | ~2.5x overhead |
| npm Integration | @ruvector/ruvllm | JavaScript/TypeScript API |

## System Overview

```
                              +----------------------------------+
                              |          User Application        |
                              +----------------------------------+
                                              |
                                              v
+-------------------------------------------------------------------------------------+
|                                    RuvLLM Core                                       |
|  +-------------------------------------------------------------------------------+  |
|  |                              Backend Abstraction                               |  |
|  |  +-------------------------+  +-------------------------+                     |  |
|  |  |    Candle Backend       |  |    mistral-rs Backend   |                     |  |
|  |  |  - Model Loading        |  |  - Model Loading        |                     |  |
|  |  |  - Tokenization         |  |  - Tokenization         |                     |  |
|  |  |  - Forward Pass         |  |  - Forward Pass         |                     |  |
|  |  +-------------------------+  +-------------------------+                     |  |
|  +-------------------------------------------------------------------------------+  |
|                                          |                                          |
|  +-------------------------------------------------------------------------------+  |
|  |                              SONA Learning Layer                               |  |
|  |  +---------------------+  +----------------------+  +---------------------+   |  |
|  |  |    Instant Loop     |  |   Background Loop    |  |     Deep Loop       |   |  |
|  |  |    (<1ms latency)   |  |   (~100ms interval)  |  |   (minutes/hours)   |   |  |
|  |  |  - MicroLoRA adapt  |  |  - Pattern merge     |  |  - Full fine-tune   |   |  |
|  |  |  - Per-request      |  |  - EWC++ update      |  |  - Model distill    |   |  |
|  |  +---------------------+  +----------------------+  +---------------------+   |  |
|  +-------------------------------------------------------------------------------+  |
|                                          |                                          |
|  +-------------------------------------------------------------------------------+  |
|  |                              Optimized Kernels                                 |  |
|  |  +------------------+  +------------------+  +------------------+              |  |
|  |  |  Attention       |  |  Normalization   |  |  Embedding       |              |  |
|  |  |  - Flash Attn 2  |  |  - RMSNorm       |  |  - RoPE          |              |  |
|  |  |  - Paged Attn    |  |  - LayerNorm     |  |  - Token Embed   |              |  |
|  |  |  - GQA/MQA       |  |  - Fused Ops     |  |  - Pos Embed     |              |  |
|  |  +------------------+  +------------------+  +------------------+              |  |
|  +-------------------------------------------------------------------------------+  |
|                                          |                                          |
|  +-------------------------------------------------------------------------------+  |
|  |                              Memory Management                                 |  |
|  |  +-------------------------+  +-------------------------------------------+   |  |
|  |  |   Two-Tier KV Cache     |  |           Memory Pool                     |   |  |
|  |  |  +-------------------+  |  |  - Slab allocator                         |   |  |
|  |  |  |  FP16 Tail (hot)  |  |  |  - Arena allocation                       |   |  |
|  |  |  +-------------------+  |  |  - Zero-copy transfers                    |   |  |
|  |  |  |  Q4 Store (cold)  |  |  |                                           |   |  |
|  |  |  +-------------------+  |  +-------------------------------------------+   |  |
|  |  +-------------------------+                                                  |  |
|  +-------------------------------------------------------------------------------+  |
+-------------------------------------------------------------------------------------+
                                          |
                                          v
+-------------------------------------------------------------------------------------+
|                              Hardware Acceleration                                   |
|  +---------------------------+  +---------------------------+                       |
|  |     Metal (Apple GPU)     |  |      CUDA (NVIDIA)        |                       |
|  |  - MLX integration        |  |  - cuBLAS                 |                       |
|  |  - Metal Performance      |  |  - cuDNN                  |                       |
|  |    Shaders                |  |  - TensorRT               |                       |
|  +---------------------------+  +---------------------------+                       |
+-------------------------------------------------------------------------------------+
```

## Component Architecture

### 1. Backend Abstraction Layer

The backend abstraction provides a unified interface for different ML frameworks.

```
+---------------------------+
|     LlmBackend Trait      |
|  - load_model()           |
|  - generate()             |
|  - forward()              |
|  - get_tokenizer()        |
+---------------------------+
           ^
           |
    +------+------+
    |             |
+-------+   +-----------+
|Candle |   |mistral-rs |
+-------+   +-----------+
```

**Candle Backend Features:**
- HuggingFace model hub integration
- Native Rust tensor operations
- Metal/CUDA acceleration
- Safetensors loading

### 2. SONA Learning Layer

Self-Optimizing Neural Architecture with three learning loops:

```
+-------------------+     +-------------------+
| Inference Request |---->| Instant Loop      |
| + feedback        |     | - MicroLoRA adapt |
+-------------------+     | - <1ms latency    |
                          +--------+----------+
                                   |
                                   v (async, 100ms)
                          +--------+----------+
                          | Background Loop   |
                          | - Pattern merge   |
                          | - Adapter compose |
                          | - EWC++ update    |
                          +--------+----------+
                                   |
                                   v (triggered)
                          +--------+----------+
                          | Deep Loop         |
                          | - Full fine-tune  |
                          | - Model distill   |
                          | - Pattern bank    |
                          +-------------------+
```

**Loop Characteristics:**

| Loop | Latency | Trigger | Purpose |
|------|---------|---------|---------|
| Instant | <1ms | Per-request | Real-time adaptation |
| Background | ~100ms | Interval/threshold | Pattern consolidation |
| Deep | Minutes | Accumulated quality | Full optimization |

### 3. Optimized Kernel Layer

NEON SIMD-optimized kernels for ARM64:

```
+-----------------------------------------------+
|              Attention Kernels                 |
+-----------------------------------------------+
|                                               |
|  +------------------+  +------------------+   |
|  | Flash Attention  |  | Paged Attention  |   |
|  |  - Tiled QKV     |  |  - Block tables  |   |
|  |  - Online softmax|  |  - Non-contiguous|   |
|  |  - O(N) memory   |  |  - KV cache aware|   |
|  +------------------+  +------------------+   |
|                                               |
|  +------------------+  +------------------+   |
|  | Multi-Query (MQA)|  | Grouped-Query    |   |
|  |  - 1 KV head     |  |  - KV groups     |   |
|  |  - Shared KV     |  |  - 4-8x savings  |   |
|  +------------------+  +------------------+   |
+-----------------------------------------------+

+-----------------------------------------------+
|            Normalization Kernels               |
+-----------------------------------------------+
|  +------------------+  +------------------+   |
|  |    RMSNorm       |  |    LayerNorm     |   |
|  |  - NEON SIMD     |  |  - NEON SIMD     |   |
|  |  - Fused ops     |  |  - Fused ops     |   |
|  +------------------+  +------------------+   |
+-----------------------------------------------+

+-----------------------------------------------+
|             Embedding Kernels                  |
+-----------------------------------------------+
|  +------------------+  +------------------+   |
|  | Rotary Position  |  | Token Embedding  |   |
|  |  (RoPE)          |  |  - Lookup table  |   |
|  |  - Precomputed   |  |  - Batch gather  |   |
|  +------------------+  +------------------+   |
+-----------------------------------------------+
```

### 4. Memory Management

Two-tier KV cache for optimal memory/quality tradeoff:

```
+----------------------------------------------------+
|                  Two-Tier KV Cache                  |
+----------------------------------------------------+
|                                                    |
|   Position: 0            tail_length        max    |
|   +------------------+------------------+          |
|   |                  |                  |          |
|   |  Quantized Store |  High-Precision  |          |
|   |     (Cold)       |    Tail (Hot)    |          |
|   |                  |                  |          |
|   |  - Q4/Q8 format  |  - FP16 format   |          |
|   |  - Older tokens  |  - Recent tokens |          |
|   |  - 4x smaller    |  - Full quality  |          |
|   |                  |                  |          |
|   +------------------+------------------+          |
|                                                    |
|   Migration: Hot -> Cold (when tail_length exceeded)|
|   Eviction:  Cold first, then Hot                  |
+----------------------------------------------------+
```

**Cache Operations:**

1. **Append**: Add new KV pairs to tail
2. **Migrate**: Move old tokens from tail to quantized store
3. **Evict**: Remove oldest tokens when max exceeded
4. **Attend**: Dequantize cold + use hot for attention

## Data Flow

### Inference Pipeline

```
Input Tokens
     |
     v
+--------------------+
|   Token Embedding  |
|   + RoPE Position  |
+--------------------+
     |
     v (for each layer)
+--------------------+
|   Attention Layer  |
|   +---------------+|
|   | Q,K,V Project ||
|   +---------------+|
|          |         |
|   +---------------+|
|   | KV Cache      ||
|   | Update        ||
|   +---------------+|
|          |         |
|   +---------------+|
|   | Flash/Paged   ||
|   | Attention     ||
|   +---------------+|
|          |         |
|   +---------------+|
|   | Output Proj   ||
|   +---------------+|
+--------------------+
     |
     v
+--------------------+
|   FFN Layer        |
|   - Gate Proj      |
|   - Up Proj        |
|   - Down Proj      |
|   - Activation     |
+--------------------+
     |
     v
+--------------------+
|   RMSNorm          |
+--------------------+
     |
     v
+--------------------+
|   LM Head          |
|   (final layer)    |
+--------------------+
     |
     v
Logits -> Sampling -> Token
```

### Learning Pipeline

```
Request + Response + Feedback
              |
              v
+---------------------------+
|      Instant Loop         |
|  - Compute embeddings     |
|  - Apply MicroLoRA        |
|  - Queue for background   |
+---------------------------+
              |
              v (async)
+---------------------------+
|    Background Loop        |
|  - Batch samples          |
|  - Update EWC++ Fisher    |
|  - Merge adapters         |
|  - Store in ReasoningBank |
+---------------------------+
              |
              v (threshold triggered)
+---------------------------+
|       Deep Loop           |
|  - Full training pipeline |
|  - Pattern distillation   |
|  - Catastrophic forget    |
|    prevention (EWC++)     |
+---------------------------+
```

## Module Structure

```
ruvllm/
├── src/
│   ├── lib.rs              # Crate root, re-exports
│   ├── error.rs            # Error types
│   ├── types.rs            # Common types (Precision, etc.)
│   │
│   ├── backends/           # ML framework backends
│   │   ├── mod.rs          # Backend trait
│   │   ├── candle_backend.rs
│   │   └── config.rs
│   │
│   ├── kernels/            # Optimized kernels
│   │   ├── mod.rs          # Kernel exports
│   │   ├── attention.rs    # Attention variants
│   │   ├── matmul.rs       # Matrix multiplication
│   │   ├── norm.rs         # Normalization ops
│   │   └── rope.rs         # Rotary embeddings
│   │
│   ├── lora/               # LoRA adapters
│   │   ├── mod.rs          # LoRA exports
│   │   ├── micro_lora.rs   # Real-time MicroLoRA
│   │   └── training.rs     # Training pipeline
│   │
│   ├── optimization/       # SONA integration
│   │   ├── mod.rs
│   │   └── sona_llm.rs     # Learning loops
│   │
│   ├── kv_cache.rs         # Two-tier KV cache
│   ├── sona.rs             # SONA core integration
│   ├── policy_store.rs     # Learned policies
│   └── witness_log.rs      # Inference logging
│
└── benches/                # Benchmarks
    ├── attention_bench.rs
    ├── lora_bench.rs
    └── e2e_bench.rs
```

## Performance Characteristics

### Memory Layout

| Component | Memory Pattern | Optimization |
|-----------|---------------|--------------|
| KV Cache Tail | Sequential | NEON vectorized |
| KV Cache Store | Quantized blocks | Batch dequant |
| Model Weights | Memory-mapped | Zero-copy |
| Intermediate | Stack allocated | Arena alloc |

### Throughput Targets (M4 Pro)

| Operation | Target | Achieved |
|-----------|--------|----------|
| Flash Attention | 2.5x vs naive | ~2.3x |
| Paged Attention | 1.8x vs contiguous | ~1.7x |
| GQA vs MHA | 4x less KV memory | 4x |
| MicroLoRA adapt | <1ms | ~0.5ms |

## Integration Points

### With RuVector Core

```rust
// Memory backend integration
use ruvector_core::storage::Storage;

// SONA learning integration
use ruvector_sona::{SonaEngine, ReasoningBank};
```

### With External Systems

- **HuggingFace Hub**: Model downloads
- **OpenAI API**: Compatible inference endpoint
- **Prometheus**: Metrics export
- **gRPC**: High-performance RPC

## Future Architecture

Planned enhancements:

1. **Speculative Decoding**: Draft model integration
2. **Tensor Parallelism**: Multi-GPU support
3. **Continuous Batching**: Dynamic batch scheduling
4. **PagedAttention v2**: vLLM-style memory management