# RuvLLM: Candle + mistral-rs + SONA Integration Architecture **Document Version**: 1.0 **Status**: Proposed **Date**: 2026-01-18 **Target Hardware**: Apple M4 Pro (ARM64/NEON) --- ## 1. Executive Summary This document defines the architecture for integrating Candle tensor operations, mistral-rs model inference, and RuvLLM's SONA learning framework into a unified, high-performance LLM serving runtime optimized for Apple Silicon. ### Key Design Goals | Goal | Target | Rationale | |------|--------|-----------| | Inference Latency | <50ms TTFT | Real-time interactive use | | Memory Efficiency | 4GB for 7B model | M4 Pro unified memory constraint | | Learning Overhead | <1ms per request | SONA instant loop requirement | | Throughput | 100+ tokens/sec | Competitive with cloud inference | --- ## 2. Component Diagram ``` +===========================================================================+ | RuvLLM Engine (Orchestration Layer) | +===========================================================================+ | | | +-------------------+ +-------------------+ +------------------+ | | | Request Router |---->| Model Selector |---->| Batch Scheduler | | | | (SONA-guided) | | (FastGRNN) | | (Continuous) | | | +-------------------+ +-------------------+ +------------------+ | | | | | | | v v v | | +------------------------------------------------------------------------+ | | Backend Abstraction Layer | | +------------------------------------------------------------------------+ | | | | | | v v v | | +-------------------+ +-------------------+ +------------------+ | | | Candle Backend | | mistral-rs Backend| | Hybrid Backend | | | | (Tensor Ops) | | (Full Inference) | | (Mix & Match) | | | +-------------------+ +-------------------+ +------------------+ | | | | | | | +-------------+-----------+------------------------+ | | | | | v | | +------------------------------------------------------------------------+ | | NEON-Optimized Kernel Layer | | | (ruvector-core/simd_intrinsics) | | +------------------------------------------------------------------------+ | | Attention | RoPE/ALiBi | RMSNorm | Quantization | GEMM | | +------------------------------------------------------------------------+ | | | | v | | +------------------------------------------------------------------------+ | | Memory Management Layer | | +------------------------------------------------------------------------+ | | +----------------+ +------------------+ +----------------------------+ | | | | Arena Allocator| | Unified Mem Pool | | 3-Tier KV Cache | | | | | (Batch Ops) | | (ADR-006) | | Hot(FP16)/Warm(Q8)/Cold(Q4)| | | | +----------------+ +------------------+ +----------------------------+ | | +------------------------------------------------------------------------+ | | | | v | | +------------------------------------------------------------------------+ | | SONA Learning Integration | | +------------------------------------------------------------------------+ | | +----------------+ +------------------+ +----------------------------+ | | | | MicroLoRA | | ReasoningBank | | EWC++ Fisher | | | | | (Rank 1-2) | | (Pattern Store) | | (Forgetting Prevention) | | | | +----------------+ +------------------+ +----------------------------+ | | +------------------------------------------------------------------------+ | | +============================================================================+ ``` --- ## 3. Integration Architecture ### 3.1 Backend Selection Strategy ``` +-----------------------------------------------------------------------+ | BACKEND SELECTION DECISION TREE | +-----------------------------------------------------------------------+ +-------------------+ | Inference Request | +---------+---------+ | +---------v---------+ | Check Model Type | +---------+---------+ | +---------------------+---------------------+ | | | +-------v-------+ +-------v-------+ +-------v-------+ | Standard LLM | | Custom/LoRA | | Embedding | | (Mistral/Llama)| | (Fine-tuned) | | Only | +-------+-------+ +-------+-------+ +-------+-------+ | | | +-------v-------+ +-------v-------+ +-------v-------+ | mistral-rs | | Candle Backend| | Candle Backend| | Backend | | + MicroLoRA | | (Optimized) | | (Full Model) | | Injection | | | +---------------+ +---------------+ +---------------+ Backend Selection Criteria: - mistral-rs: Best for standard models (optimized loading, PagedAttention) - Candle: Best for custom operations, LoRA injection, embeddings - Hybrid: Route different layers to different backends ``` ### 3.2 Candle Integration Layer ```rust // crates/ruvllm/src/backends/candle.rs /// Candle backend configuration pub struct CandleBackendConfig { /// Device type (Metal for M4 Pro) pub device: DeviceType, /// Default dtype for operations pub default_dtype: DType, /// Enable Metal Performance Shaders pub use_mps: bool, /// Memory pool configuration pub memory_config: MemoryConfig, } /// Candle backend for tensor operations pub struct CandleBackend { config: CandleBackendConfig, device: Device, /// NEON kernel registry neon_kernels: NeonKernelRegistry, /// Memory pool memory_pool: Arc, } impl CandleBackend { /// Create tensors with NEON-optimized operations pub fn create_tensor(&self, data: &[f32], shape: &[usize]) -> Result { // Use CacheAlignedVec for NEON compatibility let aligned = CacheAlignedVec::from_slice(data); Tensor::from_slice(aligned.as_slice(), shape, &self.device) } /// Execute NEON-optimized attention pub fn attention(&self, q: &Tensor, k: &Tensor, v: &Tensor, scale: f32) -> Result { // Route to NEON kernel if dimensions match optimization thresholds if self.should_use_neon(q.dims()) { self.neon_kernels.attention(q, k, v, scale) } else { // Fallback to Candle default candle_nn::attention(q, k, v, scale) } } } ``` ### 3.3 mistral-rs Integration Layer ```rust // crates/ruvllm/src/backends/mistral.rs /// mistral-rs backend configuration pub struct MistralBackendConfig { /// Model path or HuggingFace ID pub model_id: String, /// Quantization format pub quantization: QuantizationFormat, /// Use PagedAttention pub paged_attention: bool, /// KV cache configuration pub kv_cache: KvCacheConfig, /// Device mapping (for multi-device) pub device_map: DeviceMap, } /// mistral-rs backend for model inference pub struct MistralBackend { config: MistralBackendConfig, /// mistral-rs model pipeline pipeline: Arc, /// KV cache manager kv_cache: Arc, /// Paged attention manager paged_attention: Arc, } impl MistralBackend { /// Load model with SONA-aware caching pub async fn load(config: MistralBackendConfig) -> Result { // Create model loader with custom device configuration let loader = MistralLoader::new(&config.model_id) .with_dtype(config.quantization.dtype()) .with_device_map(&config.device_map); // Load model let pipeline = loader.load().await?; // Initialize KV cache with existing RuvLLM implementation let kv_cache = TwoTierKvCache::new(config.kv_cache.clone()); let paged_attention = PagedAttention::new(config.paged_attention_config()); Ok(Self { config, pipeline: Arc::new(pipeline), kv_cache: Arc::new(kv_cache), paged_attention: Arc::new(paged_attention), }) } /// Forward pass with KV cache integration pub fn forward( &self, tokens: &[u32], sequence_id: &str, generation_config: &GenerationConfig, ) -> Result { // Allocate paged attention for this sequence self.paged_attention.allocate_sequence(sequence_id, tokens.len())?; // Run inference through mistral-rs pipeline let output = self.pipeline.forward(tokens, generation_config)?; // Update KV cache self.kv_cache.append( &output.key_cache, &output.value_cache, )?; Ok(output) } } ``` --- ## 4. Data Flow for Inference ``` +===========================================================================+ | INFERENCE DATA FLOW | +===========================================================================+ User Request Response | ^ v | +-----+-----+ +-----+-----+ | Tokenize | | Decode | | (HF) | | (HF) | +-----+-----+ +-----+-----+ | ^ v | +-----+-----+ +----------------+ +----------------+ +-----+-----+ | Embedding |---->| SONA Pattern |---->| Route Decision |---->| Log | | Lookup | | Lookup | | (Model+Quant) | | Witness | +-----------+ +----------------+ +----------------+ +-----------+ | | | | +-------------+ | | | | v v v +-----+----+-----+ +-----+-----+ | Context Prep | | Select | | - Retrieve KV | | Backend | | - Load LoRA | | (Candle/ | | - Apply Policy | | Mistral) | +-----+----------+ +-----+-----+ | | +------------------+----------------------+ | v +----------+----------+ | NEON Kernels | | (Attention, | | RoPE, Norm) | +----------+----------+ | v +----------+----------+ | Transformer Layers | | (Loop N times) | +----------+----------+ | v +----------+----------+ | Output Projection | | + Sampling | +----------+----------+ | v +----------+----------+ | MicroLoRA Update | | (Instant Loop) | +----------+----------+ | v +----------+----------+ | Update KV Cache | | (Tiered Storage) | +----------+----------+ | v [Output] ``` ### 4.1 Detailed Token Processing Flow ``` Token IDs: [1, 234, 567, ...] | v +-------------------+ | Embedding Layer | | (NEON dot_product)| +-------------------+ | v +-------------------+ | RoPE Position | | Encoding (NEON) | +-------------------+ | v For each layer (0..N): +-------------------+ | RMSNorm (NEON) | +-------------------+ | v +-------------------+ | Self-Attention | | - Q/K/V Project | | - Paged Attention | | - Output Project | +-------------------+ | v +-------------------+ | Feed Forward | | - Gate Project | | - Up Project | | - Down Project | +-------------------+ | v +-------------------+ | MicroLoRA Inject | | (If active) | +-------------------+ | +-- Next Layer --+ | v +-------------------+ | Final RMSNorm | +-------------------+ | v +-------------------+ | LM Head Project | +-------------------+ | v [Logits] ``` --- ## 5. Memory Layout ### 5.1 Unified Memory Architecture (M4 Pro) ``` +===========================================================================+ | UNIFIED MEMORY LAYOUT (16GB M4 Pro) | +===========================================================================+ Address Space: 0x0000_0000_0000 +--------------------------------------------------+ | System Reserved (2GB) | 0x0000_8000_0000 +--------------------------------------------------+ | Model Weights (4-8GB depending on quantization) | | +--------------------------------------------+ | | | Embedding Matrix (128MB - 512MB) | | | +--------------------------------------------+ | | | Transformer Layers (N x ~200MB) | | | | - Attention Weights (Q, K, V, O) | | | | - FFN Weights (Gate, Up, Down) | | | +--------------------------------------------+ | | | LM Head (128MB - 512MB) | | | +--------------------------------------------+ | 0x0002_0000_0000 +--------------------------------------------------+ | KV Cache Pool (2-4GB) | | +--------------------------------------------+ | | | Hot Tier (FP16) - 512MB | | | | - Last 256 tokens per sequence | | | +--------------------------------------------+ | | | Warm Tier (Q8) - 1GB | | | | - Tokens 257-2048 | | | +--------------------------------------------+ | | | Cold Tier (Q4/KIVI) - 1-2GB | | | | - Tokens 2049+ | | | +--------------------------------------------+ | 0x0003_0000_0000 +--------------------------------------------------+ | LoRA Adapter Pool (256MB - 1GB) | | +--------------------------------------------+ | | | Active Adapters (FP16, ~10MB each) | | | | MicroLoRA Weights (Rank 1-2, ~1MB) | | | | BaseLoRA Weights (Rank 4-8, ~4MB) | | | +--------------------------------------------+ | 0x0003_4000_0000 +--------------------------------------------------+ | Activation Scratch Space (512MB) | | +--------------------------------------------+ | | | Per-request activations | | | | Intermediate computations | | | +--------------------------------------------+ | 0x0003_6000_0000 +--------------------------------------------------+ | Arena Allocator Pool (256MB) | | +--------------------------------------------+ | | | Batch Vector Allocator | | | | Temporary SIMD buffers | | | +--------------------------------------------+ | 0x0003_7000_0000 +--------------------------------------------------+ | SONA Learning State (128MB) | | +--------------------------------------------+ | | | ReasoningBank Patterns | | | | EWC++ Fisher Diagonal | | | | Trajectory Buffer | | | +--------------------------------------------+ | 0x0003_7800_0000 +--------------------------------------------------+ | Free / Expansion (Remaining) | 0x0004_0000_0000 +--------------------------------------------------+ ``` ### 5.2 KV Cache Memory Layout (Detailed) ``` +===========================================================================+ | 3-TIER KV CACHE MEMORY LAYOUT | +===========================================================================+ Per-Sequence Layout (4096 context length, 32 KV heads, 128 head dim): +------------------------+------------------------+------------------------+ | HOT TIER | WARM TIER | COLD TIER | | (FP16) | (Q8) | (Q4/KIVI) | +------------------------+------------------------+------------------------+ | Tokens: 3841-4096 | Tokens: 2049-3840 | Tokens: 0-2048 | | Length: 256 tokens | Length: 1792 tokens | Length: 2048 tokens | +------------------------+------------------------+------------------------+ | Size per KV head: | Size per KV head: | Size per KV head: | | 256 * 128 * 2 bytes | 1792 * 128 * 1 byte | 2048 * 128 * 0.5 byte | | = 64KB | = 224KB | = 128KB | +------------------------+------------------------+------------------------+ | Total (32 heads): | Total (32 heads): | Total (32 heads): | | 64KB * 32 * 2 (K+V) | 224KB * 32 * 2 (K+V) | 128KB * 32 * 2 (K+V) | | = 4MB | = 14MB | = 8MB | +------------------------+------------------------+------------------------+ Total per sequence: 4MB + 14MB + 8MB = 26MB With 100 concurrent sequences: 2.6GB Page Table Structure: +--------+--------+--------+--------+--------+--------+ | Seq ID | Tier | Page 0 | Page 1 | Page 2 | ... | +--------+--------+--------+--------+--------+--------+ | seq-1 | HOT | 0x100 | 0x101 | 0x102 | 0x103 | | seq-1 | WARM | 0x200 | 0x201 | ... | ... | | seq-1 | COLD | 0x300 | 0x301 | ... | ... | | seq-2 | HOT | 0x104 | 0x105 | ... | ... | +--------+--------+--------+--------+--------+--------+ ``` --- ## 6. NEON Optimization Points ### 6.1 Kernel Registry ```rust // crates/ruvllm/src/kernels/mod.rs /// NEON-optimized kernel registry pub struct NeonKernelRegistry { /// Attention kernels pub attention: AttentionKernels, /// RoPE kernels pub rope: RoPEKernels, /// Normalization kernels pub norm: NormKernels, /// Quantization kernels pub quant: QuantKernels, /// GEMM kernels pub gemm: GemmKernels, } impl NeonKernelRegistry { pub fn new() -> Self { Self { attention: AttentionKernels::new(), rope: RoPEKernels::new(), norm: NormKernels::new(), quant: QuantKernels::new(), gemm: GemmKernels::new(), } } } ``` ### 6.2 Attention Kernels (NEON) ```rust // crates/ruvllm/src/kernels/attention.rs use std::arch::aarch64::*; /// Flash Attention variant optimized for M4 Pro NEON pub struct FlashAttentionNeon { /// Block size for tiled computation block_size: usize, /// Softmax scale factor scale: f32, } impl FlashAttentionNeon { /// Compute attention with 4x unrolling (matching simd_intrinsics.rs pattern) #[inline(always)] pub unsafe fn forward( &self, query: &[f32], // [seq_len, num_heads, head_dim] key: &[f32], // [seq_len, num_kv_heads, head_dim] value: &[f32], // [seq_len, num_kv_heads, head_dim] output: &mut [f32], seq_len: usize, num_heads: usize, num_kv_heads: usize, head_dim: usize, ) { let gqa_ratio = num_heads / num_kv_heads; let scale = self.scale; // For each query head for h in 0..num_heads { let kv_head = h / gqa_ratio; // Tiled attention computation for q_block_start in (0..seq_len).step_by(self.block_size) { let q_block_end = (q_block_start + self.block_size).min(seq_len); for k_block_start in (0..seq_len).step_by(self.block_size) { let k_block_end = (k_block_start + self.block_size).min(seq_len); // Compute QK^T for this tile self.compute_attention_tile( query, key, value, output, q_block_start, q_block_end, k_block_start, k_block_end, h, kv_head, head_dim, scale, ); } } } } #[inline(always)] unsafe fn compute_attention_tile( &self, query: &[f32], key: &[f32], value: &[f32], output: &mut [f32], q_start: usize, q_end: usize, k_start: usize, k_end: usize, head: usize, kv_head: usize, head_dim: usize, scale: f32, ) { // Use 4 accumulators for better ILP (matching simd_intrinsics.rs) let mut sum0 = vdupq_n_f32(0.0); let mut sum1 = vdupq_n_f32(0.0); let mut sum2 = vdupq_n_f32(0.0); let mut sum3 = vdupq_n_f32(0.0); let scale_vec = vdupq_n_f32(scale); // Process head_dim in chunks of 16 (4x4 unrolling) let chunks = head_dim / 16; for q_pos in q_start..q_end { let q_offset = (q_pos * head_dim) + (head * head_dim); let q_ptr = query.as_ptr().add(q_offset); let mut max_score = f32::NEG_INFINITY; let mut scores = Vec::with_capacity(k_end - k_start); // Compute attention scores for k_pos in k_start..k_end { let k_offset = (k_pos * head_dim) + (kv_head * head_dim); let k_ptr = key.as_ptr().add(k_offset); // Reset accumulators sum0 = vdupq_n_f32(0.0); sum1 = vdupq_n_f32(0.0); sum2 = vdupq_n_f32(0.0); sum3 = vdupq_n_f32(0.0); let mut idx = 0; for _ in 0..chunks { // Load Q vectors let q0 = vld1q_f32(q_ptr.add(idx)); let q1 = vld1q_f32(q_ptr.add(idx + 4)); let q2 = vld1q_f32(q_ptr.add(idx + 8)); let q3 = vld1q_f32(q_ptr.add(idx + 12)); // Load K vectors let k0 = vld1q_f32(k_ptr.add(idx)); let k1 = vld1q_f32(k_ptr.add(idx + 4)); let k2 = vld1q_f32(k_ptr.add(idx + 8)); let k3 = vld1q_f32(k_ptr.add(idx + 12)); // FMA: sum += q * k sum0 = vfmaq_f32(sum0, q0, k0); sum1 = vfmaq_f32(sum1, q1, k1); sum2 = vfmaq_f32(sum2, q2, k2); sum3 = vfmaq_f32(sum3, q3, k3); idx += 16; } // Tree reduction let sum01 = vaddq_f32(sum0, sum1); let sum23 = vaddq_f32(sum2, sum3); let sum = vaddq_f32(sum01, sum23); // Horizontal sum + scale let score = vaddvq_f32(vmulq_f32(sum, scale_vec)); scores.push(score); max_score = max_score.max(score); } // Online softmax + value accumulation self.softmax_and_accumulate( &scores, max_score, value, output, q_pos, k_start, k_end, kv_head, head_dim, head, ); } } } ``` ### 6.3 RoPE Kernels (NEON) ```rust // crates/ruvllm/src/kernels/rope.rs use std::arch::aarch64::*; /// Rotary Position Embedding optimized for NEON pub struct RoPENeon { /// Precomputed cos table cos_cache: Vec, /// Precomputed sin table sin_cache: Vec, /// Maximum sequence length max_seq_len: usize, /// Head dimension head_dim: usize, } impl RoPENeon { pub fn new(max_seq_len: usize, head_dim: usize, base: f32) -> Self { let half_dim = head_dim / 2; let mut cos_cache = vec![0.0; max_seq_len * half_dim]; let mut sin_cache = vec![0.0; max_seq_len * half_dim]; // Precompute frequencies for pos in 0..max_seq_len { for i in 0..half_dim { let freq = 1.0 / base.powf((2 * i) as f32 / head_dim as f32); let angle = pos as f32 * freq; cos_cache[pos * half_dim + i] = angle.cos(); sin_cache[pos * half_dim + i] = angle.sin(); } } Self { cos_cache, sin_cache, max_seq_len, head_dim } } /// Apply RoPE to query/key tensors in-place #[inline(always)] pub unsafe fn apply( &self, tensor: &mut [f32], positions: &[usize], num_heads: usize, ) { let half_dim = self.head_dim / 2; let chunks = half_dim / 4; for (seq_idx, &pos) in positions.iter().enumerate() { let cos_ptr = self.cos_cache.as_ptr().add(pos * half_dim); let sin_ptr = self.sin_cache.as_ptr().add(pos * half_dim); for head in 0..num_heads { let base_offset = (seq_idx * num_heads + head) * self.head_dim; let tensor_ptr = tensor.as_mut_ptr().add(base_offset); let mut idx = 0; for _ in 0..chunks { // Load first half (x0) let x0 = vld1q_f32(tensor_ptr.add(idx)); // Load second half (x1) let x1 = vld1q_f32(tensor_ptr.add(idx + half_dim)); // Load cos/sin let cos = vld1q_f32(cos_ptr.add(idx)); let sin = vld1q_f32(sin_ptr.add(idx)); // Apply rotation: [x0*cos - x1*sin, x0*sin + x1*cos] let neg_sin = vnegq_f32(sin); let new_x0 = vfmaq_f32(vmulq_f32(x0, cos), x1, neg_sin); let new_x1 = vfmaq_f32(vmulq_f32(x0, sin), x1, cos); // Store results vst1q_f32(tensor_ptr.add(idx), new_x0); vst1q_f32(tensor_ptr.add(idx + half_dim), new_x1); idx += 4; } } } } } ``` ### 6.4 RMSNorm Kernel (NEON) ```rust // crates/ruvllm/src/kernels/norm.rs use std::arch::aarch64::*; /// RMSNorm optimized for NEON pub struct RMSNormNeon { /// Weight vector (gamma) weight: Vec, /// Epsilon for numerical stability eps: f32, } impl RMSNormNeon { /// Apply RMSNorm in-place #[inline(always)] pub unsafe fn forward(&self, x: &mut [f32], hidden_size: usize) { let num_tokens = x.len() / hidden_size; for token_idx in 0..num_tokens { let offset = token_idx * hidden_size; let x_ptr = x.as_mut_ptr().add(offset); let w_ptr = self.weight.as_ptr(); // Compute variance (mean of squares) let mut var0 = vdupq_n_f32(0.0); let mut var1 = vdupq_n_f32(0.0); let mut var2 = vdupq_n_f32(0.0); let mut var3 = vdupq_n_f32(0.0); let chunks = hidden_size / 16; let mut idx = 0; for _ in 0..chunks { let v0 = vld1q_f32(x_ptr.add(idx)); let v1 = vld1q_f32(x_ptr.add(idx + 4)); let v2 = vld1q_f32(x_ptr.add(idx + 8)); let v3 = vld1q_f32(x_ptr.add(idx + 12)); var0 = vfmaq_f32(var0, v0, v0); var1 = vfmaq_f32(var1, v1, v1); var2 = vfmaq_f32(var2, v2, v2); var3 = vfmaq_f32(var3, v3, v3); idx += 16; } // Tree reduction let var01 = vaddq_f32(var0, var1); let var23 = vaddq_f32(var2, var3); let var = vaddq_f32(var01, var23); let variance = vaddvq_f32(var) / hidden_size as f32; // Compute scale: 1/sqrt(variance + eps) let scale = 1.0 / (variance + self.eps).sqrt(); let scale_vec = vdupq_n_f32(scale); // Apply normalization and weight idx = 0; for _ in 0..chunks { let v0 = vld1q_f32(x_ptr.add(idx)); let v1 = vld1q_f32(x_ptr.add(idx + 4)); let v2 = vld1q_f32(x_ptr.add(idx + 8)); let v3 = vld1q_f32(x_ptr.add(idx + 12)); let w0 = vld1q_f32(w_ptr.add(idx)); let w1 = vld1q_f32(w_ptr.add(idx + 4)); let w2 = vld1q_f32(w_ptr.add(idx + 8)); let w3 = vld1q_f32(w_ptr.add(idx + 12)); let out0 = vmulq_f32(vmulq_f32(v0, scale_vec), w0); let out1 = vmulq_f32(vmulq_f32(v1, scale_vec), w1); let out2 = vmulq_f32(vmulq_f32(v2, scale_vec), w2); let out3 = vmulq_f32(vmulq_f32(v3, scale_vec), w3); vst1q_f32(x_ptr.add(idx), out0); vst1q_f32(x_ptr.add(idx + 4), out1); vst1q_f32(x_ptr.add(idx + 8), out2); vst1q_f32(x_ptr.add(idx + 12), out3); idx += 16; } } } } ``` --- ## 7. MicroLoRA Integration ### 7.1 MicroLoRA Architecture ``` +===========================================================================+ | MICROLORA REAL-TIME ADAPTATION | +===========================================================================+ +-------------------+ | Input Activation | | x: [batch, dim] | +---------+---------+ | +-------------------------+-------------------------+ | | | v v v +-------+-------+ +-------+-------+ +-------+-------+ | Base Weight | | MicroLoRA A | | MicroLoRA B | | W: [out, in] | | A: [rank, in] | | B: [out, rank]| | (Frozen) | | (Rank 1-2) | | (Rank 1-2) | +-------+-------+ +-------+-------+ +-------+-------+ | | | v +----------+--------------+ +----+----+ | | W @ x | v +---------+ +----------+----------+ | | scale * B @ (A @ x) | | +----------+----------+ +-------------+------------------------+ | v +-------+-------+ | y = Wx + sBAx | +---------------+ ``` ### 7.2 MicroLoRA Implementation ```rust // crates/ruvllm/src/lora/micro_lora.rs /// MicroLoRA for per-request real-time adaptation pub struct MicroLoRA { /// Config config: MicroLoRAConfig, /// A matrices per layer: [num_layers, rank, hidden_dim] a_matrices: Vec>, /// B matrices per layer: [num_layers, hidden_dim, rank] b_matrices: Vec>, /// Scale factor scale: f32, /// Gradient accumulators for instant learning grad_a: Vec>, grad_b: Vec>, } /// MicroLoRA configuration pub struct MicroLoRAConfig { /// LoRA rank (typically 1-2 for instant learning) pub rank: usize, /// Hidden dimension pub hidden_dim: usize, /// Number of layers pub num_layers: usize, /// Learning rate for instant updates pub learning_rate: f32, /// Scale factor (alpha / rank) pub scale: f32, /// Apply to which modules pub target_modules: TargetModules, } #[derive(Clone, Copy)] pub enum TargetModules { /// Query and Value projections only QV, /// All attention projections QKVO, /// All linear layers All, } impl MicroLoRA { pub fn new(config: MicroLoRAConfig) -> Self { let num_layers = config.num_layers; let rank = config.rank; let hidden_dim = config.hidden_dim; // Initialize with small random values (Xavier) let mut rng = rand::thread_rng(); let std_a = (2.0 / (hidden_dim + rank) as f32).sqrt(); let std_b = 0.0; // B initialized to zero let a_matrices: Vec> = (0..num_layers) .map(|_| { (0..rank * hidden_dim) .map(|_| rng.gen::() * std_a) .collect() }) .collect(); let b_matrices: Vec> = (0..num_layers) .map(|_| vec![std_b; hidden_dim * rank]) .collect(); let grad_a = vec![vec![0.0; rank * hidden_dim]; num_layers]; let grad_b = vec![vec![0.0; hidden_dim * rank]; num_layers]; Self { scale: config.scale, config, a_matrices, b_matrices, grad_a, grad_b, } } /// Forward pass: adds LoRA contribution to base output #[inline(always)] pub fn forward( &self, x: &[f32], // Input: [batch_size, hidden_dim] base_output: &mut [f32], // Base output to modify in-place layer_idx: usize, batch_size: usize, ) { let rank = self.config.rank; let hidden_dim = self.config.hidden_dim; let a = &self.a_matrices[layer_idx]; let b = &self.b_matrices[layer_idx]; // Compute A @ x -> [batch_size, rank] let mut ax = vec![0.0; batch_size * rank]; for batch in 0..batch_size { for r in 0..rank { let mut sum = 0.0; for d in 0..hidden_dim { sum += a[r * hidden_dim + d] * x[batch * hidden_dim + d]; } ax[batch * rank + r] = sum; } } // Compute B @ (A @ x) and add to base_output for batch in 0..batch_size { for d in 0..hidden_dim { let mut sum = 0.0; for r in 0..rank { sum += b[d * rank + r] * ax[batch * rank + r]; } base_output[batch * hidden_dim + d] += self.scale * sum; } } } /// Instant update from trajectory (SONA instant loop) pub fn instant_update( &mut self, input: &[f32], grad_output: &[f32], layer_idx: usize, quality_score: f32, ) { let rank = self.config.rank; let hidden_dim = self.config.hidden_dim; let lr = self.config.learning_rate * quality_score; // Scale by quality // Compute gradients // grad_B = grad_output @ (A @ input)^T // grad_A = B^T @ grad_output @ input^T // Simplified single-sample update let a = &self.a_matrices[layer_idx]; let b = &mut self.b_matrices[layer_idx]; // A @ input -> [rank] let mut ax = vec![0.0; rank]; for r in 0..rank { let mut sum = 0.0; for d in 0..hidden_dim { sum += a[r * hidden_dim + d] * input[d]; } ax[r] = sum; } // Update B: grad_B[d, r] = grad_output[d] * ax[r] for d in 0..hidden_dim { for r in 0..rank { let grad = grad_output[d] * ax[r]; b[d * rank + r] -= lr * grad; } } // Update A: grad_A[r, d] = sum_d'(B[d', r] * grad_output[d']) * input[d] let a = &mut self.a_matrices[layer_idx]; for r in 0..rank { let mut b_grad_sum = 0.0; for d in 0..hidden_dim { b_grad_sum += self.b_matrices[layer_idx][d * rank + r] * grad_output[d]; } for d in 0..hidden_dim { let grad = b_grad_sum * input[d]; a[r * hidden_dim + d] -= lr * grad; } } } } ``` ### 7.3 LoRA Adapter Manager ```rust // crates/ruvllm/src/lora/adapter.rs /// LoRA adapter management with hot-swapping pub struct LoRAAdapterManager { /// Active MicroLoRA (per-request) micro_lora: Arc>, /// Base LoRA adapters (shared across requests) base_adapters: DashMap>, /// Adapter residency manager residency: AdapterResidencyManager, /// Memory pool for adapter weights memory_pool: Arc, } /// Base LoRA adapter (rank 4-8, trained in background loop) pub struct BaseLoRAAdapter { pub id: String, pub rank: usize, pub a_matrices: Vec>, pub b_matrices: Vec>, pub scale: f32, pub precision: Precision, pub last_access: AtomicU64, pub access_count: AtomicU64, } impl LoRAAdapterManager { /// Load adapter from storage with tier management pub async fn load_adapter(&self, adapter_id: &str) -> Result> { // Check if already loaded if let Some(adapter) = self.base_adapters.get(adapter_id) { adapter.access_count.fetch_add(1, Ordering::Relaxed); adapter.last_access.store( std::time::SystemTime::now() .duration_since(std::time::UNIX_EPOCH) .unwrap() .as_secs(), Ordering::Relaxed, ); return Ok(adapter.clone()); } // Load from appropriate tier let adapter = self.residency.load(adapter_id).await?; let adapter = Arc::new(adapter); self.base_adapters.insert(adapter_id.to_string(), adapter.clone()); Ok(adapter) } /// Merge MicroLoRA into Base LoRA (background loop) pub fn merge_micro_to_base(&self, base_adapter_id: &str, quality_threshold: f32) { let micro = self.micro_lora.read(); if let Some(mut base) = self.base_adapters.get_mut(base_adapter_id) { // Only merge if recent trajectories exceed quality threshold // This is handled by SONA's trajectory filtering for layer_idx in 0..micro.config.num_layers { for (i, (micro_a, base_a)) in micro.a_matrices[layer_idx] .iter() .zip(base.a_matrices[layer_idx].iter_mut()) .enumerate() { // Exponential moving average merge *base_a = 0.99 * *base_a + 0.01 * micro_a; } for (i, (micro_b, base_b)) in micro.b_matrices[layer_idx] .iter() .zip(base.b_matrices[layer_idx].iter_mut()) .enumerate() { *base_b = 0.99 * *base_b + 0.01 * micro_b; } } } } } ``` --- ## 8. SONA-LLM Integration ### 8.1 SONA LLM Configuration ```rust // crates/ruvllm/src/optimization/sona_llm.rs /// SONA integration specifically for LLM operations pub struct SonaLLM { /// Core SONA integration sona: Arc, /// MicroLoRA manager micro_lora: Arc>, /// KV cache policy learning kv_policy_learner: KvPolicyLearner, /// Router learning router_learner: RouterLearner, } impl SonaLLM { /// Record LLM trajectory for learning pub fn record_llm_trajectory( &self, request_id: &str, session_id: &str, input_tokens: &[u32], output_tokens: &[u32], quality_score: f32, latency_ms: f32, model_used: ModelSize, kv_cache_stats: &KvCacheStats, ) -> Result<()> { // Compute embeddings let query_embedding = self.compute_embedding(input_tokens)?; let response_embedding = self.compute_embedding(output_tokens)?; // Create trajectory let trajectory = Trajectory { request_id: request_id.to_string(), session_id: session_id.to_string(), query_embedding, response_embedding, quality_score, routing_features: vec![ latency_ms / 1000.0, // Normalize kv_cache_stats.compression_ratio, kv_cache_stats.total_tokens as f32 / 4096.0, model_used.index() as f32 / 4.0, ], model_index: model_used.index(), timestamp: chrono::Utc::now(), }; // Record in SONA self.sona.record_trajectory(trajectory)?; // Update MicroLoRA if quality is good if quality_score >= 0.7 { self.update_micro_lora(&query_embedding, quality_score)?; } // Update KV cache policy self.kv_policy_learner.update(kv_cache_stats, quality_score); Ok(()) } /// Get routing recommendation for new request pub fn get_llm_routing(&self, input_embedding: &[f32]) -> LLMRoutingDecision { // Get base SONA recommendation let base_rec = self.sona.get_routing_recommendation(input_embedding); // Get router learner recommendation let router_rec = self.router_learner.recommend(input_embedding); // Get KV cache policy recommendation let kv_rec = self.kv_policy_learner.recommend(input_embedding); LLMRoutingDecision { model: base_rec.suggested_model, confidence: (base_rec.confidence + router_rec.confidence) / 2.0, kv_quantization: kv_rec.quantization, kv_tail_length: kv_rec.tail_length, use_micro_lora: base_rec.average_quality > 0.6, } } } /// LLM-specific routing decision pub struct LLMRoutingDecision { /// Model size to use (0=tiny, 1=small, 2=medium, 3=large) pub model: usize, /// Confidence in decision pub confidence: f32, /// KV cache quantization level pub kv_quantization: Precision, /// KV cache tail length (high-precision) pub kv_tail_length: usize, /// Whether to apply MicroLoRA pub use_micro_lora: bool, } ``` ### 8.2 Real-Time Optimization Loop ```rust // crates/ruvllm/src/optimization/realtime.rs /// Real-time optimization during inference pub struct RealtimeOptimizer { /// SONA LLM integration sona_llm: Arc, /// Performance monitor perf_monitor: PerformanceMonitor, /// Optimization triggers triggers: OptimizationTriggers, } #[derive(Clone)] pub struct OptimizationTriggers { /// Trigger MicroLoRA update after N requests pub micro_lora_update_interval: usize, /// Trigger KV cache rebalance at memory threshold pub kv_rebalance_threshold: f32, /// Trigger router update after N trajectories pub router_update_interval: usize, } impl RealtimeOptimizer { /// Called before each forward pass pub fn pre_forward(&self, request: &InferenceRequest) -> ForwardConfig { // Get SONA routing decision let routing = self.sona_llm.get_llm_routing(&request.input_embedding); // Check if real-time adjustments needed let perf = self.perf_monitor.current_metrics(); ForwardConfig { model_index: routing.model, use_micro_lora: routing.use_micro_lora, kv_config: KvConfig { quantization: if perf.memory_pressure > 0.9 { Precision::Q4 // Aggressive compression under pressure } else { routing.kv_quantization }, tail_length: routing.kv_tail_length, }, batch_optimization: perf.throughput < 50.0, // tokens/sec } } /// Called after each forward pass pub fn post_forward(&self, result: &InferenceResult) { // Record trajectory self.sona_llm.record_llm_trajectory( &result.request_id, &result.session_id, &result.input_tokens, &result.output_tokens, result.quality_score, result.latency_ms, result.model_used, &result.kv_stats, ).ok(); // Update performance monitor self.perf_monitor.record(result); // Check optimization triggers if self.should_trigger_micro_lora_update() { self.trigger_micro_lora_merge(); } if self.should_trigger_kv_rebalance() { self.trigger_kv_rebalance(); } } } ``` --- ## 9. API Design ### 9.1 Public API ```rust // crates/ruvllm/src/engine.rs (to be added) /// Main inference engine combining all components pub struct LLMInferenceEngine { /// Configuration config: LLMInferenceConfig, /// Backend (Candle, mistral-rs, or Hybrid) backend: Box, /// SONA LLM integration sona_llm: Arc, /// Real-time optimizer optimizer: Arc, /// KV cache manager kv_cache: Arc, /// Paged attention manager paged_attention: Arc, /// LoRA adapter manager lora_manager: Arc, /// Session manager session_manager: SessionManager, } /// Engine configuration pub struct LLMInferenceConfig { /// Backend type pub backend: BackendType, /// Model configuration pub model: ModelConfig, /// Memory configuration pub memory: MemoryConfig, /// SONA configuration pub sona: SonaConfig, /// KV cache configuration pub kv_cache: KvCacheConfig, /// LoRA configuration pub lora: LoRAConfig, } #[derive(Clone)] pub enum BackendType { Candle(CandleBackendConfig), MistralRs(MistralBackendConfig), Hybrid { candle: CandleBackendConfig, mistral: MistralBackendConfig, routing: HybridRoutingConfig, }, } impl LLMInferenceEngine { /// Create a new inference engine pub async fn new(config: LLMInferenceConfig) -> Result { let backend: Box = match &config.backend { BackendType::Candle(cfg) => Box::new(CandleBackend::new(cfg.clone())?), BackendType::MistralRs(cfg) => Box::new(MistralBackend::load(cfg.clone()).await?), BackendType::Hybrid { candle, mistral, routing } => { Box::new(HybridBackend::new(candle.clone(), mistral.clone(), routing.clone()).await?) } }; // Initialize components let sona_llm = Arc::new(SonaLLM::new(config.sona.clone())?); let optimizer = Arc::new(RealtimeOptimizer::new(sona_llm.clone())); let kv_cache = Arc::new(TwoTierKvCache::new(config.kv_cache.clone())); let paged_attention = Arc::new(PagedAttention::new(config.kv_cache.into())); let lora_manager = Arc::new(LoRAAdapterManager::new(config.lora.clone())); let session_manager = SessionManager::new(config.session.clone()); Ok(Self { config, backend, sona_llm, optimizer, kv_cache, paged_attention, lora_manager, session_manager, }) } /// Run inference pub async fn generate( &self, request: GenerationRequest, ) -> Result { // Get or create session let session = self.session_manager .get_or_create(&request.session_id)?; // Pre-forward optimization let forward_config = self.optimizer.pre_forward(&request.into()); // Load LoRA adapter if specified if let Some(adapter_id) = &request.adapter_id { self.lora_manager.load_adapter(adapter_id).await?; } // Run generation let start = std::time::Instant::now(); let output = self.backend.generate(&request, &forward_config, &session).await?; let latency_ms = start.elapsed().as_secs_f32() * 1000.0; // Post-forward optimization let result = InferenceResult { request_id: request.request_id.clone(), session_id: session.id.clone(), input_tokens: request.input_ids.clone(), output_tokens: output.token_ids.clone(), quality_score: output.quality_estimate, latency_ms, model_used: forward_config.model_index.into(), kv_stats: self.kv_cache.stats(), }; self.optimizer.post_forward(&result); Ok(GenerationResponse { request_id: request.request_id, generated_text: output.text, token_ids: output.token_ids, latency_ms, tokens_per_second: output.token_ids.len() as f32 / (latency_ms / 1000.0), }) } } /// Generation request pub struct GenerationRequest { pub request_id: String, pub session_id: Option, pub prompt: String, pub input_ids: Vec, pub max_new_tokens: usize, pub temperature: f32, pub top_p: f32, pub adapter_id: Option, } /// Generation response pub struct GenerationResponse { pub request_id: String, pub generated_text: String, pub token_ids: Vec, pub latency_ms: f32, pub tokens_per_second: f32, } ``` --- ## 10. Cargo.toml Dependencies ```toml # crates/ruvllm/Cargo.toml (additions to existing) [package] name = "ruvllm-integration" version.workspace = true edition.workspace = true # ... existing fields ... [dependencies] # Existing dependencies ruvector-core = { path = "../ruvector-core", default-features = false, features = ["storage"] } ruvector-sona = { path = "../sona", default-features = false, features = ["serde-support"] } # Candle - Tensor operations candle-core = { version = "0.8", features = ["metal"] } candle-nn = { version = "0.8" } candle-transformers = { version = "0.8" } # mistral-rs - Model inference (optional, for hybrid mode) mistralrs = { version = "0.6", optional = true, features = ["metal", "flash-attn"] } mistralrs-core = { version = "0.6", optional = true } # Tokenizers tokenizers = { version = "0.20", features = ["http"] } hf-hub = { version = "0.3" } # Async runtime tokio = { workspace = true, features = ["rt-multi-thread", "sync", "macros"] } futures = "0.3" # Serialization serde = { workspace = true } serde_json = { workspace = true } # Error handling thiserror = { workspace = true } anyhow = { workspace = true } tracing = { workspace = true } # Performance dashmap = { workspace = true } parking_lot = { workspace = true } once_cell = { workspace = true } # Time and UUID chrono = { workspace = true, features = ["serde"] } uuid = { workspace = true, features = ["v4", "serde"] } # Math ndarray = { workspace = true } rand = { workspace = true } half = { version = "2.4", features = ["std"] } # For f16 support # Memory mapping (for model loading) memmap2 = "0.9" bytemuck = { version = "1.18", features = ["derive"] } [dev-dependencies] criterion = { workspace = true, features = ["html_reports"] } tempfile = "3.13" tracing-subscriber = { workspace = true } approx = "0.5" [features] default = ["async-runtime", "candle-backend"] async-runtime = ["tokio"] candle-backend = [] mistral-backend = ["mistralrs", "mistralrs-core"] hybrid-backend = ["candle-backend", "mistral-backend"] metal = ["candle-core/metal"] wasm = [] [[bench]] name = "attention_benchmarks" harness = false [[bench]] name = "lora_benchmarks" harness = false ``` --- ## 11. Module Structure (Final) ``` crates/ruvllm/src/ +-- lib.rs # (modify) Add new module exports +-- engine.rs # NEW: Main LLM inference engine | +-- backends/ | +-- mod.rs # NEW: Backend trait and selection | +-- candle.rs # NEW: Candle tensor backend | +-- mistral.rs # NEW: mistral-rs model backend | +-- hybrid.rs # NEW: Hybrid routing backend | +-- lora/ | +-- mod.rs # NEW: LoRA module exports | +-- micro_lora.rs # NEW: MicroLoRA implementation | +-- base_lora.rs # NEW: Base LoRA adapters | +-- adapter.rs # NEW: Adapter manager | +-- residency.rs # NEW: Tier management | +-- kernels/ | +-- mod.rs # NEW: Kernel registry | +-- attention.rs # NEW: Flash/Paged attention NEON | +-- rope.rs # NEW: RoPE NEON implementation | +-- norm.rs # NEW: RMSNorm/LayerNorm NEON | +-- quantize.rs # NEW: Quantization kernels | +-- gemm.rs # NEW: GEMM kernels (optional) | +-- optimization/ | +-- mod.rs # NEW: Optimization exports | +-- sona_llm.rs # NEW: SONA LLM integration | +-- realtime.rs # NEW: Real-time optimization | +-- policy.rs # NEW: KV/Router policy learning | +-- adapter_manager.rs # (existing) Modify for new LoRA +-- error.rs # (existing) +-- kv_cache.rs # (existing) Enhance with 3-tier +-- paged_attention.rs # (existing) +-- policy_store.rs # (existing) +-- session.rs # (existing) +-- session_index.rs # (existing) +-- sona.rs # (existing) +-- types.rs # (existing) Add new types +-- witness_log.rs # (existing) ``` --- ## 12. Performance Targets | Operation | Target | Hardware Optimization | |-----------|--------|----------------------| | Attention (256 seq) | <2ms | NEON 4x unrolling, Flash tiling | | RoPE | <0.1ms | Precomputed tables, NEON vectorization | | RMSNorm | <0.05ms | NEON tree reduction | | MicroLoRA forward | <0.5ms | Rank 1-2, NEON matmul | | MicroLoRA update | <1ms | Sparse gradient, instant loop | | KV append (hot tier) | <0.1ms | Zero-copy append | | KV migration (hot->warm) | <1ms | Batch quantization | | Model load (7B Q4) | <30s | mmap, lazy loading | | TTFT | <50ms | Paged attention, continuous batching | | Throughput | 100+ tok/s | Batch optimization, prefetching | --- ## 13. Risk Analysis | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Metal compatibility issues | Medium | High | Fallback to CPU NEON | | Memory pressure at scale | Medium | High | Aggressive KV quantization, eviction | | mistral-rs API changes | Low | Medium | Version pinning, abstraction layer | | MicroLoRA quality degradation | Medium | Medium | EWC++, quality thresholds | | Backend switching overhead | Low | Low | Warm-start caching | --- ## 14. References 1. [Candle Documentation](https://huggingface.co/docs/candle) 2. [mistral-rs GitHub](https://github.com/EricLBuehler/mistral.rs) 3. [Flash Attention Paper](https://arxiv.org/abs/2205.14135) 4. [S-LoRA Paper](https://arxiv.org/abs/2311.03285) 5. [KIVI: 2-bit KV Cache Quantization](https://arxiv.org/abs/2402.02750) 6. ADR-002: RuvLLM Integration with Ruvector 7. ADR-006: Unified Memory Pool and Paging Strategy --- **Document Status**: Ready for Implementation Review