# Sparse Inference Engine Architecture ## PowerInfer-style Activation Locality for Ruvector **Version**: 1.0.0 **Date**: 2026-01-05 **Status**: Design Phase --- ## Executive Summary This document defines the architecture for a sparse inference engine that exploits **activation locality** in transformer models. The system achieves 2-10x speedup with <1% accuracy loss by: 1. **Predicting** which neurons will activate using low-rank matrices (P·Q) 2. **Computing** only active neurons in FFN layers 3. **Caching** hot neurons in fast memory 4. **Offloading** cold neurons to slower storage **Key Innovation**: Unlike model-wide quantization or pruning, we perform **neuron-level sparse computation** at runtime based on learned activation patterns. --- ## 1. System Architecture Overview ### 1.1 High-Level Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Sparse Inference Engine │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │ │ Model Loader │─────▶│ Calibrator │─────▶│ Execution Engine│ │ │ │ │ │ │ │ │ │ │ │ • GGUF Parser │ │ • P·Q Learn │ │ • Layer Exec │ │ │ │ • HF Loader │ │ • Threshold │ │ • Sparse Compute│ │ │ │ • Safetensors │ │ • Neuron Map │ │ • Backend Route │ │ │ └───────────────┘ └──────────────┘ └─────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌───────────────────────────────────────────────────────────────┐ │ │ │ Neuron Cache Manager │ │ │ │ ┌──────────────┐ ┌───────────────┐ ┌──────────────────┐ │ │ │ │ │ Hot Neurons │ │ Predictor Map │ │ Cold Neurons │ │ │ │ │ │ (GPU/Memory) │ │ (P·Q Matrices)│ │ (Disk/Offload) │ │ │ │ │ └──────────────┘ └───────────────┘ └──────────────────┘ │ │ │ └───────────────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌───────────────────────────────────────────────────────────────┐ │ │ │ Backend Abstraction │ │ │ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ │ │ CPU SIMD│ │ WASM SIMD│ │ GPU/Metal│ │ NPU (future) │ │ │ │ │ │ (AVX512)│ │ (128-bit)│ │ (Compute)│ │ │ │ │ │ │ └─────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │ │ └───────────────────────────────────────────────────────────────┘ │ │ │ ├─────────────────────────────────────────────────────────────────────┤ │ Integration Layer │ │ ┌──────────────────────────┐ ┌──────────────────────────────┐ │ │ │ Ruvector Integration │ │ RuvLLM Integration │ │ │ │ • EmbeddingProvider │ │ • InferenceBackend trait │ │ │ │ • Sparse embed() calls │ │ • generate() with sparsity │ │ │ └──────────────────────────┘ └──────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### 1.2 Component Interaction Flow ``` User Request │ ▼ ┌─────────────────┐ │ Model Selection │ (LFM2, sentence-bert, Llama GGUF) └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Model Loader │ Parse GGUF/HF → Extract layers, weights, config └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Calibration │ Feed sample data → Learn P·Q matrices → Classify neurons └────────┬────────┘ (Optional: Skip if pre-calibrated model) │ ▼ ┌─────────────────┐ │ Inference Setup │ Load hot neurons → Offload cold → Build predictor └────────┬────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Runtime Inference Loop │ │ │ │ Input → Embedding │ │ │ │ │ ▼ │ │ For each layer: │ │ 1. Predictor(x) → Active neuron mask │ │ 2. Sparse Attention (full or sparse) │ │ 3. Sparse FFN (only active neurons) │ │ 4. LayerNorm + Residual │ │ │ │ │ ▼ │ │ Output (embeddings/logits) │ └─────────────────────────────────────────────┘ │ ▼ User Result ``` --- ## 2. Core Components ### 2.1 Model Loader **Responsibility**: Parse and load transformer models from various formats. #### Supported Formats | Format | Use Case | Priority | |--------|----------|----------| | **GGUF** | Quantized Llama models (q4_0, q8_0) | P0 | | **HuggingFace** | Sentence transformers (LFM2, BERT) | P0 | | **Safetensors** | Modern PyTorch exports | P1 | | **ONNX** | Cross-platform inference | P2 | #### GGUF Parser Details ```rust pub struct GGUFLoader { /// File handle to .gguf model file: File, /// Parsed metadata metadata: GGUFMetadata, /// Tensor mappings tensor_index: HashMap, } impl GGUFLoader { /// Parse header and build tensor index pub fn open(path: &Path) -> Result; /// Load specific layer weights pub fn load_layer(&self, layer_idx: usize) -> Result; /// Extract model config (n_layers, hidden_size, etc.) pub fn config(&self) -> ModelConfig; /// Check quantization type (Q4_0, Q8_0, F16) pub fn quantization(&self) -> QuantizationType; } pub struct LayerWeights { pub attention_qkv: Tensor, // Combined Q,K,V weights pub attention_output: Tensor, pub ffn_gate: Tensor, // FFN up-projection pub ffn_up: Tensor, pub ffn_down: Tensor, // FFN down-projection pub norm1: Tensor, // Pre-attention norm pub norm2: Tensor, // Pre-FFN norm } ``` #### HuggingFace Loader ```rust pub struct HFLoader { model_id: String, cache_dir: PathBuf, tokenizer: Tokenizer, } impl HFLoader { /// Download or load cached model pub fn from_pretrained(model_id: &str) -> Result; /// Load full model into memory pub fn load_model(&self) -> Result; /// Stream-load layer by layer (for large models) pub fn load_layer_stream(&self) -> impl Iterator; } ``` #### Model Configuration ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ModelConfig { pub model_type: ModelType, // Llama, BERT, GPT pub num_layers: usize, pub hidden_size: usize, // Embedding dimension pub intermediate_size: usize, // FFN intermediate dimension pub num_attention_heads: usize, pub vocab_size: usize, pub max_position_embeddings: usize, pub quantization: Option, } #[derive(Debug, Clone)] pub enum ModelType { Llama, LlamaRoPE, BERT, SentenceBERT, LFM2, } ``` --- ### 2.2 Activation Predictor **Responsibility**: Predict which neurons will activate without computing full FFN. #### Low-Rank Predictor Architecture The predictor uses **P·Q decomposition** where: - **P ∈ ℝ^(hidden_size × r)**: Projection from hidden state - **Q ∈ ℝ^(r × intermediate_size)**: Projection to neuron scores - **r << hidden_size**: Rank (typically 64-256) ``` Input: x ∈ ℝ^hidden_size Score: s = (P·x)·Q ∈ ℝ^intermediate_size (low-rank computation) Mask: m = (s > threshold) (binary mask for active neurons) ``` #### Predictor Implementation ```rust pub struct LowRankPredictor { /// P matrix: [hidden_size, rank] p_matrix: Tensor, /// Q matrix: [rank, intermediate_size] q_matrix: Tensor, /// Activation threshold per neuron thresholds: Vec, /// Neuron statistics (for threshold tuning) neuron_stats: Vec, } impl LowRankPredictor { /// Predict active neurons from hidden state pub fn predict(&self, hidden_state: &Tensor) -> NeuronMask { // scores = (hidden_state @ P) @ Q let proj = hidden_state.matmul(&self.p_matrix); // [batch, rank] let scores = proj.matmul(&self.q_matrix); // [batch, intermediate] // Apply thresholds let mask = scores.iter() .zip(&self.thresholds) .map(|(score, threshold)| score > threshold) .collect(); NeuronMask::new(mask) } /// Get predicted sparsity ratio pub fn sparsity_ratio(&self, hidden_state: &Tensor) -> f32 { let mask = self.predict(hidden_state); mask.active_ratio() } } #[derive(Debug, Clone)] pub struct NeuronMask { /// Boolean mask: true = compute, false = skip mask: Vec, /// Precomputed active indices (for sparse kernels) active_indices: Vec, } impl NeuronMask { pub fn active_count(&self) -> usize { self.active_indices.len() } pub fn active_ratio(&self) -> f32 { self.active_count() as f32 / self.mask.len() as f32 } pub fn iter_active(&self) -> impl Iterator + '_ { self.active_indices.iter().copied() } } ``` #### Calibration Process ```rust pub struct Calibrator { model: TransformerModel, config: CalibrationConfig, } #[derive(Debug, Clone)] pub struct CalibrationConfig { /// Number of calibration samples pub num_samples: usize, /// Target sparsity (e.g., 0.2 = 80% neurons skipped) pub target_sparsity: f32, /// Predictor rank pub predictor_rank: usize, /// Calibration data source pub data_source: DataSource, } impl Calibrator { /// Run calibration to learn P, Q matrices pub fn calibrate(&mut self) -> Result { let samples = self.load_calibration_data()?; let mut predictors = Vec::new(); for layer_idx in 0..self.model.num_layers() { // 1. Collect activation statistics let activations = self.collect_activations(layer_idx, &samples)?; // 2. Classify hot/cold neurons let classification = self.classify_neurons(&activations)?; // 3. Learn low-rank predictor let predictor = self.learn_predictor( layer_idx, &activations, &classification )?; predictors.push(predictor); } Ok(PredictorSet { predictors }) } /// Collect FFN activations for layer fn collect_activations( &self, layer_idx: usize, samples: &[Tensor] ) -> Result { let mut hidden_states = Vec::new(); let mut ffn_activations = Vec::new(); for input in samples { let hidden = self.model.forward_to_layer(input, layer_idx)?; let ffn_out = self.model.compute_ffn(layer_idx, &hidden)?; hidden_states.push(hidden); ffn_activations.push(ffn_out); } Ok(ActivationData { hidden_states, ffn_activations, }) } /// Classify neurons as hot/cold based on activation frequency fn classify_neurons(&self, data: &ActivationData) -> Result { let intermediate_size = data.ffn_activations[0].shape()[1]; let mut activation_counts = vec![0usize; intermediate_size]; // Count how often each neuron activates for activations in &data.ffn_activations { for (i, value) in activations.iter().enumerate() { if value.abs() > 1e-6 { // Non-zero threshold activation_counts[i] += 1; } } } // Compute activation frequency let total_samples = data.ffn_activations.len(); let frequencies: Vec = activation_counts.iter() .map(|&count| count as f32 / total_samples as f32) .collect(); // Classify: hot if frequency > threshold let hot_threshold = self.config.target_sparsity; let classification: Vec = frequencies.iter() .map(|&freq| { if freq > hot_threshold { NeuronType::Hot } else { NeuronType::Cold } }) .collect(); Ok(NeuronClassification { types: classification, frequencies, }) } /// Learn P, Q matrices via gradient descent fn learn_predictor( &self, layer_idx: usize, data: &ActivationData, classification: &NeuronClassification ) -> Result { let hidden_size = data.hidden_states[0].shape()[0]; let intermediate_size = classification.types.len(); let rank = self.config.predictor_rank; // Initialize P, Q with Xavier let mut p_matrix = Tensor::randn(&[hidden_size, rank]) * (2.0 / hidden_size as f32).sqrt(); let mut q_matrix = Tensor::randn(&[rank, intermediate_size]) * (2.0 / rank as f32).sqrt(); let optimizer = Adam::new(0.001); // Training loop for epoch in 0..100 { let mut total_loss = 0.0; for (hidden, target) in data.hidden_states.iter().zip(&data.ffn_activations) { // Forward: scores = (hidden @ P) @ Q let proj = hidden.matmul(&p_matrix); let scores = proj.matmul(&q_matrix); // Loss: binary cross-entropy on active/inactive neurons let loss = self.predictor_loss(&scores, target, classification); total_loss += loss.item(); // Backward loss.backward(); optimizer.step(&mut [&mut p_matrix, &mut q_matrix]); } if total_loss < 0.01 { break; // Converged } } // Learn thresholds (per-neuron calibration) let thresholds = self.compute_thresholds(&p_matrix, &q_matrix, data)?; Ok(LowRankPredictor { p_matrix, q_matrix, thresholds, neuron_stats: self.compute_neuron_stats(classification), }) } } #[derive(Debug, Clone)] pub enum NeuronType { Hot, // Frequently activates (>80% samples) Cold, // Rarely activates (<20% samples) } ``` --- ### 2.3 Sparse FFN Computation **Responsibility**: Compute FFN layer with only active neurons. #### Standard FFN vs Sparse FFN **Standard FFN**: ``` FFN(x) = down(activation(gate(x) ⊙ up(x))) where gate, up, down are full matrix multiplications ``` **Sparse FFN**: ``` 1. mask = Predictor(x) (predict active neurons) 2. gate_active = gate(x)[mask] (sparse matmul: only active columns) 3. up_active = up(x)[mask] 4. hidden = activation(gate_active ⊙ up_active) 5. output = down_active(hidden) (sparse matmul: only active rows) ``` #### Implementation ```rust pub struct SparseFFN { /// Gate projection weights: [hidden_size, intermediate_size] gate_weights: Tensor, /// Up projection weights: [hidden_size, intermediate_size] up_weights: Tensor, /// Down projection weights: [intermediate_size, hidden_size] down_weights: Tensor, /// Activation function (SiLU, GELU, ReLU) activation: ActivationType, /// Predictor for this layer predictor: LowRankPredictor, } impl SparseFFN { pub fn forward(&self, hidden_state: &Tensor, backend: &dyn Backend) -> Result { // 1. Predict active neurons let mask = self.predictor.predict(hidden_state); if mask.active_ratio() > 0.8 { // Fallback to dense computation if too many neurons active return self.forward_dense(hidden_state, backend); } // 2. Sparse gate projection: only compute active columns let gate_active = backend.sparse_matmul_cols( hidden_state, &self.gate_weights, &mask )?; // 3. Sparse up projection let up_active = backend.sparse_matmul_cols( hidden_state, &self.up_weights, &mask )?; // 4. Activation: gate ⊙ up (element-wise) let activated = self.activation.apply(&gate_active.mul(&up_active)?)?; // 5. Sparse down projection: only active rows matter let output = backend.sparse_matmul_rows( &activated, &self.down_weights, &mask )?; Ok(output) } fn forward_dense(&self, hidden_state: &Tensor, backend: &dyn Backend) -> Result { // Standard dense FFN (fallback) let gate = hidden_state.matmul(&self.gate_weights)?; let up = hidden_state.matmul(&self.up_weights)?; let activated = self.activation.apply(&gate.mul(&up)?)?; activated.matmul(&self.down_weights) } } #[derive(Debug, Clone, Copy)] pub enum ActivationType { SiLU, // Llama models GELU, // BERT models ReLU, // Legacy models } impl ActivationType { pub fn apply(&self, x: &Tensor) -> Result { match self { Self::SiLU => x.mul(&x.sigmoid()?), // x * σ(x) Self::GELU => x.gelu(), Self::ReLU => x.relu(), } } } ``` #### Sparse Attention (Optional) For very large models, attention can also be sparsified: ```rust pub struct SparseAttention { /// Query, Key, Value weights (combined or separate) qkv_weights: Tensor, output_weights: Tensor, num_heads: usize, head_dim: usize, /// Attention mask pattern (e.g., local, strided) sparsity_pattern: AttentionPattern, } #[derive(Debug, Clone)] pub enum AttentionPattern { /// Full attention (no sparsity) Full, /// Local attention (window size) Local { window_size: usize }, /// Strided attention (BigBird style) Strided { stride: usize, window: usize }, /// Learned sparse pattern Learned { mask: Tensor }, } ``` --- ### 2.4 Neuron Cache Manager **Responsibility**: Manage hot/cold neuron weights in memory hierarchy. #### Cache Architecture ``` ┌──────────────────────────────────────────────┐ │ Neuron Cache Hierarchy │ ├──────────────────────────────────────────────┤ │ L1: Hot Neurons (GPU Memory / Fast RAM) │ │ - 10-20% most active neurons │ │ - Always resident │ │ - FP16/FP32 precision │ ├──────────────────────────────────────────────┤ │ L2: Warm Neurons (System RAM) │ │ - 30-40% moderately active │ │ - Loaded on-demand │ │ - INT8/FP16 quantized │ ├──────────────────────────────────────────────┤ │ L3: Cold Neurons (Disk / Compressed) │ │ - 40-60% rarely active │ │ - Lazy load if predicted │ │ - INT4/INT8 quantized │ └──────────────────────────────────────────────┘ ``` #### Implementation ```rust pub struct NeuronCache { /// Model configuration config: ModelConfig, /// Per-layer cache layers: Vec, /// Memory budget (bytes) memory_budget: usize, /// Current memory usage memory_used: usize, } pub struct LayerCache { /// Hot neuron indices hot_neurons: Vec, /// Hot neuron weights (gate, up, down) hot_weights: HotWeights, /// Cold neuron weights (memory-mapped or compressed) cold_weights: ColdWeights, /// Neuron statistics stats: Vec, } #[derive(Debug, Clone)] pub struct HotWeights { /// Gate weights for hot neurons: [hidden_size, num_hot] gate: Tensor, /// Up weights for hot neurons: [hidden_size, num_hot] up: Tensor, /// Down weights for hot neurons: [num_hot, hidden_size] down: Tensor, } pub enum ColdWeights { /// Memory-mapped file (lazy load) MemoryMapped { file: Mmap, offsets: Vec, }, /// Compressed in-memory Compressed { data: Vec, codec: CompressionCodec, }, /// Quantized INT4 Quantized { data: Vec, scales: Vec, }, } impl NeuronCache { /// Build cache from calibration results pub fn from_calibration( model: &TransformerModel, predictors: &PredictorSet, config: CacheConfig ) -> Result { let mut layers = Vec::new(); for (layer_idx, predictor) in predictors.iter().enumerate() { // Extract hot/cold neurons let hot_neurons: Vec = predictor.neuron_stats.iter() .enumerate() .filter(|(_, stats)| stats.neuron_type == NeuronType::Hot) .map(|(idx, _)| idx) .collect(); // Load hot neuron weights into fast memory let layer_weights = model.get_layer_weights(layer_idx)?; let hot_weights = Self::extract_hot_weights(&layer_weights, &hot_neurons)?; // Compress cold neuron weights let cold_weights = Self::compress_cold_weights( &layer_weights, &hot_neurons, config.compression )?; layers.push(LayerCache { hot_neurons, hot_weights, cold_weights, stats: predictor.neuron_stats.clone(), }); } Ok(Self { config: model.config.clone(), layers, memory_budget: config.memory_budget, memory_used: Self::calculate_memory(&layers), }) } /// Get weights for active neurons (hot cached, cold loaded on-demand) pub fn get_active_weights( &self, layer_idx: usize, mask: &NeuronMask ) -> Result { let cache = &self.layers[layer_idx]; // Separate hot and cold neurons in mask let (hot_indices, cold_indices) = self.split_hot_cold(cache, mask); // Hot neurons: direct lookup let hot_weights = self.gather_hot_weights(cache, &hot_indices)?; // Cold neurons: lazy load let cold_weights = if !cold_indices.is_empty() { self.load_cold_weights(cache, &cold_indices)? } else { None }; Ok(ActiveWeights { hot: hot_weights, cold: cold_weights, }) } fn load_cold_weights( &self, cache: &LayerCache, indices: &[usize] ) -> Result> { match &cache.cold_weights { ColdWeights::MemoryMapped { file, offsets } => { // Lazy load from disk let mut weights = Vec::new(); for &idx in indices { let offset = offsets[idx]; let data = &file[offset..offset + self.weight_size()]; weights.extend_from_slice(data); } Ok(Some(Tensor::from_bytes(&weights)?)) } ColdWeights::Quantized { data, scales } => { // Dequantize on-the-fly let weights = Self::dequantize(data, scales, indices)?; Ok(Some(weights)) } _ => Ok(None), } } } #[derive(Debug, Clone)] pub struct CacheConfig { /// Memory budget in bytes pub memory_budget: usize, /// Compression for cold neurons pub compression: CompressionCodec, /// Whether to use memory-mapped files pub use_mmap: bool, } #[derive(Debug, Clone, Copy)] pub enum CompressionCodec { None, Quantize4Bit, Quantize8Bit, ZSTD, } ``` --- ### 2.5 Execution Engine **Responsibility**: Orchestrate layer-by-layer inference with sparse computation. ```rust pub struct ExecutionEngine { /// Model configuration config: ModelConfig, /// Neuron cache cache: NeuronCache, /// Predictors (one per layer) predictors: PredictorSet, /// Backend for computation backend: Arc, /// Performance metrics metrics: Metrics, } impl ExecutionEngine { /// Run inference on input pub fn forward(&mut self, input: &Tensor) -> Result { let batch_size = input.shape()[0]; // 1. Embedding layer (always dense) let mut hidden = self.embed(input)?; // 2. Transformer layers for layer_idx in 0..self.config.num_layers { let start = std::time::Instant::now(); // Attention (dense or sparse) hidden = self.run_attention(layer_idx, &hidden)?; // Sparse FFN hidden = self.run_sparse_ffn(layer_idx, &hidden)?; self.metrics.record_layer(layer_idx, start.elapsed()); } // 3. Output layer let output = self.output_projection(&hidden)?; Ok(output) } fn run_sparse_ffn(&mut self, layer_idx: usize, hidden: &Tensor) -> Result { // 1. Predict active neurons let predictor = &self.predictors[layer_idx]; let mask = predictor.predict(hidden); self.metrics.record_sparsity(layer_idx, mask.active_ratio()); // 2. Get active neuron weights from cache let weights = self.cache.get_active_weights(layer_idx, &mask)?; // 3. Sparse FFN computation let ffn = SparseFFN::new(weights, predictor.clone()); let output = ffn.forward(hidden, self.backend.as_ref())?; Ok(output) } /// Get inference statistics pub fn metrics(&self) -> &Metrics { &self.metrics } } #[derive(Debug, Default)] pub struct Metrics { /// Per-layer latency layer_latency: Vec, /// Per-layer sparsity ratio layer_sparsity: Vec, /// Total tokens processed tokens_processed: usize, /// Cache hits/misses cache_hits: usize, cache_misses: usize, } impl Metrics { pub fn average_sparsity(&self) -> f32 { self.layer_sparsity.iter().sum::() / self.layer_sparsity.len() as f32 } pub fn total_latency(&self) -> Duration { self.layer_latency.iter().sum() } pub fn tokens_per_second(&self) -> f32 { let total_secs = self.total_latency().as_secs_f32(); self.tokens_processed as f32 / total_secs } } ``` --- ### 2.6 Backend Abstraction **Responsibility**: Provide SIMD-optimized sparse operations across platforms. ```rust pub trait Backend: Send + Sync { /// Sparse matrix multiplication: A @ B[:, mask] fn sparse_matmul_cols( &self, a: &Tensor, b: &Tensor, col_mask: &NeuronMask ) -> Result; /// Sparse matrix multiplication: A[mask, :] @ B fn sparse_matmul_rows( &self, a: &Tensor, b: &Tensor, row_mask: &NeuronMask ) -> Result; /// Dense matrix multiplication (fallback) fn matmul(&self, a: &Tensor, b: &Tensor) -> Result; /// Element-wise operations fn add(&self, a: &Tensor, b: &Tensor) -> Result; fn mul(&self, a: &Tensor, b: &Tensor) -> Result; /// Activation functions fn silu(&self, x: &Tensor) -> Result; fn gelu(&self, x: &Tensor) -> Result; /// Quantization fn quantize(&self, x: &Tensor, bits: u8) -> Result<(Tensor, Vec)>; fn dequantize(&self, x: &Tensor, scales: &[f32]) -> Result; } ``` #### CPU Backend (AVX512 SIMD) ```rust pub struct CpuBackend { num_threads: usize, simd_features: SimdFeatures, } #[derive(Debug, Clone)] pub struct SimdFeatures { pub avx512: bool, pub avx2: bool, pub fma: bool, pub vnni: bool, // INT8 acceleration } impl Backend for CpuBackend { fn sparse_matmul_cols( &self, a: &Tensor, b: &Tensor, col_mask: &NeuronMask ) -> Result { // A: [batch, hidden_size] // B: [hidden_size, intermediate_size] // Output: [batch, active_neurons] let active_cols = col_mask.iter_active().collect::>(); if self.simd_features.avx512 { self.sparse_matmul_avx512(a, b, &active_cols) } else if self.simd_features.avx2 { self.sparse_matmul_avx2(a, b, &active_cols) } else { self.sparse_matmul_scalar(a, b, &active_cols) } } } impl CpuBackend { #[target_feature(enable = "avx512f")] unsafe fn sparse_matmul_avx512( &self, a: &Tensor, b: &Tensor, active_cols: &[usize] ) -> Result { // AVX-512: 16x f32 per vector // Optimized sparse GEMM kernel let batch = a.shape()[0]; let hidden = a.shape()[1]; let num_active = active_cols.len(); let mut output = Tensor::zeros(&[batch, num_active]); for row in 0..batch { for (out_idx, &col) in active_cols.iter().enumerate() { // Dot product: a[row, :] · b[:, col] let mut sum = _mm512_setzero_ps(); for k in (0..hidden).step_by(16) { let a_vec = _mm512_loadu_ps(&a.data()[row * hidden + k]); let b_vec = _mm512_loadu_ps(&b.data()[col * hidden + k]); sum = _mm512_fmadd_ps(a_vec, b_vec, sum); } output.data_mut()[row * num_active + out_idx] = _mm512_reduce_add_ps(sum); } } Ok(output) } } ``` #### WASM Backend (Portable SIMD) ```rust #[cfg(target_arch = "wasm32")] pub struct WasmBackend { simd_enabled: bool, } #[cfg(target_arch = "wasm32")] impl Backend for WasmBackend { fn sparse_matmul_cols( &self, a: &Tensor, b: &Tensor, col_mask: &NeuronMask ) -> Result { use std::arch::wasm32::*; if !self.simd_enabled { return self.sparse_matmul_scalar(a, b, col_mask); } // WASM SIMD: 4x f32 per v128 let active_cols = col_mask.iter_active().collect::>(); let batch = a.shape()[0]; let hidden = a.shape()[1]; let num_active = active_cols.len(); let mut output = Tensor::zeros(&[batch, num_active]); for row in 0..batch { for (out_idx, &col) in active_cols.iter().enumerate() { let mut sum = f32x4_splat(0.0); for k in (0..hidden).step_by(4) { let a_vec = v128_load(&a.data()[row * hidden + k] as *const f32 as *const v128); let b_vec = v128_load(&b.data()[col * hidden + k] as *const f32 as *const v128); sum = f32x4_add(sum, f32x4_mul(a_vec, b_vec)); } // Horizontal sum let result = f32x4_extract_lane::<0>(sum) + f32x4_extract_lane::<1>(sum) + f32x4_extract_lane::<2>(sum) + f32x4_extract_lane::<3>(sum); output.data_mut()[row * num_active + out_idx] = result; } } Ok(output) } } ``` --- ## 3. Data Flow Architecture ### 3.1 Model Loading Flow ``` User: model_path │ ▼ ┌─────────────────────────────┐ │ Detect Model Format │ (check extension: .gguf, .safetensors, .bin) └────────┬────────────────────┘ │ ├──── .gguf ────────▶ GGUFLoader::open() │ │ │ ├─ Parse header │ ├─ Build tensor index │ └─ Extract config │ ├──── HF/ST ────────▶ HFLoader::from_pretrained() │ │ │ ├─ Download if needed │ ├─ Load safetensors │ └─ Parse config.json │ ▼ ┌─────────────────────────────┐ │ TransformerModel │ │ - Config │ │ - Layer weights │ │ - Tokenizer │ └─────────────────────────────┘ ``` ### 3.2 Calibration Flow ``` TransformerModel │ ▼ ┌─────────────────────────────┐ │ Load Calibration Dataset │ (WikiText, C4, custom) │ - Sample 512-2048 examples │ └────────┬────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ For each layer: │ │ 1. Forward samples → collect │ │ - Hidden states (input to FFN) │ │ - FFN activations (output) │ │ │ │ 2. Analyze activations: │ │ - Compute activation frequency │ │ - Classify hot/cold neurons │ │ │ │ 3. Learn predictor: │ │ - Initialize P, Q matrices │ │ - Train on (hidden → activation) │ │ - Optimize thresholds │ └────────┬────────────────────────────────┘ │ ▼ ┌─────────────────────────────┐ │ PredictorSet + NeuronCache │ │ - P·Q matrices per layer │ │ - Hot neuron weights │ │ - Cold neuron offload │ └─────────────────────────────┘ ``` ### 3.3 Inference Flow (Single Token) ``` Input Token(s) │ ▼ ┌─────────────────────────────┐ │ Tokenizer + Embedding │ └────────┬────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────┐ │ Layer 0: │ │ ┌─────────────────────────────────────────────────┐ │ │ │ 1. Attention (dense) │ │ │ │ - Q, K, V projections │ │ │ │ - Scaled dot-product attention │ │ │ │ - Output projection │ │ │ └─────────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ 2. Sparse FFN │ │ │ │ a) Predictor(hidden) → mask [T/F/F/T/T/F...] │ │ │ │ b) Load weights for active neurons only │ │ │ │ c) Sparse gate/up projections │ │ │ │ d) Activation (SiLU/GELU) │ │ │ │ e) Sparse down projection │ │ │ └─────────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ 3. Residual + LayerNorm │ │ │ └─────────────────────────────────────────────────┘ │ └───────────────────┬───────────────────────────────────┘ │ ▼ (Repeat for layers 1..N-1) │ ▼ ┌─────────────────────────────┐ │ Output Projection │ │ - Linear(hidden, vocab) │ │ - Softmax (for generation) │ └────────┬────────────────────┘ │ ▼ Logits / Embedding ``` --- ## 4. Rust Module Structure ### 4.1 Crate Layout ``` crates/ruvector-sparse-inference/ ├── Cargo.toml ├── build.rs # Build-time feature detection ├── README.md └── src/ ├── lib.rs # Public API ├── config.rs # Configuration types ├── error.rs # Error types │ ├── predictor/ │ ├── mod.rs # Predictor API │ ├── lowrank.rs # P·Q low-rank predictor │ ├── calibration.rs # Calibration logic │ └── threshold.rs # Threshold optimization │ ├── sparse/ │ ├── mod.rs # Sparse operations API │ ├── ffn.rs # Sparse FFN layer │ ├── attention.rs # Sparse attention (optional) │ └── kernels.rs # SIMD kernels │ ├── model/ │ ├── mod.rs # Model loading API │ ├── gguf.rs # GGUF parser │ ├── hf.rs # HuggingFace loader │ ├── loader.rs # Generic loader trait │ └── runners.rs # Model-specific runners (Llama, BERT) │ ├── memory/ │ ├── mod.rs # Memory management API │ ├── cache.rs # Neuron cache │ ├── quantization.rs # Quantization utilities │ └── compression.rs # Compression codecs │ ├── backend/ │ ├── mod.rs # Backend trait │ ├── cpu.rs # CPU SIMD backend │ ├── wasm.rs # WASM SIMD backend │ └── gpu.rs # GPU backend (future) │ ├── integration/ │ ├── mod.rs # Integration API │ ├── ruvector.rs # EmbeddingProvider impl │ └── ruvllm.rs # InferenceBackend impl │ └── utils/ ├── mod.rs ├── tensor.rs # Tensor utilities └── metrics.rs # Performance tracking ``` ### 4.2 Key Module Responsibilities | Module | Responsibility | Dependencies | |--------|---------------|-------------| | `lib.rs` | Public API, re-exports | All modules | | `config` | Configuration types | None | | `error` | Error handling | None | | `predictor` | Neuron prediction | `tensor`, `backend` | | `sparse` | Sparse computation | `predictor`, `backend`, `memory` | | `model` | Model loading | `config`, `error` | | `memory` | Cache management | `model`, `predictor` | | `backend` | SIMD operations | `tensor` | | `integration` | Ruvector/RuvLLM | All | --- ## 5. Key Traits and Interfaces ### 5.1 ModelRunner Trait ```rust pub trait ModelRunner: Send + Sync { /// Get model configuration fn config(&self) -> &ModelConfig; /// Run inference on input tokens fn forward(&mut self, input_ids: &[u32]) -> Result; /// Encode text to embeddings (for embedding models) fn encode(&mut self, text: &str) -> Result>; /// Generate text (for language models) fn generate(&mut self, prompt: &str, max_tokens: usize) -> Result; /// Get inference metrics fn metrics(&self) -> &Metrics; } // Implementations pub struct LlamaRunner { /* ... */ } pub struct BertRunner { /* ... */ } pub struct LFM2Runner { /* ... */ } impl ModelRunner for LlamaRunner { /* ... */ } impl ModelRunner for BertRunner { /* ... */ } impl ModelRunner for LFM2Runner { /* ... */ } ``` ### 5.2 Predictor Trait ```rust pub trait Predictor: Send + Sync { /// Predict active neurons from hidden state fn predict(&self, hidden_state: &Tensor) -> NeuronMask; /// Get predicted sparsity ratio fn sparsity_ratio(&self, hidden_state: &Tensor) -> f32; /// Get neuron statistics fn neuron_stats(&self) -> &[NeuronStats]; } #[derive(Debug, Clone)] pub struct NeuronStats { pub neuron_type: NeuronType, pub activation_frequency: f32, pub average_magnitude: f32, } ``` ### 5.3 Cache Trait ```rust pub trait Cache: Send + Sync { /// Get weights for active neurons fn get_active_weights( &self, layer_idx: usize, mask: &NeuronMask ) -> Result; /// Get memory usage statistics fn memory_usage(&self) -> MemoryStats; /// Evict least-recently-used cold neurons fn evict(&mut self, size: usize) -> Result<()>; } #[derive(Debug, Clone)] pub struct MemoryStats { pub hot_neurons_bytes: usize, pub cold_neurons_bytes: usize, pub predictor_bytes: usize, pub total_bytes: usize, } ``` --- ## 6. Integration Architecture ### 6.1 Ruvector EmbeddingProvider Integration ```rust // In ruvector-core/src/embeddings.rs pub trait EmbeddingProvider { fn embed(&self, text: &str) -> Result>; fn embed_batch(&self, texts: &[&str]) -> Result>>; } // New implementation in sparse-inference impl EmbeddingProvider for SparseInferenceEngine { fn embed(&self, text: &str) -> Result> { // 1. Tokenize let tokens = self.tokenizer.encode(text)?; // 2. Run sparse inference let output = self.runner.forward(&tokens)?; // 3. Mean pooling (for sentence embeddings) let embedding = self.mean_pool(&output)?; Ok(embedding.to_vec()) } fn embed_batch(&self, texts: &[&str]) -> Result>> { texts.iter().map(|text| self.embed(text)).collect() } } // Usage let engine = SparseInferenceEngine::from_pretrained( "TaylorAI/gte-tiny", SparseConfig::default() )?; let rv = RuVector::builder() .embedding_provider(Box::new(engine)) .build()?; rv.insert("test", "Hello world")?; ``` ### 6.2 RuvLLM InferenceBackend Integration ```rust // In ruvllm/src/backend.rs pub trait InferenceBackend { fn generate(&mut self, prompt: &str, config: GenerateConfig) -> Result; fn logits(&mut self, prompt: &str) -> Result>; } // Implementation impl InferenceBackend for SparseInferenceEngine { fn generate(&mut self, prompt: &str, config: GenerateConfig) -> Result { let mut tokens = self.tokenizer.encode(prompt)?; let mut output = String::new(); for _ in 0..config.max_tokens { // Sparse inference let logits = self.runner.forward(&tokens)?; // Sample next token let next_token = self.sample(&logits, config.temperature)?; tokens.push(next_token); // Decode let text = self.tokenizer.decode(&[next_token])?; output.push_str(&text); if next_token == self.tokenizer.eos_token() { break; } } Ok(output) } } // Usage let engine = SparseInferenceEngine::from_pretrained( "TheBloke/Llama-2-7B-GGUF", SparseConfig::default() )?; let llm = RuvLLM::builder() .backend(Box::new(engine)) .build()?; let response = llm.generate("Explain quantum computing", Default::default())?; ``` --- ## 7. Performance Targets ### 7.1 Latency Targets | Model | Operation | Target Latency | Baseline | Speedup | |-------|-----------|----------------|----------|---------| | **LFM2-350M** | Sentence embedding | 5-10ms | 25ms | 2.5-5x | | **BERT-base** | Sentence embedding | 8-15ms | 40ms | 2.7-5x | | **Llama-7B** | Token generation | 50-100ms | 500ms | 5-10x | | **Llama-13B** | Token generation | 100-200ms | 1.2s | 6-12x | ### 7.2 Memory Targets | Model | Baseline RAM | Sparse RAM | Reduction | |-------|--------------|------------|-----------| | **LFM2-350M** | 1.4 GB | 700 MB | 2x | | **Llama-7B (FP16)** | 14 GB | 7-9 GB | 1.5-2x | | **Llama-7B (Q4)** | 4 GB | 2.5-3 GB | 1.3-1.6x | ### 7.3 Accuracy Targets - **Embedding similarity**: >0.99 cosine similarity to dense baseline - **Generation quality**: <1% perplexity increase - **Classification accuracy**: <0.5% drop on downstream tasks ### 7.4 Sparsity Targets | Layer Type | Target Sparsity | Active Neurons | |------------|-----------------|----------------| | **Early layers** | 60-70% | 30-40% compute | | **Middle layers** | 70-85% | 15-30% compute | | **Late layers** | 50-60% | 40-50% compute | | **Average** | 70-80% | 20-30% compute | --- ## 8. Deployment Architecture ### 8.1 CPU Deployment ``` ┌─────────────────────────────────────┐ │ Application Process │ ├─────────────────────────────────────┤ │ ┌───────────────────────────────┐ │ │ │ Sparse Inference Engine │ │ │ │ - Hot neurons: RAM (1-2 GB) │ │ │ │ - Cold neurons: mmap disk │ │ │ │ - SIMD: AVX-512 / AVX2 │ │ │ └───────────────────────────────┘ │ │ ┌───────────────────────────────┐ │ │ │ Thread Pool (rayon) │ │ │ │ - Parallel batch processing │ │ │ └───────────────────────────────┘ │ └─────────────────────────────────────┘ ``` ### 8.2 WASM Deployment ``` ┌─────────────────────────────────────┐ │ Browser / Node.js │ ├─────────────────────────────────────┤ │ ┌───────────────────────────────┐ │ │ │ WASM Module │ │ │ │ - Hot neurons: ArrayBuffer │ │ │ │ - SIMD: wasm128 (if avail) │ │ │ │ - Memory limit: 2-4 GB │ │ │ └───────────────────────────────┘ │ │ ┌───────────────────────────────┐ │ │ │ Worker Pool │ │ │ │ - Parallel inference │ │ │ └───────────────────────────────┘ │ └─────────────────────────────────────┘ ``` ### 8.3 Hybrid Deployment ``` ┌─────────────────────────────────────┐ │ Cloud GPU (Hot path) │ │ - 20% most frequent queries │ │ - Full dense inference │ │ - <10ms latency │ └──────────────┬──────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Edge CPU (Cold path) │ │ - 80% long-tail queries │ │ - Sparse inference │ │ - 20-50ms latency │ │ - 10x lower cost │ └─────────────────────────────────────┘ ``` --- ## 9. Future Enhancements ### 9.1 Phase 2 Features - **Dynamic sparsity**: Adjust predictor thresholds at runtime - **Multi-modal**: Support vision-language models (CLIP, LLaVA) - **Quantization-aware**: INT8/INT4 predictor matrices - **GPU kernels**: CUDA/Metal sparse kernels - **NPU support**: Apple Neural Engine, Qualcomm Hexagon ### 9.2 Phase 3 Features - **Learned sparsity patterns**: Train end-to-end with sparsity loss - **Mixture-of-experts**: Combine with MoE models - **Speculative decoding**: Sparse draft models + dense verification - **Cross-layer optimization**: Share predictors across layers --- ## 10. References and Inspiration 1. **PowerInfer** (SOSP'23): Fast LLM serving with activation locality 2. **DejaVu** (MLSys'23): Contextual sparsity in transformers 3. **CATS** (ICLR'24): Context-aware token selection 4. **FlashAttention**: Memory-efficient attention 5. **GPTQ/AWQ**: Weight quantization for LLMs --- ## Appendix A: Configuration Examples ### A.1 LFM2 Embedding Configuration ```rust let config = SparseConfig { model_path: "TaylorAI/gte-tiny".to_string(), predictor_rank: 128, target_sparsity: 0.75, cache_config: CacheConfig { memory_budget: 1024 * 1024 * 1024, // 1 GB use_mmap: false, compression: CompressionCodec::None, }, backend: BackendType::CpuAvx2, calibration: Some(CalibrationConfig { num_samples: 1024, data_source: DataSource::WikiText, }), }; ``` ### A.2 Llama-7B Generation Configuration ```rust let config = SparseConfig { model_path: "TheBloke/Llama-2-7B-GGUF".to_string(), predictor_rank: 256, target_sparsity: 0.80, cache_config: CacheConfig { memory_budget: 8 * 1024 * 1024 * 1024, // 8 GB use_mmap: true, compression: CompressionCodec::Quantize4Bit, }, backend: BackendType::CpuAvx512, calibration: Some(CalibrationConfig { num_samples: 2048, data_source: DataSource::C4, }), }; ``` --- ## Appendix B: Benchmarking Protocol ### B.1 Latency Benchmarking ```bash # Embedding models cargo bench --bench embeddings -- \ --model gte-tiny \ --batch-sizes 1,8,32 \ --sequence-lengths 16,64,256 # Generation models cargo bench --bench generation -- \ --model llama-7b \ --prompt-lengths 32,128,512 \ --generate-lengths 32,128 ``` ### B.2 Accuracy Evaluation ```bash # STS-B (semantic similarity) cargo run --release --bin eval-sts \ --model gte-tiny \ --sparse \ --dataset data/stsbenchmark # MMLU (language understanding) cargo run --release --bin eval-mmlu \ --model llama-7b \ --sparse \ --subset abstract_algebra,anatomy ``` --- **End of Architecture Document**