Files
wifi-densepose/vendor/ruvector/docs/sparse-inference/ARCHITECTURE.md

54 KiB
Raw Blame History

Sparse Inference Engine Architecture

PowerInfer-style Activation Locality for Ruvector

Version: 1.0.0 Date: 2026-01-05 Status: Design Phase


Executive Summary

This document defines the architecture for a sparse inference engine that exploits activation locality in transformer models. The system achieves 2-10x speedup with <1% accuracy loss by:

  1. Predicting which neurons will activate using low-rank matrices (P·Q)
  2. Computing only active neurons in FFN layers
  3. Caching hot neurons in fast memory
  4. Offloading cold neurons to slower storage

Key Innovation: Unlike model-wide quantization or pruning, we perform neuron-level sparse computation at runtime based on learned activation patterns.


1. System Architecture Overview

1.1 High-Level Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     Sparse Inference Engine                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌───────────────┐      ┌──────────────┐      ┌─────────────────┐  │
│  │ Model Loader  │─────▶│  Calibrator  │─────▶│ Execution Engine│  │
│  │               │      │              │      │                 │  │
│  │ • GGUF Parser │      │ • P·Q Learn  │      │ • Layer Exec    │  │
│  │ • HF Loader   │      │ • Threshold  │      │ • Sparse Compute│  │
│  │ • Safetensors │      │ • Neuron Map │      │ • Backend Route │  │
│  └───────────────┘      └──────────────┘      └─────────────────┘  │
│         │                       │                       │            │
│         ▼                       ▼                       ▼            │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │                      Neuron Cache Manager                      │  │
│  │  ┌──────────────┐  ┌───────────────┐  ┌──────────────────┐   │  │
│  │  │ Hot Neurons  │  │ Predictor Map │  │ Cold Neurons     │   │  │
│  │  │ (GPU/Memory) │  │ (P·Q Matrices)│  │ (Disk/Offload)   │   │  │
│  │  └──────────────┘  └───────────────┘  └──────────────────┘   │  │
│  └───────────────────────────────────────────────────────────────┘  │
│         │                       │                       │            │
│         ▼                       ▼                       ▼            │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │                     Backend Abstraction                        │  │
│  │  ┌─────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐   │  │
│  │  │ CPU SIMD│  │ WASM SIMD│  │ GPU/Metal│  │ NPU (future) │   │  │
│  │  │ (AVX512)│  │ (128-bit)│  │ (Compute)│  │              │   │  │
│  │  └─────────┘  └──────────┘  └──────────┘  └──────────────┘   │  │
│  └───────────────────────────────────────────────────────────────┘  │
│                                                                       │
├─────────────────────────────────────────────────────────────────────┤
│                      Integration Layer                               │
│  ┌──────────────────────────┐    ┌──────────────────────────────┐  │
│  │   Ruvector Integration   │    │     RuvLLM Integration       │  │
│  │ • EmbeddingProvider      │    │ • InferenceBackend trait     │  │
│  │ • Sparse embed() calls   │    │ • generate() with sparsity   │  │
│  └──────────────────────────┘    └──────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

1.2 Component Interaction Flow

User Request
    │
    ▼
┌─────────────────┐
│ Model Selection │ (LFM2, sentence-bert, Llama GGUF)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Model Loader   │ Parse GGUF/HF → Extract layers, weights, config
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Calibration   │ Feed sample data → Learn P·Q matrices → Classify neurons
└────────┬────────┘         (Optional: Skip if pre-calibrated model)
         │
         ▼
┌─────────────────┐
│ Inference Setup │ Load hot neurons → Offload cold → Build predictor
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────────────┐
│        Runtime Inference Loop               │
│                                             │
│  Input → Embedding                          │
│     │                                       │
│     ▼                                       │
│  For each layer:                            │
│     1. Predictor(x) → Active neuron mask   │
│     2. Sparse Attention (full or sparse)   │
│     3. Sparse FFN (only active neurons)    │
│     4. LayerNorm + Residual                │
│     │                                       │
│     ▼                                       │
│  Output (embeddings/logits)                │
└─────────────────────────────────────────────┘
         │
         ▼
    User Result

2. Core Components

2.1 Model Loader

Responsibility: Parse and load transformer models from various formats.

Supported Formats

Format Use Case Priority
GGUF Quantized Llama models (q4_0, q8_0) P0
HuggingFace Sentence transformers (LFM2, BERT) P0
Safetensors Modern PyTorch exports P1
ONNX Cross-platform inference P2

GGUF Parser Details

pub struct GGUFLoader {
    /// File handle to .gguf model
    file: File,
    /// Parsed metadata
    metadata: GGUFMetadata,
    /// Tensor mappings
    tensor_index: HashMap<String, TensorInfo>,
}

impl GGUFLoader {
    /// Parse header and build tensor index
    pub fn open(path: &Path) -> Result<Self>;

    /// Load specific layer weights
    pub fn load_layer(&self, layer_idx: usize) -> Result<LayerWeights>;

    /// Extract model config (n_layers, hidden_size, etc.)
    pub fn config(&self) -> ModelConfig;

    /// Check quantization type (Q4_0, Q8_0, F16)
    pub fn quantization(&self) -> QuantizationType;
}

pub struct LayerWeights {
    pub attention_qkv: Tensor,       // Combined Q,K,V weights
    pub attention_output: Tensor,
    pub ffn_gate: Tensor,            // FFN up-projection
    pub ffn_up: Tensor,
    pub ffn_down: Tensor,            // FFN down-projection
    pub norm1: Tensor,               // Pre-attention norm
    pub norm2: Tensor,               // Pre-FFN norm
}

HuggingFace Loader

pub struct HFLoader {
    model_id: String,
    cache_dir: PathBuf,
    tokenizer: Tokenizer,
}

impl HFLoader {
    /// Download or load cached model
    pub fn from_pretrained(model_id: &str) -> Result<Self>;

    /// Load full model into memory
    pub fn load_model(&self) -> Result<TransformerModel>;

    /// Stream-load layer by layer (for large models)
    pub fn load_layer_stream(&self) -> impl Iterator<Item = LayerWeights>;
}

Model Configuration

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelConfig {
    pub model_type: ModelType,        // Llama, BERT, GPT
    pub num_layers: usize,
    pub hidden_size: usize,           // Embedding dimension
    pub intermediate_size: usize,     // FFN intermediate dimension
    pub num_attention_heads: usize,
    pub vocab_size: usize,
    pub max_position_embeddings: usize,
    pub quantization: Option<QuantizationType>,
}

#[derive(Debug, Clone)]
pub enum ModelType {
    Llama,
    LlamaRoPE,
    BERT,
    SentenceBERT,
    LFM2,
}

2.2 Activation Predictor

Responsibility: Predict which neurons will activate without computing full FFN.

Low-Rank Predictor Architecture

The predictor uses P·Q decomposition where:

  • P ∈ ^(hidden_size × r): Projection from hidden state
  • Q ∈ ^(r × intermediate_size): Projection to neuron scores
  • r << hidden_size: Rank (typically 64-256)
Input: x ∈ ^hidden_size
Score: s = (P·x)·Q ∈ ^intermediate_size  (low-rank computation)
Mask: m = (s > threshold)  (binary mask for active neurons)

Predictor Implementation

pub struct LowRankPredictor {
    /// P matrix: [hidden_size, rank]
    p_matrix: Tensor,
    /// Q matrix: [rank, intermediate_size]
    q_matrix: Tensor,
    /// Activation threshold per neuron
    thresholds: Vec<f32>,
    /// Neuron statistics (for threshold tuning)
    neuron_stats: Vec<NeuronStats>,
}

impl LowRankPredictor {
    /// Predict active neurons from hidden state
    pub fn predict(&self, hidden_state: &Tensor) -> NeuronMask {
        // scores = (hidden_state @ P) @ Q
        let proj = hidden_state.matmul(&self.p_matrix);  // [batch, rank]
        let scores = proj.matmul(&self.q_matrix);        // [batch, intermediate]

        // Apply thresholds
        let mask = scores.iter()
            .zip(&self.thresholds)
            .map(|(score, threshold)| score > threshold)
            .collect();

        NeuronMask::new(mask)
    }

    /// Get predicted sparsity ratio
    pub fn sparsity_ratio(&self, hidden_state: &Tensor) -> f32 {
        let mask = self.predict(hidden_state);
        mask.active_ratio()
    }
}

#[derive(Debug, Clone)]
pub struct NeuronMask {
    /// Boolean mask: true = compute, false = skip
    mask: Vec<bool>,
    /// Precomputed active indices (for sparse kernels)
    active_indices: Vec<usize>,
}

impl NeuronMask {
    pub fn active_count(&self) -> usize {
        self.active_indices.len()
    }

    pub fn active_ratio(&self) -> f32 {
        self.active_count() as f32 / self.mask.len() as f32
    }

    pub fn iter_active(&self) -> impl Iterator<Item = usize> + '_ {
        self.active_indices.iter().copied()
    }
}

Calibration Process

pub struct Calibrator {
    model: TransformerModel,
    config: CalibrationConfig,
}

#[derive(Debug, Clone)]
pub struct CalibrationConfig {
    /// Number of calibration samples
    pub num_samples: usize,
    /// Target sparsity (e.g., 0.2 = 80% neurons skipped)
    pub target_sparsity: f32,
    /// Predictor rank
    pub predictor_rank: usize,
    /// Calibration data source
    pub data_source: DataSource,
}

impl Calibrator {
    /// Run calibration to learn P, Q matrices
    pub fn calibrate(&mut self) -> Result<PredictorSet> {
        let samples = self.load_calibration_data()?;
        let mut predictors = Vec::new();

        for layer_idx in 0..self.model.num_layers() {
            // 1. Collect activation statistics
            let activations = self.collect_activations(layer_idx, &samples)?;

            // 2. Classify hot/cold neurons
            let classification = self.classify_neurons(&activations)?;

            // 3. Learn low-rank predictor
            let predictor = self.learn_predictor(
                layer_idx,
                &activations,
                &classification
            )?;

            predictors.push(predictor);
        }

        Ok(PredictorSet { predictors })
    }

    /// Collect FFN activations for layer
    fn collect_activations(
        &self,
        layer_idx: usize,
        samples: &[Tensor]
    ) -> Result<ActivationData> {
        let mut hidden_states = Vec::new();
        let mut ffn_activations = Vec::new();

        for input in samples {
            let hidden = self.model.forward_to_layer(input, layer_idx)?;
            let ffn_out = self.model.compute_ffn(layer_idx, &hidden)?;

            hidden_states.push(hidden);
            ffn_activations.push(ffn_out);
        }

        Ok(ActivationData {
            hidden_states,
            ffn_activations,
        })
    }

    /// Classify neurons as hot/cold based on activation frequency
    fn classify_neurons(&self, data: &ActivationData) -> Result<NeuronClassification> {
        let intermediate_size = data.ffn_activations[0].shape()[1];
        let mut activation_counts = vec![0usize; intermediate_size];

        // Count how often each neuron activates
        for activations in &data.ffn_activations {
            for (i, value) in activations.iter().enumerate() {
                if value.abs() > 1e-6 {  // Non-zero threshold
                    activation_counts[i] += 1;
                }
            }
        }

        // Compute activation frequency
        let total_samples = data.ffn_activations.len();
        let frequencies: Vec<f32> = activation_counts.iter()
            .map(|&count| count as f32 / total_samples as f32)
            .collect();

        // Classify: hot if frequency > threshold
        let hot_threshold = self.config.target_sparsity;
        let classification: Vec<NeuronType> = frequencies.iter()
            .map(|&freq| {
                if freq > hot_threshold {
                    NeuronType::Hot
                } else {
                    NeuronType::Cold
                }
            })
            .collect();

        Ok(NeuronClassification {
            types: classification,
            frequencies,
        })
    }

    /// Learn P, Q matrices via gradient descent
    fn learn_predictor(
        &self,
        layer_idx: usize,
        data: &ActivationData,
        classification: &NeuronClassification
    ) -> Result<LowRankPredictor> {
        let hidden_size = data.hidden_states[0].shape()[0];
        let intermediate_size = classification.types.len();
        let rank = self.config.predictor_rank;

        // Initialize P, Q with Xavier
        let mut p_matrix = Tensor::randn(&[hidden_size, rank]) * (2.0 / hidden_size as f32).sqrt();
        let mut q_matrix = Tensor::randn(&[rank, intermediate_size]) * (2.0 / rank as f32).sqrt();

        let optimizer = Adam::new(0.001);

        // Training loop
        for epoch in 0..100 {
            let mut total_loss = 0.0;

            for (hidden, target) in data.hidden_states.iter().zip(&data.ffn_activations) {
                // Forward: scores = (hidden @ P) @ Q
                let proj = hidden.matmul(&p_matrix);
                let scores = proj.matmul(&q_matrix);

                // Loss: binary cross-entropy on active/inactive neurons
                let loss = self.predictor_loss(&scores, target, classification);
                total_loss += loss.item();

                // Backward
                loss.backward();
                optimizer.step(&mut [&mut p_matrix, &mut q_matrix]);
            }

            if total_loss < 0.01 {
                break;  // Converged
            }
        }

        // Learn thresholds (per-neuron calibration)
        let thresholds = self.compute_thresholds(&p_matrix, &q_matrix, data)?;

        Ok(LowRankPredictor {
            p_matrix,
            q_matrix,
            thresholds,
            neuron_stats: self.compute_neuron_stats(classification),
        })
    }
}

#[derive(Debug, Clone)]
pub enum NeuronType {
    Hot,   // Frequently activates (>80% samples)
    Cold,  // Rarely activates (<20% samples)
}

2.3 Sparse FFN Computation

Responsibility: Compute FFN layer with only active neurons.

Standard FFN vs Sparse FFN

Standard FFN:

FFN(x) = down(activation(gate(x) ⊙ up(x)))
  where gate, up, down are full matrix multiplications

Sparse FFN:

1. mask = Predictor(x)  (predict active neurons)
2. gate_active = gate(x)[mask]  (sparse matmul: only active columns)
3. up_active = up(x)[mask]
4. hidden = activation(gate_active ⊙ up_active)
5. output = down_active(hidden)  (sparse matmul: only active rows)

Implementation

pub struct SparseFFN {
    /// Gate projection weights: [hidden_size, intermediate_size]
    gate_weights: Tensor,
    /// Up projection weights: [hidden_size, intermediate_size]
    up_weights: Tensor,
    /// Down projection weights: [intermediate_size, hidden_size]
    down_weights: Tensor,
    /// Activation function (SiLU, GELU, ReLU)
    activation: ActivationType,
    /// Predictor for this layer
    predictor: LowRankPredictor,
}

impl SparseFFN {
    pub fn forward(&self, hidden_state: &Tensor, backend: &dyn Backend) -> Result<Tensor> {
        // 1. Predict active neurons
        let mask = self.predictor.predict(hidden_state);

        if mask.active_ratio() > 0.8 {
            // Fallback to dense computation if too many neurons active
            return self.forward_dense(hidden_state, backend);
        }

        // 2. Sparse gate projection: only compute active columns
        let gate_active = backend.sparse_matmul_cols(
            hidden_state,
            &self.gate_weights,
            &mask
        )?;

        // 3. Sparse up projection
        let up_active = backend.sparse_matmul_cols(
            hidden_state,
            &self.up_weights,
            &mask
        )?;

        // 4. Activation: gate ⊙ up (element-wise)
        let activated = self.activation.apply(&gate_active.mul(&up_active)?)?;

        // 5. Sparse down projection: only active rows matter
        let output = backend.sparse_matmul_rows(
            &activated,
            &self.down_weights,
            &mask
        )?;

        Ok(output)
    }

    fn forward_dense(&self, hidden_state: &Tensor, backend: &dyn Backend) -> Result<Tensor> {
        // Standard dense FFN (fallback)
        let gate = hidden_state.matmul(&self.gate_weights)?;
        let up = hidden_state.matmul(&self.up_weights)?;
        let activated = self.activation.apply(&gate.mul(&up)?)?;
        activated.matmul(&self.down_weights)
    }
}

#[derive(Debug, Clone, Copy)]
pub enum ActivationType {
    SiLU,    // Llama models
    GELU,    // BERT models
    ReLU,    // Legacy models
}

impl ActivationType {
    pub fn apply(&self, x: &Tensor) -> Result<Tensor> {
        match self {
            Self::SiLU => x.mul(&x.sigmoid()?),  // x * σ(x)
            Self::GELU => x.gelu(),
            Self::ReLU => x.relu(),
        }
    }
}

Sparse Attention (Optional)

For very large models, attention can also be sparsified:

pub struct SparseAttention {
    /// Query, Key, Value weights (combined or separate)
    qkv_weights: Tensor,
    output_weights: Tensor,
    num_heads: usize,
    head_dim: usize,
    /// Attention mask pattern (e.g., local, strided)
    sparsity_pattern: AttentionPattern,
}

#[derive(Debug, Clone)]
pub enum AttentionPattern {
    /// Full attention (no sparsity)
    Full,
    /// Local attention (window size)
    Local { window_size: usize },
    /// Strided attention (BigBird style)
    Strided { stride: usize, window: usize },
    /// Learned sparse pattern
    Learned { mask: Tensor },
}

2.4 Neuron Cache Manager

Responsibility: Manage hot/cold neuron weights in memory hierarchy.

Cache Architecture

┌──────────────────────────────────────────────┐
│          Neuron Cache Hierarchy              │
├──────────────────────────────────────────────┤
│  L1: Hot Neurons (GPU Memory / Fast RAM)     │
│      - 10-20% most active neurons            │
│      - Always resident                       │
│      - FP16/FP32 precision                   │
├──────────────────────────────────────────────┤
│  L2: Warm Neurons (System RAM)               │
│      - 30-40% moderately active              │
│      - Loaded on-demand                      │
│      - INT8/FP16 quantized                   │
├──────────────────────────────────────────────┤
│  L3: Cold Neurons (Disk / Compressed)        │
│      - 40-60% rarely active                  │
│      - Lazy load if predicted               │
│      - INT4/INT8 quantized                   │
└──────────────────────────────────────────────┘

Implementation

pub struct NeuronCache {
    /// Model configuration
    config: ModelConfig,
    /// Per-layer cache
    layers: Vec<LayerCache>,
    /// Memory budget (bytes)
    memory_budget: usize,
    /// Current memory usage
    memory_used: usize,
}

pub struct LayerCache {
    /// Hot neuron indices
    hot_neurons: Vec<usize>,
    /// Hot neuron weights (gate, up, down)
    hot_weights: HotWeights,
    /// Cold neuron weights (memory-mapped or compressed)
    cold_weights: ColdWeights,
    /// Neuron statistics
    stats: Vec<NeuronStats>,
}

#[derive(Debug, Clone)]
pub struct HotWeights {
    /// Gate weights for hot neurons: [hidden_size, num_hot]
    gate: Tensor,
    /// Up weights for hot neurons: [hidden_size, num_hot]
    up: Tensor,
    /// Down weights for hot neurons: [num_hot, hidden_size]
    down: Tensor,
}

pub enum ColdWeights {
    /// Memory-mapped file (lazy load)
    MemoryMapped {
        file: Mmap,
        offsets: Vec<usize>,
    },
    /// Compressed in-memory
    Compressed {
        data: Vec<u8>,
        codec: CompressionCodec,
    },
    /// Quantized INT4
    Quantized {
        data: Vec<u8>,
        scales: Vec<f32>,
    },
}

impl NeuronCache {
    /// Build cache from calibration results
    pub fn from_calibration(
        model: &TransformerModel,
        predictors: &PredictorSet,
        config: CacheConfig
    ) -> Result<Self> {
        let mut layers = Vec::new();

        for (layer_idx, predictor) in predictors.iter().enumerate() {
            // Extract hot/cold neurons
            let hot_neurons: Vec<usize> = predictor.neuron_stats.iter()
                .enumerate()
                .filter(|(_, stats)| stats.neuron_type == NeuronType::Hot)
                .map(|(idx, _)| idx)
                .collect();

            // Load hot neuron weights into fast memory
            let layer_weights = model.get_layer_weights(layer_idx)?;
            let hot_weights = Self::extract_hot_weights(&layer_weights, &hot_neurons)?;

            // Compress cold neuron weights
            let cold_weights = Self::compress_cold_weights(
                &layer_weights,
                &hot_neurons,
                config.compression
            )?;

            layers.push(LayerCache {
                hot_neurons,
                hot_weights,
                cold_weights,
                stats: predictor.neuron_stats.clone(),
            });
        }

        Ok(Self {
            config: model.config.clone(),
            layers,
            memory_budget: config.memory_budget,
            memory_used: Self::calculate_memory(&layers),
        })
    }

    /// Get weights for active neurons (hot cached, cold loaded on-demand)
    pub fn get_active_weights(
        &self,
        layer_idx: usize,
        mask: &NeuronMask
    ) -> Result<ActiveWeights> {
        let cache = &self.layers[layer_idx];

        // Separate hot and cold neurons in mask
        let (hot_indices, cold_indices) = self.split_hot_cold(cache, mask);

        // Hot neurons: direct lookup
        let hot_weights = self.gather_hot_weights(cache, &hot_indices)?;

        // Cold neurons: lazy load
        let cold_weights = if !cold_indices.is_empty() {
            self.load_cold_weights(cache, &cold_indices)?
        } else {
            None
        };

        Ok(ActiveWeights {
            hot: hot_weights,
            cold: cold_weights,
        })
    }

    fn load_cold_weights(
        &self,
        cache: &LayerCache,
        indices: &[usize]
    ) -> Result<Option<Tensor>> {
        match &cache.cold_weights {
            ColdWeights::MemoryMapped { file, offsets } => {
                // Lazy load from disk
                let mut weights = Vec::new();
                for &idx in indices {
                    let offset = offsets[idx];
                    let data = &file[offset..offset + self.weight_size()];
                    weights.extend_from_slice(data);
                }
                Ok(Some(Tensor::from_bytes(&weights)?))
            }
            ColdWeights::Quantized { data, scales } => {
                // Dequantize on-the-fly
                let weights = Self::dequantize(data, scales, indices)?;
                Ok(Some(weights))
            }
            _ => Ok(None),
        }
    }
}

#[derive(Debug, Clone)]
pub struct CacheConfig {
    /// Memory budget in bytes
    pub memory_budget: usize,
    /// Compression for cold neurons
    pub compression: CompressionCodec,
    /// Whether to use memory-mapped files
    pub use_mmap: bool,
}

#[derive(Debug, Clone, Copy)]
pub enum CompressionCodec {
    None,
    Quantize4Bit,
    Quantize8Bit,
    ZSTD,
}

2.5 Execution Engine

Responsibility: Orchestrate layer-by-layer inference with sparse computation.

pub struct ExecutionEngine {
    /// Model configuration
    config: ModelConfig,
    /// Neuron cache
    cache: NeuronCache,
    /// Predictors (one per layer)
    predictors: PredictorSet,
    /// Backend for computation
    backend: Arc<dyn Backend>,
    /// Performance metrics
    metrics: Metrics,
}

impl ExecutionEngine {
    /// Run inference on input
    pub fn forward(&mut self, input: &Tensor) -> Result<Tensor> {
        let batch_size = input.shape()[0];

        // 1. Embedding layer (always dense)
        let mut hidden = self.embed(input)?;

        // 2. Transformer layers
        for layer_idx in 0..self.config.num_layers {
            let start = std::time::Instant::now();

            // Attention (dense or sparse)
            hidden = self.run_attention(layer_idx, &hidden)?;

            // Sparse FFN
            hidden = self.run_sparse_ffn(layer_idx, &hidden)?;

            self.metrics.record_layer(layer_idx, start.elapsed());
        }

        // 3. Output layer
        let output = self.output_projection(&hidden)?;

        Ok(output)
    }

    fn run_sparse_ffn(&mut self, layer_idx: usize, hidden: &Tensor) -> Result<Tensor> {
        // 1. Predict active neurons
        let predictor = &self.predictors[layer_idx];
        let mask = predictor.predict(hidden);

        self.metrics.record_sparsity(layer_idx, mask.active_ratio());

        // 2. Get active neuron weights from cache
        let weights = self.cache.get_active_weights(layer_idx, &mask)?;

        // 3. Sparse FFN computation
        let ffn = SparseFFN::new(weights, predictor.clone());
        let output = ffn.forward(hidden, self.backend.as_ref())?;

        Ok(output)
    }

    /// Get inference statistics
    pub fn metrics(&self) -> &Metrics {
        &self.metrics
    }
}

#[derive(Debug, Default)]
pub struct Metrics {
    /// Per-layer latency
    layer_latency: Vec<Duration>,
    /// Per-layer sparsity ratio
    layer_sparsity: Vec<f32>,
    /// Total tokens processed
    tokens_processed: usize,
    /// Cache hits/misses
    cache_hits: usize,
    cache_misses: usize,
}

impl Metrics {
    pub fn average_sparsity(&self) -> f32 {
        self.layer_sparsity.iter().sum::<f32>() / self.layer_sparsity.len() as f32
    }

    pub fn total_latency(&self) -> Duration {
        self.layer_latency.iter().sum()
    }

    pub fn tokens_per_second(&self) -> f32 {
        let total_secs = self.total_latency().as_secs_f32();
        self.tokens_processed as f32 / total_secs
    }
}

2.6 Backend Abstraction

Responsibility: Provide SIMD-optimized sparse operations across platforms.

pub trait Backend: Send + Sync {
    /// Sparse matrix multiplication: A @ B[:, mask]
    fn sparse_matmul_cols(
        &self,
        a: &Tensor,
        b: &Tensor,
        col_mask: &NeuronMask
    ) -> Result<Tensor>;

    /// Sparse matrix multiplication: A[mask, :] @ B
    fn sparse_matmul_rows(
        &self,
        a: &Tensor,
        b: &Tensor,
        row_mask: &NeuronMask
    ) -> Result<Tensor>;

    /// Dense matrix multiplication (fallback)
    fn matmul(&self, a: &Tensor, b: &Tensor) -> Result<Tensor>;

    /// Element-wise operations
    fn add(&self, a: &Tensor, b: &Tensor) -> Result<Tensor>;
    fn mul(&self, a: &Tensor, b: &Tensor) -> Result<Tensor>;

    /// Activation functions
    fn silu(&self, x: &Tensor) -> Result<Tensor>;
    fn gelu(&self, x: &Tensor) -> Result<Tensor>;

    /// Quantization
    fn quantize(&self, x: &Tensor, bits: u8) -> Result<(Tensor, Vec<f32>)>;
    fn dequantize(&self, x: &Tensor, scales: &[f32]) -> Result<Tensor>;
}

CPU Backend (AVX512 SIMD)

pub struct CpuBackend {
    num_threads: usize,
    simd_features: SimdFeatures,
}

#[derive(Debug, Clone)]
pub struct SimdFeatures {
    pub avx512: bool,
    pub avx2: bool,
    pub fma: bool,
    pub vnni: bool,  // INT8 acceleration
}

impl Backend for CpuBackend {
    fn sparse_matmul_cols(
        &self,
        a: &Tensor,
        b: &Tensor,
        col_mask: &NeuronMask
    ) -> Result<Tensor> {
        // A: [batch, hidden_size]
        // B: [hidden_size, intermediate_size]
        // Output: [batch, active_neurons]

        let active_cols = col_mask.iter_active().collect::<Vec<_>>();

        if self.simd_features.avx512 {
            self.sparse_matmul_avx512(a, b, &active_cols)
        } else if self.simd_features.avx2 {
            self.sparse_matmul_avx2(a, b, &active_cols)
        } else {
            self.sparse_matmul_scalar(a, b, &active_cols)
        }
    }
}

impl CpuBackend {
    #[target_feature(enable = "avx512f")]
    unsafe fn sparse_matmul_avx512(
        &self,
        a: &Tensor,
        b: &Tensor,
        active_cols: &[usize]
    ) -> Result<Tensor> {
        // AVX-512: 16x f32 per vector
        // Optimized sparse GEMM kernel

        let batch = a.shape()[0];
        let hidden = a.shape()[1];
        let num_active = active_cols.len();

        let mut output = Tensor::zeros(&[batch, num_active]);

        for row in 0..batch {
            for (out_idx, &col) in active_cols.iter().enumerate() {
                // Dot product: a[row, :] · b[:, col]
                let mut sum = _mm512_setzero_ps();

                for k in (0..hidden).step_by(16) {
                    let a_vec = _mm512_loadu_ps(&a.data()[row * hidden + k]);
                    let b_vec = _mm512_loadu_ps(&b.data()[col * hidden + k]);
                    sum = _mm512_fmadd_ps(a_vec, b_vec, sum);
                }

                output.data_mut()[row * num_active + out_idx] =
                    _mm512_reduce_add_ps(sum);
            }
        }

        Ok(output)
    }
}

WASM Backend (Portable SIMD)

#[cfg(target_arch = "wasm32")]
pub struct WasmBackend {
    simd_enabled: bool,
}

#[cfg(target_arch = "wasm32")]
impl Backend for WasmBackend {
    fn sparse_matmul_cols(
        &self,
        a: &Tensor,
        b: &Tensor,
        col_mask: &NeuronMask
    ) -> Result<Tensor> {
        use std::arch::wasm32::*;

        if !self.simd_enabled {
            return self.sparse_matmul_scalar(a, b, col_mask);
        }

        // WASM SIMD: 4x f32 per v128
        let active_cols = col_mask.iter_active().collect::<Vec<_>>();
        let batch = a.shape()[0];
        let hidden = a.shape()[1];
        let num_active = active_cols.len();

        let mut output = Tensor::zeros(&[batch, num_active]);

        for row in 0..batch {
            for (out_idx, &col) in active_cols.iter().enumerate() {
                let mut sum = f32x4_splat(0.0);

                for k in (0..hidden).step_by(4) {
                    let a_vec = v128_load(&a.data()[row * hidden + k] as *const f32 as *const v128);
                    let b_vec = v128_load(&b.data()[col * hidden + k] as *const f32 as *const v128);
                    sum = f32x4_add(sum, f32x4_mul(a_vec, b_vec));
                }

                // Horizontal sum
                let result = f32x4_extract_lane::<0>(sum)
                    + f32x4_extract_lane::<1>(sum)
                    + f32x4_extract_lane::<2>(sum)
                    + f32x4_extract_lane::<3>(sum);

                output.data_mut()[row * num_active + out_idx] = result;
            }
        }

        Ok(output)
    }
}

3. Data Flow Architecture

3.1 Model Loading Flow

User: model_path
    │
    ▼
┌─────────────────────────────┐
│  Detect Model Format        │ (check extension: .gguf, .safetensors, .bin)
└────────┬────────────────────┘
         │
         ├──── .gguf ────────▶ GGUFLoader::open()
         │                       │
         │                       ├─ Parse header
         │                       ├─ Build tensor index
         │                       └─ Extract config
         │
         ├──── HF/ST ────────▶ HFLoader::from_pretrained()
         │                       │
         │                       ├─ Download if needed
         │                       ├─ Load safetensors
         │                       └─ Parse config.json
         │
         ▼
┌─────────────────────────────┐
│  TransformerModel           │
│  - Config                   │
│  - Layer weights            │
│  - Tokenizer                │
└─────────────────────────────┘

3.2 Calibration Flow

TransformerModel
    │
    ▼
┌─────────────────────────────┐
│  Load Calibration Dataset   │ (WikiText, C4, custom)
│  - Sample 512-2048 examples │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────────────────┐
│  For each layer:                        │
│  1. Forward samples → collect           │
│     - Hidden states (input to FFN)      │
│     - FFN activations (output)          │
│                                         │
│  2. Analyze activations:                │
│     - Compute activation frequency      │
│     - Classify hot/cold neurons         │
│                                         │
│  3. Learn predictor:                    │
│     - Initialize P, Q matrices          │
│     - Train on (hidden → activation)    │
│     - Optimize thresholds               │
└────────┬────────────────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│  PredictorSet + NeuronCache │
│  - P·Q matrices per layer   │
│  - Hot neuron weights       │
│  - Cold neuron offload      │
└─────────────────────────────┘

3.3 Inference Flow (Single Token)

Input Token(s)
    │
    ▼
┌─────────────────────────────┐
│  Tokenizer + Embedding      │
└────────┬────────────────────┘
         │
         ▼
┌───────────────────────────────────────────────────────┐
│  Layer 0:                                             │
│  ┌─────────────────────────────────────────────────┐ │
│  │ 1. Attention (dense)                            │ │
│  │    - Q, K, V projections                        │ │
│  │    - Scaled dot-product attention               │ │
│  │    - Output projection                          │ │
│  └─────────────────────────────────────────────────┘ │
│  ┌─────────────────────────────────────────────────┐ │
│  │ 2. Sparse FFN                                   │ │
│  │    a) Predictor(hidden) → mask [T/F/F/T/T/F...] │ │
│  │    b) Load weights for active neurons only      │ │
│  │    c) Sparse gate/up projections                │ │
│  │    d) Activation (SiLU/GELU)                    │ │
│  │    e) Sparse down projection                    │ │
│  └─────────────────────────────────────────────────┘ │
│  ┌─────────────────────────────────────────────────┐ │
│  │ 3. Residual + LayerNorm                         │ │
│  └─────────────────────────────────────────────────┘ │
└───────────────────┬───────────────────────────────────┘
                    │
                    ▼
                (Repeat for layers 1..N-1)
                    │
                    ▼
┌─────────────────────────────┐
│  Output Projection          │
│  - Linear(hidden, vocab)    │
│  - Softmax (for generation) │
└────────┬────────────────────┘
         │
         ▼
    Logits / Embedding

4. Rust Module Structure

4.1 Crate Layout

crates/ruvector-sparse-inference/
├── Cargo.toml
├── build.rs                     # Build-time feature detection
├── README.md
└── src/
    ├── lib.rs                   # Public API
    ├── config.rs                # Configuration types
    ├── error.rs                 # Error types
    │
    ├── predictor/
    │   ├── mod.rs               # Predictor API
    │   ├── lowrank.rs           # P·Q low-rank predictor
    │   ├── calibration.rs       # Calibration logic
    │   └── threshold.rs         # Threshold optimization
    │
    ├── sparse/
    │   ├── mod.rs               # Sparse operations API
    │   ├── ffn.rs               # Sparse FFN layer
    │   ├── attention.rs         # Sparse attention (optional)
    │   └── kernels.rs           # SIMD kernels
    │
    ├── model/
    │   ├── mod.rs               # Model loading API
    │   ├── gguf.rs              # GGUF parser
    │   ├── hf.rs                # HuggingFace loader
    │   ├── loader.rs            # Generic loader trait
    │   └── runners.rs           # Model-specific runners (Llama, BERT)
    │
    ├── memory/
    │   ├── mod.rs               # Memory management API
    │   ├── cache.rs             # Neuron cache
    │   ├── quantization.rs      # Quantization utilities
    │   └── compression.rs       # Compression codecs
    │
    ├── backend/
    │   ├── mod.rs               # Backend trait
    │   ├── cpu.rs               # CPU SIMD backend
    │   ├── wasm.rs              # WASM SIMD backend
    │   └── gpu.rs               # GPU backend (future)
    │
    ├── integration/
    │   ├── mod.rs               # Integration API
    │   ├── ruvector.rs          # EmbeddingProvider impl
    │   └── ruvllm.rs            # InferenceBackend impl
    │
    └── utils/
        ├── mod.rs
        ├── tensor.rs            # Tensor utilities
        └── metrics.rs           # Performance tracking

4.2 Key Module Responsibilities

Module Responsibility Dependencies
lib.rs Public API, re-exports All modules
config Configuration types None
error Error handling None
predictor Neuron prediction tensor, backend
sparse Sparse computation predictor, backend, memory
model Model loading config, error
memory Cache management model, predictor
backend SIMD operations tensor
integration Ruvector/RuvLLM All

5. Key Traits and Interfaces

5.1 ModelRunner Trait

pub trait ModelRunner: Send + Sync {
    /// Get model configuration
    fn config(&self) -> &ModelConfig;

    /// Run inference on input tokens
    fn forward(&mut self, input_ids: &[u32]) -> Result<Tensor>;

    /// Encode text to embeddings (for embedding models)
    fn encode(&mut self, text: &str) -> Result<Vec<f32>>;

    /// Generate text (for language models)
    fn generate(&mut self, prompt: &str, max_tokens: usize) -> Result<String>;

    /// Get inference metrics
    fn metrics(&self) -> &Metrics;
}

// Implementations
pub struct LlamaRunner { /* ... */ }
pub struct BertRunner { /* ... */ }
pub struct LFM2Runner { /* ... */ }

impl ModelRunner for LlamaRunner { /* ... */ }
impl ModelRunner for BertRunner { /* ... */ }
impl ModelRunner for LFM2Runner { /* ... */ }

5.2 Predictor Trait

pub trait Predictor: Send + Sync {
    /// Predict active neurons from hidden state
    fn predict(&self, hidden_state: &Tensor) -> NeuronMask;

    /// Get predicted sparsity ratio
    fn sparsity_ratio(&self, hidden_state: &Tensor) -> f32;

    /// Get neuron statistics
    fn neuron_stats(&self) -> &[NeuronStats];
}

#[derive(Debug, Clone)]
pub struct NeuronStats {
    pub neuron_type: NeuronType,
    pub activation_frequency: f32,
    pub average_magnitude: f32,
}

5.3 Cache Trait

pub trait Cache: Send + Sync {
    /// Get weights for active neurons
    fn get_active_weights(
        &self,
        layer_idx: usize,
        mask: &NeuronMask
    ) -> Result<ActiveWeights>;

    /// Get memory usage statistics
    fn memory_usage(&self) -> MemoryStats;

    /// Evict least-recently-used cold neurons
    fn evict(&mut self, size: usize) -> Result<()>;
}

#[derive(Debug, Clone)]
pub struct MemoryStats {
    pub hot_neurons_bytes: usize,
    pub cold_neurons_bytes: usize,
    pub predictor_bytes: usize,
    pub total_bytes: usize,
}

6. Integration Architecture

6.1 Ruvector EmbeddingProvider Integration

// In ruvector-core/src/embeddings.rs
pub trait EmbeddingProvider {
    fn embed(&self, text: &str) -> Result<Vec<f32>>;
    fn embed_batch(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>>;
}

// New implementation in sparse-inference
impl EmbeddingProvider for SparseInferenceEngine {
    fn embed(&self, text: &str) -> Result<Vec<f32>> {
        // 1. Tokenize
        let tokens = self.tokenizer.encode(text)?;

        // 2. Run sparse inference
        let output = self.runner.forward(&tokens)?;

        // 3. Mean pooling (for sentence embeddings)
        let embedding = self.mean_pool(&output)?;

        Ok(embedding.to_vec())
    }

    fn embed_batch(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>> {
        texts.iter().map(|text| self.embed(text)).collect()
    }
}

// Usage
let engine = SparseInferenceEngine::from_pretrained(
    "TaylorAI/gte-tiny",
    SparseConfig::default()
)?;

let rv = RuVector::builder()
    .embedding_provider(Box::new(engine))
    .build()?;

rv.insert("test", "Hello world")?;

6.2 RuvLLM InferenceBackend Integration

// In ruvllm/src/backend.rs
pub trait InferenceBackend {
    fn generate(&mut self, prompt: &str, config: GenerateConfig) -> Result<String>;
    fn logits(&mut self, prompt: &str) -> Result<Vec<f32>>;
}

// Implementation
impl InferenceBackend for SparseInferenceEngine {
    fn generate(&mut self, prompt: &str, config: GenerateConfig) -> Result<String> {
        let mut tokens = self.tokenizer.encode(prompt)?;
        let mut output = String::new();

        for _ in 0..config.max_tokens {
            // Sparse inference
            let logits = self.runner.forward(&tokens)?;

            // Sample next token
            let next_token = self.sample(&logits, config.temperature)?;
            tokens.push(next_token);

            // Decode
            let text = self.tokenizer.decode(&[next_token])?;
            output.push_str(&text);

            if next_token == self.tokenizer.eos_token() {
                break;
            }
        }

        Ok(output)
    }
}

// Usage
let engine = SparseInferenceEngine::from_pretrained(
    "TheBloke/Llama-2-7B-GGUF",
    SparseConfig::default()
)?;

let llm = RuvLLM::builder()
    .backend(Box::new(engine))
    .build()?;

let response = llm.generate("Explain quantum computing", Default::default())?;

7. Performance Targets

7.1 Latency Targets

Model Operation Target Latency Baseline Speedup
LFM2-350M Sentence embedding 5-10ms 25ms 2.5-5x
BERT-base Sentence embedding 8-15ms 40ms 2.7-5x
Llama-7B Token generation 50-100ms 500ms 5-10x
Llama-13B Token generation 100-200ms 1.2s 6-12x

7.2 Memory Targets

Model Baseline RAM Sparse RAM Reduction
LFM2-350M 1.4 GB 700 MB 2x
Llama-7B (FP16) 14 GB 7-9 GB 1.5-2x
Llama-7B (Q4) 4 GB 2.5-3 GB 1.3-1.6x

7.3 Accuracy Targets

  • Embedding similarity: >0.99 cosine similarity to dense baseline
  • Generation quality: <1% perplexity increase
  • Classification accuracy: <0.5% drop on downstream tasks

7.4 Sparsity Targets

Layer Type Target Sparsity Active Neurons
Early layers 60-70% 30-40% compute
Middle layers 70-85% 15-30% compute
Late layers 50-60% 40-50% compute
Average 70-80% 20-30% compute

8. Deployment Architecture

8.1 CPU Deployment

┌─────────────────────────────────────┐
│  Application Process                │
├─────────────────────────────────────┤
│  ┌───────────────────────────────┐  │
│  │  Sparse Inference Engine      │  │
│  │  - Hot neurons: RAM (1-2 GB)  │  │
│  │  - Cold neurons: mmap disk    │  │
│  │  - SIMD: AVX-512 / AVX2       │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │  Thread Pool (rayon)          │  │
│  │  - Parallel batch processing  │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

8.2 WASM Deployment

┌─────────────────────────────────────┐
│  Browser / Node.js                  │
├─────────────────────────────────────┤
│  ┌───────────────────────────────┐  │
│  │  WASM Module                  │  │
│  │  - Hot neurons: ArrayBuffer   │  │
│  │  - SIMD: wasm128 (if avail)   │  │
│  │  - Memory limit: 2-4 GB       │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │  Worker Pool                  │  │
│  │  - Parallel inference         │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

8.3 Hybrid Deployment

┌─────────────────────────────────────┐
│  Cloud GPU (Hot path)               │
│  - 20% most frequent queries        │
│  - Full dense inference             │
│  - <10ms latency                    │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│  Edge CPU (Cold path)               │
│  - 80% long-tail queries            │
│  - Sparse inference                 │
│  - 20-50ms latency                  │
│  - 10x lower cost                   │
└─────────────────────────────────────┘

9. Future Enhancements

9.1 Phase 2 Features

  • Dynamic sparsity: Adjust predictor thresholds at runtime
  • Multi-modal: Support vision-language models (CLIP, LLaVA)
  • Quantization-aware: INT8/INT4 predictor matrices
  • GPU kernels: CUDA/Metal sparse kernels
  • NPU support: Apple Neural Engine, Qualcomm Hexagon

9.2 Phase 3 Features

  • Learned sparsity patterns: Train end-to-end with sparsity loss
  • Mixture-of-experts: Combine with MoE models
  • Speculative decoding: Sparse draft models + dense verification
  • Cross-layer optimization: Share predictors across layers

10. References and Inspiration

  1. PowerInfer (SOSP'23): Fast LLM serving with activation locality
  2. DejaVu (MLSys'23): Contextual sparsity in transformers
  3. CATS (ICLR'24): Context-aware token selection
  4. FlashAttention: Memory-efficient attention
  5. GPTQ/AWQ: Weight quantization for LLMs

Appendix A: Configuration Examples

A.1 LFM2 Embedding Configuration

let config = SparseConfig {
    model_path: "TaylorAI/gte-tiny".to_string(),
    predictor_rank: 128,
    target_sparsity: 0.75,
    cache_config: CacheConfig {
        memory_budget: 1024 * 1024 * 1024,  // 1 GB
        use_mmap: false,
        compression: CompressionCodec::None,
    },
    backend: BackendType::CpuAvx2,
    calibration: Some(CalibrationConfig {
        num_samples: 1024,
        data_source: DataSource::WikiText,
    }),
};

A.2 Llama-7B Generation Configuration

let config = SparseConfig {
    model_path: "TheBloke/Llama-2-7B-GGUF".to_string(),
    predictor_rank: 256,
    target_sparsity: 0.80,
    cache_config: CacheConfig {
        memory_budget: 8 * 1024 * 1024 * 1024,  // 8 GB
        use_mmap: true,
        compression: CompressionCodec::Quantize4Bit,
    },
    backend: BackendType::CpuAvx512,
    calibration: Some(CalibrationConfig {
        num_samples: 2048,
        data_source: DataSource::C4,
    }),
};

Appendix B: Benchmarking Protocol

B.1 Latency Benchmarking

# Embedding models
cargo bench --bench embeddings -- \
  --model gte-tiny \
  --batch-sizes 1,8,32 \
  --sequence-lengths 16,64,256

# Generation models
cargo bench --bench generation -- \
  --model llama-7b \
  --prompt-lengths 32,128,512 \
  --generate-lengths 32,128

B.2 Accuracy Evaluation

# STS-B (semantic similarity)
cargo run --release --bin eval-sts \
  --model gte-tiny \
  --sparse \
  --dataset data/stsbenchmark

# MMLU (language understanding)
cargo run --release --bin eval-mmlu \
  --model llama-7b \
  --sparse \
  --subset abstract_algebra,anatomy

End of Architecture Document