54 KiB
Sparse Inference Engine Architecture
PowerInfer-style Activation Locality for Ruvector
Version: 1.0.0 Date: 2026-01-05 Status: Design Phase
Executive Summary
This document defines the architecture for a sparse inference engine that exploits activation locality in transformer models. The system achieves 2-10x speedup with <1% accuracy loss by:
- Predicting which neurons will activate using low-rank matrices (P·Q)
- Computing only active neurons in FFN layers
- Caching hot neurons in fast memory
- Offloading cold neurons to slower storage
Key Innovation: Unlike model-wide quantization or pruning, we perform neuron-level sparse computation at runtime based on learned activation patterns.
1. System Architecture Overview
1.1 High-Level Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Sparse Inference Engine │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Model Loader │─────▶│ Calibrator │─────▶│ Execution Engine│ │
│ │ │ │ │ │ │ │
│ │ • GGUF Parser │ │ • P·Q Learn │ │ • Layer Exec │ │
│ │ • HF Loader │ │ • Threshold │ │ • Sparse Compute│ │
│ │ • Safetensors │ │ • Neuron Map │ │ • Backend Route │ │
│ └───────────────┘ └──────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Neuron Cache Manager │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌──────────────────┐ │ │
│ │ │ Hot Neurons │ │ Predictor Map │ │ Cold Neurons │ │ │
│ │ │ (GPU/Memory) │ │ (P·Q Matrices)│ │ (Disk/Offload) │ │ │
│ │ └──────────────┘ └───────────────┘ └──────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Backend Abstraction │ │
│ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │
│ │ │ CPU SIMD│ │ WASM SIMD│ │ GPU/Metal│ │ NPU (future) │ │ │
│ │ │ (AVX512)│ │ (128-bit)│ │ (Compute)│ │ │ │ │
│ │ └─────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
├─────────────────────────────────────────────────────────────────────┤
│ Integration Layer │
│ ┌──────────────────────────┐ ┌──────────────────────────────┐ │
│ │ Ruvector Integration │ │ RuvLLM Integration │ │
│ │ • EmbeddingProvider │ │ • InferenceBackend trait │ │
│ │ • Sparse embed() calls │ │ • generate() with sparsity │ │
│ └──────────────────────────┘ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
1.2 Component Interaction Flow
User Request
│
▼
┌─────────────────┐
│ Model Selection │ (LFM2, sentence-bert, Llama GGUF)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Model Loader │ Parse GGUF/HF → Extract layers, weights, config
└────────┬────────┘
│
▼
┌─────────────────┐
│ Calibration │ Feed sample data → Learn P·Q matrices → Classify neurons
└────────┬────────┘ (Optional: Skip if pre-calibrated model)
│
▼
┌─────────────────┐
│ Inference Setup │ Load hot neurons → Offload cold → Build predictor
└────────┬────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Runtime Inference Loop │
│ │
│ Input → Embedding │
│ │ │
│ ▼ │
│ For each layer: │
│ 1. Predictor(x) → Active neuron mask │
│ 2. Sparse Attention (full or sparse) │
│ 3. Sparse FFN (only active neurons) │
│ 4. LayerNorm + Residual │
│ │ │
│ ▼ │
│ Output (embeddings/logits) │
└─────────────────────────────────────────────┘
│
▼
User Result
2. Core Components
2.1 Model Loader
Responsibility: Parse and load transformer models from various formats.
Supported Formats
| Format | Use Case | Priority |
|---|---|---|
| GGUF | Quantized Llama models (q4_0, q8_0) | P0 |
| HuggingFace | Sentence transformers (LFM2, BERT) | P0 |
| Safetensors | Modern PyTorch exports | P1 |
| ONNX | Cross-platform inference | P2 |
GGUF Parser Details
pub struct GGUFLoader {
/// File handle to .gguf model
file: File,
/// Parsed metadata
metadata: GGUFMetadata,
/// Tensor mappings
tensor_index: HashMap<String, TensorInfo>,
}
impl GGUFLoader {
/// Parse header and build tensor index
pub fn open(path: &Path) -> Result<Self>;
/// Load specific layer weights
pub fn load_layer(&self, layer_idx: usize) -> Result<LayerWeights>;
/// Extract model config (n_layers, hidden_size, etc.)
pub fn config(&self) -> ModelConfig;
/// Check quantization type (Q4_0, Q8_0, F16)
pub fn quantization(&self) -> QuantizationType;
}
pub struct LayerWeights {
pub attention_qkv: Tensor, // Combined Q,K,V weights
pub attention_output: Tensor,
pub ffn_gate: Tensor, // FFN up-projection
pub ffn_up: Tensor,
pub ffn_down: Tensor, // FFN down-projection
pub norm1: Tensor, // Pre-attention norm
pub norm2: Tensor, // Pre-FFN norm
}
HuggingFace Loader
pub struct HFLoader {
model_id: String,
cache_dir: PathBuf,
tokenizer: Tokenizer,
}
impl HFLoader {
/// Download or load cached model
pub fn from_pretrained(model_id: &str) -> Result<Self>;
/// Load full model into memory
pub fn load_model(&self) -> Result<TransformerModel>;
/// Stream-load layer by layer (for large models)
pub fn load_layer_stream(&self) -> impl Iterator<Item = LayerWeights>;
}
Model Configuration
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelConfig {
pub model_type: ModelType, // Llama, BERT, GPT
pub num_layers: usize,
pub hidden_size: usize, // Embedding dimension
pub intermediate_size: usize, // FFN intermediate dimension
pub num_attention_heads: usize,
pub vocab_size: usize,
pub max_position_embeddings: usize,
pub quantization: Option<QuantizationType>,
}
#[derive(Debug, Clone)]
pub enum ModelType {
Llama,
LlamaRoPE,
BERT,
SentenceBERT,
LFM2,
}
2.2 Activation Predictor
Responsibility: Predict which neurons will activate without computing full FFN.
Low-Rank Predictor Architecture
The predictor uses P·Q decomposition where:
- P ∈ ℝ^(hidden_size × r): Projection from hidden state
- Q ∈ ℝ^(r × intermediate_size): Projection to neuron scores
- r << hidden_size: Rank (typically 64-256)
Input: x ∈ ℝ^hidden_size
Score: s = (P·x)·Q ∈ ℝ^intermediate_size (low-rank computation)
Mask: m = (s > threshold) (binary mask for active neurons)
Predictor Implementation
pub struct LowRankPredictor {
/// P matrix: [hidden_size, rank]
p_matrix: Tensor,
/// Q matrix: [rank, intermediate_size]
q_matrix: Tensor,
/// Activation threshold per neuron
thresholds: Vec<f32>,
/// Neuron statistics (for threshold tuning)
neuron_stats: Vec<NeuronStats>,
}
impl LowRankPredictor {
/// Predict active neurons from hidden state
pub fn predict(&self, hidden_state: &Tensor) -> NeuronMask {
// scores = (hidden_state @ P) @ Q
let proj = hidden_state.matmul(&self.p_matrix); // [batch, rank]
let scores = proj.matmul(&self.q_matrix); // [batch, intermediate]
// Apply thresholds
let mask = scores.iter()
.zip(&self.thresholds)
.map(|(score, threshold)| score > threshold)
.collect();
NeuronMask::new(mask)
}
/// Get predicted sparsity ratio
pub fn sparsity_ratio(&self, hidden_state: &Tensor) -> f32 {
let mask = self.predict(hidden_state);
mask.active_ratio()
}
}
#[derive(Debug, Clone)]
pub struct NeuronMask {
/// Boolean mask: true = compute, false = skip
mask: Vec<bool>,
/// Precomputed active indices (for sparse kernels)
active_indices: Vec<usize>,
}
impl NeuronMask {
pub fn active_count(&self) -> usize {
self.active_indices.len()
}
pub fn active_ratio(&self) -> f32 {
self.active_count() as f32 / self.mask.len() as f32
}
pub fn iter_active(&self) -> impl Iterator<Item = usize> + '_ {
self.active_indices.iter().copied()
}
}
Calibration Process
pub struct Calibrator {
model: TransformerModel,
config: CalibrationConfig,
}
#[derive(Debug, Clone)]
pub struct CalibrationConfig {
/// Number of calibration samples
pub num_samples: usize,
/// Target sparsity (e.g., 0.2 = 80% neurons skipped)
pub target_sparsity: f32,
/// Predictor rank
pub predictor_rank: usize,
/// Calibration data source
pub data_source: DataSource,
}
impl Calibrator {
/// Run calibration to learn P, Q matrices
pub fn calibrate(&mut self) -> Result<PredictorSet> {
let samples = self.load_calibration_data()?;
let mut predictors = Vec::new();
for layer_idx in 0..self.model.num_layers() {
// 1. Collect activation statistics
let activations = self.collect_activations(layer_idx, &samples)?;
// 2. Classify hot/cold neurons
let classification = self.classify_neurons(&activations)?;
// 3. Learn low-rank predictor
let predictor = self.learn_predictor(
layer_idx,
&activations,
&classification
)?;
predictors.push(predictor);
}
Ok(PredictorSet { predictors })
}
/// Collect FFN activations for layer
fn collect_activations(
&self,
layer_idx: usize,
samples: &[Tensor]
) -> Result<ActivationData> {
let mut hidden_states = Vec::new();
let mut ffn_activations = Vec::new();
for input in samples {
let hidden = self.model.forward_to_layer(input, layer_idx)?;
let ffn_out = self.model.compute_ffn(layer_idx, &hidden)?;
hidden_states.push(hidden);
ffn_activations.push(ffn_out);
}
Ok(ActivationData {
hidden_states,
ffn_activations,
})
}
/// Classify neurons as hot/cold based on activation frequency
fn classify_neurons(&self, data: &ActivationData) -> Result<NeuronClassification> {
let intermediate_size = data.ffn_activations[0].shape()[1];
let mut activation_counts = vec![0usize; intermediate_size];
// Count how often each neuron activates
for activations in &data.ffn_activations {
for (i, value) in activations.iter().enumerate() {
if value.abs() > 1e-6 { // Non-zero threshold
activation_counts[i] += 1;
}
}
}
// Compute activation frequency
let total_samples = data.ffn_activations.len();
let frequencies: Vec<f32> = activation_counts.iter()
.map(|&count| count as f32 / total_samples as f32)
.collect();
// Classify: hot if frequency > threshold
let hot_threshold = self.config.target_sparsity;
let classification: Vec<NeuronType> = frequencies.iter()
.map(|&freq| {
if freq > hot_threshold {
NeuronType::Hot
} else {
NeuronType::Cold
}
})
.collect();
Ok(NeuronClassification {
types: classification,
frequencies,
})
}
/// Learn P, Q matrices via gradient descent
fn learn_predictor(
&self,
layer_idx: usize,
data: &ActivationData,
classification: &NeuronClassification
) -> Result<LowRankPredictor> {
let hidden_size = data.hidden_states[0].shape()[0];
let intermediate_size = classification.types.len();
let rank = self.config.predictor_rank;
// Initialize P, Q with Xavier
let mut p_matrix = Tensor::randn(&[hidden_size, rank]) * (2.0 / hidden_size as f32).sqrt();
let mut q_matrix = Tensor::randn(&[rank, intermediate_size]) * (2.0 / rank as f32).sqrt();
let optimizer = Adam::new(0.001);
// Training loop
for epoch in 0..100 {
let mut total_loss = 0.0;
for (hidden, target) in data.hidden_states.iter().zip(&data.ffn_activations) {
// Forward: scores = (hidden @ P) @ Q
let proj = hidden.matmul(&p_matrix);
let scores = proj.matmul(&q_matrix);
// Loss: binary cross-entropy on active/inactive neurons
let loss = self.predictor_loss(&scores, target, classification);
total_loss += loss.item();
// Backward
loss.backward();
optimizer.step(&mut [&mut p_matrix, &mut q_matrix]);
}
if total_loss < 0.01 {
break; // Converged
}
}
// Learn thresholds (per-neuron calibration)
let thresholds = self.compute_thresholds(&p_matrix, &q_matrix, data)?;
Ok(LowRankPredictor {
p_matrix,
q_matrix,
thresholds,
neuron_stats: self.compute_neuron_stats(classification),
})
}
}
#[derive(Debug, Clone)]
pub enum NeuronType {
Hot, // Frequently activates (>80% samples)
Cold, // Rarely activates (<20% samples)
}
2.3 Sparse FFN Computation
Responsibility: Compute FFN layer with only active neurons.
Standard FFN vs Sparse FFN
Standard FFN:
FFN(x) = down(activation(gate(x) ⊙ up(x)))
where gate, up, down are full matrix multiplications
Sparse FFN:
1. mask = Predictor(x) (predict active neurons)
2. gate_active = gate(x)[mask] (sparse matmul: only active columns)
3. up_active = up(x)[mask]
4. hidden = activation(gate_active ⊙ up_active)
5. output = down_active(hidden) (sparse matmul: only active rows)
Implementation
pub struct SparseFFN {
/// Gate projection weights: [hidden_size, intermediate_size]
gate_weights: Tensor,
/// Up projection weights: [hidden_size, intermediate_size]
up_weights: Tensor,
/// Down projection weights: [intermediate_size, hidden_size]
down_weights: Tensor,
/// Activation function (SiLU, GELU, ReLU)
activation: ActivationType,
/// Predictor for this layer
predictor: LowRankPredictor,
}
impl SparseFFN {
pub fn forward(&self, hidden_state: &Tensor, backend: &dyn Backend) -> Result<Tensor> {
// 1. Predict active neurons
let mask = self.predictor.predict(hidden_state);
if mask.active_ratio() > 0.8 {
// Fallback to dense computation if too many neurons active
return self.forward_dense(hidden_state, backend);
}
// 2. Sparse gate projection: only compute active columns
let gate_active = backend.sparse_matmul_cols(
hidden_state,
&self.gate_weights,
&mask
)?;
// 3. Sparse up projection
let up_active = backend.sparse_matmul_cols(
hidden_state,
&self.up_weights,
&mask
)?;
// 4. Activation: gate ⊙ up (element-wise)
let activated = self.activation.apply(&gate_active.mul(&up_active)?)?;
// 5. Sparse down projection: only active rows matter
let output = backend.sparse_matmul_rows(
&activated,
&self.down_weights,
&mask
)?;
Ok(output)
}
fn forward_dense(&self, hidden_state: &Tensor, backend: &dyn Backend) -> Result<Tensor> {
// Standard dense FFN (fallback)
let gate = hidden_state.matmul(&self.gate_weights)?;
let up = hidden_state.matmul(&self.up_weights)?;
let activated = self.activation.apply(&gate.mul(&up)?)?;
activated.matmul(&self.down_weights)
}
}
#[derive(Debug, Clone, Copy)]
pub enum ActivationType {
SiLU, // Llama models
GELU, // BERT models
ReLU, // Legacy models
}
impl ActivationType {
pub fn apply(&self, x: &Tensor) -> Result<Tensor> {
match self {
Self::SiLU => x.mul(&x.sigmoid()?), // x * σ(x)
Self::GELU => x.gelu(),
Self::ReLU => x.relu(),
}
}
}
Sparse Attention (Optional)
For very large models, attention can also be sparsified:
pub struct SparseAttention {
/// Query, Key, Value weights (combined or separate)
qkv_weights: Tensor,
output_weights: Tensor,
num_heads: usize,
head_dim: usize,
/// Attention mask pattern (e.g., local, strided)
sparsity_pattern: AttentionPattern,
}
#[derive(Debug, Clone)]
pub enum AttentionPattern {
/// Full attention (no sparsity)
Full,
/// Local attention (window size)
Local { window_size: usize },
/// Strided attention (BigBird style)
Strided { stride: usize, window: usize },
/// Learned sparse pattern
Learned { mask: Tensor },
}
2.4 Neuron Cache Manager
Responsibility: Manage hot/cold neuron weights in memory hierarchy.
Cache Architecture
┌──────────────────────────────────────────────┐
│ Neuron Cache Hierarchy │
├──────────────────────────────────────────────┤
│ L1: Hot Neurons (GPU Memory / Fast RAM) │
│ - 10-20% most active neurons │
│ - Always resident │
│ - FP16/FP32 precision │
├──────────────────────────────────────────────┤
│ L2: Warm Neurons (System RAM) │
│ - 30-40% moderately active │
│ - Loaded on-demand │
│ - INT8/FP16 quantized │
├──────────────────────────────────────────────┤
│ L3: Cold Neurons (Disk / Compressed) │
│ - 40-60% rarely active │
│ - Lazy load if predicted │
│ - INT4/INT8 quantized │
└──────────────────────────────────────────────┘
Implementation
pub struct NeuronCache {
/// Model configuration
config: ModelConfig,
/// Per-layer cache
layers: Vec<LayerCache>,
/// Memory budget (bytes)
memory_budget: usize,
/// Current memory usage
memory_used: usize,
}
pub struct LayerCache {
/// Hot neuron indices
hot_neurons: Vec<usize>,
/// Hot neuron weights (gate, up, down)
hot_weights: HotWeights,
/// Cold neuron weights (memory-mapped or compressed)
cold_weights: ColdWeights,
/// Neuron statistics
stats: Vec<NeuronStats>,
}
#[derive(Debug, Clone)]
pub struct HotWeights {
/// Gate weights for hot neurons: [hidden_size, num_hot]
gate: Tensor,
/// Up weights for hot neurons: [hidden_size, num_hot]
up: Tensor,
/// Down weights for hot neurons: [num_hot, hidden_size]
down: Tensor,
}
pub enum ColdWeights {
/// Memory-mapped file (lazy load)
MemoryMapped {
file: Mmap,
offsets: Vec<usize>,
},
/// Compressed in-memory
Compressed {
data: Vec<u8>,
codec: CompressionCodec,
},
/// Quantized INT4
Quantized {
data: Vec<u8>,
scales: Vec<f32>,
},
}
impl NeuronCache {
/// Build cache from calibration results
pub fn from_calibration(
model: &TransformerModel,
predictors: &PredictorSet,
config: CacheConfig
) -> Result<Self> {
let mut layers = Vec::new();
for (layer_idx, predictor) in predictors.iter().enumerate() {
// Extract hot/cold neurons
let hot_neurons: Vec<usize> = predictor.neuron_stats.iter()
.enumerate()
.filter(|(_, stats)| stats.neuron_type == NeuronType::Hot)
.map(|(idx, _)| idx)
.collect();
// Load hot neuron weights into fast memory
let layer_weights = model.get_layer_weights(layer_idx)?;
let hot_weights = Self::extract_hot_weights(&layer_weights, &hot_neurons)?;
// Compress cold neuron weights
let cold_weights = Self::compress_cold_weights(
&layer_weights,
&hot_neurons,
config.compression
)?;
layers.push(LayerCache {
hot_neurons,
hot_weights,
cold_weights,
stats: predictor.neuron_stats.clone(),
});
}
Ok(Self {
config: model.config.clone(),
layers,
memory_budget: config.memory_budget,
memory_used: Self::calculate_memory(&layers),
})
}
/// Get weights for active neurons (hot cached, cold loaded on-demand)
pub fn get_active_weights(
&self,
layer_idx: usize,
mask: &NeuronMask
) -> Result<ActiveWeights> {
let cache = &self.layers[layer_idx];
// Separate hot and cold neurons in mask
let (hot_indices, cold_indices) = self.split_hot_cold(cache, mask);
// Hot neurons: direct lookup
let hot_weights = self.gather_hot_weights(cache, &hot_indices)?;
// Cold neurons: lazy load
let cold_weights = if !cold_indices.is_empty() {
self.load_cold_weights(cache, &cold_indices)?
} else {
None
};
Ok(ActiveWeights {
hot: hot_weights,
cold: cold_weights,
})
}
fn load_cold_weights(
&self,
cache: &LayerCache,
indices: &[usize]
) -> Result<Option<Tensor>> {
match &cache.cold_weights {
ColdWeights::MemoryMapped { file, offsets } => {
// Lazy load from disk
let mut weights = Vec::new();
for &idx in indices {
let offset = offsets[idx];
let data = &file[offset..offset + self.weight_size()];
weights.extend_from_slice(data);
}
Ok(Some(Tensor::from_bytes(&weights)?))
}
ColdWeights::Quantized { data, scales } => {
// Dequantize on-the-fly
let weights = Self::dequantize(data, scales, indices)?;
Ok(Some(weights))
}
_ => Ok(None),
}
}
}
#[derive(Debug, Clone)]
pub struct CacheConfig {
/// Memory budget in bytes
pub memory_budget: usize,
/// Compression for cold neurons
pub compression: CompressionCodec,
/// Whether to use memory-mapped files
pub use_mmap: bool,
}
#[derive(Debug, Clone, Copy)]
pub enum CompressionCodec {
None,
Quantize4Bit,
Quantize8Bit,
ZSTD,
}
2.5 Execution Engine
Responsibility: Orchestrate layer-by-layer inference with sparse computation.
pub struct ExecutionEngine {
/// Model configuration
config: ModelConfig,
/// Neuron cache
cache: NeuronCache,
/// Predictors (one per layer)
predictors: PredictorSet,
/// Backend for computation
backend: Arc<dyn Backend>,
/// Performance metrics
metrics: Metrics,
}
impl ExecutionEngine {
/// Run inference on input
pub fn forward(&mut self, input: &Tensor) -> Result<Tensor> {
let batch_size = input.shape()[0];
// 1. Embedding layer (always dense)
let mut hidden = self.embed(input)?;
// 2. Transformer layers
for layer_idx in 0..self.config.num_layers {
let start = std::time::Instant::now();
// Attention (dense or sparse)
hidden = self.run_attention(layer_idx, &hidden)?;
// Sparse FFN
hidden = self.run_sparse_ffn(layer_idx, &hidden)?;
self.metrics.record_layer(layer_idx, start.elapsed());
}
// 3. Output layer
let output = self.output_projection(&hidden)?;
Ok(output)
}
fn run_sparse_ffn(&mut self, layer_idx: usize, hidden: &Tensor) -> Result<Tensor> {
// 1. Predict active neurons
let predictor = &self.predictors[layer_idx];
let mask = predictor.predict(hidden);
self.metrics.record_sparsity(layer_idx, mask.active_ratio());
// 2. Get active neuron weights from cache
let weights = self.cache.get_active_weights(layer_idx, &mask)?;
// 3. Sparse FFN computation
let ffn = SparseFFN::new(weights, predictor.clone());
let output = ffn.forward(hidden, self.backend.as_ref())?;
Ok(output)
}
/// Get inference statistics
pub fn metrics(&self) -> &Metrics {
&self.metrics
}
}
#[derive(Debug, Default)]
pub struct Metrics {
/// Per-layer latency
layer_latency: Vec<Duration>,
/// Per-layer sparsity ratio
layer_sparsity: Vec<f32>,
/// Total tokens processed
tokens_processed: usize,
/// Cache hits/misses
cache_hits: usize,
cache_misses: usize,
}
impl Metrics {
pub fn average_sparsity(&self) -> f32 {
self.layer_sparsity.iter().sum::<f32>() / self.layer_sparsity.len() as f32
}
pub fn total_latency(&self) -> Duration {
self.layer_latency.iter().sum()
}
pub fn tokens_per_second(&self) -> f32 {
let total_secs = self.total_latency().as_secs_f32();
self.tokens_processed as f32 / total_secs
}
}
2.6 Backend Abstraction
Responsibility: Provide SIMD-optimized sparse operations across platforms.
pub trait Backend: Send + Sync {
/// Sparse matrix multiplication: A @ B[:, mask]
fn sparse_matmul_cols(
&self,
a: &Tensor,
b: &Tensor,
col_mask: &NeuronMask
) -> Result<Tensor>;
/// Sparse matrix multiplication: A[mask, :] @ B
fn sparse_matmul_rows(
&self,
a: &Tensor,
b: &Tensor,
row_mask: &NeuronMask
) -> Result<Tensor>;
/// Dense matrix multiplication (fallback)
fn matmul(&self, a: &Tensor, b: &Tensor) -> Result<Tensor>;
/// Element-wise operations
fn add(&self, a: &Tensor, b: &Tensor) -> Result<Tensor>;
fn mul(&self, a: &Tensor, b: &Tensor) -> Result<Tensor>;
/// Activation functions
fn silu(&self, x: &Tensor) -> Result<Tensor>;
fn gelu(&self, x: &Tensor) -> Result<Tensor>;
/// Quantization
fn quantize(&self, x: &Tensor, bits: u8) -> Result<(Tensor, Vec<f32>)>;
fn dequantize(&self, x: &Tensor, scales: &[f32]) -> Result<Tensor>;
}
CPU Backend (AVX512 SIMD)
pub struct CpuBackend {
num_threads: usize,
simd_features: SimdFeatures,
}
#[derive(Debug, Clone)]
pub struct SimdFeatures {
pub avx512: bool,
pub avx2: bool,
pub fma: bool,
pub vnni: bool, // INT8 acceleration
}
impl Backend for CpuBackend {
fn sparse_matmul_cols(
&self,
a: &Tensor,
b: &Tensor,
col_mask: &NeuronMask
) -> Result<Tensor> {
// A: [batch, hidden_size]
// B: [hidden_size, intermediate_size]
// Output: [batch, active_neurons]
let active_cols = col_mask.iter_active().collect::<Vec<_>>();
if self.simd_features.avx512 {
self.sparse_matmul_avx512(a, b, &active_cols)
} else if self.simd_features.avx2 {
self.sparse_matmul_avx2(a, b, &active_cols)
} else {
self.sparse_matmul_scalar(a, b, &active_cols)
}
}
}
impl CpuBackend {
#[target_feature(enable = "avx512f")]
unsafe fn sparse_matmul_avx512(
&self,
a: &Tensor,
b: &Tensor,
active_cols: &[usize]
) -> Result<Tensor> {
// AVX-512: 16x f32 per vector
// Optimized sparse GEMM kernel
let batch = a.shape()[0];
let hidden = a.shape()[1];
let num_active = active_cols.len();
let mut output = Tensor::zeros(&[batch, num_active]);
for row in 0..batch {
for (out_idx, &col) in active_cols.iter().enumerate() {
// Dot product: a[row, :] · b[:, col]
let mut sum = _mm512_setzero_ps();
for k in (0..hidden).step_by(16) {
let a_vec = _mm512_loadu_ps(&a.data()[row * hidden + k]);
let b_vec = _mm512_loadu_ps(&b.data()[col * hidden + k]);
sum = _mm512_fmadd_ps(a_vec, b_vec, sum);
}
output.data_mut()[row * num_active + out_idx] =
_mm512_reduce_add_ps(sum);
}
}
Ok(output)
}
}
WASM Backend (Portable SIMD)
#[cfg(target_arch = "wasm32")]
pub struct WasmBackend {
simd_enabled: bool,
}
#[cfg(target_arch = "wasm32")]
impl Backend for WasmBackend {
fn sparse_matmul_cols(
&self,
a: &Tensor,
b: &Tensor,
col_mask: &NeuronMask
) -> Result<Tensor> {
use std::arch::wasm32::*;
if !self.simd_enabled {
return self.sparse_matmul_scalar(a, b, col_mask);
}
// WASM SIMD: 4x f32 per v128
let active_cols = col_mask.iter_active().collect::<Vec<_>>();
let batch = a.shape()[0];
let hidden = a.shape()[1];
let num_active = active_cols.len();
let mut output = Tensor::zeros(&[batch, num_active]);
for row in 0..batch {
for (out_idx, &col) in active_cols.iter().enumerate() {
let mut sum = f32x4_splat(0.0);
for k in (0..hidden).step_by(4) {
let a_vec = v128_load(&a.data()[row * hidden + k] as *const f32 as *const v128);
let b_vec = v128_load(&b.data()[col * hidden + k] as *const f32 as *const v128);
sum = f32x4_add(sum, f32x4_mul(a_vec, b_vec));
}
// Horizontal sum
let result = f32x4_extract_lane::<0>(sum)
+ f32x4_extract_lane::<1>(sum)
+ f32x4_extract_lane::<2>(sum)
+ f32x4_extract_lane::<3>(sum);
output.data_mut()[row * num_active + out_idx] = result;
}
}
Ok(output)
}
}
3. Data Flow Architecture
3.1 Model Loading Flow
User: model_path
│
▼
┌─────────────────────────────┐
│ Detect Model Format │ (check extension: .gguf, .safetensors, .bin)
└────────┬────────────────────┘
│
├──── .gguf ────────▶ GGUFLoader::open()
│ │
│ ├─ Parse header
│ ├─ Build tensor index
│ └─ Extract config
│
├──── HF/ST ────────▶ HFLoader::from_pretrained()
│ │
│ ├─ Download if needed
│ ├─ Load safetensors
│ └─ Parse config.json
│
▼
┌─────────────────────────────┐
│ TransformerModel │
│ - Config │
│ - Layer weights │
│ - Tokenizer │
└─────────────────────────────┘
3.2 Calibration Flow
TransformerModel
│
▼
┌─────────────────────────────┐
│ Load Calibration Dataset │ (WikiText, C4, custom)
│ - Sample 512-2048 examples │
└────────┬────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ For each layer: │
│ 1. Forward samples → collect │
│ - Hidden states (input to FFN) │
│ - FFN activations (output) │
│ │
│ 2. Analyze activations: │
│ - Compute activation frequency │
│ - Classify hot/cold neurons │
│ │
│ 3. Learn predictor: │
│ - Initialize P, Q matrices │
│ - Train on (hidden → activation) │
│ - Optimize thresholds │
└────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ PredictorSet + NeuronCache │
│ - P·Q matrices per layer │
│ - Hot neuron weights │
│ - Cold neuron offload │
└─────────────────────────────┘
3.3 Inference Flow (Single Token)
Input Token(s)
│
▼
┌─────────────────────────────┐
│ Tokenizer + Embedding │
└────────┬────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ Layer 0: │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 1. Attention (dense) │ │
│ │ - Q, K, V projections │ │
│ │ - Scaled dot-product attention │ │
│ │ - Output projection │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 2. Sparse FFN │ │
│ │ a) Predictor(hidden) → mask [T/F/F/T/T/F...] │ │
│ │ b) Load weights for active neurons only │ │
│ │ c) Sparse gate/up projections │ │
│ │ d) Activation (SiLU/GELU) │ │
│ │ e) Sparse down projection │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 3. Residual + LayerNorm │ │
│ └─────────────────────────────────────────────────┘ │
└───────────────────┬───────────────────────────────────┘
│
▼
(Repeat for layers 1..N-1)
│
▼
┌─────────────────────────────┐
│ Output Projection │
│ - Linear(hidden, vocab) │
│ - Softmax (for generation) │
└────────┬────────────────────┘
│
▼
Logits / Embedding
4. Rust Module Structure
4.1 Crate Layout
crates/ruvector-sparse-inference/
├── Cargo.toml
├── build.rs # Build-time feature detection
├── README.md
└── src/
├── lib.rs # Public API
├── config.rs # Configuration types
├── error.rs # Error types
│
├── predictor/
│ ├── mod.rs # Predictor API
│ ├── lowrank.rs # P·Q low-rank predictor
│ ├── calibration.rs # Calibration logic
│ └── threshold.rs # Threshold optimization
│
├── sparse/
│ ├── mod.rs # Sparse operations API
│ ├── ffn.rs # Sparse FFN layer
│ ├── attention.rs # Sparse attention (optional)
│ └── kernels.rs # SIMD kernels
│
├── model/
│ ├── mod.rs # Model loading API
│ ├── gguf.rs # GGUF parser
│ ├── hf.rs # HuggingFace loader
│ ├── loader.rs # Generic loader trait
│ └── runners.rs # Model-specific runners (Llama, BERT)
│
├── memory/
│ ├── mod.rs # Memory management API
│ ├── cache.rs # Neuron cache
│ ├── quantization.rs # Quantization utilities
│ └── compression.rs # Compression codecs
│
├── backend/
│ ├── mod.rs # Backend trait
│ ├── cpu.rs # CPU SIMD backend
│ ├── wasm.rs # WASM SIMD backend
│ └── gpu.rs # GPU backend (future)
│
├── integration/
│ ├── mod.rs # Integration API
│ ├── ruvector.rs # EmbeddingProvider impl
│ └── ruvllm.rs # InferenceBackend impl
│
└── utils/
├── mod.rs
├── tensor.rs # Tensor utilities
└── metrics.rs # Performance tracking
4.2 Key Module Responsibilities
| Module | Responsibility | Dependencies |
|---|---|---|
lib.rs |
Public API, re-exports | All modules |
config |
Configuration types | None |
error |
Error handling | None |
predictor |
Neuron prediction | tensor, backend |
sparse |
Sparse computation | predictor, backend, memory |
model |
Model loading | config, error |
memory |
Cache management | model, predictor |
backend |
SIMD operations | tensor |
integration |
Ruvector/RuvLLM | All |
5. Key Traits and Interfaces
5.1 ModelRunner Trait
pub trait ModelRunner: Send + Sync {
/// Get model configuration
fn config(&self) -> &ModelConfig;
/// Run inference on input tokens
fn forward(&mut self, input_ids: &[u32]) -> Result<Tensor>;
/// Encode text to embeddings (for embedding models)
fn encode(&mut self, text: &str) -> Result<Vec<f32>>;
/// Generate text (for language models)
fn generate(&mut self, prompt: &str, max_tokens: usize) -> Result<String>;
/// Get inference metrics
fn metrics(&self) -> &Metrics;
}
// Implementations
pub struct LlamaRunner { /* ... */ }
pub struct BertRunner { /* ... */ }
pub struct LFM2Runner { /* ... */ }
impl ModelRunner for LlamaRunner { /* ... */ }
impl ModelRunner for BertRunner { /* ... */ }
impl ModelRunner for LFM2Runner { /* ... */ }
5.2 Predictor Trait
pub trait Predictor: Send + Sync {
/// Predict active neurons from hidden state
fn predict(&self, hidden_state: &Tensor) -> NeuronMask;
/// Get predicted sparsity ratio
fn sparsity_ratio(&self, hidden_state: &Tensor) -> f32;
/// Get neuron statistics
fn neuron_stats(&self) -> &[NeuronStats];
}
#[derive(Debug, Clone)]
pub struct NeuronStats {
pub neuron_type: NeuronType,
pub activation_frequency: f32,
pub average_magnitude: f32,
}
5.3 Cache Trait
pub trait Cache: Send + Sync {
/// Get weights for active neurons
fn get_active_weights(
&self,
layer_idx: usize,
mask: &NeuronMask
) -> Result<ActiveWeights>;
/// Get memory usage statistics
fn memory_usage(&self) -> MemoryStats;
/// Evict least-recently-used cold neurons
fn evict(&mut self, size: usize) -> Result<()>;
}
#[derive(Debug, Clone)]
pub struct MemoryStats {
pub hot_neurons_bytes: usize,
pub cold_neurons_bytes: usize,
pub predictor_bytes: usize,
pub total_bytes: usize,
}
6. Integration Architecture
6.1 Ruvector EmbeddingProvider Integration
// In ruvector-core/src/embeddings.rs
pub trait EmbeddingProvider {
fn embed(&self, text: &str) -> Result<Vec<f32>>;
fn embed_batch(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>>;
}
// New implementation in sparse-inference
impl EmbeddingProvider for SparseInferenceEngine {
fn embed(&self, text: &str) -> Result<Vec<f32>> {
// 1. Tokenize
let tokens = self.tokenizer.encode(text)?;
// 2. Run sparse inference
let output = self.runner.forward(&tokens)?;
// 3. Mean pooling (for sentence embeddings)
let embedding = self.mean_pool(&output)?;
Ok(embedding.to_vec())
}
fn embed_batch(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>> {
texts.iter().map(|text| self.embed(text)).collect()
}
}
// Usage
let engine = SparseInferenceEngine::from_pretrained(
"TaylorAI/gte-tiny",
SparseConfig::default()
)?;
let rv = RuVector::builder()
.embedding_provider(Box::new(engine))
.build()?;
rv.insert("test", "Hello world")?;
6.2 RuvLLM InferenceBackend Integration
// In ruvllm/src/backend.rs
pub trait InferenceBackend {
fn generate(&mut self, prompt: &str, config: GenerateConfig) -> Result<String>;
fn logits(&mut self, prompt: &str) -> Result<Vec<f32>>;
}
// Implementation
impl InferenceBackend for SparseInferenceEngine {
fn generate(&mut self, prompt: &str, config: GenerateConfig) -> Result<String> {
let mut tokens = self.tokenizer.encode(prompt)?;
let mut output = String::new();
for _ in 0..config.max_tokens {
// Sparse inference
let logits = self.runner.forward(&tokens)?;
// Sample next token
let next_token = self.sample(&logits, config.temperature)?;
tokens.push(next_token);
// Decode
let text = self.tokenizer.decode(&[next_token])?;
output.push_str(&text);
if next_token == self.tokenizer.eos_token() {
break;
}
}
Ok(output)
}
}
// Usage
let engine = SparseInferenceEngine::from_pretrained(
"TheBloke/Llama-2-7B-GGUF",
SparseConfig::default()
)?;
let llm = RuvLLM::builder()
.backend(Box::new(engine))
.build()?;
let response = llm.generate("Explain quantum computing", Default::default())?;
7. Performance Targets
7.1 Latency Targets
| Model | Operation | Target Latency | Baseline | Speedup |
|---|---|---|---|---|
| LFM2-350M | Sentence embedding | 5-10ms | 25ms | 2.5-5x |
| BERT-base | Sentence embedding | 8-15ms | 40ms | 2.7-5x |
| Llama-7B | Token generation | 50-100ms | 500ms | 5-10x |
| Llama-13B | Token generation | 100-200ms | 1.2s | 6-12x |
7.2 Memory Targets
| Model | Baseline RAM | Sparse RAM | Reduction |
|---|---|---|---|
| LFM2-350M | 1.4 GB | 700 MB | 2x |
| Llama-7B (FP16) | 14 GB | 7-9 GB | 1.5-2x |
| Llama-7B (Q4) | 4 GB | 2.5-3 GB | 1.3-1.6x |
7.3 Accuracy Targets
- Embedding similarity: >0.99 cosine similarity to dense baseline
- Generation quality: <1% perplexity increase
- Classification accuracy: <0.5% drop on downstream tasks
7.4 Sparsity Targets
| Layer Type | Target Sparsity | Active Neurons |
|---|---|---|
| Early layers | 60-70% | 30-40% compute |
| Middle layers | 70-85% | 15-30% compute |
| Late layers | 50-60% | 40-50% compute |
| Average | 70-80% | 20-30% compute |
8. Deployment Architecture
8.1 CPU Deployment
┌─────────────────────────────────────┐
│ Application Process │
├─────────────────────────────────────┤
│ ┌───────────────────────────────┐ │
│ │ Sparse Inference Engine │ │
│ │ - Hot neurons: RAM (1-2 GB) │ │
│ │ - Cold neurons: mmap disk │ │
│ │ - SIMD: AVX-512 / AVX2 │ │
│ └───────────────────────────────┘ │
│ ┌───────────────────────────────┐ │
│ │ Thread Pool (rayon) │ │
│ │ - Parallel batch processing │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
8.2 WASM Deployment
┌─────────────────────────────────────┐
│ Browser / Node.js │
├─────────────────────────────────────┤
│ ┌───────────────────────────────┐ │
│ │ WASM Module │ │
│ │ - Hot neurons: ArrayBuffer │ │
│ │ - SIMD: wasm128 (if avail) │ │
│ │ - Memory limit: 2-4 GB │ │
│ └───────────────────────────────┘ │
│ ┌───────────────────────────────┐ │
│ │ Worker Pool │ │
│ │ - Parallel inference │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
8.3 Hybrid Deployment
┌─────────────────────────────────────┐
│ Cloud GPU (Hot path) │
│ - 20% most frequent queries │
│ - Full dense inference │
│ - <10ms latency │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Edge CPU (Cold path) │
│ - 80% long-tail queries │
│ - Sparse inference │
│ - 20-50ms latency │
│ - 10x lower cost │
└─────────────────────────────────────┘
9. Future Enhancements
9.1 Phase 2 Features
- Dynamic sparsity: Adjust predictor thresholds at runtime
- Multi-modal: Support vision-language models (CLIP, LLaVA)
- Quantization-aware: INT8/INT4 predictor matrices
- GPU kernels: CUDA/Metal sparse kernels
- NPU support: Apple Neural Engine, Qualcomm Hexagon
9.2 Phase 3 Features
- Learned sparsity patterns: Train end-to-end with sparsity loss
- Mixture-of-experts: Combine with MoE models
- Speculative decoding: Sparse draft models + dense verification
- Cross-layer optimization: Share predictors across layers
10. References and Inspiration
- PowerInfer (SOSP'23): Fast LLM serving with activation locality
- DejaVu (MLSys'23): Contextual sparsity in transformers
- CATS (ICLR'24): Context-aware token selection
- FlashAttention: Memory-efficient attention
- GPTQ/AWQ: Weight quantization for LLMs
Appendix A: Configuration Examples
A.1 LFM2 Embedding Configuration
let config = SparseConfig {
model_path: "TaylorAI/gte-tiny".to_string(),
predictor_rank: 128,
target_sparsity: 0.75,
cache_config: CacheConfig {
memory_budget: 1024 * 1024 * 1024, // 1 GB
use_mmap: false,
compression: CompressionCodec::None,
},
backend: BackendType::CpuAvx2,
calibration: Some(CalibrationConfig {
num_samples: 1024,
data_source: DataSource::WikiText,
}),
};
A.2 Llama-7B Generation Configuration
let config = SparseConfig {
model_path: "TheBloke/Llama-2-7B-GGUF".to_string(),
predictor_rank: 256,
target_sparsity: 0.80,
cache_config: CacheConfig {
memory_budget: 8 * 1024 * 1024 * 1024, // 8 GB
use_mmap: true,
compression: CompressionCodec::Quantize4Bit,
},
backend: BackendType::CpuAvx512,
calibration: Some(CalibrationConfig {
num_samples: 2048,
data_source: DataSource::C4,
}),
};
Appendix B: Benchmarking Protocol
B.1 Latency Benchmarking
# Embedding models
cargo bench --bench embeddings -- \
--model gte-tiny \
--batch-sizes 1,8,32 \
--sequence-lengths 16,64,256
# Generation models
cargo bench --bench generation -- \
--model llama-7b \
--prompt-lengths 32,128,512 \
--generate-lengths 32,128
B.2 Accuracy Evaluation
# STS-B (semantic similarity)
cargo run --release --bin eval-sts \
--model gte-tiny \
--sparse \
--dataset data/stsbenchmark
# MMLU (language understanding)
cargo run --release --bin eval-mmlu \
--model llama-7b \
--sparse \
--subset abstract_algebra,anatomy
End of Architecture Document