Files
wifi-densepose/examples/ruvLLM/docs/sparc/01-specification.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

21 KiB
Raw Blame History

RuvLLM: Self-Learning LLM with LFM2 and Ruvector Integration

SPARC Phase 1: Specification


1. Executive Summary

RuvLLM is a self-learning LLM architecture that integrates Liquid Foundation Models (LFM2) with ruvector as the world model and memory substrate. The system uses FastGRNN as an intelligent router to dynamically allocate computational resources based on query complexity, enabling efficient on-device inference with continuous learning capabilities.

Core Innovation

The architecture treats:

  • LFM2 as the reasoning head (inference engine)
  • Ruvector as the world model and episodic memory
  • FastGRNN as the control circuit (routing decisions)

This triad creates a self-learning system where:

  1. Queries are semantically embedded and matched against memory
  2. Graph attention extracts relevant neighborhood context
  3. FastGRNN routes to optimal model configuration
  4. LFM2 generates responses with retrieved context
  5. Successful interactions are written back to memory (self-improvement)

2. Technical Requirements

2.1 Functional Requirements

FR-001: LFM2 Model Integration

  • Description: Support LFM2 model family (350M, 700M, 1.2B, 2.6B parameters)
  • Acceptance Criteria:
    • Load models via llama.cpp (CPU) or vLLM (server)
    • Support quantization: Q4/Q5 (CPU), 8-bit/4-bit weight-only (GPU)
    • Enable KV cache for context reuse
    • Achieve <500ms median latency (CPU), <100ms (GPU)

FR-002: Ruvector Memory Service

  • Description: Implement semantic memory with graph structure
  • Storage Schema:
    Nodes: {
      id: UUID,
      vector: [f32; D],      // D = embedding dimension
      text: String,
      type: NodeType,        // Query | Document | AgentStep | Fact
      source: String,
      metadata: {
        timestamp: i64,
        tags: Vec<String>,
        domain: String,
        version: u32,
        confidence: f32
      }
    }
    
    Edges: {
      id: UUID,
      src: UUID,
      dst: UUID,
      rel: EdgeType,         // Cites | Follows | SameTopic | AgentStep | Derived
      weight: f32,
      metadata: {
        timestamp: i64,
        created_by: String,
        confidence: f32
      }
    }
    
  • Acceptance Criteria:
    • HNSW index with M=32, efConstruction=200, efSearch=64
    • Sub-millisecond retrieval for k≤64
    • Graph attention over 2-hop neighborhoods
    • Support billion-scale corpora

FR-003: FastGRNN Router

  • Description: Implement gated recurrent router for intelligent resource allocation
  • Architecture (per Kusupati et al.):
    • Hidden size: 32-64 units
    • Input: Fixed-length feature vector (~128 dims)
    • Outputs: model_selection, context_size, temperature, top_p
  • Feature Vector Components (128 dimensions):
    Query Stats [32 dims]:
      - token_count: f32
      - language_id: [f32; 8] (one-hot)
      - domain_encoding: [f32; 16]
      - user_frequency: f32
      - query_type: [f32; 6] (factual/reasoning/creative/...)
    
    Embedding Stats [16 dims]:
      - l2_norm: f32
      - principal_components: [f32; 8]
      - entropy: f32
      - sparsity: f32
      - cluster_assignment: [f32; 4]
    
    HNSW Search Stats [48 dims]:
      - k_retrieved: f32
      - distances: { mean, std, min, max }: [f32; 4]
      - entropy: f32
      - graph_depth: f32
      - recall_estimate: f32
      - neighborhood_density: [f32; 16]
      - semantic_coherence: [f32; 24]
    
    System Constraints [32 dims]:
      - latency_budget: f32
      - device_class: [f32; 4] (edge/mobile/server/cluster)
      - privacy_level: [f32; 4]
      - memory_available: f32
      - battery_level: f32 (for mobile)
      - concurrent_requests: f32
      - historical_accuracy: [f32; 16]
    

FR-004: Self-Learning Pipeline

  • Description: Implement continuous learning with forgetting mitigation
  • Components:
    • Online learning from successful interactions
    • Elastic Weight Consolidation (EWC) for catastrophic forgetting prevention
    • Experience replay with reservoir sampling
    • Curriculum learning for progressive complexity
  • Acceptance Criteria:
    • Quality regret <0.1 points vs. always-big baseline
    • No measurable forgetting over 10K update cycles
    • Router accuracy >95% for seen patterns

FR-005: Graph Attention Engine

  • Description: Context extraction via graph-aware attention
  • Mechanism:
    • Multi-head attention over retrieved nodes
    • Edge-weighted aggregation (confidence, recency)
    • Hyperbolic embeddings for hierarchical relationships
    • 2-hop neighborhood expansion
  • Integration with existing ruvector-attention:
    • Leverage EdgeFeaturedAttention for edge attributes
    • Use GraphRoPE for positional encoding on graphs
    • Apply DualSpaceAttention for multi-manifold reasoning

2.2 Non-Functional Requirements

NFR-001: Performance

Metric Tier A (Server) Tier B (Edge) Tier C (Mobile)
P50 Latency <200ms <500ms <800ms
P99 Latency <1s <2s <5s
Throughput 100 QPS 20 QPS 5 QPS
Memory <16GB <4GB <1GB

NFR-002: Quality

  • Accuracy: F1 >0.85 on QA benchmarks
  • Retrieval: R@10 >0.90 for relevant documents
  • Router: Decision accuracy >95%
  • Judge Rating: 4.2+/5.0 on LLM-as-judge evaluations

NFR-003: Scalability

  • Support 10M+ vectors in memory
  • Support 1B+ vectors with hybrid indexing
  • Linear scaling with node count in cluster mode

NFR-004: Reliability

  • Zero data loss on graceful shutdown
  • Recovery from OOM within 30s
  • Automatic failover in cluster mode

3. LFM2 Deep Dive

3.1 Architecture Analysis

LFM2 employs a hybrid backbone combining:

  1. Gated Short Convolutions: Lightweight local feature processing

    • O(n) complexity vs O(n²) for attention
    • Captures local patterns efficiently
    • Enables 2x faster prefill on CPUs
  2. Grouped Query Attention (GQA): Reduced KV heads

    • 4-8 KV heads vs 32+ in standard attention
    • Maintains quality with 4x memory reduction
    • Critical for edge deployment

3.2 Training Methodology

LFM2's training is relevant for our self-learning pipeline:

  1. Knowledge Distillation: Tempered, decoupled Top-K

    • Teacher: Large model (70B+)
    • Student: LFM2 variants
    • Insight: We can distill router decisions from expensive oracle
  2. Curriculum Learning: Progressive complexity

    • Start with simple factual queries
    • Graduate to multi-step reasoning
    • Application: Router training follows same progression
  3. Three-Stage Post-Training:

    • SFT: Supervised fine-tuning on quality data
    • DPO: Direct preference optimization
    • Model merging: Combine specialists
    • Application: We merge domain-specific adapters

3.3 Multimodal Extensions (Future)

  • LFM2-VL: Vision-language (image understanding)
  • LFM2-Audio: Speech I/O
  • LFM2-ColBERT: Low-latency retrieval encoder

4. Ruvector Integration Analysis

4.1 Existing Capabilities

Component Status Integration Plan
ruvector-core Production Primary vector store
ruvector-gnn Production Graph neural layer
ruvector-attention Production Attention mechanisms
ruvector-router-core Production Base routing
ruvector-graph Production Knowledge graph

4.2 Required Extensions

4.2.1 Embedding Adapter

pub struct EmbeddingAdapter {
    /// LFM2 encoder for query embedding
    lfm2_encoder: Lfm2Encoder,
    /// Dimension alignment layer
    projection: Linear,
    /// Normalization
    layer_norm: LayerNorm,
}

impl EmbeddingAdapter {
    pub fn embed(&self, text: &str) -> Vec<f32> {
        let raw = self.lfm2_encoder.encode(text);
        let projected = self.projection.forward(&raw);
        self.layer_norm.forward(&projected)
    }
}

4.2.2 Memory Writeback Service

pub struct MemoryWriteback {
    /// Quality threshold for writeback
    quality_threshold: f32,
    /// Deduplication via MinHash
    dedup_hasher: MinHasher,
    /// Conflict resolution
    merger: ConflictMerger,
}

impl MemoryWriteback {
    pub async fn maybe_write(
        &self,
        query: &str,
        response: &str,
        quality_score: f32,
        db: &VectorDB,
    ) -> Result<Option<UUID>> {
        if quality_score < self.quality_threshold {
            return Ok(None);
        }

        // Check for near-duplicates
        let embedding = embed(query, response);
        let similar = db.search_threshold(&embedding, 0.95)?;
        if !similar.is_empty() {
            return self.merger.resolve(similar, query, response);
        }

        // Insert new memory
        let entry = VectorEntry::new(embedding)
            .with_text(format!("Q: {}\nA: {}", query, response))
            .with_metadata(json!({
                "type": "qa_pair",
                "quality": quality_score,
                "timestamp": now(),
            }));

        Ok(Some(db.insert(entry)?))
    }
}

4.3 HNSW Parameter Tuning

Based on arxiv:2511.23404v1 insights on retrieval efficiency:

Corpus Size M efConstruction efSearch Recall@10
<100K 16 100 32 0.98
100K-1M 32 200 64 0.96
1M-10M 48 300 128 0.94
10M-100M 64 400 256 0.92
>100M Hybrid Tiered Adaptive 0.90

5. FastGRNN Router Specification

5.1 Mathematical Formulation

FastGRNN (Fast, Accurate, Stable, and Tiny GRU):

z_t = σ(W_z · x_t + U_z · h_{t-1} + b_z)
h̃_t = tanh(W_h · x_t + U_h · (r_t ⊙ h_{t-1}) + b_h)
h_t = (ζ · (1 - z_t) + ν) ⊙ h̃_t + z_t ⊙ h_{t-1}

where:
  - ζ, ν: Learned scalars (typically ζ≈1, ν≈0.5)
  - W_z, W_h: Input weight matrices (sparse)
  - U_z, U_h: Recurrent weight matrices (low-rank)
  - r_t: Optional reset gate (can be fixed to 1)

5.2 Output Heads

pub struct RouterOutputs {
    /// Model selection: [350M, 700M, 1.2B, 2.6B] probabilities
    pub model_probs: [f32; 4],
    /// Context size bins: [256, 512, 1024, 2048, 4096] tokens
    pub context_probs: [f32; 5],
    /// Temperature: continuous [0.0, 2.0]
    pub temperature: f32,
    /// Top-p: continuous [0.0, 1.0]
    pub top_p: f32,
    /// Confidence score
    pub confidence: f32,
}

5.3 Training Protocol

Phase 1: Data Collection

For each query q:
  1. Run all model configurations (expensive baseline)
  2. Collect quality metrics Q, latency L, cost C
  3. Compute utility: U = Q - λ·L - μ·C
  4. Label: y_model = argmax(U), y_ctx = min viable context

Phase 2: Supervised Training

Loss = CE(model_pred, y_model)
     + CE(ctx_pred, y_ctx)
     + α·SmoothL1(temp_pred, y_temp)
     + β·SmoothL1(top_p_pred, y_top_p)

Phase 3: Online Refinement

Every N requests:
  1. Sample exploration (ε-greedy or Thompson)
  2. Compute regret vs. oracle
  3. Update weights with importance sampling
  4. Apply EWC regularization

6. Self-Learning Mechanisms

6.1 Continual Learning Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Self-Learning Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐   │
│  │ Query   │───▶│ Retrieve│───▶│ Generate│───▶│ Evaluate│   │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘   │
│       │              │              │              │         │
│       │              │              │              ▼         │
│       │              │              │        ┌─────────┐     │
│       │              │              │        │ Quality │     │
│       │              │              │        │ > θ ?   │     │
│       │              │              │        └────┬────┘     │
│       │              │              │             │          │
│       │              │              │      ┌──────┴──────┐   │
│       │              │              │      ▼             ▼   │
│       │              │              │  ┌───────┐   ┌───────┐ │
│       │              │              │  │ Write │   │ Skip  │ │
│       │              │              │  │ Back  │   │       │ │
│       │              │              │  └───┬───┘   └───────┘ │
│       │              │              │      │                 │
│       ▼              ▼              ▼      ▼                 │
│  ┌─────────────────────────────────────────────┐             │
│  │            Replay Buffer (Reservoir)         │             │
│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │             │
│  │  │ E_1 │ │ E_2 │ │ ... │ │E_n-1│ │ E_n │   │             │
│  │  └─────┘ └─────┘ └─────┘ └─────┘ └─────┘   │             │
│  └──────────────────────┬──────────────────────┘             │
│                         │                                    │
│                         ▼                                    │
│  ┌─────────────────────────────────────────────┐             │
│  │           EWC Regularization Layer           │             │
│  │                                               │             │
│  │  L_total = L_task + λ·Σ F_i·(θ_i - θ*_i)²   │             │
│  │                                               │             │
│  │  F_i = Fisher Information (importance)        │             │
│  │  θ*_i = Optimal weights from previous task   │             │
│  └─────────────────────────────────────────────┘             │
│                                                               │
└─────────────────────────────────────────────────────────────┘

6.2 Quality Evaluation

LLM-as-Judge Protocol:

pub struct QualityJudge {
    judge_model: Lfm2, // Use 2.6B for judging
    rubric: JudgeRubric,
}

impl QualityJudge {
    pub fn evaluate(&self, query: &str, response: &str, context: &[&str]) -> f32 {
        let prompt = format!(r#"
            Evaluate the response quality on a scale of 1-5:

            Query: {query}
            Retrieved Context: {context:?}
            Response: {response}

            Criteria:
            1. Factual accuracy (grounded in context)
            2. Completeness (addresses the query fully)
            3. Coherence (logical flow)
            4. Conciseness (no unnecessary verbosity)

            Score (1-5):
        "#);

        let score_str = self.judge_model.generate(&prompt, 10);
        parse_score(&score_str)
    }
}

6.3 Forgetting Mitigation

Elastic Weight Consolidation (EWC):

// From ruvector-gnn ewc module
pub struct ElasticWeightConsolidation {
    lambda: f32,                    // Regularization strength
    fisher_info: Vec<f32>,          // Fisher information diagonal
    optimal_weights: Vec<f32>,      // θ* from previous task
}

impl ElasticWeightConsolidation {
    pub fn regularization_loss(&self, current_weights: &[f32]) -> f32 {
        self.fisher_info.iter()
            .zip(current_weights.iter())
            .zip(self.optimal_weights.iter())
            .map(|((f, w), w_star)| f * (w - w_star).powi(2))
            .sum::<f32>() * self.lambda / 2.0
    }

    pub fn update_fisher(&mut self, gradients: &[Vec<f32>]) {
        // Fisher = E[∇logP(y|x;θ)²]
        for (i, grad_samples) in gradients.iter().enumerate() {
            self.fisher_info[i] = grad_samples.iter()
                .map(|g| g.powi(2))
                .sum::<f32>() / grad_samples.len() as f32;
        }
    }
}

7. Performance Optimization Strategy

7.1 LFM2 Level

Optimization Speedup Quality Impact Implementation
Model selection 2-4x <1% FastGRNN router
KV cache reuse 1.5-2x 0% llama.cpp native
Q4 quantization 2-3x <2% GGUF format
Speculative decode 1.3-1.5x 0% Draft model
Continuous batching 2-4x 0% vLLM

7.2 Ruvector Level

Optimization Speedup Quality Impact Implementation
HNSW tuning Variable Recall tradeoff efSearch adjustment
Product quantization 4-8x memory <5% PQ in ruvector-core
Graph pruning 1.2-1.5x <1% Edge weight threshold
Batch retrieval 2-3x 0% Parallel HNSW
Caching 10x+ (hits) 0% LRU with TTL

7.3 Router Level

Optimization Speedup Quality Impact Implementation
Sparse weights 10-50x <0.5% Magnitude pruning
Low-rank U 2-4x <0.5% SVD decomposition
Int8 quantization 2-4x <0.1% Post-training quant
Cascade routing 1.5-2x 0% Early exit

8. Success Metrics

8.1 Primary Metrics

Metric Target Measurement
End-to-end latency P50 <500ms Timer instrumentation
Quality (LLM judge) 4.2+/5.0 Automated evaluation
Router accuracy >95% Oracle comparison
Memory efficiency <4GB (edge) RSS monitoring
Throughput 20 QPS (edge) Load testing

8.2 Secondary Metrics

Metric Target Measurement
Retrieval R@10 >0.90 Benchmark suite
Forgetting rate <5%/10K updates Periodic eval
Cost reduction >50% vs baseline Token counting
Writeback rate 10-30% Database metrics

8.3 Regret Analysis

Quality Regret = E[Q_baseline - Q_routed]
Latency Regret = E[L_routed - L_oracle]
Cost Regret = E[C_routed - C_oracle]

Targets:
- Quality Regret < 0.1 points (1-5 scale)
- Latency Regret < 50ms
- Cost Regret < 10%

9. Risk Analysis

Risk Probability Impact Mitigation
Router misprediction Medium High Confidence thresholds, fallback
Catastrophic forgetting Low Critical EWC, replay buffer, checkpoints
Memory exhaustion Medium High Streaming, tiered storage
Quality degradation Medium High A/B testing, rollback
Latency spikes High Medium Caching, async processing

10. Dependencies

10.1 Internal Dependencies

[dependencies]
ruvector-core = { path = "../ruvector-core" }
ruvector-gnn = { path = "../ruvector-gnn" }
ruvector-attention = { path = "../ruvector-attention" }
ruvector-graph = { path = "../ruvector-graph" }
ruvector-router-core = { path = "../ruvector-router-core" }

10.2 External Dependencies

[dependencies]
# LLM runtime
llama-cpp-rs = "0.3"        # CPU inference
tokenizers = "0.15"         # Fast tokenization

# Async runtime
tokio = { version = "1.41", features = ["full"] }

# Serialization
serde = { version = "1.0", features = ["derive"] }

# Metrics
prometheus = "0.13"
tracing = "0.1"

11. References

  1. LFM2 Technical Report: arxiv:2511.23404v1
  2. FastGRNN: Kusupati et al., "FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network"
  3. EWC: Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks"
  4. HNSW: Malkov & Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"
  5. Graph Attention: Veličković et al., "Graph Attention Networks"

Document Version: 1.0 Last Updated: 2025-12-02 Author: RuvLLM Architecture Team