Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

21 KiB

Raw Blame History

RuvLLM: Self-Learning LLM with LFM2 and Ruvector Integration

SPARC Phase 1: Specification

1. Executive Summary

RuvLLM is a self-learning LLM architecture that integrates Liquid Foundation Models (LFM2) with ruvector as the world model and memory substrate. The system uses FastGRNN as an intelligent router to dynamically allocate computational resources based on query complexity, enabling efficient on-device inference with continuous learning capabilities.

Core Innovation

The architecture treats:

LFM2 as the reasoning head (inference engine)
Ruvector as the world model and episodic memory
FastGRNN as the control circuit (routing decisions)

This triad creates a self-learning system where:

Queries are semantically embedded and matched against memory
Graph attention extracts relevant neighborhood context
FastGRNN routes to optimal model configuration
LFM2 generates responses with retrieved context
Successful interactions are written back to memory (self-improvement)

2. Technical Requirements

2.1 Functional Requirements

FR-001: LFM2 Model Integration

Description: Support LFM2 model family (350M, 700M, 1.2B, 2.6B parameters)
Acceptance Criteria:
- Load models via llama.cpp (CPU) or vLLM (server)
- Support quantization: Q4/Q5 (CPU), 8-bit/4-bit weight-only (GPU)
- Enable KV cache for context reuse
- Achieve <500ms median latency (CPU), <100ms (GPU)

FR-002: Ruvector Memory Service

Description: Implement semantic memory with graph structure

Storage Schema:

Nodes: {
  id: UUID,
  vector: [f32; D],      // D = embedding dimension
  text: String,
  type: NodeType,        // Query | Document | AgentStep | Fact
  source: String,
  metadata: {
    timestamp: i64,
    tags: Vec<String>,
    domain: String,
    version: u32,
    confidence: f32
  }
}

Edges: {
  id: UUID,
  src: UUID,
  dst: UUID,
  rel: EdgeType,         // Cites | Follows | SameTopic | AgentStep | Derived
  weight: f32,
  metadata: {
    timestamp: i64,
    created_by: String,
    confidence: f32
  }
}

Acceptance Criteria:
- HNSW index with M=32, efConstruction=200, efSearch=64
- Sub-millisecond retrieval for k≤64
- Graph attention over 2-hop neighborhoods
- Support billion-scale corpora

FR-003: FastGRNN Router

Description: Implement gated recurrent router for intelligent resource allocation
Architecture (per Kusupati et al.):
- Hidden size: 32-64 units
- Input: Fixed-length feature vector (~128 dims)
- Outputs: model_selection, context_size, temperature, top_p

Feature Vector Components (128 dimensions):

Query Stats [32 dims]:
  - token_count: f32
  - language_id: [f32; 8] (one-hot)
  - domain_encoding: [f32; 16]
  - user_frequency: f32
  - query_type: [f32; 6] (factual/reasoning/creative/...)

Embedding Stats [16 dims]:
  - l2_norm: f32
  - principal_components: [f32; 8]
  - entropy: f32
  - sparsity: f32
  - cluster_assignment: [f32; 4]

HNSW Search Stats [48 dims]:
  - k_retrieved: f32
  - distances: { mean, std, min, max }: [f32; 4]
  - entropy: f32
  - graph_depth: f32
  - recall_estimate: f32
  - neighborhood_density: [f32; 16]
  - semantic_coherence: [f32; 24]

System Constraints [32 dims]:
  - latency_budget: f32
  - device_class: [f32; 4] (edge/mobile/server/cluster)
  - privacy_level: [f32; 4]
  - memory_available: f32
  - battery_level: f32 (for mobile)
  - concurrent_requests: f32
  - historical_accuracy: [f32; 16]

FR-004: Self-Learning Pipeline

Description: Implement continuous learning with forgetting mitigation
Components:
- Online learning from successful interactions
- Elastic Weight Consolidation (EWC) for catastrophic forgetting prevention
- Experience replay with reservoir sampling
- Curriculum learning for progressive complexity
Acceptance Criteria:
- Quality regret <0.1 points vs. always-big baseline
- No measurable forgetting over 10K update cycles
- Router accuracy >95% for seen patterns

FR-005: Graph Attention Engine

Description: Context extraction via graph-aware attention
Mechanism:
- Multi-head attention over retrieved nodes
- Edge-weighted aggregation (confidence, recency)
- Hyperbolic embeddings for hierarchical relationships
- 2-hop neighborhood expansion
Integration with existing ruvector-attention:
- Leverage EdgeFeaturedAttention for edge attributes
- Use GraphRoPE for positional encoding on graphs
- Apply DualSpaceAttention for multi-manifold reasoning

2.2 Non-Functional Requirements

NFR-001: Performance

Metric	Tier A (Server)	Tier B (Edge)	Tier C (Mobile)
P50 Latency	<200ms	<500ms	<800ms
P99 Latency	<1s	<2s	<5s
Throughput	100 QPS	20 QPS	5 QPS
Memory	<16GB	<4GB	<1GB

NFR-002: Quality

Accuracy: F1 >0.85 on QA benchmarks
Retrieval: R@10 >0.90 for relevant documents
Router: Decision accuracy >95%
Judge Rating: 4.2+/5.0 on LLM-as-judge evaluations

NFR-003: Scalability

Support 10M+ vectors in memory
Support 1B+ vectors with hybrid indexing
Linear scaling with node count in cluster mode

NFR-004: Reliability

Zero data loss on graceful shutdown
Recovery from OOM within 30s
Automatic failover in cluster mode

3. LFM2 Deep Dive

3.1 Architecture Analysis

LFM2 employs a hybrid backbone combining:

Gated Short Convolutions: Lightweight local feature processing
- O(n) complexity vs O(n²) for attention
- Captures local patterns efficiently
- Enables 2x faster prefill on CPUs
Grouped Query Attention (GQA): Reduced KV heads
- 4-8 KV heads vs 32+ in standard attention
- Maintains quality with 4x memory reduction
- Critical for edge deployment

3.2 Training Methodology

LFM2's training is relevant for our self-learning pipeline:

Knowledge Distillation: Tempered, decoupled Top-K
- Teacher: Large model (70B+)
- Student: LFM2 variants
- Insight: We can distill router decisions from expensive oracle
Curriculum Learning: Progressive complexity
- Start with simple factual queries
- Graduate to multi-step reasoning
- Application: Router training follows same progression
Three-Stage Post-Training:
- SFT: Supervised fine-tuning on quality data
- DPO: Direct preference optimization
- Model merging: Combine specialists
- Application: We merge domain-specific adapters

3.3 Multimodal Extensions (Future)

LFM2-VL: Vision-language (image understanding)
LFM2-Audio: Speech I/O
LFM2-ColBERT: Low-latency retrieval encoder

4. Ruvector Integration Analysis

4.1 Existing Capabilities

Component	Status	Integration Plan
ruvector-core	✅ Production	Primary vector store
ruvector-gnn	✅ Production	Graph neural layer
ruvector-attention	✅ Production	Attention mechanisms
ruvector-router-core	✅ Production	Base routing
ruvector-graph	✅ Production	Knowledge graph

4.2 Required Extensions

4.2.1 Embedding Adapter

pub struct EmbeddingAdapter {
    /// LFM2 encoder for query embedding
    lfm2_encoder: Lfm2Encoder,
    /// Dimension alignment layer
    projection: Linear,
    /// Normalization
    layer_norm: LayerNorm,
}

impl EmbeddingAdapter {
    pub fn embed(&self, text: &str) -> Vec<f32> {
        let raw = self.lfm2_encoder.encode(text);
        let projected = self.projection.forward(&raw);
        self.layer_norm.forward(&projected)
    }
}

4.2.2 Memory Writeback Service

pub struct MemoryWriteback {
    /// Quality threshold for writeback
    quality_threshold: f32,
    /// Deduplication via MinHash
    dedup_hasher: MinHasher,
    /// Conflict resolution
    merger: ConflictMerger,
}

impl MemoryWriteback {
    pub async fn maybe_write(
        &self,
        query: &str,
        response: &str,
        quality_score: f32,
        db: &VectorDB,
    ) -> Result<Option<UUID>> {
        if quality_score < self.quality_threshold {
            return Ok(None);
        }

        // Check for near-duplicates
        let embedding = embed(query, response);
        let similar = db.search_threshold(&embedding, 0.95)?;
        if !similar.is_empty() {
            return self.merger.resolve(similar, query, response);
        }

        // Insert new memory
        let entry = VectorEntry::new(embedding)
            .with_text(format!("Q: {}\nA: {}", query, response))
            .with_metadata(json!({
                "type": "qa_pair",
                "quality": quality_score,
                "timestamp": now(),
            }));

        Ok(Some(db.insert(entry)?))
    }
}

4.3 HNSW Parameter Tuning

Based on arxiv:2511.23404v1 insights on retrieval efficiency:

Corpus Size	M	efConstruction	efSearch	Recall@10
<100K	16	100	32	0.98
100K-1M	32	200	64	0.96
1M-10M	48	300	128	0.94
10M-100M	64	400	256	0.92
>100M	Hybrid	Tiered	Adaptive	0.90

5. FastGRNN Router Specification

5.1 Mathematical Formulation

FastGRNN (Fast, Accurate, Stable, and Tiny GRU):

z_t = σ(W_z · x_t + U_z · h_{t-1} + b_z)
h̃_t = tanh(W_h · x_t + U_h · (r_t ⊙ h_{t-1}) + b_h)
h_t = (ζ · (1 - z_t) + ν) ⊙ h̃_t + z_t ⊙ h_{t-1}

where:
  - ζ, ν: Learned scalars (typically ζ≈1, ν≈0.5)
  - W_z, W_h: Input weight matrices (sparse)
  - U_z, U_h: Recurrent weight matrices (low-rank)
  - r_t: Optional reset gate (can be fixed to 1)

5.2 Output Heads

pub struct RouterOutputs {
    /// Model selection: [350M, 700M, 1.2B, 2.6B] probabilities
    pub model_probs: [f32; 4],
    /// Context size bins: [256, 512, 1024, 2048, 4096] tokens
    pub context_probs: [f32; 5],
    /// Temperature: continuous [0.0, 2.0]
    pub temperature: f32,
    /// Top-p: continuous [0.0, 1.0]
    pub top_p: f32,
    /// Confidence score
    pub confidence: f32,
}

5.3 Training Protocol

Phase 1: Data Collection

For each query q:
  1. Run all model configurations (expensive baseline)
  2. Collect quality metrics Q, latency L, cost C
  3. Compute utility: U = Q - λ·L - μ·C
  4. Label: y_model = argmax(U), y_ctx = min viable context

Phase 2: Supervised Training

Loss = CE(model_pred, y_model)
     + CE(ctx_pred, y_ctx)
     + α·SmoothL1(temp_pred, y_temp)
     + β·SmoothL1(top_p_pred, y_top_p)

Phase 3: Online Refinement

Every N requests:
  1. Sample exploration (ε-greedy or Thompson)
  2. Compute regret vs. oracle
  3. Update weights with importance sampling
  4. Apply EWC regularization

6. Self-Learning Mechanisms

6.1 Continual Learning Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Self-Learning Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐   │
│  │ Query   │───▶│ Retrieve│───▶│ Generate│───▶│ Evaluate│   │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘   │
│       │              │              │              │         │
│       │              │              │              ▼         │
│       │              │              │        ┌─────────┐     │
│       │              │              │        │ Quality │     │
│       │              │              │        │ > θ ?   │     │
│       │              │              │        └────┬────┘     │
│       │              │              │             │          │
│       │              │              │      ┌──────┴──────┐   │
│       │              │              │      ▼             ▼   │
│       │              │              │  ┌───────┐   ┌───────┐ │
│       │              │              │  │ Write │   │ Skip  │ │
│       │              │              │  │ Back  │   │       │ │
│       │              │              │  └───┬───┘   └───────┘ │
│       │              │              │      │                 │
│       ▼              ▼              ▼      ▼                 │
│  ┌─────────────────────────────────────────────┐             │
│  │            Replay Buffer (Reservoir)         │             │
│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │             │
│  │  │ E_1 │ │ E_2 │ │ ... │ │E_n-1│ │ E_n │   │             │
│  │  └─────┘ └─────┘ └─────┘ └─────┘ └─────┘   │             │
│  └──────────────────────┬──────────────────────┘             │
│                         │                                    │
│                         ▼                                    │
│  ┌─────────────────────────────────────────────┐             │
│  │           EWC Regularization Layer           │             │
│  │                                               │             │
│  │  L_total = L_task + λ·Σ F_i·(θ_i - θ*_i)²   │             │
│  │                                               │             │
│  │  F_i = Fisher Information (importance)        │             │
│  │  θ*_i = Optimal weights from previous task   │             │
│  └─────────────────────────────────────────────┘             │
│                                                               │
└─────────────────────────────────────────────────────────────┘

6.2 Quality Evaluation

LLM-as-Judge Protocol:

pub struct QualityJudge {
    judge_model: Lfm2, // Use 2.6B for judging
    rubric: JudgeRubric,
}

impl QualityJudge {
    pub fn evaluate(&self, query: &str, response: &str, context: &[&str]) -> f32 {
        let prompt = format!(r#"
            Evaluate the response quality on a scale of 1-5:

            Query: {query}
            Retrieved Context: {context:?}
            Response: {response}

            Criteria:
            1. Factual accuracy (grounded in context)
            2. Completeness (addresses the query fully)
            3. Coherence (logical flow)
            4. Conciseness (no unnecessary verbosity)

            Score (1-5):
        "#);

        let score_str = self.judge_model.generate(&prompt, 10);
        parse_score(&score_str)
    }
}

6.3 Forgetting Mitigation

Elastic Weight Consolidation (EWC):

// From ruvector-gnn ewc module
pub struct ElasticWeightConsolidation {
    lambda: f32,                    // Regularization strength
    fisher_info: Vec<f32>,          // Fisher information diagonal
    optimal_weights: Vec<f32>,      // θ* from previous task
}

impl ElasticWeightConsolidation {
    pub fn regularization_loss(&self, current_weights: &[f32]) -> f32 {
        self.fisher_info.iter()
            .zip(current_weights.iter())
            .zip(self.optimal_weights.iter())
            .map(|((f, w), w_star)| f * (w - w_star).powi(2))
            .sum::<f32>() * self.lambda / 2.0
    }

    pub fn update_fisher(&mut self, gradients: &[Vec<f32>]) {
        // Fisher = E[∇logP(y|x;θ)²]
        for (i, grad_samples) in gradients.iter().enumerate() {
            self.fisher_info[i] = grad_samples.iter()
                .map(|g| g.powi(2))
                .sum::<f32>() / grad_samples.len() as f32;
        }
    }
}

7. Performance Optimization Strategy

7.1 LFM2 Level

Optimization	Speedup	Quality Impact	Implementation
Model selection	2-4x	<1%	FastGRNN router
KV cache reuse	1.5-2x	0%	llama.cpp native
Q4 quantization	2-3x	<2%	GGUF format
Speculative decode	1.3-1.5x	0%	Draft model
Continuous batching	2-4x	0%	vLLM

7.2 Ruvector Level

Optimization	Speedup	Quality Impact	Implementation
HNSW tuning	Variable	Recall tradeoff	efSearch adjustment
Product quantization	4-8x memory	<5%	PQ in ruvector-core
Graph pruning	1.2-1.5x	<1%	Edge weight threshold
Batch retrieval	2-3x	0%	Parallel HNSW
Caching	10x+ (hits)	0%	LRU with TTL

7.3 Router Level

Optimization	Speedup	Quality Impact	Implementation
Sparse weights	10-50x	<0.5%	Magnitude pruning
Low-rank U	2-4x	<0.5%	SVD decomposition
Int8 quantization	2-4x	<0.1%	Post-training quant
Cascade routing	1.5-2x	0%	Early exit

8. Success Metrics

8.1 Primary Metrics

Metric	Target	Measurement
End-to-end latency P50	<500ms	Timer instrumentation
Quality (LLM judge)	4.2+/5.0	Automated evaluation
Router accuracy	>95%	Oracle comparison
Memory efficiency	<4GB (edge)	RSS monitoring
Throughput	20 QPS (edge)	Load testing

8.2 Secondary Metrics

Metric	Target	Measurement
Retrieval R@10	>0.90	Benchmark suite
Forgetting rate	<5%/10K updates	Periodic eval
Cost reduction	>50% vs baseline	Token counting
Writeback rate	10-30%	Database metrics

8.3 Regret Analysis

Quality Regret = E[Q_baseline - Q_routed]
Latency Regret = E[L_routed - L_oracle]
Cost Regret = E[C_routed - C_oracle]

Targets:
- Quality Regret < 0.1 points (1-5 scale)
- Latency Regret < 50ms
- Cost Regret < 10%

9. Risk Analysis

Risk	Probability	Impact	Mitigation
Router misprediction	Medium	High	Confidence thresholds, fallback
Catastrophic forgetting	Low	Critical	EWC, replay buffer, checkpoints
Memory exhaustion	Medium	High	Streaming, tiered storage
Quality degradation	Medium	High	A/B testing, rollback
Latency spikes	High	Medium	Caching, async processing

10. Dependencies

10.1 Internal Dependencies

[dependencies]
ruvector-core = { path = "../ruvector-core" }
ruvector-gnn = { path = "../ruvector-gnn" }
ruvector-attention = { path = "../ruvector-attention" }
ruvector-graph = { path = "../ruvector-graph" }
ruvector-router-core = { path = "../ruvector-router-core" }

10.2 External Dependencies

[dependencies]
# LLM runtime
llama-cpp-rs = "0.3"        # CPU inference
tokenizers = "0.15"         # Fast tokenization

# Async runtime
tokio = { version = "1.41", features = ["full"] }

# Serialization
serde = { version = "1.0", features = ["derive"] }

# Metrics
prometheus = "0.13"
tracing = "0.1"

11. References

LFM2 Technical Report: arxiv:2511.23404v1
FastGRNN: Kusupati et al., "FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network"
EWC: Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks"
HNSW: Malkov & Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"
Graph Attention: Veličković et al., "Graph Attention Networks"

Document Version: 1.0 Last Updated: 2025-12-02 Author: RuvLLM Architecture Team

21 KiB Raw Blame History Unescape Escape