Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,612 @@
# RuvLLM: Self-Learning LLM with LFM2 and Ruvector Integration
## SPARC Phase 1: Specification
---
## 1. Executive Summary
RuvLLM is a self-learning LLM architecture that integrates **Liquid Foundation Models (LFM2)** with **ruvector** as the world model and memory substrate. The system uses **FastGRNN** as an intelligent router to dynamically allocate computational resources based on query complexity, enabling efficient on-device inference with continuous learning capabilities.
### Core Innovation
The architecture treats:
- **LFM2** as the reasoning head (inference engine)
- **Ruvector** as the world model and episodic memory
- **FastGRNN** as the control circuit (routing decisions)
This triad creates a self-learning system where:
1. Queries are semantically embedded and matched against memory
2. Graph attention extracts relevant neighborhood context
3. FastGRNN routes to optimal model configuration
4. LFM2 generates responses with retrieved context
5. Successful interactions are written back to memory (self-improvement)
---
## 2. Technical Requirements
### 2.1 Functional Requirements
#### FR-001: LFM2 Model Integration
- **Description**: Support LFM2 model family (350M, 700M, 1.2B, 2.6B parameters)
- **Acceptance Criteria**:
- Load models via llama.cpp (CPU) or vLLM (server)
- Support quantization: Q4/Q5 (CPU), 8-bit/4-bit weight-only (GPU)
- Enable KV cache for context reuse
- Achieve <500ms median latency (CPU), <100ms (GPU)
#### FR-002: Ruvector Memory Service
- **Description**: Implement semantic memory with graph structure
- **Storage Schema**:
```
Nodes: {
id: UUID,
vector: [f32; D], // D = embedding dimension
text: String,
type: NodeType, // Query | Document | AgentStep | Fact
source: String,
metadata: {
timestamp: i64,
tags: Vec<String>,
domain: String,
version: u32,
confidence: f32
}
}
Edges: {
id: UUID,
src: UUID,
dst: UUID,
rel: EdgeType, // Cites | Follows | SameTopic | AgentStep | Derived
weight: f32,
metadata: {
timestamp: i64,
created_by: String,
confidence: f32
}
}
```
- **Acceptance Criteria**:
- HNSW index with M=32, efConstruction=200, efSearch=64
- Sub-millisecond retrieval for k≤64
- Graph attention over 2-hop neighborhoods
- Support billion-scale corpora
#### FR-003: FastGRNN Router
- **Description**: Implement gated recurrent router for intelligent resource allocation
- **Architecture** (per Kusupati et al.):
- Hidden size: 32-64 units
- Input: Fixed-length feature vector (~128 dims)
- Outputs: model_selection, context_size, temperature, top_p
- **Feature Vector Components** (128 dimensions):
```
Query Stats [32 dims]:
- token_count: f32
- language_id: [f32; 8] (one-hot)
- domain_encoding: [f32; 16]
- user_frequency: f32
- query_type: [f32; 6] (factual/reasoning/creative/...)
Embedding Stats [16 dims]:
- l2_norm: f32
- principal_components: [f32; 8]
- entropy: f32
- sparsity: f32
- cluster_assignment: [f32; 4]
HNSW Search Stats [48 dims]:
- k_retrieved: f32
- distances: { mean, std, min, max }: [f32; 4]
- entropy: f32
- graph_depth: f32
- recall_estimate: f32
- neighborhood_density: [f32; 16]
- semantic_coherence: [f32; 24]
System Constraints [32 dims]:
- latency_budget: f32
- device_class: [f32; 4] (edge/mobile/server/cluster)
- privacy_level: [f32; 4]
- memory_available: f32
- battery_level: f32 (for mobile)
- concurrent_requests: f32
- historical_accuracy: [f32; 16]
```
#### FR-004: Self-Learning Pipeline
- **Description**: Implement continuous learning with forgetting mitigation
- **Components**:
- Online learning from successful interactions
- Elastic Weight Consolidation (EWC) for catastrophic forgetting prevention
- Experience replay with reservoir sampling
- Curriculum learning for progressive complexity
- **Acceptance Criteria**:
- Quality regret <0.1 points vs. always-big baseline
- No measurable forgetting over 10K update cycles
- Router accuracy >95% for seen patterns
#### FR-005: Graph Attention Engine
- **Description**: Context extraction via graph-aware attention
- **Mechanism**:
- Multi-head attention over retrieved nodes
- Edge-weighted aggregation (confidence, recency)
- Hyperbolic embeddings for hierarchical relationships
- 2-hop neighborhood expansion
- **Integration with existing ruvector-attention**:
- Leverage `EdgeFeaturedAttention` for edge attributes
- Use `GraphRoPE` for positional encoding on graphs
- Apply `DualSpaceAttention` for multi-manifold reasoning
### 2.2 Non-Functional Requirements
#### NFR-001: Performance
| Metric | Tier A (Server) | Tier B (Edge) | Tier C (Mobile) |
|--------|-----------------|---------------|-----------------|
| P50 Latency | <200ms | <500ms | <800ms |
| P99 Latency | <1s | <2s | <5s |
| Throughput | 100 QPS | 20 QPS | 5 QPS |
| Memory | <16GB | <4GB | <1GB |
#### NFR-002: Quality
- **Accuracy**: F1 >0.85 on QA benchmarks
- **Retrieval**: R@10 >0.90 for relevant documents
- **Router**: Decision accuracy >95%
- **Judge Rating**: 4.2+/5.0 on LLM-as-judge evaluations
#### NFR-003: Scalability
- Support 10M+ vectors in memory
- Support 1B+ vectors with hybrid indexing
- Linear scaling with node count in cluster mode
#### NFR-004: Reliability
- Zero data loss on graceful shutdown
- Recovery from OOM within 30s
- Automatic failover in cluster mode
---
## 3. LFM2 Deep Dive
### 3.1 Architecture Analysis
LFM2 employs a **hybrid backbone** combining:
1. **Gated Short Convolutions**: Lightweight local feature processing
- O(n) complexity vs O(n²) for attention
- Captures local patterns efficiently
- Enables 2x faster prefill on CPUs
2. **Grouped Query Attention (GQA)**: Reduced KV heads
- 4-8 KV heads vs 32+ in standard attention
- Maintains quality with 4x memory reduction
- Critical for edge deployment
### 3.2 Training Methodology
LFM2's training is relevant for our self-learning pipeline:
1. **Knowledge Distillation**: Tempered, decoupled Top-K
- Teacher: Large model (70B+)
- Student: LFM2 variants
- **Insight**: We can distill router decisions from expensive oracle
2. **Curriculum Learning**: Progressive complexity
- Start with simple factual queries
- Graduate to multi-step reasoning
- **Application**: Router training follows same progression
3. **Three-Stage Post-Training**:
- SFT: Supervised fine-tuning on quality data
- DPO: Direct preference optimization
- Model merging: Combine specialists
- **Application**: We merge domain-specific adapters
### 3.3 Multimodal Extensions (Future)
- **LFM2-VL**: Vision-language (image understanding)
- **LFM2-Audio**: Speech I/O
- **LFM2-ColBERT**: Low-latency retrieval encoder
---
## 4. Ruvector Integration Analysis
### 4.1 Existing Capabilities
| Component | Status | Integration Plan |
|-----------|--------|------------------|
| ruvector-core | ✅ Production | Primary vector store |
| ruvector-gnn | ✅ Production | Graph neural layer |
| ruvector-attention | ✅ Production | Attention mechanisms |
| ruvector-router-core | ✅ Production | Base routing |
| ruvector-graph | ✅ Production | Knowledge graph |
### 4.2 Required Extensions
#### 4.2.1 Embedding Adapter
```rust
pub struct EmbeddingAdapter {
/// LFM2 encoder for query embedding
lfm2_encoder: Lfm2Encoder,
/// Dimension alignment layer
projection: Linear,
/// Normalization
layer_norm: LayerNorm,
}
impl EmbeddingAdapter {
pub fn embed(&self, text: &str) -> Vec<f32> {
let raw = self.lfm2_encoder.encode(text);
let projected = self.projection.forward(&raw);
self.layer_norm.forward(&projected)
}
}
```
#### 4.2.2 Memory Writeback Service
```rust
pub struct MemoryWriteback {
/// Quality threshold for writeback
quality_threshold: f32,
/// Deduplication via MinHash
dedup_hasher: MinHasher,
/// Conflict resolution
merger: ConflictMerger,
}
impl MemoryWriteback {
pub async fn maybe_write(
&self,
query: &str,
response: &str,
quality_score: f32,
db: &VectorDB,
) -> Result<Option<UUID>> {
if quality_score < self.quality_threshold {
return Ok(None);
}
// Check for near-duplicates
let embedding = embed(query, response);
let similar = db.search_threshold(&embedding, 0.95)?;
if !similar.is_empty() {
return self.merger.resolve(similar, query, response);
}
// Insert new memory
let entry = VectorEntry::new(embedding)
.with_text(format!("Q: {}\nA: {}", query, response))
.with_metadata(json!({
"type": "qa_pair",
"quality": quality_score,
"timestamp": now(),
}));
Ok(Some(db.insert(entry)?))
}
}
```
### 4.3 HNSW Parameter Tuning
Based on arxiv:2511.23404v1 insights on retrieval efficiency:
| Corpus Size | M | efConstruction | efSearch | Recall@10 |
|-------------|---|----------------|----------|-----------|
| <100K | 16 | 100 | 32 | 0.98 |
| 100K-1M | 32 | 200 | 64 | 0.96 |
| 1M-10M | 48 | 300 | 128 | 0.94 |
| 10M-100M | 64 | 400 | 256 | 0.92 |
| >100M | Hybrid | Tiered | Adaptive | 0.90 |
---
## 5. FastGRNN Router Specification
### 5.1 Mathematical Formulation
FastGRNN (Fast, Accurate, Stable, and Tiny GRU):
```
z_t = σ(W_z · x_t + U_z · h_{t-1} + b_z)
h̃_t = tanh(W_h · x_t + U_h · (r_t ⊙ h_{t-1}) + b_h)
h_t = (ζ · (1 - z_t) + ν) ⊙ h̃_t + z_t ⊙ h_{t-1}
where:
- ζ, ν: Learned scalars (typically ζ≈1, ν≈0.5)
- W_z, W_h: Input weight matrices (sparse)
- U_z, U_h: Recurrent weight matrices (low-rank)
- r_t: Optional reset gate (can be fixed to 1)
```
### 5.2 Output Heads
```rust
pub struct RouterOutputs {
/// Model selection: [350M, 700M, 1.2B, 2.6B] probabilities
pub model_probs: [f32; 4],
/// Context size bins: [256, 512, 1024, 2048, 4096] tokens
pub context_probs: [f32; 5],
/// Temperature: continuous [0.0, 2.0]
pub temperature: f32,
/// Top-p: continuous [0.0, 1.0]
pub top_p: f32,
/// Confidence score
pub confidence: f32,
}
```
### 5.3 Training Protocol
**Phase 1: Data Collection**
```
For each query q:
1. Run all model configurations (expensive baseline)
2. Collect quality metrics Q, latency L, cost C
3. Compute utility: U = Q - λ·L - μ·C
4. Label: y_model = argmax(U), y_ctx = min viable context
```
**Phase 2: Supervised Training**
```
Loss = CE(model_pred, y_model)
+ CE(ctx_pred, y_ctx)
+ α·SmoothL1(temp_pred, y_temp)
+ β·SmoothL1(top_p_pred, y_top_p)
```
**Phase 3: Online Refinement**
```
Every N requests:
1. Sample exploration (ε-greedy or Thompson)
2. Compute regret vs. oracle
3. Update weights with importance sampling
4. Apply EWC regularization
```
---
## 6. Self-Learning Mechanisms
### 6.1 Continual Learning Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Self-Learning Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Query │───▶│ Retrieve│───▶│ Generate│───▶│ Evaluate│ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │
│ │ │ │ ▼ │
│ │ │ │ ┌─────────┐ │
│ │ │ │ │ Quality │ │
│ │ │ │ │ > θ ? │ │
│ │ │ │ └────┬────┘ │
│ │ │ │ │ │
│ │ │ │ ┌──────┴──────┐ │
│ │ │ │ ▼ ▼ │
│ │ │ │ ┌───────┐ ┌───────┐ │
│ │ │ │ │ Write │ │ Skip │ │
│ │ │ │ │ Back │ │ │ │
│ │ │ │ └───┬───┘ └───────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Replay Buffer (Reservoir) │ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │
│ │ │ E_1 │ │ E_2 │ │ ... │ │E_n-1│ │ E_n │ │ │
│ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ EWC Regularization Layer │ │
│ │ │ │
│ │ L_total = L_task + λ·Σ F_i·(θ_i - θ*_i)² │ │
│ │ │ │
│ │ F_i = Fisher Information (importance) │ │
│ │ θ*_i = Optimal weights from previous task │ │
│ └─────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
### 6.2 Quality Evaluation
**LLM-as-Judge Protocol**:
```rust
pub struct QualityJudge {
judge_model: Lfm2, // Use 2.6B for judging
rubric: JudgeRubric,
}
impl QualityJudge {
pub fn evaluate(&self, query: &str, response: &str, context: &[&str]) -> f32 {
let prompt = format!(r#"
Evaluate the response quality on a scale of 1-5:
Query: {query}
Retrieved Context: {context:?}
Response: {response}
Criteria:
1. Factual accuracy (grounded in context)
2. Completeness (addresses the query fully)
3. Coherence (logical flow)
4. Conciseness (no unnecessary verbosity)
Score (1-5):
"#);
let score_str = self.judge_model.generate(&prompt, 10);
parse_score(&score_str)
}
}
```
### 6.3 Forgetting Mitigation
**Elastic Weight Consolidation (EWC)**:
```rust
// From ruvector-gnn ewc module
pub struct ElasticWeightConsolidation {
lambda: f32, // Regularization strength
fisher_info: Vec<f32>, // Fisher information diagonal
optimal_weights: Vec<f32>, // θ* from previous task
}
impl ElasticWeightConsolidation {
pub fn regularization_loss(&self, current_weights: &[f32]) -> f32 {
self.fisher_info.iter()
.zip(current_weights.iter())
.zip(self.optimal_weights.iter())
.map(|((f, w), w_star)| f * (w - w_star).powi(2))
.sum::<f32>() * self.lambda / 2.0
}
pub fn update_fisher(&mut self, gradients: &[Vec<f32>]) {
// Fisher = E[∇logP(y|x;θ)²]
for (i, grad_samples) in gradients.iter().enumerate() {
self.fisher_info[i] = grad_samples.iter()
.map(|g| g.powi(2))
.sum::<f32>() / grad_samples.len() as f32;
}
}
}
```
---
## 7. Performance Optimization Strategy
### 7.1 LFM2 Level
| Optimization | Speedup | Quality Impact | Implementation |
|--------------|---------|----------------|----------------|
| Model selection | 2-4x | <1% | FastGRNN router |
| KV cache reuse | 1.5-2x | 0% | llama.cpp native |
| Q4 quantization | 2-3x | <2% | GGUF format |
| Speculative decode | 1.3-1.5x | 0% | Draft model |
| Continuous batching | 2-4x | 0% | vLLM |
### 7.2 Ruvector Level
| Optimization | Speedup | Quality Impact | Implementation |
|--------------|---------|----------------|----------------|
| HNSW tuning | Variable | Recall tradeoff | efSearch adjustment |
| Product quantization | 4-8x memory | <5% | PQ in ruvector-core |
| Graph pruning | 1.2-1.5x | <1% | Edge weight threshold |
| Batch retrieval | 2-3x | 0% | Parallel HNSW |
| Caching | 10x+ (hits) | 0% | LRU with TTL |
### 7.3 Router Level
| Optimization | Speedup | Quality Impact | Implementation |
|--------------|---------|----------------|----------------|
| Sparse weights | 10-50x | <0.5% | Magnitude pruning |
| Low-rank U | 2-4x | <0.5% | SVD decomposition |
| Int8 quantization | 2-4x | <0.1% | Post-training quant |
| Cascade routing | 1.5-2x | 0% | Early exit |
---
## 8. Success Metrics
### 8.1 Primary Metrics
| Metric | Target | Measurement |
|--------|--------|-------------|
| End-to-end latency P50 | <500ms | Timer instrumentation |
| Quality (LLM judge) | 4.2+/5.0 | Automated evaluation |
| Router accuracy | >95% | Oracle comparison |
| Memory efficiency | <4GB (edge) | RSS monitoring |
| Throughput | 20 QPS (edge) | Load testing |
### 8.2 Secondary Metrics
| Metric | Target | Measurement |
|--------|--------|-------------|
| Retrieval R@10 | >0.90 | Benchmark suite |
| Forgetting rate | <5%/10K updates | Periodic eval |
| Cost reduction | >50% vs baseline | Token counting |
| Writeback rate | 10-30% | Database metrics |
### 8.3 Regret Analysis
```
Quality Regret = E[Q_baseline - Q_routed]
Latency Regret = E[L_routed - L_oracle]
Cost Regret = E[C_routed - C_oracle]
Targets:
- Quality Regret < 0.1 points (1-5 scale)
- Latency Regret < 50ms
- Cost Regret < 10%
```
---
## 9. Risk Analysis
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Router misprediction | Medium | High | Confidence thresholds, fallback |
| Catastrophic forgetting | Low | Critical | EWC, replay buffer, checkpoints |
| Memory exhaustion | Medium | High | Streaming, tiered storage |
| Quality degradation | Medium | High | A/B testing, rollback |
| Latency spikes | High | Medium | Caching, async processing |
---
## 10. Dependencies
### 10.1 Internal Dependencies
```toml
[dependencies]
ruvector-core = { path = "../ruvector-core" }
ruvector-gnn = { path = "../ruvector-gnn" }
ruvector-attention = { path = "../ruvector-attention" }
ruvector-graph = { path = "../ruvector-graph" }
ruvector-router-core = { path = "../ruvector-router-core" }
```
### 10.2 External Dependencies
```toml
[dependencies]
# LLM runtime
llama-cpp-rs = "0.3" # CPU inference
tokenizers = "0.15" # Fast tokenization
# Async runtime
tokio = { version = "1.41", features = ["full"] }
# Serialization
serde = { version = "1.0", features = ["derive"] }
# Metrics
prometheus = "0.13"
tracing = "0.1"
```
---
## 11. References
1. **LFM2 Technical Report**: arxiv:2511.23404v1
2. **FastGRNN**: Kusupati et al., "FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network"
3. **EWC**: Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks"
4. **HNSW**: Malkov & Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"
5. **Graph Attention**: Veličković et al., "Graph Attention Networks"
---
*Document Version: 1.0*
*Last Updated: 2025-12-02*
*Author: RuvLLM Architecture Team*

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,886 @@
# RuvLLM: Integration and Deployment
## SPARC Phase 5: Completion
---
## 1. Integration Strategy
### 1.1 Crate Structure
```
ruvector/
├── crates/
│ ├── ruvector-core/ # Existing: Vector DB
│ ├── ruvector-gnn/ # Existing: GNN + EWC + Replay
│ ├── ruvector-attention/ # Existing: Attention mechanisms
│ ├── ruvector-graph/ # Existing: Graph storage
│ └── ruvector-router-core/ # Existing: Routing primitives
└── examples/
└── ruvLLM/ # NEW: Self-learning LLM
├── src/
│ ├── lib.rs # Main library entry
│ ├── orchestrator.rs # Request orchestration
│ ├── embedding.rs # LFM2 embedding service
│ ├── router.rs # FastGRNN router
│ ├── memory.rs # Ruvector memory layer
│ ├── attention.rs # Graph attention wrapper
│ ├── inference.rs # LFM2 model pool
│ ├── learning.rs # Self-learning service
│ ├── compression.rs # Concept abstraction
│ ├── config.rs # Configuration
│ ├── types.rs # Core types
│ └── error.rs # Error handling
├── tests/
│ ├── unit/
│ └── integration/
├── benches/
├── config/
└── docs/ # SPARC documentation
```
### 1.2 Dependency Integration
```toml
# examples/ruvLLM/Cargo.toml
[package]
name = "ruvllm"
version = "0.1.0"
edition = "2021"
description = "Self-learning LLM with LFM2 and Ruvector integration"
[dependencies]
# Internal dependencies (path-based for development)
ruvector-core = { path = "../../crates/ruvector-core" }
ruvector-gnn = { path = "../../crates/ruvector-gnn" }
ruvector-attention = { path = "../../crates/ruvector-attention" }
ruvector-graph = { path = "../../crates/ruvector-graph" }
ruvector-router-core = { path = "../../crates/ruvector-router-core" }
# LLM inference
llama-cpp-rs = "0.3" # CPU inference via llama.cpp
tokenizers = "0.15" # Fast tokenization
# Async runtime
tokio = { version = "1.41", features = ["full"] }
futures = "0.3"
# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
bincode = "2.0.0-rc.3"
# Numerics
ndarray = { version = "0.16", features = ["serde"] }
rand = "0.8"
# Utilities
uuid = { version = "1.11", features = ["v4", "serde"] }
chrono = { version = "0.4", features = ["serde"] }
thiserror = "2.0"
anyhow = "1.0"
tracing = "0.1"
# Performance
dashmap = "6.1"
parking_lot = "0.12"
lru = "0.12"
# Metrics
prometheus = "0.13"
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
proptest = "1.5"
tokio-test = "0.4"
tempfile = "3.13"
tracing-subscriber = "0.3"
[features]
default = ["cpu"]
cpu = [] # llama.cpp CPU inference
gpu = ["vllm"] # vLLM GPU inference (optional)
vllm = []
[[bench]]
name = "pipeline"
harness = false
[[bench]]
name = "router"
harness = false
[[bench]]
name = "memory"
harness = false
```
### 1.3 API Surface
```rust
//! # RuvLLM - Self-Learning LLM
//!
//! A self-learning language model system integrating LFM2 with Ruvector.
//!
//! ## Architecture
//!
//! - **LFM2**: Frozen reasoning engine (350M-2.6B parameters)
//! - **Ruvector**: Living memory that adapts continuously
//! - **FastGRNN**: Control circuit for intelligent routing
//!
//! ## Quick Start
//!
//! ```rust,ignore
//! use ruvllm::{RuvLLM, Config};
//!
//! #[tokio::main]
//! async fn main() -> Result<()> {
//! // Initialize system
//! let config = Config::builder()
//! .db_path("./memory.db")
//! .model_path_350m("./models/lfm2-350m-q4.gguf")
//! .model_path_700m("./models/lfm2-700m-q4.gguf")
//! .build()?;
//!
//! let llm = RuvLLM::new(config).await?;
//!
//! // Process query
//! let response = llm.query("What is machine learning?").await?;
//! println!("Response: {}", response.text);
//! println!("Confidence: {:.2}", response.confidence);
//!
//! Ok(())
//! }
//! ```
//!
//! ## Self-Learning Loops
//!
//! The system learns through three feedback loops:
//!
//! 1. **Memory Growth**: Every interaction strengthens/weakens graph edges
//! 2. **Router Learning**: FastGRNN learns optimal model selection
//! 3. **Compression**: Periodic summarization creates concept hierarchies
pub mod attention;
pub mod compression;
pub mod config;
pub mod embedding;
pub mod error;
pub mod inference;
pub mod learning;
pub mod memory;
pub mod orchestrator;
pub mod router;
pub mod types;
// Re-exports for convenience
pub use config::{Config, ConfigBuilder};
pub use error::{Error, Result};
pub use orchestrator::RuvLLM;
pub use types::{Request, Response, Session};
/// Library version
pub const VERSION: &str = env!("CARGO_PKG_VERSION");
```
---
## 2. Implementation Checklist
### 2.1 Core Components
```
Phase 1: Foundation
━━━━━━━━━━━━━━━━━━━━
[x] Project structure setup
[x] Cargo.toml with dependencies
[ ] Error types definition
[ ] Configuration system
[ ] Core types (Request, Response, Session)
Phase 2: Services
━━━━━━━━━━━━━━━━━━
[ ] EmbeddingService
[ ] LFM2 encoder wrapper
[ ] Dimension projection
[ ] Tokenization
[ ] Batch processing
[ ] MemoryService
[ ] VectorDB initialization
[ ] GraphStore integration
[ ] HNSW search wrapper
[ ] Graph expansion
[ ] Writeback queue
[ ] FastGRNNRouter
[ ] Cell implementation
[ ] Sparse matrix operations
[ ] Low-rank matrices
[ ] Output heads
[ ] Training loop
[ ] GraphAttentionEngine
[ ] Attention layer wrapper
[ ] Edge feature encoding
[ ] Multi-head aggregation
[ ] Context ranking
[ ] InferencePool
[ ] Model loading
[ ] Lazy initialization
[ ] KV cache management
[ ] LRU eviction
[ ] LearningService
[ ] Quality judge
[ ] Replay buffer
[ ] EWC integration
[ ] Background training
[ ] Compression jobs
Phase 3: Orchestration
━━━━━━━━━━━━━━━━━━━━━━
[ ] Orchestrator
[ ] Request routing
[ ] Session management
[ ] Pipeline coordination
[ ] Metrics collection
[ ] Error handling
Phase 4: Integration
━━━━━━━━━━━━━━━━━━━━
[ ] Integration tests
[ ] Benchmark suite
[ ] Example applications
[ ] Documentation
```
### 2.2 Test Coverage Requirements
| Component | Unit Tests | Integration | Benchmark |
|-----------|------------|-------------|-----------|
| Embedding | 15+ | 3+ | 2 |
| Memory | 20+ | 5+ | 3 |
| Router | 25+ | 5+ | 2 |
| Attention | 15+ | 3+ | 2 |
| Inference | 10+ | 3+ | 2 |
| Learning | 20+ | 5+ | 1 |
| Orchestrator | 10+ | 5+ | 2 |
| **Total** | **115+** | **29+** | **14** |
---
## 3. Deployment Configurations
### 3.1 Edge Deployment (Raspberry Pi / Mobile)
```toml
# config/edge.toml
[system]
device_class = "edge"
max_memory_mb = 2048
max_concurrent_requests = 2
[embedding]
model = "onnx" # ONNX for portability
dimension = 384
batch_size = 1
[memory]
hnsw_m = 16
hnsw_ef_construction = 100
hnsw_ef_search = 32
max_nodes = 100_000
[router]
hidden_dim = 32
sparsity = 0.95
confidence_threshold = 0.6
[inference]
models = ["350m"]
quantization = "q4_k"
max_context = 1024
max_loaded_models = 1
[learning]
enabled = true
quality_threshold = 0.8
replay_capacity = 1000
training_interval_ms = 300_000 # 5 minutes
```
### 3.2 Server Deployment (CPU)
```toml
# config/server-cpu.toml
[system]
device_class = "server"
max_memory_mb = 16384
max_concurrent_requests = 20
[embedding]
model = "lfm2-encoder"
dimension = 768
batch_size = 8
[memory]
hnsw_m = 32
hnsw_ef_construction = 200
hnsw_ef_search = 64
max_nodes = 10_000_000
[router]
hidden_dim = 64
sparsity = 0.9
confidence_threshold = 0.7
[inference]
models = ["700m", "1.2b", "2.6b"]
quantization = "q5_k"
max_context = 4096
max_loaded_models = 2
[learning]
enabled = true
quality_threshold = 0.75
replay_capacity = 100_000
training_interval_ms = 60_000 # 1 minute
```
### 3.3 Server Deployment (GPU)
```toml
# config/server-gpu.toml
[system]
device_class = "gpu"
max_memory_mb = 32768
max_concurrent_requests = 100
[embedding]
model = "lfm2-encoder"
dimension = 1024
batch_size = 32
[memory]
hnsw_m = 48
hnsw_ef_construction = 300
hnsw_ef_search = 128
max_nodes = 100_000_000
[router]
hidden_dim = 64
sparsity = 0.85
confidence_threshold = 0.75
[inference]
models = ["1.2b", "2.6b"]
quantization = "fp16"
max_context = 8192
max_loaded_models = 2
use_vllm = true
tensor_parallel = 1
[learning]
enabled = true
quality_threshold = 0.7
replay_capacity = 1_000_000
training_interval_ms = 30_000 # 30 seconds
```
---
## 4. Operational Runbook
### 4.1 Startup Sequence
```bash
#!/bin/bash
# scripts/start.sh
set -e
CONFIG=${1:-"config/server-cpu.toml"}
LOG_LEVEL=${LOG_LEVEL:-"info"}
echo "Starting RuvLLM with config: $CONFIG"
# 1. Validate configuration
cargo run --release --bin ruvllm-validate -- --config "$CONFIG"
# 2. Initialize database if needed
if [ ! -f "data/memory.db" ]; then
echo "Initializing database..."
cargo run --release --bin ruvllm-init -- --config "$CONFIG"
fi
# 3. Download models if needed
cargo run --release --bin ruvllm-models -- --config "$CONFIG" --check-or-download
# 4. Start server
RUST_LOG=$LOG_LEVEL cargo run --release --bin ruvllm-server -- \
--config "$CONFIG" \
--metrics-port 9090 \
--http-port 8080
```
### 4.2 Health Checks
```rust
/// Health check endpoint implementation
pub struct HealthCheck {
memory: Arc<RuvectorMemory>,
router: Arc<FastGRNNRouter>,
inference: Arc<InferencePool>,
}
impl HealthCheck {
pub async fn check(&self) -> HealthStatus {
let mut status = HealthStatus::default();
// Check memory service
status.memory = match self.memory.ping().await {
Ok(latency) => ComponentHealth::Healthy { latency_ms: latency },
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
// Check router
status.router = match self.router.ping() {
Ok(latency) => ComponentHealth::Healthy { latency_ms: latency },
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
// Check inference (at least one model loadable)
status.inference = match self.inference.health_check().await {
Ok(info) => ComponentHealth::Healthy {
latency_ms: info.latency,
details: json!({
"loaded_models": info.loaded_models,
"available_memory": info.available_memory,
}),
},
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
status.overall = if status.all_healthy() {
OverallHealth::Healthy
} else if status.any_critical() {
OverallHealth::Critical
} else {
OverallHealth::Degraded
};
status
}
}
```
### 4.3 Monitoring Dashboards
```yaml
# Prometheus alerting rules
groups:
- name: ruvllm
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, ruvllm_request_latency_seconds_bucket) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "RuvLLM P95 latency above 1s"
- alert: LowQualityScore
expr: avg(ruvllm_quality_score) < 0.7
for: 10m
labels:
severity: warning
annotations:
summary: "Average quality score dropped below 0.7"
- alert: MemoryPressure
expr: ruvllm_memory_usage_bytes / ruvllm_memory_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Memory usage above 90%"
- alert: RouterLowConfidence
expr: avg(ruvllm_router_confidence) < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Router confidence consistently low"
- alert: HighErrorRate
expr: rate(ruvllm_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 10%"
```
### 4.4 Backup and Recovery
```bash
#!/bin/bash
# scripts/backup.sh
BACKUP_DIR="/backups/ruvllm/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
echo "Creating backup in $BACKUP_DIR"
# 1. Backup memory database
cp -r data/memory.db "$BACKUP_DIR/memory.db"
# 2. Backup router weights
cp -r data/router_weights.bin "$BACKUP_DIR/router_weights.bin"
# 3. Backup EWC state
cp -r data/ewc_state.bin "$BACKUP_DIR/ewc_state.bin"
# 4. Backup replay buffer
cp -r data/replay_buffer.bin "$BACKUP_DIR/replay_buffer.bin"
# 5. Backup configuration
cp -r config/ "$BACKUP_DIR/config/"
# 6. Create manifest
cat > "$BACKUP_DIR/manifest.json" << EOF
{
"timestamp": "$(date -Iseconds)",
"version": "$(cargo run --release --bin ruvllm-version)",
"components": {
"memory_db": "memory.db",
"router_weights": "router_weights.bin",
"ewc_state": "ewc_state.bin",
"replay_buffer": "replay_buffer.bin",
"config": "config/"
}
}
EOF
echo "Backup complete: $BACKUP_DIR"
# 7. Upload to S3 if configured
if [ -n "$S3_BACKUP_BUCKET" ]; then
aws s3 sync "$BACKUP_DIR" "s3://$S3_BACKUP_BUCKET/$(basename $BACKUP_DIR)/"
echo "Uploaded to S3: $S3_BACKUP_BUCKET"
fi
```
---
## 5. Production Checklist
### 5.1 Pre-Launch
```
Security
━━━━━━━━
[ ] Input validation and sanitization
[ ] Rate limiting configured
[ ] TLS/HTTPS enabled
[ ] API authentication (if public)
[ ] Secrets in environment variables
[ ] Model integrity verification
Performance
━━━━━━━━━━━
[ ] Load tested to expected traffic
[ ] Memory profiled (no leaks)
[ ] Latency targets met
[ ] Caching configured
[ ] Connection pooling
Reliability
━━━━━━━━━━━
[ ] Health checks implemented
[ ] Graceful shutdown
[ ] Automatic restarts (systemd/k8s)
[ ] Backup procedures tested
[ ] Recovery procedures documented
Observability
━━━━━━━━━━━━━
[ ] Structured logging
[ ] Metrics exported
[ ] Distributed tracing
[ ] Alerting rules configured
[ ] Dashboards created
```
### 5.2 Post-Launch
```
Daily
━━━━━
[ ] Check error rates
[ ] Review quality scores
[ ] Monitor latency trends
[ ] Verify backup success
Weekly
━━━━━━
[ ] Review router decisions distribution
[ ] Analyze forgetting metrics
[ ] Check memory growth rate
[ ] Run compression job
[ ] Update router weights
Monthly
━━━━━━━
[ ] Full system backup
[ ] Performance benchmark
[ ] Security audit
[ ] Dependency updates
[ ] Evaluate student model candidates
```
---
## 6. API Reference
### 6.1 HTTP API
```yaml
openapi: "3.0.0"
info:
title: RuvLLM API
version: "0.1.0"
description: Self-learning LLM with LFM2 and Ruvector
paths:
/v1/query:
post:
summary: Process a query
requestBody:
required: true
content:
application/json:
schema:
type: object
required:
- query
properties:
query:
type: string
description: The user query
session_id:
type: string
description: Optional session for multi-turn
constraints:
type: object
properties:
max_latency_ms:
type: integer
max_tokens:
type: integer
temperature:
type: number
responses:
"200":
description: Successful response
content:
application/json:
schema:
type: object
properties:
text:
type: string
confidence:
type: number
sources:
type: array
items:
type: object
routing_info:
type: object
/v1/feedback:
post:
summary: Provide feedback on a response
requestBody:
required: true
content:
application/json:
schema:
type: object
required:
- request_id
properties:
request_id:
type: string
rating:
type: integer
minimum: 1
maximum: 5
correction:
type: string
responses:
"200":
description: Feedback recorded
/v1/health:
get:
summary: Health check
responses:
"200":
description: System healthy
"503":
description: System unhealthy
/v1/metrics:
get:
summary: Prometheus metrics
responses:
"200":
description: Metrics in Prometheus format
```
### 6.2 Rust SDK
```rust
use ruvllm::{RuvLLM, Config, Request, Response};
/// Simple query
async fn simple_query(llm: &RuvLLM) -> Result<Response> {
llm.query("What is Rust?").await
}
/// Query with options
async fn query_with_options(llm: &RuvLLM) -> Result<Response> {
llm.query_with(Request {
query: "Explain backpropagation".into(),
session_id: Some("user-123".into()),
constraints: Constraints {
max_latency_ms: Some(500),
max_tokens: Some(500),
temperature: Some(0.7),
..Default::default()
},
}).await
}
/// Multi-turn conversation
async fn conversation(llm: &RuvLLM) -> Result<()> {
let session = llm.new_session();
let r1 = llm.query_session(&session, "What is a neural network?").await?;
println!("Turn 1: {}", r1.text);
let r2 = llm.query_session(&session, "How do you train one?").await?;
println!("Turn 2: {}", r2.text);
let r3 = llm.query_session(&session, "What about overfitting?").await?;
println!("Turn 3: {}", r3.text);
Ok(())
}
/// Provide feedback
async fn with_feedback(llm: &RuvLLM) -> Result<()> {
let response = llm.query("What is 2+2?").await?;
llm.feedback(Feedback {
request_id: response.request_id,
rating: 5,
correction: None,
}).await?;
Ok(())
}
/// Stream response
async fn streaming(llm: &RuvLLM) -> Result<()> {
let mut stream = llm.query_stream("Tell me a story").await?;
while let Some(chunk) = stream.next().await {
print!("{}", chunk?);
}
Ok(())
}
```
---
## 7. Future Roadmap
### 7.1 Short-Term (1-3 months)
- [ ] LFM2-VL integration (vision-language)
- [ ] Multi-GPU inference with tensor parallelism
- [ ] Retrieval-augmented fine-tuning pipeline
- [ ] Improved compression algorithms
- [ ] WebAssembly deployment target
### 7.2 Medium-Term (3-6 months)
- [ ] Federated learning across edge nodes
- [ ] LFM2-Audio integration (speech)
- [ ] Custom domain fine-tuning toolkit
- [ ] Advanced curriculum learning
- [ ] Hyperbolic embeddings for hierarchies
### 7.3 Long-Term (6-12 months)
- [ ] Multi-agent collaboration
- [ ] Neuro-symbolic reasoning integration
- [ ] Continuous pre-training pipeline
- [ ] Hardware-specific optimizations (NPU, TPU)
- [ ] Enterprise multi-tenancy
---
## 8. Success Criteria
### 8.1 Technical Metrics
| Metric | Target | Current |
|--------|--------|---------|
| Latency P50 | <500ms | - |
| Latency P99 | <2s | - |
| Quality Score | >0.8 | - |
| Router Accuracy | >90% | - |
| Memory Efficiency | <4GB (edge) | - |
| Throughput | 20 QPS (edge) | - |
| Forgetting Rate | <5%/10K | - |
| Test Coverage | >80% | - |
### 8.2 Business Metrics
| Metric | Target | Notes |
|--------|--------|-------|
| User Satisfaction | >4.0/5.0 | Survey scores |
| Response Relevance | >85% | Human eval |
| Knowledge Retention | >90% | Multi-turn coherence |
| Cost Reduction | >50% | vs. always-big baseline |
---
## 9. Conclusion
RuvLLM represents a paradigm shift from static LLMs to adaptive, self-learning systems. By treating:
- **LFM2 as the stable cortex** (reasoning)
- **Ruvector as the living synaptic mesh** (memory)
- **FastGRNN as the control circuit** (routing)
We create intelligence that emerges from the loop, not just the model.
The three learning loops—memory growth, router optimization, and concept compression—enable continuous adaptation without the risks of in-place weight modification.
**The intelligence is not in one model anymore. It is in the loop.**
---
*Document Version: 1.0*
*Last Updated: 2025-12-02*
*Author: RuvLLM Architecture Team*