Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/examples/ruvLLM/docs/sparc/01-specification.md
+++ b/vendor/ruvector/examples/ruvLLM/docs/sparc/01-specification.md
@@ -0,0 +1,612 @@
+# RuvLLM: Self-Learning LLM with LFM2 and Ruvector Integration
+
+## SPARC Phase 1: Specification
+
+---
+
+## 1. Executive Summary
+
+RuvLLM is a self-learning LLM architecture that integrates **Liquid Foundation Models (LFM2)** with **ruvector** as the world model and memory substrate. The system uses **FastGRNN** as an intelligent router to dynamically allocate computational resources based on query complexity, enabling efficient on-device inference with continuous learning capabilities.
+
+### Core Innovation
+
+The architecture treats:
+- **LFM2** as the reasoning head (inference engine)
+- **Ruvector** as the world model and episodic memory
+- **FastGRNN** as the control circuit (routing decisions)
+
+This triad creates a self-learning system where:
+1. Queries are semantically embedded and matched against memory
+2. Graph attention extracts relevant neighborhood context
+3. FastGRNN routes to optimal model configuration
+4. LFM2 generates responses with retrieved context
+5. Successful interactions are written back to memory (self-improvement)
+
+---
+
+## 2. Technical Requirements
+
+### 2.1 Functional Requirements
+
+#### FR-001: LFM2 Model Integration
+- **Description**: Support LFM2 model family (350M, 700M, 1.2B, 2.6B parameters)
+- **Acceptance Criteria**:
+  - Load models via llama.cpp (CPU) or vLLM (server)
+  - Support quantization: Q4/Q5 (CPU), 8-bit/4-bit weight-only (GPU)
+  - Enable KV cache for context reuse
+  - Achieve <500ms median latency (CPU), <100ms (GPU)
+
+#### FR-002: Ruvector Memory Service
+- **Description**: Implement semantic memory with graph structure
+- **Storage Schema**:
+  ```
+  Nodes: {
+    id: UUID,
+    vector: [f32; D],      // D = embedding dimension
+    text: String,
+    type: NodeType,        // Query | Document | AgentStep | Fact
+    source: String,
+    metadata: {
+      timestamp: i64,
+      tags: Vec<String>,
+      domain: String,
+      version: u32,
+      confidence: f32
+    }
+  }
+
+  Edges: {
+    id: UUID,
+    src: UUID,
+    dst: UUID,
+    rel: EdgeType,         // Cites | Follows | SameTopic | AgentStep | Derived
+    weight: f32,
+    metadata: {
+      timestamp: i64,
+      created_by: String,
+      confidence: f32
+    }
+  }
+  ```
+- **Acceptance Criteria**:
+  - HNSW index with M=32, efConstruction=200, efSearch=64
+  - Sub-millisecond retrieval for k≤64
+  - Graph attention over 2-hop neighborhoods
+  - Support billion-scale corpora
+
+#### FR-003: FastGRNN Router
+- **Description**: Implement gated recurrent router for intelligent resource allocation
+- **Architecture** (per Kusupati et al.):
+  - Hidden size: 32-64 units
+  - Input: Fixed-length feature vector (~128 dims)
+  - Outputs: model_selection, context_size, temperature, top_p
+- **Feature Vector Components** (128 dimensions):
+  ```
+  Query Stats [32 dims]:
+    - token_count: f32
+    - language_id: [f32; 8] (one-hot)
+    - domain_encoding: [f32; 16]
+    - user_frequency: f32
+    - query_type: [f32; 6] (factual/reasoning/creative/...)
+
+  Embedding Stats [16 dims]:
+    - l2_norm: f32
+    - principal_components: [f32; 8]
+    - entropy: f32
+    - sparsity: f32
+    - cluster_assignment: [f32; 4]
+
+  HNSW Search Stats [48 dims]:
+    - k_retrieved: f32
+    - distances: { mean, std, min, max }: [f32; 4]
+    - entropy: f32
+    - graph_depth: f32
+    - recall_estimate: f32
+    - neighborhood_density: [f32; 16]
+    - semantic_coherence: [f32; 24]
+
+  System Constraints [32 dims]:
+    - latency_budget: f32
+    - device_class: [f32; 4] (edge/mobile/server/cluster)
+    - privacy_level: [f32; 4]
+    - memory_available: f32
+    - battery_level: f32 (for mobile)
+    - concurrent_requests: f32
+    - historical_accuracy: [f32; 16]
+  ```
+
+#### FR-004: Self-Learning Pipeline
+- **Description**: Implement continuous learning with forgetting mitigation
+- **Components**:
+  - Online learning from successful interactions
+  - Elastic Weight Consolidation (EWC) for catastrophic forgetting prevention
+  - Experience replay with reservoir sampling
+  - Curriculum learning for progressive complexity
+- **Acceptance Criteria**:
+  - Quality regret <0.1 points vs. always-big baseline
+  - No measurable forgetting over 10K update cycles
+  - Router accuracy >95% for seen patterns
+
+#### FR-005: Graph Attention Engine
+- **Description**: Context extraction via graph-aware attention
+- **Mechanism**:
+  - Multi-head attention over retrieved nodes
+  - Edge-weighted aggregation (confidence, recency)
+  - Hyperbolic embeddings for hierarchical relationships
+  - 2-hop neighborhood expansion
+- **Integration with existing ruvector-attention**:
+  - Leverage `EdgeFeaturedAttention` for edge attributes
+  - Use `GraphRoPE` for positional encoding on graphs
+  - Apply `DualSpaceAttention` for multi-manifold reasoning
+
+### 2.2 Non-Functional Requirements
+
+#### NFR-001: Performance
+| Metric | Tier A (Server) | Tier B (Edge) | Tier C (Mobile) |
+|--------|-----------------|---------------|-----------------|
+| P50 Latency | <200ms | <500ms | <800ms |
+| P99 Latency | <1s | <2s | <5s |
+| Throughput | 100 QPS | 20 QPS | 5 QPS |
+| Memory | <16GB | <4GB | <1GB |
+
+#### NFR-002: Quality
+- **Accuracy**: F1 >0.85 on QA benchmarks
+- **Retrieval**: R@10 >0.90 for relevant documents
+- **Router**: Decision accuracy >95%
+- **Judge Rating**: 4.2+/5.0 on LLM-as-judge evaluations
+
+#### NFR-003: Scalability
+- Support 10M+ vectors in memory
+- Support 1B+ vectors with hybrid indexing
+- Linear scaling with node count in cluster mode
+
+#### NFR-004: Reliability
+- Zero data loss on graceful shutdown
+- Recovery from OOM within 30s
+- Automatic failover in cluster mode
+
+---
+
+## 3. LFM2 Deep Dive
+
+### 3.1 Architecture Analysis
+
+LFM2 employs a **hybrid backbone** combining:
+
+1. **Gated Short Convolutions**: Lightweight local feature processing
+   - O(n) complexity vs O(n²) for attention
+   - Captures local patterns efficiently
+   - Enables 2x faster prefill on CPUs
+
+2. **Grouped Query Attention (GQA)**: Reduced KV heads
+   - 4-8 KV heads vs 32+ in standard attention
+   - Maintains quality with 4x memory reduction
+   - Critical for edge deployment
+
+### 3.2 Training Methodology
+
+LFM2's training is relevant for our self-learning pipeline:
+
+1. **Knowledge Distillation**: Tempered, decoupled Top-K
+   - Teacher: Large model (70B+)
+   - Student: LFM2 variants
+   - **Insight**: We can distill router decisions from expensive oracle
+
+2. **Curriculum Learning**: Progressive complexity
+   - Start with simple factual queries
+   - Graduate to multi-step reasoning
+   - **Application**: Router training follows same progression
+
+3. **Three-Stage Post-Training**:
+   - SFT: Supervised fine-tuning on quality data
+   - DPO: Direct preference optimization
+   - Model merging: Combine specialists
+   - **Application**: We merge domain-specific adapters
+
+### 3.3 Multimodal Extensions (Future)
+
+- **LFM2-VL**: Vision-language (image understanding)
+- **LFM2-Audio**: Speech I/O
+- **LFM2-ColBERT**: Low-latency retrieval encoder
+
+---
+
+## 4. Ruvector Integration Analysis
+
+### 4.1 Existing Capabilities
+
+| Component | Status | Integration Plan |
+|-----------|--------|------------------|
+| ruvector-core | ✅ Production | Primary vector store |
+| ruvector-gnn | ✅ Production | Graph neural layer |
+| ruvector-attention | ✅ Production | Attention mechanisms |
+| ruvector-router-core | ✅ Production | Base routing |
+| ruvector-graph | ✅ Production | Knowledge graph |
+
+### 4.2 Required Extensions
+
+#### 4.2.1 Embedding Adapter
+```rust
+pub struct EmbeddingAdapter {
+    /// LFM2 encoder for query embedding
+    lfm2_encoder: Lfm2Encoder,
+    /// Dimension alignment layer
+    projection: Linear,
+    /// Normalization
+    layer_norm: LayerNorm,
+}
+
+impl EmbeddingAdapter {
+    pub fn embed(&self, text: &str) -> Vec<f32> {
+        let raw = self.lfm2_encoder.encode(text);
+        let projected = self.projection.forward(&raw);
+        self.layer_norm.forward(&projected)
+    }
+}
+```
+
+#### 4.2.2 Memory Writeback Service
+```rust
+pub struct MemoryWriteback {
+    /// Quality threshold for writeback
+    quality_threshold: f32,
+    /// Deduplication via MinHash
+    dedup_hasher: MinHasher,
+    /// Conflict resolution
+    merger: ConflictMerger,
+}
+
+impl MemoryWriteback {
+    pub async fn maybe_write(
+        &self,
+        query: &str,
+        response: &str,
+        quality_score: f32,
+        db: &VectorDB,
+    ) -> Result<Option<UUID>> {
+        if quality_score < self.quality_threshold {
+            return Ok(None);
+        }
+
+        // Check for near-duplicates
+        let embedding = embed(query, response);
+        let similar = db.search_threshold(&embedding, 0.95)?;
+        if !similar.is_empty() {
+            return self.merger.resolve(similar, query, response);
+        }
+
+        // Insert new memory
+        let entry = VectorEntry::new(embedding)
+            .with_text(format!("Q: {}\nA: {}", query, response))
+            .with_metadata(json!({
+                "type": "qa_pair",
+                "quality": quality_score,
+                "timestamp": now(),
+            }));
+
+        Ok(Some(db.insert(entry)?))
+    }
+}
+```
+
+### 4.3 HNSW Parameter Tuning
+
+Based on arxiv:2511.23404v1 insights on retrieval efficiency:
+
+| Corpus Size | M | efConstruction | efSearch | Recall@10 |
+|-------------|---|----------------|----------|-----------|
+| <100K | 16 | 100 | 32 | 0.98 |
+| 100K-1M | 32 | 200 | 64 | 0.96 |
+| 1M-10M | 48 | 300 | 128 | 0.94 |
+| 10M-100M | 64 | 400 | 256 | 0.92 |
+| >100M | Hybrid | Tiered | Adaptive | 0.90 |
+
+---
+
+## 5. FastGRNN Router Specification
+
+### 5.1 Mathematical Formulation
+
+FastGRNN (Fast, Accurate, Stable, and Tiny GRU):
+
+```
+z_t = σ(W_z · x_t + U_z · h_{t-1} + b_z)
+h̃_t = tanh(W_h · x_t + U_h · (r_t ⊙ h_{t-1}) + b_h)
+h_t = (ζ · (1 - z_t) + ν) ⊙ h̃_t + z_t ⊙ h_{t-1}
+
+where:
+  - ζ, ν: Learned scalars (typically ζ≈1, ν≈0.5)
+  - W_z, W_h: Input weight matrices (sparse)
+  - U_z, U_h: Recurrent weight matrices (low-rank)
+  - r_t: Optional reset gate (can be fixed to 1)
+```
+
+### 5.2 Output Heads
+
+```rust
+pub struct RouterOutputs {
+    /// Model selection: [350M, 700M, 1.2B, 2.6B] probabilities
+    pub model_probs: [f32; 4],
+    /// Context size bins: [256, 512, 1024, 2048, 4096] tokens
+    pub context_probs: [f32; 5],
+    /// Temperature: continuous [0.0, 2.0]
+    pub temperature: f32,
+    /// Top-p: continuous [0.0, 1.0]
+    pub top_p: f32,
+    /// Confidence score
+    pub confidence: f32,
+}
+```
+
+### 5.3 Training Protocol
+
+**Phase 1: Data Collection**
+```
+For each query q:
+  1. Run all model configurations (expensive baseline)
+  2. Collect quality metrics Q, latency L, cost C
+  3. Compute utility: U = Q - λ·L - μ·C
+  4. Label: y_model = argmax(U), y_ctx = min viable context
+```
+
+**Phase 2: Supervised Training**
+```
+Loss = CE(model_pred, y_model)
+     + CE(ctx_pred, y_ctx)
+     + α·SmoothL1(temp_pred, y_temp)
+     + β·SmoothL1(top_p_pred, y_top_p)
+```
+
+**Phase 3: Online Refinement**
+```
+Every N requests:
+  1. Sample exploration (ε-greedy or Thompson)
+  2. Compute regret vs. oracle
+  3. Update weights with importance sampling
+  4. Apply EWC regularization
+```
+
+---
+
+## 6. Self-Learning Mechanisms
+
+### 6.1 Continual Learning Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Self-Learning Pipeline                     │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐   │
+│  │ Query   │───▶│ Retrieve│───▶│ Generate│───▶│ Evaluate│   │
+│  └─────────┘    └─────────┘    └─────────┘    └─────────┘   │
+│       │              │              │              │         │
+│       │              │              │              ▼         │
+│       │              │              │        ┌─────────┐     │
+│       │              │              │        │ Quality │     │
+│       │              │              │        │ > θ ?   │     │
+│       │              │              │        └────┬────┘     │
+│       │              │              │             │          │
+│       │              │              │      ┌──────┴──────┐   │
+│       │              │              │      ▼             ▼   │
+│       │              │              │  ┌───────┐   ┌───────┐ │
+│       │              │              │  │ Write │   │ Skip  │ │
+│       │              │              │  │ Back  │   │       │ │
+│       │              │              │  └───┬───┘   └───────┘ │
+│       │              │              │      │                 │
+│       ▼              ▼              ▼      ▼                 │
+│  ┌─────────────────────────────────────────────┐             │
+│  │            Replay Buffer (Reservoir)         │             │
+│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │             │
+│  │  │ E_1 │ │ E_2 │ │ ... │ │E_n-1│ │ E_n │   │             │
+│  │  └─────┘ └─────┘ └─────┘ └─────┘ └─────┘   │             │
+│  └──────────────────────┬──────────────────────┘             │
+│                         │                                    │
+│                         ▼                                    │
+│  ┌─────────────────────────────────────────────┐             │
+│  │           EWC Regularization Layer           │             │
+│  │                                               │             │
+│  │  L_total = L_task + λ·Σ F_i·(θ_i - θ*_i)²   │             │
+│  │                                               │             │
+│  │  F_i = Fisher Information (importance)        │             │
+│  │  θ*_i = Optimal weights from previous task   │             │
+│  └─────────────────────────────────────────────┘             │
+│                                                               │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### 6.2 Quality Evaluation
+
+**LLM-as-Judge Protocol**:
+```rust
+pub struct QualityJudge {
+    judge_model: Lfm2, // Use 2.6B for judging
+    rubric: JudgeRubric,
+}
+
+impl QualityJudge {
+    pub fn evaluate(&self, query: &str, response: &str, context: &[&str]) -> f32 {
+        let prompt = format!(r#"
+            Evaluate the response quality on a scale of 1-5:
+
+            Query: {query}
+            Retrieved Context: {context:?}
+            Response: {response}
+
+            Criteria:
+            1. Factual accuracy (grounded in context)
+            2. Completeness (addresses the query fully)
+            3. Coherence (logical flow)
+            4. Conciseness (no unnecessary verbosity)
+
+            Score (1-5):
+        "#);
+
+        let score_str = self.judge_model.generate(&prompt, 10);
+        parse_score(&score_str)
+    }
+}
+```
+
+### 6.3 Forgetting Mitigation
+
+**Elastic Weight Consolidation (EWC)**:
+
+```rust
+// From ruvector-gnn ewc module
+pub struct ElasticWeightConsolidation {
+    lambda: f32,                    // Regularization strength
+    fisher_info: Vec<f32>,          // Fisher information diagonal
+    optimal_weights: Vec<f32>,      // θ* from previous task
+}
+
+impl ElasticWeightConsolidation {
+    pub fn regularization_loss(&self, current_weights: &[f32]) -> f32 {
+        self.fisher_info.iter()
+            .zip(current_weights.iter())
+            .zip(self.optimal_weights.iter())
+            .map(|((f, w), w_star)| f * (w - w_star).powi(2))
+            .sum::<f32>() * self.lambda / 2.0
+    }
+
+    pub fn update_fisher(&mut self, gradients: &[Vec<f32>]) {
+        // Fisher = E[∇logP(y|x;θ)²]
+        for (i, grad_samples) in gradients.iter().enumerate() {
+            self.fisher_info[i] = grad_samples.iter()
+                .map(|g| g.powi(2))
+                .sum::<f32>() / grad_samples.len() as f32;
+        }
+    }
+}
+```
+
+---
+
+## 7. Performance Optimization Strategy
+
+### 7.1 LFM2 Level
+
+| Optimization | Speedup | Quality Impact | Implementation |
+|--------------|---------|----------------|----------------|
+| Model selection | 2-4x | <1% | FastGRNN router |
+| KV cache reuse | 1.5-2x | 0% | llama.cpp native |
+| Q4 quantization | 2-3x | <2% | GGUF format |
+| Speculative decode | 1.3-1.5x | 0% | Draft model |
+| Continuous batching | 2-4x | 0% | vLLM |
+
+### 7.2 Ruvector Level
+
+| Optimization | Speedup | Quality Impact | Implementation |
+|--------------|---------|----------------|----------------|
+| HNSW tuning | Variable | Recall tradeoff | efSearch adjustment |
+| Product quantization | 4-8x memory | <5% | PQ in ruvector-core |
+| Graph pruning | 1.2-1.5x | <1% | Edge weight threshold |
+| Batch retrieval | 2-3x | 0% | Parallel HNSW |
+| Caching | 10x+ (hits) | 0% | LRU with TTL |
+
+### 7.3 Router Level
+
+| Optimization | Speedup | Quality Impact | Implementation |
+|--------------|---------|----------------|----------------|
+| Sparse weights | 10-50x | <0.5% | Magnitude pruning |
+| Low-rank U | 2-4x | <0.5% | SVD decomposition |
+| Int8 quantization | 2-4x | <0.1% | Post-training quant |
+| Cascade routing | 1.5-2x | 0% | Early exit |
+
+---
+
+## 8. Success Metrics
+
+### 8.1 Primary Metrics
+
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| End-to-end latency P50 | <500ms | Timer instrumentation |
+| Quality (LLM judge) | 4.2+/5.0 | Automated evaluation |
+| Router accuracy | >95% | Oracle comparison |
+| Memory efficiency | <4GB (edge) | RSS monitoring |
+| Throughput | 20 QPS (edge) | Load testing |
+
+### 8.2 Secondary Metrics
+
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| Retrieval R@10 | >0.90 | Benchmark suite |
+| Forgetting rate | <5%/10K updates | Periodic eval |
+| Cost reduction | >50% vs baseline | Token counting |
+| Writeback rate | 10-30% | Database metrics |
+
+### 8.3 Regret Analysis
+
+```
+Quality Regret = E[Q_baseline - Q_routed]
+Latency Regret = E[L_routed - L_oracle]
+Cost Regret = E[C_routed - C_oracle]
+
+Targets:
+- Quality Regret < 0.1 points (1-5 scale)
+- Latency Regret < 50ms
+- Cost Regret < 10%
+```
+
+---
+
+## 9. Risk Analysis
+
+| Risk | Probability | Impact | Mitigation |
+|------|-------------|--------|------------|
+| Router misprediction | Medium | High | Confidence thresholds, fallback |
+| Catastrophic forgetting | Low | Critical | EWC, replay buffer, checkpoints |
+| Memory exhaustion | Medium | High | Streaming, tiered storage |
+| Quality degradation | Medium | High | A/B testing, rollback |
+| Latency spikes | High | Medium | Caching, async processing |
+
+---
+
+## 10. Dependencies
+
+### 10.1 Internal Dependencies
+
+```toml
+[dependencies]
+ruvector-core = { path = "../ruvector-core" }
+ruvector-gnn = { path = "../ruvector-gnn" }
+ruvector-attention = { path = "../ruvector-attention" }
+ruvector-graph = { path = "../ruvector-graph" }
+ruvector-router-core = { path = "../ruvector-router-core" }
+```
+
+### 10.2 External Dependencies
+
+```toml
+[dependencies]
+# LLM runtime
+llama-cpp-rs = "0.3"        # CPU inference
+tokenizers = "0.15"         # Fast tokenization
+
+# Async runtime
+tokio = { version = "1.41", features = ["full"] }
+
+# Serialization
+serde = { version = "1.0", features = ["derive"] }
+
+# Metrics
+prometheus = "0.13"
+tracing = "0.1"
+```
+
+---
+
+## 11. References
+
+1. **LFM2 Technical Report**: arxiv:2511.23404v1
+2. **FastGRNN**: Kusupati et al., "FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network"
+3. **EWC**: Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks"
+4. **HNSW**: Malkov & Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"
+5. **Graph Attention**: Veličković et al., "Graph Attention Networks"
+
+---
+
+*Document Version: 1.0*
+*Last Updated: 2025-12-02*
+*Author: RuvLLM Architecture Team*
--- a/vendor/ruvector/examples/ruvLLM/docs/sparc/02-pseudocode.md
+++ b/vendor/ruvector/examples/ruvLLM/docs/sparc/02-pseudocode.md
--- a/vendor/ruvector/examples/ruvLLM/docs/sparc/03-architecture.md
+++ b/vendor/ruvector/examples/ruvLLM/docs/sparc/03-architecture.md
--- a/vendor/ruvector/examples/ruvLLM/docs/sparc/04-refinement.md
+++ b/vendor/ruvector/examples/ruvLLM/docs/sparc/04-refinement.md
--- a/vendor/ruvector/examples/ruvLLM/docs/sparc/05-completion.md
+++ b/vendor/ruvector/examples/ruvLLM/docs/sparc/05-completion.md
@@ -0,0 +1,886 @@
+# RuvLLM: Integration and Deployment
+
+## SPARC Phase 5: Completion
+
+---
+
+## 1. Integration Strategy
+
+### 1.1 Crate Structure
+
+```
+ruvector/
+├── crates/
+│   ├── ruvector-core/           # Existing: Vector DB
+│   ├── ruvector-gnn/            # Existing: GNN + EWC + Replay
+│   ├── ruvector-attention/      # Existing: Attention mechanisms
+│   ├── ruvector-graph/          # Existing: Graph storage
+│   └── ruvector-router-core/    # Existing: Routing primitives
+│
+└── examples/
+    └── ruvLLM/                  # NEW: Self-learning LLM
+        ├── src/
+        │   ├── lib.rs           # Main library entry
+        │   ├── orchestrator.rs  # Request orchestration
+        │   ├── embedding.rs     # LFM2 embedding service
+        │   ├── router.rs        # FastGRNN router
+        │   ├── memory.rs        # Ruvector memory layer
+        │   ├── attention.rs     # Graph attention wrapper
+        │   ├── inference.rs     # LFM2 model pool
+        │   ├── learning.rs      # Self-learning service
+        │   ├── compression.rs   # Concept abstraction
+        │   ├── config.rs        # Configuration
+        │   ├── types.rs         # Core types
+        │   └── error.rs         # Error handling
+        ├── tests/
+        │   ├── unit/
+        │   └── integration/
+        ├── benches/
+        ├── config/
+        └── docs/                # SPARC documentation
+```
+
+### 1.2 Dependency Integration
+
+```toml
+# examples/ruvLLM/Cargo.toml
+[package]
+name = "ruvllm"
+version = "0.1.0"
+edition = "2021"
+description = "Self-learning LLM with LFM2 and Ruvector integration"
+
+[dependencies]
+# Internal dependencies (path-based for development)
+ruvector-core = { path = "../../crates/ruvector-core" }
+ruvector-gnn = { path = "../../crates/ruvector-gnn" }
+ruvector-attention = { path = "../../crates/ruvector-attention" }
+ruvector-graph = { path = "../../crates/ruvector-graph" }
+ruvector-router-core = { path = "../../crates/ruvector-router-core" }
+
+# LLM inference
+llama-cpp-rs = "0.3"           # CPU inference via llama.cpp
+tokenizers = "0.15"            # Fast tokenization
+
+# Async runtime
+tokio = { version = "1.41", features = ["full"] }
+futures = "0.3"
+
+# Serialization
+serde = { version = "1.0", features = ["derive"] }
+serde_json = "1.0"
+bincode = "2.0.0-rc.3"
+
+# Numerics
+ndarray = { version = "0.16", features = ["serde"] }
+rand = "0.8"
+
+# Utilities
+uuid = { version = "1.11", features = ["v4", "serde"] }
+chrono = { version = "0.4", features = ["serde"] }
+thiserror = "2.0"
+anyhow = "1.0"
+tracing = "0.1"
+
+# Performance
+dashmap = "6.1"
+parking_lot = "0.12"
+lru = "0.12"
+
+# Metrics
+prometheus = "0.13"
+
+[dev-dependencies]
+criterion = { version = "0.5", features = ["html_reports"] }
+proptest = "1.5"
+tokio-test = "0.4"
+tempfile = "3.13"
+tracing-subscriber = "0.3"
+
+[features]
+default = ["cpu"]
+cpu = []                       # llama.cpp CPU inference
+gpu = ["vllm"]                 # vLLM GPU inference (optional)
+vllm = []
+
+[[bench]]
+name = "pipeline"
+harness = false
+
+[[bench]]
+name = "router"
+harness = false
+
+[[bench]]
+name = "memory"
+harness = false
+```
+
+### 1.3 API Surface
+
+```rust
+//! # RuvLLM - Self-Learning LLM
+//!
+//! A self-learning language model system integrating LFM2 with Ruvector.
+//!
+//! ## Architecture
+//!
+//! - **LFM2**: Frozen reasoning engine (350M-2.6B parameters)
+//! - **Ruvector**: Living memory that adapts continuously
+//! - **FastGRNN**: Control circuit for intelligent routing
+//!
+//! ## Quick Start
+//!
+//! ```rust,ignore
+//! use ruvllm::{RuvLLM, Config};
+//!
+//! #[tokio::main]
+//! async fn main() -> Result<()> {
+//!     // Initialize system
+//!     let config = Config::builder()
+//!         .db_path("./memory.db")
+//!         .model_path_350m("./models/lfm2-350m-q4.gguf")
+//!         .model_path_700m("./models/lfm2-700m-q4.gguf")
+//!         .build()?;
+//!
+//!     let llm = RuvLLM::new(config).await?;
+//!
+//!     // Process query
+//!     let response = llm.query("What is machine learning?").await?;
+//!     println!("Response: {}", response.text);
+//!     println!("Confidence: {:.2}", response.confidence);
+//!
+//!     Ok(())
+//! }
+//! ```
+//!
+//! ## Self-Learning Loops
+//!
+//! The system learns through three feedback loops:
+//!
+//! 1. **Memory Growth**: Every interaction strengthens/weakens graph edges
+//! 2. **Router Learning**: FastGRNN learns optimal model selection
+//! 3. **Compression**: Periodic summarization creates concept hierarchies
+
+pub mod attention;
+pub mod compression;
+pub mod config;
+pub mod embedding;
+pub mod error;
+pub mod inference;
+pub mod learning;
+pub mod memory;
+pub mod orchestrator;
+pub mod router;
+pub mod types;
+
+// Re-exports for convenience
+pub use config::{Config, ConfigBuilder};
+pub use error::{Error, Result};
+pub use orchestrator::RuvLLM;
+pub use types::{Request, Response, Session};
+
+/// Library version
+pub const VERSION: &str = env!("CARGO_PKG_VERSION");
+```
+
+---
+
+## 2. Implementation Checklist
+
+### 2.1 Core Components
+
+```
+Phase 1: Foundation
+━━━━━━━━━━━━━━━━━━━━
+[x] Project structure setup
+[x] Cargo.toml with dependencies
+[ ] Error types definition
+[ ] Configuration system
+[ ] Core types (Request, Response, Session)
+
+Phase 2: Services
+━━━━━━━━━━━━━━━━━━
+[ ] EmbeddingService
+    [ ] LFM2 encoder wrapper
+    [ ] Dimension projection
+    [ ] Tokenization
+    [ ] Batch processing
+
+[ ] MemoryService
+    [ ] VectorDB initialization
+    [ ] GraphStore integration
+    [ ] HNSW search wrapper
+    [ ] Graph expansion
+    [ ] Writeback queue
+
+[ ] FastGRNNRouter
+    [ ] Cell implementation
+    [ ] Sparse matrix operations
+    [ ] Low-rank matrices
+    [ ] Output heads
+    [ ] Training loop
+
+[ ] GraphAttentionEngine
+    [ ] Attention layer wrapper
+    [ ] Edge feature encoding
+    [ ] Multi-head aggregation
+    [ ] Context ranking
+
+[ ] InferencePool
+    [ ] Model loading
+    [ ] Lazy initialization
+    [ ] KV cache management
+    [ ] LRU eviction
+
+[ ] LearningService
+    [ ] Quality judge
+    [ ] Replay buffer
+    [ ] EWC integration
+    [ ] Background training
+    [ ] Compression jobs
+
+Phase 3: Orchestration
+━━━━━━━━━━━━━━━━━━━━━━
+[ ] Orchestrator
+    [ ] Request routing
+    [ ] Session management
+    [ ] Pipeline coordination
+    [ ] Metrics collection
+    [ ] Error handling
+
+Phase 4: Integration
+━━━━━━━━━━━━━━━━━━━━
+[ ] Integration tests
+[ ] Benchmark suite
+[ ] Example applications
+[ ] Documentation
+```
+
+### 2.2 Test Coverage Requirements
+
+| Component | Unit Tests | Integration | Benchmark |
+|-----------|------------|-------------|-----------|
+| Embedding | 15+ | 3+ | 2 |
+| Memory | 20+ | 5+ | 3 |
+| Router | 25+ | 5+ | 2 |
+| Attention | 15+ | 3+ | 2 |
+| Inference | 10+ | 3+ | 2 |
+| Learning | 20+ | 5+ | 1 |
+| Orchestrator | 10+ | 5+ | 2 |
+| **Total** | **115+** | **29+** | **14** |
+
+---
+
+## 3. Deployment Configurations
+
+### 3.1 Edge Deployment (Raspberry Pi / Mobile)
+
+```toml
+# config/edge.toml
+[system]
+device_class = "edge"
+max_memory_mb = 2048
+max_concurrent_requests = 2
+
+[embedding]
+model = "onnx"  # ONNX for portability
+dimension = 384
+batch_size = 1
+
+[memory]
+hnsw_m = 16
+hnsw_ef_construction = 100
+hnsw_ef_search = 32
+max_nodes = 100_000
+
+[router]
+hidden_dim = 32
+sparsity = 0.95
+confidence_threshold = 0.6
+
+[inference]
+models = ["350m"]
+quantization = "q4_k"
+max_context = 1024
+max_loaded_models = 1
+
+[learning]
+enabled = true
+quality_threshold = 0.8
+replay_capacity = 1000
+training_interval_ms = 300_000  # 5 minutes
+```
+
+### 3.2 Server Deployment (CPU)
+
+```toml
+# config/server-cpu.toml
+[system]
+device_class = "server"
+max_memory_mb = 16384
+max_concurrent_requests = 20
+
+[embedding]
+model = "lfm2-encoder"
+dimension = 768
+batch_size = 8
+
+[memory]
+hnsw_m = 32
+hnsw_ef_construction = 200
+hnsw_ef_search = 64
+max_nodes = 10_000_000
+
+[router]
+hidden_dim = 64
+sparsity = 0.9
+confidence_threshold = 0.7
+
+[inference]
+models = ["700m", "1.2b", "2.6b"]
+quantization = "q5_k"
+max_context = 4096
+max_loaded_models = 2
+
+[learning]
+enabled = true
+quality_threshold = 0.75
+replay_capacity = 100_000
+training_interval_ms = 60_000  # 1 minute
+```
+
+### 3.3 Server Deployment (GPU)
+
+```toml
+# config/server-gpu.toml
+[system]
+device_class = "gpu"
+max_memory_mb = 32768
+max_concurrent_requests = 100
+
+[embedding]
+model = "lfm2-encoder"
+dimension = 1024
+batch_size = 32
+
+[memory]
+hnsw_m = 48
+hnsw_ef_construction = 300
+hnsw_ef_search = 128
+max_nodes = 100_000_000
+
+[router]
+hidden_dim = 64
+sparsity = 0.85
+confidence_threshold = 0.75
+
+[inference]
+models = ["1.2b", "2.6b"]
+quantization = "fp16"
+max_context = 8192
+max_loaded_models = 2
+use_vllm = true
+tensor_parallel = 1
+
+[learning]
+enabled = true
+quality_threshold = 0.7
+replay_capacity = 1_000_000
+training_interval_ms = 30_000  # 30 seconds
+```
+
+---
+
+## 4. Operational Runbook
+
+### 4.1 Startup Sequence
+
+```bash
+#!/bin/bash
+# scripts/start.sh
+
+set -e
+
+CONFIG=${1:-"config/server-cpu.toml"}
+LOG_LEVEL=${LOG_LEVEL:-"info"}
+
+echo "Starting RuvLLM with config: $CONFIG"
+
+# 1. Validate configuration
+cargo run --release --bin ruvllm-validate -- --config "$CONFIG"
+
+# 2. Initialize database if needed
+if [ ! -f "data/memory.db" ]; then
+    echo "Initializing database..."
+    cargo run --release --bin ruvllm-init -- --config "$CONFIG"
+fi
+
+# 3. Download models if needed
+cargo run --release --bin ruvllm-models -- --config "$CONFIG" --check-or-download
+
+# 4. Start server
+RUST_LOG=$LOG_LEVEL cargo run --release --bin ruvllm-server -- \
+    --config "$CONFIG" \
+    --metrics-port 9090 \
+    --http-port 8080
+```
+
+### 4.2 Health Checks
+
+```rust
+/// Health check endpoint implementation
+pub struct HealthCheck {
+    memory: Arc<RuvectorMemory>,
+    router: Arc<FastGRNNRouter>,
+    inference: Arc<InferencePool>,
+}
+
+impl HealthCheck {
+    pub async fn check(&self) -> HealthStatus {
+        let mut status = HealthStatus::default();
+
+        // Check memory service
+        status.memory = match self.memory.ping().await {
+            Ok(latency) => ComponentHealth::Healthy { latency_ms: latency },
+            Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
+        };
+
+        // Check router
+        status.router = match self.router.ping() {
+            Ok(latency) => ComponentHealth::Healthy { latency_ms: latency },
+            Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
+        };
+
+        // Check inference (at least one model loadable)
+        status.inference = match self.inference.health_check().await {
+            Ok(info) => ComponentHealth::Healthy {
+                latency_ms: info.latency,
+                details: json!({
+                    "loaded_models": info.loaded_models,
+                    "available_memory": info.available_memory,
+                }),
+            },
+            Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
+        };
+
+        status.overall = if status.all_healthy() {
+            OverallHealth::Healthy
+        } else if status.any_critical() {
+            OverallHealth::Critical
+        } else {
+            OverallHealth::Degraded
+        };
+
+        status
+    }
+}
+```
+
+### 4.3 Monitoring Dashboards
+
+```yaml
+# Prometheus alerting rules
+groups:
+  - name: ruvllm
+    rules:
+      - alert: HighLatency
+        expr: histogram_quantile(0.95, ruvllm_request_latency_seconds_bucket) > 1.0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "RuvLLM P95 latency above 1s"
+
+      - alert: LowQualityScore
+        expr: avg(ruvllm_quality_score) < 0.7
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Average quality score dropped below 0.7"
+
+      - alert: MemoryPressure
+        expr: ruvllm_memory_usage_bytes / ruvllm_memory_limit_bytes > 0.9
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Memory usage above 90%"
+
+      - alert: RouterLowConfidence
+        expr: avg(ruvllm_router_confidence) < 0.5
+        for: 15m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Router confidence consistently low"
+
+      - alert: HighErrorRate
+        expr: rate(ruvllm_errors_total[5m]) > 0.1
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Error rate above 10%"
+```
+
+### 4.4 Backup and Recovery
+
+```bash
+#!/bin/bash
+# scripts/backup.sh
+
+BACKUP_DIR="/backups/ruvllm/$(date +%Y%m%d_%H%M%S)"
+mkdir -p "$BACKUP_DIR"
+
+echo "Creating backup in $BACKUP_DIR"
+
+# 1. Backup memory database
+cp -r data/memory.db "$BACKUP_DIR/memory.db"
+
+# 2. Backup router weights
+cp -r data/router_weights.bin "$BACKUP_DIR/router_weights.bin"
+
+# 3. Backup EWC state
+cp -r data/ewc_state.bin "$BACKUP_DIR/ewc_state.bin"
+
+# 4. Backup replay buffer
+cp -r data/replay_buffer.bin "$BACKUP_DIR/replay_buffer.bin"
+
+# 5. Backup configuration
+cp -r config/ "$BACKUP_DIR/config/"
+
+# 6. Create manifest
+cat > "$BACKUP_DIR/manifest.json" << EOF
+{
+  "timestamp": "$(date -Iseconds)",
+  "version": "$(cargo run --release --bin ruvllm-version)",
+  "components": {
+    "memory_db": "memory.db",
+    "router_weights": "router_weights.bin",
+    "ewc_state": "ewc_state.bin",
+    "replay_buffer": "replay_buffer.bin",
+    "config": "config/"
+  }
+}
+EOF
+
+echo "Backup complete: $BACKUP_DIR"
+
+# 7. Upload to S3 if configured
+if [ -n "$S3_BACKUP_BUCKET" ]; then
+    aws s3 sync "$BACKUP_DIR" "s3://$S3_BACKUP_BUCKET/$(basename $BACKUP_DIR)/"
+    echo "Uploaded to S3: $S3_BACKUP_BUCKET"
+fi
+```
+
+---
+
+## 5. Production Checklist
+
+### 5.1 Pre-Launch
+
+```
+Security
+━━━━━━━━
+[ ] Input validation and sanitization
+[ ] Rate limiting configured
+[ ] TLS/HTTPS enabled
+[ ] API authentication (if public)
+[ ] Secrets in environment variables
+[ ] Model integrity verification
+
+Performance
+━━━━━━━━━━━
+[ ] Load tested to expected traffic
+[ ] Memory profiled (no leaks)
+[ ] Latency targets met
+[ ] Caching configured
+[ ] Connection pooling
+
+Reliability
+━━━━━━━━━━━
+[ ] Health checks implemented
+[ ] Graceful shutdown
+[ ] Automatic restarts (systemd/k8s)
+[ ] Backup procedures tested
+[ ] Recovery procedures documented
+
+Observability
+━━━━━━━━━━━━━
+[ ] Structured logging
+[ ] Metrics exported
+[ ] Distributed tracing
+[ ] Alerting rules configured
+[ ] Dashboards created
+```
+
+### 5.2 Post-Launch
+
+```
+Daily
+━━━━━
+[ ] Check error rates
+[ ] Review quality scores
+[ ] Monitor latency trends
+[ ] Verify backup success
+
+Weekly
+━━━━━━
+[ ] Review router decisions distribution
+[ ] Analyze forgetting metrics
+[ ] Check memory growth rate
+[ ] Run compression job
+[ ] Update router weights
+
+Monthly
+━━━━━━━
+[ ] Full system backup
+[ ] Performance benchmark
+[ ] Security audit
+[ ] Dependency updates
+[ ] Evaluate student model candidates
+```
+
+---
+
+## 6. API Reference
+
+### 6.1 HTTP API
+
+```yaml
+openapi: "3.0.0"
+info:
+  title: RuvLLM API
+  version: "0.1.0"
+  description: Self-learning LLM with LFM2 and Ruvector
+
+paths:
+  /v1/query:
+    post:
+      summary: Process a query
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              type: object
+              required:
+                - query
+              properties:
+                query:
+                  type: string
+                  description: The user query
+                session_id:
+                  type: string
+                  description: Optional session for multi-turn
+                constraints:
+                  type: object
+                  properties:
+                    max_latency_ms:
+                      type: integer
+                    max_tokens:
+                      type: integer
+                    temperature:
+                      type: number
+      responses:
+        "200":
+          description: Successful response
+          content:
+            application/json:
+              schema:
+                type: object
+                properties:
+                  text:
+                    type: string
+                  confidence:
+                    type: number
+                  sources:
+                    type: array
+                    items:
+                      type: object
+                  routing_info:
+                    type: object
+
+  /v1/feedback:
+    post:
+      summary: Provide feedback on a response
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              type: object
+              required:
+                - request_id
+              properties:
+                request_id:
+                  type: string
+                rating:
+                  type: integer
+                  minimum: 1
+                  maximum: 5
+                correction:
+                  type: string
+      responses:
+        "200":
+          description: Feedback recorded
+
+  /v1/health:
+    get:
+      summary: Health check
+      responses:
+        "200":
+          description: System healthy
+        "503":
+          description: System unhealthy
+
+  /v1/metrics:
+    get:
+      summary: Prometheus metrics
+      responses:
+        "200":
+          description: Metrics in Prometheus format
+```
+
+### 6.2 Rust SDK
+
+```rust
+use ruvllm::{RuvLLM, Config, Request, Response};
+
+/// Simple query
+async fn simple_query(llm: &RuvLLM) -> Result<Response> {
+    llm.query("What is Rust?").await
+}
+
+/// Query with options
+async fn query_with_options(llm: &RuvLLM) -> Result<Response> {
+    llm.query_with(Request {
+        query: "Explain backpropagation".into(),
+        session_id: Some("user-123".into()),
+        constraints: Constraints {
+            max_latency_ms: Some(500),
+            max_tokens: Some(500),
+            temperature: Some(0.7),
+            ..Default::default()
+        },
+    }).await
+}
+
+/// Multi-turn conversation
+async fn conversation(llm: &RuvLLM) -> Result<()> {
+    let session = llm.new_session();
+
+    let r1 = llm.query_session(&session, "What is a neural network?").await?;
+    println!("Turn 1: {}", r1.text);
+
+    let r2 = llm.query_session(&session, "How do you train one?").await?;
+    println!("Turn 2: {}", r2.text);
+
+    let r3 = llm.query_session(&session, "What about overfitting?").await?;
+    println!("Turn 3: {}", r3.text);
+
+    Ok(())
+}
+
+/// Provide feedback
+async fn with_feedback(llm: &RuvLLM) -> Result<()> {
+    let response = llm.query("What is 2+2?").await?;
+
+    llm.feedback(Feedback {
+        request_id: response.request_id,
+        rating: 5,
+        correction: None,
+    }).await?;
+
+    Ok(())
+}
+
+/// Stream response
+async fn streaming(llm: &RuvLLM) -> Result<()> {
+    let mut stream = llm.query_stream("Tell me a story").await?;
+
+    while let Some(chunk) = stream.next().await {
+        print!("{}", chunk?);
+    }
+
+    Ok(())
+}
+```
+
+---
+
+## 7. Future Roadmap
+
+### 7.1 Short-Term (1-3 months)
+
+- [ ] LFM2-VL integration (vision-language)
+- [ ] Multi-GPU inference with tensor parallelism
+- [ ] Retrieval-augmented fine-tuning pipeline
+- [ ] Improved compression algorithms
+- [ ] WebAssembly deployment target
+
+### 7.2 Medium-Term (3-6 months)
+
+- [ ] Federated learning across edge nodes
+- [ ] LFM2-Audio integration (speech)
+- [ ] Custom domain fine-tuning toolkit
+- [ ] Advanced curriculum learning
+- [ ] Hyperbolic embeddings for hierarchies
+
+### 7.3 Long-Term (6-12 months)
+
+- [ ] Multi-agent collaboration
+- [ ] Neuro-symbolic reasoning integration
+- [ ] Continuous pre-training pipeline
+- [ ] Hardware-specific optimizations (NPU, TPU)
+- [ ] Enterprise multi-tenancy
+
+---
+
+## 8. Success Criteria
+
+### 8.1 Technical Metrics
+
+| Metric | Target | Current |
+|--------|--------|---------|
+| Latency P50 | <500ms | - |
+| Latency P99 | <2s | - |
+| Quality Score | >0.8 | - |
+| Router Accuracy | >90% | - |
+| Memory Efficiency | <4GB (edge) | - |
+| Throughput | 20 QPS (edge) | - |
+| Forgetting Rate | <5%/10K | - |
+| Test Coverage | >80% | - |
+
+### 8.2 Business Metrics
+
+| Metric | Target | Notes |
+|--------|--------|-------|
+| User Satisfaction | >4.0/5.0 | Survey scores |
+| Response Relevance | >85% | Human eval |
+| Knowledge Retention | >90% | Multi-turn coherence |
+| Cost Reduction | >50% | vs. always-big baseline |
+
+---
+
+## 9. Conclusion
+
+RuvLLM represents a paradigm shift from static LLMs to adaptive, self-learning systems. By treating:
+
+- **LFM2 as the stable cortex** (reasoning)
+- **Ruvector as the living synaptic mesh** (memory)
+- **FastGRNN as the control circuit** (routing)
+
+We create intelligence that emerges from the loop, not just the model.
+
+The three learning loops—memory growth, router optimization, and concept compression—enable continuous adaptation without the risks of in-place weight modification.
+
+**The intelligence is not in one model anymore. It is in the loop.**
+
+---
+
+*Document Version: 1.0*
+*Last Updated: 2025-12-02*
+*Author: RuvLLM Architecture Team*