Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
612
examples/ruvLLM/docs/sparc/01-specification.md
Normal file
612
examples/ruvLLM/docs/sparc/01-specification.md
Normal file
@@ -0,0 +1,612 @@
|
||||
# RuvLLM: Self-Learning LLM with LFM2 and Ruvector Integration
|
||||
|
||||
## SPARC Phase 1: Specification
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
RuvLLM is a self-learning LLM architecture that integrates **Liquid Foundation Models (LFM2)** with **ruvector** as the world model and memory substrate. The system uses **FastGRNN** as an intelligent router to dynamically allocate computational resources based on query complexity, enabling efficient on-device inference with continuous learning capabilities.
|
||||
|
||||
### Core Innovation
|
||||
|
||||
The architecture treats:
|
||||
- **LFM2** as the reasoning head (inference engine)
|
||||
- **Ruvector** as the world model and episodic memory
|
||||
- **FastGRNN** as the control circuit (routing decisions)
|
||||
|
||||
This triad creates a self-learning system where:
|
||||
1. Queries are semantically embedded and matched against memory
|
||||
2. Graph attention extracts relevant neighborhood context
|
||||
3. FastGRNN routes to optimal model configuration
|
||||
4. LFM2 generates responses with retrieved context
|
||||
5. Successful interactions are written back to memory (self-improvement)
|
||||
|
||||
---
|
||||
|
||||
## 2. Technical Requirements
|
||||
|
||||
### 2.1 Functional Requirements
|
||||
|
||||
#### FR-001: LFM2 Model Integration
|
||||
- **Description**: Support LFM2 model family (350M, 700M, 1.2B, 2.6B parameters)
|
||||
- **Acceptance Criteria**:
|
||||
- Load models via llama.cpp (CPU) or vLLM (server)
|
||||
- Support quantization: Q4/Q5 (CPU), 8-bit/4-bit weight-only (GPU)
|
||||
- Enable KV cache for context reuse
|
||||
- Achieve <500ms median latency (CPU), <100ms (GPU)
|
||||
|
||||
#### FR-002: Ruvector Memory Service
|
||||
- **Description**: Implement semantic memory with graph structure
|
||||
- **Storage Schema**:
|
||||
```
|
||||
Nodes: {
|
||||
id: UUID,
|
||||
vector: [f32; D], // D = embedding dimension
|
||||
text: String,
|
||||
type: NodeType, // Query | Document | AgentStep | Fact
|
||||
source: String,
|
||||
metadata: {
|
||||
timestamp: i64,
|
||||
tags: Vec<String>,
|
||||
domain: String,
|
||||
version: u32,
|
||||
confidence: f32
|
||||
}
|
||||
}
|
||||
|
||||
Edges: {
|
||||
id: UUID,
|
||||
src: UUID,
|
||||
dst: UUID,
|
||||
rel: EdgeType, // Cites | Follows | SameTopic | AgentStep | Derived
|
||||
weight: f32,
|
||||
metadata: {
|
||||
timestamp: i64,
|
||||
created_by: String,
|
||||
confidence: f32
|
||||
}
|
||||
}
|
||||
```
|
||||
- **Acceptance Criteria**:
|
||||
- HNSW index with M=32, efConstruction=200, efSearch=64
|
||||
- Sub-millisecond retrieval for k≤64
|
||||
- Graph attention over 2-hop neighborhoods
|
||||
- Support billion-scale corpora
|
||||
|
||||
#### FR-003: FastGRNN Router
|
||||
- **Description**: Implement gated recurrent router for intelligent resource allocation
|
||||
- **Architecture** (per Kusupati et al.):
|
||||
- Hidden size: 32-64 units
|
||||
- Input: Fixed-length feature vector (~128 dims)
|
||||
- Outputs: model_selection, context_size, temperature, top_p
|
||||
- **Feature Vector Components** (128 dimensions):
|
||||
```
|
||||
Query Stats [32 dims]:
|
||||
- token_count: f32
|
||||
- language_id: [f32; 8] (one-hot)
|
||||
- domain_encoding: [f32; 16]
|
||||
- user_frequency: f32
|
||||
- query_type: [f32; 6] (factual/reasoning/creative/...)
|
||||
|
||||
Embedding Stats [16 dims]:
|
||||
- l2_norm: f32
|
||||
- principal_components: [f32; 8]
|
||||
- entropy: f32
|
||||
- sparsity: f32
|
||||
- cluster_assignment: [f32; 4]
|
||||
|
||||
HNSW Search Stats [48 dims]:
|
||||
- k_retrieved: f32
|
||||
- distances: { mean, std, min, max }: [f32; 4]
|
||||
- entropy: f32
|
||||
- graph_depth: f32
|
||||
- recall_estimate: f32
|
||||
- neighborhood_density: [f32; 16]
|
||||
- semantic_coherence: [f32; 24]
|
||||
|
||||
System Constraints [32 dims]:
|
||||
- latency_budget: f32
|
||||
- device_class: [f32; 4] (edge/mobile/server/cluster)
|
||||
- privacy_level: [f32; 4]
|
||||
- memory_available: f32
|
||||
- battery_level: f32 (for mobile)
|
||||
- concurrent_requests: f32
|
||||
- historical_accuracy: [f32; 16]
|
||||
```
|
||||
|
||||
#### FR-004: Self-Learning Pipeline
|
||||
- **Description**: Implement continuous learning with forgetting mitigation
|
||||
- **Components**:
|
||||
- Online learning from successful interactions
|
||||
- Elastic Weight Consolidation (EWC) for catastrophic forgetting prevention
|
||||
- Experience replay with reservoir sampling
|
||||
- Curriculum learning for progressive complexity
|
||||
- **Acceptance Criteria**:
|
||||
- Quality regret <0.1 points vs. always-big baseline
|
||||
- No measurable forgetting over 10K update cycles
|
||||
- Router accuracy >95% for seen patterns
|
||||
|
||||
#### FR-005: Graph Attention Engine
|
||||
- **Description**: Context extraction via graph-aware attention
|
||||
- **Mechanism**:
|
||||
- Multi-head attention over retrieved nodes
|
||||
- Edge-weighted aggregation (confidence, recency)
|
||||
- Hyperbolic embeddings for hierarchical relationships
|
||||
- 2-hop neighborhood expansion
|
||||
- **Integration with existing ruvector-attention**:
|
||||
- Leverage `EdgeFeaturedAttention` for edge attributes
|
||||
- Use `GraphRoPE` for positional encoding on graphs
|
||||
- Apply `DualSpaceAttention` for multi-manifold reasoning
|
||||
|
||||
### 2.2 Non-Functional Requirements
|
||||
|
||||
#### NFR-001: Performance
|
||||
| Metric | Tier A (Server) | Tier B (Edge) | Tier C (Mobile) |
|
||||
|--------|-----------------|---------------|-----------------|
|
||||
| P50 Latency | <200ms | <500ms | <800ms |
|
||||
| P99 Latency | <1s | <2s | <5s |
|
||||
| Throughput | 100 QPS | 20 QPS | 5 QPS |
|
||||
| Memory | <16GB | <4GB | <1GB |
|
||||
|
||||
#### NFR-002: Quality
|
||||
- **Accuracy**: F1 >0.85 on QA benchmarks
|
||||
- **Retrieval**: R@10 >0.90 for relevant documents
|
||||
- **Router**: Decision accuracy >95%
|
||||
- **Judge Rating**: 4.2+/5.0 on LLM-as-judge evaluations
|
||||
|
||||
#### NFR-003: Scalability
|
||||
- Support 10M+ vectors in memory
|
||||
- Support 1B+ vectors with hybrid indexing
|
||||
- Linear scaling with node count in cluster mode
|
||||
|
||||
#### NFR-004: Reliability
|
||||
- Zero data loss on graceful shutdown
|
||||
- Recovery from OOM within 30s
|
||||
- Automatic failover in cluster mode
|
||||
|
||||
---
|
||||
|
||||
## 3. LFM2 Deep Dive
|
||||
|
||||
### 3.1 Architecture Analysis
|
||||
|
||||
LFM2 employs a **hybrid backbone** combining:
|
||||
|
||||
1. **Gated Short Convolutions**: Lightweight local feature processing
|
||||
- O(n) complexity vs O(n²) for attention
|
||||
- Captures local patterns efficiently
|
||||
- Enables 2x faster prefill on CPUs
|
||||
|
||||
2. **Grouped Query Attention (GQA)**: Reduced KV heads
|
||||
- 4-8 KV heads vs 32+ in standard attention
|
||||
- Maintains quality with 4x memory reduction
|
||||
- Critical for edge deployment
|
||||
|
||||
### 3.2 Training Methodology
|
||||
|
||||
LFM2's training is relevant for our self-learning pipeline:
|
||||
|
||||
1. **Knowledge Distillation**: Tempered, decoupled Top-K
|
||||
- Teacher: Large model (70B+)
|
||||
- Student: LFM2 variants
|
||||
- **Insight**: We can distill router decisions from expensive oracle
|
||||
|
||||
2. **Curriculum Learning**: Progressive complexity
|
||||
- Start with simple factual queries
|
||||
- Graduate to multi-step reasoning
|
||||
- **Application**: Router training follows same progression
|
||||
|
||||
3. **Three-Stage Post-Training**:
|
||||
- SFT: Supervised fine-tuning on quality data
|
||||
- DPO: Direct preference optimization
|
||||
- Model merging: Combine specialists
|
||||
- **Application**: We merge domain-specific adapters
|
||||
|
||||
### 3.3 Multimodal Extensions (Future)
|
||||
|
||||
- **LFM2-VL**: Vision-language (image understanding)
|
||||
- **LFM2-Audio**: Speech I/O
|
||||
- **LFM2-ColBERT**: Low-latency retrieval encoder
|
||||
|
||||
---
|
||||
|
||||
## 4. Ruvector Integration Analysis
|
||||
|
||||
### 4.1 Existing Capabilities
|
||||
|
||||
| Component | Status | Integration Plan |
|
||||
|-----------|--------|------------------|
|
||||
| ruvector-core | ✅ Production | Primary vector store |
|
||||
| ruvector-gnn | ✅ Production | Graph neural layer |
|
||||
| ruvector-attention | ✅ Production | Attention mechanisms |
|
||||
| ruvector-router-core | ✅ Production | Base routing |
|
||||
| ruvector-graph | ✅ Production | Knowledge graph |
|
||||
|
||||
### 4.2 Required Extensions
|
||||
|
||||
#### 4.2.1 Embedding Adapter
|
||||
```rust
|
||||
pub struct EmbeddingAdapter {
|
||||
/// LFM2 encoder for query embedding
|
||||
lfm2_encoder: Lfm2Encoder,
|
||||
/// Dimension alignment layer
|
||||
projection: Linear,
|
||||
/// Normalization
|
||||
layer_norm: LayerNorm,
|
||||
}
|
||||
|
||||
impl EmbeddingAdapter {
|
||||
pub fn embed(&self, text: &str) -> Vec<f32> {
|
||||
let raw = self.lfm2_encoder.encode(text);
|
||||
let projected = self.projection.forward(&raw);
|
||||
self.layer_norm.forward(&projected)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 4.2.2 Memory Writeback Service
|
||||
```rust
|
||||
pub struct MemoryWriteback {
|
||||
/// Quality threshold for writeback
|
||||
quality_threshold: f32,
|
||||
/// Deduplication via MinHash
|
||||
dedup_hasher: MinHasher,
|
||||
/// Conflict resolution
|
||||
merger: ConflictMerger,
|
||||
}
|
||||
|
||||
impl MemoryWriteback {
|
||||
pub async fn maybe_write(
|
||||
&self,
|
||||
query: &str,
|
||||
response: &str,
|
||||
quality_score: f32,
|
||||
db: &VectorDB,
|
||||
) -> Result<Option<UUID>> {
|
||||
if quality_score < self.quality_threshold {
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
// Check for near-duplicates
|
||||
let embedding = embed(query, response);
|
||||
let similar = db.search_threshold(&embedding, 0.95)?;
|
||||
if !similar.is_empty() {
|
||||
return self.merger.resolve(similar, query, response);
|
||||
}
|
||||
|
||||
// Insert new memory
|
||||
let entry = VectorEntry::new(embedding)
|
||||
.with_text(format!("Q: {}\nA: {}", query, response))
|
||||
.with_metadata(json!({
|
||||
"type": "qa_pair",
|
||||
"quality": quality_score,
|
||||
"timestamp": now(),
|
||||
}));
|
||||
|
||||
Ok(Some(db.insert(entry)?))
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.3 HNSW Parameter Tuning
|
||||
|
||||
Based on arxiv:2511.23404v1 insights on retrieval efficiency:
|
||||
|
||||
| Corpus Size | M | efConstruction | efSearch | Recall@10 |
|
||||
|-------------|---|----------------|----------|-----------|
|
||||
| <100K | 16 | 100 | 32 | 0.98 |
|
||||
| 100K-1M | 32 | 200 | 64 | 0.96 |
|
||||
| 1M-10M | 48 | 300 | 128 | 0.94 |
|
||||
| 10M-100M | 64 | 400 | 256 | 0.92 |
|
||||
| >100M | Hybrid | Tiered | Adaptive | 0.90 |
|
||||
|
||||
---
|
||||
|
||||
## 5. FastGRNN Router Specification
|
||||
|
||||
### 5.1 Mathematical Formulation
|
||||
|
||||
FastGRNN (Fast, Accurate, Stable, and Tiny GRU):
|
||||
|
||||
```
|
||||
z_t = σ(W_z · x_t + U_z · h_{t-1} + b_z)
|
||||
h̃_t = tanh(W_h · x_t + U_h · (r_t ⊙ h_{t-1}) + b_h)
|
||||
h_t = (ζ · (1 - z_t) + ν) ⊙ h̃_t + z_t ⊙ h_{t-1}
|
||||
|
||||
where:
|
||||
- ζ, ν: Learned scalars (typically ζ≈1, ν≈0.5)
|
||||
- W_z, W_h: Input weight matrices (sparse)
|
||||
- U_z, U_h: Recurrent weight matrices (low-rank)
|
||||
- r_t: Optional reset gate (can be fixed to 1)
|
||||
```
|
||||
|
||||
### 5.2 Output Heads
|
||||
|
||||
```rust
|
||||
pub struct RouterOutputs {
|
||||
/// Model selection: [350M, 700M, 1.2B, 2.6B] probabilities
|
||||
pub model_probs: [f32; 4],
|
||||
/// Context size bins: [256, 512, 1024, 2048, 4096] tokens
|
||||
pub context_probs: [f32; 5],
|
||||
/// Temperature: continuous [0.0, 2.0]
|
||||
pub temperature: f32,
|
||||
/// Top-p: continuous [0.0, 1.0]
|
||||
pub top_p: f32,
|
||||
/// Confidence score
|
||||
pub confidence: f32,
|
||||
}
|
||||
```
|
||||
|
||||
### 5.3 Training Protocol
|
||||
|
||||
**Phase 1: Data Collection**
|
||||
```
|
||||
For each query q:
|
||||
1. Run all model configurations (expensive baseline)
|
||||
2. Collect quality metrics Q, latency L, cost C
|
||||
3. Compute utility: U = Q - λ·L - μ·C
|
||||
4. Label: y_model = argmax(U), y_ctx = min viable context
|
||||
```
|
||||
|
||||
**Phase 2: Supervised Training**
|
||||
```
|
||||
Loss = CE(model_pred, y_model)
|
||||
+ CE(ctx_pred, y_ctx)
|
||||
+ α·SmoothL1(temp_pred, y_temp)
|
||||
+ β·SmoothL1(top_p_pred, y_top_p)
|
||||
```
|
||||
|
||||
**Phase 3: Online Refinement**
|
||||
```
|
||||
Every N requests:
|
||||
1. Sample exploration (ε-greedy or Thompson)
|
||||
2. Compute regret vs. oracle
|
||||
3. Update weights with importance sampling
|
||||
4. Apply EWC regularization
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Self-Learning Mechanisms
|
||||
|
||||
### 6.1 Continual Learning Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Self-Learning Pipeline │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ Query │───▶│ Retrieve│───▶│ Generate│───▶│ Evaluate│ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ │ │ │ ▼ │
|
||||
│ │ │ │ ┌─────────┐ │
|
||||
│ │ │ │ │ Quality │ │
|
||||
│ │ │ │ │ > θ ? │ │
|
||||
│ │ │ │ └────┬────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ │ │ │ ┌──────┴──────┐ │
|
||||
│ │ │ │ ▼ ▼ │
|
||||
│ │ │ │ ┌───────┐ ┌───────┐ │
|
||||
│ │ │ │ │ Write │ │ Skip │ │
|
||||
│ │ │ │ │ Back │ │ │ │
|
||||
│ │ │ │ └───┬───┘ └───────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ ▼ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────────────────────────────────┐ │
|
||||
│ │ Replay Buffer (Reservoir) │ │
|
||||
│ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │
|
||||
│ │ │ E_1 │ │ E_2 │ │ ... │ │E_n-1│ │ E_n │ │ │
|
||||
│ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │
|
||||
│ └──────────────────────┬──────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────┐ │
|
||||
│ │ EWC Regularization Layer │ │
|
||||
│ │ │ │
|
||||
│ │ L_total = L_task + λ·Σ F_i·(θ_i - θ*_i)² │ │
|
||||
│ │ │ │
|
||||
│ │ F_i = Fisher Information (importance) │ │
|
||||
│ │ θ*_i = Optimal weights from previous task │ │
|
||||
│ └─────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 6.2 Quality Evaluation
|
||||
|
||||
**LLM-as-Judge Protocol**:
|
||||
```rust
|
||||
pub struct QualityJudge {
|
||||
judge_model: Lfm2, // Use 2.6B for judging
|
||||
rubric: JudgeRubric,
|
||||
}
|
||||
|
||||
impl QualityJudge {
|
||||
pub fn evaluate(&self, query: &str, response: &str, context: &[&str]) -> f32 {
|
||||
let prompt = format!(r#"
|
||||
Evaluate the response quality on a scale of 1-5:
|
||||
|
||||
Query: {query}
|
||||
Retrieved Context: {context:?}
|
||||
Response: {response}
|
||||
|
||||
Criteria:
|
||||
1. Factual accuracy (grounded in context)
|
||||
2. Completeness (addresses the query fully)
|
||||
3. Coherence (logical flow)
|
||||
4. Conciseness (no unnecessary verbosity)
|
||||
|
||||
Score (1-5):
|
||||
"#);
|
||||
|
||||
let score_str = self.judge_model.generate(&prompt, 10);
|
||||
parse_score(&score_str)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6.3 Forgetting Mitigation
|
||||
|
||||
**Elastic Weight Consolidation (EWC)**:
|
||||
|
||||
```rust
|
||||
// From ruvector-gnn ewc module
|
||||
pub struct ElasticWeightConsolidation {
|
||||
lambda: f32, // Regularization strength
|
||||
fisher_info: Vec<f32>, // Fisher information diagonal
|
||||
optimal_weights: Vec<f32>, // θ* from previous task
|
||||
}
|
||||
|
||||
impl ElasticWeightConsolidation {
|
||||
pub fn regularization_loss(&self, current_weights: &[f32]) -> f32 {
|
||||
self.fisher_info.iter()
|
||||
.zip(current_weights.iter())
|
||||
.zip(self.optimal_weights.iter())
|
||||
.map(|((f, w), w_star)| f * (w - w_star).powi(2))
|
||||
.sum::<f32>() * self.lambda / 2.0
|
||||
}
|
||||
|
||||
pub fn update_fisher(&mut self, gradients: &[Vec<f32>]) {
|
||||
// Fisher = E[∇logP(y|x;θ)²]
|
||||
for (i, grad_samples) in gradients.iter().enumerate() {
|
||||
self.fisher_info[i] = grad_samples.iter()
|
||||
.map(|g| g.powi(2))
|
||||
.sum::<f32>() / grad_samples.len() as f32;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Performance Optimization Strategy
|
||||
|
||||
### 7.1 LFM2 Level
|
||||
|
||||
| Optimization | Speedup | Quality Impact | Implementation |
|
||||
|--------------|---------|----------------|----------------|
|
||||
| Model selection | 2-4x | <1% | FastGRNN router |
|
||||
| KV cache reuse | 1.5-2x | 0% | llama.cpp native |
|
||||
| Q4 quantization | 2-3x | <2% | GGUF format |
|
||||
| Speculative decode | 1.3-1.5x | 0% | Draft model |
|
||||
| Continuous batching | 2-4x | 0% | vLLM |
|
||||
|
||||
### 7.2 Ruvector Level
|
||||
|
||||
| Optimization | Speedup | Quality Impact | Implementation |
|
||||
|--------------|---------|----------------|----------------|
|
||||
| HNSW tuning | Variable | Recall tradeoff | efSearch adjustment |
|
||||
| Product quantization | 4-8x memory | <5% | PQ in ruvector-core |
|
||||
| Graph pruning | 1.2-1.5x | <1% | Edge weight threshold |
|
||||
| Batch retrieval | 2-3x | 0% | Parallel HNSW |
|
||||
| Caching | 10x+ (hits) | 0% | LRU with TTL |
|
||||
|
||||
### 7.3 Router Level
|
||||
|
||||
| Optimization | Speedup | Quality Impact | Implementation |
|
||||
|--------------|---------|----------------|----------------|
|
||||
| Sparse weights | 10-50x | <0.5% | Magnitude pruning |
|
||||
| Low-rank U | 2-4x | <0.5% | SVD decomposition |
|
||||
| Int8 quantization | 2-4x | <0.1% | Post-training quant |
|
||||
| Cascade routing | 1.5-2x | 0% | Early exit |
|
||||
|
||||
---
|
||||
|
||||
## 8. Success Metrics
|
||||
|
||||
### 8.1 Primary Metrics
|
||||
|
||||
| Metric | Target | Measurement |
|
||||
|--------|--------|-------------|
|
||||
| End-to-end latency P50 | <500ms | Timer instrumentation |
|
||||
| Quality (LLM judge) | 4.2+/5.0 | Automated evaluation |
|
||||
| Router accuracy | >95% | Oracle comparison |
|
||||
| Memory efficiency | <4GB (edge) | RSS monitoring |
|
||||
| Throughput | 20 QPS (edge) | Load testing |
|
||||
|
||||
### 8.2 Secondary Metrics
|
||||
|
||||
| Metric | Target | Measurement |
|
||||
|--------|--------|-------------|
|
||||
| Retrieval R@10 | >0.90 | Benchmark suite |
|
||||
| Forgetting rate | <5%/10K updates | Periodic eval |
|
||||
| Cost reduction | >50% vs baseline | Token counting |
|
||||
| Writeback rate | 10-30% | Database metrics |
|
||||
|
||||
### 8.3 Regret Analysis
|
||||
|
||||
```
|
||||
Quality Regret = E[Q_baseline - Q_routed]
|
||||
Latency Regret = E[L_routed - L_oracle]
|
||||
Cost Regret = E[C_routed - C_oracle]
|
||||
|
||||
Targets:
|
||||
- Quality Regret < 0.1 points (1-5 scale)
|
||||
- Latency Regret < 50ms
|
||||
- Cost Regret < 10%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Risk Analysis
|
||||
|
||||
| Risk | Probability | Impact | Mitigation |
|
||||
|------|-------------|--------|------------|
|
||||
| Router misprediction | Medium | High | Confidence thresholds, fallback |
|
||||
| Catastrophic forgetting | Low | Critical | EWC, replay buffer, checkpoints |
|
||||
| Memory exhaustion | Medium | High | Streaming, tiered storage |
|
||||
| Quality degradation | Medium | High | A/B testing, rollback |
|
||||
| Latency spikes | High | Medium | Caching, async processing |
|
||||
|
||||
---
|
||||
|
||||
## 10. Dependencies
|
||||
|
||||
### 10.1 Internal Dependencies
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
ruvector-core = { path = "../ruvector-core" }
|
||||
ruvector-gnn = { path = "../ruvector-gnn" }
|
||||
ruvector-attention = { path = "../ruvector-attention" }
|
||||
ruvector-graph = { path = "../ruvector-graph" }
|
||||
ruvector-router-core = { path = "../ruvector-router-core" }
|
||||
```
|
||||
|
||||
### 10.2 External Dependencies
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
# LLM runtime
|
||||
llama-cpp-rs = "0.3" # CPU inference
|
||||
tokenizers = "0.15" # Fast tokenization
|
||||
|
||||
# Async runtime
|
||||
tokio = { version = "1.41", features = ["full"] }
|
||||
|
||||
# Serialization
|
||||
serde = { version = "1.0", features = ["derive"] }
|
||||
|
||||
# Metrics
|
||||
prometheus = "0.13"
|
||||
tracing = "0.1"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. References
|
||||
|
||||
1. **LFM2 Technical Report**: arxiv:2511.23404v1
|
||||
2. **FastGRNN**: Kusupati et al., "FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network"
|
||||
3. **EWC**: Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks"
|
||||
4. **HNSW**: Malkov & Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"
|
||||
5. **Graph Attention**: Veličković et al., "Graph Attention Networks"
|
||||
|
||||
---
|
||||
|
||||
*Document Version: 1.0*
|
||||
*Last Updated: 2025-12-02*
|
||||
*Author: RuvLLM Architecture Team*
|
||||
Reference in New Issue
Block a user