git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
1160 lines
45 KiB
Markdown
1160 lines
45 KiB
Markdown
# RuvLLM: TDD and Iterative Refinement
|
|
|
|
## SPARC Phase 4: Refinement
|
|
|
|
---
|
|
|
|
## 1. Core Philosophy: Three-Layer Self-Learning
|
|
|
|
### 1.1 The Mental Model
|
|
|
|
> **"The intelligence is not in one model anymore. It is in the loop."**
|
|
|
|
RuvLLM treats:
|
|
- **LFM2 weights** as a **stable cortex** (fixed core reasoning engine)
|
|
- **Ruvector** as the **living synaptic mesh** (adapts continuously)
|
|
- **FastGRNN** as the **control circuit** (learns when to use what)
|
|
|
|
This creates a system that genuinely learns from experience without requiring constant model retraining.
|
|
|
|
### 1.2 Three Adaptation Timescales
|
|
|
|
| Timescale | Mechanism | What Changes | Frequency |
|
|
|-----------|-----------|--------------|-----------|
|
|
| **Short-term** | Memory + Routing | Graph structure, attention patterns, routing decisions | Every request |
|
|
| **Medium-term** | Compression | Concept nodes, graph hierarchy, router weights | Hourly/Daily |
|
|
| **Long-term** | Weight tuning | LFM2 fine-tuned variants | Weekly/Monthly |
|
|
|
|
---
|
|
|
|
## 2. Self-Learning Loop Architecture
|
|
|
|
### 2.1 Loop A: Memory Growth and Refinement
|
|
|
|
**What happens on every request:**
|
|
|
|
```
|
|
Request → Response → Outcome
|
|
↓
|
|
┌────────────────────────────────────────────────────────────────┐
|
|
│ Memory Growth Loop │
|
|
├────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ 1. WRITE to ruvector: │
|
|
│ ┌─────────────────────────────────────────────────────────┐│
|
|
│ │ - Question (query embedding + text) ││
|
|
│ │ - Answer (response embedding + text) ││
|
|
│ │ - Retrieved documents (context used) ││
|
|
│ │ - Final outcome (quality score, task success) ││
|
|
│ │ - User feedback if any (explicit signals) ││
|
|
│ └─────────────────────────────────────────────────────────┘│
|
|
│ │
|
|
│ 2. GRAPH RULES: │
|
|
│ ┌─────────────────────────────────────────────────────────┐│
|
|
│ │ ✓ Strengthen edges between nodes that co-appear ││
|
|
│ │ in good answers ││
|
|
│ │ ✓ Weaken/prune edges rarely used or correlating ││
|
|
│ │ with bad answers ││
|
|
│ │ ✓ Update attention weights based on success patterns ││
|
|
│ └─────────────────────────────────────────────────────────┘│
|
|
│ │
|
|
│ 3. RESULT: │
|
|
│ Same LFM2 checkpoint → Different answers over time │
|
|
│ because the graph, weights, and attention improve │
|
|
│ │
|
|
└────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**TDD Tests for Loop A:**
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod memory_growth_tests {
|
|
use super::*;
|
|
|
|
#[test]
|
|
fn test_successful_interaction_strengthens_edges() {
|
|
// Given: A memory with two related nodes
|
|
let mut memory = RuvectorMemory::new_test();
|
|
let node_a = memory.insert_node("Machine learning is a subset of AI");
|
|
let node_b = memory.insert_node("Neural networks are ML models");
|
|
memory.insert_edge(node_a, node_b, EdgeType::SameTopic, 0.5);
|
|
|
|
// When: A successful query uses both nodes
|
|
let outcome = InteractionOutcome {
|
|
quality_score: 0.9,
|
|
used_nodes: vec![node_a.clone(), node_b.clone()],
|
|
task_success: true,
|
|
};
|
|
memory.apply_outcome(&outcome);
|
|
|
|
// Then: Edge weight should increase
|
|
let edge = memory.get_edge(&node_a, &node_b).unwrap();
|
|
assert!(edge.weight > 0.5);
|
|
}
|
|
|
|
#[test]
|
|
fn test_failed_interaction_weakens_edges() {
|
|
// Given: A memory with edge
|
|
let mut memory = RuvectorMemory::new_test();
|
|
let node_a = memory.insert_node("Topic A");
|
|
let node_b = memory.insert_node("Unrelated B");
|
|
memory.insert_edge(node_a, node_b, EdgeType::SameTopic, 0.5);
|
|
|
|
// When: Query uses these but fails
|
|
let outcome = InteractionOutcome {
|
|
quality_score: 0.3,
|
|
used_nodes: vec![node_a.clone(), node_b.clone()],
|
|
task_success: false,
|
|
};
|
|
memory.apply_outcome(&outcome);
|
|
|
|
// Then: Edge weight should decrease
|
|
let edge = memory.get_edge(&node_a, &node_b).unwrap();
|
|
assert!(edge.weight < 0.5);
|
|
}
|
|
|
|
#[test]
|
|
fn test_unused_edges_decay_over_time() {
|
|
// Given: An edge that hasn't been used
|
|
let mut memory = RuvectorMemory::new_test();
|
|
let edge = memory.create_edge_with_last_used(
|
|
"node_a", "node_b",
|
|
0.5,
|
|
Instant::now() - Duration::from_days(30)
|
|
);
|
|
|
|
// When: Periodic cleanup runs
|
|
memory.apply_decay(DECAY_RATE, MIN_INTERACTIONS_BEFORE_PRUNE);
|
|
|
|
// Then: Edge weight should have decayed
|
|
let updated = memory.get_edge(&edge.src, &edge.dst).unwrap();
|
|
assert!(updated.weight < 0.5);
|
|
}
|
|
|
|
#[test]
|
|
fn test_attention_weights_update_from_success_patterns() {
|
|
// Given: Graph attention engine with initial weights
|
|
let mut attention = GraphAttentionEngine::new_test();
|
|
let initial_weights = attention.get_edge_bias_weights();
|
|
|
|
// When: Train on successful interaction patterns
|
|
let patterns = vec![
|
|
AttentionPattern {
|
|
edges_used: vec![EdgeType::Cites],
|
|
outcome_quality: 0.95,
|
|
},
|
|
AttentionPattern {
|
|
edges_used: vec![EdgeType::Cites],
|
|
outcome_quality: 0.90,
|
|
},
|
|
];
|
|
attention.train_on_patterns(&patterns);
|
|
|
|
// Then: Edge type "Cites" should have higher attention bias
|
|
let updated_weights = attention.get_edge_bias_weights();
|
|
assert!(updated_weights[EdgeType::Cites] > initial_weights[EdgeType::Cites]);
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2.2 Loop B: Router Learning
|
|
|
|
**What the router learns:**
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────────┐
|
|
│ Router Learning Loop │
|
|
├────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ For each query, LOG: │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ - Router features (128-dim input vector) │ │
|
|
│ │ - Chosen route (model, context, temp, top_p) │ │
|
|
│ │ - Actual latency and cost │ │
|
|
│ │ - Quality score (judge model or task outcome) │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ Periodically RETRAIN FastGRNN: │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ Objective: Prefer cheaper routes when quality holds │ │
|
|
│ │ Escalate only when necessary │ │
|
|
│ │ │ │
|
|
│ │ Loss = -Quality + λ·Cost + μ·LatencyPenalty │ │
|
|
│ │ │ │
|
|
│ │ Constraints: │ │
|
|
│ │ - Quality must exceed threshold θ_min │ │
|
|
│ │ - Latency must meet SLA │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ RESULT: Router becomes self-learning policy over your stack │
|
|
│ │
|
|
└────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**TDD Tests for Loop B:**
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod router_learning_tests {
|
|
use super::*;
|
|
|
|
#[test]
|
|
fn test_router_prefers_smaller_model_when_quality_sufficient() {
|
|
// Given: Training data showing 700M achieves same quality as 1.2B
|
|
let training_data = vec![
|
|
RouterSample {
|
|
features: simple_query_features(),
|
|
model_used: ModelSize::M700,
|
|
quality: 0.92,
|
|
latency_ms: 150.0,
|
|
cost: 0.001,
|
|
},
|
|
RouterSample {
|
|
features: simple_query_features(),
|
|
model_used: ModelSize::B1_2,
|
|
quality: 0.93, // Only marginally better
|
|
latency_ms: 300.0,
|
|
cost: 0.003,
|
|
},
|
|
];
|
|
|
|
// When: Router is trained
|
|
let mut router = FastGRNNRouter::new_test();
|
|
router.train(&training_data, QUALITY_THRESHOLD);
|
|
|
|
// Then: Router should prefer 700M for similar queries
|
|
let decision = router.forward(&simple_query_features(), &initial_hidden());
|
|
assert_eq!(decision.model, ModelSize::M700);
|
|
}
|
|
|
|
#[test]
|
|
fn test_router_escalates_for_complex_queries() {
|
|
// Given: Training data showing complex queries need larger models
|
|
let training_data = vec![
|
|
RouterSample {
|
|
features: complex_query_features(),
|
|
model_used: ModelSize::M700,
|
|
quality: 0.45, // Poor quality
|
|
latency_ms: 150.0,
|
|
cost: 0.001,
|
|
},
|
|
RouterSample {
|
|
features: complex_query_features(),
|
|
model_used: ModelSize::B2_6,
|
|
quality: 0.91, // Good quality
|
|
latency_ms: 500.0,
|
|
cost: 0.010,
|
|
},
|
|
];
|
|
|
|
// When: Router is trained
|
|
let mut router = FastGRNNRouter::new_test();
|
|
router.train(&training_data, QUALITY_THRESHOLD);
|
|
|
|
// Then: Router should choose 2.6B for complex queries
|
|
let decision = router.forward(&complex_query_features(), &initial_hidden());
|
|
assert_eq!(decision.model, ModelSize::B2_6);
|
|
}
|
|
|
|
#[test]
|
|
fn test_router_confidence_correlates_with_seen_patterns() {
|
|
// Given: Router trained on specific feature patterns
|
|
let mut router = FastGRNNRouter::new_test();
|
|
let seen_features = vec![training_features_a(), training_features_b()];
|
|
router.train(&samples_from_features(&seen_features), QUALITY_THRESHOLD);
|
|
|
|
// When: Querying with seen vs unseen patterns
|
|
let seen_decision = router.forward(&training_features_a(), &initial_hidden());
|
|
let unseen_decision = router.forward(&novel_features(), &initial_hidden());
|
|
|
|
// Then: Confidence should be higher for seen patterns
|
|
assert!(seen_decision.confidence > unseen_decision.confidence);
|
|
}
|
|
|
|
#[test]
|
|
fn test_router_ewc_prevents_forgetting() {
|
|
// Given: Router trained on task A
|
|
let mut router = FastGRNNRouter::new_test();
|
|
let mut ewc = ElasticWeightConsolidation::new(0.4);
|
|
router.train(&task_a_samples(), QUALITY_THRESHOLD);
|
|
let task_a_accuracy_before = router.evaluate(&task_a_samples());
|
|
|
|
// Compute Fisher and store optimal weights
|
|
ewc.compute_fisher(&router, &task_a_samples());
|
|
|
|
// When: Train on task B with EWC
|
|
router.train_with_ewc(&task_b_samples(), &ewc, QUALITY_THRESHOLD);
|
|
|
|
// Then: Task A accuracy should not significantly degrade
|
|
let task_a_accuracy_after = router.evaluate(&task_a_samples());
|
|
assert!(task_a_accuracy_after > task_a_accuracy_before - 0.05);
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2.3 Loop C: Compression and Abstraction
|
|
|
|
**How the system avoids bloat:**
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────────┐
|
|
│ Compression and Abstraction Loop │
|
|
├────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ PERIODICALLY (hourly/daily): │
|
|
│ │
|
|
│ 1. CLUSTER DETECTION │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ Identify clusters of similar nodes in graph: │ │
|
|
│ │ - Dense neighborhoods with similar embeddings │ │
|
|
│ │ - Frequently co-retrieved node sets │ │
|
|
│ │ - High edge connectivity within cluster │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ 2. LFM2 SUMMARIZATION │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ For each cluster: │ │
|
|
│ │ - Feed cluster nodes to LFM2 │ │
|
|
│ │ - Generate summary "concept" node │ │
|
|
│ │ - Create embedding for concept │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ 3. HIERARCHICAL ATTACHMENT │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ - Concept node becomes parent of cluster members │ │
|
|
│ │ - Add "contains" edges from concept to members │ │
|
|
│ │ - Future queries see concept first in attention │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ 4. ARCHIVAL │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ - Old, rarely-used fine-grained nodes → cold storage │ │
|
|
│ │ - Concept summaries stay in hot tier │ │
|
|
│ │ - Preserve graph structure for rehydration │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ RESULT: Hierarchy of concepts, not ever-growing bag of chunks │
|
|
│ │
|
|
└────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**TDD Tests for Loop C:**
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod compression_tests {
|
|
use super::*;
|
|
|
|
#[test]
|
|
fn test_cluster_detection_finds_dense_neighborhoods() {
|
|
// Given: Graph with clear clusters
|
|
let mut memory = RuvectorMemory::new_test();
|
|
|
|
// Cluster 1: ML topics (densely connected)
|
|
let ml_nodes = vec![
|
|
memory.insert_node("Neural networks learn patterns"),
|
|
memory.insert_node("Deep learning uses multiple layers"),
|
|
memory.insert_node("Backpropagation trains neural nets"),
|
|
];
|
|
for i in 0..ml_nodes.len() {
|
|
for j in i+1..ml_nodes.len() {
|
|
memory.insert_edge(&ml_nodes[i], &ml_nodes[j], EdgeType::SameTopic, 0.9);
|
|
}
|
|
}
|
|
|
|
// Cluster 2: Cooking topics (densely connected)
|
|
let cooking_nodes = vec![
|
|
memory.insert_node("Sourdough needs starter"),
|
|
memory.insert_node("Bread baking requires patience"),
|
|
];
|
|
memory.insert_edge(&cooking_nodes[0], &cooking_nodes[1], EdgeType::SameTopic, 0.85);
|
|
|
|
// When: Run cluster detection
|
|
let clusters = memory.detect_clusters(MIN_CLUSTER_SIZE, MIN_EDGE_DENSITY);
|
|
|
|
// Then: Should find two distinct clusters
|
|
assert_eq!(clusters.len(), 2);
|
|
assert!(clusters.iter().any(|c| c.nodes.len() == 3)); // ML cluster
|
|
assert!(clusters.iter().any(|c| c.nodes.len() == 2)); // Cooking cluster
|
|
}
|
|
|
|
#[test]
|
|
fn test_summarization_creates_concept_node() {
|
|
// Given: A cluster of related nodes
|
|
let cluster = Cluster {
|
|
nodes: vec![
|
|
Node::new("Rust is memory safe"),
|
|
Node::new("Rust has zero-cost abstractions"),
|
|
Node::new("Rust prevents data races"),
|
|
],
|
|
centroid: compute_centroid(&cluster.nodes),
|
|
};
|
|
|
|
// When: Generate summary
|
|
let summarizer = ClusterSummarizer::new(lfm2_model());
|
|
let concept = summarizer.summarize(&cluster);
|
|
|
|
// Then: Concept should capture key themes
|
|
assert!(concept.text.to_lowercase().contains("rust"));
|
|
assert!(concept.node_type == NodeType::Concept);
|
|
assert!(concept.metadata.contains_key("source_cluster_size"));
|
|
}
|
|
|
|
#[test]
|
|
fn test_concept_nodes_are_prioritized_in_retrieval() {
|
|
// Given: Memory with concept and detail nodes
|
|
let mut memory = RuvectorMemory::new_test();
|
|
let concept = memory.insert_node_typed(
|
|
"Rust programming overview",
|
|
NodeType::Concept
|
|
);
|
|
let detail = memory.insert_node_typed(
|
|
"Rust's borrow checker enforces ownership",
|
|
NodeType::Document
|
|
);
|
|
memory.insert_edge(&concept, &detail, EdgeType::Contains, 1.0);
|
|
|
|
// When: Query about Rust
|
|
let query_embedding = embed("Tell me about Rust");
|
|
let results = memory.search_with_concept_boost(&query_embedding, 10);
|
|
|
|
// Then: Concept should appear before (or with higher weight than) details
|
|
let concept_idx = results.iter().position(|r| r.id == concept.id).unwrap();
|
|
let detail_idx = results.iter().position(|r| r.id == detail.id).unwrap();
|
|
assert!(concept_idx < detail_idx);
|
|
}
|
|
|
|
#[test]
|
|
fn test_archival_moves_old_nodes_to_cold_storage() {
|
|
// Given: Nodes with different access patterns
|
|
let mut memory = RuvectorMemory::new_test();
|
|
let hot_node = memory.insert_node_with_access(
|
|
"Recently used content",
|
|
AccessStats { last_used: now(), use_count: 50 }
|
|
);
|
|
let cold_node = memory.insert_node_with_access(
|
|
"Old unused content",
|
|
AccessStats { last_used: now() - Duration::from_days(90), use_count: 1 }
|
|
);
|
|
|
|
// When: Run archival
|
|
memory.run_archival(
|
|
MAX_AGE_DAYS,
|
|
MIN_USE_COUNT,
|
|
COLD_STORAGE_PATH
|
|
);
|
|
|
|
// Then: Hot node stays, cold node archived
|
|
assert!(memory.contains(&hot_node.id));
|
|
assert!(!memory.contains(&cold_node.id));
|
|
assert!(cold_storage_contains(&cold_node.id));
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Weight-Level Self-Learning (Controlled)
|
|
|
|
### 3.1 The Safe Outer Loop
|
|
|
|
**Weight updates happen outside production, in a controlled pipeline:**
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────────┐
|
|
│ Weight-Level Self-Learning Pipeline │
|
|
├────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ STEP 1: COLLECT TRAINING TRACES (continuous) │ │
|
|
│ │ │ │
|
|
│ │ From live system, store: │ │
|
|
│ │ - (prompt, retrieved_context, final_answer, outcome) │ │
|
|
│ │ - Judge scores or human ratings │ │
|
|
│ │ - Explicit error cases │ │
|
|
│ │ │ │
|
|
│ │ Tag by: domain, difficulty, risk_level │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ STEP 2: BUILD ROLLING CURRICULUM (nightly/weekly) │ │
|
|
│ │ │ │
|
|
│ │ Sample recent traces: │ │
|
|
│ │ - Up-weight hard or high-value tasks │ │
|
|
│ │ - Filter out cases where context was wrong │ │
|
|
│ │ │ │
|
|
│ │ Create three sets: │ │
|
|
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │
|
|
│ │ │ SFT │ │ Preference │ │ Retrieval │ │ │
|
|
│ │ │ (good │ │ Pairs │ │ Correction │ │ │
|
|
│ │ │ answers) │ │ (good vs bad) │ │ (context │ │ │
|
|
│ │ │ │ │ │ │ selection) │ │ │
|
|
│ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ STEP 3: TRAIN STUDENT VARIANTS (offline) │ │
|
|
│ │ │ │
|
|
│ │ Take current best LFM2 checkpoint: │ │
|
|
│ │ 1. Run supervised fine-tuning on new traces │ │
|
|
│ │ 2. Optionally run preference objective on pairs │ │
|
|
│ │ 3. Validate on fixed holdout + public benchmarks │ │
|
|
│ │ │ │
|
|
│ │ Output: "LFM2-ruv-edition-vN" │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ STEP 4: GATED DEPLOYMENT (A/B testing) │ │
|
|
│ │ │ │
|
|
│ │ ┌─────────────────────────────────────────────────────┐│ │
|
|
│ │ │ Production Traffic ││ │
|
|
│ │ │ ┌────────────────┐ ┌────────────────┐ ││ │
|
|
│ │ │ │ 90% → Current │ │ 10% → Student │ ││ │
|
|
│ │ │ │ Model │ │ vN │ ││ │
|
|
│ │ │ └────────────────┘ └────────────────┘ ││ │
|
|
│ │ └─────────────────────────────────────────────────────┘│ │
|
|
│ │ │ │
|
|
│ │ Compare: quality, latency, failure_rate │ │
|
|
│ │ Promote IFF: student dominates OR ties on key metrics │ │
|
|
│ │ │ │
|
|
│ │ ⚠️ Never free-write weights in-place │ │
|
|
│ │ ⚠️ Always retrain in controlled loop │ │
|
|
│ │ ⚠️ Promote only when safe │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**TDD Tests for Weight-Level Learning:**
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod weight_learning_tests {
|
|
use super::*;
|
|
|
|
#[test]
|
|
fn test_trace_collection_captures_all_components() {
|
|
// Given: A completed interaction
|
|
let trace_collector = TraceCollector::new_test();
|
|
let interaction = Interaction {
|
|
prompt: "What is Rust?",
|
|
context: vec!["Rust is a systems language"],
|
|
response: "Rust is a memory-safe systems programming language",
|
|
quality_score: 0.92,
|
|
task_outcome: TaskOutcome::Success,
|
|
};
|
|
|
|
// When: Trace is collected
|
|
let trace = trace_collector.collect(&interaction);
|
|
|
|
// Then: All components should be present
|
|
assert!(trace.prompt.is_some());
|
|
assert!(trace.context.len() > 0);
|
|
assert!(trace.response.is_some());
|
|
assert!(trace.quality_score.is_some());
|
|
assert!(trace.domain_tags.len() > 0);
|
|
}
|
|
|
|
#[test]
|
|
fn test_curriculum_upweights_hard_tasks() {
|
|
// Given: Mix of easy and hard traces
|
|
let traces = vec![
|
|
Trace { difficulty: 0.2, quality: 0.95, ..default() }, // Easy, good
|
|
Trace { difficulty: 0.9, quality: 0.85, ..default() }, // Hard, good
|
|
Trace { difficulty: 0.3, quality: 0.60, ..default() }, // Easy, bad
|
|
];
|
|
|
|
// When: Build curriculum
|
|
let curriculum = CurriculumBuilder::new()
|
|
.upweight_hard_tasks(true)
|
|
.filter_bad_quality(0.7)
|
|
.build(&traces);
|
|
|
|
// Then: Hard successful trace should have higher weight
|
|
let hard_weight = curriculum.weight_for(&traces[1]);
|
|
let easy_weight = curriculum.weight_for(&traces[0]);
|
|
assert!(hard_weight > easy_weight);
|
|
|
|
// And: Bad quality trace should be filtered
|
|
assert!(!curriculum.contains(&traces[2]));
|
|
}
|
|
|
|
#[test]
|
|
fn test_preference_pairs_correctly_ordered() {
|
|
// Given: Same query with different quality responses
|
|
let good_response = Response { text: "Detailed answer...", quality: 0.9 };
|
|
let bad_response = Response { text: "I don't know", quality: 0.3 };
|
|
let query = "Explain backpropagation";
|
|
|
|
// When: Create preference pair
|
|
let pair = PreferencePair::from_responses(query, &good_response, &bad_response);
|
|
|
|
// Then: Good should be preferred
|
|
assert_eq!(pair.chosen, good_response.text);
|
|
assert_eq!(pair.rejected, bad_response.text);
|
|
}
|
|
|
|
#[test]
|
|
fn test_student_validation_gates_deployment() {
|
|
// Given: Student model that underperforms on holdout
|
|
let student = StudentModel::new_test();
|
|
let holdout = HoldoutDataset::load_test();
|
|
let baseline_accuracy = 0.85;
|
|
let student_accuracy = 0.78; // Below baseline
|
|
|
|
// When: Validate for deployment
|
|
let validation = ValidationResult::new(student_accuracy, baseline_accuracy);
|
|
|
|
// Then: Should NOT be approved for deployment
|
|
assert!(!validation.approved_for_deployment());
|
|
assert!(validation.rejection_reason().contains("accuracy"));
|
|
}
|
|
|
|
#[test]
|
|
fn test_ab_test_detects_regression() {
|
|
// Given: A/B test results
|
|
let ab_results = ABTestResults {
|
|
control: ABMetrics { quality: 0.90, latency_p50: 200.0, failure_rate: 0.02 },
|
|
treatment: ABMetrics { quality: 0.88, latency_p50: 180.0, failure_rate: 0.05 },
|
|
};
|
|
|
|
// When: Evaluate for promotion
|
|
let decision = ABDecision::evaluate(&ab_results, SIGNIFICANCE_THRESHOLD);
|
|
|
|
// Then: Should NOT promote due to quality regression + higher failure rate
|
|
assert_eq!(decision, ABDecision::KeepControl);
|
|
assert!(decision.reasons().contains("quality_regression"));
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Test-Driven Development Plan
|
|
|
|
### 4.1 Testing Pyramid
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ E2E Tests │ (5%)
|
|
│ Full pipeline │
|
|
└────────┬────────┘
|
|
│
|
|
┌─────────────┴─────────────┐
|
|
│ Integration Tests │ (20%)
|
|
│ Cross-component flows │
|
|
└─────────────┬─────────────┘
|
|
│
|
|
┌────────────────────┴────────────────────┐
|
|
│ Unit Tests │ (75%)
|
|
│ Individual functions & modules │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
### 4.2 Test Categories by Component
|
|
|
|
#### 4.2.1 Orchestrator Tests
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod orchestrator_tests {
|
|
#[test]
|
|
fn test_request_routing_respects_session() { }
|
|
|
|
#[test]
|
|
fn test_rate_limiting_rejects_excess_requests() { }
|
|
|
|
#[test]
|
|
fn test_cache_hit_bypasses_processing() { }
|
|
|
|
#[test]
|
|
fn test_cache_miss_triggers_full_pipeline() { }
|
|
|
|
#[test]
|
|
fn test_error_handling_returns_graceful_response() { }
|
|
|
|
#[test]
|
|
fn test_metrics_recorded_for_all_requests() { }
|
|
}
|
|
```
|
|
|
|
#### 4.2.2 Embedding Service Tests
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod embedding_tests {
|
|
#[test]
|
|
fn test_embedding_dimension_matches_config() { }
|
|
|
|
#[test]
|
|
fn test_similar_texts_have_similar_embeddings() { }
|
|
|
|
#[test]
|
|
fn test_different_texts_have_different_embeddings() { }
|
|
|
|
#[test]
|
|
fn test_long_text_truncation() { }
|
|
|
|
#[test]
|
|
fn test_batch_embedding_matches_individual() { }
|
|
|
|
#[test]
|
|
fn test_empty_string_handling() { }
|
|
}
|
|
```
|
|
|
|
#### 4.2.3 Router Tests
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod router_tests {
|
|
#[test]
|
|
fn test_forward_produces_valid_probabilities() { }
|
|
|
|
#[test]
|
|
fn test_hidden_state_updates_across_calls() { }
|
|
|
|
#[test]
|
|
fn test_confidence_threshold_triggers_fallback() { }
|
|
|
|
#[test]
|
|
fn test_gradient_computation() { }
|
|
|
|
#[test]
|
|
fn test_sparse_matrix_operations() { }
|
|
|
|
#[test]
|
|
fn test_low_rank_matrix_approximation() { }
|
|
}
|
|
```
|
|
|
|
#### 4.2.4 Memory Tests
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod memory_tests {
|
|
#[test]
|
|
fn test_hnsw_search_returns_k_neighbors() { }
|
|
|
|
#[test]
|
|
fn test_graph_expansion_respects_hop_limit() { }
|
|
|
|
#[test]
|
|
fn test_writeback_queue_batches_correctly() { }
|
|
|
|
#[test]
|
|
fn test_deduplication_prevents_near_duplicates() { }
|
|
|
|
#[test]
|
|
fn test_metadata_filtering() { }
|
|
|
|
#[test]
|
|
fn test_edge_weight_update() { }
|
|
}
|
|
```
|
|
|
|
#### 4.2.5 Attention Tests
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod attention_tests {
|
|
#[test]
|
|
fn test_attention_weights_sum_to_one() { }
|
|
|
|
#[test]
|
|
fn test_edge_features_influence_attention() { }
|
|
|
|
#[test]
|
|
fn test_multi_head_concatenation() { }
|
|
|
|
#[test]
|
|
fn test_residual_connection_preserved() { }
|
|
|
|
#[test]
|
|
fn test_layer_norm_normalization() { }
|
|
|
|
#[test]
|
|
fn test_attention_ranking_matches_weights() { }
|
|
}
|
|
```
|
|
|
|
#### 4.2.6 Inference Tests
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod inference_tests {
|
|
#[test]
|
|
fn test_model_loading_correct_size() { }
|
|
|
|
#[test]
|
|
fn test_kv_cache_reuse() { }
|
|
|
|
#[test]
|
|
fn test_generation_respects_max_tokens() { }
|
|
|
|
#[test]
|
|
fn test_temperature_affects_randomness() { }
|
|
|
|
#[test]
|
|
fn test_top_p_filtering() { }
|
|
|
|
#[test]
|
|
fn test_model_eviction_under_memory_pressure() { }
|
|
}
|
|
```
|
|
|
|
#### 4.2.7 Learning Tests
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod learning_tests {
|
|
#[test]
|
|
fn test_replay_buffer_reservoir_sampling() { }
|
|
|
|
#[test]
|
|
fn test_ewc_regularization_value() { }
|
|
|
|
#[test]
|
|
fn test_fisher_information_computation() { }
|
|
|
|
#[test]
|
|
fn test_quality_judge_score_range() { }
|
|
|
|
#[test]
|
|
fn test_writeback_threshold_filtering() { }
|
|
|
|
#[test]
|
|
fn test_background_training_thread() { }
|
|
}
|
|
```
|
|
|
|
### 4.3 Integration Test Scenarios
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod integration_tests {
|
|
/// Test full request-response cycle
|
|
#[tokio::test]
|
|
async fn test_end_to_end_query() {
|
|
let system = RuvLLMSystem::new_test().await;
|
|
|
|
let response = system.process(Request {
|
|
query: "What is machine learning?",
|
|
session_id: Some("test-session"),
|
|
constraints: Default::default(),
|
|
}).await.unwrap();
|
|
|
|
assert!(!response.text.is_empty());
|
|
assert!(response.confidence > 0.0);
|
|
assert!(!response.sources.is_empty());
|
|
}
|
|
|
|
/// Test multi-turn conversation with context
|
|
#[tokio::test]
|
|
async fn test_multi_turn_context() {
|
|
let system = RuvLLMSystem::new_test().await;
|
|
let session = "multi-turn-test";
|
|
|
|
// Turn 1
|
|
let r1 = system.process(Request {
|
|
query: "What is Rust?",
|
|
session_id: Some(session),
|
|
..Default::default()
|
|
}).await.unwrap();
|
|
|
|
// Turn 2 (should use KV cache)
|
|
let r2 = system.process(Request {
|
|
query: "What are its main features?",
|
|
session_id: Some(session),
|
|
..Default::default()
|
|
}).await.unwrap();
|
|
|
|
// Response should reference Rust from context
|
|
assert!(r2.text.to_lowercase().contains("rust") ||
|
|
r2.text.to_lowercase().contains("memory") ||
|
|
r2.text.to_lowercase().contains("safety"));
|
|
}
|
|
|
|
/// Test that learning loop updates memory
|
|
#[tokio::test]
|
|
async fn test_learning_updates_memory() {
|
|
let system = RuvLLMSystem::new_test().await;
|
|
let initial_node_count = system.memory.node_count();
|
|
|
|
// Process high-quality interaction
|
|
let response = system.process_with_feedback(
|
|
Request { query: "Novel question...", ..Default::default() },
|
|
Feedback { quality: 0.95, explicit_rating: Some(5) }
|
|
).await.unwrap();
|
|
|
|
// Memory should have grown
|
|
let final_node_count = system.memory.node_count();
|
|
assert!(final_node_count > initial_node_count);
|
|
}
|
|
|
|
/// Test router learns from experience
|
|
#[tokio::test]
|
|
async fn test_router_adaptation() {
|
|
let mut system = RuvLLMSystem::new_test().await;
|
|
|
|
// Process many simple queries
|
|
for _ in 0..100 {
|
|
system.process(Request {
|
|
query: "Simple factual question",
|
|
..Default::default()
|
|
}).await.unwrap();
|
|
}
|
|
|
|
// Trigger training
|
|
system.learning_service.train_router().await;
|
|
|
|
// Router should now prefer smaller models for similar queries
|
|
let decision = system.router.forward(
|
|
&simple_query_features(),
|
|
&initial_hidden()
|
|
);
|
|
assert!(decision.model == ModelSize::M350 || decision.model == ModelSize::M700);
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Benchmarking Suite
|
|
|
|
### 5.1 Performance Benchmarks
|
|
|
|
```rust
|
|
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
|
|
|
|
fn embedding_benchmark(c: &mut Criterion) {
|
|
let embedder = EmbeddingService::new_test();
|
|
|
|
let mut group = c.benchmark_group("embedding");
|
|
|
|
for size in [32, 128, 512, 2048].iter() {
|
|
let text = "a".repeat(*size);
|
|
group.bench_with_input(
|
|
BenchmarkId::new("embed", size),
|
|
&text,
|
|
|b, t| b.iter(|| embedder.embed(t))
|
|
);
|
|
}
|
|
|
|
group.finish();
|
|
}
|
|
|
|
fn hnsw_search_benchmark(c: &mut Criterion) {
|
|
let memory = RuvectorMemory::new_with_data(100_000); // 100K vectors
|
|
let query = random_vector(384);
|
|
|
|
let mut group = c.benchmark_group("hnsw_search");
|
|
|
|
for k in [10, 32, 64].iter() {
|
|
for ef in [32, 64, 128].iter() {
|
|
group.bench_with_input(
|
|
BenchmarkId::new(format!("k={},ef={}", k, ef), ""),
|
|
&(k, ef),
|
|
|b, (k, ef)| b.iter(|| memory.search(&query, **k, **ef))
|
|
);
|
|
}
|
|
}
|
|
|
|
group.finish();
|
|
}
|
|
|
|
fn router_forward_benchmark(c: &mut Criterion) {
|
|
let router = FastGRNNRouter::new_test();
|
|
let features = random_vector(128);
|
|
let hidden = random_vector(64);
|
|
|
|
c.bench_function("router_forward", |b| {
|
|
b.iter(|| router.forward(&features, &hidden))
|
|
});
|
|
}
|
|
|
|
fn graph_attention_benchmark(c: &mut Criterion) {
|
|
let attention = GraphAttentionEngine::new_test();
|
|
let query = random_vector(384);
|
|
let subgraph = generate_subgraph(50, 100); // 50 nodes, 100 edges
|
|
|
|
c.bench_function("graph_attention", |b| {
|
|
b.iter(|| attention.attend(&query, &subgraph))
|
|
});
|
|
}
|
|
|
|
criterion_group!(
|
|
benches,
|
|
embedding_benchmark,
|
|
hnsw_search_benchmark,
|
|
router_forward_benchmark,
|
|
graph_attention_benchmark
|
|
);
|
|
criterion_main!(benches);
|
|
```
|
|
|
|
### 5.2 Quality Benchmarks
|
|
|
|
```rust
|
|
/// Benchmark suite for quality metrics
|
|
pub struct QualityBenchmark {
|
|
dataset: BenchmarkDataset,
|
|
judge: QualityJudge,
|
|
}
|
|
|
|
impl QualityBenchmark {
|
|
pub async fn run(&self, system: &RuvLLMSystem) -> QualityResults {
|
|
let mut results = QualityResults::default();
|
|
|
|
for sample in &self.dataset.samples {
|
|
let response = system.process(Request {
|
|
query: sample.query.clone(),
|
|
..Default::default()
|
|
}).await.unwrap();
|
|
|
|
// Judge quality
|
|
let quality = self.judge.evaluate(
|
|
&sample.query,
|
|
&response.text,
|
|
&response.sources
|
|
).await;
|
|
|
|
// Check against ground truth if available
|
|
if let Some(expected) = &sample.expected_answer {
|
|
let f1 = compute_f1(&response.text, expected);
|
|
results.f1_scores.push(f1);
|
|
}
|
|
|
|
results.quality_scores.push(quality);
|
|
results.latencies.push(response.latency);
|
|
}
|
|
|
|
results
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Iteration Milestones
|
|
|
|
### 6.1 Phase 1: Foundation (Weeks 1-2)
|
|
|
|
| Milestone | Deliverables | Tests |
|
|
|-----------|--------------|-------|
|
|
| M1.1 | Embedding service stub | Dimension tests |
|
|
| M1.2 | Memory service with HNSW | Search tests |
|
|
| M1.3 | Basic orchestrator | Integration smoke tests |
|
|
| M1.4 | Mock LFM2 interface | Interface contract tests |
|
|
|
|
### 6.2 Phase 2: Core Pipeline (Weeks 3-4)
|
|
|
|
| Milestone | Deliverables | Tests |
|
|
|-----------|--------------|-------|
|
|
| M2.1 | FastGRNN router | Forward pass tests |
|
|
| M2.2 | Graph attention engine | Attention computation tests |
|
|
| M2.3 | Context builder | Deduplication, truncation tests |
|
|
| M2.4 | End-to-end pipeline | Full flow integration tests |
|
|
|
|
### 6.3 Phase 3: Learning Loops (Weeks 5-6)
|
|
|
|
| Milestone | Deliverables | Tests |
|
|
|-----------|--------------|-------|
|
|
| M3.1 | Quality judge | Evaluation tests |
|
|
| M3.2 | Replay buffer | Sampling distribution tests |
|
|
| M3.3 | EWC integration | Forgetting prevention tests |
|
|
| M3.4 | Memory writeback | Graph update tests |
|
|
|
|
### 6.4 Phase 4: Optimization (Weeks 7-8)
|
|
|
|
| Milestone | Deliverables | Tests |
|
|
|-----------|--------------|-------|
|
|
| M4.1 | Router training loop | Learning convergence tests |
|
|
| M4.2 | Compression/abstraction | Cluster detection tests |
|
|
| M4.3 | Performance tuning | Benchmark suite |
|
|
| M4.4 | Production hardening | Load tests, failure injection |
|
|
|
|
---
|
|
|
|
## 7. Refinement Checklist
|
|
|
|
### 7.1 Per-Component Checklist
|
|
|
|
```
|
|
[ ] Orchestrator
|
|
[ ] Request validation
|
|
[ ] Session management
|
|
[ ] Rate limiting
|
|
[ ] Caching
|
|
[ ] Error handling
|
|
[ ] Metrics export
|
|
|
|
[ ] Embedding Service
|
|
[ ] LFM2 encoder integration
|
|
[ ] Dimension projection
|
|
[ ] Batch processing
|
|
[ ] Tokenization
|
|
[ ] Truncation handling
|
|
|
|
[ ] FastGRNN Router
|
|
[ ] Cell implementation
|
|
[ ] Sparse weight matrices
|
|
[ ] Low-rank recurrent matrices
|
|
[ ] Output heads
|
|
[ ] Confidence calibration
|
|
[ ] Training loop
|
|
|
|
[ ] Memory Service
|
|
[ ] HNSW configuration
|
|
[ ] Graph storage
|
|
[ ] Edge operations
|
|
[ ] Writeback queue
|
|
[ ] Deduplication
|
|
[ ] Archival
|
|
|
|
[ ] Graph Attention
|
|
[ ] Multi-head attention
|
|
[ ] Edge feature encoding
|
|
[ ] Layer stacking
|
|
[ ] Residual connections
|
|
[ ] Output ranking
|
|
|
|
[ ] Inference Pool
|
|
[ ] Model loading
|
|
[ ] Lazy initialization
|
|
[ ] KV cache management
|
|
[ ] Quantization selection
|
|
[ ] LRU eviction
|
|
|
|
[ ] Learning Service
|
|
[ ] Quality evaluation
|
|
[ ] Replay buffer
|
|
[ ] EWC regularization
|
|
[ ] Background training
|
|
[ ] Writeback logic
|
|
[ ] Compression jobs
|
|
```
|
|
|
|
### 7.2 Quality Gates
|
|
|
|
| Gate | Criteria | Status |
|
|
|------|----------|--------|
|
|
| Unit test coverage | >80% | ⬜ |
|
|
| Integration tests passing | 100% | ⬜ |
|
|
| Latency P50 | <500ms | ⬜ |
|
|
| Quality score mean | >0.8 | ⬜ |
|
|
| Router accuracy | >90% | ⬜ |
|
|
| Memory efficiency | <4GB | ⬜ |
|
|
| No memory leaks | 24h stress test | ⬜ |
|
|
| Forgetting rate | <5%/10K | ⬜ |
|
|
|
|
---
|
|
|
|
*Document Version: 1.0*
|
|
*Last Updated: 2025-12-02*
|
|
*Author: RuvLLM Architecture Team*
|