45 KiB
45 KiB
RuvLLM: TDD and Iterative Refinement
SPARC Phase 4: Refinement
1. Core Philosophy: Three-Layer Self-Learning
1.1 The Mental Model
"The intelligence is not in one model anymore. It is in the loop."
RuvLLM treats:
- LFM2 weights as a stable cortex (fixed core reasoning engine)
- Ruvector as the living synaptic mesh (adapts continuously)
- FastGRNN as the control circuit (learns when to use what)
This creates a system that genuinely learns from experience without requiring constant model retraining.
1.2 Three Adaptation Timescales
| Timescale | Mechanism | What Changes | Frequency |
|---|---|---|---|
| Short-term | Memory + Routing | Graph structure, attention patterns, routing decisions | Every request |
| Medium-term | Compression | Concept nodes, graph hierarchy, router weights | Hourly/Daily |
| Long-term | Weight tuning | LFM2 fine-tuned variants | Weekly/Monthly |
2. Self-Learning Loop Architecture
2.1 Loop A: Memory Growth and Refinement
What happens on every request:
Request → Response → Outcome
↓
┌────────────────────────────────────────────────────────────────┐
│ Memory Growth Loop │
├────────────────────────────────────────────────────────────────┤
│ │
│ 1. WRITE to ruvector: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ - Question (query embedding + text) ││
│ │ - Answer (response embedding + text) ││
│ │ - Retrieved documents (context used) ││
│ │ - Final outcome (quality score, task success) ││
│ │ - User feedback if any (explicit signals) ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ 2. GRAPH RULES: │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ✓ Strengthen edges between nodes that co-appear ││
│ │ in good answers ││
│ │ ✓ Weaken/prune edges rarely used or correlating ││
│ │ with bad answers ││
│ │ ✓ Update attention weights based on success patterns ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ 3. RESULT: │
│ Same LFM2 checkpoint → Different answers over time │
│ because the graph, weights, and attention improve │
│ │
└────────────────────────────────────────────────────────────────┘
TDD Tests for Loop A:
#[cfg(test)]
mod memory_growth_tests {
use super::*;
#[test]
fn test_successful_interaction_strengthens_edges() {
// Given: A memory with two related nodes
let mut memory = RuvectorMemory::new_test();
let node_a = memory.insert_node("Machine learning is a subset of AI");
let node_b = memory.insert_node("Neural networks are ML models");
memory.insert_edge(node_a, node_b, EdgeType::SameTopic, 0.5);
// When: A successful query uses both nodes
let outcome = InteractionOutcome {
quality_score: 0.9,
used_nodes: vec![node_a.clone(), node_b.clone()],
task_success: true,
};
memory.apply_outcome(&outcome);
// Then: Edge weight should increase
let edge = memory.get_edge(&node_a, &node_b).unwrap();
assert!(edge.weight > 0.5);
}
#[test]
fn test_failed_interaction_weakens_edges() {
// Given: A memory with edge
let mut memory = RuvectorMemory::new_test();
let node_a = memory.insert_node("Topic A");
let node_b = memory.insert_node("Unrelated B");
memory.insert_edge(node_a, node_b, EdgeType::SameTopic, 0.5);
// When: Query uses these but fails
let outcome = InteractionOutcome {
quality_score: 0.3,
used_nodes: vec![node_a.clone(), node_b.clone()],
task_success: false,
};
memory.apply_outcome(&outcome);
// Then: Edge weight should decrease
let edge = memory.get_edge(&node_a, &node_b).unwrap();
assert!(edge.weight < 0.5);
}
#[test]
fn test_unused_edges_decay_over_time() {
// Given: An edge that hasn't been used
let mut memory = RuvectorMemory::new_test();
let edge = memory.create_edge_with_last_used(
"node_a", "node_b",
0.5,
Instant::now() - Duration::from_days(30)
);
// When: Periodic cleanup runs
memory.apply_decay(DECAY_RATE, MIN_INTERACTIONS_BEFORE_PRUNE);
// Then: Edge weight should have decayed
let updated = memory.get_edge(&edge.src, &edge.dst).unwrap();
assert!(updated.weight < 0.5);
}
#[test]
fn test_attention_weights_update_from_success_patterns() {
// Given: Graph attention engine with initial weights
let mut attention = GraphAttentionEngine::new_test();
let initial_weights = attention.get_edge_bias_weights();
// When: Train on successful interaction patterns
let patterns = vec![
AttentionPattern {
edges_used: vec![EdgeType::Cites],
outcome_quality: 0.95,
},
AttentionPattern {
edges_used: vec![EdgeType::Cites],
outcome_quality: 0.90,
},
];
attention.train_on_patterns(&patterns);
// Then: Edge type "Cites" should have higher attention bias
let updated_weights = attention.get_edge_bias_weights();
assert!(updated_weights[EdgeType::Cites] > initial_weights[EdgeType::Cites]);
}
}
2.2 Loop B: Router Learning
What the router learns:
┌────────────────────────────────────────────────────────────────┐
│ Router Learning Loop │
├────────────────────────────────────────────────────────────────┤
│ │
│ For each query, LOG: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Router features (128-dim input vector) │ │
│ │ - Chosen route (model, context, temp, top_p) │ │
│ │ - Actual latency and cost │ │
│ │ - Quality score (judge model or task outcome) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Periodically RETRAIN FastGRNN: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Objective: Prefer cheaper routes when quality holds │ │
│ │ Escalate only when necessary │ │
│ │ │ │
│ │ Loss = -Quality + λ·Cost + μ·LatencyPenalty │ │
│ │ │ │
│ │ Constraints: │ │
│ │ - Quality must exceed threshold θ_min │ │
│ │ - Latency must meet SLA │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ RESULT: Router becomes self-learning policy over your stack │
│ │
└────────────────────────────────────────────────────────────────┘
TDD Tests for Loop B:
#[cfg(test)]
mod router_learning_tests {
use super::*;
#[test]
fn test_router_prefers_smaller_model_when_quality_sufficient() {
// Given: Training data showing 700M achieves same quality as 1.2B
let training_data = vec![
RouterSample {
features: simple_query_features(),
model_used: ModelSize::M700,
quality: 0.92,
latency_ms: 150.0,
cost: 0.001,
},
RouterSample {
features: simple_query_features(),
model_used: ModelSize::B1_2,
quality: 0.93, // Only marginally better
latency_ms: 300.0,
cost: 0.003,
},
];
// When: Router is trained
let mut router = FastGRNNRouter::new_test();
router.train(&training_data, QUALITY_THRESHOLD);
// Then: Router should prefer 700M for similar queries
let decision = router.forward(&simple_query_features(), &initial_hidden());
assert_eq!(decision.model, ModelSize::M700);
}
#[test]
fn test_router_escalates_for_complex_queries() {
// Given: Training data showing complex queries need larger models
let training_data = vec![
RouterSample {
features: complex_query_features(),
model_used: ModelSize::M700,
quality: 0.45, // Poor quality
latency_ms: 150.0,
cost: 0.001,
},
RouterSample {
features: complex_query_features(),
model_used: ModelSize::B2_6,
quality: 0.91, // Good quality
latency_ms: 500.0,
cost: 0.010,
},
];
// When: Router is trained
let mut router = FastGRNNRouter::new_test();
router.train(&training_data, QUALITY_THRESHOLD);
// Then: Router should choose 2.6B for complex queries
let decision = router.forward(&complex_query_features(), &initial_hidden());
assert_eq!(decision.model, ModelSize::B2_6);
}
#[test]
fn test_router_confidence_correlates_with_seen_patterns() {
// Given: Router trained on specific feature patterns
let mut router = FastGRNNRouter::new_test();
let seen_features = vec![training_features_a(), training_features_b()];
router.train(&samples_from_features(&seen_features), QUALITY_THRESHOLD);
// When: Querying with seen vs unseen patterns
let seen_decision = router.forward(&training_features_a(), &initial_hidden());
let unseen_decision = router.forward(&novel_features(), &initial_hidden());
// Then: Confidence should be higher for seen patterns
assert!(seen_decision.confidence > unseen_decision.confidence);
}
#[test]
fn test_router_ewc_prevents_forgetting() {
// Given: Router trained on task A
let mut router = FastGRNNRouter::new_test();
let mut ewc = ElasticWeightConsolidation::new(0.4);
router.train(&task_a_samples(), QUALITY_THRESHOLD);
let task_a_accuracy_before = router.evaluate(&task_a_samples());
// Compute Fisher and store optimal weights
ewc.compute_fisher(&router, &task_a_samples());
// When: Train on task B with EWC
router.train_with_ewc(&task_b_samples(), &ewc, QUALITY_THRESHOLD);
// Then: Task A accuracy should not significantly degrade
let task_a_accuracy_after = router.evaluate(&task_a_samples());
assert!(task_a_accuracy_after > task_a_accuracy_before - 0.05);
}
}
2.3 Loop C: Compression and Abstraction
How the system avoids bloat:
┌────────────────────────────────────────────────────────────────┐
│ Compression and Abstraction Loop │
├────────────────────────────────────────────────────────────────┤
│ │
│ PERIODICALLY (hourly/daily): │
│ │
│ 1. CLUSTER DETECTION │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Identify clusters of similar nodes in graph: │ │
│ │ - Dense neighborhoods with similar embeddings │ │
│ │ - Frequently co-retrieved node sets │ │
│ │ - High edge connectivity within cluster │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ 2. LFM2 SUMMARIZATION │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ For each cluster: │ │
│ │ - Feed cluster nodes to LFM2 │ │
│ │ - Generate summary "concept" node │ │
│ │ - Create embedding for concept │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ 3. HIERARCHICAL ATTACHMENT │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Concept node becomes parent of cluster members │ │
│ │ - Add "contains" edges from concept to members │ │
│ │ - Future queries see concept first in attention │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ 4. ARCHIVAL │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Old, rarely-used fine-grained nodes → cold storage │ │
│ │ - Concept summaries stay in hot tier │ │
│ │ - Preserve graph structure for rehydration │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ RESULT: Hierarchy of concepts, not ever-growing bag of chunks │
│ │
└────────────────────────────────────────────────────────────────┘
TDD Tests for Loop C:
#[cfg(test)]
mod compression_tests {
use super::*;
#[test]
fn test_cluster_detection_finds_dense_neighborhoods() {
// Given: Graph with clear clusters
let mut memory = RuvectorMemory::new_test();
// Cluster 1: ML topics (densely connected)
let ml_nodes = vec![
memory.insert_node("Neural networks learn patterns"),
memory.insert_node("Deep learning uses multiple layers"),
memory.insert_node("Backpropagation trains neural nets"),
];
for i in 0..ml_nodes.len() {
for j in i+1..ml_nodes.len() {
memory.insert_edge(&ml_nodes[i], &ml_nodes[j], EdgeType::SameTopic, 0.9);
}
}
// Cluster 2: Cooking topics (densely connected)
let cooking_nodes = vec![
memory.insert_node("Sourdough needs starter"),
memory.insert_node("Bread baking requires patience"),
];
memory.insert_edge(&cooking_nodes[0], &cooking_nodes[1], EdgeType::SameTopic, 0.85);
// When: Run cluster detection
let clusters = memory.detect_clusters(MIN_CLUSTER_SIZE, MIN_EDGE_DENSITY);
// Then: Should find two distinct clusters
assert_eq!(clusters.len(), 2);
assert!(clusters.iter().any(|c| c.nodes.len() == 3)); // ML cluster
assert!(clusters.iter().any(|c| c.nodes.len() == 2)); // Cooking cluster
}
#[test]
fn test_summarization_creates_concept_node() {
// Given: A cluster of related nodes
let cluster = Cluster {
nodes: vec![
Node::new("Rust is memory safe"),
Node::new("Rust has zero-cost abstractions"),
Node::new("Rust prevents data races"),
],
centroid: compute_centroid(&cluster.nodes),
};
// When: Generate summary
let summarizer = ClusterSummarizer::new(lfm2_model());
let concept = summarizer.summarize(&cluster);
// Then: Concept should capture key themes
assert!(concept.text.to_lowercase().contains("rust"));
assert!(concept.node_type == NodeType::Concept);
assert!(concept.metadata.contains_key("source_cluster_size"));
}
#[test]
fn test_concept_nodes_are_prioritized_in_retrieval() {
// Given: Memory with concept and detail nodes
let mut memory = RuvectorMemory::new_test();
let concept = memory.insert_node_typed(
"Rust programming overview",
NodeType::Concept
);
let detail = memory.insert_node_typed(
"Rust's borrow checker enforces ownership",
NodeType::Document
);
memory.insert_edge(&concept, &detail, EdgeType::Contains, 1.0);
// When: Query about Rust
let query_embedding = embed("Tell me about Rust");
let results = memory.search_with_concept_boost(&query_embedding, 10);
// Then: Concept should appear before (or with higher weight than) details
let concept_idx = results.iter().position(|r| r.id == concept.id).unwrap();
let detail_idx = results.iter().position(|r| r.id == detail.id).unwrap();
assert!(concept_idx < detail_idx);
}
#[test]
fn test_archival_moves_old_nodes_to_cold_storage() {
// Given: Nodes with different access patterns
let mut memory = RuvectorMemory::new_test();
let hot_node = memory.insert_node_with_access(
"Recently used content",
AccessStats { last_used: now(), use_count: 50 }
);
let cold_node = memory.insert_node_with_access(
"Old unused content",
AccessStats { last_used: now() - Duration::from_days(90), use_count: 1 }
);
// When: Run archival
memory.run_archival(
MAX_AGE_DAYS,
MIN_USE_COUNT,
COLD_STORAGE_PATH
);
// Then: Hot node stays, cold node archived
assert!(memory.contains(&hot_node.id));
assert!(!memory.contains(&cold_node.id));
assert!(cold_storage_contains(&cold_node.id));
}
}
3. Weight-Level Self-Learning (Controlled)
3.1 The Safe Outer Loop
Weight updates happen outside production, in a controlled pipeline:
┌────────────────────────────────────────────────────────────────┐
│ Weight-Level Self-Learning Pipeline │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STEP 1: COLLECT TRAINING TRACES (continuous) │ │
│ │ │ │
│ │ From live system, store: │ │
│ │ - (prompt, retrieved_context, final_answer, outcome) │ │
│ │ - Judge scores or human ratings │ │
│ │ - Explicit error cases │ │
│ │ │ │
│ │ Tag by: domain, difficulty, risk_level │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STEP 2: BUILD ROLLING CURRICULUM (nightly/weekly) │ │
│ │ │ │
│ │ Sample recent traces: │ │
│ │ - Up-weight hard or high-value tasks │ │
│ │ - Filter out cases where context was wrong │ │
│ │ │ │
│ │ Create three sets: │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ SFT │ │ Preference │ │ Retrieval │ │ │
│ │ │ (good │ │ Pairs │ │ Correction │ │ │
│ │ │ answers) │ │ (good vs bad) │ │ (context │ │ │
│ │ │ │ │ │ │ selection) │ │ │
│ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STEP 3: TRAIN STUDENT VARIANTS (offline) │ │
│ │ │ │
│ │ Take current best LFM2 checkpoint: │ │
│ │ 1. Run supervised fine-tuning on new traces │ │
│ │ 2. Optionally run preference objective on pairs │ │
│ │ 3. Validate on fixed holdout + public benchmarks │ │
│ │ │ │
│ │ Output: "LFM2-ruv-edition-vN" │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STEP 4: GATED DEPLOYMENT (A/B testing) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────┐│ │
│ │ │ Production Traffic ││ │
│ │ │ ┌────────────────┐ ┌────────────────┐ ││ │
│ │ │ │ 90% → Current │ │ 10% → Student │ ││ │
│ │ │ │ Model │ │ vN │ ││ │
│ │ │ └────────────────┘ └────────────────┘ ││ │
│ │ └─────────────────────────────────────────────────────┘│ │
│ │ │ │
│ │ Compare: quality, latency, failure_rate │ │
│ │ Promote IFF: student dominates OR ties on key metrics │ │
│ │ │ │
│ │ ⚠️ Never free-write weights in-place │ │
│ │ ⚠️ Always retrain in controlled loop │ │
│ │ ⚠️ Promote only when safe │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
TDD Tests for Weight-Level Learning:
#[cfg(test)]
mod weight_learning_tests {
use super::*;
#[test]
fn test_trace_collection_captures_all_components() {
// Given: A completed interaction
let trace_collector = TraceCollector::new_test();
let interaction = Interaction {
prompt: "What is Rust?",
context: vec!["Rust is a systems language"],
response: "Rust is a memory-safe systems programming language",
quality_score: 0.92,
task_outcome: TaskOutcome::Success,
};
// When: Trace is collected
let trace = trace_collector.collect(&interaction);
// Then: All components should be present
assert!(trace.prompt.is_some());
assert!(trace.context.len() > 0);
assert!(trace.response.is_some());
assert!(trace.quality_score.is_some());
assert!(trace.domain_tags.len() > 0);
}
#[test]
fn test_curriculum_upweights_hard_tasks() {
// Given: Mix of easy and hard traces
let traces = vec![
Trace { difficulty: 0.2, quality: 0.95, ..default() }, // Easy, good
Trace { difficulty: 0.9, quality: 0.85, ..default() }, // Hard, good
Trace { difficulty: 0.3, quality: 0.60, ..default() }, // Easy, bad
];
// When: Build curriculum
let curriculum = CurriculumBuilder::new()
.upweight_hard_tasks(true)
.filter_bad_quality(0.7)
.build(&traces);
// Then: Hard successful trace should have higher weight
let hard_weight = curriculum.weight_for(&traces[1]);
let easy_weight = curriculum.weight_for(&traces[0]);
assert!(hard_weight > easy_weight);
// And: Bad quality trace should be filtered
assert!(!curriculum.contains(&traces[2]));
}
#[test]
fn test_preference_pairs_correctly_ordered() {
// Given: Same query with different quality responses
let good_response = Response { text: "Detailed answer...", quality: 0.9 };
let bad_response = Response { text: "I don't know", quality: 0.3 };
let query = "Explain backpropagation";
// When: Create preference pair
let pair = PreferencePair::from_responses(query, &good_response, &bad_response);
// Then: Good should be preferred
assert_eq!(pair.chosen, good_response.text);
assert_eq!(pair.rejected, bad_response.text);
}
#[test]
fn test_student_validation_gates_deployment() {
// Given: Student model that underperforms on holdout
let student = StudentModel::new_test();
let holdout = HoldoutDataset::load_test();
let baseline_accuracy = 0.85;
let student_accuracy = 0.78; // Below baseline
// When: Validate for deployment
let validation = ValidationResult::new(student_accuracy, baseline_accuracy);
// Then: Should NOT be approved for deployment
assert!(!validation.approved_for_deployment());
assert!(validation.rejection_reason().contains("accuracy"));
}
#[test]
fn test_ab_test_detects_regression() {
// Given: A/B test results
let ab_results = ABTestResults {
control: ABMetrics { quality: 0.90, latency_p50: 200.0, failure_rate: 0.02 },
treatment: ABMetrics { quality: 0.88, latency_p50: 180.0, failure_rate: 0.05 },
};
// When: Evaluate for promotion
let decision = ABDecision::evaluate(&ab_results, SIGNIFICANCE_THRESHOLD);
// Then: Should NOT promote due to quality regression + higher failure rate
assert_eq!(decision, ABDecision::KeepControl);
assert!(decision.reasons().contains("quality_regression"));
}
}
4. Test-Driven Development Plan
4.1 Testing Pyramid
┌─────────────────┐
│ E2E Tests │ (5%)
│ Full pipeline │
└────────┬────────┘
│
┌─────────────┴─────────────┐
│ Integration Tests │ (20%)
│ Cross-component flows │
└─────────────┬─────────────┘
│
┌────────────────────┴────────────────────┐
│ Unit Tests │ (75%)
│ Individual functions & modules │
└─────────────────────────────────────────┘
4.2 Test Categories by Component
4.2.1 Orchestrator Tests
#[cfg(test)]
mod orchestrator_tests {
#[test]
fn test_request_routing_respects_session() { }
#[test]
fn test_rate_limiting_rejects_excess_requests() { }
#[test]
fn test_cache_hit_bypasses_processing() { }
#[test]
fn test_cache_miss_triggers_full_pipeline() { }
#[test]
fn test_error_handling_returns_graceful_response() { }
#[test]
fn test_metrics_recorded_for_all_requests() { }
}
4.2.2 Embedding Service Tests
#[cfg(test)]
mod embedding_tests {
#[test]
fn test_embedding_dimension_matches_config() { }
#[test]
fn test_similar_texts_have_similar_embeddings() { }
#[test]
fn test_different_texts_have_different_embeddings() { }
#[test]
fn test_long_text_truncation() { }
#[test]
fn test_batch_embedding_matches_individual() { }
#[test]
fn test_empty_string_handling() { }
}
4.2.3 Router Tests
#[cfg(test)]
mod router_tests {
#[test]
fn test_forward_produces_valid_probabilities() { }
#[test]
fn test_hidden_state_updates_across_calls() { }
#[test]
fn test_confidence_threshold_triggers_fallback() { }
#[test]
fn test_gradient_computation() { }
#[test]
fn test_sparse_matrix_operations() { }
#[test]
fn test_low_rank_matrix_approximation() { }
}
4.2.4 Memory Tests
#[cfg(test)]
mod memory_tests {
#[test]
fn test_hnsw_search_returns_k_neighbors() { }
#[test]
fn test_graph_expansion_respects_hop_limit() { }
#[test]
fn test_writeback_queue_batches_correctly() { }
#[test]
fn test_deduplication_prevents_near_duplicates() { }
#[test]
fn test_metadata_filtering() { }
#[test]
fn test_edge_weight_update() { }
}
4.2.5 Attention Tests
#[cfg(test)]
mod attention_tests {
#[test]
fn test_attention_weights_sum_to_one() { }
#[test]
fn test_edge_features_influence_attention() { }
#[test]
fn test_multi_head_concatenation() { }
#[test]
fn test_residual_connection_preserved() { }
#[test]
fn test_layer_norm_normalization() { }
#[test]
fn test_attention_ranking_matches_weights() { }
}
4.2.6 Inference Tests
#[cfg(test)]
mod inference_tests {
#[test]
fn test_model_loading_correct_size() { }
#[test]
fn test_kv_cache_reuse() { }
#[test]
fn test_generation_respects_max_tokens() { }
#[test]
fn test_temperature_affects_randomness() { }
#[test]
fn test_top_p_filtering() { }
#[test]
fn test_model_eviction_under_memory_pressure() { }
}
4.2.7 Learning Tests
#[cfg(test)]
mod learning_tests {
#[test]
fn test_replay_buffer_reservoir_sampling() { }
#[test]
fn test_ewc_regularization_value() { }
#[test]
fn test_fisher_information_computation() { }
#[test]
fn test_quality_judge_score_range() { }
#[test]
fn test_writeback_threshold_filtering() { }
#[test]
fn test_background_training_thread() { }
}
4.3 Integration Test Scenarios
#[cfg(test)]
mod integration_tests {
/// Test full request-response cycle
#[tokio::test]
async fn test_end_to_end_query() {
let system = RuvLLMSystem::new_test().await;
let response = system.process(Request {
query: "What is machine learning?",
session_id: Some("test-session"),
constraints: Default::default(),
}).await.unwrap();
assert!(!response.text.is_empty());
assert!(response.confidence > 0.0);
assert!(!response.sources.is_empty());
}
/// Test multi-turn conversation with context
#[tokio::test]
async fn test_multi_turn_context() {
let system = RuvLLMSystem::new_test().await;
let session = "multi-turn-test";
// Turn 1
let r1 = system.process(Request {
query: "What is Rust?",
session_id: Some(session),
..Default::default()
}).await.unwrap();
// Turn 2 (should use KV cache)
let r2 = system.process(Request {
query: "What are its main features?",
session_id: Some(session),
..Default::default()
}).await.unwrap();
// Response should reference Rust from context
assert!(r2.text.to_lowercase().contains("rust") ||
r2.text.to_lowercase().contains("memory") ||
r2.text.to_lowercase().contains("safety"));
}
/// Test that learning loop updates memory
#[tokio::test]
async fn test_learning_updates_memory() {
let system = RuvLLMSystem::new_test().await;
let initial_node_count = system.memory.node_count();
// Process high-quality interaction
let response = system.process_with_feedback(
Request { query: "Novel question...", ..Default::default() },
Feedback { quality: 0.95, explicit_rating: Some(5) }
).await.unwrap();
// Memory should have grown
let final_node_count = system.memory.node_count();
assert!(final_node_count > initial_node_count);
}
/// Test router learns from experience
#[tokio::test]
async fn test_router_adaptation() {
let mut system = RuvLLMSystem::new_test().await;
// Process many simple queries
for _ in 0..100 {
system.process(Request {
query: "Simple factual question",
..Default::default()
}).await.unwrap();
}
// Trigger training
system.learning_service.train_router().await;
// Router should now prefer smaller models for similar queries
let decision = system.router.forward(
&simple_query_features(),
&initial_hidden()
);
assert!(decision.model == ModelSize::M350 || decision.model == ModelSize::M700);
}
}
5. Benchmarking Suite
5.1 Performance Benchmarks
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
fn embedding_benchmark(c: &mut Criterion) {
let embedder = EmbeddingService::new_test();
let mut group = c.benchmark_group("embedding");
for size in [32, 128, 512, 2048].iter() {
let text = "a".repeat(*size);
group.bench_with_input(
BenchmarkId::new("embed", size),
&text,
|b, t| b.iter(|| embedder.embed(t))
);
}
group.finish();
}
fn hnsw_search_benchmark(c: &mut Criterion) {
let memory = RuvectorMemory::new_with_data(100_000); // 100K vectors
let query = random_vector(384);
let mut group = c.benchmark_group("hnsw_search");
for k in [10, 32, 64].iter() {
for ef in [32, 64, 128].iter() {
group.bench_with_input(
BenchmarkId::new(format!("k={},ef={}", k, ef), ""),
&(k, ef),
|b, (k, ef)| b.iter(|| memory.search(&query, **k, **ef))
);
}
}
group.finish();
}
fn router_forward_benchmark(c: &mut Criterion) {
let router = FastGRNNRouter::new_test();
let features = random_vector(128);
let hidden = random_vector(64);
c.bench_function("router_forward", |b| {
b.iter(|| router.forward(&features, &hidden))
});
}
fn graph_attention_benchmark(c: &mut Criterion) {
let attention = GraphAttentionEngine::new_test();
let query = random_vector(384);
let subgraph = generate_subgraph(50, 100); // 50 nodes, 100 edges
c.bench_function("graph_attention", |b| {
b.iter(|| attention.attend(&query, &subgraph))
});
}
criterion_group!(
benches,
embedding_benchmark,
hnsw_search_benchmark,
router_forward_benchmark,
graph_attention_benchmark
);
criterion_main!(benches);
5.2 Quality Benchmarks
/// Benchmark suite for quality metrics
pub struct QualityBenchmark {
dataset: BenchmarkDataset,
judge: QualityJudge,
}
impl QualityBenchmark {
pub async fn run(&self, system: &RuvLLMSystem) -> QualityResults {
let mut results = QualityResults::default();
for sample in &self.dataset.samples {
let response = system.process(Request {
query: sample.query.clone(),
..Default::default()
}).await.unwrap();
// Judge quality
let quality = self.judge.evaluate(
&sample.query,
&response.text,
&response.sources
).await;
// Check against ground truth if available
if let Some(expected) = &sample.expected_answer {
let f1 = compute_f1(&response.text, expected);
results.f1_scores.push(f1);
}
results.quality_scores.push(quality);
results.latencies.push(response.latency);
}
results
}
}
6. Iteration Milestones
6.1 Phase 1: Foundation (Weeks 1-2)
| Milestone | Deliverables | Tests |
|---|---|---|
| M1.1 | Embedding service stub | Dimension tests |
| M1.2 | Memory service with HNSW | Search tests |
| M1.3 | Basic orchestrator | Integration smoke tests |
| M1.4 | Mock LFM2 interface | Interface contract tests |
6.2 Phase 2: Core Pipeline (Weeks 3-4)
| Milestone | Deliverables | Tests |
|---|---|---|
| M2.1 | FastGRNN router | Forward pass tests |
| M2.2 | Graph attention engine | Attention computation tests |
| M2.3 | Context builder | Deduplication, truncation tests |
| M2.4 | End-to-end pipeline | Full flow integration tests |
6.3 Phase 3: Learning Loops (Weeks 5-6)
| Milestone | Deliverables | Tests |
|---|---|---|
| M3.1 | Quality judge | Evaluation tests |
| M3.2 | Replay buffer | Sampling distribution tests |
| M3.3 | EWC integration | Forgetting prevention tests |
| M3.4 | Memory writeback | Graph update tests |
6.4 Phase 4: Optimization (Weeks 7-8)
| Milestone | Deliverables | Tests |
|---|---|---|
| M4.1 | Router training loop | Learning convergence tests |
| M4.2 | Compression/abstraction | Cluster detection tests |
| M4.3 | Performance tuning | Benchmark suite |
| M4.4 | Production hardening | Load tests, failure injection |
7. Refinement Checklist
7.1 Per-Component Checklist
[ ] Orchestrator
[ ] Request validation
[ ] Session management
[ ] Rate limiting
[ ] Caching
[ ] Error handling
[ ] Metrics export
[ ] Embedding Service
[ ] LFM2 encoder integration
[ ] Dimension projection
[ ] Batch processing
[ ] Tokenization
[ ] Truncation handling
[ ] FastGRNN Router
[ ] Cell implementation
[ ] Sparse weight matrices
[ ] Low-rank recurrent matrices
[ ] Output heads
[ ] Confidence calibration
[ ] Training loop
[ ] Memory Service
[ ] HNSW configuration
[ ] Graph storage
[ ] Edge operations
[ ] Writeback queue
[ ] Deduplication
[ ] Archival
[ ] Graph Attention
[ ] Multi-head attention
[ ] Edge feature encoding
[ ] Layer stacking
[ ] Residual connections
[ ] Output ranking
[ ] Inference Pool
[ ] Model loading
[ ] Lazy initialization
[ ] KV cache management
[ ] Quantization selection
[ ] LRU eviction
[ ] Learning Service
[ ] Quality evaluation
[ ] Replay buffer
[ ] EWC regularization
[ ] Background training
[ ] Writeback logic
[ ] Compression jobs
7.2 Quality Gates
| Gate | Criteria | Status |
|---|---|---|
| Unit test coverage | >80% | ⬜ |
| Integration tests passing | 100% | ⬜ |
| Latency P50 | <500ms | ⬜ |
| Quality score mean | >0.8 | ⬜ |
| Router accuracy | >90% | ⬜ |
| Memory efficiency | <4GB | ⬜ |
| No memory leaks | 24h stress test | ⬜ |
| Forgetting rate | <5%/10K | ⬜ |
Document Version: 1.0 Last Updated: 2025-12-02 Author: RuvLLM Architecture Team