Files
wifi-densepose/examples/ruvLLM/docs/sparc/04-refinement.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

45 KiB

RuvLLM: TDD and Iterative Refinement

SPARC Phase 4: Refinement


1. Core Philosophy: Three-Layer Self-Learning

1.1 The Mental Model

"The intelligence is not in one model anymore. It is in the loop."

RuvLLM treats:

  • LFM2 weights as a stable cortex (fixed core reasoning engine)
  • Ruvector as the living synaptic mesh (adapts continuously)
  • FastGRNN as the control circuit (learns when to use what)

This creates a system that genuinely learns from experience without requiring constant model retraining.

1.2 Three Adaptation Timescales

Timescale Mechanism What Changes Frequency
Short-term Memory + Routing Graph structure, attention patterns, routing decisions Every request
Medium-term Compression Concept nodes, graph hierarchy, router weights Hourly/Daily
Long-term Weight tuning LFM2 fine-tuned variants Weekly/Monthly

2. Self-Learning Loop Architecture

2.1 Loop A: Memory Growth and Refinement

What happens on every request:

Request → Response → Outcome
     ↓
┌────────────────────────────────────────────────────────────────┐
│                    Memory Growth Loop                          │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. WRITE to ruvector:                                          │
│     ┌─────────────────────────────────────────────────────────┐│
│     │  - Question (query embedding + text)                     ││
│     │  - Answer (response embedding + text)                    ││
│     │  - Retrieved documents (context used)                    ││
│     │  - Final outcome (quality score, task success)           ││
│     │  - User feedback if any (explicit signals)               ││
│     └─────────────────────────────────────────────────────────┘│
│                                                                  │
│  2. GRAPH RULES:                                                │
│     ┌─────────────────────────────────────────────────────────┐│
│     │  ✓ Strengthen edges between nodes that co-appear        ││
│     │    in good answers                                       ││
│     │  ✓ Weaken/prune edges rarely used or correlating        ││
│     │    with bad answers                                      ││
│     │  ✓ Update attention weights based on success patterns   ││
│     └─────────────────────────────────────────────────────────┘│
│                                                                  │
│  3. RESULT:                                                     │
│     Same LFM2 checkpoint → Different answers over time         │
│     because the graph, weights, and attention improve          │
│                                                                  │
└────────────────────────────────────────────────────────────────┘

TDD Tests for Loop A:

#[cfg(test)]
mod memory_growth_tests {
    use super::*;

    #[test]
    fn test_successful_interaction_strengthens_edges() {
        // Given: A memory with two related nodes
        let mut memory = RuvectorMemory::new_test();
        let node_a = memory.insert_node("Machine learning is a subset of AI");
        let node_b = memory.insert_node("Neural networks are ML models");
        memory.insert_edge(node_a, node_b, EdgeType::SameTopic, 0.5);

        // When: A successful query uses both nodes
        let outcome = InteractionOutcome {
            quality_score: 0.9,
            used_nodes: vec![node_a.clone(), node_b.clone()],
            task_success: true,
        };
        memory.apply_outcome(&outcome);

        // Then: Edge weight should increase
        let edge = memory.get_edge(&node_a, &node_b).unwrap();
        assert!(edge.weight > 0.5);
    }

    #[test]
    fn test_failed_interaction_weakens_edges() {
        // Given: A memory with edge
        let mut memory = RuvectorMemory::new_test();
        let node_a = memory.insert_node("Topic A");
        let node_b = memory.insert_node("Unrelated B");
        memory.insert_edge(node_a, node_b, EdgeType::SameTopic, 0.5);

        // When: Query uses these but fails
        let outcome = InteractionOutcome {
            quality_score: 0.3,
            used_nodes: vec![node_a.clone(), node_b.clone()],
            task_success: false,
        };
        memory.apply_outcome(&outcome);

        // Then: Edge weight should decrease
        let edge = memory.get_edge(&node_a, &node_b).unwrap();
        assert!(edge.weight < 0.5);
    }

    #[test]
    fn test_unused_edges_decay_over_time() {
        // Given: An edge that hasn't been used
        let mut memory = RuvectorMemory::new_test();
        let edge = memory.create_edge_with_last_used(
            "node_a", "node_b",
            0.5,
            Instant::now() - Duration::from_days(30)
        );

        // When: Periodic cleanup runs
        memory.apply_decay(DECAY_RATE, MIN_INTERACTIONS_BEFORE_PRUNE);

        // Then: Edge weight should have decayed
        let updated = memory.get_edge(&edge.src, &edge.dst).unwrap();
        assert!(updated.weight < 0.5);
    }

    #[test]
    fn test_attention_weights_update_from_success_patterns() {
        // Given: Graph attention engine with initial weights
        let mut attention = GraphAttentionEngine::new_test();
        let initial_weights = attention.get_edge_bias_weights();

        // When: Train on successful interaction patterns
        let patterns = vec![
            AttentionPattern {
                edges_used: vec![EdgeType::Cites],
                outcome_quality: 0.95,
            },
            AttentionPattern {
                edges_used: vec![EdgeType::Cites],
                outcome_quality: 0.90,
            },
        ];
        attention.train_on_patterns(&patterns);

        // Then: Edge type "Cites" should have higher attention bias
        let updated_weights = attention.get_edge_bias_weights();
        assert!(updated_weights[EdgeType::Cites] > initial_weights[EdgeType::Cites]);
    }
}

2.2 Loop B: Router Learning

What the router learns:

┌────────────────────────────────────────────────────────────────┐
│                    Router Learning Loop                         │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  For each query, LOG:                                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  - Router features (128-dim input vector)                │   │
│  │  - Chosen route (model, context, temp, top_p)            │   │
│  │  - Actual latency and cost                               │   │
│  │  - Quality score (judge model or task outcome)           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  Periodically RETRAIN FastGRNN:                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Objective: Prefer cheaper routes when quality holds     │   │
│  │            Escalate only when necessary                  │   │
│  │                                                           │   │
│  │  Loss = -Quality + λ·Cost + μ·LatencyPenalty             │   │
│  │                                                           │   │
│  │  Constraints:                                             │   │
│  │    - Quality must exceed threshold θ_min                  │   │
│  │    - Latency must meet SLA                                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  RESULT: Router becomes self-learning policy over your stack   │
│                                                                  │
└────────────────────────────────────────────────────────────────┘

TDD Tests for Loop B:

#[cfg(test)]
mod router_learning_tests {
    use super::*;

    #[test]
    fn test_router_prefers_smaller_model_when_quality_sufficient() {
        // Given: Training data showing 700M achieves same quality as 1.2B
        let training_data = vec![
            RouterSample {
                features: simple_query_features(),
                model_used: ModelSize::M700,
                quality: 0.92,
                latency_ms: 150.0,
                cost: 0.001,
            },
            RouterSample {
                features: simple_query_features(),
                model_used: ModelSize::B1_2,
                quality: 0.93,  // Only marginally better
                latency_ms: 300.0,
                cost: 0.003,
            },
        ];

        // When: Router is trained
        let mut router = FastGRNNRouter::new_test();
        router.train(&training_data, QUALITY_THRESHOLD);

        // Then: Router should prefer 700M for similar queries
        let decision = router.forward(&simple_query_features(), &initial_hidden());
        assert_eq!(decision.model, ModelSize::M700);
    }

    #[test]
    fn test_router_escalates_for_complex_queries() {
        // Given: Training data showing complex queries need larger models
        let training_data = vec![
            RouterSample {
                features: complex_query_features(),
                model_used: ModelSize::M700,
                quality: 0.45,  // Poor quality
                latency_ms: 150.0,
                cost: 0.001,
            },
            RouterSample {
                features: complex_query_features(),
                model_used: ModelSize::B2_6,
                quality: 0.91,  // Good quality
                latency_ms: 500.0,
                cost: 0.010,
            },
        ];

        // When: Router is trained
        let mut router = FastGRNNRouter::new_test();
        router.train(&training_data, QUALITY_THRESHOLD);

        // Then: Router should choose 2.6B for complex queries
        let decision = router.forward(&complex_query_features(), &initial_hidden());
        assert_eq!(decision.model, ModelSize::B2_6);
    }

    #[test]
    fn test_router_confidence_correlates_with_seen_patterns() {
        // Given: Router trained on specific feature patterns
        let mut router = FastGRNNRouter::new_test();
        let seen_features = vec![training_features_a(), training_features_b()];
        router.train(&samples_from_features(&seen_features), QUALITY_THRESHOLD);

        // When: Querying with seen vs unseen patterns
        let seen_decision = router.forward(&training_features_a(), &initial_hidden());
        let unseen_decision = router.forward(&novel_features(), &initial_hidden());

        // Then: Confidence should be higher for seen patterns
        assert!(seen_decision.confidence > unseen_decision.confidence);
    }

    #[test]
    fn test_router_ewc_prevents_forgetting() {
        // Given: Router trained on task A
        let mut router = FastGRNNRouter::new_test();
        let mut ewc = ElasticWeightConsolidation::new(0.4);
        router.train(&task_a_samples(), QUALITY_THRESHOLD);
        let task_a_accuracy_before = router.evaluate(&task_a_samples());

        // Compute Fisher and store optimal weights
        ewc.compute_fisher(&router, &task_a_samples());

        // When: Train on task B with EWC
        router.train_with_ewc(&task_b_samples(), &ewc, QUALITY_THRESHOLD);

        // Then: Task A accuracy should not significantly degrade
        let task_a_accuracy_after = router.evaluate(&task_a_samples());
        assert!(task_a_accuracy_after > task_a_accuracy_before - 0.05);
    }
}

2.3 Loop C: Compression and Abstraction

How the system avoids bloat:

┌────────────────────────────────────────────────────────────────┐
│               Compression and Abstraction Loop                  │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  PERIODICALLY (hourly/daily):                                   │
│                                                                  │
│  1. CLUSTER DETECTION                                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Identify clusters of similar nodes in graph:            │   │
│  │  - Dense neighborhoods with similar embeddings           │   │
│  │  - Frequently co-retrieved node sets                     │   │
│  │  - High edge connectivity within cluster                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  2. LFM2 SUMMARIZATION                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  For each cluster:                                        │   │
│  │  - Feed cluster nodes to LFM2                            │   │
│  │  - Generate summary "concept" node                        │   │
│  │  - Create embedding for concept                           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  3. HIERARCHICAL ATTACHMENT                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  - Concept node becomes parent of cluster members        │   │
│  │  - Add "contains" edges from concept to members          │   │
│  │  - Future queries see concept first in attention         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  4. ARCHIVAL                                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  - Old, rarely-used fine-grained nodes → cold storage    │   │
│  │  - Concept summaries stay in hot tier                    │   │
│  │  - Preserve graph structure for rehydration              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  RESULT: Hierarchy of concepts, not ever-growing bag of chunks │
│                                                                  │
└────────────────────────────────────────────────────────────────┘

TDD Tests for Loop C:

#[cfg(test)]
mod compression_tests {
    use super::*;

    #[test]
    fn test_cluster_detection_finds_dense_neighborhoods() {
        // Given: Graph with clear clusters
        let mut memory = RuvectorMemory::new_test();

        // Cluster 1: ML topics (densely connected)
        let ml_nodes = vec![
            memory.insert_node("Neural networks learn patterns"),
            memory.insert_node("Deep learning uses multiple layers"),
            memory.insert_node("Backpropagation trains neural nets"),
        ];
        for i in 0..ml_nodes.len() {
            for j in i+1..ml_nodes.len() {
                memory.insert_edge(&ml_nodes[i], &ml_nodes[j], EdgeType::SameTopic, 0.9);
            }
        }

        // Cluster 2: Cooking topics (densely connected)
        let cooking_nodes = vec![
            memory.insert_node("Sourdough needs starter"),
            memory.insert_node("Bread baking requires patience"),
        ];
        memory.insert_edge(&cooking_nodes[0], &cooking_nodes[1], EdgeType::SameTopic, 0.85);

        // When: Run cluster detection
        let clusters = memory.detect_clusters(MIN_CLUSTER_SIZE, MIN_EDGE_DENSITY);

        // Then: Should find two distinct clusters
        assert_eq!(clusters.len(), 2);
        assert!(clusters.iter().any(|c| c.nodes.len() == 3)); // ML cluster
        assert!(clusters.iter().any(|c| c.nodes.len() == 2)); // Cooking cluster
    }

    #[test]
    fn test_summarization_creates_concept_node() {
        // Given: A cluster of related nodes
        let cluster = Cluster {
            nodes: vec![
                Node::new("Rust is memory safe"),
                Node::new("Rust has zero-cost abstractions"),
                Node::new("Rust prevents data races"),
            ],
            centroid: compute_centroid(&cluster.nodes),
        };

        // When: Generate summary
        let summarizer = ClusterSummarizer::new(lfm2_model());
        let concept = summarizer.summarize(&cluster);

        // Then: Concept should capture key themes
        assert!(concept.text.to_lowercase().contains("rust"));
        assert!(concept.node_type == NodeType::Concept);
        assert!(concept.metadata.contains_key("source_cluster_size"));
    }

    #[test]
    fn test_concept_nodes_are_prioritized_in_retrieval() {
        // Given: Memory with concept and detail nodes
        let mut memory = RuvectorMemory::new_test();
        let concept = memory.insert_node_typed(
            "Rust programming overview",
            NodeType::Concept
        );
        let detail = memory.insert_node_typed(
            "Rust's borrow checker enforces ownership",
            NodeType::Document
        );
        memory.insert_edge(&concept, &detail, EdgeType::Contains, 1.0);

        // When: Query about Rust
        let query_embedding = embed("Tell me about Rust");
        let results = memory.search_with_concept_boost(&query_embedding, 10);

        // Then: Concept should appear before (or with higher weight than) details
        let concept_idx = results.iter().position(|r| r.id == concept.id).unwrap();
        let detail_idx = results.iter().position(|r| r.id == detail.id).unwrap();
        assert!(concept_idx < detail_idx);
    }

    #[test]
    fn test_archival_moves_old_nodes_to_cold_storage() {
        // Given: Nodes with different access patterns
        let mut memory = RuvectorMemory::new_test();
        let hot_node = memory.insert_node_with_access(
            "Recently used content",
            AccessStats { last_used: now(), use_count: 50 }
        );
        let cold_node = memory.insert_node_with_access(
            "Old unused content",
            AccessStats { last_used: now() - Duration::from_days(90), use_count: 1 }
        );

        // When: Run archival
        memory.run_archival(
            MAX_AGE_DAYS,
            MIN_USE_COUNT,
            COLD_STORAGE_PATH
        );

        // Then: Hot node stays, cold node archived
        assert!(memory.contains(&hot_node.id));
        assert!(!memory.contains(&cold_node.id));
        assert!(cold_storage_contains(&cold_node.id));
    }
}

3. Weight-Level Self-Learning (Controlled)

3.1 The Safe Outer Loop

Weight updates happen outside production, in a controlled pipeline:

┌────────────────────────────────────────────────────────────────┐
│           Weight-Level Self-Learning Pipeline                   │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STEP 1: COLLECT TRAINING TRACES (continuous)           │   │
│  │                                                           │   │
│  │  From live system, store:                                 │   │
│  │  - (prompt, retrieved_context, final_answer, outcome)    │   │
│  │  - Judge scores or human ratings                         │   │
│  │  - Explicit error cases                                   │   │
│  │                                                           │   │
│  │  Tag by: domain, difficulty, risk_level                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STEP 2: BUILD ROLLING CURRICULUM (nightly/weekly)      │   │
│  │                                                           │   │
│  │  Sample recent traces:                                    │   │
│  │  - Up-weight hard or high-value tasks                    │   │
│  │  - Filter out cases where context was wrong              │   │
│  │                                                           │   │
│  │  Create three sets:                                       │   │
│  │  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐  │   │
│  │  │      SFT      │ │  Preference   │ │   Retrieval   │  │   │
│  │  │    (good      │ │    Pairs      │ │  Correction   │  │   │
│  │  │   answers)    │ │ (good vs bad) │ │    (context   │  │   │
│  │  │               │ │               │ │   selection)  │  │   │
│  │  └───────────────┘ └───────────────┘ └───────────────┘  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STEP 3: TRAIN STUDENT VARIANTS (offline)               │   │
│  │                                                           │   │
│  │  Take current best LFM2 checkpoint:                      │   │
│  │  1. Run supervised fine-tuning on new traces             │   │
│  │  2. Optionally run preference objective on pairs         │   │
│  │  3. Validate on fixed holdout + public benchmarks        │   │
│  │                                                           │   │
│  │  Output: "LFM2-ruv-edition-vN"                           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STEP 4: GATED DEPLOYMENT (A/B testing)                 │   │
│  │                                                           │   │
│  │  ┌─────────────────────────────────────────────────────┐│   │
│  │  │  Production Traffic                                  ││   │
│  │  │  ┌────────────────┐  ┌────────────────┐            ││   │
│  │  │  │  90% → Current │  │  10% → Student │            ││   │
│  │  │  │     Model      │  │     vN         │            ││   │
│  │  │  └────────────────┘  └────────────────┘            ││   │
│  │  └─────────────────────────────────────────────────────┘│   │
│  │                                                           │   │
│  │  Compare: quality, latency, failure_rate                 │   │
│  │  Promote IFF: student dominates OR ties on key metrics   │   │
│  │                                                           │   │
│  │  ⚠️  Never free-write weights in-place                  │   │
│  │  ⚠️  Always retrain in controlled loop                   │   │
│  │  ⚠️  Promote only when safe                              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
└────────────────────────────────────────────────────────────────┘

TDD Tests for Weight-Level Learning:

#[cfg(test)]
mod weight_learning_tests {
    use super::*;

    #[test]
    fn test_trace_collection_captures_all_components() {
        // Given: A completed interaction
        let trace_collector = TraceCollector::new_test();
        let interaction = Interaction {
            prompt: "What is Rust?",
            context: vec!["Rust is a systems language"],
            response: "Rust is a memory-safe systems programming language",
            quality_score: 0.92,
            task_outcome: TaskOutcome::Success,
        };

        // When: Trace is collected
        let trace = trace_collector.collect(&interaction);

        // Then: All components should be present
        assert!(trace.prompt.is_some());
        assert!(trace.context.len() > 0);
        assert!(trace.response.is_some());
        assert!(trace.quality_score.is_some());
        assert!(trace.domain_tags.len() > 0);
    }

    #[test]
    fn test_curriculum_upweights_hard_tasks() {
        // Given: Mix of easy and hard traces
        let traces = vec![
            Trace { difficulty: 0.2, quality: 0.95, ..default() },  // Easy, good
            Trace { difficulty: 0.9, quality: 0.85, ..default() },  // Hard, good
            Trace { difficulty: 0.3, quality: 0.60, ..default() },  // Easy, bad
        ];

        // When: Build curriculum
        let curriculum = CurriculumBuilder::new()
            .upweight_hard_tasks(true)
            .filter_bad_quality(0.7)
            .build(&traces);

        // Then: Hard successful trace should have higher weight
        let hard_weight = curriculum.weight_for(&traces[1]);
        let easy_weight = curriculum.weight_for(&traces[0]);
        assert!(hard_weight > easy_weight);

        // And: Bad quality trace should be filtered
        assert!(!curriculum.contains(&traces[2]));
    }

    #[test]
    fn test_preference_pairs_correctly_ordered() {
        // Given: Same query with different quality responses
        let good_response = Response { text: "Detailed answer...", quality: 0.9 };
        let bad_response = Response { text: "I don't know", quality: 0.3 };
        let query = "Explain backpropagation";

        // When: Create preference pair
        let pair = PreferencePair::from_responses(query, &good_response, &bad_response);

        // Then: Good should be preferred
        assert_eq!(pair.chosen, good_response.text);
        assert_eq!(pair.rejected, bad_response.text);
    }

    #[test]
    fn test_student_validation_gates_deployment() {
        // Given: Student model that underperforms on holdout
        let student = StudentModel::new_test();
        let holdout = HoldoutDataset::load_test();
        let baseline_accuracy = 0.85;
        let student_accuracy = 0.78;  // Below baseline

        // When: Validate for deployment
        let validation = ValidationResult::new(student_accuracy, baseline_accuracy);

        // Then: Should NOT be approved for deployment
        assert!(!validation.approved_for_deployment());
        assert!(validation.rejection_reason().contains("accuracy"));
    }

    #[test]
    fn test_ab_test_detects_regression() {
        // Given: A/B test results
        let ab_results = ABTestResults {
            control: ABMetrics { quality: 0.90, latency_p50: 200.0, failure_rate: 0.02 },
            treatment: ABMetrics { quality: 0.88, latency_p50: 180.0, failure_rate: 0.05 },
        };

        // When: Evaluate for promotion
        let decision = ABDecision::evaluate(&ab_results, SIGNIFICANCE_THRESHOLD);

        // Then: Should NOT promote due to quality regression + higher failure rate
        assert_eq!(decision, ABDecision::KeepControl);
        assert!(decision.reasons().contains("quality_regression"));
    }
}

4. Test-Driven Development Plan

4.1 Testing Pyramid

                    ┌─────────────────┐
                    │    E2E Tests    │  (5%)
                    │  Full pipeline  │
                    └────────┬────────┘
                             │
               ┌─────────────┴─────────────┐
               │    Integration Tests      │  (20%)
               │  Cross-component flows    │
               └─────────────┬─────────────┘
                             │
        ┌────────────────────┴────────────────────┐
        │           Unit Tests                    │  (75%)
        │  Individual functions & modules          │
        └─────────────────────────────────────────┘

4.2 Test Categories by Component

4.2.1 Orchestrator Tests

#[cfg(test)]
mod orchestrator_tests {
    #[test]
    fn test_request_routing_respects_session() { }

    #[test]
    fn test_rate_limiting_rejects_excess_requests() { }

    #[test]
    fn test_cache_hit_bypasses_processing() { }

    #[test]
    fn test_cache_miss_triggers_full_pipeline() { }

    #[test]
    fn test_error_handling_returns_graceful_response() { }

    #[test]
    fn test_metrics_recorded_for_all_requests() { }
}

4.2.2 Embedding Service Tests

#[cfg(test)]
mod embedding_tests {
    #[test]
    fn test_embedding_dimension_matches_config() { }

    #[test]
    fn test_similar_texts_have_similar_embeddings() { }

    #[test]
    fn test_different_texts_have_different_embeddings() { }

    #[test]
    fn test_long_text_truncation() { }

    #[test]
    fn test_batch_embedding_matches_individual() { }

    #[test]
    fn test_empty_string_handling() { }
}

4.2.3 Router Tests

#[cfg(test)]
mod router_tests {
    #[test]
    fn test_forward_produces_valid_probabilities() { }

    #[test]
    fn test_hidden_state_updates_across_calls() { }

    #[test]
    fn test_confidence_threshold_triggers_fallback() { }

    #[test]
    fn test_gradient_computation() { }

    #[test]
    fn test_sparse_matrix_operations() { }

    #[test]
    fn test_low_rank_matrix_approximation() { }
}

4.2.4 Memory Tests

#[cfg(test)]
mod memory_tests {
    #[test]
    fn test_hnsw_search_returns_k_neighbors() { }

    #[test]
    fn test_graph_expansion_respects_hop_limit() { }

    #[test]
    fn test_writeback_queue_batches_correctly() { }

    #[test]
    fn test_deduplication_prevents_near_duplicates() { }

    #[test]
    fn test_metadata_filtering() { }

    #[test]
    fn test_edge_weight_update() { }
}

4.2.5 Attention Tests

#[cfg(test)]
mod attention_tests {
    #[test]
    fn test_attention_weights_sum_to_one() { }

    #[test]
    fn test_edge_features_influence_attention() { }

    #[test]
    fn test_multi_head_concatenation() { }

    #[test]
    fn test_residual_connection_preserved() { }

    #[test]
    fn test_layer_norm_normalization() { }

    #[test]
    fn test_attention_ranking_matches_weights() { }
}

4.2.6 Inference Tests

#[cfg(test)]
mod inference_tests {
    #[test]
    fn test_model_loading_correct_size() { }

    #[test]
    fn test_kv_cache_reuse() { }

    #[test]
    fn test_generation_respects_max_tokens() { }

    #[test]
    fn test_temperature_affects_randomness() { }

    #[test]
    fn test_top_p_filtering() { }

    #[test]
    fn test_model_eviction_under_memory_pressure() { }
}

4.2.7 Learning Tests

#[cfg(test)]
mod learning_tests {
    #[test]
    fn test_replay_buffer_reservoir_sampling() { }

    #[test]
    fn test_ewc_regularization_value() { }

    #[test]
    fn test_fisher_information_computation() { }

    #[test]
    fn test_quality_judge_score_range() { }

    #[test]
    fn test_writeback_threshold_filtering() { }

    #[test]
    fn test_background_training_thread() { }
}

4.3 Integration Test Scenarios

#[cfg(test)]
mod integration_tests {
    /// Test full request-response cycle
    #[tokio::test]
    async fn test_end_to_end_query() {
        let system = RuvLLMSystem::new_test().await;

        let response = system.process(Request {
            query: "What is machine learning?",
            session_id: Some("test-session"),
            constraints: Default::default(),
        }).await.unwrap();

        assert!(!response.text.is_empty());
        assert!(response.confidence > 0.0);
        assert!(!response.sources.is_empty());
    }

    /// Test multi-turn conversation with context
    #[tokio::test]
    async fn test_multi_turn_context() {
        let system = RuvLLMSystem::new_test().await;
        let session = "multi-turn-test";

        // Turn 1
        let r1 = system.process(Request {
            query: "What is Rust?",
            session_id: Some(session),
            ..Default::default()
        }).await.unwrap();

        // Turn 2 (should use KV cache)
        let r2 = system.process(Request {
            query: "What are its main features?",
            session_id: Some(session),
            ..Default::default()
        }).await.unwrap();

        // Response should reference Rust from context
        assert!(r2.text.to_lowercase().contains("rust") ||
                r2.text.to_lowercase().contains("memory") ||
                r2.text.to_lowercase().contains("safety"));
    }

    /// Test that learning loop updates memory
    #[tokio::test]
    async fn test_learning_updates_memory() {
        let system = RuvLLMSystem::new_test().await;
        let initial_node_count = system.memory.node_count();

        // Process high-quality interaction
        let response = system.process_with_feedback(
            Request { query: "Novel question...", ..Default::default() },
            Feedback { quality: 0.95, explicit_rating: Some(5) }
        ).await.unwrap();

        // Memory should have grown
        let final_node_count = system.memory.node_count();
        assert!(final_node_count > initial_node_count);
    }

    /// Test router learns from experience
    #[tokio::test]
    async fn test_router_adaptation() {
        let mut system = RuvLLMSystem::new_test().await;

        // Process many simple queries
        for _ in 0..100 {
            system.process(Request {
                query: "Simple factual question",
                ..Default::default()
            }).await.unwrap();
        }

        // Trigger training
        system.learning_service.train_router().await;

        // Router should now prefer smaller models for similar queries
        let decision = system.router.forward(
            &simple_query_features(),
            &initial_hidden()
        );
        assert!(decision.model == ModelSize::M350 || decision.model == ModelSize::M700);
    }
}

5. Benchmarking Suite

5.1 Performance Benchmarks

use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};

fn embedding_benchmark(c: &mut Criterion) {
    let embedder = EmbeddingService::new_test();

    let mut group = c.benchmark_group("embedding");

    for size in [32, 128, 512, 2048].iter() {
        let text = "a".repeat(*size);
        group.bench_with_input(
            BenchmarkId::new("embed", size),
            &text,
            |b, t| b.iter(|| embedder.embed(t))
        );
    }

    group.finish();
}

fn hnsw_search_benchmark(c: &mut Criterion) {
    let memory = RuvectorMemory::new_with_data(100_000);  // 100K vectors
    let query = random_vector(384);

    let mut group = c.benchmark_group("hnsw_search");

    for k in [10, 32, 64].iter() {
        for ef in [32, 64, 128].iter() {
            group.bench_with_input(
                BenchmarkId::new(format!("k={},ef={}", k, ef), ""),
                &(k, ef),
                |b, (k, ef)| b.iter(|| memory.search(&query, **k, **ef))
            );
        }
    }

    group.finish();
}

fn router_forward_benchmark(c: &mut Criterion) {
    let router = FastGRNNRouter::new_test();
    let features = random_vector(128);
    let hidden = random_vector(64);

    c.bench_function("router_forward", |b| {
        b.iter(|| router.forward(&features, &hidden))
    });
}

fn graph_attention_benchmark(c: &mut Criterion) {
    let attention = GraphAttentionEngine::new_test();
    let query = random_vector(384);
    let subgraph = generate_subgraph(50, 100);  // 50 nodes, 100 edges

    c.bench_function("graph_attention", |b| {
        b.iter(|| attention.attend(&query, &subgraph))
    });
}

criterion_group!(
    benches,
    embedding_benchmark,
    hnsw_search_benchmark,
    router_forward_benchmark,
    graph_attention_benchmark
);
criterion_main!(benches);

5.2 Quality Benchmarks

/// Benchmark suite for quality metrics
pub struct QualityBenchmark {
    dataset: BenchmarkDataset,
    judge: QualityJudge,
}

impl QualityBenchmark {
    pub async fn run(&self, system: &RuvLLMSystem) -> QualityResults {
        let mut results = QualityResults::default();

        for sample in &self.dataset.samples {
            let response = system.process(Request {
                query: sample.query.clone(),
                ..Default::default()
            }).await.unwrap();

            // Judge quality
            let quality = self.judge.evaluate(
                &sample.query,
                &response.text,
                &response.sources
            ).await;

            // Check against ground truth if available
            if let Some(expected) = &sample.expected_answer {
                let f1 = compute_f1(&response.text, expected);
                results.f1_scores.push(f1);
            }

            results.quality_scores.push(quality);
            results.latencies.push(response.latency);
        }

        results
    }
}

6. Iteration Milestones

6.1 Phase 1: Foundation (Weeks 1-2)

Milestone Deliverables Tests
M1.1 Embedding service stub Dimension tests
M1.2 Memory service with HNSW Search tests
M1.3 Basic orchestrator Integration smoke tests
M1.4 Mock LFM2 interface Interface contract tests

6.2 Phase 2: Core Pipeline (Weeks 3-4)

Milestone Deliverables Tests
M2.1 FastGRNN router Forward pass tests
M2.2 Graph attention engine Attention computation tests
M2.3 Context builder Deduplication, truncation tests
M2.4 End-to-end pipeline Full flow integration tests

6.3 Phase 3: Learning Loops (Weeks 5-6)

Milestone Deliverables Tests
M3.1 Quality judge Evaluation tests
M3.2 Replay buffer Sampling distribution tests
M3.3 EWC integration Forgetting prevention tests
M3.4 Memory writeback Graph update tests

6.4 Phase 4: Optimization (Weeks 7-8)

Milestone Deliverables Tests
M4.1 Router training loop Learning convergence tests
M4.2 Compression/abstraction Cluster detection tests
M4.3 Performance tuning Benchmark suite
M4.4 Production hardening Load tests, failure injection

7. Refinement Checklist

7.1 Per-Component Checklist

[ ] Orchestrator
    [ ] Request validation
    [ ] Session management
    [ ] Rate limiting
    [ ] Caching
    [ ] Error handling
    [ ] Metrics export

[ ] Embedding Service
    [ ] LFM2 encoder integration
    [ ] Dimension projection
    [ ] Batch processing
    [ ] Tokenization
    [ ] Truncation handling

[ ] FastGRNN Router
    [ ] Cell implementation
    [ ] Sparse weight matrices
    [ ] Low-rank recurrent matrices
    [ ] Output heads
    [ ] Confidence calibration
    [ ] Training loop

[ ] Memory Service
    [ ] HNSW configuration
    [ ] Graph storage
    [ ] Edge operations
    [ ] Writeback queue
    [ ] Deduplication
    [ ] Archival

[ ] Graph Attention
    [ ] Multi-head attention
    [ ] Edge feature encoding
    [ ] Layer stacking
    [ ] Residual connections
    [ ] Output ranking

[ ] Inference Pool
    [ ] Model loading
    [ ] Lazy initialization
    [ ] KV cache management
    [ ] Quantization selection
    [ ] LRU eviction

[ ] Learning Service
    [ ] Quality evaluation
    [ ] Replay buffer
    [ ] EWC regularization
    [ ] Background training
    [ ] Writeback logic
    [ ] Compression jobs

7.2 Quality Gates

Gate Criteria Status
Unit test coverage >80%
Integration tests passing 100%
Latency P50 <500ms
Quality score mean >0.8
Router accuracy >90%
Memory efficiency <4GB
No memory leaks 24h stress test
Forgetting rate <5%/10K

Document Version: 1.0 Last Updated: 2025-12-02 Author: RuvLLM Architecture Team