wifi-densepose/examples/ruvLLM/docs/sparc/04-refinement.md

# RuvLLM: TDD and Iterative Refinement

## SPARC Phase 4: Refinement

---

## 1. Core Philosophy: Three-Layer Self-Learning

### 1.1 The Mental Model

> **"The intelligence is not in one model anymore. It is in the loop."**

RuvLLM treats:
- **LFM2 weights** as a **stable cortex** (fixed core reasoning engine)
- **Ruvector** as the **living synaptic mesh** (adapts continuously)
- **FastGRNN** as the **control circuit** (learns when to use what)

This creates a system that genuinely learns from experience without requiring constant model retraining.

### 1.2 Three Adaptation Timescales

| Timescale | Mechanism | What Changes | Frequency |
|-----------|-----------|--------------|-----------|
| **Short-term** | Memory + Routing | Graph structure, attention patterns, routing decisions | Every request |
| **Medium-term** | Compression | Concept nodes, graph hierarchy, router weights | Hourly/Daily |
| **Long-term** | Weight tuning | LFM2 fine-tuned variants | Weekly/Monthly |

---

## 2. Self-Learning Loop Architecture

### 2.1 Loop A: Memory Growth and Refinement

**What happens on every request:**

```
Request → Response → Outcome
     ↓
┌────────────────────────────────────────────────────────────────┐
│                    Memory Growth Loop                          │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. WRITE to ruvector:                                          │
│     ┌─────────────────────────────────────────────────────────┐│
│     │  - Question (query embedding + text)                     ││
│     │  - Answer (response embedding + text)                    ││
│     │  - Retrieved documents (context used)                    ││
│     │  - Final outcome (quality score, task success)           ││
│     │  - User feedback if any (explicit signals)               ││
│     └─────────────────────────────────────────────────────────┘│
│                                                                  │
│  2. GRAPH RULES:                                                │
│     ┌─────────────────────────────────────────────────────────┐│
│     │  ✓ Strengthen edges between nodes that co-appear        ││
│     │    in good answers                                       ││
│     │  ✓ Weaken/prune edges rarely used or correlating        ││
│     │    with bad answers                                      ││
│     │  ✓ Update attention weights based on success patterns   ││
│     └─────────────────────────────────────────────────────────┘│
│                                                                  │
│  3. RESULT:                                                     │
│     Same LFM2 checkpoint → Different answers over time         │
│     because the graph, weights, and attention improve          │
│                                                                  │
└────────────────────────────────────────────────────────────────┘
```

**TDD Tests for Loop A:**

```rust
#[cfg(test)]
mod memory_growth_tests {
    use super::*;

    #[test]
    fn test_successful_interaction_strengthens_edges() {
        // Given: A memory with two related nodes
        let mut memory = RuvectorMemory::new_test();
        let node_a = memory.insert_node("Machine learning is a subset of AI");
        let node_b = memory.insert_node("Neural networks are ML models");
        memory.insert_edge(node_a, node_b, EdgeType::SameTopic, 0.5);

        // When: A successful query uses both nodes
        let outcome = InteractionOutcome {
            quality_score: 0.9,
            used_nodes: vec![node_a.clone(), node_b.clone()],
            task_success: true,
        };
        memory.apply_outcome(&outcome);

        // Then: Edge weight should increase
        let edge = memory.get_edge(&node_a, &node_b).unwrap();
        assert!(edge.weight > 0.5);
    }

    #[test]
    fn test_failed_interaction_weakens_edges() {
        // Given: A memory with edge
        let mut memory = RuvectorMemory::new_test();
        let node_a = memory.insert_node("Topic A");
        let node_b = memory.insert_node("Unrelated B");
        memory.insert_edge(node_a, node_b, EdgeType::SameTopic, 0.5);

        // When: Query uses these but fails
        let outcome = InteractionOutcome {
            quality_score: 0.3,
            used_nodes: vec![node_a.clone(), node_b.clone()],
            task_success: false,
        };
        memory.apply_outcome(&outcome);

        // Then: Edge weight should decrease
        let edge = memory.get_edge(&node_a, &node_b).unwrap();
        assert!(edge.weight < 0.5);
    }

    #[test]
    fn test_unused_edges_decay_over_time() {
        // Given: An edge that hasn't been used
        let mut memory = RuvectorMemory::new_test();
        let edge = memory.create_edge_with_last_used(
            "node_a", "node_b",
            0.5,
            Instant::now() - Duration::from_days(30)
        );

        // When: Periodic cleanup runs
        memory.apply_decay(DECAY_RATE, MIN_INTERACTIONS_BEFORE_PRUNE);

        // Then: Edge weight should have decayed
        let updated = memory.get_edge(&edge.src, &edge.dst).unwrap();
        assert!(updated.weight < 0.5);
    }

    #[test]
    fn test_attention_weights_update_from_success_patterns() {
        // Given: Graph attention engine with initial weights
        let mut attention = GraphAttentionEngine::new_test();
        let initial_weights = attention.get_edge_bias_weights();

        // When: Train on successful interaction patterns
        let patterns = vec![
            AttentionPattern {
                edges_used: vec![EdgeType::Cites],
                outcome_quality: 0.95,
            },
            AttentionPattern {
                edges_used: vec![EdgeType::Cites],
                outcome_quality: 0.90,
            },
        ];
        attention.train_on_patterns(&patterns);

        // Then: Edge type "Cites" should have higher attention bias
        let updated_weights = attention.get_edge_bias_weights();
        assert!(updated_weights[EdgeType::Cites] > initial_weights[EdgeType::Cites]);
    }
}
```

### 2.2 Loop B: Router Learning

**What the router learns:**

```
┌────────────────────────────────────────────────────────────────┐
│                    Router Learning Loop                         │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  For each query, LOG:                                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  - Router features (128-dim input vector)                │   │
│  │  - Chosen route (model, context, temp, top_p)            │   │
│  │  - Actual latency and cost                               │   │
│  │  - Quality score (judge model or task outcome)           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  Periodically RETRAIN FastGRNN:                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Objective: Prefer cheaper routes when quality holds     │   │
│  │            Escalate only when necessary                  │   │
│  │                                                           │   │
│  │  Loss = -Quality + λ·Cost + μ·LatencyPenalty             │   │
│  │                                                           │   │
│  │  Constraints:                                             │   │
│  │    - Quality must exceed threshold θ_min                  │   │
│  │    - Latency must meet SLA                                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  RESULT: Router becomes self-learning policy over your stack   │
│                                                                  │
└────────────────────────────────────────────────────────────────┘
```

**TDD Tests for Loop B:**

```rust
#[cfg(test)]
mod router_learning_tests {
    use super::*;

    #[test]
    fn test_router_prefers_smaller_model_when_quality_sufficient() {
        // Given: Training data showing 700M achieves same quality as 1.2B
        let training_data = vec![
            RouterSample {
                features: simple_query_features(),
                model_used: ModelSize::M700,
                quality: 0.92,
                latency_ms: 150.0,
                cost: 0.001,
            },
            RouterSample {
                features: simple_query_features(),
                model_used: ModelSize::B1_2,
                quality: 0.93,  // Only marginally better
                latency_ms: 300.0,
                cost: 0.003,
            },
        ];

        // When: Router is trained
        let mut router = FastGRNNRouter::new_test();
        router.train(&training_data, QUALITY_THRESHOLD);

        // Then: Router should prefer 700M for similar queries
        let decision = router.forward(&simple_query_features(), &initial_hidden());
        assert_eq!(decision.model, ModelSize::M700);
    }

    #[test]
    fn test_router_escalates_for_complex_queries() {
        // Given: Training data showing complex queries need larger models
        let training_data = vec![
            RouterSample {
                features: complex_query_features(),
                model_used: ModelSize::M700,
                quality: 0.45,  // Poor quality
                latency_ms: 150.0,
                cost: 0.001,
            },
            RouterSample {
                features: complex_query_features(),
                model_used: ModelSize::B2_6,
                quality: 0.91,  // Good quality
                latency_ms: 500.0,
                cost: 0.010,
            },
        ];

        // When: Router is trained
        let mut router = FastGRNNRouter::new_test();
        router.train(&training_data, QUALITY_THRESHOLD);

        // Then: Router should choose 2.6B for complex queries
        let decision = router.forward(&complex_query_features(), &initial_hidden());
        assert_eq!(decision.model, ModelSize::B2_6);
    }

    #[test]
    fn test_router_confidence_correlates_with_seen_patterns() {
        // Given: Router trained on specific feature patterns
        let mut router = FastGRNNRouter::new_test();
        let seen_features = vec![training_features_a(), training_features_b()];
        router.train(&samples_from_features(&seen_features), QUALITY_THRESHOLD);

        // When: Querying with seen vs unseen patterns
        let seen_decision = router.forward(&training_features_a(), &initial_hidden());
        let unseen_decision = router.forward(&novel_features(), &initial_hidden());

        // Then: Confidence should be higher for seen patterns
        assert!(seen_decision.confidence > unseen_decision.confidence);
    }

    #[test]
    fn test_router_ewc_prevents_forgetting() {
        // Given: Router trained on task A
        let mut router = FastGRNNRouter::new_test();
        let mut ewc = ElasticWeightConsolidation::new(0.4);
        router.train(&task_a_samples(), QUALITY_THRESHOLD);
        let task_a_accuracy_before = router.evaluate(&task_a_samples());

        // Compute Fisher and store optimal weights
        ewc.compute_fisher(&router, &task_a_samples());

        // When: Train on task B with EWC
        router.train_with_ewc(&task_b_samples(), &ewc, QUALITY_THRESHOLD);

        // Then: Task A accuracy should not significantly degrade
        let task_a_accuracy_after = router.evaluate(&task_a_samples());
        assert!(task_a_accuracy_after > task_a_accuracy_before - 0.05);
    }
}
```

### 2.3 Loop C: Compression and Abstraction

**How the system avoids bloat:**

```
┌────────────────────────────────────────────────────────────────┐
│               Compression and Abstraction Loop                  │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  PERIODICALLY (hourly/daily):                                   │
│                                                                  │
│  1. CLUSTER DETECTION                                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Identify clusters of similar nodes in graph:            │   │
│  │  - Dense neighborhoods with similar embeddings           │   │
│  │  - Frequently co-retrieved node sets                     │   │
│  │  - High edge connectivity within cluster                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  2. LFM2 SUMMARIZATION                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  For each cluster:                                        │   │
│  │  - Feed cluster nodes to LFM2                            │   │
│  │  - Generate summary "concept" node                        │   │
│  │  - Create embedding for concept                           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  3. HIERARCHICAL ATTACHMENT                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  - Concept node becomes parent of cluster members        │   │
│  │  - Add "contains" edges from concept to members          │   │
│  │  - Future queries see concept first in attention         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  4. ARCHIVAL                                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  - Old, rarely-used fine-grained nodes → cold storage    │   │
│  │  - Concept summaries stay in hot tier                    │   │
│  │  - Preserve graph structure for rehydration              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  RESULT: Hierarchy of concepts, not ever-growing bag of chunks │
│                                                                  │
└────────────────────────────────────────────────────────────────┘
```

**TDD Tests for Loop C:**

```rust
#[cfg(test)]
mod compression_tests {
    use super::*;

    #[test]
    fn test_cluster_detection_finds_dense_neighborhoods() {
        // Given: Graph with clear clusters
        let mut memory = RuvectorMemory::new_test();

        // Cluster 1: ML topics (densely connected)
        let ml_nodes = vec![
            memory.insert_node("Neural networks learn patterns"),
            memory.insert_node("Deep learning uses multiple layers"),
            memory.insert_node("Backpropagation trains neural nets"),
        ];
        for i in 0..ml_nodes.len() {
            for j in i+1..ml_nodes.len() {
                memory.insert_edge(&ml_nodes[i], &ml_nodes[j], EdgeType::SameTopic, 0.9);
            }
        }

        // Cluster 2: Cooking topics (densely connected)
        let cooking_nodes = vec![
            memory.insert_node("Sourdough needs starter"),
            memory.insert_node("Bread baking requires patience"),
        ];
        memory.insert_edge(&cooking_nodes[0], &cooking_nodes[1], EdgeType::SameTopic, 0.85);

        // When: Run cluster detection
        let clusters = memory.detect_clusters(MIN_CLUSTER_SIZE, MIN_EDGE_DENSITY);

        // Then: Should find two distinct clusters
        assert_eq!(clusters.len(), 2);
        assert!(clusters.iter().any(|c| c.nodes.len() == 3)); // ML cluster
        assert!(clusters.iter().any(|c| c.nodes.len() == 2)); // Cooking cluster
    }

    #[test]
    fn test_summarization_creates_concept_node() {
        // Given: A cluster of related nodes
        let cluster = Cluster {
            nodes: vec![
                Node::new("Rust is memory safe"),
                Node::new("Rust has zero-cost abstractions"),
                Node::new("Rust prevents data races"),
            ],
            centroid: compute_centroid(&cluster.nodes),
        };

        // When: Generate summary
        let summarizer = ClusterSummarizer::new(lfm2_model());
        let concept = summarizer.summarize(&cluster);

        // Then: Concept should capture key themes
        assert!(concept.text.to_lowercase().contains("rust"));
        assert!(concept.node_type == NodeType::Concept);
        assert!(concept.metadata.contains_key("source_cluster_size"));
    }

    #[test]
    fn test_concept_nodes_are_prioritized_in_retrieval() {
        // Given: Memory with concept and detail nodes
        let mut memory = RuvectorMemory::new_test();
        let concept = memory.insert_node_typed(
            "Rust programming overview",
            NodeType::Concept
        );
        let detail = memory.insert_node_typed(
            "Rust's borrow checker enforces ownership",
            NodeType::Document
        );
        memory.insert_edge(&concept, &detail, EdgeType::Contains, 1.0);

        // When: Query about Rust
        let query_embedding = embed("Tell me about Rust");
        let results = memory.search_with_concept_boost(&query_embedding, 10);

        // Then: Concept should appear before (or with higher weight than) details
        let concept_idx = results.iter().position(|r| r.id == concept.id).unwrap();
        let detail_idx = results.iter().position(|r| r.id == detail.id).unwrap();
        assert!(concept_idx < detail_idx);
    }

    #[test]
    fn test_archival_moves_old_nodes_to_cold_storage() {
        // Given: Nodes with different access patterns
        let mut memory = RuvectorMemory::new_test();
        let hot_node = memory.insert_node_with_access(
            "Recently used content",
            AccessStats { last_used: now(), use_count: 50 }
        );
        let cold_node = memory.insert_node_with_access(
            "Old unused content",
            AccessStats { last_used: now() - Duration::from_days(90), use_count: 1 }
        );

        // When: Run archival
        memory.run_archival(
            MAX_AGE_DAYS,
            MIN_USE_COUNT,
            COLD_STORAGE_PATH
        );

        // Then: Hot node stays, cold node archived
        assert!(memory.contains(&hot_node.id));
        assert!(!memory.contains(&cold_node.id));
        assert!(cold_storage_contains(&cold_node.id));
    }
}
```

---

## 3. Weight-Level Self-Learning (Controlled)

### 3.1 The Safe Outer Loop

**Weight updates happen outside production, in a controlled pipeline:**

```
┌────────────────────────────────────────────────────────────────┐
│           Weight-Level Self-Learning Pipeline                   │
├────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STEP 1: COLLECT TRAINING TRACES (continuous)           │   │
│  │                                                           │   │
│  │  From live system, store:                                 │   │
│  │  - (prompt, retrieved_context, final_answer, outcome)    │   │
│  │  - Judge scores or human ratings                         │   │
│  │  - Explicit error cases                                   │   │
│  │                                                           │   │
│  │  Tag by: domain, difficulty, risk_level                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STEP 2: BUILD ROLLING CURRICULUM (nightly/weekly)      │   │
│  │                                                           │   │
│  │  Sample recent traces:                                    │   │
│  │  - Up-weight hard or high-value tasks                    │   │
│  │  - Filter out cases where context was wrong              │   │
│  │                                                           │   │
│  │  Create three sets:                                       │   │
│  │  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐  │   │
│  │  │      SFT      │ │  Preference   │ │   Retrieval   │  │   │
│  │  │    (good      │ │    Pairs      │ │  Correction   │  │   │
│  │  │   answers)    │ │ (good vs bad) │ │    (context   │  │   │
│  │  │               │ │               │ │   selection)  │  │   │
│  │  └───────────────┘ └───────────────┘ └───────────────┘  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STEP 3: TRAIN STUDENT VARIANTS (offline)               │   │
│  │                                                           │   │
│  │  Take current best LFM2 checkpoint:                      │   │
│  │  1. Run supervised fine-tuning on new traces             │   │
│  │  2. Optionally run preference objective on pairs         │   │
│  │  3. Validate on fixed holdout + public benchmarks        │   │
│  │                                                           │   │
│  │  Output: "LFM2-ruv-edition-vN"                           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STEP 4: GATED DEPLOYMENT (A/B testing)                 │   │
│  │                                                           │   │
│  │  ┌─────────────────────────────────────────────────────┐│   │
│  │  │  Production Traffic                                  ││   │
│  │  │  ┌────────────────┐  ┌────────────────┐            ││   │
│  │  │  │  90% → Current │  │  10% → Student │            ││   │
│  │  │  │     Model      │  │     vN         │            ││   │
│  │  │  └────────────────┘  └────────────────┘            ││   │
│  │  └─────────────────────────────────────────────────────┘│   │
│  │                                                           │   │
│  │  Compare: quality, latency, failure_rate                 │   │
│  │  Promote IFF: student dominates OR ties on key metrics   │   │
│  │                                                           │   │
│  │  ⚠️  Never free-write weights in-place                  │   │
│  │  ⚠️  Always retrain in controlled loop                   │   │
│  │  ⚠️  Promote only when safe                              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
└────────────────────────────────────────────────────────────────┘
```

**TDD Tests for Weight-Level Learning:**

```rust
#[cfg(test)]
mod weight_learning_tests {
    use super::*;

    #[test]
    fn test_trace_collection_captures_all_components() {
        // Given: A completed interaction
        let trace_collector = TraceCollector::new_test();
        let interaction = Interaction {
            prompt: "What is Rust?",
            context: vec!["Rust is a systems language"],
            response: "Rust is a memory-safe systems programming language",
            quality_score: 0.92,
            task_outcome: TaskOutcome::Success,
        };

        // When: Trace is collected
        let trace = trace_collector.collect(&interaction);

        // Then: All components should be present
        assert!(trace.prompt.is_some());
        assert!(trace.context.len() > 0);
        assert!(trace.response.is_some());
        assert!(trace.quality_score.is_some());
        assert!(trace.domain_tags.len() > 0);
    }

    #[test]
    fn test_curriculum_upweights_hard_tasks() {
        // Given: Mix of easy and hard traces
        let traces = vec![
            Trace { difficulty: 0.2, quality: 0.95, ..default() },  // Easy, good
            Trace { difficulty: 0.9, quality: 0.85, ..default() },  // Hard, good
            Trace { difficulty: 0.3, quality: 0.60, ..default() },  // Easy, bad
        ];

        // When: Build curriculum
        let curriculum = CurriculumBuilder::new()
            .upweight_hard_tasks(true)
            .filter_bad_quality(0.7)
            .build(&traces);

        // Then: Hard successful trace should have higher weight
        let hard_weight = curriculum.weight_for(&traces[1]);
        let easy_weight = curriculum.weight_for(&traces[0]);
        assert!(hard_weight > easy_weight);

        // And: Bad quality trace should be filtered
        assert!(!curriculum.contains(&traces[2]));
    }

    #[test]
    fn test_preference_pairs_correctly_ordered() {
        // Given: Same query with different quality responses
        let good_response = Response { text: "Detailed answer...", quality: 0.9 };
        let bad_response = Response { text: "I don't know", quality: 0.3 };
        let query = "Explain backpropagation";

        // When: Create preference pair
        let pair = PreferencePair::from_responses(query, &good_response, &bad_response);

        // Then: Good should be preferred
        assert_eq!(pair.chosen, good_response.text);
        assert_eq!(pair.rejected, bad_response.text);
    }

    #[test]
    fn test_student_validation_gates_deployment() {
        // Given: Student model that underperforms on holdout
        let student = StudentModel::new_test();
        let holdout = HoldoutDataset::load_test();
        let baseline_accuracy = 0.85;
        let student_accuracy = 0.78;  // Below baseline

        // When: Validate for deployment
        let validation = ValidationResult::new(student_accuracy, baseline_accuracy);

        // Then: Should NOT be approved for deployment
        assert!(!validation.approved_for_deployment());
        assert!(validation.rejection_reason().contains("accuracy"));
    }

    #[test]
    fn test_ab_test_detects_regression() {
        // Given: A/B test results
        let ab_results = ABTestResults {
            control: ABMetrics { quality: 0.90, latency_p50: 200.0, failure_rate: 0.02 },
            treatment: ABMetrics { quality: 0.88, latency_p50: 180.0, failure_rate: 0.05 },
        };

        // When: Evaluate for promotion
        let decision = ABDecision::evaluate(&ab_results, SIGNIFICANCE_THRESHOLD);

        // Then: Should NOT promote due to quality regression + higher failure rate
        assert_eq!(decision, ABDecision::KeepControl);
        assert!(decision.reasons().contains("quality_regression"));
    }
}
```

---

## 4. Test-Driven Development Plan

### 4.1 Testing Pyramid

```
                    ┌─────────────────┐
                    │    E2E Tests    │  (5%)
                    │  Full pipeline  │
                    └────────┬────────┘
                             │
               ┌─────────────┴─────────────┐
               │    Integration Tests      │  (20%)
               │  Cross-component flows    │
               └─────────────┬─────────────┘
                             │
        ┌────────────────────┴────────────────────┐
        │           Unit Tests                    │  (75%)
        │  Individual functions & modules          │
        └─────────────────────────────────────────┘
```

### 4.2 Test Categories by Component

#### 4.2.1 Orchestrator Tests

```rust
#[cfg(test)]
mod orchestrator_tests {
    #[test]
    fn test_request_routing_respects_session() { }

    #[test]
    fn test_rate_limiting_rejects_excess_requests() { }

    #[test]
    fn test_cache_hit_bypasses_processing() { }

    #[test]
    fn test_cache_miss_triggers_full_pipeline() { }

    #[test]
    fn test_error_handling_returns_graceful_response() { }

    #[test]
    fn test_metrics_recorded_for_all_requests() { }
}
```

#### 4.2.2 Embedding Service Tests

```rust
#[cfg(test)]
mod embedding_tests {
    #[test]
    fn test_embedding_dimension_matches_config() { }

    #[test]
    fn test_similar_texts_have_similar_embeddings() { }

    #[test]
    fn test_different_texts_have_different_embeddings() { }

    #[test]
    fn test_long_text_truncation() { }

    #[test]
    fn test_batch_embedding_matches_individual() { }

    #[test]
    fn test_empty_string_handling() { }
}
```

#### 4.2.3 Router Tests

```rust
#[cfg(test)]
mod router_tests {
    #[test]
    fn test_forward_produces_valid_probabilities() { }

    #[test]
    fn test_hidden_state_updates_across_calls() { }

    #[test]
    fn test_confidence_threshold_triggers_fallback() { }

    #[test]
    fn test_gradient_computation() { }

    #[test]
    fn test_sparse_matrix_operations() { }

    #[test]
    fn test_low_rank_matrix_approximation() { }
}
```

#### 4.2.4 Memory Tests

```rust
#[cfg(test)]
mod memory_tests {
    #[test]
    fn test_hnsw_search_returns_k_neighbors() { }

    #[test]
    fn test_graph_expansion_respects_hop_limit() { }

    #[test]
    fn test_writeback_queue_batches_correctly() { }

    #[test]
    fn test_deduplication_prevents_near_duplicates() { }

    #[test]
    fn test_metadata_filtering() { }

    #[test]
    fn test_edge_weight_update() { }
}
```

#### 4.2.5 Attention Tests

```rust
#[cfg(test)]
mod attention_tests {
    #[test]
    fn test_attention_weights_sum_to_one() { }

    #[test]
    fn test_edge_features_influence_attention() { }

    #[test]
    fn test_multi_head_concatenation() { }

    #[test]
    fn test_residual_connection_preserved() { }

    #[test]
    fn test_layer_norm_normalization() { }

    #[test]
    fn test_attention_ranking_matches_weights() { }
}
```

#### 4.2.6 Inference Tests

```rust
#[cfg(test)]
mod inference_tests {
    #[test]
    fn test_model_loading_correct_size() { }

    #[test]
    fn test_kv_cache_reuse() { }

    #[test]
    fn test_generation_respects_max_tokens() { }

    #[test]
    fn test_temperature_affects_randomness() { }

    #[test]
    fn test_top_p_filtering() { }

    #[test]
    fn test_model_eviction_under_memory_pressure() { }
}
```

#### 4.2.7 Learning Tests

```rust
#[cfg(test)]
mod learning_tests {
    #[test]
    fn test_replay_buffer_reservoir_sampling() { }

    #[test]
    fn test_ewc_regularization_value() { }

    #[test]
    fn test_fisher_information_computation() { }

    #[test]
    fn test_quality_judge_score_range() { }

    #[test]
    fn test_writeback_threshold_filtering() { }

    #[test]
    fn test_background_training_thread() { }
}
```

### 4.3 Integration Test Scenarios

```rust
#[cfg(test)]
mod integration_tests {
    /// Test full request-response cycle
    #[tokio::test]
    async fn test_end_to_end_query() {
        let system = RuvLLMSystem::new_test().await;

        let response = system.process(Request {
            query: "What is machine learning?",
            session_id: Some("test-session"),
            constraints: Default::default(),
        }).await.unwrap();

        assert!(!response.text.is_empty());
        assert!(response.confidence > 0.0);
        assert!(!response.sources.is_empty());
    }

    /// Test multi-turn conversation with context
    #[tokio::test]
    async fn test_multi_turn_context() {
        let system = RuvLLMSystem::new_test().await;
        let session = "multi-turn-test";

        // Turn 1
        let r1 = system.process(Request {
            query: "What is Rust?",
            session_id: Some(session),
            ..Default::default()
        }).await.unwrap();

        // Turn 2 (should use KV cache)
        let r2 = system.process(Request {
            query: "What are its main features?",
            session_id: Some(session),
            ..Default::default()
        }).await.unwrap();

        // Response should reference Rust from context
        assert!(r2.text.to_lowercase().contains("rust") ||
                r2.text.to_lowercase().contains("memory") ||
                r2.text.to_lowercase().contains("safety"));
    }

    /// Test that learning loop updates memory
    #[tokio::test]
    async fn test_learning_updates_memory() {
        let system = RuvLLMSystem::new_test().await;
        let initial_node_count = system.memory.node_count();

        // Process high-quality interaction
        let response = system.process_with_feedback(
            Request { query: "Novel question...", ..Default::default() },
            Feedback { quality: 0.95, explicit_rating: Some(5) }
        ).await.unwrap();

        // Memory should have grown
        let final_node_count = system.memory.node_count();
        assert!(final_node_count > initial_node_count);
    }

    /// Test router learns from experience
    #[tokio::test]
    async fn test_router_adaptation() {
        let mut system = RuvLLMSystem::new_test().await;

        // Process many simple queries
        for _ in 0..100 {
            system.process(Request {
                query: "Simple factual question",
                ..Default::default()
            }).await.unwrap();
        }

        // Trigger training
        system.learning_service.train_router().await;

        // Router should now prefer smaller models for similar queries
        let decision = system.router.forward(
            &simple_query_features(),
            &initial_hidden()
        );
        assert!(decision.model == ModelSize::M350 || decision.model == ModelSize::M700);
    }
}
```

---

## 5. Benchmarking Suite

### 5.1 Performance Benchmarks

```rust
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};

fn embedding_benchmark(c: &mut Criterion) {
    let embedder = EmbeddingService::new_test();

    let mut group = c.benchmark_group("embedding");

    for size in [32, 128, 512, 2048].iter() {
        let text = "a".repeat(*size);
        group.bench_with_input(
            BenchmarkId::new("embed", size),
            &text,
            |b, t| b.iter(|| embedder.embed(t))
        );
    }

    group.finish();
}

fn hnsw_search_benchmark(c: &mut Criterion) {
    let memory = RuvectorMemory::new_with_data(100_000);  // 100K vectors
    let query = random_vector(384);

    let mut group = c.benchmark_group("hnsw_search");

    for k in [10, 32, 64].iter() {
        for ef in [32, 64, 128].iter() {
            group.bench_with_input(
                BenchmarkId::new(format!("k={},ef={}", k, ef), ""),
                &(k, ef),
                |b, (k, ef)| b.iter(|| memory.search(&query, **k, **ef))
            );
        }
    }

    group.finish();
}

fn router_forward_benchmark(c: &mut Criterion) {
    let router = FastGRNNRouter::new_test();
    let features = random_vector(128);
    let hidden = random_vector(64);

    c.bench_function("router_forward", |b| {
        b.iter(|| router.forward(&features, &hidden))
    });
}

fn graph_attention_benchmark(c: &mut Criterion) {
    let attention = GraphAttentionEngine::new_test();
    let query = random_vector(384);
    let subgraph = generate_subgraph(50, 100);  // 50 nodes, 100 edges

    c.bench_function("graph_attention", |b| {
        b.iter(|| attention.attend(&query, &subgraph))
    });
}

criterion_group!(
    benches,
    embedding_benchmark,
    hnsw_search_benchmark,
    router_forward_benchmark,
    graph_attention_benchmark
);
criterion_main!(benches);
```

### 5.2 Quality Benchmarks

```rust
/// Benchmark suite for quality metrics
pub struct QualityBenchmark {
    dataset: BenchmarkDataset,
    judge: QualityJudge,
}

impl QualityBenchmark {
    pub async fn run(&self, system: &RuvLLMSystem) -> QualityResults {
        let mut results = QualityResults::default();

        for sample in &self.dataset.samples {
            let response = system.process(Request {
                query: sample.query.clone(),
                ..Default::default()
            }).await.unwrap();

            // Judge quality
            let quality = self.judge.evaluate(
                &sample.query,
                &response.text,
                &response.sources
            ).await;

            // Check against ground truth if available
            if let Some(expected) = &sample.expected_answer {
                let f1 = compute_f1(&response.text, expected);
                results.f1_scores.push(f1);
            }

            results.quality_scores.push(quality);
            results.latencies.push(response.latency);
        }

        results
    }
}
```

---

## 6. Iteration Milestones

### 6.1 Phase 1: Foundation (Weeks 1-2)

| Milestone | Deliverables | Tests |
|-----------|--------------|-------|
| M1.1 | Embedding service stub | Dimension tests |
| M1.2 | Memory service with HNSW | Search tests |
| M1.3 | Basic orchestrator | Integration smoke tests |
| M1.4 | Mock LFM2 interface | Interface contract tests |

### 6.2 Phase 2: Core Pipeline (Weeks 3-4)

| Milestone | Deliverables | Tests |
|-----------|--------------|-------|
| M2.1 | FastGRNN router | Forward pass tests |
| M2.2 | Graph attention engine | Attention computation tests |
| M2.3 | Context builder | Deduplication, truncation tests |
| M2.4 | End-to-end pipeline | Full flow integration tests |

### 6.3 Phase 3: Learning Loops (Weeks 5-6)

| Milestone | Deliverables | Tests |
|-----------|--------------|-------|
| M3.1 | Quality judge | Evaluation tests |
| M3.2 | Replay buffer | Sampling distribution tests |
| M3.3 | EWC integration | Forgetting prevention tests |
| M3.4 | Memory writeback | Graph update tests |

### 6.4 Phase 4: Optimization (Weeks 7-8)

| Milestone | Deliverables | Tests |
|-----------|--------------|-------|
| M4.1 | Router training loop | Learning convergence tests |
| M4.2 | Compression/abstraction | Cluster detection tests |
| M4.3 | Performance tuning | Benchmark suite |
| M4.4 | Production hardening | Load tests, failure injection |

---

## 7. Refinement Checklist

### 7.1 Per-Component Checklist

```
[ ] Orchestrator
    [ ] Request validation
    [ ] Session management
    [ ] Rate limiting
    [ ] Caching
    [ ] Error handling
    [ ] Metrics export

[ ] Embedding Service
    [ ] LFM2 encoder integration
    [ ] Dimension projection
    [ ] Batch processing
    [ ] Tokenization
    [ ] Truncation handling

[ ] FastGRNN Router
    [ ] Cell implementation
    [ ] Sparse weight matrices
    [ ] Low-rank recurrent matrices
    [ ] Output heads
    [ ] Confidence calibration
    [ ] Training loop

[ ] Memory Service
    [ ] HNSW configuration
    [ ] Graph storage
    [ ] Edge operations
    [ ] Writeback queue
    [ ] Deduplication
    [ ] Archival

[ ] Graph Attention
    [ ] Multi-head attention
    [ ] Edge feature encoding
    [ ] Layer stacking
    [ ] Residual connections
    [ ] Output ranking

[ ] Inference Pool
    [ ] Model loading
    [ ] Lazy initialization
    [ ] KV cache management
    [ ] Quantization selection
    [ ] LRU eviction

[ ] Learning Service
    [ ] Quality evaluation
    [ ] Replay buffer
    [ ] EWC regularization
    [ ] Background training
    [ ] Writeback logic
    [ ] Compression jobs
```

### 7.2 Quality Gates

| Gate | Criteria | Status |
|------|----------|--------|
| Unit test coverage | >80% | ⬜ |
| Integration tests passing | 100% | ⬜ |
| Latency P50 | <500ms | ⬜ |
| Quality score mean | >0.8 | ⬜ |
| Router accuracy | >90% | ⬜ |
| Memory efficiency | <4GB | ⬜ |
| No memory leaks | 24h stress test | ⬜ |
| Forgetting rate | <5%/10K | ⬜ |

---

*Document Version: 1.0*
*Last Updated: 2025-12-02*
*Author: RuvLLM Architecture Team*