# Incremental Graph Learning (ATLAS) - Implementation Plan ## Overview ### Problem Statement Current GNN computation in ruvector is **full-graph recomputation**: whenever the graph changes (new vectors added, edges modified), the entire GNN must re-run forward passes over all nodes. This causes severe performance bottlenecks: - **Slow Updates**: Adding 1,000 vectors to a 1M-node graph requires recomputing 1M+ node embeddings - **Wasted Computation**: Most nodes are unaffected by localized changes - **Poor Scalability**: O(N) update time where N = total graph size - **Latency Spikes**: Updates block queries, causing P99 latency degradation - **Memory Pressure**: Full-graph activations stored during backpropagation Real-world impact: - Vector insertion rate limited to ~100 vectors/second (vs 10,000+ for index-only updates) - GNN updates take 10-100x longer than HNSW index updates - Cannot support real-time streaming workloads ### Proposed Solution **ATLAS (Adaptive Topology-Aware Learning Accelerator System)**: An incremental graph learning framework that updates only affected subgraphs: 1. **Dirty Node Tracking**: Mark nodes whose features/edges changed 2. **Dependency Propagation**: Compute k-hop affected region (receptive field) 3. **Incremental Forward Pass**: Recompute only dirty + affected nodes 4. **Activation Caching**: Reuse cached activations for unchanged nodes 5. **Lazy Materialization**: Defer updates to batch changes efficiently **Key Insight**: Graph neural networks have bounded receptive fields. A k-layer GNN only needs information from k-hop neighbors. If a node's k-hop neighborhood is unchanged, its embedding is unchanged. ### Expected Benefits **Quantified Performance Improvements:** | Metric | Current (Full) | ATLAS (Incremental) | Improvement | |--------|----------------|---------------------|-------------| | Update Latency (1K vectors) | 500ms | 5ms | **100x faster** | | Update Latency (10K vectors) | 5s | 50ms | **100x faster** | | Throughput (vectors/sec) | 100 | 10,000 | **100x faster** | | Memory (activation storage) | 1GB (full graph) | 10MB (dirty region) | **100x reduction** | | Query Availability | Blocked during update | Concurrent | **Continuous** | **Qualitative Benefits:** - Real-time vector streaming support - No query latency spikes during updates - Memory-efficient updates - Support for continuous learning workflows ## Technical Design ### Architecture Diagram (ASCII Art) ``` ┌─────────────────────────────────────────────────────────────────┐ │ ATLAS Incremental Learning System │ └─────────────────────────────────────────────────────────────────┘ Vector Insert/Update/Delete │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Change Tracker │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Dirty Node Set (BitSet) │ │ │ │ - Nodes with changed features: [42, 137, 1025, ...] │ │ │ │ - Nodes with changed edges: [43, 138, ...] │ │ │ │ - Timestamps: last_modified[node_id] = timestamp │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Dependency Analyzer │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Compute Affected Region (k-hop BFS) │ │ │ │ │ │ │ │ dirty_nodes = {42, 137, 1025} │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ 1-hop neighbors: {41, 43, 136, 138, 1024, 1026} │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ 2-hop neighbors: {40, 44, 135, 139, ...} │ │ │ │ │ │ │ │ │ ▼ (repeat for k hops) │ │ │ │ affected_region = dirty ∪ 1-hop ∪ 2-hop ∪ ... ∪ k-hop │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Activation Cache │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Cached Embeddings (per layer) │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ Layer 0: {node_id → embedding, timestamp} │ │ │ │ │ │ 42 → [0.1, 0.3, ...] (STALE - dirty) │ │ │ │ │ │ 100 → [0.5, 0.2, ...] (FRESH - reuse!) │ │ │ │ │ │ 137 → [0.8, 0.1, ...] (STALE - affected) │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ Layer 1: {node_id → embedding, timestamp} │ │ │ │ │ │ ... │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Incremental Forward Pass │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ For each layer l in GNN: │ │ │ │ For each node in affected_region: │ │ │ │ if cached[l-1][node].is_fresh(): │ │ │ │ embedding[l][node] = cached[l][node] # Reuse! │ │ │ │ else: │ │ │ │ # Recompute from previous layer │ │ │ │ neighbor_embeddings = [cached[l-1][n] for n in N(v)]│ │ │ │ embedding[l][node] = GNN_layer(neighbor_embeddings) │ │ │ │ cached[l][node] = embedding[l][node] # Update cache│ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Batch Update Optimizer │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Lazy Materialization: │ │ │ │ - Buffer changes until threshold (time/count) │ │ │ │ - Coalesce dirty regions (merge overlapping k-hop sets) │ │ │ │ - Sort affected nodes by layer propagation order │ │ │ │ - Execute single batch update instead of N small updates │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ Updated GNN Embeddings (partial) ┌─────────────────────────────────────────────────────────────────┐ │ Query Path (Concurrent with Updates) │ └─────────────────────────────────────────────────────────────────┘ Query Request │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Read-Write Lock (Activation Cache) │ │ - Queries acquire read lock (concurrent reads OK) │ │ - Updates acquire write lock (blocks queries briefly) │ │ - Most queries see slightly stale embeddings (acceptable) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ Retrieve embeddings from cache (mostly fresh) │ ▼ Return query results ``` ### Core Data Structures (Rust) ```rust // File: crates/ruvector-gnn/src/incremental/mod.rs use std::collections::{HashMap, HashSet, VecDeque}; use std::sync::{Arc, RwLock}; use bitvec::prelude::*; use ndarray::Array2; /// ATLAS incremental learning system pub struct IncrementalGnn { /// Tracks which nodes have changed change_tracker: ChangeTracker, /// Caches computed activations per layer activation_cache: ActivationCache, /// Dependency graph for k-hop propagation dependency_graph: DependencyGraph, /// Batch update configuration batch_config: BatchUpdateConfig, /// Performance metrics metrics: IncrementalMetrics, /// GNN layer count (determines receptive field) num_layers: usize, } /// Tracks which nodes are dirty (need recomputation) pub struct ChangeTracker { /// Dirty nodes (changed features or edges) dirty_nodes: BitVec, /// Timestamp of last modification per node last_modified: HashMap, /// Global update counter update_counter: u64, /// Pending changes (buffered for batch processing) pending_changes: VecDeque, } #[derive(Debug, Clone)] pub enum NodeChange { /// Node features changed FeatureUpdate { node_id: u32, timestamp: u64 }, /// Edges added/removed EdgeUpdate { node_id: u32, timestamp: u64 }, /// Node deleted NodeDeleted { node_id: u32, timestamp: u64 }, } /// Caches GNN activations (embeddings) per layer pub struct ActivationCache { /// Cached embeddings per layer: layer_idx -> (node_id -> embedding) /// Wrapped in RwLock for concurrent read access during queries cache: Vec>>>, /// Maximum cache size per layer (LRU eviction) max_size_per_layer: usize, /// Total cache hits/misses stats: CacheStats, } #[derive(Debug, Clone)] pub struct CachedActivation { /// Node embedding for this layer pub embedding: Array2, /// Timestamp when computed pub timestamp: u64, /// Whether this activation is still valid pub is_valid: bool, } /// Computes affected regions for incremental updates pub struct DependencyGraph { /// Graph structure for k-hop traversal graph: Arc, /// Precomputed k-hop neighborhoods (optional) khop_cache: HashMap>>, /// Number of GNN layers (k-hop receptive field) num_layers: usize, } /// Configuration for batch update optimization #[derive(Debug, Clone)] pub struct BatchUpdateConfig { /// Minimum changes to trigger batch update pub min_batch_size: usize, /// Maximum time to buffer changes (milliseconds) pub max_buffer_time_ms: u64, /// Whether to coalesce overlapping dirty regions pub coalesce_regions: bool, /// Whether to sort affected nodes topologically pub topological_sort: bool, } /// Performance metrics for incremental updates #[derive(Debug, Default)] pub struct IncrementalMetrics { /// Total incremental updates performed pub total_updates: u64, /// Average affected region size pub avg_affected_size: f64, /// Average update latency (microseconds) pub avg_update_latency_us: f64, /// Percentage of nodes recomputed (vs full graph) pub recompute_percentage: f64, /// Cache hit rate pub cache_hit_rate: f64, /// Time saved vs full recomputation pub time_saved_ratio: f64, } #[derive(Debug, Default)] struct CacheStats { pub hits: u64, pub misses: u64, pub evictions: u64, } /// Result of dependency analysis #[derive(Debug)] pub struct AffectedRegion { /// Nodes that need recomputation pub affected_nodes: HashSet, /// Organized by layer (for ordered processing) pub by_layer: Vec>, /// Estimated computation cost pub estimated_cost: usize, } /// Update plan for batch processing pub struct UpdatePlan { /// Changes to apply pub changes: Vec, /// Affected region pub affected_region: AffectedRegion, /// Execution order (topologically sorted) pub execution_order: Vec, /// Whether to invalidate cache entries pub invalidate_cache: bool, } ``` ### Key Algorithms (Pseudocode) #### 1. Incremental GNN Update Algorithm ```python function incremental_gnn_update(gnn: IncrementalGnn, changes: List[NodeChange]): """ Update GNN embeddings incrementally based on changed nodes. Key idea: Only recompute nodes whose k-hop neighborhoods changed. """ # Step 1: Mark dirty nodes dirty_nodes = set() for change in changes: dirty_nodes.add(change.node_id) gnn.change_tracker.mark_dirty(change.node_id) # Step 2: Compute affected region (k-hop propagation) affected_region = compute_affected_region( dirty_nodes, gnn.dependency_graph, k=gnn.num_layers ) # Step 3: Invalidate cache for affected nodes for layer in range(gnn.num_layers): for node in affected_region.affected_nodes: gnn.activation_cache.invalidate(layer, node) # Step 4: Incremental forward pass (layer by layer) for layer in range(gnn.num_layers): # Get nodes to recompute at this layer nodes_to_compute = affected_region.by_layer[layer] for node in sorted(nodes_to_compute): # Topological order # Check if we can reuse cached activation if gnn.activation_cache.is_valid(layer, node): continue # Skip, already computed # Get neighbors from previous layer neighbors = gnn.dependency_graph.get_neighbors(node) neighbor_embeddings = [] for neighbor in neighbors: # Try to reuse cached embedding from previous layer if layer == 0: # Base features emb = gnn.get_node_features(neighbor) else: # Check cache first cached = gnn.activation_cache.get(layer - 1, neighbor) if cached is not None and cached.is_valid: emb = cached.embedding # Reuse! else: # Recursive recomputation (should not happen often) emb = recompute_node(gnn, neighbor, layer - 1) neighbor_embeddings.append(emb) # Apply GNN layer (attention, aggregation, etc.) new_embedding = gnn.gnn_layers[layer].forward( node_features=gnn.get_node_features(node), neighbor_embeddings=neighbor_embeddings, edge_features=gnn.get_edge_features(node, neighbors) ) # Update cache gnn.activation_cache.set( layer, node, CachedActivation( embedding=new_embedding, timestamp=gnn.change_tracker.update_counter, is_valid=True ) ) # Step 5: Clear dirty flags gnn.change_tracker.clear_dirty(dirty_nodes) gnn.change_tracker.update_counter += 1 # Step 6: Update metrics gnn.metrics.record_update( affected_size=len(affected_region.affected_nodes), total_nodes=gnn.dependency_graph.num_nodes() ) function compute_affected_region(dirty_nodes, graph, k): """ Compute k-hop affected region via BFS. Returns nodes that need recomputation due to changed neighborhoods. """ affected = set(dirty_nodes) current_frontier = set(dirty_nodes) # Propagate for k hops for hop in range(k): next_frontier = set() for node in current_frontier: # Get neighbors (reverse direction: who depends on this node?) # In GNN, node v depends on neighbors N(v), so we need reverse edges neighbors = graph.get_reverse_neighbors(node) for neighbor in neighbors: if neighbor not in affected: affected.add(neighbor) next_frontier.add(neighbor) current_frontier = next_frontier if not current_frontier: break # No more propagation needed # Organize by layer for ordered processing by_layer = organize_by_layer(affected, graph, k) return AffectedRegion( affected_nodes=affected, by_layer=by_layer, estimated_cost=len(affected) ) function organize_by_layer(affected_nodes, graph, num_layers): """ Organize affected nodes by layer for correct processing order. Layer 0 nodes must be computed before Layer 1, etc. """ by_layer = [[] for _ in range(num_layers)] # Topological sort by dependency depth for node in affected_nodes: # Compute minimum layer where this node needs recomputation # (based on its position in the dependency graph) layer = compute_required_layer(node, graph, num_layers) by_layer[layer].append(node) return by_layer function recompute_node(gnn, node, layer): """ Recursively recompute a node's embedding at a given layer. This should be rare if cache is working properly. """ if layer == 0: return gnn.get_node_features(node) # Get neighbors from previous layer neighbors = gnn.dependency_graph.get_neighbors(node) neighbor_embeddings = [ recompute_node(gnn, neighbor, layer - 1) for neighbor in neighbors ] # Apply GNN layer embedding = gnn.gnn_layers[layer].forward( node_features=gnn.get_node_features(node), neighbor_embeddings=neighbor_embeddings, edge_features=gnn.get_edge_features(node, neighbors) ) # Cache result gnn.activation_cache.set(layer, node, CachedActivation( embedding=embedding, timestamp=gnn.change_tracker.update_counter, is_valid=True )) return embedding ``` #### 2. Batch Update Optimization ```python function batch_update_optimizer(gnn: IncrementalGnn): """ Buffer and coalesce changes for efficient batch processing. Reduces overhead of many small updates. """ buffer = gnn.change_tracker.pending_changes config = gnn.batch_config while True: # Wait for trigger condition if len(buffer) < config.min_batch_size: sleep_until(timeout=config.max_buffer_time_ms) if len(buffer) == 0: continue # Collect all pending changes changes = buffer.drain() # Coalesce overlapping dirty regions if config.coalesce_regions: changes = coalesce_changes(changes) # Create update plan plan = create_update_plan(gnn, changes) # Execute batch update execute_update_plan(gnn, plan) function coalesce_changes(changes): """ Merge overlapping changes to reduce redundant computation. Example: If node A changes at t=1 and t=5, only keep t=5. """ # Deduplicate by node_id, keep latest timestamp latest_changes = {} for change in changes: node = change.node_id if node not in latest_changes or change.timestamp > latest_changes[node].timestamp: latest_changes[node] = change return list(latest_changes.values()) function create_update_plan(gnn, changes): """ Create optimized execution plan for batch update. """ # Compute affected region for all changes dirty_nodes = {change.node_id for change in changes} affected_region = compute_affected_region( dirty_nodes, gnn.dependency_graph, k=gnn.num_layers ) # Topologically sort affected nodes for correct order if gnn.batch_config.topological_sort: execution_order = topological_sort( affected_region.affected_nodes, gnn.dependency_graph ) else: execution_order = list(affected_region.affected_nodes) return UpdatePlan( changes=changes, affected_region=affected_region, execution_order=execution_order, invalidate_cache=True ) function execute_update_plan(gnn, plan): """ Execute batch update with write lock on activation cache. """ # Acquire write lock (blocks queries briefly) with gnn.activation_cache.write_lock(): incremental_gnn_update(gnn, plan.changes) # Queries can resume with updated embeddings ``` #### 3. Concurrent Query Support ```python function query_with_incremental_gnn(gnn, query_vector, k): """ Query GNN embeddings while updates are happening. Uses read-write locks to allow concurrent reads. """ # Acquire read lock (multiple queries can read concurrently) with gnn.activation_cache.read_lock(): # Get embeddings from cache (might be slightly stale) embeddings = [] for node_id in gnn.graph.all_nodes(): # Try to get from cache cached = gnn.activation_cache.get( layer=gnn.num_layers - 1, # Final layer node=node_id ) if cached is not None and cached.is_valid: embeddings.append((node_id, cached.embedding)) else: # Fallback: use base features (no GNN) base_features = gnn.get_node_features(node_id) embeddings.append((node_id, base_features)) # Perform similarity search results = search_similar(query_vector, embeddings, k) return results ``` ### API Design (Function Signatures) ```rust // File: crates/ruvector-gnn/src/incremental/mod.rs impl IncrementalGnn { /// Create a new incremental GNN system pub fn new( graph: Arc, num_layers: usize, batch_config: BatchUpdateConfig, ) -> Result; /// Record a node feature update (triggers incremental recomputation) pub fn update_node_features( &mut self, node_id: u32, new_features: &[f32], ) -> Result<(), GnnError>; /// Record edge changes (triggers incremental recomputation) pub fn update_edges( &mut self, node_id: u32, added_edges: &[(u32, u32)], removed_edges: &[(u32, u32)], ) -> Result<(), GnnError>; /// Perform incremental update based on pending changes pub fn apply_incremental_update(&mut self) -> Result; /// Force full graph recomputation (fallback) pub fn full_recompute(&mut self) -> Result<(), GnnError>; /// Get cached embedding for a node pub fn get_embedding( &self, node_id: u32, layer: usize, ) -> Option>; /// Check if cached embedding is valid pub fn is_embedding_valid( &self, node_id: u32, layer: usize, ) -> bool; /// Get incremental update metrics pub fn metrics(&self) -> &IncrementalMetrics; /// Clear all cached activations pub fn clear_cache(&mut self); } impl ChangeTracker { /// Mark a node as dirty (needs recomputation) pub fn mark_dirty(&mut self, node_id: u32); /// Check if a node is dirty pub fn is_dirty(&self, node_id: u32) -> bool; /// Clear dirty flag for a node pub fn clear_dirty(&mut self, node_id: u32); /// Get all dirty nodes pub fn get_dirty_nodes(&self) -> Vec; /// Buffer a change for batch processing pub fn buffer_change(&mut self, change: NodeChange); /// Drain all buffered changes pub fn drain_buffered(&mut self) -> Vec; } impl ActivationCache { /// Create a new activation cache pub fn new(num_layers: usize, max_size_per_layer: usize) -> Self; /// Get cached activation pub fn get(&self, layer: usize, node_id: u32) -> Option; /// Set cached activation pub fn set(&mut self, layer: usize, node_id: u32, activation: CachedActivation); /// Invalidate cached activation pub fn invalidate(&mut self, layer: usize, node_id: u32); /// Check if activation is valid pub fn is_valid(&self, layer: usize, node_id: u32) -> bool; /// Acquire read lock (for concurrent queries) pub fn read_lock(&self) -> RwLockReadGuard<'_, HashMap>; /// Acquire write lock (for updates) pub fn write_lock(&mut self) -> RwLockWriteGuard<'_, HashMap>; /// Get cache statistics pub fn stats(&self) -> &CacheStats; /// Clear all cached activations pub fn clear(&mut self); } impl DependencyGraph { /// Create dependency graph from HNSW graph pub fn from_hnsw(graph: Arc, num_layers: usize) -> Self; /// Compute k-hop affected region from dirty nodes pub fn compute_affected_region( &self, dirty_nodes: &HashSet, ) -> AffectedRegion; /// Get reverse neighbors (who depends on this node?) pub fn get_reverse_neighbors(&self, node_id: u32) -> Vec; /// Precompute k-hop neighborhoods (optional optimization) pub fn precompute_khop_cache(&mut self) -> Result<(), GnnError>; } #[derive(Debug)] pub struct UpdateStats { /// Number of nodes recomputed pub nodes_recomputed: usize, /// Total nodes in graph pub total_nodes: usize, /// Update latency (microseconds) pub latency_us: u64, /// Speedup vs full recomputation pub speedup_ratio: f64, } ``` ## Integration Points ### Affected Crates/Modules 1. **`ruvector-gnn`** (Primary) - New module: `src/incremental/mod.rs` - Core ATLAS system - New module: `src/incremental/change_tracker.rs` - Dirty node tracking - New module: `src/incremental/activation_cache.rs` - Embedding caching - New module: `src/incremental/dependency.rs` - Dependency analysis - Modified: `src/lib.rs` - Export incremental types 2. **`ruvector-core`** (Integration) - Modified: `src/index/hnsw.rs` - Notify GNN of graph changes - New: `src/index/hnsw_events.rs` - Event system for graph updates - Modified: `src/vector_store.rs` - Trigger incremental updates on insert/delete 3. **`ruvector-api`** (Configuration) - Modified: `src/config.rs` - Add incremental GNN config - Modified: `src/index_manager.rs` - Manage incremental update lifecycle ### New Modules to Create ``` crates/ruvector-gnn/ ├── src/ │ ├── incremental/ │ │ ├── mod.rs # Core IncrementalGnn │ │ ├── change_tracker.rs # ChangeTracker implementation │ │ ├── activation_cache.rs # ActivationCache implementation │ │ ├── dependency.rs # DependencyGraph implementation │ │ ├── batch_optimizer.rs # Batch update optimization │ │ └── metrics.rs # Performance tracking crates/ruvector-core/ ├── src/ │ ├── index/ │ │ ├── hnsw_events.rs # Event system for graph changes examples/ ├── incremental_gnn/ │ ├── benchmark_updates.rs # Benchmark incremental vs full │ ├── streaming_workload.rs # Real-time streaming example │ └── README.md ``` ### Dependencies on Other Features **Depends On:** - **GNN Layer Implementation (Issue #38)**: Needs working GNN layers to recompute embeddings - **HNSW Index**: Needs graph structure for dependency analysis **Synergies With:** - **GNN-Guided Routing (Feature 1)**: Incremental updates keep routing model fresh - **Neuro-Symbolic Query (Feature 3)**: Faster updates enable real-time constraint learning **External Dependencies:** - `bitvec` - Efficient BitSet for dirty node tracking - `parking_lot` - RwLock for concurrent cache access - `crossbeam` - Batch processing queue (optional) ## Regression Prevention ### What Existing Functionality Could Break 1. **GNN Embedding Correctness** - Risk: Incremental updates produce different embeddings than full recomputation - Impact: Incorrect query results, embedding drift 2. **Memory Leaks** - Risk: Activation cache grows unbounded if not evicted - Impact: OOM crashes 3. **Deadlocks** - Risk: Read-write lock contention between queries and updates - Impact: System hangs 4. **Stale Embeddings** - Risk: Cache invalidation logic misses affected nodes - Impact: Queries use outdated embeddings 5. **Update Ordering** - Risk: Concurrent updates applied in wrong order - Impact: Inconsistent graph state ### Test Cases to Prevent Regressions ```rust // File: crates/ruvector-gnn/tests/incremental_regression_tests.rs #[test] fn test_incremental_matches_full_recomputation() { // Incremental updates must produce identical embeddings to full recompute let graph = build_test_graph(1000); let gnn_full = FullGnn::new(&graph, num_layers=3); let gnn_inc = IncrementalGnn::new(&graph, num_layers=3); // Apply 100 random updates let updates = generate_random_updates(100); // Full recomputation for update in &updates { apply_update_full(&mut gnn_full, update); } gnn_full.recompute_all(); // Incremental updates for update in &updates { apply_update_incremental(&mut gnn_inc, update); } gnn_inc.apply_incremental_update(); // Compare embeddings (should be identical within floating-point tolerance) for node_id in 0..1000 { let emb_full = gnn_full.get_embedding(node_id, layer=2); let emb_inc = gnn_inc.get_embedding(node_id, layer=2).unwrap(); assert_embeddings_equal(&emb_full, &emb_inc, tolerance=1e-5); } } #[test] fn test_cache_invalidation_correctness() { // All affected nodes must have cache invalidated let graph = build_test_graph(1000); let mut gnn = IncrementalGnn::new(&graph, num_layers=3); // Mark node 42 as dirty gnn.update_node_features(42, &random_features()); // Compute affected region (3-hop) let affected = gnn.dependency_graph.compute_affected_region(&hashset!{42}); // Check cache invalidation for node in &affected.affected_nodes { for layer in 0..3 { assert!(!gnn.activation_cache.is_valid(layer, *node), "Node {} layer {} should be invalidated", node, layer); } } } #[test] fn test_incremental_speedup() { // Incremental updates must be ≥10x faster than full recompute let graph = build_test_graph(100_000); let mut gnn_full = FullGnn::new(&graph, num_layers=3); let mut gnn_inc = IncrementalGnn::new(&graph, num_layers=3); // Small update (100 nodes) let updates = generate_random_updates(100); // Benchmark full recomputation let start = Instant::now(); for update in &updates { apply_update_full(&mut gnn_full, update); } gnn_full.recompute_all(); let full_time = start.elapsed(); // Benchmark incremental let start = Instant::now(); for update in &updates { apply_update_incremental(&mut gnn_inc, update); } gnn_inc.apply_incremental_update(); let inc_time = start.elapsed(); let speedup = full_time.as_secs_f64() / inc_time.as_secs_f64(); assert!(speedup >= 10.0, "Speedup: {:.1}x, expected ≥10x", speedup); } #[test] fn test_concurrent_query_update() { // Queries must not block on updates (concurrent reads) let graph = Arc::new(build_test_graph(10_000)); let gnn = Arc::new(RwLock::new(IncrementalGnn::new(&graph, num_layers=3))); // Spawn update thread let gnn_update = Arc::clone(&gnn); let update_handle = thread::spawn(move || { loop { let mut g = gnn_update.write().unwrap(); g.update_node_features(rand::random(), &random_features()); g.apply_incremental_update().unwrap(); drop(g); // Release lock sleep(Duration::from_millis(10)); } }); // Spawn query threads let query_handles: Vec<_> = (0..8) .map(|_| { let gnn_query = Arc::clone(&gnn); thread::spawn(move || { for _ in 0..1000 { let g = gnn_query.read().unwrap(); let emb = g.get_embedding(rand::random::() % 10_000, layer=2); assert!(emb.is_some()); drop(g); // Release lock } }) }) .collect(); // Wait for queries to complete for handle in query_handles { handle.join().unwrap(); } // Should complete without deadlocks } #[test] fn test_cache_memory_bounded() { // Cache must not exceed configured size limit let graph = build_test_graph(100_000); let mut gnn = IncrementalGnn::new(&graph, num_layers=3); // Configure small cache (1000 entries per layer) gnn.activation_cache = ActivationCache::new(3, max_size_per_layer=1000); // Perform many updates (should trigger evictions) for _ in 0..10_000 { gnn.update_node_features(rand::random(), &random_features()); gnn.apply_incremental_update().unwrap(); } // Check cache size for layer in 0..3 { let cache_size = gnn.activation_cache.layer_size(layer); assert!(cache_size <= 1000, "Layer {} cache size: {}, expected ≤1000", layer, cache_size); } } ``` ### Backward Compatibility Strategy 1. **Default Disabled** - Incremental GNN is opt-in via configuration - Existing code defaults to full recomputation 2. **Graceful Fallback** - If incremental update fails, fallback to full recompute - Log warning but do not crash 3. **Configuration Schema** ```yaml gnn: incremental: enabled: false # Default: disabled batch_size: 100 max_buffer_time_ms: 1000 cache_size_per_layer: 10000 ``` 4. **API Compatibility** - Existing `Gnn::recompute()` still works (full recompute) - New `Gnn::incremental_update()` method added ## Implementation Phases ### Phase 1: Core Infrastructure (Week 1-2) **Goal**: Working change tracking and activation cache **Tasks**: 1. Implement `ChangeTracker` with BitSet 2. Implement `ActivationCache` with RwLock 3. Add unit tests for both 4. Benchmark cache performance (hit rate, contention) **Deliverables**: - `incremental/change_tracker.rs` - `incremental/activation_cache.rs` - Passing unit tests - Benchmark report **Success Criteria**: - Change tracking overhead <1% of update time - Cache hit rate >90% for typical workloads - No deadlocks in concurrent access ### Phase 2: Dependency Analysis (Week 2-3) **Goal**: Compute affected regions correctly **Tasks**: 1. Implement `DependencyGraph` with k-hop BFS 2. Add topological sorting for update order 3. Test affected region computation on various graph topologies 4. Optimize with k-hop caching (optional) **Deliverables**: - `incremental/dependency.rs` - Tests for k-hop propagation - Performance benchmarks **Success Criteria**: - Affected region computation <10ms for 1K dirty nodes - Correct propagation (matches ground truth) - Handles edge cases (disconnected components, cycles) ### Phase 3: Incremental Forward Pass (Week 3-4) **Goal**: Recompute only affected nodes **Tasks**: 1. Implement incremental forward pass algorithm 2. Integrate with existing GNN layers 3. Add cache reuse logic 4. Test correctness vs full recomputation 5. Benchmark speedup **Deliverables**: - `incremental/mod.rs` (core algorithm) - Correctness tests - Performance benchmarks **Success Criteria**: - Embeddings match full recomputation (within tolerance) - ≥10x speedup for small updates (<1% of graph) - ≥100x speedup for tiny updates (<0.1% of graph) ### Phase 4: Batch Optimization (Week 4-5) **Goal**: Efficient batch processing of updates **Tasks**: 1. Implement batch update optimizer 2. Add change coalescing logic 3. Tune buffer size and timeout 4. Benchmark throughput improvement **Deliverables**: - `incremental/batch_optimizer.rs` - Batch processing benchmarks - Configuration guide **Success Criteria**: - Batch updates 2-5x faster than individual updates - Latency <50ms for 1K batched changes - No excessive buffering delays ### Phase 5: Production Hardening (Week 5-6) **Goal**: Production-ready with safety guarantees **Tasks**: 1. Add comprehensive error handling 2. Implement fallback to full recompute on errors 3. Add telemetry and observability 4. Write documentation 5. Stress testing (10M+ nodes, concurrent workloads) **Deliverables**: - Full error handling - Regression test suite - User documentation - Performance report **Success Criteria**: - Zero crashes in stress tests - Graceful degradation on errors - Documentation complete ## Success Metrics ### Performance Benchmarks **Primary Metrics** (Must Achieve): | Workload | Current (Full) | Target (ATLAS) | Improvement | |----------|----------------|----------------|-------------| | 100 vector updates | 50ms | 0.5ms | **100x** | | 1,000 vector updates | 500ms | 5ms | **100x** | | 10,000 vector updates | 5s | 50ms | **100x** | | Continuous stream (1K/s) | Blocked | 1K/s sustained | **∞** | **Secondary Metrics**: | Metric | Target | |--------|--------| | Cache hit rate | >90% | | Memory overhead | <10% of base GNN | | Concurrent query throughput | No degradation | | Affected region ratio | <5% of graph (for 0.1% dirty nodes) | ### Accuracy Metrics **Embedding Correctness**: - Incremental embeddings must match full recomputation within `1e-5` tolerance (floating-point) - Zero embedding drift over 1M updates **Cache Invalidation**: - 100% of affected nodes have cache invalidated (no stale embeddings used) - Zero false negatives (missed invalidations) ### Memory/Latency Targets **Memory**: - Activation cache: <100MB per 1M nodes - Change tracker: <10MB per 1M nodes (BitSet) - Total overhead: <10% of base GNN memory **Latency**: - Update latency (100 vectors): <1ms - Update latency (1K vectors): <10ms - Update latency (10K vectors): <100ms - Query latency: No increase (concurrent reads) **Throughput**: - Sustained update rate: 10,000 vectors/second - Batch update throughput: 100,000 vectors/second ## Risks and Mitigations ### Technical Risks **Risk 1: Cache Invalidation Bugs** *Probability: High | Impact: Critical* **Description**: Missing cache invalidations could cause stale embeddings to be used, leading to incorrect query results. **Mitigation**: - Extensive testing with known ground truth - Add assertion checks in debug builds (compare incremental vs full) - Implement cache consistency validation tool - Conservative invalidation (over-invalidate rather than under-invalidate) - Monitor embedding drift metrics in production **Contingency**: Add "full recompute verification" mode that periodically checks incremental results against full recompute. --- **Risk 2: Concurrency Bugs (Deadlocks, Race Conditions)** *Probability: Medium | Impact: High* **Description**: RwLock usage could introduce deadlocks or race conditions between queries and updates. **Mitigation**: - Use proven lock-free data structures where possible - Lock ordering discipline (always acquire in same order) - Timeout on lock acquisition - Extensive concurrency testing with ThreadSanitizer - Use parking_lot for better performance and diagnostics **Contingency**: Fallback to single-threaded updates if concurrency issues arise. --- **Risk 3: Memory Leak from Unbounded Cache** *Probability: Medium | Impact: Medium* **Description**: Activation cache could grow unbounded if eviction policy fails. **Mitigation**: - Implement strict LRU eviction - Set hard memory limits with monitoring - Add memory pressure detection - Test with long-running workloads - Provide cache clear API for manual intervention **Contingency**: Add periodic cache clearing (e.g., every 1M updates) as safety net. --- **Risk 4: k-Hop Propagation Overhead** *Probability: Low | Impact: Medium* **Description**: Computing k-hop affected regions could be slow on dense graphs. **Mitigation**: - Precompute k-hop neighborhoods (optional) - Use approximate k-hop (prune low-degree nodes) - Parallelize BFS traversal - Cache affected regions for repeated patterns - Profile and optimize hot paths **Contingency**: Add configurable k-hop limit (user can reduce k if needed). --- **Risk 5: Divergence from Full Recomputation** *Probability: Low | Impact: High* **Description**: Incremental updates could accumulate numerical errors, causing embedding drift over time. **Mitigation**: - Use same floating-point precision as full recompute - Periodically run full recomputation to reset (e.g., daily) - Monitor embedding distance metrics - Add numerical stability tests - Use higher precision (f64) for accumulation if needed **Contingency**: Implement "full recompute every N updates" policy. --- **Risk 6: Complex Debugging** *Probability: High | Impact: Medium* **Description**: Incremental update bugs are harder to debug than full recomputation. **Mitigation**: - Add extensive logging and telemetry - Implement deterministic replay of update sequences - Provide debugging tools (cache inspector, affected region visualizer) - Add assertion modes for validation - Document common failure modes **Contingency**: Provide "debug mode" that runs both incremental and full in parallel for comparison. --- ### Summary Risk Matrix | Risk | Probability | Impact | Mitigation Priority | |------|-------------|--------|---------------------| | Cache invalidation bugs | High | Critical | **CRITICAL** | | Concurrency bugs | Medium | High | **HIGH** | | Memory leak | Medium | Medium | HIGH | | k-hop overhead | Low | Medium | Medium | | Embedding divergence | Low | High | Medium | | Complex debugging | High | Medium | LOW | --- ## Next Steps 1. **Prototype Phase 1**: Build change tracker and activation cache (1 week) 2. **Validate Approach**: Test on small graph (1K nodes), measure speedup (2 days) 3. **Scale Testing**: Test on realistic graph (100K nodes), identify bottlenecks (3 days) 4. **Integration**: Connect to HNSW index updates (1 week) 5. **Optimization**: Profile and optimize hot paths (ongoing) **Key Decision Points**: - After Phase 1: Is cache overhead acceptable? (<10% memory) - After Phase 3: Does speedup meet targets? (≥10x required) - After Phase 5: Are embeddings correct? (Pass all regression tests) **Go/No-Go Criteria**: - ✅ 10x+ speedup on small updates - ✅ Zero embedding correctness regressions - ✅ No concurrency bugs in stress tests - ✅ Memory overhead <10%