git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
1319 lines
46 KiB
Markdown
1319 lines
46 KiB
Markdown
# Incremental Graph Learning (ATLAS) - Implementation Plan
|
||
|
||
## Overview
|
||
|
||
### Problem Statement
|
||
|
||
Current GNN computation in ruvector is **full-graph recomputation**: whenever the graph changes (new vectors added, edges modified), the entire GNN must re-run forward passes over all nodes. This causes severe performance bottlenecks:
|
||
|
||
- **Slow Updates**: Adding 1,000 vectors to a 1M-node graph requires recomputing 1M+ node embeddings
|
||
- **Wasted Computation**: Most nodes are unaffected by localized changes
|
||
- **Poor Scalability**: O(N) update time where N = total graph size
|
||
- **Latency Spikes**: Updates block queries, causing P99 latency degradation
|
||
- **Memory Pressure**: Full-graph activations stored during backpropagation
|
||
|
||
Real-world impact:
|
||
- Vector insertion rate limited to ~100 vectors/second (vs 10,000+ for index-only updates)
|
||
- GNN updates take 10-100x longer than HNSW index updates
|
||
- Cannot support real-time streaming workloads
|
||
|
||
### Proposed Solution
|
||
|
||
**ATLAS (Adaptive Topology-Aware Learning Accelerator System)**: An incremental graph learning framework that updates only affected subgraphs:
|
||
|
||
1. **Dirty Node Tracking**: Mark nodes whose features/edges changed
|
||
2. **Dependency Propagation**: Compute k-hop affected region (receptive field)
|
||
3. **Incremental Forward Pass**: Recompute only dirty + affected nodes
|
||
4. **Activation Caching**: Reuse cached activations for unchanged nodes
|
||
5. **Lazy Materialization**: Defer updates to batch changes efficiently
|
||
|
||
**Key Insight**: Graph neural networks have bounded receptive fields. A k-layer GNN only needs information from k-hop neighbors. If a node's k-hop neighborhood is unchanged, its embedding is unchanged.
|
||
|
||
### Expected Benefits
|
||
|
||
**Quantified Performance Improvements:**
|
||
|
||
| Metric | Current (Full) | ATLAS (Incremental) | Improvement |
|
||
|--------|----------------|---------------------|-------------|
|
||
| Update Latency (1K vectors) | 500ms | 5ms | **100x faster** |
|
||
| Update Latency (10K vectors) | 5s | 50ms | **100x faster** |
|
||
| Throughput (vectors/sec) | 100 | 10,000 | **100x faster** |
|
||
| Memory (activation storage) | 1GB (full graph) | 10MB (dirty region) | **100x reduction** |
|
||
| Query Availability | Blocked during update | Concurrent | **Continuous** |
|
||
|
||
**Qualitative Benefits:**
|
||
- Real-time vector streaming support
|
||
- No query latency spikes during updates
|
||
- Memory-efficient updates
|
||
- Support for continuous learning workflows
|
||
|
||
## Technical Design
|
||
|
||
### Architecture Diagram (ASCII Art)
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ ATLAS Incremental Learning System │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
|
||
Vector Insert/Update/Delete
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Change Tracker │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ Dirty Node Set (BitSet) │ │
|
||
│ │ - Nodes with changed features: [42, 137, 1025, ...] │ │
|
||
│ │ - Nodes with changed edges: [43, 138, ...] │ │
|
||
│ │ - Timestamps: last_modified[node_id] = timestamp │ │
|
||
│ └────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Dependency Analyzer │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ Compute Affected Region (k-hop BFS) │ │
|
||
│ │ │ │
|
||
│ │ dirty_nodes = {42, 137, 1025} │ │
|
||
│ │ │ │ │
|
||
│ │ ▼ │ │
|
||
│ │ 1-hop neighbors: {41, 43, 136, 138, 1024, 1026} │ │
|
||
│ │ │ │ │
|
||
│ │ ▼ │ │
|
||
│ │ 2-hop neighbors: {40, 44, 135, 139, ...} │ │
|
||
│ │ │ │ │
|
||
│ │ ▼ (repeat for k hops) │ │
|
||
│ │ affected_region = dirty ∪ 1-hop ∪ 2-hop ∪ ... ∪ k-hop │ │
|
||
│ └────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Activation Cache │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ Cached Embeddings (per layer) │ │
|
||
│ │ ┌──────────────────────────────────────────────┐ │ │
|
||
│ │ │ Layer 0: {node_id → embedding, timestamp} │ │ │
|
||
│ │ │ 42 → [0.1, 0.3, ...] (STALE - dirty) │ │ │
|
||
│ │ │ 100 → [0.5, 0.2, ...] (FRESH - reuse!) │ │ │
|
||
│ │ │ 137 → [0.8, 0.1, ...] (STALE - affected) │ │ │
|
||
│ │ └──────────────────────────────────────────────┘ │ │
|
||
│ │ ┌──────────────────────────────────────────────┐ │ │
|
||
│ │ │ Layer 1: {node_id → embedding, timestamp} │ │ │
|
||
│ │ │ ... │ │ │
|
||
│ │ └──────────────────────────────────────────────┘ │ │
|
||
│ └────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Incremental Forward Pass │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ For each layer l in GNN: │ │
|
||
│ │ For each node in affected_region: │ │
|
||
│ │ if cached[l-1][node].is_fresh(): │ │
|
||
│ │ embedding[l][node] = cached[l][node] # Reuse! │ │
|
||
│ │ else: │ │
|
||
│ │ # Recompute from previous layer │ │
|
||
│ │ neighbor_embeddings = [cached[l-1][n] for n in N(v)]│ │
|
||
│ │ embedding[l][node] = GNN_layer(neighbor_embeddings) │ │
|
||
│ │ cached[l][node] = embedding[l][node] # Update cache│ │
|
||
│ └────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Batch Update Optimizer │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ Lazy Materialization: │ │
|
||
│ │ - Buffer changes until threshold (time/count) │ │
|
||
│ │ - Coalesce dirty regions (merge overlapping k-hop sets) │ │
|
||
│ │ - Sort affected nodes by layer propagation order │ │
|
||
│ │ - Execute single batch update instead of N small updates │ │
|
||
│ └────────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
Updated GNN Embeddings (partial)
|
||
|
||
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Query Path (Concurrent with Updates) │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
|
||
Query Request
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Read-Write Lock (Activation Cache) │
|
||
│ - Queries acquire read lock (concurrent reads OK) │
|
||
│ - Updates acquire write lock (blocks queries briefly) │
|
||
│ - Most queries see slightly stale embeddings (acceptable) │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
Retrieve embeddings from cache (mostly fresh)
|
||
│
|
||
▼
|
||
Return query results
|
||
```
|
||
|
||
### Core Data Structures (Rust)
|
||
|
||
```rust
|
||
// File: crates/ruvector-gnn/src/incremental/mod.rs
|
||
|
||
use std::collections::{HashMap, HashSet, VecDeque};
|
||
use std::sync::{Arc, RwLock};
|
||
use bitvec::prelude::*;
|
||
use ndarray::Array2;
|
||
|
||
/// ATLAS incremental learning system
|
||
pub struct IncrementalGnn {
|
||
/// Tracks which nodes have changed
|
||
change_tracker: ChangeTracker,
|
||
|
||
/// Caches computed activations per layer
|
||
activation_cache: ActivationCache,
|
||
|
||
/// Dependency graph for k-hop propagation
|
||
dependency_graph: DependencyGraph,
|
||
|
||
/// Batch update configuration
|
||
batch_config: BatchUpdateConfig,
|
||
|
||
/// Performance metrics
|
||
metrics: IncrementalMetrics,
|
||
|
||
/// GNN layer count (determines receptive field)
|
||
num_layers: usize,
|
||
}
|
||
|
||
/// Tracks which nodes are dirty (need recomputation)
|
||
pub struct ChangeTracker {
|
||
/// Dirty nodes (changed features or edges)
|
||
dirty_nodes: BitVec,
|
||
|
||
/// Timestamp of last modification per node
|
||
last_modified: HashMap<u32, u64>,
|
||
|
||
/// Global update counter
|
||
update_counter: u64,
|
||
|
||
/// Pending changes (buffered for batch processing)
|
||
pending_changes: VecDeque<NodeChange>,
|
||
}
|
||
|
||
#[derive(Debug, Clone)]
|
||
pub enum NodeChange {
|
||
/// Node features changed
|
||
FeatureUpdate { node_id: u32, timestamp: u64 },
|
||
|
||
/// Edges added/removed
|
||
EdgeUpdate { node_id: u32, timestamp: u64 },
|
||
|
||
/// Node deleted
|
||
NodeDeleted { node_id: u32, timestamp: u64 },
|
||
}
|
||
|
||
/// Caches GNN activations (embeddings) per layer
|
||
pub struct ActivationCache {
|
||
/// Cached embeddings per layer: layer_idx -> (node_id -> embedding)
|
||
/// Wrapped in RwLock for concurrent read access during queries
|
||
cache: Vec<Arc<RwLock<HashMap<u32, CachedActivation>>>>,
|
||
|
||
/// Maximum cache size per layer (LRU eviction)
|
||
max_size_per_layer: usize,
|
||
|
||
/// Total cache hits/misses
|
||
stats: CacheStats,
|
||
}
|
||
|
||
#[derive(Debug, Clone)]
|
||
pub struct CachedActivation {
|
||
/// Node embedding for this layer
|
||
pub embedding: Array2<f32>,
|
||
|
||
/// Timestamp when computed
|
||
pub timestamp: u64,
|
||
|
||
/// Whether this activation is still valid
|
||
pub is_valid: bool,
|
||
}
|
||
|
||
/// Computes affected regions for incremental updates
|
||
pub struct DependencyGraph {
|
||
/// Graph structure for k-hop traversal
|
||
graph: Arc<HnswGraph>,
|
||
|
||
/// Precomputed k-hop neighborhoods (optional)
|
||
khop_cache: HashMap<u32, Vec<HashSet<u32>>>,
|
||
|
||
/// Number of GNN layers (k-hop receptive field)
|
||
num_layers: usize,
|
||
}
|
||
|
||
/// Configuration for batch update optimization
|
||
#[derive(Debug, Clone)]
|
||
pub struct BatchUpdateConfig {
|
||
/// Minimum changes to trigger batch update
|
||
pub min_batch_size: usize,
|
||
|
||
/// Maximum time to buffer changes (milliseconds)
|
||
pub max_buffer_time_ms: u64,
|
||
|
||
/// Whether to coalesce overlapping dirty regions
|
||
pub coalesce_regions: bool,
|
||
|
||
/// Whether to sort affected nodes topologically
|
||
pub topological_sort: bool,
|
||
}
|
||
|
||
/// Performance metrics for incremental updates
|
||
#[derive(Debug, Default)]
|
||
pub struct IncrementalMetrics {
|
||
/// Total incremental updates performed
|
||
pub total_updates: u64,
|
||
|
||
/// Average affected region size
|
||
pub avg_affected_size: f64,
|
||
|
||
/// Average update latency (microseconds)
|
||
pub avg_update_latency_us: f64,
|
||
|
||
/// Percentage of nodes recomputed (vs full graph)
|
||
pub recompute_percentage: f64,
|
||
|
||
/// Cache hit rate
|
||
pub cache_hit_rate: f64,
|
||
|
||
/// Time saved vs full recomputation
|
||
pub time_saved_ratio: f64,
|
||
}
|
||
|
||
#[derive(Debug, Default)]
|
||
struct CacheStats {
|
||
pub hits: u64,
|
||
pub misses: u64,
|
||
pub evictions: u64,
|
||
}
|
||
|
||
/// Result of dependency analysis
|
||
#[derive(Debug)]
|
||
pub struct AffectedRegion {
|
||
/// Nodes that need recomputation
|
||
pub affected_nodes: HashSet<u32>,
|
||
|
||
/// Organized by layer (for ordered processing)
|
||
pub by_layer: Vec<Vec<u32>>,
|
||
|
||
/// Estimated computation cost
|
||
pub estimated_cost: usize,
|
||
}
|
||
|
||
/// Update plan for batch processing
|
||
pub struct UpdatePlan {
|
||
/// Changes to apply
|
||
pub changes: Vec<NodeChange>,
|
||
|
||
/// Affected region
|
||
pub affected_region: AffectedRegion,
|
||
|
||
/// Execution order (topologically sorted)
|
||
pub execution_order: Vec<u32>,
|
||
|
||
/// Whether to invalidate cache entries
|
||
pub invalidate_cache: bool,
|
||
}
|
||
```
|
||
|
||
### Key Algorithms (Pseudocode)
|
||
|
||
#### 1. Incremental GNN Update Algorithm
|
||
|
||
```python
|
||
function incremental_gnn_update(gnn: IncrementalGnn, changes: List[NodeChange]):
|
||
"""
|
||
Update GNN embeddings incrementally based on changed nodes.
|
||
|
||
Key idea: Only recompute nodes whose k-hop neighborhoods changed.
|
||
"""
|
||
# Step 1: Mark dirty nodes
|
||
dirty_nodes = set()
|
||
for change in changes:
|
||
dirty_nodes.add(change.node_id)
|
||
gnn.change_tracker.mark_dirty(change.node_id)
|
||
|
||
# Step 2: Compute affected region (k-hop propagation)
|
||
affected_region = compute_affected_region(
|
||
dirty_nodes,
|
||
gnn.dependency_graph,
|
||
k=gnn.num_layers
|
||
)
|
||
|
||
# Step 3: Invalidate cache for affected nodes
|
||
for layer in range(gnn.num_layers):
|
||
for node in affected_region.affected_nodes:
|
||
gnn.activation_cache.invalidate(layer, node)
|
||
|
||
# Step 4: Incremental forward pass (layer by layer)
|
||
for layer in range(gnn.num_layers):
|
||
# Get nodes to recompute at this layer
|
||
nodes_to_compute = affected_region.by_layer[layer]
|
||
|
||
for node in sorted(nodes_to_compute): # Topological order
|
||
# Check if we can reuse cached activation
|
||
if gnn.activation_cache.is_valid(layer, node):
|
||
continue # Skip, already computed
|
||
|
||
# Get neighbors from previous layer
|
||
neighbors = gnn.dependency_graph.get_neighbors(node)
|
||
neighbor_embeddings = []
|
||
|
||
for neighbor in neighbors:
|
||
# Try to reuse cached embedding from previous layer
|
||
if layer == 0:
|
||
# Base features
|
||
emb = gnn.get_node_features(neighbor)
|
||
else:
|
||
# Check cache first
|
||
cached = gnn.activation_cache.get(layer - 1, neighbor)
|
||
if cached is not None and cached.is_valid:
|
||
emb = cached.embedding # Reuse!
|
||
else:
|
||
# Recursive recomputation (should not happen often)
|
||
emb = recompute_node(gnn, neighbor, layer - 1)
|
||
|
||
neighbor_embeddings.append(emb)
|
||
|
||
# Apply GNN layer (attention, aggregation, etc.)
|
||
new_embedding = gnn.gnn_layers[layer].forward(
|
||
node_features=gnn.get_node_features(node),
|
||
neighbor_embeddings=neighbor_embeddings,
|
||
edge_features=gnn.get_edge_features(node, neighbors)
|
||
)
|
||
|
||
# Update cache
|
||
gnn.activation_cache.set(
|
||
layer,
|
||
node,
|
||
CachedActivation(
|
||
embedding=new_embedding,
|
||
timestamp=gnn.change_tracker.update_counter,
|
||
is_valid=True
|
||
)
|
||
)
|
||
|
||
# Step 5: Clear dirty flags
|
||
gnn.change_tracker.clear_dirty(dirty_nodes)
|
||
gnn.change_tracker.update_counter += 1
|
||
|
||
# Step 6: Update metrics
|
||
gnn.metrics.record_update(
|
||
affected_size=len(affected_region.affected_nodes),
|
||
total_nodes=gnn.dependency_graph.num_nodes()
|
||
)
|
||
|
||
|
||
function compute_affected_region(dirty_nodes, graph, k):
|
||
"""
|
||
Compute k-hop affected region via BFS.
|
||
|
||
Returns nodes that need recomputation due to changed neighborhoods.
|
||
"""
|
||
affected = set(dirty_nodes)
|
||
current_frontier = set(dirty_nodes)
|
||
|
||
# Propagate for k hops
|
||
for hop in range(k):
|
||
next_frontier = set()
|
||
|
||
for node in current_frontier:
|
||
# Get neighbors (reverse direction: who depends on this node?)
|
||
# In GNN, node v depends on neighbors N(v), so we need reverse edges
|
||
neighbors = graph.get_reverse_neighbors(node)
|
||
|
||
for neighbor in neighbors:
|
||
if neighbor not in affected:
|
||
affected.add(neighbor)
|
||
next_frontier.add(neighbor)
|
||
|
||
current_frontier = next_frontier
|
||
|
||
if not current_frontier:
|
||
break # No more propagation needed
|
||
|
||
# Organize by layer for ordered processing
|
||
by_layer = organize_by_layer(affected, graph, k)
|
||
|
||
return AffectedRegion(
|
||
affected_nodes=affected,
|
||
by_layer=by_layer,
|
||
estimated_cost=len(affected)
|
||
)
|
||
|
||
|
||
function organize_by_layer(affected_nodes, graph, num_layers):
|
||
"""
|
||
Organize affected nodes by layer for correct processing order.
|
||
|
||
Layer 0 nodes must be computed before Layer 1, etc.
|
||
"""
|
||
by_layer = [[] for _ in range(num_layers)]
|
||
|
||
# Topological sort by dependency depth
|
||
for node in affected_nodes:
|
||
# Compute minimum layer where this node needs recomputation
|
||
# (based on its position in the dependency graph)
|
||
layer = compute_required_layer(node, graph, num_layers)
|
||
by_layer[layer].append(node)
|
||
|
||
return by_layer
|
||
|
||
|
||
function recompute_node(gnn, node, layer):
|
||
"""
|
||
Recursively recompute a node's embedding at a given layer.
|
||
|
||
This should be rare if cache is working properly.
|
||
"""
|
||
if layer == 0:
|
||
return gnn.get_node_features(node)
|
||
|
||
# Get neighbors from previous layer
|
||
neighbors = gnn.dependency_graph.get_neighbors(node)
|
||
neighbor_embeddings = [
|
||
recompute_node(gnn, neighbor, layer - 1)
|
||
for neighbor in neighbors
|
||
]
|
||
|
||
# Apply GNN layer
|
||
embedding = gnn.gnn_layers[layer].forward(
|
||
node_features=gnn.get_node_features(node),
|
||
neighbor_embeddings=neighbor_embeddings,
|
||
edge_features=gnn.get_edge_features(node, neighbors)
|
||
)
|
||
|
||
# Cache result
|
||
gnn.activation_cache.set(layer, node, CachedActivation(
|
||
embedding=embedding,
|
||
timestamp=gnn.change_tracker.update_counter,
|
||
is_valid=True
|
||
))
|
||
|
||
return embedding
|
||
```
|
||
|
||
#### 2. Batch Update Optimization
|
||
|
||
```python
|
||
function batch_update_optimizer(gnn: IncrementalGnn):
|
||
"""
|
||
Buffer and coalesce changes for efficient batch processing.
|
||
|
||
Reduces overhead of many small updates.
|
||
"""
|
||
buffer = gnn.change_tracker.pending_changes
|
||
config = gnn.batch_config
|
||
|
||
while True:
|
||
# Wait for trigger condition
|
||
if len(buffer) < config.min_batch_size:
|
||
sleep_until(timeout=config.max_buffer_time_ms)
|
||
|
||
if len(buffer) == 0:
|
||
continue
|
||
|
||
# Collect all pending changes
|
||
changes = buffer.drain()
|
||
|
||
# Coalesce overlapping dirty regions
|
||
if config.coalesce_regions:
|
||
changes = coalesce_changes(changes)
|
||
|
||
# Create update plan
|
||
plan = create_update_plan(gnn, changes)
|
||
|
||
# Execute batch update
|
||
execute_update_plan(gnn, plan)
|
||
|
||
|
||
function coalesce_changes(changes):
|
||
"""
|
||
Merge overlapping changes to reduce redundant computation.
|
||
|
||
Example: If node A changes at t=1 and t=5, only keep t=5.
|
||
"""
|
||
# Deduplicate by node_id, keep latest timestamp
|
||
latest_changes = {}
|
||
for change in changes:
|
||
node = change.node_id
|
||
if node not in latest_changes or change.timestamp > latest_changes[node].timestamp:
|
||
latest_changes[node] = change
|
||
|
||
return list(latest_changes.values())
|
||
|
||
|
||
function create_update_plan(gnn, changes):
|
||
"""
|
||
Create optimized execution plan for batch update.
|
||
"""
|
||
# Compute affected region for all changes
|
||
dirty_nodes = {change.node_id for change in changes}
|
||
affected_region = compute_affected_region(
|
||
dirty_nodes,
|
||
gnn.dependency_graph,
|
||
k=gnn.num_layers
|
||
)
|
||
|
||
# Topologically sort affected nodes for correct order
|
||
if gnn.batch_config.topological_sort:
|
||
execution_order = topological_sort(
|
||
affected_region.affected_nodes,
|
||
gnn.dependency_graph
|
||
)
|
||
else:
|
||
execution_order = list(affected_region.affected_nodes)
|
||
|
||
return UpdatePlan(
|
||
changes=changes,
|
||
affected_region=affected_region,
|
||
execution_order=execution_order,
|
||
invalidate_cache=True
|
||
)
|
||
|
||
|
||
function execute_update_plan(gnn, plan):
|
||
"""
|
||
Execute batch update with write lock on activation cache.
|
||
"""
|
||
# Acquire write lock (blocks queries briefly)
|
||
with gnn.activation_cache.write_lock():
|
||
incremental_gnn_update(gnn, plan.changes)
|
||
|
||
# Queries can resume with updated embeddings
|
||
```
|
||
|
||
#### 3. Concurrent Query Support
|
||
|
||
```python
|
||
function query_with_incremental_gnn(gnn, query_vector, k):
|
||
"""
|
||
Query GNN embeddings while updates are happening.
|
||
|
||
Uses read-write locks to allow concurrent reads.
|
||
"""
|
||
# Acquire read lock (multiple queries can read concurrently)
|
||
with gnn.activation_cache.read_lock():
|
||
# Get embeddings from cache (might be slightly stale)
|
||
embeddings = []
|
||
for node_id in gnn.graph.all_nodes():
|
||
# Try to get from cache
|
||
cached = gnn.activation_cache.get(
|
||
layer=gnn.num_layers - 1, # Final layer
|
||
node=node_id
|
||
)
|
||
|
||
if cached is not None and cached.is_valid:
|
||
embeddings.append((node_id, cached.embedding))
|
||
else:
|
||
# Fallback: use base features (no GNN)
|
||
base_features = gnn.get_node_features(node_id)
|
||
embeddings.append((node_id, base_features))
|
||
|
||
# Perform similarity search
|
||
results = search_similar(query_vector, embeddings, k)
|
||
|
||
return results
|
||
```
|
||
|
||
### API Design (Function Signatures)
|
||
|
||
```rust
|
||
// File: crates/ruvector-gnn/src/incremental/mod.rs
|
||
|
||
impl IncrementalGnn {
|
||
/// Create a new incremental GNN system
|
||
pub fn new(
|
||
graph: Arc<HnswGraph>,
|
||
num_layers: usize,
|
||
batch_config: BatchUpdateConfig,
|
||
) -> Result<Self, GnnError>;
|
||
|
||
/// Record a node feature update (triggers incremental recomputation)
|
||
pub fn update_node_features(
|
||
&mut self,
|
||
node_id: u32,
|
||
new_features: &[f32],
|
||
) -> Result<(), GnnError>;
|
||
|
||
/// Record edge changes (triggers incremental recomputation)
|
||
pub fn update_edges(
|
||
&mut self,
|
||
node_id: u32,
|
||
added_edges: &[(u32, u32)],
|
||
removed_edges: &[(u32, u32)],
|
||
) -> Result<(), GnnError>;
|
||
|
||
/// Perform incremental update based on pending changes
|
||
pub fn apply_incremental_update(&mut self) -> Result<UpdateStats, GnnError>;
|
||
|
||
/// Force full graph recomputation (fallback)
|
||
pub fn full_recompute(&mut self) -> Result<(), GnnError>;
|
||
|
||
/// Get cached embedding for a node
|
||
pub fn get_embedding(
|
||
&self,
|
||
node_id: u32,
|
||
layer: usize,
|
||
) -> Option<Array2<f32>>;
|
||
|
||
/// Check if cached embedding is valid
|
||
pub fn is_embedding_valid(
|
||
&self,
|
||
node_id: u32,
|
||
layer: usize,
|
||
) -> bool;
|
||
|
||
/// Get incremental update metrics
|
||
pub fn metrics(&self) -> &IncrementalMetrics;
|
||
|
||
/// Clear all cached activations
|
||
pub fn clear_cache(&mut self);
|
||
}
|
||
|
||
impl ChangeTracker {
|
||
/// Mark a node as dirty (needs recomputation)
|
||
pub fn mark_dirty(&mut self, node_id: u32);
|
||
|
||
/// Check if a node is dirty
|
||
pub fn is_dirty(&self, node_id: u32) -> bool;
|
||
|
||
/// Clear dirty flag for a node
|
||
pub fn clear_dirty(&mut self, node_id: u32);
|
||
|
||
/// Get all dirty nodes
|
||
pub fn get_dirty_nodes(&self) -> Vec<u32>;
|
||
|
||
/// Buffer a change for batch processing
|
||
pub fn buffer_change(&mut self, change: NodeChange);
|
||
|
||
/// Drain all buffered changes
|
||
pub fn drain_buffered(&mut self) -> Vec<NodeChange>;
|
||
}
|
||
|
||
impl ActivationCache {
|
||
/// Create a new activation cache
|
||
pub fn new(num_layers: usize, max_size_per_layer: usize) -> Self;
|
||
|
||
/// Get cached activation
|
||
pub fn get(&self, layer: usize, node_id: u32) -> Option<CachedActivation>;
|
||
|
||
/// Set cached activation
|
||
pub fn set(&mut self, layer: usize, node_id: u32, activation: CachedActivation);
|
||
|
||
/// Invalidate cached activation
|
||
pub fn invalidate(&mut self, layer: usize, node_id: u32);
|
||
|
||
/// Check if activation is valid
|
||
pub fn is_valid(&self, layer: usize, node_id: u32) -> bool;
|
||
|
||
/// Acquire read lock (for concurrent queries)
|
||
pub fn read_lock(&self) -> RwLockReadGuard<'_, HashMap<u32, CachedActivation>>;
|
||
|
||
/// Acquire write lock (for updates)
|
||
pub fn write_lock(&mut self) -> RwLockWriteGuard<'_, HashMap<u32, CachedActivation>>;
|
||
|
||
/// Get cache statistics
|
||
pub fn stats(&self) -> &CacheStats;
|
||
|
||
/// Clear all cached activations
|
||
pub fn clear(&mut self);
|
||
}
|
||
|
||
impl DependencyGraph {
|
||
/// Create dependency graph from HNSW graph
|
||
pub fn from_hnsw(graph: Arc<HnswGraph>, num_layers: usize) -> Self;
|
||
|
||
/// Compute k-hop affected region from dirty nodes
|
||
pub fn compute_affected_region(
|
||
&self,
|
||
dirty_nodes: &HashSet<u32>,
|
||
) -> AffectedRegion;
|
||
|
||
/// Get reverse neighbors (who depends on this node?)
|
||
pub fn get_reverse_neighbors(&self, node_id: u32) -> Vec<u32>;
|
||
|
||
/// Precompute k-hop neighborhoods (optional optimization)
|
||
pub fn precompute_khop_cache(&mut self) -> Result<(), GnnError>;
|
||
}
|
||
|
||
#[derive(Debug)]
|
||
pub struct UpdateStats {
|
||
/// Number of nodes recomputed
|
||
pub nodes_recomputed: usize,
|
||
|
||
/// Total nodes in graph
|
||
pub total_nodes: usize,
|
||
|
||
/// Update latency (microseconds)
|
||
pub latency_us: u64,
|
||
|
||
/// Speedup vs full recomputation
|
||
pub speedup_ratio: f64,
|
||
}
|
||
```
|
||
|
||
## Integration Points
|
||
|
||
### Affected Crates/Modules
|
||
|
||
1. **`ruvector-gnn`** (Primary)
|
||
- New module: `src/incremental/mod.rs` - Core ATLAS system
|
||
- New module: `src/incremental/change_tracker.rs` - Dirty node tracking
|
||
- New module: `src/incremental/activation_cache.rs` - Embedding caching
|
||
- New module: `src/incremental/dependency.rs` - Dependency analysis
|
||
- Modified: `src/lib.rs` - Export incremental types
|
||
|
||
2. **`ruvector-core`** (Integration)
|
||
- Modified: `src/index/hnsw.rs` - Notify GNN of graph changes
|
||
- New: `src/index/hnsw_events.rs` - Event system for graph updates
|
||
- Modified: `src/vector_store.rs` - Trigger incremental updates on insert/delete
|
||
|
||
3. **`ruvector-api`** (Configuration)
|
||
- Modified: `src/config.rs` - Add incremental GNN config
|
||
- Modified: `src/index_manager.rs` - Manage incremental update lifecycle
|
||
|
||
### New Modules to Create
|
||
|
||
```
|
||
crates/ruvector-gnn/
|
||
├── src/
|
||
│ ├── incremental/
|
||
│ │ ├── mod.rs # Core IncrementalGnn
|
||
│ │ ├── change_tracker.rs # ChangeTracker implementation
|
||
│ │ ├── activation_cache.rs # ActivationCache implementation
|
||
│ │ ├── dependency.rs # DependencyGraph implementation
|
||
│ │ ├── batch_optimizer.rs # Batch update optimization
|
||
│ │ └── metrics.rs # Performance tracking
|
||
|
||
crates/ruvector-core/
|
||
├── src/
|
||
│ ├── index/
|
||
│ │ ├── hnsw_events.rs # Event system for graph changes
|
||
|
||
examples/
|
||
├── incremental_gnn/
|
||
│ ├── benchmark_updates.rs # Benchmark incremental vs full
|
||
│ ├── streaming_workload.rs # Real-time streaming example
|
||
│ └── README.md
|
||
```
|
||
|
||
### Dependencies on Other Features
|
||
|
||
**Depends On:**
|
||
- **GNN Layer Implementation (Issue #38)**: Needs working GNN layers to recompute embeddings
|
||
- **HNSW Index**: Needs graph structure for dependency analysis
|
||
|
||
**Synergies With:**
|
||
- **GNN-Guided Routing (Feature 1)**: Incremental updates keep routing model fresh
|
||
- **Neuro-Symbolic Query (Feature 3)**: Faster updates enable real-time constraint learning
|
||
|
||
**External Dependencies:**
|
||
- `bitvec` - Efficient BitSet for dirty node tracking
|
||
- `parking_lot` - RwLock for concurrent cache access
|
||
- `crossbeam` - Batch processing queue (optional)
|
||
|
||
## Regression Prevention
|
||
|
||
### What Existing Functionality Could Break
|
||
|
||
1. **GNN Embedding Correctness**
|
||
- Risk: Incremental updates produce different embeddings than full recomputation
|
||
- Impact: Incorrect query results, embedding drift
|
||
|
||
2. **Memory Leaks**
|
||
- Risk: Activation cache grows unbounded if not evicted
|
||
- Impact: OOM crashes
|
||
|
||
3. **Deadlocks**
|
||
- Risk: Read-write lock contention between queries and updates
|
||
- Impact: System hangs
|
||
|
||
4. **Stale Embeddings**
|
||
- Risk: Cache invalidation logic misses affected nodes
|
||
- Impact: Queries use outdated embeddings
|
||
|
||
5. **Update Ordering**
|
||
- Risk: Concurrent updates applied in wrong order
|
||
- Impact: Inconsistent graph state
|
||
|
||
### Test Cases to Prevent Regressions
|
||
|
||
```rust
|
||
// File: crates/ruvector-gnn/tests/incremental_regression_tests.rs
|
||
|
||
#[test]
|
||
fn test_incremental_matches_full_recomputation() {
|
||
// Incremental updates must produce identical embeddings to full recompute
|
||
let graph = build_test_graph(1000);
|
||
let gnn_full = FullGnn::new(&graph, num_layers=3);
|
||
let gnn_inc = IncrementalGnn::new(&graph, num_layers=3);
|
||
|
||
// Apply 100 random updates
|
||
let updates = generate_random_updates(100);
|
||
|
||
// Full recomputation
|
||
for update in &updates {
|
||
apply_update_full(&mut gnn_full, update);
|
||
}
|
||
gnn_full.recompute_all();
|
||
|
||
// Incremental updates
|
||
for update in &updates {
|
||
apply_update_incremental(&mut gnn_inc, update);
|
||
}
|
||
gnn_inc.apply_incremental_update();
|
||
|
||
// Compare embeddings (should be identical within floating-point tolerance)
|
||
for node_id in 0..1000 {
|
||
let emb_full = gnn_full.get_embedding(node_id, layer=2);
|
||
let emb_inc = gnn_inc.get_embedding(node_id, layer=2).unwrap();
|
||
|
||
assert_embeddings_equal(&emb_full, &emb_inc, tolerance=1e-5);
|
||
}
|
||
}
|
||
|
||
#[test]
|
||
fn test_cache_invalidation_correctness() {
|
||
// All affected nodes must have cache invalidated
|
||
let graph = build_test_graph(1000);
|
||
let mut gnn = IncrementalGnn::new(&graph, num_layers=3);
|
||
|
||
// Mark node 42 as dirty
|
||
gnn.update_node_features(42, &random_features());
|
||
|
||
// Compute affected region (3-hop)
|
||
let affected = gnn.dependency_graph.compute_affected_region(&hashset!{42});
|
||
|
||
// Check cache invalidation
|
||
for node in &affected.affected_nodes {
|
||
for layer in 0..3 {
|
||
assert!(!gnn.activation_cache.is_valid(layer, *node),
|
||
"Node {} layer {} should be invalidated", node, layer);
|
||
}
|
||
}
|
||
}
|
||
|
||
#[test]
|
||
fn test_incremental_speedup() {
|
||
// Incremental updates must be ≥10x faster than full recompute
|
||
let graph = build_test_graph(100_000);
|
||
let mut gnn_full = FullGnn::new(&graph, num_layers=3);
|
||
let mut gnn_inc = IncrementalGnn::new(&graph, num_layers=3);
|
||
|
||
// Small update (100 nodes)
|
||
let updates = generate_random_updates(100);
|
||
|
||
// Benchmark full recomputation
|
||
let start = Instant::now();
|
||
for update in &updates {
|
||
apply_update_full(&mut gnn_full, update);
|
||
}
|
||
gnn_full.recompute_all();
|
||
let full_time = start.elapsed();
|
||
|
||
// Benchmark incremental
|
||
let start = Instant::now();
|
||
for update in &updates {
|
||
apply_update_incremental(&mut gnn_inc, update);
|
||
}
|
||
gnn_inc.apply_incremental_update();
|
||
let inc_time = start.elapsed();
|
||
|
||
let speedup = full_time.as_secs_f64() / inc_time.as_secs_f64();
|
||
assert!(speedup >= 10.0, "Speedup: {:.1}x, expected ≥10x", speedup);
|
||
}
|
||
|
||
#[test]
|
||
fn test_concurrent_query_update() {
|
||
// Queries must not block on updates (concurrent reads)
|
||
let graph = Arc::new(build_test_graph(10_000));
|
||
let gnn = Arc::new(RwLock::new(IncrementalGnn::new(&graph, num_layers=3)));
|
||
|
||
// Spawn update thread
|
||
let gnn_update = Arc::clone(&gnn);
|
||
let update_handle = thread::spawn(move || {
|
||
loop {
|
||
let mut g = gnn_update.write().unwrap();
|
||
g.update_node_features(rand::random(), &random_features());
|
||
g.apply_incremental_update().unwrap();
|
||
drop(g); // Release lock
|
||
sleep(Duration::from_millis(10));
|
||
}
|
||
});
|
||
|
||
// Spawn query threads
|
||
let query_handles: Vec<_> = (0..8)
|
||
.map(|_| {
|
||
let gnn_query = Arc::clone(&gnn);
|
||
thread::spawn(move || {
|
||
for _ in 0..1000 {
|
||
let g = gnn_query.read().unwrap();
|
||
let emb = g.get_embedding(rand::random::<u32>() % 10_000, layer=2);
|
||
assert!(emb.is_some());
|
||
drop(g); // Release lock
|
||
}
|
||
})
|
||
})
|
||
.collect();
|
||
|
||
// Wait for queries to complete
|
||
for handle in query_handles {
|
||
handle.join().unwrap();
|
||
}
|
||
|
||
// Should complete without deadlocks
|
||
}
|
||
|
||
#[test]
|
||
fn test_cache_memory_bounded() {
|
||
// Cache must not exceed configured size limit
|
||
let graph = build_test_graph(100_000);
|
||
let mut gnn = IncrementalGnn::new(&graph, num_layers=3);
|
||
|
||
// Configure small cache (1000 entries per layer)
|
||
gnn.activation_cache = ActivationCache::new(3, max_size_per_layer=1000);
|
||
|
||
// Perform many updates (should trigger evictions)
|
||
for _ in 0..10_000 {
|
||
gnn.update_node_features(rand::random(), &random_features());
|
||
gnn.apply_incremental_update().unwrap();
|
||
}
|
||
|
||
// Check cache size
|
||
for layer in 0..3 {
|
||
let cache_size = gnn.activation_cache.layer_size(layer);
|
||
assert!(cache_size <= 1000, "Layer {} cache size: {}, expected ≤1000", layer, cache_size);
|
||
}
|
||
}
|
||
```
|
||
|
||
### Backward Compatibility Strategy
|
||
|
||
1. **Default Disabled**
|
||
- Incremental GNN is opt-in via configuration
|
||
- Existing code defaults to full recomputation
|
||
|
||
2. **Graceful Fallback**
|
||
- If incremental update fails, fallback to full recompute
|
||
- Log warning but do not crash
|
||
|
||
3. **Configuration Schema**
|
||
```yaml
|
||
gnn:
|
||
incremental:
|
||
enabled: false # Default: disabled
|
||
batch_size: 100
|
||
max_buffer_time_ms: 1000
|
||
cache_size_per_layer: 10000
|
||
```
|
||
|
||
4. **API Compatibility**
|
||
- Existing `Gnn::recompute()` still works (full recompute)
|
||
- New `Gnn::incremental_update()` method added
|
||
|
||
## Implementation Phases
|
||
|
||
### Phase 1: Core Infrastructure (Week 1-2)
|
||
|
||
**Goal**: Working change tracking and activation cache
|
||
|
||
**Tasks**:
|
||
1. Implement `ChangeTracker` with BitSet
|
||
2. Implement `ActivationCache` with RwLock
|
||
3. Add unit tests for both
|
||
4. Benchmark cache performance (hit rate, contention)
|
||
|
||
**Deliverables**:
|
||
- `incremental/change_tracker.rs`
|
||
- `incremental/activation_cache.rs`
|
||
- Passing unit tests
|
||
- Benchmark report
|
||
|
||
**Success Criteria**:
|
||
- Change tracking overhead <1% of update time
|
||
- Cache hit rate >90% for typical workloads
|
||
- No deadlocks in concurrent access
|
||
|
||
### Phase 2: Dependency Analysis (Week 2-3)
|
||
|
||
**Goal**: Compute affected regions correctly
|
||
|
||
**Tasks**:
|
||
1. Implement `DependencyGraph` with k-hop BFS
|
||
2. Add topological sorting for update order
|
||
3. Test affected region computation on various graph topologies
|
||
4. Optimize with k-hop caching (optional)
|
||
|
||
**Deliverables**:
|
||
- `incremental/dependency.rs`
|
||
- Tests for k-hop propagation
|
||
- Performance benchmarks
|
||
|
||
**Success Criteria**:
|
||
- Affected region computation <10ms for 1K dirty nodes
|
||
- Correct propagation (matches ground truth)
|
||
- Handles edge cases (disconnected components, cycles)
|
||
|
||
### Phase 3: Incremental Forward Pass (Week 3-4)
|
||
|
||
**Goal**: Recompute only affected nodes
|
||
|
||
**Tasks**:
|
||
1. Implement incremental forward pass algorithm
|
||
2. Integrate with existing GNN layers
|
||
3. Add cache reuse logic
|
||
4. Test correctness vs full recomputation
|
||
5. Benchmark speedup
|
||
|
||
**Deliverables**:
|
||
- `incremental/mod.rs` (core algorithm)
|
||
- Correctness tests
|
||
- Performance benchmarks
|
||
|
||
**Success Criteria**:
|
||
- Embeddings match full recomputation (within tolerance)
|
||
- ≥10x speedup for small updates (<1% of graph)
|
||
- ≥100x speedup for tiny updates (<0.1% of graph)
|
||
|
||
### Phase 4: Batch Optimization (Week 4-5)
|
||
|
||
**Goal**: Efficient batch processing of updates
|
||
|
||
**Tasks**:
|
||
1. Implement batch update optimizer
|
||
2. Add change coalescing logic
|
||
3. Tune buffer size and timeout
|
||
4. Benchmark throughput improvement
|
||
|
||
**Deliverables**:
|
||
- `incremental/batch_optimizer.rs`
|
||
- Batch processing benchmarks
|
||
- Configuration guide
|
||
|
||
**Success Criteria**:
|
||
- Batch updates 2-5x faster than individual updates
|
||
- Latency <50ms for 1K batched changes
|
||
- No excessive buffering delays
|
||
|
||
### Phase 5: Production Hardening (Week 5-6)
|
||
|
||
**Goal**: Production-ready with safety guarantees
|
||
|
||
**Tasks**:
|
||
1. Add comprehensive error handling
|
||
2. Implement fallback to full recompute on errors
|
||
3. Add telemetry and observability
|
||
4. Write documentation
|
||
5. Stress testing (10M+ nodes, concurrent workloads)
|
||
|
||
**Deliverables**:
|
||
- Full error handling
|
||
- Regression test suite
|
||
- User documentation
|
||
- Performance report
|
||
|
||
**Success Criteria**:
|
||
- Zero crashes in stress tests
|
||
- Graceful degradation on errors
|
||
- Documentation complete
|
||
|
||
## Success Metrics
|
||
|
||
### Performance Benchmarks
|
||
|
||
**Primary Metrics** (Must Achieve):
|
||
|
||
| Workload | Current (Full) | Target (ATLAS) | Improvement |
|
||
|----------|----------------|----------------|-------------|
|
||
| 100 vector updates | 50ms | 0.5ms | **100x** |
|
||
| 1,000 vector updates | 500ms | 5ms | **100x** |
|
||
| 10,000 vector updates | 5s | 50ms | **100x** |
|
||
| Continuous stream (1K/s) | Blocked | 1K/s sustained | **∞** |
|
||
|
||
**Secondary Metrics**:
|
||
|
||
| Metric | Target |
|
||
|--------|--------|
|
||
| Cache hit rate | >90% |
|
||
| Memory overhead | <10% of base GNN |
|
||
| Concurrent query throughput | No degradation |
|
||
| Affected region ratio | <5% of graph (for 0.1% dirty nodes) |
|
||
|
||
### Accuracy Metrics
|
||
|
||
**Embedding Correctness**:
|
||
- Incremental embeddings must match full recomputation within `1e-5` tolerance (floating-point)
|
||
- Zero embedding drift over 1M updates
|
||
|
||
**Cache Invalidation**:
|
||
- 100% of affected nodes have cache invalidated (no stale embeddings used)
|
||
- Zero false negatives (missed invalidations)
|
||
|
||
### Memory/Latency Targets
|
||
|
||
**Memory**:
|
||
- Activation cache: <100MB per 1M nodes
|
||
- Change tracker: <10MB per 1M nodes (BitSet)
|
||
- Total overhead: <10% of base GNN memory
|
||
|
||
**Latency**:
|
||
- Update latency (100 vectors): <1ms
|
||
- Update latency (1K vectors): <10ms
|
||
- Update latency (10K vectors): <100ms
|
||
- Query latency: No increase (concurrent reads)
|
||
|
||
**Throughput**:
|
||
- Sustained update rate: 10,000 vectors/second
|
||
- Batch update throughput: 100,000 vectors/second
|
||
|
||
## Risks and Mitigations
|
||
|
||
### Technical Risks
|
||
|
||
**Risk 1: Cache Invalidation Bugs**
|
||
|
||
*Probability: High | Impact: Critical*
|
||
|
||
**Description**: Missing cache invalidations could cause stale embeddings to be used, leading to incorrect query results.
|
||
|
||
**Mitigation**:
|
||
- Extensive testing with known ground truth
|
||
- Add assertion checks in debug builds (compare incremental vs full)
|
||
- Implement cache consistency validation tool
|
||
- Conservative invalidation (over-invalidate rather than under-invalidate)
|
||
- Monitor embedding drift metrics in production
|
||
|
||
**Contingency**: Add "full recompute verification" mode that periodically checks incremental results against full recompute.
|
||
|
||
---
|
||
|
||
**Risk 2: Concurrency Bugs (Deadlocks, Race Conditions)**
|
||
|
||
*Probability: Medium | Impact: High*
|
||
|
||
**Description**: RwLock usage could introduce deadlocks or race conditions between queries and updates.
|
||
|
||
**Mitigation**:
|
||
- Use proven lock-free data structures where possible
|
||
- Lock ordering discipline (always acquire in same order)
|
||
- Timeout on lock acquisition
|
||
- Extensive concurrency testing with ThreadSanitizer
|
||
- Use parking_lot for better performance and diagnostics
|
||
|
||
**Contingency**: Fallback to single-threaded updates if concurrency issues arise.
|
||
|
||
---
|
||
|
||
**Risk 3: Memory Leak from Unbounded Cache**
|
||
|
||
*Probability: Medium | Impact: Medium*
|
||
|
||
**Description**: Activation cache could grow unbounded if eviction policy fails.
|
||
|
||
**Mitigation**:
|
||
- Implement strict LRU eviction
|
||
- Set hard memory limits with monitoring
|
||
- Add memory pressure detection
|
||
- Test with long-running workloads
|
||
- Provide cache clear API for manual intervention
|
||
|
||
**Contingency**: Add periodic cache clearing (e.g., every 1M updates) as safety net.
|
||
|
||
---
|
||
|
||
**Risk 4: k-Hop Propagation Overhead**
|
||
|
||
*Probability: Low | Impact: Medium*
|
||
|
||
**Description**: Computing k-hop affected regions could be slow on dense graphs.
|
||
|
||
**Mitigation**:
|
||
- Precompute k-hop neighborhoods (optional)
|
||
- Use approximate k-hop (prune low-degree nodes)
|
||
- Parallelize BFS traversal
|
||
- Cache affected regions for repeated patterns
|
||
- Profile and optimize hot paths
|
||
|
||
**Contingency**: Add configurable k-hop limit (user can reduce k if needed).
|
||
|
||
---
|
||
|
||
**Risk 5: Divergence from Full Recomputation**
|
||
|
||
*Probability: Low | Impact: High*
|
||
|
||
**Description**: Incremental updates could accumulate numerical errors, causing embedding drift over time.
|
||
|
||
**Mitigation**:
|
||
- Use same floating-point precision as full recompute
|
||
- Periodically run full recomputation to reset (e.g., daily)
|
||
- Monitor embedding distance metrics
|
||
- Add numerical stability tests
|
||
- Use higher precision (f64) for accumulation if needed
|
||
|
||
**Contingency**: Implement "full recompute every N updates" policy.
|
||
|
||
---
|
||
|
||
**Risk 6: Complex Debugging**
|
||
|
||
*Probability: High | Impact: Medium*
|
||
|
||
**Description**: Incremental update bugs are harder to debug than full recomputation.
|
||
|
||
**Mitigation**:
|
||
- Add extensive logging and telemetry
|
||
- Implement deterministic replay of update sequences
|
||
- Provide debugging tools (cache inspector, affected region visualizer)
|
||
- Add assertion modes for validation
|
||
- Document common failure modes
|
||
|
||
**Contingency**: Provide "debug mode" that runs both incremental and full in parallel for comparison.
|
||
|
||
---
|
||
|
||
### Summary Risk Matrix
|
||
|
||
| Risk | Probability | Impact | Mitigation Priority |
|
||
|------|-------------|--------|---------------------|
|
||
| Cache invalidation bugs | High | Critical | **CRITICAL** |
|
||
| Concurrency bugs | Medium | High | **HIGH** |
|
||
| Memory leak | Medium | Medium | HIGH |
|
||
| k-hop overhead | Low | Medium | Medium |
|
||
| Embedding divergence | Low | High | Medium |
|
||
| Complex debugging | High | Medium | LOW |
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
1. **Prototype Phase 1**: Build change tracker and activation cache (1 week)
|
||
2. **Validate Approach**: Test on small graph (1K nodes), measure speedup (2 days)
|
||
3. **Scale Testing**: Test on realistic graph (100K nodes), identify bottlenecks (3 days)
|
||
4. **Integration**: Connect to HNSW index updates (1 week)
|
||
5. **Optimization**: Profile and optimize hot paths (ongoing)
|
||
|
||
**Key Decision Points**:
|
||
- After Phase 1: Is cache overhead acceptable? (<10% memory)
|
||
- After Phase 3: Does speedup meet targets? (≥10x required)
|
||
- After Phase 5: Are embeddings correct? (Pass all regression tests)
|
||
|
||
**Go/No-Go Criteria**:
|
||
- ✅ 10x+ speedup on small updates
|
||
- ✅ Zero embedding correctness regressions
|
||
- ✅ No concurrency bugs in stress tests
|
||
- ✅ Memory overhead <10%
|