git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
1124 lines
38 KiB
Markdown
1124 lines
38 KiB
Markdown
# Graph Condensation (SFGC) - Implementation Plan
|
|
|
|
## Overview
|
|
|
|
### Problem Statement
|
|
|
|
HNSW graphs in production vector databases face critical deployment challenges:
|
|
|
|
1. **Memory Footprint**: Full HNSW graphs require 40-120 bytes per vector for connectivity metadata
|
|
2. **Edge Deployment**: Mobile/IoT devices cannot store million-node graphs (400MB-4.8GB overhead)
|
|
3. **Federated Learning**: Transferring full graphs between nodes is bandwidth-prohibitive
|
|
4. **Cold Start**: Initial graph construction is expensive for dynamic applications
|
|
|
|
### Proposed Solution
|
|
|
|
Implement Structure-Preserving Graph Condensation (SFGC) that creates synthetic "super-nodes" representing clusters of original nodes. The condensed graph:
|
|
|
|
- Reduces graph size by 10-100x (configurable compression ratio)
|
|
- Preserves topological properties (small-world, scale-free characteristics)
|
|
- Maintains search accuracy within 2-5% of full graph
|
|
- Enables progressive graph expansion from condensed to full representation
|
|
|
|
**Core Innovation**: Unlike naive graph coarsening, SFGC learns synthetic node embeddings that maximize structural fidelity using a differentiable graph neural network.
|
|
|
|
### Expected Benefits (Quantified)
|
|
|
|
| Metric | Current (Full HNSW) | With SFGC (50x) | Improvement |
|
|
|--------|---------------------|-----------------|-------------|
|
|
| Memory footprint | 4.8GB (1M vectors) | 96MB | 50x reduction |
|
|
| Transfer bandwidth | 4.8GB | 96MB | 50x reduction |
|
|
| Edge device compatibility | Limited to 100K vectors | 5M vectors | 50x capacity |
|
|
| Cold start time | 120s | 8s + progressive | 15x faster |
|
|
| Search accuracy (recall@10) | 0.95 | 0.92-0.94 | 2-3% degradation |
|
|
| Search latency | 1.2ms | 1.5ms (initial), 1.2ms (expanded) | 25% slower → same |
|
|
|
|
**ROI Calculation**:
|
|
- Edge deployment: enables $500 devices vs $2000 workstations
|
|
- Federated learning: 50x faster synchronization (2.4s vs 120s)
|
|
- Multi-tenant SaaS: 50x more graphs per server
|
|
|
|
## Technical Design
|
|
|
|
### Architecture Diagram (ASCII)
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Graph Condensation Pipeline │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌─────────────┴─────────────┐
|
|
▼ ▼
|
|
┌───────────────────────┐ ┌───────────────────────┐
|
|
│ Offline Condensation │ │ Online Expansion │
|
|
│ (Training) │ │ (Runtime) │
|
|
└───────────────────────┘ └───────────────────────┘
|
|
│ │
|
|
┌───────────┼───────────┐ │
|
|
▼ ▼ ▼ ▼
|
|
┌────────┐ ┌────────┐ ┌─────────┐ ┌─────────────┐
|
|
│Cluster │ │Synth │ │Edge │ │Progressive │
|
|
│ ing │ │Node │ │Preserv │ │Decompression│
|
|
│ │ │Learn │ │ation │ │ │
|
|
└────────┘ └────────┘ └─────────┘ └─────────────┘
|
|
│ │ │ │
|
|
└───────────┼───────────┘ │
|
|
▼ ▼
|
|
┌──────────────┐ ┌──────────────┐
|
|
│ Condensed │─────────▶│ Hybrid │
|
|
│ Graph File │ Load │ Graph Store │
|
|
│ (.cgraph) │ │ │
|
|
└──────────────┘ └──────────────┘
|
|
│
|
|
▼
|
|
┌──────────────┐
|
|
│ Search API │
|
|
│ (adaptive) │
|
|
└──────────────┘
|
|
```
|
|
|
|
**Component Flow**:
|
|
|
|
1. **Offline Condensation** (Training Phase):
|
|
- Hierarchical clustering of original graph
|
|
- GNN-based synthetic node embedding learning
|
|
- Edge weight optimization via structure preservation loss
|
|
- Export to `.cgraph` format
|
|
|
|
2. **Online Expansion** (Runtime):
|
|
- Load condensed graph for fast cold start
|
|
- Progressive decompression on cache misses
|
|
- Adaptive switching between condensed/full graph
|
|
|
|
### Core Data Structures (Rust)
|
|
|
|
```rust
|
|
/// Condensed graph representation with synthetic nodes
|
|
#[derive(Clone, Debug)]
|
|
pub struct CondensedGraph {
|
|
/// Synthetic node embeddings (learned via GNN)
|
|
pub synthetic_nodes: Vec<SyntheticNode>,
|
|
|
|
/// Condensed HNSW layers (smaller topology)
|
|
pub condensed_layers: Vec<HnswLayer>,
|
|
|
|
/// Compression ratio (e.g., 50.0 for 50x)
|
|
pub compression_ratio: f32,
|
|
|
|
/// Mapping from synthetic node to original node IDs
|
|
pub expansion_map: HashMap<NodeId, Vec<NodeId>>,
|
|
|
|
/// Graph statistics for adaptive expansion
|
|
pub stats: GraphStatistics,
|
|
}
|
|
|
|
/// Synthetic node representing a cluster of original nodes
|
|
#[derive(Clone, Debug)]
|
|
pub struct SyntheticNode {
|
|
/// Learned embedding (centroid of cluster, refined via GNN)
|
|
pub embedding: Vec<f32>,
|
|
|
|
/// Original node IDs in this cluster
|
|
pub cluster_members: Vec<NodeId>,
|
|
|
|
/// Cluster radius (for expansion threshold)
|
|
pub radius: f32,
|
|
|
|
/// Connectivity in condensed graph
|
|
pub neighbors: Vec<(NodeId, f32)>, // (neighbor_id, edge_weight)
|
|
|
|
/// Access frequency (for adaptive expansion)
|
|
pub access_count: AtomicU64,
|
|
}
|
|
|
|
/// Configuration for graph condensation process
|
|
#[derive(Clone, Debug)]
|
|
pub struct CondensationConfig {
|
|
/// Target compression ratio (10-100)
|
|
pub compression_ratio: f32,
|
|
|
|
/// Clustering method
|
|
pub clustering_method: ClusteringMethod,
|
|
|
|
/// GNN training epochs for synthetic nodes
|
|
pub gnn_epochs: usize,
|
|
|
|
/// Structure preservation weight (vs embedding quality)
|
|
pub structure_weight: f32,
|
|
|
|
/// Edge preservation strategy
|
|
pub edge_strategy: EdgePreservationStrategy,
|
|
}
|
|
|
|
#[derive(Clone, Debug)]
|
|
pub enum ClusteringMethod {
|
|
/// Hierarchical agglomerative clustering
|
|
Hierarchical { linkage: LinkageType },
|
|
|
|
/// Louvain modularity-based clustering
|
|
Louvain { resolution: f32 },
|
|
|
|
/// Spectral clustering via graph Laplacian
|
|
Spectral { n_components: usize },
|
|
|
|
/// Custom clustering function
|
|
Custom(Box<dyn Fn(&HnswIndex) -> Vec<Vec<NodeId>>>),
|
|
}
|
|
|
|
#[derive(Clone, Debug)]
|
|
pub enum EdgePreservationStrategy {
|
|
/// Keep edges if both endpoints map to different synthetic nodes
|
|
InterCluster,
|
|
|
|
/// Weighted by cluster similarity
|
|
WeightedSimilarity,
|
|
|
|
/// Learn edge weights via GNN
|
|
Learned,
|
|
}
|
|
|
|
/// Hybrid graph store supporting both condensed and full graphs
|
|
pub struct HybridGraphStore {
|
|
/// Condensed graph (always loaded)
|
|
condensed: CondensedGraph,
|
|
|
|
/// Full graph (lazily loaded regions)
|
|
full_graph: Option<Arc<RwLock<HnswIndex>>>,
|
|
|
|
/// Expanded regions cache
|
|
expanded_cache: LruCache<NodeId, ExpandedRegion>,
|
|
|
|
/// Expansion policy
|
|
policy: ExpansionPolicy,
|
|
}
|
|
|
|
/// Policy for when to expand condensed nodes to full graph
|
|
#[derive(Clone, Debug)]
|
|
pub enum ExpansionPolicy {
|
|
/// Never expand (use condensed graph only)
|
|
Never,
|
|
|
|
/// Expand on cache miss
|
|
OnDemand { cache_size: usize },
|
|
|
|
/// Expand regions with high query frequency
|
|
Adaptive { threshold: f64 },
|
|
|
|
/// Always use full graph
|
|
Always,
|
|
}
|
|
|
|
/// Expanded region of the full graph
|
|
struct ExpandedRegion {
|
|
/// Full node data for this region
|
|
nodes: Vec<FullNode>,
|
|
|
|
/// Last access timestamp
|
|
last_access: Instant,
|
|
|
|
/// Access count
|
|
access_count: u64,
|
|
}
|
|
|
|
/// Statistics for monitoring condensation quality
|
|
#[derive(Clone, Debug, Default)]
|
|
pub struct GraphStatistics {
|
|
/// Average cluster size
|
|
pub avg_cluster_size: f32,
|
|
|
|
/// Cluster size variance
|
|
pub cluster_variance: f32,
|
|
|
|
/// Edge preservation ratio (condensed edges / original edges)
|
|
pub edge_preservation: f32,
|
|
|
|
/// Average path length increase
|
|
pub path_length_delta: f32,
|
|
|
|
/// Clustering coefficient preservation
|
|
pub clustering_coef_ratio: f32,
|
|
}
|
|
```
|
|
|
|
### Key Algorithms (Pseudocode)
|
|
|
|
#### Algorithm 1: Graph Condensation (Offline Training)
|
|
|
|
```
|
|
function condense_graph(hnsw_index, config):
|
|
# Step 1: Hierarchical clustering
|
|
clusters = hierarchical_cluster(
|
|
hnsw_index.nodes,
|
|
target_clusters = hnsw_index.size / config.compression_ratio
|
|
)
|
|
|
|
# Step 2: Initialize synthetic node embeddings
|
|
synthetic_nodes = []
|
|
for cluster in clusters:
|
|
centroid = compute_centroid(cluster.members)
|
|
synthetic_nodes.append(SyntheticNode {
|
|
embedding: centroid,
|
|
cluster_members: cluster.members,
|
|
radius: compute_cluster_radius(cluster),
|
|
neighbors: [],
|
|
access_count: 0
|
|
})
|
|
|
|
# Step 3: Build condensed edges
|
|
condensed_edges = build_condensed_edges(
|
|
hnsw_index,
|
|
clusters,
|
|
config.edge_strategy
|
|
)
|
|
|
|
# Step 4: GNN-based refinement
|
|
gnn_model = GraphNeuralNetwork(
|
|
input_dim = embedding_dim,
|
|
hidden_dims = [128, 64],
|
|
output_dim = embedding_dim
|
|
)
|
|
|
|
optimizer = Adam(gnn_model.parameters(), lr=0.001)
|
|
|
|
for epoch in 1..config.gnn_epochs:
|
|
# Forward pass: refine synthetic embeddings
|
|
refined_embeddings = gnn_model.forward(
|
|
synthetic_nodes.embeddings,
|
|
condensed_edges
|
|
)
|
|
|
|
# Compute structure preservation loss
|
|
loss = compute_structure_loss(
|
|
refined_embeddings,
|
|
condensed_edges,
|
|
original_graph = hnsw_index,
|
|
expansion_map = clusters,
|
|
structure_weight = config.structure_weight
|
|
)
|
|
|
|
# Backward pass
|
|
loss.backward()
|
|
optimizer.step()
|
|
|
|
# Update synthetic embeddings
|
|
for i, node in enumerate(synthetic_nodes):
|
|
node.embedding = refined_embeddings[i]
|
|
|
|
# Step 5: Build condensed HNSW layers
|
|
condensed_layers = build_hnsw_layers(
|
|
synthetic_nodes,
|
|
condensed_edges,
|
|
max_layer = hnsw_index.max_layer
|
|
)
|
|
|
|
return CondensedGraph {
|
|
synthetic_nodes,
|
|
condensed_layers,
|
|
compression_ratio: config.compression_ratio,
|
|
expansion_map: clusters,
|
|
stats: compute_statistics(...)
|
|
}
|
|
|
|
function compute_structure_loss(embeddings, edges, original_graph, expansion_map, structure_weight):
|
|
# Part 1: Embedding quality (centroid fidelity)
|
|
embedding_loss = 0
|
|
for i, synthetic_node in enumerate(embeddings):
|
|
cluster_members = expansion_map[i]
|
|
original_embeddings = [original_graph.get_embedding(id) for id in cluster_members]
|
|
true_centroid = mean(original_embeddings)
|
|
embedding_loss += mse(synthetic_node, true_centroid)
|
|
|
|
# Part 2: Structure preservation (edge connectivity)
|
|
structure_loss = 0
|
|
for (u, v, weight) in edges:
|
|
# Check if original graph had path between clusters u and v
|
|
cluster_u = expansion_map[u]
|
|
cluster_v = expansion_map[v]
|
|
original_connectivity = compute_inter_cluster_connectivity(
|
|
original_graph, cluster_u, cluster_v
|
|
)
|
|
predicted_connectivity = cosine_similarity(embeddings[u], embeddings[v])
|
|
structure_loss += mse(predicted_connectivity, original_connectivity)
|
|
|
|
# Part 3: Topological invariants
|
|
topo_loss = 0
|
|
condensed_clustering_coef = compute_clustering_coefficient(embeddings, edges)
|
|
original_clustering_coef = original_graph.clustering_coefficient
|
|
topo_loss += abs(condensed_clustering_coef - original_clustering_coef)
|
|
|
|
return (1 - structure_weight) * embedding_loss +
|
|
structure_weight * (structure_loss + 0.1 * topo_loss)
|
|
```
|
|
|
|
#### Algorithm 2: Progressive Expansion (Online Runtime)
|
|
|
|
```
|
|
function search_hybrid_graph(query, k, hybrid_store):
|
|
# Step 1: Search in condensed graph
|
|
condensed_results = search_condensed(
|
|
query,
|
|
hybrid_store.condensed,
|
|
k_initial = k * 2 # oversample
|
|
)
|
|
|
|
# Step 2: Decide whether to expand
|
|
if hybrid_store.policy == ExpansionPolicy::Never:
|
|
return refine_condensed_results(condensed_results, k)
|
|
|
|
# Step 3: Identify expansion candidates
|
|
expansion_candidates = []
|
|
for result in condensed_results:
|
|
synthetic_node = result.node
|
|
|
|
# Expand if: high uncertainty OR cache miss OR high query frequency
|
|
should_expand = (
|
|
result.distance < synthetic_node.radius * 1.5 OR # uncertainty
|
|
not hybrid_store.expanded_cache.contains(synthetic_node.id) OR # cache miss
|
|
synthetic_node.access_count.load() > adaptive_threshold # hot region
|
|
)
|
|
|
|
if should_expand:
|
|
expansion_candidates.append(synthetic_node.id)
|
|
|
|
# Step 4: Expand regions (lazily load from full graph)
|
|
if len(expansion_candidates) > 0:
|
|
expanded_regions = hybrid_store.expand_regions(expansion_candidates)
|
|
|
|
# Step 5: Refine search in expanded regions
|
|
refined_results = []
|
|
for region in expanded_regions:
|
|
local_results = search_full_graph(
|
|
query,
|
|
region.nodes,
|
|
k_local = k
|
|
)
|
|
refined_results.extend(local_results)
|
|
|
|
# Merge condensed and expanded results
|
|
all_results = merge_results(condensed_results, refined_results)
|
|
return top_k(all_results, k)
|
|
else:
|
|
# No expansion needed
|
|
return refine_condensed_results(condensed_results, k)
|
|
|
|
function expand_regions(hybrid_store, synthetic_node_ids):
|
|
expanded = []
|
|
for node_id in synthetic_node_ids:
|
|
# Check cache first
|
|
if hybrid_store.expanded_cache.contains(node_id):
|
|
expanded.append(hybrid_store.expanded_cache.get(node_id))
|
|
continue
|
|
|
|
# Load from full graph (disk or memory)
|
|
synthetic_node = hybrid_store.condensed.synthetic_nodes[node_id]
|
|
cluster_member_ids = synthetic_node.cluster_members
|
|
|
|
full_nodes = []
|
|
if hybrid_store.full_graph.is_some():
|
|
# Full graph in memory
|
|
full_graph = hybrid_store.full_graph.unwrap()
|
|
for member_id in cluster_member_ids:
|
|
full_nodes.append(full_graph.get_node(member_id))
|
|
else:
|
|
# Load from disk (mmap)
|
|
full_nodes = load_nodes_from_disk(cluster_member_ids)
|
|
|
|
region = ExpandedRegion {
|
|
nodes: full_nodes,
|
|
last_access: now(),
|
|
access_count: 1
|
|
}
|
|
|
|
# Add to cache (evict LRU if full)
|
|
hybrid_store.expanded_cache.put(node_id, region)
|
|
expanded.append(region)
|
|
|
|
return expanded
|
|
```
|
|
|
|
### API Design (Function Signatures)
|
|
|
|
```rust
|
|
// ============================================================
|
|
// Public API for Graph Condensation
|
|
// ============================================================
|
|
|
|
pub trait GraphCondensation {
|
|
/// Condense an HNSW index into a smaller graph
|
|
fn condense(
|
|
&self,
|
|
config: CondensationConfig,
|
|
) -> Result<CondensedGraph, CondensationError>;
|
|
|
|
/// Save condensed graph to disk
|
|
fn save_condensed(&self, path: &Path) -> Result<(), io::Error>;
|
|
|
|
/// Load condensed graph from disk
|
|
fn load_condensed(path: &Path) -> Result<CondensedGraph, io::Error>;
|
|
|
|
/// Validate condensation quality
|
|
fn validate_condensation(
|
|
&self,
|
|
condensed: &CondensedGraph,
|
|
test_queries: &[Vec<f32>],
|
|
) -> ValidationMetrics;
|
|
}
|
|
|
|
pub trait HybridGraphSearch {
|
|
/// Search using hybrid condensed/full graph
|
|
fn search_hybrid(
|
|
&self,
|
|
query: &[f32],
|
|
k: usize,
|
|
policy: ExpansionPolicy,
|
|
) -> Result<Vec<SearchResult>, SearchError>;
|
|
|
|
/// Adaptive search with automatic expansion
|
|
fn search_adaptive(
|
|
&self,
|
|
query: &[f32],
|
|
k: usize,
|
|
recall_target: f32, // e.g., 0.95
|
|
) -> Result<Vec<SearchResult>, SearchError>;
|
|
|
|
/// Get current cache statistics
|
|
fn cache_stats(&self) -> CacheStatistics;
|
|
|
|
/// Preload hot regions into cache
|
|
fn warmup_cache(&mut self, query_log: &[Vec<f32>]) -> Result<(), CacheError>;
|
|
}
|
|
|
|
// ============================================================
|
|
// Configuration API
|
|
// ============================================================
|
|
|
|
impl CondensationConfig {
|
|
/// Default configuration for 50x compression
|
|
pub fn default_50x() -> Self {
|
|
Self {
|
|
compression_ratio: 50.0,
|
|
clustering_method: ClusteringMethod::Hierarchical {
|
|
linkage: LinkageType::Ward,
|
|
},
|
|
gnn_epochs: 100,
|
|
structure_weight: 0.7,
|
|
edge_strategy: EdgePreservationStrategy::Learned,
|
|
}
|
|
}
|
|
|
|
/// Aggressive compression for edge devices (100x)
|
|
pub fn edge_device() -> Self {
|
|
Self {
|
|
compression_ratio: 100.0,
|
|
clustering_method: ClusteringMethod::Louvain {
|
|
resolution: 1.2,
|
|
},
|
|
gnn_epochs: 50,
|
|
structure_weight: 0.5,
|
|
edge_strategy: EdgePreservationStrategy::InterCluster,
|
|
}
|
|
}
|
|
|
|
/// Conservative compression for high accuracy (10x)
|
|
pub fn high_accuracy() -> Self {
|
|
Self {
|
|
compression_ratio: 10.0,
|
|
clustering_method: ClusteringMethod::Spectral {
|
|
n_components: 128,
|
|
},
|
|
gnn_epochs: 200,
|
|
structure_weight: 0.9,
|
|
edge_strategy: EdgePreservationStrategy::Learned,
|
|
}
|
|
}
|
|
}
|
|
|
|
// ============================================================
|
|
// Monitoring and Metrics
|
|
// ============================================================
|
|
|
|
#[derive(Clone, Debug)]
|
|
pub struct ValidationMetrics {
|
|
/// Recall at different k values
|
|
pub recall_at_k: HashMap<usize, f32>,
|
|
|
|
/// Average path length increase
|
|
pub avg_path_length_ratio: f32,
|
|
|
|
/// Search latency comparison
|
|
pub latency_ratio: f32,
|
|
|
|
/// Memory reduction achieved
|
|
pub memory_reduction: f32,
|
|
|
|
/// Graph property preservation
|
|
pub property_preservation: PropertyPreservation,
|
|
}
|
|
|
|
#[derive(Clone, Debug)]
|
|
pub struct PropertyPreservation {
|
|
pub clustering_coefficient: f32,
|
|
pub average_degree: f32,
|
|
pub diameter_ratio: f32,
|
|
}
|
|
|
|
#[derive(Clone, Debug)]
|
|
pub struct CacheStatistics {
|
|
pub hit_rate: f32,
|
|
pub eviction_count: u64,
|
|
pub avg_expansion_time: Duration,
|
|
pub total_expansions: u64,
|
|
}
|
|
```
|
|
|
|
## Integration Points
|
|
|
|
### Affected Crates/Modules
|
|
|
|
1. **`ruvector-gnn` (Core GNN crate)**:
|
|
- Add `condensation/` module for graph compression
|
|
- Extend `HnswIndex` with `condense()` method
|
|
- Add GNN training loop for synthetic node refinement
|
|
|
|
2. **`ruvector-core`**:
|
|
- Add `CondensedGraph` serialization format (`.cgraph`)
|
|
- Extend search API with hybrid search modes
|
|
- Add `HybridGraphStore` as alternative index backend
|
|
|
|
3. **`ruvector-gnn-node` (Node.js bindings)**:
|
|
- Expose `condense()` API to JavaScript/TypeScript
|
|
- Add configuration builder for condensation parameters
|
|
- Provide progress callbacks for offline condensation
|
|
|
|
4. **`ruvector-cli`**:
|
|
- Add `ruvector condense` command for offline condensation
|
|
- Add `ruvector validate-condensed` for quality testing
|
|
- Add visualization for condensed graph statistics
|
|
|
|
5. **`ruvector-distributed`**:
|
|
- Use condensed graphs for federated learning synchronization
|
|
- Implement condensed graph transfer protocol
|
|
- Add merge logic for condensed graphs from multiple nodes
|
|
|
|
### New Modules to Create
|
|
|
|
```
|
|
crates/ruvector-gnn/src/condensation/
|
|
├── mod.rs # Public API
|
|
├── clustering.rs # Hierarchical/Louvain/Spectral clustering
|
|
├── synthetic_node.rs # Synthetic node learning via GNN
|
|
├── edge_preservation.rs # Edge weight computation
|
|
├── gnn_trainer.rs # GNN training loop
|
|
├── structure_loss.rs # Loss functions for structure preservation
|
|
├── serialization.rs # .cgraph format I/O
|
|
└── validation.rs # Quality metrics
|
|
|
|
crates/ruvector-core/src/hybrid/
|
|
├── mod.rs # HybridGraphStore
|
|
├── expansion_policy.rs # Adaptive expansion logic
|
|
├── cache.rs # LRU cache for expanded regions
|
|
└── search.rs # Hybrid search algorithms
|
|
|
|
crates/ruvector-gnn-node/condensation/
|
|
├── bindings.rs # NAPI bindings
|
|
└── typescript/
|
|
└── condensation.d.ts # TypeScript definitions
|
|
```
|
|
|
|
### Dependencies on Other Features
|
|
|
|
1. **Prerequisite: Attention Mechanisms (Tier 1)**:
|
|
- SFGC uses attention-weighted clustering
|
|
- Synthetic node embeddings benefit from attention-based aggregation
|
|
- **Action**: Ensure attention module is stable before SFGC integration
|
|
|
|
2. **Synergy: Adaptive HNSW (Tier 2, Feature #5)**:
|
|
- Adaptive HNSW can use condensed graph for cold start
|
|
- Layer-wise compression ratios (compress higher layers more aggressively)
|
|
- **Integration**: Shared `ExpansionPolicy` trait
|
|
|
|
3. **Optional: Neuromorphic Spiking (Tier 2, Feature #6)**:
|
|
- Spiking networks can accelerate GNN training for synthetic nodes
|
|
- **Integration**: Conditional compilation flag for spiking backend
|
|
|
|
4. **Complementary: Sparse Attention (Tier 3, Feature #8)**:
|
|
- Sparse attention patterns can guide clustering
|
|
- **Integration**: Use learned attention masks as clustering hints
|
|
|
|
## Regression Prevention
|
|
|
|
### Existing Functionality at Risk
|
|
|
|
1. **HNSW Search Accuracy**:
|
|
- **Risk**: Condensed graph returns lower-quality results
|
|
- **Mitigation**:
|
|
- Validate recall@10 >= 0.92 on standard benchmarks (SIFT1M, GIST1M)
|
|
- Add A/B testing framework for condensed vs full graph
|
|
- Default to conservative 10x compression
|
|
|
|
2. **Memory Safety (Rust)**:
|
|
- **Risk**: Expansion cache causes use-after-free or data races
|
|
- **Mitigation**:
|
|
- Use `Arc<RwLock<...>>` for shared ownership
|
|
- Fuzz testing with ThreadSanitizer
|
|
- Property-based testing with `proptest`
|
|
|
|
3. **Serialization Format Compatibility**:
|
|
- **Risk**: `.cgraph` format breaks existing index loading
|
|
- **Mitigation**:
|
|
- Separate file extension (`.cgraph` vs `.hnsw`)
|
|
- Version magic number in header
|
|
- Fallback to full graph if condensation fails
|
|
|
|
4. **Node.js Bindings Performance**:
|
|
- **Risk**: Condensation adds latency to JavaScript API
|
|
- **Mitigation**:
|
|
- Make condensation opt-in (separate method)
|
|
- Async/non-blocking condensation API
|
|
- Progress callbacks to avoid blocking event loop
|
|
|
|
### Test Cases to Prevent Regressions
|
|
|
|
```rust
|
|
// Test 1: Search quality preservation
|
|
#[test]
|
|
fn test_condensed_search_recall() {
|
|
let full_index = build_test_index(10000);
|
|
let condensed = full_index.condense(CondensationConfig::default_50x()).unwrap();
|
|
|
|
let test_queries = generate_test_queries(100);
|
|
|
|
for query in test_queries {
|
|
let full_results = full_index.search(&query, 10);
|
|
let condensed_results = condensed.search(&query, 10);
|
|
|
|
let recall = compute_recall(&full_results, &condensed_results);
|
|
assert!(recall >= 0.92, "Recall dropped below 92%: {}", recall);
|
|
}
|
|
}
|
|
|
|
// Test 2: Memory reduction
|
|
#[test]
|
|
fn test_memory_footprint() {
|
|
let full_index = build_test_index(100000);
|
|
let condensed = full_index.condense(CondensationConfig::default_50x()).unwrap();
|
|
|
|
let full_size = full_index.memory_usage();
|
|
let condensed_size = condensed.memory_usage();
|
|
|
|
let reduction = full_size as f32 / condensed_size as f32;
|
|
assert!(reduction >= 40.0, "Memory reduction below 40x: {}", reduction);
|
|
}
|
|
|
|
// Test 3: Serialization round-trip
|
|
#[test]
|
|
fn test_condensed_serialization() {
|
|
let original = build_test_index(1000).condense(CondensationConfig::default_50x()).unwrap();
|
|
|
|
let path = "/tmp/test.cgraph";
|
|
original.save_condensed(Path::new(path)).unwrap();
|
|
let loaded = CondensedGraph::load_condensed(Path::new(path)).unwrap();
|
|
|
|
assert_eq!(original.synthetic_nodes.len(), loaded.synthetic_nodes.len());
|
|
assert_eq!(original.compression_ratio, loaded.compression_ratio);
|
|
}
|
|
|
|
// Test 4: Hybrid search correctness
|
|
#[test]
|
|
fn test_hybrid_search_equivalence() {
|
|
let full_index = build_test_index(5000);
|
|
let condensed = full_index.condense(CondensationConfig::default_50x()).unwrap();
|
|
|
|
let hybrid_store = HybridGraphStore::new(condensed, Some(Arc::new(RwLock::new(full_index))));
|
|
|
|
let query = generate_random_query();
|
|
|
|
// With ExpansionPolicy::Always, hybrid should match full graph
|
|
let hybrid_results = hybrid_store.search_hybrid(&query, 10, ExpansionPolicy::Always).unwrap();
|
|
let full_results = full_index.search(&query, 10);
|
|
|
|
assert_eq!(hybrid_results, full_results);
|
|
}
|
|
|
|
// Test 5: Concurrent expansion safety
|
|
#[test]
|
|
fn test_concurrent_expansion() {
|
|
let hybrid_store = Arc::new(RwLock::new(build_hybrid_store()));
|
|
|
|
let handles: Vec<_> = (0..10).map(|_| {
|
|
let store = Arc::clone(&hybrid_store);
|
|
thread::spawn(move || {
|
|
let query = generate_random_query();
|
|
let results = store.write().unwrap().search_hybrid(
|
|
&query, 10, ExpansionPolicy::OnDemand { cache_size: 100 }
|
|
);
|
|
assert!(results.is_ok());
|
|
})
|
|
}).collect();
|
|
|
|
for handle in handles {
|
|
handle.join().unwrap();
|
|
}
|
|
}
|
|
```
|
|
|
|
### Backward Compatibility Strategy
|
|
|
|
1. **API Level**:
|
|
- Keep existing `HnswIndex::search()` unchanged
|
|
- Add new `HnswIndex::condense()` method (opt-in)
|
|
- Condensed search via separate `HybridGraphStore` type
|
|
|
|
2. **File Format**:
|
|
- Condensed graphs use `.cgraph` extension
|
|
- Original `.hnsw` format unchanged
|
|
- Metadata includes version + compression ratio
|
|
|
|
3. **Node.js Bindings**:
|
|
- Add `index.condense(config)` method (returns new `CondensedIndex` instance)
|
|
- Keep `index.search()` behavior identical
|
|
- Add `condensedIndex.searchHybrid()` for hybrid mode
|
|
|
|
4. **CLI**:
|
|
- `ruvector build` unchanged (builds full graph)
|
|
- New `ruvector condense` command (separate step)
|
|
- Auto-detect `.cgraph` vs `.hnsw` on load
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Core Implementation (Weeks 1-3)
|
|
|
|
**Goals**:
|
|
- Implement clustering algorithms (hierarchical, Louvain)
|
|
- Build basic synthetic node creation (centroid-based, no GNN)
|
|
- Implement condensed HNSW layer construction
|
|
- Basic serialization (`.cgraph` format)
|
|
|
|
**Deliverables**:
|
|
```rust
|
|
// Week 1: Clustering
|
|
crates/ruvector-gnn/src/condensation/clustering.rs
|
|
✓ hierarchical_cluster()
|
|
✓ louvain_cluster()
|
|
✓ spectral_cluster()
|
|
|
|
// Week 2: Synthetic nodes + edges
|
|
crates/ruvector-gnn/src/condensation/synthetic_node.rs
|
|
✓ create_synthetic_nodes() // centroid-based
|
|
✓ build_condensed_edges()
|
|
|
|
// Week 3: Condensed graph + serialization
|
|
crates/ruvector-gnn/src/condensation/mod.rs
|
|
✓ CondensedGraph::from_hnsw()
|
|
✓ save_condensed() / load_condensed()
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- Can condense 100K vector index to 2K synthetic nodes
|
|
- Serialization round-trip preserves graph structure
|
|
- Unit tests pass for clustering algorithms
|
|
|
|
### Phase 2: Integration (Weeks 4-6)
|
|
|
|
**Goals**:
|
|
- Integrate with `HnswIndex` API
|
|
- Add GNN-based synthetic node refinement
|
|
- Implement hybrid search with basic expansion policy
|
|
- Node.js bindings
|
|
|
|
**Deliverables**:
|
|
```rust
|
|
// Week 4: HNSW integration
|
|
crates/ruvector-gnn/src/hnsw/index.rs
|
|
✓ impl GraphCondensation for HnswIndex
|
|
|
|
// Week 5: GNN training
|
|
crates/ruvector-gnn/src/condensation/gnn_trainer.rs
|
|
✓ train_synthetic_embeddings()
|
|
✓ structure_preservation_loss()
|
|
|
|
// Week 6: Hybrid search
|
|
crates/ruvector-core/src/hybrid/
|
|
✓ HybridGraphStore::search_hybrid()
|
|
✓ ExpansionPolicy::OnDemand
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- Recall@10 >= 0.90 on SIFT1M benchmark
|
|
- GNN training converges in <100 epochs
|
|
- Hybrid search passes correctness tests
|
|
|
|
### Phase 3: Optimization (Weeks 7-9)
|
|
|
|
**Goals**:
|
|
- Performance tuning (SIMD, caching)
|
|
- Adaptive expansion policy (query frequency tracking)
|
|
- Distributed condensation for federated learning
|
|
- CLI tool for offline condensation
|
|
|
|
**Deliverables**:
|
|
```rust
|
|
// Week 7: Performance optimization
|
|
crates/ruvector-gnn/src/condensation/
|
|
✓ SIMD-optimized centroid computation
|
|
✓ Parallel clustering (rayon)
|
|
|
|
// Week 8: Adaptive expansion
|
|
crates/ruvector-core/src/hybrid/
|
|
✓ ExpansionPolicy::Adaptive
|
|
✓ Query frequency tracking
|
|
✓ LRU cache tuning
|
|
|
|
// Week 9: CLI + distributed
|
|
crates/ruvector-cli/src/commands/condense.rs
|
|
✓ ruvector condense --ratio 50
|
|
crates/ruvector-distributed/src/sync.rs
|
|
✓ Condensed graph synchronization
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- Condensation time <10s for 1M vectors
|
|
- Adaptive expansion improves latency by 20%+
|
|
- CLI can condense production-scale graphs
|
|
|
|
### Phase 4: Production Hardening (Weeks 10-12)
|
|
|
|
**Goals**:
|
|
- Comprehensive testing (property-based, fuzz, benchmarks)
|
|
- Documentation + examples
|
|
- Performance regression suite
|
|
- Multi-platform validation
|
|
|
|
**Deliverables**:
|
|
```rust
|
|
// Week 10: Testing
|
|
tests/condensation/
|
|
✓ Property-based tests (proptest)
|
|
✓ Fuzz testing (cargo-fuzz)
|
|
✓ Regression test suite
|
|
|
|
// Week 11: Documentation
|
|
docs/
|
|
✓ Graph Condensation Guide (user-facing)
|
|
✓ API documentation (rustdoc)
|
|
✓ Examples (edge device deployment)
|
|
|
|
// Week 12: Benchmarks + validation
|
|
benches/condensation.rs
|
|
✓ Condensation time benchmarks
|
|
✓ Search quality benchmarks
|
|
✓ Memory footprint benchmarks
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- 100% code coverage for condensation module
|
|
- Passes all regression tests
|
|
- Documentation complete with 3+ examples
|
|
- Validated on ARM64, x86-64, WASM targets
|
|
|
|
## Success Metrics
|
|
|
|
### Performance Benchmarks
|
|
|
|
| Benchmark | Metric | Target | Measurement Method |
|
|
|-----------|--------|--------|-------------------|
|
|
| Condensation Time | Time to condense 1M vectors | <10s | `cargo bench condense_1m` |
|
|
| Memory Reduction | Footprint ratio (full/condensed) | 50x | `malloc_count` |
|
|
| Search Latency (condensed only) | p99 latency | <2ms | `criterion` benchmark |
|
|
| Search Latency (hybrid, cold) | p99 latency on first query | <3ms | Cache miss scenario |
|
|
| Search Latency (hybrid, warm) | p99 latency after warmup | <1.5ms | Cache hit scenario |
|
|
| Expansion Time | Time to expand 1 cluster | <0.5ms | `expand_regions()` profiling |
|
|
|
|
### Accuracy Metrics
|
|
|
|
| Dataset | Metric | Target | Baseline (Full Graph) |
|
|
|---------|--------|--------|-----------------------|
|
|
| SIFT1M | Recall@10 (50x compression) | >=0.92 | 0.95 |
|
|
| SIFT1M | Recall@100 (50x compression) | >=0.90 | 0.94 |
|
|
| GIST1M | Recall@10 (50x compression) | >=0.90 | 0.93 |
|
|
| GloVe-200 | Recall@10 (100x compression) | >=0.85 | 0.92 |
|
|
| Custom high-dim (1536d) | Recall@10 (50x compression) | >=0.88 | 0.94 |
|
|
|
|
### Memory/Latency Targets
|
|
|
|
| Configuration | Memory Footprint | Search Latency (p99) | Use Case |
|
|
|---------------|------------------|----------------------|----------|
|
|
| Full HNSW (1M vectors) | 4.8GB | 1.2ms | Server deployment |
|
|
| Condensed 50x (baseline) | 96MB | 1.5ms (cold), 1.2ms (warm) | Edge device |
|
|
| Condensed 100x (aggressive) | 48MB | 2.0ms (cold), 1.5ms (warm) | IoT device |
|
|
| Condensed 10x (conservative) | 480MB | 1.3ms (cold), 1.2ms (warm) | Embedded system |
|
|
| Hybrid (50x + on-demand) | 96MB + cache | 1.3ms (adaptive) | Mobile app |
|
|
|
|
**Measurement Tools**:
|
|
- Memory: `massif` (Valgrind), `heaptrack`, custom `malloc_count`
|
|
- Latency: `criterion` (Rust), `perf` (Linux profiling)
|
|
- Accuracy: Custom recall calculator against ground truth
|
|
|
|
### Quality Gates
|
|
|
|
All gates must pass before production release:
|
|
|
|
1. **Functional**:
|
|
- ✓ All unit tests pass (100% coverage for core logic)
|
|
- ✓ Integration tests pass on 3+ datasets
|
|
- ✓ Serialization round-trip is lossless
|
|
|
|
2. **Performance**:
|
|
- ✓ Memory reduction >= 40x (for 50x target config)
|
|
- ✓ Condensation time <= 15s for 1M vectors
|
|
- ✓ Search latency penalty <= 30% (cold start)
|
|
|
|
3. **Accuracy**:
|
|
- ✓ Recall@10 >= 0.92 on SIFT1M (50x compression)
|
|
- ✓ Recall@10 >= 0.85 on GIST1M (100x compression)
|
|
- ✓ No catastrophic failures (recall < 0.5)
|
|
|
|
4. **Compatibility**:
|
|
- ✓ Works on Linux x86-64, ARM64, macOS
|
|
- ✓ Node.js bindings pass all tests
|
|
- ✓ Backward compatible with existing indexes
|
|
|
|
## Risks and Mitigations
|
|
|
|
### Technical Risks
|
|
|
|
#### Risk 1: GNN Training Instability
|
|
|
|
**Description**:
|
|
Synthetic node embeddings may not converge during GNN training, leading to poor structure preservation.
|
|
|
|
**Probability**: Medium (30%)
|
|
|
|
**Impact**: High (blocks Phase 2)
|
|
|
|
**Mitigation**:
|
|
1. **Fallback**: Start with centroid-only embeddings (no GNN) in Phase 1
|
|
2. **Hyperparameter Tuning**: Grid search over learning rates (1e-4 to 1e-2)
|
|
3. **Loss Function Design**: Add regularization term to prevent mode collapse
|
|
4. **Early Stopping**: Monitor validation recall and stop if plateauing
|
|
5. **Alternative**: Use pre-trained graph embeddings (Node2Vec, GraphSAGE) if GNN fails
|
|
|
|
**Contingency Plan**:
|
|
If GNN training is unstable after 2 weeks of tuning, fall back to attention-weighted centroids (use existing attention mechanisms from Tier 1).
|
|
|
|
#### Risk 2: Cold Start Latency Regression
|
|
|
|
**Description**:
|
|
Condensed graph search may be slower than expected due to poor synthetic node placement.
|
|
|
|
**Probability**: Medium (40%)
|
|
|
|
**Impact**: Medium (user-facing latency)
|
|
|
|
**Mitigation**:
|
|
1. **Profiling**: Use `perf` to identify bottlenecks (likely distance computations)
|
|
2. **SIMD Optimization**: Vectorize distance calculations for synthetic nodes
|
|
3. **Caching**: Precompute pairwise distances between synthetic nodes
|
|
4. **Pruning**: Reduce condensed graph connectivity (fewer edges per node)
|
|
5. **Hybrid Strategy**: Always expand top-3 synthetic nodes to reduce uncertainty
|
|
|
|
**Contingency Plan**:
|
|
If cold start latency exceeds 2x full graph, add "warm cache" mode that preloads frequently accessed clusters based on query distribution.
|
|
|
|
#### Risk 3: Memory Overhead from Expansion Cache
|
|
|
|
**Description**:
|
|
LRU cache for expanded regions may consume more memory than expected, negating compression benefits.
|
|
|
|
**Probability**: Low (20%)
|
|
|
|
**Impact**: Medium (defeats purpose on edge devices)
|
|
|
|
**Mitigation**:
|
|
1. **Adaptive Cache Size**: Dynamically adjust cache size based on available memory
|
|
2. **Partial Expansion**: Only expand k-nearest neighbors within cluster (not full cluster)
|
|
3. **Compression**: Store expanded regions in quantized format (int8 instead of float32)
|
|
4. **Eviction Policy**: Evict based on access frequency + recency (LFU + LRU hybrid)
|
|
|
|
**Contingency Plan**:
|
|
If cache overhead exceeds 20% of condensed graph size, make expansion fully on-demand (no caching) and optimize expansion from disk (mmap).
|
|
|
|
#### Risk 4: Clustering Quality for High-Dimensional Data
|
|
|
|
**Description**:
|
|
Hierarchical clustering may produce imbalanced clusters in high-dimensional spaces (curse of dimensionality).
|
|
|
|
**Probability**: High (60%)
|
|
|
|
**Impact**: Medium (poor compression or accuracy)
|
|
|
|
**Mitigation**:
|
|
1. **Dimensionality Reduction**: Apply PCA or UMAP before clustering
|
|
2. **Alternative Algorithms**: Try spectral clustering or Louvain (graph-based, not distance-based)
|
|
3. **Cluster Validation**: Measure silhouette score and reject poor clusterings
|
|
4. **Adaptive Compression**: Use variable compression ratios per region (dense regions = higher compression)
|
|
|
|
**Contingency Plan**:
|
|
If clustering quality is poor (silhouette score < 0.3), switch to graph-based Louvain clustering using HNSW edges as adjacency matrix.
|
|
|
|
#### Risk 5: Serialization Format Bloat
|
|
|
|
**Description**:
|
|
`.cgraph` format may be larger than expected due to storing expansion maps and GNN weights.
|
|
|
|
**Probability**: Medium (35%)
|
|
|
|
**Impact**: Low (reduces compression benefits)
|
|
|
|
**Mitigation**:
|
|
1. **Sparse Storage**: Use sparse matrix formats (CSR) for expansion maps
|
|
2. **Quantization**: Store GNN embeddings in int8 (8x smaller)
|
|
3. **Compression**: Apply zstd compression to `.cgraph` file
|
|
4. **Lazy Loading**: Only load expansion map on-demand (not upfront)
|
|
|
|
**Contingency Plan**:
|
|
If `.cgraph` file exceeds 50% of condensed graph target size, remove GNN weights from serialization and recompute on load (trade disk space for CPU time).
|
|
|
|
### Operational Risks
|
|
|
|
#### Risk 6: User Confusion with Hybrid API
|
|
|
|
**Description**:
|
|
Users may not understand when to use condensed vs full vs hybrid graphs.
|
|
|
|
**Probability**: High (70%)
|
|
|
|
**Impact**: Low (documentation issue)
|
|
|
|
**Mitigation**:
|
|
1. **Clear Documentation**: Add decision tree (edge device → condensed, server → full, mobile → hybrid)
|
|
2. **Smart Defaults**: Auto-detect environment (check available memory) and choose policy
|
|
3. **Examples**: Provide 3 reference implementations (edge, mobile, server)
|
|
4. **Validation**: Add `validate_condensed()` method that warns if recall is too low
|
|
|
|
#### Risk 7: Debugging Difficulty
|
|
|
|
**Description**:
|
|
When condensed search returns wrong results, debugging is harder (no direct mapping to original nodes).
|
|
|
|
**Probability**: Medium (50%)
|
|
|
|
**Impact**: Medium (developer experience)
|
|
|
|
**Mitigation**:
|
|
1. **Logging**: Add verbose logging for expansion decisions
|
|
2. **Visualization**: Provide tool to visualize condensed graph + clusters
|
|
3. **Explain API**: Add `explain_search()` method that shows which clusters were searched
|
|
4. **Metrics**: Expose per-cluster recall metrics
|
|
|
|
---
|
|
|
|
## Appendix: Related Research
|
|
|
|
This design is based on:
|
|
|
|
1. **Graph Condensation for GNNs** (Jin et al., 2021): Core SFGC algorithm
|
|
2. **Structure-Preserving Graph Coarsening** (Loukas, 2019): Topological invariants
|
|
3. **Hierarchical Navigable Small Worlds** (Malkov & Yashunin, 2018): HNSW baseline
|
|
4. **Federated Graph Learning** (Wu et al., 2022): Distributed graph synchronization
|
|
|
|
Key differences from prior work:
|
|
- **Novel**: GNN-based synthetic node learning (prior work used simple centroids)
|
|
- **Novel**: Hybrid search with adaptive expansion (prior work only used condensed graph)
|
|
- **Engineering**: Production-ready Rust implementation with SIMD optimization
|