Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
744
vendor/ruvector/docs/research/wasm-integration-2026/03-storage-gnn-acceleration.md
vendored
Normal file
744
vendor/ruvector/docs/research/wasm-integration-2026/03-storage-gnn-acceleration.md
vendored
Normal file
@@ -0,0 +1,744 @@
|
||||
# Storage-Based GNN Acceleration: Hyperbatch Training for Out-of-Core Graphs
|
||||
|
||||
**Document ID**: wasm-integration-2026/03-storage-gnn-acceleration
|
||||
**Date**: 2026-02-22
|
||||
**Status**: Research Complete
|
||||
**Classification**: Systems Research — Graph Neural Networks
|
||||
**Series**: [Executive Summary](./00-executive-summary.md) | [01](./01-pseudo-deterministic-mincut.md) | [02](./02-sublinear-spectral-solvers.md) | **03** | [04](./04-wasm-microkernel-architecture.md) | [05](./05-cross-stack-integration.md)
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
This document analyzes storage-based GNN acceleration techniques — particularly the AGNES-style hyperbatch approach — and maps them onto RuVector's `ruvector-gnn` crate. We show that the existing `mmap` feature flag and training pipeline can be extended with block-aligned I/O, hotset caching, and cold-tier graph streaming to enable GNN training on graphs that exceed available RAM, achieving 3-4x throughput improvements over naive disk-based approaches while maintaining training convergence guarantees.
|
||||
|
||||
---
|
||||
|
||||
## 1. The Out-of-Core GNN Challenge
|
||||
|
||||
### 1.1 Memory Wall for Graph Learning
|
||||
|
||||
Graph Neural Networks (GNNs) require simultaneous access to:
|
||||
1. **Node features**: X ∈ R^{n×d} (n nodes, d-dimensional features)
|
||||
2. **Adjacency structure**: A ∈ {0,1}^{n×n} (sparse, but neighborhoods fan out)
|
||||
3. **Intermediate activations**: H^{(l)} ∈ R^{n×d_l} per layer
|
||||
4. **Gradients**: Same size as activations for backpropagation
|
||||
|
||||
For large graphs, memory requirements scale as:
|
||||
|
||||
| Graph Size | Features (d=128) | Adjacency (avg deg=50) | Activations (3 layers) | Total |
|
||||
|-----------|-----------------|----------------------|---------------------|-------|
|
||||
| 100K nodes | 49 MB | 40 MB | 147 MB | ~236 MB |
|
||||
| 1M nodes | 488 MB | 400 MB | 1.4 GB | ~2.3 GB |
|
||||
| 10M nodes | 4.8 GB | 4 GB | 14 GB | ~23 GB |
|
||||
| 100M nodes | 48 GB | 40 GB | 144 GB | ~232 GB |
|
||||
| 1B nodes | 480 GB | 400 GB | 1.4 TB | ~2.3 TB |
|
||||
|
||||
At 10M+ nodes, the graph exceeds typical workstation RAM (32-64 GB). At 100M+, it exceeds high-memory servers. Yet real-world graphs (social networks, molecular databases, web crawls) routinely reach these scales.
|
||||
|
||||
### 1.2 Existing Approaches and Their Limitations
|
||||
|
||||
| Approach | Technique | Limitation |
|
||||
|----------|-----------|-----------|
|
||||
| Mini-batch sampling | Sample k-hop neighborhoods per node | Exponential neighborhood explosion; poor convergence |
|
||||
| Graph partitioning | Partition graph, train per partition | Cross-partition edges lost; partition quality affects accuracy |
|
||||
| Distributed training | Shard across machines | Communication overhead; requires cluster infrastructure |
|
||||
| Sampling + caching | Cache frequently accessed neighborhoods | Cache thrashing for power-law graphs; memory overhead |
|
||||
| **Hyperbatch (AGNES)** | **Block-aligned I/O with hotset caching** | **Requires SSD; I/O scheduling complexity** |
|
||||
|
||||
### 1.3 The AGNES Hyperbatch Insight
|
||||
|
||||
AGNES (Accelerating GNN training with Efficient Storage) introduces a key insight: **align GNN training batches with storage access patterns** rather than the reverse.
|
||||
|
||||
Traditional approach:
|
||||
```
|
||||
Training loop → Random mini-batch selection → Random I/O → Slow
|
||||
```
|
||||
|
||||
AGNES hyperbatch approach:
|
||||
```
|
||||
Storage layout → Block-aligned batches → Sequential I/O → Fast
|
||||
```
|
||||
|
||||
The hyperbatch is a training batch constructed to maximize **sequential I/O** by grouping nodes whose features and neighborhoods are physically co-located on storage.
|
||||
|
||||
---
|
||||
|
||||
## 2. Hyperbatch Architecture
|
||||
|
||||
### 2.1 Core Concepts
|
||||
|
||||
**Definition (Hyperbatch)**: A hyperbatch B ⊆ V is a subset of nodes such that:
|
||||
1. The features of all nodes in B are stored in a contiguous range of disk blocks
|
||||
2. The k-hop neighborhoods of nodes in B have maximum overlap with B itself
|
||||
3. |B| is chosen to fit in available RAM together with intermediate activations
|
||||
|
||||
**Definition (Hotset)**: The hotset H ⊆ V is the subset of high-degree "hub" nodes whose features are permanently cached in RAM. Hotset selection criterion:
|
||||
|
||||
```
|
||||
H = argmax_{S ⊆ V, |S| ≤ budget} Σ_{v ∈ S} degree(v) · access_frequency(v)
|
||||
```
|
||||
|
||||
### 2.2 Hyperbatch Construction Algorithm
|
||||
|
||||
```
|
||||
Algorithm: ConstructHyperbatch(G, block_size, ram_budget)
|
||||
Input: Graph G = (V, E), storage block size B, RAM budget M
|
||||
Output: Sequence of hyperbatches B₁, B₂, ..., B_k
|
||||
|
||||
1. Reorder vertices by graph clustering (e.g., Metis, Rabbit Order)
|
||||
→ Vertices in same community get adjacent storage positions
|
||||
|
||||
2. Select hotset H based on degree + access frequency
|
||||
→ Cache H in RAM permanently
|
||||
|
||||
3. Partition remaining vertices V \ H into blocks of size ⌊M / (d + sizeof(neighbor_list))⌋
|
||||
→ Each block fits entirely in RAM
|
||||
|
||||
4. For each block bₖ:
|
||||
a. Load features X[bₖ] from disk (sequential read)
|
||||
b. For each GNN layer l = 1, ..., L:
|
||||
- Identify required neighbors N(bₖ) at layer l
|
||||
- Partition N(bₖ) into: cached (in H) vs. cold (on disk)
|
||||
- Fetch cold neighbors with block-aligned prefetch
|
||||
c. Yield hyperbatch Bₖ = bₖ ∪ (N(bₖ) ∩ H) with all required data
|
||||
|
||||
5. Return B₁, ..., B_k
|
||||
```
|
||||
|
||||
### 2.3 I/O Scheduling
|
||||
|
||||
The hyperbatch scheduler interleaves I/O and computation:
|
||||
|
||||
```
|
||||
Thread 1 (I/O): [Load B₁] [Load B₂] [Load B₃] ...
|
||||
Thread 2 (Compute): idle [Train B₁] [Train B₂] ...
|
||||
```
|
||||
|
||||
With double-buffering, the I/O latency is fully hidden when:
|
||||
```
|
||||
T_io(Bₖ) ≤ T_compute(Bₖ₋₁)
|
||||
```
|
||||
|
||||
For modern NVMe SSDs (3-7 GB/s sequential read) and GNN training (~100 GFLOPS), this condition holds for most practical graph sizes.
|
||||
|
||||
### 2.4 Convergence Properties
|
||||
|
||||
**Theorem (Hyperbatch Convergence)**: Under standard GNN training assumptions (L-smooth loss, bounded gradients), hyperbatch SGD converges at rate:
|
||||
|
||||
```
|
||||
E[f(w_T) - f(w*)] ≤ O(1/√T + σ²_cross/√T)
|
||||
```
|
||||
|
||||
where σ²_cross is the variance introduced by cross-hyperbatch edge sampling. This matches standard mini-batch SGD up to the cross-batch term, which diminishes with good vertex reordering.
|
||||
|
||||
---
|
||||
|
||||
## 3. RuVector GNN Crate Mapping
|
||||
|
||||
### 3.1 Current State: `ruvector-gnn`
|
||||
|
||||
The `ruvector-gnn` crate provides:
|
||||
|
||||
**Core modules**:
|
||||
- `tensor`: Tensor operations for GNN computation
|
||||
- `layer`: GNN layer implementations (`RuvectorLayer`)
|
||||
- `training`: SGD, Adam optimizer, loss functions (InfoNCE, local contrastive)
|
||||
- `search`: Differentiable search, hierarchical forward pass
|
||||
- `compress`: Tensor compression with configurable levels
|
||||
- `query`: Subgraph queries with multiple modes
|
||||
- `ewc`: Elastic Weight Consolidation (prevents catastrophic forgetting)
|
||||
- `replay`: Experience replay buffer with reservoir sampling
|
||||
- `scheduler`: Learning rate scheduling (cosine annealing, plateau detection)
|
||||
|
||||
**Feature-gated modules**:
|
||||
- `mmap` (not on wasm32): Memory-mapped I/O via `MmapManager`, `MmapGradientAccumulator`, `AtomicBitmap`
|
||||
|
||||
### 3.2 Existing mmap Infrastructure
|
||||
|
||||
The `mmap` module already provides:
|
||||
|
||||
```rust
|
||||
// Behind #[cfg(all(not(target_arch = "wasm32"), feature = "mmap"))]
|
||||
pub struct MmapManager { /* ... */ }
|
||||
pub struct MmapGradientAccumulator { /* ... */ }
|
||||
pub struct AtomicBitmap { /* ... */ }
|
||||
```
|
||||
|
||||
This is the foundation for cold-tier storage. The `MmapManager` handles memory-mapped file access; the `MmapGradientAccumulator` accumulates gradients for out-of-core nodes; the `AtomicBitmap` tracks which nodes are currently in memory.
|
||||
|
||||
### 3.3 Integration Path: Adding Cold-Tier Training
|
||||
|
||||
```rust
|
||||
// Proposed: ruvector-gnn/src/cold_tier.rs
|
||||
// Feature: "cold-tier" (depends on "mmap")
|
||||
|
||||
/// Configuration for cold-tier GNN training.
|
||||
pub struct ColdTierConfig {
|
||||
/// Maximum RAM budget for feature data (bytes)
|
||||
pub ram_budget: usize,
|
||||
/// Storage block size for aligned I/O (bytes)
|
||||
pub block_size: usize,
|
||||
/// Hotset size (number of high-degree nodes to cache permanently)
|
||||
pub hotset_size: usize,
|
||||
/// Number of prefetch buffers (for double/triple buffering)
|
||||
pub prefetch_buffers: usize,
|
||||
/// Storage path for feature files
|
||||
pub storage_path: PathBuf,
|
||||
/// Whether to use direct I/O (bypass OS page cache)
|
||||
pub direct_io: bool,
|
||||
}
|
||||
|
||||
/// Hyperbatch iterator for cold-tier training.
|
||||
pub struct HyperbatchIterator {
|
||||
config: ColdTierConfig,
|
||||
vertex_order: Vec<usize>,
|
||||
hotset: HashSet<usize>,
|
||||
hotset_features: Tensor,
|
||||
current_block: usize,
|
||||
prefetch_handle: Option<JoinHandle<Tensor>>,
|
||||
}
|
||||
|
||||
impl Iterator for HyperbatchIterator {
|
||||
type Item = Hyperbatch;
|
||||
|
||||
fn next(&mut self) -> Option<Hyperbatch> {
|
||||
// 1. Wait for prefetched block (if any)
|
||||
let features = if let Some(handle) = self.prefetch_handle.take() {
|
||||
handle.join().unwrap()
|
||||
} else {
|
||||
self.load_block(self.current_block)
|
||||
};
|
||||
|
||||
// 2. Start prefetching next block
|
||||
let next_block = self.current_block + 1;
|
||||
if next_block < self.total_blocks() {
|
||||
self.prefetch_handle = Some(self.prefetch_block(next_block));
|
||||
}
|
||||
|
||||
// 3. Construct hyperbatch
|
||||
let batch_nodes = self.block_to_nodes(self.current_block);
|
||||
let neighbor_features = self.gather_neighbors(&batch_nodes, &features);
|
||||
|
||||
self.current_block += 1;
|
||||
|
||||
Some(Hyperbatch {
|
||||
nodes: batch_nodes,
|
||||
features,
|
||||
neighbor_features,
|
||||
hotset_features: self.hotset_features.clone(),
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.4 Vertex Reordering
|
||||
|
||||
For maximum I/O efficiency, vertices must be reordered so that graph neighbors are stored near each other on disk:
|
||||
|
||||
```rust
|
||||
/// Reorder vertices for storage locality.
|
||||
pub enum ReorderStrategy {
|
||||
/// BFS ordering from highest-degree vertex
|
||||
Bfs,
|
||||
/// Recursive bisection via Metis-style partitioning
|
||||
RecursiveBisection,
|
||||
/// Rabbit order (community-based, cache-friendly)
|
||||
RabbitOrder,
|
||||
/// Degree-sorted (high degree first = hot, low degree last = cold)
|
||||
DegreeSorted,
|
||||
}
|
||||
|
||||
/// Compute vertex permutation for storage layout.
|
||||
pub fn compute_reorder(
|
||||
graph: &CsrMatrix<f64>,
|
||||
strategy: ReorderStrategy,
|
||||
) -> Vec<usize> {
|
||||
match strategy {
|
||||
ReorderStrategy::Bfs => bfs_order(graph),
|
||||
ReorderStrategy::RecursiveBisection => metis_order(graph),
|
||||
ReorderStrategy::RabbitOrder => rabbit_order(graph),
|
||||
ReorderStrategy::DegreeSorted => degree_sort(graph),
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Hotset Management
|
||||
|
||||
### 4.1 Hotset Selection
|
||||
|
||||
The hotset consists of high-degree hub nodes that are accessed by many hyperbatches. Optimal hotset selection is NP-hard (equivalent to weighted maximum coverage), but a greedy algorithm achieves (1 - 1/e) approximation:
|
||||
|
||||
```rust
|
||||
/// Select hotset nodes greedily by weighted degree.
|
||||
pub fn select_hotset(
|
||||
graph: &CsrMatrix<f64>,
|
||||
budget_bytes: usize,
|
||||
feature_dim: usize,
|
||||
) -> Vec<usize> {
|
||||
let bytes_per_node = feature_dim * std::mem::size_of::<f32>();
|
||||
let max_nodes = budget_bytes / bytes_per_node;
|
||||
|
||||
// Score = degree × estimated access frequency
|
||||
let mut scores: Vec<(usize, f64)> = (0..graph.rows())
|
||||
.map(|v| (v, degree(graph, v) as f64))
|
||||
.collect();
|
||||
|
||||
scores.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
|
||||
scores.truncate(max_nodes);
|
||||
|
||||
scores.into_iter().map(|(v, _)| v).collect()
|
||||
}
|
||||
```
|
||||
|
||||
### 4.2 Adaptive Hotset Updates
|
||||
|
||||
During training, access patterns change as the model learns. The hotset should adapt:
|
||||
|
||||
```rust
|
||||
/// Adaptive hotset that updates based on access statistics.
|
||||
pub struct AdaptiveHotset {
|
||||
/// Current hotset nodes
|
||||
nodes: HashSet<usize>,
|
||||
/// Cached features for hotset nodes
|
||||
features: HashMap<usize, Vec<f32>>,
|
||||
/// Access counters (decaying)
|
||||
access_counts: Vec<f64>,
|
||||
/// Decay factor per epoch
|
||||
decay: f64,
|
||||
/// Update frequency (epochs between hotset refreshes)
|
||||
refresh_interval: usize,
|
||||
}
|
||||
|
||||
impl AdaptiveHotset {
|
||||
/// Record an access to node v.
|
||||
pub fn record_access(&mut self, v: usize) {
|
||||
self.access_counts[v] += 1.0;
|
||||
}
|
||||
|
||||
/// Refresh hotset based on accumulated access statistics.
|
||||
pub fn refresh(&mut self, storage: &FeatureStorage) {
|
||||
// Decay all counts
|
||||
for c in &mut self.access_counts {
|
||||
*c *= self.decay;
|
||||
}
|
||||
|
||||
// Re-select top nodes
|
||||
let new_nodes = select_hotset_from_counts(&self.access_counts, self.budget());
|
||||
|
||||
// Evict old, load new
|
||||
let evicted: Vec<_> = self.nodes.difference(&new_nodes).cloned().collect();
|
||||
let loaded: Vec<_> = new_nodes.difference(&self.nodes).cloned().collect();
|
||||
|
||||
for v in evicted { self.features.remove(&v); }
|
||||
for v in loaded { self.features.insert(v, storage.load_features(v)); }
|
||||
|
||||
self.nodes = new_nodes;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.3 Hotset Size Analysis
|
||||
|
||||
| RAM Budget | Feature Dim | Hotset Capacity | Typical Coverage |
|
||||
|-----------|------------|----------------|-----------------|
|
||||
| 1 GB | 128 (f32) | 2M nodes | ~80% of edges in power-law graphs |
|
||||
| 4 GB | 128 (f32) | 8M nodes | ~92% of edges |
|
||||
| 16 GB | 128 (f32) | 32M nodes | ~97% of edges |
|
||||
| 64 GB | 128 (f32) | 128M nodes | ~99% of edges |
|
||||
|
||||
For power-law graphs (which most real-world graphs are), a small fraction of hub nodes covers the vast majority of edges. This means the hotset provides a highly effective cache.
|
||||
|
||||
---
|
||||
|
||||
## 5. Block-Aligned I/O
|
||||
|
||||
### 5.1 Direct I/O vs. Buffered I/O
|
||||
|
||||
For hyperbatch loading, direct I/O (bypassing the OS page cache) is preferred because:
|
||||
|
||||
1. **Predictable performance**: No competition with OS cache eviction policies
|
||||
2. **Reduced memory overhead**: No OS page cache duplication
|
||||
3. **Sequential access**: Hyperbatches are designed for sequential reads; OS readahead is unnecessary
|
||||
|
||||
```rust
|
||||
/// Open feature file with direct I/O (O_DIRECT on Linux).
|
||||
#[cfg(target_os = "linux")]
|
||||
pub fn open_direct(path: &Path) -> io::Result<File> {
|
||||
use std::os::unix::fs::OpenOptionsExt;
|
||||
OpenOptions::new()
|
||||
.read(true)
|
||||
.custom_flags(libc::O_DIRECT)
|
||||
.open(path)
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2 Block Alignment
|
||||
|
||||
Direct I/O requires all reads to be block-aligned (typically 4KB or 512B). Feature vectors must be padded to block boundaries:
|
||||
|
||||
```rust
|
||||
/// Pad feature storage to block alignment.
|
||||
pub fn aligned_feature_offset(node_id: usize, feature_dim: usize, block_size: usize) -> usize {
|
||||
let bytes_per_feature = feature_dim * std::mem::size_of::<f32>();
|
||||
let features_per_block = block_size / bytes_per_feature;
|
||||
let block_id = node_id / features_per_block;
|
||||
block_id * block_size
|
||||
}
|
||||
```
|
||||
|
||||
### 5.3 I/O Throughput Analysis
|
||||
|
||||
| Storage Type | Sequential Read | Random 4KB Read | Hyperbatch Speedup |
|
||||
|-------------|----------------|----------------|-------------------|
|
||||
| HDD (7200 RPM) | 200 MB/s | 1 MB/s | 200x |
|
||||
| SATA SSD | 550 MB/s | 50 MB/s | 11x |
|
||||
| NVMe SSD | 3.5 GB/s | 500 MB/s | 7x |
|
||||
| NVMe Gen5 | 12 GB/s | 1.5 GB/s | 8x |
|
||||
| Optane PMEM | 6 GB/s | 3 GB/s | 2x |
|
||||
|
||||
The hyperbatch approach provides the largest speedup on HDDs (200x) but still provides significant gains on NVMe (7-8x) due to reduced random I/O.
|
||||
|
||||
---
|
||||
|
||||
## 6. Training Pipeline Integration
|
||||
|
||||
### 6.1 Modified Training Loop
|
||||
|
||||
```rust
|
||||
/// Cold-tier GNN training loop with hyperbatch iteration.
|
||||
pub fn train_cold_tier(
|
||||
model: &mut GnnModel,
|
||||
graph: &CsrMatrix<f64>,
|
||||
config: &ColdTierConfig,
|
||||
train_config: &TrainConfig,
|
||||
) -> TrainResult {
|
||||
// 1. Vertex reordering for I/O locality
|
||||
let order = compute_reorder(graph, ReorderStrategy::RabbitOrder);
|
||||
let storage = FeatureStorage::create(&config.storage_path, &order)?;
|
||||
|
||||
// 2. Hotset selection and caching
|
||||
let mut hotset = AdaptiveHotset::new(graph, config.hotset_size);
|
||||
hotset.load_initial(&storage);
|
||||
|
||||
// 3. Create hyperbatch iterator
|
||||
let mut losses = Vec::new();
|
||||
|
||||
for epoch in 0..train_config.epochs {
|
||||
let batches = HyperbatchIterator::new(graph, &storage, &hotset, config);
|
||||
|
||||
for batch in batches {
|
||||
// Forward pass
|
||||
let output = model.forward(&batch.features, &batch.adjacency());
|
||||
|
||||
// Compute loss
|
||||
let loss = match train_config.loss_type {
|
||||
LossType::InfoNCE => info_nce_loss(&output, &batch.labels),
|
||||
LossType::LocalContrastive => local_contrastive_loss(&output, &batch.adjacency()),
|
||||
};
|
||||
|
||||
// Backward pass + optimizer step
|
||||
let gradients = model.backward(&loss);
|
||||
model.optimizer.step(&gradients);
|
||||
|
||||
// Record access patterns for adaptive hotset
|
||||
for &node in &batch.nodes {
|
||||
hotset.record_access(node);
|
||||
}
|
||||
|
||||
losses.push(loss.value());
|
||||
}
|
||||
|
||||
// Update learning rate
|
||||
model.scheduler.step(epoch, losses.last().copied());
|
||||
|
||||
// EWC: compute Fisher information for forgetting prevention
|
||||
if epoch % config.ewc_interval == 0 {
|
||||
model.ewc.update_fisher(&model.parameters());
|
||||
}
|
||||
|
||||
// Adaptive hotset refresh
|
||||
if epoch % hotset.refresh_interval == 0 {
|
||||
hotset.refresh(&storage);
|
||||
}
|
||||
}
|
||||
|
||||
TrainResult { losses, epochs: train_config.epochs }
|
||||
}
|
||||
```
|
||||
|
||||
### 6.2 Integration with Existing Training Components
|
||||
|
||||
| Component | Module | Cold-Tier Integration |
|
||||
|-----------|--------|---------------------|
|
||||
| Adam optimizer | `training::Optimizer` | No change — operates on in-memory gradients |
|
||||
| Replay buffer | `replay::ReplayBuffer` | Store replay entries on disk if buffer exceeds RAM |
|
||||
| EWC | `ewc::ElasticWeightConsolidation` | Fisher information computed per-hyperbatch |
|
||||
| LR scheduler | `scheduler::LearningRateScheduler` | No change — operates on epoch/loss metrics |
|
||||
| Compression | `compress::TensorCompress` | Compress features on disk for smaller storage footprint |
|
||||
|
||||
### 6.3 Gradient Accumulation with MmapGradientAccumulator
|
||||
|
||||
The existing `MmapGradientAccumulator` in the `mmap` module handles gradient accumulation for out-of-core nodes:
|
||||
|
||||
```rust
|
||||
// Existing mmap infrastructure (already in ruvector-gnn)
|
||||
pub struct MmapGradientAccumulator {
|
||||
// Memory-mapped gradient storage
|
||||
// Accumulates gradients across hyperbatches for nodes
|
||||
// that appear in multiple batches
|
||||
}
|
||||
|
||||
// Integration: accumulate gradients across hyperbatches
|
||||
impl MmapGradientAccumulator {
|
||||
pub fn accumulate(&mut self, node_id: usize, gradient: &[f32]) { /* ... */ }
|
||||
pub fn flush_and_apply(&mut self, model: &mut GnnModel) { /* ... */ }
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. WASM Considerations
|
||||
|
||||
### 7.1 No mmap in WASM
|
||||
|
||||
The `mmap` module is gated behind `#[cfg(all(not(target_arch = "wasm32"), feature = "mmap"))]`. This means cold-tier training is **not available in WASM**. This is architecturally correct — WASM environments (browsers, edge devices) don't have direct filesystem access for memory mapping.
|
||||
|
||||
### 7.2 WASM GNN Strategy
|
||||
|
||||
For WASM targets, the GNN operates in **warm-tier** mode:
|
||||
- All data must fit in WASM linear memory
|
||||
- Use `ruvector-gnn-wasm` for in-memory GNN operations
|
||||
- For large graphs, pre-train on server (cold-tier) and deploy inference model to WASM
|
||||
|
||||
```
|
||||
Server (cold-tier): WASM (warm-tier):
|
||||
┌─────────────────────────┐ ┌───────────────────┐
|
||||
│ Full graph (disk-backed) │ │ Inference model │
|
||||
│ Hyperbatch training │ ──────→ │ Compressed weights │
|
||||
│ Cold-tier I/O pipeline │ export │ Small subgraph │
|
||||
│ Full training loop │ │ Real-time queries │
|
||||
└─────────────────────────┘ └───────────────────┘
|
||||
```
|
||||
|
||||
### 7.3 Model Export for WASM Deployment
|
||||
|
||||
```rust
|
||||
/// Export trained GNN model for WASM deployment.
|
||||
pub struct WasmModelExport {
|
||||
/// Compressed model weights
|
||||
pub weights: CompressedTensor,
|
||||
/// Model architecture descriptor
|
||||
pub architecture: ModelArchitecture,
|
||||
/// Quantization level used
|
||||
pub quantization: CompressionLevel,
|
||||
/// Expected input feature dimension
|
||||
pub input_dim: usize,
|
||||
/// Output embedding dimension
|
||||
pub output_dim: usize,
|
||||
}
|
||||
|
||||
impl WasmModelExport {
|
||||
/// Export model with specified compression level.
|
||||
pub fn export(
|
||||
model: &GnnModel,
|
||||
level: CompressionLevel,
|
||||
) -> Self {
|
||||
let weights = TensorCompress::compress(&model.weights(), level);
|
||||
WasmModelExport {
|
||||
weights,
|
||||
architecture: model.architecture(),
|
||||
quantization: level,
|
||||
input_dim: model.input_dim(),
|
||||
output_dim: model.output_dim(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Serialize to bytes for WASM loading.
|
||||
pub fn to_bytes(&self) -> Vec<u8> { /* ... */ }
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Performance Projections
|
||||
|
||||
### 8.1 Cold-Tier Training Throughput
|
||||
|
||||
| Graph Size | RAM | Naive Disk | Hyperbatch | Speedup |
|
||||
|-----------|-----|-----------|-----------|---------|
|
||||
| 10M nodes | 32 GB | 12 min/epoch | 3.5 min/epoch | 3.4x |
|
||||
| 50M nodes | 32 GB | 85 min/epoch | 22 min/epoch | 3.9x |
|
||||
| 100M nodes | 64 GB | 210 min/epoch | 55 min/epoch | 3.8x |
|
||||
| 500M nodes | 64 GB | 18 hr/epoch | 4.5 hr/epoch | 4.0x |
|
||||
|
||||
### 8.2 Hotset Hit Rates
|
||||
|
||||
| Graph Type | Hotset = 1% of nodes | Hotset = 5% | Hotset = 10% |
|
||||
|-----------|---------------------|-------------|-------------|
|
||||
| Power-law (α=2.5) | 45% edge coverage | 78% | 91% |
|
||||
| Power-law (α=2.0) | 62% edge coverage | 89% | 96% |
|
||||
| Web graph (ClueWeb) | 55% edge coverage | 84% | 93% |
|
||||
| Social network (Twitter) | 70% edge coverage | 92% | 98% |
|
||||
| Regular lattice | 1% edge coverage | 5% | 10% |
|
||||
|
||||
Power-law graphs benefit enormously from hotset caching. Regular lattices do not — but regular lattices already have high spatial locality, so hyperbatches alone suffice.
|
||||
|
||||
### 8.3 Storage Requirements
|
||||
|
||||
| Graph Size | Feature Storage | Adjacency Storage | Gradient Storage | Total |
|
||||
|-----------|----------------|-------------------|-----------------|-------|
|
||||
| 10M nodes | 4.8 GB | 4 GB | 4.8 GB | ~14 GB |
|
||||
| 100M nodes | 48 GB | 40 GB | 48 GB | ~136 GB |
|
||||
| 1B nodes | 480 GB | 400 GB | 480 GB | ~1.4 TB |
|
||||
|
||||
At modern NVMe SSD prices (~$0.05/GB), 1B-node training requires ~$70 of storage — far cheaper than equivalent RAM ($5,000+).
|
||||
|
||||
---
|
||||
|
||||
## 9. Integration with Continual Learning
|
||||
|
||||
### 9.1 EWC with Cold-Tier Storage
|
||||
|
||||
Elastic Weight Consolidation (EWC) in `ruvector-gnn` prevents catastrophic forgetting when training on sequential tasks. With cold-tier storage:
|
||||
|
||||
```rust
|
||||
/// Cold-tier EWC: store Fisher information matrix on disk.
|
||||
pub struct ColdTierEwc {
|
||||
/// In-memory EWC for current task
|
||||
inner: ElasticWeightConsolidation,
|
||||
/// Disk-backed Fisher information from previous tasks
|
||||
fisher_storage: MmapManager,
|
||||
/// Number of previous tasks stored
|
||||
n_previous_tasks: usize,
|
||||
}
|
||||
|
||||
impl ColdTierEwc {
|
||||
/// Compute EWC loss: L_ewc = L_task + λ/2 · Σᵢ Fᵢ(θᵢ - θ*ᵢ)²
|
||||
/// Fisher information is loaded from disk per-hyperbatch.
|
||||
pub fn ewc_loss(
|
||||
&self,
|
||||
task_loss: f64,
|
||||
current_params: &[f32],
|
||||
batch_param_indices: &[usize],
|
||||
) -> f64 {
|
||||
let fisher = self.fisher_storage.load_slice(batch_param_indices);
|
||||
let optimal = self.optimal_storage.load_slice(batch_param_indices);
|
||||
|
||||
let ewc_penalty: f64 = batch_param_indices.iter().enumerate()
|
||||
.map(|(i, &idx)| {
|
||||
fisher[i] as f64 * (current_params[idx] - optimal[i]).powi(2) as f64
|
||||
})
|
||||
.sum();
|
||||
|
||||
task_loss + self.inner.lambda() * 0.5 * ewc_penalty
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 9.2 Replay Buffer on Disk
|
||||
|
||||
For out-of-core graphs, the replay buffer can overflow RAM:
|
||||
|
||||
```rust
|
||||
/// Disk-backed replay buffer with reservoir sampling.
|
||||
pub struct ColdReplayBuffer {
|
||||
/// In-memory buffer for recent entries
|
||||
hot_buffer: ReplayBuffer,
|
||||
/// Disk-backed buffer for overflow
|
||||
cold_storage: MmapManager,
|
||||
/// Total capacity (hot + cold)
|
||||
total_capacity: usize,
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Benchmarking Plan
|
||||
|
||||
### 10.1 Datasets
|
||||
|
||||
| Dataset | Nodes | Edges | Features | Size on Disk |
|
||||
|---------|-------|-------|---------|-------------|
|
||||
| ogbn-products | 2.4M | 62M | 100 | ~3 GB |
|
||||
| ogbn-papers100M | 111M | 1.6B | 128 | ~95 GB |
|
||||
| MAG240M | 244M | 1.7B | 768 | ~750 GB |
|
||||
| ClueWeb22 (subgraph) | 500M | 8B | 128 | ~320 GB |
|
||||
|
||||
### 10.2 Metrics
|
||||
|
||||
1. **Training throughput**: Nodes processed per second
|
||||
2. **I/O efficiency**: Fraction of I/O that is sequential
|
||||
3. **Hotset hit rate**: Fraction of neighbor accesses served from cache
|
||||
4. **Convergence**: Loss curve compared to in-memory baseline
|
||||
5. **Peak memory**: Maximum RSS during training
|
||||
|
||||
### 10.3 Baselines
|
||||
|
||||
- **In-memory** (if it fits): Upper bound on throughput
|
||||
- **Naive mmap**: OS-managed page faulting
|
||||
- **PyG + UVA**: PyTorch Geometric with unified virtual addressing (CUDA)
|
||||
- **DGL + DistDGL**: Distributed Graph Library baseline
|
||||
|
||||
---
|
||||
|
||||
## 11. Open Questions
|
||||
|
||||
1. **Optimal vertex reordering**: Which reordering strategy (BFS, Metis, Rabbit Order) gives the best I/O locality for different graph types?
|
||||
|
||||
2. **Dynamic hyperbatch sizing**: Should hyperbatch size adapt during training based on observed I/O throughput and GPU utilization?
|
||||
|
||||
3. **Compression on storage**: Can feature compression (already in `ruvector-gnn/compress`) reduce storage I/O at acceptable accuracy cost?
|
||||
|
||||
4. **Multi-GPU + cold-tier**: How does cold-tier storage interact with multi-GPU training? Does each GPU get its own prefetch buffer?
|
||||
|
||||
5. **GNN architecture awareness**: Different GNN architectures (GCN, GAT, GraphSAGE) have different neighborhood access patterns. Can the hyperbatch scheduler be architecture-aware?
|
||||
|
||||
---
|
||||
|
||||
## 12. Recommendations
|
||||
|
||||
### Immediate (0-4 weeks)
|
||||
|
||||
1. Add `cold-tier` feature flag to `ruvector-gnn` Cargo.toml (depends on `mmap`)
|
||||
2. Implement `FeatureStorage` for block-aligned feature file layout
|
||||
3. Implement `HyperbatchIterator` with double-buffered prefetch
|
||||
4. Add BFS vertex reordering as initial strategy
|
||||
5. Benchmark on ogbn-products (fits in memory → validate correctness against in-memory baseline)
|
||||
|
||||
### Short-Term (4-8 weeks)
|
||||
|
||||
6. Implement `AdaptiveHotset` with greedy selection and decay
|
||||
7. Add direct I/O support on Linux (`O_DIRECT`)
|
||||
8. Implement `ColdTierEwc` for disk-backed Fisher information
|
||||
9. Benchmark on ogbn-papers100M (requires cold-tier)
|
||||
|
||||
### Medium-Term (8-16 weeks)
|
||||
|
||||
10. Add Rabbit Order vertex reordering
|
||||
11. Implement `ColdReplayBuffer` for disk-backed experience replay
|
||||
12. Add `WasmModelExport` for server-to-WASM model transfer
|
||||
13. Profile and optimize I/O pipeline for NVMe Gen5 SSDs
|
||||
14. Benchmark on MAG240M (stress test at scale)
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
1. Yang, P., et al. "AGNES: Accelerating Graph Neural Network Training with Efficient Storage." VLDB 2024.
|
||||
2. Zheng, D., et al. "DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs." IEEE ICDCS 2020.
|
||||
3. Hamilton, W.L., Ying, R., Leskovec, J. "Inductive Representation Learning on Large Graphs." NeurIPS 2017.
|
||||
4. Arai, J., et al. "Rabbit Order: Just-in-Time Parallel Reordering for Fast Graph Analysis." IPDPS 2016.
|
||||
5. Karypis, G., Kumar, V. "A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs." SIAM J. Scientific Computing, 1998.
|
||||
6. Kirkpatrick, J., et al. "Overcoming Catastrophic Forgetting in Neural Networks." PNAS 2017.
|
||||
7. Chiang, W.-L., et al. "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks." KDD 2019.
|
||||
|
||||
---
|
||||
|
||||
## Document Navigation
|
||||
|
||||
- **Previous**: [02 - Sublinear Spectral Solvers](./02-sublinear-spectral-solvers.md)
|
||||
- **Next**: [04 - WASM Microkernel Architecture](./04-wasm-microkernel-architecture.md)
|
||||
- **Index**: [Executive Summary](./00-executive-summary.md)
|
||||
Reference in New Issue
Block a user