Files
wifi-densepose/docs/research/latent-space/gnn-architecture-analysis.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

462 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GNN Architecture Analysis: RuVector Implementation
## Executive Summary
RuVector implements a sophisticated Graph Neural Network architecture that operates on HNSW (Hierarchical Navigable Small World) graph topology. The architecture combines message passing, multi-head attention, gated recurrent updates, and differentiable search mechanisms to create a powerful framework for learning on graph-structured data.
**Key Components**: Linear transformations, Multi-head Attention, GRU cells, Layer Normalization, Hierarchical Search
**Code Location**: `crates/ruvector-gnn/src/layer.rs`, `crates/ruvector-gnn/src/search.rs`
---
## 1. Core Architecture: RuvectorLayer
### 1.1 Mathematical Formulation
The RuvectorLayer implements a message passing neural network with the following forward pass:
```
Given: node embedding h_v, neighbor embeddings {h_u}_u∈N(v), edge weights {e_uv}_u∈N(v)
1. Message Transformation:
m_v = W_msg · h_v
m_u = W_msg · h_u for u ∈ N(v)
2. Multi-Head Attention:
a_v = MultiHeadAttention(m_v, {m_u}, {m_u})
3. Weighted Aggregation:
agg_v = Σ_u (e_uv / Σ_u' e_u'v) · m_u
4. Combination:
combined = a_v + agg_v
transformed = W_agg · combined
5. GRU Update:
h'_v = GRU(transformed, m_v)
6. Normalization & Regularization:
output = LayerNorm(Dropout(h'_v))
```
### 1.2 Implementation Details
**File**: `crates/ruvector-gnn/src/layer.rs:307-440`
```rust
pub struct RuvectorLayer {
w_msg: Linear, // Message weight matrix
w_agg: Linear, // Aggregation weight matrix
w_update: GRUCell, // GRU update cell
attention: MultiHeadAttention,
norm: LayerNorm,
dropout: f32,
}
```
**Design Choices**:
- **Xavier Initialization**: Weights initialized as N(0, √(2/(d_in + d_out)))
- **Numerical Stability**: Softmax uses max subtraction trick
- **Residual Connections**: Implicit through GRU's (1-z) term
- **Flexibility**: Handles empty neighbor sets gracefully
---
## 2. Multi-Head Attention Mechanism
### 2.1 Scaled Dot-Product Attention
**File**: `crates/ruvector-gnn/src/layer.rs:84-205`
The attention mechanism follows the Transformer architecture:
```
Attention(Q, K, V) = softmax(QK^T / √d_k) V
where:
- Q = W_q · h_v (query from target node)
- K = W_k · h_u (keys from neighbors)
- V = W_v · h_u (values from neighbors)
- d_k = hidden_dim / num_heads
```
### 2.2 Multi-Head Decomposition
```
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o
where head_i = Attention(Q W_q^i, K W_k^i, V W_v^i)
```
**Mathematical Properties**:
1. **Permutation Invariance**: Attention scores independent of neighbor ordering
2. **Soft Selection**: Differentiable alternative to hard neighbor selection
3. **Context Aware**: Each head can focus on different aspects of neighborhood
### 2.3 Numerical Stability
```rust
// Softmax with numerical stability
let max_score = scores.iter().copied().fold(f32::NEG_INFINITY, f32::max);
let exp_scores: Vec<f32> = scores.iter()
.map(|&s| (s - max_score).exp())
.collect();
let sum_exp: f32 = exp_scores.iter().sum::<f32>().max(1e-10);
```
**Key Features**:
- Prevents overflow with max subtraction
- Guards against division by zero with epsilon
- Maintains gradient flow through exp operations
---
## 3. Gated Recurrent Unit (GRU) Integration
### 3.1 GRU Cell Mathematics
**File**: `crates/ruvector-gnn/src/layer.rs:207-305`
```
z_t = σ(W_z x_t + U_z h_{t-1}) [Update Gate]
r_t = σ(W_r x_t + U_r h_{t-1}) [Reset Gate]
h̃_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1})) [Candidate State]
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t [Final State]
```
### 3.2 Why GRU for Graph Updates?
1. **Memory of Previous State**: Maintains information from earlier layers
2. **Selective Updates**: Update gate z_t controls how much to change
3. **Reset Mechanism**: Reset gate r_t decides relevance of previous state
4. **Gradient Flow**: Mitigates vanishing gradients in deep GNNs
**Connection to Graph Learning**:
- `h_{t-1}`: Node's current representation (before aggregation)
- `x_t`: Aggregated neighborhood information
- `h_t`: Updated node representation (after message passing)
---
## 4. Differentiable Search Mechanism
### 4.1 Soft Attention Over Candidates
**File**: `crates/ruvector-gnn/src/search.rs:38-86`
```
Given: query q, candidates C = {c_1, ..., c_n}
1. Compute Similarities:
s_i = cosine_similarity(q, c_i)
2. Temperature-Scaled Softmax:
w_i = exp(s_i / τ) / Σ_j exp(s_j / τ)
3. Soft Top-K Selection:
indices = argsort(w)[:k]
weights = {w_i | i ∈ indices}
```
**Temperature Parameter τ**:
- **τ → 0**: Sharp selection (approximates hard argmax)
- **τ → ∞**: Uniform distribution (all candidates equal)
- **τ = 0.07-1.0**: Typical range balancing discrimination and smoothness
### 4.2 Hierarchical Forward Pass
**File**: `crates/ruvector-gnn/src/search.rs:88-154`
Processes query through HNSW layers sequentially:
```
Input: query q, layer_embeddings L = {L_0, ..., L_d}, gnn_layers G
h_0 = q
for layer l = 0 to d:
1. Find top-k nodes: indices, weights = DifferentiableSearch(h_l, L_l)
2. Aggregate: agg = Σ_i weights[i] · L_l[indices[i]]
3. Combine: combined = (h_l + agg) / 2
4. Transform: h_{l+1} = G_l(combined, neighbors, edge_weights)
Output: h_d
```
**Gradient Flow Through Hierarchy**:
- Softmax ensures differentiability
- Enables end-to-end training of search process
- Backpropagation through entire HNSW traversal
---
## 5. Data Flow Architecture
### 5.1 Forward Pass Diagram
```
Input Node Embedding (h_v)
|
v
[W_msg Transform] ──────────────┐
| |
v |
Message (m_v) |
| |
v |
┌─────────────────┐ |
│ Multi-Head │ |
│ Attention │ ← Neighbors (transformed)
└─────────────────┘ |
| |
v |
Attention Output |
| |
v |
[+ Weighted Agg] ← Edge Weights |
| |
v |
[W_agg Transform] |
| |
v |
Aggregated Message |
| |
v |
┌─────────────────┐ |
│ GRU Cell │ ← Previous State (m_v)
└─────────────────┘
|
v
Updated State
|
v
[Dropout]
|
v
[LayerNorm]
|
v
Output Embedding
```
### 5.2 Information Bottlenecks
**Potential Bottlenecks**:
1. **Linear Transformations**: Fixed capacity W_msg, W_agg
2. **Attention Heads**: Limited parallelism (typically 2-8 heads)
3. **GRU Hidden State**: Fixed dimensionality
4. **Dropout**: Information loss during training
**Mitigation Strategies**:
- Residual connections via GRU gates
- Layer normalization prevents gradient explosion
- Xavier init maintains variance through layers
---
## 6. Comparison with Standard GNN Architectures
| Feature | RuVector | GCN | GAT | GraphSAGE |
|---------|----------|-----|-----|-----------|
| Aggregation | Attention + Weighted | Mean | Attention | Mean/Max/LSTM |
| Update | GRU | Linear | Linear | Linear |
| Normalization | LayerNorm | None/BatchNorm | None | None |
| Topology | HNSW | General | General | General |
| Differentiable Search | Yes | No | No | No |
| Multi-Head | Yes | No | Yes | No |
| Gated Updates | Yes (GRU) | No | No | No |
**RuVector Advantages**:
1. **Temporal Dynamics**: GRU captures evolution of node states
2. **Hierarchical Processing**: HNSW structure for efficient search
3. **Dual Aggregation**: Combines attention and edge-weighted aggregation
4. **Stable Training**: LayerNorm + Xavier init + numerical guards
---
## 7. Computational Complexity
### 7.1 Per-Layer Complexity
For a node with degree d, hidden dimension h, and k attention heads:
| Operation | Complexity | Notes |
|-----------|------------|-------|
| Message Transform | O(h²) | Linear layer |
| Multi-Head Attention | O(k·d·h²/k) = O(d·h²) | k heads, each h/k dim |
| Weighted Aggregation | O(d·h) | Sum over neighbors |
| GRU Update | O(h²) | 6 linear transformations |
| Layer Norm | O(h) | Mean + variance |
| **Total** | **O(d·h² + h²)** | Dominated by attention |
### 7.2 Hierarchical Search Complexity
```
For HNSW with L layers, M neighbors per node:
- Greedy search: O(L · M · log N)
- Differentiable search: O(L · k · h)
where k = top-k candidates per layer
```
---
## 8. Training Considerations
### 8.1 Contrastive Loss Functions
**File**: `crates/ruvector-gnn/src/training.rs:330-462`
**InfoNCE Loss**:
```
L_InfoNCE = -log(exp(sim(q, p⁺) / τ) / Σ_{p∈P} exp(sim(q, p) / τ))
where:
- q: anchor (query node)
- p⁺: positive sample (neighbor)
- P: all samples (positives + negatives)
- τ: temperature parameter
```
**Local Contrastive Loss**:
```
Encourages node embeddings to be similar to graph neighbors
and dissimilar to non-neighbors
```
### 8.2 Elastic Weight Consolidation (EWC)
**File**: `crates/ruvector-gnn/src/ewc.rs`
Prevents catastrophic forgetting in continual learning:
```
L_total = L_task + (λ/2) Σ_i F_i (θ_i - θ*_i)²
where:
- L_task: Current task loss
- F_i: Fisher information (importance of parameter i)
- θ_i: Current parameter
- θ*_i: Anchor parameter from previous task
- λ: Regularization strength (10-10000)
```
**Fisher Information Approximation**:
```rust
F_i (1/N) Σ_{n=1}^N (L/θ_i)²
```
---
## 9. Key Insights for Latent Space Design
### 9.1 Embedding Geometry
**Current Architecture Assumptions**:
1. **Euclidean Latent Space**: All operations assume flat geometry
2. **Cosine Similarity**: Angular distance metric in search
3. **Linear Projections**: Affine transformations preserve convexity
**Implications**:
- Tree-like graphs poorly represented in Euclidean space
- Hierarchical HNSW structure hints at hyperbolic geometry benefits
- Attention mechanism can partially compensate for metric mismatch
### 9.2 Information Flow Bottlenecks
**Critical Points**:
1. **Attention Softmax**: Hard selection at inference (argmax)
2. **GRU Gates**: Sigmoid saturation can block gradients
3. **Fixed Dimensions**: h_dim bottleneck between layers
**Potential Improvements**:
- Adaptive dimensionality per layer
- Sparse attention patterns
- Mixture of experts for different graph patterns
---
## 10. Connection to HNSW Topology
### 10.1 HNSW Structure
Hierarchical layers:
```
Layer 2: [sparse, long-range connections]
Layer 1: [medium density]
Layer 0: [dense, local connections]
```
### 10.2 GNN-HNSW Synergy
**Advantages**:
1. **Coarse-to-Fine**: Higher layers = global structure, lower = local
2. **Skip Connections**: Hierarchical search jumps across graph
3. **Differentiable**: Soft attention enables gradient-based optimization
**Challenges**:
1. **Layer Mismatch**: HNSW layers ≠ GNN layers
2. **Probabilistic Construction**: HNSW randomness vs. learned embeddings
3. **Online Updates**: Adding nodes requires GNN re-evaluation
---
## 11. Strengths and Limitations
### 11.1 Strengths
1. **Numerically Stable**: Extensive guards against overflow/underflow
2. **Flexible**: Handles variable-degree nodes and empty neighborhoods
3. **Rich Interactions**: Dual aggregation (attention + weighted)
4. **Recurrent Memory**: GRU maintains long-term dependencies
5. **End-to-End Differentiable**: Full gradient flow through search
### 11.2 Limitations
1. **Computational Cost**: O(d·h²) per node per layer
2. **Fixed Architecture**: Uniform layers, no adaptive depth
3. **Euclidean Bias**: May not suit hierarchical graphs
4. **Limited Expressiveness**: Single attention type (dot-product)
5. **No Edge Features**: Only uses edge weights, not attributes
---
## 12. Research Opportunities
### 12.1 Short-Term Enhancements
1. **Edge Features**: Extend attention to incorporate edge attributes
2. **Adaptive Heads**: Learn number of attention heads per layer
3. **Sparse Attention**: Local + global attention patterns
4. **Layer Skip Connections**: Direct paths from input to output
### 12.2 Long-Term Directions
1. **Hyperbolic GNN**: Replace Euclidean operations with Poincaré ball
2. **Graph Transformers**: Replace message passing with full attention
3. **Neural ODEs**: Continuous-depth GNN with differential equations
4. **Equivariant Networks**: SE(3) or E(n) equivariance for geometric graphs
---
## References
### Internal Code References
- `/crates/ruvector-gnn/src/layer.rs` - Core GNN layers
- `/crates/ruvector-gnn/src/search.rs` - Differentiable search
- `/crates/ruvector-gnn/src/training.rs` - Loss functions and optimizers
- `/crates/ruvector-gnn/src/ewc.rs` - Continual learning
- `/crates/ruvector-graph/src/hybrid/graph_neural.rs` - GNN engine interface
### Key Papers
- Kipf & Welling (2017) - Graph Convolutional Networks
- Veličković et al. (2018) - Graph Attention Networks
- Chung et al. (2014) - Gated Recurrent Units
- Vaswani et al. (2017) - Attention Is All You Need (Transformers)
- Malkov & Yashunin (2018) - HNSW for ANN search
---
**Document Version**: 1.0
**Last Updated**: 2025-11-30
**Author**: RuVector Research Team