git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
462 lines
13 KiB
Markdown
462 lines
13 KiB
Markdown
# GNN Architecture Analysis: RuVector Implementation
|
||
|
||
## Executive Summary
|
||
|
||
RuVector implements a sophisticated Graph Neural Network architecture that operates on HNSW (Hierarchical Navigable Small World) graph topology. The architecture combines message passing, multi-head attention, gated recurrent updates, and differentiable search mechanisms to create a powerful framework for learning on graph-structured data.
|
||
|
||
**Key Components**: Linear transformations, Multi-head Attention, GRU cells, Layer Normalization, Hierarchical Search
|
||
|
||
**Code Location**: `crates/ruvector-gnn/src/layer.rs`, `crates/ruvector-gnn/src/search.rs`
|
||
|
||
---
|
||
|
||
## 1. Core Architecture: RuvectorLayer
|
||
|
||
### 1.1 Mathematical Formulation
|
||
|
||
The RuvectorLayer implements a message passing neural network with the following forward pass:
|
||
|
||
```
|
||
Given: node embedding h_v, neighbor embeddings {h_u}_u∈N(v), edge weights {e_uv}_u∈N(v)
|
||
|
||
1. Message Transformation:
|
||
m_v = W_msg · h_v
|
||
m_u = W_msg · h_u for u ∈ N(v)
|
||
|
||
2. Multi-Head Attention:
|
||
a_v = MultiHeadAttention(m_v, {m_u}, {m_u})
|
||
|
||
3. Weighted Aggregation:
|
||
agg_v = Σ_u (e_uv / Σ_u' e_u'v) · m_u
|
||
|
||
4. Combination:
|
||
combined = a_v + agg_v
|
||
transformed = W_agg · combined
|
||
|
||
5. GRU Update:
|
||
h'_v = GRU(transformed, m_v)
|
||
|
||
6. Normalization & Regularization:
|
||
output = LayerNorm(Dropout(h'_v))
|
||
```
|
||
|
||
### 1.2 Implementation Details
|
||
|
||
**File**: `crates/ruvector-gnn/src/layer.rs:307-440`
|
||
|
||
```rust
|
||
pub struct RuvectorLayer {
|
||
w_msg: Linear, // Message weight matrix
|
||
w_agg: Linear, // Aggregation weight matrix
|
||
w_update: GRUCell, // GRU update cell
|
||
attention: MultiHeadAttention,
|
||
norm: LayerNorm,
|
||
dropout: f32,
|
||
}
|
||
```
|
||
|
||
**Design Choices**:
|
||
- **Xavier Initialization**: Weights initialized as N(0, √(2/(d_in + d_out)))
|
||
- **Numerical Stability**: Softmax uses max subtraction trick
|
||
- **Residual Connections**: Implicit through GRU's (1-z) term
|
||
- **Flexibility**: Handles empty neighbor sets gracefully
|
||
|
||
---
|
||
|
||
## 2. Multi-Head Attention Mechanism
|
||
|
||
### 2.1 Scaled Dot-Product Attention
|
||
|
||
**File**: `crates/ruvector-gnn/src/layer.rs:84-205`
|
||
|
||
The attention mechanism follows the Transformer architecture:
|
||
|
||
```
|
||
Attention(Q, K, V) = softmax(QK^T / √d_k) V
|
||
|
||
where:
|
||
- Q = W_q · h_v (query from target node)
|
||
- K = W_k · h_u (keys from neighbors)
|
||
- V = W_v · h_u (values from neighbors)
|
||
- d_k = hidden_dim / num_heads
|
||
```
|
||
|
||
### 2.2 Multi-Head Decomposition
|
||
|
||
```
|
||
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o
|
||
|
||
where head_i = Attention(Q W_q^i, K W_k^i, V W_v^i)
|
||
```
|
||
|
||
**Mathematical Properties**:
|
||
1. **Permutation Invariance**: Attention scores independent of neighbor ordering
|
||
2. **Soft Selection**: Differentiable alternative to hard neighbor selection
|
||
3. **Context Aware**: Each head can focus on different aspects of neighborhood
|
||
|
||
### 2.3 Numerical Stability
|
||
|
||
```rust
|
||
// Softmax with numerical stability
|
||
let max_score = scores.iter().copied().fold(f32::NEG_INFINITY, f32::max);
|
||
let exp_scores: Vec<f32> = scores.iter()
|
||
.map(|&s| (s - max_score).exp())
|
||
.collect();
|
||
let sum_exp: f32 = exp_scores.iter().sum::<f32>().max(1e-10);
|
||
```
|
||
|
||
**Key Features**:
|
||
- Prevents overflow with max subtraction
|
||
- Guards against division by zero with epsilon
|
||
- Maintains gradient flow through exp operations
|
||
|
||
---
|
||
|
||
## 3. Gated Recurrent Unit (GRU) Integration
|
||
|
||
### 3.1 GRU Cell Mathematics
|
||
|
||
**File**: `crates/ruvector-gnn/src/layer.rs:207-305`
|
||
|
||
```
|
||
z_t = σ(W_z x_t + U_z h_{t-1}) [Update Gate]
|
||
r_t = σ(W_r x_t + U_r h_{t-1}) [Reset Gate]
|
||
h̃_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1})) [Candidate State]
|
||
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t [Final State]
|
||
```
|
||
|
||
### 3.2 Why GRU for Graph Updates?
|
||
|
||
1. **Memory of Previous State**: Maintains information from earlier layers
|
||
2. **Selective Updates**: Update gate z_t controls how much to change
|
||
3. **Reset Mechanism**: Reset gate r_t decides relevance of previous state
|
||
4. **Gradient Flow**: Mitigates vanishing gradients in deep GNNs
|
||
|
||
**Connection to Graph Learning**:
|
||
- `h_{t-1}`: Node's current representation (before aggregation)
|
||
- `x_t`: Aggregated neighborhood information
|
||
- `h_t`: Updated node representation (after message passing)
|
||
|
||
---
|
||
|
||
## 4. Differentiable Search Mechanism
|
||
|
||
### 4.1 Soft Attention Over Candidates
|
||
|
||
**File**: `crates/ruvector-gnn/src/search.rs:38-86`
|
||
|
||
```
|
||
Given: query q, candidates C = {c_1, ..., c_n}
|
||
|
||
1. Compute Similarities:
|
||
s_i = cosine_similarity(q, c_i)
|
||
|
||
2. Temperature-Scaled Softmax:
|
||
w_i = exp(s_i / τ) / Σ_j exp(s_j / τ)
|
||
|
||
3. Soft Top-K Selection:
|
||
indices = argsort(w)[:k]
|
||
weights = {w_i | i ∈ indices}
|
||
```
|
||
|
||
**Temperature Parameter τ**:
|
||
- **τ → 0**: Sharp selection (approximates hard argmax)
|
||
- **τ → ∞**: Uniform distribution (all candidates equal)
|
||
- **τ = 0.07-1.0**: Typical range balancing discrimination and smoothness
|
||
|
||
### 4.2 Hierarchical Forward Pass
|
||
|
||
**File**: `crates/ruvector-gnn/src/search.rs:88-154`
|
||
|
||
Processes query through HNSW layers sequentially:
|
||
|
||
```
|
||
Input: query q, layer_embeddings L = {L_0, ..., L_d}, gnn_layers G
|
||
|
||
h_0 = q
|
||
for layer l = 0 to d:
|
||
1. Find top-k nodes: indices, weights = DifferentiableSearch(h_l, L_l)
|
||
2. Aggregate: agg = Σ_i weights[i] · L_l[indices[i]]
|
||
3. Combine: combined = (h_l + agg) / 2
|
||
4. Transform: h_{l+1} = G_l(combined, neighbors, edge_weights)
|
||
|
||
Output: h_d
|
||
```
|
||
|
||
**Gradient Flow Through Hierarchy**:
|
||
- Softmax ensures differentiability
|
||
- Enables end-to-end training of search process
|
||
- Backpropagation through entire HNSW traversal
|
||
|
||
---
|
||
|
||
## 5. Data Flow Architecture
|
||
|
||
### 5.1 Forward Pass Diagram
|
||
|
||
```
|
||
Input Node Embedding (h_v)
|
||
|
|
||
v
|
||
[W_msg Transform] ──────────────┐
|
||
| |
|
||
v |
|
||
Message (m_v) |
|
||
| |
|
||
v |
|
||
┌─────────────────┐ |
|
||
│ Multi-Head │ |
|
||
│ Attention │ ← Neighbors (transformed)
|
||
└─────────────────┘ |
|
||
| |
|
||
v |
|
||
Attention Output |
|
||
| |
|
||
v |
|
||
[+ Weighted Agg] ← Edge Weights |
|
||
| |
|
||
v |
|
||
[W_agg Transform] |
|
||
| |
|
||
v |
|
||
Aggregated Message |
|
||
| |
|
||
v |
|
||
┌─────────────────┐ |
|
||
│ GRU Cell │ ← Previous State (m_v)
|
||
└─────────────────┘
|
||
|
|
||
v
|
||
Updated State
|
||
|
|
||
v
|
||
[Dropout]
|
||
|
|
||
v
|
||
[LayerNorm]
|
||
|
|
||
v
|
||
Output Embedding
|
||
```
|
||
|
||
### 5.2 Information Bottlenecks
|
||
|
||
**Potential Bottlenecks**:
|
||
1. **Linear Transformations**: Fixed capacity W_msg, W_agg
|
||
2. **Attention Heads**: Limited parallelism (typically 2-8 heads)
|
||
3. **GRU Hidden State**: Fixed dimensionality
|
||
4. **Dropout**: Information loss during training
|
||
|
||
**Mitigation Strategies**:
|
||
- Residual connections via GRU gates
|
||
- Layer normalization prevents gradient explosion
|
||
- Xavier init maintains variance through layers
|
||
|
||
---
|
||
|
||
## 6. Comparison with Standard GNN Architectures
|
||
|
||
| Feature | RuVector | GCN | GAT | GraphSAGE |
|
||
|---------|----------|-----|-----|-----------|
|
||
| Aggregation | Attention + Weighted | Mean | Attention | Mean/Max/LSTM |
|
||
| Update | GRU | Linear | Linear | Linear |
|
||
| Normalization | LayerNorm | None/BatchNorm | None | None |
|
||
| Topology | HNSW | General | General | General |
|
||
| Differentiable Search | Yes | No | No | No |
|
||
| Multi-Head | Yes | No | Yes | No |
|
||
| Gated Updates | Yes (GRU) | No | No | No |
|
||
|
||
**RuVector Advantages**:
|
||
1. **Temporal Dynamics**: GRU captures evolution of node states
|
||
2. **Hierarchical Processing**: HNSW structure for efficient search
|
||
3. **Dual Aggregation**: Combines attention and edge-weighted aggregation
|
||
4. **Stable Training**: LayerNorm + Xavier init + numerical guards
|
||
|
||
---
|
||
|
||
## 7. Computational Complexity
|
||
|
||
### 7.1 Per-Layer Complexity
|
||
|
||
For a node with degree d, hidden dimension h, and k attention heads:
|
||
|
||
| Operation | Complexity | Notes |
|
||
|-----------|------------|-------|
|
||
| Message Transform | O(h²) | Linear layer |
|
||
| Multi-Head Attention | O(k·d·h²/k) = O(d·h²) | k heads, each h/k dim |
|
||
| Weighted Aggregation | O(d·h) | Sum over neighbors |
|
||
| GRU Update | O(h²) | 6 linear transformations |
|
||
| Layer Norm | O(h) | Mean + variance |
|
||
| **Total** | **O(d·h² + h²)** | Dominated by attention |
|
||
|
||
### 7.2 Hierarchical Search Complexity
|
||
|
||
```
|
||
For HNSW with L layers, M neighbors per node:
|
||
- Greedy search: O(L · M · log N)
|
||
- Differentiable search: O(L · k · h)
|
||
where k = top-k candidates per layer
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Training Considerations
|
||
|
||
### 8.1 Contrastive Loss Functions
|
||
|
||
**File**: `crates/ruvector-gnn/src/training.rs:330-462`
|
||
|
||
**InfoNCE Loss**:
|
||
```
|
||
L_InfoNCE = -log(exp(sim(q, p⁺) / τ) / Σ_{p∈P} exp(sim(q, p) / τ))
|
||
|
||
where:
|
||
- q: anchor (query node)
|
||
- p⁺: positive sample (neighbor)
|
||
- P: all samples (positives + negatives)
|
||
- τ: temperature parameter
|
||
```
|
||
|
||
**Local Contrastive Loss**:
|
||
```
|
||
Encourages node embeddings to be similar to graph neighbors
|
||
and dissimilar to non-neighbors
|
||
```
|
||
|
||
### 8.2 Elastic Weight Consolidation (EWC)
|
||
|
||
**File**: `crates/ruvector-gnn/src/ewc.rs`
|
||
|
||
Prevents catastrophic forgetting in continual learning:
|
||
|
||
```
|
||
L_total = L_task + (λ/2) Σ_i F_i (θ_i - θ*_i)²
|
||
|
||
where:
|
||
- L_task: Current task loss
|
||
- F_i: Fisher information (importance of parameter i)
|
||
- θ_i: Current parameter
|
||
- θ*_i: Anchor parameter from previous task
|
||
- λ: Regularization strength (10-10000)
|
||
```
|
||
|
||
**Fisher Information Approximation**:
|
||
```rust
|
||
F_i ≈ (1/N) Σ_{n=1}^N (∂L/∂θ_i)²
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Key Insights for Latent Space Design
|
||
|
||
### 9.1 Embedding Geometry
|
||
|
||
**Current Architecture Assumptions**:
|
||
1. **Euclidean Latent Space**: All operations assume flat geometry
|
||
2. **Cosine Similarity**: Angular distance metric in search
|
||
3. **Linear Projections**: Affine transformations preserve convexity
|
||
|
||
**Implications**:
|
||
- Tree-like graphs poorly represented in Euclidean space
|
||
- Hierarchical HNSW structure hints at hyperbolic geometry benefits
|
||
- Attention mechanism can partially compensate for metric mismatch
|
||
|
||
### 9.2 Information Flow Bottlenecks
|
||
|
||
**Critical Points**:
|
||
1. **Attention Softmax**: Hard selection at inference (argmax)
|
||
2. **GRU Gates**: Sigmoid saturation can block gradients
|
||
3. **Fixed Dimensions**: h_dim bottleneck between layers
|
||
|
||
**Potential Improvements**:
|
||
- Adaptive dimensionality per layer
|
||
- Sparse attention patterns
|
||
- Mixture of experts for different graph patterns
|
||
|
||
---
|
||
|
||
## 10. Connection to HNSW Topology
|
||
|
||
### 10.1 HNSW Structure
|
||
|
||
Hierarchical layers:
|
||
```
|
||
Layer 2: [sparse, long-range connections]
|
||
Layer 1: [medium density]
|
||
Layer 0: [dense, local connections]
|
||
```
|
||
|
||
### 10.2 GNN-HNSW Synergy
|
||
|
||
**Advantages**:
|
||
1. **Coarse-to-Fine**: Higher layers = global structure, lower = local
|
||
2. **Skip Connections**: Hierarchical search jumps across graph
|
||
3. **Differentiable**: Soft attention enables gradient-based optimization
|
||
|
||
**Challenges**:
|
||
1. **Layer Mismatch**: HNSW layers ≠ GNN layers
|
||
2. **Probabilistic Construction**: HNSW randomness vs. learned embeddings
|
||
3. **Online Updates**: Adding nodes requires GNN re-evaluation
|
||
|
||
---
|
||
|
||
## 11. Strengths and Limitations
|
||
|
||
### 11.1 Strengths
|
||
|
||
1. **Numerically Stable**: Extensive guards against overflow/underflow
|
||
2. **Flexible**: Handles variable-degree nodes and empty neighborhoods
|
||
3. **Rich Interactions**: Dual aggregation (attention + weighted)
|
||
4. **Recurrent Memory**: GRU maintains long-term dependencies
|
||
5. **End-to-End Differentiable**: Full gradient flow through search
|
||
|
||
### 11.2 Limitations
|
||
|
||
1. **Computational Cost**: O(d·h²) per node per layer
|
||
2. **Fixed Architecture**: Uniform layers, no adaptive depth
|
||
3. **Euclidean Bias**: May not suit hierarchical graphs
|
||
4. **Limited Expressiveness**: Single attention type (dot-product)
|
||
5. **No Edge Features**: Only uses edge weights, not attributes
|
||
|
||
---
|
||
|
||
## 12. Research Opportunities
|
||
|
||
### 12.1 Short-Term Enhancements
|
||
|
||
1. **Edge Features**: Extend attention to incorporate edge attributes
|
||
2. **Adaptive Heads**: Learn number of attention heads per layer
|
||
3. **Sparse Attention**: Local + global attention patterns
|
||
4. **Layer Skip Connections**: Direct paths from input to output
|
||
|
||
### 12.2 Long-Term Directions
|
||
|
||
1. **Hyperbolic GNN**: Replace Euclidean operations with Poincaré ball
|
||
2. **Graph Transformers**: Replace message passing with full attention
|
||
3. **Neural ODEs**: Continuous-depth GNN with differential equations
|
||
4. **Equivariant Networks**: SE(3) or E(n) equivariance for geometric graphs
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
### Internal Code References
|
||
- `/crates/ruvector-gnn/src/layer.rs` - Core GNN layers
|
||
- `/crates/ruvector-gnn/src/search.rs` - Differentiable search
|
||
- `/crates/ruvector-gnn/src/training.rs` - Loss functions and optimizers
|
||
- `/crates/ruvector-gnn/src/ewc.rs` - Continual learning
|
||
- `/crates/ruvector-graph/src/hybrid/graph_neural.rs` - GNN engine interface
|
||
|
||
### Key Papers
|
||
- Kipf & Welling (2017) - Graph Convolutional Networks
|
||
- Veličković et al. (2018) - Graph Attention Networks
|
||
- Chung et al. (2014) - Gated Recurrent Units
|
||
- Vaswani et al. (2017) - Attention Is All You Need (Transformers)
|
||
- Malkov & Yashunin (2018) - HNSW for ANN search
|
||
|
||
---
|
||
|
||
**Document Version**: 1.0
|
||
**Last Updated**: 2025-11-30
|
||
**Author**: RuVector Research Team
|