13 KiB
GNN Architecture Analysis: RuVector Implementation
Executive Summary
RuVector implements a sophisticated Graph Neural Network architecture that operates on HNSW (Hierarchical Navigable Small World) graph topology. The architecture combines message passing, multi-head attention, gated recurrent updates, and differentiable search mechanisms to create a powerful framework for learning on graph-structured data.
Key Components: Linear transformations, Multi-head Attention, GRU cells, Layer Normalization, Hierarchical Search
Code Location: crates/ruvector-gnn/src/layer.rs, crates/ruvector-gnn/src/search.rs
1. Core Architecture: RuvectorLayer
1.1 Mathematical Formulation
The RuvectorLayer implements a message passing neural network with the following forward pass:
Given: node embedding h_v, neighbor embeddings {h_u}_u∈N(v), edge weights {e_uv}_u∈N(v)
1. Message Transformation:
m_v = W_msg · h_v
m_u = W_msg · h_u for u ∈ N(v)
2. Multi-Head Attention:
a_v = MultiHeadAttention(m_v, {m_u}, {m_u})
3. Weighted Aggregation:
agg_v = Σ_u (e_uv / Σ_u' e_u'v) · m_u
4. Combination:
combined = a_v + agg_v
transformed = W_agg · combined
5. GRU Update:
h'_v = GRU(transformed, m_v)
6. Normalization & Regularization:
output = LayerNorm(Dropout(h'_v))
1.2 Implementation Details
File: crates/ruvector-gnn/src/layer.rs:307-440
pub struct RuvectorLayer {
w_msg: Linear, // Message weight matrix
w_agg: Linear, // Aggregation weight matrix
w_update: GRUCell, // GRU update cell
attention: MultiHeadAttention,
norm: LayerNorm,
dropout: f32,
}
Design Choices:
- Xavier Initialization: Weights initialized as N(0, √(2/(d_in + d_out)))
- Numerical Stability: Softmax uses max subtraction trick
- Residual Connections: Implicit through GRU's (1-z) term
- Flexibility: Handles empty neighbor sets gracefully
2. Multi-Head Attention Mechanism
2.1 Scaled Dot-Product Attention
File: crates/ruvector-gnn/src/layer.rs:84-205
The attention mechanism follows the Transformer architecture:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
where:
- Q = W_q · h_v (query from target node)
- K = W_k · h_u (keys from neighbors)
- V = W_v · h_u (values from neighbors)
- d_k = hidden_dim / num_heads
2.2 Multi-Head Decomposition
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o
where head_i = Attention(Q W_q^i, K W_k^i, V W_v^i)
Mathematical Properties:
- Permutation Invariance: Attention scores independent of neighbor ordering
- Soft Selection: Differentiable alternative to hard neighbor selection
- Context Aware: Each head can focus on different aspects of neighborhood
2.3 Numerical Stability
// Softmax with numerical stability
let max_score = scores.iter().copied().fold(f32::NEG_INFINITY, f32::max);
let exp_scores: Vec<f32> = scores.iter()
.map(|&s| (s - max_score).exp())
.collect();
let sum_exp: f32 = exp_scores.iter().sum::<f32>().max(1e-10);
Key Features:
- Prevents overflow with max subtraction
- Guards against division by zero with epsilon
- Maintains gradient flow through exp operations
3. Gated Recurrent Unit (GRU) Integration
3.1 GRU Cell Mathematics
File: crates/ruvector-gnn/src/layer.rs:207-305
z_t = σ(W_z x_t + U_z h_{t-1}) [Update Gate]
r_t = σ(W_r x_t + U_r h_{t-1}) [Reset Gate]
h̃_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1})) [Candidate State]
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t [Final State]
3.2 Why GRU for Graph Updates?
- Memory of Previous State: Maintains information from earlier layers
- Selective Updates: Update gate z_t controls how much to change
- Reset Mechanism: Reset gate r_t decides relevance of previous state
- Gradient Flow: Mitigates vanishing gradients in deep GNNs
Connection to Graph Learning:
h_{t-1}: Node's current representation (before aggregation)x_t: Aggregated neighborhood informationh_t: Updated node representation (after message passing)
4. Differentiable Search Mechanism
4.1 Soft Attention Over Candidates
File: crates/ruvector-gnn/src/search.rs:38-86
Given: query q, candidates C = {c_1, ..., c_n}
1. Compute Similarities:
s_i = cosine_similarity(q, c_i)
2. Temperature-Scaled Softmax:
w_i = exp(s_i / τ) / Σ_j exp(s_j / τ)
3. Soft Top-K Selection:
indices = argsort(w)[:k]
weights = {w_i | i ∈ indices}
Temperature Parameter τ:
- τ → 0: Sharp selection (approximates hard argmax)
- τ → ∞: Uniform distribution (all candidates equal)
- τ = 0.07-1.0: Typical range balancing discrimination and smoothness
4.2 Hierarchical Forward Pass
File: crates/ruvector-gnn/src/search.rs:88-154
Processes query through HNSW layers sequentially:
Input: query q, layer_embeddings L = {L_0, ..., L_d}, gnn_layers G
h_0 = q
for layer l = 0 to d:
1. Find top-k nodes: indices, weights = DifferentiableSearch(h_l, L_l)
2. Aggregate: agg = Σ_i weights[i] · L_l[indices[i]]
3. Combine: combined = (h_l + agg) / 2
4. Transform: h_{l+1} = G_l(combined, neighbors, edge_weights)
Output: h_d
Gradient Flow Through Hierarchy:
- Softmax ensures differentiability
- Enables end-to-end training of search process
- Backpropagation through entire HNSW traversal
5. Data Flow Architecture
5.1 Forward Pass Diagram
Input Node Embedding (h_v)
|
v
[W_msg Transform] ──────────────┐
| |
v |
Message (m_v) |
| |
v |
┌─────────────────┐ |
│ Multi-Head │ |
│ Attention │ ← Neighbors (transformed)
└─────────────────┘ |
| |
v |
Attention Output |
| |
v |
[+ Weighted Agg] ← Edge Weights |
| |
v |
[W_agg Transform] |
| |
v |
Aggregated Message |
| |
v |
┌─────────────────┐ |
│ GRU Cell │ ← Previous State (m_v)
└─────────────────┘
|
v
Updated State
|
v
[Dropout]
|
v
[LayerNorm]
|
v
Output Embedding
5.2 Information Bottlenecks
Potential Bottlenecks:
- Linear Transformations: Fixed capacity W_msg, W_agg
- Attention Heads: Limited parallelism (typically 2-8 heads)
- GRU Hidden State: Fixed dimensionality
- Dropout: Information loss during training
Mitigation Strategies:
- Residual connections via GRU gates
- Layer normalization prevents gradient explosion
- Xavier init maintains variance through layers
6. Comparison with Standard GNN Architectures
| Feature | RuVector | GCN | GAT | GraphSAGE |
|---|---|---|---|---|
| Aggregation | Attention + Weighted | Mean | Attention | Mean/Max/LSTM |
| Update | GRU | Linear | Linear | Linear |
| Normalization | LayerNorm | None/BatchNorm | None | None |
| Topology | HNSW | General | General | General |
| Differentiable Search | Yes | No | No | No |
| Multi-Head | Yes | No | Yes | No |
| Gated Updates | Yes (GRU) | No | No | No |
RuVector Advantages:
- Temporal Dynamics: GRU captures evolution of node states
- Hierarchical Processing: HNSW structure for efficient search
- Dual Aggregation: Combines attention and edge-weighted aggregation
- Stable Training: LayerNorm + Xavier init + numerical guards
7. Computational Complexity
7.1 Per-Layer Complexity
For a node with degree d, hidden dimension h, and k attention heads:
| Operation | Complexity | Notes |
|---|---|---|
| Message Transform | O(h²) | Linear layer |
| Multi-Head Attention | O(k·d·h²/k) = O(d·h²) | k heads, each h/k dim |
| Weighted Aggregation | O(d·h) | Sum over neighbors |
| GRU Update | O(h²) | 6 linear transformations |
| Layer Norm | O(h) | Mean + variance |
| Total | O(d·h² + h²) | Dominated by attention |
7.2 Hierarchical Search Complexity
For HNSW with L layers, M neighbors per node:
- Greedy search: O(L · M · log N)
- Differentiable search: O(L · k · h)
where k = top-k candidates per layer
8. Training Considerations
8.1 Contrastive Loss Functions
File: crates/ruvector-gnn/src/training.rs:330-462
InfoNCE Loss:
L_InfoNCE = -log(exp(sim(q, p⁺) / τ) / Σ_{p∈P} exp(sim(q, p) / τ))
where:
- q: anchor (query node)
- p⁺: positive sample (neighbor)
- P: all samples (positives + negatives)
- τ: temperature parameter
Local Contrastive Loss:
Encourages node embeddings to be similar to graph neighbors
and dissimilar to non-neighbors
8.2 Elastic Weight Consolidation (EWC)
File: crates/ruvector-gnn/src/ewc.rs
Prevents catastrophic forgetting in continual learning:
L_total = L_task + (λ/2) Σ_i F_i (θ_i - θ*_i)²
where:
- L_task: Current task loss
- F_i: Fisher information (importance of parameter i)
- θ_i: Current parameter
- θ*_i: Anchor parameter from previous task
- λ: Regularization strength (10-10000)
Fisher Information Approximation:
F_i ≈ (1/N) Σ_{n=1}^N (∂L/∂θ_i)²
9. Key Insights for Latent Space Design
9.1 Embedding Geometry
Current Architecture Assumptions:
- Euclidean Latent Space: All operations assume flat geometry
- Cosine Similarity: Angular distance metric in search
- Linear Projections: Affine transformations preserve convexity
Implications:
- Tree-like graphs poorly represented in Euclidean space
- Hierarchical HNSW structure hints at hyperbolic geometry benefits
- Attention mechanism can partially compensate for metric mismatch
9.2 Information Flow Bottlenecks
Critical Points:
- Attention Softmax: Hard selection at inference (argmax)
- GRU Gates: Sigmoid saturation can block gradients
- Fixed Dimensions: h_dim bottleneck between layers
Potential Improvements:
- Adaptive dimensionality per layer
- Sparse attention patterns
- Mixture of experts for different graph patterns
10. Connection to HNSW Topology
10.1 HNSW Structure
Hierarchical layers:
Layer 2: [sparse, long-range connections]
Layer 1: [medium density]
Layer 0: [dense, local connections]
10.2 GNN-HNSW Synergy
Advantages:
- Coarse-to-Fine: Higher layers = global structure, lower = local
- Skip Connections: Hierarchical search jumps across graph
- Differentiable: Soft attention enables gradient-based optimization
Challenges:
- Layer Mismatch: HNSW layers ≠ GNN layers
- Probabilistic Construction: HNSW randomness vs. learned embeddings
- Online Updates: Adding nodes requires GNN re-evaluation
11. Strengths and Limitations
11.1 Strengths
- Numerically Stable: Extensive guards against overflow/underflow
- Flexible: Handles variable-degree nodes and empty neighborhoods
- Rich Interactions: Dual aggregation (attention + weighted)
- Recurrent Memory: GRU maintains long-term dependencies
- End-to-End Differentiable: Full gradient flow through search
11.2 Limitations
- Computational Cost: O(d·h²) per node per layer
- Fixed Architecture: Uniform layers, no adaptive depth
- Euclidean Bias: May not suit hierarchical graphs
- Limited Expressiveness: Single attention type (dot-product)
- No Edge Features: Only uses edge weights, not attributes
12. Research Opportunities
12.1 Short-Term Enhancements
- Edge Features: Extend attention to incorporate edge attributes
- Adaptive Heads: Learn number of attention heads per layer
- Sparse Attention: Local + global attention patterns
- Layer Skip Connections: Direct paths from input to output
12.2 Long-Term Directions
- Hyperbolic GNN: Replace Euclidean operations with Poincaré ball
- Graph Transformers: Replace message passing with full attention
- Neural ODEs: Continuous-depth GNN with differential equations
- Equivariant Networks: SE(3) or E(n) equivariance for geometric graphs
References
Internal Code References
/crates/ruvector-gnn/src/layer.rs- Core GNN layers/crates/ruvector-gnn/src/search.rs- Differentiable search/crates/ruvector-gnn/src/training.rs- Loss functions and optimizers/crates/ruvector-gnn/src/ewc.rs- Continual learning/crates/ruvector-graph/src/hybrid/graph_neural.rs- GNN engine interface
Key Papers
- Kipf & Welling (2017) - Graph Convolutional Networks
- Veličković et al. (2018) - Graph Attention Networks
- Chung et al. (2014) - Gated Recurrent Units
- Vaswani et al. (2017) - Attention Is All You Need (Transformers)
- Malkov & Yashunin (2018) - HNSW for ANN search
Document Version: 1.0 Last Updated: 2025-11-30 Author: RuVector Research Team