Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

13 KiB

Raw Blame History

GNN Architecture Analysis: RuVector Implementation

Executive Summary

RuVector implements a sophisticated Graph Neural Network architecture that operates on HNSW (Hierarchical Navigable Small World) graph topology. The architecture combines message passing, multi-head attention, gated recurrent updates, and differentiable search mechanisms to create a powerful framework for learning on graph-structured data.

Key Components: Linear transformations, Multi-head Attention, GRU cells, Layer Normalization, Hierarchical Search

Code Location: crates/ruvector-gnn/src/layer.rs, crates/ruvector-gnn/src/search.rs

1. Core Architecture: RuvectorLayer

1.1 Mathematical Formulation

The RuvectorLayer implements a message passing neural network with the following forward pass:

Given: node embedding h_v, neighbor embeddings {h_u}_u∈N(v), edge weights {e_uv}_u∈N(v)

1. Message Transformation:
   m_v = W_msg · h_v
   m_u = W_msg · h_u  for u ∈ N(v)

2. Multi-Head Attention:
   a_v = MultiHeadAttention(m_v, {m_u}, {m_u})

3. Weighted Aggregation:
   agg_v = Σ_u (e_uv / Σ_u' e_u'v) · m_u

4. Combination:
   combined = a_v + agg_v
   transformed = W_agg · combined

5. GRU Update:
   h'_v = GRU(transformed, m_v)

6. Normalization & Regularization:
   output = LayerNorm(Dropout(h'_v))

1.2 Implementation Details

File: crates/ruvector-gnn/src/layer.rs:307-440

pub struct RuvectorLayer {
    w_msg: Linear,              // Message weight matrix
    w_agg: Linear,              // Aggregation weight matrix
    w_update: GRUCell,          // GRU update cell
    attention: MultiHeadAttention,
    norm: LayerNorm,
    dropout: f32,
}

Design Choices:

Xavier Initialization: Weights initialized as N(0, √(2/(d_in + d_out)))
Numerical Stability: Softmax uses max subtraction trick
Residual Connections: Implicit through GRU's (1-z) term
Flexibility: Handles empty neighbor sets gracefully

2. Multi-Head Attention Mechanism

2.1 Scaled Dot-Product Attention

File: crates/ruvector-gnn/src/layer.rs:84-205

The attention mechanism follows the Transformer architecture:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

where:
- Q = W_q · h_v (query from target node)
- K = W_k · h_u (keys from neighbors)
- V = W_v · h_u (values from neighbors)
- d_k = hidden_dim / num_heads

2.2 Multi-Head Decomposition

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o

where head_i = Attention(Q W_q^i, K W_k^i, V W_v^i)

Mathematical Properties:

Permutation Invariance: Attention scores independent of neighbor ordering
Soft Selection: Differentiable alternative to hard neighbor selection
Context Aware: Each head can focus on different aspects of neighborhood

2.3 Numerical Stability

// Softmax with numerical stability
let max_score = scores.iter().copied().fold(f32::NEG_INFINITY, f32::max);
let exp_scores: Vec<f32> = scores.iter()
    .map(|&s| (s - max_score).exp())
    .collect();
let sum_exp: f32 = exp_scores.iter().sum::<f32>().max(1e-10);

Key Features:

Prevents overflow with max subtraction
Guards against division by zero with epsilon
Maintains gradient flow through exp operations

3. Gated Recurrent Unit (GRU) Integration

3.1 GRU Cell Mathematics

File: crates/ruvector-gnn/src/layer.rs:207-305

z_t = σ(W_z x_t + U_z h_{t-1})        [Update Gate]
r_t = σ(W_r x_t + U_r h_{t-1})        [Reset Gate]
h̃_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1}))  [Candidate State]
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t     [Final State]

3.2 Why GRU for Graph Updates?

Memory of Previous State: Maintains information from earlier layers
Selective Updates: Update gate z_t controls how much to change
Reset Mechanism: Reset gate r_t decides relevance of previous state
Gradient Flow: Mitigates vanishing gradients in deep GNNs

Connection to Graph Learning:

h_{t-1}: Node's current representation (before aggregation)
x_t: Aggregated neighborhood information
h_t: Updated node representation (after message passing)

4. Differentiable Search Mechanism

4.1 Soft Attention Over Candidates

File: crates/ruvector-gnn/src/search.rs:38-86

Given: query q, candidates C = {c_1, ..., c_n}

1. Compute Similarities:
   s_i = cosine_similarity(q, c_i)

2. Temperature-Scaled Softmax:
   w_i = exp(s_i / τ) / Σ_j exp(s_j / τ)

3. Soft Top-K Selection:
   indices = argsort(w)[:k]
   weights = {w_i | i ∈ indices}

Temperature Parameter τ:

τ → 0: Sharp selection (approximates hard argmax)
τ → ∞: Uniform distribution (all candidates equal)
τ = 0.07-1.0: Typical range balancing discrimination and smoothness

4.2 Hierarchical Forward Pass

File: crates/ruvector-gnn/src/search.rs:88-154

Processes query through HNSW layers sequentially:

Input: query q, layer_embeddings L = {L_0, ..., L_d}, gnn_layers G

h_0 = q
for layer l = 0 to d:
    1. Find top-k nodes: indices, weights = DifferentiableSearch(h_l, L_l)
    2. Aggregate: agg = Σ_i weights[i] · L_l[indices[i]]
    3. Combine: combined = (h_l + agg) / 2
    4. Transform: h_{l+1} = G_l(combined, neighbors, edge_weights)

Output: h_d

Gradient Flow Through Hierarchy:

Softmax ensures differentiability
Enables end-to-end training of search process
Backpropagation through entire HNSW traversal

5. Data Flow Architecture

5.1 Forward Pass Diagram

Input Node Embedding (h_v)
         |
         v
    [W_msg Transform] ──────────────┐
         |                          |
         v                          |
    Message (m_v)                   |
         |                          |
         v                          |
    ┌─────────────────┐             |
    │  Multi-Head     │             |
    │  Attention      │ ← Neighbors (transformed)
    └─────────────────┘             |
         |                          |
         v                          |
    Attention Output                |
         |                          |
         v                          |
    [+ Weighted Agg] ← Edge Weights |
         |                          |
         v                          |
    [W_agg Transform]               |
         |                          |
         v                          |
    Aggregated Message              |
         |                          |
         v                          |
    ┌─────────────────┐             |
    │   GRU Cell      │ ← Previous State (m_v)
    └─────────────────┘
         |
         v
    Updated State
         |
         v
    [Dropout]
         |
         v
    [LayerNorm]
         |
         v
    Output Embedding

5.2 Information Bottlenecks

Potential Bottlenecks:

Linear Transformations: Fixed capacity W_msg, W_agg
Attention Heads: Limited parallelism (typically 2-8 heads)
GRU Hidden State: Fixed dimensionality
Dropout: Information loss during training

Mitigation Strategies:

Residual connections via GRU gates
Layer normalization prevents gradient explosion
Xavier init maintains variance through layers

6. Comparison with Standard GNN Architectures

Feature	RuVector	GCN	GAT	GraphSAGE
Aggregation	Attention + Weighted	Mean	Attention	Mean/Max/LSTM
Update	GRU	Linear	Linear	Linear
Normalization	LayerNorm	None/BatchNorm	None	None
Topology	HNSW	General	General	General
Differentiable Search	Yes	No	No	No
Multi-Head	Yes	No	Yes	No
Gated Updates	Yes (GRU)	No	No	No

RuVector Advantages:

Temporal Dynamics: GRU captures evolution of node states
Hierarchical Processing: HNSW structure for efficient search
Dual Aggregation: Combines attention and edge-weighted aggregation
Stable Training: LayerNorm + Xavier init + numerical guards

7. Computational Complexity

7.1 Per-Layer Complexity

For a node with degree d, hidden dimension h, and k attention heads:

Operation	Complexity	Notes
Message Transform	O(h²)	Linear layer
Multi-Head Attention	O(k·d·h²/k) = O(d·h²)	k heads, each h/k dim
Weighted Aggregation	O(d·h)	Sum over neighbors
GRU Update	O(h²)	6 linear transformations
Layer Norm	O(h)	Mean + variance
Total	O(d·h² + h²)	Dominated by attention

7.2 Hierarchical Search Complexity

For HNSW with L layers, M neighbors per node:
- Greedy search: O(L · M · log N)
- Differentiable search: O(L · k · h)
  where k = top-k candidates per layer

8. Training Considerations

8.1 Contrastive Loss Functions

File: crates/ruvector-gnn/src/training.rs:330-462

InfoNCE Loss:

L_InfoNCE = -log(exp(sim(q, p⁺) / τ) / Σ_{p∈P} exp(sim(q, p) / τ))

where:
- q: anchor (query node)
- p⁺: positive sample (neighbor)
- P: all samples (positives + negatives)
- τ: temperature parameter

Local Contrastive Loss:

Encourages node embeddings to be similar to graph neighbors
and dissimilar to non-neighbors

8.2 Elastic Weight Consolidation (EWC)

File: crates/ruvector-gnn/src/ewc.rs

Prevents catastrophic forgetting in continual learning:

L_total = L_task + (λ/2) Σ_i F_i (θ_i - θ*_i)²

where:
- L_task: Current task loss
- F_i: Fisher information (importance of parameter i)
- θ_i: Current parameter
- θ*_i: Anchor parameter from previous task
- λ: Regularization strength (10-10000)

Fisher Information Approximation:

F_i ≈ (1/N) Σ_{n=1}^N (∂L/∂θ_i)²

9. Key Insights for Latent Space Design

9.1 Embedding Geometry

Current Architecture Assumptions:

Euclidean Latent Space: All operations assume flat geometry
Cosine Similarity: Angular distance metric in search
Linear Projections: Affine transformations preserve convexity

Implications:

Tree-like graphs poorly represented in Euclidean space
Hierarchical HNSW structure hints at hyperbolic geometry benefits
Attention mechanism can partially compensate for metric mismatch

9.2 Information Flow Bottlenecks

Critical Points:

Attention Softmax: Hard selection at inference (argmax)
GRU Gates: Sigmoid saturation can block gradients
Fixed Dimensions: h_dim bottleneck between layers

Potential Improvements:

Adaptive dimensionality per layer
Sparse attention patterns
Mixture of experts for different graph patterns

10. Connection to HNSW Topology

10.1 HNSW Structure

Hierarchical layers:

Layer 2: [sparse, long-range connections]
Layer 1: [medium density]
Layer 0: [dense, local connections]

10.2 GNN-HNSW Synergy

Advantages:

Coarse-to-Fine: Higher layers = global structure, lower = local
Skip Connections: Hierarchical search jumps across graph
Differentiable: Soft attention enables gradient-based optimization

Challenges:

Layer Mismatch: HNSW layers ≠ GNN layers
Probabilistic Construction: HNSW randomness vs. learned embeddings
Online Updates: Adding nodes requires GNN re-evaluation

11. Strengths and Limitations

11.1 Strengths

Numerically Stable: Extensive guards against overflow/underflow
Flexible: Handles variable-degree nodes and empty neighborhoods
Rich Interactions: Dual aggregation (attention + weighted)
Recurrent Memory: GRU maintains long-term dependencies
End-to-End Differentiable: Full gradient flow through search

11.2 Limitations

Computational Cost: O(d·h²) per node per layer
Fixed Architecture: Uniform layers, no adaptive depth
Euclidean Bias: May not suit hierarchical graphs
Limited Expressiveness: Single attention type (dot-product)
No Edge Features: Only uses edge weights, not attributes

12. Research Opportunities

12.1 Short-Term Enhancements

Edge Features: Extend attention to incorporate edge attributes
Adaptive Heads: Learn number of attention heads per layer
Sparse Attention: Local + global attention patterns
Layer Skip Connections: Direct paths from input to output

12.2 Long-Term Directions

Hyperbolic GNN: Replace Euclidean operations with Poincaré ball
Graph Transformers: Replace message passing with full attention
Neural ODEs: Continuous-depth GNN with differential equations
Equivariant Networks: SE(3) or E(n) equivariance for geometric graphs

References

Internal Code References

/crates/ruvector-gnn/src/layer.rs - Core GNN layers
/crates/ruvector-gnn/src/search.rs - Differentiable search
/crates/ruvector-gnn/src/training.rs - Loss functions and optimizers
/crates/ruvector-gnn/src/ewc.rs - Continual learning
/crates/ruvector-graph/src/hybrid/graph_neural.rs - GNN engine interface

Key Papers

Kipf & Welling (2017) - Graph Convolutional Networks
Veličković et al. (2018) - Graph Attention Networks
Chung et al. (2014) - Gated Recurrent Units
Vaswani et al. (2017) - Attention Is All You Need (Transformers)
Malkov & Yashunin (2018) - HNSW for ANN search

Document Version: 1.0 Last Updated: 2025-11-30 Author: RuVector Research Team

13 KiB Raw Blame History Unescape Escape