Files
wifi-densepose/vendor/ruvector/docs/research/latent-space/gnn-architecture-analysis.md

13 KiB
Raw Permalink Blame History

GNN Architecture Analysis: RuVector Implementation

Executive Summary

RuVector implements a sophisticated Graph Neural Network architecture that operates on HNSW (Hierarchical Navigable Small World) graph topology. The architecture combines message passing, multi-head attention, gated recurrent updates, and differentiable search mechanisms to create a powerful framework for learning on graph-structured data.

Key Components: Linear transformations, Multi-head Attention, GRU cells, Layer Normalization, Hierarchical Search

Code Location: crates/ruvector-gnn/src/layer.rs, crates/ruvector-gnn/src/search.rs


1. Core Architecture: RuvectorLayer

1.1 Mathematical Formulation

The RuvectorLayer implements a message passing neural network with the following forward pass:

Given: node embedding h_v, neighbor embeddings {h_u}_u∈N(v), edge weights {e_uv}_u∈N(v)

1. Message Transformation:
   m_v = W_msg · h_v
   m_u = W_msg · h_u  for u ∈ N(v)

2. Multi-Head Attention:
   a_v = MultiHeadAttention(m_v, {m_u}, {m_u})

3. Weighted Aggregation:
   agg_v = Σ_u (e_uv / Σ_u' e_u'v) · m_u

4. Combination:
   combined = a_v + agg_v
   transformed = W_agg · combined

5. GRU Update:
   h'_v = GRU(transformed, m_v)

6. Normalization & Regularization:
   output = LayerNorm(Dropout(h'_v))

1.2 Implementation Details

File: crates/ruvector-gnn/src/layer.rs:307-440

pub struct RuvectorLayer {
    w_msg: Linear,              // Message weight matrix
    w_agg: Linear,              // Aggregation weight matrix
    w_update: GRUCell,          // GRU update cell
    attention: MultiHeadAttention,
    norm: LayerNorm,
    dropout: f32,
}

Design Choices:

  • Xavier Initialization: Weights initialized as N(0, √(2/(d_in + d_out)))
  • Numerical Stability: Softmax uses max subtraction trick
  • Residual Connections: Implicit through GRU's (1-z) term
  • Flexibility: Handles empty neighbor sets gracefully

2. Multi-Head Attention Mechanism

2.1 Scaled Dot-Product Attention

File: crates/ruvector-gnn/src/layer.rs:84-205

The attention mechanism follows the Transformer architecture:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

where:
- Q = W_q · h_v (query from target node)
- K = W_k · h_u (keys from neighbors)
- V = W_v · h_u (values from neighbors)
- d_k = hidden_dim / num_heads

2.2 Multi-Head Decomposition

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o

where head_i = Attention(Q W_q^i, K W_k^i, V W_v^i)

Mathematical Properties:

  1. Permutation Invariance: Attention scores independent of neighbor ordering
  2. Soft Selection: Differentiable alternative to hard neighbor selection
  3. Context Aware: Each head can focus on different aspects of neighborhood

2.3 Numerical Stability

// Softmax with numerical stability
let max_score = scores.iter().copied().fold(f32::NEG_INFINITY, f32::max);
let exp_scores: Vec<f32> = scores.iter()
    .map(|&s| (s - max_score).exp())
    .collect();
let sum_exp: f32 = exp_scores.iter().sum::<f32>().max(1e-10);

Key Features:

  • Prevents overflow with max subtraction
  • Guards against division by zero with epsilon
  • Maintains gradient flow through exp operations

3. Gated Recurrent Unit (GRU) Integration

3.1 GRU Cell Mathematics

File: crates/ruvector-gnn/src/layer.rs:207-305

z_t = σ(W_z x_t + U_z h_{t-1})        [Update Gate]
r_t = σ(W_r x_t + U_r h_{t-1})        [Reset Gate]
h̃_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1}))  [Candidate State]
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t     [Final State]

3.2 Why GRU for Graph Updates?

  1. Memory of Previous State: Maintains information from earlier layers
  2. Selective Updates: Update gate z_t controls how much to change
  3. Reset Mechanism: Reset gate r_t decides relevance of previous state
  4. Gradient Flow: Mitigates vanishing gradients in deep GNNs

Connection to Graph Learning:

  • h_{t-1}: Node's current representation (before aggregation)
  • x_t: Aggregated neighborhood information
  • h_t: Updated node representation (after message passing)

4. Differentiable Search Mechanism

4.1 Soft Attention Over Candidates

File: crates/ruvector-gnn/src/search.rs:38-86

Given: query q, candidates C = {c_1, ..., c_n}

1. Compute Similarities:
   s_i = cosine_similarity(q, c_i)

2. Temperature-Scaled Softmax:
   w_i = exp(s_i / τ) / Σ_j exp(s_j / τ)

3. Soft Top-K Selection:
   indices = argsort(w)[:k]
   weights = {w_i | i ∈ indices}

Temperature Parameter τ:

  • τ → 0: Sharp selection (approximates hard argmax)
  • τ → ∞: Uniform distribution (all candidates equal)
  • τ = 0.07-1.0: Typical range balancing discrimination and smoothness

4.2 Hierarchical Forward Pass

File: crates/ruvector-gnn/src/search.rs:88-154

Processes query through HNSW layers sequentially:

Input: query q, layer_embeddings L = {L_0, ..., L_d}, gnn_layers G

h_0 = q
for layer l = 0 to d:
    1. Find top-k nodes: indices, weights = DifferentiableSearch(h_l, L_l)
    2. Aggregate: agg = Σ_i weights[i] · L_l[indices[i]]
    3. Combine: combined = (h_l + agg) / 2
    4. Transform: h_{l+1} = G_l(combined, neighbors, edge_weights)

Output: h_d

Gradient Flow Through Hierarchy:

  • Softmax ensures differentiability
  • Enables end-to-end training of search process
  • Backpropagation through entire HNSW traversal

5. Data Flow Architecture

5.1 Forward Pass Diagram

Input Node Embedding (h_v)
         |
         v
    [W_msg Transform] ──────────────┐
         |                          |
         v                          |
    Message (m_v)                   |
         |                          |
         v                          |
    ┌─────────────────┐             |
    │  Multi-Head     │             |
    │  Attention      │ ← Neighbors (transformed)
    └─────────────────┘             |
         |                          |
         v                          |
    Attention Output                |
         |                          |
         v                          |
    [+ Weighted Agg] ← Edge Weights |
         |                          |
         v                          |
    [W_agg Transform]               |
         |                          |
         v                          |
    Aggregated Message              |
         |                          |
         v                          |
    ┌─────────────────┐             |
    │   GRU Cell      │ ← Previous State (m_v)
    └─────────────────┘
         |
         v
    Updated State
         |
         v
    [Dropout]
         |
         v
    [LayerNorm]
         |
         v
    Output Embedding

5.2 Information Bottlenecks

Potential Bottlenecks:

  1. Linear Transformations: Fixed capacity W_msg, W_agg
  2. Attention Heads: Limited parallelism (typically 2-8 heads)
  3. GRU Hidden State: Fixed dimensionality
  4. Dropout: Information loss during training

Mitigation Strategies:

  • Residual connections via GRU gates
  • Layer normalization prevents gradient explosion
  • Xavier init maintains variance through layers

6. Comparison with Standard GNN Architectures

Feature RuVector GCN GAT GraphSAGE
Aggregation Attention + Weighted Mean Attention Mean/Max/LSTM
Update GRU Linear Linear Linear
Normalization LayerNorm None/BatchNorm None None
Topology HNSW General General General
Differentiable Search Yes No No No
Multi-Head Yes No Yes No
Gated Updates Yes (GRU) No No No

RuVector Advantages:

  1. Temporal Dynamics: GRU captures evolution of node states
  2. Hierarchical Processing: HNSW structure for efficient search
  3. Dual Aggregation: Combines attention and edge-weighted aggregation
  4. Stable Training: LayerNorm + Xavier init + numerical guards

7. Computational Complexity

7.1 Per-Layer Complexity

For a node with degree d, hidden dimension h, and k attention heads:

Operation Complexity Notes
Message Transform O(h²) Linear layer
Multi-Head Attention O(k·d·h²/k) = O(d·h²) k heads, each h/k dim
Weighted Aggregation O(d·h) Sum over neighbors
GRU Update O(h²) 6 linear transformations
Layer Norm O(h) Mean + variance
Total O(d·h² + h²) Dominated by attention

7.2 Hierarchical Search Complexity

For HNSW with L layers, M neighbors per node:
- Greedy search: O(L · M · log N)
- Differentiable search: O(L · k · h)
  where k = top-k candidates per layer

8. Training Considerations

8.1 Contrastive Loss Functions

File: crates/ruvector-gnn/src/training.rs:330-462

InfoNCE Loss:

L_InfoNCE = -log(exp(sim(q, p⁺) / τ) / Σ_{p∈P} exp(sim(q, p) / τ))

where:
- q: anchor (query node)
- p⁺: positive sample (neighbor)
- P: all samples (positives + negatives)
- τ: temperature parameter

Local Contrastive Loss:

Encourages node embeddings to be similar to graph neighbors
and dissimilar to non-neighbors

8.2 Elastic Weight Consolidation (EWC)

File: crates/ruvector-gnn/src/ewc.rs

Prevents catastrophic forgetting in continual learning:

L_total = L_task + (λ/2) Σ_i F_i (θ_i - θ*_i)²

where:
- L_task: Current task loss
- F_i: Fisher information (importance of parameter i)
- θ_i: Current parameter
- θ*_i: Anchor parameter from previous task
- λ: Regularization strength (10-10000)

Fisher Information Approximation:

F_i  (1/N) Σ_{n=1}^N (L/∂θ_i)²

9. Key Insights for Latent Space Design

9.1 Embedding Geometry

Current Architecture Assumptions:

  1. Euclidean Latent Space: All operations assume flat geometry
  2. Cosine Similarity: Angular distance metric in search
  3. Linear Projections: Affine transformations preserve convexity

Implications:

  • Tree-like graphs poorly represented in Euclidean space
  • Hierarchical HNSW structure hints at hyperbolic geometry benefits
  • Attention mechanism can partially compensate for metric mismatch

9.2 Information Flow Bottlenecks

Critical Points:

  1. Attention Softmax: Hard selection at inference (argmax)
  2. GRU Gates: Sigmoid saturation can block gradients
  3. Fixed Dimensions: h_dim bottleneck between layers

Potential Improvements:

  • Adaptive dimensionality per layer
  • Sparse attention patterns
  • Mixture of experts for different graph patterns

10. Connection to HNSW Topology

10.1 HNSW Structure

Hierarchical layers:

Layer 2: [sparse, long-range connections]
Layer 1: [medium density]
Layer 0: [dense, local connections]

10.2 GNN-HNSW Synergy

Advantages:

  1. Coarse-to-Fine: Higher layers = global structure, lower = local
  2. Skip Connections: Hierarchical search jumps across graph
  3. Differentiable: Soft attention enables gradient-based optimization

Challenges:

  1. Layer Mismatch: HNSW layers ≠ GNN layers
  2. Probabilistic Construction: HNSW randomness vs. learned embeddings
  3. Online Updates: Adding nodes requires GNN re-evaluation

11. Strengths and Limitations

11.1 Strengths

  1. Numerically Stable: Extensive guards against overflow/underflow
  2. Flexible: Handles variable-degree nodes and empty neighborhoods
  3. Rich Interactions: Dual aggregation (attention + weighted)
  4. Recurrent Memory: GRU maintains long-term dependencies
  5. End-to-End Differentiable: Full gradient flow through search

11.2 Limitations

  1. Computational Cost: O(d·h²) per node per layer
  2. Fixed Architecture: Uniform layers, no adaptive depth
  3. Euclidean Bias: May not suit hierarchical graphs
  4. Limited Expressiveness: Single attention type (dot-product)
  5. No Edge Features: Only uses edge weights, not attributes

12. Research Opportunities

12.1 Short-Term Enhancements

  1. Edge Features: Extend attention to incorporate edge attributes
  2. Adaptive Heads: Learn number of attention heads per layer
  3. Sparse Attention: Local + global attention patterns
  4. Layer Skip Connections: Direct paths from input to output

12.2 Long-Term Directions

  1. Hyperbolic GNN: Replace Euclidean operations with Poincaré ball
  2. Graph Transformers: Replace message passing with full attention
  3. Neural ODEs: Continuous-depth GNN with differential equations
  4. Equivariant Networks: SE(3) or E(n) equivariance for geometric graphs

References

Internal Code References

  • /crates/ruvector-gnn/src/layer.rs - Core GNN layers
  • /crates/ruvector-gnn/src/search.rs - Differentiable search
  • /crates/ruvector-gnn/src/training.rs - Loss functions and optimizers
  • /crates/ruvector-gnn/src/ewc.rs - Continual learning
  • /crates/ruvector-graph/src/hybrid/graph_neural.rs - GNN engine interface

Key Papers

  • Kipf & Welling (2017) - Graph Convolutional Networks
  • Veličković et al. (2018) - Graph Attention Networks
  • Chung et al. (2014) - Gated Recurrent Units
  • Vaswani et al. (2017) - Attention Is All You Need (Transformers)
  • Malkov & Yashunin (2018) - HNSW for ANN search

Document Version: 1.0 Last Updated: 2025-11-30 Author: RuVector Research Team