# GNN Architecture Analysis: RuVector Implementation ## Executive Summary RuVector implements a sophisticated Graph Neural Network architecture that operates on HNSW (Hierarchical Navigable Small World) graph topology. The architecture combines message passing, multi-head attention, gated recurrent updates, and differentiable search mechanisms to create a powerful framework for learning on graph-structured data. **Key Components**: Linear transformations, Multi-head Attention, GRU cells, Layer Normalization, Hierarchical Search **Code Location**: `crates/ruvector-gnn/src/layer.rs`, `crates/ruvector-gnn/src/search.rs` --- ## 1. Core Architecture: RuvectorLayer ### 1.1 Mathematical Formulation The RuvectorLayer implements a message passing neural network with the following forward pass: ``` Given: node embedding h_v, neighbor embeddings {h_u}_u∈N(v), edge weights {e_uv}_u∈N(v) 1. Message Transformation: m_v = W_msg · h_v m_u = W_msg · h_u for u ∈ N(v) 2. Multi-Head Attention: a_v = MultiHeadAttention(m_v, {m_u}, {m_u}) 3. Weighted Aggregation: agg_v = Σ_u (e_uv / Σ_u' e_u'v) · m_u 4. Combination: combined = a_v + agg_v transformed = W_agg · combined 5. GRU Update: h'_v = GRU(transformed, m_v) 6. Normalization & Regularization: output = LayerNorm(Dropout(h'_v)) ``` ### 1.2 Implementation Details **File**: `crates/ruvector-gnn/src/layer.rs:307-440` ```rust pub struct RuvectorLayer { w_msg: Linear, // Message weight matrix w_agg: Linear, // Aggregation weight matrix w_update: GRUCell, // GRU update cell attention: MultiHeadAttention, norm: LayerNorm, dropout: f32, } ``` **Design Choices**: - **Xavier Initialization**: Weights initialized as N(0, √(2/(d_in + d_out))) - **Numerical Stability**: Softmax uses max subtraction trick - **Residual Connections**: Implicit through GRU's (1-z) term - **Flexibility**: Handles empty neighbor sets gracefully --- ## 2. Multi-Head Attention Mechanism ### 2.1 Scaled Dot-Product Attention **File**: `crates/ruvector-gnn/src/layer.rs:84-205` The attention mechanism follows the Transformer architecture: ``` Attention(Q, K, V) = softmax(QK^T / √d_k) V where: - Q = W_q · h_v (query from target node) - K = W_k · h_u (keys from neighbors) - V = W_v · h_u (values from neighbors) - d_k = hidden_dim / num_heads ``` ### 2.2 Multi-Head Decomposition ``` MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o where head_i = Attention(Q W_q^i, K W_k^i, V W_v^i) ``` **Mathematical Properties**: 1. **Permutation Invariance**: Attention scores independent of neighbor ordering 2. **Soft Selection**: Differentiable alternative to hard neighbor selection 3. **Context Aware**: Each head can focus on different aspects of neighborhood ### 2.3 Numerical Stability ```rust // Softmax with numerical stability let max_score = scores.iter().copied().fold(f32::NEG_INFINITY, f32::max); let exp_scores: Vec = scores.iter() .map(|&s| (s - max_score).exp()) .collect(); let sum_exp: f32 = exp_scores.iter().sum::().max(1e-10); ``` **Key Features**: - Prevents overflow with max subtraction - Guards against division by zero with epsilon - Maintains gradient flow through exp operations --- ## 3. Gated Recurrent Unit (GRU) Integration ### 3.1 GRU Cell Mathematics **File**: `crates/ruvector-gnn/src/layer.rs:207-305` ``` z_t = σ(W_z x_t + U_z h_{t-1}) [Update Gate] r_t = σ(W_r x_t + U_r h_{t-1}) [Reset Gate] h̃_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1})) [Candidate State] h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t [Final State] ``` ### 3.2 Why GRU for Graph Updates? 1. **Memory of Previous State**: Maintains information from earlier layers 2. **Selective Updates**: Update gate z_t controls how much to change 3. **Reset Mechanism**: Reset gate r_t decides relevance of previous state 4. **Gradient Flow**: Mitigates vanishing gradients in deep GNNs **Connection to Graph Learning**: - `h_{t-1}`: Node's current representation (before aggregation) - `x_t`: Aggregated neighborhood information - `h_t`: Updated node representation (after message passing) --- ## 4. Differentiable Search Mechanism ### 4.1 Soft Attention Over Candidates **File**: `crates/ruvector-gnn/src/search.rs:38-86` ``` Given: query q, candidates C = {c_1, ..., c_n} 1. Compute Similarities: s_i = cosine_similarity(q, c_i) 2. Temperature-Scaled Softmax: w_i = exp(s_i / τ) / Σ_j exp(s_j / τ) 3. Soft Top-K Selection: indices = argsort(w)[:k] weights = {w_i | i ∈ indices} ``` **Temperature Parameter τ**: - **τ → 0**: Sharp selection (approximates hard argmax) - **τ → ∞**: Uniform distribution (all candidates equal) - **τ = 0.07-1.0**: Typical range balancing discrimination and smoothness ### 4.2 Hierarchical Forward Pass **File**: `crates/ruvector-gnn/src/search.rs:88-154` Processes query through HNSW layers sequentially: ``` Input: query q, layer_embeddings L = {L_0, ..., L_d}, gnn_layers G h_0 = q for layer l = 0 to d: 1. Find top-k nodes: indices, weights = DifferentiableSearch(h_l, L_l) 2. Aggregate: agg = Σ_i weights[i] · L_l[indices[i]] 3. Combine: combined = (h_l + agg) / 2 4. Transform: h_{l+1} = G_l(combined, neighbors, edge_weights) Output: h_d ``` **Gradient Flow Through Hierarchy**: - Softmax ensures differentiability - Enables end-to-end training of search process - Backpropagation through entire HNSW traversal --- ## 5. Data Flow Architecture ### 5.1 Forward Pass Diagram ``` Input Node Embedding (h_v) | v [W_msg Transform] ──────────────┐ | | v | Message (m_v) | | | v | ┌─────────────────┐ | │ Multi-Head │ | │ Attention │ ← Neighbors (transformed) └─────────────────┘ | | | v | Attention Output | | | v | [+ Weighted Agg] ← Edge Weights | | | v | [W_agg Transform] | | | v | Aggregated Message | | | v | ┌─────────────────┐ | │ GRU Cell │ ← Previous State (m_v) └─────────────────┘ | v Updated State | v [Dropout] | v [LayerNorm] | v Output Embedding ``` ### 5.2 Information Bottlenecks **Potential Bottlenecks**: 1. **Linear Transformations**: Fixed capacity W_msg, W_agg 2. **Attention Heads**: Limited parallelism (typically 2-8 heads) 3. **GRU Hidden State**: Fixed dimensionality 4. **Dropout**: Information loss during training **Mitigation Strategies**: - Residual connections via GRU gates - Layer normalization prevents gradient explosion - Xavier init maintains variance through layers --- ## 6. Comparison with Standard GNN Architectures | Feature | RuVector | GCN | GAT | GraphSAGE | |---------|----------|-----|-----|-----------| | Aggregation | Attention + Weighted | Mean | Attention | Mean/Max/LSTM | | Update | GRU | Linear | Linear | Linear | | Normalization | LayerNorm | None/BatchNorm | None | None | | Topology | HNSW | General | General | General | | Differentiable Search | Yes | No | No | No | | Multi-Head | Yes | No | Yes | No | | Gated Updates | Yes (GRU) | No | No | No | **RuVector Advantages**: 1. **Temporal Dynamics**: GRU captures evolution of node states 2. **Hierarchical Processing**: HNSW structure for efficient search 3. **Dual Aggregation**: Combines attention and edge-weighted aggregation 4. **Stable Training**: LayerNorm + Xavier init + numerical guards --- ## 7. Computational Complexity ### 7.1 Per-Layer Complexity For a node with degree d, hidden dimension h, and k attention heads: | Operation | Complexity | Notes | |-----------|------------|-------| | Message Transform | O(h²) | Linear layer | | Multi-Head Attention | O(k·d·h²/k) = O(d·h²) | k heads, each h/k dim | | Weighted Aggregation | O(d·h) | Sum over neighbors | | GRU Update | O(h²) | 6 linear transformations | | Layer Norm | O(h) | Mean + variance | | **Total** | **O(d·h² + h²)** | Dominated by attention | ### 7.2 Hierarchical Search Complexity ``` For HNSW with L layers, M neighbors per node: - Greedy search: O(L · M · log N) - Differentiable search: O(L · k · h) where k = top-k candidates per layer ``` --- ## 8. Training Considerations ### 8.1 Contrastive Loss Functions **File**: `crates/ruvector-gnn/src/training.rs:330-462` **InfoNCE Loss**: ``` L_InfoNCE = -log(exp(sim(q, p⁺) / τ) / Σ_{p∈P} exp(sim(q, p) / τ)) where: - q: anchor (query node) - p⁺: positive sample (neighbor) - P: all samples (positives + negatives) - τ: temperature parameter ``` **Local Contrastive Loss**: ``` Encourages node embeddings to be similar to graph neighbors and dissimilar to non-neighbors ``` ### 8.2 Elastic Weight Consolidation (EWC) **File**: `crates/ruvector-gnn/src/ewc.rs` Prevents catastrophic forgetting in continual learning: ``` L_total = L_task + (λ/2) Σ_i F_i (θ_i - θ*_i)² where: - L_task: Current task loss - F_i: Fisher information (importance of parameter i) - θ_i: Current parameter - θ*_i: Anchor parameter from previous task - λ: Regularization strength (10-10000) ``` **Fisher Information Approximation**: ```rust F_i ≈ (1/N) Σ_{n=1}^N (∂L/∂θ_i)² ``` --- ## 9. Key Insights for Latent Space Design ### 9.1 Embedding Geometry **Current Architecture Assumptions**: 1. **Euclidean Latent Space**: All operations assume flat geometry 2. **Cosine Similarity**: Angular distance metric in search 3. **Linear Projections**: Affine transformations preserve convexity **Implications**: - Tree-like graphs poorly represented in Euclidean space - Hierarchical HNSW structure hints at hyperbolic geometry benefits - Attention mechanism can partially compensate for metric mismatch ### 9.2 Information Flow Bottlenecks **Critical Points**: 1. **Attention Softmax**: Hard selection at inference (argmax) 2. **GRU Gates**: Sigmoid saturation can block gradients 3. **Fixed Dimensions**: h_dim bottleneck between layers **Potential Improvements**: - Adaptive dimensionality per layer - Sparse attention patterns - Mixture of experts for different graph patterns --- ## 10. Connection to HNSW Topology ### 10.1 HNSW Structure Hierarchical layers: ``` Layer 2: [sparse, long-range connections] Layer 1: [medium density] Layer 0: [dense, local connections] ``` ### 10.2 GNN-HNSW Synergy **Advantages**: 1. **Coarse-to-Fine**: Higher layers = global structure, lower = local 2. **Skip Connections**: Hierarchical search jumps across graph 3. **Differentiable**: Soft attention enables gradient-based optimization **Challenges**: 1. **Layer Mismatch**: HNSW layers ≠ GNN layers 2. **Probabilistic Construction**: HNSW randomness vs. learned embeddings 3. **Online Updates**: Adding nodes requires GNN re-evaluation --- ## 11. Strengths and Limitations ### 11.1 Strengths 1. **Numerically Stable**: Extensive guards against overflow/underflow 2. **Flexible**: Handles variable-degree nodes and empty neighborhoods 3. **Rich Interactions**: Dual aggregation (attention + weighted) 4. **Recurrent Memory**: GRU maintains long-term dependencies 5. **End-to-End Differentiable**: Full gradient flow through search ### 11.2 Limitations 1. **Computational Cost**: O(d·h²) per node per layer 2. **Fixed Architecture**: Uniform layers, no adaptive depth 3. **Euclidean Bias**: May not suit hierarchical graphs 4. **Limited Expressiveness**: Single attention type (dot-product) 5. **No Edge Features**: Only uses edge weights, not attributes --- ## 12. Research Opportunities ### 12.1 Short-Term Enhancements 1. **Edge Features**: Extend attention to incorporate edge attributes 2. **Adaptive Heads**: Learn number of attention heads per layer 3. **Sparse Attention**: Local + global attention patterns 4. **Layer Skip Connections**: Direct paths from input to output ### 12.2 Long-Term Directions 1. **Hyperbolic GNN**: Replace Euclidean operations with Poincaré ball 2. **Graph Transformers**: Replace message passing with full attention 3. **Neural ODEs**: Continuous-depth GNN with differential equations 4. **Equivariant Networks**: SE(3) or E(n) equivariance for geometric graphs --- ## References ### Internal Code References - `/crates/ruvector-gnn/src/layer.rs` - Core GNN layers - `/crates/ruvector-gnn/src/search.rs` - Differentiable search - `/crates/ruvector-gnn/src/training.rs` - Loss functions and optimizers - `/crates/ruvector-gnn/src/ewc.rs` - Continual learning - `/crates/ruvector-graph/src/hybrid/graph_neural.rs` - GNN engine interface ### Key Papers - Kipf & Welling (2017) - Graph Convolutional Networks - Veličković et al. (2018) - Graph Attention Networks - Chung et al. (2014) - Gated Recurrent Units - Vaswani et al. (2017) - Attention Is All You Need (Transformers) - Malkov & Yashunin (2018) - HNSW for ANN search --- **Document Version**: 1.0 **Last Updated**: 2025-11-30 **Author**: RuVector Research Team