Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,461 @@
# GNN Architecture Analysis: RuVector Implementation
## Executive Summary
RuVector implements a sophisticated Graph Neural Network architecture that operates on HNSW (Hierarchical Navigable Small World) graph topology. The architecture combines message passing, multi-head attention, gated recurrent updates, and differentiable search mechanisms to create a powerful framework for learning on graph-structured data.
**Key Components**: Linear transformations, Multi-head Attention, GRU cells, Layer Normalization, Hierarchical Search
**Code Location**: `crates/ruvector-gnn/src/layer.rs`, `crates/ruvector-gnn/src/search.rs`
---
## 1. Core Architecture: RuvectorLayer
### 1.1 Mathematical Formulation
The RuvectorLayer implements a message passing neural network with the following forward pass:
```
Given: node embedding h_v, neighbor embeddings {h_u}_u∈N(v), edge weights {e_uv}_u∈N(v)
1. Message Transformation:
m_v = W_msg · h_v
m_u = W_msg · h_u for u ∈ N(v)
2. Multi-Head Attention:
a_v = MultiHeadAttention(m_v, {m_u}, {m_u})
3. Weighted Aggregation:
agg_v = Σ_u (e_uv / Σ_u' e_u'v) · m_u
4. Combination:
combined = a_v + agg_v
transformed = W_agg · combined
5. GRU Update:
h'_v = GRU(transformed, m_v)
6. Normalization & Regularization:
output = LayerNorm(Dropout(h'_v))
```
### 1.2 Implementation Details
**File**: `crates/ruvector-gnn/src/layer.rs:307-440`
```rust
pub struct RuvectorLayer {
w_msg: Linear, // Message weight matrix
w_agg: Linear, // Aggregation weight matrix
w_update: GRUCell, // GRU update cell
attention: MultiHeadAttention,
norm: LayerNorm,
dropout: f32,
}
```
**Design Choices**:
- **Xavier Initialization**: Weights initialized as N(0, √(2/(d_in + d_out)))
- **Numerical Stability**: Softmax uses max subtraction trick
- **Residual Connections**: Implicit through GRU's (1-z) term
- **Flexibility**: Handles empty neighbor sets gracefully
---
## 2. Multi-Head Attention Mechanism
### 2.1 Scaled Dot-Product Attention
**File**: `crates/ruvector-gnn/src/layer.rs:84-205`
The attention mechanism follows the Transformer architecture:
```
Attention(Q, K, V) = softmax(QK^T / √d_k) V
where:
- Q = W_q · h_v (query from target node)
- K = W_k · h_u (keys from neighbors)
- V = W_v · h_u (values from neighbors)
- d_k = hidden_dim / num_heads
```
### 2.2 Multi-Head Decomposition
```
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o
where head_i = Attention(Q W_q^i, K W_k^i, V W_v^i)
```
**Mathematical Properties**:
1. **Permutation Invariance**: Attention scores independent of neighbor ordering
2. **Soft Selection**: Differentiable alternative to hard neighbor selection
3. **Context Aware**: Each head can focus on different aspects of neighborhood
### 2.3 Numerical Stability
```rust
// Softmax with numerical stability
let max_score = scores.iter().copied().fold(f32::NEG_INFINITY, f32::max);
let exp_scores: Vec<f32> = scores.iter()
.map(|&s| (s - max_score).exp())
.collect();
let sum_exp: f32 = exp_scores.iter().sum::<f32>().max(1e-10);
```
**Key Features**:
- Prevents overflow with max subtraction
- Guards against division by zero with epsilon
- Maintains gradient flow through exp operations
---
## 3. Gated Recurrent Unit (GRU) Integration
### 3.1 GRU Cell Mathematics
**File**: `crates/ruvector-gnn/src/layer.rs:207-305`
```
z_t = σ(W_z x_t + U_z h_{t-1}) [Update Gate]
r_t = σ(W_r x_t + U_r h_{t-1}) [Reset Gate]
h̃_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1})) [Candidate State]
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t [Final State]
```
### 3.2 Why GRU for Graph Updates?
1. **Memory of Previous State**: Maintains information from earlier layers
2. **Selective Updates**: Update gate z_t controls how much to change
3. **Reset Mechanism**: Reset gate r_t decides relevance of previous state
4. **Gradient Flow**: Mitigates vanishing gradients in deep GNNs
**Connection to Graph Learning**:
- `h_{t-1}`: Node's current representation (before aggregation)
- `x_t`: Aggregated neighborhood information
- `h_t`: Updated node representation (after message passing)
---
## 4. Differentiable Search Mechanism
### 4.1 Soft Attention Over Candidates
**File**: `crates/ruvector-gnn/src/search.rs:38-86`
```
Given: query q, candidates C = {c_1, ..., c_n}
1. Compute Similarities:
s_i = cosine_similarity(q, c_i)
2. Temperature-Scaled Softmax:
w_i = exp(s_i / τ) / Σ_j exp(s_j / τ)
3. Soft Top-K Selection:
indices = argsort(w)[:k]
weights = {w_i | i ∈ indices}
```
**Temperature Parameter τ**:
- **τ → 0**: Sharp selection (approximates hard argmax)
- **τ → ∞**: Uniform distribution (all candidates equal)
- **τ = 0.07-1.0**: Typical range balancing discrimination and smoothness
### 4.2 Hierarchical Forward Pass
**File**: `crates/ruvector-gnn/src/search.rs:88-154`
Processes query through HNSW layers sequentially:
```
Input: query q, layer_embeddings L = {L_0, ..., L_d}, gnn_layers G
h_0 = q
for layer l = 0 to d:
1. Find top-k nodes: indices, weights = DifferentiableSearch(h_l, L_l)
2. Aggregate: agg = Σ_i weights[i] · L_l[indices[i]]
3. Combine: combined = (h_l + agg) / 2
4. Transform: h_{l+1} = G_l(combined, neighbors, edge_weights)
Output: h_d
```
**Gradient Flow Through Hierarchy**:
- Softmax ensures differentiability
- Enables end-to-end training of search process
- Backpropagation through entire HNSW traversal
---
## 5. Data Flow Architecture
### 5.1 Forward Pass Diagram
```
Input Node Embedding (h_v)
|
v
[W_msg Transform] ──────────────┐
| |
v |
Message (m_v) |
| |
v |
┌─────────────────┐ |
│ Multi-Head │ |
│ Attention │ ← Neighbors (transformed)
└─────────────────┘ |
| |
v |
Attention Output |
| |
v |
[+ Weighted Agg] ← Edge Weights |
| |
v |
[W_agg Transform] |
| |
v |
Aggregated Message |
| |
v |
┌─────────────────┐ |
│ GRU Cell │ ← Previous State (m_v)
└─────────────────┘
|
v
Updated State
|
v
[Dropout]
|
v
[LayerNorm]
|
v
Output Embedding
```
### 5.2 Information Bottlenecks
**Potential Bottlenecks**:
1. **Linear Transformations**: Fixed capacity W_msg, W_agg
2. **Attention Heads**: Limited parallelism (typically 2-8 heads)
3. **GRU Hidden State**: Fixed dimensionality
4. **Dropout**: Information loss during training
**Mitigation Strategies**:
- Residual connections via GRU gates
- Layer normalization prevents gradient explosion
- Xavier init maintains variance through layers
---
## 6. Comparison with Standard GNN Architectures
| Feature | RuVector | GCN | GAT | GraphSAGE |
|---------|----------|-----|-----|-----------|
| Aggregation | Attention + Weighted | Mean | Attention | Mean/Max/LSTM |
| Update | GRU | Linear | Linear | Linear |
| Normalization | LayerNorm | None/BatchNorm | None | None |
| Topology | HNSW | General | General | General |
| Differentiable Search | Yes | No | No | No |
| Multi-Head | Yes | No | Yes | No |
| Gated Updates | Yes (GRU) | No | No | No |
**RuVector Advantages**:
1. **Temporal Dynamics**: GRU captures evolution of node states
2. **Hierarchical Processing**: HNSW structure for efficient search
3. **Dual Aggregation**: Combines attention and edge-weighted aggregation
4. **Stable Training**: LayerNorm + Xavier init + numerical guards
---
## 7. Computational Complexity
### 7.1 Per-Layer Complexity
For a node with degree d, hidden dimension h, and k attention heads:
| Operation | Complexity | Notes |
|-----------|------------|-------|
| Message Transform | O(h²) | Linear layer |
| Multi-Head Attention | O(k·d·h²/k) = O(d·h²) | k heads, each h/k dim |
| Weighted Aggregation | O(d·h) | Sum over neighbors |
| GRU Update | O(h²) | 6 linear transformations |
| Layer Norm | O(h) | Mean + variance |
| **Total** | **O(d·h² + h²)** | Dominated by attention |
### 7.2 Hierarchical Search Complexity
```
For HNSW with L layers, M neighbors per node:
- Greedy search: O(L · M · log N)
- Differentiable search: O(L · k · h)
where k = top-k candidates per layer
```
---
## 8. Training Considerations
### 8.1 Contrastive Loss Functions
**File**: `crates/ruvector-gnn/src/training.rs:330-462`
**InfoNCE Loss**:
```
L_InfoNCE = -log(exp(sim(q, p⁺) / τ) / Σ_{p∈P} exp(sim(q, p) / τ))
where:
- q: anchor (query node)
- p⁺: positive sample (neighbor)
- P: all samples (positives + negatives)
- τ: temperature parameter
```
**Local Contrastive Loss**:
```
Encourages node embeddings to be similar to graph neighbors
and dissimilar to non-neighbors
```
### 8.2 Elastic Weight Consolidation (EWC)
**File**: `crates/ruvector-gnn/src/ewc.rs`
Prevents catastrophic forgetting in continual learning:
```
L_total = L_task + (λ/2) Σ_i F_i (θ_i - θ*_i)²
where:
- L_task: Current task loss
- F_i: Fisher information (importance of parameter i)
- θ_i: Current parameter
- θ*_i: Anchor parameter from previous task
- λ: Regularization strength (10-10000)
```
**Fisher Information Approximation**:
```rust
F_i (1/N) Σ_{n=1}^N (L/θ_i)²
```
---
## 9. Key Insights for Latent Space Design
### 9.1 Embedding Geometry
**Current Architecture Assumptions**:
1. **Euclidean Latent Space**: All operations assume flat geometry
2. **Cosine Similarity**: Angular distance metric in search
3. **Linear Projections**: Affine transformations preserve convexity
**Implications**:
- Tree-like graphs poorly represented in Euclidean space
- Hierarchical HNSW structure hints at hyperbolic geometry benefits
- Attention mechanism can partially compensate for metric mismatch
### 9.2 Information Flow Bottlenecks
**Critical Points**:
1. **Attention Softmax**: Hard selection at inference (argmax)
2. **GRU Gates**: Sigmoid saturation can block gradients
3. **Fixed Dimensions**: h_dim bottleneck between layers
**Potential Improvements**:
- Adaptive dimensionality per layer
- Sparse attention patterns
- Mixture of experts for different graph patterns
---
## 10. Connection to HNSW Topology
### 10.1 HNSW Structure
Hierarchical layers:
```
Layer 2: [sparse, long-range connections]
Layer 1: [medium density]
Layer 0: [dense, local connections]
```
### 10.2 GNN-HNSW Synergy
**Advantages**:
1. **Coarse-to-Fine**: Higher layers = global structure, lower = local
2. **Skip Connections**: Hierarchical search jumps across graph
3. **Differentiable**: Soft attention enables gradient-based optimization
**Challenges**:
1. **Layer Mismatch**: HNSW layers ≠ GNN layers
2. **Probabilistic Construction**: HNSW randomness vs. learned embeddings
3. **Online Updates**: Adding nodes requires GNN re-evaluation
---
## 11. Strengths and Limitations
### 11.1 Strengths
1. **Numerically Stable**: Extensive guards against overflow/underflow
2. **Flexible**: Handles variable-degree nodes and empty neighborhoods
3. **Rich Interactions**: Dual aggregation (attention + weighted)
4. **Recurrent Memory**: GRU maintains long-term dependencies
5. **End-to-End Differentiable**: Full gradient flow through search
### 11.2 Limitations
1. **Computational Cost**: O(d·h²) per node per layer
2. **Fixed Architecture**: Uniform layers, no adaptive depth
3. **Euclidean Bias**: May not suit hierarchical graphs
4. **Limited Expressiveness**: Single attention type (dot-product)
5. **No Edge Features**: Only uses edge weights, not attributes
---
## 12. Research Opportunities
### 12.1 Short-Term Enhancements
1. **Edge Features**: Extend attention to incorporate edge attributes
2. **Adaptive Heads**: Learn number of attention heads per layer
3. **Sparse Attention**: Local + global attention patterns
4. **Layer Skip Connections**: Direct paths from input to output
### 12.2 Long-Term Directions
1. **Hyperbolic GNN**: Replace Euclidean operations with Poincaré ball
2. **Graph Transformers**: Replace message passing with full attention
3. **Neural ODEs**: Continuous-depth GNN with differential equations
4. **Equivariant Networks**: SE(3) or E(n) equivariance for geometric graphs
---
## References
### Internal Code References
- `/crates/ruvector-gnn/src/layer.rs` - Core GNN layers
- `/crates/ruvector-gnn/src/search.rs` - Differentiable search
- `/crates/ruvector-gnn/src/training.rs` - Loss functions and optimizers
- `/crates/ruvector-gnn/src/ewc.rs` - Continual learning
- `/crates/ruvector-graph/src/hybrid/graph_neural.rs` - GNN engine interface
### Key Papers
- Kipf & Welling (2017) - Graph Convolutional Networks
- Veličković et al. (2018) - Graph Attention Networks
- Chung et al. (2014) - Gated Recurrent Units
- Vaswani et al. (2017) - Attention Is All You Need (Transformers)
- Malkov & Yashunin (2018) - HNSW for ANN search
---
**Document Version**: 1.0
**Last Updated**: 2025-11-30
**Author**: RuVector Research Team