Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/research/latent-space/gnn-architecture-analysis.md
+++ b/vendor/ruvector/docs/research/latent-space/gnn-architecture-analysis.md
@@ -0,0 +1,461 @@
+# GNN Architecture Analysis: RuVector Implementation
+
+## Executive Summary
+
+RuVector implements a sophisticated Graph Neural Network architecture that operates on HNSW (Hierarchical Navigable Small World) graph topology. The architecture combines message passing, multi-head attention, gated recurrent updates, and differentiable search mechanisms to create a powerful framework for learning on graph-structured data.
+
+**Key Components**: Linear transformations, Multi-head Attention, GRU cells, Layer Normalization, Hierarchical Search
+
+**Code Location**: `crates/ruvector-gnn/src/layer.rs`, `crates/ruvector-gnn/src/search.rs`
+
+---
+
+## 1. Core Architecture: RuvectorLayer
+
+### 1.1 Mathematical Formulation
+
+The RuvectorLayer implements a message passing neural network with the following forward pass:
+
+```
+Given: node embedding h_v, neighbor embeddings {h_u}_u∈N(v), edge weights {e_uv}_u∈N(v)
+
+1. Message Transformation:
+   m_v = W_msg · h_v
+   m_u = W_msg · h_u  for u ∈ N(v)
+
+2. Multi-Head Attention:
+   a_v = MultiHeadAttention(m_v, {m_u}, {m_u})
+
+3. Weighted Aggregation:
+   agg_v = Σ_u (e_uv / Σ_u' e_u'v) · m_u
+
+4. Combination:
+   combined = a_v + agg_v
+   transformed = W_agg · combined
+
+5. GRU Update:
+   h'_v = GRU(transformed, m_v)
+
+6. Normalization & Regularization:
+   output = LayerNorm(Dropout(h'_v))
+```
+
+### 1.2 Implementation Details
+
+**File**: `crates/ruvector-gnn/src/layer.rs:307-440`
+
+```rust
+pub struct RuvectorLayer {
+    w_msg: Linear,              // Message weight matrix
+    w_agg: Linear,              // Aggregation weight matrix
+    w_update: GRUCell,          // GRU update cell
+    attention: MultiHeadAttention,
+    norm: LayerNorm,
+    dropout: f32,
+}
+```
+
+**Design Choices**:
+- **Xavier Initialization**: Weights initialized as N(0, √(2/(d_in + d_out)))
+- **Numerical Stability**: Softmax uses max subtraction trick
+- **Residual Connections**: Implicit through GRU's (1-z) term
+- **Flexibility**: Handles empty neighbor sets gracefully
+
+---
+
+## 2. Multi-Head Attention Mechanism
+
+### 2.1 Scaled Dot-Product Attention
+
+**File**: `crates/ruvector-gnn/src/layer.rs:84-205`
+
+The attention mechanism follows the Transformer architecture:
+
+```
+Attention(Q, K, V) = softmax(QK^T / √d_k) V
+
+where:
+- Q = W_q · h_v (query from target node)
+- K = W_k · h_u (keys from neighbors)
+- V = W_v · h_u (values from neighbors)
+- d_k = hidden_dim / num_heads
+```
+
+### 2.2 Multi-Head Decomposition
+
+```
+MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o
+
+where head_i = Attention(Q W_q^i, K W_k^i, V W_v^i)
+```
+
+**Mathematical Properties**:
+1. **Permutation Invariance**: Attention scores independent of neighbor ordering
+2. **Soft Selection**: Differentiable alternative to hard neighbor selection
+3. **Context Aware**: Each head can focus on different aspects of neighborhood
+
+### 2.3 Numerical Stability
+
+```rust
+// Softmax with numerical stability
+let max_score = scores.iter().copied().fold(f32::NEG_INFINITY, f32::max);
+let exp_scores: Vec<f32> = scores.iter()
+    .map(|&s| (s - max_score).exp())
+    .collect();
+let sum_exp: f32 = exp_scores.iter().sum::<f32>().max(1e-10);
+```
+
+**Key Features**:
+- Prevents overflow with max subtraction
+- Guards against division by zero with epsilon
+- Maintains gradient flow through exp operations
+
+---
+
+## 3. Gated Recurrent Unit (GRU) Integration
+
+### 3.1 GRU Cell Mathematics
+
+**File**: `crates/ruvector-gnn/src/layer.rs:207-305`
+
+```
+z_t = σ(W_z x_t + U_z h_{t-1})        [Update Gate]
+r_t = σ(W_r x_t + U_r h_{t-1})        [Reset Gate]
+h̃_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1}))  [Candidate State]
+h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t     [Final State]
+```
+
+### 3.2 Why GRU for Graph Updates?
+
+1. **Memory of Previous State**: Maintains information from earlier layers
+2. **Selective Updates**: Update gate z_t controls how much to change
+3. **Reset Mechanism**: Reset gate r_t decides relevance of previous state
+4. **Gradient Flow**: Mitigates vanishing gradients in deep GNNs
+
+**Connection to Graph Learning**:
+- `h_{t-1}`: Node's current representation (before aggregation)
+- `x_t`: Aggregated neighborhood information
+- `h_t`: Updated node representation (after message passing)
+
+---
+
+## 4. Differentiable Search Mechanism
+
+### 4.1 Soft Attention Over Candidates
+
+**File**: `crates/ruvector-gnn/src/search.rs:38-86`
+
+```
+Given: query q, candidates C = {c_1, ..., c_n}
+
+1. Compute Similarities:
+   s_i = cosine_similarity(q, c_i)
+
+2. Temperature-Scaled Softmax:
+   w_i = exp(s_i / τ) / Σ_j exp(s_j / τ)
+
+3. Soft Top-K Selection:
+   indices = argsort(w)[:k]
+   weights = {w_i | i ∈ indices}
+```
+
+**Temperature Parameter τ**:
+- **τ → 0**: Sharp selection (approximates hard argmax)
+- **τ → ∞**: Uniform distribution (all candidates equal)
+- **τ = 0.07-1.0**: Typical range balancing discrimination and smoothness
+
+### 4.2 Hierarchical Forward Pass
+
+**File**: `crates/ruvector-gnn/src/search.rs:88-154`
+
+Processes query through HNSW layers sequentially:
+
+```
+Input: query q, layer_embeddings L = {L_0, ..., L_d}, gnn_layers G
+
+h_0 = q
+for layer l = 0 to d:
+    1. Find top-k nodes: indices, weights = DifferentiableSearch(h_l, L_l)
+    2. Aggregate: agg = Σ_i weights[i] · L_l[indices[i]]
+    3. Combine: combined = (h_l + agg) / 2
+    4. Transform: h_{l+1} = G_l(combined, neighbors, edge_weights)
+
+Output: h_d
+```
+
+**Gradient Flow Through Hierarchy**:
+- Softmax ensures differentiability
+- Enables end-to-end training of search process
+- Backpropagation through entire HNSW traversal
+
+---
+
+## 5. Data Flow Architecture
+
+### 5.1 Forward Pass Diagram
+
+```
+Input Node Embedding (h_v)
+         |
+         v
+    [W_msg Transform] ──────────────┐
+         |                          |
+         v                          |
+    Message (m_v)                   |
+         |                          |
+         v                          |
+    ┌─────────────────┐             |
+    │  Multi-Head     │             |
+    │  Attention      │ ← Neighbors (transformed)
+    └─────────────────┘             |
+         |                          |
+         v                          |
+    Attention Output                |
+         |                          |
+         v                          |
+    [+ Weighted Agg] ← Edge Weights |
+         |                          |
+         v                          |
+    [W_agg Transform]               |
+         |                          |
+         v                          |
+    Aggregated Message              |
+         |                          |
+         v                          |
+    ┌─────────────────┐             |
+    │   GRU Cell      │ ← Previous State (m_v)
+    └─────────────────┘
+         |
+         v
+    Updated State
+         |
+         v
+    [Dropout]
+         |
+         v
+    [LayerNorm]
+         |
+         v
+    Output Embedding
+```
+
+### 5.2 Information Bottlenecks
+
+**Potential Bottlenecks**:
+1. **Linear Transformations**: Fixed capacity W_msg, W_agg
+2. **Attention Heads**: Limited parallelism (typically 2-8 heads)
+3. **GRU Hidden State**: Fixed dimensionality
+4. **Dropout**: Information loss during training
+
+**Mitigation Strategies**:
+- Residual connections via GRU gates
+- Layer normalization prevents gradient explosion
+- Xavier init maintains variance through layers
+
+---
+
+## 6. Comparison with Standard GNN Architectures
+
+| Feature | RuVector | GCN | GAT | GraphSAGE |
+|---------|----------|-----|-----|-----------|
+| Aggregation | Attention + Weighted | Mean | Attention | Mean/Max/LSTM |
+| Update | GRU | Linear | Linear | Linear |
+| Normalization | LayerNorm | None/BatchNorm | None | None |
+| Topology | HNSW | General | General | General |
+| Differentiable Search | Yes | No | No | No |
+| Multi-Head | Yes | No | Yes | No |
+| Gated Updates | Yes (GRU) | No | No | No |
+
+**RuVector Advantages**:
+1. **Temporal Dynamics**: GRU captures evolution of node states
+2. **Hierarchical Processing**: HNSW structure for efficient search
+3. **Dual Aggregation**: Combines attention and edge-weighted aggregation
+4. **Stable Training**: LayerNorm + Xavier init + numerical guards
+
+---
+
+## 7. Computational Complexity
+
+### 7.1 Per-Layer Complexity
+
+For a node with degree d, hidden dimension h, and k attention heads:
+
+| Operation | Complexity | Notes |
+|-----------|------------|-------|
+| Message Transform | O(h²) | Linear layer |
+| Multi-Head Attention | O(k·d·h²/k) = O(d·h²) | k heads, each h/k dim |
+| Weighted Aggregation | O(d·h) | Sum over neighbors |
+| GRU Update | O(h²) | 6 linear transformations |
+| Layer Norm | O(h) | Mean + variance |
+| **Total** | **O(d·h² + h²)** | Dominated by attention |
+
+### 7.2 Hierarchical Search Complexity
+
+```
+For HNSW with L layers, M neighbors per node:
+- Greedy search: O(L · M · log N)
+- Differentiable search: O(L · k · h)
+  where k = top-k candidates per layer
+```
+
+---
+
+## 8. Training Considerations
+
+### 8.1 Contrastive Loss Functions
+
+**File**: `crates/ruvector-gnn/src/training.rs:330-462`
+
+**InfoNCE Loss**:
+```
+L_InfoNCE = -log(exp(sim(q, p⁺) / τ) / Σ_{p∈P} exp(sim(q, p) / τ))
+
+where:
+- q: anchor (query node)
+- p⁺: positive sample (neighbor)
+- P: all samples (positives + negatives)
+- τ: temperature parameter
+```
+
+**Local Contrastive Loss**:
+```
+Encourages node embeddings to be similar to graph neighbors
+and dissimilar to non-neighbors
+```
+
+### 8.2 Elastic Weight Consolidation (EWC)
+
+**File**: `crates/ruvector-gnn/src/ewc.rs`
+
+Prevents catastrophic forgetting in continual learning:
+
+```
+L_total = L_task + (λ/2) Σ_i F_i (θ_i - θ*_i)²
+
+where:
+- L_task: Current task loss
+- F_i: Fisher information (importance of parameter i)
+- θ_i: Current parameter
+- θ*_i: Anchor parameter from previous task
+- λ: Regularization strength (10-10000)
+```
+
+**Fisher Information Approximation**:
+```rust
+F_i ≈ (1/N) Σ_{n=1}^N (∂L/∂θ_i)²
+```
+
+---
+
+## 9. Key Insights for Latent Space Design
+
+### 9.1 Embedding Geometry
+
+**Current Architecture Assumptions**:
+1. **Euclidean Latent Space**: All operations assume flat geometry
+2. **Cosine Similarity**: Angular distance metric in search
+3. **Linear Projections**: Affine transformations preserve convexity
+
+**Implications**:
+- Tree-like graphs poorly represented in Euclidean space
+- Hierarchical HNSW structure hints at hyperbolic geometry benefits
+- Attention mechanism can partially compensate for metric mismatch
+
+### 9.2 Information Flow Bottlenecks
+
+**Critical Points**:
+1. **Attention Softmax**: Hard selection at inference (argmax)
+2. **GRU Gates**: Sigmoid saturation can block gradients
+3. **Fixed Dimensions**: h_dim bottleneck between layers
+
+**Potential Improvements**:
+- Adaptive dimensionality per layer
+- Sparse attention patterns
+- Mixture of experts for different graph patterns
+
+---
+
+## 10. Connection to HNSW Topology
+
+### 10.1 HNSW Structure
+
+Hierarchical layers:
+```
+Layer 2: [sparse, long-range connections]
+Layer 1: [medium density]
+Layer 0: [dense, local connections]
+```
+
+### 10.2 GNN-HNSW Synergy
+
+**Advantages**:
+1. **Coarse-to-Fine**: Higher layers = global structure, lower = local
+2. **Skip Connections**: Hierarchical search jumps across graph
+3. **Differentiable**: Soft attention enables gradient-based optimization
+
+**Challenges**:
+1. **Layer Mismatch**: HNSW layers ≠ GNN layers
+2. **Probabilistic Construction**: HNSW randomness vs. learned embeddings
+3. **Online Updates**: Adding nodes requires GNN re-evaluation
+
+---
+
+## 11. Strengths and Limitations
+
+### 11.1 Strengths
+
+1. **Numerically Stable**: Extensive guards against overflow/underflow
+2. **Flexible**: Handles variable-degree nodes and empty neighborhoods
+3. **Rich Interactions**: Dual aggregation (attention + weighted)
+4. **Recurrent Memory**: GRU maintains long-term dependencies
+5. **End-to-End Differentiable**: Full gradient flow through search
+
+### 11.2 Limitations
+
+1. **Computational Cost**: O(d·h²) per node per layer
+2. **Fixed Architecture**: Uniform layers, no adaptive depth
+3. **Euclidean Bias**: May not suit hierarchical graphs
+4. **Limited Expressiveness**: Single attention type (dot-product)
+5. **No Edge Features**: Only uses edge weights, not attributes
+
+---
+
+## 12. Research Opportunities
+
+### 12.1 Short-Term Enhancements
+
+1. **Edge Features**: Extend attention to incorporate edge attributes
+2. **Adaptive Heads**: Learn number of attention heads per layer
+3. **Sparse Attention**: Local + global attention patterns
+4. **Layer Skip Connections**: Direct paths from input to output
+
+### 12.2 Long-Term Directions
+
+1. **Hyperbolic GNN**: Replace Euclidean operations with Poincaré ball
+2. **Graph Transformers**: Replace message passing with full attention
+3. **Neural ODEs**: Continuous-depth GNN with differential equations
+4. **Equivariant Networks**: SE(3) or E(n) equivariance for geometric graphs
+
+---
+
+## References
+
+### Internal Code References
+- `/crates/ruvector-gnn/src/layer.rs` - Core GNN layers
+- `/crates/ruvector-gnn/src/search.rs` - Differentiable search
+- `/crates/ruvector-gnn/src/training.rs` - Loss functions and optimizers
+- `/crates/ruvector-gnn/src/ewc.rs` - Continual learning
+- `/crates/ruvector-graph/src/hybrid/graph_neural.rs` - GNN engine interface
+
+### Key Papers
+- Kipf & Welling (2017) - Graph Convolutional Networks
+- Veličković et al. (2018) - Graph Attention Networks
+- Chung et al. (2014) - Gated Recurrent Units
+- Vaswani et al. (2017) - Attention Is All You Need (Transformers)
+- Malkov & Yashunin (2018) - HNSW for ANN search
+
+---
+
+**Document Version**: 1.0
+**Last Updated**: 2025-11-30
+**Author**: RuVector Research Team