Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/research/latent-space/attention-mechanisms-research.md
+++ b/vendor/ruvector/docs/research/latent-space/attention-mechanisms-research.md
@@ -0,0 +1,925 @@
+# Alternative Attention Mechanisms for GNN Latent Space
+
+## Executive Summary
+
+This document explores alternative attention mechanisms beyond the current scaled dot-product multi-head attention used in RuVector. We analyze mechanisms that could better bridge the gap between high-dimensional latent spaces and graph topology, with emphasis on efficiency, expressiveness, and geometric awareness.
+
+**Current**: Multi-head scaled dot-product attention (O(n²) complexity)
+**Goal**: Enhance attention to capture graph structure, reduce complexity, and improve latent-graph interplay
+
+---
+
+## 1. Current Attention Mechanism Analysis
+
+### 1.1 Scaled Dot-Product Attention (Current Implementation)
+
+**File**: `crates/ruvector-gnn/src/layer.rs:84-205`
+
+```
+Attention(Q, K, V) = softmax(QK^T / √d_k) V
+```
+
+**Strengths**:
+- ✓ Permutation invariant
+- ✓ Differentiable
+- ✓ Well-understood training dynamics
+- ✓ Parallel computation
+
+**Weaknesses**:
+- ✗ No explicit edge features
+- ✗ No positional/structural encoding
+- ✗ Uniform geometric assumptions (Euclidean)
+- ✗ O(d·h²) computational cost
+- ✗ Attention scores independent of graph topology
+
+### 1.2 Multi-Head Decomposition (Current)
+
+```
+MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o
+```
+
+**Strengths**:
+- ✓ Multiple representation subspaces
+- ✓ Different aspects of neighborhood
+
+**Weaknesses**:
+- ✗ Fixed number of heads
+- ✗ Heads learn similar patterns (redundancy)
+- ✗ No explicit head specialization
+
+---
+
+## 2. Graph Attention Networks (GAT) Extensions
+
+### 2.1 Edge-Featured Attention
+
+**Key Innovation**: Incorporate edge attributes into attention computation
+
+```
+e_{ij} = LeakyReLU(a^T [W h_i || W h_j || W_e edge_{ij}])
+α_{ij} = softmax_j(e_{ij})
+h'_i = σ(Σ_{j∈N(i)} α_{ij} W h_j)
+```
+
+**Implementation Proposal**:
+
+```rust
+pub struct EdgeFeaturedAttention {
+    w_node: Linear,      // Node transformation
+    w_edge: Linear,      // Edge transformation
+    a: Vec<f32>,         // Attention coefficients
+    activation: LeakyReLU,
+}
+
+impl EdgeFeaturedAttention {
+    fn forward(
+        &self,
+        query_node: &[f32],
+        neighbor_nodes: &[Vec<f32>],
+        edge_features: &[Vec<f32>],  // NEW
+    ) -> Vec<f32> {
+        // 1. Transform nodes and edges
+        let q_trans = self.w_node.forward(query_node);
+        let n_trans: Vec<_> = neighbor_nodes.iter()
+            .map(|n| self.w_node.forward(n))
+            .collect();
+        let e_trans: Vec<_> = edge_features.iter()
+            .map(|e| self.w_edge.forward(e))
+            .collect();
+
+        // 2. Compute attention with edge features
+        let mut scores = Vec::new();
+        for (n, e) in n_trans.iter().zip(e_trans.iter()) {
+            // Concatenate [query || neighbor || edge]
+            let concat = [&q_trans[..], &n[..], &e[..]].concat();
+            let score = dot_product(&self.a, &concat);
+            scores.push(self.activation.forward(score));
+        }
+
+        // 3. Softmax and aggregate
+        let weights = softmax(&scores);
+        weighted_sum(&n_trans, &weights)
+    }
+}
+```
+
+**Benefits for RuVector**:
+- Edge weights (distances) become learnable features
+- HNSW layer information can be encoded in edges
+- Better captures graph topology in latent space
+
+**Complexity**: O(d·(h_node + h_edge + h_attn))
+
+---
+
+## 3. Hyperbolic Attention
+
+### 3.1 Motivation
+
+**Problem**: HNSW has hierarchical structure, but Euclidean space poorly represents trees/hierarchies
+
+**Solution**: Operate in hyperbolic space (Poincaré ball or hyperboloid model)
+
+### 3.2 Poincaré Ball Attention
+
+**Poincaré Ball Model**:
+```
+B^d = {x ∈ R^d : ||x|| < 1}
+Distance: d(x, y) = arcosh(1 + 2||x - y||² / ((1-||x||²)(1-||y||²)))
+```
+
+**Hyperbolic Attention Mechanism**:
+
+```
+# Key differences from Euclidean:
+1. Use hyperbolic distance for similarity
+2. Exponential map for transformations
+3. Logarithmic map for aggregation
+
+HyperbolicAttention(q, k, v):
+    # Compute hyperbolic similarity
+    sim_ij = -d_poincare(q, k_j)  # Negative distance
+
+    # Softmax in tangent space
+    α_ij = softmax(sim_ij / τ)
+
+    # Aggregate in hyperbolic space
+    result = ⊕_{j} (α_ij ⊗ v_j)  # Möbius addition
+
+    return result
+```
+
+**Implementation Sketch**:
+
+```rust
+pub struct HyperbolicAttention {
+    curvature: f32,  // Negative curvature (e.g., -1.0)
+}
+
+impl HyperbolicAttention {
+    // Poincaré distance
+    fn poincare_distance(&self, x: &[f32], y: &[f32]) -> f32 {
+        let diff_norm_sq = l2_norm_squared(&subtract(x, y));
+        let x_norm_sq = l2_norm_squared(x);
+        let y_norm_sq = l2_norm_squared(y);
+
+        let numerator = 2.0 * diff_norm_sq;
+        let denominator = (1.0 - x_norm_sq) * (1.0 - y_norm_sq);
+
+        self.curvature.abs().sqrt() * (1.0 + numerator / denominator).acosh()
+    }
+
+    // Möbius addition (hyperbolic vector addition)
+    fn mobius_add(&self, x: &[f32], y: &[f32]) -> Vec<f32> {
+        let x_norm_sq = l2_norm_squared(x);
+        let y_norm_sq = l2_norm_squared(y);
+        let xy_dot = dot_product(x, y);
+
+        let numerator_coef = (1.0 + 2.0*xy_dot + y_norm_sq) / (1.0 - x_norm_sq);
+        let denominator_coef = (1.0 + 2.0*xy_dot + x_norm_sq*y_norm_sq) / (1.0 - x_norm_sq);
+
+        // (1+2⟨x,y⟩+||y||²)x + (1-||x||²)y / (1+2⟨x,y⟩+||x||²||y||²)
+        let numerator = add(
+            &scale(x, numerator_coef),
+            &scale(y, 1.0 - x_norm_sq)
+        );
+        scale(&numerator, 1.0 / denominator_coef)
+    }
+
+    fn forward(
+        &self,
+        query: &[f32],
+        keys: &[Vec<f32>],
+        values: &[Vec<f32>],
+    ) -> Vec<f32> {
+        // 1. Compute hyperbolic similarities (negative distances)
+        let scores: Vec<f32> = keys.iter()
+            .map(|k| -self.poincare_distance(query, k))
+            .collect();
+
+        // 2. Softmax
+        let weights = softmax(&scores);
+
+        // 3. Hyperbolic aggregation
+        let mut result = vec![0.0; values[0].len()];
+        for (v, &w) in values.iter().zip(weights.iter()) {
+            let scaled = self.mobius_scalar_mult(w, v);
+            result = self.mobius_add(&result, &scaled);
+        }
+
+        result
+    }
+}
+```
+
+**Benefits for HNSW**:
+- Natural representation of hierarchical layers
+- Exponential capacity (tree-like structures)
+- Distance preserves hierarchy
+
+**Challenges**:
+- Numerical instability near ball boundary (||x|| → 1)
+- More complex backpropagation
+- Requires hyperbolic embeddings throughout pipeline
+
+---
+
+## 4. Sparse Attention Patterns
+
+### 4.1 Local + Global Attention (Longformer-style)
+
+**Motivation**: Full attention is O(n²), wasteful for graphs with local structure
+
+**Pattern**:
+```
+Attention Matrix Structure:
+  [L L L G 0 0 0 0]
+  [L L L L G 0 0 0]
+  [L L L L L G 0 0]
+  [G L L L L L G 0]
+  [0 G L L L L L G]
+  [0 0 G L L L L L]
+  [0 0 0 G L L L L]
+  [0 0 0 0 G L L L]
+
+L = Local attention (1-hop neighbors)
+G = Global attention (HNSW higher layers)
+0 = No attention
+```
+
+**Implementation**:
+
+```rust
+pub struct SparseGraphAttention {
+    local_attn: MultiHeadAttention,
+    global_attn: MultiHeadAttention,
+    local_window: usize,  // K-hop neighborhood
+}
+
+impl SparseGraphAttention {
+    fn forward(
+        &self,
+        query: &[f32],
+        neighbor_embeddings: &[Vec<f32>],
+        neighbor_layers: &[usize],  // HNSW layer for each neighbor
+    ) -> Vec<f32> {
+        // Split neighbors by locality
+        let (local_neighbors, local_indices): (Vec<_>, Vec<_>) =
+            neighbor_embeddings.iter().enumerate()
+                .filter(|(i, _)| neighbor_layers[*i] == 0)  // Layer 0 = local
+                .unzip();
+
+        let (global_neighbors, global_indices): (Vec<_>, Vec<_>) =
+            neighbor_embeddings.iter().enumerate()
+                .filter(|(i, _)| neighbor_layers[*i] > 0)  // Higher layers = global
+                .unzip();
+
+        // Compute local attention
+        let local_output = if !local_neighbors.is_empty() {
+            self.local_attn.forward(query, &local_neighbors, &local_neighbors)
+        } else {
+            vec![0.0; query.len()]
+        };
+
+        // Compute global attention
+        let global_output = if !global_neighbors.is_empty() {
+            self.global_attn.forward(query, &global_neighbors, &global_neighbors)
+        } else {
+            vec![0.0; query.len()]
+        };
+
+        // Combine (learned gating)
+        combine_local_global(&local_output, &global_output)
+    }
+}
+```
+
+**Complexity**: O(k_local + k_global) instead of O(n²)
+
+---
+
+## 5. Linear Attention (O(n) complexity)
+
+### 5.1 Kernel-Based Linear Attention
+
+**Key Idea**: Replace softmax with kernel feature map
+
+```
+Standard: Attention(Q, K, V) = softmax(QK^T) V
+Linear:   Attention(Q, K, V) = φ(Q) (φ(K)^T V) / (φ(Q) (φ(K)^T 1))
+
+where φ: R^d → R^D is a feature map
+```
+
+**Random Feature Approximation** (Performer):
+
+```rust
+pub struct LinearAttention {
+    num_features: usize,  // D (typically 256-512)
+    random_features: Array2<f32>,  // Random projection matrix
+}
+
+impl LinearAttention {
+    fn feature_map(&self, x: &[f32]) -> Vec<f32> {
+        // Random Fourier Features
+        let proj = self.random_features.dot(&Array1::from_vec(x.to_vec()));
+        let scale = 1.0 / (self.num_features as f32).sqrt();
+
+        proj.mapv(|z| {
+            scale * (z.cos() + z.sin())  // Simplified RFF
+        }).to_vec()
+    }
+
+    fn forward(
+        &self,
+        query: &[f32],
+        keys: &[Vec<f32>],
+        values: &[Vec<f32>],
+    ) -> Vec<f32> {
+        // 1. Apply feature map
+        let q_feat = self.feature_map(query);
+        let k_feats: Vec<_> = keys.iter().map(|k| self.feature_map(k)).collect();
+
+        // 2. Compute K^T V (sum over neighbors)
+        let mut kv = vec![0.0; values[0].len()];
+        for (k_feat, v) in k_feats.iter().zip(values.iter()) {
+            for (i, &v_i) in v.iter().enumerate() {
+                kv[i] += k_feat.iter().sum::<f32>() * v_i;
+            }
+        }
+
+        // 3. Compute Q (K^T V)
+        let numerator: Vec<f32> = kv.iter()
+            .map(|&kv_i| q_feat.iter().sum::<f32>() * kv_i)
+            .collect();
+
+        // 4. Normalize by Q (K^T 1)
+        let denominator: f32 = q_feat.iter().sum::<f32>()
+            * k_feats.iter().map(|k| k.iter().sum::<f32>()).sum::<f32>();
+
+        numerator.iter().map(|&n| n / denominator).collect()
+    }
+}
+```
+
+**Benefits**:
+- **O(n) complexity**: Scales linearly with graph size
+- **Theoretically grounded**: Approximates softmax attention
+- **Parallel friendly**: Matrix operations
+
+**Tradeoffs**:
+- Approximation error vs. exact softmax
+- Requires more random features for accuracy
+- Less interpretable attention weights
+
+---
+
+## 6. Rotary Position Embeddings (RoPE) for Graphs
+
+### 6.1 Motivation
+
+**Problem**: Graph attention has no notion of "position" or "distance" beyond explicit edge features
+
+**Solution**: Encode relative distances/positions via rotation
+
+### 6.2 RoPE Mathematics
+
+**Standard RoPE** (for sequences):
+```
+RoPE(x, m) = [
+    x₀ cos(mθ₀) - x₁ sin(mθ₀),
+    x₀ sin(mθ₀) + x₁ cos(mθ₀),
+    x₂ cos(mθ₁) - x₃ sin(mθ₁),
+    ...
+]
+
+where m = position index, θᵢ = 10000^(-2i/d)
+```
+
+**Graph RoPE Adaptation**:
+```
+Instead of sequential position m, use:
+- Graph distance (shortest path length)
+- HNSW layer index
+- Normalized edge weight
+```
+
+**Implementation**:
+
+```rust
+pub struct GraphRoPE {
+    dim: usize,
+    base: f32,  // Base frequency (default 10000)
+}
+
+impl GraphRoPE {
+    fn apply_rotation(&self, embedding: &[f32], distance: f32) -> Vec<f32> {
+        let mut rotated = vec![0.0; embedding.len()];
+
+        for i in (0..self.dim).step_by(2) {
+            let theta = distance / self.base.powf(2.0 * i as f32 / self.dim as f32);
+            let cos_theta = theta.cos();
+            let sin_theta = theta.sin();
+
+            rotated[i] = embedding[i] * cos_theta - embedding[i+1] * sin_theta;
+            rotated[i+1] = embedding[i] * sin_theta + embedding[i+1] * cos_theta;
+        }
+
+        rotated
+    }
+
+    fn forward_attention(
+        &self,
+        query: &[f32],
+        keys: &[Vec<f32>],
+        values: &[Vec<f32>],
+        distances: &[f32],  // NEW: graph distances
+    ) -> Vec<f32> {
+        // Apply RoPE to query and keys based on relative distance
+        let q_rotated = self.apply_rotation(query, 0.0);  // Query at "origin"
+
+        let mut scores = Vec::new();
+        for (k, &dist) in keys.iter().zip(distances.iter()) {
+            let k_rotated = self.apply_rotation(k, dist);
+            let score = dot_product(&q_rotated, &k_rotated);
+            scores.push(score);
+        }
+
+        let weights = softmax(&scores);
+        weighted_sum(values, &weights)
+    }
+}
+```
+
+**Benefits**:
+- Encodes distance without explicit features
+- Relative position encoding (rotation-invariant)
+- Efficient (just rotations, no extra parameters)
+
+**Graph-Specific Applications**:
+1. **HNSW Layer Distance**: Encode which layer neighbors come from
+2. **Shortest Path Distance**: Penalize far nodes in latent space
+3. **Edge Weight Encoding**: Continuous rotation based on edge weight
+
+---
+
+## 7. Flash Attention (Memory-Efficient)
+
+### 7.1 Problem
+
+Standard attention materializes the full attention matrix in memory:
+```
+Memory: O(n²)  for n neighbors
+```
+
+For dense graphs or large neighborhoods, this is prohibitive.
+
+### 7.2 Flash Attention Algorithm
+
+**Key Ideas**:
+1. Tile the attention computation
+2. Recompute attention on-the-fly during backward pass
+3. Never materialize full attention matrix
+
+**Pseudocode**:
+
+```
+FlashAttention(Q, K, V):
+    # Divide Q, K, V into blocks
+    Q_blocks = split(Q, block_size)
+    K_blocks = split(K, block_size)
+    V_blocks = split(V, block_size)
+
+    O = zeros_like(Q)
+
+    # Outer loop: iterate over query blocks
+    for Q_i in Q_blocks:
+        row_max = -inf
+        row_sum = 0
+
+        # Inner loop: iterate over key blocks
+        for K_j, V_j in zip(K_blocks, V_blocks):
+            # Compute attention block
+            S_ij = Q_i @ K_j^T / sqrt(d)
+
+            # Online softmax (numerically stable)
+            new_max = max(row_max, max(S_ij))
+            exp_S = exp(S_ij - new_max)
+
+            # Update running statistics
+            correction = exp(row_max - new_max)
+            row_sum = row_sum * correction + sum(exp_S)
+            row_max = new_max
+
+            # Accumulate output
+            O_i += exp_S @ V_j
+
+        # Final normalization
+        O_i /= row_sum
+
+    return O
+```
+
+**Implementation Note**:
+
+Flash Attention requires careful low-level optimization (CUDA kernels, tiling, SRAM management). For RuVector:
+
+```rust
+// Simplified tiled version for CPU
+pub struct TiledAttention {
+    block_size: usize,
+}
+
+impl TiledAttention {
+    fn forward_tiled(
+        &self,
+        query: &[f32],
+        keys: &[Vec<f32>],
+        values: &[Vec<f32>],
+    ) -> Vec<f32> {
+        let n = keys.len();
+        let mut output = vec![0.0; query.len()];
+        let mut row_sum = 0.0;
+        let mut row_max = f32::NEG_INFINITY;
+
+        // Process keys in blocks
+        for chunk_start in (0..n).step_by(self.block_size) {
+            let chunk_end = (chunk_start + self.block_size).min(n);
+
+            // Compute attention for this block
+            let chunk_keys = &keys[chunk_start..chunk_end];
+            let chunk_values = &values[chunk_start..chunk_end];
+
+            let scores: Vec<f32> = chunk_keys.iter()
+                .map(|k| dot_product(query, k))
+                .collect();
+
+            // Online softmax update
+            let new_max = scores.iter().copied().fold(row_max, f32::max);
+            let exp_scores: Vec<f32> = scores.iter()
+                .map(|&s| (s - new_max).exp())
+                .collect();
+
+            let correction = (row_max - new_max).exp();
+            row_sum = row_sum * correction + exp_scores.iter().sum::<f32>();
+            row_max = new_max;
+
+            // Accumulate weighted values
+            for (v, &weight) in chunk_values.iter().zip(exp_scores.iter()) {
+                for (o, &v_i) in output.iter_mut().zip(v.iter()) {
+                    *o = *o * correction + weight * v_i;
+                }
+            }
+        }
+
+        // Final normalization
+        output.iter().map(|&o| o / row_sum).collect()
+    }
+}
+```
+
+**Benefits**:
+- **Memory**: O(n) instead of O(n²)
+- **Speed**: Can be faster due to better cache locality
+- **Scalability**: Handle larger neighborhoods
+
+---
+
+## 8. Mixture of Experts (MoE) Attention
+
+### 8.1 Concept
+
+Different attention mechanisms for different graph patterns:
+
+```
+MoE-Attention(query, keys, values):
+    # Router decides which expert(s) to use
+    router_scores = Router(query)
+    expert_indices = topk(router_scores, k=2)
+
+    # Apply selected experts
+    outputs = []
+    for expert_idx in expert_indices:
+        expert_output = Experts[expert_idx](query, keys, values)
+        outputs.append(expert_output * router_scores[expert_idx])
+
+    return sum(outputs)
+```
+
+**Graph-Specific Experts**:
+1. **Local Expert**: For 1-hop neighbors (standard attention)
+2. **Hierarchical Expert**: For HNSW higher layers (hyperbolic attention)
+3. **Global Expert**: For distant nodes (linear attention)
+4. **Structural Expert**: Edge-featured attention
+
+### 8.2 Implementation
+
+```rust
+pub enum AttentionExpert {
+    Standard(MultiHeadAttention),
+    Hyperbolic(HyperbolicAttention),
+    Linear(LinearAttention),
+    EdgeFeatured(EdgeFeaturedAttention),
+}
+
+pub struct MoEAttention {
+    router: Linear,  // Maps query to expert scores
+    experts: Vec<AttentionExpert>,
+    top_k: usize,
+}
+
+impl MoEAttention {
+    fn forward(
+        &self,
+        query: &[f32],
+        keys: &[Vec<f32>],
+        values: &[Vec<f32>],
+        edge_features: Option<&[Vec<f32>]>,
+    ) -> Vec<f32> {
+        // 1. Route to experts
+        let router_scores = self.router.forward(query);
+        let expert_weights = softmax(&router_scores);
+        let top_experts = topk_indices(&expert_weights, self.top_k);
+
+        // 2. Compute weighted expert outputs
+        let mut output = vec![0.0; query.len()];
+        for &expert_idx in &top_experts {
+            let expert_output = match &self.experts[expert_idx] {
+                AttentionExpert::Standard(attn) =>
+                    attn.forward(query, keys, values),
+                AttentionExpert::Hyperbolic(attn) =>
+                    attn.forward(query, keys, values),
+                AttentionExpert::Linear(attn) =>
+                    attn.forward(query, keys, values),
+                AttentionExpert::EdgeFeatured(attn) =>
+                    attn.forward(query, keys, values, edge_features.unwrap()),
+            };
+
+            let weight = expert_weights[expert_idx];
+            for (o, &e) in output.iter_mut().zip(expert_output.iter()) {
+                *o += weight * e;
+            }
+        }
+
+        output
+    }
+}
+```
+
+**Benefits**:
+- Adaptive to different graph neighborhoods
+- Specialization reduces computation
+- Router learns which mechanism suits which context
+
+---
+
+## 9. Cross-Attention Between Graph and Latent
+
+### 9.1 Motivation
+
+**Problem**: Current attention only looks at graph neighbors. What about latent space neighbors?
+
+**Solution**: Cross-attention between topological neighbors (graph) and semantic neighbors (latent)
+
+### 9.2 Dual-Space Attention
+
+```
+Given node v:
+- Graph neighbors: N_G(v) = {u : (u,v) ∈ E}
+- Latent neighbors: N_L(v) = TopK({u : sim(h_u, h_v) > threshold})
+
+CrossAttention(v):
+    # Graph attention
+    graph_out = Attention(h_v, {h_u}_{u∈N_G}, {h_u}_{u∈N_G})
+
+    # Latent attention
+    latent_out = Attention(h_v, {h_u}_{u∈N_L}, {h_u}_{u∈N_L})
+
+    # Cross-attention: graph queries latent
+    cross_out = Attention(graph_out, {h_u}_{u∈N_L}, {h_u}_{u∈N_L})
+
+    # Fusion
+    return Combine(graph_out, latent_out, cross_out)
+```
+
+**Implementation**:
+
+```rust
+pub struct DualSpaceAttention {
+    graph_attn: MultiHeadAttention,
+    latent_attn: MultiHeadAttention,
+    cross_attn: MultiHeadAttention,
+    fusion: Linear,
+}
+
+impl DualSpaceAttention {
+    fn forward(
+        &self,
+        query: &[f32],
+        graph_neighbors: &[Vec<f32>],
+        all_embeddings: &[Vec<f32>],  // For latent neighbor search
+        k_latent: usize,
+    ) -> Vec<f32> {
+        // 1. Graph attention (topology-based)
+        let graph_output = self.graph_attn.forward(
+            query,
+            graph_neighbors,
+            graph_neighbors
+        );
+
+        // 2. Find latent neighbors (similarity-based)
+        let latent_neighbors = self.find_latent_neighbors(
+            query,
+            all_embeddings,
+            k_latent
+        );
+
+        // 3. Latent attention (embedding-based)
+        let latent_output = self.latent_attn.forward(
+            query,
+            &latent_neighbors,
+            &latent_neighbors
+        );
+
+        // 4. Cross-attention (graph context attends to latent space)
+        let cross_output = self.cross_attn.forward(
+            &graph_output,
+            &latent_neighbors,
+            &latent_neighbors
+        );
+
+        // 5. Fusion
+        let concatenated = [
+            &graph_output[..],
+            &latent_output[..],
+            &cross_output[..],
+        ].concat();
+
+        self.fusion.forward(&concatenated)
+    }
+
+    fn find_latent_neighbors(
+        &self,
+        query: &[f32],
+        all_embeddings: &[Vec<f32>],
+        k: usize,
+    ) -> Vec<Vec<f32>> {
+        // Compute similarities
+        let mut similarities: Vec<(usize, f32)> = all_embeddings
+            .iter()
+            .enumerate()
+            .map(|(i, emb)| (i, cosine_similarity(query, emb)))
+            .collect();
+
+        // Sort by similarity
+        similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
+
+        // Return top-k
+        similarities.iter()
+            .take(k)
+            .map(|(i, _)| all_embeddings[*i].clone())
+            .collect()
+    }
+}
+```
+
+**Benefits**:
+- Bridges topology and semantics
+- Captures "similar but not connected" nodes
+- Enriches latent space with graph structure
+
+---
+
+## 10. Comparison Matrix
+
+| Mechanism | Complexity | Edge Features | Geometry | Memory | Use Case |
+|-----------|------------|---------------|----------|--------|----------|
+| **Current (MHA)** | O(d·h²) | ✗ | Euclidean | O(d·h) | General purpose |
+| **GAT + Edges** | O(d·h²) | ✓ | Euclidean | O(d·h) | Rich edge info |
+| **Hyperbolic** | O(d·h²) | ✗ | Hyperbolic | O(d·h) | Hierarchical graphs |
+| **Sparse (Local+Global)** | O(k_l + k_g) | ✗ | Euclidean | O((k_l+k_g)·h) | Large graphs |
+| **Linear (Performer)** | O(d·D) | ✗ | Euclidean | O(D·h) | Scalability |
+| **RoPE** | O(d·h²) | Implicit | Euclidean | O(d·h) | Distance encoding |
+| **Flash Attention** | O(d·h²) | ✗ | Euclidean | O(h) | Memory efficiency |
+| **MoE** | Variable | ✓ | Mixed | Variable | Heterogeneous graphs |
+| **Cross (Dual-Space)** | O(d·h² + k²·h) | ✗ | Dual | O((d+k)·h) | Latent-graph bridge |
+
+---
+
+## 11. Recommendations for RuVector
+
+### 11.1 Short-Term (Immediate Implementation)
+
+**1. Edge-Featured Attention**
+- **Priority**: HIGH
+- **Effort**: LOW-MEDIUM
+- **Reason**: HNSW edge weights are currently underutilized
+- **Implementation**: Extend current `MultiHeadAttention` to include edge features
+
+**2. Sparse Attention (Local + Global)**
+- **Priority**: HIGH
+- **Effort**: MEDIUM
+- **Reason**: Natural fit for HNSW's layered structure
+- **Implementation**: Separate attention for layer 0 (local) vs. higher layers (global)
+
+**3. RoPE for Distance Encoding**
+- **Priority**: MEDIUM
+- **Effort**: LOW
+- **Reason**: Encode HNSW layer or edge distance without extra parameters
+- **Implementation**: Apply rotation based on layer index or edge weight
+
+### 11.2 Medium-Term (Next Quarter)
+
+**4. Linear Attention (Performer)**
+- **Priority**: MEDIUM
+- **Effort**: MEDIUM-HIGH
+- **Reason**: Scalability for large graphs
+- **Implementation**: Replace softmax with random feature approximation
+
+**5. Flash Attention**
+- **Priority**: LOW-MEDIUM
+- **Effort**: HIGH
+- **Reason**: Memory efficiency for dense neighborhoods
+- **Implementation**: Tiled computation, may need GPU optimization
+
+### 11.3 Long-Term (Research Exploration)
+
+**6. Hyperbolic Attention**
+- **Priority**: MEDIUM
+- **Effort**: HIGH
+- **Reason**: Hierarchical HNSW structure naturally hyperbolic
+- **Implementation**: Full pipeline change to hyperbolic embeddings
+
+**7. Mixture of Experts**
+- **Priority**: LOW
+- **Effort**: HIGH
+- **Reason**: Heterogeneous graph patterns
+- **Implementation**: Multiple attention types with learned routing
+
+**8. Cross-Attention (Dual-Space)**
+- **Priority**: HIGH (Research)
+- **Effort**: HIGH
+- **Reason**: Core to latent-graph interplay
+- **Implementation**: Requires efficient latent neighbor search (ANN)
+
+---
+
+## 12. Implementation Roadmap
+
+### Phase 1: Extend Current Attention (1-2 weeks)
+```rust
+// Add edge features to existing MultiHeadAttention
+impl MultiHeadAttention {
+    pub fn forward_with_edges(
+        &self,
+        query: &[f32],
+        keys: &[Vec<f32>],
+        values: &[Vec<f32>],
+        edge_features: &[Vec<f32>],  // NEW
+    ) -> Vec<f32> {
+        // Modify attention score computation to include edges
+    }
+}
+```
+
+### Phase 2: Sparse Attention Variant (2-3 weeks)
+```rust
+// Separate local and global attention based on HNSW layer
+pub struct HNSWAwareAttention {
+    local: MultiHeadAttention,
+    global: MultiHeadAttention,
+}
+```
+
+### Phase 3: Alternative Mechanisms (1-2 months)
+- Implement RoPE for distance encoding
+- Prototype Linear Attention
+- Benchmark all variants
+
+### Phase 4: Research Exploration (Ongoing)
+- Hyperbolic embeddings (full pipeline change)
+- MoE attention routing
+- Cross-attention with latent neighbors
+
+---
+
+## References
+
+### Papers
+1. **GAT**: Veličković et al. (2018) - Graph Attention Networks
+2. **Hyperbolic**: Chami et al. (2019) - Hyperbolic Graph Convolutional Neural Networks
+3. **Longformer**: Beltagy et al. (2020) - Longformer: The Long-Document Transformer
+4. **Performer**: Choromanski et al. (2020) - Rethinking Attention with Performers
+5. **RoPE**: Su et al. (2021) - RoFormer: Enhanced Transformer with Rotary Position Embedding
+6. **Flash Attention**: Dao et al. (2022) - FlashAttention: Fast and Memory-Efficient Exact Attention
+7. **MoE**: Shazeer et al. (2017) - Outrageously Large Neural Networks: The Sparsely-Gated MoE
+
+### RuVector Code References
+- `crates/ruvector-gnn/src/layer.rs:84-205` - Current MultiHeadAttention
+- `crates/ruvector-gnn/src/search.rs:38-86` - Differentiable search with softmax
+
+---
+
+**Document Version**: 1.0
+**Last Updated**: 2025-11-30
+**Author**: RuVector Research Team