# Alternative Attention Mechanisms for GNN Latent Space ## Executive Summary This document explores alternative attention mechanisms beyond the current scaled dot-product multi-head attention used in RuVector. We analyze mechanisms that could better bridge the gap between high-dimensional latent spaces and graph topology, with emphasis on efficiency, expressiveness, and geometric awareness. **Current**: Multi-head scaled dot-product attention (O(n²) complexity) **Goal**: Enhance attention to capture graph structure, reduce complexity, and improve latent-graph interplay --- ## 1. Current Attention Mechanism Analysis ### 1.1 Scaled Dot-Product Attention (Current Implementation) **File**: `crates/ruvector-gnn/src/layer.rs:84-205` ``` Attention(Q, K, V) = softmax(QK^T / √d_k) V ``` **Strengths**: - ✓ Permutation invariant - ✓ Differentiable - ✓ Well-understood training dynamics - ✓ Parallel computation **Weaknesses**: - ✗ No explicit edge features - ✗ No positional/structural encoding - ✗ Uniform geometric assumptions (Euclidean) - ✗ O(d·h²) computational cost - ✗ Attention scores independent of graph topology ### 1.2 Multi-Head Decomposition (Current) ``` MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_o ``` **Strengths**: - ✓ Multiple representation subspaces - ✓ Different aspects of neighborhood **Weaknesses**: - ✗ Fixed number of heads - ✗ Heads learn similar patterns (redundancy) - ✗ No explicit head specialization --- ## 2. Graph Attention Networks (GAT) Extensions ### 2.1 Edge-Featured Attention **Key Innovation**: Incorporate edge attributes into attention computation ``` e_{ij} = LeakyReLU(a^T [W h_i || W h_j || W_e edge_{ij}]) α_{ij} = softmax_j(e_{ij}) h'_i = σ(Σ_{j∈N(i)} α_{ij} W h_j) ``` **Implementation Proposal**: ```rust pub struct EdgeFeaturedAttention { w_node: Linear, // Node transformation w_edge: Linear, // Edge transformation a: Vec, // Attention coefficients activation: LeakyReLU, } impl EdgeFeaturedAttention { fn forward( &self, query_node: &[f32], neighbor_nodes: &[Vec], edge_features: &[Vec], // NEW ) -> Vec { // 1. Transform nodes and edges let q_trans = self.w_node.forward(query_node); let n_trans: Vec<_> = neighbor_nodes.iter() .map(|n| self.w_node.forward(n)) .collect(); let e_trans: Vec<_> = edge_features.iter() .map(|e| self.w_edge.forward(e)) .collect(); // 2. Compute attention with edge features let mut scores = Vec::new(); for (n, e) in n_trans.iter().zip(e_trans.iter()) { // Concatenate [query || neighbor || edge] let concat = [&q_trans[..], &n[..], &e[..]].concat(); let score = dot_product(&self.a, &concat); scores.push(self.activation.forward(score)); } // 3. Softmax and aggregate let weights = softmax(&scores); weighted_sum(&n_trans, &weights) } } ``` **Benefits for RuVector**: - Edge weights (distances) become learnable features - HNSW layer information can be encoded in edges - Better captures graph topology in latent space **Complexity**: O(d·(h_node + h_edge + h_attn)) --- ## 3. Hyperbolic Attention ### 3.1 Motivation **Problem**: HNSW has hierarchical structure, but Euclidean space poorly represents trees/hierarchies **Solution**: Operate in hyperbolic space (Poincaré ball or hyperboloid model) ### 3.2 Poincaré Ball Attention **Poincaré Ball Model**: ``` B^d = {x ∈ R^d : ||x|| < 1} Distance: d(x, y) = arcosh(1 + 2||x - y||² / ((1-||x||²)(1-||y||²))) ``` **Hyperbolic Attention Mechanism**: ``` # Key differences from Euclidean: 1. Use hyperbolic distance for similarity 2. Exponential map for transformations 3. Logarithmic map for aggregation HyperbolicAttention(q, k, v): # Compute hyperbolic similarity sim_ij = -d_poincare(q, k_j) # Negative distance # Softmax in tangent space α_ij = softmax(sim_ij / τ) # Aggregate in hyperbolic space result = ⊕_{j} (α_ij ⊗ v_j) # Möbius addition return result ``` **Implementation Sketch**: ```rust pub struct HyperbolicAttention { curvature: f32, // Negative curvature (e.g., -1.0) } impl HyperbolicAttention { // Poincaré distance fn poincare_distance(&self, x: &[f32], y: &[f32]) -> f32 { let diff_norm_sq = l2_norm_squared(&subtract(x, y)); let x_norm_sq = l2_norm_squared(x); let y_norm_sq = l2_norm_squared(y); let numerator = 2.0 * diff_norm_sq; let denominator = (1.0 - x_norm_sq) * (1.0 - y_norm_sq); self.curvature.abs().sqrt() * (1.0 + numerator / denominator).acosh() } // Möbius addition (hyperbolic vector addition) fn mobius_add(&self, x: &[f32], y: &[f32]) -> Vec { let x_norm_sq = l2_norm_squared(x); let y_norm_sq = l2_norm_squared(y); let xy_dot = dot_product(x, y); let numerator_coef = (1.0 + 2.0*xy_dot + y_norm_sq) / (1.0 - x_norm_sq); let denominator_coef = (1.0 + 2.0*xy_dot + x_norm_sq*y_norm_sq) / (1.0 - x_norm_sq); // (1+2⟨x,y⟩+||y||²)x + (1-||x||²)y / (1+2⟨x,y⟩+||x||²||y||²) let numerator = add( &scale(x, numerator_coef), &scale(y, 1.0 - x_norm_sq) ); scale(&numerator, 1.0 / denominator_coef) } fn forward( &self, query: &[f32], keys: &[Vec], values: &[Vec], ) -> Vec { // 1. Compute hyperbolic similarities (negative distances) let scores: Vec = keys.iter() .map(|k| -self.poincare_distance(query, k)) .collect(); // 2. Softmax let weights = softmax(&scores); // 3. Hyperbolic aggregation let mut result = vec![0.0; values[0].len()]; for (v, &w) in values.iter().zip(weights.iter()) { let scaled = self.mobius_scalar_mult(w, v); result = self.mobius_add(&result, &scaled); } result } } ``` **Benefits for HNSW**: - Natural representation of hierarchical layers - Exponential capacity (tree-like structures) - Distance preserves hierarchy **Challenges**: - Numerical instability near ball boundary (||x|| → 1) - More complex backpropagation - Requires hyperbolic embeddings throughout pipeline --- ## 4. Sparse Attention Patterns ### 4.1 Local + Global Attention (Longformer-style) **Motivation**: Full attention is O(n²), wasteful for graphs with local structure **Pattern**: ``` Attention Matrix Structure: [L L L G 0 0 0 0] [L L L L G 0 0 0] [L L L L L G 0 0] [G L L L L L G 0] [0 G L L L L L G] [0 0 G L L L L L] [0 0 0 G L L L L] [0 0 0 0 G L L L] L = Local attention (1-hop neighbors) G = Global attention (HNSW higher layers) 0 = No attention ``` **Implementation**: ```rust pub struct SparseGraphAttention { local_attn: MultiHeadAttention, global_attn: MultiHeadAttention, local_window: usize, // K-hop neighborhood } impl SparseGraphAttention { fn forward( &self, query: &[f32], neighbor_embeddings: &[Vec], neighbor_layers: &[usize], // HNSW layer for each neighbor ) -> Vec { // Split neighbors by locality let (local_neighbors, local_indices): (Vec<_>, Vec<_>) = neighbor_embeddings.iter().enumerate() .filter(|(i, _)| neighbor_layers[*i] == 0) // Layer 0 = local .unzip(); let (global_neighbors, global_indices): (Vec<_>, Vec<_>) = neighbor_embeddings.iter().enumerate() .filter(|(i, _)| neighbor_layers[*i] > 0) // Higher layers = global .unzip(); // Compute local attention let local_output = if !local_neighbors.is_empty() { self.local_attn.forward(query, &local_neighbors, &local_neighbors) } else { vec![0.0; query.len()] }; // Compute global attention let global_output = if !global_neighbors.is_empty() { self.global_attn.forward(query, &global_neighbors, &global_neighbors) } else { vec![0.0; query.len()] }; // Combine (learned gating) combine_local_global(&local_output, &global_output) } } ``` **Complexity**: O(k_local + k_global) instead of O(n²) --- ## 5. Linear Attention (O(n) complexity) ### 5.1 Kernel-Based Linear Attention **Key Idea**: Replace softmax with kernel feature map ``` Standard: Attention(Q, K, V) = softmax(QK^T) V Linear: Attention(Q, K, V) = φ(Q) (φ(K)^T V) / (φ(Q) (φ(K)^T 1)) where φ: R^d → R^D is a feature map ``` **Random Feature Approximation** (Performer): ```rust pub struct LinearAttention { num_features: usize, // D (typically 256-512) random_features: Array2, // Random projection matrix } impl LinearAttention { fn feature_map(&self, x: &[f32]) -> Vec { // Random Fourier Features let proj = self.random_features.dot(&Array1::from_vec(x.to_vec())); let scale = 1.0 / (self.num_features as f32).sqrt(); proj.mapv(|z| { scale * (z.cos() + z.sin()) // Simplified RFF }).to_vec() } fn forward( &self, query: &[f32], keys: &[Vec], values: &[Vec], ) -> Vec { // 1. Apply feature map let q_feat = self.feature_map(query); let k_feats: Vec<_> = keys.iter().map(|k| self.feature_map(k)).collect(); // 2. Compute K^T V (sum over neighbors) let mut kv = vec![0.0; values[0].len()]; for (k_feat, v) in k_feats.iter().zip(values.iter()) { for (i, &v_i) in v.iter().enumerate() { kv[i] += k_feat.iter().sum::() * v_i; } } // 3. Compute Q (K^T V) let numerator: Vec = kv.iter() .map(|&kv_i| q_feat.iter().sum::() * kv_i) .collect(); // 4. Normalize by Q (K^T 1) let denominator: f32 = q_feat.iter().sum::() * k_feats.iter().map(|k| k.iter().sum::()).sum::(); numerator.iter().map(|&n| n / denominator).collect() } } ``` **Benefits**: - **O(n) complexity**: Scales linearly with graph size - **Theoretically grounded**: Approximates softmax attention - **Parallel friendly**: Matrix operations **Tradeoffs**: - Approximation error vs. exact softmax - Requires more random features for accuracy - Less interpretable attention weights --- ## 6. Rotary Position Embeddings (RoPE) for Graphs ### 6.1 Motivation **Problem**: Graph attention has no notion of "position" or "distance" beyond explicit edge features **Solution**: Encode relative distances/positions via rotation ### 6.2 RoPE Mathematics **Standard RoPE** (for sequences): ``` RoPE(x, m) = [ x₀ cos(mθ₀) - x₁ sin(mθ₀), x₀ sin(mθ₀) + x₁ cos(mθ₀), x₂ cos(mθ₁) - x₃ sin(mθ₁), ... ] where m = position index, θᵢ = 10000^(-2i/d) ``` **Graph RoPE Adaptation**: ``` Instead of sequential position m, use: - Graph distance (shortest path length) - HNSW layer index - Normalized edge weight ``` **Implementation**: ```rust pub struct GraphRoPE { dim: usize, base: f32, // Base frequency (default 10000) } impl GraphRoPE { fn apply_rotation(&self, embedding: &[f32], distance: f32) -> Vec { let mut rotated = vec![0.0; embedding.len()]; for i in (0..self.dim).step_by(2) { let theta = distance / self.base.powf(2.0 * i as f32 / self.dim as f32); let cos_theta = theta.cos(); let sin_theta = theta.sin(); rotated[i] = embedding[i] * cos_theta - embedding[i+1] * sin_theta; rotated[i+1] = embedding[i] * sin_theta + embedding[i+1] * cos_theta; } rotated } fn forward_attention( &self, query: &[f32], keys: &[Vec], values: &[Vec], distances: &[f32], // NEW: graph distances ) -> Vec { // Apply RoPE to query and keys based on relative distance let q_rotated = self.apply_rotation(query, 0.0); // Query at "origin" let mut scores = Vec::new(); for (k, &dist) in keys.iter().zip(distances.iter()) { let k_rotated = self.apply_rotation(k, dist); let score = dot_product(&q_rotated, &k_rotated); scores.push(score); } let weights = softmax(&scores); weighted_sum(values, &weights) } } ``` **Benefits**: - Encodes distance without explicit features - Relative position encoding (rotation-invariant) - Efficient (just rotations, no extra parameters) **Graph-Specific Applications**: 1. **HNSW Layer Distance**: Encode which layer neighbors come from 2. **Shortest Path Distance**: Penalize far nodes in latent space 3. **Edge Weight Encoding**: Continuous rotation based on edge weight --- ## 7. Flash Attention (Memory-Efficient) ### 7.1 Problem Standard attention materializes the full attention matrix in memory: ``` Memory: O(n²) for n neighbors ``` For dense graphs or large neighborhoods, this is prohibitive. ### 7.2 Flash Attention Algorithm **Key Ideas**: 1. Tile the attention computation 2. Recompute attention on-the-fly during backward pass 3. Never materialize full attention matrix **Pseudocode**: ``` FlashAttention(Q, K, V): # Divide Q, K, V into blocks Q_blocks = split(Q, block_size) K_blocks = split(K, block_size) V_blocks = split(V, block_size) O = zeros_like(Q) # Outer loop: iterate over query blocks for Q_i in Q_blocks: row_max = -inf row_sum = 0 # Inner loop: iterate over key blocks for K_j, V_j in zip(K_blocks, V_blocks): # Compute attention block S_ij = Q_i @ K_j^T / sqrt(d) # Online softmax (numerically stable) new_max = max(row_max, max(S_ij)) exp_S = exp(S_ij - new_max) # Update running statistics correction = exp(row_max - new_max) row_sum = row_sum * correction + sum(exp_S) row_max = new_max # Accumulate output O_i += exp_S @ V_j # Final normalization O_i /= row_sum return O ``` **Implementation Note**: Flash Attention requires careful low-level optimization (CUDA kernels, tiling, SRAM management). For RuVector: ```rust // Simplified tiled version for CPU pub struct TiledAttention { block_size: usize, } impl TiledAttention { fn forward_tiled( &self, query: &[f32], keys: &[Vec], values: &[Vec], ) -> Vec { let n = keys.len(); let mut output = vec![0.0; query.len()]; let mut row_sum = 0.0; let mut row_max = f32::NEG_INFINITY; // Process keys in blocks for chunk_start in (0..n).step_by(self.block_size) { let chunk_end = (chunk_start + self.block_size).min(n); // Compute attention for this block let chunk_keys = &keys[chunk_start..chunk_end]; let chunk_values = &values[chunk_start..chunk_end]; let scores: Vec = chunk_keys.iter() .map(|k| dot_product(query, k)) .collect(); // Online softmax update let new_max = scores.iter().copied().fold(row_max, f32::max); let exp_scores: Vec = scores.iter() .map(|&s| (s - new_max).exp()) .collect(); let correction = (row_max - new_max).exp(); row_sum = row_sum * correction + exp_scores.iter().sum::(); row_max = new_max; // Accumulate weighted values for (v, &weight) in chunk_values.iter().zip(exp_scores.iter()) { for (o, &v_i) in output.iter_mut().zip(v.iter()) { *o = *o * correction + weight * v_i; } } } // Final normalization output.iter().map(|&o| o / row_sum).collect() } } ``` **Benefits**: - **Memory**: O(n) instead of O(n²) - **Speed**: Can be faster due to better cache locality - **Scalability**: Handle larger neighborhoods --- ## 8. Mixture of Experts (MoE) Attention ### 8.1 Concept Different attention mechanisms for different graph patterns: ``` MoE-Attention(query, keys, values): # Router decides which expert(s) to use router_scores = Router(query) expert_indices = topk(router_scores, k=2) # Apply selected experts outputs = [] for expert_idx in expert_indices: expert_output = Experts[expert_idx](query, keys, values) outputs.append(expert_output * router_scores[expert_idx]) return sum(outputs) ``` **Graph-Specific Experts**: 1. **Local Expert**: For 1-hop neighbors (standard attention) 2. **Hierarchical Expert**: For HNSW higher layers (hyperbolic attention) 3. **Global Expert**: For distant nodes (linear attention) 4. **Structural Expert**: Edge-featured attention ### 8.2 Implementation ```rust pub enum AttentionExpert { Standard(MultiHeadAttention), Hyperbolic(HyperbolicAttention), Linear(LinearAttention), EdgeFeatured(EdgeFeaturedAttention), } pub struct MoEAttention { router: Linear, // Maps query to expert scores experts: Vec, top_k: usize, } impl MoEAttention { fn forward( &self, query: &[f32], keys: &[Vec], values: &[Vec], edge_features: Option<&[Vec]>, ) -> Vec { // 1. Route to experts let router_scores = self.router.forward(query); let expert_weights = softmax(&router_scores); let top_experts = topk_indices(&expert_weights, self.top_k); // 2. Compute weighted expert outputs let mut output = vec![0.0; query.len()]; for &expert_idx in &top_experts { let expert_output = match &self.experts[expert_idx] { AttentionExpert::Standard(attn) => attn.forward(query, keys, values), AttentionExpert::Hyperbolic(attn) => attn.forward(query, keys, values), AttentionExpert::Linear(attn) => attn.forward(query, keys, values), AttentionExpert::EdgeFeatured(attn) => attn.forward(query, keys, values, edge_features.unwrap()), }; let weight = expert_weights[expert_idx]; for (o, &e) in output.iter_mut().zip(expert_output.iter()) { *o += weight * e; } } output } } ``` **Benefits**: - Adaptive to different graph neighborhoods - Specialization reduces computation - Router learns which mechanism suits which context --- ## 9. Cross-Attention Between Graph and Latent ### 9.1 Motivation **Problem**: Current attention only looks at graph neighbors. What about latent space neighbors? **Solution**: Cross-attention between topological neighbors (graph) and semantic neighbors (latent) ### 9.2 Dual-Space Attention ``` Given node v: - Graph neighbors: N_G(v) = {u : (u,v) ∈ E} - Latent neighbors: N_L(v) = TopK({u : sim(h_u, h_v) > threshold}) CrossAttention(v): # Graph attention graph_out = Attention(h_v, {h_u}_{u∈N_G}, {h_u}_{u∈N_G}) # Latent attention latent_out = Attention(h_v, {h_u}_{u∈N_L}, {h_u}_{u∈N_L}) # Cross-attention: graph queries latent cross_out = Attention(graph_out, {h_u}_{u∈N_L}, {h_u}_{u∈N_L}) # Fusion return Combine(graph_out, latent_out, cross_out) ``` **Implementation**: ```rust pub struct DualSpaceAttention { graph_attn: MultiHeadAttention, latent_attn: MultiHeadAttention, cross_attn: MultiHeadAttention, fusion: Linear, } impl DualSpaceAttention { fn forward( &self, query: &[f32], graph_neighbors: &[Vec], all_embeddings: &[Vec], // For latent neighbor search k_latent: usize, ) -> Vec { // 1. Graph attention (topology-based) let graph_output = self.graph_attn.forward( query, graph_neighbors, graph_neighbors ); // 2. Find latent neighbors (similarity-based) let latent_neighbors = self.find_latent_neighbors( query, all_embeddings, k_latent ); // 3. Latent attention (embedding-based) let latent_output = self.latent_attn.forward( query, &latent_neighbors, &latent_neighbors ); // 4. Cross-attention (graph context attends to latent space) let cross_output = self.cross_attn.forward( &graph_output, &latent_neighbors, &latent_neighbors ); // 5. Fusion let concatenated = [ &graph_output[..], &latent_output[..], &cross_output[..], ].concat(); self.fusion.forward(&concatenated) } fn find_latent_neighbors( &self, query: &[f32], all_embeddings: &[Vec], k: usize, ) -> Vec> { // Compute similarities let mut similarities: Vec<(usize, f32)> = all_embeddings .iter() .enumerate() .map(|(i, emb)| (i, cosine_similarity(query, emb))) .collect(); // Sort by similarity similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); // Return top-k similarities.iter() .take(k) .map(|(i, _)| all_embeddings[*i].clone()) .collect() } } ``` **Benefits**: - Bridges topology and semantics - Captures "similar but not connected" nodes - Enriches latent space with graph structure --- ## 10. Comparison Matrix | Mechanism | Complexity | Edge Features | Geometry | Memory | Use Case | |-----------|------------|---------------|----------|--------|----------| | **Current (MHA)** | O(d·h²) | ✗ | Euclidean | O(d·h) | General purpose | | **GAT + Edges** | O(d·h²) | ✓ | Euclidean | O(d·h) | Rich edge info | | **Hyperbolic** | O(d·h²) | ✗ | Hyperbolic | O(d·h) | Hierarchical graphs | | **Sparse (Local+Global)** | O(k_l + k_g) | ✗ | Euclidean | O((k_l+k_g)·h) | Large graphs | | **Linear (Performer)** | O(d·D) | ✗ | Euclidean | O(D·h) | Scalability | | **RoPE** | O(d·h²) | Implicit | Euclidean | O(d·h) | Distance encoding | | **Flash Attention** | O(d·h²) | ✗ | Euclidean | O(h) | Memory efficiency | | **MoE** | Variable | ✓ | Mixed | Variable | Heterogeneous graphs | | **Cross (Dual-Space)** | O(d·h² + k²·h) | ✗ | Dual | O((d+k)·h) | Latent-graph bridge | --- ## 11. Recommendations for RuVector ### 11.1 Short-Term (Immediate Implementation) **1. Edge-Featured Attention** - **Priority**: HIGH - **Effort**: LOW-MEDIUM - **Reason**: HNSW edge weights are currently underutilized - **Implementation**: Extend current `MultiHeadAttention` to include edge features **2. Sparse Attention (Local + Global)** - **Priority**: HIGH - **Effort**: MEDIUM - **Reason**: Natural fit for HNSW's layered structure - **Implementation**: Separate attention for layer 0 (local) vs. higher layers (global) **3. RoPE for Distance Encoding** - **Priority**: MEDIUM - **Effort**: LOW - **Reason**: Encode HNSW layer or edge distance without extra parameters - **Implementation**: Apply rotation based on layer index or edge weight ### 11.2 Medium-Term (Next Quarter) **4. Linear Attention (Performer)** - **Priority**: MEDIUM - **Effort**: MEDIUM-HIGH - **Reason**: Scalability for large graphs - **Implementation**: Replace softmax with random feature approximation **5. Flash Attention** - **Priority**: LOW-MEDIUM - **Effort**: HIGH - **Reason**: Memory efficiency for dense neighborhoods - **Implementation**: Tiled computation, may need GPU optimization ### 11.3 Long-Term (Research Exploration) **6. Hyperbolic Attention** - **Priority**: MEDIUM - **Effort**: HIGH - **Reason**: Hierarchical HNSW structure naturally hyperbolic - **Implementation**: Full pipeline change to hyperbolic embeddings **7. Mixture of Experts** - **Priority**: LOW - **Effort**: HIGH - **Reason**: Heterogeneous graph patterns - **Implementation**: Multiple attention types with learned routing **8. Cross-Attention (Dual-Space)** - **Priority**: HIGH (Research) - **Effort**: HIGH - **Reason**: Core to latent-graph interplay - **Implementation**: Requires efficient latent neighbor search (ANN) --- ## 12. Implementation Roadmap ### Phase 1: Extend Current Attention (1-2 weeks) ```rust // Add edge features to existing MultiHeadAttention impl MultiHeadAttention { pub fn forward_with_edges( &self, query: &[f32], keys: &[Vec], values: &[Vec], edge_features: &[Vec], // NEW ) -> Vec { // Modify attention score computation to include edges } } ``` ### Phase 2: Sparse Attention Variant (2-3 weeks) ```rust // Separate local and global attention based on HNSW layer pub struct HNSWAwareAttention { local: MultiHeadAttention, global: MultiHeadAttention, } ``` ### Phase 3: Alternative Mechanisms (1-2 months) - Implement RoPE for distance encoding - Prototype Linear Attention - Benchmark all variants ### Phase 4: Research Exploration (Ongoing) - Hyperbolic embeddings (full pipeline change) - MoE attention routing - Cross-attention with latent neighbors --- ## References ### Papers 1. **GAT**: Veličković et al. (2018) - Graph Attention Networks 2. **Hyperbolic**: Chami et al. (2019) - Hyperbolic Graph Convolutional Neural Networks 3. **Longformer**: Beltagy et al. (2020) - Longformer: The Long-Document Transformer 4. **Performer**: Choromanski et al. (2020) - Rethinking Attention with Performers 5. **RoPE**: Su et al. (2021) - RoFormer: Enhanced Transformer with Rotary Position Embedding 6. **Flash Attention**: Dao et al. (2022) - FlashAttention: Fast and Memory-Efficient Exact Attention 7. **MoE**: Shazeer et al. (2017) - Outrageously Large Neural Networks: The Sparsely-Gated MoE ### RuVector Code References - `crates/ruvector-gnn/src/layer.rs:84-205` - Current MultiHeadAttention - `crates/ruvector-gnn/src/search.rs:38-86` - Differentiable search with softmax --- **Document Version**: 1.0 **Last Updated**: 2025-11-30 **Author**: RuVector Research Team