# SPARC Pseudocode: RuVector Attention Mechanisms ## Executive Summary This document provides comprehensive pseudocode for all attention mechanisms proposed for the RuVector GNN latent-graph interplay system. Following SPARC methodology, this serves as the bridge between specification (requirements) and architecture (system design). **Scope**: Complete algorithmic specifications for attention mechanisms, training procedures, and optimization strategies. **Target Audience**: Implementers who will translate these algorithms into Rust code. **Conventions**: - UPPERCASE: Algorithm names, constants - lowercase: Variables, parameters - `←`: Assignment - `∈`: Set membership - Arrays are 0-indexed unless specified - All complexity analysis uses Big-O notation --- ## Table of Contents 1. [Core Attention Mechanisms](#1-core-attention-mechanisms) 2. [Geometric Attention](#2-geometric-attention) 3. [Sparse Attention](#3-sparse-attention) 4. [Graph Attention](#4-graph-attention) 5. [Adaptive Attention](#5-adaptive-attention) 6. [Training Procedures](#6-training-procedures) 7. [Data Structures](#7-data-structures) 8. [Complexity Summary](#8-complexity-summary) --- ## 1. Core Attention Mechanisms ### 1.1 Scaled Dot-Product Attention **Purpose**: Foundation attention mechanism for all variants **Complexity**: - Time: O(n·d²) where n = number of keys, d = embedding dimension - Space: O(n) ``` ALGORITHM: ScaledDotProductAttention INPUT: Q: query vector [d] K: key matrix [n × d] V: value matrix [n × d] d_k: key dimension (scalar) OUTPUT: output: attention output [d] weights: attention weights [n] BEGIN // 1. Compute attention scores scores ← EMPTY_ARRAY[n] FOR i ← 0 TO n-1 DO scores[i] ← DotProduct(Q, K[i]) / sqrt(d_k) END FOR // 2. Apply softmax for normalization weights ← Softmax(scores) // 3. Weighted sum of values output ← ZeroVector(d) FOR i ← 0 TO n-1 DO output ← output + weights[i] * V[i] END FOR RETURN output, weights END SUBROUTINE: DotProduct INPUT: x[d], y[d] OUTPUT: scalar BEGIN sum ← 0 FOR i ← 0 TO d-1 DO sum ← sum + x[i] * y[i] END FOR RETURN sum END SUBROUTINE: Softmax INPUT: scores[n] OUTPUT: probabilities[n] BEGIN // Numerical stability: subtract max max_score ← Max(scores) exp_scores ← EMPTY_ARRAY[n] sum_exp ← 0 FOR i ← 0 TO n-1 DO exp_scores[i] ← exp(scores[i] - max_score) sum_exp ← sum_exp + exp_scores[i] END FOR probabilities ← EMPTY_ARRAY[n] FOR i ← 0 TO n-1 DO probabilities[i] ← exp_scores[i] / sum_exp END FOR RETURN probabilities END ``` --- ### 1.2 Multi-Head Attention **Purpose**: Learn multiple representation subspaces simultaneously **Complexity**: - Time: O(h·n·d²/h²) = O(n·d²/h) where h = number of heads - Space: O(h·d) ``` ALGORITHM: MultiHeadAttention INPUT: Q: query vector [d_model] K: key matrix [n × d_model] V: value matrix [n × d_model] num_heads: number of attention heads W_Q: query projection weights [num_heads × d_head × d_model] W_K: key projection weights [num_heads × d_head × d_model] W_V: value projection weights [num_heads × d_head × d_model] W_O: output projection weights [d_model × d_model] OUTPUT: output: multi-head attention output [d_model] CONSTANTS: d_head ← d_model / num_heads BEGIN heads ← EMPTY_ARRAY[num_heads] // 1. Project and compute attention for each head FOR h ← 0 TO num_heads-1 DO // Project query Q_h ← LinearTransform(Q, W_Q[h]) // [d_head] // Project keys K_h ← EMPTY_MATRIX[n × d_head] FOR i ← 0 TO n-1 DO K_h[i] ← LinearTransform(K[i], W_K[h]) END FOR // Project values V_h ← EMPTY_MATRIX[n × d_head] FOR i ← 0 TO n-1 DO V_h[i] ← LinearTransform(V[i], W_V[h]) END FOR // Compute attention for this head head_output, _ ← ScaledDotProductAttention(Q_h, K_h, V_h, d_head) heads[h] ← head_output END FOR // 2. Concatenate all heads concat ← Concatenate(heads[0], heads[1], ..., heads[num_heads-1]) // 3. Final linear projection output ← LinearTransform(concat, W_O) RETURN output END SUBROUTINE: LinearTransform INPUT: x[d_in], W[d_out × d_in] OUTPUT: y[d_out] BEGIN y ← ZeroVector(d_out) FOR i ← 0 TO d_out-1 DO FOR j ← 0 TO d_in-1 DO y[i] ← y[i] + W[i][j] * x[j] END FOR END FOR RETURN y END SUBROUTINE: Concatenate INPUT: vectors... (variable number of vectors) OUTPUT: concatenated vector BEGIN total_dim ← Sum of all input dimensions result ← EMPTY_ARRAY[total_dim] offset ← 0 FOR EACH vector IN vectors DO FOR i ← 0 TO Length(vector)-1 DO result[offset + i] ← vector[i] END FOR offset ← offset + Length(vector) END FOR RETURN result END ``` --- ## 2. Geometric Attention ### 2.1 Hyperbolic Attention (Poincaré Ball Model) **Purpose**: Capture hierarchical structure using hyperbolic geometry **Complexity**: - Time: O(n·d²) (same as Euclidean, but with more expensive ops) - Space: O(n) **Geometric Background**: ``` Poincaré Ball: B^d = {x ∈ R^d : ||x|| < 1} Distance: d_P(x,y) = arcosh(1 + 2||x-y||²/((1-||x||²)(1-||y||²))) ``` ``` ALGORITHM: HyperbolicAttention INPUT: query: query point in Poincaré ball [d] keys: key points in Poincaré ball [n × d] values: value points in Poincaré ball [n × d] curvature: negative curvature (typically -1.0) temperature: softmax temperature OUTPUT: output: aggregated point in Poincaré ball [d] BEGIN // 1. Compute hyperbolic distances as similarity scores scores ← EMPTY_ARRAY[n] FOR i ← 0 TO n-1 DO // Negative distance = similarity (closer = higher score) scores[i] ← -PoincareDistance(query, keys[i], curvature) END FOR // 2. Softmax to get attention weights weights ← Softmax(scores / temperature) // 3. Hyperbolic weighted aggregation using Möbius addition result ← ZeroVector(d) // Origin in Poincaré ball FOR i ← 0 TO n-1 DO // Scale value by weight using Möbius scalar multiplication scaled_value ← MobiusScalarMult(weights[i], values[i], curvature) // Add to result using Möbius addition result ← MobiusAdd(result, scaled_value, curvature) END FOR RETURN result END SUBROUTINE: PoincareDistance INPUT: x[d], y[d], curvature OUTPUT: distance (scalar) BEGIN // Compute squared norms x_norm_sq ← L2NormSquared(x) y_norm_sq ← L2NormSquared(y) // Ensure points are inside the ball (||x|| < 1, ||y|| < 1) IF x_norm_sq >= 1.0 OR y_norm_sq >= 1.0 THEN ERROR "Points must be inside Poincaré ball" END IF // Compute squared distance between points diff ← Subtract(x, y) diff_norm_sq ← L2NormSquared(diff) // Poincaré distance formula numerator ← 2.0 * diff_norm_sq denominator ← (1.0 - x_norm_sq) * (1.0 - y_norm_sq) arg ← 1.0 + numerator / denominator // Numerical stability: clamp arg >= 1.0 IF arg < 1.0 THEN arg ← 1.0 END IF distance ← sqrt(abs(curvature)) * arcosh(arg) RETURN distance END SUBROUTINE: MobiusAdd INPUT: x[d], y[d], curvature OUTPUT: z[d] (Möbius sum x ⊕ y) BEGIN // Special case: if x is origin, return y IF IsZero(x) THEN RETURN y END IF // Special case: if y is origin, return x IF IsZero(y) THEN RETURN x END IF // Compute norms and dot product x_norm_sq ← L2NormSquared(x) y_norm_sq ← L2NormSquared(y) xy_dot ← DotProduct(x, y) // Möbius addition formula: // z = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) / // (1 + 2c⟨x,y⟩ + c²||x||²||y||²) c ← -curvature // For Poincaré ball, typically c = 1 numerator_x_coef ← 1.0 + 2.0*c*xy_dot + c*y_norm_sq numerator_y_coef ← 1.0 - c*x_norm_sq denominator ← 1.0 + 2.0*c*xy_dot + c*c*x_norm_sq*y_norm_sq numerator ← Add( Scale(x, numerator_x_coef), Scale(y, numerator_y_coef) ) z ← Scale(numerator, 1.0 / denominator) // Project back to ball if numerical errors pushed outside z_norm ← L2Norm(z) IF z_norm >= 1.0 THEN z ← Scale(z, 0.99 / z_norm) // Project to ball with margin END IF RETURN z END SUBROUTINE: MobiusScalarMult INPUT: r (scalar), x[d], curvature OUTPUT: r ⊗ x (Möbius scalar multiplication) BEGIN // Handle special cases IF r == 0 OR IsZero(x) THEN RETURN ZeroVector(d) END IF x_norm ← L2Norm(x) c ← -curvature // Möbius scalar multiplication: // r ⊗ x = (1/√c) * tanh(r * arctanh(√c * ||x||)) * (x / ||x||) sqrt_c ← sqrt(c) arctanh_arg ← sqrt_c * x_norm // Numerical stability IF arctanh_arg >= 1.0 THEN arctanh_arg ← 0.999 END IF arctanh_val ← arctanh(arctanh_arg) tanh_arg ← r * arctanh_val tanh_val ← tanh(tanh_arg) scale_factor ← (1.0 / sqrt_c) * tanh_val / x_norm result ← Scale(x, scale_factor) RETURN result END SUBROUTINE: L2NormSquared INPUT: x[d] OUTPUT: ||x||² (scalar) BEGIN sum ← 0 FOR i ← 0 TO d-1 DO sum ← sum + x[i] * x[i] END FOR RETURN sum END SUBROUTINE: L2Norm INPUT: x[d] OUTPUT: ||x|| (scalar) BEGIN RETURN sqrt(L2NormSquared(x)) END SUBROUTINE: Subtract INPUT: x[d], y[d] OUTPUT: x - y [d] BEGIN result ← EMPTY_ARRAY[d] FOR i ← 0 TO d-1 DO result[i] ← x[i] - y[i] END FOR RETURN result END SUBROUTINE: Add INPUT: x[d], y[d] OUTPUT: x + y [d] BEGIN result ← EMPTY_ARRAY[d] FOR i ← 0 TO d-1 DO result[i] ← x[i] + y[i] END FOR RETURN result END SUBROUTINE: Scale INPUT: x[d], scalar OUTPUT: scalar * x [d] BEGIN result ← EMPTY_ARRAY[d] FOR i ← 0 TO d-1 DO result[i] ← scalar * x[i] END FOR RETURN result END SUBROUTINE: IsZero INPUT: x[d] OUTPUT: boolean BEGIN epsilon ← 1e-10 RETURN L2Norm(x) < epsilon END ``` --- ## 3. Sparse Attention ### 3.1 Local + Global Sparse Attention **Purpose**: Reduce O(n²) to O(k_local + k_global) for large graphs **Complexity**: - Time: O(k_local·d + k_global·d) where k_local, k_global << n - Space: O(k_local + k_global) ``` ALGORITHM: SparseLocalGlobalAttention INPUT: query: query vector [d] all_neighbors: all neighbor embeddings [n × d] neighbor_layers: HNSW layer for each neighbor [n] local_window: size of local neighborhood global_indices: indices of global attention nodes OUTPUT: output: attention output [d] BEGIN // 1. Partition neighbors into local and global local_neighbors ← EMPTY_LIST local_indices ← EMPTY_LIST global_neighbors ← EMPTY_LIST global_indices_actual ← EMPTY_LIST FOR i ← 0 TO n-1 DO IF neighbor_layers[i] == 0 AND Length(local_neighbors) < local_window THEN // Layer 0 = local neighbors local_neighbors.Append(all_neighbors[i]) local_indices.Append(i) ELSE IF neighbor_layers[i] > 0 AND i IN global_indices THEN // Higher layers = global neighbors global_neighbors.Append(all_neighbors[i]) global_indices_actual.Append(i) END IF END FOR // 2. Compute local attention local_output ← ZeroVector(d) IF Length(local_neighbors) > 0 THEN local_K ← ConvertToMatrix(local_neighbors) local_V ← local_K // Self-attention local_output, _ ← ScaledDotProductAttention( query, local_K, local_V, d ) END IF // 3. Compute global attention global_output ← ZeroVector(d) IF Length(global_neighbors) > 0 THEN global_K ← ConvertToMatrix(global_neighbors) global_V ← global_K global_output, _ ← ScaledDotProductAttention( query, global_K, global_V, d ) END IF // 4. Learned gating to combine local and global alpha ← LearnedGate(query, local_output, global_output) // 5. Combine outputs output ← ZeroVector(d) FOR i ← 0 TO d-1 DO output[i] ← alpha * local_output[i] + (1.0 - alpha) * global_output[i] END FOR RETURN output END SUBROUTINE: LearnedGate INPUT: query[d], local_output[d], global_output[d] OUTPUT: alpha (scalar in [0, 1]) BEGIN // Concatenate all inputs concat ← Concatenate(query, local_output, global_output) // Linear projection + sigmoid gate_weights ← LEARNED_PARAMETERS[3*d] // Learned during training bias ← LEARNED_BIAS // Learned during training logit ← DotProduct(concat, gate_weights) + bias alpha ← Sigmoid(logit) RETURN alpha END SUBROUTINE: Sigmoid INPUT: x (scalar) OUTPUT: sigmoid(x) in [0, 1] BEGIN RETURN 1.0 / (1.0 + exp(-x)) END ``` --- ### 3.2 Linear Attention (Performer / Random Features) **Purpose**: O(n·d) complexity using kernel approximation **Complexity**: - Time: O(n·D·d) where D = number of random features - Space: O(D·d) ``` ALGORITHM: LinearAttention INPUT: query: query vector [d] keys: key matrix [n × d] values: value matrix [n × d] num_features: number of random features D random_matrix: random projection matrix [D × d] OUTPUT: output: attention output [d] BEGIN // 1. Apply feature map to query phi_Q ← FeatureMap(query, random_matrix, num_features) // 2. Apply feature map to all keys phi_K ← EMPTY_MATRIX[n × num_features] FOR i ← 0 TO n-1 DO phi_K[i] ← FeatureMap(keys[i], random_matrix, num_features) END FOR // 3. Compute K^T V (sum over neighbors) - O(n·D·d) KV_sum ← ZeroMatrix(num_features, d) FOR i ← 0 TO n-1 DO FOR j ← 0 TO num_features-1 DO FOR k ← 0 TO d-1 DO KV_sum[j][k] ← KV_sum[j][k] + phi_K[i][j] * values[i][k] END FOR END FOR END FOR // 4. Compute Q·(K^T V) - O(D·d) numerator ← ZeroVector(d) FOR k ← 0 TO d-1 DO FOR j ← 0 TO num_features-1 DO numerator[k] ← numerator[k] + phi_Q[j] * KV_sum[j][k] END FOR END FOR // 5. Compute K^T 1 (sum of feature-mapped keys) - O(n·D) K_sum ← ZeroVector(num_features) FOR i ← 0 TO n-1 DO FOR j ← 0 TO num_features-1 DO K_sum[j] ← K_sum[j] + phi_K[i][j] END FOR END FOR // 6. Compute denominator Q·(K^T 1) - O(D) denominator ← DotProduct(phi_Q, K_sum) // 7. Normalize output ← Scale(numerator, 1.0 / (denominator + 1e-10)) RETURN output END SUBROUTINE: FeatureMap INPUT: x[d], random_matrix[D × d], num_features D OUTPUT: features[D] BEGIN // Random Fourier Features // φ(x) = sqrt(1/D) * [cos(w₁·x), sin(w₁·x), cos(w₂·x), sin(w₂·x), ...] scale ← 1.0 / sqrt(num_features) features ← EMPTY_ARRAY[num_features] FOR i ← 0 TO num_features/2 - 1 DO // Get random projection w ← random_matrix[i] projection ← DotProduct(w, x) // Apply cos and sin features[2*i] ← scale * cos(projection) features[2*i + 1] ← scale * sin(projection) END FOR RETURN features END SUBROUTINE: ZeroMatrix INPUT: rows, cols OUTPUT: matrix[rows × cols] BEGIN matrix ← EMPTY_MATRIX[rows × cols] FOR i ← 0 TO rows-1 DO FOR j ← 0 TO cols-1 DO matrix[i][j] ← 0.0 END FOR END FOR RETURN matrix END ``` --- ### 3.3 Flash Attention (Tiled / Memory-Efficient) **Purpose**: O(n) memory instead of O(n²) through tiling **Complexity**: - Time: O(n²·d) (same as standard, but better cache locality) - Space: O(n) instead of O(n²) ``` ALGORITHM: FlashAttention INPUT: query: query vector [d] keys: key matrix [n × d] values: value matrix [n × d] block_size: tile size B (typically 64-128) OUTPUT: output: attention output [d] BEGIN n ← Length(keys) output ← ZeroVector(d) row_max ← -INFINITY row_sum ← 0.0 num_blocks ← Ceiling(n / block_size) // Process keys/values in blocks (tiles) FOR block_idx ← 0 TO num_blocks-1 DO // 1. Define current block range chunk_start ← block_idx * block_size chunk_end ← Min(chunk_start + block_size, n) chunk_size ← chunk_end - chunk_start // 2. Extract block of keys and values chunk_K ← keys[chunk_start : chunk_end] chunk_V ← values[chunk_start : chunk_end] // 3. Compute attention scores for this block scores ← EMPTY_ARRAY[chunk_size] FOR i ← 0 TO chunk_size-1 DO scores[i] ← DotProduct(query, chunk_K[i]) / sqrt(d) END FOR // 4. Online softmax: update running max new_max ← Max(row_max, Max(scores)) // 5. Compute exponentials with new max exp_scores ← EMPTY_ARRAY[chunk_size] FOR i ← 0 TO chunk_size-1 DO exp_scores[i] ← exp(scores[i] - new_max) END FOR // 6. Correction factor for previous blocks correction ← exp(row_max - new_max) // 7. Update running sum of exponentials chunk_sum ← Sum(exp_scores) row_sum ← row_sum * correction + chunk_sum // 8. Update running max row_max ← new_max // 9. Accumulate weighted values with correction FOR i ← 0 TO d-1 DO output[i] ← output[i] * correction END FOR FOR i ← 0 TO chunk_size-1 DO FOR j ← 0 TO d-1 DO output[j] ← output[j] + exp_scores[i] * chunk_V[i][j] END FOR END FOR END FOR // 10. Final normalization FOR i ← 0 TO d-1 DO output[i] ← output[i] / row_sum END FOR RETURN output END SUBROUTINE: Max INPUT: array[n] OR two scalars OUTPUT: maximum value BEGIN IF array is provided THEN max_val ← array[0] FOR i ← 1 TO Length(array)-1 DO IF array[i] > max_val THEN max_val ← array[i] END IF END FOR RETURN max_val ELSE // Two scalars RETURN IF (a > b) THEN a ELSE b END IF END SUBROUTINE: Sum INPUT: array[n] OUTPUT: sum of elements BEGIN total ← 0 FOR i ← 0 TO Length(array)-1 DO total ← total + array[i] END FOR RETURN total END SUBROUTINE: Ceiling INPUT: x (real number) OUTPUT: ⌈x⌉ (smallest integer >= x) BEGIN RETURN integer ceiling of x END SUBROUTINE: Min INPUT: a, b (scalars) OUTPUT: minimum value BEGIN RETURN IF (a < b) THEN a ELSE b END ``` --- ## 4. Graph Attention ### 4.1 Edge-Featured Attention **Purpose**: Incorporate edge attributes into attention computation **Complexity**: - Time: O(n·(d² + d_edge·d)) - Space: O(n) ``` ALGORITHM: EdgeFeaturedAttention INPUT: query: query node embedding [d] keys: neighbor node embeddings [n × d] values: neighbor node embeddings [n × d] edge_features: edge attributes [n × d_edge] W_node: node transformation matrix [d × d] W_edge: edge transformation matrix [d_edge × d_attn] a: attention coefficient vector [2d + d_attn] OUTPUT: output: aggregated embedding [d] BEGIN // 1. Transform query q_trans ← MatrixVectorMult(W_node, query) // 2. Transform all keys and edge features k_trans ← EMPTY_MATRIX[n × d] e_trans ← EMPTY_MATRIX[n × d_attn] FOR i ← 0 TO n-1 DO k_trans[i] ← MatrixVectorMult(W_node, keys[i]) e_trans[i] ← MatrixVectorMult(W_edge, edge_features[i]) END FOR // 3. Compute attention scores with edge features scores ← EMPTY_ARRAY[n] FOR i ← 0 TO n-1 DO // Concatenate [query || key || edge] concat ← Concatenate(q_trans, k_trans[i], e_trans[i]) // Attention coefficient score ← DotProduct(a, concat) // Activation (LeakyReLU) scores[i] ← LeakyReLU(score, alpha=0.2) END FOR // 4. Softmax normalization weights ← Softmax(scores) // 5. Weighted aggregation output ← WeightedSum(values, weights) RETURN output END SUBROUTINE: MatrixVectorMult INPUT: M[m × n], v[n] OUTPUT: result[m] BEGIN result ← ZeroVector(m) FOR i ← 0 TO m-1 DO FOR j ← 0 TO n-1 DO result[i] ← result[i] + M[i][j] * v[j] END FOR END FOR RETURN result END SUBROUTINE: LeakyReLU INPUT: x (scalar), alpha (negative slope) OUTPUT: activated value BEGIN IF x >= 0 THEN RETURN x ELSE RETURN alpha * x END IF END SUBROUTINE: WeightedSum INPUT: vectors[n × d], weights[n] OUTPUT: result[d] BEGIN result ← ZeroVector(d) FOR i ← 0 TO n-1 DO FOR j ← 0 TO d-1 DO result[j] ← result[j] + weights[i] * vectors[i][j] END FOR END FOR RETURN result END ``` --- ### 4.2 RoPE Graph Attention **Purpose**: Encode graph distances via rotary position embeddings **Complexity**: - Time: O(n·d²) - Space: O(n) ``` ALGORITHM: RoPEGraphAttention INPUT: query: query node embedding [d] keys: neighbor node embeddings [n × d] values: neighbor node embeddings [n × d] distances: graph distances to neighbors [n] base: RoPE frequency base (default 10000) OUTPUT: output: attention output [d] BEGIN // 1. Apply RoPE rotation to query (at origin, distance = 0) Q_rotated ← ApplyRotation(query, distance=0.0, base) // 2. Apply RoPE rotation to keys based on their distances K_rotated ← EMPTY_MATRIX[n × d] FOR i ← 0 TO n-1 DO K_rotated[i] ← ApplyRotation(keys[i], distances[i], base) END FOR // 3. Compute attention scores with rotated embeddings scores ← EMPTY_ARRAY[n] FOR i ← 0 TO n-1 DO scores[i] ← DotProduct(Q_rotated, K_rotated[i]) END FOR // 4. Softmax and aggregate weights ← Softmax(scores) output ← WeightedSum(values, weights) RETURN output END SUBROUTINE: ApplyRotation INPUT: embedding[d], distance (scalar), base OUTPUT: rotated[d] BEGIN rotated ← ZeroVector(d) // Apply rotation to pairs of dimensions FOR i ← 0 TO d/2 - 1 DO // Compute rotation angle for this dimension pair theta ← distance / (base ^ (2.0 * i / d)) cos_theta ← cos(theta) sin_theta ← sin(theta) // Rotate dimensions (2*i, 2*i+1) rotated[2*i] ← embedding[2*i] * cos_theta - embedding[2*i+1] * sin_theta rotated[2*i+1] ← embedding[2*i] * sin_theta + embedding[2*i+1] * cos_theta END FOR RETURN rotated END ``` --- ### 4.3 Cross-Space (Dual) Attention **Purpose**: Bridge graph topology and latent space semantics **Complexity**: - Time: O(n_graph·d² + k_latent·d² + k_latent²·d) - Space: O(n_graph + k_latent) ``` ALGORITHM: DualSpaceAttention INPUT: query: query node embedding [d] graph_neighbors: topological neighbors [n_graph × d] all_embeddings: all node embeddings for latent search [N × d] k_latent: number of latent neighbors OUTPUT: output: fused embedding [d] BEGIN // 1. Graph attention (topology-based) graph_output, _ ← MultiHeadAttention( query, graph_neighbors, graph_neighbors, num_heads=8 ) // 2. Find latent neighbors (similarity-based) latent_neighbors ← FindTopKSimilar(query, all_embeddings, k_latent) // 3. Latent attention (embedding-based) latent_output, _ ← MultiHeadAttention( query, latent_neighbors, latent_neighbors, num_heads=8 ) // 4. Cross-attention (graph context queries latent space) cross_output, _ ← MultiHeadAttention( graph_output, // Use graph output as query latent_neighbors, latent_neighbors, num_heads=8 ) // 5. Fusion of all three outputs concatenated ← Concatenate(graph_output, latent_output, cross_output) // 6. Final projection W_fusion ← LEARNED_WEIGHTS[d × 3d] output ← MatrixVectorMult(W_fusion, concatenated) RETURN output END SUBROUTINE: FindTopKSimilar INPUT: query[d], all_embeddings[N × d], k OUTPUT: top_k_embeddings[k × d] BEGIN similarities ← EMPTY_ARRAY[N] // 1. Compute cosine similarity to all embeddings FOR i ← 0 TO N-1 DO similarities[i] ← CosineSimilarity(query, all_embeddings[i]) END FOR // 2. Find top-k indices top_k_indices ← TopKIndices(similarities, k) // 3. Extract top-k embeddings top_k_embeddings ← EMPTY_MATRIX[k × d] FOR i ← 0 TO k-1 DO top_k_embeddings[i] ← all_embeddings[top_k_indices[i]] END FOR RETURN top_k_embeddings END SUBROUTINE: CosineSimilarity INPUT: x[d], y[d] OUTPUT: similarity in [-1, 1] BEGIN dot ← DotProduct(x, y) norm_x ← L2Norm(x) norm_y ← L2Norm(y) // Avoid division by zero IF norm_x == 0 OR norm_y == 0 THEN RETURN 0.0 END IF RETURN dot / (norm_x * norm_y) END SUBROUTINE: TopKIndices INPUT: array[N], k OUTPUT: indices[k] BEGIN // Create (index, value) pairs pairs ← EMPTY_ARRAY[N] FOR i ← 0 TO N-1 DO pairs[i] ← (i, array[i]) END FOR // Sort by value (descending) Sort(pairs, by=value, order=descending) // Extract top-k indices indices ← EMPTY_ARRAY[k] FOR i ← 0 TO k-1 DO indices[i] ← pairs[i].index END FOR RETURN indices END ``` --- ## 5. Adaptive Attention ### 5.1 Mixture of Experts (MoE) Attention **Purpose**: Route to specialized attention mechanisms based on context **Complexity**: - Time: O(K · attention_complexity) where K = top-k experts (typically 2) - Space: O(num_experts · model_size) ``` ALGORITHM: MoEAttention INPUT: query: query node embedding [d] keys: neighbor embeddings [n × d] values: neighbor embeddings [n × d] experts: list of attention mechanisms router: routing network top_k: number of experts to use (typically 2) OUTPUT: output: expert-mixed output [d] EXPERT_TYPES: 1. Standard Multi-Head Attention 2. Hyperbolic Attention 3. Linear Attention 4. Edge-Featured Attention BEGIN num_experts ← Length(experts) // 1. Router computes expert scores router_logits ← RouterNetwork(query, router) router_probs ← Softmax(router_logits) // 2. Select top-k experts top_k_indices ← TopKIndices(router_probs, top_k) // 3. Normalize selected expert weights selected_weights ← EMPTY_ARRAY[top_k] weight_sum ← 0.0 FOR i ← 0 TO top_k-1 DO expert_idx ← top_k_indices[i] selected_weights[i] ← router_probs[expert_idx] weight_sum ← weight_sum + selected_weights[i] END FOR // Normalize FOR i ← 0 TO top_k-1 DO selected_weights[i] ← selected_weights[i] / weight_sum END FOR // 4. Compute weighted expert outputs output ← ZeroVector(d) FOR i ← 0 TO top_k-1 DO expert_idx ← top_k_indices[i] expert ← experts[expert_idx] // Call appropriate expert expert_output ← CALL_EXPERT(expert, query, keys, values) // Weighted accumulation weight ← selected_weights[i] FOR j ← 0 TO d-1 DO output[j] ← output[j] + weight * expert_output[j] END FOR END FOR RETURN output END SUBROUTINE: RouterNetwork INPUT: query[d], router_weights OUTPUT: logits[num_experts] BEGIN // Simple two-layer MLP hidden_size ← 4 * d // First layer W1 ← router_weights.layer1 // [hidden_size × d] b1 ← router_weights.bias1 // [hidden_size] hidden ← MatrixVectorMult(W1, query) FOR i ← 0 TO hidden_size-1 DO hidden[i] ← ReLU(hidden[i] + b1[i]) END FOR // Second layer W2 ← router_weights.layer2 // [num_experts × hidden_size] b2 ← router_weights.bias2 // [num_experts] logits ← MatrixVectorMult(W2, hidden) FOR i ← 0 TO num_experts-1 DO logits[i] ← logits[i] + b2[i] END FOR RETURN logits END SUBROUTINE: CALL_EXPERT INPUT: expert, query, keys, values OUTPUT: expert_output[d] BEGIN MATCH expert.type: CASE "standard": RETURN MultiHeadAttention(query, keys, values, num_heads=8) CASE "hyperbolic": RETURN HyperbolicAttention(query, keys, values, curvature=-1.0) CASE "linear": RETURN LinearAttention(query, keys, values, num_features=256) CASE "edge_featured": edge_features ← expert.edge_features RETURN EdgeFeaturedAttention(query, keys, values, edge_features) DEFAULT: ERROR "Unknown expert type" END MATCH END SUBROUTINE: ReLU INPUT: x (scalar) OUTPUT: max(0, x) BEGIN RETURN IF (x > 0) THEN x ELSE 0 END ``` --- ### 5.2 Learned Navigation (Reinforcement Learning) **Purpose**: Learn optimal navigation policy for graph traversal **Complexity**: - Time: O(num_steps · d²) per navigation episode - Space: O(graph_size + policy_params) ``` ALGORITHM: RLNavigationStep INPUT: current_state: current navigation state policy_network: learned policy (neural network) value_network: value estimator graph: graph structure OUTPUT: action: which neighbor to visit reward: immediate reward next_state: resulting state STATE_REPRESENTATION: current_embedding: [d] query_embedding: [d] graph_features: [d_graph] history: [max_steps × d] BEGIN // 1. Encode current state state_vector ← EncodeState(current_state) // 2. Policy network outputs action logits action_logits ← PolicyNetwork(state_vector, policy_network) // 3. Value network estimates state value state_value ← ValueNetwork(state_vector, value_network) // 4. Sample action from policy action_probs ← Softmax(action_logits) action ← SampleCategorical(action_probs) // Which neighbor to visit // 5. Execute action (move to selected neighbor) next_node ← current_state.neighbors[action] // 6. Compute reward reward ← ComputeReward(current_state, next_node, current_state.query) // 7. Update state next_state ← UpdateState(current_state, next_node, action) RETURN action, reward, next_state, state_value END SUBROUTINE: EncodeState INPUT: state OUTPUT: state_vector[d_state] BEGIN // Concatenate all state components state_vector ← Concatenate( state.current_embedding, state.query_embedding, state.graph_features, Flatten(state.history) ) RETURN state_vector END SUBROUTINE: PolicyNetwork INPUT: state_vector[d_state], policy_params OUTPUT: action_logits[num_neighbors] BEGIN // Three-layer MLP hidden1 ← ReLU(Linear(state_vector, policy_params.W1, policy_params.b1)) hidden2 ← ReLU(Linear(hidden1, policy_params.W2, policy_params.b2)) logits ← Linear(hidden2, policy_params.W3, policy_params.b3) RETURN logits END SUBROUTINE: ValueNetwork INPUT: state_vector[d_state], value_params OUTPUT: value (scalar) BEGIN // Three-layer MLP ending in scalar hidden1 ← ReLU(Linear(state_vector, value_params.W1, value_params.b1)) hidden2 ← ReLU(Linear(hidden1, value_params.W2, value_params.b2)) value ← Linear(hidden2, value_params.W3, value_params.b3)[0] // Scalar output RETURN value END SUBROUTINE: ComputeReward INPUT: current_state, next_node, query OUTPUT: reward (scalar) BEGIN // Reward based on similarity improvement current_similarity ← CosineSimilarity( current_state.current_embedding, query ) next_similarity ← CosineSimilarity( next_node.embedding, query ) // Positive reward if moving closer, negative if farther reward ← next_similarity - current_similarity // Bonus for reaching goal IF next_similarity > GOAL_THRESHOLD THEN reward ← reward + GOAL_BONUS END IF // Penalty for taking too many steps reward ← reward - STEP_PENALTY RETURN reward END SUBROUTINE: SampleCategorical INPUT: probabilities[n] OUTPUT: sampled_index in [0, n-1] BEGIN // Sample from categorical distribution cumsum ← 0.0 rand ← Random() // Uniform [0, 1) FOR i ← 0 TO n-1 DO cumsum ← cumsum + probabilities[i] IF rand < cumsum THEN RETURN i END IF END FOR // Fallback (shouldn't reach here if probabilities sum to 1) RETURN n-1 END SUBROUTINE: UpdateState INPUT: current_state, next_node, action OUTPUT: new_state BEGIN new_state ← COPY(current_state) // Update current node new_state.current_node ← next_node new_state.current_embedding ← next_node.embedding // Update history (sliding window) new_state.history.PopFirst() new_state.history.Append(next_node.embedding) // Increment step counter new_state.num_steps ← new_state.num_steps + 1 RETURN new_state END SUBROUTINE: Linear INPUT: x[d_in], W[d_out × d_in], b[d_out] OUTPUT: y[d_out] BEGIN y ← MatrixVectorMult(W, x) FOR i ← 0 TO d_out-1 DO y[i] ← y[i] + b[i] END FOR RETURN y END ``` --- ## 6. Training Procedures ### 6.1 InfoNCE Contrastive Loss **Purpose**: Learn embeddings that are similar to positives and dissimilar to negatives **Complexity**: - Time: O((n_pos + n_neg) · d) - Space: O(n_pos + n_neg) ``` ALGORITHM: InfoNCELoss INPUT: anchor: anchor embedding [d] positives: positive samples [n_pos × d] negatives: negative samples [n_neg × d] temperature: softmax temperature (typically 0.07) OUTPUT: loss: contrastive loss (scalar) BEGIN // 1. Compute positive similarities pos_scores ← EMPTY_ARRAY[n_pos] FOR i ← 0 TO n_pos-1 DO sim ← CosineSimilarity(anchor, positives[i]) pos_scores[i] ← sim / temperature END FOR // 2. Compute negative similarities neg_scores ← EMPTY_ARRAY[n_neg] FOR i ← 0 TO n_neg-1 DO sim ← CosineSimilarity(anchor, negatives[i]) neg_scores[i] ← sim / temperature END FOR // 3. InfoNCE loss (average over positives) total_loss ← 0.0 FOR i ← 0 TO n_pos-1 DO // Numerator: exp(positive score) numerator ← exp(pos_scores[i]) // Denominator: sum of exp(positive score) + all exp(negative scores) denominator ← numerator FOR j ← 0 TO n_neg-1 DO denominator ← denominator + exp(neg_scores[j]) END FOR // Log probability log_prob ← log(numerator / denominator) // Accumulate negative log probability total_loss ← total_loss - log_prob END FOR // Average over positives loss ← total_loss / n_pos RETURN loss END ``` --- ### 6.2 Hard Negative Sampling **Purpose**: Select informative negative samples for faster learning **Complexity**: - Time: O(N·d) where N = total number of samples - Space: O(k) where k = number of hard negatives ``` ALGORITHM: SampleHardNegatives INPUT: anchor: anchor embedding [d] all_embeddings: all available embeddings [N × d] true_positives: indices of true positives k: number of hard negatives to sample strategy: sampling strategy ("distance", "degree", "mixed") OUTPUT: hard_negatives: selected hard negative samples [k × d] BEGIN // 1. Filter out true positives candidate_indices ← EMPTY_LIST FOR i ← 0 TO N-1 DO IF i NOT IN true_positives THEN candidate_indices.Append(i) END IF END FOR n_candidates ← Length(candidate_indices) // 2. Select hard negatives based on strategy MATCH strategy: CASE "distance": hard_negatives ← SampleByDistance( anchor, all_embeddings, candidate_indices, k ) CASE "degree": hard_negatives ← SampleByDegree( anchor, all_embeddings, candidate_indices, k ) CASE "mixed": k_dist ← k / 2 k_deg ← k - k_dist dist_negs ← SampleByDistance( anchor, all_embeddings, candidate_indices, k_dist ) deg_negs ← SampleByDegree( anchor, all_embeddings, candidate_indices, k_deg ) hard_negatives ← Concatenate(dist_negs, deg_negs) DEFAULT: ERROR "Unknown strategy" END MATCH RETURN hard_negatives END SUBROUTINE: SampleByDistance INPUT: anchor[d], all_embeddings[N × d], candidate_indices, k OUTPUT: hard_negatives[k × d] BEGIN // Select k most similar candidates (hardest negatives) similarities ← EMPTY_ARRAY[Length(candidate_indices)] FOR i ← 0 TO Length(candidate_indices)-1 DO idx ← candidate_indices[i] similarities[i] ← CosineSimilarity(anchor, all_embeddings[idx]) END FOR // Get top-k most similar (hardest) top_k_local_indices ← TopKIndices(similarities, k) // Map back to global indices hard_negatives ← EMPTY_MATRIX[k × d] FOR i ← 0 TO k-1 DO local_idx ← top_k_local_indices[i] global_idx ← candidate_indices[local_idx] hard_negatives[i] ← all_embeddings[global_idx] END FOR RETURN hard_negatives END SUBROUTINE: SampleByDegree INPUT: anchor[d], all_embeddings[N × d], candidate_indices, k OUTPUT: hard_negatives[k × d] BEGIN // Select candidates with similar degree to anchor anchor_degree ← GetDegree(anchor) degree_diffs ← EMPTY_ARRAY[Length(candidate_indices)] FOR i ← 0 TO Length(candidate_indices)-1 DO idx ← candidate_indices[i] candidate_degree ← GetDegree(all_embeddings[idx]) degree_diffs[i] ← abs(anchor_degree - candidate_degree) END FOR // Get k candidates with most similar degree top_k_local_indices ← TopKIndices( NegateArray(degree_diffs), // Negate for similarity k ) hard_negatives ← EMPTY_MATRIX[k × d] FOR i ← 0 TO k-1 DO local_idx ← top_k_local_indices[i] global_idx ← candidate_indices[local_idx] hard_negatives[i] ← all_embeddings[global_idx] END FOR RETURN hard_negatives END SUBROUTINE: NegateArray INPUT: array[n] OUTPUT: negated[n] BEGIN negated ← EMPTY_ARRAY[n] FOR i ← 0 TO n-1 DO negated[i] ← -array[i] END FOR RETURN negated END ``` --- ### 6.3 Curriculum Learning Schedule **Purpose**: Gradually increase task difficulty during training **Complexity**: - Time: O(1) per epoch (just weight computation) - Space: O(num_losses) ``` ALGORITHM: CurriculumSchedule INPUT: current_epoch: current training epoch total_epochs: total number of epochs loss_types: list of loss components OUTPUT: loss_weights: weight for each loss component LOSS_TYPES: - reconstruction: Autoencoder reconstruction loss - contrastive: InfoNCE contrastive loss - task: Downstream task loss - spectral: Laplacian regularization - ewc: Elastic Weight Consolidation BEGIN loss_weights ← EMPTY_MAP // 1. Reconstruction: High early, decay exponentially lambda_recon ← exp(-current_epoch / 50.0) loss_weights["reconstruction"] ← lambda_recon // 2. Contrastive: Ramp up linearly in first 10 epochs IF current_epoch < 10 THEN lambda_contrast ← 0.1 + 0.9 * (current_epoch / 10.0) ELSE lambda_contrast ← 1.0 END IF loss_weights["contrastive"] ← lambda_contrast // 3. Task: Start after 50 epochs, ramp up IF current_epoch < 50 THEN lambda_task ← 0.1 ELSE lambda_task ← 0.1 + 0.9 * ((current_epoch - 50) / 50.0) lambda_task ← Min(lambda_task, 1.0) END IF loss_weights["task"] ← lambda_task // 4. Spectral: Constant moderate weight loss_weights["spectral"] ← 0.01 // 5. EWC: Increase if using continual learning IF using_continual_learning THEN lambda_ewc ← Min(current_epoch / 100.0, 1.0) ELSE lambda_ewc ← 0.0 END IF loss_weights["ewc"] ← lambda_ewc RETURN loss_weights END ``` --- ### 6.4 Multi-Objective Loss Computation **Purpose**: Combine multiple loss functions with learned or scheduled weights **Complexity**: - Time: O(num_losses) - Space: O(1) ``` ALGORITHM: MultiObjectiveLoss INPUT: loss_components: computed loss values loss_weights: weights for each component auto_balance: whether to auto-balance weights OUTPUT: total_loss: weighted sum of losses updated_weights: potentially updated weights LOSS_COMPONENTS: task_loss: Main task objective contrastive_loss: InfoNCE or similar reconstruction_loss: Autoencoder spectral_loss: Laplacian smoothness ewc_loss: Continual learning penalty BEGIN // 1. Auto-balance (optional) IF auto_balance THEN loss_weights ← AutoBalance(loss_components, loss_weights) END IF // 2. Compute weighted sum total_loss ← 0.0 total_loss ← total_loss + loss_weights["task"] * loss_components.task_loss total_loss ← total_loss + loss_weights["contrastive"] * loss_components.contrastive_loss total_loss ← total_loss + loss_weights["reconstruction"] * loss_components.reconstruction_loss total_loss ← total_loss + loss_weights["spectral"] * loss_components.spectral_loss total_loss ← total_loss + loss_weights["ewc"] * loss_components.ewc_loss RETURN total_loss, loss_weights END SUBROUTINE: AutoBalance INPUT: loss_components, current_weights OUTPUT: balanced_weights BEGIN // Normalize so each loss contributes equally num_losses ← 5 // Compute current contribution of each loss contributions ← EMPTY_MAP contributions["task"] ← current_weights["task"] * loss_components.task_loss contributions["contrastive"] ← current_weights["contrastive"] * loss_components.contrastive_loss contributions["reconstruction"] ← current_weights["reconstruction"] * loss_components.reconstruction_loss contributions["spectral"] ← current_weights["spectral"] * loss_components.spectral_loss contributions["ewc"] ← current_weights["ewc"] * loss_components.ewc_loss // Compute total and target per-loss contribution total ← Sum(contributions.values) target_contribution ← total / num_losses // Adjust weights to equalize contributions balanced_weights ← EMPTY_MAP epsilon ← 1e-10 // Avoid division by zero balanced_weights["task"] ← target_contribution / Max(loss_components.task_loss, epsilon) balanced_weights["contrastive"] ← target_contribution / Max(loss_components.contrastive_loss, epsilon) balanced_weights["reconstruction"] ← target_contribution / Max(loss_components.reconstruction_loss, epsilon) balanced_weights["spectral"] ← target_contribution / Max(loss_components.spectral_loss, epsilon) balanced_weights["ewc"] ← target_contribution / Max(loss_components.ewc_loss, epsilon) RETURN balanced_weights END ``` --- ### 6.5 Spectral Regularization **Purpose**: Preserve graph structure through Laplacian smoothness **Complexity**: - Time: O(|E|·d) where |E| = number of edges - Space: O(1) (streaming computation) ``` ALGORITHM: LaplacianRegularization INPUT: embeddings: node embeddings [N × d] edges: edge list [(u, v)] edge_weights: optional edge weights [|E|] normalized: whether to use normalized Laplacian node_degrees: node degrees [N] OUTPUT: spectral_loss: smoothness penalty (scalar) BEGIN total_loss ← 0.0 num_edges ← Length(edges) FOR i ← 0 TO num_edges-1 DO u, v ← edges[i] // Compute embedding difference diff ← Subtract(embeddings[u], embeddings[v]) diff_norm_sq ← L2NormSquared(diff) // Get edge weight weight ← 1.0 IF edge_weights PROVIDED THEN weight ← edge_weights[i] END IF // Normalized Laplacian: weight by degrees IF normalized THEN degree_norm ← sqrt(node_degrees[u] * node_degrees[v]) weight ← weight / Max(degree_norm, 1.0) END IF // Accumulate weighted squared difference total_loss ← total_loss + weight * diff_norm_sq END FOR // Average over edges spectral_loss ← total_loss / num_edges RETURN spectral_loss END ``` --- ## 7. Data Structures ### 7.1 Attention State ``` STRUCTURE: AttentionState FIELDS: query: [d] // Query embedding keys: [n × d] // Key embeddings values: [n × d] // Value embeddings attention_weights: [n] // Computed weights output: [d] // Final output metadata: Map // Additional info OPERATIONS: Initialize(query, keys, values) ComputeWeights() → attention_weights ComputeOutput() → output GetMetadata(key) → value ``` --- ### 7.2 Graph Structure ``` STRUCTURE: Graph FIELDS: nodes: [N] // Node identifiers embeddings: [N × d] // Node embeddings adjacency: [N × N] OR SparseMatrix // Adjacency matrix edge_list: [(u, v)] // Edge list edge_features: [|E| × d_edge] // Edge attributes node_degrees: [N] // Degree of each node OPERATIONS: GetNeighbors(node_id) → [neighbor_ids] GetEdgeFeature(u, v) → [d_edge] GetDegree(node_id) → scalar AddEdge(u, v, features) UpdateEmbedding(node_id, new_embedding) ``` --- ### 7.3 HNSW-Specific Structure ``` STRUCTURE: HNSWGraph EXTENDS: Graph ADDITIONAL_FIELDS: layers: [max_layer] // Layer-wise graphs entry_point: node_id // Top-layer entry max_layer: integer // Maximum layer layer_neighbors: Map<(node, layer), [neighbors]> OPERATIONS: GetLayerNeighbors(node_id, layer) → [neighbor_ids] GetNodeLayer(node_id) → layer NavigateLayer(query, layer, num_steps) → closest_node InsertNode(node_id, embedding, layer) ``` --- ### 7.4 Training State ``` STRUCTURE: TrainingState FIELDS: current_epoch: integer loss_history: [num_epochs] loss_weights: Map curriculum_schedule: CurriculumSchedule optimizer_state: OptimizerState best_model_params: ModelParams early_stopping_counter: integer OPERATIONS: UpdateEpoch() RecordLoss(loss_value) GetLossWeight(loss_type) → weight UpdateBestModel(current_params) ShouldEarlystop() → boolean ``` --- ## 8. Complexity Summary ### 8.1 Attention Mechanisms | Mechanism | Time Complexity | Space Complexity | Notes | |-----------|----------------|------------------|-------| | Scaled Dot-Product | O(n·d²) | O(n) | Standard attention | | Multi-Head (h heads) | O(n·d²/h) | O(h·d) | Parallel heads | | Hyperbolic | O(n·d²) | O(n) | More expensive ops | | Sparse (Local+Global) | O((k_l + k_g)·d) | O(k_l + k_g) | k << n | | Linear (Performer) | O(n·D·d) | O(D·d) | D = random features | | Flash | O(n²·d) | O(n) | Better cache locality | | Edge-Featured | O(n·(d² + d_edge·d)) | O(n) | Added edge cost | | RoPE | O(n·d²) | O(n) | Rotation overhead minimal | | Cross-Space | O(n_g·d² + k_l·d²) | O(n_g + k_l) | Dual attention | | MoE (k experts) | O(k·base_complexity) | O(num_experts·model_size) | Expert routing | **Legend**: - n: number of neighbors/keys - d: embedding dimension - h: number of attention heads - k_l, k_g: local and global neighbor counts - D: number of random features - d_edge: edge feature dimension --- ### 8.2 Training Operations | Operation | Time Complexity | Space Complexity | Notes | |-----------|----------------|------------------|-------| | InfoNCE Loss | O((n_pos + n_neg)·d) | O(n_pos + n_neg) | Per anchor | | Hard Negative Sampling | O(N·d) | O(k) | N = total samples | | Spectral Regularization | O(\|E\|·d) | O(1) | E = edges | | Curriculum Schedule | O(1) | O(num_losses) | Per epoch | | Multi-Objective Loss | O(num_losses) | O(1) | Weighted sum | --- ## 9. Implementation Notes ### 9.1 Numerical Stability **Softmax Stability**: ``` // Always subtract max before exp max_score ← Max(scores) exp_scores[i] ← exp(scores[i] - max_score) ``` **Hyperbolic Boundary**: ``` // Ensure points stay in Poincaré ball IF ||x|| >= 1.0 THEN x ← 0.99 * x / ||x|| // Project back with margin END IF ``` **Division by Zero**: ``` // Add epsilon to denominators result ← numerator / (denominator + 1e-10) ``` --- ### 9.2 Performance Optimization **Vectorization**: - Use SIMD operations for dot products - Batch matrix multiplications - Parallelize independent attention heads **Memory Layout**: - Contiguous memory for cache efficiency - Column-major for matrix operations - Pre-allocate buffers **Lazy Computation**: - Only compute attention weights when needed - Cache frequently accessed embeddings - Prune low-weight attention connections --- ### 9.3 Testing Strategies **Unit Tests**: ``` TEST: ScaledDotProductAttention INPUT: Known query, keys, values EXPECTED: Hand-computed output VERIFY: Output matches expected within tolerance TEST: Softmax Numerical Stability INPUT: Very large scores [1000, 999, 998] VERIFY: No NaN or Inf in output VERIFY: Probabilities sum to 1.0 TEST: Hyperbolic Boundary INPUT: Points near ball boundary (||x|| = 0.99) VERIFY: Result still in ball (||result|| < 1.0) ``` **Integration Tests**: ``` TEST: End-to-End Attention Pipeline INPUT: Real graph structure VERIFY: All mechanisms produce valid outputs VERIFY: Outputs are differentiable ``` **Performance Tests**: ``` BENCHMARK: Attention Complexity INPUT: Varying n = [10, 100, 1000, 10000] MEASURE: Time and memory usage VERIFY: Matches theoretical complexity ``` --- ## 10. References ### 10.1 Core Papers 1. **Attention Mechanism**: Vaswani et al. (2017) - "Attention Is All You Need" 2. **GAT**: Veličković et al. (2018) - "Graph Attention Networks" 3. **Hyperbolic GNNs**: Chami et al. (2019) - "Hyperbolic Graph Convolutional Neural Networks" 4. **Performer**: Choromanski et al. (2020) - "Rethinking Attention with Performers" 5. **Flash Attention**: Dao et al. (2022) - "FlashAttention: Fast and Memory-Efficient Exact Attention" 6. **RoPE**: Su et al. (2021) - "RoFormer: Enhanced Transformer with Rotary Position Embedding" 7. **MoE**: Shazeer et al. (2017) - "Outrageously Large Neural Networks" ### 10.2 Mathematical Background - **Hyperbolic Geometry**: Cannon et al. (1997) - "Hyperbolic Geometry" - **Graph Laplacian**: Chung (1997) - "Spectral Graph Theory" - **Contrastive Learning**: Chen et al. (2020) - "A Simple Framework for Contrastive Learning" --- ## 11. Glossary **Attention**: Mechanism to weight importance of different inputs **Multi-Head**: Parallel attention with different learned projections **Hyperbolic Space**: Non-Euclidean geometry with constant negative curvature **Poincaré Ball**: Conformal model of hyperbolic space in unit ball **Möbius Addition**: Hyperbolic vector addition operation **Sparse Attention**: Attention over subset of inputs (not all pairs) **Linear Attention**: O(n) complexity via kernel approximation **Flash Attention**: Memory-efficient tiled attention computation **RoPE**: Rotary Position Embedding for distance encoding **Cross-Attention**: Attention between two different spaces **MoE**: Mixture of Experts, routing to specialized sub-models **InfoNCE**: Noise Contrastive Estimation loss for contrastive learning **Hard Negatives**: Difficult negative samples close to positives **Curriculum Learning**: Gradually increasing task difficulty **Spectral Regularization**: Graph smoothness via Laplacian --- **Document Version**: 1.0 **Last Updated**: 2025-11-30 **Author**: RuVector Research Team **SPARC Phase**: Pseudocode (Phase 2) **Next Phase**: Architecture (Phase 3) - See `04-architecture.md` --- ## Appendix A: Quick Reference ### Common Subroutines ``` DotProduct(x, y) → scalar L2Norm(x) → scalar L2NormSquared(x) → scalar Softmax(scores) → probabilities CosineSimilarity(x, y) → similarity ∈ [-1, 1] Scale(x, scalar) → scaled_vector Add(x, y) → sum_vector Subtract(x, y) → diff_vector Concatenate(vectors...) → concatenated_vector ZeroVector(d) → zero-initialized vector ZeroMatrix(rows, cols) → zero-initialized matrix ``` ### Complexity Quick Reference ``` O(1) - Constant time O(d) - Linear in dimension O(n) - Linear in number of items O(n·d) - Linear in both O(n²) - Quadratic (standard full attention) O(n·d²) - Attention complexity O(|E|) - Linear in number of edges ``` ---