52 KiB
SPARC Pseudocode: RuVector Attention Mechanisms
Executive Summary
This document provides comprehensive pseudocode for all attention mechanisms proposed for the RuVector GNN latent-graph interplay system. Following SPARC methodology, this serves as the bridge between specification (requirements) and architecture (system design).
Scope: Complete algorithmic specifications for attention mechanisms, training procedures, and optimization strategies.
Target Audience: Implementers who will translate these algorithms into Rust code.
Conventions:
- UPPERCASE: Algorithm names, constants
- lowercase: Variables, parameters
←: Assignment∈: Set membership- Arrays are 0-indexed unless specified
- All complexity analysis uses Big-O notation
Table of Contents
- Core Attention Mechanisms
- Geometric Attention
- Sparse Attention
- Graph Attention
- Adaptive Attention
- Training Procedures
- Data Structures
- Complexity Summary
1. Core Attention Mechanisms
1.1 Scaled Dot-Product Attention
Purpose: Foundation attention mechanism for all variants
Complexity:
- Time: O(n·d²) where n = number of keys, d = embedding dimension
- Space: O(n)
ALGORITHM: ScaledDotProductAttention
INPUT:
Q: query vector [d]
K: key matrix [n × d]
V: value matrix [n × d]
d_k: key dimension (scalar)
OUTPUT:
output: attention output [d]
weights: attention weights [n]
BEGIN
// 1. Compute attention scores
scores ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
scores[i] ← DotProduct(Q, K[i]) / sqrt(d_k)
END FOR
// 2. Apply softmax for normalization
weights ← Softmax(scores)
// 3. Weighted sum of values
output ← ZeroVector(d)
FOR i ← 0 TO n-1 DO
output ← output + weights[i] * V[i]
END FOR
RETURN output, weights
END
SUBROUTINE: DotProduct
INPUT: x[d], y[d]
OUTPUT: scalar
BEGIN
sum ← 0
FOR i ← 0 TO d-1 DO
sum ← sum + x[i] * y[i]
END FOR
RETURN sum
END
SUBROUTINE: Softmax
INPUT: scores[n]
OUTPUT: probabilities[n]
BEGIN
// Numerical stability: subtract max
max_score ← Max(scores)
exp_scores ← EMPTY_ARRAY[n]
sum_exp ← 0
FOR i ← 0 TO n-1 DO
exp_scores[i] ← exp(scores[i] - max_score)
sum_exp ← sum_exp + exp_scores[i]
END FOR
probabilities ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
probabilities[i] ← exp_scores[i] / sum_exp
END FOR
RETURN probabilities
END
1.2 Multi-Head Attention
Purpose: Learn multiple representation subspaces simultaneously
Complexity:
- Time: O(h·n·d²/h²) = O(n·d²/h) where h = number of heads
- Space: O(h·d)
ALGORITHM: MultiHeadAttention
INPUT:
Q: query vector [d_model]
K: key matrix [n × d_model]
V: value matrix [n × d_model]
num_heads: number of attention heads
W_Q: query projection weights [num_heads × d_head × d_model]
W_K: key projection weights [num_heads × d_head × d_model]
W_V: value projection weights [num_heads × d_head × d_model]
W_O: output projection weights [d_model × d_model]
OUTPUT:
output: multi-head attention output [d_model]
CONSTANTS:
d_head ← d_model / num_heads
BEGIN
heads ← EMPTY_ARRAY[num_heads]
// 1. Project and compute attention for each head
FOR h ← 0 TO num_heads-1 DO
// Project query
Q_h ← LinearTransform(Q, W_Q[h]) // [d_head]
// Project keys
K_h ← EMPTY_MATRIX[n × d_head]
FOR i ← 0 TO n-1 DO
K_h[i] ← LinearTransform(K[i], W_K[h])
END FOR
// Project values
V_h ← EMPTY_MATRIX[n × d_head]
FOR i ← 0 TO n-1 DO
V_h[i] ← LinearTransform(V[i], W_V[h])
END FOR
// Compute attention for this head
head_output, _ ← ScaledDotProductAttention(Q_h, K_h, V_h, d_head)
heads[h] ← head_output
END FOR
// 2. Concatenate all heads
concat ← Concatenate(heads[0], heads[1], ..., heads[num_heads-1])
// 3. Final linear projection
output ← LinearTransform(concat, W_O)
RETURN output
END
SUBROUTINE: LinearTransform
INPUT: x[d_in], W[d_out × d_in]
OUTPUT: y[d_out]
BEGIN
y ← ZeroVector(d_out)
FOR i ← 0 TO d_out-1 DO
FOR j ← 0 TO d_in-1 DO
y[i] ← y[i] + W[i][j] * x[j]
END FOR
END FOR
RETURN y
END
SUBROUTINE: Concatenate
INPUT: vectors... (variable number of vectors)
OUTPUT: concatenated vector
BEGIN
total_dim ← Sum of all input dimensions
result ← EMPTY_ARRAY[total_dim]
offset ← 0
FOR EACH vector IN vectors DO
FOR i ← 0 TO Length(vector)-1 DO
result[offset + i] ← vector[i]
END FOR
offset ← offset + Length(vector)
END FOR
RETURN result
END
2. Geometric Attention
2.1 Hyperbolic Attention (Poincaré Ball Model)
Purpose: Capture hierarchical structure using hyperbolic geometry
Complexity:
- Time: O(n·d²) (same as Euclidean, but with more expensive ops)
- Space: O(n)
Geometric Background:
Poincaré Ball: B^d = {x ∈ R^d : ||x|| < 1}
Distance: d_P(x,y) = arcosh(1 + 2||x-y||²/((1-||x||²)(1-||y||²)))
ALGORITHM: HyperbolicAttention
INPUT:
query: query point in Poincaré ball [d]
keys: key points in Poincaré ball [n × d]
values: value points in Poincaré ball [n × d]
curvature: negative curvature (typically -1.0)
temperature: softmax temperature
OUTPUT:
output: aggregated point in Poincaré ball [d]
BEGIN
// 1. Compute hyperbolic distances as similarity scores
scores ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
// Negative distance = similarity (closer = higher score)
scores[i] ← -PoincareDistance(query, keys[i], curvature)
END FOR
// 2. Softmax to get attention weights
weights ← Softmax(scores / temperature)
// 3. Hyperbolic weighted aggregation using Möbius addition
result ← ZeroVector(d) // Origin in Poincaré ball
FOR i ← 0 TO n-1 DO
// Scale value by weight using Möbius scalar multiplication
scaled_value ← MobiusScalarMult(weights[i], values[i], curvature)
// Add to result using Möbius addition
result ← MobiusAdd(result, scaled_value, curvature)
END FOR
RETURN result
END
SUBROUTINE: PoincareDistance
INPUT: x[d], y[d], curvature
OUTPUT: distance (scalar)
BEGIN
// Compute squared norms
x_norm_sq ← L2NormSquared(x)
y_norm_sq ← L2NormSquared(y)
// Ensure points are inside the ball (||x|| < 1, ||y|| < 1)
IF x_norm_sq >= 1.0 OR y_norm_sq >= 1.0 THEN
ERROR "Points must be inside Poincaré ball"
END IF
// Compute squared distance between points
diff ← Subtract(x, y)
diff_norm_sq ← L2NormSquared(diff)
// Poincaré distance formula
numerator ← 2.0 * diff_norm_sq
denominator ← (1.0 - x_norm_sq) * (1.0 - y_norm_sq)
arg ← 1.0 + numerator / denominator
// Numerical stability: clamp arg >= 1.0
IF arg < 1.0 THEN
arg ← 1.0
END IF
distance ← sqrt(abs(curvature)) * arcosh(arg)
RETURN distance
END
SUBROUTINE: MobiusAdd
INPUT: x[d], y[d], curvature
OUTPUT: z[d] (Möbius sum x ⊕ y)
BEGIN
// Special case: if x is origin, return y
IF IsZero(x) THEN
RETURN y
END IF
// Special case: if y is origin, return x
IF IsZero(y) THEN
RETURN x
END IF
// Compute norms and dot product
x_norm_sq ← L2NormSquared(x)
y_norm_sq ← L2NormSquared(y)
xy_dot ← DotProduct(x, y)
// Möbius addition formula:
// z = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) /
// (1 + 2c⟨x,y⟩ + c²||x||²||y||²)
c ← -curvature // For Poincaré ball, typically c = 1
numerator_x_coef ← 1.0 + 2.0*c*xy_dot + c*y_norm_sq
numerator_y_coef ← 1.0 - c*x_norm_sq
denominator ← 1.0 + 2.0*c*xy_dot + c*c*x_norm_sq*y_norm_sq
numerator ← Add(
Scale(x, numerator_x_coef),
Scale(y, numerator_y_coef)
)
z ← Scale(numerator, 1.0 / denominator)
// Project back to ball if numerical errors pushed outside
z_norm ← L2Norm(z)
IF z_norm >= 1.0 THEN
z ← Scale(z, 0.99 / z_norm) // Project to ball with margin
END IF
RETURN z
END
SUBROUTINE: MobiusScalarMult
INPUT: r (scalar), x[d], curvature
OUTPUT: r ⊗ x (Möbius scalar multiplication)
BEGIN
// Handle special cases
IF r == 0 OR IsZero(x) THEN
RETURN ZeroVector(d)
END IF
x_norm ← L2Norm(x)
c ← -curvature
// Möbius scalar multiplication:
// r ⊗ x = (1/√c) * tanh(r * arctanh(√c * ||x||)) * (x / ||x||)
sqrt_c ← sqrt(c)
arctanh_arg ← sqrt_c * x_norm
// Numerical stability
IF arctanh_arg >= 1.0 THEN
arctanh_arg ← 0.999
END IF
arctanh_val ← arctanh(arctanh_arg)
tanh_arg ← r * arctanh_val
tanh_val ← tanh(tanh_arg)
scale_factor ← (1.0 / sqrt_c) * tanh_val / x_norm
result ← Scale(x, scale_factor)
RETURN result
END
SUBROUTINE: L2NormSquared
INPUT: x[d]
OUTPUT: ||x||² (scalar)
BEGIN
sum ← 0
FOR i ← 0 TO d-1 DO
sum ← sum + x[i] * x[i]
END FOR
RETURN sum
END
SUBROUTINE: L2Norm
INPUT: x[d]
OUTPUT: ||x|| (scalar)
BEGIN
RETURN sqrt(L2NormSquared(x))
END
SUBROUTINE: Subtract
INPUT: x[d], y[d]
OUTPUT: x - y [d]
BEGIN
result ← EMPTY_ARRAY[d]
FOR i ← 0 TO d-1 DO
result[i] ← x[i] - y[i]
END FOR
RETURN result
END
SUBROUTINE: Add
INPUT: x[d], y[d]
OUTPUT: x + y [d]
BEGIN
result ← EMPTY_ARRAY[d]
FOR i ← 0 TO d-1 DO
result[i] ← x[i] + y[i]
END FOR
RETURN result
END
SUBROUTINE: Scale
INPUT: x[d], scalar
OUTPUT: scalar * x [d]
BEGIN
result ← EMPTY_ARRAY[d]
FOR i ← 0 TO d-1 DO
result[i] ← scalar * x[i]
END FOR
RETURN result
END
SUBROUTINE: IsZero
INPUT: x[d]
OUTPUT: boolean
BEGIN
epsilon ← 1e-10
RETURN L2Norm(x) < epsilon
END
3. Sparse Attention
3.1 Local + Global Sparse Attention
Purpose: Reduce O(n²) to O(k_local + k_global) for large graphs
Complexity:
- Time: O(k_local·d + k_global·d) where k_local, k_global << n
- Space: O(k_local + k_global)
ALGORITHM: SparseLocalGlobalAttention
INPUT:
query: query vector [d]
all_neighbors: all neighbor embeddings [n × d]
neighbor_layers: HNSW layer for each neighbor [n]
local_window: size of local neighborhood
global_indices: indices of global attention nodes
OUTPUT:
output: attention output [d]
BEGIN
// 1. Partition neighbors into local and global
local_neighbors ← EMPTY_LIST
local_indices ← EMPTY_LIST
global_neighbors ← EMPTY_LIST
global_indices_actual ← EMPTY_LIST
FOR i ← 0 TO n-1 DO
IF neighbor_layers[i] == 0 AND Length(local_neighbors) < local_window THEN
// Layer 0 = local neighbors
local_neighbors.Append(all_neighbors[i])
local_indices.Append(i)
ELSE IF neighbor_layers[i] > 0 AND i IN global_indices THEN
// Higher layers = global neighbors
global_neighbors.Append(all_neighbors[i])
global_indices_actual.Append(i)
END IF
END FOR
// 2. Compute local attention
local_output ← ZeroVector(d)
IF Length(local_neighbors) > 0 THEN
local_K ← ConvertToMatrix(local_neighbors)
local_V ← local_K // Self-attention
local_output, _ ← ScaledDotProductAttention(
query, local_K, local_V, d
)
END IF
// 3. Compute global attention
global_output ← ZeroVector(d)
IF Length(global_neighbors) > 0 THEN
global_K ← ConvertToMatrix(global_neighbors)
global_V ← global_K
global_output, _ ← ScaledDotProductAttention(
query, global_K, global_V, d
)
END IF
// 4. Learned gating to combine local and global
alpha ← LearnedGate(query, local_output, global_output)
// 5. Combine outputs
output ← ZeroVector(d)
FOR i ← 0 TO d-1 DO
output[i] ← alpha * local_output[i] + (1.0 - alpha) * global_output[i]
END FOR
RETURN output
END
SUBROUTINE: LearnedGate
INPUT: query[d], local_output[d], global_output[d]
OUTPUT: alpha (scalar in [0, 1])
BEGIN
// Concatenate all inputs
concat ← Concatenate(query, local_output, global_output)
// Linear projection + sigmoid
gate_weights ← LEARNED_PARAMETERS[3*d] // Learned during training
bias ← LEARNED_BIAS // Learned during training
logit ← DotProduct(concat, gate_weights) + bias
alpha ← Sigmoid(logit)
RETURN alpha
END
SUBROUTINE: Sigmoid
INPUT: x (scalar)
OUTPUT: sigmoid(x) in [0, 1]
BEGIN
RETURN 1.0 / (1.0 + exp(-x))
END
3.2 Linear Attention (Performer / Random Features)
Purpose: O(n·d) complexity using kernel approximation
Complexity:
- Time: O(n·D·d) where D = number of random features
- Space: O(D·d)
ALGORITHM: LinearAttention
INPUT:
query: query vector [d]
keys: key matrix [n × d]
values: value matrix [n × d]
num_features: number of random features D
random_matrix: random projection matrix [D × d]
OUTPUT:
output: attention output [d]
BEGIN
// 1. Apply feature map to query
phi_Q ← FeatureMap(query, random_matrix, num_features)
// 2. Apply feature map to all keys
phi_K ← EMPTY_MATRIX[n × num_features]
FOR i ← 0 TO n-1 DO
phi_K[i] ← FeatureMap(keys[i], random_matrix, num_features)
END FOR
// 3. Compute K^T V (sum over neighbors) - O(n·D·d)
KV_sum ← ZeroMatrix(num_features, d)
FOR i ← 0 TO n-1 DO
FOR j ← 0 TO num_features-1 DO
FOR k ← 0 TO d-1 DO
KV_sum[j][k] ← KV_sum[j][k] + phi_K[i][j] * values[i][k]
END FOR
END FOR
END FOR
// 4. Compute Q·(K^T V) - O(D·d)
numerator ← ZeroVector(d)
FOR k ← 0 TO d-1 DO
FOR j ← 0 TO num_features-1 DO
numerator[k] ← numerator[k] + phi_Q[j] * KV_sum[j][k]
END FOR
END FOR
// 5. Compute K^T 1 (sum of feature-mapped keys) - O(n·D)
K_sum ← ZeroVector(num_features)
FOR i ← 0 TO n-1 DO
FOR j ← 0 TO num_features-1 DO
K_sum[j] ← K_sum[j] + phi_K[i][j]
END FOR
END FOR
// 6. Compute denominator Q·(K^T 1) - O(D)
denominator ← DotProduct(phi_Q, K_sum)
// 7. Normalize
output ← Scale(numerator, 1.0 / (denominator + 1e-10))
RETURN output
END
SUBROUTINE: FeatureMap
INPUT: x[d], random_matrix[D × d], num_features D
OUTPUT: features[D]
BEGIN
// Random Fourier Features
// φ(x) = sqrt(1/D) * [cos(w₁·x), sin(w₁·x), cos(w₂·x), sin(w₂·x), ...]
scale ← 1.0 / sqrt(num_features)
features ← EMPTY_ARRAY[num_features]
FOR i ← 0 TO num_features/2 - 1 DO
// Get random projection
w ← random_matrix[i]
projection ← DotProduct(w, x)
// Apply cos and sin
features[2*i] ← scale * cos(projection)
features[2*i + 1] ← scale * sin(projection)
END FOR
RETURN features
END
SUBROUTINE: ZeroMatrix
INPUT: rows, cols
OUTPUT: matrix[rows × cols]
BEGIN
matrix ← EMPTY_MATRIX[rows × cols]
FOR i ← 0 TO rows-1 DO
FOR j ← 0 TO cols-1 DO
matrix[i][j] ← 0.0
END FOR
END FOR
RETURN matrix
END
3.3 Flash Attention (Tiled / Memory-Efficient)
Purpose: O(n) memory instead of O(n²) through tiling
Complexity:
- Time: O(n²·d) (same as standard, but better cache locality)
- Space: O(n) instead of O(n²)
ALGORITHM: FlashAttention
INPUT:
query: query vector [d]
keys: key matrix [n × d]
values: value matrix [n × d]
block_size: tile size B (typically 64-128)
OUTPUT:
output: attention output [d]
BEGIN
n ← Length(keys)
output ← ZeroVector(d)
row_max ← -INFINITY
row_sum ← 0.0
num_blocks ← Ceiling(n / block_size)
// Process keys/values in blocks (tiles)
FOR block_idx ← 0 TO num_blocks-1 DO
// 1. Define current block range
chunk_start ← block_idx * block_size
chunk_end ← Min(chunk_start + block_size, n)
chunk_size ← chunk_end - chunk_start
// 2. Extract block of keys and values
chunk_K ← keys[chunk_start : chunk_end]
chunk_V ← values[chunk_start : chunk_end]
// 3. Compute attention scores for this block
scores ← EMPTY_ARRAY[chunk_size]
FOR i ← 0 TO chunk_size-1 DO
scores[i] ← DotProduct(query, chunk_K[i]) / sqrt(d)
END FOR
// 4. Online softmax: update running max
new_max ← Max(row_max, Max(scores))
// 5. Compute exponentials with new max
exp_scores ← EMPTY_ARRAY[chunk_size]
FOR i ← 0 TO chunk_size-1 DO
exp_scores[i] ← exp(scores[i] - new_max)
END FOR
// 6. Correction factor for previous blocks
correction ← exp(row_max - new_max)
// 7. Update running sum of exponentials
chunk_sum ← Sum(exp_scores)
row_sum ← row_sum * correction + chunk_sum
// 8. Update running max
row_max ← new_max
// 9. Accumulate weighted values with correction
FOR i ← 0 TO d-1 DO
output[i] ← output[i] * correction
END FOR
FOR i ← 0 TO chunk_size-1 DO
FOR j ← 0 TO d-1 DO
output[j] ← output[j] + exp_scores[i] * chunk_V[i][j]
END FOR
END FOR
END FOR
// 10. Final normalization
FOR i ← 0 TO d-1 DO
output[i] ← output[i] / row_sum
END FOR
RETURN output
END
SUBROUTINE: Max
INPUT: array[n] OR two scalars
OUTPUT: maximum value
BEGIN
IF array is provided THEN
max_val ← array[0]
FOR i ← 1 TO Length(array)-1 DO
IF array[i] > max_val THEN
max_val ← array[i]
END IF
END FOR
RETURN max_val
ELSE
// Two scalars
RETURN IF (a > b) THEN a ELSE b
END IF
END
SUBROUTINE: Sum
INPUT: array[n]
OUTPUT: sum of elements
BEGIN
total ← 0
FOR i ← 0 TO Length(array)-1 DO
total ← total + array[i]
END FOR
RETURN total
END
SUBROUTINE: Ceiling
INPUT: x (real number)
OUTPUT: ⌈x⌉ (smallest integer >= x)
BEGIN
RETURN integer ceiling of x
END
SUBROUTINE: Min
INPUT: a, b (scalars)
OUTPUT: minimum value
BEGIN
RETURN IF (a < b) THEN a ELSE b
END
4. Graph Attention
4.1 Edge-Featured Attention
Purpose: Incorporate edge attributes into attention computation
Complexity:
- Time: O(n·(d² + d_edge·d))
- Space: O(n)
ALGORITHM: EdgeFeaturedAttention
INPUT:
query: query node embedding [d]
keys: neighbor node embeddings [n × d]
values: neighbor node embeddings [n × d]
edge_features: edge attributes [n × d_edge]
W_node: node transformation matrix [d × d]
W_edge: edge transformation matrix [d_edge × d_attn]
a: attention coefficient vector [2d + d_attn]
OUTPUT:
output: aggregated embedding [d]
BEGIN
// 1. Transform query
q_trans ← MatrixVectorMult(W_node, query)
// 2. Transform all keys and edge features
k_trans ← EMPTY_MATRIX[n × d]
e_trans ← EMPTY_MATRIX[n × d_attn]
FOR i ← 0 TO n-1 DO
k_trans[i] ← MatrixVectorMult(W_node, keys[i])
e_trans[i] ← MatrixVectorMult(W_edge, edge_features[i])
END FOR
// 3. Compute attention scores with edge features
scores ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
// Concatenate [query || key || edge]
concat ← Concatenate(q_trans, k_trans[i], e_trans[i])
// Attention coefficient
score ← DotProduct(a, concat)
// Activation (LeakyReLU)
scores[i] ← LeakyReLU(score, alpha=0.2)
END FOR
// 4. Softmax normalization
weights ← Softmax(scores)
// 5. Weighted aggregation
output ← WeightedSum(values, weights)
RETURN output
END
SUBROUTINE: MatrixVectorMult
INPUT: M[m × n], v[n]
OUTPUT: result[m]
BEGIN
result ← ZeroVector(m)
FOR i ← 0 TO m-1 DO
FOR j ← 0 TO n-1 DO
result[i] ← result[i] + M[i][j] * v[j]
END FOR
END FOR
RETURN result
END
SUBROUTINE: LeakyReLU
INPUT: x (scalar), alpha (negative slope)
OUTPUT: activated value
BEGIN
IF x >= 0 THEN
RETURN x
ELSE
RETURN alpha * x
END IF
END
SUBROUTINE: WeightedSum
INPUT: vectors[n × d], weights[n]
OUTPUT: result[d]
BEGIN
result ← ZeroVector(d)
FOR i ← 0 TO n-1 DO
FOR j ← 0 TO d-1 DO
result[j] ← result[j] + weights[i] * vectors[i][j]
END FOR
END FOR
RETURN result
END
4.2 RoPE Graph Attention
Purpose: Encode graph distances via rotary position embeddings
Complexity:
- Time: O(n·d²)
- Space: O(n)
ALGORITHM: RoPEGraphAttention
INPUT:
query: query node embedding [d]
keys: neighbor node embeddings [n × d]
values: neighbor node embeddings [n × d]
distances: graph distances to neighbors [n]
base: RoPE frequency base (default 10000)
OUTPUT:
output: attention output [d]
BEGIN
// 1. Apply RoPE rotation to query (at origin, distance = 0)
Q_rotated ← ApplyRotation(query, distance=0.0, base)
// 2. Apply RoPE rotation to keys based on their distances
K_rotated ← EMPTY_MATRIX[n × d]
FOR i ← 0 TO n-1 DO
K_rotated[i] ← ApplyRotation(keys[i], distances[i], base)
END FOR
// 3. Compute attention scores with rotated embeddings
scores ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
scores[i] ← DotProduct(Q_rotated, K_rotated[i])
END FOR
// 4. Softmax and aggregate
weights ← Softmax(scores)
output ← WeightedSum(values, weights)
RETURN output
END
SUBROUTINE: ApplyRotation
INPUT: embedding[d], distance (scalar), base
OUTPUT: rotated[d]
BEGIN
rotated ← ZeroVector(d)
// Apply rotation to pairs of dimensions
FOR i ← 0 TO d/2 - 1 DO
// Compute rotation angle for this dimension pair
theta ← distance / (base ^ (2.0 * i / d))
cos_theta ← cos(theta)
sin_theta ← sin(theta)
// Rotate dimensions (2*i, 2*i+1)
rotated[2*i] ← embedding[2*i] * cos_theta - embedding[2*i+1] * sin_theta
rotated[2*i+1] ← embedding[2*i] * sin_theta + embedding[2*i+1] * cos_theta
END FOR
RETURN rotated
END
4.3 Cross-Space (Dual) Attention
Purpose: Bridge graph topology and latent space semantics
Complexity:
- Time: O(n_graph·d² + k_latent·d² + k_latent²·d)
- Space: O(n_graph + k_latent)
ALGORITHM: DualSpaceAttention
INPUT:
query: query node embedding [d]
graph_neighbors: topological neighbors [n_graph × d]
all_embeddings: all node embeddings for latent search [N × d]
k_latent: number of latent neighbors
OUTPUT:
output: fused embedding [d]
BEGIN
// 1. Graph attention (topology-based)
graph_output, _ ← MultiHeadAttention(
query,
graph_neighbors,
graph_neighbors,
num_heads=8
)
// 2. Find latent neighbors (similarity-based)
latent_neighbors ← FindTopKSimilar(query, all_embeddings, k_latent)
// 3. Latent attention (embedding-based)
latent_output, _ ← MultiHeadAttention(
query,
latent_neighbors,
latent_neighbors,
num_heads=8
)
// 4. Cross-attention (graph context queries latent space)
cross_output, _ ← MultiHeadAttention(
graph_output, // Use graph output as query
latent_neighbors,
latent_neighbors,
num_heads=8
)
// 5. Fusion of all three outputs
concatenated ← Concatenate(graph_output, latent_output, cross_output)
// 6. Final projection
W_fusion ← LEARNED_WEIGHTS[d × 3d]
output ← MatrixVectorMult(W_fusion, concatenated)
RETURN output
END
SUBROUTINE: FindTopKSimilar
INPUT: query[d], all_embeddings[N × d], k
OUTPUT: top_k_embeddings[k × d]
BEGIN
similarities ← EMPTY_ARRAY[N]
// 1. Compute cosine similarity to all embeddings
FOR i ← 0 TO N-1 DO
similarities[i] ← CosineSimilarity(query, all_embeddings[i])
END FOR
// 2. Find top-k indices
top_k_indices ← TopKIndices(similarities, k)
// 3. Extract top-k embeddings
top_k_embeddings ← EMPTY_MATRIX[k × d]
FOR i ← 0 TO k-1 DO
top_k_embeddings[i] ← all_embeddings[top_k_indices[i]]
END FOR
RETURN top_k_embeddings
END
SUBROUTINE: CosineSimilarity
INPUT: x[d], y[d]
OUTPUT: similarity in [-1, 1]
BEGIN
dot ← DotProduct(x, y)
norm_x ← L2Norm(x)
norm_y ← L2Norm(y)
// Avoid division by zero
IF norm_x == 0 OR norm_y == 0 THEN
RETURN 0.0
END IF
RETURN dot / (norm_x * norm_y)
END
SUBROUTINE: TopKIndices
INPUT: array[N], k
OUTPUT: indices[k]
BEGIN
// Create (index, value) pairs
pairs ← EMPTY_ARRAY[N]
FOR i ← 0 TO N-1 DO
pairs[i] ← (i, array[i])
END FOR
// Sort by value (descending)
Sort(pairs, by=value, order=descending)
// Extract top-k indices
indices ← EMPTY_ARRAY[k]
FOR i ← 0 TO k-1 DO
indices[i] ← pairs[i].index
END FOR
RETURN indices
END
5. Adaptive Attention
5.1 Mixture of Experts (MoE) Attention
Purpose: Route to specialized attention mechanisms based on context
Complexity:
- Time: O(K · attention_complexity) where K = top-k experts (typically 2)
- Space: O(num_experts · model_size)
ALGORITHM: MoEAttention
INPUT:
query: query node embedding [d]
keys: neighbor embeddings [n × d]
values: neighbor embeddings [n × d]
experts: list of attention mechanisms
router: routing network
top_k: number of experts to use (typically 2)
OUTPUT:
output: expert-mixed output [d]
EXPERT_TYPES:
1. Standard Multi-Head Attention
2. Hyperbolic Attention
3. Linear Attention
4. Edge-Featured Attention
BEGIN
num_experts ← Length(experts)
// 1. Router computes expert scores
router_logits ← RouterNetwork(query, router)
router_probs ← Softmax(router_logits)
// 2. Select top-k experts
top_k_indices ← TopKIndices(router_probs, top_k)
// 3. Normalize selected expert weights
selected_weights ← EMPTY_ARRAY[top_k]
weight_sum ← 0.0
FOR i ← 0 TO top_k-1 DO
expert_idx ← top_k_indices[i]
selected_weights[i] ← router_probs[expert_idx]
weight_sum ← weight_sum + selected_weights[i]
END FOR
// Normalize
FOR i ← 0 TO top_k-1 DO
selected_weights[i] ← selected_weights[i] / weight_sum
END FOR
// 4. Compute weighted expert outputs
output ← ZeroVector(d)
FOR i ← 0 TO top_k-1 DO
expert_idx ← top_k_indices[i]
expert ← experts[expert_idx]
// Call appropriate expert
expert_output ← CALL_EXPERT(expert, query, keys, values)
// Weighted accumulation
weight ← selected_weights[i]
FOR j ← 0 TO d-1 DO
output[j] ← output[j] + weight * expert_output[j]
END FOR
END FOR
RETURN output
END
SUBROUTINE: RouterNetwork
INPUT: query[d], router_weights
OUTPUT: logits[num_experts]
BEGIN
// Simple two-layer MLP
hidden_size ← 4 * d
// First layer
W1 ← router_weights.layer1 // [hidden_size × d]
b1 ← router_weights.bias1 // [hidden_size]
hidden ← MatrixVectorMult(W1, query)
FOR i ← 0 TO hidden_size-1 DO
hidden[i] ← ReLU(hidden[i] + b1[i])
END FOR
// Second layer
W2 ← router_weights.layer2 // [num_experts × hidden_size]
b2 ← router_weights.bias2 // [num_experts]
logits ← MatrixVectorMult(W2, hidden)
FOR i ← 0 TO num_experts-1 DO
logits[i] ← logits[i] + b2[i]
END FOR
RETURN logits
END
SUBROUTINE: CALL_EXPERT
INPUT: expert, query, keys, values
OUTPUT: expert_output[d]
BEGIN
MATCH expert.type:
CASE "standard":
RETURN MultiHeadAttention(query, keys, values, num_heads=8)
CASE "hyperbolic":
RETURN HyperbolicAttention(query, keys, values, curvature=-1.0)
CASE "linear":
RETURN LinearAttention(query, keys, values, num_features=256)
CASE "edge_featured":
edge_features ← expert.edge_features
RETURN EdgeFeaturedAttention(query, keys, values, edge_features)
DEFAULT:
ERROR "Unknown expert type"
END MATCH
END
SUBROUTINE: ReLU
INPUT: x (scalar)
OUTPUT: max(0, x)
BEGIN
RETURN IF (x > 0) THEN x ELSE 0
END
5.2 Learned Navigation (Reinforcement Learning)
Purpose: Learn optimal navigation policy for graph traversal
Complexity:
- Time: O(num_steps · d²) per navigation episode
- Space: O(graph_size + policy_params)
ALGORITHM: RLNavigationStep
INPUT:
current_state: current navigation state
policy_network: learned policy (neural network)
value_network: value estimator
graph: graph structure
OUTPUT:
action: which neighbor to visit
reward: immediate reward
next_state: resulting state
STATE_REPRESENTATION:
current_embedding: [d]
query_embedding: [d]
graph_features: [d_graph]
history: [max_steps × d]
BEGIN
// 1. Encode current state
state_vector ← EncodeState(current_state)
// 2. Policy network outputs action logits
action_logits ← PolicyNetwork(state_vector, policy_network)
// 3. Value network estimates state value
state_value ← ValueNetwork(state_vector, value_network)
// 4. Sample action from policy
action_probs ← Softmax(action_logits)
action ← SampleCategorical(action_probs) // Which neighbor to visit
// 5. Execute action (move to selected neighbor)
next_node ← current_state.neighbors[action]
// 6. Compute reward
reward ← ComputeReward(current_state, next_node, current_state.query)
// 7. Update state
next_state ← UpdateState(current_state, next_node, action)
RETURN action, reward, next_state, state_value
END
SUBROUTINE: EncodeState
INPUT: state
OUTPUT: state_vector[d_state]
BEGIN
// Concatenate all state components
state_vector ← Concatenate(
state.current_embedding,
state.query_embedding,
state.graph_features,
Flatten(state.history)
)
RETURN state_vector
END
SUBROUTINE: PolicyNetwork
INPUT: state_vector[d_state], policy_params
OUTPUT: action_logits[num_neighbors]
BEGIN
// Three-layer MLP
hidden1 ← ReLU(Linear(state_vector, policy_params.W1, policy_params.b1))
hidden2 ← ReLU(Linear(hidden1, policy_params.W2, policy_params.b2))
logits ← Linear(hidden2, policy_params.W3, policy_params.b3)
RETURN logits
END
SUBROUTINE: ValueNetwork
INPUT: state_vector[d_state], value_params
OUTPUT: value (scalar)
BEGIN
// Three-layer MLP ending in scalar
hidden1 ← ReLU(Linear(state_vector, value_params.W1, value_params.b1))
hidden2 ← ReLU(Linear(hidden1, value_params.W2, value_params.b2))
value ← Linear(hidden2, value_params.W3, value_params.b3)[0] // Scalar output
RETURN value
END
SUBROUTINE: ComputeReward
INPUT: current_state, next_node, query
OUTPUT: reward (scalar)
BEGIN
// Reward based on similarity improvement
current_similarity ← CosineSimilarity(
current_state.current_embedding,
query
)
next_similarity ← CosineSimilarity(
next_node.embedding,
query
)
// Positive reward if moving closer, negative if farther
reward ← next_similarity - current_similarity
// Bonus for reaching goal
IF next_similarity > GOAL_THRESHOLD THEN
reward ← reward + GOAL_BONUS
END IF
// Penalty for taking too many steps
reward ← reward - STEP_PENALTY
RETURN reward
END
SUBROUTINE: SampleCategorical
INPUT: probabilities[n]
OUTPUT: sampled_index in [0, n-1]
BEGIN
// Sample from categorical distribution
cumsum ← 0.0
rand ← Random() // Uniform [0, 1)
FOR i ← 0 TO n-1 DO
cumsum ← cumsum + probabilities[i]
IF rand < cumsum THEN
RETURN i
END IF
END FOR
// Fallback (shouldn't reach here if probabilities sum to 1)
RETURN n-1
END
SUBROUTINE: UpdateState
INPUT: current_state, next_node, action
OUTPUT: new_state
BEGIN
new_state ← COPY(current_state)
// Update current node
new_state.current_node ← next_node
new_state.current_embedding ← next_node.embedding
// Update history (sliding window)
new_state.history.PopFirst()
new_state.history.Append(next_node.embedding)
// Increment step counter
new_state.num_steps ← new_state.num_steps + 1
RETURN new_state
END
SUBROUTINE: Linear
INPUT: x[d_in], W[d_out × d_in], b[d_out]
OUTPUT: y[d_out]
BEGIN
y ← MatrixVectorMult(W, x)
FOR i ← 0 TO d_out-1 DO
y[i] ← y[i] + b[i]
END FOR
RETURN y
END
6. Training Procedures
6.1 InfoNCE Contrastive Loss
Purpose: Learn embeddings that are similar to positives and dissimilar to negatives
Complexity:
- Time: O((n_pos + n_neg) · d)
- Space: O(n_pos + n_neg)
ALGORITHM: InfoNCELoss
INPUT:
anchor: anchor embedding [d]
positives: positive samples [n_pos × d]
negatives: negative samples [n_neg × d]
temperature: softmax temperature (typically 0.07)
OUTPUT:
loss: contrastive loss (scalar)
BEGIN
// 1. Compute positive similarities
pos_scores ← EMPTY_ARRAY[n_pos]
FOR i ← 0 TO n_pos-1 DO
sim ← CosineSimilarity(anchor, positives[i])
pos_scores[i] ← sim / temperature
END FOR
// 2. Compute negative similarities
neg_scores ← EMPTY_ARRAY[n_neg]
FOR i ← 0 TO n_neg-1 DO
sim ← CosineSimilarity(anchor, negatives[i])
neg_scores[i] ← sim / temperature
END FOR
// 3. InfoNCE loss (average over positives)
total_loss ← 0.0
FOR i ← 0 TO n_pos-1 DO
// Numerator: exp(positive score)
numerator ← exp(pos_scores[i])
// Denominator: sum of exp(positive score) + all exp(negative scores)
denominator ← numerator
FOR j ← 0 TO n_neg-1 DO
denominator ← denominator + exp(neg_scores[j])
END FOR
// Log probability
log_prob ← log(numerator / denominator)
// Accumulate negative log probability
total_loss ← total_loss - log_prob
END FOR
// Average over positives
loss ← total_loss / n_pos
RETURN loss
END
6.2 Hard Negative Sampling
Purpose: Select informative negative samples for faster learning
Complexity:
- Time: O(N·d) where N = total number of samples
- Space: O(k) where k = number of hard negatives
ALGORITHM: SampleHardNegatives
INPUT:
anchor: anchor embedding [d]
all_embeddings: all available embeddings [N × d]
true_positives: indices of true positives
k: number of hard negatives to sample
strategy: sampling strategy ("distance", "degree", "mixed")
OUTPUT:
hard_negatives: selected hard negative samples [k × d]
BEGIN
// 1. Filter out true positives
candidate_indices ← EMPTY_LIST
FOR i ← 0 TO N-1 DO
IF i NOT IN true_positives THEN
candidate_indices.Append(i)
END IF
END FOR
n_candidates ← Length(candidate_indices)
// 2. Select hard negatives based on strategy
MATCH strategy:
CASE "distance":
hard_negatives ← SampleByDistance(
anchor, all_embeddings, candidate_indices, k
)
CASE "degree":
hard_negatives ← SampleByDegree(
anchor, all_embeddings, candidate_indices, k
)
CASE "mixed":
k_dist ← k / 2
k_deg ← k - k_dist
dist_negs ← SampleByDistance(
anchor, all_embeddings, candidate_indices, k_dist
)
deg_negs ← SampleByDegree(
anchor, all_embeddings, candidate_indices, k_deg
)
hard_negatives ← Concatenate(dist_negs, deg_negs)
DEFAULT:
ERROR "Unknown strategy"
END MATCH
RETURN hard_negatives
END
SUBROUTINE: SampleByDistance
INPUT: anchor[d], all_embeddings[N × d], candidate_indices, k
OUTPUT: hard_negatives[k × d]
BEGIN
// Select k most similar candidates (hardest negatives)
similarities ← EMPTY_ARRAY[Length(candidate_indices)]
FOR i ← 0 TO Length(candidate_indices)-1 DO
idx ← candidate_indices[i]
similarities[i] ← CosineSimilarity(anchor, all_embeddings[idx])
END FOR
// Get top-k most similar (hardest)
top_k_local_indices ← TopKIndices(similarities, k)
// Map back to global indices
hard_negatives ← EMPTY_MATRIX[k × d]
FOR i ← 0 TO k-1 DO
local_idx ← top_k_local_indices[i]
global_idx ← candidate_indices[local_idx]
hard_negatives[i] ← all_embeddings[global_idx]
END FOR
RETURN hard_negatives
END
SUBROUTINE: SampleByDegree
INPUT: anchor[d], all_embeddings[N × d], candidate_indices, k
OUTPUT: hard_negatives[k × d]
BEGIN
// Select candidates with similar degree to anchor
anchor_degree ← GetDegree(anchor)
degree_diffs ← EMPTY_ARRAY[Length(candidate_indices)]
FOR i ← 0 TO Length(candidate_indices)-1 DO
idx ← candidate_indices[i]
candidate_degree ← GetDegree(all_embeddings[idx])
degree_diffs[i] ← abs(anchor_degree - candidate_degree)
END FOR
// Get k candidates with most similar degree
top_k_local_indices ← TopKIndices(
NegateArray(degree_diffs), // Negate for similarity
k
)
hard_negatives ← EMPTY_MATRIX[k × d]
FOR i ← 0 TO k-1 DO
local_idx ← top_k_local_indices[i]
global_idx ← candidate_indices[local_idx]
hard_negatives[i] ← all_embeddings[global_idx]
END FOR
RETURN hard_negatives
END
SUBROUTINE: NegateArray
INPUT: array[n]
OUTPUT: negated[n]
BEGIN
negated ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
negated[i] ← -array[i]
END FOR
RETURN negated
END
6.3 Curriculum Learning Schedule
Purpose: Gradually increase task difficulty during training
Complexity:
- Time: O(1) per epoch (just weight computation)
- Space: O(num_losses)
ALGORITHM: CurriculumSchedule
INPUT:
current_epoch: current training epoch
total_epochs: total number of epochs
loss_types: list of loss components
OUTPUT:
loss_weights: weight for each loss component
LOSS_TYPES:
- reconstruction: Autoencoder reconstruction loss
- contrastive: InfoNCE contrastive loss
- task: Downstream task loss
- spectral: Laplacian regularization
- ewc: Elastic Weight Consolidation
BEGIN
loss_weights ← EMPTY_MAP
// 1. Reconstruction: High early, decay exponentially
lambda_recon ← exp(-current_epoch / 50.0)
loss_weights["reconstruction"] ← lambda_recon
// 2. Contrastive: Ramp up linearly in first 10 epochs
IF current_epoch < 10 THEN
lambda_contrast ← 0.1 + 0.9 * (current_epoch / 10.0)
ELSE
lambda_contrast ← 1.0
END IF
loss_weights["contrastive"] ← lambda_contrast
// 3. Task: Start after 50 epochs, ramp up
IF current_epoch < 50 THEN
lambda_task ← 0.1
ELSE
lambda_task ← 0.1 + 0.9 * ((current_epoch - 50) / 50.0)
lambda_task ← Min(lambda_task, 1.0)
END IF
loss_weights["task"] ← lambda_task
// 4. Spectral: Constant moderate weight
loss_weights["spectral"] ← 0.01
// 5. EWC: Increase if using continual learning
IF using_continual_learning THEN
lambda_ewc ← Min(current_epoch / 100.0, 1.0)
ELSE
lambda_ewc ← 0.0
END IF
loss_weights["ewc"] ← lambda_ewc
RETURN loss_weights
END
6.4 Multi-Objective Loss Computation
Purpose: Combine multiple loss functions with learned or scheduled weights
Complexity:
- Time: O(num_losses)
- Space: O(1)
ALGORITHM: MultiObjectiveLoss
INPUT:
loss_components: computed loss values
loss_weights: weights for each component
auto_balance: whether to auto-balance weights
OUTPUT:
total_loss: weighted sum of losses
updated_weights: potentially updated weights
LOSS_COMPONENTS:
task_loss: Main task objective
contrastive_loss: InfoNCE or similar
reconstruction_loss: Autoencoder
spectral_loss: Laplacian smoothness
ewc_loss: Continual learning penalty
BEGIN
// 1. Auto-balance (optional)
IF auto_balance THEN
loss_weights ← AutoBalance(loss_components, loss_weights)
END IF
// 2. Compute weighted sum
total_loss ← 0.0
total_loss ← total_loss + loss_weights["task"] * loss_components.task_loss
total_loss ← total_loss + loss_weights["contrastive"] * loss_components.contrastive_loss
total_loss ← total_loss + loss_weights["reconstruction"] * loss_components.reconstruction_loss
total_loss ← total_loss + loss_weights["spectral"] * loss_components.spectral_loss
total_loss ← total_loss + loss_weights["ewc"] * loss_components.ewc_loss
RETURN total_loss, loss_weights
END
SUBROUTINE: AutoBalance
INPUT: loss_components, current_weights
OUTPUT: balanced_weights
BEGIN
// Normalize so each loss contributes equally
num_losses ← 5
// Compute current contribution of each loss
contributions ← EMPTY_MAP
contributions["task"] ← current_weights["task"] * loss_components.task_loss
contributions["contrastive"] ← current_weights["contrastive"] * loss_components.contrastive_loss
contributions["reconstruction"] ← current_weights["reconstruction"] * loss_components.reconstruction_loss
contributions["spectral"] ← current_weights["spectral"] * loss_components.spectral_loss
contributions["ewc"] ← current_weights["ewc"] * loss_components.ewc_loss
// Compute total and target per-loss contribution
total ← Sum(contributions.values)
target_contribution ← total / num_losses
// Adjust weights to equalize contributions
balanced_weights ← EMPTY_MAP
epsilon ← 1e-10 // Avoid division by zero
balanced_weights["task"] ← target_contribution / Max(loss_components.task_loss, epsilon)
balanced_weights["contrastive"] ← target_contribution / Max(loss_components.contrastive_loss, epsilon)
balanced_weights["reconstruction"] ← target_contribution / Max(loss_components.reconstruction_loss, epsilon)
balanced_weights["spectral"] ← target_contribution / Max(loss_components.spectral_loss, epsilon)
balanced_weights["ewc"] ← target_contribution / Max(loss_components.ewc_loss, epsilon)
RETURN balanced_weights
END
6.5 Spectral Regularization
Purpose: Preserve graph structure through Laplacian smoothness
Complexity:
- Time: O(|E|·d) where |E| = number of edges
- Space: O(1) (streaming computation)
ALGORITHM: LaplacianRegularization
INPUT:
embeddings: node embeddings [N × d]
edges: edge list [(u, v)]
edge_weights: optional edge weights [|E|]
normalized: whether to use normalized Laplacian
node_degrees: node degrees [N]
OUTPUT:
spectral_loss: smoothness penalty (scalar)
BEGIN
total_loss ← 0.0
num_edges ← Length(edges)
FOR i ← 0 TO num_edges-1 DO
u, v ← edges[i]
// Compute embedding difference
diff ← Subtract(embeddings[u], embeddings[v])
diff_norm_sq ← L2NormSquared(diff)
// Get edge weight
weight ← 1.0
IF edge_weights PROVIDED THEN
weight ← edge_weights[i]
END IF
// Normalized Laplacian: weight by degrees
IF normalized THEN
degree_norm ← sqrt(node_degrees[u] * node_degrees[v])
weight ← weight / Max(degree_norm, 1.0)
END IF
// Accumulate weighted squared difference
total_loss ← total_loss + weight * diff_norm_sq
END FOR
// Average over edges
spectral_loss ← total_loss / num_edges
RETURN spectral_loss
END
7. Data Structures
7.1 Attention State
STRUCTURE: AttentionState
FIELDS:
query: [d] // Query embedding
keys: [n × d] // Key embeddings
values: [n × d] // Value embeddings
attention_weights: [n] // Computed weights
output: [d] // Final output
metadata: Map<String, Any> // Additional info
OPERATIONS:
Initialize(query, keys, values)
ComputeWeights() → attention_weights
ComputeOutput() → output
GetMetadata(key) → value
7.2 Graph Structure
STRUCTURE: Graph
FIELDS:
nodes: [N] // Node identifiers
embeddings: [N × d] // Node embeddings
adjacency: [N × N] OR SparseMatrix // Adjacency matrix
edge_list: [(u, v)] // Edge list
edge_features: [|E| × d_edge] // Edge attributes
node_degrees: [N] // Degree of each node
OPERATIONS:
GetNeighbors(node_id) → [neighbor_ids]
GetEdgeFeature(u, v) → [d_edge]
GetDegree(node_id) → scalar
AddEdge(u, v, features)
UpdateEmbedding(node_id, new_embedding)
7.3 HNSW-Specific Structure
STRUCTURE: HNSWGraph
EXTENDS: Graph
ADDITIONAL_FIELDS:
layers: [max_layer] // Layer-wise graphs
entry_point: node_id // Top-layer entry
max_layer: integer // Maximum layer
layer_neighbors: Map<(node, layer), [neighbors]>
OPERATIONS:
GetLayerNeighbors(node_id, layer) → [neighbor_ids]
GetNodeLayer(node_id) → layer
NavigateLayer(query, layer, num_steps) → closest_node
InsertNode(node_id, embedding, layer)
7.4 Training State
STRUCTURE: TrainingState
FIELDS:
current_epoch: integer
loss_history: [num_epochs]
loss_weights: Map<loss_type, weight>
curriculum_schedule: CurriculumSchedule
optimizer_state: OptimizerState
best_model_params: ModelParams
early_stopping_counter: integer
OPERATIONS:
UpdateEpoch()
RecordLoss(loss_value)
GetLossWeight(loss_type) → weight
UpdateBestModel(current_params)
ShouldEarlystop() → boolean
8. Complexity Summary
8.1 Attention Mechanisms
| Mechanism | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Scaled Dot-Product | O(n·d²) | O(n) | Standard attention |
| Multi-Head (h heads) | O(n·d²/h) | O(h·d) | Parallel heads |
| Hyperbolic | O(n·d²) | O(n) | More expensive ops |
| Sparse (Local+Global) | O((k_l + k_g)·d) | O(k_l + k_g) | k << n |
| Linear (Performer) | O(n·D·d) | O(D·d) | D = random features |
| Flash | O(n²·d) | O(n) | Better cache locality |
| Edge-Featured | O(n·(d² + d_edge·d)) | O(n) | Added edge cost |
| RoPE | O(n·d²) | O(n) | Rotation overhead minimal |
| Cross-Space | O(n_g·d² + k_l·d²) | O(n_g + k_l) | Dual attention |
| MoE (k experts) | O(k·base_complexity) | O(num_experts·model_size) | Expert routing |
Legend:
- n: number of neighbors/keys
- d: embedding dimension
- h: number of attention heads
- k_l, k_g: local and global neighbor counts
- D: number of random features
- d_edge: edge feature dimension
8.2 Training Operations
| Operation | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| InfoNCE Loss | O((n_pos + n_neg)·d) | O(n_pos + n_neg) | Per anchor |
| Hard Negative Sampling | O(N·d) | O(k) | N = total samples |
| Spectral Regularization | O(|E|·d) | O(1) | E = edges |
| Curriculum Schedule | O(1) | O(num_losses) | Per epoch |
| Multi-Objective Loss | O(num_losses) | O(1) | Weighted sum |
9. Implementation Notes
9.1 Numerical Stability
Softmax Stability:
// Always subtract max before exp
max_score ← Max(scores)
exp_scores[i] ← exp(scores[i] - max_score)
Hyperbolic Boundary:
// Ensure points stay in Poincaré ball
IF ||x|| >= 1.0 THEN
x ← 0.99 * x / ||x|| // Project back with margin
END IF
Division by Zero:
// Add epsilon to denominators
result ← numerator / (denominator + 1e-10)
9.2 Performance Optimization
Vectorization:
- Use SIMD operations for dot products
- Batch matrix multiplications
- Parallelize independent attention heads
Memory Layout:
- Contiguous memory for cache efficiency
- Column-major for matrix operations
- Pre-allocate buffers
Lazy Computation:
- Only compute attention weights when needed
- Cache frequently accessed embeddings
- Prune low-weight attention connections
9.3 Testing Strategies
Unit Tests:
TEST: ScaledDotProductAttention
INPUT: Known query, keys, values
EXPECTED: Hand-computed output
VERIFY: Output matches expected within tolerance
TEST: Softmax Numerical Stability
INPUT: Very large scores [1000, 999, 998]
VERIFY: No NaN or Inf in output
VERIFY: Probabilities sum to 1.0
TEST: Hyperbolic Boundary
INPUT: Points near ball boundary (||x|| = 0.99)
VERIFY: Result still in ball (||result|| < 1.0)
Integration Tests:
TEST: End-to-End Attention Pipeline
INPUT: Real graph structure
VERIFY: All mechanisms produce valid outputs
VERIFY: Outputs are differentiable
Performance Tests:
BENCHMARK: Attention Complexity
INPUT: Varying n = [10, 100, 1000, 10000]
MEASURE: Time and memory usage
VERIFY: Matches theoretical complexity
10. References
10.1 Core Papers
- Attention Mechanism: Vaswani et al. (2017) - "Attention Is All You Need"
- GAT: Veličković et al. (2018) - "Graph Attention Networks"
- Hyperbolic GNNs: Chami et al. (2019) - "Hyperbolic Graph Convolutional Neural Networks"
- Performer: Choromanski et al. (2020) - "Rethinking Attention with Performers"
- Flash Attention: Dao et al. (2022) - "FlashAttention: Fast and Memory-Efficient Exact Attention"
- RoPE: Su et al. (2021) - "RoFormer: Enhanced Transformer with Rotary Position Embedding"
- MoE: Shazeer et al. (2017) - "Outrageously Large Neural Networks"
10.2 Mathematical Background
- Hyperbolic Geometry: Cannon et al. (1997) - "Hyperbolic Geometry"
- Graph Laplacian: Chung (1997) - "Spectral Graph Theory"
- Contrastive Learning: Chen et al. (2020) - "A Simple Framework for Contrastive Learning"
11. Glossary
Attention: Mechanism to weight importance of different inputs Multi-Head: Parallel attention with different learned projections Hyperbolic Space: Non-Euclidean geometry with constant negative curvature Poincaré Ball: Conformal model of hyperbolic space in unit ball Möbius Addition: Hyperbolic vector addition operation Sparse Attention: Attention over subset of inputs (not all pairs) Linear Attention: O(n) complexity via kernel approximation Flash Attention: Memory-efficient tiled attention computation RoPE: Rotary Position Embedding for distance encoding Cross-Attention: Attention between two different spaces MoE: Mixture of Experts, routing to specialized sub-models InfoNCE: Noise Contrastive Estimation loss for contrastive learning Hard Negatives: Difficult negative samples close to positives Curriculum Learning: Gradually increasing task difficulty Spectral Regularization: Graph smoothness via Laplacian
Document Version: 1.0
Last Updated: 2025-11-30
Author: RuVector Research Team
SPARC Phase: Pseudocode (Phase 2)
Next Phase: Architecture (Phase 3) - See 04-architecture.md
Appendix A: Quick Reference
Common Subroutines
DotProduct(x, y) → scalar
L2Norm(x) → scalar
L2NormSquared(x) → scalar
Softmax(scores) → probabilities
CosineSimilarity(x, y) → similarity ∈ [-1, 1]
Scale(x, scalar) → scaled_vector
Add(x, y) → sum_vector
Subtract(x, y) → diff_vector
Concatenate(vectors...) → concatenated_vector
ZeroVector(d) → zero-initialized vector
ZeroMatrix(rows, cols) → zero-initialized matrix
Complexity Quick Reference
O(1) - Constant time
O(d) - Linear in dimension
O(n) - Linear in number of items
O(n·d) - Linear in both
O(n²) - Quadratic (standard full attention)
O(n·d²) - Attention complexity
O(|E|) - Linear in number of edges