Files
wifi-densepose/docs/research/latent-space/implementation-plans/03-pseudocode.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

2031 lines
52 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SPARC Pseudocode: RuVector Attention Mechanisms
## Executive Summary
This document provides comprehensive pseudocode for all attention mechanisms proposed for the RuVector GNN latent-graph interplay system. Following SPARC methodology, this serves as the bridge between specification (requirements) and architecture (system design).
**Scope**: Complete algorithmic specifications for attention mechanisms, training procedures, and optimization strategies.
**Target Audience**: Implementers who will translate these algorithms into Rust code.
**Conventions**:
- UPPERCASE: Algorithm names, constants
- lowercase: Variables, parameters
- `←`: Assignment
- `∈`: Set membership
- Arrays are 0-indexed unless specified
- All complexity analysis uses Big-O notation
---
## Table of Contents
1. [Core Attention Mechanisms](#1-core-attention-mechanisms)
2. [Geometric Attention](#2-geometric-attention)
3. [Sparse Attention](#3-sparse-attention)
4. [Graph Attention](#4-graph-attention)
5. [Adaptive Attention](#5-adaptive-attention)
6. [Training Procedures](#6-training-procedures)
7. [Data Structures](#7-data-structures)
8. [Complexity Summary](#8-complexity-summary)
---
## 1. Core Attention Mechanisms
### 1.1 Scaled Dot-Product Attention
**Purpose**: Foundation attention mechanism for all variants
**Complexity**:
- Time: O(n·d²) where n = number of keys, d = embedding dimension
- Space: O(n)
```
ALGORITHM: ScaledDotProductAttention
INPUT:
Q: query vector [d]
K: key matrix [n × d]
V: value matrix [n × d]
d_k: key dimension (scalar)
OUTPUT:
output: attention output [d]
weights: attention weights [n]
BEGIN
// 1. Compute attention scores
scores ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
scores[i] ← DotProduct(Q, K[i]) / sqrt(d_k)
END FOR
// 2. Apply softmax for normalization
weights ← Softmax(scores)
// 3. Weighted sum of values
output ← ZeroVector(d)
FOR i ← 0 TO n-1 DO
output ← output + weights[i] * V[i]
END FOR
RETURN output, weights
END
SUBROUTINE: DotProduct
INPUT: x[d], y[d]
OUTPUT: scalar
BEGIN
sum ← 0
FOR i ← 0 TO d-1 DO
sum ← sum + x[i] * y[i]
END FOR
RETURN sum
END
SUBROUTINE: Softmax
INPUT: scores[n]
OUTPUT: probabilities[n]
BEGIN
// Numerical stability: subtract max
max_score ← Max(scores)
exp_scores ← EMPTY_ARRAY[n]
sum_exp ← 0
FOR i ← 0 TO n-1 DO
exp_scores[i] ← exp(scores[i] - max_score)
sum_exp ← sum_exp + exp_scores[i]
END FOR
probabilities ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
probabilities[i] ← exp_scores[i] / sum_exp
END FOR
RETURN probabilities
END
```
---
### 1.2 Multi-Head Attention
**Purpose**: Learn multiple representation subspaces simultaneously
**Complexity**:
- Time: O(h·n·d²/h²) = O(n·d²/h) where h = number of heads
- Space: O(h·d)
```
ALGORITHM: MultiHeadAttention
INPUT:
Q: query vector [d_model]
K: key matrix [n × d_model]
V: value matrix [n × d_model]
num_heads: number of attention heads
W_Q: query projection weights [num_heads × d_head × d_model]
W_K: key projection weights [num_heads × d_head × d_model]
W_V: value projection weights [num_heads × d_head × d_model]
W_O: output projection weights [d_model × d_model]
OUTPUT:
output: multi-head attention output [d_model]
CONSTANTS:
d_head ← d_model / num_heads
BEGIN
heads ← EMPTY_ARRAY[num_heads]
// 1. Project and compute attention for each head
FOR h ← 0 TO num_heads-1 DO
// Project query
Q_h ← LinearTransform(Q, W_Q[h]) // [d_head]
// Project keys
K_h ← EMPTY_MATRIX[n × d_head]
FOR i ← 0 TO n-1 DO
K_h[i] ← LinearTransform(K[i], W_K[h])
END FOR
// Project values
V_h ← EMPTY_MATRIX[n × d_head]
FOR i ← 0 TO n-1 DO
V_h[i] ← LinearTransform(V[i], W_V[h])
END FOR
// Compute attention for this head
head_output, _ ← ScaledDotProductAttention(Q_h, K_h, V_h, d_head)
heads[h] ← head_output
END FOR
// 2. Concatenate all heads
concat ← Concatenate(heads[0], heads[1], ..., heads[num_heads-1])
// 3. Final linear projection
output ← LinearTransform(concat, W_O)
RETURN output
END
SUBROUTINE: LinearTransform
INPUT: x[d_in], W[d_out × d_in]
OUTPUT: y[d_out]
BEGIN
y ← ZeroVector(d_out)
FOR i ← 0 TO d_out-1 DO
FOR j ← 0 TO d_in-1 DO
y[i] ← y[i] + W[i][j] * x[j]
END FOR
END FOR
RETURN y
END
SUBROUTINE: Concatenate
INPUT: vectors... (variable number of vectors)
OUTPUT: concatenated vector
BEGIN
total_dim ← Sum of all input dimensions
result ← EMPTY_ARRAY[total_dim]
offset ← 0
FOR EACH vector IN vectors DO
FOR i ← 0 TO Length(vector)-1 DO
result[offset + i] ← vector[i]
END FOR
offset ← offset + Length(vector)
END FOR
RETURN result
END
```
---
## 2. Geometric Attention
### 2.1 Hyperbolic Attention (Poincaré Ball Model)
**Purpose**: Capture hierarchical structure using hyperbolic geometry
**Complexity**:
- Time: O(n·d²) (same as Euclidean, but with more expensive ops)
- Space: O(n)
**Geometric Background**:
```
Poincaré Ball: B^d = {x ∈ R^d : ||x|| < 1}
Distance: d_P(x,y) = arcosh(1 + 2||x-y||²/((1-||x||²)(1-||y||²)))
```
```
ALGORITHM: HyperbolicAttention
INPUT:
query: query point in Poincaré ball [d]
keys: key points in Poincaré ball [n × d]
values: value points in Poincaré ball [n × d]
curvature: negative curvature (typically -1.0)
temperature: softmax temperature
OUTPUT:
output: aggregated point in Poincaré ball [d]
BEGIN
// 1. Compute hyperbolic distances as similarity scores
scores ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
// Negative distance = similarity (closer = higher score)
scores[i] ← -PoincareDistance(query, keys[i], curvature)
END FOR
// 2. Softmax to get attention weights
weights ← Softmax(scores / temperature)
// 3. Hyperbolic weighted aggregation using Möbius addition
result ← ZeroVector(d) // Origin in Poincaré ball
FOR i ← 0 TO n-1 DO
// Scale value by weight using Möbius scalar multiplication
scaled_value ← MobiusScalarMult(weights[i], values[i], curvature)
// Add to result using Möbius addition
result ← MobiusAdd(result, scaled_value, curvature)
END FOR
RETURN result
END
SUBROUTINE: PoincareDistance
INPUT: x[d], y[d], curvature
OUTPUT: distance (scalar)
BEGIN
// Compute squared norms
x_norm_sq ← L2NormSquared(x)
y_norm_sq ← L2NormSquared(y)
// Ensure points are inside the ball (||x|| < 1, ||y|| < 1)
IF x_norm_sq >= 1.0 OR y_norm_sq >= 1.0 THEN
ERROR "Points must be inside Poincaré ball"
END IF
// Compute squared distance between points
diff ← Subtract(x, y)
diff_norm_sq ← L2NormSquared(diff)
// Poincaré distance formula
numerator ← 2.0 * diff_norm_sq
denominator ← (1.0 - x_norm_sq) * (1.0 - y_norm_sq)
arg ← 1.0 + numerator / denominator
// Numerical stability: clamp arg >= 1.0
IF arg < 1.0 THEN
arg ← 1.0
END IF
distance ← sqrt(abs(curvature)) * arcosh(arg)
RETURN distance
END
SUBROUTINE: MobiusAdd
INPUT: x[d], y[d], curvature
OUTPUT: z[d] (Möbius sum x ⊕ y)
BEGIN
// Special case: if x is origin, return y
IF IsZero(x) THEN
RETURN y
END IF
// Special case: if y is origin, return x
IF IsZero(y) THEN
RETURN x
END IF
// Compute norms and dot product
x_norm_sq ← L2NormSquared(x)
y_norm_sq ← L2NormSquared(y)
xy_dot ← DotProduct(x, y)
// Möbius addition formula:
// z = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) /
// (1 + 2c⟨x,y⟩ + c²||x||²||y||²)
c ← -curvature // For Poincaré ball, typically c = 1
numerator_x_coef ← 1.0 + 2.0*c*xy_dot + c*y_norm_sq
numerator_y_coef ← 1.0 - c*x_norm_sq
denominator ← 1.0 + 2.0*c*xy_dot + c*c*x_norm_sq*y_norm_sq
numerator ← Add(
Scale(x, numerator_x_coef),
Scale(y, numerator_y_coef)
)
z ← Scale(numerator, 1.0 / denominator)
// Project back to ball if numerical errors pushed outside
z_norm ← L2Norm(z)
IF z_norm >= 1.0 THEN
z ← Scale(z, 0.99 / z_norm) // Project to ball with margin
END IF
RETURN z
END
SUBROUTINE: MobiusScalarMult
INPUT: r (scalar), x[d], curvature
OUTPUT: r ⊗ x (Möbius scalar multiplication)
BEGIN
// Handle special cases
IF r == 0 OR IsZero(x) THEN
RETURN ZeroVector(d)
END IF
x_norm ← L2Norm(x)
c ← -curvature
// Möbius scalar multiplication:
// r ⊗ x = (1/√c) * tanh(r * arctanh(√c * ||x||)) * (x / ||x||)
sqrt_c ← sqrt(c)
arctanh_arg ← sqrt_c * x_norm
// Numerical stability
IF arctanh_arg >= 1.0 THEN
arctanh_arg ← 0.999
END IF
arctanh_val ← arctanh(arctanh_arg)
tanh_arg ← r * arctanh_val
tanh_val ← tanh(tanh_arg)
scale_factor ← (1.0 / sqrt_c) * tanh_val / x_norm
result ← Scale(x, scale_factor)
RETURN result
END
SUBROUTINE: L2NormSquared
INPUT: x[d]
OUTPUT: ||x||² (scalar)
BEGIN
sum ← 0
FOR i ← 0 TO d-1 DO
sum ← sum + x[i] * x[i]
END FOR
RETURN sum
END
SUBROUTINE: L2Norm
INPUT: x[d]
OUTPUT: ||x|| (scalar)
BEGIN
RETURN sqrt(L2NormSquared(x))
END
SUBROUTINE: Subtract
INPUT: x[d], y[d]
OUTPUT: x - y [d]
BEGIN
result ← EMPTY_ARRAY[d]
FOR i ← 0 TO d-1 DO
result[i] ← x[i] - y[i]
END FOR
RETURN result
END
SUBROUTINE: Add
INPUT: x[d], y[d]
OUTPUT: x + y [d]
BEGIN
result ← EMPTY_ARRAY[d]
FOR i ← 0 TO d-1 DO
result[i] ← x[i] + y[i]
END FOR
RETURN result
END
SUBROUTINE: Scale
INPUT: x[d], scalar
OUTPUT: scalar * x [d]
BEGIN
result ← EMPTY_ARRAY[d]
FOR i ← 0 TO d-1 DO
result[i] ← scalar * x[i]
END FOR
RETURN result
END
SUBROUTINE: IsZero
INPUT: x[d]
OUTPUT: boolean
BEGIN
epsilon ← 1e-10
RETURN L2Norm(x) < epsilon
END
```
---
## 3. Sparse Attention
### 3.1 Local + Global Sparse Attention
**Purpose**: Reduce O(n²) to O(k_local + k_global) for large graphs
**Complexity**:
- Time: O(k_local·d + k_global·d) where k_local, k_global << n
- Space: O(k_local + k_global)
```
ALGORITHM: SparseLocalGlobalAttention
INPUT:
query: query vector [d]
all_neighbors: all neighbor embeddings [n × d]
neighbor_layers: HNSW layer for each neighbor [n]
local_window: size of local neighborhood
global_indices: indices of global attention nodes
OUTPUT:
output: attention output [d]
BEGIN
// 1. Partition neighbors into local and global
local_neighbors ← EMPTY_LIST
local_indices ← EMPTY_LIST
global_neighbors ← EMPTY_LIST
global_indices_actual ← EMPTY_LIST
FOR i ← 0 TO n-1 DO
IF neighbor_layers[i] == 0 AND Length(local_neighbors) < local_window THEN
// Layer 0 = local neighbors
local_neighbors.Append(all_neighbors[i])
local_indices.Append(i)
ELSE IF neighbor_layers[i] > 0 AND i IN global_indices THEN
// Higher layers = global neighbors
global_neighbors.Append(all_neighbors[i])
global_indices_actual.Append(i)
END IF
END FOR
// 2. Compute local attention
local_output ← ZeroVector(d)
IF Length(local_neighbors) > 0 THEN
local_K ← ConvertToMatrix(local_neighbors)
local_V ← local_K // Self-attention
local_output, _ ← ScaledDotProductAttention(
query, local_K, local_V, d
)
END IF
// 3. Compute global attention
global_output ← ZeroVector(d)
IF Length(global_neighbors) > 0 THEN
global_K ← ConvertToMatrix(global_neighbors)
global_V ← global_K
global_output, _ ← ScaledDotProductAttention(
query, global_K, global_V, d
)
END IF
// 4. Learned gating to combine local and global
alpha ← LearnedGate(query, local_output, global_output)
// 5. Combine outputs
output ← ZeroVector(d)
FOR i ← 0 TO d-1 DO
output[i] ← alpha * local_output[i] + (1.0 - alpha) * global_output[i]
END FOR
RETURN output
END
SUBROUTINE: LearnedGate
INPUT: query[d], local_output[d], global_output[d]
OUTPUT: alpha (scalar in [0, 1])
BEGIN
// Concatenate all inputs
concat ← Concatenate(query, local_output, global_output)
// Linear projection + sigmoid
gate_weights ← LEARNED_PARAMETERS[3*d] // Learned during training
bias ← LEARNED_BIAS // Learned during training
logit ← DotProduct(concat, gate_weights) + bias
alpha ← Sigmoid(logit)
RETURN alpha
END
SUBROUTINE: Sigmoid
INPUT: x (scalar)
OUTPUT: sigmoid(x) in [0, 1]
BEGIN
RETURN 1.0 / (1.0 + exp(-x))
END
```
---
### 3.2 Linear Attention (Performer / Random Features)
**Purpose**: O(n·d) complexity using kernel approximation
**Complexity**:
- Time: O(n·D·d) where D = number of random features
- Space: O(D·d)
```
ALGORITHM: LinearAttention
INPUT:
query: query vector [d]
keys: key matrix [n × d]
values: value matrix [n × d]
num_features: number of random features D
random_matrix: random projection matrix [D × d]
OUTPUT:
output: attention output [d]
BEGIN
// 1. Apply feature map to query
phi_Q ← FeatureMap(query, random_matrix, num_features)
// 2. Apply feature map to all keys
phi_K ← EMPTY_MATRIX[n × num_features]
FOR i ← 0 TO n-1 DO
phi_K[i] ← FeatureMap(keys[i], random_matrix, num_features)
END FOR
// 3. Compute K^T V (sum over neighbors) - O(n·D·d)
KV_sum ← ZeroMatrix(num_features, d)
FOR i ← 0 TO n-1 DO
FOR j ← 0 TO num_features-1 DO
FOR k ← 0 TO d-1 DO
KV_sum[j][k] ← KV_sum[j][k] + phi_K[i][j] * values[i][k]
END FOR
END FOR
END FOR
// 4. Compute Q·(K^T V) - O(D·d)
numerator ← ZeroVector(d)
FOR k ← 0 TO d-1 DO
FOR j ← 0 TO num_features-1 DO
numerator[k] ← numerator[k] + phi_Q[j] * KV_sum[j][k]
END FOR
END FOR
// 5. Compute K^T 1 (sum of feature-mapped keys) - O(n·D)
K_sum ← ZeroVector(num_features)
FOR i ← 0 TO n-1 DO
FOR j ← 0 TO num_features-1 DO
K_sum[j] ← K_sum[j] + phi_K[i][j]
END FOR
END FOR
// 6. Compute denominator Q·(K^T 1) - O(D)
denominator ← DotProduct(phi_Q, K_sum)
// 7. Normalize
output ← Scale(numerator, 1.0 / (denominator + 1e-10))
RETURN output
END
SUBROUTINE: FeatureMap
INPUT: x[d], random_matrix[D × d], num_features D
OUTPUT: features[D]
BEGIN
// Random Fourier Features
// φ(x) = sqrt(1/D) * [cos(w₁·x), sin(w₁·x), cos(w₂·x), sin(w₂·x), ...]
scale ← 1.0 / sqrt(num_features)
features ← EMPTY_ARRAY[num_features]
FOR i ← 0 TO num_features/2 - 1 DO
// Get random projection
w ← random_matrix[i]
projection ← DotProduct(w, x)
// Apply cos and sin
features[2*i] ← scale * cos(projection)
features[2*i + 1] ← scale * sin(projection)
END FOR
RETURN features
END
SUBROUTINE: ZeroMatrix
INPUT: rows, cols
OUTPUT: matrix[rows × cols]
BEGIN
matrix ← EMPTY_MATRIX[rows × cols]
FOR i ← 0 TO rows-1 DO
FOR j ← 0 TO cols-1 DO
matrix[i][j] ← 0.0
END FOR
END FOR
RETURN matrix
END
```
---
### 3.3 Flash Attention (Tiled / Memory-Efficient)
**Purpose**: O(n) memory instead of O(n²) through tiling
**Complexity**:
- Time: O(n²·d) (same as standard, but better cache locality)
- Space: O(n) instead of O(n²)
```
ALGORITHM: FlashAttention
INPUT:
query: query vector [d]
keys: key matrix [n × d]
values: value matrix [n × d]
block_size: tile size B (typically 64-128)
OUTPUT:
output: attention output [d]
BEGIN
n ← Length(keys)
output ← ZeroVector(d)
row_max ← -INFINITY
row_sum ← 0.0
num_blocks ← Ceiling(n / block_size)
// Process keys/values in blocks (tiles)
FOR block_idx ← 0 TO num_blocks-1 DO
// 1. Define current block range
chunk_start ← block_idx * block_size
chunk_end ← Min(chunk_start + block_size, n)
chunk_size ← chunk_end - chunk_start
// 2. Extract block of keys and values
chunk_K ← keys[chunk_start : chunk_end]
chunk_V ← values[chunk_start : chunk_end]
// 3. Compute attention scores for this block
scores ← EMPTY_ARRAY[chunk_size]
FOR i ← 0 TO chunk_size-1 DO
scores[i] ← DotProduct(query, chunk_K[i]) / sqrt(d)
END FOR
// 4. Online softmax: update running max
new_max ← Max(row_max, Max(scores))
// 5. Compute exponentials with new max
exp_scores ← EMPTY_ARRAY[chunk_size]
FOR i ← 0 TO chunk_size-1 DO
exp_scores[i] ← exp(scores[i] - new_max)
END FOR
// 6. Correction factor for previous blocks
correction ← exp(row_max - new_max)
// 7. Update running sum of exponentials
chunk_sum ← Sum(exp_scores)
row_sum ← row_sum * correction + chunk_sum
// 8. Update running max
row_max ← new_max
// 9. Accumulate weighted values with correction
FOR i ← 0 TO d-1 DO
output[i] ← output[i] * correction
END FOR
FOR i ← 0 TO chunk_size-1 DO
FOR j ← 0 TO d-1 DO
output[j] ← output[j] + exp_scores[i] * chunk_V[i][j]
END FOR
END FOR
END FOR
// 10. Final normalization
FOR i ← 0 TO d-1 DO
output[i] ← output[i] / row_sum
END FOR
RETURN output
END
SUBROUTINE: Max
INPUT: array[n] OR two scalars
OUTPUT: maximum value
BEGIN
IF array is provided THEN
max_val ← array[0]
FOR i ← 1 TO Length(array)-1 DO
IF array[i] > max_val THEN
max_val ← array[i]
END IF
END FOR
RETURN max_val
ELSE
// Two scalars
RETURN IF (a > b) THEN a ELSE b
END IF
END
SUBROUTINE: Sum
INPUT: array[n]
OUTPUT: sum of elements
BEGIN
total ← 0
FOR i ← 0 TO Length(array)-1 DO
total ← total + array[i]
END FOR
RETURN total
END
SUBROUTINE: Ceiling
INPUT: x (real number)
OUTPUT: ⌈x⌉ (smallest integer >= x)
BEGIN
RETURN integer ceiling of x
END
SUBROUTINE: Min
INPUT: a, b (scalars)
OUTPUT: minimum value
BEGIN
RETURN IF (a < b) THEN a ELSE b
END
```
---
## 4. Graph Attention
### 4.1 Edge-Featured Attention
**Purpose**: Incorporate edge attributes into attention computation
**Complexity**:
- Time: O(n·(d² + d_edge·d))
- Space: O(n)
```
ALGORITHM: EdgeFeaturedAttention
INPUT:
query: query node embedding [d]
keys: neighbor node embeddings [n × d]
values: neighbor node embeddings [n × d]
edge_features: edge attributes [n × d_edge]
W_node: node transformation matrix [d × d]
W_edge: edge transformation matrix [d_edge × d_attn]
a: attention coefficient vector [2d + d_attn]
OUTPUT:
output: aggregated embedding [d]
BEGIN
// 1. Transform query
q_trans ← MatrixVectorMult(W_node, query)
// 2. Transform all keys and edge features
k_trans ← EMPTY_MATRIX[n × d]
e_trans ← EMPTY_MATRIX[n × d_attn]
FOR i ← 0 TO n-1 DO
k_trans[i] ← MatrixVectorMult(W_node, keys[i])
e_trans[i] ← MatrixVectorMult(W_edge, edge_features[i])
END FOR
// 3. Compute attention scores with edge features
scores ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
// Concatenate [query || key || edge]
concat ← Concatenate(q_trans, k_trans[i], e_trans[i])
// Attention coefficient
score ← DotProduct(a, concat)
// Activation (LeakyReLU)
scores[i] ← LeakyReLU(score, alpha=0.2)
END FOR
// 4. Softmax normalization
weights ← Softmax(scores)
// 5. Weighted aggregation
output ← WeightedSum(values, weights)
RETURN output
END
SUBROUTINE: MatrixVectorMult
INPUT: M[m × n], v[n]
OUTPUT: result[m]
BEGIN
result ← ZeroVector(m)
FOR i ← 0 TO m-1 DO
FOR j ← 0 TO n-1 DO
result[i] ← result[i] + M[i][j] * v[j]
END FOR
END FOR
RETURN result
END
SUBROUTINE: LeakyReLU
INPUT: x (scalar), alpha (negative slope)
OUTPUT: activated value
BEGIN
IF x >= 0 THEN
RETURN x
ELSE
RETURN alpha * x
END IF
END
SUBROUTINE: WeightedSum
INPUT: vectors[n × d], weights[n]
OUTPUT: result[d]
BEGIN
result ← ZeroVector(d)
FOR i ← 0 TO n-1 DO
FOR j ← 0 TO d-1 DO
result[j] ← result[j] + weights[i] * vectors[i][j]
END FOR
END FOR
RETURN result
END
```
---
### 4.2 RoPE Graph Attention
**Purpose**: Encode graph distances via rotary position embeddings
**Complexity**:
- Time: O(n·d²)
- Space: O(n)
```
ALGORITHM: RoPEGraphAttention
INPUT:
query: query node embedding [d]
keys: neighbor node embeddings [n × d]
values: neighbor node embeddings [n × d]
distances: graph distances to neighbors [n]
base: RoPE frequency base (default 10000)
OUTPUT:
output: attention output [d]
BEGIN
// 1. Apply RoPE rotation to query (at origin, distance = 0)
Q_rotated ← ApplyRotation(query, distance=0.0, base)
// 2. Apply RoPE rotation to keys based on their distances
K_rotated ← EMPTY_MATRIX[n × d]
FOR i ← 0 TO n-1 DO
K_rotated[i] ← ApplyRotation(keys[i], distances[i], base)
END FOR
// 3. Compute attention scores with rotated embeddings
scores ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
scores[i] ← DotProduct(Q_rotated, K_rotated[i])
END FOR
// 4. Softmax and aggregate
weights ← Softmax(scores)
output ← WeightedSum(values, weights)
RETURN output
END
SUBROUTINE: ApplyRotation
INPUT: embedding[d], distance (scalar), base
OUTPUT: rotated[d]
BEGIN
rotated ← ZeroVector(d)
// Apply rotation to pairs of dimensions
FOR i ← 0 TO d/2 - 1 DO
// Compute rotation angle for this dimension pair
theta ← distance / (base ^ (2.0 * i / d))
cos_theta ← cos(theta)
sin_theta ← sin(theta)
// Rotate dimensions (2*i, 2*i+1)
rotated[2*i] ← embedding[2*i] * cos_theta - embedding[2*i+1] * sin_theta
rotated[2*i+1] ← embedding[2*i] * sin_theta + embedding[2*i+1] * cos_theta
END FOR
RETURN rotated
END
```
---
### 4.3 Cross-Space (Dual) Attention
**Purpose**: Bridge graph topology and latent space semantics
**Complexity**:
- Time: O(n_graph·d² + k_latent·d² + k_latent²·d)
- Space: O(n_graph + k_latent)
```
ALGORITHM: DualSpaceAttention
INPUT:
query: query node embedding [d]
graph_neighbors: topological neighbors [n_graph × d]
all_embeddings: all node embeddings for latent search [N × d]
k_latent: number of latent neighbors
OUTPUT:
output: fused embedding [d]
BEGIN
// 1. Graph attention (topology-based)
graph_output, _ ← MultiHeadAttention(
query,
graph_neighbors,
graph_neighbors,
num_heads=8
)
// 2. Find latent neighbors (similarity-based)
latent_neighbors ← FindTopKSimilar(query, all_embeddings, k_latent)
// 3. Latent attention (embedding-based)
latent_output, _ ← MultiHeadAttention(
query,
latent_neighbors,
latent_neighbors,
num_heads=8
)
// 4. Cross-attention (graph context queries latent space)
cross_output, _ ← MultiHeadAttention(
graph_output, // Use graph output as query
latent_neighbors,
latent_neighbors,
num_heads=8
)
// 5. Fusion of all three outputs
concatenated ← Concatenate(graph_output, latent_output, cross_output)
// 6. Final projection
W_fusion ← LEARNED_WEIGHTS[d × 3d]
output ← MatrixVectorMult(W_fusion, concatenated)
RETURN output
END
SUBROUTINE: FindTopKSimilar
INPUT: query[d], all_embeddings[N × d], k
OUTPUT: top_k_embeddings[k × d]
BEGIN
similarities ← EMPTY_ARRAY[N]
// 1. Compute cosine similarity to all embeddings
FOR i ← 0 TO N-1 DO
similarities[i] ← CosineSimilarity(query, all_embeddings[i])
END FOR
// 2. Find top-k indices
top_k_indices ← TopKIndices(similarities, k)
// 3. Extract top-k embeddings
top_k_embeddings ← EMPTY_MATRIX[k × d]
FOR i ← 0 TO k-1 DO
top_k_embeddings[i] ← all_embeddings[top_k_indices[i]]
END FOR
RETURN top_k_embeddings
END
SUBROUTINE: CosineSimilarity
INPUT: x[d], y[d]
OUTPUT: similarity in [-1, 1]
BEGIN
dot ← DotProduct(x, y)
norm_x ← L2Norm(x)
norm_y ← L2Norm(y)
// Avoid division by zero
IF norm_x == 0 OR norm_y == 0 THEN
RETURN 0.0
END IF
RETURN dot / (norm_x * norm_y)
END
SUBROUTINE: TopKIndices
INPUT: array[N], k
OUTPUT: indices[k]
BEGIN
// Create (index, value) pairs
pairs ← EMPTY_ARRAY[N]
FOR i ← 0 TO N-1 DO
pairs[i] ← (i, array[i])
END FOR
// Sort by value (descending)
Sort(pairs, by=value, order=descending)
// Extract top-k indices
indices ← EMPTY_ARRAY[k]
FOR i ← 0 TO k-1 DO
indices[i] ← pairs[i].index
END FOR
RETURN indices
END
```
---
## 5. Adaptive Attention
### 5.1 Mixture of Experts (MoE) Attention
**Purpose**: Route to specialized attention mechanisms based on context
**Complexity**:
- Time: O(K · attention_complexity) where K = top-k experts (typically 2)
- Space: O(num_experts · model_size)
```
ALGORITHM: MoEAttention
INPUT:
query: query node embedding [d]
keys: neighbor embeddings [n × d]
values: neighbor embeddings [n × d]
experts: list of attention mechanisms
router: routing network
top_k: number of experts to use (typically 2)
OUTPUT:
output: expert-mixed output [d]
EXPERT_TYPES:
1. Standard Multi-Head Attention
2. Hyperbolic Attention
3. Linear Attention
4. Edge-Featured Attention
BEGIN
num_experts ← Length(experts)
// 1. Router computes expert scores
router_logits ← RouterNetwork(query, router)
router_probs ← Softmax(router_logits)
// 2. Select top-k experts
top_k_indices ← TopKIndices(router_probs, top_k)
// 3. Normalize selected expert weights
selected_weights ← EMPTY_ARRAY[top_k]
weight_sum ← 0.0
FOR i ← 0 TO top_k-1 DO
expert_idx ← top_k_indices[i]
selected_weights[i] ← router_probs[expert_idx]
weight_sum ← weight_sum + selected_weights[i]
END FOR
// Normalize
FOR i ← 0 TO top_k-1 DO
selected_weights[i] ← selected_weights[i] / weight_sum
END FOR
// 4. Compute weighted expert outputs
output ← ZeroVector(d)
FOR i ← 0 TO top_k-1 DO
expert_idx ← top_k_indices[i]
expert ← experts[expert_idx]
// Call appropriate expert
expert_output ← CALL_EXPERT(expert, query, keys, values)
// Weighted accumulation
weight ← selected_weights[i]
FOR j ← 0 TO d-1 DO
output[j] ← output[j] + weight * expert_output[j]
END FOR
END FOR
RETURN output
END
SUBROUTINE: RouterNetwork
INPUT: query[d], router_weights
OUTPUT: logits[num_experts]
BEGIN
// Simple two-layer MLP
hidden_size ← 4 * d
// First layer
W1 ← router_weights.layer1 // [hidden_size × d]
b1 ← router_weights.bias1 // [hidden_size]
hidden ← MatrixVectorMult(W1, query)
FOR i ← 0 TO hidden_size-1 DO
hidden[i] ← ReLU(hidden[i] + b1[i])
END FOR
// Second layer
W2 ← router_weights.layer2 // [num_experts × hidden_size]
b2 ← router_weights.bias2 // [num_experts]
logits ← MatrixVectorMult(W2, hidden)
FOR i ← 0 TO num_experts-1 DO
logits[i] ← logits[i] + b2[i]
END FOR
RETURN logits
END
SUBROUTINE: CALL_EXPERT
INPUT: expert, query, keys, values
OUTPUT: expert_output[d]
BEGIN
MATCH expert.type:
CASE "standard":
RETURN MultiHeadAttention(query, keys, values, num_heads=8)
CASE "hyperbolic":
RETURN HyperbolicAttention(query, keys, values, curvature=-1.0)
CASE "linear":
RETURN LinearAttention(query, keys, values, num_features=256)
CASE "edge_featured":
edge_features ← expert.edge_features
RETURN EdgeFeaturedAttention(query, keys, values, edge_features)
DEFAULT:
ERROR "Unknown expert type"
END MATCH
END
SUBROUTINE: ReLU
INPUT: x (scalar)
OUTPUT: max(0, x)
BEGIN
RETURN IF (x > 0) THEN x ELSE 0
END
```
---
### 5.2 Learned Navigation (Reinforcement Learning)
**Purpose**: Learn optimal navigation policy for graph traversal
**Complexity**:
- Time: O(num_steps · d²) per navigation episode
- Space: O(graph_size + policy_params)
```
ALGORITHM: RLNavigationStep
INPUT:
current_state: current navigation state
policy_network: learned policy (neural network)
value_network: value estimator
graph: graph structure
OUTPUT:
action: which neighbor to visit
reward: immediate reward
next_state: resulting state
STATE_REPRESENTATION:
current_embedding: [d]
query_embedding: [d]
graph_features: [d_graph]
history: [max_steps × d]
BEGIN
// 1. Encode current state
state_vector ← EncodeState(current_state)
// 2. Policy network outputs action logits
action_logits ← PolicyNetwork(state_vector, policy_network)
// 3. Value network estimates state value
state_value ← ValueNetwork(state_vector, value_network)
// 4. Sample action from policy
action_probs ← Softmax(action_logits)
action ← SampleCategorical(action_probs) // Which neighbor to visit
// 5. Execute action (move to selected neighbor)
next_node ← current_state.neighbors[action]
// 6. Compute reward
reward ← ComputeReward(current_state, next_node, current_state.query)
// 7. Update state
next_state ← UpdateState(current_state, next_node, action)
RETURN action, reward, next_state, state_value
END
SUBROUTINE: EncodeState
INPUT: state
OUTPUT: state_vector[d_state]
BEGIN
// Concatenate all state components
state_vector ← Concatenate(
state.current_embedding,
state.query_embedding,
state.graph_features,
Flatten(state.history)
)
RETURN state_vector
END
SUBROUTINE: PolicyNetwork
INPUT: state_vector[d_state], policy_params
OUTPUT: action_logits[num_neighbors]
BEGIN
// Three-layer MLP
hidden1 ← ReLU(Linear(state_vector, policy_params.W1, policy_params.b1))
hidden2 ← ReLU(Linear(hidden1, policy_params.W2, policy_params.b2))
logits ← Linear(hidden2, policy_params.W3, policy_params.b3)
RETURN logits
END
SUBROUTINE: ValueNetwork
INPUT: state_vector[d_state], value_params
OUTPUT: value (scalar)
BEGIN
// Three-layer MLP ending in scalar
hidden1 ← ReLU(Linear(state_vector, value_params.W1, value_params.b1))
hidden2 ← ReLU(Linear(hidden1, value_params.W2, value_params.b2))
value ← Linear(hidden2, value_params.W3, value_params.b3)[0] // Scalar output
RETURN value
END
SUBROUTINE: ComputeReward
INPUT: current_state, next_node, query
OUTPUT: reward (scalar)
BEGIN
// Reward based on similarity improvement
current_similarity ← CosineSimilarity(
current_state.current_embedding,
query
)
next_similarity ← CosineSimilarity(
next_node.embedding,
query
)
// Positive reward if moving closer, negative if farther
reward ← next_similarity - current_similarity
// Bonus for reaching goal
IF next_similarity > GOAL_THRESHOLD THEN
reward ← reward + GOAL_BONUS
END IF
// Penalty for taking too many steps
reward ← reward - STEP_PENALTY
RETURN reward
END
SUBROUTINE: SampleCategorical
INPUT: probabilities[n]
OUTPUT: sampled_index in [0, n-1]
BEGIN
// Sample from categorical distribution
cumsum ← 0.0
rand ← Random() // Uniform [0, 1)
FOR i ← 0 TO n-1 DO
cumsum ← cumsum + probabilities[i]
IF rand < cumsum THEN
RETURN i
END IF
END FOR
// Fallback (shouldn't reach here if probabilities sum to 1)
RETURN n-1
END
SUBROUTINE: UpdateState
INPUT: current_state, next_node, action
OUTPUT: new_state
BEGIN
new_state ← COPY(current_state)
// Update current node
new_state.current_node ← next_node
new_state.current_embedding ← next_node.embedding
// Update history (sliding window)
new_state.history.PopFirst()
new_state.history.Append(next_node.embedding)
// Increment step counter
new_state.num_steps ← new_state.num_steps + 1
RETURN new_state
END
SUBROUTINE: Linear
INPUT: x[d_in], W[d_out × d_in], b[d_out]
OUTPUT: y[d_out]
BEGIN
y ← MatrixVectorMult(W, x)
FOR i ← 0 TO d_out-1 DO
y[i] ← y[i] + b[i]
END FOR
RETURN y
END
```
---
## 6. Training Procedures
### 6.1 InfoNCE Contrastive Loss
**Purpose**: Learn embeddings that are similar to positives and dissimilar to negatives
**Complexity**:
- Time: O((n_pos + n_neg) · d)
- Space: O(n_pos + n_neg)
```
ALGORITHM: InfoNCELoss
INPUT:
anchor: anchor embedding [d]
positives: positive samples [n_pos × d]
negatives: negative samples [n_neg × d]
temperature: softmax temperature (typically 0.07)
OUTPUT:
loss: contrastive loss (scalar)
BEGIN
// 1. Compute positive similarities
pos_scores ← EMPTY_ARRAY[n_pos]
FOR i ← 0 TO n_pos-1 DO
sim ← CosineSimilarity(anchor, positives[i])
pos_scores[i] ← sim / temperature
END FOR
// 2. Compute negative similarities
neg_scores ← EMPTY_ARRAY[n_neg]
FOR i ← 0 TO n_neg-1 DO
sim ← CosineSimilarity(anchor, negatives[i])
neg_scores[i] ← sim / temperature
END FOR
// 3. InfoNCE loss (average over positives)
total_loss ← 0.0
FOR i ← 0 TO n_pos-1 DO
// Numerator: exp(positive score)
numerator ← exp(pos_scores[i])
// Denominator: sum of exp(positive score) + all exp(negative scores)
denominator ← numerator
FOR j ← 0 TO n_neg-1 DO
denominator ← denominator + exp(neg_scores[j])
END FOR
// Log probability
log_prob ← log(numerator / denominator)
// Accumulate negative log probability
total_loss ← total_loss - log_prob
END FOR
// Average over positives
loss ← total_loss / n_pos
RETURN loss
END
```
---
### 6.2 Hard Negative Sampling
**Purpose**: Select informative negative samples for faster learning
**Complexity**:
- Time: O(N·d) where N = total number of samples
- Space: O(k) where k = number of hard negatives
```
ALGORITHM: SampleHardNegatives
INPUT:
anchor: anchor embedding [d]
all_embeddings: all available embeddings [N × d]
true_positives: indices of true positives
k: number of hard negatives to sample
strategy: sampling strategy ("distance", "degree", "mixed")
OUTPUT:
hard_negatives: selected hard negative samples [k × d]
BEGIN
// 1. Filter out true positives
candidate_indices ← EMPTY_LIST
FOR i ← 0 TO N-1 DO
IF i NOT IN true_positives THEN
candidate_indices.Append(i)
END IF
END FOR
n_candidates ← Length(candidate_indices)
// 2. Select hard negatives based on strategy
MATCH strategy:
CASE "distance":
hard_negatives ← SampleByDistance(
anchor, all_embeddings, candidate_indices, k
)
CASE "degree":
hard_negatives ← SampleByDegree(
anchor, all_embeddings, candidate_indices, k
)
CASE "mixed":
k_dist ← k / 2
k_deg ← k - k_dist
dist_negs ← SampleByDistance(
anchor, all_embeddings, candidate_indices, k_dist
)
deg_negs ← SampleByDegree(
anchor, all_embeddings, candidate_indices, k_deg
)
hard_negatives ← Concatenate(dist_negs, deg_negs)
DEFAULT:
ERROR "Unknown strategy"
END MATCH
RETURN hard_negatives
END
SUBROUTINE: SampleByDistance
INPUT: anchor[d], all_embeddings[N × d], candidate_indices, k
OUTPUT: hard_negatives[k × d]
BEGIN
// Select k most similar candidates (hardest negatives)
similarities ← EMPTY_ARRAY[Length(candidate_indices)]
FOR i ← 0 TO Length(candidate_indices)-1 DO
idx ← candidate_indices[i]
similarities[i] ← CosineSimilarity(anchor, all_embeddings[idx])
END FOR
// Get top-k most similar (hardest)
top_k_local_indices ← TopKIndices(similarities, k)
// Map back to global indices
hard_negatives ← EMPTY_MATRIX[k × d]
FOR i ← 0 TO k-1 DO
local_idx ← top_k_local_indices[i]
global_idx ← candidate_indices[local_idx]
hard_negatives[i] ← all_embeddings[global_idx]
END FOR
RETURN hard_negatives
END
SUBROUTINE: SampleByDegree
INPUT: anchor[d], all_embeddings[N × d], candidate_indices, k
OUTPUT: hard_negatives[k × d]
BEGIN
// Select candidates with similar degree to anchor
anchor_degree ← GetDegree(anchor)
degree_diffs ← EMPTY_ARRAY[Length(candidate_indices)]
FOR i ← 0 TO Length(candidate_indices)-1 DO
idx ← candidate_indices[i]
candidate_degree ← GetDegree(all_embeddings[idx])
degree_diffs[i] ← abs(anchor_degree - candidate_degree)
END FOR
// Get k candidates with most similar degree
top_k_local_indices ← TopKIndices(
NegateArray(degree_diffs), // Negate for similarity
k
)
hard_negatives ← EMPTY_MATRIX[k × d]
FOR i ← 0 TO k-1 DO
local_idx ← top_k_local_indices[i]
global_idx ← candidate_indices[local_idx]
hard_negatives[i] ← all_embeddings[global_idx]
END FOR
RETURN hard_negatives
END
SUBROUTINE: NegateArray
INPUT: array[n]
OUTPUT: negated[n]
BEGIN
negated ← EMPTY_ARRAY[n]
FOR i ← 0 TO n-1 DO
negated[i] ← -array[i]
END FOR
RETURN negated
END
```
---
### 6.3 Curriculum Learning Schedule
**Purpose**: Gradually increase task difficulty during training
**Complexity**:
- Time: O(1) per epoch (just weight computation)
- Space: O(num_losses)
```
ALGORITHM: CurriculumSchedule
INPUT:
current_epoch: current training epoch
total_epochs: total number of epochs
loss_types: list of loss components
OUTPUT:
loss_weights: weight for each loss component
LOSS_TYPES:
- reconstruction: Autoencoder reconstruction loss
- contrastive: InfoNCE contrastive loss
- task: Downstream task loss
- spectral: Laplacian regularization
- ewc: Elastic Weight Consolidation
BEGIN
loss_weights ← EMPTY_MAP
// 1. Reconstruction: High early, decay exponentially
lambda_recon ← exp(-current_epoch / 50.0)
loss_weights["reconstruction"] ← lambda_recon
// 2. Contrastive: Ramp up linearly in first 10 epochs
IF current_epoch < 10 THEN
lambda_contrast ← 0.1 + 0.9 * (current_epoch / 10.0)
ELSE
lambda_contrast ← 1.0
END IF
loss_weights["contrastive"] ← lambda_contrast
// 3. Task: Start after 50 epochs, ramp up
IF current_epoch < 50 THEN
lambda_task ← 0.1
ELSE
lambda_task ← 0.1 + 0.9 * ((current_epoch - 50) / 50.0)
lambda_task ← Min(lambda_task, 1.0)
END IF
loss_weights["task"] ← lambda_task
// 4. Spectral: Constant moderate weight
loss_weights["spectral"] ← 0.01
// 5. EWC: Increase if using continual learning
IF using_continual_learning THEN
lambda_ewc ← Min(current_epoch / 100.0, 1.0)
ELSE
lambda_ewc ← 0.0
END IF
loss_weights["ewc"] ← lambda_ewc
RETURN loss_weights
END
```
---
### 6.4 Multi-Objective Loss Computation
**Purpose**: Combine multiple loss functions with learned or scheduled weights
**Complexity**:
- Time: O(num_losses)
- Space: O(1)
```
ALGORITHM: MultiObjectiveLoss
INPUT:
loss_components: computed loss values
loss_weights: weights for each component
auto_balance: whether to auto-balance weights
OUTPUT:
total_loss: weighted sum of losses
updated_weights: potentially updated weights
LOSS_COMPONENTS:
task_loss: Main task objective
contrastive_loss: InfoNCE or similar
reconstruction_loss: Autoencoder
spectral_loss: Laplacian smoothness
ewc_loss: Continual learning penalty
BEGIN
// 1. Auto-balance (optional)
IF auto_balance THEN
loss_weights ← AutoBalance(loss_components, loss_weights)
END IF
// 2. Compute weighted sum
total_loss ← 0.0
total_loss ← total_loss + loss_weights["task"] * loss_components.task_loss
total_loss ← total_loss + loss_weights["contrastive"] * loss_components.contrastive_loss
total_loss ← total_loss + loss_weights["reconstruction"] * loss_components.reconstruction_loss
total_loss ← total_loss + loss_weights["spectral"] * loss_components.spectral_loss
total_loss ← total_loss + loss_weights["ewc"] * loss_components.ewc_loss
RETURN total_loss, loss_weights
END
SUBROUTINE: AutoBalance
INPUT: loss_components, current_weights
OUTPUT: balanced_weights
BEGIN
// Normalize so each loss contributes equally
num_losses ← 5
// Compute current contribution of each loss
contributions ← EMPTY_MAP
contributions["task"] ← current_weights["task"] * loss_components.task_loss
contributions["contrastive"] ← current_weights["contrastive"] * loss_components.contrastive_loss
contributions["reconstruction"] ← current_weights["reconstruction"] * loss_components.reconstruction_loss
contributions["spectral"] ← current_weights["spectral"] * loss_components.spectral_loss
contributions["ewc"] ← current_weights["ewc"] * loss_components.ewc_loss
// Compute total and target per-loss contribution
total ← Sum(contributions.values)
target_contribution ← total / num_losses
// Adjust weights to equalize contributions
balanced_weights ← EMPTY_MAP
epsilon ← 1e-10 // Avoid division by zero
balanced_weights["task"] ← target_contribution / Max(loss_components.task_loss, epsilon)
balanced_weights["contrastive"] ← target_contribution / Max(loss_components.contrastive_loss, epsilon)
balanced_weights["reconstruction"] ← target_contribution / Max(loss_components.reconstruction_loss, epsilon)
balanced_weights["spectral"] ← target_contribution / Max(loss_components.spectral_loss, epsilon)
balanced_weights["ewc"] ← target_contribution / Max(loss_components.ewc_loss, epsilon)
RETURN balanced_weights
END
```
---
### 6.5 Spectral Regularization
**Purpose**: Preserve graph structure through Laplacian smoothness
**Complexity**:
- Time: O(|E|·d) where |E| = number of edges
- Space: O(1) (streaming computation)
```
ALGORITHM: LaplacianRegularization
INPUT:
embeddings: node embeddings [N × d]
edges: edge list [(u, v)]
edge_weights: optional edge weights [|E|]
normalized: whether to use normalized Laplacian
node_degrees: node degrees [N]
OUTPUT:
spectral_loss: smoothness penalty (scalar)
BEGIN
total_loss ← 0.0
num_edges ← Length(edges)
FOR i ← 0 TO num_edges-1 DO
u, v ← edges[i]
// Compute embedding difference
diff ← Subtract(embeddings[u], embeddings[v])
diff_norm_sq ← L2NormSquared(diff)
// Get edge weight
weight ← 1.0
IF edge_weights PROVIDED THEN
weight ← edge_weights[i]
END IF
// Normalized Laplacian: weight by degrees
IF normalized THEN
degree_norm ← sqrt(node_degrees[u] * node_degrees[v])
weight ← weight / Max(degree_norm, 1.0)
END IF
// Accumulate weighted squared difference
total_loss ← total_loss + weight * diff_norm_sq
END FOR
// Average over edges
spectral_loss ← total_loss / num_edges
RETURN spectral_loss
END
```
---
## 7. Data Structures
### 7.1 Attention State
```
STRUCTURE: AttentionState
FIELDS:
query: [d] // Query embedding
keys: [n × d] // Key embeddings
values: [n × d] // Value embeddings
attention_weights: [n] // Computed weights
output: [d] // Final output
metadata: Map<String, Any> // Additional info
OPERATIONS:
Initialize(query, keys, values)
ComputeWeights() → attention_weights
ComputeOutput() → output
GetMetadata(key) → value
```
---
### 7.2 Graph Structure
```
STRUCTURE: Graph
FIELDS:
nodes: [N] // Node identifiers
embeddings: [N × d] // Node embeddings
adjacency: [N × N] OR SparseMatrix // Adjacency matrix
edge_list: [(u, v)] // Edge list
edge_features: [|E| × d_edge] // Edge attributes
node_degrees: [N] // Degree of each node
OPERATIONS:
GetNeighbors(node_id) → [neighbor_ids]
GetEdgeFeature(u, v) → [d_edge]
GetDegree(node_id) → scalar
AddEdge(u, v, features)
UpdateEmbedding(node_id, new_embedding)
```
---
### 7.3 HNSW-Specific Structure
```
STRUCTURE: HNSWGraph
EXTENDS: Graph
ADDITIONAL_FIELDS:
layers: [max_layer] // Layer-wise graphs
entry_point: node_id // Top-layer entry
max_layer: integer // Maximum layer
layer_neighbors: Map<(node, layer), [neighbors]>
OPERATIONS:
GetLayerNeighbors(node_id, layer) → [neighbor_ids]
GetNodeLayer(node_id) → layer
NavigateLayer(query, layer, num_steps) → closest_node
InsertNode(node_id, embedding, layer)
```
---
### 7.4 Training State
```
STRUCTURE: TrainingState
FIELDS:
current_epoch: integer
loss_history: [num_epochs]
loss_weights: Map<loss_type, weight>
curriculum_schedule: CurriculumSchedule
optimizer_state: OptimizerState
best_model_params: ModelParams
early_stopping_counter: integer
OPERATIONS:
UpdateEpoch()
RecordLoss(loss_value)
GetLossWeight(loss_type) → weight
UpdateBestModel(current_params)
ShouldEarlystop() → boolean
```
---
## 8. Complexity Summary
### 8.1 Attention Mechanisms
| Mechanism | Time Complexity | Space Complexity | Notes |
|-----------|----------------|------------------|-------|
| Scaled Dot-Product | O(n·d²) | O(n) | Standard attention |
| Multi-Head (h heads) | O(n·d²/h) | O(h·d) | Parallel heads |
| Hyperbolic | O(n·d²) | O(n) | More expensive ops |
| Sparse (Local+Global) | O((k_l + k_g)·d) | O(k_l + k_g) | k << n |
| Linear (Performer) | O(n·D·d) | O(D·d) | D = random features |
| Flash | O(n²·d) | O(n) | Better cache locality |
| Edge-Featured | O(n·(d² + d_edge·d)) | O(n) | Added edge cost |
| RoPE | O(n·d²) | O(n) | Rotation overhead minimal |
| Cross-Space | O(n_g·d² + k_l·d²) | O(n_g + k_l) | Dual attention |
| MoE (k experts) | O(k·base_complexity) | O(num_experts·model_size) | Expert routing |
**Legend**:
- n: number of neighbors/keys
- d: embedding dimension
- h: number of attention heads
- k_l, k_g: local and global neighbor counts
- D: number of random features
- d_edge: edge feature dimension
---
### 8.2 Training Operations
| Operation | Time Complexity | Space Complexity | Notes |
|-----------|----------------|------------------|-------|
| InfoNCE Loss | O((n_pos + n_neg)·d) | O(n_pos + n_neg) | Per anchor |
| Hard Negative Sampling | O(N·d) | O(k) | N = total samples |
| Spectral Regularization | O(\|E\|·d) | O(1) | E = edges |
| Curriculum Schedule | O(1) | O(num_losses) | Per epoch |
| Multi-Objective Loss | O(num_losses) | O(1) | Weighted sum |
---
## 9. Implementation Notes
### 9.1 Numerical Stability
**Softmax Stability**:
```
// Always subtract max before exp
max_score ← Max(scores)
exp_scores[i] ← exp(scores[i] - max_score)
```
**Hyperbolic Boundary**:
```
// Ensure points stay in Poincaré ball
IF ||x|| >= 1.0 THEN
x ← 0.99 * x / ||x|| // Project back with margin
END IF
```
**Division by Zero**:
```
// Add epsilon to denominators
result ← numerator / (denominator + 1e-10)
```
---
### 9.2 Performance Optimization
**Vectorization**:
- Use SIMD operations for dot products
- Batch matrix multiplications
- Parallelize independent attention heads
**Memory Layout**:
- Contiguous memory for cache efficiency
- Column-major for matrix operations
- Pre-allocate buffers
**Lazy Computation**:
- Only compute attention weights when needed
- Cache frequently accessed embeddings
- Prune low-weight attention connections
---
### 9.3 Testing Strategies
**Unit Tests**:
```
TEST: ScaledDotProductAttention
INPUT: Known query, keys, values
EXPECTED: Hand-computed output
VERIFY: Output matches expected within tolerance
TEST: Softmax Numerical Stability
INPUT: Very large scores [1000, 999, 998]
VERIFY: No NaN or Inf in output
VERIFY: Probabilities sum to 1.0
TEST: Hyperbolic Boundary
INPUT: Points near ball boundary (||x|| = 0.99)
VERIFY: Result still in ball (||result|| < 1.0)
```
**Integration Tests**:
```
TEST: End-to-End Attention Pipeline
INPUT: Real graph structure
VERIFY: All mechanisms produce valid outputs
VERIFY: Outputs are differentiable
```
**Performance Tests**:
```
BENCHMARK: Attention Complexity
INPUT: Varying n = [10, 100, 1000, 10000]
MEASURE: Time and memory usage
VERIFY: Matches theoretical complexity
```
---
## 10. References
### 10.1 Core Papers
1. **Attention Mechanism**: Vaswani et al. (2017) - "Attention Is All You Need"
2. **GAT**: Veličković et al. (2018) - "Graph Attention Networks"
3. **Hyperbolic GNNs**: Chami et al. (2019) - "Hyperbolic Graph Convolutional Neural Networks"
4. **Performer**: Choromanski et al. (2020) - "Rethinking Attention with Performers"
5. **Flash Attention**: Dao et al. (2022) - "FlashAttention: Fast and Memory-Efficient Exact Attention"
6. **RoPE**: Su et al. (2021) - "RoFormer: Enhanced Transformer with Rotary Position Embedding"
7. **MoE**: Shazeer et al. (2017) - "Outrageously Large Neural Networks"
### 10.2 Mathematical Background
- **Hyperbolic Geometry**: Cannon et al. (1997) - "Hyperbolic Geometry"
- **Graph Laplacian**: Chung (1997) - "Spectral Graph Theory"
- **Contrastive Learning**: Chen et al. (2020) - "A Simple Framework for Contrastive Learning"
---
## 11. Glossary
**Attention**: Mechanism to weight importance of different inputs
**Multi-Head**: Parallel attention with different learned projections
**Hyperbolic Space**: Non-Euclidean geometry with constant negative curvature
**Poincaré Ball**: Conformal model of hyperbolic space in unit ball
**Möbius Addition**: Hyperbolic vector addition operation
**Sparse Attention**: Attention over subset of inputs (not all pairs)
**Linear Attention**: O(n) complexity via kernel approximation
**Flash Attention**: Memory-efficient tiled attention computation
**RoPE**: Rotary Position Embedding for distance encoding
**Cross-Attention**: Attention between two different spaces
**MoE**: Mixture of Experts, routing to specialized sub-models
**InfoNCE**: Noise Contrastive Estimation loss for contrastive learning
**Hard Negatives**: Difficult negative samples close to positives
**Curriculum Learning**: Gradually increasing task difficulty
**Spectral Regularization**: Graph smoothness via Laplacian
---
**Document Version**: 1.0
**Last Updated**: 2025-11-30
**Author**: RuVector Research Team
**SPARC Phase**: Pseudocode (Phase 2)
**Next Phase**: Architecture (Phase 3) - See `04-architecture.md`
---
## Appendix A: Quick Reference
### Common Subroutines
```
DotProduct(x, y) → scalar
L2Norm(x) → scalar
L2NormSquared(x) → scalar
Softmax(scores) → probabilities
CosineSimilarity(x, y) → similarity ∈ [-1, 1]
Scale(x, scalar) → scaled_vector
Add(x, y) → sum_vector
Subtract(x, y) → diff_vector
Concatenate(vectors...) → concatenated_vector
ZeroVector(d) → zero-initialized vector
ZeroMatrix(rows, cols) → zero-initialized matrix
```
### Complexity Quick Reference
```
O(1) - Constant time
O(d) - Linear in dimension
O(n) - Linear in number of items
O(n·d) - Linear in both
O(n²) - Quadratic (standard full attention)
O(n·d²) - Attention complexity
O(|E|) - Linear in number of edges
```
---