git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2031 lines
52 KiB
Markdown
2031 lines
52 KiB
Markdown
# SPARC Pseudocode: RuVector Attention Mechanisms
|
||
|
||
## Executive Summary
|
||
|
||
This document provides comprehensive pseudocode for all attention mechanisms proposed for the RuVector GNN latent-graph interplay system. Following SPARC methodology, this serves as the bridge between specification (requirements) and architecture (system design).
|
||
|
||
**Scope**: Complete algorithmic specifications for attention mechanisms, training procedures, and optimization strategies.
|
||
|
||
**Target Audience**: Implementers who will translate these algorithms into Rust code.
|
||
|
||
**Conventions**:
|
||
- UPPERCASE: Algorithm names, constants
|
||
- lowercase: Variables, parameters
|
||
- `←`: Assignment
|
||
- `∈`: Set membership
|
||
- Arrays are 0-indexed unless specified
|
||
- All complexity analysis uses Big-O notation
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Core Attention Mechanisms](#1-core-attention-mechanisms)
|
||
2. [Geometric Attention](#2-geometric-attention)
|
||
3. [Sparse Attention](#3-sparse-attention)
|
||
4. [Graph Attention](#4-graph-attention)
|
||
5. [Adaptive Attention](#5-adaptive-attention)
|
||
6. [Training Procedures](#6-training-procedures)
|
||
7. [Data Structures](#7-data-structures)
|
||
8. [Complexity Summary](#8-complexity-summary)
|
||
|
||
---
|
||
|
||
## 1. Core Attention Mechanisms
|
||
|
||
### 1.1 Scaled Dot-Product Attention
|
||
|
||
**Purpose**: Foundation attention mechanism for all variants
|
||
|
||
**Complexity**:
|
||
- Time: O(n·d²) where n = number of keys, d = embedding dimension
|
||
- Space: O(n)
|
||
|
||
```
|
||
ALGORITHM: ScaledDotProductAttention
|
||
INPUT:
|
||
Q: query vector [d]
|
||
K: key matrix [n × d]
|
||
V: value matrix [n × d]
|
||
d_k: key dimension (scalar)
|
||
OUTPUT:
|
||
output: attention output [d]
|
||
weights: attention weights [n]
|
||
|
||
BEGIN
|
||
// 1. Compute attention scores
|
||
scores ← EMPTY_ARRAY[n]
|
||
FOR i ← 0 TO n-1 DO
|
||
scores[i] ← DotProduct(Q, K[i]) / sqrt(d_k)
|
||
END FOR
|
||
|
||
// 2. Apply softmax for normalization
|
||
weights ← Softmax(scores)
|
||
|
||
// 3. Weighted sum of values
|
||
output ← ZeroVector(d)
|
||
FOR i ← 0 TO n-1 DO
|
||
output ← output + weights[i] * V[i]
|
||
END FOR
|
||
|
||
RETURN output, weights
|
||
END
|
||
|
||
SUBROUTINE: DotProduct
|
||
INPUT: x[d], y[d]
|
||
OUTPUT: scalar
|
||
BEGIN
|
||
sum ← 0
|
||
FOR i ← 0 TO d-1 DO
|
||
sum ← sum + x[i] * y[i]
|
||
END FOR
|
||
RETURN sum
|
||
END
|
||
|
||
SUBROUTINE: Softmax
|
||
INPUT: scores[n]
|
||
OUTPUT: probabilities[n]
|
||
BEGIN
|
||
// Numerical stability: subtract max
|
||
max_score ← Max(scores)
|
||
|
||
exp_scores ← EMPTY_ARRAY[n]
|
||
sum_exp ← 0
|
||
|
||
FOR i ← 0 TO n-1 DO
|
||
exp_scores[i] ← exp(scores[i] - max_score)
|
||
sum_exp ← sum_exp + exp_scores[i]
|
||
END FOR
|
||
|
||
probabilities ← EMPTY_ARRAY[n]
|
||
FOR i ← 0 TO n-1 DO
|
||
probabilities[i] ← exp_scores[i] / sum_exp
|
||
END FOR
|
||
|
||
RETURN probabilities
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 1.2 Multi-Head Attention
|
||
|
||
**Purpose**: Learn multiple representation subspaces simultaneously
|
||
|
||
**Complexity**:
|
||
- Time: O(h·n·d²/h²) = O(n·d²/h) where h = number of heads
|
||
- Space: O(h·d)
|
||
|
||
```
|
||
ALGORITHM: MultiHeadAttention
|
||
INPUT:
|
||
Q: query vector [d_model]
|
||
K: key matrix [n × d_model]
|
||
V: value matrix [n × d_model]
|
||
num_heads: number of attention heads
|
||
W_Q: query projection weights [num_heads × d_head × d_model]
|
||
W_K: key projection weights [num_heads × d_head × d_model]
|
||
W_V: value projection weights [num_heads × d_head × d_model]
|
||
W_O: output projection weights [d_model × d_model]
|
||
OUTPUT:
|
||
output: multi-head attention output [d_model]
|
||
|
||
CONSTANTS:
|
||
d_head ← d_model / num_heads
|
||
|
||
BEGIN
|
||
heads ← EMPTY_ARRAY[num_heads]
|
||
|
||
// 1. Project and compute attention for each head
|
||
FOR h ← 0 TO num_heads-1 DO
|
||
// Project query
|
||
Q_h ← LinearTransform(Q, W_Q[h]) // [d_head]
|
||
|
||
// Project keys
|
||
K_h ← EMPTY_MATRIX[n × d_head]
|
||
FOR i ← 0 TO n-1 DO
|
||
K_h[i] ← LinearTransform(K[i], W_K[h])
|
||
END FOR
|
||
|
||
// Project values
|
||
V_h ← EMPTY_MATRIX[n × d_head]
|
||
FOR i ← 0 TO n-1 DO
|
||
V_h[i] ← LinearTransform(V[i], W_V[h])
|
||
END FOR
|
||
|
||
// Compute attention for this head
|
||
head_output, _ ← ScaledDotProductAttention(Q_h, K_h, V_h, d_head)
|
||
heads[h] ← head_output
|
||
END FOR
|
||
|
||
// 2. Concatenate all heads
|
||
concat ← Concatenate(heads[0], heads[1], ..., heads[num_heads-1])
|
||
|
||
// 3. Final linear projection
|
||
output ← LinearTransform(concat, W_O)
|
||
|
||
RETURN output
|
||
END
|
||
|
||
SUBROUTINE: LinearTransform
|
||
INPUT: x[d_in], W[d_out × d_in]
|
||
OUTPUT: y[d_out]
|
||
BEGIN
|
||
y ← ZeroVector(d_out)
|
||
FOR i ← 0 TO d_out-1 DO
|
||
FOR j ← 0 TO d_in-1 DO
|
||
y[i] ← y[i] + W[i][j] * x[j]
|
||
END FOR
|
||
END FOR
|
||
RETURN y
|
||
END
|
||
|
||
SUBROUTINE: Concatenate
|
||
INPUT: vectors... (variable number of vectors)
|
||
OUTPUT: concatenated vector
|
||
BEGIN
|
||
total_dim ← Sum of all input dimensions
|
||
result ← EMPTY_ARRAY[total_dim]
|
||
offset ← 0
|
||
|
||
FOR EACH vector IN vectors DO
|
||
FOR i ← 0 TO Length(vector)-1 DO
|
||
result[offset + i] ← vector[i]
|
||
END FOR
|
||
offset ← offset + Length(vector)
|
||
END FOR
|
||
|
||
RETURN result
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Geometric Attention
|
||
|
||
### 2.1 Hyperbolic Attention (Poincaré Ball Model)
|
||
|
||
**Purpose**: Capture hierarchical structure using hyperbolic geometry
|
||
|
||
**Complexity**:
|
||
- Time: O(n·d²) (same as Euclidean, but with more expensive ops)
|
||
- Space: O(n)
|
||
|
||
**Geometric Background**:
|
||
```
|
||
Poincaré Ball: B^d = {x ∈ R^d : ||x|| < 1}
|
||
Distance: d_P(x,y) = arcosh(1 + 2||x-y||²/((1-||x||²)(1-||y||²)))
|
||
```
|
||
|
||
```
|
||
ALGORITHM: HyperbolicAttention
|
||
INPUT:
|
||
query: query point in Poincaré ball [d]
|
||
keys: key points in Poincaré ball [n × d]
|
||
values: value points in Poincaré ball [n × d]
|
||
curvature: negative curvature (typically -1.0)
|
||
temperature: softmax temperature
|
||
OUTPUT:
|
||
output: aggregated point in Poincaré ball [d]
|
||
|
||
BEGIN
|
||
// 1. Compute hyperbolic distances as similarity scores
|
||
scores ← EMPTY_ARRAY[n]
|
||
FOR i ← 0 TO n-1 DO
|
||
// Negative distance = similarity (closer = higher score)
|
||
scores[i] ← -PoincareDistance(query, keys[i], curvature)
|
||
END FOR
|
||
|
||
// 2. Softmax to get attention weights
|
||
weights ← Softmax(scores / temperature)
|
||
|
||
// 3. Hyperbolic weighted aggregation using Möbius addition
|
||
result ← ZeroVector(d) // Origin in Poincaré ball
|
||
|
||
FOR i ← 0 TO n-1 DO
|
||
// Scale value by weight using Möbius scalar multiplication
|
||
scaled_value ← MobiusScalarMult(weights[i], values[i], curvature)
|
||
|
||
// Add to result using Möbius addition
|
||
result ← MobiusAdd(result, scaled_value, curvature)
|
||
END FOR
|
||
|
||
RETURN result
|
||
END
|
||
|
||
SUBROUTINE: PoincareDistance
|
||
INPUT: x[d], y[d], curvature
|
||
OUTPUT: distance (scalar)
|
||
BEGIN
|
||
// Compute squared norms
|
||
x_norm_sq ← L2NormSquared(x)
|
||
y_norm_sq ← L2NormSquared(y)
|
||
|
||
// Ensure points are inside the ball (||x|| < 1, ||y|| < 1)
|
||
IF x_norm_sq >= 1.0 OR y_norm_sq >= 1.0 THEN
|
||
ERROR "Points must be inside Poincaré ball"
|
||
END IF
|
||
|
||
// Compute squared distance between points
|
||
diff ← Subtract(x, y)
|
||
diff_norm_sq ← L2NormSquared(diff)
|
||
|
||
// Poincaré distance formula
|
||
numerator ← 2.0 * diff_norm_sq
|
||
denominator ← (1.0 - x_norm_sq) * (1.0 - y_norm_sq)
|
||
|
||
arg ← 1.0 + numerator / denominator
|
||
|
||
// Numerical stability: clamp arg >= 1.0
|
||
IF arg < 1.0 THEN
|
||
arg ← 1.0
|
||
END IF
|
||
|
||
distance ← sqrt(abs(curvature)) * arcosh(arg)
|
||
|
||
RETURN distance
|
||
END
|
||
|
||
SUBROUTINE: MobiusAdd
|
||
INPUT: x[d], y[d], curvature
|
||
OUTPUT: z[d] (Möbius sum x ⊕ y)
|
||
BEGIN
|
||
// Special case: if x is origin, return y
|
||
IF IsZero(x) THEN
|
||
RETURN y
|
||
END IF
|
||
|
||
// Special case: if y is origin, return x
|
||
IF IsZero(y) THEN
|
||
RETURN x
|
||
END IF
|
||
|
||
// Compute norms and dot product
|
||
x_norm_sq ← L2NormSquared(x)
|
||
y_norm_sq ← L2NormSquared(y)
|
||
xy_dot ← DotProduct(x, y)
|
||
|
||
// Möbius addition formula:
|
||
// z = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) /
|
||
// (1 + 2c⟨x,y⟩ + c²||x||²||y||²)
|
||
|
||
c ← -curvature // For Poincaré ball, typically c = 1
|
||
|
||
numerator_x_coef ← 1.0 + 2.0*c*xy_dot + c*y_norm_sq
|
||
numerator_y_coef ← 1.0 - c*x_norm_sq
|
||
denominator ← 1.0 + 2.0*c*xy_dot + c*c*x_norm_sq*y_norm_sq
|
||
|
||
numerator ← Add(
|
||
Scale(x, numerator_x_coef),
|
||
Scale(y, numerator_y_coef)
|
||
)
|
||
|
||
z ← Scale(numerator, 1.0 / denominator)
|
||
|
||
// Project back to ball if numerical errors pushed outside
|
||
z_norm ← L2Norm(z)
|
||
IF z_norm >= 1.0 THEN
|
||
z ← Scale(z, 0.99 / z_norm) // Project to ball with margin
|
||
END IF
|
||
|
||
RETURN z
|
||
END
|
||
|
||
SUBROUTINE: MobiusScalarMult
|
||
INPUT: r (scalar), x[d], curvature
|
||
OUTPUT: r ⊗ x (Möbius scalar multiplication)
|
||
BEGIN
|
||
// Handle special cases
|
||
IF r == 0 OR IsZero(x) THEN
|
||
RETURN ZeroVector(d)
|
||
END IF
|
||
|
||
x_norm ← L2Norm(x)
|
||
c ← -curvature
|
||
|
||
// Möbius scalar multiplication:
|
||
// r ⊗ x = (1/√c) * tanh(r * arctanh(√c * ||x||)) * (x / ||x||)
|
||
|
||
sqrt_c ← sqrt(c)
|
||
arctanh_arg ← sqrt_c * x_norm
|
||
|
||
// Numerical stability
|
||
IF arctanh_arg >= 1.0 THEN
|
||
arctanh_arg ← 0.999
|
||
END IF
|
||
|
||
arctanh_val ← arctanh(arctanh_arg)
|
||
tanh_arg ← r * arctanh_val
|
||
tanh_val ← tanh(tanh_arg)
|
||
|
||
scale_factor ← (1.0 / sqrt_c) * tanh_val / x_norm
|
||
|
||
result ← Scale(x, scale_factor)
|
||
|
||
RETURN result
|
||
END
|
||
|
||
SUBROUTINE: L2NormSquared
|
||
INPUT: x[d]
|
||
OUTPUT: ||x||² (scalar)
|
||
BEGIN
|
||
sum ← 0
|
||
FOR i ← 0 TO d-1 DO
|
||
sum ← sum + x[i] * x[i]
|
||
END FOR
|
||
RETURN sum
|
||
END
|
||
|
||
SUBROUTINE: L2Norm
|
||
INPUT: x[d]
|
||
OUTPUT: ||x|| (scalar)
|
||
BEGIN
|
||
RETURN sqrt(L2NormSquared(x))
|
||
END
|
||
|
||
SUBROUTINE: Subtract
|
||
INPUT: x[d], y[d]
|
||
OUTPUT: x - y [d]
|
||
BEGIN
|
||
result ← EMPTY_ARRAY[d]
|
||
FOR i ← 0 TO d-1 DO
|
||
result[i] ← x[i] - y[i]
|
||
END FOR
|
||
RETURN result
|
||
END
|
||
|
||
SUBROUTINE: Add
|
||
INPUT: x[d], y[d]
|
||
OUTPUT: x + y [d]
|
||
BEGIN
|
||
result ← EMPTY_ARRAY[d]
|
||
FOR i ← 0 TO d-1 DO
|
||
result[i] ← x[i] + y[i]
|
||
END FOR
|
||
RETURN result
|
||
END
|
||
|
||
SUBROUTINE: Scale
|
||
INPUT: x[d], scalar
|
||
OUTPUT: scalar * x [d]
|
||
BEGIN
|
||
result ← EMPTY_ARRAY[d]
|
||
FOR i ← 0 TO d-1 DO
|
||
result[i] ← scalar * x[i]
|
||
END FOR
|
||
RETURN result
|
||
END
|
||
|
||
SUBROUTINE: IsZero
|
||
INPUT: x[d]
|
||
OUTPUT: boolean
|
||
BEGIN
|
||
epsilon ← 1e-10
|
||
RETURN L2Norm(x) < epsilon
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Sparse Attention
|
||
|
||
### 3.1 Local + Global Sparse Attention
|
||
|
||
**Purpose**: Reduce O(n²) to O(k_local + k_global) for large graphs
|
||
|
||
**Complexity**:
|
||
- Time: O(k_local·d + k_global·d) where k_local, k_global << n
|
||
- Space: O(k_local + k_global)
|
||
|
||
```
|
||
ALGORITHM: SparseLocalGlobalAttention
|
||
INPUT:
|
||
query: query vector [d]
|
||
all_neighbors: all neighbor embeddings [n × d]
|
||
neighbor_layers: HNSW layer for each neighbor [n]
|
||
local_window: size of local neighborhood
|
||
global_indices: indices of global attention nodes
|
||
OUTPUT:
|
||
output: attention output [d]
|
||
|
||
BEGIN
|
||
// 1. Partition neighbors into local and global
|
||
local_neighbors ← EMPTY_LIST
|
||
local_indices ← EMPTY_LIST
|
||
global_neighbors ← EMPTY_LIST
|
||
global_indices_actual ← EMPTY_LIST
|
||
|
||
FOR i ← 0 TO n-1 DO
|
||
IF neighbor_layers[i] == 0 AND Length(local_neighbors) < local_window THEN
|
||
// Layer 0 = local neighbors
|
||
local_neighbors.Append(all_neighbors[i])
|
||
local_indices.Append(i)
|
||
ELSE IF neighbor_layers[i] > 0 AND i IN global_indices THEN
|
||
// Higher layers = global neighbors
|
||
global_neighbors.Append(all_neighbors[i])
|
||
global_indices_actual.Append(i)
|
||
END IF
|
||
END FOR
|
||
|
||
// 2. Compute local attention
|
||
local_output ← ZeroVector(d)
|
||
IF Length(local_neighbors) > 0 THEN
|
||
local_K ← ConvertToMatrix(local_neighbors)
|
||
local_V ← local_K // Self-attention
|
||
local_output, _ ← ScaledDotProductAttention(
|
||
query, local_K, local_V, d
|
||
)
|
||
END IF
|
||
|
||
// 3. Compute global attention
|
||
global_output ← ZeroVector(d)
|
||
IF Length(global_neighbors) > 0 THEN
|
||
global_K ← ConvertToMatrix(global_neighbors)
|
||
global_V ← global_K
|
||
global_output, _ ← ScaledDotProductAttention(
|
||
query, global_K, global_V, d
|
||
)
|
||
END IF
|
||
|
||
// 4. Learned gating to combine local and global
|
||
alpha ← LearnedGate(query, local_output, global_output)
|
||
|
||
// 5. Combine outputs
|
||
output ← ZeroVector(d)
|
||
FOR i ← 0 TO d-1 DO
|
||
output[i] ← alpha * local_output[i] + (1.0 - alpha) * global_output[i]
|
||
END FOR
|
||
|
||
RETURN output
|
||
END
|
||
|
||
SUBROUTINE: LearnedGate
|
||
INPUT: query[d], local_output[d], global_output[d]
|
||
OUTPUT: alpha (scalar in [0, 1])
|
||
BEGIN
|
||
// Concatenate all inputs
|
||
concat ← Concatenate(query, local_output, global_output)
|
||
|
||
// Linear projection + sigmoid
|
||
gate_weights ← LEARNED_PARAMETERS[3*d] // Learned during training
|
||
bias ← LEARNED_BIAS // Learned during training
|
||
|
||
logit ← DotProduct(concat, gate_weights) + bias
|
||
alpha ← Sigmoid(logit)
|
||
|
||
RETURN alpha
|
||
END
|
||
|
||
SUBROUTINE: Sigmoid
|
||
INPUT: x (scalar)
|
||
OUTPUT: sigmoid(x) in [0, 1]
|
||
BEGIN
|
||
RETURN 1.0 / (1.0 + exp(-x))
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 3.2 Linear Attention (Performer / Random Features)
|
||
|
||
**Purpose**: O(n·d) complexity using kernel approximation
|
||
|
||
**Complexity**:
|
||
- Time: O(n·D·d) where D = number of random features
|
||
- Space: O(D·d)
|
||
|
||
```
|
||
ALGORITHM: LinearAttention
|
||
INPUT:
|
||
query: query vector [d]
|
||
keys: key matrix [n × d]
|
||
values: value matrix [n × d]
|
||
num_features: number of random features D
|
||
random_matrix: random projection matrix [D × d]
|
||
OUTPUT:
|
||
output: attention output [d]
|
||
|
||
BEGIN
|
||
// 1. Apply feature map to query
|
||
phi_Q ← FeatureMap(query, random_matrix, num_features)
|
||
|
||
// 2. Apply feature map to all keys
|
||
phi_K ← EMPTY_MATRIX[n × num_features]
|
||
FOR i ← 0 TO n-1 DO
|
||
phi_K[i] ← FeatureMap(keys[i], random_matrix, num_features)
|
||
END FOR
|
||
|
||
// 3. Compute K^T V (sum over neighbors) - O(n·D·d)
|
||
KV_sum ← ZeroMatrix(num_features, d)
|
||
FOR i ← 0 TO n-1 DO
|
||
FOR j ← 0 TO num_features-1 DO
|
||
FOR k ← 0 TO d-1 DO
|
||
KV_sum[j][k] ← KV_sum[j][k] + phi_K[i][j] * values[i][k]
|
||
END FOR
|
||
END FOR
|
||
END FOR
|
||
|
||
// 4. Compute Q·(K^T V) - O(D·d)
|
||
numerator ← ZeroVector(d)
|
||
FOR k ← 0 TO d-1 DO
|
||
FOR j ← 0 TO num_features-1 DO
|
||
numerator[k] ← numerator[k] + phi_Q[j] * KV_sum[j][k]
|
||
END FOR
|
||
END FOR
|
||
|
||
// 5. Compute K^T 1 (sum of feature-mapped keys) - O(n·D)
|
||
K_sum ← ZeroVector(num_features)
|
||
FOR i ← 0 TO n-1 DO
|
||
FOR j ← 0 TO num_features-1 DO
|
||
K_sum[j] ← K_sum[j] + phi_K[i][j]
|
||
END FOR
|
||
END FOR
|
||
|
||
// 6. Compute denominator Q·(K^T 1) - O(D)
|
||
denominator ← DotProduct(phi_Q, K_sum)
|
||
|
||
// 7. Normalize
|
||
output ← Scale(numerator, 1.0 / (denominator + 1e-10))
|
||
|
||
RETURN output
|
||
END
|
||
|
||
SUBROUTINE: FeatureMap
|
||
INPUT: x[d], random_matrix[D × d], num_features D
|
||
OUTPUT: features[D]
|
||
BEGIN
|
||
// Random Fourier Features
|
||
// φ(x) = sqrt(1/D) * [cos(w₁·x), sin(w₁·x), cos(w₂·x), sin(w₂·x), ...]
|
||
|
||
scale ← 1.0 / sqrt(num_features)
|
||
features ← EMPTY_ARRAY[num_features]
|
||
|
||
FOR i ← 0 TO num_features/2 - 1 DO
|
||
// Get random projection
|
||
w ← random_matrix[i]
|
||
projection ← DotProduct(w, x)
|
||
|
||
// Apply cos and sin
|
||
features[2*i] ← scale * cos(projection)
|
||
features[2*i + 1] ← scale * sin(projection)
|
||
END FOR
|
||
|
||
RETURN features
|
||
END
|
||
|
||
SUBROUTINE: ZeroMatrix
|
||
INPUT: rows, cols
|
||
OUTPUT: matrix[rows × cols]
|
||
BEGIN
|
||
matrix ← EMPTY_MATRIX[rows × cols]
|
||
FOR i ← 0 TO rows-1 DO
|
||
FOR j ← 0 TO cols-1 DO
|
||
matrix[i][j] ← 0.0
|
||
END FOR
|
||
END FOR
|
||
RETURN matrix
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 3.3 Flash Attention (Tiled / Memory-Efficient)
|
||
|
||
**Purpose**: O(n) memory instead of O(n²) through tiling
|
||
|
||
**Complexity**:
|
||
- Time: O(n²·d) (same as standard, but better cache locality)
|
||
- Space: O(n) instead of O(n²)
|
||
|
||
```
|
||
ALGORITHM: FlashAttention
|
||
INPUT:
|
||
query: query vector [d]
|
||
keys: key matrix [n × d]
|
||
values: value matrix [n × d]
|
||
block_size: tile size B (typically 64-128)
|
||
OUTPUT:
|
||
output: attention output [d]
|
||
|
||
BEGIN
|
||
n ← Length(keys)
|
||
output ← ZeroVector(d)
|
||
row_max ← -INFINITY
|
||
row_sum ← 0.0
|
||
|
||
num_blocks ← Ceiling(n / block_size)
|
||
|
||
// Process keys/values in blocks (tiles)
|
||
FOR block_idx ← 0 TO num_blocks-1 DO
|
||
// 1. Define current block range
|
||
chunk_start ← block_idx * block_size
|
||
chunk_end ← Min(chunk_start + block_size, n)
|
||
chunk_size ← chunk_end - chunk_start
|
||
|
||
// 2. Extract block of keys and values
|
||
chunk_K ← keys[chunk_start : chunk_end]
|
||
chunk_V ← values[chunk_start : chunk_end]
|
||
|
||
// 3. Compute attention scores for this block
|
||
scores ← EMPTY_ARRAY[chunk_size]
|
||
FOR i ← 0 TO chunk_size-1 DO
|
||
scores[i] ← DotProduct(query, chunk_K[i]) / sqrt(d)
|
||
END FOR
|
||
|
||
// 4. Online softmax: update running max
|
||
new_max ← Max(row_max, Max(scores))
|
||
|
||
// 5. Compute exponentials with new max
|
||
exp_scores ← EMPTY_ARRAY[chunk_size]
|
||
FOR i ← 0 TO chunk_size-1 DO
|
||
exp_scores[i] ← exp(scores[i] - new_max)
|
||
END FOR
|
||
|
||
// 6. Correction factor for previous blocks
|
||
correction ← exp(row_max - new_max)
|
||
|
||
// 7. Update running sum of exponentials
|
||
chunk_sum ← Sum(exp_scores)
|
||
row_sum ← row_sum * correction + chunk_sum
|
||
|
||
// 8. Update running max
|
||
row_max ← new_max
|
||
|
||
// 9. Accumulate weighted values with correction
|
||
FOR i ← 0 TO d-1 DO
|
||
output[i] ← output[i] * correction
|
||
END FOR
|
||
|
||
FOR i ← 0 TO chunk_size-1 DO
|
||
FOR j ← 0 TO d-1 DO
|
||
output[j] ← output[j] + exp_scores[i] * chunk_V[i][j]
|
||
END FOR
|
||
END FOR
|
||
END FOR
|
||
|
||
// 10. Final normalization
|
||
FOR i ← 0 TO d-1 DO
|
||
output[i] ← output[i] / row_sum
|
||
END FOR
|
||
|
||
RETURN output
|
||
END
|
||
|
||
SUBROUTINE: Max
|
||
INPUT: array[n] OR two scalars
|
||
OUTPUT: maximum value
|
||
BEGIN
|
||
IF array is provided THEN
|
||
max_val ← array[0]
|
||
FOR i ← 1 TO Length(array)-1 DO
|
||
IF array[i] > max_val THEN
|
||
max_val ← array[i]
|
||
END IF
|
||
END FOR
|
||
RETURN max_val
|
||
ELSE
|
||
// Two scalars
|
||
RETURN IF (a > b) THEN a ELSE b
|
||
END IF
|
||
END
|
||
|
||
SUBROUTINE: Sum
|
||
INPUT: array[n]
|
||
OUTPUT: sum of elements
|
||
BEGIN
|
||
total ← 0
|
||
FOR i ← 0 TO Length(array)-1 DO
|
||
total ← total + array[i]
|
||
END FOR
|
||
RETURN total
|
||
END
|
||
|
||
SUBROUTINE: Ceiling
|
||
INPUT: x (real number)
|
||
OUTPUT: ⌈x⌉ (smallest integer >= x)
|
||
BEGIN
|
||
RETURN integer ceiling of x
|
||
END
|
||
|
||
SUBROUTINE: Min
|
||
INPUT: a, b (scalars)
|
||
OUTPUT: minimum value
|
||
BEGIN
|
||
RETURN IF (a < b) THEN a ELSE b
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Graph Attention
|
||
|
||
### 4.1 Edge-Featured Attention
|
||
|
||
**Purpose**: Incorporate edge attributes into attention computation
|
||
|
||
**Complexity**:
|
||
- Time: O(n·(d² + d_edge·d))
|
||
- Space: O(n)
|
||
|
||
```
|
||
ALGORITHM: EdgeFeaturedAttention
|
||
INPUT:
|
||
query: query node embedding [d]
|
||
keys: neighbor node embeddings [n × d]
|
||
values: neighbor node embeddings [n × d]
|
||
edge_features: edge attributes [n × d_edge]
|
||
W_node: node transformation matrix [d × d]
|
||
W_edge: edge transformation matrix [d_edge × d_attn]
|
||
a: attention coefficient vector [2d + d_attn]
|
||
OUTPUT:
|
||
output: aggregated embedding [d]
|
||
|
||
BEGIN
|
||
// 1. Transform query
|
||
q_trans ← MatrixVectorMult(W_node, query)
|
||
|
||
// 2. Transform all keys and edge features
|
||
k_trans ← EMPTY_MATRIX[n × d]
|
||
e_trans ← EMPTY_MATRIX[n × d_attn]
|
||
|
||
FOR i ← 0 TO n-1 DO
|
||
k_trans[i] ← MatrixVectorMult(W_node, keys[i])
|
||
e_trans[i] ← MatrixVectorMult(W_edge, edge_features[i])
|
||
END FOR
|
||
|
||
// 3. Compute attention scores with edge features
|
||
scores ← EMPTY_ARRAY[n]
|
||
FOR i ← 0 TO n-1 DO
|
||
// Concatenate [query || key || edge]
|
||
concat ← Concatenate(q_trans, k_trans[i], e_trans[i])
|
||
|
||
// Attention coefficient
|
||
score ← DotProduct(a, concat)
|
||
|
||
// Activation (LeakyReLU)
|
||
scores[i] ← LeakyReLU(score, alpha=0.2)
|
||
END FOR
|
||
|
||
// 4. Softmax normalization
|
||
weights ← Softmax(scores)
|
||
|
||
// 5. Weighted aggregation
|
||
output ← WeightedSum(values, weights)
|
||
|
||
RETURN output
|
||
END
|
||
|
||
SUBROUTINE: MatrixVectorMult
|
||
INPUT: M[m × n], v[n]
|
||
OUTPUT: result[m]
|
||
BEGIN
|
||
result ← ZeroVector(m)
|
||
FOR i ← 0 TO m-1 DO
|
||
FOR j ← 0 TO n-1 DO
|
||
result[i] ← result[i] + M[i][j] * v[j]
|
||
END FOR
|
||
END FOR
|
||
RETURN result
|
||
END
|
||
|
||
SUBROUTINE: LeakyReLU
|
||
INPUT: x (scalar), alpha (negative slope)
|
||
OUTPUT: activated value
|
||
BEGIN
|
||
IF x >= 0 THEN
|
||
RETURN x
|
||
ELSE
|
||
RETURN alpha * x
|
||
END IF
|
||
END
|
||
|
||
SUBROUTINE: WeightedSum
|
||
INPUT: vectors[n × d], weights[n]
|
||
OUTPUT: result[d]
|
||
BEGIN
|
||
result ← ZeroVector(d)
|
||
FOR i ← 0 TO n-1 DO
|
||
FOR j ← 0 TO d-1 DO
|
||
result[j] ← result[j] + weights[i] * vectors[i][j]
|
||
END FOR
|
||
END FOR
|
||
RETURN result
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 4.2 RoPE Graph Attention
|
||
|
||
**Purpose**: Encode graph distances via rotary position embeddings
|
||
|
||
**Complexity**:
|
||
- Time: O(n·d²)
|
||
- Space: O(n)
|
||
|
||
```
|
||
ALGORITHM: RoPEGraphAttention
|
||
INPUT:
|
||
query: query node embedding [d]
|
||
keys: neighbor node embeddings [n × d]
|
||
values: neighbor node embeddings [n × d]
|
||
distances: graph distances to neighbors [n]
|
||
base: RoPE frequency base (default 10000)
|
||
OUTPUT:
|
||
output: attention output [d]
|
||
|
||
BEGIN
|
||
// 1. Apply RoPE rotation to query (at origin, distance = 0)
|
||
Q_rotated ← ApplyRotation(query, distance=0.0, base)
|
||
|
||
// 2. Apply RoPE rotation to keys based on their distances
|
||
K_rotated ← EMPTY_MATRIX[n × d]
|
||
FOR i ← 0 TO n-1 DO
|
||
K_rotated[i] ← ApplyRotation(keys[i], distances[i], base)
|
||
END FOR
|
||
|
||
// 3. Compute attention scores with rotated embeddings
|
||
scores ← EMPTY_ARRAY[n]
|
||
FOR i ← 0 TO n-1 DO
|
||
scores[i] ← DotProduct(Q_rotated, K_rotated[i])
|
||
END FOR
|
||
|
||
// 4. Softmax and aggregate
|
||
weights ← Softmax(scores)
|
||
output ← WeightedSum(values, weights)
|
||
|
||
RETURN output
|
||
END
|
||
|
||
SUBROUTINE: ApplyRotation
|
||
INPUT: embedding[d], distance (scalar), base
|
||
OUTPUT: rotated[d]
|
||
BEGIN
|
||
rotated ← ZeroVector(d)
|
||
|
||
// Apply rotation to pairs of dimensions
|
||
FOR i ← 0 TO d/2 - 1 DO
|
||
// Compute rotation angle for this dimension pair
|
||
theta ← distance / (base ^ (2.0 * i / d))
|
||
|
||
cos_theta ← cos(theta)
|
||
sin_theta ← sin(theta)
|
||
|
||
// Rotate dimensions (2*i, 2*i+1)
|
||
rotated[2*i] ← embedding[2*i] * cos_theta - embedding[2*i+1] * sin_theta
|
||
rotated[2*i+1] ← embedding[2*i] * sin_theta + embedding[2*i+1] * cos_theta
|
||
END FOR
|
||
|
||
RETURN rotated
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 4.3 Cross-Space (Dual) Attention
|
||
|
||
**Purpose**: Bridge graph topology and latent space semantics
|
||
|
||
**Complexity**:
|
||
- Time: O(n_graph·d² + k_latent·d² + k_latent²·d)
|
||
- Space: O(n_graph + k_latent)
|
||
|
||
```
|
||
ALGORITHM: DualSpaceAttention
|
||
INPUT:
|
||
query: query node embedding [d]
|
||
graph_neighbors: topological neighbors [n_graph × d]
|
||
all_embeddings: all node embeddings for latent search [N × d]
|
||
k_latent: number of latent neighbors
|
||
OUTPUT:
|
||
output: fused embedding [d]
|
||
|
||
BEGIN
|
||
// 1. Graph attention (topology-based)
|
||
graph_output, _ ← MultiHeadAttention(
|
||
query,
|
||
graph_neighbors,
|
||
graph_neighbors,
|
||
num_heads=8
|
||
)
|
||
|
||
// 2. Find latent neighbors (similarity-based)
|
||
latent_neighbors ← FindTopKSimilar(query, all_embeddings, k_latent)
|
||
|
||
// 3. Latent attention (embedding-based)
|
||
latent_output, _ ← MultiHeadAttention(
|
||
query,
|
||
latent_neighbors,
|
||
latent_neighbors,
|
||
num_heads=8
|
||
)
|
||
|
||
// 4. Cross-attention (graph context queries latent space)
|
||
cross_output, _ ← MultiHeadAttention(
|
||
graph_output, // Use graph output as query
|
||
latent_neighbors,
|
||
latent_neighbors,
|
||
num_heads=8
|
||
)
|
||
|
||
// 5. Fusion of all three outputs
|
||
concatenated ← Concatenate(graph_output, latent_output, cross_output)
|
||
|
||
// 6. Final projection
|
||
W_fusion ← LEARNED_WEIGHTS[d × 3d]
|
||
output ← MatrixVectorMult(W_fusion, concatenated)
|
||
|
||
RETURN output
|
||
END
|
||
|
||
SUBROUTINE: FindTopKSimilar
|
||
INPUT: query[d], all_embeddings[N × d], k
|
||
OUTPUT: top_k_embeddings[k × d]
|
||
BEGIN
|
||
similarities ← EMPTY_ARRAY[N]
|
||
|
||
// 1. Compute cosine similarity to all embeddings
|
||
FOR i ← 0 TO N-1 DO
|
||
similarities[i] ← CosineSimilarity(query, all_embeddings[i])
|
||
END FOR
|
||
|
||
// 2. Find top-k indices
|
||
top_k_indices ← TopKIndices(similarities, k)
|
||
|
||
// 3. Extract top-k embeddings
|
||
top_k_embeddings ← EMPTY_MATRIX[k × d]
|
||
FOR i ← 0 TO k-1 DO
|
||
top_k_embeddings[i] ← all_embeddings[top_k_indices[i]]
|
||
END FOR
|
||
|
||
RETURN top_k_embeddings
|
||
END
|
||
|
||
SUBROUTINE: CosineSimilarity
|
||
INPUT: x[d], y[d]
|
||
OUTPUT: similarity in [-1, 1]
|
||
BEGIN
|
||
dot ← DotProduct(x, y)
|
||
norm_x ← L2Norm(x)
|
||
norm_y ← L2Norm(y)
|
||
|
||
// Avoid division by zero
|
||
IF norm_x == 0 OR norm_y == 0 THEN
|
||
RETURN 0.0
|
||
END IF
|
||
|
||
RETURN dot / (norm_x * norm_y)
|
||
END
|
||
|
||
SUBROUTINE: TopKIndices
|
||
INPUT: array[N], k
|
||
OUTPUT: indices[k]
|
||
BEGIN
|
||
// Create (index, value) pairs
|
||
pairs ← EMPTY_ARRAY[N]
|
||
FOR i ← 0 TO N-1 DO
|
||
pairs[i] ← (i, array[i])
|
||
END FOR
|
||
|
||
// Sort by value (descending)
|
||
Sort(pairs, by=value, order=descending)
|
||
|
||
// Extract top-k indices
|
||
indices ← EMPTY_ARRAY[k]
|
||
FOR i ← 0 TO k-1 DO
|
||
indices[i] ← pairs[i].index
|
||
END FOR
|
||
|
||
RETURN indices
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Adaptive Attention
|
||
|
||
### 5.1 Mixture of Experts (MoE) Attention
|
||
|
||
**Purpose**: Route to specialized attention mechanisms based on context
|
||
|
||
**Complexity**:
|
||
- Time: O(K · attention_complexity) where K = top-k experts (typically 2)
|
||
- Space: O(num_experts · model_size)
|
||
|
||
```
|
||
ALGORITHM: MoEAttention
|
||
INPUT:
|
||
query: query node embedding [d]
|
||
keys: neighbor embeddings [n × d]
|
||
values: neighbor embeddings [n × d]
|
||
experts: list of attention mechanisms
|
||
router: routing network
|
||
top_k: number of experts to use (typically 2)
|
||
OUTPUT:
|
||
output: expert-mixed output [d]
|
||
|
||
EXPERT_TYPES:
|
||
1. Standard Multi-Head Attention
|
||
2. Hyperbolic Attention
|
||
3. Linear Attention
|
||
4. Edge-Featured Attention
|
||
|
||
BEGIN
|
||
num_experts ← Length(experts)
|
||
|
||
// 1. Router computes expert scores
|
||
router_logits ← RouterNetwork(query, router)
|
||
router_probs ← Softmax(router_logits)
|
||
|
||
// 2. Select top-k experts
|
||
top_k_indices ← TopKIndices(router_probs, top_k)
|
||
|
||
// 3. Normalize selected expert weights
|
||
selected_weights ← EMPTY_ARRAY[top_k]
|
||
weight_sum ← 0.0
|
||
FOR i ← 0 TO top_k-1 DO
|
||
expert_idx ← top_k_indices[i]
|
||
selected_weights[i] ← router_probs[expert_idx]
|
||
weight_sum ← weight_sum + selected_weights[i]
|
||
END FOR
|
||
|
||
// Normalize
|
||
FOR i ← 0 TO top_k-1 DO
|
||
selected_weights[i] ← selected_weights[i] / weight_sum
|
||
END FOR
|
||
|
||
// 4. Compute weighted expert outputs
|
||
output ← ZeroVector(d)
|
||
FOR i ← 0 TO top_k-1 DO
|
||
expert_idx ← top_k_indices[i]
|
||
expert ← experts[expert_idx]
|
||
|
||
// Call appropriate expert
|
||
expert_output ← CALL_EXPERT(expert, query, keys, values)
|
||
|
||
// Weighted accumulation
|
||
weight ← selected_weights[i]
|
||
FOR j ← 0 TO d-1 DO
|
||
output[j] ← output[j] + weight * expert_output[j]
|
||
END FOR
|
||
END FOR
|
||
|
||
RETURN output
|
||
END
|
||
|
||
SUBROUTINE: RouterNetwork
|
||
INPUT: query[d], router_weights
|
||
OUTPUT: logits[num_experts]
|
||
BEGIN
|
||
// Simple two-layer MLP
|
||
hidden_size ← 4 * d
|
||
|
||
// First layer
|
||
W1 ← router_weights.layer1 // [hidden_size × d]
|
||
b1 ← router_weights.bias1 // [hidden_size]
|
||
hidden ← MatrixVectorMult(W1, query)
|
||
FOR i ← 0 TO hidden_size-1 DO
|
||
hidden[i] ← ReLU(hidden[i] + b1[i])
|
||
END FOR
|
||
|
||
// Second layer
|
||
W2 ← router_weights.layer2 // [num_experts × hidden_size]
|
||
b2 ← router_weights.bias2 // [num_experts]
|
||
logits ← MatrixVectorMult(W2, hidden)
|
||
FOR i ← 0 TO num_experts-1 DO
|
||
logits[i] ← logits[i] + b2[i]
|
||
END FOR
|
||
|
||
RETURN logits
|
||
END
|
||
|
||
SUBROUTINE: CALL_EXPERT
|
||
INPUT: expert, query, keys, values
|
||
OUTPUT: expert_output[d]
|
||
BEGIN
|
||
MATCH expert.type:
|
||
CASE "standard":
|
||
RETURN MultiHeadAttention(query, keys, values, num_heads=8)
|
||
|
||
CASE "hyperbolic":
|
||
RETURN HyperbolicAttention(query, keys, values, curvature=-1.0)
|
||
|
||
CASE "linear":
|
||
RETURN LinearAttention(query, keys, values, num_features=256)
|
||
|
||
CASE "edge_featured":
|
||
edge_features ← expert.edge_features
|
||
RETURN EdgeFeaturedAttention(query, keys, values, edge_features)
|
||
|
||
DEFAULT:
|
||
ERROR "Unknown expert type"
|
||
END MATCH
|
||
END
|
||
|
||
SUBROUTINE: ReLU
|
||
INPUT: x (scalar)
|
||
OUTPUT: max(0, x)
|
||
BEGIN
|
||
RETURN IF (x > 0) THEN x ELSE 0
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 5.2 Learned Navigation (Reinforcement Learning)
|
||
|
||
**Purpose**: Learn optimal navigation policy for graph traversal
|
||
|
||
**Complexity**:
|
||
- Time: O(num_steps · d²) per navigation episode
|
||
- Space: O(graph_size + policy_params)
|
||
|
||
```
|
||
ALGORITHM: RLNavigationStep
|
||
INPUT:
|
||
current_state: current navigation state
|
||
policy_network: learned policy (neural network)
|
||
value_network: value estimator
|
||
graph: graph structure
|
||
OUTPUT:
|
||
action: which neighbor to visit
|
||
reward: immediate reward
|
||
next_state: resulting state
|
||
|
||
STATE_REPRESENTATION:
|
||
current_embedding: [d]
|
||
query_embedding: [d]
|
||
graph_features: [d_graph]
|
||
history: [max_steps × d]
|
||
|
||
BEGIN
|
||
// 1. Encode current state
|
||
state_vector ← EncodeState(current_state)
|
||
|
||
// 2. Policy network outputs action logits
|
||
action_logits ← PolicyNetwork(state_vector, policy_network)
|
||
|
||
// 3. Value network estimates state value
|
||
state_value ← ValueNetwork(state_vector, value_network)
|
||
|
||
// 4. Sample action from policy
|
||
action_probs ← Softmax(action_logits)
|
||
action ← SampleCategorical(action_probs) // Which neighbor to visit
|
||
|
||
// 5. Execute action (move to selected neighbor)
|
||
next_node ← current_state.neighbors[action]
|
||
|
||
// 6. Compute reward
|
||
reward ← ComputeReward(current_state, next_node, current_state.query)
|
||
|
||
// 7. Update state
|
||
next_state ← UpdateState(current_state, next_node, action)
|
||
|
||
RETURN action, reward, next_state, state_value
|
||
END
|
||
|
||
SUBROUTINE: EncodeState
|
||
INPUT: state
|
||
OUTPUT: state_vector[d_state]
|
||
BEGIN
|
||
// Concatenate all state components
|
||
state_vector ← Concatenate(
|
||
state.current_embedding,
|
||
state.query_embedding,
|
||
state.graph_features,
|
||
Flatten(state.history)
|
||
)
|
||
|
||
RETURN state_vector
|
||
END
|
||
|
||
SUBROUTINE: PolicyNetwork
|
||
INPUT: state_vector[d_state], policy_params
|
||
OUTPUT: action_logits[num_neighbors]
|
||
BEGIN
|
||
// Three-layer MLP
|
||
hidden1 ← ReLU(Linear(state_vector, policy_params.W1, policy_params.b1))
|
||
hidden2 ← ReLU(Linear(hidden1, policy_params.W2, policy_params.b2))
|
||
logits ← Linear(hidden2, policy_params.W3, policy_params.b3)
|
||
|
||
RETURN logits
|
||
END
|
||
|
||
SUBROUTINE: ValueNetwork
|
||
INPUT: state_vector[d_state], value_params
|
||
OUTPUT: value (scalar)
|
||
BEGIN
|
||
// Three-layer MLP ending in scalar
|
||
hidden1 ← ReLU(Linear(state_vector, value_params.W1, value_params.b1))
|
||
hidden2 ← ReLU(Linear(hidden1, value_params.W2, value_params.b2))
|
||
value ← Linear(hidden2, value_params.W3, value_params.b3)[0] // Scalar output
|
||
|
||
RETURN value
|
||
END
|
||
|
||
SUBROUTINE: ComputeReward
|
||
INPUT: current_state, next_node, query
|
||
OUTPUT: reward (scalar)
|
||
BEGIN
|
||
// Reward based on similarity improvement
|
||
current_similarity ← CosineSimilarity(
|
||
current_state.current_embedding,
|
||
query
|
||
)
|
||
|
||
next_similarity ← CosineSimilarity(
|
||
next_node.embedding,
|
||
query
|
||
)
|
||
|
||
// Positive reward if moving closer, negative if farther
|
||
reward ← next_similarity - current_similarity
|
||
|
||
// Bonus for reaching goal
|
||
IF next_similarity > GOAL_THRESHOLD THEN
|
||
reward ← reward + GOAL_BONUS
|
||
END IF
|
||
|
||
// Penalty for taking too many steps
|
||
reward ← reward - STEP_PENALTY
|
||
|
||
RETURN reward
|
||
END
|
||
|
||
SUBROUTINE: SampleCategorical
|
||
INPUT: probabilities[n]
|
||
OUTPUT: sampled_index in [0, n-1]
|
||
BEGIN
|
||
// Sample from categorical distribution
|
||
cumsum ← 0.0
|
||
rand ← Random() // Uniform [0, 1)
|
||
|
||
FOR i ← 0 TO n-1 DO
|
||
cumsum ← cumsum + probabilities[i]
|
||
IF rand < cumsum THEN
|
||
RETURN i
|
||
END IF
|
||
END FOR
|
||
|
||
// Fallback (shouldn't reach here if probabilities sum to 1)
|
||
RETURN n-1
|
||
END
|
||
|
||
SUBROUTINE: UpdateState
|
||
INPUT: current_state, next_node, action
|
||
OUTPUT: new_state
|
||
BEGIN
|
||
new_state ← COPY(current_state)
|
||
|
||
// Update current node
|
||
new_state.current_node ← next_node
|
||
new_state.current_embedding ← next_node.embedding
|
||
|
||
// Update history (sliding window)
|
||
new_state.history.PopFirst()
|
||
new_state.history.Append(next_node.embedding)
|
||
|
||
// Increment step counter
|
||
new_state.num_steps ← new_state.num_steps + 1
|
||
|
||
RETURN new_state
|
||
END
|
||
|
||
SUBROUTINE: Linear
|
||
INPUT: x[d_in], W[d_out × d_in], b[d_out]
|
||
OUTPUT: y[d_out]
|
||
BEGIN
|
||
y ← MatrixVectorMult(W, x)
|
||
FOR i ← 0 TO d_out-1 DO
|
||
y[i] ← y[i] + b[i]
|
||
END FOR
|
||
RETURN y
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Training Procedures
|
||
|
||
### 6.1 InfoNCE Contrastive Loss
|
||
|
||
**Purpose**: Learn embeddings that are similar to positives and dissimilar to negatives
|
||
|
||
**Complexity**:
|
||
- Time: O((n_pos + n_neg) · d)
|
||
- Space: O(n_pos + n_neg)
|
||
|
||
```
|
||
ALGORITHM: InfoNCELoss
|
||
INPUT:
|
||
anchor: anchor embedding [d]
|
||
positives: positive samples [n_pos × d]
|
||
negatives: negative samples [n_neg × d]
|
||
temperature: softmax temperature (typically 0.07)
|
||
OUTPUT:
|
||
loss: contrastive loss (scalar)
|
||
|
||
BEGIN
|
||
// 1. Compute positive similarities
|
||
pos_scores ← EMPTY_ARRAY[n_pos]
|
||
FOR i ← 0 TO n_pos-1 DO
|
||
sim ← CosineSimilarity(anchor, positives[i])
|
||
pos_scores[i] ← sim / temperature
|
||
END FOR
|
||
|
||
// 2. Compute negative similarities
|
||
neg_scores ← EMPTY_ARRAY[n_neg]
|
||
FOR i ← 0 TO n_neg-1 DO
|
||
sim ← CosineSimilarity(anchor, negatives[i])
|
||
neg_scores[i] ← sim / temperature
|
||
END FOR
|
||
|
||
// 3. InfoNCE loss (average over positives)
|
||
total_loss ← 0.0
|
||
|
||
FOR i ← 0 TO n_pos-1 DO
|
||
// Numerator: exp(positive score)
|
||
numerator ← exp(pos_scores[i])
|
||
|
||
// Denominator: sum of exp(positive score) + all exp(negative scores)
|
||
denominator ← numerator
|
||
FOR j ← 0 TO n_neg-1 DO
|
||
denominator ← denominator + exp(neg_scores[j])
|
||
END FOR
|
||
|
||
// Log probability
|
||
log_prob ← log(numerator / denominator)
|
||
|
||
// Accumulate negative log probability
|
||
total_loss ← total_loss - log_prob
|
||
END FOR
|
||
|
||
// Average over positives
|
||
loss ← total_loss / n_pos
|
||
|
||
RETURN loss
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 6.2 Hard Negative Sampling
|
||
|
||
**Purpose**: Select informative negative samples for faster learning
|
||
|
||
**Complexity**:
|
||
- Time: O(N·d) where N = total number of samples
|
||
- Space: O(k) where k = number of hard negatives
|
||
|
||
```
|
||
ALGORITHM: SampleHardNegatives
|
||
INPUT:
|
||
anchor: anchor embedding [d]
|
||
all_embeddings: all available embeddings [N × d]
|
||
true_positives: indices of true positives
|
||
k: number of hard negatives to sample
|
||
strategy: sampling strategy ("distance", "degree", "mixed")
|
||
OUTPUT:
|
||
hard_negatives: selected hard negative samples [k × d]
|
||
|
||
BEGIN
|
||
// 1. Filter out true positives
|
||
candidate_indices ← EMPTY_LIST
|
||
FOR i ← 0 TO N-1 DO
|
||
IF i NOT IN true_positives THEN
|
||
candidate_indices.Append(i)
|
||
END IF
|
||
END FOR
|
||
|
||
n_candidates ← Length(candidate_indices)
|
||
|
||
// 2. Select hard negatives based on strategy
|
||
MATCH strategy:
|
||
CASE "distance":
|
||
hard_negatives ← SampleByDistance(
|
||
anchor, all_embeddings, candidate_indices, k
|
||
)
|
||
|
||
CASE "degree":
|
||
hard_negatives ← SampleByDegree(
|
||
anchor, all_embeddings, candidate_indices, k
|
||
)
|
||
|
||
CASE "mixed":
|
||
k_dist ← k / 2
|
||
k_deg ← k - k_dist
|
||
|
||
dist_negs ← SampleByDistance(
|
||
anchor, all_embeddings, candidate_indices, k_dist
|
||
)
|
||
deg_negs ← SampleByDegree(
|
||
anchor, all_embeddings, candidate_indices, k_deg
|
||
)
|
||
|
||
hard_negatives ← Concatenate(dist_negs, deg_negs)
|
||
|
||
DEFAULT:
|
||
ERROR "Unknown strategy"
|
||
END MATCH
|
||
|
||
RETURN hard_negatives
|
||
END
|
||
|
||
SUBROUTINE: SampleByDistance
|
||
INPUT: anchor[d], all_embeddings[N × d], candidate_indices, k
|
||
OUTPUT: hard_negatives[k × d]
|
||
BEGIN
|
||
// Select k most similar candidates (hardest negatives)
|
||
similarities ← EMPTY_ARRAY[Length(candidate_indices)]
|
||
|
||
FOR i ← 0 TO Length(candidate_indices)-1 DO
|
||
idx ← candidate_indices[i]
|
||
similarities[i] ← CosineSimilarity(anchor, all_embeddings[idx])
|
||
END FOR
|
||
|
||
// Get top-k most similar (hardest)
|
||
top_k_local_indices ← TopKIndices(similarities, k)
|
||
|
||
// Map back to global indices
|
||
hard_negatives ← EMPTY_MATRIX[k × d]
|
||
FOR i ← 0 TO k-1 DO
|
||
local_idx ← top_k_local_indices[i]
|
||
global_idx ← candidate_indices[local_idx]
|
||
hard_negatives[i] ← all_embeddings[global_idx]
|
||
END FOR
|
||
|
||
RETURN hard_negatives
|
||
END
|
||
|
||
SUBROUTINE: SampleByDegree
|
||
INPUT: anchor[d], all_embeddings[N × d], candidate_indices, k
|
||
OUTPUT: hard_negatives[k × d]
|
||
BEGIN
|
||
// Select candidates with similar degree to anchor
|
||
anchor_degree ← GetDegree(anchor)
|
||
|
||
degree_diffs ← EMPTY_ARRAY[Length(candidate_indices)]
|
||
FOR i ← 0 TO Length(candidate_indices)-1 DO
|
||
idx ← candidate_indices[i]
|
||
candidate_degree ← GetDegree(all_embeddings[idx])
|
||
degree_diffs[i] ← abs(anchor_degree - candidate_degree)
|
||
END FOR
|
||
|
||
// Get k candidates with most similar degree
|
||
top_k_local_indices ← TopKIndices(
|
||
NegateArray(degree_diffs), // Negate for similarity
|
||
k
|
||
)
|
||
|
||
hard_negatives ← EMPTY_MATRIX[k × d]
|
||
FOR i ← 0 TO k-1 DO
|
||
local_idx ← top_k_local_indices[i]
|
||
global_idx ← candidate_indices[local_idx]
|
||
hard_negatives[i] ← all_embeddings[global_idx]
|
||
END FOR
|
||
|
||
RETURN hard_negatives
|
||
END
|
||
|
||
SUBROUTINE: NegateArray
|
||
INPUT: array[n]
|
||
OUTPUT: negated[n]
|
||
BEGIN
|
||
negated ← EMPTY_ARRAY[n]
|
||
FOR i ← 0 TO n-1 DO
|
||
negated[i] ← -array[i]
|
||
END FOR
|
||
RETURN negated
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 6.3 Curriculum Learning Schedule
|
||
|
||
**Purpose**: Gradually increase task difficulty during training
|
||
|
||
**Complexity**:
|
||
- Time: O(1) per epoch (just weight computation)
|
||
- Space: O(num_losses)
|
||
|
||
```
|
||
ALGORITHM: CurriculumSchedule
|
||
INPUT:
|
||
current_epoch: current training epoch
|
||
total_epochs: total number of epochs
|
||
loss_types: list of loss components
|
||
OUTPUT:
|
||
loss_weights: weight for each loss component
|
||
|
||
LOSS_TYPES:
|
||
- reconstruction: Autoencoder reconstruction loss
|
||
- contrastive: InfoNCE contrastive loss
|
||
- task: Downstream task loss
|
||
- spectral: Laplacian regularization
|
||
- ewc: Elastic Weight Consolidation
|
||
|
||
BEGIN
|
||
loss_weights ← EMPTY_MAP
|
||
|
||
// 1. Reconstruction: High early, decay exponentially
|
||
lambda_recon ← exp(-current_epoch / 50.0)
|
||
loss_weights["reconstruction"] ← lambda_recon
|
||
|
||
// 2. Contrastive: Ramp up linearly in first 10 epochs
|
||
IF current_epoch < 10 THEN
|
||
lambda_contrast ← 0.1 + 0.9 * (current_epoch / 10.0)
|
||
ELSE
|
||
lambda_contrast ← 1.0
|
||
END IF
|
||
loss_weights["contrastive"] ← lambda_contrast
|
||
|
||
// 3. Task: Start after 50 epochs, ramp up
|
||
IF current_epoch < 50 THEN
|
||
lambda_task ← 0.1
|
||
ELSE
|
||
lambda_task ← 0.1 + 0.9 * ((current_epoch - 50) / 50.0)
|
||
lambda_task ← Min(lambda_task, 1.0)
|
||
END IF
|
||
loss_weights["task"] ← lambda_task
|
||
|
||
// 4. Spectral: Constant moderate weight
|
||
loss_weights["spectral"] ← 0.01
|
||
|
||
// 5. EWC: Increase if using continual learning
|
||
IF using_continual_learning THEN
|
||
lambda_ewc ← Min(current_epoch / 100.0, 1.0)
|
||
ELSE
|
||
lambda_ewc ← 0.0
|
||
END IF
|
||
loss_weights["ewc"] ← lambda_ewc
|
||
|
||
RETURN loss_weights
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 6.4 Multi-Objective Loss Computation
|
||
|
||
**Purpose**: Combine multiple loss functions with learned or scheduled weights
|
||
|
||
**Complexity**:
|
||
- Time: O(num_losses)
|
||
- Space: O(1)
|
||
|
||
```
|
||
ALGORITHM: MultiObjectiveLoss
|
||
INPUT:
|
||
loss_components: computed loss values
|
||
loss_weights: weights for each component
|
||
auto_balance: whether to auto-balance weights
|
||
OUTPUT:
|
||
total_loss: weighted sum of losses
|
||
updated_weights: potentially updated weights
|
||
|
||
LOSS_COMPONENTS:
|
||
task_loss: Main task objective
|
||
contrastive_loss: InfoNCE or similar
|
||
reconstruction_loss: Autoencoder
|
||
spectral_loss: Laplacian smoothness
|
||
ewc_loss: Continual learning penalty
|
||
|
||
BEGIN
|
||
// 1. Auto-balance (optional)
|
||
IF auto_balance THEN
|
||
loss_weights ← AutoBalance(loss_components, loss_weights)
|
||
END IF
|
||
|
||
// 2. Compute weighted sum
|
||
total_loss ← 0.0
|
||
|
||
total_loss ← total_loss + loss_weights["task"] * loss_components.task_loss
|
||
total_loss ← total_loss + loss_weights["contrastive"] * loss_components.contrastive_loss
|
||
total_loss ← total_loss + loss_weights["reconstruction"] * loss_components.reconstruction_loss
|
||
total_loss ← total_loss + loss_weights["spectral"] * loss_components.spectral_loss
|
||
total_loss ← total_loss + loss_weights["ewc"] * loss_components.ewc_loss
|
||
|
||
RETURN total_loss, loss_weights
|
||
END
|
||
|
||
SUBROUTINE: AutoBalance
|
||
INPUT: loss_components, current_weights
|
||
OUTPUT: balanced_weights
|
||
BEGIN
|
||
// Normalize so each loss contributes equally
|
||
num_losses ← 5
|
||
|
||
// Compute current contribution of each loss
|
||
contributions ← EMPTY_MAP
|
||
contributions["task"] ← current_weights["task"] * loss_components.task_loss
|
||
contributions["contrastive"] ← current_weights["contrastive"] * loss_components.contrastive_loss
|
||
contributions["reconstruction"] ← current_weights["reconstruction"] * loss_components.reconstruction_loss
|
||
contributions["spectral"] ← current_weights["spectral"] * loss_components.spectral_loss
|
||
contributions["ewc"] ← current_weights["ewc"] * loss_components.ewc_loss
|
||
|
||
// Compute total and target per-loss contribution
|
||
total ← Sum(contributions.values)
|
||
target_contribution ← total / num_losses
|
||
|
||
// Adjust weights to equalize contributions
|
||
balanced_weights ← EMPTY_MAP
|
||
epsilon ← 1e-10 // Avoid division by zero
|
||
|
||
balanced_weights["task"] ← target_contribution / Max(loss_components.task_loss, epsilon)
|
||
balanced_weights["contrastive"] ← target_contribution / Max(loss_components.contrastive_loss, epsilon)
|
||
balanced_weights["reconstruction"] ← target_contribution / Max(loss_components.reconstruction_loss, epsilon)
|
||
balanced_weights["spectral"] ← target_contribution / Max(loss_components.spectral_loss, epsilon)
|
||
balanced_weights["ewc"] ← target_contribution / Max(loss_components.ewc_loss, epsilon)
|
||
|
||
RETURN balanced_weights
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
### 6.5 Spectral Regularization
|
||
|
||
**Purpose**: Preserve graph structure through Laplacian smoothness
|
||
|
||
**Complexity**:
|
||
- Time: O(|E|·d) where |E| = number of edges
|
||
- Space: O(1) (streaming computation)
|
||
|
||
```
|
||
ALGORITHM: LaplacianRegularization
|
||
INPUT:
|
||
embeddings: node embeddings [N × d]
|
||
edges: edge list [(u, v)]
|
||
edge_weights: optional edge weights [|E|]
|
||
normalized: whether to use normalized Laplacian
|
||
node_degrees: node degrees [N]
|
||
OUTPUT:
|
||
spectral_loss: smoothness penalty (scalar)
|
||
|
||
BEGIN
|
||
total_loss ← 0.0
|
||
num_edges ← Length(edges)
|
||
|
||
FOR i ← 0 TO num_edges-1 DO
|
||
u, v ← edges[i]
|
||
|
||
// Compute embedding difference
|
||
diff ← Subtract(embeddings[u], embeddings[v])
|
||
diff_norm_sq ← L2NormSquared(diff)
|
||
|
||
// Get edge weight
|
||
weight ← 1.0
|
||
IF edge_weights PROVIDED THEN
|
||
weight ← edge_weights[i]
|
||
END IF
|
||
|
||
// Normalized Laplacian: weight by degrees
|
||
IF normalized THEN
|
||
degree_norm ← sqrt(node_degrees[u] * node_degrees[v])
|
||
weight ← weight / Max(degree_norm, 1.0)
|
||
END IF
|
||
|
||
// Accumulate weighted squared difference
|
||
total_loss ← total_loss + weight * diff_norm_sq
|
||
END FOR
|
||
|
||
// Average over edges
|
||
spectral_loss ← total_loss / num_edges
|
||
|
||
RETURN spectral_loss
|
||
END
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Data Structures
|
||
|
||
### 7.1 Attention State
|
||
|
||
```
|
||
STRUCTURE: AttentionState
|
||
FIELDS:
|
||
query: [d] // Query embedding
|
||
keys: [n × d] // Key embeddings
|
||
values: [n × d] // Value embeddings
|
||
attention_weights: [n] // Computed weights
|
||
output: [d] // Final output
|
||
metadata: Map<String, Any> // Additional info
|
||
|
||
OPERATIONS:
|
||
Initialize(query, keys, values)
|
||
ComputeWeights() → attention_weights
|
||
ComputeOutput() → output
|
||
GetMetadata(key) → value
|
||
```
|
||
|
||
---
|
||
|
||
### 7.2 Graph Structure
|
||
|
||
```
|
||
STRUCTURE: Graph
|
||
FIELDS:
|
||
nodes: [N] // Node identifiers
|
||
embeddings: [N × d] // Node embeddings
|
||
adjacency: [N × N] OR SparseMatrix // Adjacency matrix
|
||
edge_list: [(u, v)] // Edge list
|
||
edge_features: [|E| × d_edge] // Edge attributes
|
||
node_degrees: [N] // Degree of each node
|
||
|
||
OPERATIONS:
|
||
GetNeighbors(node_id) → [neighbor_ids]
|
||
GetEdgeFeature(u, v) → [d_edge]
|
||
GetDegree(node_id) → scalar
|
||
AddEdge(u, v, features)
|
||
UpdateEmbedding(node_id, new_embedding)
|
||
```
|
||
|
||
---
|
||
|
||
### 7.3 HNSW-Specific Structure
|
||
|
||
```
|
||
STRUCTURE: HNSWGraph
|
||
EXTENDS: Graph
|
||
ADDITIONAL_FIELDS:
|
||
layers: [max_layer] // Layer-wise graphs
|
||
entry_point: node_id // Top-layer entry
|
||
max_layer: integer // Maximum layer
|
||
layer_neighbors: Map<(node, layer), [neighbors]>
|
||
|
||
OPERATIONS:
|
||
GetLayerNeighbors(node_id, layer) → [neighbor_ids]
|
||
GetNodeLayer(node_id) → layer
|
||
NavigateLayer(query, layer, num_steps) → closest_node
|
||
InsertNode(node_id, embedding, layer)
|
||
```
|
||
|
||
---
|
||
|
||
### 7.4 Training State
|
||
|
||
```
|
||
STRUCTURE: TrainingState
|
||
FIELDS:
|
||
current_epoch: integer
|
||
loss_history: [num_epochs]
|
||
loss_weights: Map<loss_type, weight>
|
||
curriculum_schedule: CurriculumSchedule
|
||
optimizer_state: OptimizerState
|
||
best_model_params: ModelParams
|
||
early_stopping_counter: integer
|
||
|
||
OPERATIONS:
|
||
UpdateEpoch()
|
||
RecordLoss(loss_value)
|
||
GetLossWeight(loss_type) → weight
|
||
UpdateBestModel(current_params)
|
||
ShouldEarlystop() → boolean
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Complexity Summary
|
||
|
||
### 8.1 Attention Mechanisms
|
||
|
||
| Mechanism | Time Complexity | Space Complexity | Notes |
|
||
|-----------|----------------|------------------|-------|
|
||
| Scaled Dot-Product | O(n·d²) | O(n) | Standard attention |
|
||
| Multi-Head (h heads) | O(n·d²/h) | O(h·d) | Parallel heads |
|
||
| Hyperbolic | O(n·d²) | O(n) | More expensive ops |
|
||
| Sparse (Local+Global) | O((k_l + k_g)·d) | O(k_l + k_g) | k << n |
|
||
| Linear (Performer) | O(n·D·d) | O(D·d) | D = random features |
|
||
| Flash | O(n²·d) | O(n) | Better cache locality |
|
||
| Edge-Featured | O(n·(d² + d_edge·d)) | O(n) | Added edge cost |
|
||
| RoPE | O(n·d²) | O(n) | Rotation overhead minimal |
|
||
| Cross-Space | O(n_g·d² + k_l·d²) | O(n_g + k_l) | Dual attention |
|
||
| MoE (k experts) | O(k·base_complexity) | O(num_experts·model_size) | Expert routing |
|
||
|
||
**Legend**:
|
||
- n: number of neighbors/keys
|
||
- d: embedding dimension
|
||
- h: number of attention heads
|
||
- k_l, k_g: local and global neighbor counts
|
||
- D: number of random features
|
||
- d_edge: edge feature dimension
|
||
|
||
---
|
||
|
||
### 8.2 Training Operations
|
||
|
||
| Operation | Time Complexity | Space Complexity | Notes |
|
||
|-----------|----------------|------------------|-------|
|
||
| InfoNCE Loss | O((n_pos + n_neg)·d) | O(n_pos + n_neg) | Per anchor |
|
||
| Hard Negative Sampling | O(N·d) | O(k) | N = total samples |
|
||
| Spectral Regularization | O(\|E\|·d) | O(1) | E = edges |
|
||
| Curriculum Schedule | O(1) | O(num_losses) | Per epoch |
|
||
| Multi-Objective Loss | O(num_losses) | O(1) | Weighted sum |
|
||
|
||
---
|
||
|
||
## 9. Implementation Notes
|
||
|
||
### 9.1 Numerical Stability
|
||
|
||
**Softmax Stability**:
|
||
```
|
||
// Always subtract max before exp
|
||
max_score ← Max(scores)
|
||
exp_scores[i] ← exp(scores[i] - max_score)
|
||
```
|
||
|
||
**Hyperbolic Boundary**:
|
||
```
|
||
// Ensure points stay in Poincaré ball
|
||
IF ||x|| >= 1.0 THEN
|
||
x ← 0.99 * x / ||x|| // Project back with margin
|
||
END IF
|
||
```
|
||
|
||
**Division by Zero**:
|
||
```
|
||
// Add epsilon to denominators
|
||
result ← numerator / (denominator + 1e-10)
|
||
```
|
||
|
||
---
|
||
|
||
### 9.2 Performance Optimization
|
||
|
||
**Vectorization**:
|
||
- Use SIMD operations for dot products
|
||
- Batch matrix multiplications
|
||
- Parallelize independent attention heads
|
||
|
||
**Memory Layout**:
|
||
- Contiguous memory for cache efficiency
|
||
- Column-major for matrix operations
|
||
- Pre-allocate buffers
|
||
|
||
**Lazy Computation**:
|
||
- Only compute attention weights when needed
|
||
- Cache frequently accessed embeddings
|
||
- Prune low-weight attention connections
|
||
|
||
---
|
||
|
||
### 9.3 Testing Strategies
|
||
|
||
**Unit Tests**:
|
||
```
|
||
TEST: ScaledDotProductAttention
|
||
INPUT: Known query, keys, values
|
||
EXPECTED: Hand-computed output
|
||
VERIFY: Output matches expected within tolerance
|
||
|
||
TEST: Softmax Numerical Stability
|
||
INPUT: Very large scores [1000, 999, 998]
|
||
VERIFY: No NaN or Inf in output
|
||
VERIFY: Probabilities sum to 1.0
|
||
|
||
TEST: Hyperbolic Boundary
|
||
INPUT: Points near ball boundary (||x|| = 0.99)
|
||
VERIFY: Result still in ball (||result|| < 1.0)
|
||
```
|
||
|
||
**Integration Tests**:
|
||
```
|
||
TEST: End-to-End Attention Pipeline
|
||
INPUT: Real graph structure
|
||
VERIFY: All mechanisms produce valid outputs
|
||
VERIFY: Outputs are differentiable
|
||
```
|
||
|
||
**Performance Tests**:
|
||
```
|
||
BENCHMARK: Attention Complexity
|
||
INPUT: Varying n = [10, 100, 1000, 10000]
|
||
MEASURE: Time and memory usage
|
||
VERIFY: Matches theoretical complexity
|
||
```
|
||
|
||
---
|
||
|
||
## 10. References
|
||
|
||
### 10.1 Core Papers
|
||
|
||
1. **Attention Mechanism**: Vaswani et al. (2017) - "Attention Is All You Need"
|
||
2. **GAT**: Veličković et al. (2018) - "Graph Attention Networks"
|
||
3. **Hyperbolic GNNs**: Chami et al. (2019) - "Hyperbolic Graph Convolutional Neural Networks"
|
||
4. **Performer**: Choromanski et al. (2020) - "Rethinking Attention with Performers"
|
||
5. **Flash Attention**: Dao et al. (2022) - "FlashAttention: Fast and Memory-Efficient Exact Attention"
|
||
6. **RoPE**: Su et al. (2021) - "RoFormer: Enhanced Transformer with Rotary Position Embedding"
|
||
7. **MoE**: Shazeer et al. (2017) - "Outrageously Large Neural Networks"
|
||
|
||
### 10.2 Mathematical Background
|
||
|
||
- **Hyperbolic Geometry**: Cannon et al. (1997) - "Hyperbolic Geometry"
|
||
- **Graph Laplacian**: Chung (1997) - "Spectral Graph Theory"
|
||
- **Contrastive Learning**: Chen et al. (2020) - "A Simple Framework for Contrastive Learning"
|
||
|
||
---
|
||
|
||
## 11. Glossary
|
||
|
||
**Attention**: Mechanism to weight importance of different inputs
|
||
**Multi-Head**: Parallel attention with different learned projections
|
||
**Hyperbolic Space**: Non-Euclidean geometry with constant negative curvature
|
||
**Poincaré Ball**: Conformal model of hyperbolic space in unit ball
|
||
**Möbius Addition**: Hyperbolic vector addition operation
|
||
**Sparse Attention**: Attention over subset of inputs (not all pairs)
|
||
**Linear Attention**: O(n) complexity via kernel approximation
|
||
**Flash Attention**: Memory-efficient tiled attention computation
|
||
**RoPE**: Rotary Position Embedding for distance encoding
|
||
**Cross-Attention**: Attention between two different spaces
|
||
**MoE**: Mixture of Experts, routing to specialized sub-models
|
||
**InfoNCE**: Noise Contrastive Estimation loss for contrastive learning
|
||
**Hard Negatives**: Difficult negative samples close to positives
|
||
**Curriculum Learning**: Gradually increasing task difficulty
|
||
**Spectral Regularization**: Graph smoothness via Laplacian
|
||
|
||
---
|
||
|
||
**Document Version**: 1.0
|
||
**Last Updated**: 2025-11-30
|
||
**Author**: RuVector Research Team
|
||
**SPARC Phase**: Pseudocode (Phase 2)
|
||
**Next Phase**: Architecture (Phase 3) - See `04-architecture.md`
|
||
|
||
---
|
||
|
||
## Appendix A: Quick Reference
|
||
|
||
### Common Subroutines
|
||
|
||
```
|
||
DotProduct(x, y) → scalar
|
||
L2Norm(x) → scalar
|
||
L2NormSquared(x) → scalar
|
||
Softmax(scores) → probabilities
|
||
CosineSimilarity(x, y) → similarity ∈ [-1, 1]
|
||
Scale(x, scalar) → scaled_vector
|
||
Add(x, y) → sum_vector
|
||
Subtract(x, y) → diff_vector
|
||
Concatenate(vectors...) → concatenated_vector
|
||
ZeroVector(d) → zero-initialized vector
|
||
ZeroMatrix(rows, cols) → zero-initialized matrix
|
||
```
|
||
|
||
### Complexity Quick Reference
|
||
|
||
```
|
||
O(1) - Constant time
|
||
O(d) - Linear in dimension
|
||
O(n) - Linear in number of items
|
||
O(n·d) - Linear in both
|
||
O(n²) - Quadratic (standard full attention)
|
||
O(n·d²) - Attention complexity
|
||
O(|E|) - Linear in number of edges
|
||
```
|
||
|
||
---
|