Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

14 KiB

Raw Blame History

Geometric Foundations of Hyperbolic Attention

Mathematical Prerequisites

This document provides rigorous mathematical foundations for implementing hyperbolic attention mechanisms with provable geometric properties.

Hyperbolic Geometry Basics
Poincaré Ball Model
Lorentz (Hyperboloid) Model
Isometries and Transformations
Hyperbolic Neural Operations
Attention Mechanisms in Hyperbolic Space
Curvature Adaptation
Numerical Stability
Complexity Analysis

Hyperbolic Geometry Basics

Definition

Hyperbolic space ℍⁿ is a complete, simply-connected Riemannian manifold of constant negative curvature κ < 0.

Key Properties:

Exponential volume growth: Volume of ball of radius r grows as ~exp(r√|κ|)
Unique geodesics: Any two points connected by unique shortest path
Triangle inequality: sum of angles < π (vs = π in Euclidean)
Tree embedding: Finite trees embed with arbitrarily low distortion in ℍ²

Curvature Parameter

Define curvature radius K > 0 such that κ = -1/K².

Normalization:

κ = -1: Unit hyperbolic space (mathematical convention)
κ = -1/K²: Learnable curvature (K is learned parameter)

Models of Hyperbolic Space

Five isometric models:

Poincaré ball: {x ∈ ℝⁿ : ||x|| < 1}
Lorentz (hyperboloid): {x ∈ ℝⁿ⁺¹ : ⟨x,x⟩_L = -1, x₀ > 0}
Poincaré half-space: {x ∈ ℝⁿ : xₙ > 0}
Klein disk: {x ∈ ℝⁿ : ||x|| < 1}
Hemisphere

We focus on Poincaré ball (intuitive) and Lorentz (stable).

Poincaré Ball Model

Metric

Riemannian metric:

ds² = 4K² / (1 - ||x||²/K²)² · ||dx||²

Distance between points x, y:

d_P(x, y) = K · arcosh(1 + 2||x - y||² / ((1 - ||x||²/K²)(1 - ||y||²/K²)))

Simplified formula (numerically stable):

d_P(x, y) = 2K · artanh(||(-x) ⊕_K y|| / K)

Möbius Gyrovector Operations

Möbius Addition (generalized):

x ⊕_K y = ((1 + 2⟨x,y⟩/K² + ||y||²/K²)x + (1 - ||x||²/K²)y) /
          (1 + 2⟨x,y⟩/K² + ||x||²||y||²/K⁴)

Special case (K = 1):

x ⊕ y = ((1 + 2⟨x,y⟩ + ||y||²)x + (1 - ||x||²)y) /
        (1 + 2⟨x,y⟩ + ||x||²||y||²)

Properties:

Identity: x ⊕ 0 = x
Inverse: x ⊕ (-x ⊕ 0) = 0 where (-x ⊕ 0) = -x/(1 + ||x||²/K²)
Non-commutative: x ⊕ y ≠ y ⊕ x (in general)
Non-associative: (x ⊕ y) ⊕ z ≠ x ⊕ (y ⊕ z)

Computational Complexity: O(n) for n-dimensional vectors

Exponential and Logarithmic Maps

Exponential Map (tangent space → manifold):

exp_x^K(v) = x ⊕_K (tanh(||v||_x / 2K) / ||v||_x) · v

where ||v||_x = 2K / (1 - ||x||²/K²) · ||v||  (tangent norm)

Logarithmic Map (manifold → tangent space):

log_x^K(y) = 2K / (1 - ||x||²/K²) · artanh(||(-x) ⊕_K y|| / K) ·
             ((-x) ⊕_K y) / ||(-x) ⊕_K y||

Usage:

exp: Apply Euclidean gradients to hyperbolic points
log: Compute "hyperbolic difference" between points

Parallel Transport

Problem: Moving tangent vectors along geodesics while preserving inner products.

Formula (transport v from x to y):

P_{x→y}(v) = λ(x, y) · ((I + (γ(y) - 1)ŷŷᵀ) v - γ(y)⟨ŷ, v⟩x)

where:
  ŷ = (-x) ⊕_K y / ||(-x) ⊕_K y||
  λ(x, y) = (1 - ||y||²/K²) / (1 - ||x||²/K²)
  γ(y) = 1 / (1 - ||y||²/K²)

Lorentz (Hyperboloid) Model

Minkowski Space

Ambient space: ℝⁿ⁺¹ with Minkowski inner product:

⟨x, y⟩_L = -x₀y₀ + x₁y₁ + ... + xₙyₙ

Hyperboloid constraint:

ℍⁿ = {x ∈ ℝⁿ⁺¹ : ⟨x, x⟩_L = -K², x₀ > 0}

Distance

Formula:

d_L(x, y) = K · arcosh(-⟨x, y⟩_L / K²)

Numerically stable variant:

d_L(x, y) = K · ln(-⟨x, y⟩_L / K² + √((-⟨x, y⟩_L / K²)² - 1))

Exponential Map

Formula:

exp_x^L(v) = cosh(||v|| / K) x + sinh(||v|| / K) · v / ||v||

where ||v|| = √⟨v, v⟩_L  (Minkowski norm)

Lorentz Transformations

Lorentz Boost (translation along time-like direction):

Boost_v(x) = x + (γ - 1)(x · v̂)v̂ - γv

where:
  v̂ = v / ||v||_L
  γ = cosh(||v||_L / K)

Lorentz Rotation (rotation in space-like plane):

R_θ(x) = x + sin(θ)(e₁ ⊗ e₂ - e₂ ⊗ e₁)x

where e₁, e₂ are orthonormal space-like vectors

Isometries and Transformations

Möbius Transformations (Poincaré Ball)

General form:

M(x) = (Ax + b) / ⟨c, x⟩ + d

subject to: A ∈ SO(n), ad - ⟨b, c⟩ = 1

Special case - Translation:

T_a(x) = (-a) ⊕ x

Gyrovector Multiplication

Scalar multiplication:

r ⊗ x = tanh(r · artanh(||x|| / K)) / ||x|| · x

for r ∈ ℝ, x ∈ ℍⁿ

Properties:

(r + s) ⊗ x ≠ (r ⊗ x) ⊕ (s ⊗ x) (non-linear)
r ⊗ (s ⊗ x) = (rs) ⊗ x (associative)

Hyperbolic Neural Operations

Hyperbolic Linear Layer

Euclidean linear layer: y = Wx + b

Hyperbolic equivalent:

y = exp_0(W · log_0(x) + b)

Steps:

Map x from manifold to tangent space at origin: v = log_0(x)
Apply Euclidean linear transformation: v' = Wv + b
Map back to manifold: y = exp_0(v')

Learnable parameters: W ∈ ℝᵐˣⁿ, b ∈ ℝᵐ

Hyperbolic ReLU

Problem: ReLU is defined in tangent space, not on manifold.

Solution:

ReLU_hyp(x) = exp_0(ReLU(log_0(x)))

Component-wise variant:

ReLU_hyp(x)_i = exp_0,i(max(0, log_0(x)_i))

Hyperbolic Batch Normalization

Challenge: Mean and variance are Euclidean concepts.

Hyperbolic mean (Fréchet mean):

μ = argmin_p Σ_i d(p, x_i)²

Approximation (geodesic midpoint):

μ ≈ exp_0(mean(log_0(x_1), ..., log_0(x_n)))

Normalization:

x_norm = exp_μ((log_μ(x) - μ_tangent) / σ_tangent)

Attention Mechanisms in Hyperbolic Space

Hyperbolic Dot-Product Attention

Euclidean attention:

Attention(Q, K, V) = softmax(QKᵀ / √d) V

Hyperbolic variant:

Attention_hyp(Q, K, V) = ⊕ (softmax(-d(Q_i, K_j)² / τ) ⊗ V_j)

Components:

Similarity: -d(q, k)² (negative squared distance)
Normalization: softmax with temperature τ
Aggregation: Möbius weighted sum

Complexity: O(n²d) for n tokens, d dimensions (same as Euclidean)

Hyperbolic Linear Attention (Hypformer)

Problem: Quadratic complexity O(n²)

Solution: Kernel approximation

φ(q)ᵀ φ(k) ≈ d_hyp(q, k)

Linear attention:
Attention_linear(Q, K, V) = (Σ_j φ(K_j)⊗V_j) ⊘ (Σ_j φ(K_j))

Hyperbolic kernel (proposal):

φ_hyp(x) = [cosh(||x||/K), sinh(||x||/K) · x/||x||]

Properties:
  ⟨φ_hyp(x), φ_hyp(y)⟩_L ≈ -cosh(d(x,y)/K)

Complexity: O(nd²) vs O(n²d)

Speedup: 10x for n > 10d (verified by Hypformer, KDD 2024)

Multi-Head Hyperbolic Attention

Extension:

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) W^O

where head_i = Attention_hyp(QW_i^Q, KW_i^K, VW_i^V)

Learnable per-head curvature:

head_i operates in space with curvature κ_i

Rationale: Different heads capture different hierarchical depths.

Curvature Adaptation

Learnable Curvature

Parameterization: K ∈ ℝ⁺ (learned via gradient descent)

Gradient w.r.t. curvature:

∂L/∂K = ∂L/∂d · ∂d/∂K

where:
  ∂d/∂K = ∂/∂K[K · arcosh(1 + 2||x-y||²/((1-||x||²/K²)(1-||y||²/K²)))]

Numerical trick: Reparameterize as K = exp(k) to ensure K > 0.

Coupled Optimization

Problem: Naively updating K breaks Riemannian optimizer assumptions.

Solution (from "Optimizing Curvature Learning" 2024):

1. Compute gradients in current manifold (curvature K_old)
2. Update parameters: θ_new = RiemannianSGD(θ, ∇_θ L, K_old)
3. Update curvature: K_new = K_old - α · ∂L/∂K
4. Rescale parameters to new manifold:
   θ_rescaled = rescale_curvature(θ_new, K_old, K_new)

Rescaling formula (Poincaré ball):

rescale(x, K₁, K₂) = (K₂ / K₁) · x

Multi-Curvature Embeddings

Approach: Different dimensions/layers have different curvatures.

Product space:

ℍ^{n₁}(κ₁) × ℍ^{n₂}(κ₂) × ... × ℍ^{nₖ}(κₖ)

Distance:

d_product((x₁,...,xₖ), (y₁,...,yₖ)) = √(Σ_i w_i² d²(x_i, y_i))

where w_i are learnable weights

Numerical Stability

Poincaré Ball Instabilities

Problem 1: Division by zero when ||x|| → 1

Solution: Clip to maximum norm

x_safe = x / max(1, ||x|| / (1 - ε))

where ε = 1e-5

Problem 2: Möbius addition overflow

Solution: Rewrite using log1p, expm1

Instead of: (1 + 2⟨x,y⟩ + ||y||²) / (1 + 2⟨x,y⟩ + ||x||²||y||²)
Use: exp(log1p(2⟨x,y⟩ + ||y||²) - log1p(2⟨x,y⟩ + ||x||²||y||²))

Lorentz Model Stability

Advantage: No boundary singularities!

Constraint enforcement:

After each update, project back to hyperboloid:
  x₀ = √(K² + x₁² + ... + xₙ²)

Geodesic computation (stable):

d_L(x, y) = K · log((-⟨x,y⟩ + √(⟨x,y⟩² - K⁴)) / K²)

Mixed Precision

Strategy:

FP16 for forward pass (speed)
FP32 for gradients (stability)
FP64 for curvature updates (critical)

GeoOpt recommendation: Use FP32 minimum for hyperbolic operations.

Complexity Analysis

Space Complexity

Poincaré Ball:

Point: O(n) storage (same as Euclidean)
No auxiliary structures needed

Lorentz:

Point: O(n+1) storage (extra time dimension)
Constraint: ⟨x,x⟩_L = -K²

Curvature:

Shared K: O(1) extra parameter
Per-layer K: O(L) for L layers
Per-dimension K: O(n) parameters

Time Complexity

Operation	Euclidean	Poincaré	Lorentz
Distance	O(n)	O(n)	O(n)
Addition	O(n)	O(n)	O(n)
Exp/Log	-	O(n)	O(n)
Linear layer	O(n²)	O(n²)	O(n²)
Attention	O(n²d)	O(n²d)	O(n²d)
Linear attention	O(nd²)	O(nd²)	O(nd²)

Key Insight: Asymptotic complexity same as Euclidean!

Constants: Hyperbolic ops 2-5x slower (more FLOPs per operation)

SIMD Optimization: Can recover 8-50x speedup, making hyperbolic faster than naive Euclidean.

Proofs of Key Properties

Theorem 1: Möbius Addition Preserves Poincaré Ball

Statement: If x, y ∈ 𝔹ⁿ(K) (Poincaré ball), then x ⊕_K y ∈ 𝔹ⁿ(K).

Proof:

Let ||x||² / K² = a², ||y||² / K² = b², ⟨x,y⟩ / K² = c
where a, b < 1.

||x ⊕_K y||² / K² = ||(1+2c+b²)x + (1-a²)y||² / (1+2c+a²b²)²
                   ≤ ((1+2c+b²)a + (1-a²)b)² / (1+2c+a²b²)²
                   < 1  (by calculation)

Theorem 2: Exponential Map is Diffeomorphism

Statement: exp_x: T_xℍⁿ → ℍⁿ is a diffeomorphism for each x.

Proof:

Inverse given by log_x
Both are smooth (analytic)
Jacobian is full rank everywhere
QED.

Theorem 3: Capacity Advantage

Statement: Embedding n-node tree in ℍ² requires distortion O(log n), while ℝᵏ requires k = Ω(n).

Proof Sketch:

Hyperbolic plane has exponential volume: V(r) ~ exp(r)
Trees have exponential node count: N(depth d) ~ exp(d)
Volume growth matches tree growth → O(1) average distortion
Euclidean plane has polynomial volume: V(r) ~ r²
Trees cannot fit without stretching → Ω(√n) average distortion

Implementation Checklist

Poincaré Ball Implementation

Möbius addition with curvature K
Exponential map with numerical stability
Logarithmic map with safe arctanh
Distance function with clipping
Parallel transport
Gradient clipping to prevent boundary

Lorentz Model Implementation

Minkowski inner product
Hyperboloid constraint projection
Exponential map
Distance function
Lorentz boost and rotation
Conversion to/from Poincaré

Hyperbolic Attention

Hyperbolic query/key/value projections
Distance-based similarity
Softmax with temperature
Möbius weighted aggregation
Linear attention kernel approximation

Learnable Curvature

Curvature parameter K with positive constraint
Gradient computation w.r.t. K
Coupled optimization with rescaling
Per-layer or per-head curvature

SIMD Optimizations

Vectorized Möbius addition (AVX2)
Batch distance computation
Fused exp/log operations
Cache-aligned memory layout

References

Textbooks:

"Riemannian Geometry" - do Carmo
"Foundations of Hyperbolic Manifolds" - Ratcliffe

Papers:

Ganea et al., "Hyperbolic Neural Networks" (NeurIPS 2018)
Hypformer (KDD 2024) - Linear attention formulation
Fully Hyperbolic NNs (ACL 2022) - Lorentz model analysis

Software:

GeoOpt: PyTorch library for Riemannian optimization
Hyperbolic Image Embeddings: Reference implementation

Conclusion

Hyperbolic geometry provides a mathematically rigorous framework for hierarchical neural representations with:

Provable capacity: O(exp(n)) vs O(poly(n))
Stable operations: Lorentz model superior to Poincaré
Efficient algorithms: O(n²d) attention same as Euclidean
Learnable curvature: Adapt to data hierarchy

All operations have closed-form solutions and computable gradients, making them suitable for modern automatic differentiation frameworks.

14 KiB Raw Blame History Unescape Escape

Geometric Foundations of Hyperbolic Attention

Mathematical Prerequisites

Table of Contents

Hyperbolic Geometry Basics

Definition

Curvature Parameter

Models of Hyperbolic Space

Poincaré Ball Model

Metric

Möbius Gyrovector Operations

Exponential and Logarithmic Maps

Parallel Transport

Lorentz (Hyperboloid) Model

Minkowski Space

Distance

Exponential Map

Lorentz Transformations

Isometries and Transformations

Möbius Transformations (Poincaré Ball)

Gyrovector Multiplication

Hyperbolic Neural Operations

Hyperbolic Linear Layer

Hyperbolic ReLU

Hyperbolic Batch Normalization

Attention Mechanisms in Hyperbolic Space

Hyperbolic Dot-Product Attention

Hyperbolic Linear Attention (Hypformer)

Multi-Head Hyperbolic Attention

Curvature Adaptation

Learnable Curvature

Coupled Optimization

Multi-Curvature Embeddings

Numerical Stability

Poincaré Ball Instabilities

Lorentz Model Stability

Mixed Precision

Complexity Analysis

Space Complexity

Time Complexity

Proofs of Key Properties

Theorem 1: Möbius Addition Preserves Poincaré Ball

Theorem 2: Exponential Map is Diffeomorphism

Theorem 3: Capacity Advantage

Implementation Checklist

Poincaré Ball Implementation

Lorentz Model Implementation

Hyperbolic Attention

Learnable Curvature

SIMD Optimizations

References

Conclusion

14 KiB

Raw Blame History