git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
14 KiB
Geometric Foundations of Hyperbolic Attention
Mathematical Prerequisites
This document provides rigorous mathematical foundations for implementing hyperbolic attention mechanisms with provable geometric properties.
Table of Contents
- Hyperbolic Geometry Basics
- Poincaré Ball Model
- Lorentz (Hyperboloid) Model
- Isometries and Transformations
- Hyperbolic Neural Operations
- Attention Mechanisms in Hyperbolic Space
- Curvature Adaptation
- Numerical Stability
- Complexity Analysis
Hyperbolic Geometry Basics
Definition
Hyperbolic space ℍⁿ is a complete, simply-connected Riemannian manifold of constant negative curvature κ < 0.
Key Properties:
- Exponential volume growth: Volume of ball of radius r grows as ~exp(r√|κ|)
- Unique geodesics: Any two points connected by unique shortest path
- Triangle inequality: sum of angles < π (vs = π in Euclidean)
- Tree embedding: Finite trees embed with arbitrarily low distortion in ℍ²
Curvature Parameter
Define curvature radius K > 0 such that κ = -1/K².
Normalization:
- κ = -1: Unit hyperbolic space (mathematical convention)
- κ = -1/K²: Learnable curvature (K is learned parameter)
Models of Hyperbolic Space
Five isometric models:
- Poincaré ball: {x ∈ ℝⁿ : ||x|| < 1}
- Lorentz (hyperboloid): {x ∈ ℝⁿ⁺¹ : ⟨x,x⟩_L = -1, x₀ > 0}
- Poincaré half-space: {x ∈ ℝⁿ : xₙ > 0}
- Klein disk: {x ∈ ℝⁿ : ||x|| < 1}
- Hemisphere
We focus on Poincaré ball (intuitive) and Lorentz (stable).
Poincaré Ball Model
Metric
Riemannian metric:
ds² = 4K² / (1 - ||x||²/K²)² · ||dx||²
Distance between points x, y:
d_P(x, y) = K · arcosh(1 + 2||x - y||² / ((1 - ||x||²/K²)(1 - ||y||²/K²)))
Simplified formula (numerically stable):
d_P(x, y) = 2K · artanh(||(-x) ⊕_K y|| / K)
Möbius Gyrovector Operations
Möbius Addition (generalized):
x ⊕_K y = ((1 + 2⟨x,y⟩/K² + ||y||²/K²)x + (1 - ||x||²/K²)y) /
(1 + 2⟨x,y⟩/K² + ||x||²||y||²/K⁴)
Special case (K = 1):
x ⊕ y = ((1 + 2⟨x,y⟩ + ||y||²)x + (1 - ||x||²)y) /
(1 + 2⟨x,y⟩ + ||x||²||y||²)
Properties:
- Identity: x ⊕ 0 = x
- Inverse: x ⊕ (-x ⊕ 0) = 0 where (-x ⊕ 0) = -x/(1 + ||x||²/K²)
- Non-commutative: x ⊕ y ≠ y ⊕ x (in general)
- Non-associative: (x ⊕ y) ⊕ z ≠ x ⊕ (y ⊕ z)
Computational Complexity: O(n) for n-dimensional vectors
Exponential and Logarithmic Maps
Exponential Map (tangent space → manifold):
exp_x^K(v) = x ⊕_K (tanh(||v||_x / 2K) / ||v||_x) · v
where ||v||_x = 2K / (1 - ||x||²/K²) · ||v|| (tangent norm)
Logarithmic Map (manifold → tangent space):
log_x^K(y) = 2K / (1 - ||x||²/K²) · artanh(||(-x) ⊕_K y|| / K) ·
((-x) ⊕_K y) / ||(-x) ⊕_K y||
Usage:
- exp: Apply Euclidean gradients to hyperbolic points
- log: Compute "hyperbolic difference" between points
Parallel Transport
Problem: Moving tangent vectors along geodesics while preserving inner products.
Formula (transport v from x to y):
P_{x→y}(v) = λ(x, y) · ((I + (γ(y) - 1)ŷŷᵀ) v - γ(y)⟨ŷ, v⟩x)
where:
ŷ = (-x) ⊕_K y / ||(-x) ⊕_K y||
λ(x, y) = (1 - ||y||²/K²) / (1 - ||x||²/K²)
γ(y) = 1 / (1 - ||y||²/K²)
Lorentz (Hyperboloid) Model
Minkowski Space
Ambient space: ℝⁿ⁺¹ with Minkowski inner product:
⟨x, y⟩_L = -x₀y₀ + x₁y₁ + ... + xₙyₙ
Hyperboloid constraint:
ℍⁿ = {x ∈ ℝⁿ⁺¹ : ⟨x, x⟩_L = -K², x₀ > 0}
Distance
Formula:
d_L(x, y) = K · arcosh(-⟨x, y⟩_L / K²)
Numerically stable variant:
d_L(x, y) = K · ln(-⟨x, y⟩_L / K² + √((-⟨x, y⟩_L / K²)² - 1))
Exponential Map
Formula:
exp_x^L(v) = cosh(||v|| / K) x + sinh(||v|| / K) · v / ||v||
where ||v|| = √⟨v, v⟩_L (Minkowski norm)
Lorentz Transformations
Lorentz Boost (translation along time-like direction):
Boost_v(x) = x + (γ - 1)(x · v̂)v̂ - γv
where:
v̂ = v / ||v||_L
γ = cosh(||v||_L / K)
Lorentz Rotation (rotation in space-like plane):
R_θ(x) = x + sin(θ)(e₁ ⊗ e₂ - e₂ ⊗ e₁)x
where e₁, e₂ are orthonormal space-like vectors
Isometries and Transformations
Möbius Transformations (Poincaré Ball)
General form:
M(x) = (Ax + b) / ⟨c, x⟩ + d
subject to: A ∈ SO(n), ad - ⟨b, c⟩ = 1
Special case - Translation:
T_a(x) = (-a) ⊕ x
Gyrovector Multiplication
Scalar multiplication:
r ⊗ x = tanh(r · artanh(||x|| / K)) / ||x|| · x
for r ∈ ℝ, x ∈ ℍⁿ
Properties:
- (r + s) ⊗ x ≠ (r ⊗ x) ⊕ (s ⊗ x) (non-linear)
- r ⊗ (s ⊗ x) = (rs) ⊗ x (associative)
Hyperbolic Neural Operations
Hyperbolic Linear Layer
Euclidean linear layer: y = Wx + b
Hyperbolic equivalent:
y = exp_0(W · log_0(x) + b)
Steps:
- Map x from manifold to tangent space at origin: v = log_0(x)
- Apply Euclidean linear transformation: v' = Wv + b
- Map back to manifold: y = exp_0(v')
Learnable parameters: W ∈ ℝᵐˣⁿ, b ∈ ℝᵐ
Hyperbolic ReLU
Problem: ReLU is defined in tangent space, not on manifold.
Solution:
ReLU_hyp(x) = exp_0(ReLU(log_0(x)))
Component-wise variant:
ReLU_hyp(x)_i = exp_0,i(max(0, log_0(x)_i))
Hyperbolic Batch Normalization
Challenge: Mean and variance are Euclidean concepts.
Hyperbolic mean (Fréchet mean):
μ = argmin_p Σ_i d(p, x_i)²
Approximation (geodesic midpoint):
μ ≈ exp_0(mean(log_0(x_1), ..., log_0(x_n)))
Normalization:
x_norm = exp_μ((log_μ(x) - μ_tangent) / σ_tangent)
Attention Mechanisms in Hyperbolic Space
Hyperbolic Dot-Product Attention
Euclidean attention:
Attention(Q, K, V) = softmax(QKᵀ / √d) V
Hyperbolic variant:
Attention_hyp(Q, K, V) = ⊕ (softmax(-d(Q_i, K_j)² / τ) ⊗ V_j)
Components:
- Similarity: -d(q, k)² (negative squared distance)
- Normalization: softmax with temperature τ
- Aggregation: Möbius weighted sum
Complexity: O(n²d) for n tokens, d dimensions (same as Euclidean)
Hyperbolic Linear Attention (Hypformer)
Problem: Quadratic complexity O(n²)
Solution: Kernel approximation
φ(q)ᵀ φ(k) ≈ d_hyp(q, k)
Linear attention:
Attention_linear(Q, K, V) = (Σ_j φ(K_j)⊗V_j) ⊘ (Σ_j φ(K_j))
Hyperbolic kernel (proposal):
φ_hyp(x) = [cosh(||x||/K), sinh(||x||/K) · x/||x||]
Properties:
⟨φ_hyp(x), φ_hyp(y)⟩_L ≈ -cosh(d(x,y)/K)
Complexity: O(nd²) vs O(n²d)
Speedup: 10x for n > 10d (verified by Hypformer, KDD 2024)
Multi-Head Hyperbolic Attention
Extension:
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) W^O
where head_i = Attention_hyp(QW_i^Q, KW_i^K, VW_i^V)
Learnable per-head curvature:
head_i operates in space with curvature κ_i
Rationale: Different heads capture different hierarchical depths.
Curvature Adaptation
Learnable Curvature
Parameterization: K ∈ ℝ⁺ (learned via gradient descent)
Gradient w.r.t. curvature:
∂L/∂K = ∂L/∂d · ∂d/∂K
where:
∂d/∂K = ∂/∂K[K · arcosh(1 + 2||x-y||²/((1-||x||²/K²)(1-||y||²/K²)))]
Numerical trick: Reparameterize as K = exp(k) to ensure K > 0.
Coupled Optimization
Problem: Naively updating K breaks Riemannian optimizer assumptions.
Solution (from "Optimizing Curvature Learning" 2024):
1. Compute gradients in current manifold (curvature K_old)
2. Update parameters: θ_new = RiemannianSGD(θ, ∇_θ L, K_old)
3. Update curvature: K_new = K_old - α · ∂L/∂K
4. Rescale parameters to new manifold:
θ_rescaled = rescale_curvature(θ_new, K_old, K_new)
Rescaling formula (Poincaré ball):
rescale(x, K₁, K₂) = (K₂ / K₁) · x
Multi-Curvature Embeddings
Approach: Different dimensions/layers have different curvatures.
Product space:
ℍ^{n₁}(κ₁) × ℍ^{n₂}(κ₂) × ... × ℍ^{nₖ}(κₖ)
Distance:
d_product((x₁,...,xₖ), (y₁,...,yₖ)) = √(Σ_i w_i² d²(x_i, y_i))
where w_i are learnable weights
Numerical Stability
Poincaré Ball Instabilities
Problem 1: Division by zero when ||x|| → 1
Solution: Clip to maximum norm
x_safe = x / max(1, ||x|| / (1 - ε))
where ε = 1e-5
Problem 2: Möbius addition overflow
Solution: Rewrite using log1p, expm1
Instead of: (1 + 2⟨x,y⟩ + ||y||²) / (1 + 2⟨x,y⟩ + ||x||²||y||²)
Use: exp(log1p(2⟨x,y⟩ + ||y||²) - log1p(2⟨x,y⟩ + ||x||²||y||²))
Lorentz Model Stability
Advantage: No boundary singularities!
Constraint enforcement:
After each update, project back to hyperboloid:
x₀ = √(K² + x₁² + ... + xₙ²)
Geodesic computation (stable):
d_L(x, y) = K · log((-⟨x,y⟩ + √(⟨x,y⟩² - K⁴)) / K²)
Mixed Precision
Strategy:
- FP16 for forward pass (speed)
- FP32 for gradients (stability)
- FP64 for curvature updates (critical)
GeoOpt recommendation: Use FP32 minimum for hyperbolic operations.
Complexity Analysis
Space Complexity
Poincaré Ball:
- Point: O(n) storage (same as Euclidean)
- No auxiliary structures needed
Lorentz:
- Point: O(n+1) storage (extra time dimension)
- Constraint: ⟨x,x⟩_L = -K²
Curvature:
- Shared K: O(1) extra parameter
- Per-layer K: O(L) for L layers
- Per-dimension K: O(n) parameters
Time Complexity
| Operation | Euclidean | Poincaré | Lorentz |
|---|---|---|---|
| Distance | O(n) | O(n) | O(n) |
| Addition | O(n) | O(n) | O(n) |
| Exp/Log | - | O(n) | O(n) |
| Linear layer | O(n²) | O(n²) | O(n²) |
| Attention | O(n²d) | O(n²d) | O(n²d) |
| Linear attention | O(nd²) | O(nd²) | O(nd²) |
Key Insight: Asymptotic complexity same as Euclidean!
Constants: Hyperbolic ops 2-5x slower (more FLOPs per operation)
SIMD Optimization: Can recover 8-50x speedup, making hyperbolic faster than naive Euclidean.
Proofs of Key Properties
Theorem 1: Möbius Addition Preserves Poincaré Ball
Statement: If x, y ∈ 𝔹ⁿ(K) (Poincaré ball), then x ⊕_K y ∈ 𝔹ⁿ(K).
Proof:
Let ||x||² / K² = a², ||y||² / K² = b², ⟨x,y⟩ / K² = c
where a, b < 1.
||x ⊕_K y||² / K² = ||(1+2c+b²)x + (1-a²)y||² / (1+2c+a²b²)²
≤ ((1+2c+b²)a + (1-a²)b)² / (1+2c+a²b²)²
< 1 (by calculation)
Theorem 2: Exponential Map is Diffeomorphism
Statement: exp_x: T_xℍⁿ → ℍⁿ is a diffeomorphism for each x.
Proof:
- Inverse given by log_x
- Both are smooth (analytic)
- Jacobian is full rank everywhere
- QED.
Theorem 3: Capacity Advantage
Statement: Embedding n-node tree in ℍ² requires distortion O(log n), while ℝᵏ requires k = Ω(n).
Proof Sketch:
- Hyperbolic plane has exponential volume: V(r) ~ exp(r)
- Trees have exponential node count: N(depth d) ~ exp(d)
- Volume growth matches tree growth → O(1) average distortion
- Euclidean plane has polynomial volume: V(r) ~ r²
- Trees cannot fit without stretching → Ω(√n) average distortion
Implementation Checklist
Poincaré Ball Implementation
- Möbius addition with curvature K
- Exponential map with numerical stability
- Logarithmic map with safe arctanh
- Distance function with clipping
- Parallel transport
- Gradient clipping to prevent boundary
Lorentz Model Implementation
- Minkowski inner product
- Hyperboloid constraint projection
- Exponential map
- Distance function
- Lorentz boost and rotation
- Conversion to/from Poincaré
Hyperbolic Attention
- Hyperbolic query/key/value projections
- Distance-based similarity
- Softmax with temperature
- Möbius weighted aggregation
- Linear attention kernel approximation
Learnable Curvature
- Curvature parameter K with positive constraint
- Gradient computation w.r.t. K
- Coupled optimization with rescaling
- Per-layer or per-head curvature
SIMD Optimizations
- Vectorized Möbius addition (AVX2)
- Batch distance computation
- Fused exp/log operations
- Cache-aligned memory layout
References
Textbooks:
- "Riemannian Geometry" - do Carmo
- "Foundations of Hyperbolic Manifolds" - Ratcliffe
Papers:
- Ganea et al., "Hyperbolic Neural Networks" (NeurIPS 2018)
- Hypformer (KDD 2024) - Linear attention formulation
- Fully Hyperbolic NNs (ACL 2022) - Lorentz model analysis
Software:
- GeoOpt: PyTorch library for Riemannian optimization
- Hyperbolic Image Embeddings: Reference implementation
Conclusion
Hyperbolic geometry provides a mathematically rigorous framework for hierarchical neural representations with:
- Provable capacity: O(exp(n)) vs O(poly(n))
- Stable operations: Lorentz model superior to Poincaré
- Efficient algorithms: O(n²d) attention same as Euclidean
- Learnable curvature: Adapt to data hierarchy
All operations have closed-form solutions and computable gradients, making them suitable for modern automatic differentiation frameworks.