Files
wifi-densepose/examples/exo-ai-2025/research/09-hyperbolic-attention/geometric_foundations.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

14 KiB
Raw Blame History

Geometric Foundations of Hyperbolic Attention

Mathematical Prerequisites

This document provides rigorous mathematical foundations for implementing hyperbolic attention mechanisms with provable geometric properties.


Table of Contents

  1. Hyperbolic Geometry Basics
  2. Poincaré Ball Model
  3. Lorentz (Hyperboloid) Model
  4. Isometries and Transformations
  5. Hyperbolic Neural Operations
  6. Attention Mechanisms in Hyperbolic Space
  7. Curvature Adaptation
  8. Numerical Stability
  9. Complexity Analysis

Hyperbolic Geometry Basics

Definition

Hyperbolic space ℍⁿ is a complete, simply-connected Riemannian manifold of constant negative curvature κ < 0.

Key Properties:

  1. Exponential volume growth: Volume of ball of radius r grows as ~exp(r√|κ|)
  2. Unique geodesics: Any two points connected by unique shortest path
  3. Triangle inequality: sum of angles < π (vs = π in Euclidean)
  4. Tree embedding: Finite trees embed with arbitrarily low distortion in ℍ²

Curvature Parameter

Define curvature radius K > 0 such that κ = -1/K².

Normalization:

  • κ = -1: Unit hyperbolic space (mathematical convention)
  • κ = -1/K²: Learnable curvature (K is learned parameter)

Models of Hyperbolic Space

Five isometric models:

  1. Poincaré ball: {x ∈ ℝⁿ : ||x|| < 1}
  2. Lorentz (hyperboloid): {x ∈ ℝⁿ⁺¹ : ⟨x,x⟩_L = -1, x₀ > 0}
  3. Poincaré half-space: {x ∈ ℝⁿ : xₙ > 0}
  4. Klein disk: {x ∈ ℝⁿ : ||x|| < 1}
  5. Hemisphere

We focus on Poincaré ball (intuitive) and Lorentz (stable).


Poincaré Ball Model

Metric

Riemannian metric:

ds² = 4K² / (1 - ||x||²/K²)² · ||dx||²

Distance between points x, y:

d_P(x, y) = K · arcosh(1 + 2||x - y||² / ((1 - ||x||²/K²)(1 - ||y||²/K²)))

Simplified formula (numerically stable):

d_P(x, y) = 2K · artanh(||(-x) ⊕_K y|| / K)

Möbius Gyrovector Operations

Möbius Addition (generalized):

x ⊕_K y = ((1 + 2⟨x,y⟩/K² + ||y||²/K²)x + (1 - ||x||²/K²)y) /
          (1 + 2⟨x,y⟩/K² + ||x||²||y||²/K⁴)

Special case (K = 1):

x ⊕ y = ((1 + 2⟨x,y⟩ + ||y||²)x + (1 - ||x||²)y) /
        (1 + 2⟨x,y⟩ + ||x||²||y||²)

Properties:

  • Identity: x ⊕ 0 = x
  • Inverse: x ⊕ (-x ⊕ 0) = 0 where (-x ⊕ 0) = -x/(1 + ||x||²/K²)
  • Non-commutative: x ⊕ y ≠ y ⊕ x (in general)
  • Non-associative: (x ⊕ y) ⊕ z ≠ x ⊕ (y ⊕ z)

Computational Complexity: O(n) for n-dimensional vectors

Exponential and Logarithmic Maps

Exponential Map (tangent space → manifold):

exp_x^K(v) = x ⊕_K (tanh(||v||_x / 2K) / ||v||_x) · v

where ||v||_x = 2K / (1 - ||x||²/K²) · ||v||  (tangent norm)

Logarithmic Map (manifold → tangent space):

log_x^K(y) = 2K / (1 - ||x||²/K²) · artanh(||(-x) ⊕_K y|| / K) ·
             ((-x) ⊕_K y) / ||(-x) ⊕_K y||

Usage:

  • exp: Apply Euclidean gradients to hyperbolic points
  • log: Compute "hyperbolic difference" between points

Parallel Transport

Problem: Moving tangent vectors along geodesics while preserving inner products.

Formula (transport v from x to y):

P_{x→y}(v) = λ(x, y) · ((I + (γ(y) - 1)ŷŷᵀ) v - γ(y)⟨ŷ, v⟩x)

where:
  ŷ = (-x) ⊕_K y / ||(-x) ⊕_K y||
  λ(x, y) = (1 - ||y||²/K²) / (1 - ||x||²/K²)
  γ(y) = 1 / (1 - ||y||²/K²)

Lorentz (Hyperboloid) Model

Minkowski Space

Ambient space: ℝⁿ⁺¹ with Minkowski inner product:

⟨x, y⟩_L = -x₀y₀ + x₁y₁ + ... + xₙyₙ

Hyperboloid constraint:

ℍⁿ = {x ∈ ℝⁿ⁺¹ : ⟨x, x⟩_L = -K², x₀ > 0}

Distance

Formula:

d_L(x, y) = K · arcosh(-⟨x, y⟩_L / K²)

Numerically stable variant:

d_L(x, y) = K · ln(-⟨x, y⟩_L / K² + √((-⟨x, y⟩_L / K²)² - 1))

Exponential Map

Formula:

exp_x^L(v) = cosh(||v|| / K) x + sinh(||v|| / K) · v / ||v||

where ||v|| = √⟨v, v⟩_L  (Minkowski norm)

Lorentz Transformations

Lorentz Boost (translation along time-like direction):

Boost_v(x) = x + (γ - 1)(x · v̂)v̂ - γv

where:
  v̂ = v / ||v||_L
  γ = cosh(||v||_L / K)

Lorentz Rotation (rotation in space-like plane):

R_θ(x) = x + sin(θ)(e₁ ⊗ e₂ - e₂ ⊗ e₁)x

where e₁, e₂ are orthonormal space-like vectors

Isometries and Transformations

Möbius Transformations (Poincaré Ball)

General form:

M(x) = (Ax + b) / ⟨c, x⟩ + d

subject to: A ∈ SO(n), ad - ⟨b, c⟩ = 1

Special case - Translation:

T_a(x) = (-a) ⊕ x

Gyrovector Multiplication

Scalar multiplication:

r ⊗ x = tanh(r · artanh(||x|| / K)) / ||x|| · x

for r ∈ , x ∈ ℍⁿ

Properties:

  • (r + s) ⊗ x ≠ (r ⊗ x) ⊕ (s ⊗ x) (non-linear)
  • r ⊗ (s ⊗ x) = (rs) ⊗ x (associative)

Hyperbolic Neural Operations

Hyperbolic Linear Layer

Euclidean linear layer: y = Wx + b

Hyperbolic equivalent:

y = exp_0(W · log_0(x) + b)

Steps:

  1. Map x from manifold to tangent space at origin: v = log_0(x)
  2. Apply Euclidean linear transformation: v' = Wv + b
  3. Map back to manifold: y = exp_0(v')

Learnable parameters: W ∈ ℝᵐˣⁿ, b ∈ ℝᵐ

Hyperbolic ReLU

Problem: ReLU is defined in tangent space, not on manifold.

Solution:

ReLU_hyp(x) = exp_0(ReLU(log_0(x)))

Component-wise variant:

ReLU_hyp(x)_i = exp_0,i(max(0, log_0(x)_i))

Hyperbolic Batch Normalization

Challenge: Mean and variance are Euclidean concepts.

Hyperbolic mean (Fréchet mean):

μ = argmin_p Σ_i d(p, x_i)²

Approximation (geodesic midpoint):

μ ≈ exp_0(mean(log_0(x_1), ..., log_0(x_n)))

Normalization:

x_norm = exp_μ((log_μ(x) - μ_tangent) / σ_tangent)

Attention Mechanisms in Hyperbolic Space

Hyperbolic Dot-Product Attention

Euclidean attention:

Attention(Q, K, V) = softmax(QKᵀ / √d) V

Hyperbolic variant:

Attention_hyp(Q, K, V) = ⊕ (softmax(-d(Q_i, K_j)² / τ) ⊗ V_j)

Components:

  1. Similarity: -d(q, k)² (negative squared distance)
  2. Normalization: softmax with temperature τ
  3. Aggregation: Möbius weighted sum

Complexity: O(n²d) for n tokens, d dimensions (same as Euclidean)

Hyperbolic Linear Attention (Hypformer)

Problem: Quadratic complexity O(n²)

Solution: Kernel approximation

φ(q)ᵀ φ(k) ≈ d_hyp(q, k)

Linear attention:
Attention_linear(Q, K, V) = (Σ_j φ(K_j)⊗V_j) ⊘ (Σ_j φ(K_j))

Hyperbolic kernel (proposal):

φ_hyp(x) = [cosh(||x||/K), sinh(||x||/K) · x/||x||]

Properties:
  ⟨φ_hyp(x), φ_hyp(y)⟩_L ≈ -cosh(d(x,y)/K)

Complexity: O(nd²) vs O(n²d)

Speedup: 10x for n > 10d (verified by Hypformer, KDD 2024)

Multi-Head Hyperbolic Attention

Extension:

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) W^O

where head_i = Attention_hyp(QW_i^Q, KW_i^K, VW_i^V)

Learnable per-head curvature:

head_i operates in space with curvature κ_i

Rationale: Different heads capture different hierarchical depths.


Curvature Adaptation

Learnable Curvature

Parameterization: K ∈ ℝ⁺ (learned via gradient descent)

Gradient w.r.t. curvature:

∂L/∂K = ∂L/∂d · ∂d/∂K

where:
  ∂d/∂K = ∂/∂K[K · arcosh(1 + 2||x-y||²/((1-||x||²/K²)(1-||y||²/K²)))]

Numerical trick: Reparameterize as K = exp(k) to ensure K > 0.

Coupled Optimization

Problem: Naively updating K breaks Riemannian optimizer assumptions.

Solution (from "Optimizing Curvature Learning" 2024):

1. Compute gradients in current manifold (curvature K_old)
2. Update parameters: θ_new = RiemannianSGD(θ, ∇_θ L, K_old)
3. Update curvature: K_new = K_old - α · ∂L/∂K
4. Rescale parameters to new manifold:
   θ_rescaled = rescale_curvature(θ_new, K_old, K_new)

Rescaling formula (Poincaré ball):

rescale(x, K₁, K₂) = (K₂ / K₁) · x

Multi-Curvature Embeddings

Approach: Different dimensions/layers have different curvatures.

Product space:

^{n₁}(κ₁) × ^{n₂}(κ₂) × ... × ^{nₖ}(κₖ)

Distance:

d_product((x₁,...,xₖ), (y₁,...,yₖ)) = √(Σ_i w_i² d²(x_i, y_i))

where w_i are learnable weights

Numerical Stability

Poincaré Ball Instabilities

Problem 1: Division by zero when ||x|| → 1

Solution: Clip to maximum norm

x_safe = x / max(1, ||x|| / (1 - ε))

where ε = 1e-5

Problem 2: Möbius addition overflow

Solution: Rewrite using log1p, expm1

Instead of: (1 + 2⟨x,y⟩ + ||y||²) / (1 + 2⟨x,y⟩ + ||x||²||y||²)
Use: exp(log1p(2⟨x,y⟩ + ||y||²) - log1p(2⟨x,y⟩ + ||x||²||y||²))

Lorentz Model Stability

Advantage: No boundary singularities!

Constraint enforcement:

After each update, project back to hyperboloid:
  x₀ = √(K² + x₁² + ... + xₙ²)

Geodesic computation (stable):

d_L(x, y) = K · log((-⟨x,y⟩ + √(⟨x,y⟩² - K⁴)) / K²)

Mixed Precision

Strategy:

  • FP16 for forward pass (speed)
  • FP32 for gradients (stability)
  • FP64 for curvature updates (critical)

GeoOpt recommendation: Use FP32 minimum for hyperbolic operations.


Complexity Analysis

Space Complexity

Poincaré Ball:

  • Point: O(n) storage (same as Euclidean)
  • No auxiliary structures needed

Lorentz:

  • Point: O(n+1) storage (extra time dimension)
  • Constraint: ⟨x,x⟩_L = -K²

Curvature:

  • Shared K: O(1) extra parameter
  • Per-layer K: O(L) for L layers
  • Per-dimension K: O(n) parameters

Time Complexity

Operation Euclidean Poincaré Lorentz
Distance O(n) O(n) O(n)
Addition O(n) O(n) O(n)
Exp/Log - O(n) O(n)
Linear layer O(n²) O(n²) O(n²)
Attention O(n²d) O(n²d) O(n²d)
Linear attention O(nd²) O(nd²) O(nd²)

Key Insight: Asymptotic complexity same as Euclidean!

Constants: Hyperbolic ops 2-5x slower (more FLOPs per operation)

SIMD Optimization: Can recover 8-50x speedup, making hyperbolic faster than naive Euclidean.


Proofs of Key Properties

Theorem 1: Möbius Addition Preserves Poincaré Ball

Statement: If x, y ∈ 𝔹ⁿ(K) (Poincaré ball), then x ⊕_K y ∈ 𝔹ⁿ(K).

Proof:

Let ||x||² / K² = a², ||y||² / K² = b², ⟨x,y⟩ / K² = c
where a, b < 1.

||x ⊕_K y||² / K² = ||(1+2c+b²)x + (1-a²)y||² / (1+2c+a²b²)²
                   ≤ ((1+2c+b²)a + (1-a²)b)² / (1+2c+a²b²)²
                   < 1  (by calculation)

Theorem 2: Exponential Map is Diffeomorphism

Statement: exp_x: T_xⁿ → ℍⁿ is a diffeomorphism for each x.

Proof:

  • Inverse given by log_x
  • Both are smooth (analytic)
  • Jacobian is full rank everywhere
  • QED.

Theorem 3: Capacity Advantage

Statement: Embedding n-node tree in ℍ² requires distortion O(log n), while ℝᵏ requires k = Ω(n).

Proof Sketch:

  • Hyperbolic plane has exponential volume: V(r) ~ exp(r)
  • Trees have exponential node count: N(depth d) ~ exp(d)
  • Volume growth matches tree growth → O(1) average distortion
  • Euclidean plane has polynomial volume: V(r) ~ r²
  • Trees cannot fit without stretching → Ω(√n) average distortion

Implementation Checklist

Poincaré Ball Implementation

  • Möbius addition with curvature K
  • Exponential map with numerical stability
  • Logarithmic map with safe arctanh
  • Distance function with clipping
  • Parallel transport
  • Gradient clipping to prevent boundary

Lorentz Model Implementation

  • Minkowski inner product
  • Hyperboloid constraint projection
  • Exponential map
  • Distance function
  • Lorentz boost and rotation
  • Conversion to/from Poincaré

Hyperbolic Attention

  • Hyperbolic query/key/value projections
  • Distance-based similarity
  • Softmax with temperature
  • Möbius weighted aggregation
  • Linear attention kernel approximation

Learnable Curvature

  • Curvature parameter K with positive constraint
  • Gradient computation w.r.t. K
  • Coupled optimization with rescaling
  • Per-layer or per-head curvature

SIMD Optimizations

  • Vectorized Möbius addition (AVX2)
  • Batch distance computation
  • Fused exp/log operations
  • Cache-aligned memory layout

References

Textbooks:

  1. "Riemannian Geometry" - do Carmo
  2. "Foundations of Hyperbolic Manifolds" - Ratcliffe

Papers:

  1. Ganea et al., "Hyperbolic Neural Networks" (NeurIPS 2018)
  2. Hypformer (KDD 2024) - Linear attention formulation
  3. Fully Hyperbolic NNs (ACL 2022) - Lorentz model analysis

Software:

  • GeoOpt: PyTorch library for Riemannian optimization
  • Hyperbolic Image Embeddings: Reference implementation

Conclusion

Hyperbolic geometry provides a mathematically rigorous framework for hierarchical neural representations with:

  • Provable capacity: O(exp(n)) vs O(poly(n))
  • Stable operations: Lorentz model superior to Poincaré
  • Efficient algorithms: O(n²d) attention same as Euclidean
  • Learnable curvature: Adapt to data hierarchy

All operations have closed-form solutions and computable gradients, making them suitable for modern automatic differentiation frameworks.