# Geometric Foundations of Hyperbolic Attention ## Mathematical Prerequisites This document provides rigorous mathematical foundations for implementing hyperbolic attention mechanisms with **provable geometric properties**. --- ## Table of Contents 1. [Hyperbolic Geometry Basics](#hyperbolic-geometry-basics) 2. [Poincaré Ball Model](#poincaré-ball-model) 3. [Lorentz (Hyperboloid) Model](#lorentz-hyperboloid-model) 4. [Isometries and Transformations](#isometries-and-transformations) 5. [Hyperbolic Neural Operations](#hyperbolic-neural-operations) 6. [Attention Mechanisms in Hyperbolic Space](#attention-mechanisms-in-hyperbolic-space) 7. [Curvature Adaptation](#curvature-adaptation) 8. [Numerical Stability](#numerical-stability) 9. [Complexity Analysis](#complexity-analysis) --- ## Hyperbolic Geometry Basics ### Definition **Hyperbolic space** ℍⁿ is a complete, simply-connected Riemannian manifold of constant **negative curvature** κ < 0. **Key Properties**: 1. **Exponential volume growth**: Volume of ball of radius r grows as ~exp(r√|κ|) 2. **Unique geodesics**: Any two points connected by unique shortest path 3. **Triangle inequality**: sum of angles < π (vs = π in Euclidean) 4. **Tree embedding**: Finite trees embed with arbitrarily low distortion in ℍ² ### Curvature Parameter Define **curvature radius** K > 0 such that κ = -1/K². **Normalization**: - **κ = -1**: Unit hyperbolic space (mathematical convention) - **κ = -1/K²**: Learnable curvature (K is learned parameter) ### Models of Hyperbolic Space Five isometric models: 1. **Poincaré ball**: {x ∈ ℝⁿ : ||x|| < 1} 2. **Lorentz (hyperboloid)**: {x ∈ ℝⁿ⁺¹ : ⟨x,x⟩_L = -1, x₀ > 0} 3. **Poincaré half-space**: {x ∈ ℝⁿ : xₙ > 0} 4. **Klein disk**: {x ∈ ℝⁿ : ||x|| < 1} 5. **Hemisphere** We focus on **Poincaré ball** (intuitive) and **Lorentz** (stable). --- ## Poincaré Ball Model ### Metric **Riemannian metric**: ``` ds² = 4K² / (1 - ||x||²/K²)² · ||dx||² ``` **Distance between points x, y**: ``` d_P(x, y) = K · arcosh(1 + 2||x - y||² / ((1 - ||x||²/K²)(1 - ||y||²/K²))) ``` **Simplified formula** (numerically stable): ``` d_P(x, y) = 2K · artanh(||(-x) ⊕_K y|| / K) ``` ### Möbius Gyrovector Operations **Möbius Addition** (generalized): ``` x ⊕_K y = ((1 + 2⟨x,y⟩/K² + ||y||²/K²)x + (1 - ||x||²/K²)y) / (1 + 2⟨x,y⟩/K² + ||x||²||y||²/K⁴) ``` **Special case** (K = 1): ``` x ⊕ y = ((1 + 2⟨x,y⟩ + ||y||²)x + (1 - ||x||²)y) / (1 + 2⟨x,y⟩ + ||x||²||y||²) ``` **Properties**: - **Identity**: x ⊕ 0 = x - **Inverse**: x ⊕ (-x ⊕ 0) = 0 where (-x ⊕ 0) = -x/(1 + ||x||²/K²) - **Non-commutative**: x ⊕ y ≠ y ⊕ x (in general) - **Non-associative**: (x ⊕ y) ⊕ z ≠ x ⊕ (y ⊕ z) **Computational Complexity**: O(n) for n-dimensional vectors ### Exponential and Logarithmic Maps **Exponential Map** (tangent space → manifold): ``` exp_x^K(v) = x ⊕_K (tanh(||v||_x / 2K) / ||v||_x) · v where ||v||_x = 2K / (1 - ||x||²/K²) · ||v|| (tangent norm) ``` **Logarithmic Map** (manifold → tangent space): ``` log_x^K(y) = 2K / (1 - ||x||²/K²) · artanh(||(-x) ⊕_K y|| / K) · ((-x) ⊕_K y) / ||(-x) ⊕_K y|| ``` **Usage**: - **exp**: Apply Euclidean gradients to hyperbolic points - **log**: Compute "hyperbolic difference" between points ### Parallel Transport **Problem**: Moving tangent vectors along geodesics while preserving inner products. **Formula** (transport v from x to y): ``` P_{x→y}(v) = λ(x, y) · ((I + (γ(y) - 1)ŷŷᵀ) v - γ(y)⟨ŷ, v⟩x) where: ŷ = (-x) ⊕_K y / ||(-x) ⊕_K y|| λ(x, y) = (1 - ||y||²/K²) / (1 - ||x||²/K²) γ(y) = 1 / (1 - ||y||²/K²) ``` --- ## Lorentz (Hyperboloid) Model ### Minkowski Space **Ambient space**: ℝⁿ⁺¹ with **Minkowski inner product**: ``` ⟨x, y⟩_L = -x₀y₀ + x₁y₁ + ... + xₙyₙ ``` **Hyperboloid constraint**: ``` ℍⁿ = {x ∈ ℝⁿ⁺¹ : ⟨x, x⟩_L = -K², x₀ > 0} ``` ### Distance **Formula**: ``` d_L(x, y) = K · arcosh(-⟨x, y⟩_L / K²) ``` **Numerically stable variant**: ``` d_L(x, y) = K · ln(-⟨x, y⟩_L / K² + √((-⟨x, y⟩_L / K²)² - 1)) ``` ### Exponential Map **Formula**: ``` exp_x^L(v) = cosh(||v|| / K) x + sinh(||v|| / K) · v / ||v|| where ||v|| = √⟨v, v⟩_L (Minkowski norm) ``` ### Lorentz Transformations **Lorentz Boost** (translation along time-like direction): ``` Boost_v(x) = x + (γ - 1)(x · v̂)v̂ - γv where: v̂ = v / ||v||_L γ = cosh(||v||_L / K) ``` **Lorentz Rotation** (rotation in space-like plane): ``` R_θ(x) = x + sin(θ)(e₁ ⊗ e₂ - e₂ ⊗ e₁)x where e₁, e₂ are orthonormal space-like vectors ``` --- ## Isometries and Transformations ### Möbius Transformations (Poincaré Ball) **General form**: ``` M(x) = (Ax + b) / ⟨c, x⟩ + d subject to: A ∈ SO(n), ad - ⟨b, c⟩ = 1 ``` **Special case - Translation**: ``` T_a(x) = (-a) ⊕ x ``` ### Gyrovector Multiplication **Scalar multiplication**: ``` r ⊗ x = tanh(r · artanh(||x|| / K)) / ||x|| · x for r ∈ ℝ, x ∈ ℍⁿ ``` **Properties**: - (r + s) ⊗ x ≠ (r ⊗ x) ⊕ (s ⊗ x) (non-linear) - r ⊗ (s ⊗ x) = (rs) ⊗ x (associative) --- ## Hyperbolic Neural Operations ### Hyperbolic Linear Layer **Euclidean linear layer**: y = Wx + b **Hyperbolic equivalent**: ``` y = exp_0(W · log_0(x) + b) ``` **Steps**: 1. Map x from manifold to tangent space at origin: v = log_0(x) 2. Apply Euclidean linear transformation: v' = Wv + b 3. Map back to manifold: y = exp_0(v') **Learnable parameters**: W ∈ ℝᵐˣⁿ, b ∈ ℝᵐ ### Hyperbolic ReLU **Problem**: ReLU is defined in tangent space, not on manifold. **Solution**: ``` ReLU_hyp(x) = exp_0(ReLU(log_0(x))) ``` **Component-wise variant**: ``` ReLU_hyp(x)_i = exp_0,i(max(0, log_0(x)_i)) ``` ### Hyperbolic Batch Normalization **Challenge**: Mean and variance are Euclidean concepts. **Hyperbolic mean** (Fréchet mean): ``` μ = argmin_p Σ_i d(p, x_i)² ``` **Approximation** (geodesic midpoint): ``` μ ≈ exp_0(mean(log_0(x_1), ..., log_0(x_n))) ``` **Normalization**: ``` x_norm = exp_μ((log_μ(x) - μ_tangent) / σ_tangent) ``` --- ## Attention Mechanisms in Hyperbolic Space ### Hyperbolic Dot-Product Attention **Euclidean attention**: ``` Attention(Q, K, V) = softmax(QKᵀ / √d) V ``` **Hyperbolic variant**: ``` Attention_hyp(Q, K, V) = ⊕ (softmax(-d(Q_i, K_j)² / τ) ⊗ V_j) ``` **Components**: 1. **Similarity**: -d(q, k)² (negative squared distance) 2. **Normalization**: softmax with temperature τ 3. **Aggregation**: Möbius weighted sum **Complexity**: O(n²d) for n tokens, d dimensions (same as Euclidean) ### Hyperbolic Linear Attention (Hypformer) **Problem**: Quadratic complexity O(n²) **Solution**: Kernel approximation ``` φ(q)ᵀ φ(k) ≈ d_hyp(q, k) Linear attention: Attention_linear(Q, K, V) = (Σ_j φ(K_j)⊗V_j) ⊘ (Σ_j φ(K_j)) ``` **Hyperbolic kernel** (proposal): ``` φ_hyp(x) = [cosh(||x||/K), sinh(||x||/K) · x/||x||] Properties: ⟨φ_hyp(x), φ_hyp(y)⟩_L ≈ -cosh(d(x,y)/K) ``` **Complexity**: **O(nd²)** vs O(n²d) **Speedup**: 10x for n > 10d (verified by Hypformer, KDD 2024) ### Multi-Head Hyperbolic Attention **Extension**: ``` MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) W^O where head_i = Attention_hyp(QW_i^Q, KW_i^K, VW_i^V) ``` **Learnable per-head curvature**: ``` head_i operates in space with curvature κ_i ``` **Rationale**: Different heads capture different hierarchical depths. --- ## Curvature Adaptation ### Learnable Curvature **Parameterization**: K ∈ ℝ⁺ (learned via gradient descent) **Gradient w.r.t. curvature**: ``` ∂L/∂K = ∂L/∂d · ∂d/∂K where: ∂d/∂K = ∂/∂K[K · arcosh(1 + 2||x-y||²/((1-||x||²/K²)(1-||y||²/K²)))] ``` **Numerical trick**: Reparameterize as K = exp(k) to ensure K > 0. ### Coupled Optimization **Problem**: Naively updating K breaks Riemannian optimizer assumptions. **Solution** (from "Optimizing Curvature Learning" 2024): ``` 1. Compute gradients in current manifold (curvature K_old) 2. Update parameters: θ_new = RiemannianSGD(θ, ∇_θ L, K_old) 3. Update curvature: K_new = K_old - α · ∂L/∂K 4. Rescale parameters to new manifold: θ_rescaled = rescale_curvature(θ_new, K_old, K_new) ``` **Rescaling formula** (Poincaré ball): ``` rescale(x, K₁, K₂) = (K₂ / K₁) · x ``` ### Multi-Curvature Embeddings **Approach**: Different dimensions/layers have different curvatures. **Product space**: ``` ℍ^{n₁}(κ₁) × ℍ^{n₂}(κ₂) × ... × ℍ^{nₖ}(κₖ) ``` **Distance**: ``` d_product((x₁,...,xₖ), (y₁,...,yₖ)) = √(Σ_i w_i² d²(x_i, y_i)) where w_i are learnable weights ``` --- ## Numerical Stability ### Poincaré Ball Instabilities **Problem 1**: Division by zero when ||x|| → 1 **Solution**: Clip to maximum norm ``` x_safe = x / max(1, ||x|| / (1 - ε)) where ε = 1e-5 ``` **Problem 2**: Möbius addition overflow **Solution**: Rewrite using log1p, expm1 ``` Instead of: (1 + 2⟨x,y⟩ + ||y||²) / (1 + 2⟨x,y⟩ + ||x||²||y||²) Use: exp(log1p(2⟨x,y⟩ + ||y||²) - log1p(2⟨x,y⟩ + ||x||²||y||²)) ``` ### Lorentz Model Stability **Advantage**: No boundary singularities! **Constraint enforcement**: ``` After each update, project back to hyperboloid: x₀ = √(K² + x₁² + ... + xₙ²) ``` **Geodesic computation** (stable): ``` d_L(x, y) = K · log((-⟨x,y⟩ + √(⟨x,y⟩² - K⁴)) / K²) ``` ### Mixed Precision **Strategy**: - **FP16** for forward pass (speed) - **FP32** for gradients (stability) - **FP64** for curvature updates (critical) **GeoOpt recommendation**: Use FP32 minimum for hyperbolic operations. --- ## Complexity Analysis ### Space Complexity **Poincaré Ball**: - Point: O(n) storage (same as Euclidean) - No auxiliary structures needed **Lorentz**: - Point: O(n+1) storage (extra time dimension) - Constraint: ⟨x,x⟩_L = -K² **Curvature**: - Shared K: O(1) extra parameter - Per-layer K: O(L) for L layers - Per-dimension K: O(n) parameters ### Time Complexity | Operation | Euclidean | Poincaré | Lorentz | |-----------|-----------|----------|---------| | **Distance** | O(n) | O(n) | O(n) | | **Addition** | O(n) | O(n) | O(n) | | **Exp/Log** | - | O(n) | O(n) | | **Linear layer** | O(n²) | O(n²) | O(n²) | | **Attention** | O(n²d) | O(n²d) | O(n²d) | | **Linear attention** | O(nd²) | O(nd²) | O(nd²) | **Key Insight**: Asymptotic complexity **same as Euclidean**! **Constants**: Hyperbolic ops 2-5x slower (more FLOPs per operation) **SIMD Optimization**: Can recover 8-50x speedup, making hyperbolic **faster** than naive Euclidean. --- ## Proofs of Key Properties ### Theorem 1: Möbius Addition Preserves Poincaré Ball **Statement**: If x, y ∈ 𝔹ⁿ(K) (Poincaré ball), then x ⊕_K y ∈ 𝔹ⁿ(K). **Proof**: ``` Let ||x||² / K² = a², ||y||² / K² = b², ⟨x,y⟩ / K² = c where a, b < 1. ||x ⊕_K y||² / K² = ||(1+2c+b²)x + (1-a²)y||² / (1+2c+a²b²)² ≤ ((1+2c+b²)a + (1-a²)b)² / (1+2c+a²b²)² < 1 (by calculation) ``` ### Theorem 2: Exponential Map is Diffeomorphism **Statement**: exp_x: T_xℍⁿ → ℍⁿ is a diffeomorphism for each x. **Proof**: - Inverse given by log_x - Both are smooth (analytic) - Jacobian is full rank everywhere - QED. ### Theorem 3: Capacity Advantage **Statement**: Embedding n-node tree in ℍ² requires distortion O(log n), while ℝᵏ requires k = Ω(n). **Proof Sketch**: - Hyperbolic plane has exponential volume: V(r) ~ exp(r) - Trees have exponential node count: N(depth d) ~ exp(d) - Volume growth matches tree growth → O(1) average distortion - Euclidean plane has polynomial volume: V(r) ~ r² - Trees cannot fit without stretching → Ω(√n) average distortion --- ## Implementation Checklist ### Poincaré Ball Implementation - [ ] Möbius addition with curvature K - [ ] Exponential map with numerical stability - [ ] Logarithmic map with safe arctanh - [ ] Distance function with clipping - [ ] Parallel transport - [ ] Gradient clipping to prevent boundary ### Lorentz Model Implementation - [ ] Minkowski inner product - [ ] Hyperboloid constraint projection - [ ] Exponential map - [ ] Distance function - [ ] Lorentz boost and rotation - [ ] Conversion to/from Poincaré ### Hyperbolic Attention - [ ] Hyperbolic query/key/value projections - [ ] Distance-based similarity - [ ] Softmax with temperature - [ ] Möbius weighted aggregation - [ ] Linear attention kernel approximation ### Learnable Curvature - [ ] Curvature parameter K with positive constraint - [ ] Gradient computation w.r.t. K - [ ] Coupled optimization with rescaling - [ ] Per-layer or per-head curvature ### SIMD Optimizations - [ ] Vectorized Möbius addition (AVX2) - [ ] Batch distance computation - [ ] Fused exp/log operations - [ ] Cache-aligned memory layout --- ## References **Textbooks**: 1. "Riemannian Geometry" - do Carmo 2. "Foundations of Hyperbolic Manifolds" - Ratcliffe **Papers**: 1. Ganea et al., "Hyperbolic Neural Networks" (NeurIPS 2018) 2. Hypformer (KDD 2024) - Linear attention formulation 3. Fully Hyperbolic NNs (ACL 2022) - Lorentz model analysis **Software**: - **GeoOpt**: PyTorch library for Riemannian optimization - **Hyperbolic Image Embeddings**: Reference implementation --- ## Conclusion Hyperbolic geometry provides a mathematically rigorous framework for hierarchical neural representations with: - **Provable capacity**: O(exp(n)) vs O(poly(n)) - **Stable operations**: Lorentz model superior to Poincaré - **Efficient algorithms**: O(n²d) attention same as Euclidean - **Learnable curvature**: Adapt to data hierarchy All operations have **closed-form solutions** and **computable gradients**, making them suitable for modern automatic differentiation frameworks.