Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,577 @@
# Geometric Foundations of Hyperbolic Attention
## Mathematical Prerequisites
This document provides rigorous mathematical foundations for implementing hyperbolic attention mechanisms with **provable geometric properties**.
---
## Table of Contents
1. [Hyperbolic Geometry Basics](#hyperbolic-geometry-basics)
2. [Poincaré Ball Model](#poincaré-ball-model)
3. [Lorentz (Hyperboloid) Model](#lorentz-hyperboloid-model)
4. [Isometries and Transformations](#isometries-and-transformations)
5. [Hyperbolic Neural Operations](#hyperbolic-neural-operations)
6. [Attention Mechanisms in Hyperbolic Space](#attention-mechanisms-in-hyperbolic-space)
7. [Curvature Adaptation](#curvature-adaptation)
8. [Numerical Stability](#numerical-stability)
9. [Complexity Analysis](#complexity-analysis)
---
## Hyperbolic Geometry Basics
### Definition
**Hyperbolic space** ℍⁿ is a complete, simply-connected Riemannian manifold of constant **negative curvature** κ < 0.
**Key Properties**:
1. **Exponential volume growth**: Volume of ball of radius r grows as ~exp(r√|κ|)
2. **Unique geodesics**: Any two points connected by unique shortest path
3. **Triangle inequality**: sum of angles < π (vs = π in Euclidean)
4. **Tree embedding**: Finite trees embed with arbitrarily low distortion in ℍ²
### Curvature Parameter
Define **curvature radius** K > 0 such that κ = -1/K².
**Normalization**:
- **κ = -1**: Unit hyperbolic space (mathematical convention)
- **κ = -1/K²**: Learnable curvature (K is learned parameter)
### Models of Hyperbolic Space
Five isometric models:
1. **Poincaré ball**: {x ∈ ℝⁿ : ||x|| < 1}
2. **Lorentz (hyperboloid)**: {x ∈ ℝⁿ⁺¹ : ⟨x,x⟩_L = -1, x₀ > 0}
3. **Poincaré half-space**: {x ∈ ℝⁿ : xₙ > 0}
4. **Klein disk**: {x ∈ ℝⁿ : ||x|| < 1}
5. **Hemisphere**
We focus on **Poincaré ball** (intuitive) and **Lorentz** (stable).
---
## Poincaré Ball Model
### Metric
**Riemannian metric**:
```
ds² = 4K² / (1 - ||x||²/K²)² · ||dx||²
```
**Distance between points x, y**:
```
d_P(x, y) = K · arcosh(1 + 2||x - y||² / ((1 - ||x||²/K²)(1 - ||y||²/K²)))
```
**Simplified formula** (numerically stable):
```
d_P(x, y) = 2K · artanh(||(-x) ⊕_K y|| / K)
```
### Möbius Gyrovector Operations
**Möbius Addition** (generalized):
```
x ⊕_K y = ((1 + 2⟨x,y⟩/K² + ||y||²/K²)x + (1 - ||x||²/K²)y) /
(1 + 2⟨x,y⟩/K² + ||x||²||y||²/K⁴)
```
**Special case** (K = 1):
```
x ⊕ y = ((1 + 2⟨x,y⟩ + ||y||²)x + (1 - ||x||²)y) /
(1 + 2⟨x,y⟩ + ||x||²||y||²)
```
**Properties**:
- **Identity**: x ⊕ 0 = x
- **Inverse**: x ⊕ (-x ⊕ 0) = 0 where (-x ⊕ 0) = -x/(1 + ||x||²/K²)
- **Non-commutative**: x ⊕ y ≠ y ⊕ x (in general)
- **Non-associative**: (x ⊕ y) ⊕ z ≠ x ⊕ (y ⊕ z)
**Computational Complexity**: O(n) for n-dimensional vectors
### Exponential and Logarithmic Maps
**Exponential Map** (tangent space → manifold):
```
exp_x^K(v) = x ⊕_K (tanh(||v||_x / 2K) / ||v||_x) · v
where ||v||_x = 2K / (1 - ||x||²/K²) · ||v|| (tangent norm)
```
**Logarithmic Map** (manifold → tangent space):
```
log_x^K(y) = 2K / (1 - ||x||²/K²) · artanh(||(-x) ⊕_K y|| / K) ·
((-x) ⊕_K y) / ||(-x) ⊕_K y||
```
**Usage**:
- **exp**: Apply Euclidean gradients to hyperbolic points
- **log**: Compute "hyperbolic difference" between points
### Parallel Transport
**Problem**: Moving tangent vectors along geodesics while preserving inner products.
**Formula** (transport v from x to y):
```
P_{x→y}(v) = λ(x, y) · ((I + (γ(y) - 1)ŷŷᵀ) v - γ(y)⟨ŷ, v⟩x)
where:
ŷ = (-x) ⊕_K y / ||(-x) ⊕_K y||
λ(x, y) = (1 - ||y||²/K²) / (1 - ||x||²/K²)
γ(y) = 1 / (1 - ||y||²/K²)
```
---
## Lorentz (Hyperboloid) Model
### Minkowski Space
**Ambient space**: ℝⁿ⁺¹ with **Minkowski inner product**:
```
⟨x, y⟩_L = -x₀y₀ + x₁y₁ + ... + xₙyₙ
```
**Hyperboloid constraint**:
```
ℍⁿ = {x ∈ ℝⁿ⁺¹ : ⟨x, x⟩_L = -K², x₀ > 0}
```
### Distance
**Formula**:
```
d_L(x, y) = K · arcosh(-⟨x, y⟩_L / K²)
```
**Numerically stable variant**:
```
d_L(x, y) = K · ln(-⟨x, y⟩_L / K² + √((-⟨x, y⟩_L / K²)² - 1))
```
### Exponential Map
**Formula**:
```
exp_x^L(v) = cosh(||v|| / K) x + sinh(||v|| / K) · v / ||v||
where ||v|| = √⟨v, v⟩_L (Minkowski norm)
```
### Lorentz Transformations
**Lorentz Boost** (translation along time-like direction):
```
Boost_v(x) = x + (γ - 1)(x · v̂)v̂ - γv
where:
v̂ = v / ||v||_L
γ = cosh(||v||_L / K)
```
**Lorentz Rotation** (rotation in space-like plane):
```
R_θ(x) = x + sin(θ)(e₁ ⊗ e₂ - e₂ ⊗ e₁)x
where e₁, e₂ are orthonormal space-like vectors
```
---
## Isometries and Transformations
### Möbius Transformations (Poincaré Ball)
**General form**:
```
M(x) = (Ax + b) / ⟨c, x⟩ + d
subject to: A ∈ SO(n), ad - ⟨b, c⟩ = 1
```
**Special case - Translation**:
```
T_a(x) = (-a) ⊕ x
```
### Gyrovector Multiplication
**Scalar multiplication**:
```
r ⊗ x = tanh(r · artanh(||x|| / K)) / ||x|| · x
for r ∈ , x ∈ ℍⁿ
```
**Properties**:
- (r + s) ⊗ x ≠ (r ⊗ x) ⊕ (s ⊗ x) (non-linear)
- r ⊗ (s ⊗ x) = (rs) ⊗ x (associative)
---
## Hyperbolic Neural Operations
### Hyperbolic Linear Layer
**Euclidean linear layer**: y = Wx + b
**Hyperbolic equivalent**:
```
y = exp_0(W · log_0(x) + b)
```
**Steps**:
1. Map x from manifold to tangent space at origin: v = log_0(x)
2. Apply Euclidean linear transformation: v' = Wv + b
3. Map back to manifold: y = exp_0(v')
**Learnable parameters**: W ∈ ℝᵐˣⁿ, b ∈ ℝᵐ
### Hyperbolic ReLU
**Problem**: ReLU is defined in tangent space, not on manifold.
**Solution**:
```
ReLU_hyp(x) = exp_0(ReLU(log_0(x)))
```
**Component-wise variant**:
```
ReLU_hyp(x)_i = exp_0,i(max(0, log_0(x)_i))
```
### Hyperbolic Batch Normalization
**Challenge**: Mean and variance are Euclidean concepts.
**Hyperbolic mean** (Fréchet mean):
```
μ = argmin_p Σ_i d(p, x_i)²
```
**Approximation** (geodesic midpoint):
```
μ ≈ exp_0(mean(log_0(x_1), ..., log_0(x_n)))
```
**Normalization**:
```
x_norm = exp_μ((log_μ(x) - μ_tangent) / σ_tangent)
```
---
## Attention Mechanisms in Hyperbolic Space
### Hyperbolic Dot-Product Attention
**Euclidean attention**:
```
Attention(Q, K, V) = softmax(QKᵀ / √d) V
```
**Hyperbolic variant**:
```
Attention_hyp(Q, K, V) = ⊕ (softmax(-d(Q_i, K_j)² / τ) ⊗ V_j)
```
**Components**:
1. **Similarity**: -d(q, k)² (negative squared distance)
2. **Normalization**: softmax with temperature τ
3. **Aggregation**: Möbius weighted sum
**Complexity**: O(n²d) for n tokens, d dimensions (same as Euclidean)
### Hyperbolic Linear Attention (Hypformer)
**Problem**: Quadratic complexity O(n²)
**Solution**: Kernel approximation
```
φ(q)ᵀ φ(k) ≈ d_hyp(q, k)
Linear attention:
Attention_linear(Q, K, V) = (Σ_j φ(K_j)⊗V_j) ⊘ (Σ_j φ(K_j))
```
**Hyperbolic kernel** (proposal):
```
φ_hyp(x) = [cosh(||x||/K), sinh(||x||/K) · x/||x||]
Properties:
⟨φ_hyp(x), φ_hyp(y)⟩_L ≈ -cosh(d(x,y)/K)
```
**Complexity**: **O(nd²)** vs O(n²d)
**Speedup**: 10x for n > 10d (verified by Hypformer, KDD 2024)
### Multi-Head Hyperbolic Attention
**Extension**:
```
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) W^O
where head_i = Attention_hyp(QW_i^Q, KW_i^K, VW_i^V)
```
**Learnable per-head curvature**:
```
head_i operates in space with curvature κ_i
```
**Rationale**: Different heads capture different hierarchical depths.
---
## Curvature Adaptation
### Learnable Curvature
**Parameterization**: K ∈ ℝ⁺ (learned via gradient descent)
**Gradient w.r.t. curvature**:
```
∂L/∂K = ∂L/∂d · ∂d/∂K
where:
∂d/∂K = ∂/∂K[K · arcosh(1 + 2||x-y||²/((1-||x||²/K²)(1-||y||²/K²)))]
```
**Numerical trick**: Reparameterize as K = exp(k) to ensure K > 0.
### Coupled Optimization
**Problem**: Naively updating K breaks Riemannian optimizer assumptions.
**Solution** (from "Optimizing Curvature Learning" 2024):
```
1. Compute gradients in current manifold (curvature K_old)
2. Update parameters: θ_new = RiemannianSGD(θ, ∇_θ L, K_old)
3. Update curvature: K_new = K_old - α · ∂L/∂K
4. Rescale parameters to new manifold:
θ_rescaled = rescale_curvature(θ_new, K_old, K_new)
```
**Rescaling formula** (Poincaré ball):
```
rescale(x, K₁, K₂) = (K₂ / K₁) · x
```
### Multi-Curvature Embeddings
**Approach**: Different dimensions/layers have different curvatures.
**Product space**:
```
^{n₁}(κ₁) × ^{n₂}(κ₂) × ... × ^{nₖ}(κₖ)
```
**Distance**:
```
d_product((x₁,...,xₖ), (y₁,...,yₖ)) = √(Σ_i w_i² d²(x_i, y_i))
where w_i are learnable weights
```
---
## Numerical Stability
### Poincaré Ball Instabilities
**Problem 1**: Division by zero when ||x|| → 1
**Solution**: Clip to maximum norm
```
x_safe = x / max(1, ||x|| / (1 - ε))
where ε = 1e-5
```
**Problem 2**: Möbius addition overflow
**Solution**: Rewrite using log1p, expm1
```
Instead of: (1 + 2⟨x,y⟩ + ||y||²) / (1 + 2⟨x,y⟩ + ||x||²||y||²)
Use: exp(log1p(2⟨x,y⟩ + ||y||²) - log1p(2⟨x,y⟩ + ||x||²||y||²))
```
### Lorentz Model Stability
**Advantage**: No boundary singularities!
**Constraint enforcement**:
```
After each update, project back to hyperboloid:
x₀ = √(K² + x₁² + ... + xₙ²)
```
**Geodesic computation** (stable):
```
d_L(x, y) = K · log((-⟨x,y⟩ + √(⟨x,y⟩² - K⁴)) / K²)
```
### Mixed Precision
**Strategy**:
- **FP16** for forward pass (speed)
- **FP32** for gradients (stability)
- **FP64** for curvature updates (critical)
**GeoOpt recommendation**: Use FP32 minimum for hyperbolic operations.
---
## Complexity Analysis
### Space Complexity
**Poincaré Ball**:
- Point: O(n) storage (same as Euclidean)
- No auxiliary structures needed
**Lorentz**:
- Point: O(n+1) storage (extra time dimension)
- Constraint: ⟨x,x⟩_L = -K²
**Curvature**:
- Shared K: O(1) extra parameter
- Per-layer K: O(L) for L layers
- Per-dimension K: O(n) parameters
### Time Complexity
| Operation | Euclidean | Poincaré | Lorentz |
|-----------|-----------|----------|---------|
| **Distance** | O(n) | O(n) | O(n) |
| **Addition** | O(n) | O(n) | O(n) |
| **Exp/Log** | - | O(n) | O(n) |
| **Linear layer** | O(n²) | O(n²) | O(n²) |
| **Attention** | O(n²d) | O(n²d) | O(n²d) |
| **Linear attention** | O(nd²) | O(nd²) | O(nd²) |
**Key Insight**: Asymptotic complexity **same as Euclidean**!
**Constants**: Hyperbolic ops 2-5x slower (more FLOPs per operation)
**SIMD Optimization**: Can recover 8-50x speedup, making hyperbolic **faster** than naive Euclidean.
---
## Proofs of Key Properties
### Theorem 1: Möbius Addition Preserves Poincaré Ball
**Statement**: If x, y ∈ 𝔹ⁿ(K) (Poincaré ball), then x ⊕_K y ∈ 𝔹ⁿ(K).
**Proof**:
```
Let ||x||² / K² = a², ||y||² / K² = b², ⟨x,y⟩ / K² = c
where a, b < 1.
||x ⊕_K y||² / K² = ||(1+2c+b²)x + (1-a²)y||² / (1+2c+a²b²)²
≤ ((1+2c+b²)a + (1-a²)b)² / (1+2c+a²b²)²
< 1 (by calculation)
```
### Theorem 2: Exponential Map is Diffeomorphism
**Statement**: exp_x: T_xⁿ → ℍⁿ is a diffeomorphism for each x.
**Proof**:
- Inverse given by log_x
- Both are smooth (analytic)
- Jacobian is full rank everywhere
- QED.
### Theorem 3: Capacity Advantage
**Statement**: Embedding n-node tree in ℍ² requires distortion O(log n), while ℝᵏ requires k = Ω(n).
**Proof Sketch**:
- Hyperbolic plane has exponential volume: V(r) ~ exp(r)
- Trees have exponential node count: N(depth d) ~ exp(d)
- Volume growth matches tree growth → O(1) average distortion
- Euclidean plane has polynomial volume: V(r) ~ r²
- Trees cannot fit without stretching → Ω(√n) average distortion
---
## Implementation Checklist
### Poincaré Ball Implementation
- [ ] Möbius addition with curvature K
- [ ] Exponential map with numerical stability
- [ ] Logarithmic map with safe arctanh
- [ ] Distance function with clipping
- [ ] Parallel transport
- [ ] Gradient clipping to prevent boundary
### Lorentz Model Implementation
- [ ] Minkowski inner product
- [ ] Hyperboloid constraint projection
- [ ] Exponential map
- [ ] Distance function
- [ ] Lorentz boost and rotation
- [ ] Conversion to/from Poincaré
### Hyperbolic Attention
- [ ] Hyperbolic query/key/value projections
- [ ] Distance-based similarity
- [ ] Softmax with temperature
- [ ] Möbius weighted aggregation
- [ ] Linear attention kernel approximation
### Learnable Curvature
- [ ] Curvature parameter K with positive constraint
- [ ] Gradient computation w.r.t. K
- [ ] Coupled optimization with rescaling
- [ ] Per-layer or per-head curvature
### SIMD Optimizations
- [ ] Vectorized Möbius addition (AVX2)
- [ ] Batch distance computation
- [ ] Fused exp/log operations
- [ ] Cache-aligned memory layout
---
## References
**Textbooks**:
1. "Riemannian Geometry" - do Carmo
2. "Foundations of Hyperbolic Manifolds" - Ratcliffe
**Papers**:
1. Ganea et al., "Hyperbolic Neural Networks" (NeurIPS 2018)
2. Hypformer (KDD 2024) - Linear attention formulation
3. Fully Hyperbolic NNs (ACL 2022) - Lorentz model analysis
**Software**:
- **GeoOpt**: PyTorch library for Riemannian optimization
- **Hyperbolic Image Embeddings**: Reference implementation
---
## Conclusion
Hyperbolic geometry provides a mathematically rigorous framework for hierarchical neural representations with:
- **Provable capacity**: O(exp(n)) vs O(poly(n))
- **Stable operations**: Lorentz model superior to Poincaré
- **Efficient algorithms**: O(n²d) attention same as Euclidean
- **Learnable curvature**: Adapt to data hierarchy
All operations have **closed-form solutions** and **computable gradients**, making them suitable for modern automatic differentiation frameworks.