Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/examples/exo-ai-2025/research/09-hyperbolic-attention/geometric_foundations.md
+++ b/examples/exo-ai-2025/research/09-hyperbolic-attention/geometric_foundations.md
@@ -0,0 +1,577 @@
+# Geometric Foundations of Hyperbolic Attention
+
+## Mathematical Prerequisites
+
+This document provides rigorous mathematical foundations for implementing hyperbolic attention mechanisms with **provable geometric properties**.
+
+---
+
+## Table of Contents
+
+1. [Hyperbolic Geometry Basics](#hyperbolic-geometry-basics)
+2. [Poincaré Ball Model](#poincaré-ball-model)
+3. [Lorentz (Hyperboloid) Model](#lorentz-hyperboloid-model)
+4. [Isometries and Transformations](#isometries-and-transformations)
+5. [Hyperbolic Neural Operations](#hyperbolic-neural-operations)
+6. [Attention Mechanisms in Hyperbolic Space](#attention-mechanisms-in-hyperbolic-space)
+7. [Curvature Adaptation](#curvature-adaptation)
+8. [Numerical Stability](#numerical-stability)
+9. [Complexity Analysis](#complexity-analysis)
+
+---
+
+## Hyperbolic Geometry Basics
+
+### Definition
+
+**Hyperbolic space** ℍⁿ is a complete, simply-connected Riemannian manifold of constant **negative curvature** κ < 0.
+
+**Key Properties**:
+1. **Exponential volume growth**: Volume of ball of radius r grows as ~exp(r√|κ|)
+2. **Unique geodesics**: Any two points connected by unique shortest path
+3. **Triangle inequality**: sum of angles < π (vs = π in Euclidean)
+4. **Tree embedding**: Finite trees embed with arbitrarily low distortion in ℍ²
+
+### Curvature Parameter
+
+Define **curvature radius** K > 0 such that κ = -1/K².
+
+**Normalization**:
+- **κ = -1**: Unit hyperbolic space (mathematical convention)
+- **κ = -1/K²**: Learnable curvature (K is learned parameter)
+
+### Models of Hyperbolic Space
+
+Five isometric models:
+1. **Poincaré ball**: {x ∈ ℝⁿ : ||x|| < 1}
+2. **Lorentz (hyperboloid)**: {x ∈ ℝⁿ⁺¹ : ⟨x,x⟩_L = -1, x₀ > 0}
+3. **Poincaré half-space**: {x ∈ ℝⁿ : xₙ > 0}
+4. **Klein disk**: {x ∈ ℝⁿ : ||x|| < 1}
+5. **Hemisphere**
+
+We focus on **Poincaré ball** (intuitive) and **Lorentz** (stable).
+
+---
+
+## Poincaré Ball Model
+
+### Metric
+
+**Riemannian metric**:
+```
+ds² = 4K² / (1 - ||x||²/K²)² · ||dx||²
+```
+
+**Distance between points x, y**:
+```
+d_P(x, y) = K · arcosh(1 + 2||x - y||² / ((1 - ||x||²/K²)(1 - ||y||²/K²)))
+```
+
+**Simplified formula** (numerically stable):
+```
+d_P(x, y) = 2K · artanh(||(-x) ⊕_K y|| / K)
+```
+
+### Möbius Gyrovector Operations
+
+**Möbius Addition** (generalized):
+```
+x ⊕_K y = ((1 + 2⟨x,y⟩/K² + ||y||²/K²)x + (1 - ||x||²/K²)y) /
+          (1 + 2⟨x,y⟩/K² + ||x||²||y||²/K⁴)
+```
+
+**Special case** (K = 1):
+```
+x ⊕ y = ((1 + 2⟨x,y⟩ + ||y||²)x + (1 - ||x||²)y) /
+        (1 + 2⟨x,y⟩ + ||x||²||y||²)
+```
+
+**Properties**:
+- **Identity**: x ⊕ 0 = x
+- **Inverse**: x ⊕ (-x ⊕ 0) = 0 where (-x ⊕ 0) = -x/(1 + ||x||²/K²)
+- **Non-commutative**: x ⊕ y ≠ y ⊕ x (in general)
+- **Non-associative**: (x ⊕ y) ⊕ z ≠ x ⊕ (y ⊕ z)
+
+**Computational Complexity**: O(n) for n-dimensional vectors
+
+### Exponential and Logarithmic Maps
+
+**Exponential Map** (tangent space → manifold):
+```
+exp_x^K(v) = x ⊕_K (tanh(||v||_x / 2K) / ||v||_x) · v
+
+where ||v||_x = 2K / (1 - ||x||²/K²) · ||v||  (tangent norm)
+```
+
+**Logarithmic Map** (manifold → tangent space):
+```
+log_x^K(y) = 2K / (1 - ||x||²/K²) · artanh(||(-x) ⊕_K y|| / K) ·
+             ((-x) ⊕_K y) / ||(-x) ⊕_K y||
+```
+
+**Usage**:
+- **exp**: Apply Euclidean gradients to hyperbolic points
+- **log**: Compute "hyperbolic difference" between points
+
+### Parallel Transport
+
+**Problem**: Moving tangent vectors along geodesics while preserving inner products.
+
+**Formula** (transport v from x to y):
+```
+P_{x→y}(v) = λ(x, y) · ((I + (γ(y) - 1)ŷŷᵀ) v - γ(y)⟨ŷ, v⟩x)
+
+where:
+  ŷ = (-x) ⊕_K y / ||(-x) ⊕_K y||
+  λ(x, y) = (1 - ||y||²/K²) / (1 - ||x||²/K²)
+  γ(y) = 1 / (1 - ||y||²/K²)
+```
+
+---
+
+## Lorentz (Hyperboloid) Model
+
+### Minkowski Space
+
+**Ambient space**: ℝⁿ⁺¹ with **Minkowski inner product**:
+```
+⟨x, y⟩_L = -x₀y₀ + x₁y₁ + ... + xₙyₙ
+```
+
+**Hyperboloid constraint**:
+```
+ℍⁿ = {x ∈ ℝⁿ⁺¹ : ⟨x, x⟩_L = -K², x₀ > 0}
+```
+
+### Distance
+
+**Formula**:
+```
+d_L(x, y) = K · arcosh(-⟨x, y⟩_L / K²)
+```
+
+**Numerically stable variant**:
+```
+d_L(x, y) = K · ln(-⟨x, y⟩_L / K² + √((-⟨x, y⟩_L / K²)² - 1))
+```
+
+### Exponential Map
+
+**Formula**:
+```
+exp_x^L(v) = cosh(||v|| / K) x + sinh(||v|| / K) · v / ||v||
+
+where ||v|| = √⟨v, v⟩_L  (Minkowski norm)
+```
+
+### Lorentz Transformations
+
+**Lorentz Boost** (translation along time-like direction):
+```
+Boost_v(x) = x + (γ - 1)(x · v̂)v̂ - γv
+
+where:
+  v̂ = v / ||v||_L
+  γ = cosh(||v||_L / K)
+```
+
+**Lorentz Rotation** (rotation in space-like plane):
+```
+R_θ(x) = x + sin(θ)(e₁ ⊗ e₂ - e₂ ⊗ e₁)x
+
+where e₁, e₂ are orthonormal space-like vectors
+```
+
+---
+
+## Isometries and Transformations
+
+### Möbius Transformations (Poincaré Ball)
+
+**General form**:
+```
+M(x) = (Ax + b) / ⟨c, x⟩ + d
+
+subject to: A ∈ SO(n), ad - ⟨b, c⟩ = 1
+```
+
+**Special case - Translation**:
+```
+T_a(x) = (-a) ⊕ x
+```
+
+### Gyrovector Multiplication
+
+**Scalar multiplication**:
+```
+r ⊗ x = tanh(r · artanh(||x|| / K)) / ||x|| · x
+
+for r ∈ ℝ, x ∈ ℍⁿ
+```
+
+**Properties**:
+- (r + s) ⊗ x ≠ (r ⊗ x) ⊕ (s ⊗ x)  (non-linear)
+- r ⊗ (s ⊗ x) = (rs) ⊗ x  (associative)
+
+---
+
+## Hyperbolic Neural Operations
+
+### Hyperbolic Linear Layer
+
+**Euclidean linear layer**: y = Wx + b
+
+**Hyperbolic equivalent**:
+```
+y = exp_0(W · log_0(x) + b)
+```
+
+**Steps**:
+1. Map x from manifold to tangent space at origin: v = log_0(x)
+2. Apply Euclidean linear transformation: v' = Wv + b
+3. Map back to manifold: y = exp_0(v')
+
+**Learnable parameters**: W ∈ ℝᵐˣⁿ, b ∈ ℝᵐ
+
+### Hyperbolic ReLU
+
+**Problem**: ReLU is defined in tangent space, not on manifold.
+
+**Solution**:
+```
+ReLU_hyp(x) = exp_0(ReLU(log_0(x)))
+```
+
+**Component-wise variant**:
+```
+ReLU_hyp(x)_i = exp_0,i(max(0, log_0(x)_i))
+```
+
+### Hyperbolic Batch Normalization
+
+**Challenge**: Mean and variance are Euclidean concepts.
+
+**Hyperbolic mean** (Fréchet mean):
+```
+μ = argmin_p Σ_i d(p, x_i)²
+```
+
+**Approximation** (geodesic midpoint):
+```
+μ ≈ exp_0(mean(log_0(x_1), ..., log_0(x_n)))
+```
+
+**Normalization**:
+```
+x_norm = exp_μ((log_μ(x) - μ_tangent) / σ_tangent)
+```
+
+---
+
+## Attention Mechanisms in Hyperbolic Space
+
+### Hyperbolic Dot-Product Attention
+
+**Euclidean attention**:
+```
+Attention(Q, K, V) = softmax(QKᵀ / √d) V
+```
+
+**Hyperbolic variant**:
+```
+Attention_hyp(Q, K, V) = ⊕ (softmax(-d(Q_i, K_j)² / τ) ⊗ V_j)
+```
+
+**Components**:
+1. **Similarity**: -d(q, k)² (negative squared distance)
+2. **Normalization**: softmax with temperature τ
+3. **Aggregation**: Möbius weighted sum
+
+**Complexity**: O(n²d) for n tokens, d dimensions (same as Euclidean)
+
+### Hyperbolic Linear Attention (Hypformer)
+
+**Problem**: Quadratic complexity O(n²)
+
+**Solution**: Kernel approximation
+```
+φ(q)ᵀ φ(k) ≈ d_hyp(q, k)
+
+Linear attention:
+Attention_linear(Q, K, V) = (Σ_j φ(K_j)⊗V_j) ⊘ (Σ_j φ(K_j))
+```
+
+**Hyperbolic kernel** (proposal):
+```
+φ_hyp(x) = [cosh(||x||/K), sinh(||x||/K) · x/||x||]
+
+Properties:
+  ⟨φ_hyp(x), φ_hyp(y)⟩_L ≈ -cosh(d(x,y)/K)
+```
+
+**Complexity**: **O(nd²)** vs O(n²d)
+
+**Speedup**: 10x for n > 10d (verified by Hypformer, KDD 2024)
+
+### Multi-Head Hyperbolic Attention
+
+**Extension**:
+```
+MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) W^O
+
+where head_i = Attention_hyp(QW_i^Q, KW_i^K, VW_i^V)
+```
+
+**Learnable per-head curvature**:
+```
+head_i operates in space with curvature κ_i
+```
+
+**Rationale**: Different heads capture different hierarchical depths.
+
+---
+
+## Curvature Adaptation
+
+### Learnable Curvature
+
+**Parameterization**: K ∈ ℝ⁺ (learned via gradient descent)
+
+**Gradient w.r.t. curvature**:
+```
+∂L/∂K = ∂L/∂d · ∂d/∂K
+
+where:
+  ∂d/∂K = ∂/∂K[K · arcosh(1 + 2||x-y||²/((1-||x||²/K²)(1-||y||²/K²)))]
+```
+
+**Numerical trick**: Reparameterize as K = exp(k) to ensure K > 0.
+
+### Coupled Optimization
+
+**Problem**: Naively updating K breaks Riemannian optimizer assumptions.
+
+**Solution** (from "Optimizing Curvature Learning" 2024):
+```
+1. Compute gradients in current manifold (curvature K_old)
+2. Update parameters: θ_new = RiemannianSGD(θ, ∇_θ L, K_old)
+3. Update curvature: K_new = K_old - α · ∂L/∂K
+4. Rescale parameters to new manifold:
+   θ_rescaled = rescale_curvature(θ_new, K_old, K_new)
+```
+
+**Rescaling formula** (Poincaré ball):
+```
+rescale(x, K₁, K₂) = (K₂ / K₁) · x
+```
+
+### Multi-Curvature Embeddings
+
+**Approach**: Different dimensions/layers have different curvatures.
+
+**Product space**:
+```
+ℍ^{n₁}(κ₁) × ℍ^{n₂}(κ₂) × ... × ℍ^{nₖ}(κₖ)
+```
+
+**Distance**:
+```
+d_product((x₁,...,xₖ), (y₁,...,yₖ)) = √(Σ_i w_i² d²(x_i, y_i))
+
+where w_i are learnable weights
+```
+
+---
+
+## Numerical Stability
+
+### Poincaré Ball Instabilities
+
+**Problem 1**: Division by zero when ||x|| → 1
+
+**Solution**: Clip to maximum norm
+```
+x_safe = x / max(1, ||x|| / (1 - ε))
+
+where ε = 1e-5
+```
+
+**Problem 2**: Möbius addition overflow
+
+**Solution**: Rewrite using log1p, expm1
+```
+Instead of: (1 + 2⟨x,y⟩ + ||y||²) / (1 + 2⟨x,y⟩ + ||x||²||y||²)
+Use: exp(log1p(2⟨x,y⟩ + ||y||²) - log1p(2⟨x,y⟩ + ||x||²||y||²))
+```
+
+### Lorentz Model Stability
+
+**Advantage**: No boundary singularities!
+
+**Constraint enforcement**:
+```
+After each update, project back to hyperboloid:
+  x₀ = √(K² + x₁² + ... + xₙ²)
+```
+
+**Geodesic computation** (stable):
+```
+d_L(x, y) = K · log((-⟨x,y⟩ + √(⟨x,y⟩² - K⁴)) / K²)
+```
+
+### Mixed Precision
+
+**Strategy**:
+- **FP16** for forward pass (speed)
+- **FP32** for gradients (stability)
+- **FP64** for curvature updates (critical)
+
+**GeoOpt recommendation**: Use FP32 minimum for hyperbolic operations.
+
+---
+
+## Complexity Analysis
+
+### Space Complexity
+
+**Poincaré Ball**:
+- Point: O(n) storage (same as Euclidean)
+- No auxiliary structures needed
+
+**Lorentz**:
+- Point: O(n+1) storage (extra time dimension)
+- Constraint: ⟨x,x⟩_L = -K²
+
+**Curvature**:
+- Shared K: O(1) extra parameter
+- Per-layer K: O(L) for L layers
+- Per-dimension K: O(n) parameters
+
+### Time Complexity
+
+| Operation | Euclidean | Poincaré | Lorentz |
+|-----------|-----------|----------|---------|
+| **Distance** | O(n) | O(n) | O(n) |
+| **Addition** | O(n) | O(n) | O(n) |
+| **Exp/Log** | - | O(n) | O(n) |
+| **Linear layer** | O(n²) | O(n²) | O(n²) |
+| **Attention** | O(n²d) | O(n²d) | O(n²d) |
+| **Linear attention** | O(nd²) | O(nd²) | O(nd²) |
+
+**Key Insight**: Asymptotic complexity **same as Euclidean**!
+
+**Constants**: Hyperbolic ops 2-5x slower (more FLOPs per operation)
+
+**SIMD Optimization**: Can recover 8-50x speedup, making hyperbolic **faster** than naive Euclidean.
+
+---
+
+## Proofs of Key Properties
+
+### Theorem 1: Möbius Addition Preserves Poincaré Ball
+
+**Statement**: If x, y ∈ 𝔹ⁿ(K) (Poincaré ball), then x ⊕_K y ∈ 𝔹ⁿ(K).
+
+**Proof**:
+```
+Let ||x||² / K² = a², ||y||² / K² = b², ⟨x,y⟩ / K² = c
+where a, b < 1.
+
+||x ⊕_K y||² / K² = ||(1+2c+b²)x + (1-a²)y||² / (1+2c+a²b²)²
+                   ≤ ((1+2c+b²)a + (1-a²)b)² / (1+2c+a²b²)²
+                   < 1  (by calculation)
+```
+
+### Theorem 2: Exponential Map is Diffeomorphism
+
+**Statement**: exp_x: T_xℍⁿ → ℍⁿ is a diffeomorphism for each x.
+
+**Proof**:
+- Inverse given by log_x
+- Both are smooth (analytic)
+- Jacobian is full rank everywhere
+- QED.
+
+### Theorem 3: Capacity Advantage
+
+**Statement**: Embedding n-node tree in ℍ² requires distortion O(log n), while ℝᵏ requires k = Ω(n).
+
+**Proof Sketch**:
+- Hyperbolic plane has exponential volume: V(r) ~ exp(r)
+- Trees have exponential node count: N(depth d) ~ exp(d)
+- Volume growth matches tree growth → O(1) average distortion
+- Euclidean plane has polynomial volume: V(r) ~ r²
+- Trees cannot fit without stretching → Ω(√n) average distortion
+
+---
+
+## Implementation Checklist
+
+### Poincaré Ball Implementation
+
+- [ ] Möbius addition with curvature K
+- [ ] Exponential map with numerical stability
+- [ ] Logarithmic map with safe arctanh
+- [ ] Distance function with clipping
+- [ ] Parallel transport
+- [ ] Gradient clipping to prevent boundary
+
+### Lorentz Model Implementation
+
+- [ ] Minkowski inner product
+- [ ] Hyperboloid constraint projection
+- [ ] Exponential map
+- [ ] Distance function
+- [ ] Lorentz boost and rotation
+- [ ] Conversion to/from Poincaré
+
+### Hyperbolic Attention
+
+- [ ] Hyperbolic query/key/value projections
+- [ ] Distance-based similarity
+- [ ] Softmax with temperature
+- [ ] Möbius weighted aggregation
+- [ ] Linear attention kernel approximation
+
+### Learnable Curvature
+
+- [ ] Curvature parameter K with positive constraint
+- [ ] Gradient computation w.r.t. K
+- [ ] Coupled optimization with rescaling
+- [ ] Per-layer or per-head curvature
+
+### SIMD Optimizations
+
+- [ ] Vectorized Möbius addition (AVX2)
+- [ ] Batch distance computation
+- [ ] Fused exp/log operations
+- [ ] Cache-aligned memory layout
+
+---
+
+## References
+
+**Textbooks**:
+1. "Riemannian Geometry" - do Carmo
+2. "Foundations of Hyperbolic Manifolds" - Ratcliffe
+
+**Papers**:
+1. Ganea et al., "Hyperbolic Neural Networks" (NeurIPS 2018)
+2. Hypformer (KDD 2024) - Linear attention formulation
+3. Fully Hyperbolic NNs (ACL 2022) - Lorentz model analysis
+
+**Software**:
+- **GeoOpt**: PyTorch library for Riemannian optimization
+- **Hyperbolic Image Embeddings**: Reference implementation
+
+---
+
+## Conclusion
+
+Hyperbolic geometry provides a mathematically rigorous framework for hierarchical neural representations with:
+- **Provable capacity**: O(exp(n)) vs O(poly(n))
+- **Stable operations**: Lorentz model superior to Poincaré
+- **Efficient algorithms**: O(n²d) attention same as Euclidean
+- **Learnable curvature**: Adapt to data hierarchy
+
+All operations have **closed-form solutions** and **computable gradients**, making them suitable for modern automatic differentiation frameworks.