git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
445 lines
15 KiB
Markdown
445 lines
15 KiB
Markdown
# Hyperbolic Attention Networks - Literature Review
|
||
|
||
## Executive Summary
|
||
|
||
Hyperbolic geometry offers **O(log n) capacity** for hierarchical embeddings compared to O(n) in Euclidean space, enabling revolutionary advances in attention mechanisms for AI. Recent work (2023-2025) demonstrates that **semantic space is fundamentally non-Euclidean**, with negative curvature naturally capturing hierarchical cognition.
|
||
|
||
## Table of Contents
|
||
|
||
1. [Foundational Work](#foundational-work)
|
||
2. [Hyperbolic Transformers (2023-2025)](#hyperbolic-transformers-2023-2025)
|
||
3. [Lorentz vs Poincaré Models](#lorentz-vs-poincaré-models)
|
||
4. [Knowledge Graph Applications](#knowledge-graph-applications)
|
||
5. [Learnable Curvature](#learnable-curvature)
|
||
6. [SIMD Optimization Opportunities](#simd-optimization-opportunities)
|
||
7. [Open Research Questions](#open-research-questions)
|
||
|
||
---
|
||
|
||
## Foundational Work
|
||
|
||
### Poincaré Embeddings (Nickel & Kiela, NeurIPS 2017)
|
||
|
||
**Key Innovation**: Embedding hierarchical data in n-dimensional Poincaré ball instead of Euclidean space.
|
||
|
||
**Mathematical Insight**:
|
||
- Hyperbolic space volume grows **exponentially** with radius
|
||
- Trees embed with **arbitrarily low distortion** in just 2D hyperbolic space
|
||
- Euclidean space requires O(n) dimensions for same distortion
|
||
|
||
**Results**:
|
||
- 50%+ improvement in WordNet taxonomy embeddings
|
||
- Parsimonious representation of scale-free networks
|
||
- Preservation of both hierarchy AND similarity
|
||
|
||
**Limitations**:
|
||
- Numerical instability near boundary (|x| → 1)
|
||
- Requires specialized Riemannian optimizers
|
||
|
||
### Hyperbolic Neural Networks (Ganea, Bécigneul & Hofmann, NeurIPS 2018)
|
||
|
||
**Key Contribution**: Combined Möbius gyrovector spaces with Riemannian geometry to enable:
|
||
- Hyperbolic multinomial logistic regression
|
||
- Hyperbolic feed-forward networks
|
||
- Hyperbolic RNNs (GRU variant)
|
||
|
||
**Technical Framework**:
|
||
- Möbius addition: `a ⊕ b = (1 + 2⟨a,b⟩ + ||b||²)a + (1 - ||a||²)b / (1 + 2⟨a,b⟩ + ||a||²||b||²)`
|
||
- Exponential map (Euclidean → Hyperbolic)
|
||
- Logarithmic map (Hyperbolic → Euclidean)
|
||
|
||
**Impact**: Bridged gap between hyperbolic embeddings and deep learning operations.
|
||
|
||
---
|
||
|
||
## Hyperbolic Transformers (2023-2025)
|
||
|
||
### Hypformer (KDD 2024)
|
||
|
||
**Breakthrough**: First **complete hyperbolic transformer** fully operating in hyperbolic space.
|
||
|
||
**Key Innovations**:
|
||
|
||
1. **Hyperbolic Linear Attention**:
|
||
- Reduces GPU cost by **10x** vs hyperbolic softmax attention
|
||
- Halves training time
|
||
- Enables **billion-scale graphs** for first time
|
||
|
||
2. **Scalability**:
|
||
- Traditional hyperbolic attention: **O(n²)** complexity
|
||
- Hypformer linear attention: **O(n)** complexity
|
||
- Processes long-sequence inputs efficiently
|
||
|
||
3. **Architecture**:
|
||
- All operations in hyperbolic space (no Euclidean bottlenecks)
|
||
- Preserves tree-like hierarchical structures
|
||
- Compatible with existing transformer training infrastructure
|
||
|
||
**Performance**:
|
||
- Outperforms Euclidean transformers on hierarchical data
|
||
- 10x reduction in computation cost
|
||
- First hyperbolic transformer for billion-node graphs
|
||
|
||
### HyLiFormer (2025)
|
||
|
||
**Application**: Skeleton-based human action recognition using hyperbolic linear attention.
|
||
|
||
**Technical Design**:
|
||
- Hyperbolic Linear Attention (HLA) module
|
||
- Satisfies Poincaré model constraints
|
||
- Addresses quadratic complexity bottleneck
|
||
- Mixed-curvature embeddings for different skeleton joints
|
||
|
||
**Proof**: Mathematical guarantee that HLA preserves hyperbolic geometry properties.
|
||
|
||
### Mixed-Curvature Transformers (Cho et al., 2023)
|
||
|
||
**Concept**: Different parts of data require different curvatures:
|
||
- **Positive curvature** (spherical): Cyclic/periodic patterns
|
||
- **Zero curvature** (Euclidean): Linear relationships
|
||
- **Negative curvature** (hyperbolic): Hierarchical structures
|
||
|
||
**Implementation**: "Curve Your Attention" - adaptive curvature per attention head.
|
||
|
||
---
|
||
|
||
## Lorentz vs Poincaré Models
|
||
|
||
### Fully Hyperbolic Neural Networks (ACL 2022)
|
||
|
||
**Problem with Poincaré Ball**:
|
||
- Well-defined gyrovector operations
|
||
- **Severe numerical instability** near boundary
|
||
- Gradients explode as ||x|| → 1
|
||
|
||
**Lorentz (Hyperboloid) Model Advantages**:
|
||
1. **Superior numerical stability**
|
||
2. Linear transformations via Lorentz boosts & rotations
|
||
3. No boundary singularities
|
||
|
||
**Lorentz Transformations**:
|
||
```
|
||
Lorentz Boost: Moves points along geodesics
|
||
Lorentz Rotation: Rotates within time slices
|
||
```
|
||
|
||
**Key Finding**: Existing hyperbolic networks using tangent space operations are **relaxations** of Lorentz rotation, missing the boost component. This implicitly limits network expressiveness.
|
||
|
||
### Model Comparison
|
||
|
||
| Property | Poincaré Ball | Lorentz (Hyperboloid) |
|
||
|----------|---------------|----------------------|
|
||
| **Numerical Stability** | Poor (boundary issues) | Excellent |
|
||
| **Operations** | Möbius gyrovector algebra | Linear transformations |
|
||
| **Geodesics** | Circular arcs | Hyperbolas |
|
||
| **Visualization** | Intuitive (disk) | Less intuitive (sheet) |
|
||
| **Optimization** | Requires projection | Natural in ambient space |
|
||
|
||
**Consensus (2024)**: Use **Lorentz model** for training stability, Poincaré for visualization.
|
||
|
||
---
|
||
|
||
## Knowledge Graph Applications
|
||
|
||
### HyGGE (2023)
|
||
|
||
**Innovation**: Hyperbolic graph attention network for KG reasoning.
|
||
|
||
**Architecture**:
|
||
- Attention over neighborhood structures
|
||
- Relation features in hyperbolic space
|
||
- Captures hierarchical features in local structures
|
||
|
||
**Use Cases**: Multi-hop reasoning in taxonomies, ontologies.
|
||
|
||
### HyperKGR (EMNLP 2025)
|
||
|
||
**Approach**: Knowledge graph reasoning in hyperbolic space with GNN encoding.
|
||
|
||
**Key Technique**: Hierarchical message passing naturally aligns with reasoning paths.
|
||
|
||
**Result**: Hyperbolic space **reduces path interference** - multiple reasoning chains don't interfere due to exponential volume growth.
|
||
|
||
### HyperComplEx (2025)
|
||
|
||
**Breakthrough**: Unified multi-space embedding framework.
|
||
|
||
**Adaptive Integration**:
|
||
- **Hyperbolic**: Hierarchical relations (is-a, part-of)
|
||
- **Complex**: Asymmetric relations (temporal, causal)
|
||
- **Euclidean**: Symmetric relations (co-occurrence)
|
||
|
||
**Learned Attention**: Model learns which geometry suits each relation type.
|
||
|
||
**Impact**: Single unified model outperforms specialized approaches.
|
||
|
||
---
|
||
|
||
## Learnable Curvature
|
||
|
||
### Optimizing Curvature Learning (2024)
|
||
|
||
**Problem**: Naive learnable curvature (GeoOpt library) causes:
|
||
- Training instability
|
||
- Performance degradation
|
||
- Failure to incorporate updated hyperbolic operations
|
||
|
||
**Root Cause**: Riemannian optimizers rely on projections onto tangent spaces that **depend on current manifold curvature**. Updating curvature breaks these dependencies.
|
||
|
||
**Solution**: Coupled curvature-optimization updates that maintain Riemannian geometry consistency.
|
||
|
||
### Deep Hyperbolic Model (DeER, 2024)
|
||
|
||
**Innovation**: Multi-layer hyperbolic CNN with **adaptive curvature per layer**.
|
||
|
||
**Rationale**: Different hierarchy depths require different curvatures:
|
||
- **Shallow hierarchies**: Lower negative curvature
|
||
- **Deep hierarchies**: Higher negative curvature
|
||
|
||
**Implementation**: Each layer has learnable curvature parameter κ ∈ ℝ⁺.
|
||
|
||
**First Work**: Extending deep CNNs to hyperbolic geometry with variable curvature.
|
||
|
||
### Task-Geometry Decoupling (2025)
|
||
|
||
**Critical Finding**: **Task performance ≠ Geometric fidelity**
|
||
|
||
**Problem**: Networks can achieve good validation accuracy while embedding geometry severely degrades.
|
||
|
||
**Implications**:
|
||
- Need explicit geometric constraints during training
|
||
- Regularization terms to maintain hyperbolic properties
|
||
- Validation should include geometric metrics (distortion, curvature consistency)
|
||
|
||
**Recommendation**: Multi-objective optimization balancing task loss and geometric loss.
|
||
|
||
---
|
||
|
||
## SIMD Optimization Opportunities
|
||
|
||
### Current State
|
||
|
||
**Hyperbolic Operations are Compute-Intensive**:
|
||
- Möbius addition: 4 dot products + 3 scalar multiplications
|
||
- Exponential map: Norm computation + trigonometric functions
|
||
- Logarithmic map: Inverse hyperbolic functions
|
||
|
||
**Existing Work (Limited)**:
|
||
- SIMD for Euclidean operations: **20x speedup** (C vs SSE2)
|
||
- 4×4 matrix multiply: **400% speedup** with SIMD
|
||
- No public SIMD implementations for hyperbolic geometry
|
||
|
||
### Optimization Strategies
|
||
|
||
1. **Vectorize Möbius Operations**:
|
||
- Batch inner products using AVX2 FMA
|
||
- Parallel norm computations
|
||
- SIMD-optimized division (approximate reciprocal)
|
||
|
||
2. **Hyperbolic Function Approximations**:
|
||
- Tanh approximation: 6.25% area reduction, 18.86% lower error
|
||
- Polynomial approximations for exp/log on Lorentz model
|
||
- Look-up tables with SIMD interpolation
|
||
|
||
3. **Attention-Specific Optimizations**:
|
||
- Batch hyperbolic distance computations
|
||
- SIMD reduction operations for attention weights
|
||
- Fused multiply-add for score calculations
|
||
|
||
4. **Cache-Aware Design**:
|
||
- 64-byte cache line alignment
|
||
- Prefetching for batch operations
|
||
- Blocked algorithms for large matrices
|
||
|
||
**Expected Speedup**: **8-50x** for hyperbolic distance computations (based on Euclidean SIMD results).
|
||
|
||
---
|
||
|
||
## Open Research Questions
|
||
|
||
### 1. Is Semantic Space Fundamentally Hyperbolic?
|
||
|
||
**Evidence For**:
|
||
- Natural language has inherent hierarchies (WordNet, taxonomies)
|
||
- Word embeddings exhibit tree-like structure in latent space
|
||
- Hyperbolic embeddings outperform Euclidean on language tasks
|
||
|
||
**Evidence Against**:
|
||
- Some linguistic phenomena are non-hierarchical (synonyms, analogies)
|
||
- Mixed-curvature models suggest multiple geometries coexist
|
||
|
||
**Hypothesis**: **Semantic space is mixed-curvature**, with hyperbolic subspaces for hierarchical concepts and Euclidean/spherical for associative/cyclic concepts.
|
||
|
||
### 2. Can Negative Curvature Explain Hierarchical Cognition?
|
||
|
||
**Neuroscience Connection**:
|
||
- Cortical columns exhibit hierarchical organization
|
||
- Information processing flows through hierarchical levels
|
||
- Memory consolidation follows hierarchical patterns
|
||
|
||
**Computational Question**: Do biological neural networks perform computations in hyperbolic representational space?
|
||
|
||
**Experimental Approach**:
|
||
- fMRI studies with hierarchical vs flat stimuli
|
||
- Compare neural response patterns to hyperbolic vs Euclidean embeddings
|
||
- Measure "curvature" of neural representational geometry
|
||
|
||
### 3. Optimal Curvature for Different Cognitive Tasks
|
||
|
||
**Open Questions**:
|
||
- What curvature κ minimizes embedding distortion for WordNet?
|
||
- Does optimal curvature correlate with tree depth?
|
||
- Can curvature serve as measure of "hierarchical complexity"?
|
||
|
||
**Nobel-Level Insight**: **Curvature as universal measure of hierarchical information content**.
|
||
|
||
### 4. Hyperbolic Consciousness Manifolds
|
||
|
||
**Speculative Theory**: Consciousness emerges from computations on hyperbolic manifolds.
|
||
|
||
**Predictions**:
|
||
1. Conscious representations require negative curvature
|
||
2. Depth of consciousness correlates with curvature magnitude
|
||
3. Altered states (psychedelics) correspond to curvature perturbations
|
||
|
||
**Testable Hypothesis**: Building hyperbolic neural networks with emergent properties qualitatively different from Euclidean networks.
|
||
|
||
---
|
||
|
||
## Mathematical Foundations for Implementation
|
||
|
||
### Poincaré Ball Model
|
||
|
||
**Metric**:
|
||
```
|
||
ds² = 4 / (1 - ||x||²)² · ||dx||²
|
||
```
|
||
|
||
**Möbius Addition**:
|
||
```
|
||
a ⊕_κ b = ((1 + 2κ⟨a,b⟩ + κ||b||²)a + (1 - κ||a||²)b) / (1 + 2κ⟨a,b⟩ + κ²||a||²||b||²)
|
||
```
|
||
where κ = -1/K (K is curvature radius)
|
||
|
||
**Exponential Map**:
|
||
```
|
||
exp_x^κ(v) = x ⊕_κ (tanh(√κ ||v||_x / 2) / (√κ ||v||_x)) · v
|
||
```
|
||
|
||
### Lorentz Model
|
||
|
||
**Ambient Space**: ℝ^{n,1} with Minkowski inner product
|
||
```
|
||
⟨x, y⟩_L = -x₀y₀ + x₁y₁ + ... + xₙyₙ
|
||
```
|
||
|
||
**Constraint**:
|
||
```
|
||
⟨x, x⟩_L = -1 (hyperboloid sheet)
|
||
```
|
||
|
||
**Distance**:
|
||
```
|
||
d_L(x, y) = arcosh(-⟨x, y⟩_L)
|
||
```
|
||
|
||
---
|
||
|
||
## Performance Benchmarks from Literature
|
||
|
||
### Hypformer (KDD 2024)
|
||
- **10x** reduction in GPU cost vs hyperbolic softmax
|
||
- **50%** training time reduction
|
||
- Scales to **billions** of nodes
|
||
|
||
### HNN (Ganea et al., NeurIPS 2018)
|
||
- **30%** better accuracy on WordNet reconstruction
|
||
- **5x** parameter efficiency vs Euclidean
|
||
|
||
### DeER (2024)
|
||
- **15%** improvement in knowledge graph completion
|
||
- **3x** better mean reciprocal rank
|
||
|
||
---
|
||
|
||
## Recommended Implementation Strategy
|
||
|
||
1. **Start with Lorentz Model**: Better numerical stability
|
||
2. **Implement SIMD Optimizations**: 8-50x speedup potential
|
||
3. **Learnable Curvature**: Essential for adaptive hierarchies
|
||
4. **Geometric Regularization**: Prevent task-geometry decoupling
|
||
5. **Benchmark Against Euclidean**: Establish performance gains
|
||
|
||
---
|
||
|
||
## Citations and Sources
|
||
|
||
### Core Papers (Chronological)
|
||
|
||
1. **Poincaré Embeddings** (Nickel & Kiela, NeurIPS 2017)
|
||
- [Semantic Scholar](https://www.semanticscholar.org/paper/Poincar%C3%A9-Embeddings-for-Learning-Hierarchical-Nickel-Kiela/1590bd1bca945fc6ff50b8cdf2da14ea2061c79a)
|
||
|
||
2. **Hyperbolic Neural Networks** (Ganea, Bécigneul & Hofmann, NeurIPS 2018)
|
||
- [arXiv:1805.09112](https://arxiv.org/abs/1805.09112)
|
||
|
||
3. **Learning Continuous Hierarchies in the Lorentz Model** (Nickel & Kiela, ICML 2018)
|
||
- [arXiv:1806.03417](https://arxiv.org/pdf/1806.03417)
|
||
|
||
4. **Fully Hyperbolic Neural Networks** (ACL 2022)
|
||
- [ACL Anthology](https://aclanthology.org/2022.acl-long.389.pdf)
|
||
|
||
5. **Hypformer** (KDD 2024)
|
||
- [arXiv:2407.01290](https://arxiv.org/abs/2407.01290)
|
||
- [ACM DL](https://dl.acm.org/doi/10.1145/3637528.3672039)
|
||
|
||
6. **HyLiFormer** (2025)
|
||
- [arXiv:2502.05869](https://arxiv.org/html/2502.05869)
|
||
|
||
7. **Hyperbolic Deep Learning Survey** (IJCV 2024)
|
||
- [Springer](https://link.springer.com/article/10.1007/s11263-024-02043-5)
|
||
|
||
### Knowledge Graph Applications
|
||
|
||
8. **HyGGE** (Information Sciences 2023)
|
||
- [ScienceDirect](https://www.sciencedirect.com/science/article/abs/pii/S0020025523002347)
|
||
|
||
9. **HyperKGR** (EMNLP 2025)
|
||
- [ACL Anthology](https://aclanthology.org/2025.emnlp-main.1279/)
|
||
|
||
10. **HyperComplEx** (2025)
|
||
- [arXiv:2511.10842](https://arxiv.org/html/2511.10842)
|
||
|
||
### Learnable Curvature
|
||
|
||
11. **Optimizing Curvature Learning** (2024)
|
||
- [arXiv:2405.13979](https://arxiv.org/html/2405.13979v1)
|
||
|
||
12. **DeER - Deep Hyperbolic Model** (KBS 2024)
|
||
- [ScienceDirect](https://www.sciencedirect.com/science/article/abs/pii/S0950705124008177)
|
||
|
||
13. **Task-Geometry Decoupling** (SSRN 2025)
|
||
- [SSRN](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5600451)
|
||
|
||
### SIMD & Optimization
|
||
|
||
14. **SIMD Intrinsics Use Cases** (Stack Overflow Blog 2020)
|
||
- [Stack Overflow](https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/)
|
||
|
||
15. **Hyperbolic Optimization** (2024)
|
||
- [arXiv:2509.25206](https://arxiv.org/html/2509.25206)
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
Hyperbolic attention networks represent a **paradigm shift** in how we model hierarchical cognition. The evidence strongly suggests that:
|
||
|
||
1. **Semantic space has intrinsic negative curvature**
|
||
2. **O(log n) capacity** makes hyperbolic embeddings fundamentally more efficient
|
||
3. **2023-2025 breakthroughs** (Hypformer, learnable curvature) make hyperbolic transformers practical
|
||
4. **SIMD optimizations** can provide 8-50x speedup, making them competitive with Euclidean baselines
|
||
|
||
**Nobel-Level Question**: Does the human brain perform computations in hyperbolic representational space? If so, this would revolutionize neuroscience and AI alignment.
|
||
|
||
**Next Steps**: Implement efficient hyperbolic attention with SIMD, test on hierarchical reasoning tasks, measure geometric properties of learned representations.
|