git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
488 lines
15 KiB
Markdown
488 lines
15 KiB
Markdown
# Axis 7: Hyperbolic & Mixed-Curvature Graph Transformers
|
|
|
|
**Document:** 27 of 30
|
|
**Series:** Graph Transformers: 2026-2036 and Beyond
|
|
**Last Updated:** 2026-02-25
|
|
**Status:** Research Prospectus
|
|
|
|
---
|
|
|
|
## 1. Problem Statement
|
|
|
|
Euclidean space is the wrong geometry for most real-world graphs. Hierarchical data (taxonomies, organizational charts, phylogenetic trees) embeds naturally into hyperbolic space, where the volume of a ball grows exponentially with radius -- matching the exponential branching of trees. Cyclical data (molecular rings, social cycles) embeds into spherical space. Most real graphs contain a mixture of hierarchical, cyclical, and flat substructures.
|
|
|
|
The mixed-curvature axis asks: how do we build graph transformers that operate in the right geometry for each part of the graph?
|
|
|
|
### 1.1 Why Geometry Matters
|
|
|
|
**Distortion theorem (Bourgain, 1985).** Any metric space with n points can be embedded in Euclidean space with O(log n) distortion. For trees, hyperbolic space achieves O(1) distortion. The gap is exponential.
|
|
|
|
**Practical impact:**
|
|
|
|
| Graph Structure | Euclidean (d=128) | Hyperbolic (d=128) | Improvement |
|
|
|----------------|-------------------|-------------------|-------------|
|
|
| Tree (branching=3, depth=10) | 40% recall@10 | 95% recall@10 | 2.4x |
|
|
| Social network (power-law) | 70% | 92% | 1.3x |
|
|
| Molecular graph (cycles) | 85% | 75% | Worse |
|
|
| Mixed (wiki hyperlinks) | 75% | 80% | 1.07x |
|
|
|
|
Hyperbolic helps hierarchies but hurts cycles. We need both.
|
|
|
|
### 1.2 RuVector Baseline
|
|
|
|
- **`ruvector-hyperbolic-hnsw`**: Poincare ball model (`poincare.rs`), hyperbolic HNSW search (`hnsw.rs`), tangent space operations (`tangent.rs`), sharding (`shard.rs`)
|
|
- **`ruvector-attention`**: Hyperbolic attention (`hyperbolic/`), curvature attention (`curvature/`)
|
|
- **`ruvector-attention`**: Info-geometry (`info_geometry/`), transport attention (`transport/`)
|
|
|
|
---
|
|
|
|
## 2. Hyperbolic Graph Attention
|
|
|
|
### 2.1 The Poincare Ball Model
|
|
|
|
The Poincare ball B_c^d = {x in R^d : c * ||x||^2 < 1} with curvature -1/c. Key operations:
|
|
|
|
**Mobius addition:**
|
|
```
|
|
x (+)_c y = ((1 + 2c<x,y> + c||y||^2) * x + (1 - c||x||^2) * y)
|
|
/ (1 + 2c<x,y> + c^2 * ||x||^2 * ||y||^2)
|
|
```
|
|
|
|
**Hyperbolic distance:**
|
|
```
|
|
d_c(x, y) = (2/sqrt(c)) * arctanh(sqrt(c) * ||(-x) (+)_c y||)
|
|
```
|
|
|
|
**Exponential map (tangent -> ball):**
|
|
```
|
|
exp_x^c(v) = x (+)_c (tanh(sqrt(c) * lambda_x * ||v|| / 2) * v / (sqrt(c) * ||v||))
|
|
where lambda_x = 2 / (1 - c * ||x||^2) (conformal factor)
|
|
```
|
|
|
|
**Logarithmic map (ball -> tangent):**
|
|
```
|
|
log_x^c(y) = (2 / (sqrt(c) * lambda_x)) * arctanh(sqrt(c) * ||(-x) (+)_c y||)
|
|
* ((-x) (+)_c y) / ||(-x) (+)_c y||
|
|
```
|
|
|
|
### 2.2 Hyperbolic Multi-Head Attention
|
|
|
|
Standard multi-head attention operates in Euclidean space. Hyperbolic MHA works in the Poincare ball:
|
|
|
|
```
|
|
HyperbolicMHA(Q, K, V):
|
|
|
|
For each head h:
|
|
1. Project to tangent space at origin:
|
|
Q_h = log_0(Q) * W_Q^h
|
|
K_h = log_0(K) * W_K^h
|
|
V_h = log_0(V) * W_V^h
|
|
|
|
2. Compute attention in tangent space (Euclidean):
|
|
alpha_h = softmax(Q_h * K_h^T / sqrt(d_h))
|
|
|
|
3. Aggregate values in tangent space:
|
|
Z_h = alpha_h * V_h
|
|
|
|
4. Map back to hyperbolic space:
|
|
O_h = exp_0(Z_h)
|
|
|
|
Concatenate and project:
|
|
O = exp_0(concat(log_0(O_1), ..., log_0(O_H)) * W_O)
|
|
```
|
|
|
|
**Advantage:** Attention weights computed from hyperbolic distances naturally give more weight to semantically close nodes in the tree hierarchy.
|
|
|
|
### 2.3 Fully Hyperbolic Attention (No Tangent Space)
|
|
|
|
The tangent space approach "flattens" the hyperbolic geometry. Fully hyperbolic attention operates entirely in the ball:
|
|
|
|
```
|
|
FullyHyperbolicAttention(q, K, V):
|
|
|
|
For each key k_j:
|
|
// Hyperbolic attention score
|
|
score_j = -beta * d_c(q, k_j)^2 + <q, k_j>_L
|
|
// where <.,.>_L is the Lorentzian inner product
|
|
|
|
alpha = softmax(scores)
|
|
|
|
// Hyperbolic weighted midpoint (Einstein midpoint)
|
|
z = EinsteinMidpoint(V, alpha, c)
|
|
= exp_0(sum_j alpha_j * gamma_j * log_0(v_j) / sum_j alpha_j * gamma_j)
|
|
// where gamma_j = 1 / sqrt(1 - c * ||v_j||^2) is the Lorentz factor
|
|
```
|
|
|
|
**Complexity:** Same as Euclidean attention O(n^2 * d), but with ~3x constant factor due to hyperbolic arithmetic.
|
|
|
|
---
|
|
|
|
## 3. Product Manifold Transformers
|
|
|
|
### 3.1 Product Spaces
|
|
|
|
Real graphs have mixed curvature. We use product manifolds:
|
|
|
|
```
|
|
M = H_{c1}^{d1} x S_{c2}^{d2} x R^{d3}
|
|
|
|
where:
|
|
H_c^d = Hyperbolic space (curvature -1/c) -- for hierarchies
|
|
S_c^d = Spherical space (curvature 1/c) -- for cycles
|
|
R^d = Euclidean space (curvature 0) -- for flat structures
|
|
|
|
Total dimension: d = d1 + d2 + d3
|
|
```
|
|
|
|
**Distance in product space:**
|
|
```
|
|
d_M(x, y) = sqrt(w_H * d_H(x_H, y_H)^2 + w_S * d_S(x_S, y_S)^2 + w_R * d_R(x_R, y_R)^2)
|
|
```
|
|
where w_H, w_S, w_R are learned weights.
|
|
|
|
### 3.2 Product Manifold Attention
|
|
|
|
```
|
|
ProductAttention(Q, K, V):
|
|
|
|
// Split embeddings into manifold components
|
|
Q_H, Q_S, Q_R = split(Q, [d1, d2, d3])
|
|
K_H, K_S, K_R = split(K, [d1, d2, d3])
|
|
V_H, V_S, V_R = split(V, [d1, d2, d3])
|
|
|
|
// Attention scores from each manifold
|
|
score_H = -d_H(Q_H, K_H)^2 // Hyperbolic distance
|
|
score_S = <Q_S, K_S>_S // Spherical inner product
|
|
score_R = Q_R . K_R^T / sqrt(d3) // Euclidean dot product
|
|
|
|
// Combined attention
|
|
alpha = softmax(w_H * score_H + w_S * score_S + w_R * score_R)
|
|
|
|
// Aggregate per manifold
|
|
Z_H = HyperbolicMidpoint(V_H, alpha)
|
|
Z_S = SphericalMidpoint(V_S, alpha)
|
|
Z_R = EuclideanWeightedSum(V_R, alpha)
|
|
|
|
return concat(Z_H, Z_S, Z_R)
|
|
```
|
|
|
|
### 3.3 Learned Dimension Allocation
|
|
|
|
**Key question:** How many dimensions to allocate to each manifold component?
|
|
|
|
**Differentiable allocation:**
|
|
```
|
|
Input: Total dimension budget d, curvature signal from data
|
|
|
|
1. Compute curvature estimates per subgraph:
|
|
kappa_i = estimated_sectional_curvature(subgraph_i)
|
|
|
|
2. Classify:
|
|
if kappa_i < -threshold: allocate to H (hyperbolic)
|
|
if kappa_i > +threshold: allocate to S (spherical)
|
|
else: allocate to R (Euclidean)
|
|
|
|
3. Dimension allocation:
|
|
d_H = d * fraction_hyperbolic
|
|
d_S = d * fraction_spherical
|
|
d_R = d * fraction_euclidean
|
|
```
|
|
|
|
**Continuous relaxation:** Use Gumbel-Softmax to make dimension allocation differentiable and trainable end-to-end.
|
|
|
|
---
|
|
|
|
## 4. Lorentzian Graph Neural Networks
|
|
|
|
### 4.1 The Hyperboloid Model
|
|
|
|
The hyperboloid (Lorentz) model represents hyperbolic space as:
|
|
|
|
```
|
|
L_c^d = {x in R^{d+1} : <x, x>_L = -1/c}
|
|
|
|
Lorentzian inner product:
|
|
<x, y>_L = -x_0 * y_0 + x_1 * y_1 + ... + x_d * y_d
|
|
```
|
|
|
|
**Advantages over Poincare ball:**
|
|
- Numerically stable (no division by small numbers near boundary)
|
|
- Natural connection to special relativity
|
|
- Efficient parallel transport
|
|
|
|
### 4.2 Lorentzian Attention
|
|
|
|
```
|
|
LorentzianAttention(Q, K, V):
|
|
|
|
For each query q_i, key k_j:
|
|
// Lorentzian inner product as attention score
|
|
score_{ij} = -<q_i, k_j>_L - 1/c
|
|
|
|
// This is related to hyperbolic distance:
|
|
// d_L(x,y) = (1/sqrt(c)) * arccosh(-c * <x, y>_L)
|
|
|
|
alpha = softmax(scores / sqrt(d))
|
|
|
|
// Lorentzian centroid (Frechet mean on hyperboloid)
|
|
z_i = LorentzianCentroid(V, alpha[i])
|
|
```
|
|
|
|
**Lorentzian centroid computation:**
|
|
```
|
|
LorentzianCentroid(points, weights):
|
|
1. Weighted sum in ambient space:
|
|
s = sum_j w_j * v_j
|
|
|
|
2. Project back to hyperboloid:
|
|
z = s / sqrt(|<s, s>_L| * c)
|
|
// Ensures <z, z>_L = -1/c
|
|
```
|
|
|
|
### 4.3 Causal Structure in Lorentzian Graphs
|
|
|
|
In Minkowski space, the Lorentzian metric defines a causal structure: event A can influence event B only if A is in B's past light cone.
|
|
|
|
**Causal attention:** Only allow attention from past to future:
|
|
```
|
|
alpha_{ij} = softmax(score_{ij}) * causal_mask_{ij}
|
|
|
|
causal_mask_{ij} = 1 if <x_i - x_j, x_i - x_j>_L <= 0 and x_j^0 < x_i^0
|
|
0 otherwise
|
|
|
|
// Interpretation: j can attend to i only if i is in j's causal past
|
|
```
|
|
|
|
This naturally enforces causality in temporal graph transformers.
|
|
|
|
### 4.4 Lorentz Boosts as Attention Transformations
|
|
|
|
In special relativity, Lorentz boosts map between reference frames. In Lorentzian GNNs, we use boosts as learned transformations:
|
|
|
|
```
|
|
Boost(x, v):
|
|
// Boost embedding x by velocity v
|
|
gamma = 1 / sqrt(1 - ||v||^2)
|
|
x_0' = gamma * (x_0 - v . x_{1:d})
|
|
x_{1:d}' = x_{1:d} + (gamma - 1) * (v . x_{1:d}) / ||v||^2 * v - gamma * v * x_0
|
|
return (x_0', x_{1:d}')
|
|
```
|
|
|
|
**Boost-equivariant attention:** Attention weights are invariant under Lorentz boosts:
|
|
```
|
|
alpha(Boost(x, v), Boost(y, v)) = alpha(x, y)
|
|
// Same attention regardless of reference frame
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Curvature-Adaptive Routing
|
|
|
|
### 5.1 The Problem
|
|
|
|
Different parts of a graph have different optimal curvatures. A single global curvature is suboptimal. We need per-node or per-subgraph curvature.
|
|
|
|
### 5.2 Sectional Curvature Estimation
|
|
|
|
For a small triangle (u, v, w) in the graph, estimate sectional curvature using the Toponogov comparison:
|
|
|
|
```
|
|
Given triangle with side lengths a = d(u,v), b = d(v,w), c = d(u,w):
|
|
|
|
Euclidean comparison angle:
|
|
cos(alpha_0) = (a^2 + b^2 - c^2) / (2ab)
|
|
|
|
Actual angle (from embeddings):
|
|
cos(alpha) = <h_u - h_v, h_w - h_v> / (||h_u - h_v|| * ||h_w - h_v||)
|
|
|
|
Curvature estimate:
|
|
kappa ~ 3 * (alpha - alpha_0) / (a * b * sin(alpha_0))
|
|
|
|
kappa < 0: locally hyperbolic (tree-like)
|
|
kappa > 0: locally spherical (cycle-like)
|
|
kappa = 0: locally Euclidean (flat)
|
|
```
|
|
|
|
### 5.3 Adaptive Curvature Attention
|
|
|
|
```
|
|
CurvatureAdaptiveAttention(Q, K, V, G):
|
|
|
|
For each node v:
|
|
// Estimate local curvature
|
|
kappa_v = estimate_curvature(v, G)
|
|
|
|
// Select attention mechanism based on curvature
|
|
if kappa_v < -threshold:
|
|
attn_v = HyperbolicAttention(Q[v], K[N(v)], V[N(v)], c=-1/kappa_v)
|
|
elif kappa_v > threshold:
|
|
attn_v = SphericalAttention(Q[v], K[N(v)], V[N(v)], c=1/kappa_v)
|
|
else:
|
|
attn_v = EuclideanAttention(Q[v], K[N(v)], V[N(v)])
|
|
|
|
// Smooth blending at curvature transitions
|
|
For boundary nodes (where curvature changes sign):
|
|
attn_v = lerp(attn_neg, attn_pos, sigmoid(kappa_v / sigma))
|
|
```
|
|
|
|
**RuVector integration:**
|
|
|
|
```rust
|
|
/// Curvature-adaptive graph attention
|
|
pub trait CurvatureAdaptiveAttention {
|
|
/// Estimate local curvature at each node
|
|
fn estimate_curvature(
|
|
&self,
|
|
graph: &PropertyGraph,
|
|
features: &Tensor,
|
|
node: NodeId,
|
|
) -> f64;
|
|
|
|
/// Compute attention with locally-adapted geometry
|
|
fn attend(
|
|
&self,
|
|
graph: &PropertyGraph,
|
|
features: &Tensor,
|
|
curvatures: &[f64],
|
|
) -> Result<Tensor, CurvatureError>;
|
|
|
|
/// Get curvature distribution statistics
|
|
fn curvature_stats(&self) -> CurvatureDistribution;
|
|
}
|
|
|
|
pub struct CurvatureDistribution {
|
|
pub mean: f64,
|
|
pub std: f64,
|
|
pub min: f64,
|
|
pub max: f64,
|
|
pub fraction_hyperbolic: f64,
|
|
pub fraction_spherical: f64,
|
|
pub fraction_euclidean: f64,
|
|
pub per_node: Vec<f64>,
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Riemannian Optimization on Graphs
|
|
|
|
### 6.1 Riemannian Gradient Descent
|
|
|
|
Standard gradient descent does not preserve manifold constraints. Riemannian GD operates on the manifold directly:
|
|
|
|
```
|
|
Riemannian SGD update:
|
|
|
|
1. Compute Euclidean gradient: g = dL/dtheta
|
|
2. Project to tangent space: g_R = proj_{T_theta M}(g)
|
|
3. Retract to manifold: theta' = Retract_theta(-lr * g_R)
|
|
|
|
For Poincare ball:
|
|
proj(g) = g / (lambda_theta)^2 // Rescale by conformal factor
|
|
Retract(v) = exp_theta(-lr * v) // Exponential map
|
|
|
|
For Hyperboloid:
|
|
proj(g) = g + <g, theta>_L * theta // Lorentzian projection
|
|
Retract(v) = cosh(||v||_L) * theta + sinh(||v||_L) * v / ||v||_L
|
|
```
|
|
|
|
### 6.2 Mixed-Curvature Optimization
|
|
|
|
For product manifold M = H x S x R:
|
|
```
|
|
1. Split gradient: g = (g_H, g_S, g_R)
|
|
2. Project each component:
|
|
g_H' = proj_{T_H}(g_H) // Hyperbolic projection
|
|
g_S' = proj_{T_S}(g_S) // Spherical projection
|
|
g_R' = g_R // Euclidean (no projection needed)
|
|
3. Retract each component:
|
|
theta_H' = exp_H(-lr_H * g_H')
|
|
theta_S' = exp_S(-lr_S * g_S')
|
|
theta_R' = theta_R - lr_R * g_R'
|
|
```
|
|
|
|
**Per-manifold learning rates:** Different curvatures need different learning rates. Hyperbolic components typically need smaller learning rates to avoid exploding gradients near the boundary.
|
|
|
|
---
|
|
|
|
## 7. Projections
|
|
|
|
### 7.1 By 2030
|
|
|
|
**Likely:**
|
|
- Product manifold transformers with learned dimension allocation standard for heterogeneous graphs
|
|
- Curvature-adaptive attention for knowledge graphs (hierarchical + cyclical)
|
|
- Riemannian optimization integrated into standard training frameworks
|
|
|
|
**Possible:**
|
|
- Lorentzian graph neural networks for spacetime-structured data
|
|
- Per-node curvature adaptation (not just per-subgraph)
|
|
- Curvature-based architecture search (select geometry by task)
|
|
|
|
**Speculative:**
|
|
- General Riemannian manifold attention (beyond constant-curvature spaces)
|
|
- Learned metric tensors that define custom geometry per graph
|
|
|
|
### 7.2 By 2033
|
|
|
|
**Likely:**
|
|
- Mixed-curvature graph transformers as default for graph ML
|
|
- Hardware-accelerated hyperbolic operations
|
|
|
|
**Possible:**
|
|
- Finsler manifold attention (asymmetric distances for directed graphs)
|
|
- Sub-Riemannian attention (constrained movement in embedding space)
|
|
- Connection to physics: graph attention in curved spacetime
|
|
|
|
### 7.3 By 2036+
|
|
|
|
**Possible:**
|
|
- Emergent geometry: graph transformers that discover the right manifold
|
|
- Geometric deep learning unification: all attention as parallel transport on bundles
|
|
- Quantum hyperbolic attention on quantum hardware
|
|
|
|
**Speculative:**
|
|
- Graph transformers operating in exotic manifolds (Calabi-Yau, spin manifolds)
|
|
- Attention as geodesic flow on the manifold of distributions
|
|
|
|
---
|
|
|
|
## 8. RuVector Implementation Roadmap
|
|
|
|
### Phase 1: Product Manifolds (2026-2027)
|
|
- Extend `ruvector-hyperbolic-hnsw` with spherical and product space support
|
|
- Implement product manifold attention in `ruvector-attention/src/hyperbolic/`
|
|
- Learned dimension allocation with Gumbel-Softmax
|
|
- Benchmark on mixed-curvature datasets
|
|
|
|
### Phase 2: Lorentzian & Curvature-Adaptive (2027-2028)
|
|
- Implement Lorentzian (hyperboloid) model alongside Poincare ball
|
|
- Curvature estimation module
|
|
- Curvature-adaptive attention routing
|
|
- Riemannian optimizer for mixed-curvature training
|
|
- Integration with `ruvector-attention/src/curvature/` existing infrastructure
|
|
|
|
### Phase 3: Advanced Geometry (2028-2030)
|
|
- Finsler manifold attention for directed graphs
|
|
- General Riemannian attention with learned metric tensors
|
|
- Causal Lorentzian attention for temporal graphs
|
|
- Integration with physics-informed axis (Doc 22)
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
1. Chami et al., "Hyperbolic Graph Convolutional Neural Networks," NeurIPS 2019
|
|
2. Bachmann et al., "Constant Curvature Graph Convolutional Networks," ICML 2020
|
|
3. Gu et al., "Learning Mixed-Curvature Representations in Product Spaces," ICLR 2019
|
|
4. Law et al., "Lorentzian Distance Learning for Hyperbolic Representations," ICML 2019
|
|
5. Nickel & Kiela, "Poincare Embeddings for Learning Hierarchical Representations," NeurIPS 2017
|
|
6. Bonnabel, "Stochastic Gradient Descent on Riemannian Manifolds," IEEE TAC 2013
|
|
7. RuVector `ruvector-hyperbolic-hnsw` documentation (internal)
|
|
|
|
---
|
|
|
|
**End of Document 27**
|
|
|
|
**Next:** [Doc 28 - Temporal: Causal & Retrocausal Attention](28-temporal-causal-retrocausal.md)
|