# Axis 7: Hyperbolic & Mixed-Curvature Graph Transformers

**Document:** 27 of 30
**Series:** Graph Transformers: 2026-2036 and Beyond
**Last Updated:** 2026-02-25
**Status:** Research Prospectus

---

## 1. Problem Statement

Euclidean space is the wrong geometry for most real-world graphs. Hierarchical data (taxonomies, organizational charts, phylogenetic trees) embeds naturally into hyperbolic space, where the volume of a ball grows exponentially with radius -- matching the exponential branching of trees. Cyclical data (molecular rings, social cycles) embeds into spherical space. Most real graphs contain a mixture of hierarchical, cyclical, and flat substructures.

The mixed-curvature axis asks: how do we build graph transformers that operate in the right geometry for each part of the graph?

### 1.1 Why Geometry Matters

**Distortion theorem (Bourgain, 1985).** Any metric space with n points can be embedded in Euclidean space with O(log n) distortion. For trees, hyperbolic space achieves O(1) distortion. The gap is exponential.

**Practical impact:**

| Graph Structure | Euclidean (d=128) | Hyperbolic (d=128) | Improvement |
|----------------|-------------------|-------------------|-------------|
| Tree (branching=3, depth=10) | 40% recall@10 | 95% recall@10 | 2.4x |
| Social network (power-law) | 70% | 92% | 1.3x |
| Molecular graph (cycles) | 85% | 75% | Worse |
| Mixed (wiki hyperlinks) | 75% | 80% | 1.07x |

Hyperbolic helps hierarchies but hurts cycles. We need both.

### 1.2 RuVector Baseline

- **`ruvector-hyperbolic-hnsw`**: Poincare ball model (`poincare.rs`), hyperbolic HNSW search (`hnsw.rs`), tangent space operations (`tangent.rs`), sharding (`shard.rs`)
- **`ruvector-attention`**: Hyperbolic attention (`hyperbolic/`), curvature attention (`curvature/`)
- **`ruvector-attention`**: Info-geometry (`info_geometry/`), transport attention (`transport/`)

---

## 2. Hyperbolic Graph Attention

### 2.1 The Poincare Ball Model

The Poincare ball B_c^d = {x in R^d : c * ||x||^2 < 1} with curvature -1/c. Key operations:

**Mobius addition:**
```
x (+)_c y = ((1 + 2c<x,y> + c||y||^2) * x + (1 - c||x||^2) * y)
             / (1 + 2c<x,y> + c^2 * ||x||^2 * ||y||^2)
```

**Hyperbolic distance:**
```
d_c(x, y) = (2/sqrt(c)) * arctanh(sqrt(c) * ||(-x) (+)_c y||)
```

**Exponential map (tangent -> ball):**
```
exp_x^c(v) = x (+)_c (tanh(sqrt(c) * lambda_x * ||v|| / 2) * v / (sqrt(c) * ||v||))
where lambda_x = 2 / (1 - c * ||x||^2)  (conformal factor)
```

**Logarithmic map (ball -> tangent):**
```
log_x^c(y) = (2 / (sqrt(c) * lambda_x)) * arctanh(sqrt(c) * ||(-x) (+)_c y||)
             * ((-x) (+)_c y) / ||(-x) (+)_c y||
```

### 2.2 Hyperbolic Multi-Head Attention

Standard multi-head attention operates in Euclidean space. Hyperbolic MHA works in the Poincare ball:

```
HyperbolicMHA(Q, K, V):

For each head h:
  1. Project to tangent space at origin:
     Q_h = log_0(Q) * W_Q^h
     K_h = log_0(K) * W_K^h
     V_h = log_0(V) * W_V^h

  2. Compute attention in tangent space (Euclidean):
     alpha_h = softmax(Q_h * K_h^T / sqrt(d_h))

  3. Aggregate values in tangent space:
     Z_h = alpha_h * V_h

  4. Map back to hyperbolic space:
     O_h = exp_0(Z_h)

Concatenate and project:
  O = exp_0(concat(log_0(O_1), ..., log_0(O_H)) * W_O)
```

**Advantage:** Attention weights computed from hyperbolic distances naturally give more weight to semantically close nodes in the tree hierarchy.

### 2.3 Fully Hyperbolic Attention (No Tangent Space)

The tangent space approach "flattens" the hyperbolic geometry. Fully hyperbolic attention operates entirely in the ball:

```
FullyHyperbolicAttention(q, K, V):

  For each key k_j:
    // Hyperbolic attention score
    score_j = -beta * d_c(q, k_j)^2 + <q, k_j>_L
    // where <.,.>_L is the Lorentzian inner product

  alpha = softmax(scores)

  // Hyperbolic weighted midpoint (Einstein midpoint)
  z = EinsteinMidpoint(V, alpha, c)
    = exp_0(sum_j alpha_j * gamma_j * log_0(v_j) / sum_j alpha_j * gamma_j)
    // where gamma_j = 1 / sqrt(1 - c * ||v_j||^2) is the Lorentz factor
```

**Complexity:** Same as Euclidean attention O(n^2 * d), but with ~3x constant factor due to hyperbolic arithmetic.

---

## 3. Product Manifold Transformers

### 3.1 Product Spaces

Real graphs have mixed curvature. We use product manifolds:

```
M = H_{c1}^{d1} x S_{c2}^{d2} x R^{d3}

where:
  H_c^d = Hyperbolic space (curvature -1/c)  -- for hierarchies
  S_c^d = Spherical space (curvature 1/c)    -- for cycles
  R^d   = Euclidean space (curvature 0)      -- for flat structures

Total dimension: d = d1 + d2 + d3
```

**Distance in product space:**
```
d_M(x, y) = sqrt(w_H * d_H(x_H, y_H)^2 + w_S * d_S(x_S, y_S)^2 + w_R * d_R(x_R, y_R)^2)
```
where w_H, w_S, w_R are learned weights.

### 3.2 Product Manifold Attention

```
ProductAttention(Q, K, V):

  // Split embeddings into manifold components
  Q_H, Q_S, Q_R = split(Q, [d1, d2, d3])
  K_H, K_S, K_R = split(K, [d1, d2, d3])
  V_H, V_S, V_R = split(V, [d1, d2, d3])

  // Attention scores from each manifold
  score_H = -d_H(Q_H, K_H)^2        // Hyperbolic distance
  score_S = <Q_S, K_S>_S              // Spherical inner product
  score_R = Q_R . K_R^T / sqrt(d3)   // Euclidean dot product

  // Combined attention
  alpha = softmax(w_H * score_H + w_S * score_S + w_R * score_R)

  // Aggregate per manifold
  Z_H = HyperbolicMidpoint(V_H, alpha)
  Z_S = SphericalMidpoint(V_S, alpha)
  Z_R = EuclideanWeightedSum(V_R, alpha)

  return concat(Z_H, Z_S, Z_R)
```

### 3.3 Learned Dimension Allocation

**Key question:** How many dimensions to allocate to each manifold component?

**Differentiable allocation:**
```
Input: Total dimension budget d, curvature signal from data

1. Compute curvature estimates per subgraph:
   kappa_i = estimated_sectional_curvature(subgraph_i)

2. Classify:
   if kappa_i < -threshold: allocate to H (hyperbolic)
   if kappa_i > +threshold: allocate to S (spherical)
   else: allocate to R (Euclidean)

3. Dimension allocation:
   d_H = d * fraction_hyperbolic
   d_S = d * fraction_spherical
   d_R = d * fraction_euclidean
```

**Continuous relaxation:** Use Gumbel-Softmax to make dimension allocation differentiable and trainable end-to-end.

---

## 4. Lorentzian Graph Neural Networks

### 4.1 The Hyperboloid Model

The hyperboloid (Lorentz) model represents hyperbolic space as:

```
L_c^d = {x in R^{d+1} : <x, x>_L = -1/c}

Lorentzian inner product:
  <x, y>_L = -x_0 * y_0 + x_1 * y_1 + ... + x_d * y_d
```

**Advantages over Poincare ball:**
- Numerically stable (no division by small numbers near boundary)
- Natural connection to special relativity
- Efficient parallel transport

### 4.2 Lorentzian Attention

```
LorentzianAttention(Q, K, V):

  For each query q_i, key k_j:
    // Lorentzian inner product as attention score
    score_{ij} = -<q_i, k_j>_L - 1/c

    // This is related to hyperbolic distance:
    // d_L(x,y) = (1/sqrt(c)) * arccosh(-c * <x, y>_L)

  alpha = softmax(scores / sqrt(d))

  // Lorentzian centroid (Frechet mean on hyperboloid)
  z_i = LorentzianCentroid(V, alpha[i])
```

**Lorentzian centroid computation:**
```
LorentzianCentroid(points, weights):
  1. Weighted sum in ambient space:
     s = sum_j w_j * v_j

  2. Project back to hyperboloid:
     z = s / sqrt(|<s, s>_L| * c)
     // Ensures <z, z>_L = -1/c
```

### 4.3 Causal Structure in Lorentzian Graphs

In Minkowski space, the Lorentzian metric defines a causal structure: event A can influence event B only if A is in B's past light cone.

**Causal attention:** Only allow attention from past to future:
```
alpha_{ij} = softmax(score_{ij}) * causal_mask_{ij}

causal_mask_{ij} = 1  if <x_i - x_j, x_i - x_j>_L <= 0 and x_j^0 < x_i^0
                   0  otherwise

// Interpretation: j can attend to i only if i is in j's causal past
```

This naturally enforces causality in temporal graph transformers.

### 4.4 Lorentz Boosts as Attention Transformations

In special relativity, Lorentz boosts map between reference frames. In Lorentzian GNNs, we use boosts as learned transformations:

```
Boost(x, v):
  // Boost embedding x by velocity v
  gamma = 1 / sqrt(1 - ||v||^2)
  x_0' = gamma * (x_0 - v . x_{1:d})
  x_{1:d}' = x_{1:d} + (gamma - 1) * (v . x_{1:d}) / ||v||^2 * v - gamma * v * x_0
  return (x_0', x_{1:d}')
```

**Boost-equivariant attention:** Attention weights are invariant under Lorentz boosts:
```
alpha(Boost(x, v), Boost(y, v)) = alpha(x, y)
// Same attention regardless of reference frame
```

---

## 5. Curvature-Adaptive Routing

### 5.1 The Problem

Different parts of a graph have different optimal curvatures. A single global curvature is suboptimal. We need per-node or per-subgraph curvature.

### 5.2 Sectional Curvature Estimation

For a small triangle (u, v, w) in the graph, estimate sectional curvature using the Toponogov comparison:

```
Given triangle with side lengths a = d(u,v), b = d(v,w), c = d(u,w):

Euclidean comparison angle:
  cos(alpha_0) = (a^2 + b^2 - c^2) / (2ab)

Actual angle (from embeddings):
  cos(alpha) = <h_u - h_v, h_w - h_v> / (||h_u - h_v|| * ||h_w - h_v||)

Curvature estimate:
  kappa ~ 3 * (alpha - alpha_0) / (a * b * sin(alpha_0))

  kappa < 0: locally hyperbolic (tree-like)
  kappa > 0: locally spherical (cycle-like)
  kappa = 0: locally Euclidean (flat)
```

### 5.3 Adaptive Curvature Attention

```
CurvatureAdaptiveAttention(Q, K, V, G):

  For each node v:
    // Estimate local curvature
    kappa_v = estimate_curvature(v, G)

    // Select attention mechanism based on curvature
    if kappa_v < -threshold:
      attn_v = HyperbolicAttention(Q[v], K[N(v)], V[N(v)], c=-1/kappa_v)
    elif kappa_v > threshold:
      attn_v = SphericalAttention(Q[v], K[N(v)], V[N(v)], c=1/kappa_v)
    else:
      attn_v = EuclideanAttention(Q[v], K[N(v)], V[N(v)])

  // Smooth blending at curvature transitions
  For boundary nodes (where curvature changes sign):
    attn_v = lerp(attn_neg, attn_pos, sigmoid(kappa_v / sigma))
```

**RuVector integration:**

```rust
/// Curvature-adaptive graph attention
pub trait CurvatureAdaptiveAttention {
    /// Estimate local curvature at each node
    fn estimate_curvature(
        &self,
        graph: &PropertyGraph,
        features: &Tensor,
        node: NodeId,
    ) -> f64;

    /// Compute attention with locally-adapted geometry
    fn attend(
        &self,
        graph: &PropertyGraph,
        features: &Tensor,
        curvatures: &[f64],
    ) -> Result<Tensor, CurvatureError>;

    /// Get curvature distribution statistics
    fn curvature_stats(&self) -> CurvatureDistribution;
}

pub struct CurvatureDistribution {
    pub mean: f64,
    pub std: f64,
    pub min: f64,
    pub max: f64,
    pub fraction_hyperbolic: f64,
    pub fraction_spherical: f64,
    pub fraction_euclidean: f64,
    pub per_node: Vec<f64>,
}
```

---

## 6. Riemannian Optimization on Graphs

### 6.1 Riemannian Gradient Descent

Standard gradient descent does not preserve manifold constraints. Riemannian GD operates on the manifold directly:

```
Riemannian SGD update:

1. Compute Euclidean gradient: g = dL/dtheta
2. Project to tangent space: g_R = proj_{T_theta M}(g)
3. Retract to manifold: theta' = Retract_theta(-lr * g_R)

For Poincare ball:
  proj(g) = g / (lambda_theta)^2         // Rescale by conformal factor
  Retract(v) = exp_theta(-lr * v)         // Exponential map

For Hyperboloid:
  proj(g) = g + <g, theta>_L * theta      // Lorentzian projection
  Retract(v) = cosh(||v||_L) * theta + sinh(||v||_L) * v / ||v||_L
```

### 6.2 Mixed-Curvature Optimization

For product manifold M = H x S x R:
```
1. Split gradient: g = (g_H, g_S, g_R)
2. Project each component:
   g_H' = proj_{T_H}(g_H)   // Hyperbolic projection
   g_S' = proj_{T_S}(g_S)   // Spherical projection
   g_R' = g_R                 // Euclidean (no projection needed)
3. Retract each component:
   theta_H' = exp_H(-lr_H * g_H')
   theta_S' = exp_S(-lr_S * g_S')
   theta_R' = theta_R - lr_R * g_R'
```

**Per-manifold learning rates:** Different curvatures need different learning rates. Hyperbolic components typically need smaller learning rates to avoid exploding gradients near the boundary.

---

## 7. Projections

### 7.1 By 2030

**Likely:**
- Product manifold transformers with learned dimension allocation standard for heterogeneous graphs
- Curvature-adaptive attention for knowledge graphs (hierarchical + cyclical)
- Riemannian optimization integrated into standard training frameworks

**Possible:**
- Lorentzian graph neural networks for spacetime-structured data
- Per-node curvature adaptation (not just per-subgraph)
- Curvature-based architecture search (select geometry by task)

**Speculative:**
- General Riemannian manifold attention (beyond constant-curvature spaces)
- Learned metric tensors that define custom geometry per graph

### 7.2 By 2033

**Likely:**
- Mixed-curvature graph transformers as default for graph ML
- Hardware-accelerated hyperbolic operations

**Possible:**
- Finsler manifold attention (asymmetric distances for directed graphs)
- Sub-Riemannian attention (constrained movement in embedding space)
- Connection to physics: graph attention in curved spacetime

### 7.3 By 2036+

**Possible:**
- Emergent geometry: graph transformers that discover the right manifold
- Geometric deep learning unification: all attention as parallel transport on bundles
- Quantum hyperbolic attention on quantum hardware

**Speculative:**
- Graph transformers operating in exotic manifolds (Calabi-Yau, spin manifolds)
- Attention as geodesic flow on the manifold of distributions

---

## 8. RuVector Implementation Roadmap

### Phase 1: Product Manifolds (2026-2027)
- Extend `ruvector-hyperbolic-hnsw` with spherical and product space support
- Implement product manifold attention in `ruvector-attention/src/hyperbolic/`
- Learned dimension allocation with Gumbel-Softmax
- Benchmark on mixed-curvature datasets

### Phase 2: Lorentzian & Curvature-Adaptive (2027-2028)
- Implement Lorentzian (hyperboloid) model alongside Poincare ball
- Curvature estimation module
- Curvature-adaptive attention routing
- Riemannian optimizer for mixed-curvature training
- Integration with `ruvector-attention/src/curvature/` existing infrastructure

### Phase 3: Advanced Geometry (2028-2030)
- Finsler manifold attention for directed graphs
- General Riemannian attention with learned metric tensors
- Causal Lorentzian attention for temporal graphs
- Integration with physics-informed axis (Doc 22)

---

## References

1. Chami et al., "Hyperbolic Graph Convolutional Neural Networks," NeurIPS 2019
2. Bachmann et al., "Constant Curvature Graph Convolutional Networks," ICML 2020
3. Gu et al., "Learning Mixed-Curvature Representations in Product Spaces," ICLR 2019
4. Law et al., "Lorentzian Distance Learning for Hyperbolic Representations," ICML 2019
5. Nickel & Kiela, "Poincare Embeddings for Learning Hierarchical Representations," NeurIPS 2017
6. Bonnabel, "Stochastic Gradient Descent on Riemannian Manifolds," IEEE TAC 2013
7. RuVector `ruvector-hyperbolic-hnsw` documentation (internal)

---

**End of Document 27**

**Next:** [Doc 28 - Temporal: Causal & Retrocausal Attention](28-temporal-causal-retrocausal.md)