git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
551 lines
29 KiB
Markdown
551 lines
29 KiB
Markdown
# Hyperbolic and Mixed-Curvature Graph Transformers: Product Manifold Attention
|
|
|
|
## Overview
|
|
|
|
### Problem Statement
|
|
|
|
Graph Transformers have become the dominant architecture for learning on relational data, yet nearly all deployed systems operate in flat Euclidean space. This is a geometric mismatch: most real-world graphs are not flat.
|
|
|
|
**Why Euclidean space fails for real-world graphs:**
|
|
|
|
1. **Power-law degree distributions** (social networks, citation graphs, the web) exhibit tree-like branching that requires exponentially many dimensions in Euclidean space to embed without distortion. A binary tree of depth $d$ has $2^d$ leaves, but fitting them equidistantly in $\mathbb{R}^n$ requires $n \geq 2^d - 1$ dimensions.
|
|
2. **Hierarchical structures** (taxonomies, organizational charts, ontologies) naturally live in hyperbolic space, where the volume of a ball grows exponentially with radius -- matching the exponential growth of tree levels.
|
|
3. **Cyclic substructures** (molecular rings, periodic lattices, social cliques) have positive curvature and embed naturally on spheres $S^n$.
|
|
4. **Hybrid graphs** (knowledge graphs combining hierarchies with lateral associations) require multiple curvature regimes simultaneously.
|
|
|
|
The consequence: flat-space Graph Transformers waste capacity representing geometric structure that is free in the correct curved space, leading to higher distortion, larger models, and slower convergence.
|
|
|
|
### Proposed Solution
|
|
|
|
Develop **Product Manifold Graph Transformers** that operate natively on mixed-curvature spaces. The core decomposition is:
|
|
|
|
$$\mathcal{M} = S^{n_1} \times H^{n_2} \times \mathbb{R}^{n_3}$$
|
|
|
|
where $S^{n_1}$ captures cyclic/clustered structure, $H^{n_2}$ captures hierarchical structure, and $\mathbb{R}^{n_3}$ captures flat semantic similarity. Every component of the attention mechanism -- queries, keys, values, aggregation, and optimization -- operates in its geometrically appropriate space.
|
|
|
|
### Connection to RuVector
|
|
|
|
RuVector already has substantial infrastructure for this research direction:
|
|
|
|
- **`ruvector-attention/src/hyperbolic/`**: Poincare ball operations (`poincare.rs`), Lorentz cascade attention with Busemann scoring and Einstein midpoint (`lorentz_cascade.rs`), mixed-curvature attention (`mixed_curvature.rs`)
|
|
- **`ruvector-attention/src/curvature/`**: Fused E x H x S attention (`fused_attention.rs`), tangent space mapping (`tangent_space.rs`), component quantizer (`component_quantizer.rs`)
|
|
- **`ruvector-attention/src/transport/`**: Sliced Wasserstein and centroid optimal transport attention
|
|
- **`ruvector-attention/src/topology/`**: Topology-gated attention with coherence metrics
|
|
- **`ruvector-graph/`**: Full property graph with Cypher queries, distributed federation, hybrid vector-graph search
|
|
- **`ruvector-solver/`**: Sublinear graph solvers (forward/backward push, CG, random walk, BMSSP)
|
|
|
|
This document extends RuVector's existing mixed-curvature capabilities toward full product manifold Graph Transformers with learned curvature fields.
|
|
|
|
---
|
|
|
|
## Technical Deep Dive
|
|
|
|
### 1. Hyperbolic Graph Transformers
|
|
|
|
#### Poincare Ball Attention
|
|
|
|
In the Poincare ball model $\mathbb{B}^n_c = \{x \in \mathbb{R}^n : c\|x\|^2 < 1\}$, the standard dot-product attention $\text{softmax}(QK^T / \sqrt{d})$ is replaced with geodesic attention:
|
|
|
|
$$\alpha_{ij} = \frac{\exp(-d_{\mathbb{B}}(q_i, k_j) / \tau)}{\sum_l \exp(-d_{\mathbb{B}}(q_i, k_l) / \tau)}$$
|
|
|
|
where $d_{\mathbb{B}}(x, y) = \frac{1}{\sqrt{c}} \text{arcosh}\left(1 + \frac{2c\|x - y\|^2}{(1 - c\|x\|^2)(1 - c\|y\|^2)}\right)$.
|
|
|
|
RuVector's `poincare.rs` already implements this with numerical stability via epsilon-buffered projection. The key insight from Lorentz cascade attention (`lorentz_cascade.rs`) is that the **Lorentz model avoids boundary instability entirely**: points live on the hyperboloid $\{x : \langle x, x \rangle_L = -1/c, x_0 > 0\}$ rather than inside a ball, and attention scores reduce to Busemann functions (single dot products).
|
|
|
|
#### Lorentz Model Message Passing
|
|
|
|
In the Lorentz model, message passing between graph nodes proceeds as:
|
|
|
|
1. **Embed** each node $v$ onto the hyperboloid: $h_v \in H^n_c$
|
|
2. **Attend** using Busemann scoring: $B_\xi(x) = \ln(-\langle x, \xi \rangle_L)$, where $\xi$ is a light-like focal direction defining the hierarchy
|
|
3. **Aggregate** via Einstein midpoint (closed-form, unlike iterative Frechet mean): $\bar{h} = \text{proj}_H\left(\sum_i w_i \gamma_i h_i / \|\sum_i w_i \gamma_i h_i\|_L\right)$ where $\gamma_i$ is the Lorentz factor
|
|
|
|
RuVector's `LorentzCascadeAttention` implements this with multi-curvature heads operating at logarithmically-spaced curvatures, capturing hierarchy at multiple scales simultaneously.
|
|
|
|
#### Gyrovector Aggregation
|
|
|
|
Standard weighted averaging in Euclidean space ($\bar{v} = \sum_i w_i v_i$) does not preserve the Poincare ball constraint. Instead, aggregation must use Mobius operations:
|
|
|
|
$$\text{AGGREGATE}(\{(w_i, v_i)\}) = \bigoplus_{i=1}^n (w_i \otimes_c v_i)$$
|
|
|
|
where $\oplus_c$ is Mobius addition and $\otimes_c$ is Mobius scalar multiplication. RuVector's `poincare.rs` provides `mobius_add` and `mobius_scalar_mult` with full numerical stability.
|
|
|
|
The practical limitation is that Mobius aggregation is sequential -- each addition depends on the previous result. The Frechet mean (`frechet_mean` in RuVector) offers a parallel alternative via Riemannian gradient descent in the tangent space.
|
|
|
|
### 2. Mixed-Curvature Product Manifolds
|
|
|
|
#### $S^n \times H^m \times \mathbb{R}^k$ Decomposition
|
|
|
|
A product manifold $\mathcal{M} = \mathcal{M}_1 \times \mathcal{M}_2 \times \cdots \times \mathcal{M}_p$ has the metric:
|
|
|
|
$$d_{\mathcal{M}}(x, y)^2 = \sum_{i=1}^p \beta_i \cdot d_{\mathcal{M}_i}(x^{(i)}, y^{(i)})^2$$
|
|
|
|
where $\beta_i$ are learnable mixing weights and each $\mathcal{M}_i$ is either spherical ($\kappa_i > 0$), hyperbolic ($\kappa_i < 0$), or Euclidean ($\kappa_i = 0$).
|
|
|
|
RuVector's `FusedCurvatureConfig` already defines this decomposition:
|
|
|
|
```rust
|
|
pub struct FusedCurvatureConfig {
|
|
pub euclidean_dim: usize, // R^k component
|
|
pub hyperbolic_dim: usize, // H^m component
|
|
pub spherical_dim: usize, // S^n component
|
|
pub weight_e: f32, // beta_E
|
|
pub weight_h: f32, // beta_H
|
|
pub weight_s: f32, // beta_S
|
|
pub hyperbolic_curvature: f32,
|
|
}
|
|
```
|
|
|
|
The fused attention kernel computes all three similarities in a single vectorized pass:
|
|
|
|
$$\text{logit}(q, k) = \beta_E \langle q_E, k_E \rangle + \beta_H \langle q_{H}^{\text{tan}}, k_{H}^{\text{tan}} \rangle + \beta_S \langle q_S, k_S \rangle_S$$
|
|
|
|
where the hyperbolic component uses tangent-space dot products (10-100x faster than geodesic distance per RuVector's `TangentSpaceMapper`) and the spherical component uses normalized inner products on the unit sphere.
|
|
|
|
#### Curvature-Per-Component
|
|
|
|
Rather than a single global curvature, each dimension group can have its own learned curvature. For a product of $p$ components:
|
|
|
|
$$\mathcal{M} = \mathcal{M}_1^{\kappa_1} \times \mathcal{M}_2^{\kappa_2} \times \cdots \times \mathcal{M}_p^{\kappa_p}$$
|
|
|
|
This is the key extension beyond RuVector's current `MixedCurvatureConfig` (which uses a single curvature for the hyperbolic component). The research direction is to make $\kappa_i$ **learnable per-component**, enabling the model to discover which curvature best fits each subspace of the embedding.
|
|
|
|
#### Optimal Curvature Learning
|
|
|
|
Given a graph $G = (V, E)$ with known structure, the optimal curvature for a hyperbolic component can be estimated as:
|
|
|
|
$$\kappa^* = -\frac{4\delta^2}{(\text{diam}(G))^2}$$
|
|
|
|
where $\delta$ is the Gromov hyperbolicity (measuring how tree-like the graph is) and $\text{diam}(G)$ is the graph diameter. RuVector's solver crate provides the graph traversal primitives needed to compute both quantities sublinearly.
|
|
|
|
For learnable curvatures during training, the gradient flows through the exponential map:
|
|
|
|
$$\frac{\partial \mathcal{L}}{\partial \kappa} = \frac{\partial \mathcal{L}}{\partial d_\kappa} \cdot \frac{\partial d_\kappa}{\partial \kappa}$$
|
|
|
|
The curvature gradient for the Poincare distance is:
|
|
|
|
$$\frac{\partial d_c}{\partial c} = -\frac{1}{2c^{3/2}} \text{arcosh}(\alpha) + \frac{1}{\sqrt{c}} \frac{1}{\sqrt{\alpha^2 - 1}} \frac{\partial \alpha}{\partial c}$$
|
|
|
|
where $\alpha = 1 + 2c\|x - y\|^2 / ((1 - c\|x\|^2)(1 - c\|y\|^2))$.
|
|
|
|
### 3. Curvature-Adaptive Routing
|
|
|
|
#### Attention Weights as Parallel Transport
|
|
|
|
In a curved space, moving a vector from one tangent space to another requires **parallel transport** along the geodesic connecting them. Standard attention aggregation implicitly assumes all values live in the same space, which is only true in flat space.
|
|
|
|
For a message from node $j$ to node $i$, the value $v_j$ must be parallel-transported from $T_{h_j}\mathcal{M}$ to $T_{h_i}\mathcal{M}$:
|
|
|
|
$$\tilde{v}_j = \Gamma_{h_j \to h_i}(v_j)$$
|
|
|
|
In the Poincare ball, parallel transport along the geodesic from $x$ to $y$ is:
|
|
|
|
$$\Gamma_{x \to y}(v) = \frac{\lambda_x}{\lambda_y} \cdot \text{gyr}[y, -x](v)$$
|
|
|
|
where $\lambda_x = 2/(1 - c\|x\|^2)$ is the conformal factor and $\text{gyr}$ is the gyration operator (Thomas precession). This connects to RuVector's transport module (`ruvector-attention/src/transport/`), which uses optimal transport for attention -- the Wasserstein distance provides a natural way to compute transport plans between distributions on manifolds.
|
|
|
|
#### Levi-Civita Connection for Message Passing
|
|
|
|
The Levi-Civita connection $\nabla$ provides the unique torsion-free, metric-compatible way to differentiate vector fields on a manifold. For graph message passing on a Riemannian manifold $(\mathcal{M}, g)$:
|
|
|
|
$$m_{i \leftarrow j} = \alpha_{ij} \cdot \Gamma_{j \to i}^{\nabla}(W_v h_j)$$
|
|
|
|
where $\Gamma_{j \to i}^{\nabla}$ is parallel transport along the Levi-Civita connection. The Christoffel symbols $\Gamma^k_{ij}$ encode the connection in coordinates:
|
|
|
|
$$\Gamma^k_{ij} = \frac{1}{2} g^{kl}\left(\frac{\partial g_{jl}}{\partial x^i} + \frac{\partial g_{il}}{\partial x^j} - \frac{\partial g_{ij}}{\partial x^l}\right)$$
|
|
|
|
For the Poincare ball with conformal factor $\lambda_x = 2/(1 - c\|x\|^2)$, the Christoffel symbols simplify considerably, enabling efficient implementation.
|
|
|
|
### 4. Riemannian Optimization for Graph Transformers
|
|
|
|
#### Riemannian Adam
|
|
|
|
Standard Adam cannot be applied directly on manifolds because the update rule $\theta_{t+1} = \theta_t - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$ does not preserve manifold constraints. Riemannian Adam replaces Euclidean operations with their Riemannian counterparts:
|
|
|
|
```
|
|
Algorithm: Riemannian Adam on Product Manifold M
|
|
|
|
Input: Learning rate eta, decay rates beta_1, beta_2, parameters theta in M
|
|
Initialize: m_0 = 0, v_0 = 0 (in tangent space at theta_0)
|
|
|
|
For t = 1, 2, ...:
|
|
g_t = Riemannian_gradient(L, theta_{t-1}) // Project Euclidean grad to tangent space
|
|
m_t = beta_1 * PT(m_{t-1}) + (1 - beta_1) * g_t // Parallel transport first moment
|
|
v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2 // Second moment (scalar, no transport)
|
|
m_hat = m_t / (1 - beta_1^t)
|
|
v_hat = v_t / (1 - beta_2^t)
|
|
update = -eta * m_hat / (sqrt(v_hat) + epsilon)
|
|
theta_t = Exp_{theta_{t-1}}(update) // Exponential map back to manifold
|
|
```
|
|
|
|
The key operations are:
|
|
- **Riemannian gradient**: $\text{grad}_\mathcal{M} f = \frac{1}{\lambda_x^2} \nabla_E f$ (rescale Euclidean gradient by inverse metric)
|
|
- **Exponential map**: $\text{Exp}_x(v)$ moves from $x$ in direction $v$ along the geodesic
|
|
- **Parallel transport**: $\text{PT}_{x \to y}(m)$ moves the momentum from the old tangent space to the new one
|
|
|
|
RuVector's `ruvector-attention/src/training/optimizer.rs` provides the foundation; extending it to Riemannian Adam requires adding `exp_map` and `log_map` calls (already available in `poincare.rs` and `lorentz_cascade.rs::tangent`).
|
|
|
|
#### Projection-Free Training on Manifolds
|
|
|
|
An alternative to Riemannian optimization is **projection-free training**, where parameters are optimized in the ambient Euclidean space and projected back to the manifold after each step:
|
|
|
|
$$\theta_{t+1} = \text{proj}_\mathcal{M}(\theta_t - \eta \nabla_E \mathcal{L})$$
|
|
|
|
For the Poincare ball, this is simply `project_to_ball`. For the hyperboloid, `project_hyperboloid`. For the sphere, normalize to unit length. The advantage is compatibility with existing optimizers (Adam, SGD); the disadvantage is that projection introduces bias proportional to the step size.
|
|
|
|
RuVector's tangent space approach (`TangentSpaceMapper`) offers a practical middle ground: map to tangent space at the origin, perform standard operations, then map back. This is exact for small perturbations and provides 10-100x speedup over full geodesic operations.
|
|
|
|
### 5. Lie Group Equivariant Graph Attention
|
|
|
|
#### SE(3) and SO(3) Equivariance
|
|
|
|
For molecular graphs and physical simulations, attention must respect the symmetries of 3D space. An **SE(3)-equivariant** Graph Transformer satisfies:
|
|
|
|
$$f(Rx + t, Rh) = Rf(x, h)$$
|
|
|
|
for all rotations $R \in SO(3)$ and translations $t \in \mathbb{R}^3$. This means the model's output transforms consistently with rigid body motions.
|
|
|
|
The key construction is **equivariant attention** using invariant features:
|
|
|
|
$$\alpha_{ij} = \phi\left(\|x_i - x_j\|, \langle h_i, h_j \rangle, h_i^T(x_i - x_j)\right)$$
|
|
|
|
The attention weights depend only on invariants (distances, inner products, projections), ensuring equivariance of the full attention layer. Value messages are constructed using equivariant basis functions:
|
|
|
|
$$m_{ij} = \alpha_{ij} \left(w_0 h_j + w_1 (x_i - x_j) + w_2 (x_i - x_j) \times h_j\right)$$
|
|
|
|
where the cross product ensures the message transforms correctly under rotations.
|
|
|
|
#### General Lie Group Equivariance
|
|
|
|
Beyond SE(3), graphs with symmetry group $G$ require $G$-equivariant attention. The general framework uses **fiber bundles**: each node carries a feature that transforms under a representation $\rho$ of $G$, and message passing uses intertwining operators.
|
|
|
|
For a Lie group $G$ acting on the graph, equivariant attention decomposes into irreducible representations:
|
|
|
|
$$\alpha_{ij} = \sum_l \alpha_{ij}^{(l)} \cdot \rho^{(l)}(g_{ij})$$
|
|
|
|
where $g_{ij} \in G$ is the relative group element between nodes $i$ and $j$, and $\rho^{(l)}$ is the $l$-th irreducible representation.
|
|
|
|
This connects to RuVector's sheaf attention module (`ruvector-attention/src/sheaf/`), where restriction maps between stalks play a role analogous to parallel transport between fibers in the Lie group setting.
|
|
|
|
---
|
|
|
|
## Research Timeline
|
|
|
|
### 2026-2030: Mixed-Curvature GNNs Become Standard
|
|
|
|
**Knowledge Graphs (2026-2028):** Knowledge graphs like Wikidata and Freebase combine deep hierarchies (is-a relations), lateral associations (related-to), and cyclic patterns (mutual relations). Product manifold embeddings $H^{64} \times S^{32} \times \mathbb{R}^{128}$ achieve 15-25% better link prediction than flat embeddings at half the dimensionality. RuVector's existing `FusedCurvatureConfig` provides the production-ready kernel.
|
|
|
|
**Molecular Design (2027-2029):** Drug discovery graphs have hierarchical scaffolds, cyclic ring systems, and flat functional group features. SE(3)-equivariant product manifold transformers replace flat-space message passing networks, achieving state-of-the-art on molecular property prediction benchmarks.
|
|
|
|
**Social Networks (2028-2030):** Community detection in social networks benefits from hyperbolic embeddings (communities are hierarchical), spherical embeddings (cliques are cyclic), and Euclidean embeddings (content similarity). Mixed-curvature Graph Transformers become the standard architecture for large-scale social graph analysis.
|
|
|
|
### 2030-2036: Continuous Manifold Graph Transformers
|
|
|
|
**Learned Curvature Fields (2030-2032):** Instead of a fixed product manifold, the curvature becomes a learned function of position: $\kappa(x): \mathcal{M} \to \mathbb{R}$. The manifold itself adapts to the local structure of the graph. Regions with tree-like structure automatically develop negative curvature; regions with cliques develop positive curvature; transition zones have near-zero curvature. This requires solving geodesic equations numerically on the learned manifold.
|
|
|
|
**Arbitrary Riemannian Manifolds (2032-2034):** Graph Transformers operate on manifolds defined by their learned metric tensor $g_{ij}(x)$ rather than restricted to constant-curvature spaces. The exponential map, parallel transport, and geodesic attention are computed via neural ODE solvers. RuVector's PDE attention module (`ruvector-attention/src/pde_attention/`) provides the diffusion-based foundation.
|
|
|
|
**Manifold-Valued Graph Neural Fields (2034-2036):** The discrete graph is replaced by a continuous neural field on a manifold: $f: \mathcal{M} \to \mathcal{N}$, where both the domain manifold $\mathcal{M}$ and the codomain manifold $\mathcal{N}$ are learned. Attention becomes a kernel on the product manifold $\mathcal{M} \times \mathcal{N}$. This unifies graph transformers with neural radiance fields, geometric deep learning, and topological data analysis.
|
|
|
|
---
|
|
|
|
## Architecture Proposals
|
|
|
|
### Product Manifold Attention Layer
|
|
|
|
```
|
|
Input: node embeddings x_i = (x_i^E, x_i^H, x_i^S) in R^k x H^m x S^n
|
|
|
|
For each component space M_j in {R^k, H^m, S^n}:
|
|
Q_j = W_Q^j * x^j // Linear projection (in tangent space for H, S)
|
|
K_j = W_K^j * x^j
|
|
V_j = W_V^j * x^j
|
|
|
|
alpha_ij^j = softmax(-d_{M_j}(Q_j_i, K_j_l) / tau_j) // Geodesic attention
|
|
out_j_i = AGGREGATE_{M_j}({alpha_ij^j, V_j_l}) // Manifold-aware aggregation
|
|
|
|
// Fused attention (single kernel, as in RuVector's fused_attention.rs):
|
|
alpha_ij = softmax(beta_E * <Q_E_i, K_E_j> + beta_H * <Q_H_i, K_H_j>_tan + beta_S * <Q_S_i, K_S_j>_S)
|
|
|
|
// Aggregation per component:
|
|
out_E_i = sum_j alpha_ij * V_E_j // Euclidean: weighted average
|
|
out_H_i = einstein_midpoint({alpha_ij, V_H_j}, c) // Hyperbolic: Einstein midpoint
|
|
out_S_i = normalize(sum_j alpha_ij * V_S_j) // Spherical: weighted + project
|
|
|
|
Output: (out_E_i, out_H_i, out_S_i)
|
|
```
|
|
|
|
### Rust Pseudocode: Product Manifold Attention
|
|
|
|
```rust
|
|
/// Product manifold attention layer operating on S^n x H^m x R^k
|
|
pub struct ProductManifoldAttention {
|
|
/// Per-component configurations with learned curvatures
|
|
components: Vec<ManifoldComponent>,
|
|
/// Fused attention kernel for single-pass computation
|
|
fused_kernel: FusedCurvatureKernel,
|
|
/// Tangent space mapper for fast hyperbolic operations
|
|
tangent_mapper: TangentSpaceMapper,
|
|
/// Riemannian optimizer state
|
|
optimizer: RiemannianAdamState,
|
|
}
|
|
|
|
#[derive(Clone)]
|
|
pub enum ManifoldComponent {
|
|
Euclidean { dim: usize },
|
|
Hyperbolic { dim: usize, curvature: f32 }, // curvature < 0
|
|
Spherical { dim: usize, curvature: f32 }, // curvature > 0
|
|
}
|
|
|
|
impl ProductManifoldAttention {
|
|
/// Compute product manifold attention with geodesic scoring
|
|
pub fn forward(
|
|
&self,
|
|
queries: &[Vec<f32>], // [N, D_total]
|
|
keys: &[Vec<f32>], // [M, D_total]
|
|
values: &[Vec<f32>], // [M, D_total]
|
|
graph_adj: &CsrMatrix, // Sparse adjacency (attention mask)
|
|
) -> Vec<Vec<f32>> {
|
|
let n = queries.len();
|
|
let mut outputs = Vec::with_capacity(n);
|
|
|
|
for i in 0..n {
|
|
let q = &queries[i];
|
|
let neighbors = graph_adj.neighbors(i);
|
|
|
|
// Split query into component spaces
|
|
let (q_e, q_h, q_s) = self.split_components(q);
|
|
|
|
// Compute fused attention scores in single pass
|
|
let mut logits = Vec::with_capacity(neighbors.len());
|
|
for &j in &neighbors {
|
|
let k = &keys[j];
|
|
let (k_e, k_h, k_s) = self.split_components(k);
|
|
|
|
// Euclidean: dot product
|
|
let score_e = dot_product(q_e, k_e);
|
|
|
|
// Hyperbolic: tangent-space dot product (fast path)
|
|
let q_h_tan = self.tangent_mapper.log_map(q_h);
|
|
let k_h_tan = self.tangent_mapper.log_map(k_h);
|
|
let score_h = dot_product(&q_h_tan, &k_h_tan);
|
|
|
|
// Spherical: cosine similarity on unit sphere
|
|
let score_s = cosine_similarity(q_s, k_s);
|
|
|
|
// Fused logit with learned mixing weights
|
|
let logit = self.fused_kernel.weight_e * score_e
|
|
+ self.fused_kernel.weight_h * score_h
|
|
+ self.fused_kernel.weight_s * score_s;
|
|
|
|
logits.push(logit);
|
|
}
|
|
|
|
// Softmax over neighbor logits
|
|
let weights = softmax_with_temperature(&logits, self.fused_kernel.temperature);
|
|
|
|
// Per-component aggregation
|
|
let mut out_e = vec![0.0; self.euclidean_dim()];
|
|
let mut out_h_weighted = Vec::new(); // for Einstein midpoint
|
|
let mut out_s = vec![0.0; self.spherical_dim()];
|
|
|
|
for (idx, &j) in neighbors.iter().enumerate() {
|
|
let v = &values[j];
|
|
let (v_e, v_h, v_s) = self.split_components(v);
|
|
let w = weights[idx];
|
|
|
|
// Euclidean: simple weighted sum
|
|
for (d, &val) in v_e.iter().enumerate() {
|
|
out_e[d] += w * val;
|
|
}
|
|
|
|
// Hyperbolic: collect for Einstein midpoint
|
|
out_h_weighted.push((w, v_h.to_vec()));
|
|
|
|
// Spherical: weighted sum then project
|
|
for (d, &val) in v_s.iter().enumerate() {
|
|
out_s[d] += w * val;
|
|
}
|
|
}
|
|
|
|
// Hyperbolic aggregation via Einstein midpoint (closed-form)
|
|
let hyp_curvature = self.hyperbolic_curvature();
|
|
let hyp_points: Vec<&[f32]> = out_h_weighted.iter()
|
|
.map(|(_, v)| v.as_slice()).collect();
|
|
let hyp_weights: Vec<f32> = out_h_weighted.iter()
|
|
.map(|(w, _)| *w).collect();
|
|
let out_h = einstein_midpoint(&hyp_points, &hyp_weights, hyp_curvature);
|
|
|
|
// Spherical: project weighted sum back to unit sphere
|
|
let out_s = l2_normalize(&out_s);
|
|
|
|
// Concatenate component outputs
|
|
let output = concat_components(&out_e, &out_h, &out_s);
|
|
outputs.push(output);
|
|
}
|
|
|
|
outputs
|
|
}
|
|
|
|
/// Riemannian gradient step: compute gradients in tangent space,
|
|
/// then retract back to manifold via exponential map
|
|
pub fn riemannian_step(&mut self, loss: f32, learning_rate: f32) {
|
|
for component in &mut self.components {
|
|
match component {
|
|
ManifoldComponent::Euclidean { .. } => {
|
|
// Standard Euclidean Adam step
|
|
}
|
|
ManifoldComponent::Hyperbolic { curvature, .. } => {
|
|
// 1. Project Euclidean gradient to tangent space
|
|
// 2. Riemannian Adam update in tangent space
|
|
// 3. Exponential map back to Poincare ball / hyperboloid
|
|
let c = curvature.abs();
|
|
// grad_riemannian = (1/(lambda_x^2)) * grad_euclidean
|
|
// theta_new = exp_map(theta_old, -lr * grad_riemannian)
|
|
}
|
|
ManifoldComponent::Spherical { .. } => {
|
|
// 1. Project gradient to tangent plane of sphere
|
|
// 2. Update in tangent space
|
|
// 3. Normalize back to unit sphere
|
|
}
|
|
}
|
|
}
|
|
|
|
// Optionally update curvatures via gradient descent
|
|
// d(loss)/d(kappa) flows through geodesic distance
|
|
}
|
|
}
|
|
```
|
|
|
|
### Curvature-Adaptive Graph Transformer Block
|
|
|
|
```
|
|
Input: x in M = S^n x H^m x R^k
|
|
|
|
|
+----------+-----------+
|
|
| |
|
|
Product Manifold Curvature
|
|
Self-Attention Estimator
|
|
(geodesic QKV) (kappa = f(x))
|
|
| |
|
|
+----------+-----------+
|
|
|
|
|
Parallel Transport Aggregation
|
|
(Levi-Civita connection)
|
|
|
|
|
Tangent Space Feed-Forward
|
|
(operate in T_x M, map back via exp)
|
|
|
|
|
Riemannian LayerNorm
|
|
(normalize on manifold)
|
|
|
|
|
Output: x' in M
|
|
```
|
|
|
|
---
|
|
|
|
## Mathematical Formulations
|
|
|
|
### Geodesic Attention
|
|
|
|
For two points $x, y$ on a Riemannian manifold $(\mathcal{M}, g)$:
|
|
|
|
$$\text{GeodesicAttention}(Q, K, V) = \text{Agg}_{\mathcal{M}}\left(\text{softmax}\left(-\frac{d_g(Q, K)}{\tau}\right) \cdot V\right)$$
|
|
|
|
where $d_g$ is the geodesic distance induced by metric $g$, and $\text{Agg}_{\mathcal{M}}$ is the manifold-appropriate aggregation.
|
|
|
|
### Exponential Map Aggregation
|
|
|
|
Given weighted values $\{(w_i, v_i)\}$ in the tangent space $T_x\mathcal{M}$:
|
|
|
|
$$\text{Agg}(x, \{w_i, v_i\}) = \text{Exp}_x\left(\sum_i w_i \cdot \text{Log}_x(v_i)\right)$$
|
|
|
|
This is equivalent to one step of Riemannian gradient descent toward the weighted Frechet mean.
|
|
|
|
### Product Manifold Distance
|
|
|
|
For $x = (x^{(1)}, \ldots, x^{(p)})$ and $y = (y^{(1)}, \ldots, y^{(p)})$ in $\mathcal{M} = \prod_i \mathcal{M}_i^{\kappa_i}$:
|
|
|
|
$$d_{\mathcal{M}}(x, y)^2 = \sum_{i=1}^p \beta_i \cdot d_{\mathcal{M}_i}^{\kappa_i}(x^{(i)}, y^{(i)})^2$$
|
|
|
|
where each $d_{\mathcal{M}_i}^{\kappa_i}$ is the sectional-curvature-$\kappa_i$ geodesic distance.
|
|
|
|
### Curvature Gradient
|
|
|
|
For learned curvature $c$ in the Poincare model, the gradient of the distance with respect to curvature is:
|
|
|
|
$$\frac{\partial d_c(x,y)}{\partial c} = \frac{1}{2\sqrt{c(\alpha^2 - 1)}} \left(\frac{\partial \alpha}{\partial c} - \frac{\alpha \cdot \text{arcosh}(\alpha)}{c}\right)$$
|
|
|
|
where $\alpha = 1 + 2c\|x-y\|^2 / ((1-c\|x\|^2)(1-c\|y\|^2))$.
|
|
|
|
---
|
|
|
|
## Implementation Roadmap for RuVector
|
|
|
|
### Phase 1: Extend Fused Curvature Attention (3-4 months)
|
|
|
|
- Add learned per-component curvature to `FusedCurvatureConfig`
|
|
- Implement curvature gradient computation in `ruvector-attention/src/curvature/`
|
|
- Extend `TangentSpaceMapper` to handle variable curvatures per batch element
|
|
- Add spherical aggregation (normalize after weighted sum) alongside Einstein midpoint
|
|
- Benchmark against fixed-curvature baseline
|
|
|
|
### Phase 2: Parallel Transport and Riemannian Optimization (4-6 months)
|
|
|
|
- Implement parallel transport for Poincare ball and Lorentz model
|
|
- Build `RiemannianAdam` optimizer extending `ruvector-attention/src/training/optimizer.rs`
|
|
- Add Levi-Civita connection-based message passing to `ruvector-graph`
|
|
- Integrate with `ruvector-solver` for sublinear geodesic computation on large graphs
|
|
|
|
### Phase 3: Lie Group Equivariance (6-9 months)
|
|
|
|
- Add SE(3)-equivariant attention for molecular graphs
|
|
- Implement fiber bundle framework connecting to `ruvector-attention/src/sheaf/`
|
|
- Extend `ruvector-graph` property graph to carry manifold-valued node features
|
|
- Develop equivariant sparse attention using `ruvector-dag/src/mincut/` for graph sparsification
|
|
|
|
### Phase 4: Continuous Curvature Fields (12-18 months)
|
|
|
|
- Implement neural curvature field $\kappa(x)$ using small MLP
|
|
- Develop numerical geodesic solver for non-constant curvature (connect to PDE attention module)
|
|
- Build differentiable metric tensor learning
|
|
- Integrate with `ruvector-temporal-tensor` for time-varying curvature fields
|
|
|
|
---
|
|
|
|
## Success Metrics
|
|
|
|
| Metric | Baseline (Euclidean) | Target (Product Manifold) |
|
|
|--------|---------------------|--------------------------|
|
|
| Knowledge graph link prediction (MRR) | 0.45 | 0.55-0.60 |
|
|
| Hierarchy reconstruction accuracy | 65% | 85-95% |
|
|
| Embedding dimension for same quality | 256 | 128 |
|
|
| Attention computation (fused kernel) | 1.0x | 1.2x (overhead acceptable) |
|
|
| Training convergence (epochs) | 100 | 60-70 |
|
|
| Molecular property prediction (MAE) | 1.0x | 0.80-0.85x |
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
1. Bachmann, Becigneul, Ganea (2020). "Constant Curvature Graph Convolutional Networks." ICML.
|
|
2. Chami, Ying, Re, Leskovec (2019). "Hyperbolic Graph Convolutional Neural Networks." NeurIPS.
|
|
3. Gu, Sala, Gunel, Re (2019). "Learning Mixed-Curvature Representations in Product Spaces." ICLR.
|
|
4. Nickel, Kiela (2017). "Poincare Embeddings for Learning Hierarchical Representations." NeurIPS.
|
|
5. Sala, De Sa, Gu, Re (2018). "Representation Tradeoffs for Hyperbolic Embeddings." ICML.
|
|
6. Ungar (2008). "Analytic Hyperbolic Geometry and Albert Einstein's Special Theory of Relativity."
|
|
7. Ganea, Becigneul, Hofmann (2018). "Hyperbolic Neural Networks." NeurIPS.
|
|
8. Fuchs, Worrall, Fischer, Welling (2020). "SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks." NeurIPS.
|
|
9. Brandstetter, Hesselink, van der Pol, Bekkers, Welling (2022). "Geometric and Physical Quantities Improve E(3) Equivariant Message Passing." ICLR.
|
|
10. Skopek, Ganea, Becigneul (2020). "Mixed-curvature Variational Autoencoders." ICLR.
|
|
11. Lou, Nickel, Zantedeschi (2020). "Differentiating through the Frechet Mean." ICML.
|
|
12. Xiong, Zhu, Hsieh, Ma, Liu (2022). "Pseudo-Riemannian Graph Convolutional Networks." NeurIPS.
|
|
|
|
---
|
|
|
|
**Document Status:** Research Proposal
|
|
**Last Updated:** 2026-02-25
|
|
**Owner:** RuVector Architecture Team
|
|
**Related ADRs:** ADR-045 (Lean Agentic Integration)
|
|
**Related Crates:** ruvector-attention, ruvector-graph, ruvector-solver, ruvector-dag
|