24 KiB
Hyperbolic Embeddings for Hierarchical Vector Representations
Overview
Problem Statement
Traditional Euclidean embeddings struggle to represent hierarchical structures efficiently. Tree-like and scale-free graphs (common in knowledge graphs, social networks, and taxonomies) require exponentially growing dimensions in Euclidean space to preserve hierarchical distances. This leads to:
- High dimensionality requirements: 100+ dimensions for modest hierarchies
- Poor distance preservation: Hierarchical relationships get distorted
- Inefficient similarity search: HNSW performance degrades with unnecessary dimensions
- Loss of structural information: Parent-child relationships not explicitly encoded
Proposed Solution
Implement a Hybrid Euclidean-Hyperbolic Embedding System that combines:
- Poincaré Ball Model for hyperbolic space (hierarchy representation)
- Euclidean Space for traditional similarity features
- Möbius Gyrovector Algebra for vector operations in hyperbolic space
- Adaptive Blending to balance hierarchical vs. similarity features
The system maintains dual representations:
- Hyperbolic component: Captures tree-like hierarchies (20-40% of vector)
- Euclidean component: Captures semantic similarity (60-80% of vector)
Expected Benefits
Quantified Improvements:
- Dimension Reduction: 30-50% fewer dimensions for hierarchical data
- Hierarchy Preservation: 85-95% hierarchy accuracy vs. 60-70% in Euclidean
- Search Speed: 1.5-2x faster due to reduced dimensionality
- Memory Savings: 25-40% reduction in total storage
- Distortion: 2-3x lower distortion for tree-like structures
Use Cases:
- Knowledge graph embeddings (WordNet, Wikidata)
- Organizational hierarchies
- Taxonomy classification
- Document topic hierarchies
Technical Design
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ HybridEmbedding<T> │
├─────────────────────────────────────────────────────────────┤
│ - euclidean_component: Vec<T> [60-80% of dimensions] │
│ - hyperbolic_component: Vec<T> [20-40% of dimensions] │
│ - blend_ratio: f32 │
│ - curvature: f32 [typically -1.0] │
└─────────────────────────────────────────────────────────────┘
▲
│
┌───────────────┴───────────────┐
│ │
┌─────────▼──────────┐ ┌─────────▼──────────┐
│ PoincareOps<T> │ │ EuclideanOps<T> │
├────────────────────┤ ├────────────────────┤
│ - mobius_add() │ │ - dot_product() │
│ - exp_map() │ │ - cosine_sim() │
│ - log_map() │ │ - l2_norm() │
│ - distance() │ │ - normalize() │
│ - gyration() │ └────────────────────┘
└────────────────────┘
│
▼
┌─────────────────────┐
│ HyperbolicHNSW<T> │
├─────────────────────┤
│ - hybrid_distance() │ ← Combines both distances
│ - insert() │
│ - search() │
└─────────────────────┘
Core Data Structures
/// Hybrid embedding combining Euclidean and Hyperbolic spaces
#[derive(Clone, Debug)]
pub struct HybridEmbedding<T: Float> {
/// Euclidean component (semantic similarity)
pub euclidean: Vec<T>,
/// Hyperbolic component (hierarchy in Poincaré ball)
/// Each coordinate constrained to ||x|| < 1
pub hyperbolic: Vec<T>,
/// Blend ratio (0.0 = pure Euclidean, 1.0 = pure hyperbolic)
pub blend_ratio: f32,
/// Hyperbolic space curvature (typically -1.0)
pub curvature: f32,
/// Total dimension
pub dimension: usize,
}
/// Poincaré ball operations (Möbius gyrovector algebra)
pub struct PoincareOps<T: Float> {
curvature: T,
epsilon: T, // Numerical stability (1e-8)
}
impl<T: Float> PoincareOps<T> {
/// Möbius addition: x ⊕ y
/// (x⊕y) = ((1+2⟨x,y⟩+||y||²)x + (1-||x||²)y) / (1+2⟨x,y⟩+||x||²||y||²)
pub fn mobius_add(&self, x: &[T], y: &[T]) -> Vec<T>;
/// Exponential map: TₓM → M (tangent to manifold)
pub fn exp_map(&self, x: &[T], v: &[T]) -> Vec<T>;
/// Logarithmic map: M → TₓM (manifold to tangent)
pub fn log_map(&self, x: &[T], y: &[T]) -> Vec<T>;
/// Poincaré distance
/// d(x,y) = acosh(1 + 2||x-y||²/((1-||x||²)(1-||y||²)))
pub fn distance(&self, x: &[T], y: &[T]) -> T;
/// Project vector to Poincaré ball (ensure ||x|| < 1)
pub fn project(&self, x: &[T]) -> Vec<T>;
}
/// Hybrid HNSW index supporting both distance metrics
pub struct HybridHNSW<T: Float> {
/// Standard HNSW graph structure
layers: Vec<HNSWLayer>,
/// Hybrid embeddings
embeddings: Vec<HybridEmbedding<T>>,
/// Distance computation strategy
distance_fn: HybridDistanceFunction,
/// HNSW parameters
params: HNSWParams,
}
/// Distance function combining Euclidean and hyperbolic metrics
pub enum HybridDistanceFunction {
/// Weighted combination
Weighted { euclidean_weight: f32, hyperbolic_weight: f32 },
/// Adaptive based on query context
Adaptive,
/// Hierarchical first, then Euclidean for tie-breaking
Hierarchical,
}
/// Configuration for hybrid embeddings
#[derive(Clone)]
pub struct HybridConfig {
/// Total embedding dimension
pub total_dim: usize,
/// Fraction allocated to hyperbolic space (0.2-0.4)
pub hyperbolic_ratio: f32,
/// Hyperbolic space curvature
pub curvature: f32,
/// Distance blending strategy
pub distance_strategy: HybridDistanceFunction,
/// Numerical stability epsilon
pub epsilon: f32,
}
Key Algorithms
Algorithm 1: Hybrid Distance Computation
function hybrid_distance(emb1: HybridEmbedding, emb2: HybridEmbedding) -> float:
// Compute Euclidean component distance
d_euclidean = cosine_distance(emb1.euclidean, emb2.euclidean)
// Compute hyperbolic component distance (Poincaré)
d_hyperbolic = poincare_distance(emb1.hyperbolic, emb2.hyperbolic)
// Normalize distances to [0, 1] range
d_euclidean_norm = d_euclidean / 2.0 // cosine ∈ [0, 2]
d_hyperbolic_norm = tanh(d_hyperbolic / 2.0) // hyperbolic ∈ [0, ∞)
// Blend based on strategy
match emb1.blend_strategy:
Weighted(w_e, w_h):
return w_e * d_euclidean_norm + w_h * d_hyperbolic_norm
Adaptive:
// Use hyperbolic more for hierarchical queries
hierarchy_score = detect_hierarchy(emb1, emb2)
w_h = hierarchy_score
w_e = 1.0 - hierarchy_score
return w_e * d_euclidean_norm + w_h * d_hyperbolic_norm
Hierarchical:
// Use hyperbolic for pruning, Euclidean for ranking
if d_hyperbolic_norm > threshold:
return d_hyperbolic_norm
else:
return 0.3 * d_hyperbolic_norm + 0.7 * d_euclidean_norm
Algorithm 2: Poincaré Distance (Optimized)
function poincare_distance(x: Vec<T>, y: Vec<T>, curvature: T) -> T:
// Compute ||x - y||²
diff_norm_sq = 0.0
for i in 0..x.len():
diff = x[i] - y[i]
diff_norm_sq += diff * diff
// Compute ||x||² and ||y||²
x_norm_sq = dot(x, x)
y_norm_sq = dot(y, y)
// Numerical stability: ensure norms < 1
x_norm_sq = min(x_norm_sq, 1.0 - epsilon)
y_norm_sq = min(y_norm_sq, 1.0 - epsilon)
// Poincaré distance formula
numerator = 2.0 * diff_norm_sq
denominator = (1.0 - x_norm_sq) * (1.0 - y_norm_sq)
ratio = numerator / (denominator + epsilon)
// d = acosh(1 + ratio)
// Numerically stable: acosh(x) = log(x + sqrt(x²-1))
inner = 1.0 + ratio
if inner < 1.0 + epsilon:
return 0.0 // Points are identical
return log(inner + sqrt(inner * inner - 1.0)) / sqrt(abs(curvature))
Algorithm 3: Möbius Addition (Core Operation)
function mobius_add(x: Vec<T>, y: Vec<T>, curvature: T) -> Vec<T]:
// Compute scalar products
xy_dot = dot(x, y)
x_norm_sq = dot(x, x)
y_norm_sq = dot(y, y)
// Conformal factor
denominator = 1.0 + 2.0 * curvature * xy_dot +
curvature² * x_norm_sq * y_norm_sq
// Numerator terms
numerator_x_coeff = 1.0 + 2.0 * curvature * xy_dot +
curvature * y_norm_sq
numerator_y_coeff = 1.0 - curvature * x_norm_sq
// Result
result = Vec::new()
for i in 0..x.len():
value = (numerator_x_coeff * x[i] + numerator_y_coeff * y[i]) /
(denominator + epsilon)
result.push(value)
// Project back to ball (ensure ||result|| < 1)
return project_to_ball(result)
function project_to_ball(x: Vec<T>) -> Vec<T]:
norm = sqrt(dot(x, x))
if norm >= 1.0:
// Project to ball with radius 1 - epsilon
scale = (1.0 - epsilon) / norm
return x.map(|xi| xi * scale)
return x
API Design
// Public API for hybrid embeddings
pub mod hybrid {
use super::*;
/// Create hybrid embedding from separate components
pub fn create_hybrid<T: Float>(
euclidean: Vec<T>,
hyperbolic: Vec<T>,
config: HybridConfig,
) -> Result<HybridEmbedding<T>, Error>;
/// Convert standard embedding to hybrid (automatic split)
pub fn euclidean_to_hybrid<T: Float>(
embedding: &[T],
config: HybridConfig,
) -> Result<HybridEmbedding<T>, Error>;
/// Compute distance between hybrid embeddings
pub fn distance<T: Float>(
a: &HybridEmbedding<T>,
b: &HybridEmbedding<T>,
) -> T;
/// Create HNSW index with hybrid embeddings
pub fn build_index<T: Float>(
embeddings: Vec<HybridEmbedding<T>>,
config: HybridConfig,
hnsw_params: HNSWParams,
) -> Result<HybridHNSW<T>, Error>;
}
// Poincaré ball operations (advanced users)
pub mod poincare {
/// Möbius addition in Poincaré ball
pub fn mobius_add<T: Float>(
x: &[T],
y: &[T],
curvature: T,
) -> Vec<T>;
/// Exponential map (tangent to manifold)
pub fn exp_map<T: Float>(
base: &[T],
tangent: &[T],
curvature: T,
) -> Vec<T>;
/// Logarithmic map (manifold to tangent)
pub fn log_map<T: Float>(
base: &[T],
point: &[T],
curvature: T,
) -> Vec<T>;
/// Poincaré distance
pub fn distance<T: Float>(
x: &[T],
y: &[T],
curvature: T,
) -> T;
}
Integration Points
Affected Crates/Modules
-
ruvector-core (Major Changes)
- Add
hybrid_embedding.rsmodule - Extend
Distancetrait withHybridDistancevariant - Update
Embeddingenum to includeHybridvariant
- Add
-
ruvector-hnsw (Moderate Changes)
- Modify distance computation in
hnsw/search.rs - Add hybrid-aware layer construction
- Update serialization for hybrid embeddings
- Modify distance computation in
-
ruvector-gnn-node (Minor Changes)
- Add TypeScript bindings for hybrid embeddings
- Export Poincaré operations to JavaScript
-
ruvector-quantization (Future Integration)
- Separate quantization strategies for Euclidean vs. hyperbolic components
- Hyperbolic component needs special handling (preserve ball constraint)
New Modules to Create
crates/ruvector-hyperbolic/
├── src/
│ ├── lib.rs # Public API
│ ├── poincare/
│ │ ├── mod.rs # Poincaré ball model
│ │ ├── ops.rs # Möbius operations
│ │ ├── distance.rs # Distance computation
│ │ └── projection.rs # Ball projection
│ ├── hybrid/
│ │ ├── mod.rs # Hybrid embeddings
│ │ ├── embedding.rs # HybridEmbedding struct
│ │ ├── distance.rs # Hybrid distance
│ │ └── conversion.rs # Euclidean ↔ Hybrid
│ ├── hnsw/
│ │ ├── mod.rs # Hybrid HNSW
│ │ └── index.rs # HybridHNSW implementation
│ └── math/
│ ├── gyrovector.rs # Gyrovector algebra
│ └── numerics.rs # Numerical stability
├── tests/
│ ├── poincare_tests.rs # Poincaré operations
│ ├── hierarchy_tests.rs # Hierarchy preservation
│ └── integration_tests.rs # End-to-end
├── benches/
│ ├── distance_bench.rs # Distance computation
│ └── hnsw_bench.rs # HNSW performance
└── Cargo.toml
Dependencies on Other Features
- Independent: Can be implemented standalone
- Synergies:
- Adaptive Precision (Feature 5): Hyperbolic components may benefit from higher precision near ball boundary
- Temporal GNN (Feature 6): Time-evolving hierarchies (e.g., organizational changes)
- Attention Mechanisms (Existing): Attention weights could adapt based on hierarchy depth
Regression Prevention
What Existing Functionality Could Break
-
HNSW Search Performance
- Risk: Hybrid distance computation is more expensive
- Impact: 10-20% search latency increase
-
Serialization Format
- Risk: Existing indexes won't deserialize
- Impact: Breaking change for stored indexes
-
Memory Layout
- Risk: Hybrid embeddings require metadata (blend ratio, curvature)
- Impact: 5-10% memory overhead
-
Distance Metric Assumptions
- Risk: Some code assumes Euclidean properties (triangle inequality)
- Impact: Graph construction may be affected
Test Cases to Prevent Regressions
#[cfg(test)]
mod regression_tests {
use super::*;
#[test]
fn test_pure_euclidean_mode_matches_original() {
// Hybrid with blend_ratio=0.0 should match Euclidean exactly
let config = HybridConfig {
hyperbolic_ratio: 0.0, // No hyperbolic component
..Default::default()
};
let euclidean_dist = cosine_distance(&emb1, &emb2);
let hybrid_dist = hybrid_distance(&hybrid_emb1, &hybrid_emb2);
assert!((euclidean_dist - hybrid_dist).abs() < 1e-6);
}
#[test]
fn test_hnsw_recall_not_degraded() {
// HNSW recall should remain >= 95% with hybrid embeddings
let recall = benchmark_hnsw_recall(&hybrid_index, &queries);
assert!(recall >= 0.95);
}
#[test]
fn test_backward_compatibility_serialization() {
// Old indexes should still deserialize
let legacy_index = deserialize_legacy_index("test.hnsw");
assert!(legacy_index.is_ok());
}
#[test]
fn test_numerical_stability_edge_cases() {
// Test with points near ball boundary (||x|| ≈ 1)
let near_boundary = vec![0.999, 0.0, 0.0];
let result = mobius_add(&near_boundary, &near_boundary);
// Should not produce NaN or overflow
assert!(result.iter().all(|x| x.is_finite()));
assert!(l2_norm(&result) < 1.0); // Still in ball
}
}
Backward Compatibility Strategy
-
Versioned Serialization
enum EmbeddingFormat { V1Euclidean, // Legacy format V2Hybrid, // New format } -
Feature Flag
[features] default = ["euclidean"] hyperbolic = ["dep:special-functions"] -
Migration Path
// Automatic conversion utility pub fn migrate_index_to_hybrid( old_index: &Path, config: HybridConfig, ) -> Result<HybridHNSW, Error> { // Read old Euclidean index // Convert embeddings to hybrid // Rebuild graph structure }
Implementation Phases
Phase 1: Core Implementation (Weeks 1-2)
Goal: Implement Poincaré ball operations and hybrid embeddings
Tasks:
- Create
ruvector-hyperboliccrate - Implement
PoincareOps:- Möbius addition
- Exponential/logarithmic maps
- Distance computation
- Projection to ball
- Implement
HybridEmbeddingstruct - Write comprehensive unit tests
- Add numerical stability tests
Deliverables:
- Working Poincaré operations (100% test coverage)
- Hybrid embedding data structure
- Benchmark suite for distance computation
Success Criteria:
- All Poincaré operations pass property tests (associativity, etc.)
- Numerical stability for edge cases (||x|| → 1)
- Distance computation < 2µs per pair (f32)
Phase 2: Integration (Weeks 3-4)
Goal: Integrate hybrid embeddings with HNSW
Tasks:
- Extend
Distancetrait withHybridDistance - Implement
HybridHNSWindex - Add serialization/deserialization
- Create migration utilities for legacy indexes
- Add TypeScript/JavaScript bindings
Deliverables:
- Functioning
HybridHNSWindex - Backward-compatible serialization
- Node.js bindings with examples
Success Criteria:
- HNSW search works with hybrid embeddings
- Recall >= 95% (compared to brute force)
- Legacy indexes still load correctly
Phase 3: Optimization (Weeks 5-6)
Goal: Optimize performance and memory usage
Tasks:
- SIMD optimization for Poincaré distance
- Cache-friendly memory layout
- Parallel distance computation
- Benchmark against pure Euclidean baseline
- Profile and optimize hotspots
Deliverables:
- SIMD-accelerated distance computation
- Performance benchmarks
- Memory profiling report
Success Criteria:
- Distance computation within 1.5x of Euclidean baseline
- Memory overhead < 10%
- Parallel search scales linearly to 8 threads
Phase 4: Production Hardening (Weeks 7-8)
Goal: Production-ready with documentation and examples
Tasks:
- Write comprehensive documentation
- Create example applications:
- Knowledge graph embeddings
- Hierarchical taxonomy search
- Add monitoring/observability
- Performance tuning for specific use cases
- Create migration guide
Deliverables:
- API documentation
- 3+ example applications
- Migration guide from Euclidean
- Production deployment checklist
Success Criteria:
- Documentation completeness score > 90%
- Examples run successfully
- Zero P0/P1 bugs in testing
Success Metrics
Performance Benchmarks
Latency Targets:
- Poincaré distance computation: < 2.0µs (f32), < 1.0µs (SIMD)
- Hybrid distance computation: < 2.5µs (f32)
- HNSW search (100k vectors): < 500µs (p95)
- Index construction: < 10 minutes (1M vectors)
Comparison Baseline (Pure Euclidean):
- Distance computation slowdown: < 1.5x
- Search latency slowdown: < 1.3x
- Index size increase: < 10%
Throughput Targets:
- Distance computation: > 400k pairs/sec (single thread)
- HNSW search: > 2000 QPS (8 threads)
Accuracy Metrics
Hierarchy Preservation:
- Tree reconstruction accuracy: > 90%
- Parent-child relationship recall: > 85%
- Hierarchy depth correlation: > 0.90
HNSW Recall:
- Top-10 recall @ ef=50: >= 95%
- Top-100 recall @ ef=200: >= 98%
Distance Distortion:
- Average distortion (vs. ground truth): < 0.15
- Max distortion (99th percentile): < 0.30
Memory/Latency Targets
Memory Reduction (vs. pure Euclidean with same hierarchy quality):
- Total embedding size: 30-50% reduction
- HNSW index size: 25-40% reduction
- Runtime memory: < 5% overhead for metadata
Latency Breakdown:
- Euclidean component: 40-50% of time
- Hyperbolic component: 40-50% of time
- Blending/normalization: < 10% of time
Scalability:
- Linear scaling to 10M vectors
- Sub-linear scaling to 100M vectors (with sharding)
Risks and Mitigations
Technical Risks
Risk 1: Numerical Instability near Ball Boundary
- Severity: High
- Impact: NaN/Inf values, incorrect distances
- Probability: Medium
- Mitigation:
- Use epsilon-buffered projection (||x|| < 1 - ε)
- Employ numerically stable formulas (log-sum-exp tricks)
- Add extensive edge case tests
- Use higher precision (f64) for critical operations
Risk 2: Performance Degradation
- Severity: Medium
- Impact: Slower search, higher latency
- Probability: High
- Mitigation:
- SIMD optimization for distance computation
- Precompute and cache norm squares
- Profile-guided optimization
- Provide performance tuning guide
Risk 3: Complex API Confusion
- Severity: Medium
- Impact: User adoption issues, misconfiguration
- Probability: Medium
- Mitigation:
- Provide sensible defaults (blend_ratio=0.3, curvature=-1.0)
- Create configuration presets (taxonomy, knowledge-graph, etc.)
- Write comprehensive examples
- Add validation with helpful error messages
Risk 4: Serialization Compatibility
- Severity: High
- Impact: Breaking changes, migration pain
- Probability: High
- Mitigation:
- Version serialization format
- Provide automatic migration tool
- Support reading legacy formats
- Comprehensive migration guide
Risk 5: Integration with Quantization
- Severity: Medium
- Impact: Quantization may break ball constraints
- Probability: High
- Mitigation:
- Defer quantization for hyperbolic component
- Research hyperbolic-aware quantization schemes
- Document incompatibilities clearly
- Provide fallback to f32 for hyperbolic
Risk 6: Limited Use Case Applicability
- Severity: Low
- Impact: Feature underutilized if data isn't hierarchical
- Probability: Medium
- Mitigation:
- Provide hierarchy detection tool
- Make hyperbolic component optional (blend_ratio=0)
- Document ideal use cases clearly
- Add auto-configuration based on data analysis
Mitigation Summary Table
| Risk | Mitigation Strategy | Owner | Timeline |
|---|---|---|---|
| Numerical instability | Epsilon buffering + stable formulas | Core team | Phase 1 |
| Performance degradation | SIMD + profiling + caching | Optimization team | Phase 3 |
| API complexity | Defaults + examples + validation | API team | Phase 4 |
| Serialization breaks | Versioning + migration tool | Integration team | Phase 2 |
| Quantization conflict | Defer integration + research | Research team | Post-v1 |
| Limited applicability | Detection tool + documentation | Product team | Phase 4 |
References
- Nickel & Kiela (2017): "Poincaré Embeddings for Learning Hierarchical Representations"
- Sala et al. (2018): "Representation Tradeoffs for Hyperbolic Embeddings"
- Chami et al. (2019): "Hyperbolic Graph Convolutional Neural Networks"
- Ganea et al. (2018): "Hyperbolic Neural Networks"
Appendix: Mathematical Foundations
Poincaré Ball Model
The Poincaré ball model represents hyperbolic space as:
B^n = {x ∈ ℝ^n : ||x|| < 1}
with metric tensor:
g_x = (2 / (1 - ||x||²))² δ_ij
Möbius Addition Formula
x ⊕_c y = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) / (1 + 2c⟨x,y⟩ + c²||x||²||y||²)
where c is the absolute curvature (typically c = 1, curvature = -1).
Distance Formula
d_c(x, y) = (1/√c) acosh(1 + 2c ||x - y||² / ((1 - c||x||²)(1 - c||y||²)))
Exponential Map (Tangent to Manifold)
exp_x^c(v) = x ⊕_c (tanh(√c ||v|| / 2) / (√c ||v||)) v
Logarithmic Map (Manifold to Tangent)
log_x^c(y) = (2 / (√c λ_x)) atanh(√c ||(-x) ⊕_c y||) · ((-x) ⊕_c y) / ||(-x) ⊕_c y||
where λ_x = 1 / (1 - c||x||²) is the conformal factor.