git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
774 lines
24 KiB
Markdown
774 lines
24 KiB
Markdown
# Hyperbolic Embeddings for Hierarchical Vector Representations
|
||
|
||
## Overview
|
||
|
||
### Problem Statement
|
||
|
||
Traditional Euclidean embeddings struggle to represent hierarchical structures efficiently. Tree-like and scale-free graphs (common in knowledge graphs, social networks, and taxonomies) require exponentially growing dimensions in Euclidean space to preserve hierarchical distances. This leads to:
|
||
|
||
- **High dimensionality requirements**: 100+ dimensions for modest hierarchies
|
||
- **Poor distance preservation**: Hierarchical relationships get distorted
|
||
- **Inefficient similarity search**: HNSW performance degrades with unnecessary dimensions
|
||
- **Loss of structural information**: Parent-child relationships not explicitly encoded
|
||
|
||
### Proposed Solution
|
||
|
||
Implement a **Hybrid Euclidean-Hyperbolic Embedding System** that combines:
|
||
|
||
1. **Poincaré Ball Model** for hyperbolic space (hierarchy representation)
|
||
2. **Euclidean Space** for traditional similarity features
|
||
3. **Möbius Gyrovector Algebra** for vector operations in hyperbolic space
|
||
4. **Adaptive Blending** to balance hierarchical vs. similarity features
|
||
|
||
The system maintains dual representations:
|
||
- Hyperbolic component: Captures tree-like hierarchies (20-40% of vector)
|
||
- Euclidean component: Captures semantic similarity (60-80% of vector)
|
||
|
||
### Expected Benefits
|
||
|
||
**Quantified Improvements:**
|
||
- **Dimension Reduction**: 30-50% fewer dimensions for hierarchical data
|
||
- **Hierarchy Preservation**: 85-95% hierarchy accuracy vs. 60-70% in Euclidean
|
||
- **Search Speed**: 1.5-2x faster due to reduced dimensionality
|
||
- **Memory Savings**: 25-40% reduction in total storage
|
||
- **Distortion**: 2-3x lower distortion for tree-like structures
|
||
|
||
**Use Cases:**
|
||
- Knowledge graph embeddings (WordNet, Wikidata)
|
||
- Organizational hierarchies
|
||
- Taxonomy classification
|
||
- Document topic hierarchies
|
||
|
||
## Technical Design
|
||
|
||
### Architecture Diagram
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ HybridEmbedding<T> │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ - euclidean_component: Vec<T> [60-80% of dimensions] │
|
||
│ - hyperbolic_component: Vec<T> [20-40% of dimensions] │
|
||
│ - blend_ratio: f32 │
|
||
│ - curvature: f32 [typically -1.0] │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
▲
|
||
│
|
||
┌───────────────┴───────────────┐
|
||
│ │
|
||
┌─────────▼──────────┐ ┌─────────▼──────────┐
|
||
│ PoincareOps<T> │ │ EuclideanOps<T> │
|
||
├────────────────────┤ ├────────────────────┤
|
||
│ - mobius_add() │ │ - dot_product() │
|
||
│ - exp_map() │ │ - cosine_sim() │
|
||
│ - log_map() │ │ - l2_norm() │
|
||
│ - distance() │ │ - normalize() │
|
||
│ - gyration() │ └────────────────────┘
|
||
└────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────┐
|
||
│ HyperbolicHNSW<T> │
|
||
├─────────────────────┤
|
||
│ - hybrid_distance() │ ← Combines both distances
|
||
│ - insert() │
|
||
│ - search() │
|
||
└─────────────────────┘
|
||
```
|
||
|
||
### Core Data Structures
|
||
|
||
```rust
|
||
/// Hybrid embedding combining Euclidean and Hyperbolic spaces
|
||
#[derive(Clone, Debug)]
|
||
pub struct HybridEmbedding<T: Float> {
|
||
/// Euclidean component (semantic similarity)
|
||
pub euclidean: Vec<T>,
|
||
|
||
/// Hyperbolic component (hierarchy in Poincaré ball)
|
||
/// Each coordinate constrained to ||x|| < 1
|
||
pub hyperbolic: Vec<T>,
|
||
|
||
/// Blend ratio (0.0 = pure Euclidean, 1.0 = pure hyperbolic)
|
||
pub blend_ratio: f32,
|
||
|
||
/// Hyperbolic space curvature (typically -1.0)
|
||
pub curvature: f32,
|
||
|
||
/// Total dimension
|
||
pub dimension: usize,
|
||
}
|
||
|
||
/// Poincaré ball operations (Möbius gyrovector algebra)
|
||
pub struct PoincareOps<T: Float> {
|
||
curvature: T,
|
||
epsilon: T, // Numerical stability (1e-8)
|
||
}
|
||
|
||
impl<T: Float> PoincareOps<T> {
|
||
/// Möbius addition: x ⊕ y
|
||
/// (x⊕y) = ((1+2⟨x,y⟩+||y||²)x + (1-||x||²)y) / (1+2⟨x,y⟩+||x||²||y||²)
|
||
pub fn mobius_add(&self, x: &[T], y: &[T]) -> Vec<T>;
|
||
|
||
/// Exponential map: TₓM → M (tangent to manifold)
|
||
pub fn exp_map(&self, x: &[T], v: &[T]) -> Vec<T>;
|
||
|
||
/// Logarithmic map: M → TₓM (manifold to tangent)
|
||
pub fn log_map(&self, x: &[T], y: &[T]) -> Vec<T>;
|
||
|
||
/// Poincaré distance
|
||
/// d(x,y) = acosh(1 + 2||x-y||²/((1-||x||²)(1-||y||²)))
|
||
pub fn distance(&self, x: &[T], y: &[T]) -> T;
|
||
|
||
/// Project vector to Poincaré ball (ensure ||x|| < 1)
|
||
pub fn project(&self, x: &[T]) -> Vec<T>;
|
||
}
|
||
|
||
/// Hybrid HNSW index supporting both distance metrics
|
||
pub struct HybridHNSW<T: Float> {
|
||
/// Standard HNSW graph structure
|
||
layers: Vec<HNSWLayer>,
|
||
|
||
/// Hybrid embeddings
|
||
embeddings: Vec<HybridEmbedding<T>>,
|
||
|
||
/// Distance computation strategy
|
||
distance_fn: HybridDistanceFunction,
|
||
|
||
/// HNSW parameters
|
||
params: HNSWParams,
|
||
}
|
||
|
||
/// Distance function combining Euclidean and hyperbolic metrics
|
||
pub enum HybridDistanceFunction {
|
||
/// Weighted combination
|
||
Weighted { euclidean_weight: f32, hyperbolic_weight: f32 },
|
||
|
||
/// Adaptive based on query context
|
||
Adaptive,
|
||
|
||
/// Hierarchical first, then Euclidean for tie-breaking
|
||
Hierarchical,
|
||
}
|
||
|
||
/// Configuration for hybrid embeddings
|
||
#[derive(Clone)]
|
||
pub struct HybridConfig {
|
||
/// Total embedding dimension
|
||
pub total_dim: usize,
|
||
|
||
/// Fraction allocated to hyperbolic space (0.2-0.4)
|
||
pub hyperbolic_ratio: f32,
|
||
|
||
/// Hyperbolic space curvature
|
||
pub curvature: f32,
|
||
|
||
/// Distance blending strategy
|
||
pub distance_strategy: HybridDistanceFunction,
|
||
|
||
/// Numerical stability epsilon
|
||
pub epsilon: f32,
|
||
}
|
||
```
|
||
|
||
### Key Algorithms
|
||
|
||
#### Algorithm 1: Hybrid Distance Computation
|
||
|
||
```pseudocode
|
||
function hybrid_distance(emb1: HybridEmbedding, emb2: HybridEmbedding) -> float:
|
||
// Compute Euclidean component distance
|
||
d_euclidean = cosine_distance(emb1.euclidean, emb2.euclidean)
|
||
|
||
// Compute hyperbolic component distance (Poincaré)
|
||
d_hyperbolic = poincare_distance(emb1.hyperbolic, emb2.hyperbolic)
|
||
|
||
// Normalize distances to [0, 1] range
|
||
d_euclidean_norm = d_euclidean / 2.0 // cosine ∈ [0, 2]
|
||
d_hyperbolic_norm = tanh(d_hyperbolic / 2.0) // hyperbolic ∈ [0, ∞)
|
||
|
||
// Blend based on strategy
|
||
match emb1.blend_strategy:
|
||
Weighted(w_e, w_h):
|
||
return w_e * d_euclidean_norm + w_h * d_hyperbolic_norm
|
||
|
||
Adaptive:
|
||
// Use hyperbolic more for hierarchical queries
|
||
hierarchy_score = detect_hierarchy(emb1, emb2)
|
||
w_h = hierarchy_score
|
||
w_e = 1.0 - hierarchy_score
|
||
return w_e * d_euclidean_norm + w_h * d_hyperbolic_norm
|
||
|
||
Hierarchical:
|
||
// Use hyperbolic for pruning, Euclidean for ranking
|
||
if d_hyperbolic_norm > threshold:
|
||
return d_hyperbolic_norm
|
||
else:
|
||
return 0.3 * d_hyperbolic_norm + 0.7 * d_euclidean_norm
|
||
```
|
||
|
||
#### Algorithm 2: Poincaré Distance (Optimized)
|
||
|
||
```pseudocode
|
||
function poincare_distance(x: Vec<T>, y: Vec<T>, curvature: T) -> T:
|
||
// Compute ||x - y||²
|
||
diff_norm_sq = 0.0
|
||
for i in 0..x.len():
|
||
diff = x[i] - y[i]
|
||
diff_norm_sq += diff * diff
|
||
|
||
// Compute ||x||² and ||y||²
|
||
x_norm_sq = dot(x, x)
|
||
y_norm_sq = dot(y, y)
|
||
|
||
// Numerical stability: ensure norms < 1
|
||
x_norm_sq = min(x_norm_sq, 1.0 - epsilon)
|
||
y_norm_sq = min(y_norm_sq, 1.0 - epsilon)
|
||
|
||
// Poincaré distance formula
|
||
numerator = 2.0 * diff_norm_sq
|
||
denominator = (1.0 - x_norm_sq) * (1.0 - y_norm_sq)
|
||
|
||
ratio = numerator / (denominator + epsilon)
|
||
|
||
// d = acosh(1 + ratio)
|
||
// Numerically stable: acosh(x) = log(x + sqrt(x²-1))
|
||
inner = 1.0 + ratio
|
||
if inner < 1.0 + epsilon:
|
||
return 0.0 // Points are identical
|
||
|
||
return log(inner + sqrt(inner * inner - 1.0)) / sqrt(abs(curvature))
|
||
```
|
||
|
||
#### Algorithm 3: Möbius Addition (Core Operation)
|
||
|
||
```pseudocode
|
||
function mobius_add(x: Vec<T>, y: Vec<T>, curvature: T) -> Vec<T]:
|
||
// Compute scalar products
|
||
xy_dot = dot(x, y)
|
||
x_norm_sq = dot(x, x)
|
||
y_norm_sq = dot(y, y)
|
||
|
||
// Conformal factor
|
||
denominator = 1.0 + 2.0 * curvature * xy_dot +
|
||
curvature² * x_norm_sq * y_norm_sq
|
||
|
||
// Numerator terms
|
||
numerator_x_coeff = 1.0 + 2.0 * curvature * xy_dot +
|
||
curvature * y_norm_sq
|
||
numerator_y_coeff = 1.0 - curvature * x_norm_sq
|
||
|
||
// Result
|
||
result = Vec::new()
|
||
for i in 0..x.len():
|
||
value = (numerator_x_coeff * x[i] + numerator_y_coeff * y[i]) /
|
||
(denominator + epsilon)
|
||
result.push(value)
|
||
|
||
// Project back to ball (ensure ||result|| < 1)
|
||
return project_to_ball(result)
|
||
|
||
function project_to_ball(x: Vec<T>) -> Vec<T]:
|
||
norm = sqrt(dot(x, x))
|
||
if norm >= 1.0:
|
||
// Project to ball with radius 1 - epsilon
|
||
scale = (1.0 - epsilon) / norm
|
||
return x.map(|xi| xi * scale)
|
||
return x
|
||
```
|
||
|
||
### API Design
|
||
|
||
```rust
|
||
// Public API for hybrid embeddings
|
||
pub mod hybrid {
|
||
use super::*;
|
||
|
||
/// Create hybrid embedding from separate components
|
||
pub fn create_hybrid<T: Float>(
|
||
euclidean: Vec<T>,
|
||
hyperbolic: Vec<T>,
|
||
config: HybridConfig,
|
||
) -> Result<HybridEmbedding<T>, Error>;
|
||
|
||
/// Convert standard embedding to hybrid (automatic split)
|
||
pub fn euclidean_to_hybrid<T: Float>(
|
||
embedding: &[T],
|
||
config: HybridConfig,
|
||
) -> Result<HybridEmbedding<T>, Error>;
|
||
|
||
/// Compute distance between hybrid embeddings
|
||
pub fn distance<T: Float>(
|
||
a: &HybridEmbedding<T>,
|
||
b: &HybridEmbedding<T>,
|
||
) -> T;
|
||
|
||
/// Create HNSW index with hybrid embeddings
|
||
pub fn build_index<T: Float>(
|
||
embeddings: Vec<HybridEmbedding<T>>,
|
||
config: HybridConfig,
|
||
hnsw_params: HNSWParams,
|
||
) -> Result<HybridHNSW<T>, Error>;
|
||
}
|
||
|
||
// Poincaré ball operations (advanced users)
|
||
pub mod poincare {
|
||
/// Möbius addition in Poincaré ball
|
||
pub fn mobius_add<T: Float>(
|
||
x: &[T],
|
||
y: &[T],
|
||
curvature: T,
|
||
) -> Vec<T>;
|
||
|
||
/// Exponential map (tangent to manifold)
|
||
pub fn exp_map<T: Float>(
|
||
base: &[T],
|
||
tangent: &[T],
|
||
curvature: T,
|
||
) -> Vec<T>;
|
||
|
||
/// Logarithmic map (manifold to tangent)
|
||
pub fn log_map<T: Float>(
|
||
base: &[T],
|
||
point: &[T],
|
||
curvature: T,
|
||
) -> Vec<T>;
|
||
|
||
/// Poincaré distance
|
||
pub fn distance<T: Float>(
|
||
x: &[T],
|
||
y: &[T],
|
||
curvature: T,
|
||
) -> T;
|
||
}
|
||
```
|
||
|
||
## Integration Points
|
||
|
||
### Affected Crates/Modules
|
||
|
||
1. **ruvector-core** (Major Changes)
|
||
- Add `hybrid_embedding.rs` module
|
||
- Extend `Distance` trait with `HybridDistance` variant
|
||
- Update `Embedding` enum to include `Hybrid` variant
|
||
|
||
2. **ruvector-hnsw** (Moderate Changes)
|
||
- Modify distance computation in `hnsw/search.rs`
|
||
- Add hybrid-aware layer construction
|
||
- Update serialization for hybrid embeddings
|
||
|
||
3. **ruvector-gnn-node** (Minor Changes)
|
||
- Add TypeScript bindings for hybrid embeddings
|
||
- Export Poincaré operations to JavaScript
|
||
|
||
4. **ruvector-quantization** (Future Integration)
|
||
- Separate quantization strategies for Euclidean vs. hyperbolic components
|
||
- Hyperbolic component needs special handling (preserve ball constraint)
|
||
|
||
### New Modules to Create
|
||
|
||
```
|
||
crates/ruvector-hyperbolic/
|
||
├── src/
|
||
│ ├── lib.rs # Public API
|
||
│ ├── poincare/
|
||
│ │ ├── mod.rs # Poincaré ball model
|
||
│ │ ├── ops.rs # Möbius operations
|
||
│ │ ├── distance.rs # Distance computation
|
||
│ │ └── projection.rs # Ball projection
|
||
│ ├── hybrid/
|
||
│ │ ├── mod.rs # Hybrid embeddings
|
||
│ │ ├── embedding.rs # HybridEmbedding struct
|
||
│ │ ├── distance.rs # Hybrid distance
|
||
│ │ └── conversion.rs # Euclidean ↔ Hybrid
|
||
│ ├── hnsw/
|
||
│ │ ├── mod.rs # Hybrid HNSW
|
||
│ │ └── index.rs # HybridHNSW implementation
|
||
│ └── math/
|
||
│ ├── gyrovector.rs # Gyrovector algebra
|
||
│ └── numerics.rs # Numerical stability
|
||
├── tests/
|
||
│ ├── poincare_tests.rs # Poincaré operations
|
||
│ ├── hierarchy_tests.rs # Hierarchy preservation
|
||
│ └── integration_tests.rs # End-to-end
|
||
├── benches/
|
||
│ ├── distance_bench.rs # Distance computation
|
||
│ └── hnsw_bench.rs # HNSW performance
|
||
└── Cargo.toml
|
||
```
|
||
|
||
### Dependencies on Other Features
|
||
|
||
- **Independent**: Can be implemented standalone
|
||
- **Synergies**:
|
||
- **Adaptive Precision** (Feature 5): Hyperbolic components may benefit from higher precision near ball boundary
|
||
- **Temporal GNN** (Feature 6): Time-evolving hierarchies (e.g., organizational changes)
|
||
- **Attention Mechanisms** (Existing): Attention weights could adapt based on hierarchy depth
|
||
|
||
## Regression Prevention
|
||
|
||
### What Existing Functionality Could Break
|
||
|
||
1. **HNSW Search Performance**
|
||
- Risk: Hybrid distance computation is more expensive
|
||
- Impact: 10-20% search latency increase
|
||
|
||
2. **Serialization Format**
|
||
- Risk: Existing indexes won't deserialize
|
||
- Impact: Breaking change for stored indexes
|
||
|
||
3. **Memory Layout**
|
||
- Risk: Hybrid embeddings require metadata (blend ratio, curvature)
|
||
- Impact: 5-10% memory overhead
|
||
|
||
4. **Distance Metric Assumptions**
|
||
- Risk: Some code assumes Euclidean properties (triangle inequality)
|
||
- Impact: Graph construction may be affected
|
||
|
||
### Test Cases to Prevent Regressions
|
||
|
||
```rust
|
||
#[cfg(test)]
|
||
mod regression_tests {
|
||
use super::*;
|
||
|
||
#[test]
|
||
fn test_pure_euclidean_mode_matches_original() {
|
||
// Hybrid with blend_ratio=0.0 should match Euclidean exactly
|
||
let config = HybridConfig {
|
||
hyperbolic_ratio: 0.0, // No hyperbolic component
|
||
..Default::default()
|
||
};
|
||
|
||
let euclidean_dist = cosine_distance(&emb1, &emb2);
|
||
let hybrid_dist = hybrid_distance(&hybrid_emb1, &hybrid_emb2);
|
||
|
||
assert!((euclidean_dist - hybrid_dist).abs() < 1e-6);
|
||
}
|
||
|
||
#[test]
|
||
fn test_hnsw_recall_not_degraded() {
|
||
// HNSW recall should remain >= 95% with hybrid embeddings
|
||
let recall = benchmark_hnsw_recall(&hybrid_index, &queries);
|
||
assert!(recall >= 0.95);
|
||
}
|
||
|
||
#[test]
|
||
fn test_backward_compatibility_serialization() {
|
||
// Old indexes should still deserialize
|
||
let legacy_index = deserialize_legacy_index("test.hnsw");
|
||
assert!(legacy_index.is_ok());
|
||
}
|
||
|
||
#[test]
|
||
fn test_numerical_stability_edge_cases() {
|
||
// Test with points near ball boundary (||x|| ≈ 1)
|
||
let near_boundary = vec![0.999, 0.0, 0.0];
|
||
let result = mobius_add(&near_boundary, &near_boundary);
|
||
|
||
// Should not produce NaN or overflow
|
||
assert!(result.iter().all(|x| x.is_finite()));
|
||
assert!(l2_norm(&result) < 1.0); // Still in ball
|
||
}
|
||
}
|
||
```
|
||
|
||
### Backward Compatibility Strategy
|
||
|
||
1. **Versioned Serialization**
|
||
```rust
|
||
enum EmbeddingFormat {
|
||
V1Euclidean, // Legacy format
|
||
V2Hybrid, // New format
|
||
}
|
||
```
|
||
|
||
2. **Feature Flag**
|
||
```toml
|
||
[features]
|
||
default = ["euclidean"]
|
||
hyperbolic = ["dep:special-functions"]
|
||
```
|
||
|
||
3. **Migration Path**
|
||
```rust
|
||
// Automatic conversion utility
|
||
pub fn migrate_index_to_hybrid(
|
||
old_index: &Path,
|
||
config: HybridConfig,
|
||
) -> Result<HybridHNSW, Error> {
|
||
// Read old Euclidean index
|
||
// Convert embeddings to hybrid
|
||
// Rebuild graph structure
|
||
}
|
||
```
|
||
|
||
## Implementation Phases
|
||
|
||
### Phase 1: Core Implementation (Weeks 1-2)
|
||
|
||
**Goal**: Implement Poincaré ball operations and hybrid embeddings
|
||
|
||
**Tasks**:
|
||
1. Create `ruvector-hyperbolic` crate
|
||
2. Implement `PoincareOps`:
|
||
- Möbius addition
|
||
- Exponential/logarithmic maps
|
||
- Distance computation
|
||
- Projection to ball
|
||
3. Implement `HybridEmbedding` struct
|
||
4. Write comprehensive unit tests
|
||
5. Add numerical stability tests
|
||
|
||
**Deliverables**:
|
||
- Working Poincaré operations (100% test coverage)
|
||
- Hybrid embedding data structure
|
||
- Benchmark suite for distance computation
|
||
|
||
**Success Criteria**:
|
||
- All Poincaré operations pass property tests (associativity, etc.)
|
||
- Numerical stability for edge cases (||x|| → 1)
|
||
- Distance computation < 2µs per pair (f32)
|
||
|
||
### Phase 2: Integration (Weeks 3-4)
|
||
|
||
**Goal**: Integrate hybrid embeddings with HNSW
|
||
|
||
**Tasks**:
|
||
1. Extend `Distance` trait with `HybridDistance`
|
||
2. Implement `HybridHNSW` index
|
||
3. Add serialization/deserialization
|
||
4. Create migration utilities for legacy indexes
|
||
5. Add TypeScript/JavaScript bindings
|
||
|
||
**Deliverables**:
|
||
- Functioning `HybridHNSW` index
|
||
- Backward-compatible serialization
|
||
- Node.js bindings with examples
|
||
|
||
**Success Criteria**:
|
||
- HNSW search works with hybrid embeddings
|
||
- Recall >= 95% (compared to brute force)
|
||
- Legacy indexes still load correctly
|
||
|
||
### Phase 3: Optimization (Weeks 5-6)
|
||
|
||
**Goal**: Optimize performance and memory usage
|
||
|
||
**Tasks**:
|
||
1. SIMD optimization for Poincaré distance
|
||
2. Cache-friendly memory layout
|
||
3. Parallel distance computation
|
||
4. Benchmark against pure Euclidean baseline
|
||
5. Profile and optimize hotspots
|
||
|
||
**Deliverables**:
|
||
- SIMD-accelerated distance computation
|
||
- Performance benchmarks
|
||
- Memory profiling report
|
||
|
||
**Success Criteria**:
|
||
- Distance computation within 1.5x of Euclidean baseline
|
||
- Memory overhead < 10%
|
||
- Parallel search scales linearly to 8 threads
|
||
|
||
### Phase 4: Production Hardening (Weeks 7-8)
|
||
|
||
**Goal**: Production-ready with documentation and examples
|
||
|
||
**Tasks**:
|
||
1. Write comprehensive documentation
|
||
2. Create example applications:
|
||
- Knowledge graph embeddings
|
||
- Hierarchical taxonomy search
|
||
3. Add monitoring/observability
|
||
4. Performance tuning for specific use cases
|
||
5. Create migration guide
|
||
|
||
**Deliverables**:
|
||
- API documentation
|
||
- 3+ example applications
|
||
- Migration guide from Euclidean
|
||
- Production deployment checklist
|
||
|
||
**Success Criteria**:
|
||
- Documentation completeness score > 90%
|
||
- Examples run successfully
|
||
- Zero P0/P1 bugs in testing
|
||
|
||
## Success Metrics
|
||
|
||
### Performance Benchmarks
|
||
|
||
**Latency Targets**:
|
||
- Poincaré distance computation: < 2.0µs (f32), < 1.0µs (SIMD)
|
||
- Hybrid distance computation: < 2.5µs (f32)
|
||
- HNSW search (100k vectors): < 500µs (p95)
|
||
- Index construction: < 10 minutes (1M vectors)
|
||
|
||
**Comparison Baseline** (Pure Euclidean):
|
||
- Distance computation slowdown: < 1.5x
|
||
- Search latency slowdown: < 1.3x
|
||
- Index size increase: < 10%
|
||
|
||
**Throughput Targets**:
|
||
- Distance computation: > 400k pairs/sec (single thread)
|
||
- HNSW search: > 2000 QPS (8 threads)
|
||
|
||
### Accuracy Metrics
|
||
|
||
**Hierarchy Preservation**:
|
||
- Tree reconstruction accuracy: > 90%
|
||
- Parent-child relationship recall: > 85%
|
||
- Hierarchy depth correlation: > 0.90
|
||
|
||
**HNSW Recall**:
|
||
- Top-10 recall @ ef=50: >= 95%
|
||
- Top-100 recall @ ef=200: >= 98%
|
||
|
||
**Distance Distortion**:
|
||
- Average distortion (vs. ground truth): < 0.15
|
||
- Max distortion (99th percentile): < 0.30
|
||
|
||
### Memory/Latency Targets
|
||
|
||
**Memory Reduction** (vs. pure Euclidean with same hierarchy quality):
|
||
- Total embedding size: 30-50% reduction
|
||
- HNSW index size: 25-40% reduction
|
||
- Runtime memory: < 5% overhead for metadata
|
||
|
||
**Latency Breakdown**:
|
||
- Euclidean component: 40-50% of time
|
||
- Hyperbolic component: 40-50% of time
|
||
- Blending/normalization: < 10% of time
|
||
|
||
**Scalability**:
|
||
- Linear scaling to 10M vectors
|
||
- Sub-linear scaling to 100M vectors (with sharding)
|
||
|
||
## Risks and Mitigations
|
||
|
||
### Technical Risks
|
||
|
||
**Risk 1: Numerical Instability near Ball Boundary**
|
||
- **Severity**: High
|
||
- **Impact**: NaN/Inf values, incorrect distances
|
||
- **Probability**: Medium
|
||
- **Mitigation**:
|
||
- Use epsilon-buffered projection (||x|| < 1 - ε)
|
||
- Employ numerically stable formulas (log-sum-exp tricks)
|
||
- Add extensive edge case tests
|
||
- Use higher precision (f64) for critical operations
|
||
|
||
**Risk 2: Performance Degradation**
|
||
- **Severity**: Medium
|
||
- **Impact**: Slower search, higher latency
|
||
- **Probability**: High
|
||
- **Mitigation**:
|
||
- SIMD optimization for distance computation
|
||
- Precompute and cache norm squares
|
||
- Profile-guided optimization
|
||
- Provide performance tuning guide
|
||
|
||
**Risk 3: Complex API Confusion**
|
||
- **Severity**: Medium
|
||
- **Impact**: User adoption issues, misconfiguration
|
||
- **Probability**: Medium
|
||
- **Mitigation**:
|
||
- Provide sensible defaults (blend_ratio=0.3, curvature=-1.0)
|
||
- Create configuration presets (taxonomy, knowledge-graph, etc.)
|
||
- Write comprehensive examples
|
||
- Add validation with helpful error messages
|
||
|
||
**Risk 4: Serialization Compatibility**
|
||
- **Severity**: High
|
||
- **Impact**: Breaking changes, migration pain
|
||
- **Probability**: High
|
||
- **Mitigation**:
|
||
- Version serialization format
|
||
- Provide automatic migration tool
|
||
- Support reading legacy formats
|
||
- Comprehensive migration guide
|
||
|
||
**Risk 5: Integration with Quantization**
|
||
- **Severity**: Medium
|
||
- **Impact**: Quantization may break ball constraints
|
||
- **Probability**: High
|
||
- **Mitigation**:
|
||
- Defer quantization for hyperbolic component
|
||
- Research hyperbolic-aware quantization schemes
|
||
- Document incompatibilities clearly
|
||
- Provide fallback to f32 for hyperbolic
|
||
|
||
**Risk 6: Limited Use Case Applicability**
|
||
- **Severity**: Low
|
||
- **Impact**: Feature underutilized if data isn't hierarchical
|
||
- **Probability**: Medium
|
||
- **Mitigation**:
|
||
- Provide hierarchy detection tool
|
||
- Make hyperbolic component optional (blend_ratio=0)
|
||
- Document ideal use cases clearly
|
||
- Add auto-configuration based on data analysis
|
||
|
||
### Mitigation Summary Table
|
||
|
||
| Risk | Mitigation Strategy | Owner | Timeline |
|
||
|------|-------------------|-------|----------|
|
||
| Numerical instability | Epsilon buffering + stable formulas | Core team | Phase 1 |
|
||
| Performance degradation | SIMD + profiling + caching | Optimization team | Phase 3 |
|
||
| API complexity | Defaults + examples + validation | API team | Phase 4 |
|
||
| Serialization breaks | Versioning + migration tool | Integration team | Phase 2 |
|
||
| Quantization conflict | Defer integration + research | Research team | Post-v1 |
|
||
| Limited applicability | Detection tool + documentation | Product team | Phase 4 |
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
1. **Nickel & Kiela (2017)**: "Poincaré Embeddings for Learning Hierarchical Representations"
|
||
2. **Sala et al. (2018)**: "Representation Tradeoffs for Hyperbolic Embeddings"
|
||
3. **Chami et al. (2019)**: "Hyperbolic Graph Convolutional Neural Networks"
|
||
4. **Ganea et al. (2018)**: "Hyperbolic Neural Networks"
|
||
|
||
## Appendix: Mathematical Foundations
|
||
|
||
### Poincaré Ball Model
|
||
|
||
The Poincaré ball model represents hyperbolic space as:
|
||
```
|
||
B^n = {x ∈ ℝ^n : ||x|| < 1}
|
||
```
|
||
|
||
with metric tensor:
|
||
```
|
||
g_x = (2 / (1 - ||x||²))² δ_ij
|
||
```
|
||
|
||
### Möbius Addition Formula
|
||
|
||
```
|
||
x ⊕_c y = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) / (1 + 2c⟨x,y⟩ + c²||x||²||y||²)
|
||
```
|
||
|
||
where c is the absolute curvature (typically c = 1, curvature = -1).
|
||
|
||
### Distance Formula
|
||
|
||
```
|
||
d_c(x, y) = (1/√c) acosh(1 + 2c ||x - y||² / ((1 - c||x||²)(1 - c||y||²)))
|
||
```
|
||
|
||
### Exponential Map (Tangent to Manifold)
|
||
|
||
```
|
||
exp_x^c(v) = x ⊕_c (tanh(√c ||v|| / 2) / (√c ||v||)) v
|
||
```
|
||
|
||
### Logarithmic Map (Manifold to Tangent)
|
||
|
||
```
|
||
log_x^c(y) = (2 / (√c λ_x)) atanh(√c ||(-x) ⊕_c y||) · ((-x) ⊕_c y) / ||(-x) ⊕_c y||
|
||
```
|
||
|
||
where `λ_x = 1 / (1 - c||x||²)` is the conformal factor.
|