Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,773 @@
# Hyperbolic Embeddings for Hierarchical Vector Representations
## Overview
### Problem Statement
Traditional Euclidean embeddings struggle to represent hierarchical structures efficiently. Tree-like and scale-free graphs (common in knowledge graphs, social networks, and taxonomies) require exponentially growing dimensions in Euclidean space to preserve hierarchical distances. This leads to:
- **High dimensionality requirements**: 100+ dimensions for modest hierarchies
- **Poor distance preservation**: Hierarchical relationships get distorted
- **Inefficient similarity search**: HNSW performance degrades with unnecessary dimensions
- **Loss of structural information**: Parent-child relationships not explicitly encoded
### Proposed Solution
Implement a **Hybrid Euclidean-Hyperbolic Embedding System** that combines:
1. **Poincaré Ball Model** for hyperbolic space (hierarchy representation)
2. **Euclidean Space** for traditional similarity features
3. **Möbius Gyrovector Algebra** for vector operations in hyperbolic space
4. **Adaptive Blending** to balance hierarchical vs. similarity features
The system maintains dual representations:
- Hyperbolic component: Captures tree-like hierarchies (20-40% of vector)
- Euclidean component: Captures semantic similarity (60-80% of vector)
### Expected Benefits
**Quantified Improvements:**
- **Dimension Reduction**: 30-50% fewer dimensions for hierarchical data
- **Hierarchy Preservation**: 85-95% hierarchy accuracy vs. 60-70% in Euclidean
- **Search Speed**: 1.5-2x faster due to reduced dimensionality
- **Memory Savings**: 25-40% reduction in total storage
- **Distortion**: 2-3x lower distortion for tree-like structures
**Use Cases:**
- Knowledge graph embeddings (WordNet, Wikidata)
- Organizational hierarchies
- Taxonomy classification
- Document topic hierarchies
## Technical Design
### Architecture Diagram
```
┌─────────────────────────────────────────────────────────────┐
│ HybridEmbedding<T> │
├─────────────────────────────────────────────────────────────┤
│ - euclidean_component: Vec<T> [60-80% of dimensions] │
│ - hyperbolic_component: Vec<T> [20-40% of dimensions] │
│ - blend_ratio: f32 │
│ - curvature: f32 [typically -1.0] │
└─────────────────────────────────────────────────────────────┘
┌───────────────┴───────────────┐
│ │
┌─────────▼──────────┐ ┌─────────▼──────────┐
│ PoincareOps<T> │ │ EuclideanOps<T> │
├────────────────────┤ ├────────────────────┤
│ - mobius_add() │ │ - dot_product() │
│ - exp_map() │ │ - cosine_sim() │
│ - log_map() │ │ - l2_norm() │
│ - distance() │ │ - normalize() │
│ - gyration() │ └────────────────────┘
└────────────────────┘
┌─────────────────────┐
│ HyperbolicHNSW<T> │
├─────────────────────┤
│ - hybrid_distance() │ ← Combines both distances
│ - insert() │
│ - search() │
└─────────────────────┘
```
### Core Data Structures
```rust
/// Hybrid embedding combining Euclidean and Hyperbolic spaces
#[derive(Clone, Debug)]
pub struct HybridEmbedding<T: Float> {
/// Euclidean component (semantic similarity)
pub euclidean: Vec<T>,
/// Hyperbolic component (hierarchy in Poincaré ball)
/// Each coordinate constrained to ||x|| < 1
pub hyperbolic: Vec<T>,
/// Blend ratio (0.0 = pure Euclidean, 1.0 = pure hyperbolic)
pub blend_ratio: f32,
/// Hyperbolic space curvature (typically -1.0)
pub curvature: f32,
/// Total dimension
pub dimension: usize,
}
/// Poincaré ball operations (Möbius gyrovector algebra)
pub struct PoincareOps<T: Float> {
curvature: T,
epsilon: T, // Numerical stability (1e-8)
}
impl<T: Float> PoincareOps<T> {
/// Möbius addition: x ⊕ y
/// (x⊕y) = ((1+2⟨x,y⟩+||y||²)x + (1-||x||²)y) / (1+2⟨x,y⟩+||x||²||y||²)
pub fn mobius_add(&self, x: &[T], y: &[T]) -> Vec<T>;
/// Exponential map: TₓM → M (tangent to manifold)
pub fn exp_map(&self, x: &[T], v: &[T]) -> Vec<T>;
/// Logarithmic map: M → TₓM (manifold to tangent)
pub fn log_map(&self, x: &[T], y: &[T]) -> Vec<T>;
/// Poincaré distance
/// d(x,y) = acosh(1 + 2||x-y||²/((1-||x||²)(1-||y||²)))
pub fn distance(&self, x: &[T], y: &[T]) -> T;
/// Project vector to Poincaré ball (ensure ||x|| < 1)
pub fn project(&self, x: &[T]) -> Vec<T>;
}
/// Hybrid HNSW index supporting both distance metrics
pub struct HybridHNSW<T: Float> {
/// Standard HNSW graph structure
layers: Vec<HNSWLayer>,
/// Hybrid embeddings
embeddings: Vec<HybridEmbedding<T>>,
/// Distance computation strategy
distance_fn: HybridDistanceFunction,
/// HNSW parameters
params: HNSWParams,
}
/// Distance function combining Euclidean and hyperbolic metrics
pub enum HybridDistanceFunction {
/// Weighted combination
Weighted { euclidean_weight: f32, hyperbolic_weight: f32 },
/// Adaptive based on query context
Adaptive,
/// Hierarchical first, then Euclidean for tie-breaking
Hierarchical,
}
/// Configuration for hybrid embeddings
#[derive(Clone)]
pub struct HybridConfig {
/// Total embedding dimension
pub total_dim: usize,
/// Fraction allocated to hyperbolic space (0.2-0.4)
pub hyperbolic_ratio: f32,
/// Hyperbolic space curvature
pub curvature: f32,
/// Distance blending strategy
pub distance_strategy: HybridDistanceFunction,
/// Numerical stability epsilon
pub epsilon: f32,
}
```
### Key Algorithms
#### Algorithm 1: Hybrid Distance Computation
```pseudocode
function hybrid_distance(emb1: HybridEmbedding, emb2: HybridEmbedding) -> float:
// Compute Euclidean component distance
d_euclidean = cosine_distance(emb1.euclidean, emb2.euclidean)
// Compute hyperbolic component distance (Poincaré)
d_hyperbolic = poincare_distance(emb1.hyperbolic, emb2.hyperbolic)
// Normalize distances to [0, 1] range
d_euclidean_norm = d_euclidean / 2.0 // cosine ∈ [0, 2]
d_hyperbolic_norm = tanh(d_hyperbolic / 2.0) // hyperbolic ∈ [0, ∞)
// Blend based on strategy
match emb1.blend_strategy:
Weighted(w_e, w_h):
return w_e * d_euclidean_norm + w_h * d_hyperbolic_norm
Adaptive:
// Use hyperbolic more for hierarchical queries
hierarchy_score = detect_hierarchy(emb1, emb2)
w_h = hierarchy_score
w_e = 1.0 - hierarchy_score
return w_e * d_euclidean_norm + w_h * d_hyperbolic_norm
Hierarchical:
// Use hyperbolic for pruning, Euclidean for ranking
if d_hyperbolic_norm > threshold:
return d_hyperbolic_norm
else:
return 0.3 * d_hyperbolic_norm + 0.7 * d_euclidean_norm
```
#### Algorithm 2: Poincaré Distance (Optimized)
```pseudocode
function poincare_distance(x: Vec<T>, y: Vec<T>, curvature: T) -> T:
// Compute ||x - y||²
diff_norm_sq = 0.0
for i in 0..x.len():
diff = x[i] - y[i]
diff_norm_sq += diff * diff
// Compute ||x||² and ||y||²
x_norm_sq = dot(x, x)
y_norm_sq = dot(y, y)
// Numerical stability: ensure norms < 1
x_norm_sq = min(x_norm_sq, 1.0 - epsilon)
y_norm_sq = min(y_norm_sq, 1.0 - epsilon)
// Poincaré distance formula
numerator = 2.0 * diff_norm_sq
denominator = (1.0 - x_norm_sq) * (1.0 - y_norm_sq)
ratio = numerator / (denominator + epsilon)
// d = acosh(1 + ratio)
// Numerically stable: acosh(x) = log(x + sqrt(x²-1))
inner = 1.0 + ratio
if inner < 1.0 + epsilon:
return 0.0 // Points are identical
return log(inner + sqrt(inner * inner - 1.0)) / sqrt(abs(curvature))
```
#### Algorithm 3: Möbius Addition (Core Operation)
```pseudocode
function mobius_add(x: Vec<T>, y: Vec<T>, curvature: T) -> Vec<T]:
// Compute scalar products
xy_dot = dot(x, y)
x_norm_sq = dot(x, x)
y_norm_sq = dot(y, y)
// Conformal factor
denominator = 1.0 + 2.0 * curvature * xy_dot +
curvature² * x_norm_sq * y_norm_sq
// Numerator terms
numerator_x_coeff = 1.0 + 2.0 * curvature * xy_dot +
curvature * y_norm_sq
numerator_y_coeff = 1.0 - curvature * x_norm_sq
// Result
result = Vec::new()
for i in 0..x.len():
value = (numerator_x_coeff * x[i] + numerator_y_coeff * y[i]) /
(denominator + epsilon)
result.push(value)
// Project back to ball (ensure ||result|| < 1)
return project_to_ball(result)
function project_to_ball(x: Vec<T>) -> Vec<T]:
norm = sqrt(dot(x, x))
if norm >= 1.0:
// Project to ball with radius 1 - epsilon
scale = (1.0 - epsilon) / norm
return x.map(|xi| xi * scale)
return x
```
### API Design
```rust
// Public API for hybrid embeddings
pub mod hybrid {
use super::*;
/// Create hybrid embedding from separate components
pub fn create_hybrid<T: Float>(
euclidean: Vec<T>,
hyperbolic: Vec<T>,
config: HybridConfig,
) -> Result<HybridEmbedding<T>, Error>;
/// Convert standard embedding to hybrid (automatic split)
pub fn euclidean_to_hybrid<T: Float>(
embedding: &[T],
config: HybridConfig,
) -> Result<HybridEmbedding<T>, Error>;
/// Compute distance between hybrid embeddings
pub fn distance<T: Float>(
a: &HybridEmbedding<T>,
b: &HybridEmbedding<T>,
) -> T;
/// Create HNSW index with hybrid embeddings
pub fn build_index<T: Float>(
embeddings: Vec<HybridEmbedding<T>>,
config: HybridConfig,
hnsw_params: HNSWParams,
) -> Result<HybridHNSW<T>, Error>;
}
// Poincaré ball operations (advanced users)
pub mod poincare {
/// Möbius addition in Poincaré ball
pub fn mobius_add<T: Float>(
x: &[T],
y: &[T],
curvature: T,
) -> Vec<T>;
/// Exponential map (tangent to manifold)
pub fn exp_map<T: Float>(
base: &[T],
tangent: &[T],
curvature: T,
) -> Vec<T>;
/// Logarithmic map (manifold to tangent)
pub fn log_map<T: Float>(
base: &[T],
point: &[T],
curvature: T,
) -> Vec<T>;
/// Poincaré distance
pub fn distance<T: Float>(
x: &[T],
y: &[T],
curvature: T,
) -> T;
}
```
## Integration Points
### Affected Crates/Modules
1. **ruvector-core** (Major Changes)
- Add `hybrid_embedding.rs` module
- Extend `Distance` trait with `HybridDistance` variant
- Update `Embedding` enum to include `Hybrid` variant
2. **ruvector-hnsw** (Moderate Changes)
- Modify distance computation in `hnsw/search.rs`
- Add hybrid-aware layer construction
- Update serialization for hybrid embeddings
3. **ruvector-gnn-node** (Minor Changes)
- Add TypeScript bindings for hybrid embeddings
- Export Poincaré operations to JavaScript
4. **ruvector-quantization** (Future Integration)
- Separate quantization strategies for Euclidean vs. hyperbolic components
- Hyperbolic component needs special handling (preserve ball constraint)
### New Modules to Create
```
crates/ruvector-hyperbolic/
├── src/
│ ├── lib.rs # Public API
│ ├── poincare/
│ │ ├── mod.rs # Poincaré ball model
│ │ ├── ops.rs # Möbius operations
│ │ ├── distance.rs # Distance computation
│ │ └── projection.rs # Ball projection
│ ├── hybrid/
│ │ ├── mod.rs # Hybrid embeddings
│ │ ├── embedding.rs # HybridEmbedding struct
│ │ ├── distance.rs # Hybrid distance
│ │ └── conversion.rs # Euclidean ↔ Hybrid
│ ├── hnsw/
│ │ ├── mod.rs # Hybrid HNSW
│ │ └── index.rs # HybridHNSW implementation
│ └── math/
│ ├── gyrovector.rs # Gyrovector algebra
│ └── numerics.rs # Numerical stability
├── tests/
│ ├── poincare_tests.rs # Poincaré operations
│ ├── hierarchy_tests.rs # Hierarchy preservation
│ └── integration_tests.rs # End-to-end
├── benches/
│ ├── distance_bench.rs # Distance computation
│ └── hnsw_bench.rs # HNSW performance
└── Cargo.toml
```
### Dependencies on Other Features
- **Independent**: Can be implemented standalone
- **Synergies**:
- **Adaptive Precision** (Feature 5): Hyperbolic components may benefit from higher precision near ball boundary
- **Temporal GNN** (Feature 6): Time-evolving hierarchies (e.g., organizational changes)
- **Attention Mechanisms** (Existing): Attention weights could adapt based on hierarchy depth
## Regression Prevention
### What Existing Functionality Could Break
1. **HNSW Search Performance**
- Risk: Hybrid distance computation is more expensive
- Impact: 10-20% search latency increase
2. **Serialization Format**
- Risk: Existing indexes won't deserialize
- Impact: Breaking change for stored indexes
3. **Memory Layout**
- Risk: Hybrid embeddings require metadata (blend ratio, curvature)
- Impact: 5-10% memory overhead
4. **Distance Metric Assumptions**
- Risk: Some code assumes Euclidean properties (triangle inequality)
- Impact: Graph construction may be affected
### Test Cases to Prevent Regressions
```rust
#[cfg(test)]
mod regression_tests {
use super::*;
#[test]
fn test_pure_euclidean_mode_matches_original() {
// Hybrid with blend_ratio=0.0 should match Euclidean exactly
let config = HybridConfig {
hyperbolic_ratio: 0.0, // No hyperbolic component
..Default::default()
};
let euclidean_dist = cosine_distance(&emb1, &emb2);
let hybrid_dist = hybrid_distance(&hybrid_emb1, &hybrid_emb2);
assert!((euclidean_dist - hybrid_dist).abs() < 1e-6);
}
#[test]
fn test_hnsw_recall_not_degraded() {
// HNSW recall should remain >= 95% with hybrid embeddings
let recall = benchmark_hnsw_recall(&hybrid_index, &queries);
assert!(recall >= 0.95);
}
#[test]
fn test_backward_compatibility_serialization() {
// Old indexes should still deserialize
let legacy_index = deserialize_legacy_index("test.hnsw");
assert!(legacy_index.is_ok());
}
#[test]
fn test_numerical_stability_edge_cases() {
// Test with points near ball boundary (||x|| ≈ 1)
let near_boundary = vec![0.999, 0.0, 0.0];
let result = mobius_add(&near_boundary, &near_boundary);
// Should not produce NaN or overflow
assert!(result.iter().all(|x| x.is_finite()));
assert!(l2_norm(&result) < 1.0); // Still in ball
}
}
```
### Backward Compatibility Strategy
1. **Versioned Serialization**
```rust
enum EmbeddingFormat {
V1Euclidean, // Legacy format
V2Hybrid, // New format
}
```
2. **Feature Flag**
```toml
[features]
default = ["euclidean"]
hyperbolic = ["dep:special-functions"]
```
3. **Migration Path**
```rust
// Automatic conversion utility
pub fn migrate_index_to_hybrid(
old_index: &Path,
config: HybridConfig,
) -> Result<HybridHNSW, Error> {
// Read old Euclidean index
// Convert embeddings to hybrid
// Rebuild graph structure
}
```
## Implementation Phases
### Phase 1: Core Implementation (Weeks 1-2)
**Goal**: Implement Poincaré ball operations and hybrid embeddings
**Tasks**:
1. Create `ruvector-hyperbolic` crate
2. Implement `PoincareOps`:
- Möbius addition
- Exponential/logarithmic maps
- Distance computation
- Projection to ball
3. Implement `HybridEmbedding` struct
4. Write comprehensive unit tests
5. Add numerical stability tests
**Deliverables**:
- Working Poincaré operations (100% test coverage)
- Hybrid embedding data structure
- Benchmark suite for distance computation
**Success Criteria**:
- All Poincaré operations pass property tests (associativity, etc.)
- Numerical stability for edge cases (||x|| → 1)
- Distance computation < 2µs per pair (f32)
### Phase 2: Integration (Weeks 3-4)
**Goal**: Integrate hybrid embeddings with HNSW
**Tasks**:
1. Extend `Distance` trait with `HybridDistance`
2. Implement `HybridHNSW` index
3. Add serialization/deserialization
4. Create migration utilities for legacy indexes
5. Add TypeScript/JavaScript bindings
**Deliverables**:
- Functioning `HybridHNSW` index
- Backward-compatible serialization
- Node.js bindings with examples
**Success Criteria**:
- HNSW search works with hybrid embeddings
- Recall >= 95% (compared to brute force)
- Legacy indexes still load correctly
### Phase 3: Optimization (Weeks 5-6)
**Goal**: Optimize performance and memory usage
**Tasks**:
1. SIMD optimization for Poincaré distance
2. Cache-friendly memory layout
3. Parallel distance computation
4. Benchmark against pure Euclidean baseline
5. Profile and optimize hotspots
**Deliverables**:
- SIMD-accelerated distance computation
- Performance benchmarks
- Memory profiling report
**Success Criteria**:
- Distance computation within 1.5x of Euclidean baseline
- Memory overhead < 10%
- Parallel search scales linearly to 8 threads
### Phase 4: Production Hardening (Weeks 7-8)
**Goal**: Production-ready with documentation and examples
**Tasks**:
1. Write comprehensive documentation
2. Create example applications:
- Knowledge graph embeddings
- Hierarchical taxonomy search
3. Add monitoring/observability
4. Performance tuning for specific use cases
5. Create migration guide
**Deliverables**:
- API documentation
- 3+ example applications
- Migration guide from Euclidean
- Production deployment checklist
**Success Criteria**:
- Documentation completeness score > 90%
- Examples run successfully
- Zero P0/P1 bugs in testing
## Success Metrics
### Performance Benchmarks
**Latency Targets**:
- Poincaré distance computation: < 2.0µs (f32), < 1.0µs (SIMD)
- Hybrid distance computation: < 2.5µs (f32)
- HNSW search (100k vectors): < 500µs (p95)
- Index construction: < 10 minutes (1M vectors)
**Comparison Baseline** (Pure Euclidean):
- Distance computation slowdown: < 1.5x
- Search latency slowdown: < 1.3x
- Index size increase: < 10%
**Throughput Targets**:
- Distance computation: > 400k pairs/sec (single thread)
- HNSW search: > 2000 QPS (8 threads)
### Accuracy Metrics
**Hierarchy Preservation**:
- Tree reconstruction accuracy: > 90%
- Parent-child relationship recall: > 85%
- Hierarchy depth correlation: > 0.90
**HNSW Recall**:
- Top-10 recall @ ef=50: >= 95%
- Top-100 recall @ ef=200: >= 98%
**Distance Distortion**:
- Average distortion (vs. ground truth): < 0.15
- Max distortion (99th percentile): < 0.30
### Memory/Latency Targets
**Memory Reduction** (vs. pure Euclidean with same hierarchy quality):
- Total embedding size: 30-50% reduction
- HNSW index size: 25-40% reduction
- Runtime memory: < 5% overhead for metadata
**Latency Breakdown**:
- Euclidean component: 40-50% of time
- Hyperbolic component: 40-50% of time
- Blending/normalization: < 10% of time
**Scalability**:
- Linear scaling to 10M vectors
- Sub-linear scaling to 100M vectors (with sharding)
## Risks and Mitigations
### Technical Risks
**Risk 1: Numerical Instability near Ball Boundary**
- **Severity**: High
- **Impact**: NaN/Inf values, incorrect distances
- **Probability**: Medium
- **Mitigation**:
- Use epsilon-buffered projection (||x|| < 1 - ε)
- Employ numerically stable formulas (log-sum-exp tricks)
- Add extensive edge case tests
- Use higher precision (f64) for critical operations
**Risk 2: Performance Degradation**
- **Severity**: Medium
- **Impact**: Slower search, higher latency
- **Probability**: High
- **Mitigation**:
- SIMD optimization for distance computation
- Precompute and cache norm squares
- Profile-guided optimization
- Provide performance tuning guide
**Risk 3: Complex API Confusion**
- **Severity**: Medium
- **Impact**: User adoption issues, misconfiguration
- **Probability**: Medium
- **Mitigation**:
- Provide sensible defaults (blend_ratio=0.3, curvature=-1.0)
- Create configuration presets (taxonomy, knowledge-graph, etc.)
- Write comprehensive examples
- Add validation with helpful error messages
**Risk 4: Serialization Compatibility**
- **Severity**: High
- **Impact**: Breaking changes, migration pain
- **Probability**: High
- **Mitigation**:
- Version serialization format
- Provide automatic migration tool
- Support reading legacy formats
- Comprehensive migration guide
**Risk 5: Integration with Quantization**
- **Severity**: Medium
- **Impact**: Quantization may break ball constraints
- **Probability**: High
- **Mitigation**:
- Defer quantization for hyperbolic component
- Research hyperbolic-aware quantization schemes
- Document incompatibilities clearly
- Provide fallback to f32 for hyperbolic
**Risk 6: Limited Use Case Applicability**
- **Severity**: Low
- **Impact**: Feature underutilized if data isn't hierarchical
- **Probability**: Medium
- **Mitigation**:
- Provide hierarchy detection tool
- Make hyperbolic component optional (blend_ratio=0)
- Document ideal use cases clearly
- Add auto-configuration based on data analysis
### Mitigation Summary Table
| Risk | Mitigation Strategy | Owner | Timeline |
|------|-------------------|-------|----------|
| Numerical instability | Epsilon buffering + stable formulas | Core team | Phase 1 |
| Performance degradation | SIMD + profiling + caching | Optimization team | Phase 3 |
| API complexity | Defaults + examples + validation | API team | Phase 4 |
| Serialization breaks | Versioning + migration tool | Integration team | Phase 2 |
| Quantization conflict | Defer integration + research | Research team | Post-v1 |
| Limited applicability | Detection tool + documentation | Product team | Phase 4 |
---
## References
1. **Nickel & Kiela (2017)**: "Poincaré Embeddings for Learning Hierarchical Representations"
2. **Sala et al. (2018)**: "Representation Tradeoffs for Hyperbolic Embeddings"
3. **Chami et al. (2019)**: "Hyperbolic Graph Convolutional Neural Networks"
4. **Ganea et al. (2018)**: "Hyperbolic Neural Networks"
## Appendix: Mathematical Foundations
### Poincaré Ball Model
The Poincaré ball model represents hyperbolic space as:
```
B^n = {x ∈ ^n : ||x|| < 1}
```
with metric tensor:
```
g_x = (2 / (1 - ||x||²))² δ_ij
```
### Möbius Addition Formula
```
x ⊕_c y = ((1 + 2c⟨x,y⟩ + c||y||²)x + (1 - c||x||²)y) / (1 + 2c⟨x,y⟩ + c²||x||²||y||²)
```
where c is the absolute curvature (typically c = 1, curvature = -1).
### Distance Formula
```
d_c(x, y) = (1/√c) acosh(1 + 2c ||x - y||² / ((1 - c||x||²)(1 - c||y||²)))
```
### Exponential Map (Tangent to Manifold)
```
exp_x^c(v) = x ⊕_c (tanh(√c ||v|| / 2) / (√c ||v||)) v
```
### Logarithmic Map (Manifold to Tangent)
```
log_x^c(y) = (2 / (√c λ_x)) atanh(√c ||(-x) ⊕_c y||) · ((-x) ⊕_c y) / ||(-x) ⊕_c y||
```
where `λ_x = 1 / (1 - c||x||²)` is the conformal factor.