Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/docs/project-phases/PHASE6_ADVANCED.md
+++ b/docs/project-phases/PHASE6_ADVANCED.md
@@ -0,0 +1,408 @@
+# Phase 6: Advanced Techniques - Implementation Guide
+
+## Overview
+
+Phase 6 implements cutting-edge features for next-generation vector search:
+- **Hypergraphs**: N-ary relationships beyond pairwise similarity
+- **Learned Indexes**: Neural network-based index structures (RMI)
+- **Neural Hash Functions**: Similarity-preserving binary projections
+- **Topological Data Analysis**: Embedding quality assessment
+
+## Features Implemented
+
+### 1. Hypergraph Support
+
+**Location**: `/crates/ruvector-core/src/advanced/hypergraph.rs`
+
+#### Core Components:
+
+```rust
+// Hyperedge connecting multiple vectors
+pub struct Hyperedge {
+    pub id: String,
+    pub nodes: Vec<VectorId>,
+    pub description: String,
+    pub embedding: Vec<f32>,
+    pub confidence: f32,
+}
+
+// Temporal hyperedge with time attributes
+pub struct TemporalHyperedge {
+    pub hyperedge: Hyperedge,
+    pub timestamp: u64,
+    pub granularity: TemporalGranularity,
+}
+
+// Hypergraph index with bipartite storage
+pub struct HypergraphIndex {
+    entities: HashMap<VectorId, Vec<f32>>,
+    hyperedges: HashMap<String, Hyperedge>,
+    temporal_index: HashMap<u64, Vec<String>>,
+}
+```
+
+#### Key Features:
+- ✅ N-ary relationships (3+ entities)
+- ✅ Bipartite graph transformation for efficient storage
+- ✅ Temporal indexing with multiple granularities
+- ✅ K-hop neighbor traversal
+- ✅ Semantic search over hyperedges
+
+#### Use Cases:
+- **Multi-document relationships**: Papers co-cited in reviews
+- **Temporal patterns**: User interaction sequences
+- **Complex knowledge graphs**: Multi-entity relationships
+
+### 2. Causal Hypergraph Memory
+
+**Location**: `/crates/ruvector-core/src/advanced/hypergraph.rs`
+
+#### Core Component:
+
+```rust
+pub struct CausalMemory {
+    index: HypergraphIndex,
+    causal_counts: HashMap<(VectorId, VectorId), u32>,
+    latencies: HashMap<VectorId, f32>,
+    // Utility weights: α=0.7, β=0.2, γ=0.1
+}
+```
+
+#### Utility Function:
+```
+U = α·semantic_similarity + β·causal_uplift - γ·latency
+```
+
+Where:
+- **α = 0.7**: Weight for semantic similarity
+- **β = 0.2**: Weight for causal strength (success count)
+- **γ = 0.1**: Penalty for action latency
+
+#### Key Features:
+- ✅ Cause-effect relationship tracking
+- ✅ Multi-entity causal inference
+- ✅ Confidence weights
+- ✅ Latency-aware queries
+
+#### Use Cases:
+- **Agent reasoning**: Learn which actions lead to success
+- **Skill consolidation**: Identify successful patterns
+- **Reflexion memory**: Store self-critique with causal links
+
+### 3. Learned Index Structures
+
+**Location**: `/crates/ruvector-core/src/advanced/learned_index.rs`
+
+#### Recursive Model Index (RMI):
+
+```rust
+pub struct RecursiveModelIndex {
+    root_model: LinearModel,      // Coarse prediction
+    leaf_models: Vec<LinearModel>, // Fine prediction
+    data: Vec<(Vec<f32>, VectorId)>,
+    max_error: usize,              // Bounded error for binary search
+}
+```
+
+#### Implementation:
+- Root model predicts leaf model
+- Leaf models predict positions
+- Bounded error correction with binary search
+- Linear models for simplicity (production would use neural networks)
+
+#### Performance Targets:
+- 1.5-3x lookup speedup on sorted data
+- 10-100x space reduction vs traditional B-trees
+- Best for read-heavy workloads
+
+#### Hybrid Index:
+
+```rust
+pub struct HybridIndex {
+    learned: RecursiveModelIndex,    // Static segment
+    dynamic_buffer: HashMap<...>,     // Dynamic updates
+    rebuild_threshold: usize,
+}
+```
+
+- Learned index for static data
+- Dynamic buffer for updates
+- Periodic rebuilds
+
+### 4. Neural Hash Functions
+
+**Location**: `/crates/ruvector-core/src/advanced/neural_hash.rs`
+
+#### Deep Hash Embedding:
+
+```rust
+pub struct DeepHashEmbedding {
+    projections: Vec<Array2<f32>>, // Multi-layer projections
+    biases: Vec<Array1<f32>>,
+    output_bits: usize,
+}
+```
+
+#### Training:
+- Contrastive loss on positive/negative pairs
+- Similar vectors → small Hamming distance
+- Dissimilar vectors → large Hamming distance
+
+#### Compression Ratios:
+- **128D → 32 bits**: 128x compression
+- **384D → 64 bits**: 192x compression
+- **90-95% recall** with proper training
+
+#### Simple LSH Baseline:
+
+```rust
+pub struct SimpleLSH {
+    projections: Array2<f32>, // Random Gaussian projections
+    num_bits: usize,
+}
+```
+
+- Random projection baseline
+- No training required
+- 80-85% recall
+
+#### Hash Index:
+
+```rust
+pub struct HashIndex<H: NeuralHash> {
+    hasher: H,
+    tables: HashMap<Vec<u8>, Vec<VectorId>>,
+    vectors: HashMap<VectorId, Vec<f32>>,
+}
+```
+
+- Fast approximate nearest neighbor search
+- Hamming distance filtering
+- Re-ranking with full precision
+
+### 5. Topological Data Analysis
+
+**Location**: `/crates/ruvector-core/src/advanced/tda.rs`
+
+#### Topological Analyzer:
+
+```rust
+pub struct TopologicalAnalyzer {
+    k_neighbors: usize,
+    epsilon: f32,
+}
+```
+
+#### Metrics Computed:
+
+```rust
+pub struct EmbeddingQuality {
+    pub dimensions: usize,
+    pub num_vectors: usize,
+    pub connected_components: usize,
+    pub clustering_coefficient: f32,
+    pub mode_collapse_score: f32,    // 0=collapsed, 1=good
+    pub degeneracy_score: f32,       // 0=full rank, 1=degenerate
+    pub quality_score: f32,          // Overall: 0-1
+}
+```
+
+#### Detection Capabilities:
+- **Mode collapse**: Vectors clustering too closely
+- **Degeneracy**: Embeddings in lower-dimensional manifold
+- **Connectivity**: Graph structure analysis
+- **Persistence**: Topological features across scales
+
+#### Use Cases:
+- **Embedding quality assessment**: Detect training issues
+- **Model validation**: Ensure diverse representations
+- **Topological regularization**: Guide training
+
+## Usage Examples
+
+### Basic Hypergraph:
+
+```rust
+use ruvector_core::advanced::{HypergraphIndex, Hyperedge};
+use ruvector_core::types::DistanceMetric;
+
+let mut index = HypergraphIndex::new(DistanceMetric::Cosine);
+
+// Add entities
+index.add_entity(1, vec![1.0, 0.0, 0.0]);
+index.add_entity(2, vec![0.0, 1.0, 0.0]);
+index.add_entity(3, vec![0.0, 0.0, 1.0]);
+
+// Add hyperedge connecting 3 entities
+let edge = Hyperedge::new(
+    vec![1, 2, 3],
+    "Triple relationship".to_string(),
+    vec![0.5, 0.5, 0.5],
+    0.9
+);
+index.add_hyperedge(edge)?;
+
+// Search for similar relationships
+let results = index.search_hyperedges(&[0.6, 0.3, 0.1], 5);
+```
+
+### Causal Memory:
+
+```rust
+use ruvector_core::advanced::CausalMemory;
+
+let mut memory = CausalMemory::new(DistanceMetric::Cosine)
+    .with_weights(0.7, 0.2, 0.1);
+
+// Record causal relationship
+memory.add_causal_edge(
+    1,     // cause action
+    2,     // effect
+    vec![3], // context
+    "Action leads to success".to_string(),
+    vec![0.5, 0.5, 0.0],
+    100.0  // latency in ms
+)?;
+
+// Query with utility function
+let results = memory.query_with_utility(&[0.6, 0.4, 0.0], 1, 5);
+```
+
+### Learned Index:
+
+```rust
+use ruvector_core::advanced::{RecursiveModelIndex, LearnedIndex};
+
+let mut rmi = RecursiveModelIndex::new(2, 4);
+
+// Build from sorted data
+let data: Vec<(Vec<f32>, u64)> = /* ... */;
+rmi.build(data)?;
+
+// Fast lookup
+let pos = rmi.predict(&[0.5, 0.25])?;
+let result = rmi.search(&[0.5, 0.25])?;
+```
+
+### Neural Hashing:
+
+```rust
+use ruvector_core::advanced::{SimpleLSH, HashIndex};
+
+let lsh = SimpleLSH::new(128, 32); // 128D -> 32 bits
+let mut index = HashIndex::new(lsh, 32);
+
+// Insert vectors
+for (id, vec) in vectors {
+    index.insert(id, vec);
+}
+
+// Fast search
+let results = index.search(&query, 10, 8); // k=10, max_hamming=8
+```
+
+### Topological Analysis:
+
+```rust
+use ruvector_core::advanced::TopologicalAnalyzer;
+
+let analyzer = TopologicalAnalyzer::new(5, 10.0);
+let quality = analyzer.analyze(&embeddings)?;
+
+println!("Quality: {}", quality.quality_score);
+println!("Assessment: {}", quality.assessment());
+
+if quality.has_mode_collapse() {
+    eprintln!("Warning: Mode collapse detected!");
+}
+```
+
+## Testing
+
+All features include comprehensive tests:
+
+**Location**: `/tests/advanced_tests.rs`
+
+Run tests:
+```bash
+cargo test --test advanced_tests
+```
+
+Run examples:
+```bash
+cargo run --example advanced_features
+```
+
+## Performance Characteristics
+
+### Hypergraphs:
+- **Insert**: O(|E|) where E is hyperedge size
+- **Search**: O(k log n) for k results
+- **K-hop**: O(exp(k)·N) - use sampling for large k
+
+### Learned Indexes:
+- **Build**: O(n log n) sorting + O(n) training
+- **Lookup**: O(1) prediction + O(log error) correction
+- **Speedup**: 1.5-3x on read-heavy workloads
+
+### Neural Hashing:
+- **Encoding**: O(d) forward pass
+- **Search**: O(|B|·k) where B is bucket size
+- **Compression**: 32-128x with 90-95% recall
+
+### TDA:
+- **Analysis**: O(n²) for distance matrix
+- **Graph building**: O(n·k) for k-NN
+- **Best use**: Offline quality assessment
+
+## Integration with Existing Features
+
+### With HNSW:
+- Use neural hashing for filtering
+- Hypergraphs for relationship queries
+- TDA for index quality monitoring
+
+### With AgenticDB:
+- Causal memory for agent reasoning
+- Skill consolidation via hypergraphs
+- Reflexion episodes with causal links
+
+### With Quantization:
+- Combined with learned hash functions
+- Three-tier: binary → scalar → full precision
+
+## Future Enhancements
+
+### Short Term (Weeks):
+- [ ] Proper neural network training (PyTorch/tch-rs)
+- [ ] GPU-accelerated hash functions
+- [ ] Persistent homology (full TDA)
+
+### Medium Term (Months):
+- [ ] Dynamic RMI updates
+- [ ] Multi-level hypergraph indexing
+- [ ] Causal inference algorithms
+
+### Long Term (Year+):
+- [ ] Neuromorphic hardware integration
+- [ ] Quantum-inspired algorithms
+- [ ] Advanced topology optimization
+
+## References
+
+1. **HyperGraphRAG** (NeurIPS 2025): Multi-entity relationships
+2. **Learned Indexes** (SIGMOD 2018): RMI architecture
+3. **Deep Hashing** (CVPR): Similarity-preserving codes
+4. **Topological Data Analysis**: Persistent homology
+
+## Notes
+
+- All features are **opt-in** - no overhead if unused
+- **Experimental status**: API may change
+- **Production readiness**: Hypergraphs and TDA ready, learned indexes experimental
+- **Performance tuning**: Profile before production deployment
+
+---
+
+**Status**: ✅ Phase 6 Complete
+**Next**: Integration testing and production deployment