Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/research/gnn-v2/10-gravitational-embedding-fields.md
+++ b/vendor/ruvector/docs/research/gnn-v2/10-gravitational-embedding-fields.md
@@ -0,0 +1,572 @@
+# Gravitational Embedding Fields (GEF)
+
+## Overview
+
+### Problem Statement
+Current vector search treats all embeddings equally, ignoring the importance or frequency of access to nodes. High-value documents (frequently queried, authoritative sources) should have stronger influence on search trajectories, similar to how massive objects exert stronger gravitational pull in physics.
+
+### Proposed Solution
+Implement a physics-inspired attention mechanism where embeddings exert "gravitational pull" proportional to their query frequency and importance. Search follows gradient descent through a potential field, naturally routing toward high-value nodes before exploring local neighborhoods.
+
+### Expected Benefits
+- **30-50% reduction in search hops**: High-frequency nodes act as routing landmarks
+- **15-25% improved relevance**: Important documents discovered earlier in search
+- **Adaptive importance**: Automatically learns document authority from usage patterns
+- **Natural load balancing**: Popular nodes become graph hubs, improving overall connectivity
+
+### Novelty Claim
+First application of gravitational field dynamics to vector search. Unlike PageRank (global static scores) or attention mechanisms (pairwise interactions), GEF creates a continuous potential field that guides search trajectories dynamically based on real-time usage patterns.
+
+## Technical Design
+
+### Architecture Diagram
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   Gravitational Field Layer                  │
+│                                                              │
+│  ┌──────────┐      ┌──────────┐      ┌──────────┐         │
+│  │ Query    │      │ Potential│      │ Gradient │         │
+│  │ Vector   │─────▶│ Field    │─────▶│ Descent  │─────▶   │
+│  │ (q)      │      │ Φ(x)     │      │ ∇Φ(x)    │  Path   │
+│  └──────────┘      └──────────┘      └──────────┘         │
+│       │                  │                  │              │
+│       │                  ▼                  │              │
+│       │         ┌──────────────────┐        │              │
+│       │         │  Mass Assignment │        │              │
+│       │         │  m_i = f(freq_i) │        │              │
+│       │         └──────────────────┘        │              │
+│       │                  │                  │              │
+│       ▼                  ▼                  ▼              │
+│  ┌────────────────────────────────────────────────┐       │
+│  │         HNSW Graph with Masses                 │       │
+│  │                                                 │       │
+│  │   ○─────○─────●═════●─────○                   │       │
+│  │   │     │     ║     ║     │                   │       │
+│  │   ○     ●═════●     ●─────○    ● = high mass  │       │
+│  │   │     ║     │     ║     │    ○ = low mass   │       │
+│  │   ○─────●─────○─────●═════○    ═ = strong     │       │
+│  │                              pull              │       │
+│  └────────────────────────────────────────────────┘       │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### Core Data Structures
+
+```rust
+/// Gravitational mass and frequency tracking for each node
+#[derive(Clone, Debug)]
+pub struct NodeMass {
+    /// Effective gravitational mass (learned from query frequency)
+    pub mass: f32,
+
+    /// Query frequency counter (exponential moving average)
+    pub query_frequency: f64,
+
+    /// Last update timestamp
+    pub last_update: SystemTime,
+
+    /// Decay rate for frequency (default: 0.95)
+    pub decay_rate: f32,
+}
+
+/// Gravitational field configuration
+#[derive(Clone, Debug)]
+pub struct GravitationalFieldConfig {
+    /// Gravitational constant (strength of attraction)
+    pub g_constant: f32,  // default: 1.0
+
+    /// Mass function type
+    pub mass_function: MassFunction,
+
+    /// Maximum influence radius (in embedding space)
+    pub max_radius: f32,  // default: 10.0
+
+    /// Softening parameter (prevents singularities at r=0)
+    pub softening: f32,   // default: 0.1
+
+    /// Field update frequency
+    pub update_interval: Duration,
+}
+
+/// Mass calculation strategies
+#[derive(Clone, Debug)]
+pub enum MassFunction {
+    /// Linear: m = frequency
+    Linear,
+
+    /// Logarithmic: m = log(1 + frequency)
+    Logarithmic,
+
+    /// Square root: m = sqrt(frequency)
+    SquareRoot,
+
+    /// Custom function
+    Custom(fn(f64) -> f32),
+}
+
+/// Gravitational potential field
+pub struct PotentialField {
+    /// Node masses indexed by node ID
+    masses: Vec<NodeMass>,
+
+    /// Spatial index for fast radius queries
+    spatial_index: KDTree<NodeId>,
+
+    /// Configuration
+    config: GravitationalFieldConfig,
+
+    /// Cached potential values (invalidated on mass updates)
+    potential_cache: LruCache<(NodeId, NodeId), f32>,
+}
+
+/// Search path with gravitational guidance
+pub struct GravitationalSearchPath {
+    /// Visited nodes
+    pub visited: Vec<NodeId>,
+
+    /// Potential energy at each step
+    pub potentials: Vec<f32>,
+
+    /// Gradient magnitudes
+    pub gradients: Vec<f32>,
+
+    /// Total energy consumed
+    pub total_energy: f32,
+}
+```
+
+### Key Algorithms
+
+```rust
+// Pseudocode for gravitational field search
+
+fn gravitational_search(
+    query: &[f32],
+    field: &PotentialField,
+    graph: &HnswGraph,
+    k: usize
+) -> Vec<NodeId> {
+    // Initialize at entry point
+    let mut current = graph.entry_point;
+    let mut visited = HashSet::new();
+    let mut candidates = BinaryHeap::new();
+
+    // Calculate initial potential
+    let mut potential = field.calculate_potential(query, current);
+
+    while !converged(&candidates, k) {
+        visited.insert(current);
+
+        // Get neighbors from HNSW graph
+        let neighbors = graph.get_neighbors(current, layer=0);
+
+        for neighbor in neighbors {
+            if visited.contains(&neighbor) { continue; }
+
+            // Calculate gravitational force contribution
+            let neighbor_mass = field.get_mass(neighbor);
+            let distance = euclidean_distance(query, graph.get_embedding(neighbor));
+
+            // Gravitational potential: Φ = -G * m / (r + ε)
+            // where ε is softening parameter
+            let grav_potential = -field.config.g_constant * neighbor_mass
+                               / (distance + field.config.softening);
+
+            // Combine embedding similarity with gravitational pull
+            let similarity = cosine_similarity(query, graph.get_embedding(neighbor));
+
+            // Total potential: combine semantic similarity and gravitational field
+            // α controls balance (default: 0.7 semantic, 0.3 gravitational)
+            let total_potential = 0.7 * similarity + 0.3 * grav_potential;
+
+            candidates.push((neighbor, total_potential));
+        }
+
+        // Follow gradient: move to node with lowest potential
+        current = candidates.pop().unwrap().0;
+        potential = field.calculate_potential(query, current);
+    }
+
+    // Return top-k by final similarity
+    candidates.into_sorted_vec()
+        .iter()
+        .take(k)
+        .map(|(id, _)| *id)
+        .collect()
+}
+
+// Mass update from query patterns
+fn update_masses(field: &mut PotentialField, query_log: &[QueryEvent]) {
+    for event in query_log {
+        for visited_node in &event.visited_nodes {
+            let mass = &mut field.masses[*visited_node];
+
+            // Exponential moving average of query frequency
+            let time_delta = event.timestamp.duration_since(mass.last_update);
+            let decay = mass.decay_rate.powf(time_delta.as_secs_f32() / 3600.0);
+
+            mass.query_frequency = mass.query_frequency * decay as f64 + 1.0;
+
+            // Update mass based on frequency
+            mass.mass = match field.config.mass_function {
+                MassFunction::Linear => mass.query_frequency as f32,
+                MassFunction::Logarithmic => (1.0 + mass.query_frequency).ln() as f32,
+                MassFunction::SquareRoot => mass.query_frequency.sqrt() as f32,
+                MassFunction::Custom(f) => f(mass.query_frequency),
+            };
+
+            mass.last_update = event.timestamp;
+        }
+    }
+
+    // Invalidate potential cache
+    field.potential_cache.clear();
+
+    // Rebuild spatial index if significant changes
+    if should_rebuild_index(field) {
+        field.rebuild_spatial_index();
+    }
+}
+```
+
+### API Design
+
+```rust
+/// Public API for Gravitational Embedding Fields
+pub trait GravitationalField {
+    /// Create new gravitational field for graph
+    fn new(graph: &HnswGraph, config: GravitationalFieldConfig) -> Self;
+
+    /// Search with gravitational guidance
+    fn search(
+        &self,
+        query: &[f32],
+        k: usize,
+        options: SearchOptions,
+    ) -> Result<Vec<SearchResult>, GefError>;
+
+    /// Update masses from query log
+    fn update_masses(&mut self, query_log: &[QueryEvent]) -> Result<(), GefError>;
+
+    /// Get mass for specific node
+    fn get_mass(&self, node_id: NodeId) -> f32;
+
+    /// Calculate potential at point
+    fn calculate_potential(&self, point: &[f32], reference: NodeId) -> f32;
+
+    /// Calculate gradient at point
+    fn calculate_gradient(&self, point: &[f32]) -> Vec<f32>;
+
+    /// Export field visualization data
+    fn export_field(&self, resolution: usize) -> FieldVisualization;
+
+    /// Get field statistics
+    fn statistics(&self) -> FieldStatistics;
+}
+
+/// Search options for GEF
+#[derive(Clone, Debug)]
+pub struct SearchOptions {
+    /// Balance between semantic similarity and gravitational pull (0.0-1.0)
+    pub semantic_weight: f32,
+
+    /// Maximum search steps
+    pub max_steps: usize,
+
+    /// Enable path recording
+    pub record_path: bool,
+
+    /// Convergence threshold
+    pub convergence_threshold: f32,
+}
+
+/// Statistics about gravitational field
+#[derive(Clone, Debug)]
+pub struct FieldStatistics {
+    /// Total number of nodes
+    pub total_nodes: usize,
+
+    /// Mass distribution (min, max, mean, median)
+    pub mass_distribution: Distribution,
+
+    /// Number of high-mass nodes (top 10%)
+    pub high_mass_nodes: usize,
+
+    /// Average query frequency
+    pub avg_query_frequency: f64,
+
+    /// Last update timestamp
+    pub last_update: SystemTime,
+}
+```
+
+## Integration Points
+
+### Affected Crates/Modules
+
+1. **`crates/ruvector-core/src/hnsw/`**
+   - Modify search algorithm to accept potential field guidance
+   - Add hooks for mass updates on queries
+   - Extend node metadata to store mass values
+
+2. **`crates/ruvector-gnn/src/attention/`**
+   - Integrate GEF as attention mechanism variant
+   - Combine with existing attention patterns
+
+3. **`crates/ruvector-core/src/distance/`**
+   - Add potential field distance metrics
+   - Implement gradient calculation utilities
+
+### New Modules to Create
+
+1. **`crates/ruvector-gnn/src/gravitational/`**
+   - `field.rs` - Core potential field implementation
+   - `mass.rs` - Mass calculation and updates
+   - `search.rs` - Gravitational-guided search algorithms
+   - `config.rs` - Configuration and tuning
+   - `visualization.rs` - Field visualization utilities
+
+2. **`crates/ruvector-core/src/query_log/`**
+   - `logger.rs` - Query event logging
+   - `analyzer.rs` - Query pattern analysis
+   - `replay.rs` - Query replay for testing
+
+### Dependencies on Other Features
+
+- **Feature 11 (Causal Attention Networks)**: GEF can respect causal ordering by preventing backward gravitational pull
+- **Feature 12 (Topology-Aware Gradient Routing)**: Combine graph topology with gravitational field for hybrid routing
+- **Feature 13 (Embedding Crystallization)**: High-mass nodes serve as natural crystallization nuclei
+
+## Regression Prevention
+
+### Existing Functionality at Risk
+
+1. **Standard HNSW Search Performance**
+   - Risk: Gravitational calculations add overhead
+   - Prevention: Make GEF optional, benchmark against baseline
+
+2. **Deterministic Search Results**
+   - Risk: Mass updates change results over time
+   - Prevention: Add `frozen_field` mode for reproducible searches
+
+3. **Memory Usage**
+   - Risk: Additional mass metadata per node
+   - Prevention: Use compact representations (f32 instead of f64), lazy cache
+
+4. **Concurrent Queries**
+   - Risk: Race conditions in mass updates
+   - Prevention: Use atomic updates or batch processing
+
+### Test Cases to Prevent Regressions
+
+```rust
+#[cfg(test)]
+mod regression_tests {
+    // Baseline performance should not degrade
+    #[test]
+    fn test_gef_disabled_matches_baseline() {
+        let graph = create_test_graph(10000);
+        let query = random_vector(128);
+
+        let baseline_results = graph.search(&query, 10);
+
+        let gef_field = GravitationalField::new(&graph, GravitationalFieldConfig {
+            semantic_weight: 1.0,  // Pure semantic search
+            ..Default::default()
+        });
+        let gef_results = gef_field.search(&query, 10);
+
+        assert_eq!(baseline_results, gef_results);
+    }
+
+    // Frozen field produces deterministic results
+    #[test]
+    fn test_frozen_field_deterministic() {
+        let mut field = create_test_field();
+        field.freeze();
+
+        let query = random_vector(128);
+        let results1 = field.search(&query, 10);
+        let results2 = field.search(&query, 10);
+
+        assert_eq!(results1, results2);
+    }
+
+    // Mass updates don't break existing searches
+    #[test]
+    fn test_concurrent_search_and_update() {
+        let field = Arc::new(RwLock::new(create_test_field()));
+
+        let search_thread = spawn({
+            let field = field.clone();
+            move || {
+                for _ in 0..100 {
+                    let f = field.read().unwrap();
+                    f.search(&random_vector(128), 10).unwrap();
+                }
+            }
+        });
+
+        let update_thread = spawn({
+            let field = field.clone();
+            move || {
+                for _ in 0..10 {
+                    let mut f = field.write().unwrap();
+                    f.update_masses(&generate_query_log(10)).unwrap();
+                    thread::sleep(Duration::from_millis(10));
+                }
+            }
+        });
+
+        search_thread.join().unwrap();
+        update_thread.join().unwrap();
+    }
+}
+```
+
+### Backward Compatibility Strategy
+
+1. **Feature Flag**: GEF behind `gravitational-fields` feature flag
+2. **Opt-in**: Default config has `semantic_weight = 1.0` (pure semantic search)
+3. **Migration Path**: Provide tools to analyze existing graphs and recommend GEF settings
+4. **Serialization**: Store mass data in separate file, gracefully handle missing data
+
+## Implementation Phases
+
+### Phase 1: Research Validation (2 weeks)
+**Goal**: Validate physics-inspired approach on synthetic data
+
+- Implement basic potential field calculations
+- Create toy dataset with known high-frequency nodes
+- Measure search efficiency improvements
+- Compare against baselines (pure HNSW, PageRank-weighted)
+- **Deliverable**: Research report with benchmarks
+
+### Phase 2: Core Implementation (3 weeks)
+**Goal**: Production-ready GEF implementation
+
+- Implement `PotentialField` and `NodeMass` structures
+- Develop mass update algorithms with decay
+- Integrate with HNSW search
+- Add configuration system
+- Implement caching and optimization
+- **Deliverable**: Working GEF module with unit tests
+
+### Phase 3: Integration (2 weeks)
+**Goal**: Integrate with existing RuVector systems
+
+- Add query logging infrastructure
+- Implement mass persistence (save/load)
+- Create API bindings (Python, Node.js)
+- Add monitoring and metrics
+- Write integration tests
+- **Deliverable**: GEF integrated into main codebase
+
+### Phase 4: Optimization (2 weeks)
+**Goal**: Production performance and tuning
+
+- Profile and optimize hot paths
+- Implement spatial indexing for large graphs
+- Add adaptive tuning (auto-adjust G constant)
+- Create visualization tools
+- Write documentation and examples
+- **Deliverable**: Production-ready, documented feature
+
+## Success Metrics
+
+### Performance Benchmarks
+
+| Metric | Baseline | Target | Measurement |
+|--------|----------|--------|-------------|
+| Search latency (10K nodes) | 1.2ms | <1.5ms | 99th percentile |
+| Search quality (recall@10) | 0.95 | >0.95 | Standard test set |
+| Hops to target | 12.3 | <9.0 | Average path length |
+| Memory overhead | 0MB | <50MB | Per 1M nodes |
+| Mass update latency | N/A | <10ms | Per 1K queries |
+
+### Accuracy Metrics
+
+1. **Authority Discovery**: High-authority nodes found in top-10 results
+   - Target: 80% of known authoritative nodes in top-10
+
+2. **Query Efficiency**: Reduction in nodes visited per search
+   - Target: 30% fewer nodes visited for same recall
+
+3. **Adaptive Learning**: Mass distribution correlates with true importance
+   - Target: Spearman correlation >0.7 with ground truth rankings
+
+### Comparison to Baselines
+
+Test against:
+1. **Pure HNSW**: Standard implementation without GEF
+2. **PageRank-weighted**: Static global importance scores
+3. **Attention-based**: Standard attention mechanism from Feature 1
+4. **Hybrid**: GEF + Topology-Aware Routing (Feature 12)
+
+Datasets:
+- Wikipedia embeddings (1M articles)
+- ArXiv papers with citation counts (500K papers)
+- E-commerce products with view counts (2M products)
+
+## Risks and Mitigations
+
+### Technical Risks
+
+| Risk | Impact | Probability | Mitigation |
+|------|--------|-------------|------------|
+| Mass updates too slow | High | Medium | Batch updates, incremental computation |
+| Field calculations expensive | High | High | Spatial indexing, caching, approximations |
+| Over-attraction to popular nodes | Medium | High | Softening parameter, max influence radius |
+| Mass distribution unstable | Medium | Medium | Regularization, decay rates, bounds checking |
+| Poor generalization | High | Low | Multi-dataset validation, adaptive tuning |
+
+### Detailed Mitigations
+
+1. **Slow Mass Updates**
+   - Implement incremental updates (only changed nodes)
+   - Batch query logs and process asynchronously
+   - Use lock-free data structures for concurrent updates
+   - Fallback: Update masses periodically (e.g., hourly) instead of real-time
+
+2. **Expensive Field Calculations**
+   - Pre-compute potential fields for common queries
+   - Use spatial hashing for O(1) radius queries
+   - Approximate far-field contributions (multipole expansion)
+   - Fallback: Disable GEF for low-latency requirements
+
+3. **Over-Attraction to Popular Nodes**
+   - Tune softening parameter ε to prevent singularities
+   - Cap maximum mass value
+   - Implement repulsive forces for diversity
+   - Fallback: Reduce gravitational weight in combined score
+
+4. **Unstable Mass Distribution**
+   - Add L2 regularization to mass updates
+   - Implement mass normalization across graph
+   - Monitor mass variance, trigger rebalancing
+   - Fallback: Reset masses to uniform distribution
+
+5. **Poor Generalization**
+   - Test on diverse datasets (text, images, graphs)
+   - Implement domain-specific mass functions
+   - Provide configuration templates for common use cases
+   - Fallback: Disable GEF for unsupported domains
+
+## References
+
+### Physics Inspiration
+- Newtonian gravity: F = G·m₁·m₂/r²
+- Potential fields in robotics path planning
+- N-body simulations and Barnes-Hut algorithms
+
+### Related ML Techniques
+- PageRank and graph centrality measures
+- Attention mechanisms in transformers
+- Reinforcement learning value functions
+- Metric learning and embedding spaces
+
+### Implementation Precedents
+- Fast multipole methods (FMM)
+- Spatial hashing and KD-trees
+- Incremental graph algorithms
+- Online learning with exponential decay