dearsky/wifi-densepose

Fork 0

Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

20 KiB

Raw Permalink Blame History

Gravitational Embedding Fields (GEF)

Overview

Problem Statement

Current vector search treats all embeddings equally, ignoring the importance or frequency of access to nodes. High-value documents (frequently queried, authoritative sources) should have stronger influence on search trajectories, similar to how massive objects exert stronger gravitational pull in physics.

Proposed Solution

Implement a physics-inspired attention mechanism where embeddings exert "gravitational pull" proportional to their query frequency and importance. Search follows gradient descent through a potential field, naturally routing toward high-value nodes before exploring local neighborhoods.

Expected Benefits

30-50% reduction in search hops: High-frequency nodes act as routing landmarks
15-25% improved relevance: Important documents discovered earlier in search
Adaptive importance: Automatically learns document authority from usage patterns
Natural load balancing: Popular nodes become graph hubs, improving overall connectivity

Novelty Claim

First application of gravitational field dynamics to vector search. Unlike PageRank (global static scores) or attention mechanisms (pairwise interactions), GEF creates a continuous potential field that guides search trajectories dynamically based on real-time usage patterns.

Technical Design

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                   Gravitational Field Layer                  │
│                                                              │
│  ┌──────────┐      ┌──────────┐      ┌──────────┐         │
│  │ Query    │      │ Potential│      │ Gradient │         │
│  │ Vector   │─────▶│ Field    │─────▶│ Descent  │─────▶   │
│  │ (q)      │      │ Φ(x)     │      │ ∇Φ(x)    │  Path   │
│  └──────────┘      └──────────┘      └──────────┘         │
│       │                  │                  │              │
│       │                  ▼                  │              │
│       │         ┌──────────────────┐        │              │
│       │         │  Mass Assignment │        │              │
│       │         │  m_i = f(freq_i) │        │              │
│       │         └──────────────────┘        │              │
│       │                  │                  │              │
│       ▼                  ▼                  ▼              │
│  ┌────────────────────────────────────────────────┐       │
│  │         HNSW Graph with Masses                 │       │
│  │                                                 │       │
│  │   ○─────○─────●═════●─────○                   │       │
│  │   │     │     ║     ║     │                   │       │
│  │   ○     ●═════●     ●─────○    ● = high mass  │       │
│  │   │     ║     │     ║     │    ○ = low mass   │       │
│  │   ○─────●─────○─────●═════○    ═ = strong     │       │
│  │                              pull              │       │
│  └────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────┘

Core Data Structures

/// Gravitational mass and frequency tracking for each node
#[derive(Clone, Debug)]
pub struct NodeMass {
    /// Effective gravitational mass (learned from query frequency)
    pub mass: f32,

    /// Query frequency counter (exponential moving average)
    pub query_frequency: f64,

    /// Last update timestamp
    pub last_update: SystemTime,

    /// Decay rate for frequency (default: 0.95)
    pub decay_rate: f32,
}

/// Gravitational field configuration
#[derive(Clone, Debug)]
pub struct GravitationalFieldConfig {
    /// Gravitational constant (strength of attraction)
    pub g_constant: f32,  // default: 1.0

    /// Mass function type
    pub mass_function: MassFunction,

    /// Maximum influence radius (in embedding space)
    pub max_radius: f32,  // default: 10.0

    /// Softening parameter (prevents singularities at r=0)
    pub softening: f32,   // default: 0.1

    /// Field update frequency
    pub update_interval: Duration,
}

/// Mass calculation strategies
#[derive(Clone, Debug)]
pub enum MassFunction {
    /// Linear: m = frequency
    Linear,

    /// Logarithmic: m = log(1 + frequency)
    Logarithmic,

    /// Square root: m = sqrt(frequency)
    SquareRoot,

    /// Custom function
    Custom(fn(f64) -> f32),
}

/// Gravitational potential field
pub struct PotentialField {
    /// Node masses indexed by node ID
    masses: Vec<NodeMass>,

    /// Spatial index for fast radius queries
    spatial_index: KDTree<NodeId>,

    /// Configuration
    config: GravitationalFieldConfig,

    /// Cached potential values (invalidated on mass updates)
    potential_cache: LruCache<(NodeId, NodeId), f32>,
}

/// Search path with gravitational guidance
pub struct GravitationalSearchPath {
    /// Visited nodes
    pub visited: Vec<NodeId>,

    /// Potential energy at each step
    pub potentials: Vec<f32>,

    /// Gradient magnitudes
    pub gradients: Vec<f32>,

    /// Total energy consumed
    pub total_energy: f32,
}

Key Algorithms

// Pseudocode for gravitational field search

fn gravitational_search(
    query: &[f32],
    field: &PotentialField,
    graph: &HnswGraph,
    k: usize
) -> Vec<NodeId> {
    // Initialize at entry point
    let mut current = graph.entry_point;
    let mut visited = HashSet::new();
    let mut candidates = BinaryHeap::new();

    // Calculate initial potential
    let mut potential = field.calculate_potential(query, current);

    while !converged(&candidates, k) {
        visited.insert(current);

        // Get neighbors from HNSW graph
        let neighbors = graph.get_neighbors(current, layer=0);

        for neighbor in neighbors {
            if visited.contains(&neighbor) { continue; }

            // Calculate gravitational force contribution
            let neighbor_mass = field.get_mass(neighbor);
            let distance = euclidean_distance(query, graph.get_embedding(neighbor));

            // Gravitational potential: Φ = -G * m / (r + ε)
            // where ε is softening parameter
            let grav_potential = -field.config.g_constant * neighbor_mass
                               / (distance + field.config.softening);

            // Combine embedding similarity with gravitational pull
            let similarity = cosine_similarity(query, graph.get_embedding(neighbor));

            // Total potential: combine semantic similarity and gravitational field
            // α controls balance (default: 0.7 semantic, 0.3 gravitational)
            let total_potential = 0.7 * similarity + 0.3 * grav_potential;

            candidates.push((neighbor, total_potential));
        }

        // Follow gradient: move to node with lowest potential
        current = candidates.pop().unwrap().0;
        potential = field.calculate_potential(query, current);
    }

    // Return top-k by final similarity
    candidates.into_sorted_vec()
        .iter()
        .take(k)
        .map(|(id, _)| *id)
        .collect()
}

// Mass update from query patterns
fn update_masses(field: &mut PotentialField, query_log: &[QueryEvent]) {
    for event in query_log {
        for visited_node in &event.visited_nodes {
            let mass = &mut field.masses[*visited_node];

            // Exponential moving average of query frequency
            let time_delta = event.timestamp.duration_since(mass.last_update);
            let decay = mass.decay_rate.powf(time_delta.as_secs_f32() / 3600.0);

            mass.query_frequency = mass.query_frequency * decay as f64 + 1.0;

            // Update mass based on frequency
            mass.mass = match field.config.mass_function {
                MassFunction::Linear => mass.query_frequency as f32,
                MassFunction::Logarithmic => (1.0 + mass.query_frequency).ln() as f32,
                MassFunction::SquareRoot => mass.query_frequency.sqrt() as f32,
                MassFunction::Custom(f) => f(mass.query_frequency),
            };

            mass.last_update = event.timestamp;
        }
    }

    // Invalidate potential cache
    field.potential_cache.clear();

    // Rebuild spatial index if significant changes
    if should_rebuild_index(field) {
        field.rebuild_spatial_index();
    }
}

API Design

/// Public API for Gravitational Embedding Fields
pub trait GravitationalField {
    /// Create new gravitational field for graph
    fn new(graph: &HnswGraph, config: GravitationalFieldConfig) -> Self;

    /// Search with gravitational guidance
    fn search(
        &self,
        query: &[f32],
        k: usize,
        options: SearchOptions,
    ) -> Result<Vec<SearchResult>, GefError>;

    /// Update masses from query log
    fn update_masses(&mut self, query_log: &[QueryEvent]) -> Result<(), GefError>;

    /// Get mass for specific node
    fn get_mass(&self, node_id: NodeId) -> f32;

    /// Calculate potential at point
    fn calculate_potential(&self, point: &[f32], reference: NodeId) -> f32;

    /// Calculate gradient at point
    fn calculate_gradient(&self, point: &[f32]) -> Vec<f32>;

    /// Export field visualization data
    fn export_field(&self, resolution: usize) -> FieldVisualization;

    /// Get field statistics
    fn statistics(&self) -> FieldStatistics;
}

/// Search options for GEF
#[derive(Clone, Debug)]
pub struct SearchOptions {
    /// Balance between semantic similarity and gravitational pull (0.0-1.0)
    pub semantic_weight: f32,

    /// Maximum search steps
    pub max_steps: usize,

    /// Enable path recording
    pub record_path: bool,

    /// Convergence threshold
    pub convergence_threshold: f32,
}

/// Statistics about gravitational field
#[derive(Clone, Debug)]
pub struct FieldStatistics {
    /// Total number of nodes
    pub total_nodes: usize,

    /// Mass distribution (min, max, mean, median)
    pub mass_distribution: Distribution,

    /// Number of high-mass nodes (top 10%)
    pub high_mass_nodes: usize,

    /// Average query frequency
    pub avg_query_frequency: f64,

    /// Last update timestamp
    pub last_update: SystemTime,
}

Integration Points

Affected Crates/Modules

crates/ruvector-core/src/hnsw/
- Modify search algorithm to accept potential field guidance
- Add hooks for mass updates on queries
- Extend node metadata to store mass values
crates/ruvector-gnn/src/attention/
- Integrate GEF as attention mechanism variant
- Combine with existing attention patterns
crates/ruvector-core/src/distance/
- Add potential field distance metrics
- Implement gradient calculation utilities

New Modules to Create

crates/ruvector-gnn/src/gravitational/
- field.rs - Core potential field implementation
- mass.rs - Mass calculation and updates
- search.rs - Gravitational-guided search algorithms
- config.rs - Configuration and tuning
- visualization.rs - Field visualization utilities
crates/ruvector-core/src/query_log/
- logger.rs - Query event logging
- analyzer.rs - Query pattern analysis
- replay.rs - Query replay for testing

Dependencies on Other Features

Feature 11 (Causal Attention Networks): GEF can respect causal ordering by preventing backward gravitational pull
Feature 12 (Topology-Aware Gradient Routing): Combine graph topology with gravitational field for hybrid routing
Feature 13 (Embedding Crystallization): High-mass nodes serve as natural crystallization nuclei

Regression Prevention

Existing Functionality at Risk

Standard HNSW Search Performance
- Risk: Gravitational calculations add overhead
- Prevention: Make GEF optional, benchmark against baseline
Deterministic Search Results
- Risk: Mass updates change results over time
- Prevention: Add frozen_field mode for reproducible searches
Memory Usage
- Risk: Additional mass metadata per node
- Prevention: Use compact representations (f32 instead of f64), lazy cache
Concurrent Queries
- Risk: Race conditions in mass updates
- Prevention: Use atomic updates or batch processing

Test Cases to Prevent Regressions

#[cfg(test)]
mod regression_tests {
    // Baseline performance should not degrade
    #[test]
    fn test_gef_disabled_matches_baseline() {
        let graph = create_test_graph(10000);
        let query = random_vector(128);

        let baseline_results = graph.search(&query, 10);

        let gef_field = GravitationalField::new(&graph, GravitationalFieldConfig {
            semantic_weight: 1.0,  // Pure semantic search
            ..Default::default()
        });
        let gef_results = gef_field.search(&query, 10);

        assert_eq!(baseline_results, gef_results);
    }

    // Frozen field produces deterministic results
    #[test]
    fn test_frozen_field_deterministic() {
        let mut field = create_test_field();
        field.freeze();

        let query = random_vector(128);
        let results1 = field.search(&query, 10);
        let results2 = field.search(&query, 10);

        assert_eq!(results1, results2);
    }

    // Mass updates don't break existing searches
    #[test]
    fn test_concurrent_search_and_update() {
        let field = Arc::new(RwLock::new(create_test_field()));

        let search_thread = spawn({
            let field = field.clone();
            move || {
                for _ in 0..100 {
                    let f = field.read().unwrap();
                    f.search(&random_vector(128), 10).unwrap();
                }
            }
        });

        let update_thread = spawn({
            let field = field.clone();
            move || {
                for _ in 0..10 {
                    let mut f = field.write().unwrap();
                    f.update_masses(&generate_query_log(10)).unwrap();
                    thread::sleep(Duration::from_millis(10));
                }
            }
        });

        search_thread.join().unwrap();
        update_thread.join().unwrap();
    }
}

Backward Compatibility Strategy

Feature Flag: GEF behind gravitational-fields feature flag
Opt-in: Default config has semantic_weight = 1.0 (pure semantic search)
Migration Path: Provide tools to analyze existing graphs and recommend GEF settings
Serialization: Store mass data in separate file, gracefully handle missing data

Implementation Phases

Phase 1: Research Validation (2 weeks)

Goal: Validate physics-inspired approach on synthetic data

Implement basic potential field calculations
Create toy dataset with known high-frequency nodes
Measure search efficiency improvements
Compare against baselines (pure HNSW, PageRank-weighted)
Deliverable: Research report with benchmarks

Phase 2: Core Implementation (3 weeks)

Goal: Production-ready GEF implementation

Implement PotentialField and NodeMass structures
Develop mass update algorithms with decay
Integrate with HNSW search
Add configuration system
Implement caching and optimization
Deliverable: Working GEF module with unit tests

Phase 3: Integration (2 weeks)

Goal: Integrate with existing RuVector systems

Add query logging infrastructure
Implement mass persistence (save/load)
Create API bindings (Python, Node.js)
Add monitoring and metrics
Write integration tests
Deliverable: GEF integrated into main codebase

Phase 4: Optimization (2 weeks)

Goal: Production performance and tuning

Profile and optimize hot paths
Implement spatial indexing for large graphs
Add adaptive tuning (auto-adjust G constant)
Create visualization tools
Write documentation and examples
Deliverable: Production-ready, documented feature

Success Metrics

Performance Benchmarks

Metric	Baseline	Target	Measurement
Search latency (10K nodes)	1.2ms	<1.5ms	99th percentile
Search quality (recall@10)	0.95	>0.95	Standard test set
Hops to target	12.3	<9.0	Average path length
Memory overhead	0MB	<50MB	Per 1M nodes
Mass update latency	N/A	<10ms	Per 1K queries

Accuracy Metrics

Authority Discovery: High-authority nodes found in top-10 results
- Target: 80% of known authoritative nodes in top-10
Query Efficiency: Reduction in nodes visited per search
- Target: 30% fewer nodes visited for same recall
Adaptive Learning: Mass distribution correlates with true importance
- Target: Spearman correlation >0.7 with ground truth rankings

Comparison to Baselines

Test against:

Pure HNSW: Standard implementation without GEF
PageRank-weighted: Static global importance scores
Attention-based: Standard attention mechanism from Feature 1
Hybrid: GEF + Topology-Aware Routing (Feature 12)

Datasets:

Wikipedia embeddings (1M articles)
ArXiv papers with citation counts (500K papers)
E-commerce products with view counts (2M products)

Risks and Mitigations

Technical Risks

Risk	Impact	Probability	Mitigation
Mass updates too slow	High	Medium	Batch updates, incremental computation
Field calculations expensive	High	High	Spatial indexing, caching, approximations
Over-attraction to popular nodes	Medium	High	Softening parameter, max influence radius
Mass distribution unstable	Medium	Medium	Regularization, decay rates, bounds checking
Poor generalization	High	Low	Multi-dataset validation, adaptive tuning

Detailed Mitigations

Slow Mass Updates
- Implement incremental updates (only changed nodes)
- Batch query logs and process asynchronously
- Use lock-free data structures for concurrent updates
- Fallback: Update masses periodically (e.g., hourly) instead of real-time
Expensive Field Calculations
- Pre-compute potential fields for common queries
- Use spatial hashing for O(1) radius queries
- Approximate far-field contributions (multipole expansion)
- Fallback: Disable GEF for low-latency requirements
Over-Attraction to Popular Nodes
- Tune softening parameter ε to prevent singularities
- Cap maximum mass value
- Implement repulsive forces for diversity
- Fallback: Reduce gravitational weight in combined score
Unstable Mass Distribution
- Add L2 regularization to mass updates
- Implement mass normalization across graph
- Monitor mass variance, trigger rebalancing
- Fallback: Reset masses to uniform distribution
Poor Generalization
- Test on diverse datasets (text, images, graphs)
- Implement domain-specific mass functions
- Provide configuration templates for common use cases
- Fallback: Disable GEF for unsupported domains

References

Physics Inspiration

Newtonian gravity: F = G·m₁·m₂/r²
Potential fields in robotics path planning
N-body simulations and Barnes-Hut algorithms

PageRank and graph centrality measures
Attention mechanisms in transformers
Reinforcement learning value functions
Metric learning and embedding spaces

Implementation Precedents

Fast multipole methods (FMM)
Spatial hashing and KD-trees
Incremental graph algorithms
Online learning with exponential decay

20 KiB Raw Permalink Blame History Unescape Escape