Files
wifi-densepose/vendor/ruvector/docs/research/gnn-v2/10-gravitational-embedding-fields.md

20 KiB
Raw Permalink Blame History

Gravitational Embedding Fields (GEF)

Overview

Problem Statement

Current vector search treats all embeddings equally, ignoring the importance or frequency of access to nodes. High-value documents (frequently queried, authoritative sources) should have stronger influence on search trajectories, similar to how massive objects exert stronger gravitational pull in physics.

Proposed Solution

Implement a physics-inspired attention mechanism where embeddings exert "gravitational pull" proportional to their query frequency and importance. Search follows gradient descent through a potential field, naturally routing toward high-value nodes before exploring local neighborhoods.

Expected Benefits

  • 30-50% reduction in search hops: High-frequency nodes act as routing landmarks
  • 15-25% improved relevance: Important documents discovered earlier in search
  • Adaptive importance: Automatically learns document authority from usage patterns
  • Natural load balancing: Popular nodes become graph hubs, improving overall connectivity

Novelty Claim

First application of gravitational field dynamics to vector search. Unlike PageRank (global static scores) or attention mechanisms (pairwise interactions), GEF creates a continuous potential field that guides search trajectories dynamically based on real-time usage patterns.

Technical Design

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                   Gravitational Field Layer                  │
│                                                              │
│  ┌──────────┐      ┌──────────┐      ┌──────────┐         │
│  │ Query    │      │ Potential│      │ Gradient │         │
│  │ Vector   │─────▶│ Field    │─────▶│ Descent  │─────▶   │
│  │ (q)      │      │ Φ(x)     │      │ ∇Φ(x)    │  Path   │
│  └──────────┘      └──────────┘      └──────────┘         │
│       │                  │                  │              │
│       │                  ▼                  │              │
│       │         ┌──────────────────┐        │              │
│       │         │  Mass Assignment │        │              │
│       │         │  m_i = f(freq_i) │        │              │
│       │         └──────────────────┘        │              │
│       │                  │                  │              │
│       ▼                  ▼                  ▼              │
│  ┌────────────────────────────────────────────────┐       │
│  │         HNSW Graph with Masses                 │       │
│  │                                                 │       │
│  │   ○─────○─────●═════●─────○                   │       │
│  │   │     │     ║     ║     │                   │       │
│  │   ○     ●═════●     ●─────○    ● = high mass  │       │
│  │   │     ║     │     ║     │    ○ = low mass   │       │
│  │   ○─────●─────○─────●═════○    ═ = strong     │       │
│  │                              pull              │       │
│  └────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────┘

Core Data Structures

/// Gravitational mass and frequency tracking for each node
#[derive(Clone, Debug)]
pub struct NodeMass {
    /// Effective gravitational mass (learned from query frequency)
    pub mass: f32,

    /// Query frequency counter (exponential moving average)
    pub query_frequency: f64,

    /// Last update timestamp
    pub last_update: SystemTime,

    /// Decay rate for frequency (default: 0.95)
    pub decay_rate: f32,
}

/// Gravitational field configuration
#[derive(Clone, Debug)]
pub struct GravitationalFieldConfig {
    /// Gravitational constant (strength of attraction)
    pub g_constant: f32,  // default: 1.0

    /// Mass function type
    pub mass_function: MassFunction,

    /// Maximum influence radius (in embedding space)
    pub max_radius: f32,  // default: 10.0

    /// Softening parameter (prevents singularities at r=0)
    pub softening: f32,   // default: 0.1

    /// Field update frequency
    pub update_interval: Duration,
}

/// Mass calculation strategies
#[derive(Clone, Debug)]
pub enum MassFunction {
    /// Linear: m = frequency
    Linear,

    /// Logarithmic: m = log(1 + frequency)
    Logarithmic,

    /// Square root: m = sqrt(frequency)
    SquareRoot,

    /// Custom function
    Custom(fn(f64) -> f32),
}

/// Gravitational potential field
pub struct PotentialField {
    /// Node masses indexed by node ID
    masses: Vec<NodeMass>,

    /// Spatial index for fast radius queries
    spatial_index: KDTree<NodeId>,

    /// Configuration
    config: GravitationalFieldConfig,

    /// Cached potential values (invalidated on mass updates)
    potential_cache: LruCache<(NodeId, NodeId), f32>,
}

/// Search path with gravitational guidance
pub struct GravitationalSearchPath {
    /// Visited nodes
    pub visited: Vec<NodeId>,

    /// Potential energy at each step
    pub potentials: Vec<f32>,

    /// Gradient magnitudes
    pub gradients: Vec<f32>,

    /// Total energy consumed
    pub total_energy: f32,
}

Key Algorithms

// Pseudocode for gravitational field search

fn gravitational_search(
    query: &[f32],
    field: &PotentialField,
    graph: &HnswGraph,
    k: usize
) -> Vec<NodeId> {
    // Initialize at entry point
    let mut current = graph.entry_point;
    let mut visited = HashSet::new();
    let mut candidates = BinaryHeap::new();

    // Calculate initial potential
    let mut potential = field.calculate_potential(query, current);

    while !converged(&candidates, k) {
        visited.insert(current);

        // Get neighbors from HNSW graph
        let neighbors = graph.get_neighbors(current, layer=0);

        for neighbor in neighbors {
            if visited.contains(&neighbor) { continue; }

            // Calculate gravitational force contribution
            let neighbor_mass = field.get_mass(neighbor);
            let distance = euclidean_distance(query, graph.get_embedding(neighbor));

            // Gravitational potential: Φ = -G * m / (r + ε)
            // where ε is softening parameter
            let grav_potential = -field.config.g_constant * neighbor_mass
                               / (distance + field.config.softening);

            // Combine embedding similarity with gravitational pull
            let similarity = cosine_similarity(query, graph.get_embedding(neighbor));

            // Total potential: combine semantic similarity and gravitational field
            // α controls balance (default: 0.7 semantic, 0.3 gravitational)
            let total_potential = 0.7 * similarity + 0.3 * grav_potential;

            candidates.push((neighbor, total_potential));
        }

        // Follow gradient: move to node with lowest potential
        current = candidates.pop().unwrap().0;
        potential = field.calculate_potential(query, current);
    }

    // Return top-k by final similarity
    candidates.into_sorted_vec()
        .iter()
        .take(k)
        .map(|(id, _)| *id)
        .collect()
}

// Mass update from query patterns
fn update_masses(field: &mut PotentialField, query_log: &[QueryEvent]) {
    for event in query_log {
        for visited_node in &event.visited_nodes {
            let mass = &mut field.masses[*visited_node];

            // Exponential moving average of query frequency
            let time_delta = event.timestamp.duration_since(mass.last_update);
            let decay = mass.decay_rate.powf(time_delta.as_secs_f32() / 3600.0);

            mass.query_frequency = mass.query_frequency * decay as f64 + 1.0;

            // Update mass based on frequency
            mass.mass = match field.config.mass_function {
                MassFunction::Linear => mass.query_frequency as f32,
                MassFunction::Logarithmic => (1.0 + mass.query_frequency).ln() as f32,
                MassFunction::SquareRoot => mass.query_frequency.sqrt() as f32,
                MassFunction::Custom(f) => f(mass.query_frequency),
            };

            mass.last_update = event.timestamp;
        }
    }

    // Invalidate potential cache
    field.potential_cache.clear();

    // Rebuild spatial index if significant changes
    if should_rebuild_index(field) {
        field.rebuild_spatial_index();
    }
}

API Design

/// Public API for Gravitational Embedding Fields
pub trait GravitationalField {
    /// Create new gravitational field for graph
    fn new(graph: &HnswGraph, config: GravitationalFieldConfig) -> Self;

    /// Search with gravitational guidance
    fn search(
        &self,
        query: &[f32],
        k: usize,
        options: SearchOptions,
    ) -> Result<Vec<SearchResult>, GefError>;

    /// Update masses from query log
    fn update_masses(&mut self, query_log: &[QueryEvent]) -> Result<(), GefError>;

    /// Get mass for specific node
    fn get_mass(&self, node_id: NodeId) -> f32;

    /// Calculate potential at point
    fn calculate_potential(&self, point: &[f32], reference: NodeId) -> f32;

    /// Calculate gradient at point
    fn calculate_gradient(&self, point: &[f32]) -> Vec<f32>;

    /// Export field visualization data
    fn export_field(&self, resolution: usize) -> FieldVisualization;

    /// Get field statistics
    fn statistics(&self) -> FieldStatistics;
}

/// Search options for GEF
#[derive(Clone, Debug)]
pub struct SearchOptions {
    /// Balance between semantic similarity and gravitational pull (0.0-1.0)
    pub semantic_weight: f32,

    /// Maximum search steps
    pub max_steps: usize,

    /// Enable path recording
    pub record_path: bool,

    /// Convergence threshold
    pub convergence_threshold: f32,
}

/// Statistics about gravitational field
#[derive(Clone, Debug)]
pub struct FieldStatistics {
    /// Total number of nodes
    pub total_nodes: usize,

    /// Mass distribution (min, max, mean, median)
    pub mass_distribution: Distribution,

    /// Number of high-mass nodes (top 10%)
    pub high_mass_nodes: usize,

    /// Average query frequency
    pub avg_query_frequency: f64,

    /// Last update timestamp
    pub last_update: SystemTime,
}

Integration Points

Affected Crates/Modules

  1. crates/ruvector-core/src/hnsw/

    • Modify search algorithm to accept potential field guidance
    • Add hooks for mass updates on queries
    • Extend node metadata to store mass values
  2. crates/ruvector-gnn/src/attention/

    • Integrate GEF as attention mechanism variant
    • Combine with existing attention patterns
  3. crates/ruvector-core/src/distance/

    • Add potential field distance metrics
    • Implement gradient calculation utilities

New Modules to Create

  1. crates/ruvector-gnn/src/gravitational/

    • field.rs - Core potential field implementation
    • mass.rs - Mass calculation and updates
    • search.rs - Gravitational-guided search algorithms
    • config.rs - Configuration and tuning
    • visualization.rs - Field visualization utilities
  2. crates/ruvector-core/src/query_log/

    • logger.rs - Query event logging
    • analyzer.rs - Query pattern analysis
    • replay.rs - Query replay for testing

Dependencies on Other Features

  • Feature 11 (Causal Attention Networks): GEF can respect causal ordering by preventing backward gravitational pull
  • Feature 12 (Topology-Aware Gradient Routing): Combine graph topology with gravitational field for hybrid routing
  • Feature 13 (Embedding Crystallization): High-mass nodes serve as natural crystallization nuclei

Regression Prevention

Existing Functionality at Risk

  1. Standard HNSW Search Performance

    • Risk: Gravitational calculations add overhead
    • Prevention: Make GEF optional, benchmark against baseline
  2. Deterministic Search Results

    • Risk: Mass updates change results over time
    • Prevention: Add frozen_field mode for reproducible searches
  3. Memory Usage

    • Risk: Additional mass metadata per node
    • Prevention: Use compact representations (f32 instead of f64), lazy cache
  4. Concurrent Queries

    • Risk: Race conditions in mass updates
    • Prevention: Use atomic updates or batch processing

Test Cases to Prevent Regressions

#[cfg(test)]
mod regression_tests {
    // Baseline performance should not degrade
    #[test]
    fn test_gef_disabled_matches_baseline() {
        let graph = create_test_graph(10000);
        let query = random_vector(128);

        let baseline_results = graph.search(&query, 10);

        let gef_field = GravitationalField::new(&graph, GravitationalFieldConfig {
            semantic_weight: 1.0,  // Pure semantic search
            ..Default::default()
        });
        let gef_results = gef_field.search(&query, 10);

        assert_eq!(baseline_results, gef_results);
    }

    // Frozen field produces deterministic results
    #[test]
    fn test_frozen_field_deterministic() {
        let mut field = create_test_field();
        field.freeze();

        let query = random_vector(128);
        let results1 = field.search(&query, 10);
        let results2 = field.search(&query, 10);

        assert_eq!(results1, results2);
    }

    // Mass updates don't break existing searches
    #[test]
    fn test_concurrent_search_and_update() {
        let field = Arc::new(RwLock::new(create_test_field()));

        let search_thread = spawn({
            let field = field.clone();
            move || {
                for _ in 0..100 {
                    let f = field.read().unwrap();
                    f.search(&random_vector(128), 10).unwrap();
                }
            }
        });

        let update_thread = spawn({
            let field = field.clone();
            move || {
                for _ in 0..10 {
                    let mut f = field.write().unwrap();
                    f.update_masses(&generate_query_log(10)).unwrap();
                    thread::sleep(Duration::from_millis(10));
                }
            }
        });

        search_thread.join().unwrap();
        update_thread.join().unwrap();
    }
}

Backward Compatibility Strategy

  1. Feature Flag: GEF behind gravitational-fields feature flag
  2. Opt-in: Default config has semantic_weight = 1.0 (pure semantic search)
  3. Migration Path: Provide tools to analyze existing graphs and recommend GEF settings
  4. Serialization: Store mass data in separate file, gracefully handle missing data

Implementation Phases

Phase 1: Research Validation (2 weeks)

Goal: Validate physics-inspired approach on synthetic data

  • Implement basic potential field calculations
  • Create toy dataset with known high-frequency nodes
  • Measure search efficiency improvements
  • Compare against baselines (pure HNSW, PageRank-weighted)
  • Deliverable: Research report with benchmarks

Phase 2: Core Implementation (3 weeks)

Goal: Production-ready GEF implementation

  • Implement PotentialField and NodeMass structures
  • Develop mass update algorithms with decay
  • Integrate with HNSW search
  • Add configuration system
  • Implement caching and optimization
  • Deliverable: Working GEF module with unit tests

Phase 3: Integration (2 weeks)

Goal: Integrate with existing RuVector systems

  • Add query logging infrastructure
  • Implement mass persistence (save/load)
  • Create API bindings (Python, Node.js)
  • Add monitoring and metrics
  • Write integration tests
  • Deliverable: GEF integrated into main codebase

Phase 4: Optimization (2 weeks)

Goal: Production performance and tuning

  • Profile and optimize hot paths
  • Implement spatial indexing for large graphs
  • Add adaptive tuning (auto-adjust G constant)
  • Create visualization tools
  • Write documentation and examples
  • Deliverable: Production-ready, documented feature

Success Metrics

Performance Benchmarks

Metric Baseline Target Measurement
Search latency (10K nodes) 1.2ms <1.5ms 99th percentile
Search quality (recall@10) 0.95 >0.95 Standard test set
Hops to target 12.3 <9.0 Average path length
Memory overhead 0MB <50MB Per 1M nodes
Mass update latency N/A <10ms Per 1K queries

Accuracy Metrics

  1. Authority Discovery: High-authority nodes found in top-10 results

    • Target: 80% of known authoritative nodes in top-10
  2. Query Efficiency: Reduction in nodes visited per search

    • Target: 30% fewer nodes visited for same recall
  3. Adaptive Learning: Mass distribution correlates with true importance

    • Target: Spearman correlation >0.7 with ground truth rankings

Comparison to Baselines

Test against:

  1. Pure HNSW: Standard implementation without GEF
  2. PageRank-weighted: Static global importance scores
  3. Attention-based: Standard attention mechanism from Feature 1
  4. Hybrid: GEF + Topology-Aware Routing (Feature 12)

Datasets:

  • Wikipedia embeddings (1M articles)
  • ArXiv papers with citation counts (500K papers)
  • E-commerce products with view counts (2M products)

Risks and Mitigations

Technical Risks

Risk Impact Probability Mitigation
Mass updates too slow High Medium Batch updates, incremental computation
Field calculations expensive High High Spatial indexing, caching, approximations
Over-attraction to popular nodes Medium High Softening parameter, max influence radius
Mass distribution unstable Medium Medium Regularization, decay rates, bounds checking
Poor generalization High Low Multi-dataset validation, adaptive tuning

Detailed Mitigations

  1. Slow Mass Updates

    • Implement incremental updates (only changed nodes)
    • Batch query logs and process asynchronously
    • Use lock-free data structures for concurrent updates
    • Fallback: Update masses periodically (e.g., hourly) instead of real-time
  2. Expensive Field Calculations

    • Pre-compute potential fields for common queries
    • Use spatial hashing for O(1) radius queries
    • Approximate far-field contributions (multipole expansion)
    • Fallback: Disable GEF for low-latency requirements
  3. Over-Attraction to Popular Nodes

    • Tune softening parameter ε to prevent singularities
    • Cap maximum mass value
    • Implement repulsive forces for diversity
    • Fallback: Reduce gravitational weight in combined score
  4. Unstable Mass Distribution

    • Add L2 regularization to mass updates
    • Implement mass normalization across graph
    • Monitor mass variance, trigger rebalancing
    • Fallback: Reset masses to uniform distribution
  5. Poor Generalization

    • Test on diverse datasets (text, images, graphs)
    • Implement domain-specific mass functions
    • Provide configuration templates for common use cases
    • Fallback: Disable GEF for unsupported domains

References

Physics Inspiration

  • Newtonian gravity: F = G·m₁·m₂/r²
  • Potential fields in robotics path planning
  • N-body simulations and Barnes-Hut algorithms
  • PageRank and graph centrality measures
  • Attention mechanisms in transformers
  • Reinforcement learning value functions
  • Metric learning and embedding spaces

Implementation Precedents

  • Fast multipole methods (FMM)
  • Spatial hashing and KD-trees
  • Incremental graph algorithms
  • Online learning with exponential decay