20 KiB
Gravitational Embedding Fields (GEF)
Overview
Problem Statement
Current vector search treats all embeddings equally, ignoring the importance or frequency of access to nodes. High-value documents (frequently queried, authoritative sources) should have stronger influence on search trajectories, similar to how massive objects exert stronger gravitational pull in physics.
Proposed Solution
Implement a physics-inspired attention mechanism where embeddings exert "gravitational pull" proportional to their query frequency and importance. Search follows gradient descent through a potential field, naturally routing toward high-value nodes before exploring local neighborhoods.
Expected Benefits
- 30-50% reduction in search hops: High-frequency nodes act as routing landmarks
- 15-25% improved relevance: Important documents discovered earlier in search
- Adaptive importance: Automatically learns document authority from usage patterns
- Natural load balancing: Popular nodes become graph hubs, improving overall connectivity
Novelty Claim
First application of gravitational field dynamics to vector search. Unlike PageRank (global static scores) or attention mechanisms (pairwise interactions), GEF creates a continuous potential field that guides search trajectories dynamically based on real-time usage patterns.
Technical Design
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ Gravitational Field Layer │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Query │ │ Potential│ │ Gradient │ │
│ │ Vector │─────▶│ Field │─────▶│ Descent │─────▶ │
│ │ (q) │ │ Φ(x) │ │ ∇Φ(x) │ Path │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ Mass Assignment │ │ │
│ │ │ m_i = f(freq_i) │ │ │
│ │ └──────────────────┘ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────────────────────────────────────────┐ │
│ │ HNSW Graph with Masses │ │
│ │ │ │
│ │ ○─────○─────●═════●─────○ │ │
│ │ │ │ ║ ║ │ │ │
│ │ ○ ●═════● ●─────○ ● = high mass │ │
│ │ │ ║ │ ║ │ ○ = low mass │ │
│ │ ○─────●─────○─────●═════○ ═ = strong │ │
│ │ pull │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Core Data Structures
/// Gravitational mass and frequency tracking for each node
#[derive(Clone, Debug)]
pub struct NodeMass {
/// Effective gravitational mass (learned from query frequency)
pub mass: f32,
/// Query frequency counter (exponential moving average)
pub query_frequency: f64,
/// Last update timestamp
pub last_update: SystemTime,
/// Decay rate for frequency (default: 0.95)
pub decay_rate: f32,
}
/// Gravitational field configuration
#[derive(Clone, Debug)]
pub struct GravitationalFieldConfig {
/// Gravitational constant (strength of attraction)
pub g_constant: f32, // default: 1.0
/// Mass function type
pub mass_function: MassFunction,
/// Maximum influence radius (in embedding space)
pub max_radius: f32, // default: 10.0
/// Softening parameter (prevents singularities at r=0)
pub softening: f32, // default: 0.1
/// Field update frequency
pub update_interval: Duration,
}
/// Mass calculation strategies
#[derive(Clone, Debug)]
pub enum MassFunction {
/// Linear: m = frequency
Linear,
/// Logarithmic: m = log(1 + frequency)
Logarithmic,
/// Square root: m = sqrt(frequency)
SquareRoot,
/// Custom function
Custom(fn(f64) -> f32),
}
/// Gravitational potential field
pub struct PotentialField {
/// Node masses indexed by node ID
masses: Vec<NodeMass>,
/// Spatial index for fast radius queries
spatial_index: KDTree<NodeId>,
/// Configuration
config: GravitationalFieldConfig,
/// Cached potential values (invalidated on mass updates)
potential_cache: LruCache<(NodeId, NodeId), f32>,
}
/// Search path with gravitational guidance
pub struct GravitationalSearchPath {
/// Visited nodes
pub visited: Vec<NodeId>,
/// Potential energy at each step
pub potentials: Vec<f32>,
/// Gradient magnitudes
pub gradients: Vec<f32>,
/// Total energy consumed
pub total_energy: f32,
}
Key Algorithms
// Pseudocode for gravitational field search
fn gravitational_search(
query: &[f32],
field: &PotentialField,
graph: &HnswGraph,
k: usize
) -> Vec<NodeId> {
// Initialize at entry point
let mut current = graph.entry_point;
let mut visited = HashSet::new();
let mut candidates = BinaryHeap::new();
// Calculate initial potential
let mut potential = field.calculate_potential(query, current);
while !converged(&candidates, k) {
visited.insert(current);
// Get neighbors from HNSW graph
let neighbors = graph.get_neighbors(current, layer=0);
for neighbor in neighbors {
if visited.contains(&neighbor) { continue; }
// Calculate gravitational force contribution
let neighbor_mass = field.get_mass(neighbor);
let distance = euclidean_distance(query, graph.get_embedding(neighbor));
// Gravitational potential: Φ = -G * m / (r + ε)
// where ε is softening parameter
let grav_potential = -field.config.g_constant * neighbor_mass
/ (distance + field.config.softening);
// Combine embedding similarity with gravitational pull
let similarity = cosine_similarity(query, graph.get_embedding(neighbor));
// Total potential: combine semantic similarity and gravitational field
// α controls balance (default: 0.7 semantic, 0.3 gravitational)
let total_potential = 0.7 * similarity + 0.3 * grav_potential;
candidates.push((neighbor, total_potential));
}
// Follow gradient: move to node with lowest potential
current = candidates.pop().unwrap().0;
potential = field.calculate_potential(query, current);
}
// Return top-k by final similarity
candidates.into_sorted_vec()
.iter()
.take(k)
.map(|(id, _)| *id)
.collect()
}
// Mass update from query patterns
fn update_masses(field: &mut PotentialField, query_log: &[QueryEvent]) {
for event in query_log {
for visited_node in &event.visited_nodes {
let mass = &mut field.masses[*visited_node];
// Exponential moving average of query frequency
let time_delta = event.timestamp.duration_since(mass.last_update);
let decay = mass.decay_rate.powf(time_delta.as_secs_f32() / 3600.0);
mass.query_frequency = mass.query_frequency * decay as f64 + 1.0;
// Update mass based on frequency
mass.mass = match field.config.mass_function {
MassFunction::Linear => mass.query_frequency as f32,
MassFunction::Logarithmic => (1.0 + mass.query_frequency).ln() as f32,
MassFunction::SquareRoot => mass.query_frequency.sqrt() as f32,
MassFunction::Custom(f) => f(mass.query_frequency),
};
mass.last_update = event.timestamp;
}
}
// Invalidate potential cache
field.potential_cache.clear();
// Rebuild spatial index if significant changes
if should_rebuild_index(field) {
field.rebuild_spatial_index();
}
}
API Design
/// Public API for Gravitational Embedding Fields
pub trait GravitationalField {
/// Create new gravitational field for graph
fn new(graph: &HnswGraph, config: GravitationalFieldConfig) -> Self;
/// Search with gravitational guidance
fn search(
&self,
query: &[f32],
k: usize,
options: SearchOptions,
) -> Result<Vec<SearchResult>, GefError>;
/// Update masses from query log
fn update_masses(&mut self, query_log: &[QueryEvent]) -> Result<(), GefError>;
/// Get mass for specific node
fn get_mass(&self, node_id: NodeId) -> f32;
/// Calculate potential at point
fn calculate_potential(&self, point: &[f32], reference: NodeId) -> f32;
/// Calculate gradient at point
fn calculate_gradient(&self, point: &[f32]) -> Vec<f32>;
/// Export field visualization data
fn export_field(&self, resolution: usize) -> FieldVisualization;
/// Get field statistics
fn statistics(&self) -> FieldStatistics;
}
/// Search options for GEF
#[derive(Clone, Debug)]
pub struct SearchOptions {
/// Balance between semantic similarity and gravitational pull (0.0-1.0)
pub semantic_weight: f32,
/// Maximum search steps
pub max_steps: usize,
/// Enable path recording
pub record_path: bool,
/// Convergence threshold
pub convergence_threshold: f32,
}
/// Statistics about gravitational field
#[derive(Clone, Debug)]
pub struct FieldStatistics {
/// Total number of nodes
pub total_nodes: usize,
/// Mass distribution (min, max, mean, median)
pub mass_distribution: Distribution,
/// Number of high-mass nodes (top 10%)
pub high_mass_nodes: usize,
/// Average query frequency
pub avg_query_frequency: f64,
/// Last update timestamp
pub last_update: SystemTime,
}
Integration Points
Affected Crates/Modules
-
crates/ruvector-core/src/hnsw/- Modify search algorithm to accept potential field guidance
- Add hooks for mass updates on queries
- Extend node metadata to store mass values
-
crates/ruvector-gnn/src/attention/- Integrate GEF as attention mechanism variant
- Combine with existing attention patterns
-
crates/ruvector-core/src/distance/- Add potential field distance metrics
- Implement gradient calculation utilities
New Modules to Create
-
crates/ruvector-gnn/src/gravitational/field.rs- Core potential field implementationmass.rs- Mass calculation and updatessearch.rs- Gravitational-guided search algorithmsconfig.rs- Configuration and tuningvisualization.rs- Field visualization utilities
-
crates/ruvector-core/src/query_log/logger.rs- Query event logginganalyzer.rs- Query pattern analysisreplay.rs- Query replay for testing
Dependencies on Other Features
- Feature 11 (Causal Attention Networks): GEF can respect causal ordering by preventing backward gravitational pull
- Feature 12 (Topology-Aware Gradient Routing): Combine graph topology with gravitational field for hybrid routing
- Feature 13 (Embedding Crystallization): High-mass nodes serve as natural crystallization nuclei
Regression Prevention
Existing Functionality at Risk
-
Standard HNSW Search Performance
- Risk: Gravitational calculations add overhead
- Prevention: Make GEF optional, benchmark against baseline
-
Deterministic Search Results
- Risk: Mass updates change results over time
- Prevention: Add
frozen_fieldmode for reproducible searches
-
Memory Usage
- Risk: Additional mass metadata per node
- Prevention: Use compact representations (f32 instead of f64), lazy cache
-
Concurrent Queries
- Risk: Race conditions in mass updates
- Prevention: Use atomic updates or batch processing
Test Cases to Prevent Regressions
#[cfg(test)]
mod regression_tests {
// Baseline performance should not degrade
#[test]
fn test_gef_disabled_matches_baseline() {
let graph = create_test_graph(10000);
let query = random_vector(128);
let baseline_results = graph.search(&query, 10);
let gef_field = GravitationalField::new(&graph, GravitationalFieldConfig {
semantic_weight: 1.0, // Pure semantic search
..Default::default()
});
let gef_results = gef_field.search(&query, 10);
assert_eq!(baseline_results, gef_results);
}
// Frozen field produces deterministic results
#[test]
fn test_frozen_field_deterministic() {
let mut field = create_test_field();
field.freeze();
let query = random_vector(128);
let results1 = field.search(&query, 10);
let results2 = field.search(&query, 10);
assert_eq!(results1, results2);
}
// Mass updates don't break existing searches
#[test]
fn test_concurrent_search_and_update() {
let field = Arc::new(RwLock::new(create_test_field()));
let search_thread = spawn({
let field = field.clone();
move || {
for _ in 0..100 {
let f = field.read().unwrap();
f.search(&random_vector(128), 10).unwrap();
}
}
});
let update_thread = spawn({
let field = field.clone();
move || {
for _ in 0..10 {
let mut f = field.write().unwrap();
f.update_masses(&generate_query_log(10)).unwrap();
thread::sleep(Duration::from_millis(10));
}
}
});
search_thread.join().unwrap();
update_thread.join().unwrap();
}
}
Backward Compatibility Strategy
- Feature Flag: GEF behind
gravitational-fieldsfeature flag - Opt-in: Default config has
semantic_weight = 1.0(pure semantic search) - Migration Path: Provide tools to analyze existing graphs and recommend GEF settings
- Serialization: Store mass data in separate file, gracefully handle missing data
Implementation Phases
Phase 1: Research Validation (2 weeks)
Goal: Validate physics-inspired approach on synthetic data
- Implement basic potential field calculations
- Create toy dataset with known high-frequency nodes
- Measure search efficiency improvements
- Compare against baselines (pure HNSW, PageRank-weighted)
- Deliverable: Research report with benchmarks
Phase 2: Core Implementation (3 weeks)
Goal: Production-ready GEF implementation
- Implement
PotentialFieldandNodeMassstructures - Develop mass update algorithms with decay
- Integrate with HNSW search
- Add configuration system
- Implement caching and optimization
- Deliverable: Working GEF module with unit tests
Phase 3: Integration (2 weeks)
Goal: Integrate with existing RuVector systems
- Add query logging infrastructure
- Implement mass persistence (save/load)
- Create API bindings (Python, Node.js)
- Add monitoring and metrics
- Write integration tests
- Deliverable: GEF integrated into main codebase
Phase 4: Optimization (2 weeks)
Goal: Production performance and tuning
- Profile and optimize hot paths
- Implement spatial indexing for large graphs
- Add adaptive tuning (auto-adjust G constant)
- Create visualization tools
- Write documentation and examples
- Deliverable: Production-ready, documented feature
Success Metrics
Performance Benchmarks
| Metric | Baseline | Target | Measurement |
|---|---|---|---|
| Search latency (10K nodes) | 1.2ms | <1.5ms | 99th percentile |
| Search quality (recall@10) | 0.95 | >0.95 | Standard test set |
| Hops to target | 12.3 | <9.0 | Average path length |
| Memory overhead | 0MB | <50MB | Per 1M nodes |
| Mass update latency | N/A | <10ms | Per 1K queries |
Accuracy Metrics
-
Authority Discovery: High-authority nodes found in top-10 results
- Target: 80% of known authoritative nodes in top-10
-
Query Efficiency: Reduction in nodes visited per search
- Target: 30% fewer nodes visited for same recall
-
Adaptive Learning: Mass distribution correlates with true importance
- Target: Spearman correlation >0.7 with ground truth rankings
Comparison to Baselines
Test against:
- Pure HNSW: Standard implementation without GEF
- PageRank-weighted: Static global importance scores
- Attention-based: Standard attention mechanism from Feature 1
- Hybrid: GEF + Topology-Aware Routing (Feature 12)
Datasets:
- Wikipedia embeddings (1M articles)
- ArXiv papers with citation counts (500K papers)
- E-commerce products with view counts (2M products)
Risks and Mitigations
Technical Risks
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Mass updates too slow | High | Medium | Batch updates, incremental computation |
| Field calculations expensive | High | High | Spatial indexing, caching, approximations |
| Over-attraction to popular nodes | Medium | High | Softening parameter, max influence radius |
| Mass distribution unstable | Medium | Medium | Regularization, decay rates, bounds checking |
| Poor generalization | High | Low | Multi-dataset validation, adaptive tuning |
Detailed Mitigations
-
Slow Mass Updates
- Implement incremental updates (only changed nodes)
- Batch query logs and process asynchronously
- Use lock-free data structures for concurrent updates
- Fallback: Update masses periodically (e.g., hourly) instead of real-time
-
Expensive Field Calculations
- Pre-compute potential fields for common queries
- Use spatial hashing for O(1) radius queries
- Approximate far-field contributions (multipole expansion)
- Fallback: Disable GEF for low-latency requirements
-
Over-Attraction to Popular Nodes
- Tune softening parameter ε to prevent singularities
- Cap maximum mass value
- Implement repulsive forces for diversity
- Fallback: Reduce gravitational weight in combined score
-
Unstable Mass Distribution
- Add L2 regularization to mass updates
- Implement mass normalization across graph
- Monitor mass variance, trigger rebalancing
- Fallback: Reset masses to uniform distribution
-
Poor Generalization
- Test on diverse datasets (text, images, graphs)
- Implement domain-specific mass functions
- Provide configuration templates for common use cases
- Fallback: Disable GEF for unsupported domains
References
Physics Inspiration
- Newtonian gravity: F = G·m₁·m₂/r²
- Potential fields in robotics path planning
- N-body simulations and Barnes-Hut algorithms
Related ML Techniques
- PageRank and graph centrality measures
- Attention mechanisms in transformers
- Reinforcement learning value functions
- Metric learning and embedding spaces
Implementation Precedents
- Fast multipole methods (FMM)
- Spatial hashing and KD-trees
- Incremental graph algorithms
- Online learning with exponential decay