Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
572
docs/research/gnn-v2/10-gravitational-embedding-fields.md
Normal file
572
docs/research/gnn-v2/10-gravitational-embedding-fields.md
Normal file
@@ -0,0 +1,572 @@
|
||||
# Gravitational Embedding Fields (GEF)
|
||||
|
||||
## Overview
|
||||
|
||||
### Problem Statement
|
||||
Current vector search treats all embeddings equally, ignoring the importance or frequency of access to nodes. High-value documents (frequently queried, authoritative sources) should have stronger influence on search trajectories, similar to how massive objects exert stronger gravitational pull in physics.
|
||||
|
||||
### Proposed Solution
|
||||
Implement a physics-inspired attention mechanism where embeddings exert "gravitational pull" proportional to their query frequency and importance. Search follows gradient descent through a potential field, naturally routing toward high-value nodes before exploring local neighborhoods.
|
||||
|
||||
### Expected Benefits
|
||||
- **30-50% reduction in search hops**: High-frequency nodes act as routing landmarks
|
||||
- **15-25% improved relevance**: Important documents discovered earlier in search
|
||||
- **Adaptive importance**: Automatically learns document authority from usage patterns
|
||||
- **Natural load balancing**: Popular nodes become graph hubs, improving overall connectivity
|
||||
|
||||
### Novelty Claim
|
||||
First application of gravitational field dynamics to vector search. Unlike PageRank (global static scores) or attention mechanisms (pairwise interactions), GEF creates a continuous potential field that guides search trajectories dynamically based on real-time usage patterns.
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture Diagram
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Gravitational Field Layer │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ Query │ │ Potential│ │ Gradient │ │
|
||||
│ │ Vector │─────▶│ Field │─────▶│ Descent │─────▶ │
|
||||
│ │ (q) │ │ Φ(x) │ │ ∇Φ(x) │ Path │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ │
|
||||
│ │ │ │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ ┌──────────────────┐ │ │
|
||||
│ │ │ Mass Assignment │ │ │
|
||||
│ │ │ m_i = f(freq_i) │ │ │
|
||||
│ │ └──────────────────┘ │ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ HNSW Graph with Masses │ │
|
||||
│ │ │ │
|
||||
│ │ ○─────○─────●═════●─────○ │ │
|
||||
│ │ │ │ ║ ║ │ │ │
|
||||
│ │ ○ ●═════● ●─────○ ● = high mass │ │
|
||||
│ │ │ ║ │ ║ │ ○ = low mass │ │
|
||||
│ │ ○─────●─────○─────●═════○ ═ = strong │ │
|
||||
│ │ pull │ │
|
||||
│ └────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Core Data Structures
|
||||
|
||||
```rust
|
||||
/// Gravitational mass and frequency tracking for each node
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct NodeMass {
|
||||
/// Effective gravitational mass (learned from query frequency)
|
||||
pub mass: f32,
|
||||
|
||||
/// Query frequency counter (exponential moving average)
|
||||
pub query_frequency: f64,
|
||||
|
||||
/// Last update timestamp
|
||||
pub last_update: SystemTime,
|
||||
|
||||
/// Decay rate for frequency (default: 0.95)
|
||||
pub decay_rate: f32,
|
||||
}
|
||||
|
||||
/// Gravitational field configuration
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct GravitationalFieldConfig {
|
||||
/// Gravitational constant (strength of attraction)
|
||||
pub g_constant: f32, // default: 1.0
|
||||
|
||||
/// Mass function type
|
||||
pub mass_function: MassFunction,
|
||||
|
||||
/// Maximum influence radius (in embedding space)
|
||||
pub max_radius: f32, // default: 10.0
|
||||
|
||||
/// Softening parameter (prevents singularities at r=0)
|
||||
pub softening: f32, // default: 0.1
|
||||
|
||||
/// Field update frequency
|
||||
pub update_interval: Duration,
|
||||
}
|
||||
|
||||
/// Mass calculation strategies
|
||||
#[derive(Clone, Debug)]
|
||||
pub enum MassFunction {
|
||||
/// Linear: m = frequency
|
||||
Linear,
|
||||
|
||||
/// Logarithmic: m = log(1 + frequency)
|
||||
Logarithmic,
|
||||
|
||||
/// Square root: m = sqrt(frequency)
|
||||
SquareRoot,
|
||||
|
||||
/// Custom function
|
||||
Custom(fn(f64) -> f32),
|
||||
}
|
||||
|
||||
/// Gravitational potential field
|
||||
pub struct PotentialField {
|
||||
/// Node masses indexed by node ID
|
||||
masses: Vec<NodeMass>,
|
||||
|
||||
/// Spatial index for fast radius queries
|
||||
spatial_index: KDTree<NodeId>,
|
||||
|
||||
/// Configuration
|
||||
config: GravitationalFieldConfig,
|
||||
|
||||
/// Cached potential values (invalidated on mass updates)
|
||||
potential_cache: LruCache<(NodeId, NodeId), f32>,
|
||||
}
|
||||
|
||||
/// Search path with gravitational guidance
|
||||
pub struct GravitationalSearchPath {
|
||||
/// Visited nodes
|
||||
pub visited: Vec<NodeId>,
|
||||
|
||||
/// Potential energy at each step
|
||||
pub potentials: Vec<f32>,
|
||||
|
||||
/// Gradient magnitudes
|
||||
pub gradients: Vec<f32>,
|
||||
|
||||
/// Total energy consumed
|
||||
pub total_energy: f32,
|
||||
}
|
||||
```
|
||||
|
||||
### Key Algorithms
|
||||
|
||||
```rust
|
||||
// Pseudocode for gravitational field search
|
||||
|
||||
fn gravitational_search(
|
||||
query: &[f32],
|
||||
field: &PotentialField,
|
||||
graph: &HnswGraph,
|
||||
k: usize
|
||||
) -> Vec<NodeId> {
|
||||
// Initialize at entry point
|
||||
let mut current = graph.entry_point;
|
||||
let mut visited = HashSet::new();
|
||||
let mut candidates = BinaryHeap::new();
|
||||
|
||||
// Calculate initial potential
|
||||
let mut potential = field.calculate_potential(query, current);
|
||||
|
||||
while !converged(&candidates, k) {
|
||||
visited.insert(current);
|
||||
|
||||
// Get neighbors from HNSW graph
|
||||
let neighbors = graph.get_neighbors(current, layer=0);
|
||||
|
||||
for neighbor in neighbors {
|
||||
if visited.contains(&neighbor) { continue; }
|
||||
|
||||
// Calculate gravitational force contribution
|
||||
let neighbor_mass = field.get_mass(neighbor);
|
||||
let distance = euclidean_distance(query, graph.get_embedding(neighbor));
|
||||
|
||||
// Gravitational potential: Φ = -G * m / (r + ε)
|
||||
// where ε is softening parameter
|
||||
let grav_potential = -field.config.g_constant * neighbor_mass
|
||||
/ (distance + field.config.softening);
|
||||
|
||||
// Combine embedding similarity with gravitational pull
|
||||
let similarity = cosine_similarity(query, graph.get_embedding(neighbor));
|
||||
|
||||
// Total potential: combine semantic similarity and gravitational field
|
||||
// α controls balance (default: 0.7 semantic, 0.3 gravitational)
|
||||
let total_potential = 0.7 * similarity + 0.3 * grav_potential;
|
||||
|
||||
candidates.push((neighbor, total_potential));
|
||||
}
|
||||
|
||||
// Follow gradient: move to node with lowest potential
|
||||
current = candidates.pop().unwrap().0;
|
||||
potential = field.calculate_potential(query, current);
|
||||
}
|
||||
|
||||
// Return top-k by final similarity
|
||||
candidates.into_sorted_vec()
|
||||
.iter()
|
||||
.take(k)
|
||||
.map(|(id, _)| *id)
|
||||
.collect()
|
||||
}
|
||||
|
||||
// Mass update from query patterns
|
||||
fn update_masses(field: &mut PotentialField, query_log: &[QueryEvent]) {
|
||||
for event in query_log {
|
||||
for visited_node in &event.visited_nodes {
|
||||
let mass = &mut field.masses[*visited_node];
|
||||
|
||||
// Exponential moving average of query frequency
|
||||
let time_delta = event.timestamp.duration_since(mass.last_update);
|
||||
let decay = mass.decay_rate.powf(time_delta.as_secs_f32() / 3600.0);
|
||||
|
||||
mass.query_frequency = mass.query_frequency * decay as f64 + 1.0;
|
||||
|
||||
// Update mass based on frequency
|
||||
mass.mass = match field.config.mass_function {
|
||||
MassFunction::Linear => mass.query_frequency as f32,
|
||||
MassFunction::Logarithmic => (1.0 + mass.query_frequency).ln() as f32,
|
||||
MassFunction::SquareRoot => mass.query_frequency.sqrt() as f32,
|
||||
MassFunction::Custom(f) => f(mass.query_frequency),
|
||||
};
|
||||
|
||||
mass.last_update = event.timestamp;
|
||||
}
|
||||
}
|
||||
|
||||
// Invalidate potential cache
|
||||
field.potential_cache.clear();
|
||||
|
||||
// Rebuild spatial index if significant changes
|
||||
if should_rebuild_index(field) {
|
||||
field.rebuild_spatial_index();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### API Design
|
||||
|
||||
```rust
|
||||
/// Public API for Gravitational Embedding Fields
|
||||
pub trait GravitationalField {
|
||||
/// Create new gravitational field for graph
|
||||
fn new(graph: &HnswGraph, config: GravitationalFieldConfig) -> Self;
|
||||
|
||||
/// Search with gravitational guidance
|
||||
fn search(
|
||||
&self,
|
||||
query: &[f32],
|
||||
k: usize,
|
||||
options: SearchOptions,
|
||||
) -> Result<Vec<SearchResult>, GefError>;
|
||||
|
||||
/// Update masses from query log
|
||||
fn update_masses(&mut self, query_log: &[QueryEvent]) -> Result<(), GefError>;
|
||||
|
||||
/// Get mass for specific node
|
||||
fn get_mass(&self, node_id: NodeId) -> f32;
|
||||
|
||||
/// Calculate potential at point
|
||||
fn calculate_potential(&self, point: &[f32], reference: NodeId) -> f32;
|
||||
|
||||
/// Calculate gradient at point
|
||||
fn calculate_gradient(&self, point: &[f32]) -> Vec<f32>;
|
||||
|
||||
/// Export field visualization data
|
||||
fn export_field(&self, resolution: usize) -> FieldVisualization;
|
||||
|
||||
/// Get field statistics
|
||||
fn statistics(&self) -> FieldStatistics;
|
||||
}
|
||||
|
||||
/// Search options for GEF
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct SearchOptions {
|
||||
/// Balance between semantic similarity and gravitational pull (0.0-1.0)
|
||||
pub semantic_weight: f32,
|
||||
|
||||
/// Maximum search steps
|
||||
pub max_steps: usize,
|
||||
|
||||
/// Enable path recording
|
||||
pub record_path: bool,
|
||||
|
||||
/// Convergence threshold
|
||||
pub convergence_threshold: f32,
|
||||
}
|
||||
|
||||
/// Statistics about gravitational field
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct FieldStatistics {
|
||||
/// Total number of nodes
|
||||
pub total_nodes: usize,
|
||||
|
||||
/// Mass distribution (min, max, mean, median)
|
||||
pub mass_distribution: Distribution,
|
||||
|
||||
/// Number of high-mass nodes (top 10%)
|
||||
pub high_mass_nodes: usize,
|
||||
|
||||
/// Average query frequency
|
||||
pub avg_query_frequency: f64,
|
||||
|
||||
/// Last update timestamp
|
||||
pub last_update: SystemTime,
|
||||
}
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Affected Crates/Modules
|
||||
|
||||
1. **`crates/ruvector-core/src/hnsw/`**
|
||||
- Modify search algorithm to accept potential field guidance
|
||||
- Add hooks for mass updates on queries
|
||||
- Extend node metadata to store mass values
|
||||
|
||||
2. **`crates/ruvector-gnn/src/attention/`**
|
||||
- Integrate GEF as attention mechanism variant
|
||||
- Combine with existing attention patterns
|
||||
|
||||
3. **`crates/ruvector-core/src/distance/`**
|
||||
- Add potential field distance metrics
|
||||
- Implement gradient calculation utilities
|
||||
|
||||
### New Modules to Create
|
||||
|
||||
1. **`crates/ruvector-gnn/src/gravitational/`**
|
||||
- `field.rs` - Core potential field implementation
|
||||
- `mass.rs` - Mass calculation and updates
|
||||
- `search.rs` - Gravitational-guided search algorithms
|
||||
- `config.rs` - Configuration and tuning
|
||||
- `visualization.rs` - Field visualization utilities
|
||||
|
||||
2. **`crates/ruvector-core/src/query_log/`**
|
||||
- `logger.rs` - Query event logging
|
||||
- `analyzer.rs` - Query pattern analysis
|
||||
- `replay.rs` - Query replay for testing
|
||||
|
||||
### Dependencies on Other Features
|
||||
|
||||
- **Feature 11 (Causal Attention Networks)**: GEF can respect causal ordering by preventing backward gravitational pull
|
||||
- **Feature 12 (Topology-Aware Gradient Routing)**: Combine graph topology with gravitational field for hybrid routing
|
||||
- **Feature 13 (Embedding Crystallization)**: High-mass nodes serve as natural crystallization nuclei
|
||||
|
||||
## Regression Prevention
|
||||
|
||||
### Existing Functionality at Risk
|
||||
|
||||
1. **Standard HNSW Search Performance**
|
||||
- Risk: Gravitational calculations add overhead
|
||||
- Prevention: Make GEF optional, benchmark against baseline
|
||||
|
||||
2. **Deterministic Search Results**
|
||||
- Risk: Mass updates change results over time
|
||||
- Prevention: Add `frozen_field` mode for reproducible searches
|
||||
|
||||
3. **Memory Usage**
|
||||
- Risk: Additional mass metadata per node
|
||||
- Prevention: Use compact representations (f32 instead of f64), lazy cache
|
||||
|
||||
4. **Concurrent Queries**
|
||||
- Risk: Race conditions in mass updates
|
||||
- Prevention: Use atomic updates or batch processing
|
||||
|
||||
### Test Cases to Prevent Regressions
|
||||
|
||||
```rust
|
||||
#[cfg(test)]
|
||||
mod regression_tests {
|
||||
// Baseline performance should not degrade
|
||||
#[test]
|
||||
fn test_gef_disabled_matches_baseline() {
|
||||
let graph = create_test_graph(10000);
|
||||
let query = random_vector(128);
|
||||
|
||||
let baseline_results = graph.search(&query, 10);
|
||||
|
||||
let gef_field = GravitationalField::new(&graph, GravitationalFieldConfig {
|
||||
semantic_weight: 1.0, // Pure semantic search
|
||||
..Default::default()
|
||||
});
|
||||
let gef_results = gef_field.search(&query, 10);
|
||||
|
||||
assert_eq!(baseline_results, gef_results);
|
||||
}
|
||||
|
||||
// Frozen field produces deterministic results
|
||||
#[test]
|
||||
fn test_frozen_field_deterministic() {
|
||||
let mut field = create_test_field();
|
||||
field.freeze();
|
||||
|
||||
let query = random_vector(128);
|
||||
let results1 = field.search(&query, 10);
|
||||
let results2 = field.search(&query, 10);
|
||||
|
||||
assert_eq!(results1, results2);
|
||||
}
|
||||
|
||||
// Mass updates don't break existing searches
|
||||
#[test]
|
||||
fn test_concurrent_search_and_update() {
|
||||
let field = Arc::new(RwLock::new(create_test_field()));
|
||||
|
||||
let search_thread = spawn({
|
||||
let field = field.clone();
|
||||
move || {
|
||||
for _ in 0..100 {
|
||||
let f = field.read().unwrap();
|
||||
f.search(&random_vector(128), 10).unwrap();
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
let update_thread = spawn({
|
||||
let field = field.clone();
|
||||
move || {
|
||||
for _ in 0..10 {
|
||||
let mut f = field.write().unwrap();
|
||||
f.update_masses(&generate_query_log(10)).unwrap();
|
||||
thread::sleep(Duration::from_millis(10));
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
search_thread.join().unwrap();
|
||||
update_thread.join().unwrap();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Backward Compatibility Strategy
|
||||
|
||||
1. **Feature Flag**: GEF behind `gravitational-fields` feature flag
|
||||
2. **Opt-in**: Default config has `semantic_weight = 1.0` (pure semantic search)
|
||||
3. **Migration Path**: Provide tools to analyze existing graphs and recommend GEF settings
|
||||
4. **Serialization**: Store mass data in separate file, gracefully handle missing data
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Research Validation (2 weeks)
|
||||
**Goal**: Validate physics-inspired approach on synthetic data
|
||||
|
||||
- Implement basic potential field calculations
|
||||
- Create toy dataset with known high-frequency nodes
|
||||
- Measure search efficiency improvements
|
||||
- Compare against baselines (pure HNSW, PageRank-weighted)
|
||||
- **Deliverable**: Research report with benchmarks
|
||||
|
||||
### Phase 2: Core Implementation (3 weeks)
|
||||
**Goal**: Production-ready GEF implementation
|
||||
|
||||
- Implement `PotentialField` and `NodeMass` structures
|
||||
- Develop mass update algorithms with decay
|
||||
- Integrate with HNSW search
|
||||
- Add configuration system
|
||||
- Implement caching and optimization
|
||||
- **Deliverable**: Working GEF module with unit tests
|
||||
|
||||
### Phase 3: Integration (2 weeks)
|
||||
**Goal**: Integrate with existing RuVector systems
|
||||
|
||||
- Add query logging infrastructure
|
||||
- Implement mass persistence (save/load)
|
||||
- Create API bindings (Python, Node.js)
|
||||
- Add monitoring and metrics
|
||||
- Write integration tests
|
||||
- **Deliverable**: GEF integrated into main codebase
|
||||
|
||||
### Phase 4: Optimization (2 weeks)
|
||||
**Goal**: Production performance and tuning
|
||||
|
||||
- Profile and optimize hot paths
|
||||
- Implement spatial indexing for large graphs
|
||||
- Add adaptive tuning (auto-adjust G constant)
|
||||
- Create visualization tools
|
||||
- Write documentation and examples
|
||||
- **Deliverable**: Production-ready, documented feature
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Performance Benchmarks
|
||||
|
||||
| Metric | Baseline | Target | Measurement |
|
||||
|--------|----------|--------|-------------|
|
||||
| Search latency (10K nodes) | 1.2ms | <1.5ms | 99th percentile |
|
||||
| Search quality (recall@10) | 0.95 | >0.95 | Standard test set |
|
||||
| Hops to target | 12.3 | <9.0 | Average path length |
|
||||
| Memory overhead | 0MB | <50MB | Per 1M nodes |
|
||||
| Mass update latency | N/A | <10ms | Per 1K queries |
|
||||
|
||||
### Accuracy Metrics
|
||||
|
||||
1. **Authority Discovery**: High-authority nodes found in top-10 results
|
||||
- Target: 80% of known authoritative nodes in top-10
|
||||
|
||||
2. **Query Efficiency**: Reduction in nodes visited per search
|
||||
- Target: 30% fewer nodes visited for same recall
|
||||
|
||||
3. **Adaptive Learning**: Mass distribution correlates with true importance
|
||||
- Target: Spearman correlation >0.7 with ground truth rankings
|
||||
|
||||
### Comparison to Baselines
|
||||
|
||||
Test against:
|
||||
1. **Pure HNSW**: Standard implementation without GEF
|
||||
2. **PageRank-weighted**: Static global importance scores
|
||||
3. **Attention-based**: Standard attention mechanism from Feature 1
|
||||
4. **Hybrid**: GEF + Topology-Aware Routing (Feature 12)
|
||||
|
||||
Datasets:
|
||||
- Wikipedia embeddings (1M articles)
|
||||
- ArXiv papers with citation counts (500K papers)
|
||||
- E-commerce products with view counts (2M products)
|
||||
|
||||
## Risks and Mitigations
|
||||
|
||||
### Technical Risks
|
||||
|
||||
| Risk | Impact | Probability | Mitigation |
|
||||
|------|--------|-------------|------------|
|
||||
| Mass updates too slow | High | Medium | Batch updates, incremental computation |
|
||||
| Field calculations expensive | High | High | Spatial indexing, caching, approximations |
|
||||
| Over-attraction to popular nodes | Medium | High | Softening parameter, max influence radius |
|
||||
| Mass distribution unstable | Medium | Medium | Regularization, decay rates, bounds checking |
|
||||
| Poor generalization | High | Low | Multi-dataset validation, adaptive tuning |
|
||||
|
||||
### Detailed Mitigations
|
||||
|
||||
1. **Slow Mass Updates**
|
||||
- Implement incremental updates (only changed nodes)
|
||||
- Batch query logs and process asynchronously
|
||||
- Use lock-free data structures for concurrent updates
|
||||
- Fallback: Update masses periodically (e.g., hourly) instead of real-time
|
||||
|
||||
2. **Expensive Field Calculations**
|
||||
- Pre-compute potential fields for common queries
|
||||
- Use spatial hashing for O(1) radius queries
|
||||
- Approximate far-field contributions (multipole expansion)
|
||||
- Fallback: Disable GEF for low-latency requirements
|
||||
|
||||
3. **Over-Attraction to Popular Nodes**
|
||||
- Tune softening parameter ε to prevent singularities
|
||||
- Cap maximum mass value
|
||||
- Implement repulsive forces for diversity
|
||||
- Fallback: Reduce gravitational weight in combined score
|
||||
|
||||
4. **Unstable Mass Distribution**
|
||||
- Add L2 regularization to mass updates
|
||||
- Implement mass normalization across graph
|
||||
- Monitor mass variance, trigger rebalancing
|
||||
- Fallback: Reset masses to uniform distribution
|
||||
|
||||
5. **Poor Generalization**
|
||||
- Test on diverse datasets (text, images, graphs)
|
||||
- Implement domain-specific mass functions
|
||||
- Provide configuration templates for common use cases
|
||||
- Fallback: Disable GEF for unsupported domains
|
||||
|
||||
## References
|
||||
|
||||
### Physics Inspiration
|
||||
- Newtonian gravity: F = G·m₁·m₂/r²
|
||||
- Potential fields in robotics path planning
|
||||
- N-body simulations and Barnes-Hut algorithms
|
||||
|
||||
### Related ML Techniques
|
||||
- PageRank and graph centrality measures
|
||||
- Attention mechanisms in transformers
|
||||
- Reinforcement learning value functions
|
||||
- Metric learning and embedding spaces
|
||||
|
||||
### Implementation Precedents
|
||||
- Fast multipole methods (FMM)
|
||||
- Spatial hashing and KD-trees
|
||||
- Incremental graph algorithms
|
||||
- Online learning with exponential decay
|
||||
Reference in New Issue
Block a user