Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
788
vendor/ruvector/docs/research/gnn-v2/13-embedding-crystallization.md
vendored
Normal file
788
vendor/ruvector/docs/research/gnn-v2/13-embedding-crystallization.md
vendored
Normal file
@@ -0,0 +1,788 @@
|
||||
# Embedding Crystallization
|
||||
|
||||
## Overview
|
||||
|
||||
### Problem Statement
|
||||
Most vector databases require pre-defined hierarchical structures or manual clustering. This creates several problems:
|
||||
1. **Static hierarchies**: Cannot adapt to changing data distributions
|
||||
2. **Manual tuning**: Requires expert knowledge to choose hierarchy depth and branching
|
||||
3. **Poor adaptation**: Hierarchy may not match natural data clusters
|
||||
4. **Rigid structure**: Cannot reorganize as data evolves
|
||||
|
||||
### Proposed Solution
|
||||
Automatically form hierarchical structure from flat embeddings through a physics-inspired crystallization process:
|
||||
1. **Nucleation**: Identify dense clusters as crystal "seeds"
|
||||
2. **Growth**: Expand crystals outward from nuclei
|
||||
3. **Competition**: Crystals compete for boundary regions
|
||||
4. **Equilibrium**: Self-organizing hierarchy emerges
|
||||
|
||||
Like physical crystals growing from a supersaturated solution, embedding crystals grow from dense regions in embedding space.
|
||||
|
||||
### Expected Benefits
|
||||
- **Automatic hierarchy**: No manual structure design needed
|
||||
- **Adaptive organization**: Hierarchy evolves with data
|
||||
- **Natural clusters**: Respects inherent data structure
|
||||
- **Multi-scale representation**: From coarse (crystal) to fine (individual points)
|
||||
- **20-40% faster search**: Hierarchical pruning reduces search space
|
||||
|
||||
### Novelty Claim
|
||||
First application of crystal growth dynamics to vector database organization. Unlike:
|
||||
- **K-means clustering**: Fixed K, no hierarchy
|
||||
- **Hierarchical clustering**: Bottom-up, computationally expensive
|
||||
- **LSH**: Random projections, no semantic structure
|
||||
|
||||
Embedding Crystallization uses physics-inspired dynamics to discover natural hierarchical organization.
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Architecture Diagram
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ Embedding Crystallization │
|
||||
│ │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Phase 1: Nucleation Detection │ │
|
||||
│ │ │ │
|
||||
│ │ Flat Embedding Space │ │
|
||||
│ │ ┌──────────────────────────────────────────┐ │ │
|
||||
│ │ │ ● ●●● ● ● ●●●●● ● │ │ │
|
||||
│ │ │ ● ● ● ● ● ● │ │ │
|
||||
│ │ │ ● ●●● ● ● ●●●●● ● │ │ │
|
||||
│ │ │ ● │ │ │
|
||||
│ │ │ ●●●●● ●●● │ │ │
|
||||
│ │ │ ● ● ● ●● ● │ │ │
|
||||
│ │ │ ●●●●● ●●● │ │ │
|
||||
│ │ │ ▲ ▲ ▲ │ │ │
|
||||
│ │ └─────────│───────────│─────────│──────────┘ │ │
|
||||
│ │ Nucleus 1 Nucleus 2 Nucleus 3 │ │
|
||||
│ │ (ρ > ρ_crit) │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Phase 2: Crystal Growth │ │
|
||||
│ │ │ │
|
||||
│ │ Iteration 0: Iteration 5: Iteration 10: │ │
|
||||
│ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │
|
||||
│ │ │ ◎ │ │ ╔══╗ │ │╔════╗│ │ │
|
||||
│ │ │ │ │ ║ ║ │ │║ ║│ │ │
|
||||
│ │ │ ◎ │ ───▶ │ ╚══╝ │ ───▶ │╚════╝│ │ │
|
||||
│ │ │ │ │ │ │ │ │ │
|
||||
│ │ │ ◎ │ │ ╔══╗ │ │╔════╗│ │ │
|
||||
│ │ └──────┘ │ ║ ║ │ │║ ║│ │ │
|
||||
│ │ │ ╚══╝ │ │╚════╝│ │ │
|
||||
│ │ ◎ = Nucleus └──────┘ └──────┘ │ │
|
||||
│ │ ═ = Crystal Growth rate: v = -∇E │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Phase 3: Hierarchical Organization │ │
|
||||
│ │ │ │
|
||||
│ │ Root (Global) │ │
|
||||
│ │ │ │ │
|
||||
│ │ ┌────────────┼────────────┐ │ │
|
||||
│ │ │ │ │ │ │
|
||||
│ │ Crystal 1 Crystal 2 Crystal 3 │ │
|
||||
│ │ (Topic 1) (Topic 2) (Topic 3) │ │
|
||||
│ │ │ │ │ │ │
|
||||
│ │ ┌────┴────┐ ┌───┴───┐ ┌────┴────┐ │ │
|
||||
│ │ │ │ │ │ │ │ │ │
|
||||
│ │ SubCrystal SubCrystal ... ... ... │ │
|
||||
│ │ (Subtopic) (Subtopic) │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ ● ● ● ● ● ● ← Individual embeddings │ │
|
||||
│ │ │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
└────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Core Data Structures
|
||||
|
||||
```rust
|
||||
/// Crystal structure (hierarchical cluster)
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct Crystal {
|
||||
/// Unique crystal identifier
|
||||
pub id: CrystalId,
|
||||
|
||||
/// Centroid (center of mass)
|
||||
pub centroid: Vec<f32>,
|
||||
|
||||
/// Radius (effective size)
|
||||
pub radius: f32,
|
||||
|
||||
/// Member nodes
|
||||
pub members: Vec<NodeId>,
|
||||
|
||||
/// Parent crystal (if not root)
|
||||
pub parent: Option<CrystalId>,
|
||||
|
||||
/// Child crystals (subclusters)
|
||||
pub children: Vec<CrystalId>,
|
||||
|
||||
/// Hierarchy level (0 = root)
|
||||
pub level: usize,
|
||||
|
||||
/// Density at nucleation
|
||||
pub density: f32,
|
||||
|
||||
/// Growth rate
|
||||
pub growth_rate: f32,
|
||||
|
||||
/// Energy (stability measure)
|
||||
pub energy: f32,
|
||||
|
||||
/// Metadata
|
||||
pub metadata: CrystalMetadata,
|
||||
}
|
||||
|
||||
/// Crystal metadata
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct CrystalMetadata {
|
||||
/// Formation timestamp
|
||||
pub formed_at: SystemTime,
|
||||
|
||||
/// Number of growth iterations
|
||||
pub growth_iterations: usize,
|
||||
|
||||
/// Stability score (0-1)
|
||||
pub stability: f32,
|
||||
|
||||
/// Semantic label (if available)
|
||||
pub label: Option<String>,
|
||||
}
|
||||
|
||||
/// Nucleation site (seed for crystal)
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct NucleationSite {
|
||||
/// Center point
|
||||
pub center: Vec<f32>,
|
||||
|
||||
/// Local density
|
||||
pub density: f32,
|
||||
|
||||
/// Seed nodes
|
||||
pub seeds: Vec<NodeId>,
|
||||
|
||||
/// Critical radius
|
||||
pub critical_radius: f32,
|
||||
}
|
||||
|
||||
/// Crystallization configuration
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct CrystallizationConfig {
|
||||
/// Density threshold for nucleation
|
||||
pub nucleation_threshold: f32, // default: 0.7
|
||||
|
||||
/// Minimum nodes for nucleation
|
||||
pub min_nucleus_size: usize, // default: 10
|
||||
|
||||
/// Growth rate parameter
|
||||
pub growth_rate: f32, // default: 0.1
|
||||
|
||||
/// Maximum hierarchy depth
|
||||
pub max_depth: usize, // default: 5
|
||||
|
||||
/// Energy function
|
||||
pub energy_function: EnergyFunction,
|
||||
|
||||
/// Growth stopping criterion
|
||||
pub stopping_criterion: StoppingCriterion,
|
||||
|
||||
/// Allow crystal merging
|
||||
pub allow_merging: bool,
|
||||
}
|
||||
|
||||
/// Energy function for crystal stability
|
||||
#[derive(Clone, Debug)]
|
||||
pub enum EnergyFunction {
|
||||
/// Within-cluster variance
|
||||
Variance,
|
||||
|
||||
/// Silhouette score
|
||||
Silhouette,
|
||||
|
||||
/// Density-based
|
||||
Density,
|
||||
|
||||
/// Custom function
|
||||
Custom(fn(&Crystal, &[Vec<f32>]) -> f32),
|
||||
}
|
||||
|
||||
/// Stopping criterion for growth
|
||||
#[derive(Clone, Debug)]
|
||||
pub enum StoppingCriterion {
|
||||
/// Maximum iterations
|
||||
MaxIterations(usize),
|
||||
|
||||
/// Energy convergence
|
||||
EnergyConvergence { threshold: f32 },
|
||||
|
||||
/// No more boundary nodes
|
||||
NoBoundary,
|
||||
|
||||
/// Combined criteria
|
||||
Combined(Vec<StoppingCriterion>),
|
||||
}
|
||||
|
||||
/// Crystallization state
|
||||
pub struct CrystallizationState {
|
||||
/// All crystals (hierarchical)
|
||||
crystals: Vec<Crystal>,
|
||||
|
||||
/// Node to crystal assignment
|
||||
node_assignments: Vec<CrystalId>,
|
||||
|
||||
/// Hierarchy tree
|
||||
hierarchy: CrystalTree,
|
||||
|
||||
/// Configuration
|
||||
config: CrystallizationConfig,
|
||||
|
||||
/// Growth history (for analysis)
|
||||
growth_history: Vec<GrowthSnapshot>,
|
||||
}
|
||||
|
||||
/// Crystal hierarchy tree
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct CrystalTree {
|
||||
/// Root crystal (entire dataset)
|
||||
root: CrystalId,
|
||||
|
||||
/// Tree structure
|
||||
nodes: HashMap<CrystalId, CrystalTreeNode>,
|
||||
|
||||
/// Fast level-based lookup
|
||||
levels: Vec<Vec<CrystalId>>,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct CrystalTreeNode {
|
||||
pub crystal_id: CrystalId,
|
||||
pub parent: Option<CrystalId>,
|
||||
pub children: Vec<CrystalId>,
|
||||
pub level: usize,
|
||||
}
|
||||
|
||||
/// Snapshot of growth process
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct GrowthSnapshot {
|
||||
pub iteration: usize,
|
||||
pub num_crystals: usize,
|
||||
pub total_energy: f32,
|
||||
pub avg_crystal_size: f32,
|
||||
pub timestamp: SystemTime,
|
||||
}
|
||||
```
|
||||
|
||||
### Key Algorithms
|
||||
|
||||
```rust
|
||||
// Pseudocode for embedding crystallization
|
||||
|
||||
/// Main crystallization algorithm
|
||||
fn crystallize(
|
||||
embeddings: &[Vec<f32>],
|
||||
config: CrystallizationConfig
|
||||
) -> CrystallizationState {
|
||||
// Phase 1: Detect nucleation sites
|
||||
let nucleation_sites = detect_nucleation_sites(
|
||||
embeddings,
|
||||
config.nucleation_threshold,
|
||||
config.min_nucleus_size
|
||||
);
|
||||
|
||||
// Phase 2: Initialize crystals from nuclei
|
||||
let mut crystals = Vec::new();
|
||||
for (i, site) in nucleation_sites.iter().enumerate() {
|
||||
crystals.push(Crystal {
|
||||
id: i,
|
||||
centroid: site.center.clone(),
|
||||
radius: site.critical_radius,
|
||||
members: site.seeds.clone(),
|
||||
parent: None,
|
||||
children: Vec::new(),
|
||||
level: 0,
|
||||
density: site.density,
|
||||
growth_rate: config.growth_rate,
|
||||
energy: compute_energy(site.seeds, embeddings, &config),
|
||||
metadata: CrystalMetadata::new(),
|
||||
});
|
||||
}
|
||||
|
||||
// Phase 3: Grow crystals
|
||||
let mut node_assignments = vec![None; embeddings.len()];
|
||||
for crystal in &crystals {
|
||||
for &member in &crystal.members {
|
||||
node_assignments[member] = Some(crystal.id);
|
||||
}
|
||||
}
|
||||
|
||||
let mut iteration = 0;
|
||||
loop {
|
||||
let mut changed = false;
|
||||
|
||||
// Find boundary nodes (unassigned or contestable)
|
||||
let boundary_nodes = find_boundary_nodes(
|
||||
embeddings,
|
||||
&node_assignments,
|
||||
&crystals
|
||||
);
|
||||
|
||||
if boundary_nodes.is_empty() {
|
||||
break;
|
||||
}
|
||||
|
||||
// Assign boundary nodes to nearest growing crystal
|
||||
for node_id in boundary_nodes {
|
||||
let (best_crystal, energy_change) = find_best_crystal(
|
||||
node_id,
|
||||
embeddings,
|
||||
&crystals,
|
||||
&config
|
||||
);
|
||||
|
||||
// Only add if energy decreases (stability)
|
||||
if energy_change < 0.0 {
|
||||
crystals[best_crystal].members.push(node_id);
|
||||
node_assignments[node_id] = Some(best_crystal);
|
||||
changed = true;
|
||||
}
|
||||
}
|
||||
|
||||
// Update crystal properties
|
||||
for crystal in &mut crystals {
|
||||
update_centroid(crystal, embeddings);
|
||||
update_radius(crystal, embeddings);
|
||||
crystal.energy = compute_energy(&crystal.members, embeddings, &config);
|
||||
}
|
||||
|
||||
iteration += 1;
|
||||
|
||||
if !changed || should_stop(&config.stopping_criterion, iteration, &crystals) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Phase 4: Build hierarchy (recursive crystallization)
|
||||
let hierarchy = build_hierarchy(&mut crystals, embeddings, &config);
|
||||
|
||||
CrystallizationState {
|
||||
crystals,
|
||||
node_assignments,
|
||||
hierarchy,
|
||||
config,
|
||||
growth_history: Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Detect nucleation sites using density estimation
|
||||
fn detect_nucleation_sites(
|
||||
embeddings: &[Vec<f32>],
|
||||
threshold: f32,
|
||||
min_size: usize
|
||||
) -> Vec<NucleationSite> {
|
||||
let mut sites = Vec::new();
|
||||
|
||||
// Build density field using KDE
|
||||
let density_field = estimate_density(embeddings);
|
||||
|
||||
// Find local maxima above threshold
|
||||
for (i, &density) in density_field.iter().enumerate() {
|
||||
if density < threshold {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Check if local maximum
|
||||
let neighbors = find_neighbors(i, embeddings, radius=1.0);
|
||||
let is_maximum = neighbors.iter().all(|&j| {
|
||||
density_field[j] <= density
|
||||
});
|
||||
|
||||
if !is_maximum {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Collect seed nodes within critical radius
|
||||
let critical_radius = estimate_critical_radius(density);
|
||||
let seeds: Vec<NodeId> = embeddings.iter()
|
||||
.enumerate()
|
||||
.filter(|(j, emb)| {
|
||||
let dist = euclidean_distance(&embeddings[i], emb);
|
||||
dist <= critical_radius
|
||||
})
|
||||
.map(|(j, _)| j)
|
||||
.collect();
|
||||
|
||||
if seeds.len() >= min_size {
|
||||
sites.push(NucleationSite {
|
||||
center: embeddings[i].clone(),
|
||||
density,
|
||||
seeds,
|
||||
critical_radius,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Remove overlapping sites (keep higher density)
|
||||
sites = remove_overlapping_sites(sites);
|
||||
|
||||
sites
|
||||
}
|
||||
|
||||
/// Estimate density using Kernel Density Estimation
|
||||
fn estimate_density(embeddings: &[Vec<f32>]) -> Vec<f32> {
|
||||
let n = embeddings.len();
|
||||
let mut density = vec![0.0; n];
|
||||
|
||||
// Adaptive bandwidth (Scott's rule)
|
||||
let bandwidth = estimate_bandwidth(embeddings);
|
||||
|
||||
for i in 0..n {
|
||||
for j in 0..n {
|
||||
let dist = euclidean_distance(&embeddings[i], &embeddings[j]);
|
||||
density[i] += gaussian_kernel(dist, bandwidth);
|
||||
}
|
||||
density[i] /= n as f32;
|
||||
}
|
||||
|
||||
density
|
||||
}
|
||||
|
||||
/// Find best crystal for boundary node
|
||||
fn find_best_crystal(
|
||||
node_id: NodeId,
|
||||
embeddings: &[Vec<f32>],
|
||||
crystals: &[Crystal],
|
||||
config: &CrystallizationConfig
|
||||
) -> (CrystalId, f32) {
|
||||
let embedding = &embeddings[node_id];
|
||||
|
||||
let mut best_crystal = 0;
|
||||
let mut best_energy_change = f32::MAX;
|
||||
|
||||
for (i, crystal) in crystals.iter().enumerate() {
|
||||
// Distance to crystal centroid
|
||||
let dist = euclidean_distance(embedding, &crystal.centroid);
|
||||
|
||||
// Only consider if within growth radius
|
||||
if dist > crystal.radius + config.growth_rate {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Compute energy change if node joins this crystal
|
||||
let mut temp_members = crystal.members.clone();
|
||||
temp_members.push(node_id);
|
||||
|
||||
let new_energy = compute_energy(&temp_members, embeddings, config);
|
||||
let energy_change = new_energy - crystal.energy;
|
||||
|
||||
if energy_change < best_energy_change {
|
||||
best_energy_change = energy_change;
|
||||
best_crystal = i;
|
||||
}
|
||||
}
|
||||
|
||||
(best_crystal, best_energy_change)
|
||||
}
|
||||
|
||||
/// Build hierarchical structure via recursive crystallization
|
||||
fn build_hierarchy(
|
||||
crystals: &mut Vec<Crystal>,
|
||||
embeddings: &[Vec<f32>],
|
||||
config: &CrystallizationConfig
|
||||
) -> CrystalTree {
|
||||
let mut tree = CrystalTree::new();
|
||||
|
||||
// Start with level 0 (base crystals)
|
||||
for crystal in crystals.iter_mut() {
|
||||
crystal.level = 0;
|
||||
tree.add_node(crystal.id, None, 0);
|
||||
}
|
||||
|
||||
// Recursively create parent levels
|
||||
for level in 0..config.max_depth {
|
||||
let current_level_crystals: Vec<_> = crystals.iter()
|
||||
.filter(|c| c.level == level)
|
||||
.map(|c| c.id)
|
||||
.collect();
|
||||
|
||||
if current_level_crystals.len() <= 1 {
|
||||
break; // Only one cluster, stop
|
||||
}
|
||||
|
||||
// Treat crystals as embeddings (their centroids)
|
||||
let crystal_centroids: Vec<_> = current_level_crystals.iter()
|
||||
.map(|&id| crystals[id].centroid.clone())
|
||||
.collect();
|
||||
|
||||
// Recursively crystallize at higher level
|
||||
let parent_config = CrystallizationConfig {
|
||||
nucleation_threshold: config.nucleation_threshold * 0.8, // Relax threshold
|
||||
..config.clone()
|
||||
};
|
||||
|
||||
let parent_sites = detect_nucleation_sites(
|
||||
&crystal_centroids,
|
||||
parent_config.nucleation_threshold,
|
||||
2 // At least 2 child crystals
|
||||
);
|
||||
|
||||
// Create parent crystals
|
||||
for (i, site) in parent_sites.iter().enumerate() {
|
||||
let parent_id = crystals.len();
|
||||
|
||||
// Children are crystals in this parent's region
|
||||
let children: Vec<CrystalId> = site.seeds.iter()
|
||||
.map(|&seed_idx| current_level_crystals[seed_idx])
|
||||
.collect();
|
||||
|
||||
// Collect all members from children
|
||||
let mut all_members = Vec::new();
|
||||
for &child_id in &children {
|
||||
all_members.extend(&crystals[child_id].members);
|
||||
}
|
||||
|
||||
let parent = Crystal {
|
||||
id: parent_id,
|
||||
centroid: site.center.clone(),
|
||||
radius: site.critical_radius,
|
||||
members: all_members,
|
||||
parent: None,
|
||||
children: children.clone(),
|
||||
level: level + 1,
|
||||
density: site.density,
|
||||
growth_rate: config.growth_rate,
|
||||
energy: 0.0, // Computed later
|
||||
metadata: CrystalMetadata::new(),
|
||||
};
|
||||
|
||||
crystals.push(parent);
|
||||
tree.add_node(parent_id, None, level + 1);
|
||||
|
||||
// Update children's parent pointers
|
||||
for &child_id in &children {
|
||||
crystals[child_id].parent = Some(parent_id);
|
||||
tree.set_parent(child_id, parent_id);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tree
|
||||
}
|
||||
```
|
||||
|
||||
### API Design
|
||||
|
||||
```rust
|
||||
/// Public API for Embedding Crystallization
|
||||
pub trait EmbeddingCrystallization {
|
||||
/// Crystallize flat embeddings into hierarchy
|
||||
fn crystallize(
|
||||
embeddings: &[Vec<f32>],
|
||||
config: CrystallizationConfig,
|
||||
) -> Result<CrystallizationState, CrystalError>;
|
||||
|
||||
/// Search using crystal hierarchy
|
||||
fn search(
|
||||
&self,
|
||||
query: &[f32],
|
||||
k: usize,
|
||||
options: CrystalSearchOptions,
|
||||
) -> Result<Vec<SearchResult>, CrystalError>;
|
||||
|
||||
/// Add new embeddings (incremental crystallization)
|
||||
fn add_embeddings(
|
||||
&mut self,
|
||||
new_embeddings: &[Vec<f32>],
|
||||
) -> Result<(), CrystalError>;
|
||||
|
||||
/// Get crystal by ID
|
||||
fn get_crystal(&self, id: CrystalId) -> Option<&Crystal>;
|
||||
|
||||
/// Get crystals at level
|
||||
fn get_level(&self, level: usize) -> Vec<&Crystal>;
|
||||
|
||||
/// Find crystal containing node
|
||||
fn find_crystal(&self, node_id: NodeId) -> Option<CrystalId>;
|
||||
|
||||
/// Traverse hierarchy
|
||||
fn traverse(&self, strategy: TraversalStrategy) -> CrystalIterator;
|
||||
|
||||
/// Export hierarchy for visualization
|
||||
fn export_hierarchy(&self) -> HierarchyExport;
|
||||
|
||||
/// Get crystallization statistics
|
||||
fn statistics(&self) -> CrystalStatistics;
|
||||
|
||||
/// Recrystallize (rebuild hierarchy)
|
||||
fn recrystallize(&mut self) -> Result<(), CrystalError>;
|
||||
}
|
||||
|
||||
/// Search options for crystallization
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct CrystalSearchOptions {
|
||||
/// Start search at level
|
||||
pub start_level: usize,
|
||||
|
||||
/// Use hierarchical pruning
|
||||
pub enable_pruning: bool,
|
||||
|
||||
/// Pruning threshold (discard crystals with similarity < threshold)
|
||||
pub pruning_threshold: f32,
|
||||
|
||||
/// Maximum crystals to explore
|
||||
pub max_crystals: usize,
|
||||
}
|
||||
|
||||
/// Traversal strategies
|
||||
#[derive(Clone, Debug)]
|
||||
pub enum TraversalStrategy {
|
||||
/// Breadth-first (level by level)
|
||||
BreadthFirst,
|
||||
|
||||
/// Depth-first (branch by branch)
|
||||
DepthFirst,
|
||||
|
||||
/// Largest crystals first
|
||||
SizeOrder,
|
||||
|
||||
/// Highest density first
|
||||
DensityOrder,
|
||||
}
|
||||
|
||||
/// Hierarchy statistics
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct CrystalStatistics {
|
||||
pub total_crystals: usize,
|
||||
pub depth: usize,
|
||||
pub avg_branching_factor: f32,
|
||||
pub avg_crystal_size: f32,
|
||||
pub density_distribution: Vec<f32>,
|
||||
pub energy_distribution: Vec<f32>,
|
||||
}
|
||||
|
||||
/// Hierarchy export for visualization
|
||||
#[derive(Clone, Debug, Serialize)]
|
||||
pub struct HierarchyExport {
|
||||
pub crystals: Vec<CrystalExport>,
|
||||
pub edges: Vec<HierarchyEdge>,
|
||||
pub statistics: CrystalStatistics,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, Serialize)]
|
||||
pub struct CrystalExport {
|
||||
pub id: CrystalId,
|
||||
pub level: usize,
|
||||
pub size: usize,
|
||||
pub centroid: Vec<f32>,
|
||||
pub radius: f32,
|
||||
pub label: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, Serialize)]
|
||||
pub struct HierarchyEdge {
|
||||
pub parent: CrystalId,
|
||||
pub child: CrystalId,
|
||||
}
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Affected Crates/Modules
|
||||
|
||||
1. **`crates/ruvector-core/src/hnsw/`**
|
||||
- Add hierarchical layer based on crystals
|
||||
- Integrate crystal-aware search
|
||||
|
||||
2. **`crates/ruvector-gnn/src/hierarchy/`**
|
||||
- Create hierarchy management module
|
||||
- Integrate with existing GNN layers
|
||||
|
||||
### New Modules to Create
|
||||
|
||||
1. **`crates/ruvector-gnn/src/crystallization/`**
|
||||
- `nucleation.rs` - Nucleation site detection
|
||||
- `growth.rs` - Crystal growth algorithms
|
||||
- `hierarchy.rs` - Hierarchy construction
|
||||
- `search.rs` - Crystal-aware search
|
||||
- `energy.rs` - Energy functions
|
||||
- `visualization.rs` - Hierarchy visualization
|
||||
|
||||
## Regression Prevention
|
||||
|
||||
### Test Cases
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn test_hierarchy_coverage() {
|
||||
let state = crystallize_test_data();
|
||||
|
||||
// Every node should belong to exactly one crystal at level 0
|
||||
for node_id in 0..embeddings.len() {
|
||||
let crystal_id = state.find_crystal(node_id).unwrap();
|
||||
let crystal = state.get_crystal(crystal_id).unwrap();
|
||||
assert_eq!(crystal.level, 0);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_hierarchy_containment() {
|
||||
let state = crystallize_test_data();
|
||||
|
||||
// Parent crystals must contain all child members
|
||||
for crystal in &state.crystals {
|
||||
if let Some(parent_id) = crystal.parent {
|
||||
let parent = state.get_crystal(parent_id).unwrap();
|
||||
for &member in &crystal.members {
|
||||
assert!(parent.members.contains(&member));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Research Validation (2 weeks)
|
||||
- Implement nucleation detection
|
||||
- Test crystal growth on synthetic data
|
||||
- Measure hierarchy quality
|
||||
- **Deliverable**: Research report
|
||||
|
||||
### Phase 2: Core Implementation (3 weeks)
|
||||
- Full crystallization algorithm
|
||||
- Hierarchy construction
|
||||
- Energy functions
|
||||
- **Deliverable**: Working crystallization
|
||||
|
||||
### Phase 3: Integration (2 weeks)
|
||||
- HNSW integration
|
||||
- Search optimization
|
||||
- API bindings
|
||||
- **Deliverable**: Integrated feature
|
||||
|
||||
### Phase 4: Optimization (2 weeks)
|
||||
- Incremental updates
|
||||
- Performance tuning
|
||||
- Visualization tools
|
||||
- **Deliverable**: Production-ready
|
||||
|
||||
## Success Metrics
|
||||
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Search speedup | >30% |
|
||||
| Hierarchy depth | 3-5 levels |
|
||||
| Coverage | 100% nodes |
|
||||
| Energy reduction | >40% vs. random |
|
||||
|
||||
## Risks and Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| Poor nucleation | Adaptive thresholds, multiple strategies |
|
||||
| Unstable growth | Energy-based stopping, regularization |
|
||||
| Deep hierarchies | Max depth limit, pruning |
|
||||
| High computation | Approximate methods, caching |
|
||||
Reference in New Issue
Block a user