# ADR-DB-005: Delta Index Updates **Status**: Proposed **Date**: 2026-01-28 **Authors**: RuVector Architecture Team **Deciders**: Architecture Review Board **Parent**: ADR-DB-001 Delta Behavior Core Architecture ## Version History | Version | Date | Author | Changes | |---------|------|--------|---------| | 0.1 | 2026-01-28 | Architecture Team | Initial proposal | --- ## Context and Problem Statement ### The Index Update Challenge HNSW (Hierarchical Navigable Small World) indexes present unique challenges for delta-based updates: 1. **Graph Structure**: HNSW is a proximity graph where edges connect similar vectors 2. **Insert Complexity**: O(log n * ef_construction) for proper graph maintenance 3. **Update Semantics**: Standard HNSW has no native update operation 4. **Recall Sensitivity**: Graph quality directly impacts search recall 5. **Concurrent Access**: Updates must not corrupt concurrent searches ### Current HNSW Behavior Ruvector's existing HNSW implementation (ADR-001) uses: - `hnsw_rs` library for graph operations - Mark-delete semantics (no graph restructuring) - Full rebuild for significant changes - No incremental edge updates ### Delta Update Scenarios | Scenario | Vector Change | Impact on Neighbors | |----------|---------------|---------------------| | Minor adjustment (<5%) | Negligible | Neighbors likely still valid | | Moderate change (5-20%) | Moderate | Some edges may be suboptimal | | Major change (>20%) | Significant | Many edges invalidated | | Dimension shift | Variable | Depends on affected dimensions | --- ## Decision ### Adopt Lazy Repair with Quality Bounds We implement a **lazy repair** strategy that: 1. Applies deltas immediately to vector data 2. Defers index repair until quality degrades 3. Uses quality bounds to trigger selective repair 4. Maintains search correctness through fallback mechanisms ### Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ DELTA INDEX MANAGER │ └─────────────────────────────────────────────────────────────┘ │ ┌─────────────────┬─────────────────┬┴──────────────────┬─────────────────┐ │ │ │ │ │ v v v v v ┌─────────┐ ┌─────────┐ ┌───────────┐ ┌─────────────┐ ┌─────────┐ │ Delta │ │ Quality │ │ Lazy │ │ Checkpoint │ │ Rebuild │ │ Tracker │ │ Monitor │ │ Repair │ │ Manager │ │ Trigger │ └─────────┘ └─────────┘ └───────────┘ └─────────────┘ └─────────┘ │ │ │ │ │ │ │ │ │ │ v v v v v ┌─────────────────────────────────────────────────────────────────────────────────┐ │ HNSW INDEX LAYER │ │ Vector Data │ Edge Graph │ Entry Points │ Layer Structure │ Distance Cache │ └─────────────────────────────────────────────────────────────────────────────────┘ ``` ### Core Components #### 1. Delta Tracker ```rust /// Tracks pending index updates from deltas pub struct DeltaTracker { /// Pending updates by vector ID pending: DashMap, /// Delta accumulation before index update delta_buffer: Vec, /// Configuration config: DeltaTrackerConfig, } #[derive(Debug, Clone)] pub struct PendingUpdate { /// Original vector (before deltas) pub original: Vec, /// Current vector (after deltas) pub current: Vec, /// Accumulated delta magnitude pub total_delta_magnitude: f32, /// Number of deltas accumulated pub delta_count: u32, /// First delta timestamp pub first_delta_at: Instant, /// Index entry status pub index_status: IndexStatus, } #[derive(Debug, Clone, Copy)] pub enum IndexStatus { /// Index matches vector exactly Synchronized, /// Index is stale but within bounds Stale { estimated_quality: f32 }, /// Index needs repair NeedsRepair, /// Not yet indexed NotIndexed, } impl DeltaTracker { /// Record a delta application pub fn record_delta( &self, vector_id: &VectorId, old_vector: &[f32], new_vector: &[f32], ) { let delta_magnitude = compute_l2_delta(old_vector, new_vector); self.pending .entry(vector_id.clone()) .and_modify(|update| { update.current = new_vector.to_vec(); update.total_delta_magnitude += delta_magnitude; update.delta_count += 1; update.index_status = self.estimate_status(update); }) .or_insert_with(|| PendingUpdate { original: old_vector.to_vec(), current: new_vector.to_vec(), total_delta_magnitude: delta_magnitude, delta_count: 1, first_delta_at: Instant::now(), index_status: IndexStatus::Stale { estimated_quality: self.estimate_quality(delta_magnitude), }, }); } /// Get vectors needing repair pub fn get_repair_candidates(&self) -> Vec { self.pending .iter() .filter(|e| matches!(e.index_status, IndexStatus::NeedsRepair)) .map(|e| e.key().clone()) .collect() } fn estimate_status(&self, update: &PendingUpdate) -> IndexStatus { let relative_change = update.total_delta_magnitude / (vector_magnitude(&update.original) + 1e-10); if relative_change > self.config.repair_threshold { IndexStatus::NeedsRepair } else { IndexStatus::Stale { estimated_quality: self.estimate_quality(update.total_delta_magnitude), } } } fn estimate_quality(&self, delta_magnitude: f32) -> f32 { // Quality decays with delta magnitude // Based on empirical HNSW edge validity studies (-delta_magnitude / self.config.quality_decay_constant).exp() } } ``` #### 2. Quality Monitor ```rust /// Monitors index quality and triggers repairs pub struct QualityMonitor { /// Sampled quality measurements measurements: RingBuffer, /// Current quality estimate current_quality: AtomicF32, /// Quality bounds configuration bounds: QualityBounds, /// Repair trigger channel repair_trigger: Sender, } #[derive(Debug, Clone, Copy)] pub struct QualityBounds { /// Minimum acceptable recall pub min_recall: f32, /// Target recall pub target_recall: f32, /// Sampling rate (fraction of searches) pub sample_rate: f32, /// Number of samples for estimate pub sample_window: usize, } impl Default for QualityBounds { fn default() -> Self { Self { min_recall: 0.90, target_recall: 0.95, sample_rate: 0.01, // Sample 1% of searches sample_window: 1000, } } } #[derive(Debug, Clone)] pub struct QualityMeasurement { /// Estimated recall for this search pub recall: f32, /// Number of stale vectors encountered pub stale_vectors: u32, /// Timestamp pub timestamp: Instant, } impl QualityMonitor { /// Sample a search for quality estimation pub async fn sample_search( &self, query: &[f32], hnsw_results: &[SearchResult], k: usize, ) -> Option { // Only sample based on configured rate if !self.should_sample() { return None; } // Compute ground truth via exact search on sample let exact_results = self.exact_search_sample(query, k).await; // Calculate recall let hnsw_ids: HashSet<_> = hnsw_results.iter().map(|r| &r.id).collect(); let exact_ids: HashSet<_> = exact_results.iter().map(|r| &r.id).collect(); let overlap = hnsw_ids.intersection(&exact_ids).count(); let recall = overlap as f32 / k as f32; // Count stale vectors in results let stale_count = self.count_stale_in_results(hnsw_results); let measurement = QualityMeasurement { recall, stale_vectors: stale_count, timestamp: Instant::now(), }; // Update estimates self.measurements.push(measurement.clone()); self.update_quality_estimate(); // Trigger repair if below bounds if recall < self.bounds.min_recall { let _ = self.repair_trigger.send(RepairRequest::QualityBelowBounds { current_recall: recall, min_recall: self.bounds.min_recall, }); } Some(measurement) } fn update_quality_estimate(&self) { let recent: Vec<_> = self.measurements .iter() .rev() .take(self.bounds.sample_window) .collect(); if recent.is_empty() { return; } let avg_recall = recent.iter().map(|m| m.recall).sum::() / recent.len() as f32; self.current_quality.store(avg_recall, Ordering::Relaxed); } } ``` #### 3. Lazy Repair Engine ```rust /// Performs lazy index repair operations pub struct LazyRepairEngine { /// HNSW index reference hnsw: Arc>, /// Delta tracker reference tracker: Arc, /// Repair configuration config: RepairConfig, /// Background repair task repair_task: Option>, } #[derive(Debug, Clone)] pub struct RepairConfig { /// Maximum repairs per batch pub batch_size: usize, /// Repair interval pub repair_interval: Duration, /// Whether to use background repair pub background_repair: bool, /// Priority ordering for repairs pub priority: RepairPriority, } #[derive(Debug, Clone, Copy)] pub enum RepairPriority { /// Repair most changed vectors first MostChanged, /// Repair oldest pending first Oldest, /// Repair most frequently accessed first MostAccessed, /// Round-robin RoundRobin, } impl LazyRepairEngine { /// Repair a single vector in the index pub async fn repair_vector(&self, vector_id: &VectorId) -> Result { // Get current vector state let update = self.tracker.pending.get(vector_id) .ok_or(RepairError::VectorNotPending)?; let mut hnsw = self.hnsw.write().await; // Strategy 1: Soft update (if change is small) if update.total_delta_magnitude < self.config.soft_update_threshold { return self.soft_update(&mut hnsw, vector_id, &update.current).await; } // Strategy 2: Re-insertion (moderate change) if update.total_delta_magnitude < self.config.reinsert_threshold { return self.reinsert(&mut hnsw, vector_id, &update.current).await; } // Strategy 3: Full repair (large change) self.full_repair(&mut hnsw, vector_id, &update.current).await } /// Soft update: only update vector data, keep edges async fn soft_update( &self, hnsw: &mut HnswIndex, vector_id: &VectorId, new_vector: &[f32], ) -> Result { // Update vector data without touching graph structure hnsw.update_vector_data(vector_id, new_vector)?; // Mark as synchronized self.tracker.pending.remove(vector_id); Ok(RepairResult::SoftUpdate { vector_id: vector_id.clone(), edges_preserved: true, }) } /// Re-insertion: remove and re-add to graph async fn reinsert( &self, hnsw: &mut HnswIndex, vector_id: &VectorId, new_vector: &[f32], ) -> Result { // Get current index position let old_idx = hnsw.get_index_for_vector(vector_id)?; // Mark old position as deleted hnsw.mark_deleted(old_idx)?; // Insert with new vector let new_idx = hnsw.insert_vector(vector_id.clone(), new_vector.to_vec())?; // Update tracker self.tracker.pending.remove(vector_id); Ok(RepairResult::Reinserted { vector_id: vector_id.clone(), old_idx, new_idx, }) } /// Full repair: rebuild local neighborhood async fn full_repair( &self, hnsw: &mut HnswIndex, vector_id: &VectorId, new_vector: &[f32], ) -> Result { // Get current neighbors let old_neighbors = hnsw.get_neighbors(vector_id)?; // Remove and reinsert self.reinsert(hnsw, vector_id, new_vector).await?; // Repair edges from old neighbors let repaired_edges = self.repair_neighbor_edges(hnsw, &old_neighbors).await?; Ok(RepairResult::FullRepair { vector_id: vector_id.clone(), repaired_edges, }) } /// Background repair loop pub async fn run_background_repair(&self) { loop { tokio::time::sleep(self.config.repair_interval).await; // Get repair candidates let candidates = self.tracker.get_repair_candidates(); if candidates.is_empty() { continue; } // Prioritize let prioritized = self.prioritize_repairs(candidates); // Repair batch for vector_id in prioritized.into_iter().take(self.config.batch_size) { if let Err(e) = self.repair_vector(&vector_id).await { tracing::warn!("Repair failed for {}: {}", vector_id, e); } } } } } ``` ### Recall vs Latency Tradeoffs ``` ┌──────────────────────────────────────────────────────────┐ │ RECALL vs LATENCY TRADEOFF │ └──────────────────────────────────────────────────────────┘ Recall 100% │ ┌──────────────────┐ │ / │ │ / Immediate Repair │ │ / │ 95% │ ┌───────────────────────────●───────────────────────┤ │ / │ │ │ / Lazy Repair │ │ │ / │ │ 90% │●───────────────────────────────┤ │ │ │ │ │ Quality Bound │ │ 85% │ (Min Acceptable) │ │ │ │ │ └────────────────────────────────┴───────────────────────┴───> Low Medium High Write Latency ──── Lazy Repair (Selected): Best balance - - - Immediate Repair: Highest recall, highest latency · · · No Repair: Lowest latency, recall degrades ``` ### Repair Strategy Selection ```rust /// Select repair strategy based on delta characteristics pub fn select_repair_strategy( delta_magnitude: f32, vector_norm: f32, access_frequency: f32, current_recall: f32, config: &RepairConfig, ) -> RepairStrategy { let relative_change = delta_magnitude / (vector_norm + 1e-10); // High access frequency = repair sooner let access_weight = if access_frequency > config.hot_vector_threshold { 0.7 // Reduce thresholds for hot vectors } else { 1.0 }; // Low current recall = repair more aggressively let recall_weight = if current_recall < config.quality_bounds.min_recall { 0.5 // Halve thresholds when recall is critical } else { 1.0 }; let effective_threshold = config.soft_update_threshold * access_weight * recall_weight; if relative_change < effective_threshold { RepairStrategy::Deferred // No immediate action } else if relative_change < config.reinsert_threshold * access_weight * recall_weight { RepairStrategy::SoftUpdate } else if relative_change < config.full_repair_threshold * access_weight * recall_weight { RepairStrategy::Reinsert } else { RepairStrategy::FullRepair } } ``` --- ## Recall vs Latency Analysis ### Simulated Workload Results | Strategy | Write Latency (p50) | Recall@10 | Recall@100 | |----------|---------------------|-----------|------------| | Immediate Repair | 2.1ms | 99.2% | 98.7% | | Lazy (aggressive) | 150us | 96.5% | 95.1% | | Lazy (balanced) | 80us | 94.2% | 92.8% | | Lazy (relaxed) | 50us | 91.3% | 89.5% | | No Repair | 35us | 85.1%* | 82.3%* | *Degrades over time with update volume ### Quality Degradation Curves ``` Recall over time (1000 updates/sec, no repair): 100% ├──────────── │ \ 95% │ \────────────── │ \ 90% │ \──────────── │ \ 85% │ \─────── │ 80% │ └─────────────────────────────────────────────────────> 0 5 10 15 20 Minutes With lazy repair (balanced): 100% ├──────────── │ \ ┌─────┐ ┌─────┐ ┌─────┐ 95% │ \───┬┘ └───┬┘ └───┬┘ └─── │ │ Repair │ Repair │ Repair 90% │ │ │ │ │ 85% │ └─────────────────────────────────────────────────────> 0 5 10 15 20 Minutes ``` --- ## Considered Options ### Option 1: Immediate Rebuild **Description**: Rebuild affected portions of graph on every delta. **Pros**: - Always accurate graph - Maximum recall - Simple correctness model **Cons**: - O(log n * ef_construction) per update - High write latency - Blocks concurrent searches **Verdict**: Rejected - latency unacceptable for streaming updates. ### Option 2: Periodic Full Rebuild **Description**: Allow degradation, rebuild entire index periodically. **Pros**: - Minimal write overhead - Predictable rebuild schedule - Simple implementation **Cons**: - Extended degradation periods - Expensive rebuilds - Resource spikes **Verdict**: Available as configuration option, not default. ### Option 3: Lazy Update (Selected) **Description**: Defer repairs, trigger on quality bounds. **Pros**: - Low write latency - Bounded recall degradation - Adaptive to workload - Background repair **Cons**: - Complexity in quality monitoring - Potential recall dips **Verdict**: Adopted - optimal balance for delta workloads. ### Option 4: Learned Index Repair **Description**: ML model predicts optimal repair timing. **Pros**: - Potentially optimal decisions - Adapts to patterns **Cons**: - Training complexity - Model maintenance - Explainability **Verdict**: Deferred to future version. --- ## Technical Specification ### Index Update API ```rust /// Delta-aware HNSW index #[async_trait] pub trait DeltaAwareIndex: Send + Sync { /// Apply delta without immediate index update async fn apply_delta(&self, delta: &VectorDelta) -> Result; /// Get current recall estimate fn current_recall(&self) -> f32; /// Get vectors pending repair fn pending_repairs(&self) -> Vec; /// Force repair of specific vectors async fn repair_vectors(&self, ids: &[VectorId]) -> Result>; /// Trigger background repair cycle async fn trigger_repair_cycle(&self) -> Result; /// Search with optional quality sampling async fn search_with_quality( &self, query: &[f32], k: usize, sample_quality: bool, ) -> Result; } #[derive(Debug)] pub struct DeltaApplication { pub vector_id: VectorId, pub delta_id: DeltaId, pub strategy: RepairStrategy, pub deferred_repair: bool, pub estimated_recall_impact: f32, } #[derive(Debug)] pub struct SearchWithQuality { pub results: Vec, pub quality_sample: Option, pub stale_results: u32, } ``` ### Configuration ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct DeltaIndexConfig { /// Quality bounds for triggering repair pub quality_bounds: QualityBounds, /// Repair engine configuration pub repair_config: RepairConfig, /// Delta tracker configuration pub tracker_config: DeltaTrackerConfig, /// Enable background repair pub background_repair: bool, /// Checkpoint interval (for recovery) pub checkpoint_interval: Duration, } impl Default for DeltaIndexConfig { fn default() -> Self { Self { quality_bounds: QualityBounds::default(), repair_config: RepairConfig { batch_size: 100, repair_interval: Duration::from_secs(5), background_repair: true, priority: RepairPriority::MostChanged, soft_update_threshold: 0.05, // 5% change reinsert_threshold: 0.20, // 20% change full_repair_threshold: 0.50, // 50% change }, tracker_config: DeltaTrackerConfig { repair_threshold: 0.15, quality_decay_constant: 0.1, }, background_repair: true, checkpoint_interval: Duration::from_secs(300), } } } ``` --- ## Consequences ### Benefits 1. **Low Write Latency**: Sub-millisecond delta application 2. **Bounded Degradation**: Quality monitoring prevents unacceptable recall 3. **Adaptive**: Repairs prioritized by impact and access patterns 4. **Background Processing**: Repairs don't block user operations 5. **Resource Efficient**: Avoids unnecessary graph restructuring ### Risks and Mitigations | Risk | Probability | Impact | Mitigation | |------|-------------|--------|------------| | Recall below bounds | Low | High | Aggressive repair triggers | | Repair backlog | Medium | Medium | Batch size tuning | | Stale search results | Medium | Medium | Optional exact fallback | | Checkpoint overhead | Low | Low | Incremental checkpoints | --- ## References 1. Malkov, Y., & Yashunin, D. "Efficient and robust approximate nearest neighbor search using HNSW graphs." 2. Singh, A., et al. "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search." 3. ADR-001: Ruvector Core Architecture 4. ADR-DB-001: Delta Behavior Core Architecture --- ## Related Decisions - **ADR-DB-001**: Delta Behavior Core Architecture - **ADR-DB-003**: Delta Propagation Protocol - **ADR-DB-007**: Delta Temporal Windows