Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

25 KiB

Raw Blame History

ADR-DB-005: Delta Index Updates

Status: Proposed Date: 2026-01-28 Authors: RuVector Architecture Team Deciders: Architecture Review Board Parent: ADR-DB-001 Delta Behavior Core Architecture

Version History

Version	Date	Author	Changes
0.1	2026-01-28	Architecture Team	Initial proposal

Context and Problem Statement

The Index Update Challenge

HNSW (Hierarchical Navigable Small World) indexes present unique challenges for delta-based updates:

Graph Structure: HNSW is a proximity graph where edges connect similar vectors
Insert Complexity: O(log n * ef_construction) for proper graph maintenance
Update Semantics: Standard HNSW has no native update operation
Recall Sensitivity: Graph quality directly impacts search recall
Concurrent Access: Updates must not corrupt concurrent searches

Current HNSW Behavior

Ruvector's existing HNSW implementation (ADR-001) uses:

hnsw_rs library for graph operations
Mark-delete semantics (no graph restructuring)
Full rebuild for significant changes
No incremental edge updates

Delta Update Scenarios

Scenario	Vector Change	Impact on Neighbors
Minor adjustment (<5%)	Negligible	Neighbors likely still valid
Moderate change (5-20%)	Moderate	Some edges may be suboptimal
Major change (>20%)	Significant	Many edges invalidated
Dimension shift	Variable	Depends on affected dimensions

Decision

Adopt Lazy Repair with Quality Bounds

We implement a lazy repair strategy that:

Applies deltas immediately to vector data
Defers index repair until quality degrades
Uses quality bounds to trigger selective repair
Maintains search correctness through fallback mechanisms

Architecture Overview

                    ┌─────────────────────────────────────────────────────────────┐
                    │                    DELTA INDEX MANAGER                       │
                    └─────────────────────────────────────────────────────────────┘
                                               │
         ┌─────────────────┬─────────────────┬┴──────────────────┬─────────────────┐
         │                 │                 │                   │                 │
         v                 v                 v                   v                 v
    ┌─────────┐      ┌─────────┐      ┌───────────┐      ┌─────────────┐    ┌─────────┐
    │  Delta  │      │ Quality │      │   Lazy    │      │  Checkpoint │    │ Rebuild │
    │ Tracker │      │ Monitor │      │  Repair   │      │   Manager   │    │ Trigger │
    └─────────┘      └─────────┘      └───────────┘      └─────────────┘    └─────────┘
         │                 │                 │                   │                 │
         │                 │                 │                   │                 │
         v                 v                 v                   v                 v
    ┌─────────────────────────────────────────────────────────────────────────────────┐
    │                              HNSW INDEX LAYER                                   │
    │    Vector Data │ Edge Graph │ Entry Points │ Layer Structure │ Distance Cache  │
    └─────────────────────────────────────────────────────────────────────────────────┘

Core Components

1. Delta Tracker

/// Tracks pending index updates from deltas
pub struct DeltaTracker {
    /// Pending updates by vector ID
    pending: DashMap<VectorId, PendingUpdate>,
    /// Delta accumulation before index update
    delta_buffer: Vec<AccumulatedDelta>,
    /// Configuration
    config: DeltaTrackerConfig,
}

#[derive(Debug, Clone)]
pub struct PendingUpdate {
    /// Original vector (before deltas)
    pub original: Vec<f32>,
    /// Current vector (after deltas)
    pub current: Vec<f32>,
    /// Accumulated delta magnitude
    pub total_delta_magnitude: f32,
    /// Number of deltas accumulated
    pub delta_count: u32,
    /// First delta timestamp
    pub first_delta_at: Instant,
    /// Index entry status
    pub index_status: IndexStatus,
}

#[derive(Debug, Clone, Copy)]
pub enum IndexStatus {
    /// Index matches vector exactly
    Synchronized,
    /// Index is stale but within bounds
    Stale { estimated_quality: f32 },
    /// Index needs repair
    NeedsRepair,
    /// Not yet indexed
    NotIndexed,
}

impl DeltaTracker {
    /// Record a delta application
    pub fn record_delta(
        &self,
        vector_id: &VectorId,
        old_vector: &[f32],
        new_vector: &[f32],
    ) {
        let delta_magnitude = compute_l2_delta(old_vector, new_vector);

        self.pending
            .entry(vector_id.clone())
            .and_modify(|update| {
                update.current = new_vector.to_vec();
                update.total_delta_magnitude += delta_magnitude;
                update.delta_count += 1;
                update.index_status = self.estimate_status(update);
            })
            .or_insert_with(|| PendingUpdate {
                original: old_vector.to_vec(),
                current: new_vector.to_vec(),
                total_delta_magnitude: delta_magnitude,
                delta_count: 1,
                first_delta_at: Instant::now(),
                index_status: IndexStatus::Stale {
                    estimated_quality: self.estimate_quality(delta_magnitude),
                },
            });
    }

    /// Get vectors needing repair
    pub fn get_repair_candidates(&self) -> Vec<VectorId> {
        self.pending
            .iter()
            .filter(|e| matches!(e.index_status, IndexStatus::NeedsRepair))
            .map(|e| e.key().clone())
            .collect()
    }

    fn estimate_status(&self, update: &PendingUpdate) -> IndexStatus {
        let relative_change = update.total_delta_magnitude
            / (vector_magnitude(&update.original) + 1e-10);

        if relative_change > self.config.repair_threshold {
            IndexStatus::NeedsRepair
        } else {
            IndexStatus::Stale {
                estimated_quality: self.estimate_quality(update.total_delta_magnitude),
            }
        }
    }

    fn estimate_quality(&self, delta_magnitude: f32) -> f32 {
        // Quality decays with delta magnitude
        // Based on empirical HNSW edge validity studies
        (-delta_magnitude / self.config.quality_decay_constant).exp()
    }
}

2. Quality Monitor

/// Monitors index quality and triggers repairs
pub struct QualityMonitor {
    /// Sampled quality measurements
    measurements: RingBuffer<QualityMeasurement>,
    /// Current quality estimate
    current_quality: AtomicF32,
    /// Quality bounds configuration
    bounds: QualityBounds,
    /// Repair trigger channel
    repair_trigger: Sender<RepairRequest>,
}

#[derive(Debug, Clone, Copy)]
pub struct QualityBounds {
    /// Minimum acceptable recall
    pub min_recall: f32,
    /// Target recall
    pub target_recall: f32,
    /// Sampling rate (fraction of searches)
    pub sample_rate: f32,
    /// Number of samples for estimate
    pub sample_window: usize,
}

impl Default for QualityBounds {
    fn default() -> Self {
        Self {
            min_recall: 0.90,
            target_recall: 0.95,
            sample_rate: 0.01, // Sample 1% of searches
            sample_window: 1000,
        }
    }
}

#[derive(Debug, Clone)]
pub struct QualityMeasurement {
    /// Estimated recall for this search
    pub recall: f32,
    /// Number of stale vectors encountered
    pub stale_vectors: u32,
    /// Timestamp
    pub timestamp: Instant,
}

impl QualityMonitor {
    /// Sample a search for quality estimation
    pub async fn sample_search(
        &self,
        query: &[f32],
        hnsw_results: &[SearchResult],
        k: usize,
    ) -> Option<QualityMeasurement> {
        // Only sample based on configured rate
        if !self.should_sample() {
            return None;
        }

        // Compute ground truth via exact search on sample
        let exact_results = self.exact_search_sample(query, k).await;

        // Calculate recall
        let hnsw_ids: HashSet<_> = hnsw_results.iter().map(|r| &r.id).collect();
        let exact_ids: HashSet<_> = exact_results.iter().map(|r| &r.id).collect();
        let overlap = hnsw_ids.intersection(&exact_ids).count();
        let recall = overlap as f32 / k as f32;

        // Count stale vectors in results
        let stale_count = self.count_stale_in_results(hnsw_results);

        let measurement = QualityMeasurement {
            recall,
            stale_vectors: stale_count,
            timestamp: Instant::now(),
        };

        // Update estimates
        self.measurements.push(measurement.clone());
        self.update_quality_estimate();

        // Trigger repair if below bounds
        if recall < self.bounds.min_recall {
            let _ = self.repair_trigger.send(RepairRequest::QualityBelowBounds {
                current_recall: recall,
                min_recall: self.bounds.min_recall,
            });
        }

        Some(measurement)
    }

    fn update_quality_estimate(&self) {
        let recent: Vec<_> = self.measurements
            .iter()
            .rev()
            .take(self.bounds.sample_window)
            .collect();

        if recent.is_empty() {
            return;
        }

        let avg_recall = recent.iter().map(|m| m.recall).sum::<f32>() / recent.len() as f32;
        self.current_quality.store(avg_recall, Ordering::Relaxed);
    }
}

3. Lazy Repair Engine

/// Performs lazy index repair operations
pub struct LazyRepairEngine {
    /// HNSW index reference
    hnsw: Arc<RwLock<HnswIndex>>,
    /// Delta tracker reference
    tracker: Arc<DeltaTracker>,
    /// Repair configuration
    config: RepairConfig,
    /// Background repair task
    repair_task: Option<JoinHandle<()>>,
}

#[derive(Debug, Clone)]
pub struct RepairConfig {
    /// Maximum repairs per batch
    pub batch_size: usize,
    /// Repair interval
    pub repair_interval: Duration,
    /// Whether to use background repair
    pub background_repair: bool,
    /// Priority ordering for repairs
    pub priority: RepairPriority,
}

#[derive(Debug, Clone, Copy)]
pub enum RepairPriority {
    /// Repair most changed vectors first
    MostChanged,
    /// Repair oldest pending first
    Oldest,
    /// Repair most frequently accessed first
    MostAccessed,
    /// Round-robin
    RoundRobin,
}

impl LazyRepairEngine {
    /// Repair a single vector in the index
    pub async fn repair_vector(&self, vector_id: &VectorId) -> Result<RepairResult> {
        // Get current vector state
        let update = self.tracker.pending.get(vector_id)
            .ok_or(RepairError::VectorNotPending)?;

        let mut hnsw = self.hnsw.write().await;

        // Strategy 1: Soft update (if change is small)
        if update.total_delta_magnitude < self.config.soft_update_threshold {
            return self.soft_update(&mut hnsw, vector_id, &update.current).await;
        }

        // Strategy 2: Re-insertion (moderate change)
        if update.total_delta_magnitude < self.config.reinsert_threshold {
            return self.reinsert(&mut hnsw, vector_id, &update.current).await;
        }

        // Strategy 3: Full repair (large change)
        self.full_repair(&mut hnsw, vector_id, &update.current).await
    }

    /// Soft update: only update vector data, keep edges
    async fn soft_update(
        &self,
        hnsw: &mut HnswIndex,
        vector_id: &VectorId,
        new_vector: &[f32],
    ) -> Result<RepairResult> {
        // Update vector data without touching graph structure
        hnsw.update_vector_data(vector_id, new_vector)?;

        // Mark as synchronized
        self.tracker.pending.remove(vector_id);

        Ok(RepairResult::SoftUpdate {
            vector_id: vector_id.clone(),
            edges_preserved: true,
        })
    }

    /// Re-insertion: remove and re-add to graph
    async fn reinsert(
        &self,
        hnsw: &mut HnswIndex,
        vector_id: &VectorId,
        new_vector: &[f32],
    ) -> Result<RepairResult> {
        // Get current index position
        let old_idx = hnsw.get_index_for_vector(vector_id)?;

        // Mark old position as deleted
        hnsw.mark_deleted(old_idx)?;

        // Insert with new vector
        let new_idx = hnsw.insert_vector(vector_id.clone(), new_vector.to_vec())?;

        // Update tracker
        self.tracker.pending.remove(vector_id);

        Ok(RepairResult::Reinserted {
            vector_id: vector_id.clone(),
            old_idx,
            new_idx,
        })
    }

    /// Full repair: rebuild local neighborhood
    async fn full_repair(
        &self,
        hnsw: &mut HnswIndex,
        vector_id: &VectorId,
        new_vector: &[f32],
    ) -> Result<RepairResult> {
        // Get current neighbors
        let old_neighbors = hnsw.get_neighbors(vector_id)?;

        // Remove and reinsert
        self.reinsert(hnsw, vector_id, new_vector).await?;

        // Repair edges from old neighbors
        let repaired_edges = self.repair_neighbor_edges(hnsw, &old_neighbors).await?;

        Ok(RepairResult::FullRepair {
            vector_id: vector_id.clone(),
            repaired_edges,
        })
    }

    /// Background repair loop
    pub async fn run_background_repair(&self) {
        loop {
            tokio::time::sleep(self.config.repair_interval).await;

            // Get repair candidates
            let candidates = self.tracker.get_repair_candidates();

            if candidates.is_empty() {
                continue;
            }

            // Prioritize
            let prioritized = self.prioritize_repairs(candidates);

            // Repair batch
            for vector_id in prioritized.into_iter().take(self.config.batch_size) {
                if let Err(e) = self.repair_vector(&vector_id).await {
                    tracing::warn!("Repair failed for {}: {}", vector_id, e);
                }
            }
        }
    }
}

Recall vs Latency Tradeoffs

                    ┌──────────────────────────────────────────────────────────┐
                    │              RECALL vs LATENCY TRADEOFF                   │
                    └──────────────────────────────────────────────────────────┘

    Recall
    100% │                                    ┌──────────────────┐
         │                                   /                    │
         │                                  /   Immediate Repair  │
         │                                 /                      │
     95% │    ┌───────────────────────────●───────────────────────┤
         │   /                            │                       │
         │  /         Lazy Repair         │                       │
         │ /                              │                       │
     90% │●───────────────────────────────┤                       │
         │                                │                       │
         │    Quality Bound               │                       │
     85% │    (Min Acceptable)            │                       │
         │                                │                       │
         └────────────────────────────────┴───────────────────────┴───>
               Low                    Medium                    High
                              Write Latency

    ──── Lazy Repair (Selected): Best balance
    - - - Immediate Repair: Highest recall, highest latency
    · · · No Repair: Lowest latency, recall degrades

Repair Strategy Selection

/// Select repair strategy based on delta characteristics
pub fn select_repair_strategy(
    delta_magnitude: f32,
    vector_norm: f32,
    access_frequency: f32,
    current_recall: f32,
    config: &RepairConfig,
) -> RepairStrategy {
    let relative_change = delta_magnitude / (vector_norm + 1e-10);

    // High access frequency = repair sooner
    let access_weight = if access_frequency > config.hot_vector_threshold {
        0.7 // Reduce thresholds for hot vectors
    } else {
        1.0
    };

    // Low current recall = repair more aggressively
    let recall_weight = if current_recall < config.quality_bounds.min_recall {
        0.5 // Halve thresholds when recall is critical
    } else {
        1.0
    };

    let effective_threshold = config.soft_update_threshold * access_weight * recall_weight;

    if relative_change < effective_threshold {
        RepairStrategy::Deferred // No immediate action
    } else if relative_change < config.reinsert_threshold * access_weight * recall_weight {
        RepairStrategy::SoftUpdate
    } else if relative_change < config.full_repair_threshold * access_weight * recall_weight {
        RepairStrategy::Reinsert
    } else {
        RepairStrategy::FullRepair
    }
}

Recall vs Latency Analysis

Simulated Workload Results

Strategy	Write Latency (p50)	Recall@10	Recall@100
Immediate Repair	2.1ms	99.2%	98.7%
Lazy (aggressive)	150us	96.5%	95.1%
Lazy (balanced)	80us	94.2%	92.8%
Lazy (relaxed)	50us	91.3%	89.5%
No Repair	35us	85.1%*	82.3%*

*Degrades over time with update volume

Quality Degradation Curves

Recall over time (1000 updates/sec, no repair):

100% ├────────────
     │            \
 95% │             \──────────────
     │                            \
 90% │                             \────────────
     │                                          \
 85% │                                           \───────
     │
 80% │
     └─────────────────────────────────────────────────────>
     0          5          10         15         20    Minutes

With lazy repair (balanced):

100% ├────────────
     │            \     ┌─────┐     ┌─────┐     ┌─────┐
 95% │             \───┬┘     └───┬┘     └───┬┘     └───
     │                 │ Repair   │ Repair   │ Repair
 90% │                 │          │          │
     │
 85% │
     └─────────────────────────────────────────────────────>
     0          5          10         15         20    Minutes

Considered Options

Option 1: Immediate Rebuild

Description: Rebuild affected portions of graph on every delta.

Pros:

Always accurate graph
Maximum recall
Simple correctness model

Cons:

O(log n * ef_construction) per update
High write latency
Blocks concurrent searches

Verdict: Rejected - latency unacceptable for streaming updates.

Option 2: Periodic Full Rebuild

Description: Allow degradation, rebuild entire index periodically.

Pros:

Minimal write overhead
Predictable rebuild schedule
Simple implementation

Cons:

Extended degradation periods
Expensive rebuilds
Resource spikes

Verdict: Available as configuration option, not default.

Option 3: Lazy Update (Selected)

Description: Defer repairs, trigger on quality bounds.

Pros:

Low write latency
Bounded recall degradation
Adaptive to workload
Background repair

Cons:

Complexity in quality monitoring
Potential recall dips

Verdict: Adopted - optimal balance for delta workloads.

Option 4: Learned Index Repair

Description: ML model predicts optimal repair timing.

Pros:

Potentially optimal decisions
Adapts to patterns

Cons:

Training complexity
Model maintenance
Explainability

Verdict: Deferred to future version.

Technical Specification

Index Update API

/// Delta-aware HNSW index
#[async_trait]
pub trait DeltaAwareIndex: Send + Sync {
    /// Apply delta without immediate index update
    async fn apply_delta(&self, delta: &VectorDelta) -> Result<DeltaApplication>;

    /// Get current recall estimate
    fn current_recall(&self) -> f32;

    /// Get vectors pending repair
    fn pending_repairs(&self) -> Vec<VectorId>;

    /// Force repair of specific vectors
    async fn repair_vectors(&self, ids: &[VectorId]) -> Result<Vec<RepairResult>>;

    /// Trigger background repair cycle
    async fn trigger_repair_cycle(&self) -> Result<RepairCycleSummary>;

    /// Search with optional quality sampling
    async fn search_with_quality(
        &self,
        query: &[f32],
        k: usize,
        sample_quality: bool,
    ) -> Result<SearchWithQuality>;
}

#[derive(Debug)]
pub struct DeltaApplication {
    pub vector_id: VectorId,
    pub delta_id: DeltaId,
    pub strategy: RepairStrategy,
    pub deferred_repair: bool,
    pub estimated_recall_impact: f32,
}

#[derive(Debug)]
pub struct SearchWithQuality {
    pub results: Vec<SearchResult>,
    pub quality_sample: Option<QualityMeasurement>,
    pub stale_results: u32,
}

Configuration

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct DeltaIndexConfig {
    /// Quality bounds for triggering repair
    pub quality_bounds: QualityBounds,
    /// Repair engine configuration
    pub repair_config: RepairConfig,
    /// Delta tracker configuration
    pub tracker_config: DeltaTrackerConfig,
    /// Enable background repair
    pub background_repair: bool,
    /// Checkpoint interval (for recovery)
    pub checkpoint_interval: Duration,
}

impl Default for DeltaIndexConfig {
    fn default() -> Self {
        Self {
            quality_bounds: QualityBounds::default(),
            repair_config: RepairConfig {
                batch_size: 100,
                repair_interval: Duration::from_secs(5),
                background_repair: true,
                priority: RepairPriority::MostChanged,
                soft_update_threshold: 0.05,    // 5% change
                reinsert_threshold: 0.20,       // 20% change
                full_repair_threshold: 0.50,    // 50% change
            },
            tracker_config: DeltaTrackerConfig {
                repair_threshold: 0.15,
                quality_decay_constant: 0.1,
            },
            background_repair: true,
            checkpoint_interval: Duration::from_secs(300),
        }
    }
}

Consequences

Benefits

Low Write Latency: Sub-millisecond delta application
Bounded Degradation: Quality monitoring prevents unacceptable recall
Adaptive: Repairs prioritized by impact and access patterns
Background Processing: Repairs don't block user operations
Resource Efficient: Avoids unnecessary graph restructuring

Risks and Mitigations

Risk	Probability	Impact	Mitigation
Recall below bounds	Low	High	Aggressive repair triggers
Repair backlog	Medium	Medium	Batch size tuning
Stale search results	Medium	Medium	Optional exact fallback
Checkpoint overhead	Low	Low	Incremental checkpoints

References

Malkov, Y., & Yashunin, D. "Efficient and robust approximate nearest neighbor search using HNSW graphs."
Singh, A., et al. "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search."
ADR-001: Ruvector Core Architecture
ADR-DB-001: Delta Behavior Core Architecture

ADR-DB-001: Delta Behavior Core Architecture
ADR-DB-003: Delta Propagation Protocol
ADR-DB-007: Delta Temporal Windows

25 KiB Raw Blame History

ADR-DB-005: Delta Index Updates

Version History

Context and Problem Statement

The Index Update Challenge

Current HNSW Behavior

Delta Update Scenarios

Decision

Adopt Lazy Repair with Quality Bounds

Architecture Overview

Core Components

1. Delta Tracker

2. Quality Monitor

3. Lazy Repair Engine

Recall vs Latency Tradeoffs

Repair Strategy Selection

Recall vs Latency Analysis

Simulated Workload Results

Quality Degradation Curves

Considered Options

Option 1: Immediate Rebuild

Option 2: Periodic Full Rebuild

Option 3: Lazy Update (Selected)

Option 4: Learned Index Repair

Technical Specification

Index Update API

Configuration

Consequences

Benefits

Risks and Mitigations

References

Related Decisions

25 KiB

Raw Blame History