Files
wifi-densepose/vendor/ruvector/docs/postgres/v2/13-self-healing.md

29 KiB

RuVector Postgres v2 - Self-Healing System

The Missing Piece

We built the sensor (mincut integrity monitoring). Now we build the actuator (automated remediation).

Most systems detect problems and alert humans. We detect problems and fix them automatically.

Traditional:    Detect → Alert → Human → Diagnose → Fix → Verify
Self-Healing:   Detect → Classify → Remediate → Verify → Learn

This completes the control loop and makes RuVector truly autonomous.


Design Principles

  1. Graduated response — Small problems get small fixes
  2. Reversible actions — Every remediation can be undone
  3. Blast radius limits — Never make things worse
  4. Audit trail — Every action is logged and signed
  5. Learn from outcomes — Improve remediation over time

Architecture

+------------------------------------------------------------------+
|                    Integrity Monitor                              |
|  - Computes lambda_cut on contracted graph                       |
|  - Detects state transitions (normal → stress → critical)        |
+------------------------------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|                    Problem Classifier                             |
|  - Identifies root cause from witness edges                      |
|  - Maps symptoms to remediation strategies                       |
+------------------------------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|                    Remediation Engine                             |
|  - Selects appropriate action                                    |
|  - Executes with timeout and rollback                            |
|  - Verifies improvement                                          |
+------------------------------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|                    Outcome Tracker                                |
|  - Records success/failure                                       |
|  - Updates strategy weights                                      |
|  - Feeds learning pipeline                                       |
+------------------------------------------------------------------+

Problem Classification

Witness Edge Analysis

The mincut computation produces witness edges — the edges that would break the graph. These reveal the problem type.

// src/healing/classifier.rs

#[derive(Debug, Clone)]
pub enum ProblemClass {
    /// Hot partition overloaded
    HotspotCongestion {
        partition_ids: Vec<i64>,
        load_ratio: f32,  // vs average
    },

    /// Centroid imbalance in IVF index
    CentroidSkew {
        centroid_ids: Vec<i64>,
        skew_factor: f32,
    },

    /// Replication lag causing consistency issues
    ReplicationLag {
        replica_ids: Vec<i64>,
        lag_seconds: f32,
    },

    /// Background job contention
    MaintenanceContention {
        job_types: Vec<String>,
        queue_depth: usize,
    },

    /// Index fragmentation
    IndexFragmentation {
        index_ids: Vec<i64>,
        fragmentation_pct: f32,
    },

    /// Memory pressure
    MemoryPressure {
        current_usage_pct: f32,
        largest_consumers: Vec<(String, usize)>,
    },

    /// Unknown (needs human investigation)
    Unknown {
        witness_summary: String,
    },
}

pub fn classify_problem(
    witness_edges: &[WitnessEdge],
    metrics: &SystemMetrics,
) -> ProblemClass {
    // Analyze witness edge patterns
    let edge_types = count_edge_types(witness_edges);
    let node_types = count_node_types(witness_edges);

    // Pattern matching
    if edge_types.get("partition_link").unwrap_or(&0) > &3 {
        // Multiple partition links weak → hotspot
        let hot_partitions = find_hot_partitions(witness_edges, metrics);
        return ProblemClass::HotspotCongestion {
            partition_ids: hot_partitions,
            load_ratio: compute_load_ratio(&hot_partitions, metrics),
        };
    }

    if node_types.get("centroid").unwrap_or(&0) > &5 {
        // Centroid nodes in witness → skew
        let skewed = find_skewed_centroids(witness_edges, metrics);
        return ProblemClass::CentroidSkew {
            centroid_ids: skewed,
            skew_factor: compute_skew_factor(&skewed, metrics),
        };
    }

    if edge_types.get("replication").unwrap_or(&0) > &0 {
        // Replication edges weak
        let lagging = find_lagging_replicas(witness_edges, metrics);
        return ProblemClass::ReplicationLag {
            replica_ids: lagging,
            lag_seconds: get_max_lag(&lagging, metrics),
        };
    }

    if edge_types.get("dependency").unwrap_or(&0) > &2 {
        // Maintenance dependencies weak
        let jobs = find_contending_jobs(witness_edges, metrics);
        return ProblemClass::MaintenanceContention {
            job_types: jobs,
            queue_depth: metrics.maintenance_queue_depth,
        };
    }

    // Check metrics for other patterns
    if metrics.memory_usage_pct > 85.0 {
        return ProblemClass::MemoryPressure {
            current_usage_pct: metrics.memory_usage_pct,
            largest_consumers: metrics.top_memory_consumers.clone(),
        };
    }

    if metrics.index_fragmentation_pct > 30.0 {
        return ProblemClass::IndexFragmentation {
            index_ids: metrics.fragmented_indexes.clone(),
            fragmentation_pct: metrics.index_fragmentation_pct,
        };
    }

    ProblemClass::Unknown {
        witness_summary: summarize_witnesses(witness_edges),
    }
}

Remediation Strategies

Strategy Registry

// src/healing/strategies.rs

pub trait RemediationStrategy: Send + Sync {
    /// Human-readable name
    fn name(&self) -> &str;

    /// Problem classes this strategy handles
    fn handles(&self) -> Vec<ProblemClass>;

    /// Estimate impact (0-1, higher = more disruptive)
    fn impact(&self) -> f32;

    /// Estimate time to complete
    fn estimated_duration(&self) -> Duration;

    /// Can this be reversed?
    fn reversible(&self) -> bool;

    /// Execute the remediation
    fn execute(&self, context: &RemediationContext) -> Result<RemediationResult, Error>;

    /// Rollback if needed
    fn rollback(&self, context: &RemediationContext) -> Result<(), Error>;
}

/// Registry of all available strategies
pub struct StrategyRegistry {
    strategies: Vec<Box<dyn RemediationStrategy>>,
    weights: HashMap<String, f32>,  // Learned effectiveness weights
}

impl StrategyRegistry {
    pub fn new() -> Self {
        let mut registry = Self {
            strategies: vec![],
            weights: HashMap::new(),
        };

        // Register built-in strategies
        registry.register(Box::new(RebalancePartitions));
        registry.register(Box::new(RebuildCentroids));
        registry.register(Box::new(PauseMaintenanceJobs));
        registry.register(Box::new(ThrottleIngestion));
        registry.register(Box::new(EvictColdData));
        registry.register(Box::new(CompactFragmentedIndexes));
        registry.register(Box::new(ScaleReadReplicas));
        registry.register(Box::new(DrainHotPartition));

        registry
    }

    /// Select best strategy for a problem
    pub fn select(&self, problem: &ProblemClass, context: &SystemContext) -> Option<&dyn RemediationStrategy> {
        self.strategies.iter()
            .filter(|s| s.handles().iter().any(|h| matches_class(h, problem)))
            .filter(|s| s.impact() <= context.max_allowed_impact)
            .max_by(|a, b| {
                let weight_a = self.weights.get(a.name()).unwrap_or(&1.0);
                let weight_b = self.weights.get(b.name()).unwrap_or(&1.0);
                weight_a.partial_cmp(weight_b).unwrap()
            })
            .map(|s| s.as_ref())
    }
}

Built-in Strategies

1. Rebalance Partitions

pub struct RebalancePartitions;

impl RemediationStrategy for RebalancePartitions {
    fn name(&self) -> &str { "rebalance_partitions" }

    fn handles(&self) -> Vec<ProblemClass> {
        vec![ProblemClass::HotspotCongestion { .. }]
    }

    fn impact(&self) -> f32 { 0.3 }  // Medium impact

    fn reversible(&self) -> bool { true }

    fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
        let problem = ctx.problem.as_hotspot()?;

        // Find underutilized partitions
        let cold_partitions = find_cold_partitions(ctx.metrics);

        // Calculate rebalance plan
        let plan = compute_rebalance_plan(
            &problem.partition_ids,
            &cold_partitions,
            ctx.metrics,
        );

        // Execute moves incrementally
        for mv in plan.moves {
            // Move vectors from hot to cold partition
            move_vectors(mv.from_partition, mv.to_partition, mv.vector_ids)?;

            // Check if integrity improved
            let new_lambda = sample_integrity(ctx.collection_id)?;
            if new_lambda > ctx.initial_lambda * 1.1 {
                // Good progress, continue
            } else if new_lambda < ctx.initial_lambda * 0.9 {
                // Made things worse, rollback this move
                move_vectors(mv.to_partition, mv.from_partition, mv.vector_ids)?;
                break;
            }
        }

        Ok(RemediationResult {
            success: true,
            actions_taken: plan.moves.len(),
            improvement: compute_improvement(ctx),
        })
    }

    fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
        // Reverse all recorded moves
        for mv in ctx.recorded_moves.iter().rev() {
            move_vectors(mv.to_partition, mv.from_partition, mv.vector_ids)?;
        }
        Ok(())
    }
}

2. Pause Maintenance Jobs

pub struct PauseMaintenanceJobs;

impl RemediationStrategy for PauseMaintenanceJobs {
    fn name(&self) -> &str { "pause_maintenance" }

    fn handles(&self) -> Vec<ProblemClass> {
        vec![ProblemClass::MaintenanceContention { .. }]
    }

    fn impact(&self) -> f32 { 0.1 }  // Low impact

    fn reversible(&self) -> bool { true }

    fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
        let problem = ctx.problem.as_maintenance_contention()?;

        // Pause low-priority jobs
        let paused = problem.job_types.iter()
            .filter(|j| job_priority(j) < Priority::High)
            .map(|j| {
                pause_job(j)?;
                j.clone()
            })
            .collect::<Vec<_>>();

        // Wait for current operations to drain
        wait_for_drain(Duration::from_secs(30))?;

        // Verify improvement
        let new_lambda = sample_integrity(ctx.collection_id)?;

        Ok(RemediationResult {
            success: new_lambda > ctx.initial_lambda,
            actions_taken: paused.len(),
            improvement: (new_lambda - ctx.initial_lambda) / ctx.initial_lambda,
            metadata: json!({ "paused_jobs": paused }),
        })
    }

    fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
        // Resume paused jobs
        let paused: Vec<String> = ctx.result.metadata["paused_jobs"].as_array()
            .map(|a| a.iter().filter_map(|v| v.as_str().map(String::from)).collect())
            .unwrap_or_default();

        for job in paused {
            resume_job(&job)?;
        }
        Ok(())
    }
}

3. Throttle Ingestion

pub struct ThrottleIngestion;

impl RemediationStrategy for ThrottleIngestion {
    fn name(&self) -> &str { "throttle_ingestion" }

    fn handles(&self) -> Vec<ProblemClass> {
        vec![
            ProblemClass::HotspotCongestion { .. },
            ProblemClass::MemoryPressure { .. },
        ]
    }

    fn impact(&self) -> f32 { 0.4 }  // Medium-high (affects writes)

    fn reversible(&self) -> bool { true }

    fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
        // Calculate throttle percentage based on severity
        let throttle_pct = match ctx.state {
            IntegrityState::Stress => 50,
            IntegrityState::Critical => 90,
            _ => return Ok(RemediationResult::noop()),
        };

        // Apply throttle via shared memory
        set_throttle_percentage(ctx.collection_id, "insert", throttle_pct)?;
        set_throttle_percentage(ctx.collection_id, "bulk_insert", throttle_pct + 10)?;

        // Record for rollback
        ctx.record_action(ThrottleAction {
            collection_id: ctx.collection_id,
            previous_throttle: get_current_throttle(ctx.collection_id),
            new_throttle: throttle_pct,
        });

        Ok(RemediationResult {
            success: true,
            actions_taken: 1,
            improvement: 0.0,  // Preventive, not curative
            metadata: json!({ "throttle_pct": throttle_pct }),
        })
    }

    fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
        // Restore previous throttle level
        let action: ThrottleAction = ctx.get_action()?;
        set_throttle_percentage(ctx.collection_id, "insert", action.previous_throttle)?;
        Ok(())
    }
}

4. Scale Read Replicas (Kubernetes)

pub struct ScaleReadReplicas;

impl RemediationStrategy for ScaleReadReplicas {
    fn name(&self) -> &str { "scale_replicas" }

    fn handles(&self) -> Vec<ProblemClass> {
        vec![ProblemClass::HotspotCongestion { .. }]
    }

    fn impact(&self) -> f32 { 0.2 }  // Low impact

    fn reversible(&self) -> bool { true }

    fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
        // Only available in K8s environment
        let k8s = K8sClient::try_new()?;

        // Get current replica count
        let deployment = k8s.get_deployment("ruvector-read")?;
        let current = deployment.spec.replicas;

        // Scale up by 50% (capped)
        let new_count = (current as f32 * 1.5).ceil() as i32;
        let new_count = new_count.min(ctx.config.max_replicas);

        if new_count == current {
            return Ok(RemediationResult::noop());
        }

        // Apply scale
        k8s.scale_deployment("ruvector-read", new_count)?;

        // Wait for pods to be ready
        k8s.wait_for_ready("ruvector-read", Duration::from_secs(300))?;

        Ok(RemediationResult {
            success: true,
            actions_taken: 1,
            improvement: 0.0,  // Measured later
            metadata: json!({
                "previous_replicas": current,
                "new_replicas": new_count,
            }),
        })
    }

    fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
        let k8s = K8sClient::try_new()?;
        let previous: i32 = ctx.result.metadata["previous_replicas"].as_i64()? as i32;
        k8s.scale_deployment("ruvector-read", previous)?;
        Ok(())
    }
}

5. Compact Fragmented Indexes

pub struct CompactFragmentedIndexes;

impl RemediationStrategy for CompactFragmentedIndexes {
    fn name(&self) -> &str { "compact_indexes" }

    fn handles(&self) -> Vec<ProblemClass> {
        vec![ProblemClass::IndexFragmentation { .. }]
    }

    fn impact(&self) -> f32 { 0.5 }  // Higher impact (CPU intensive)

    fn reversible(&self) -> bool { false }  // Compaction is one-way

    fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
        let problem = ctx.problem.as_fragmentation()?;

        // Compact most fragmented indexes first
        let mut compacted = 0;
        for index_id in &problem.index_ids {
            // Check if we have time/resources
            if ctx.elapsed() > ctx.timeout / 2 {
                break;
            }

            // Run incremental compaction
            compact_index_incremental(*index_id, CompactConfig {
                max_duration: Duration::from_secs(60),
                batch_size: 10000,
            })?;

            compacted += 1;

            // Check improvement
            let new_lambda = sample_integrity(ctx.collection_id)?;
            if new_lambda > ctx.target_lambda {
                break;  // Good enough
            }
        }

        Ok(RemediationResult {
            success: compacted > 0,
            actions_taken: compacted,
            improvement: compute_improvement(ctx),
        })
    }

    fn rollback(&self, _ctx: &RemediationContext) -> Result<(), Error> {
        // Compaction is not reversible
        Err(Error::NotReversible)
    }
}

Remediation Engine

Execution Flow

// src/healing/engine.rs

pub struct RemediationEngine {
    registry: StrategyRegistry,
    config: HealingConfig,
    outcome_tracker: OutcomeTracker,
}

impl RemediationEngine {
    /// Main healing loop (called when integrity degrades)
    pub fn heal(&self, trigger: &IntegrityTrigger) -> HealingOutcome {
        let ctx = RemediationContext::new(trigger);

        // 1. Classify the problem
        let problem = classify_problem(&trigger.witness_edges, &ctx.metrics);
        log_problem(&problem);

        // 2. Check if we should auto-heal
        if !self.should_auto_heal(&problem, &ctx) {
            return HealingOutcome::Deferred {
                reason: "Problem requires human review",
                problem: problem.clone(),
            };
        }

        // 3. Select strategy
        let strategy = match self.registry.select(&problem, &ctx.system) {
            Some(s) => s,
            None => {
                return HealingOutcome::NoStrategy {
                    problem: problem.clone(),
                };
            }
        };

        log_strategy_selected(strategy.name(), &problem);

        // 4. Execute with timeout and monitoring
        let result = self.execute_with_safeguards(strategy, &ctx);

        // 5. Verify improvement
        let verified = self.verify_improvement(&ctx, &result);

        // 6. Rollback if needed
        if !verified && strategy.reversible() {
            log_rollback(strategy.name());
            if let Err(e) = strategy.rollback(&ctx) {
                log_rollback_failed(e);
            }
        }

        // 7. Record outcome for learning
        self.outcome_tracker.record(OutcomeRecord {
            problem: problem.clone(),
            strategy: strategy.name().to_string(),
            result: result.clone(),
            verified,
            timestamp: Utc::now(),
        });

        HealingOutcome::Completed {
            problem,
            strategy: strategy.name().to_string(),
            result,
            verified,
        }
    }

    fn should_auto_heal(&self, problem: &ProblemClass, ctx: &RemediationContext) -> bool {
        // Don't auto-heal Unknown problems
        if matches!(problem, ProblemClass::Unknown { .. }) {
            return false;
        }

        // Check cooldown
        if ctx.last_healing_attempt.elapsed() < self.config.min_healing_interval {
            return false;
        }

        // Check max attempts
        if ctx.healing_attempts_in_window > self.config.max_attempts_per_window {
            return false;
        }

        // Check if problem is getting worse despite healing
        if self.is_healing_ineffective(ctx) {
            return false;
        }

        true
    }

    fn execute_with_safeguards(
        &self,
        strategy: &dyn RemediationStrategy,
        ctx: &RemediationContext,
    ) -> RemediationResult {
        // Set up timeout
        let timeout = strategy.estimated_duration() * 2;

        // Execute in separate thread with panic catching
        let result = std::panic::catch_unwind(|| {
            tokio::time::timeout(timeout, async {
                strategy.execute(ctx)
            })
        });

        match result {
            Ok(Ok(Ok(r))) => r,
            Ok(Ok(Err(e))) => RemediationResult::failed(e.to_string()),
            Ok(Err(_)) => RemediationResult::failed("Timeout"),
            Err(_) => RemediationResult::failed("Panic during remediation"),
        }
    }

    fn verify_improvement(&self, ctx: &RemediationContext, result: &RemediationResult) -> bool {
        if !result.success {
            return false;
        }

        // Wait for system to stabilize
        std::thread::sleep(Duration::from_secs(10));

        // Sample integrity
        let new_lambda = sample_integrity(ctx.collection_id).unwrap_or(0.0);

        // Must improve by at least 10%
        new_lambda > ctx.initial_lambda * 1.1
    }
}

Safety Limits

// src/healing/config.rs

pub struct HealingConfig {
    /// Minimum time between healing attempts
    pub min_healing_interval: Duration,

    /// Maximum attempts per time window
    pub max_attempts_per_window: usize,

    /// Time window for attempt counting
    pub attempt_window: Duration,

    /// Maximum impact level for auto-healing
    pub max_auto_heal_impact: f32,

    /// Problems that require human approval
    pub require_approval: Vec<ProblemClass>,

    /// Strategies that require human approval
    pub require_approval_strategies: Vec<String>,

    /// Enable learning from outcomes
    pub learning_enabled: bool,
}

impl Default for HealingConfig {
    fn default() -> Self {
        Self {
            min_healing_interval: Duration::from_secs(300),  // 5 min
            max_attempts_per_window: 3,
            attempt_window: Duration::from_secs(3600),       // 1 hour
            max_auto_heal_impact: 0.5,
            require_approval: vec![],
            require_approval_strategies: vec!["scale_replicas".to_string()],
            learning_enabled: true,
        }
    }
}

Learning from Outcomes

Outcome Tracking

CREATE TABLE ruvector.healing_outcomes (
    id              BIGSERIAL PRIMARY KEY,
    collection_id   INTEGER NOT NULL,
    tenant_id       TEXT,

    -- Problem
    problem_class   TEXT NOT NULL,
    problem_details JSONB NOT NULL,

    -- Strategy
    strategy_name   TEXT NOT NULL,
    strategy_params JSONB,

    -- Execution
    started_at      TIMESTAMPTZ NOT NULL,
    completed_at    TIMESTAMPTZ,
    duration_ms     INTEGER,

    -- Result
    success         BOOLEAN NOT NULL,
    verified        BOOLEAN,
    actions_taken   INTEGER,
    improvement_pct REAL,
    error_message   TEXT,

    -- Context
    initial_lambda  REAL NOT NULL,
    final_lambda    REAL,
    witness_edges   JSONB,
    system_metrics  JSONB,

    -- Learning
    feedback_score  REAL,  -- Human feedback if provided

    created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_healing_outcomes_class ON ruvector.healing_outcomes(problem_class);
CREATE INDEX idx_healing_outcomes_strategy ON ruvector.healing_outcomes(strategy_name);
CREATE INDEX idx_healing_outcomes_success ON ruvector.healing_outcomes(success, verified);

Strategy Weight Updates

// src/healing/learning.rs

pub struct OutcomeTracker {
    db: DbPool,
    strategy_weights: RwLock<HashMap<String, f32>>,
}

impl OutcomeTracker {
    /// Update strategy weights based on outcomes
    pub fn update_weights(&self) {
        let outcomes = self.get_recent_outcomes(Duration::from_days(30));

        let mut new_weights = HashMap::new();

        for strategy in self.list_strategies() {
            let strategy_outcomes: Vec<_> = outcomes.iter()
                .filter(|o| o.strategy_name == strategy)
                .collect();

            if strategy_outcomes.is_empty() {
                continue;
            }

            // Calculate effectiveness
            let success_rate = strategy_outcomes.iter()
                .filter(|o| o.success && o.verified.unwrap_or(false))
                .count() as f32 / strategy_outcomes.len() as f32;

            let avg_improvement = strategy_outcomes.iter()
                .filter_map(|o| o.improvement_pct)
                .sum::<f32>() / strategy_outcomes.len() as f32;

            // Weight = success_rate * (1 + avg_improvement)
            let weight = success_rate * (1.0 + avg_improvement);
            new_weights.insert(strategy, weight);
        }

        *self.strategy_weights.write() = new_weights;
    }

    /// Get effectiveness report
    pub fn effectiveness_report(&self) -> EffectivenessReport {
        let weights = self.strategy_weights.read();

        EffectivenessReport {
            strategies: weights.iter()
                .map(|(name, weight)| StrategyEffectiveness {
                    name: name.clone(),
                    weight: *weight,
                    recent_outcomes: self.get_recent_outcomes_for(name, 10),
                })
                .collect(),
            overall_success_rate: self.compute_overall_success_rate(),
            avg_time_to_recovery: self.compute_avg_recovery_time(),
        }
    }
}

SQL Interface

Monitoring

-- View healing status
SELECT * FROM ruvector_healing_status();

-- Returns:
-- {
--   "enabled": true,
--   "last_healing": "2024-01-15T10:30:00Z",
--   "total_healings_24h": 3,
--   "success_rate": 0.67,
--   "active_remediations": [],
--   "cooldown_until": null
-- }

-- View recent healing events
SELECT * FROM ruvector_healing_history(
    since := NOW() - INTERVAL '24 hours',
    limit_ := 20
);

-- View strategy effectiveness
SELECT * FROM ruvector_healing_effectiveness();

Configuration

-- Configure healing behavior
SELECT ruvector_healing_configure('{
    "enabled": true,
    "min_healing_interval_seconds": 300,
    "max_attempts_per_hour": 3,
    "max_auto_heal_impact": 0.5,
    "require_approval_strategies": ["scale_replicas"],
    "learning_enabled": true
}'::jsonb);

-- Manually trigger healing (for testing)
SELECT ruvector_healing_trigger('embeddings');

-- Approve pending healing action
SELECT ruvector_healing_approve(action_id := 123);

-- Abort active healing
SELECT ruvector_healing_abort(action_id := 123);

Manual Remediation

-- Execute specific strategy manually
SELECT ruvector_healing_execute('embeddings', 'rebalance_partitions', '{
    "dry_run": false,
    "max_moves": 100
}'::jsonb);

-- Rollback last healing action
SELECT ruvector_healing_rollback('embeddings');

Prometheus Metrics

# Healing activity
ruvector_healing_attempts_total{collection="embeddings",strategy="rebalance"} 15
ruvector_healing_success_total{collection="embeddings",strategy="rebalance"} 12
ruvector_healing_duration_seconds{collection="embeddings",strategy="rebalance",quantile="0.99"} 45.2

# Current state
ruvector_healing_active{collection="embeddings"} 0
ruvector_healing_cooldown{collection="embeddings"} 0

# Effectiveness
ruvector_healing_success_rate{collection="embeddings"} 0.80
ruvector_healing_avg_improvement{collection="embeddings"} 0.25
ruvector_healing_time_to_recovery_seconds{collection="embeddings"} 120

Alerting Integration

# Alert when healing is failing
- alert: RuVectorHealingIneffective
  expr: ruvector_healing_success_rate < 0.5
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Self-healing is not effective"
    description: "Healing success rate is {{ $value }} - human intervention may be required"

# Alert when healing is disabled
- alert: RuVectorHealingDisabled
  expr: ruvector_healing_enabled == 0
  for: 5m
  labels:
    severity: info
  annotations:
    summary: "Self-healing is disabled for {{ $labels.collection }}"

# Alert when in prolonged degraded state
- alert: RuVectorProlongedDegradation
  expr: ruvector_integrity_state > 1 and ruvector_healing_attempts_total == 0
  for: 30m
  labels:
    severity: critical
  annotations:
    summary: "Prolonged degradation without healing attempts"

Testing Requirements

Unit Tests

  • Problem classification accuracy
  • Strategy selection logic
  • Rollback correctness
  • Weight update calculations

Integration Tests

  • End-to-end healing cycle
  • Concurrent healing prevention
  • Timeout and panic handling
  • Kubernetes scaling (mock)

Chaos Tests

  • Random failure injection
  • Healing under load
  • Cascading failure prevention
  • Recovery time objectives

The Complete Loop

       +----------------+
       |  Normal State  |
       +-------+--------+
               |
               | (stress detected)
               v
       +-------+--------+
       |    Classify    |
       +-------+--------+
               |
               v
       +-------+--------+
       |    Remediate   |
       +-------+--------+
               |
               | (verify)
               v
       +-------+--------+
       |     Learn      |
       +-------+--------+
               |
               v
       +-------+--------+
       |  Normal State  |<----+
       +----------------+     |
                              |
                        (automatic recovery)

This is what makes RuVector truly different:

We don't just detect problems early. We fix them automatically before they become incidents.

The system is not just observable. It is self-aware and self-repairing.