Files
wifi-densepose/docs/postgres/v2/13-self-healing.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

1019 lines
29 KiB
Markdown

# RuVector Postgres v2 - Self-Healing System
## The Missing Piece
We built the sensor (mincut integrity monitoring).
Now we build the actuator (automated remediation).
Most systems detect problems and alert humans. We detect problems and **fix them automatically**.
```
Traditional: Detect → Alert → Human → Diagnose → Fix → Verify
Self-Healing: Detect → Classify → Remediate → Verify → Learn
```
This completes the control loop and makes RuVector truly autonomous.
---
## Design Principles
1. **Graduated response** — Small problems get small fixes
2. **Reversible actions** — Every remediation can be undone
3. **Blast radius limits** — Never make things worse
4. **Audit trail** — Every action is logged and signed
5. **Learn from outcomes** — Improve remediation over time
---
## Architecture
```
+------------------------------------------------------------------+
| Integrity Monitor |
| - Computes lambda_cut on contracted graph |
| - Detects state transitions (normal → stress → critical) |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| Problem Classifier |
| - Identifies root cause from witness edges |
| - Maps symptoms to remediation strategies |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| Remediation Engine |
| - Selects appropriate action |
| - Executes with timeout and rollback |
| - Verifies improvement |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| Outcome Tracker |
| - Records success/failure |
| - Updates strategy weights |
| - Feeds learning pipeline |
+------------------------------------------------------------------+
```
---
## Problem Classification
### Witness Edge Analysis
The mincut computation produces **witness edges** — the edges that would break the graph. These reveal the problem type.
```rust
// src/healing/classifier.rs
#[derive(Debug, Clone)]
pub enum ProblemClass {
/// Hot partition overloaded
HotspotCongestion {
partition_ids: Vec<i64>,
load_ratio: f32, // vs average
},
/// Centroid imbalance in IVF index
CentroidSkew {
centroid_ids: Vec<i64>,
skew_factor: f32,
},
/// Replication lag causing consistency issues
ReplicationLag {
replica_ids: Vec<i64>,
lag_seconds: f32,
},
/// Background job contention
MaintenanceContention {
job_types: Vec<String>,
queue_depth: usize,
},
/// Index fragmentation
IndexFragmentation {
index_ids: Vec<i64>,
fragmentation_pct: f32,
},
/// Memory pressure
MemoryPressure {
current_usage_pct: f32,
largest_consumers: Vec<(String, usize)>,
},
/// Unknown (needs human investigation)
Unknown {
witness_summary: String,
},
}
pub fn classify_problem(
witness_edges: &[WitnessEdge],
metrics: &SystemMetrics,
) -> ProblemClass {
// Analyze witness edge patterns
let edge_types = count_edge_types(witness_edges);
let node_types = count_node_types(witness_edges);
// Pattern matching
if edge_types.get("partition_link").unwrap_or(&0) > &3 {
// Multiple partition links weak → hotspot
let hot_partitions = find_hot_partitions(witness_edges, metrics);
return ProblemClass::HotspotCongestion {
partition_ids: hot_partitions,
load_ratio: compute_load_ratio(&hot_partitions, metrics),
};
}
if node_types.get("centroid").unwrap_or(&0) > &5 {
// Centroid nodes in witness → skew
let skewed = find_skewed_centroids(witness_edges, metrics);
return ProblemClass::CentroidSkew {
centroid_ids: skewed,
skew_factor: compute_skew_factor(&skewed, metrics),
};
}
if edge_types.get("replication").unwrap_or(&0) > &0 {
// Replication edges weak
let lagging = find_lagging_replicas(witness_edges, metrics);
return ProblemClass::ReplicationLag {
replica_ids: lagging,
lag_seconds: get_max_lag(&lagging, metrics),
};
}
if edge_types.get("dependency").unwrap_or(&0) > &2 {
// Maintenance dependencies weak
let jobs = find_contending_jobs(witness_edges, metrics);
return ProblemClass::MaintenanceContention {
job_types: jobs,
queue_depth: metrics.maintenance_queue_depth,
};
}
// Check metrics for other patterns
if metrics.memory_usage_pct > 85.0 {
return ProblemClass::MemoryPressure {
current_usage_pct: metrics.memory_usage_pct,
largest_consumers: metrics.top_memory_consumers.clone(),
};
}
if metrics.index_fragmentation_pct > 30.0 {
return ProblemClass::IndexFragmentation {
index_ids: metrics.fragmented_indexes.clone(),
fragmentation_pct: metrics.index_fragmentation_pct,
};
}
ProblemClass::Unknown {
witness_summary: summarize_witnesses(witness_edges),
}
}
```
---
## Remediation Strategies
### Strategy Registry
```rust
// src/healing/strategies.rs
pub trait RemediationStrategy: Send + Sync {
/// Human-readable name
fn name(&self) -> &str;
/// Problem classes this strategy handles
fn handles(&self) -> Vec<ProblemClass>;
/// Estimate impact (0-1, higher = more disruptive)
fn impact(&self) -> f32;
/// Estimate time to complete
fn estimated_duration(&self) -> Duration;
/// Can this be reversed?
fn reversible(&self) -> bool;
/// Execute the remediation
fn execute(&self, context: &RemediationContext) -> Result<RemediationResult, Error>;
/// Rollback if needed
fn rollback(&self, context: &RemediationContext) -> Result<(), Error>;
}
/// Registry of all available strategies
pub struct StrategyRegistry {
strategies: Vec<Box<dyn RemediationStrategy>>,
weights: HashMap<String, f32>, // Learned effectiveness weights
}
impl StrategyRegistry {
pub fn new() -> Self {
let mut registry = Self {
strategies: vec![],
weights: HashMap::new(),
};
// Register built-in strategies
registry.register(Box::new(RebalancePartitions));
registry.register(Box::new(RebuildCentroids));
registry.register(Box::new(PauseMaintenanceJobs));
registry.register(Box::new(ThrottleIngestion));
registry.register(Box::new(EvictColdData));
registry.register(Box::new(CompactFragmentedIndexes));
registry.register(Box::new(ScaleReadReplicas));
registry.register(Box::new(DrainHotPartition));
registry
}
/// Select best strategy for a problem
pub fn select(&self, problem: &ProblemClass, context: &SystemContext) -> Option<&dyn RemediationStrategy> {
self.strategies.iter()
.filter(|s| s.handles().iter().any(|h| matches_class(h, problem)))
.filter(|s| s.impact() <= context.max_allowed_impact)
.max_by(|a, b| {
let weight_a = self.weights.get(a.name()).unwrap_or(&1.0);
let weight_b = self.weights.get(b.name()).unwrap_or(&1.0);
weight_a.partial_cmp(weight_b).unwrap()
})
.map(|s| s.as_ref())
}
}
```
### Built-in Strategies
#### 1. Rebalance Partitions
```rust
pub struct RebalancePartitions;
impl RemediationStrategy for RebalancePartitions {
fn name(&self) -> &str { "rebalance_partitions" }
fn handles(&self) -> Vec<ProblemClass> {
vec![ProblemClass::HotspotCongestion { .. }]
}
fn impact(&self) -> f32 { 0.3 } // Medium impact
fn reversible(&self) -> bool { true }
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
let problem = ctx.problem.as_hotspot()?;
// Find underutilized partitions
let cold_partitions = find_cold_partitions(ctx.metrics);
// Calculate rebalance plan
let plan = compute_rebalance_plan(
&problem.partition_ids,
&cold_partitions,
ctx.metrics,
);
// Execute moves incrementally
for mv in plan.moves {
// Move vectors from hot to cold partition
move_vectors(mv.from_partition, mv.to_partition, mv.vector_ids)?;
// Check if integrity improved
let new_lambda = sample_integrity(ctx.collection_id)?;
if new_lambda > ctx.initial_lambda * 1.1 {
// Good progress, continue
} else if new_lambda < ctx.initial_lambda * 0.9 {
// Made things worse, rollback this move
move_vectors(mv.to_partition, mv.from_partition, mv.vector_ids)?;
break;
}
}
Ok(RemediationResult {
success: true,
actions_taken: plan.moves.len(),
improvement: compute_improvement(ctx),
})
}
fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
// Reverse all recorded moves
for mv in ctx.recorded_moves.iter().rev() {
move_vectors(mv.to_partition, mv.from_partition, mv.vector_ids)?;
}
Ok(())
}
}
```
#### 2. Pause Maintenance Jobs
```rust
pub struct PauseMaintenanceJobs;
impl RemediationStrategy for PauseMaintenanceJobs {
fn name(&self) -> &str { "pause_maintenance" }
fn handles(&self) -> Vec<ProblemClass> {
vec![ProblemClass::MaintenanceContention { .. }]
}
fn impact(&self) -> f32 { 0.1 } // Low impact
fn reversible(&self) -> bool { true }
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
let problem = ctx.problem.as_maintenance_contention()?;
// Pause low-priority jobs
let paused = problem.job_types.iter()
.filter(|j| job_priority(j) < Priority::High)
.map(|j| {
pause_job(j)?;
j.clone()
})
.collect::<Vec<_>>();
// Wait for current operations to drain
wait_for_drain(Duration::from_secs(30))?;
// Verify improvement
let new_lambda = sample_integrity(ctx.collection_id)?;
Ok(RemediationResult {
success: new_lambda > ctx.initial_lambda,
actions_taken: paused.len(),
improvement: (new_lambda - ctx.initial_lambda) / ctx.initial_lambda,
metadata: json!({ "paused_jobs": paused }),
})
}
fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
// Resume paused jobs
let paused: Vec<String> = ctx.result.metadata["paused_jobs"].as_array()
.map(|a| a.iter().filter_map(|v| v.as_str().map(String::from)).collect())
.unwrap_or_default();
for job in paused {
resume_job(&job)?;
}
Ok(())
}
}
```
#### 3. Throttle Ingestion
```rust
pub struct ThrottleIngestion;
impl RemediationStrategy for ThrottleIngestion {
fn name(&self) -> &str { "throttle_ingestion" }
fn handles(&self) -> Vec<ProblemClass> {
vec![
ProblemClass::HotspotCongestion { .. },
ProblemClass::MemoryPressure { .. },
]
}
fn impact(&self) -> f32 { 0.4 } // Medium-high (affects writes)
fn reversible(&self) -> bool { true }
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
// Calculate throttle percentage based on severity
let throttle_pct = match ctx.state {
IntegrityState::Stress => 50,
IntegrityState::Critical => 90,
_ => return Ok(RemediationResult::noop()),
};
// Apply throttle via shared memory
set_throttle_percentage(ctx.collection_id, "insert", throttle_pct)?;
set_throttle_percentage(ctx.collection_id, "bulk_insert", throttle_pct + 10)?;
// Record for rollback
ctx.record_action(ThrottleAction {
collection_id: ctx.collection_id,
previous_throttle: get_current_throttle(ctx.collection_id),
new_throttle: throttle_pct,
});
Ok(RemediationResult {
success: true,
actions_taken: 1,
improvement: 0.0, // Preventive, not curative
metadata: json!({ "throttle_pct": throttle_pct }),
})
}
fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
// Restore previous throttle level
let action: ThrottleAction = ctx.get_action()?;
set_throttle_percentage(ctx.collection_id, "insert", action.previous_throttle)?;
Ok(())
}
}
```
#### 4. Scale Read Replicas (Kubernetes)
```rust
pub struct ScaleReadReplicas;
impl RemediationStrategy for ScaleReadReplicas {
fn name(&self) -> &str { "scale_replicas" }
fn handles(&self) -> Vec<ProblemClass> {
vec![ProblemClass::HotspotCongestion { .. }]
}
fn impact(&self) -> f32 { 0.2 } // Low impact
fn reversible(&self) -> bool { true }
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
// Only available in K8s environment
let k8s = K8sClient::try_new()?;
// Get current replica count
let deployment = k8s.get_deployment("ruvector-read")?;
let current = deployment.spec.replicas;
// Scale up by 50% (capped)
let new_count = (current as f32 * 1.5).ceil() as i32;
let new_count = new_count.min(ctx.config.max_replicas);
if new_count == current {
return Ok(RemediationResult::noop());
}
// Apply scale
k8s.scale_deployment("ruvector-read", new_count)?;
// Wait for pods to be ready
k8s.wait_for_ready("ruvector-read", Duration::from_secs(300))?;
Ok(RemediationResult {
success: true,
actions_taken: 1,
improvement: 0.0, // Measured later
metadata: json!({
"previous_replicas": current,
"new_replicas": new_count,
}),
})
}
fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
let k8s = K8sClient::try_new()?;
let previous: i32 = ctx.result.metadata["previous_replicas"].as_i64()? as i32;
k8s.scale_deployment("ruvector-read", previous)?;
Ok(())
}
}
```
#### 5. Compact Fragmented Indexes
```rust
pub struct CompactFragmentedIndexes;
impl RemediationStrategy for CompactFragmentedIndexes {
fn name(&self) -> &str { "compact_indexes" }
fn handles(&self) -> Vec<ProblemClass> {
vec![ProblemClass::IndexFragmentation { .. }]
}
fn impact(&self) -> f32 { 0.5 } // Higher impact (CPU intensive)
fn reversible(&self) -> bool { false } // Compaction is one-way
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
let problem = ctx.problem.as_fragmentation()?;
// Compact most fragmented indexes first
let mut compacted = 0;
for index_id in &problem.index_ids {
// Check if we have time/resources
if ctx.elapsed() > ctx.timeout / 2 {
break;
}
// Run incremental compaction
compact_index_incremental(*index_id, CompactConfig {
max_duration: Duration::from_secs(60),
batch_size: 10000,
})?;
compacted += 1;
// Check improvement
let new_lambda = sample_integrity(ctx.collection_id)?;
if new_lambda > ctx.target_lambda {
break; // Good enough
}
}
Ok(RemediationResult {
success: compacted > 0,
actions_taken: compacted,
improvement: compute_improvement(ctx),
})
}
fn rollback(&self, _ctx: &RemediationContext) -> Result<(), Error> {
// Compaction is not reversible
Err(Error::NotReversible)
}
}
```
---
## Remediation Engine
### Execution Flow
```rust
// src/healing/engine.rs
pub struct RemediationEngine {
registry: StrategyRegistry,
config: HealingConfig,
outcome_tracker: OutcomeTracker,
}
impl RemediationEngine {
/// Main healing loop (called when integrity degrades)
pub fn heal(&self, trigger: &IntegrityTrigger) -> HealingOutcome {
let ctx = RemediationContext::new(trigger);
// 1. Classify the problem
let problem = classify_problem(&trigger.witness_edges, &ctx.metrics);
log_problem(&problem);
// 2. Check if we should auto-heal
if !self.should_auto_heal(&problem, &ctx) {
return HealingOutcome::Deferred {
reason: "Problem requires human review",
problem: problem.clone(),
};
}
// 3. Select strategy
let strategy = match self.registry.select(&problem, &ctx.system) {
Some(s) => s,
None => {
return HealingOutcome::NoStrategy {
problem: problem.clone(),
};
}
};
log_strategy_selected(strategy.name(), &problem);
// 4. Execute with timeout and monitoring
let result = self.execute_with_safeguards(strategy, &ctx);
// 5. Verify improvement
let verified = self.verify_improvement(&ctx, &result);
// 6. Rollback if needed
if !verified && strategy.reversible() {
log_rollback(strategy.name());
if let Err(e) = strategy.rollback(&ctx) {
log_rollback_failed(e);
}
}
// 7. Record outcome for learning
self.outcome_tracker.record(OutcomeRecord {
problem: problem.clone(),
strategy: strategy.name().to_string(),
result: result.clone(),
verified,
timestamp: Utc::now(),
});
HealingOutcome::Completed {
problem,
strategy: strategy.name().to_string(),
result,
verified,
}
}
fn should_auto_heal(&self, problem: &ProblemClass, ctx: &RemediationContext) -> bool {
// Don't auto-heal Unknown problems
if matches!(problem, ProblemClass::Unknown { .. }) {
return false;
}
// Check cooldown
if ctx.last_healing_attempt.elapsed() < self.config.min_healing_interval {
return false;
}
// Check max attempts
if ctx.healing_attempts_in_window > self.config.max_attempts_per_window {
return false;
}
// Check if problem is getting worse despite healing
if self.is_healing_ineffective(ctx) {
return false;
}
true
}
fn execute_with_safeguards(
&self,
strategy: &dyn RemediationStrategy,
ctx: &RemediationContext,
) -> RemediationResult {
// Set up timeout
let timeout = strategy.estimated_duration() * 2;
// Execute in separate thread with panic catching
let result = std::panic::catch_unwind(|| {
tokio::time::timeout(timeout, async {
strategy.execute(ctx)
})
});
match result {
Ok(Ok(Ok(r))) => r,
Ok(Ok(Err(e))) => RemediationResult::failed(e.to_string()),
Ok(Err(_)) => RemediationResult::failed("Timeout"),
Err(_) => RemediationResult::failed("Panic during remediation"),
}
}
fn verify_improvement(&self, ctx: &RemediationContext, result: &RemediationResult) -> bool {
if !result.success {
return false;
}
// Wait for system to stabilize
std::thread::sleep(Duration::from_secs(10));
// Sample integrity
let new_lambda = sample_integrity(ctx.collection_id).unwrap_or(0.0);
// Must improve by at least 10%
new_lambda > ctx.initial_lambda * 1.1
}
}
```
### Safety Limits
```rust
// src/healing/config.rs
pub struct HealingConfig {
/// Minimum time between healing attempts
pub min_healing_interval: Duration,
/// Maximum attempts per time window
pub max_attempts_per_window: usize,
/// Time window for attempt counting
pub attempt_window: Duration,
/// Maximum impact level for auto-healing
pub max_auto_heal_impact: f32,
/// Problems that require human approval
pub require_approval: Vec<ProblemClass>,
/// Strategies that require human approval
pub require_approval_strategies: Vec<String>,
/// Enable learning from outcomes
pub learning_enabled: bool,
}
impl Default for HealingConfig {
fn default() -> Self {
Self {
min_healing_interval: Duration::from_secs(300), // 5 min
max_attempts_per_window: 3,
attempt_window: Duration::from_secs(3600), // 1 hour
max_auto_heal_impact: 0.5,
require_approval: vec![],
require_approval_strategies: vec!["scale_replicas".to_string()],
learning_enabled: true,
}
}
}
```
---
## Learning from Outcomes
### Outcome Tracking
```sql
CREATE TABLE ruvector.healing_outcomes (
id BIGSERIAL PRIMARY KEY,
collection_id INTEGER NOT NULL,
tenant_id TEXT,
-- Problem
problem_class TEXT NOT NULL,
problem_details JSONB NOT NULL,
-- Strategy
strategy_name TEXT NOT NULL,
strategy_params JSONB,
-- Execution
started_at TIMESTAMPTZ NOT NULL,
completed_at TIMESTAMPTZ,
duration_ms INTEGER,
-- Result
success BOOLEAN NOT NULL,
verified BOOLEAN,
actions_taken INTEGER,
improvement_pct REAL,
error_message TEXT,
-- Context
initial_lambda REAL NOT NULL,
final_lambda REAL,
witness_edges JSONB,
system_metrics JSONB,
-- Learning
feedback_score REAL, -- Human feedback if provided
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_healing_outcomes_class ON ruvector.healing_outcomes(problem_class);
CREATE INDEX idx_healing_outcomes_strategy ON ruvector.healing_outcomes(strategy_name);
CREATE INDEX idx_healing_outcomes_success ON ruvector.healing_outcomes(success, verified);
```
### Strategy Weight Updates
```rust
// src/healing/learning.rs
pub struct OutcomeTracker {
db: DbPool,
strategy_weights: RwLock<HashMap<String, f32>>,
}
impl OutcomeTracker {
/// Update strategy weights based on outcomes
pub fn update_weights(&self) {
let outcomes = self.get_recent_outcomes(Duration::from_days(30));
let mut new_weights = HashMap::new();
for strategy in self.list_strategies() {
let strategy_outcomes: Vec<_> = outcomes.iter()
.filter(|o| o.strategy_name == strategy)
.collect();
if strategy_outcomes.is_empty() {
continue;
}
// Calculate effectiveness
let success_rate = strategy_outcomes.iter()
.filter(|o| o.success && o.verified.unwrap_or(false))
.count() as f32 / strategy_outcomes.len() as f32;
let avg_improvement = strategy_outcomes.iter()
.filter_map(|o| o.improvement_pct)
.sum::<f32>() / strategy_outcomes.len() as f32;
// Weight = success_rate * (1 + avg_improvement)
let weight = success_rate * (1.0 + avg_improvement);
new_weights.insert(strategy, weight);
}
*self.strategy_weights.write() = new_weights;
}
/// Get effectiveness report
pub fn effectiveness_report(&self) -> EffectivenessReport {
let weights = self.strategy_weights.read();
EffectivenessReport {
strategies: weights.iter()
.map(|(name, weight)| StrategyEffectiveness {
name: name.clone(),
weight: *weight,
recent_outcomes: self.get_recent_outcomes_for(name, 10),
})
.collect(),
overall_success_rate: self.compute_overall_success_rate(),
avg_time_to_recovery: self.compute_avg_recovery_time(),
}
}
}
```
---
## SQL Interface
### Monitoring
```sql
-- View healing status
SELECT * FROM ruvector_healing_status();
-- Returns:
-- {
-- "enabled": true,
-- "last_healing": "2024-01-15T10:30:00Z",
-- "total_healings_24h": 3,
-- "success_rate": 0.67,
-- "active_remediations": [],
-- "cooldown_until": null
-- }
-- View recent healing events
SELECT * FROM ruvector_healing_history(
since := NOW() - INTERVAL '24 hours',
limit_ := 20
);
-- View strategy effectiveness
SELECT * FROM ruvector_healing_effectiveness();
```
### Configuration
```sql
-- Configure healing behavior
SELECT ruvector_healing_configure('{
"enabled": true,
"min_healing_interval_seconds": 300,
"max_attempts_per_hour": 3,
"max_auto_heal_impact": 0.5,
"require_approval_strategies": ["scale_replicas"],
"learning_enabled": true
}'::jsonb);
-- Manually trigger healing (for testing)
SELECT ruvector_healing_trigger('embeddings');
-- Approve pending healing action
SELECT ruvector_healing_approve(action_id := 123);
-- Abort active healing
SELECT ruvector_healing_abort(action_id := 123);
```
### Manual Remediation
```sql
-- Execute specific strategy manually
SELECT ruvector_healing_execute('embeddings', 'rebalance_partitions', '{
"dry_run": false,
"max_moves": 100
}'::jsonb);
-- Rollback last healing action
SELECT ruvector_healing_rollback('embeddings');
```
---
## Prometheus Metrics
```
# Healing activity
ruvector_healing_attempts_total{collection="embeddings",strategy="rebalance"} 15
ruvector_healing_success_total{collection="embeddings",strategy="rebalance"} 12
ruvector_healing_duration_seconds{collection="embeddings",strategy="rebalance",quantile="0.99"} 45.2
# Current state
ruvector_healing_active{collection="embeddings"} 0
ruvector_healing_cooldown{collection="embeddings"} 0
# Effectiveness
ruvector_healing_success_rate{collection="embeddings"} 0.80
ruvector_healing_avg_improvement{collection="embeddings"} 0.25
ruvector_healing_time_to_recovery_seconds{collection="embeddings"} 120
```
---
## Alerting Integration
```yaml
# Alert when healing is failing
- alert: RuVectorHealingIneffective
expr: ruvector_healing_success_rate < 0.5
for: 1h
labels:
severity: warning
annotations:
summary: "Self-healing is not effective"
description: "Healing success rate is {{ $value }} - human intervention may be required"
# Alert when healing is disabled
- alert: RuVectorHealingDisabled
expr: ruvector_healing_enabled == 0
for: 5m
labels:
severity: info
annotations:
summary: "Self-healing is disabled for {{ $labels.collection }}"
# Alert when in prolonged degraded state
- alert: RuVectorProlongedDegradation
expr: ruvector_integrity_state > 1 and ruvector_healing_attempts_total == 0
for: 30m
labels:
severity: critical
annotations:
summary: "Prolonged degradation without healing attempts"
```
---
## Testing Requirements
### Unit Tests
- Problem classification accuracy
- Strategy selection logic
- Rollback correctness
- Weight update calculations
### Integration Tests
- End-to-end healing cycle
- Concurrent healing prevention
- Timeout and panic handling
- Kubernetes scaling (mock)
### Chaos Tests
- Random failure injection
- Healing under load
- Cascading failure prevention
- Recovery time objectives
---
## The Complete Loop
```
+----------------+
| Normal State |
+-------+--------+
|
| (stress detected)
v
+-------+--------+
| Classify |
+-------+--------+
|
v
+-------+--------+
| Remediate |
+-------+--------+
|
| (verify)
v
+-------+--------+
| Learn |
+-------+--------+
|
v
+-------+--------+
| Normal State |<----+
+----------------+ |
|
(automatic recovery)
```
**This is what makes RuVector truly different:**
We don't just detect problems early. We **fix them automatically** before they become incidents.
The system is not just observable. It is **self-aware and self-repairing**.