git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
1019 lines
29 KiB
Markdown
1019 lines
29 KiB
Markdown
# RuVector Postgres v2 - Self-Healing System
|
|
|
|
## The Missing Piece
|
|
|
|
We built the sensor (mincut integrity monitoring).
|
|
Now we build the actuator (automated remediation).
|
|
|
|
Most systems detect problems and alert humans. We detect problems and **fix them automatically**.
|
|
|
|
```
|
|
Traditional: Detect → Alert → Human → Diagnose → Fix → Verify
|
|
Self-Healing: Detect → Classify → Remediate → Verify → Learn
|
|
```
|
|
|
|
This completes the control loop and makes RuVector truly autonomous.
|
|
|
|
---
|
|
|
|
## Design Principles
|
|
|
|
1. **Graduated response** — Small problems get small fixes
|
|
2. **Reversible actions** — Every remediation can be undone
|
|
3. **Blast radius limits** — Never make things worse
|
|
4. **Audit trail** — Every action is logged and signed
|
|
5. **Learn from outcomes** — Improve remediation over time
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
+------------------------------------------------------------------+
|
|
| Integrity Monitor |
|
|
| - Computes lambda_cut on contracted graph |
|
|
| - Detects state transitions (normal → stress → critical) |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| Problem Classifier |
|
|
| - Identifies root cause from witness edges |
|
|
| - Maps symptoms to remediation strategies |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| Remediation Engine |
|
|
| - Selects appropriate action |
|
|
| - Executes with timeout and rollback |
|
|
| - Verifies improvement |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| Outcome Tracker |
|
|
| - Records success/failure |
|
|
| - Updates strategy weights |
|
|
| - Feeds learning pipeline |
|
|
+------------------------------------------------------------------+
|
|
```
|
|
|
|
---
|
|
|
|
## Problem Classification
|
|
|
|
### Witness Edge Analysis
|
|
|
|
The mincut computation produces **witness edges** — the edges that would break the graph. These reveal the problem type.
|
|
|
|
```rust
|
|
// src/healing/classifier.rs
|
|
|
|
#[derive(Debug, Clone)]
|
|
pub enum ProblemClass {
|
|
/// Hot partition overloaded
|
|
HotspotCongestion {
|
|
partition_ids: Vec<i64>,
|
|
load_ratio: f32, // vs average
|
|
},
|
|
|
|
/// Centroid imbalance in IVF index
|
|
CentroidSkew {
|
|
centroid_ids: Vec<i64>,
|
|
skew_factor: f32,
|
|
},
|
|
|
|
/// Replication lag causing consistency issues
|
|
ReplicationLag {
|
|
replica_ids: Vec<i64>,
|
|
lag_seconds: f32,
|
|
},
|
|
|
|
/// Background job contention
|
|
MaintenanceContention {
|
|
job_types: Vec<String>,
|
|
queue_depth: usize,
|
|
},
|
|
|
|
/// Index fragmentation
|
|
IndexFragmentation {
|
|
index_ids: Vec<i64>,
|
|
fragmentation_pct: f32,
|
|
},
|
|
|
|
/// Memory pressure
|
|
MemoryPressure {
|
|
current_usage_pct: f32,
|
|
largest_consumers: Vec<(String, usize)>,
|
|
},
|
|
|
|
/// Unknown (needs human investigation)
|
|
Unknown {
|
|
witness_summary: String,
|
|
},
|
|
}
|
|
|
|
pub fn classify_problem(
|
|
witness_edges: &[WitnessEdge],
|
|
metrics: &SystemMetrics,
|
|
) -> ProblemClass {
|
|
// Analyze witness edge patterns
|
|
let edge_types = count_edge_types(witness_edges);
|
|
let node_types = count_node_types(witness_edges);
|
|
|
|
// Pattern matching
|
|
if edge_types.get("partition_link").unwrap_or(&0) > &3 {
|
|
// Multiple partition links weak → hotspot
|
|
let hot_partitions = find_hot_partitions(witness_edges, metrics);
|
|
return ProblemClass::HotspotCongestion {
|
|
partition_ids: hot_partitions,
|
|
load_ratio: compute_load_ratio(&hot_partitions, metrics),
|
|
};
|
|
}
|
|
|
|
if node_types.get("centroid").unwrap_or(&0) > &5 {
|
|
// Centroid nodes in witness → skew
|
|
let skewed = find_skewed_centroids(witness_edges, metrics);
|
|
return ProblemClass::CentroidSkew {
|
|
centroid_ids: skewed,
|
|
skew_factor: compute_skew_factor(&skewed, metrics),
|
|
};
|
|
}
|
|
|
|
if edge_types.get("replication").unwrap_or(&0) > &0 {
|
|
// Replication edges weak
|
|
let lagging = find_lagging_replicas(witness_edges, metrics);
|
|
return ProblemClass::ReplicationLag {
|
|
replica_ids: lagging,
|
|
lag_seconds: get_max_lag(&lagging, metrics),
|
|
};
|
|
}
|
|
|
|
if edge_types.get("dependency").unwrap_or(&0) > &2 {
|
|
// Maintenance dependencies weak
|
|
let jobs = find_contending_jobs(witness_edges, metrics);
|
|
return ProblemClass::MaintenanceContention {
|
|
job_types: jobs,
|
|
queue_depth: metrics.maintenance_queue_depth,
|
|
};
|
|
}
|
|
|
|
// Check metrics for other patterns
|
|
if metrics.memory_usage_pct > 85.0 {
|
|
return ProblemClass::MemoryPressure {
|
|
current_usage_pct: metrics.memory_usage_pct,
|
|
largest_consumers: metrics.top_memory_consumers.clone(),
|
|
};
|
|
}
|
|
|
|
if metrics.index_fragmentation_pct > 30.0 {
|
|
return ProblemClass::IndexFragmentation {
|
|
index_ids: metrics.fragmented_indexes.clone(),
|
|
fragmentation_pct: metrics.index_fragmentation_pct,
|
|
};
|
|
}
|
|
|
|
ProblemClass::Unknown {
|
|
witness_summary: summarize_witnesses(witness_edges),
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Remediation Strategies
|
|
|
|
### Strategy Registry
|
|
|
|
```rust
|
|
// src/healing/strategies.rs
|
|
|
|
pub trait RemediationStrategy: Send + Sync {
|
|
/// Human-readable name
|
|
fn name(&self) -> &str;
|
|
|
|
/// Problem classes this strategy handles
|
|
fn handles(&self) -> Vec<ProblemClass>;
|
|
|
|
/// Estimate impact (0-1, higher = more disruptive)
|
|
fn impact(&self) -> f32;
|
|
|
|
/// Estimate time to complete
|
|
fn estimated_duration(&self) -> Duration;
|
|
|
|
/// Can this be reversed?
|
|
fn reversible(&self) -> bool;
|
|
|
|
/// Execute the remediation
|
|
fn execute(&self, context: &RemediationContext) -> Result<RemediationResult, Error>;
|
|
|
|
/// Rollback if needed
|
|
fn rollback(&self, context: &RemediationContext) -> Result<(), Error>;
|
|
}
|
|
|
|
/// Registry of all available strategies
|
|
pub struct StrategyRegistry {
|
|
strategies: Vec<Box<dyn RemediationStrategy>>,
|
|
weights: HashMap<String, f32>, // Learned effectiveness weights
|
|
}
|
|
|
|
impl StrategyRegistry {
|
|
pub fn new() -> Self {
|
|
let mut registry = Self {
|
|
strategies: vec![],
|
|
weights: HashMap::new(),
|
|
};
|
|
|
|
// Register built-in strategies
|
|
registry.register(Box::new(RebalancePartitions));
|
|
registry.register(Box::new(RebuildCentroids));
|
|
registry.register(Box::new(PauseMaintenanceJobs));
|
|
registry.register(Box::new(ThrottleIngestion));
|
|
registry.register(Box::new(EvictColdData));
|
|
registry.register(Box::new(CompactFragmentedIndexes));
|
|
registry.register(Box::new(ScaleReadReplicas));
|
|
registry.register(Box::new(DrainHotPartition));
|
|
|
|
registry
|
|
}
|
|
|
|
/// Select best strategy for a problem
|
|
pub fn select(&self, problem: &ProblemClass, context: &SystemContext) -> Option<&dyn RemediationStrategy> {
|
|
self.strategies.iter()
|
|
.filter(|s| s.handles().iter().any(|h| matches_class(h, problem)))
|
|
.filter(|s| s.impact() <= context.max_allowed_impact)
|
|
.max_by(|a, b| {
|
|
let weight_a = self.weights.get(a.name()).unwrap_or(&1.0);
|
|
let weight_b = self.weights.get(b.name()).unwrap_or(&1.0);
|
|
weight_a.partial_cmp(weight_b).unwrap()
|
|
})
|
|
.map(|s| s.as_ref())
|
|
}
|
|
}
|
|
```
|
|
|
|
### Built-in Strategies
|
|
|
|
#### 1. Rebalance Partitions
|
|
|
|
```rust
|
|
pub struct RebalancePartitions;
|
|
|
|
impl RemediationStrategy for RebalancePartitions {
|
|
fn name(&self) -> &str { "rebalance_partitions" }
|
|
|
|
fn handles(&self) -> Vec<ProblemClass> {
|
|
vec![ProblemClass::HotspotCongestion { .. }]
|
|
}
|
|
|
|
fn impact(&self) -> f32 { 0.3 } // Medium impact
|
|
|
|
fn reversible(&self) -> bool { true }
|
|
|
|
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
|
|
let problem = ctx.problem.as_hotspot()?;
|
|
|
|
// Find underutilized partitions
|
|
let cold_partitions = find_cold_partitions(ctx.metrics);
|
|
|
|
// Calculate rebalance plan
|
|
let plan = compute_rebalance_plan(
|
|
&problem.partition_ids,
|
|
&cold_partitions,
|
|
ctx.metrics,
|
|
);
|
|
|
|
// Execute moves incrementally
|
|
for mv in plan.moves {
|
|
// Move vectors from hot to cold partition
|
|
move_vectors(mv.from_partition, mv.to_partition, mv.vector_ids)?;
|
|
|
|
// Check if integrity improved
|
|
let new_lambda = sample_integrity(ctx.collection_id)?;
|
|
if new_lambda > ctx.initial_lambda * 1.1 {
|
|
// Good progress, continue
|
|
} else if new_lambda < ctx.initial_lambda * 0.9 {
|
|
// Made things worse, rollback this move
|
|
move_vectors(mv.to_partition, mv.from_partition, mv.vector_ids)?;
|
|
break;
|
|
}
|
|
}
|
|
|
|
Ok(RemediationResult {
|
|
success: true,
|
|
actions_taken: plan.moves.len(),
|
|
improvement: compute_improvement(ctx),
|
|
})
|
|
}
|
|
|
|
fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
|
|
// Reverse all recorded moves
|
|
for mv in ctx.recorded_moves.iter().rev() {
|
|
move_vectors(mv.to_partition, mv.from_partition, mv.vector_ids)?;
|
|
}
|
|
Ok(())
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 2. Pause Maintenance Jobs
|
|
|
|
```rust
|
|
pub struct PauseMaintenanceJobs;
|
|
|
|
impl RemediationStrategy for PauseMaintenanceJobs {
|
|
fn name(&self) -> &str { "pause_maintenance" }
|
|
|
|
fn handles(&self) -> Vec<ProblemClass> {
|
|
vec![ProblemClass::MaintenanceContention { .. }]
|
|
}
|
|
|
|
fn impact(&self) -> f32 { 0.1 } // Low impact
|
|
|
|
fn reversible(&self) -> bool { true }
|
|
|
|
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
|
|
let problem = ctx.problem.as_maintenance_contention()?;
|
|
|
|
// Pause low-priority jobs
|
|
let paused = problem.job_types.iter()
|
|
.filter(|j| job_priority(j) < Priority::High)
|
|
.map(|j| {
|
|
pause_job(j)?;
|
|
j.clone()
|
|
})
|
|
.collect::<Vec<_>>();
|
|
|
|
// Wait for current operations to drain
|
|
wait_for_drain(Duration::from_secs(30))?;
|
|
|
|
// Verify improvement
|
|
let new_lambda = sample_integrity(ctx.collection_id)?;
|
|
|
|
Ok(RemediationResult {
|
|
success: new_lambda > ctx.initial_lambda,
|
|
actions_taken: paused.len(),
|
|
improvement: (new_lambda - ctx.initial_lambda) / ctx.initial_lambda,
|
|
metadata: json!({ "paused_jobs": paused }),
|
|
})
|
|
}
|
|
|
|
fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
|
|
// Resume paused jobs
|
|
let paused: Vec<String> = ctx.result.metadata["paused_jobs"].as_array()
|
|
.map(|a| a.iter().filter_map(|v| v.as_str().map(String::from)).collect())
|
|
.unwrap_or_default();
|
|
|
|
for job in paused {
|
|
resume_job(&job)?;
|
|
}
|
|
Ok(())
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 3. Throttle Ingestion
|
|
|
|
```rust
|
|
pub struct ThrottleIngestion;
|
|
|
|
impl RemediationStrategy for ThrottleIngestion {
|
|
fn name(&self) -> &str { "throttle_ingestion" }
|
|
|
|
fn handles(&self) -> Vec<ProblemClass> {
|
|
vec![
|
|
ProblemClass::HotspotCongestion { .. },
|
|
ProblemClass::MemoryPressure { .. },
|
|
]
|
|
}
|
|
|
|
fn impact(&self) -> f32 { 0.4 } // Medium-high (affects writes)
|
|
|
|
fn reversible(&self) -> bool { true }
|
|
|
|
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
|
|
// Calculate throttle percentage based on severity
|
|
let throttle_pct = match ctx.state {
|
|
IntegrityState::Stress => 50,
|
|
IntegrityState::Critical => 90,
|
|
_ => return Ok(RemediationResult::noop()),
|
|
};
|
|
|
|
// Apply throttle via shared memory
|
|
set_throttle_percentage(ctx.collection_id, "insert", throttle_pct)?;
|
|
set_throttle_percentage(ctx.collection_id, "bulk_insert", throttle_pct + 10)?;
|
|
|
|
// Record for rollback
|
|
ctx.record_action(ThrottleAction {
|
|
collection_id: ctx.collection_id,
|
|
previous_throttle: get_current_throttle(ctx.collection_id),
|
|
new_throttle: throttle_pct,
|
|
});
|
|
|
|
Ok(RemediationResult {
|
|
success: true,
|
|
actions_taken: 1,
|
|
improvement: 0.0, // Preventive, not curative
|
|
metadata: json!({ "throttle_pct": throttle_pct }),
|
|
})
|
|
}
|
|
|
|
fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
|
|
// Restore previous throttle level
|
|
let action: ThrottleAction = ctx.get_action()?;
|
|
set_throttle_percentage(ctx.collection_id, "insert", action.previous_throttle)?;
|
|
Ok(())
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 4. Scale Read Replicas (Kubernetes)
|
|
|
|
```rust
|
|
pub struct ScaleReadReplicas;
|
|
|
|
impl RemediationStrategy for ScaleReadReplicas {
|
|
fn name(&self) -> &str { "scale_replicas" }
|
|
|
|
fn handles(&self) -> Vec<ProblemClass> {
|
|
vec![ProblemClass::HotspotCongestion { .. }]
|
|
}
|
|
|
|
fn impact(&self) -> f32 { 0.2 } // Low impact
|
|
|
|
fn reversible(&self) -> bool { true }
|
|
|
|
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
|
|
// Only available in K8s environment
|
|
let k8s = K8sClient::try_new()?;
|
|
|
|
// Get current replica count
|
|
let deployment = k8s.get_deployment("ruvector-read")?;
|
|
let current = deployment.spec.replicas;
|
|
|
|
// Scale up by 50% (capped)
|
|
let new_count = (current as f32 * 1.5).ceil() as i32;
|
|
let new_count = new_count.min(ctx.config.max_replicas);
|
|
|
|
if new_count == current {
|
|
return Ok(RemediationResult::noop());
|
|
}
|
|
|
|
// Apply scale
|
|
k8s.scale_deployment("ruvector-read", new_count)?;
|
|
|
|
// Wait for pods to be ready
|
|
k8s.wait_for_ready("ruvector-read", Duration::from_secs(300))?;
|
|
|
|
Ok(RemediationResult {
|
|
success: true,
|
|
actions_taken: 1,
|
|
improvement: 0.0, // Measured later
|
|
metadata: json!({
|
|
"previous_replicas": current,
|
|
"new_replicas": new_count,
|
|
}),
|
|
})
|
|
}
|
|
|
|
fn rollback(&self, ctx: &RemediationContext) -> Result<(), Error> {
|
|
let k8s = K8sClient::try_new()?;
|
|
let previous: i32 = ctx.result.metadata["previous_replicas"].as_i64()? as i32;
|
|
k8s.scale_deployment("ruvector-read", previous)?;
|
|
Ok(())
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 5. Compact Fragmented Indexes
|
|
|
|
```rust
|
|
pub struct CompactFragmentedIndexes;
|
|
|
|
impl RemediationStrategy for CompactFragmentedIndexes {
|
|
fn name(&self) -> &str { "compact_indexes" }
|
|
|
|
fn handles(&self) -> Vec<ProblemClass> {
|
|
vec![ProblemClass::IndexFragmentation { .. }]
|
|
}
|
|
|
|
fn impact(&self) -> f32 { 0.5 } // Higher impact (CPU intensive)
|
|
|
|
fn reversible(&self) -> bool { false } // Compaction is one-way
|
|
|
|
fn execute(&self, ctx: &RemediationContext) -> Result<RemediationResult, Error> {
|
|
let problem = ctx.problem.as_fragmentation()?;
|
|
|
|
// Compact most fragmented indexes first
|
|
let mut compacted = 0;
|
|
for index_id in &problem.index_ids {
|
|
// Check if we have time/resources
|
|
if ctx.elapsed() > ctx.timeout / 2 {
|
|
break;
|
|
}
|
|
|
|
// Run incremental compaction
|
|
compact_index_incremental(*index_id, CompactConfig {
|
|
max_duration: Duration::from_secs(60),
|
|
batch_size: 10000,
|
|
})?;
|
|
|
|
compacted += 1;
|
|
|
|
// Check improvement
|
|
let new_lambda = sample_integrity(ctx.collection_id)?;
|
|
if new_lambda > ctx.target_lambda {
|
|
break; // Good enough
|
|
}
|
|
}
|
|
|
|
Ok(RemediationResult {
|
|
success: compacted > 0,
|
|
actions_taken: compacted,
|
|
improvement: compute_improvement(ctx),
|
|
})
|
|
}
|
|
|
|
fn rollback(&self, _ctx: &RemediationContext) -> Result<(), Error> {
|
|
// Compaction is not reversible
|
|
Err(Error::NotReversible)
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Remediation Engine
|
|
|
|
### Execution Flow
|
|
|
|
```rust
|
|
// src/healing/engine.rs
|
|
|
|
pub struct RemediationEngine {
|
|
registry: StrategyRegistry,
|
|
config: HealingConfig,
|
|
outcome_tracker: OutcomeTracker,
|
|
}
|
|
|
|
impl RemediationEngine {
|
|
/// Main healing loop (called when integrity degrades)
|
|
pub fn heal(&self, trigger: &IntegrityTrigger) -> HealingOutcome {
|
|
let ctx = RemediationContext::new(trigger);
|
|
|
|
// 1. Classify the problem
|
|
let problem = classify_problem(&trigger.witness_edges, &ctx.metrics);
|
|
log_problem(&problem);
|
|
|
|
// 2. Check if we should auto-heal
|
|
if !self.should_auto_heal(&problem, &ctx) {
|
|
return HealingOutcome::Deferred {
|
|
reason: "Problem requires human review",
|
|
problem: problem.clone(),
|
|
};
|
|
}
|
|
|
|
// 3. Select strategy
|
|
let strategy = match self.registry.select(&problem, &ctx.system) {
|
|
Some(s) => s,
|
|
None => {
|
|
return HealingOutcome::NoStrategy {
|
|
problem: problem.clone(),
|
|
};
|
|
}
|
|
};
|
|
|
|
log_strategy_selected(strategy.name(), &problem);
|
|
|
|
// 4. Execute with timeout and monitoring
|
|
let result = self.execute_with_safeguards(strategy, &ctx);
|
|
|
|
// 5. Verify improvement
|
|
let verified = self.verify_improvement(&ctx, &result);
|
|
|
|
// 6. Rollback if needed
|
|
if !verified && strategy.reversible() {
|
|
log_rollback(strategy.name());
|
|
if let Err(e) = strategy.rollback(&ctx) {
|
|
log_rollback_failed(e);
|
|
}
|
|
}
|
|
|
|
// 7. Record outcome for learning
|
|
self.outcome_tracker.record(OutcomeRecord {
|
|
problem: problem.clone(),
|
|
strategy: strategy.name().to_string(),
|
|
result: result.clone(),
|
|
verified,
|
|
timestamp: Utc::now(),
|
|
});
|
|
|
|
HealingOutcome::Completed {
|
|
problem,
|
|
strategy: strategy.name().to_string(),
|
|
result,
|
|
verified,
|
|
}
|
|
}
|
|
|
|
fn should_auto_heal(&self, problem: &ProblemClass, ctx: &RemediationContext) -> bool {
|
|
// Don't auto-heal Unknown problems
|
|
if matches!(problem, ProblemClass::Unknown { .. }) {
|
|
return false;
|
|
}
|
|
|
|
// Check cooldown
|
|
if ctx.last_healing_attempt.elapsed() < self.config.min_healing_interval {
|
|
return false;
|
|
}
|
|
|
|
// Check max attempts
|
|
if ctx.healing_attempts_in_window > self.config.max_attempts_per_window {
|
|
return false;
|
|
}
|
|
|
|
// Check if problem is getting worse despite healing
|
|
if self.is_healing_ineffective(ctx) {
|
|
return false;
|
|
}
|
|
|
|
true
|
|
}
|
|
|
|
fn execute_with_safeguards(
|
|
&self,
|
|
strategy: &dyn RemediationStrategy,
|
|
ctx: &RemediationContext,
|
|
) -> RemediationResult {
|
|
// Set up timeout
|
|
let timeout = strategy.estimated_duration() * 2;
|
|
|
|
// Execute in separate thread with panic catching
|
|
let result = std::panic::catch_unwind(|| {
|
|
tokio::time::timeout(timeout, async {
|
|
strategy.execute(ctx)
|
|
})
|
|
});
|
|
|
|
match result {
|
|
Ok(Ok(Ok(r))) => r,
|
|
Ok(Ok(Err(e))) => RemediationResult::failed(e.to_string()),
|
|
Ok(Err(_)) => RemediationResult::failed("Timeout"),
|
|
Err(_) => RemediationResult::failed("Panic during remediation"),
|
|
}
|
|
}
|
|
|
|
fn verify_improvement(&self, ctx: &RemediationContext, result: &RemediationResult) -> bool {
|
|
if !result.success {
|
|
return false;
|
|
}
|
|
|
|
// Wait for system to stabilize
|
|
std::thread::sleep(Duration::from_secs(10));
|
|
|
|
// Sample integrity
|
|
let new_lambda = sample_integrity(ctx.collection_id).unwrap_or(0.0);
|
|
|
|
// Must improve by at least 10%
|
|
new_lambda > ctx.initial_lambda * 1.1
|
|
}
|
|
}
|
|
```
|
|
|
|
### Safety Limits
|
|
|
|
```rust
|
|
// src/healing/config.rs
|
|
|
|
pub struct HealingConfig {
|
|
/// Minimum time between healing attempts
|
|
pub min_healing_interval: Duration,
|
|
|
|
/// Maximum attempts per time window
|
|
pub max_attempts_per_window: usize,
|
|
|
|
/// Time window for attempt counting
|
|
pub attempt_window: Duration,
|
|
|
|
/// Maximum impact level for auto-healing
|
|
pub max_auto_heal_impact: f32,
|
|
|
|
/// Problems that require human approval
|
|
pub require_approval: Vec<ProblemClass>,
|
|
|
|
/// Strategies that require human approval
|
|
pub require_approval_strategies: Vec<String>,
|
|
|
|
/// Enable learning from outcomes
|
|
pub learning_enabled: bool,
|
|
}
|
|
|
|
impl Default for HealingConfig {
|
|
fn default() -> Self {
|
|
Self {
|
|
min_healing_interval: Duration::from_secs(300), // 5 min
|
|
max_attempts_per_window: 3,
|
|
attempt_window: Duration::from_secs(3600), // 1 hour
|
|
max_auto_heal_impact: 0.5,
|
|
require_approval: vec![],
|
|
require_approval_strategies: vec!["scale_replicas".to_string()],
|
|
learning_enabled: true,
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Learning from Outcomes
|
|
|
|
### Outcome Tracking
|
|
|
|
```sql
|
|
CREATE TABLE ruvector.healing_outcomes (
|
|
id BIGSERIAL PRIMARY KEY,
|
|
collection_id INTEGER NOT NULL,
|
|
tenant_id TEXT,
|
|
|
|
-- Problem
|
|
problem_class TEXT NOT NULL,
|
|
problem_details JSONB NOT NULL,
|
|
|
|
-- Strategy
|
|
strategy_name TEXT NOT NULL,
|
|
strategy_params JSONB,
|
|
|
|
-- Execution
|
|
started_at TIMESTAMPTZ NOT NULL,
|
|
completed_at TIMESTAMPTZ,
|
|
duration_ms INTEGER,
|
|
|
|
-- Result
|
|
success BOOLEAN NOT NULL,
|
|
verified BOOLEAN,
|
|
actions_taken INTEGER,
|
|
improvement_pct REAL,
|
|
error_message TEXT,
|
|
|
|
-- Context
|
|
initial_lambda REAL NOT NULL,
|
|
final_lambda REAL,
|
|
witness_edges JSONB,
|
|
system_metrics JSONB,
|
|
|
|
-- Learning
|
|
feedback_score REAL, -- Human feedback if provided
|
|
|
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
|
|
CREATE INDEX idx_healing_outcomes_class ON ruvector.healing_outcomes(problem_class);
|
|
CREATE INDEX idx_healing_outcomes_strategy ON ruvector.healing_outcomes(strategy_name);
|
|
CREATE INDEX idx_healing_outcomes_success ON ruvector.healing_outcomes(success, verified);
|
|
```
|
|
|
|
### Strategy Weight Updates
|
|
|
|
```rust
|
|
// src/healing/learning.rs
|
|
|
|
pub struct OutcomeTracker {
|
|
db: DbPool,
|
|
strategy_weights: RwLock<HashMap<String, f32>>,
|
|
}
|
|
|
|
impl OutcomeTracker {
|
|
/// Update strategy weights based on outcomes
|
|
pub fn update_weights(&self) {
|
|
let outcomes = self.get_recent_outcomes(Duration::from_days(30));
|
|
|
|
let mut new_weights = HashMap::new();
|
|
|
|
for strategy in self.list_strategies() {
|
|
let strategy_outcomes: Vec<_> = outcomes.iter()
|
|
.filter(|o| o.strategy_name == strategy)
|
|
.collect();
|
|
|
|
if strategy_outcomes.is_empty() {
|
|
continue;
|
|
}
|
|
|
|
// Calculate effectiveness
|
|
let success_rate = strategy_outcomes.iter()
|
|
.filter(|o| o.success && o.verified.unwrap_or(false))
|
|
.count() as f32 / strategy_outcomes.len() as f32;
|
|
|
|
let avg_improvement = strategy_outcomes.iter()
|
|
.filter_map(|o| o.improvement_pct)
|
|
.sum::<f32>() / strategy_outcomes.len() as f32;
|
|
|
|
// Weight = success_rate * (1 + avg_improvement)
|
|
let weight = success_rate * (1.0 + avg_improvement);
|
|
new_weights.insert(strategy, weight);
|
|
}
|
|
|
|
*self.strategy_weights.write() = new_weights;
|
|
}
|
|
|
|
/// Get effectiveness report
|
|
pub fn effectiveness_report(&self) -> EffectivenessReport {
|
|
let weights = self.strategy_weights.read();
|
|
|
|
EffectivenessReport {
|
|
strategies: weights.iter()
|
|
.map(|(name, weight)| StrategyEffectiveness {
|
|
name: name.clone(),
|
|
weight: *weight,
|
|
recent_outcomes: self.get_recent_outcomes_for(name, 10),
|
|
})
|
|
.collect(),
|
|
overall_success_rate: self.compute_overall_success_rate(),
|
|
avg_time_to_recovery: self.compute_avg_recovery_time(),
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## SQL Interface
|
|
|
|
### Monitoring
|
|
|
|
```sql
|
|
-- View healing status
|
|
SELECT * FROM ruvector_healing_status();
|
|
|
|
-- Returns:
|
|
-- {
|
|
-- "enabled": true,
|
|
-- "last_healing": "2024-01-15T10:30:00Z",
|
|
-- "total_healings_24h": 3,
|
|
-- "success_rate": 0.67,
|
|
-- "active_remediations": [],
|
|
-- "cooldown_until": null
|
|
-- }
|
|
|
|
-- View recent healing events
|
|
SELECT * FROM ruvector_healing_history(
|
|
since := NOW() - INTERVAL '24 hours',
|
|
limit_ := 20
|
|
);
|
|
|
|
-- View strategy effectiveness
|
|
SELECT * FROM ruvector_healing_effectiveness();
|
|
```
|
|
|
|
### Configuration
|
|
|
|
```sql
|
|
-- Configure healing behavior
|
|
SELECT ruvector_healing_configure('{
|
|
"enabled": true,
|
|
"min_healing_interval_seconds": 300,
|
|
"max_attempts_per_hour": 3,
|
|
"max_auto_heal_impact": 0.5,
|
|
"require_approval_strategies": ["scale_replicas"],
|
|
"learning_enabled": true
|
|
}'::jsonb);
|
|
|
|
-- Manually trigger healing (for testing)
|
|
SELECT ruvector_healing_trigger('embeddings');
|
|
|
|
-- Approve pending healing action
|
|
SELECT ruvector_healing_approve(action_id := 123);
|
|
|
|
-- Abort active healing
|
|
SELECT ruvector_healing_abort(action_id := 123);
|
|
```
|
|
|
|
### Manual Remediation
|
|
|
|
```sql
|
|
-- Execute specific strategy manually
|
|
SELECT ruvector_healing_execute('embeddings', 'rebalance_partitions', '{
|
|
"dry_run": false,
|
|
"max_moves": 100
|
|
}'::jsonb);
|
|
|
|
-- Rollback last healing action
|
|
SELECT ruvector_healing_rollback('embeddings');
|
|
```
|
|
|
|
---
|
|
|
|
## Prometheus Metrics
|
|
|
|
```
|
|
# Healing activity
|
|
ruvector_healing_attempts_total{collection="embeddings",strategy="rebalance"} 15
|
|
ruvector_healing_success_total{collection="embeddings",strategy="rebalance"} 12
|
|
ruvector_healing_duration_seconds{collection="embeddings",strategy="rebalance",quantile="0.99"} 45.2
|
|
|
|
# Current state
|
|
ruvector_healing_active{collection="embeddings"} 0
|
|
ruvector_healing_cooldown{collection="embeddings"} 0
|
|
|
|
# Effectiveness
|
|
ruvector_healing_success_rate{collection="embeddings"} 0.80
|
|
ruvector_healing_avg_improvement{collection="embeddings"} 0.25
|
|
ruvector_healing_time_to_recovery_seconds{collection="embeddings"} 120
|
|
```
|
|
|
|
---
|
|
|
|
## Alerting Integration
|
|
|
|
```yaml
|
|
# Alert when healing is failing
|
|
- alert: RuVectorHealingIneffective
|
|
expr: ruvector_healing_success_rate < 0.5
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Self-healing is not effective"
|
|
description: "Healing success rate is {{ $value }} - human intervention may be required"
|
|
|
|
# Alert when healing is disabled
|
|
- alert: RuVectorHealingDisabled
|
|
expr: ruvector_healing_enabled == 0
|
|
for: 5m
|
|
labels:
|
|
severity: info
|
|
annotations:
|
|
summary: "Self-healing is disabled for {{ $labels.collection }}"
|
|
|
|
# Alert when in prolonged degraded state
|
|
- alert: RuVectorProlongedDegradation
|
|
expr: ruvector_integrity_state > 1 and ruvector_healing_attempts_total == 0
|
|
for: 30m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Prolonged degradation without healing attempts"
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Requirements
|
|
|
|
### Unit Tests
|
|
- Problem classification accuracy
|
|
- Strategy selection logic
|
|
- Rollback correctness
|
|
- Weight update calculations
|
|
|
|
### Integration Tests
|
|
- End-to-end healing cycle
|
|
- Concurrent healing prevention
|
|
- Timeout and panic handling
|
|
- Kubernetes scaling (mock)
|
|
|
|
### Chaos Tests
|
|
- Random failure injection
|
|
- Healing under load
|
|
- Cascading failure prevention
|
|
- Recovery time objectives
|
|
|
|
---
|
|
|
|
## The Complete Loop
|
|
|
|
```
|
|
+----------------+
|
|
| Normal State |
|
|
+-------+--------+
|
|
|
|
|
| (stress detected)
|
|
v
|
|
+-------+--------+
|
|
| Classify |
|
|
+-------+--------+
|
|
|
|
|
v
|
|
+-------+--------+
|
|
| Remediate |
|
|
+-------+--------+
|
|
|
|
|
| (verify)
|
|
v
|
|
+-------+--------+
|
|
| Learn |
|
|
+-------+--------+
|
|
|
|
|
v
|
|
+-------+--------+
|
|
| Normal State |<----+
|
|
+----------------+ |
|
|
|
|
|
(automatic recovery)
|
|
```
|
|
|
|
**This is what makes RuVector truly different:**
|
|
|
|
We don't just detect problems early. We **fix them automatically** before they become incidents.
|
|
|
|
The system is not just observable. It is **self-aware and self-repairing**.
|