Files
wifi-densepose/vendor/ruvector/docs/adr/ADR-049-verified-training-pipeline.md

530 lines
25 KiB
Markdown

# ADR-049: Verified Training Pipeline
## Status
Accepted
## Date
2026-02-25
## Context
Training graph transformers involves thousands of gradient steps, each of which modifies model weights. In safety-critical applications, we need guarantees that training did not introduce pathological behavior: unbounded loss spikes, conservation law violations, equivariance breakage, or adversarial vulnerability. Post-hoc auditing of trained models is expensive and often misses subtle training-time regressions.
The RuVector workspace provides the building blocks for verified training:
- `ruvector-gnn` provides `Optimizer` (SGD, Adam), `ElasticWeightConsolidation` (EWC), `LearningRateScheduler`, `ReplayBuffer`, and a training loop with `TrainConfig` in `crates/ruvector-gnn/src/training.rs`
- `ruvector-verified` provides `ProofEnvironment`, `ProofAttestation` (82 bytes), `FastTermArena` for high-throughput proof allocation, and tiered verification via `ProofTier`
- `ruvector-coherence` provides `SpectralCoherenceScore` and `SpectralTracker` (behind `spectral` feature) for monitoring model quality during training
- `ruvector-mincut-gated-transformer` provides `EnergyGate` in `crates/ruvector-mincut-gated-transformer/src/energy_gate.rs` for energy-based decision making
However, there is no mechanism for issuing per-step invariant proofs during training, no `TrainingCertificate` that attests to the training run's integrity, and no integration between the proof system and the gradient update loop.
## Decision
We will implement a `verified_training` module in `ruvector-graph-transformer` that wraps `ruvector-gnn`'s training infrastructure with proof gates, producing per-step invariant proofs and a final `TrainingCertificate` that attests to the entire training run.
### VerifiedTrainer
```rust
/// A training wrapper that issues proof attestations per gradient step.
///
/// Wraps ruvector_gnn::training::Optimizer and composes with
/// ruvector_verified::ProofEnvironment for per-step invariant verification.
pub struct VerifiedTrainer {
/// The underlying GNN optimizer (SGD or Adam).
optimizer: Optimizer,
/// EWC for continual learning (optional).
ewc: Option<ElasticWeightConsolidation>,
/// Learning rate scheduler.
scheduler: LearningRateScheduler,
/// Proof environment for generating attestations.
proof_env: ProofEnvironment,
/// Fast arena for high-throughput proof allocation.
arena: FastTermArena,
/// Per-step invariant specifications.
invariants: Vec<TrainingInvariant>,
/// Accumulated attestations for the training run.
ledger: MutationLedger,
/// Configuration.
config: VerifiedTrainerConfig,
}
```
### Per-Step Invariant Proofs
Each gradient step is bracketed by invariant checks. The `TrainingInvariant` enum defines what is verified:
```rust
pub enum TrainingInvariant {
/// Loss stability: loss stays within a bounded envelope relative to
/// a moving average. Raw loss is NOT monotonic in SGD — this invariant
/// captures what is actually enforceable: bounded deviation from trend.
///
/// **This is a true invariant**, not a heuristic: the proof certifies
/// that loss_t <= moving_avg(loss, window) * (1 + spike_cap).
LossStabilityBound {
/// Maximum spike relative to moving average (e.g., 0.10 = 10% above MA).
spike_cap: f64,
/// Window size for exponential moving average.
window: usize,
/// Gradient norm cap: reject step if ||grad|| > this value.
max_gradient_norm: f64,
/// Step size cap: reject step if effective lr * ||grad|| > this value.
max_step_size: f64,
},
/// Weight norm conservation: ||W_t|| stays within bounds per layer.
/// Prevents gradient explosion/vanishing.
///
/// Rollback strategy: **delta-apply** — gradients are applied to a
/// scratch buffer, norms checked, then committed only if bounds hold.
/// This avoids doubling peak memory via full snapshots.
WeightNormBound {
/// Maximum L2 norm per layer.
max_norm: f64,
/// Minimum L2 norm per layer (prevents collapse).
min_norm: f64,
/// Rollback strategy.
rollback: RollbackStrategy,
},
/// Equivariance: model output is equivariant to graph permutations.
/// **This is a statistical test, not a formal proof.** The certificate
/// records the exact scope: rng seed, sample count, permutation ID hashes.
/// A verifier can replay the exact same permutations to confirm.
PermutationEquivariance {
/// Number of random permutations to test per check.
samples: usize,
/// Maximum allowed deviation (L2 distance / output norm).
max_deviation: f64,
/// RNG seed for reproducibility. Bound into the proof scope.
rng_seed: u64,
},
/// Lipschitz bound: **estimated** Lipschitz constant stays below threshold.
/// Verified per-layer via spectral norm power iteration.
///
/// **Attestation scope:** The certificate records that the estimated bound
/// (via K power iterations with tolerance eps) stayed below max_lipschitz.
/// This does NOT certify the true Lipschitz constant — it certifies
/// that the estimate with stated parameters was within bounds.
LipschitzBound {
/// Maximum Lipschitz constant per layer.
max_lipschitz: f64,
/// Power iteration steps for spectral norm estimation.
power_iterations: usize,
/// Convergence tolerance for power iteration.
tolerance: f64,
},
/// Coherence: spectral coherence score stays above threshold.
/// Uses ruvector-coherence::spectral::SpectralCoherenceScore.
///
/// **Attestation scope:** Like Lipschitz, this is an estimate based on
/// sampled eigenvalues. The certificate records the estimation parameters.
CoherenceBound {
/// Minimum coherence score.
min_coherence: f64,
/// Number of eigenvalue samples for estimation.
eigenvalue_samples: usize,
},
/// Energy gate: compute energy or coherence proxy BEFORE applying
/// gradients. If below threshold, require a stronger proof tier,
/// reduce learning rate, or refuse the step entirely.
///
/// Integrates with ruvector-mincut-gated-transformer::EnergyGate
/// to make training behave like inference gating.
EnergyGate {
/// Minimum energy threshold for standard-tier step.
min_energy: f64,
/// If energy < min_energy, force this tier for verification.
escalation_tier: ProofTier,
/// If energy < critical_energy, refuse the step entirely.
critical_energy: f64,
},
/// Custom invariant with a user-provided verification function.
Custom {
/// Name for logging and attestation.
name: String,
/// Estimated proof complexity (for tier routing).
complexity: u32,
},
}
/// Rollback strategy for failed invariant checks.
pub enum RollbackStrategy {
/// Apply gradients to a scratch buffer, check invariants, then commit.
/// Peak memory: weights + one layer's gradients. No full snapshot.
DeltaApply,
/// Store per-layer deltas, revert only modified layers on failure.
/// Peak memory: weights + delta buffer (typically < 10% of weights).
ChunkedRollback,
/// Full snapshot (doubles peak memory). Use only when other strategies
/// are insufficient (e.g., cross-layer invariants).
FullSnapshot,
}
```
### Invariant Verification Flow
```rust
impl VerifiedTrainer {
/// Execute one verified training step.
///
/// 1. Compute gradients via the underlying optimizer
/// 2. Before applying gradients, verify pre-step invariants
/// 3. Apply gradients
/// 4. Verify post-step invariants
/// 5. Issue attestation for this step
/// 6. If any invariant fails, roll back gradients and return error
pub fn step(
&mut self,
loss: f64,
gradients: &Gradients,
weights: &mut Weights,
) -> Result<StepAttestation> {
// 1. Pre-step: verify gradient bounds and loss stability
let pre_proofs = self.verify_invariants(
InvariantPhase::PreStep,
loss, weights,
)?;
// 2. Energy gate: compute energy/coherence proxy BEFORE mutation.
// If below threshold, escalate proof tier or refuse step.
if let Some(energy_gate) = &self.energy_gate {
let energy = energy_gate.evaluate(weights, gradients);
if energy < energy_gate.critical_energy {
return Err(GraphTransformerError::MutationRejected {
reason: format!("energy {} < critical {}", energy, energy_gate.critical_energy),
});
}
if energy < energy_gate.min_energy {
// Force escalation to stronger proof tier
self.current_tier_override = Some(energy_gate.escalation_tier);
}
}
// 3. Apply gradients via delta-apply strategy (default).
// Gradients go into a scratch buffer, not directly into weights.
let delta = self.optimizer.compute_delta(gradients, weights)?;
// 4. Post-step verification on proposed (weights + delta).
// No mutation has occurred yet.
match self.verify_invariants_on_proposed(
InvariantPhase::PostStep, loss, weights, &delta
) {
Ok(post_proofs) => {
// 5. Commit: apply delta to actual weights.
weights.apply_delta(&delta);
// 6. Compose attestation and append to ledger.
let attestation = self.compose_step_attestation(
pre_proofs, post_proofs,
);
self.ledger.append(attestation.clone());
self.scheduler.step();
self.current_tier_override = None;
Ok(StepAttestation {
step: self.ledger.len() as u64,
attestation,
loss,
invariants_checked: self.invariants.len(),
overridden: false,
})
}
Err(e) if self.config.allow_override => {
// Degraded mode: step proceeds with OverrideProof.
// The override is visible in the certificate.
let override_proof = self.create_override_proof(&e)?;
weights.apply_delta(&delta);
self.ledger.append(override_proof.clone());
self.override_count += 1;
Ok(StepAttestation {
step: self.ledger.len() as u64,
attestation: override_proof,
loss,
invariants_checked: self.invariants.len(),
overridden: true,
})
}
Err(e) => {
// Fail-closed: delta is discarded, weights unchanged.
// Refusal is recorded in the ledger.
let refusal = self.create_refusal_attestation(&e);
self.ledger.append(refusal);
Err(e)
}
}
}
}
```
### Tier Routing for Training Invariants
Training invariant verification uses the same three-tier routing as ADR-047:
| Invariant | Typical Tier | Rationale | Formally Proven? |
|-----------|-------------|-----------|------------------|
| `LossStabilityBound` | Reflex | Moving avg comparison + gradient norm check, < 10 ns | **Yes** — bounded comparison |
| `WeightNormBound` | Standard(100) | L2 norm computation, < 1 us | **Yes** — exact computation |
| `PermutationEquivariance` | Deep | Random permutation + forward pass, < 100 us | **No** — statistical test with bound scope |
| `LipschitzBound` | Standard(500) | Power iteration spectral norm, < 10 us | **No** — estimate with stated tolerance |
| `CoherenceBound` | Standard(200) | Spectral coherence from sampled eigenvalues, < 5 us | **No** — estimate with stated sample count |
| `EnergyGate` | Reflex/Standard | Energy proxy evaluation, < 100 ns | **Yes** — threshold comparison |
| `Custom` | Routed by `complexity` field | User-defined | Depends on implementation |
**Distinction between proven and estimated invariants:** The certificate explicitly records which invariants are formally proven (exact computation within the proof system) and which are statistical estimates with bound scope (rng_seed, sample_count, iterations, tolerance). A verifier knows exactly what was tested and can replay it.
The routing decision is made by converting each `TrainingInvariant` into a `ProofKind` and calling `ruvector_verified::gated::route_proof`. For example, `LossStabilityBound` maps to `ProofKind::DimensionEquality` (literal comparison), while `PermutationEquivariance` maps to `ProofKind::Custom { estimated_complexity: samples * 100 }`.
### Certified Adversarial Robustness
For models that require adversarial robustness certification, the `verified_training` module provides an IBP (Interval Bound Propagation) / DeepPoly integration:
```rust
pub struct RobustnessCertifier {
/// Perturbation radius (L-infinity norm).
epsilon: f64,
/// Certification method.
method: CertificationMethod,
}
pub enum CertificationMethod {
/// Interval Bound Propagation -- fast but loose.
IBP,
/// DeepPoly -- tighter but slower.
DeepPoly,
/// Combined: IBP for initial bound, DeepPoly for refinement.
Hybrid { ibp_warmup_epochs: usize },
}
impl RobustnessCertifier {
/// Certify that the model's output is stable within epsilon-ball.
/// Returns a ProofGate<RobustnessCertificate> with the certified radius.
pub fn certify(
&self,
model: &GraphTransformer<impl GraphRepr>,
input: &GraphBatch,
env: &mut ProofEnvironment,
) -> Result<ProofGate<RobustnessCertificate>>;
}
pub struct RobustnessCertificate {
/// Certified perturbation radius.
pub certified_radius: f64,
/// Fraction of nodes certified robust.
pub certified_fraction: f64,
/// Method used.
pub method: CertificationMethod,
/// Attestation.
pub attestation: ProofAttestation,
}
```
### Training Certificate
At the end of a training run, a `TrainingCertificate` is produced by composing all step attestations:
```rust
pub struct TrainingCertificate {
/// Total training steps completed.
pub total_steps: u64,
/// Total invariant violations (zero if fully verified).
pub violations: u64,
/// Number of steps that proceeded via OverrideProof (degraded mode).
pub overridden_steps: u64,
/// Composed attestation over all steps via compose_chain.
pub attestation: ProofAttestation,
/// Final loss value.
pub final_loss: f64,
/// Final coherence score (if CoherenceBound invariant was active).
pub final_coherence: Option<f64>,
/// Robustness certificate (if adversarial certification was run).
pub robustness: Option<RobustnessCertificate>,
/// Epoch at which the certificate was sealed.
pub epoch: u64,
/// Per-invariant statistics.
pub invariant_stats: Vec<InvariantStats>,
// --- Artifact binding (hardening move #7) ---
/// BLAKE3 hash of the final model weights. Binds certificate to
/// the exact model artifact. Cannot be separated.
pub weights_hash: [u8; 32],
/// BLAKE3 hash of the VerifiedTrainerConfig (serialized).
pub config_hash: [u8; 32],
/// BLAKE3 hash of the dataset manifest (or RVF manifest root).
/// None if no dataset manifest was provided.
pub dataset_manifest_hash: Option<[u8; 32]>,
/// BLAKE3 hash of the code (build hash / git commit).
/// None if not provided.
pub code_build_hash: Option<[u8; 32]>,
}
pub struct InvariantStats {
/// Invariant name.
pub name: String,
/// Whether this invariant is formally proven or a statistical estimate.
pub proof_class: ProofClass,
/// Number of times checked.
pub checks: u64,
/// Number of times satisfied.
pub satisfied: u64,
/// Number of times overridden (degraded mode).
pub overridden: u64,
/// Average verification latency.
pub avg_latency_ns: u64,
/// Proof tier distribution: [reflex_count, standard_count, deep_count].
pub tier_distribution: [u64; 3],
}
pub enum ProofClass {
/// Formally proven: exact computation within the proof system.
Formal,
/// Statistical estimate with bound scope. Certificate records
/// the estimation parameters (rng_seed, iterations, tolerance).
Statistical {
rng_seed: Option<u64>,
iterations: usize,
tolerance: f64,
},
}
impl VerifiedTrainer {
/// Seal the training run and produce a certificate.
///
/// 1. Compacts the mutation ledger (proof-gated: compaction itself
/// produces a composed attestation + witness that the compacted
/// chain corresponds exactly to the original sequence).
/// 2. Computes BLAKE3 hashes of weights, config, and optional manifests.
/// 3. Composes all attestations into the final certificate.
///
/// The sealed certificate is a product artifact: verifiable by
/// third parties without trusting training logs.
pub fn seal(self, weights: &Weights) -> TrainingCertificate;
}
```
### Performance Budget
The target is proof overhead < 5% of training step time. For a typical GNN training step of ~10 ms (on CPU):
- `LossMonotonicity` (Reflex): < 10 ns = 0.0001%
- `WeightNormBound` (Standard): < 1 us = 0.01%
- `LipschitzBound` (Standard): < 10 us = 0.1%
- `CoherenceBound` (Standard): < 5 us = 0.05%
- `PermutationEquivariance` (Deep, sampled): < 100 us = 1%
- Attestation composition: < 1 us = 0.01%
- **Total**: < 120 us = 1.2% (well within 5% budget)
For GPU-accelerated training (step time ~1 ms), `PermutationEquivariance` with many samples may exceed 5%. Mitigation: reduce sample count or check equivariance every N steps (configurable via `check_interval` in `VerifiedTrainerConfig`).
### Integration with EWC and Replay Buffer
The `VerifiedTrainer` composes with `ruvector-gnn`'s continual learning primitives:
```rust
pub struct VerifiedTrainerConfig {
/// Optimizer type (from ruvector-gnn).
pub optimizer: OptimizerType,
/// EWC lambda (0.0 = disabled). Uses ruvector_gnn::ElasticWeightConsolidation.
pub ewc_lambda: f64,
/// Replay buffer size (0 = disabled). Uses ruvector_gnn::ReplayBuffer.
pub replay_buffer_size: usize,
/// Scheduler type (from ruvector-gnn).
pub scheduler: SchedulerType,
/// Invariants to verify per step.
pub invariants: Vec<TrainingInvariant>,
/// Check interval for expensive invariants (e.g., equivariance).
/// Cheap invariants (Reflex tier) run every step.
pub expensive_check_interval: usize,
/// Warmup steps during which invariant violations are logged but
/// do not trigger rollback. After warmup, fail-closed applies.
pub warmup_steps: usize,
/// Robustness certification config (None = disabled).
pub robustness: Option<RobustnessCertifier>,
/// Energy gate config (None = disabled).
/// If enabled, energy is evaluated before every gradient application.
pub energy_gate: Option<EnergyGateConfig>,
/// Default rollback strategy for invariant failures.
pub rollback_strategy: RollbackStrategy,
/// Allow degraded mode: if true, failed invariant checks produce
/// an OverrideProof and increment a visible violation counter
/// instead of stopping the step. Default: false (fail-closed).
pub allow_override: bool,
/// Optional dataset manifest hash for binding to the certificate.
pub dataset_manifest_hash: Option<[u8; 32]>,
/// Optional code build hash for binding to the certificate.
pub code_build_hash: Option<[u8; 32]>,
}
```
When EWC is enabled, the `WeightNormBound` invariant is automatically adjusted to account for the EWC penalty term. When the replay buffer is active, replayed samples also go through invariant verification.
## Consequences
### Positive
- Every training run produces a `TrainingCertificate` bound to the exact model weights via BLAKE3 hash — portable, verifiable by third parties without trusting logs
- Per-step invariant proofs catch regressions immediately — loss spikes, norm explosions, equivariance breaks become training-stopping events, not evaluation surprises
- Clear distinction between formally proven invariants and statistical estimates — the certificate is defensible because it states exactly what was proven and what was estimated
- EnergyGate integration makes training behave like inference gating — consistent proof-gated mutation across the full lifecycle
- Delta-apply rollback strategy avoids doubling peak memory while preserving proof-gated semantics
- Fail-closed by default with explicit OverrideProof for degraded mode — violations are visible, not silent
### Negative
- `PermutationEquivariance` is a statistical test, not a formal proof — the certificate is honest about this, but it means equivariance is not guaranteed, only tested with bound scope
- `LipschitzBound` via power iteration is an estimate — the certificate attests the estimate was within bounds, not the true Lipschitz constant
- The `TrainingCertificate` is only as strong as the invariants specified — missing invariants are not caught
- Robustness certification (IBP/DeepPoly) produces loose bounds for deep models; the certified radius may be conservative
- Over-conservative invariants can stop learning — mitigated by check intervals, warmup periods, and adaptive thresholds (which are themselves bounded)
### Risks
- **Proof cache hit rate drops**: High learning rate causes diverse weight states, Standard/Deep proofs dominate and exceed 5% budget. Mitigated by caching invariant structure (not values) — proof terms depend on structure, values are parameters. Monitor `ProofStats::cache_hit_rate` and alert below 80%
- **GPU steps dominated by Deep checks**: Schedule deep checks asynchronously with two-phase commit: provisional update, finalize after deep check, revert if failed. Mitigation preserves proof-gated semantics without blocking the training loop
- **EWC Fisher information**: O(n_params^2) in naive case. The existing diagonal approximation may miss cross-parameter interactions. Mitigated by periodic full Fisher computation (every K epochs) as a Deep-tier invariant
- **Attestation chain growth**: 82 bytes per step * 100,000 steps = 8 MB. Mitigated by `MutationLedger::compact` — compaction is itself proof-gated: it produces a composed attestation plus a witness that the compacted chain corresponds exactly to the original sequence under the current epoch algebra
- **Certificate separation**: Without artifact binding, the certificate can be detached from the model. Mitigated by BLAKE3 hashes of weights, config, dataset manifest, and code build hash in the certificate
### Acceptance Test
Train 200 steps with invariants enabled, then intentionally inject one bad gradient update that would push a layer norm above `max_norm`. The system must:
1. Reject the step (fail-closed)
2. Emit a refusal attestation to the ledger
3. Leave weights unchanged (delta-apply was not committed)
4. The sealed `TrainingCertificate` must show exactly one violation with the correct step index and invariant name
5. The `weights_hash` in the certificate must match the actual final weights
## Implementation
1. Define `TrainingInvariant` enum and `VerifiedTrainerConfig` in `crates/ruvector-graph-transformer/src/verified_training/invariants.rs`
2. Implement `VerifiedTrainer` wrapping `ruvector_gnn::training::Optimizer` in `crates/ruvector-graph-transformer/src/verified_training/pipeline.rs`
3. Implement invariant-to-ProofKind mapping for tier routing
4. Implement `RobustnessCertifier` with IBP and DeepPoly in `crates/ruvector-graph-transformer/src/verified_training/mod.rs`
5. Implement `TrainingCertificate` and `seal()` method
6. Add benchmarks: verified training step overhead on a 3-layer GNN (128-dim, 10K nodes)
7. Integration test: train a small GNN for 100 steps with all invariants, verify certificate
## References
- ADR-045: Lean-Agentic Integration (`ProofEnvironment`, `FastTermArena`)
- ADR-046: Graph Transformer Unified Architecture (module structure)
- ADR-047: Proof-Gated Mutation Protocol (`ProofGate<T>`, `MutationLedger`, `compose_chain`)
- `crates/ruvector-gnn/src/training.rs`: `Optimizer`, `OptimizerType`, `TrainConfig`, `sgd_step`
- `crates/ruvector-gnn/src/ewc.rs`: `ElasticWeightConsolidation`
- `crates/ruvector-gnn/src/scheduler.rs`: `LearningRateScheduler`, `SchedulerType`
- `crates/ruvector-gnn/src/replay.rs`: `ReplayBuffer`, `ReplayEntry`
- `crates/ruvector-verified/src/gated.rs`: `ProofTier`, `route_proof`, `verify_tiered`
- `crates/ruvector-verified/src/proof_store.rs`: `ProofAttestation`, `create_attestation`
- `crates/ruvector-verified/src/fast_arena.rs`: `FastTermArena`
- `crates/ruvector-coherence/src/spectral.rs`: `SpectralCoherenceScore`, `SpectralTracker`
- `crates/ruvector-mincut-gated-transformer/src/energy_gate.rs`: `EnergyGate`
- Gowal et al., "Scalable Verified Training" (ICML 2019) -- IBP training
- Singh et al., "Abstract Interpretation with DeepPoly" (POPL 2019)