wifi-densepose/vendor/ruvector/docs/adr/ADR-049-verified-training-pipeline.md

# ADR-049: Verified Training Pipeline

## Status

Accepted

## Date

2026-02-25

## Context

Training graph transformers involves thousands of gradient steps, each of which modifies model weights. In safety-critical applications, we need guarantees that training did not introduce pathological behavior: unbounded loss spikes, conservation law violations, equivariance breakage, or adversarial vulnerability. Post-hoc auditing of trained models is expensive and often misses subtle training-time regressions.

The RuVector workspace provides the building blocks for verified training:

- `ruvector-gnn` provides `Optimizer` (SGD, Adam), `ElasticWeightConsolidation` (EWC), `LearningRateScheduler`, `ReplayBuffer`, and a training loop with `TrainConfig` in `crates/ruvector-gnn/src/training.rs`
- `ruvector-verified` provides `ProofEnvironment`, `ProofAttestation` (82 bytes), `FastTermArena` for high-throughput proof allocation, and tiered verification via `ProofTier`
- `ruvector-coherence` provides `SpectralCoherenceScore` and `SpectralTracker` (behind `spectral` feature) for monitoring model quality during training
- `ruvector-mincut-gated-transformer` provides `EnergyGate` in `crates/ruvector-mincut-gated-transformer/src/energy_gate.rs` for energy-based decision making

However, there is no mechanism for issuing per-step invariant proofs during training, no `TrainingCertificate` that attests to the training run's integrity, and no integration between the proof system and the gradient update loop.

## Decision

We will implement a `verified_training` module in `ruvector-graph-transformer` that wraps `ruvector-gnn`'s training infrastructure with proof gates, producing per-step invariant proofs and a final `TrainingCertificate` that attests to the entire training run.

### VerifiedTrainer

```rust
/// A training wrapper that issues proof attestations per gradient step.
///
/// Wraps ruvector_gnn::training::Optimizer and composes with
/// ruvector_verified::ProofEnvironment for per-step invariant verification.
pub struct VerifiedTrainer {
    /// The underlying GNN optimizer (SGD or Adam).
    optimizer: Optimizer,
    /// EWC for continual learning (optional).
    ewc: Option<ElasticWeightConsolidation>,
    /// Learning rate scheduler.
    scheduler: LearningRateScheduler,
    /// Proof environment for generating attestations.
    proof_env: ProofEnvironment,
    /// Fast arena for high-throughput proof allocation.
    arena: FastTermArena,
    /// Per-step invariant specifications.
    invariants: Vec<TrainingInvariant>,
    /// Accumulated attestations for the training run.
    ledger: MutationLedger,
    /// Configuration.
    config: VerifiedTrainerConfig,
}
```

### Per-Step Invariant Proofs

Each gradient step is bracketed by invariant checks. The `TrainingInvariant` enum defines what is verified:

```rust
pub enum TrainingInvariant {
    /// Loss stability: loss stays within a bounded envelope relative to
    /// a moving average. Raw loss is NOT monotonic in SGD — this invariant
    /// captures what is actually enforceable: bounded deviation from trend.
    ///
    /// **This is a true invariant**, not a heuristic: the proof certifies
    /// that loss_t <= moving_avg(loss, window) * (1 + spike_cap).
    LossStabilityBound {
        /// Maximum spike relative to moving average (e.g., 0.10 = 10% above MA).
        spike_cap: f64,
        /// Window size for exponential moving average.
        window: usize,
        /// Gradient norm cap: reject step if ||grad|| > this value.
        max_gradient_norm: f64,
        /// Step size cap: reject step if effective lr * ||grad|| > this value.
        max_step_size: f64,
    },

    /// Weight norm conservation: ||W_t|| stays within bounds per layer.
    /// Prevents gradient explosion/vanishing.
    ///
    /// Rollback strategy: **delta-apply** — gradients are applied to a
    /// scratch buffer, norms checked, then committed only if bounds hold.
    /// This avoids doubling peak memory via full snapshots.
    WeightNormBound {
        /// Maximum L2 norm per layer.
        max_norm: f64,
        /// Minimum L2 norm per layer (prevents collapse).
        min_norm: f64,
        /// Rollback strategy.
        rollback: RollbackStrategy,
    },

    /// Equivariance: model output is equivariant to graph permutations.
    /// **This is a statistical test, not a formal proof.** The certificate
    /// records the exact scope: rng seed, sample count, permutation ID hashes.
    /// A verifier can replay the exact same permutations to confirm.
    PermutationEquivariance {
        /// Number of random permutations to test per check.
        samples: usize,
        /// Maximum allowed deviation (L2 distance / output norm).
        max_deviation: f64,
        /// RNG seed for reproducibility. Bound into the proof scope.
        rng_seed: u64,
    },

    /// Lipschitz bound: **estimated** Lipschitz constant stays below threshold.
    /// Verified per-layer via spectral norm power iteration.
    ///
    /// **Attestation scope:** The certificate records that the estimated bound
    /// (via K power iterations with tolerance eps) stayed below max_lipschitz.
    /// This does NOT certify the true Lipschitz constant — it certifies
    /// that the estimate with stated parameters was within bounds.
    LipschitzBound {
        /// Maximum Lipschitz constant per layer.
        max_lipschitz: f64,
        /// Power iteration steps for spectral norm estimation.
        power_iterations: usize,
        /// Convergence tolerance for power iteration.
        tolerance: f64,
    },

    /// Coherence: spectral coherence score stays above threshold.
    /// Uses ruvector-coherence::spectral::SpectralCoherenceScore.
    ///
    /// **Attestation scope:** Like Lipschitz, this is an estimate based on
    /// sampled eigenvalues. The certificate records the estimation parameters.
    CoherenceBound {
        /// Minimum coherence score.
        min_coherence: f64,
        /// Number of eigenvalue samples for estimation.
        eigenvalue_samples: usize,
    },

    /// Energy gate: compute energy or coherence proxy BEFORE applying
    /// gradients. If below threshold, require a stronger proof tier,
    /// reduce learning rate, or refuse the step entirely.
    ///
    /// Integrates with ruvector-mincut-gated-transformer::EnergyGate
    /// to make training behave like inference gating.
    EnergyGate {
        /// Minimum energy threshold for standard-tier step.
        min_energy: f64,
        /// If energy < min_energy, force this tier for verification.
        escalation_tier: ProofTier,
        /// If energy < critical_energy, refuse the step entirely.
        critical_energy: f64,
    },

    /// Custom invariant with a user-provided verification function.
    Custom {
        /// Name for logging and attestation.
        name: String,
        /// Estimated proof complexity (for tier routing).
        complexity: u32,
    },
}

/// Rollback strategy for failed invariant checks.
pub enum RollbackStrategy {
    /// Apply gradients to a scratch buffer, check invariants, then commit.
    /// Peak memory: weights + one layer's gradients. No full snapshot.
    DeltaApply,
    /// Store per-layer deltas, revert only modified layers on failure.
    /// Peak memory: weights + delta buffer (typically < 10% of weights).
    ChunkedRollback,
    /// Full snapshot (doubles peak memory). Use only when other strategies
    /// are insufficient (e.g., cross-layer invariants).
    FullSnapshot,
}
```

### Invariant Verification Flow

```rust
impl VerifiedTrainer {
    /// Execute one verified training step.
    ///
    /// 1. Compute gradients via the underlying optimizer
    /// 2. Before applying gradients, verify pre-step invariants
    /// 3. Apply gradients
    /// 4. Verify post-step invariants
    /// 5. Issue attestation for this step
    /// 6. If any invariant fails, roll back gradients and return error
    pub fn step(
        &mut self,
        loss: f64,
        gradients: &Gradients,
        weights: &mut Weights,
    ) -> Result<StepAttestation> {
        // 1. Pre-step: verify gradient bounds and loss stability
        let pre_proofs = self.verify_invariants(
            InvariantPhase::PreStep,
            loss, weights,
        )?;

        // 2. Energy gate: compute energy/coherence proxy BEFORE mutation.
        //    If below threshold, escalate proof tier or refuse step.
        if let Some(energy_gate) = &self.energy_gate {
            let energy = energy_gate.evaluate(weights, gradients);
            if energy < energy_gate.critical_energy {
                return Err(GraphTransformerError::MutationRejected {
                    reason: format!("energy {} < critical {}", energy, energy_gate.critical_energy),
                });
            }
            if energy < energy_gate.min_energy {
                // Force escalation to stronger proof tier
                self.current_tier_override = Some(energy_gate.escalation_tier);
            }
        }

        // 3. Apply gradients via delta-apply strategy (default).
        //    Gradients go into a scratch buffer, not directly into weights.
        let delta = self.optimizer.compute_delta(gradients, weights)?;

        // 4. Post-step verification on proposed (weights + delta).
        //    No mutation has occurred yet.
        match self.verify_invariants_on_proposed(
            InvariantPhase::PostStep, loss, weights, &delta
        ) {
            Ok(post_proofs) => {
                // 5. Commit: apply delta to actual weights.
                weights.apply_delta(&delta);

                // 6. Compose attestation and append to ledger.
                let attestation = self.compose_step_attestation(
                    pre_proofs, post_proofs,
                );
                self.ledger.append(attestation.clone());
                self.scheduler.step();
                self.current_tier_override = None;
                Ok(StepAttestation {
                    step: self.ledger.len() as u64,
                    attestation,
                    loss,
                    invariants_checked: self.invariants.len(),
                    overridden: false,
                })
            }
            Err(e) if self.config.allow_override => {
                // Degraded mode: step proceeds with OverrideProof.
                // The override is visible in the certificate.
                let override_proof = self.create_override_proof(&e)?;
                weights.apply_delta(&delta);
                self.ledger.append(override_proof.clone());
                self.override_count += 1;
                Ok(StepAttestation {
                    step: self.ledger.len() as u64,
                    attestation: override_proof,
                    loss,
                    invariants_checked: self.invariants.len(),
                    overridden: true,
                })
            }
            Err(e) => {
                // Fail-closed: delta is discarded, weights unchanged.
                // Refusal is recorded in the ledger.
                let refusal = self.create_refusal_attestation(&e);
                self.ledger.append(refusal);
                Err(e)
            }
        }
    }
}
```

### Tier Routing for Training Invariants

Training invariant verification uses the same three-tier routing as ADR-047:

| Invariant | Typical Tier | Rationale | Formally Proven? |
|-----------|-------------|-----------|------------------|
| `LossStabilityBound` | Reflex | Moving avg comparison + gradient norm check, < 10 ns | **Yes** — bounded comparison |
| `WeightNormBound` | Standard(100) | L2 norm computation, < 1 us | **Yes** — exact computation |
| `PermutationEquivariance` | Deep | Random permutation + forward pass, < 100 us | **No** — statistical test with bound scope |
| `LipschitzBound` | Standard(500) | Power iteration spectral norm, < 10 us | **No** — estimate with stated tolerance |
| `CoherenceBound` | Standard(200) | Spectral coherence from sampled eigenvalues, < 5 us | **No** — estimate with stated sample count |
| `EnergyGate` | Reflex/Standard | Energy proxy evaluation, < 100 ns | **Yes** — threshold comparison |
| `Custom` | Routed by `complexity` field | User-defined | Depends on implementation |

**Distinction between proven and estimated invariants:** The certificate explicitly records which invariants are formally proven (exact computation within the proof system) and which are statistical estimates with bound scope (rng_seed, sample_count, iterations, tolerance). A verifier knows exactly what was tested and can replay it.

The routing decision is made by converting each `TrainingInvariant` into a `ProofKind` and calling `ruvector_verified::gated::route_proof`. For example, `LossStabilityBound` maps to `ProofKind::DimensionEquality` (literal comparison), while `PermutationEquivariance` maps to `ProofKind::Custom { estimated_complexity: samples * 100 }`.

### Certified Adversarial Robustness

For models that require adversarial robustness certification, the `verified_training` module provides an IBP (Interval Bound Propagation) / DeepPoly integration:

```rust
pub struct RobustnessCertifier {
    /// Perturbation radius (L-infinity norm).
    epsilon: f64,
    /// Certification method.
    method: CertificationMethod,
}

pub enum CertificationMethod {
    /// Interval Bound Propagation -- fast but loose.
    IBP,
    /// DeepPoly -- tighter but slower.
    DeepPoly,
    /// Combined: IBP for initial bound, DeepPoly for refinement.
    Hybrid { ibp_warmup_epochs: usize },
}

impl RobustnessCertifier {
    /// Certify that the model's output is stable within epsilon-ball.
    /// Returns a ProofGate<RobustnessCertificate> with the certified radius.
    pub fn certify(
        &self,
        model: &GraphTransformer<impl GraphRepr>,
        input: &GraphBatch,
        env: &mut ProofEnvironment,
    ) -> Result<ProofGate<RobustnessCertificate>>;
}

pub struct RobustnessCertificate {
    /// Certified perturbation radius.
    pub certified_radius: f64,
    /// Fraction of nodes certified robust.
    pub certified_fraction: f64,
    /// Method used.
    pub method: CertificationMethod,
    /// Attestation.
    pub attestation: ProofAttestation,
}
```

### Training Certificate

At the end of a training run, a `TrainingCertificate` is produced by composing all step attestations:

```rust
pub struct TrainingCertificate {
    /// Total training steps completed.
    pub total_steps: u64,
    /// Total invariant violations (zero if fully verified).
    pub violations: u64,
    /// Number of steps that proceeded via OverrideProof (degraded mode).
    pub overridden_steps: u64,
    /// Composed attestation over all steps via compose_chain.
    pub attestation: ProofAttestation,
    /// Final loss value.
    pub final_loss: f64,
    /// Final coherence score (if CoherenceBound invariant was active).
    pub final_coherence: Option<f64>,
    /// Robustness certificate (if adversarial certification was run).
    pub robustness: Option<RobustnessCertificate>,
    /// Epoch at which the certificate was sealed.
    pub epoch: u64,
    /// Per-invariant statistics.
    pub invariant_stats: Vec<InvariantStats>,

    // --- Artifact binding (hardening move #7) ---

    /// BLAKE3 hash of the final model weights. Binds certificate to
    /// the exact model artifact. Cannot be separated.
    pub weights_hash: [u8; 32],
    /// BLAKE3 hash of the VerifiedTrainerConfig (serialized).
    pub config_hash: [u8; 32],
    /// BLAKE3 hash of the dataset manifest (or RVF manifest root).
    /// None if no dataset manifest was provided.
    pub dataset_manifest_hash: Option<[u8; 32]>,
    /// BLAKE3 hash of the code (build hash / git commit).
    /// None if not provided.
    pub code_build_hash: Option<[u8; 32]>,
}

pub struct InvariantStats {
    /// Invariant name.
    pub name: String,
    /// Whether this invariant is formally proven or a statistical estimate.
    pub proof_class: ProofClass,
    /// Number of times checked.
    pub checks: u64,
    /// Number of times satisfied.
    pub satisfied: u64,
    /// Number of times overridden (degraded mode).
    pub overridden: u64,
    /// Average verification latency.
    pub avg_latency_ns: u64,
    /// Proof tier distribution: [reflex_count, standard_count, deep_count].
    pub tier_distribution: [u64; 3],
}

pub enum ProofClass {
    /// Formally proven: exact computation within the proof system.
    Formal,
    /// Statistical estimate with bound scope. Certificate records
    /// the estimation parameters (rng_seed, iterations, tolerance).
    Statistical {
        rng_seed: Option<u64>,
        iterations: usize,
        tolerance: f64,
    },
}

impl VerifiedTrainer {
    /// Seal the training run and produce a certificate.
    ///
    /// 1. Compacts the mutation ledger (proof-gated: compaction itself
    ///    produces a composed attestation + witness that the compacted
    ///    chain corresponds exactly to the original sequence).
    /// 2. Computes BLAKE3 hashes of weights, config, and optional manifests.
    /// 3. Composes all attestations into the final certificate.
    ///
    /// The sealed certificate is a product artifact: verifiable by
    /// third parties without trusting training logs.
    pub fn seal(self, weights: &Weights) -> TrainingCertificate;
}
```

### Performance Budget

The target is proof overhead < 5% of training step time. For a typical GNN training step of ~10 ms (on CPU):

- `LossMonotonicity` (Reflex): < 10 ns = 0.0001%
- `WeightNormBound` (Standard): < 1 us = 0.01%
- `LipschitzBound` (Standard): < 10 us = 0.1%
- `CoherenceBound` (Standard): < 5 us = 0.05%
- `PermutationEquivariance` (Deep, sampled): < 100 us = 1%
- Attestation composition: < 1 us = 0.01%
- **Total**: < 120 us = 1.2% (well within 5% budget)

For GPU-accelerated training (step time ~1 ms), `PermutationEquivariance` with many samples may exceed 5%. Mitigation: reduce sample count or check equivariance every N steps (configurable via `check_interval` in `VerifiedTrainerConfig`).

### Integration with EWC and Replay Buffer

The `VerifiedTrainer` composes with `ruvector-gnn`'s continual learning primitives:

```rust
pub struct VerifiedTrainerConfig {
    /// Optimizer type (from ruvector-gnn).
    pub optimizer: OptimizerType,
    /// EWC lambda (0.0 = disabled). Uses ruvector_gnn::ElasticWeightConsolidation.
    pub ewc_lambda: f64,
    /// Replay buffer size (0 = disabled). Uses ruvector_gnn::ReplayBuffer.
    pub replay_buffer_size: usize,
    /// Scheduler type (from ruvector-gnn).
    pub scheduler: SchedulerType,
    /// Invariants to verify per step.
    pub invariants: Vec<TrainingInvariant>,
    /// Check interval for expensive invariants (e.g., equivariance).
    /// Cheap invariants (Reflex tier) run every step.
    pub expensive_check_interval: usize,
    /// Warmup steps during which invariant violations are logged but
    /// do not trigger rollback. After warmup, fail-closed applies.
    pub warmup_steps: usize,
    /// Robustness certification config (None = disabled).
    pub robustness: Option<RobustnessCertifier>,
    /// Energy gate config (None = disabled).
    /// If enabled, energy is evaluated before every gradient application.
    pub energy_gate: Option<EnergyGateConfig>,
    /// Default rollback strategy for invariant failures.
    pub rollback_strategy: RollbackStrategy,
    /// Allow degraded mode: if true, failed invariant checks produce
    /// an OverrideProof and increment a visible violation counter
    /// instead of stopping the step. Default: false (fail-closed).
    pub allow_override: bool,
    /// Optional dataset manifest hash for binding to the certificate.
    pub dataset_manifest_hash: Option<[u8; 32]>,
    /// Optional code build hash for binding to the certificate.
    pub code_build_hash: Option<[u8; 32]>,
}
```

When EWC is enabled, the `WeightNormBound` invariant is automatically adjusted to account for the EWC penalty term. When the replay buffer is active, replayed samples also go through invariant verification.

## Consequences

### Positive

- Every training run produces a `TrainingCertificate` bound to the exact model weights via BLAKE3 hash — portable, verifiable by third parties without trusting logs
- Per-step invariant proofs catch regressions immediately — loss spikes, norm explosions, equivariance breaks become training-stopping events, not evaluation surprises
- Clear distinction between formally proven invariants and statistical estimates — the certificate is defensible because it states exactly what was proven and what was estimated
- EnergyGate integration makes training behave like inference gating — consistent proof-gated mutation across the full lifecycle
- Delta-apply rollback strategy avoids doubling peak memory while preserving proof-gated semantics
- Fail-closed by default with explicit OverrideProof for degraded mode — violations are visible, not silent

### Negative

- `PermutationEquivariance` is a statistical test, not a formal proof — the certificate is honest about this, but it means equivariance is not guaranteed, only tested with bound scope
- `LipschitzBound` via power iteration is an estimate — the certificate attests the estimate was within bounds, not the true Lipschitz constant
- The `TrainingCertificate` is only as strong as the invariants specified — missing invariants are not caught
- Robustness certification (IBP/DeepPoly) produces loose bounds for deep models; the certified radius may be conservative
- Over-conservative invariants can stop learning — mitigated by check intervals, warmup periods, and adaptive thresholds (which are themselves bounded)

### Risks

- **Proof cache hit rate drops**: High learning rate causes diverse weight states, Standard/Deep proofs dominate and exceed 5% budget. Mitigated by caching invariant structure (not values) — proof terms depend on structure, values are parameters. Monitor `ProofStats::cache_hit_rate` and alert below 80%
- **GPU steps dominated by Deep checks**: Schedule deep checks asynchronously with two-phase commit: provisional update, finalize after deep check, revert if failed. Mitigation preserves proof-gated semantics without blocking the training loop
- **EWC Fisher information**: O(n_params^2) in naive case. The existing diagonal approximation may miss cross-parameter interactions. Mitigated by periodic full Fisher computation (every K epochs) as a Deep-tier invariant
- **Attestation chain growth**: 82 bytes per step * 100,000 steps = 8 MB. Mitigated by `MutationLedger::compact` — compaction is itself proof-gated: it produces a composed attestation plus a witness that the compacted chain corresponds exactly to the original sequence under the current epoch algebra
- **Certificate separation**: Without artifact binding, the certificate can be detached from the model. Mitigated by BLAKE3 hashes of weights, config, dataset manifest, and code build hash in the certificate

### Acceptance Test

Train 200 steps with invariants enabled, then intentionally inject one bad gradient update that would push a layer norm above `max_norm`. The system must:
1. Reject the step (fail-closed)
2. Emit a refusal attestation to the ledger
3. Leave weights unchanged (delta-apply was not committed)
4. The sealed `TrainingCertificate` must show exactly one violation with the correct step index and invariant name
5. The `weights_hash` in the certificate must match the actual final weights

## Implementation

1. Define `TrainingInvariant` enum and `VerifiedTrainerConfig` in `crates/ruvector-graph-transformer/src/verified_training/invariants.rs`
2. Implement `VerifiedTrainer` wrapping `ruvector_gnn::training::Optimizer` in `crates/ruvector-graph-transformer/src/verified_training/pipeline.rs`
3. Implement invariant-to-ProofKind mapping for tier routing
4. Implement `RobustnessCertifier` with IBP and DeepPoly in `crates/ruvector-graph-transformer/src/verified_training/mod.rs`
5. Implement `TrainingCertificate` and `seal()` method
6. Add benchmarks: verified training step overhead on a 3-layer GNN (128-dim, 10K nodes)
7. Integration test: train a small GNN for 100 steps with all invariants, verify certificate

## References

- ADR-045: Lean-Agentic Integration (`ProofEnvironment`, `FastTermArena`)
- ADR-046: Graph Transformer Unified Architecture (module structure)
- ADR-047: Proof-Gated Mutation Protocol (`ProofGate<T>`, `MutationLedger`, `compose_chain`)
- `crates/ruvector-gnn/src/training.rs`: `Optimizer`, `OptimizerType`, `TrainConfig`, `sgd_step`
- `crates/ruvector-gnn/src/ewc.rs`: `ElasticWeightConsolidation`
- `crates/ruvector-gnn/src/scheduler.rs`: `LearningRateScheduler`, `SchedulerType`
- `crates/ruvector-gnn/src/replay.rs`: `ReplayBuffer`, `ReplayEntry`
- `crates/ruvector-verified/src/gated.rs`: `ProofTier`, `route_proof`, `verify_tiered`
- `crates/ruvector-verified/src/proof_store.rs`: `ProofAttestation`, `create_attestation`
- `crates/ruvector-verified/src/fast_arena.rs`: `FastTermArena`
- `crates/ruvector-coherence/src/spectral.rs`: `SpectralCoherenceScore`, `SpectralTracker`
- `crates/ruvector-mincut-gated-transformer/src/energy_gate.rs`: `EnergyGate`
- Gowal et al., "Scalable Verified Training" (ICML 2019) -- IBP training
- Singh et al., "Abstract Interpretation with DeepPoly" (POPL 2019)