# ADR-049: Verified Training Pipeline ## Status Accepted ## Date 2026-02-25 ## Context Training graph transformers involves thousands of gradient steps, each of which modifies model weights. In safety-critical applications, we need guarantees that training did not introduce pathological behavior: unbounded loss spikes, conservation law violations, equivariance breakage, or adversarial vulnerability. Post-hoc auditing of trained models is expensive and often misses subtle training-time regressions. The RuVector workspace provides the building blocks for verified training: - `ruvector-gnn` provides `Optimizer` (SGD, Adam), `ElasticWeightConsolidation` (EWC), `LearningRateScheduler`, `ReplayBuffer`, and a training loop with `TrainConfig` in `crates/ruvector-gnn/src/training.rs` - `ruvector-verified` provides `ProofEnvironment`, `ProofAttestation` (82 bytes), `FastTermArena` for high-throughput proof allocation, and tiered verification via `ProofTier` - `ruvector-coherence` provides `SpectralCoherenceScore` and `SpectralTracker` (behind `spectral` feature) for monitoring model quality during training - `ruvector-mincut-gated-transformer` provides `EnergyGate` in `crates/ruvector-mincut-gated-transformer/src/energy_gate.rs` for energy-based decision making However, there is no mechanism for issuing per-step invariant proofs during training, no `TrainingCertificate` that attests to the training run's integrity, and no integration between the proof system and the gradient update loop. ## Decision We will implement a `verified_training` module in `ruvector-graph-transformer` that wraps `ruvector-gnn`'s training infrastructure with proof gates, producing per-step invariant proofs and a final `TrainingCertificate` that attests to the entire training run. ### VerifiedTrainer ```rust /// A training wrapper that issues proof attestations per gradient step. /// /// Wraps ruvector_gnn::training::Optimizer and composes with /// ruvector_verified::ProofEnvironment for per-step invariant verification. pub struct VerifiedTrainer { /// The underlying GNN optimizer (SGD or Adam). optimizer: Optimizer, /// EWC for continual learning (optional). ewc: Option, /// Learning rate scheduler. scheduler: LearningRateScheduler, /// Proof environment for generating attestations. proof_env: ProofEnvironment, /// Fast arena for high-throughput proof allocation. arena: FastTermArena, /// Per-step invariant specifications. invariants: Vec, /// Accumulated attestations for the training run. ledger: MutationLedger, /// Configuration. config: VerifiedTrainerConfig, } ``` ### Per-Step Invariant Proofs Each gradient step is bracketed by invariant checks. The `TrainingInvariant` enum defines what is verified: ```rust pub enum TrainingInvariant { /// Loss stability: loss stays within a bounded envelope relative to /// a moving average. Raw loss is NOT monotonic in SGD — this invariant /// captures what is actually enforceable: bounded deviation from trend. /// /// **This is a true invariant**, not a heuristic: the proof certifies /// that loss_t <= moving_avg(loss, window) * (1 + spike_cap). LossStabilityBound { /// Maximum spike relative to moving average (e.g., 0.10 = 10% above MA). spike_cap: f64, /// Window size for exponential moving average. window: usize, /// Gradient norm cap: reject step if ||grad|| > this value. max_gradient_norm: f64, /// Step size cap: reject step if effective lr * ||grad|| > this value. max_step_size: f64, }, /// Weight norm conservation: ||W_t|| stays within bounds per layer. /// Prevents gradient explosion/vanishing. /// /// Rollback strategy: **delta-apply** — gradients are applied to a /// scratch buffer, norms checked, then committed only if bounds hold. /// This avoids doubling peak memory via full snapshots. WeightNormBound { /// Maximum L2 norm per layer. max_norm: f64, /// Minimum L2 norm per layer (prevents collapse). min_norm: f64, /// Rollback strategy. rollback: RollbackStrategy, }, /// Equivariance: model output is equivariant to graph permutations. /// **This is a statistical test, not a formal proof.** The certificate /// records the exact scope: rng seed, sample count, permutation ID hashes. /// A verifier can replay the exact same permutations to confirm. PermutationEquivariance { /// Number of random permutations to test per check. samples: usize, /// Maximum allowed deviation (L2 distance / output norm). max_deviation: f64, /// RNG seed for reproducibility. Bound into the proof scope. rng_seed: u64, }, /// Lipschitz bound: **estimated** Lipschitz constant stays below threshold. /// Verified per-layer via spectral norm power iteration. /// /// **Attestation scope:** The certificate records that the estimated bound /// (via K power iterations with tolerance eps) stayed below max_lipschitz. /// This does NOT certify the true Lipschitz constant — it certifies /// that the estimate with stated parameters was within bounds. LipschitzBound { /// Maximum Lipschitz constant per layer. max_lipschitz: f64, /// Power iteration steps for spectral norm estimation. power_iterations: usize, /// Convergence tolerance for power iteration. tolerance: f64, }, /// Coherence: spectral coherence score stays above threshold. /// Uses ruvector-coherence::spectral::SpectralCoherenceScore. /// /// **Attestation scope:** Like Lipschitz, this is an estimate based on /// sampled eigenvalues. The certificate records the estimation parameters. CoherenceBound { /// Minimum coherence score. min_coherence: f64, /// Number of eigenvalue samples for estimation. eigenvalue_samples: usize, }, /// Energy gate: compute energy or coherence proxy BEFORE applying /// gradients. If below threshold, require a stronger proof tier, /// reduce learning rate, or refuse the step entirely. /// /// Integrates with ruvector-mincut-gated-transformer::EnergyGate /// to make training behave like inference gating. EnergyGate { /// Minimum energy threshold for standard-tier step. min_energy: f64, /// If energy < min_energy, force this tier for verification. escalation_tier: ProofTier, /// If energy < critical_energy, refuse the step entirely. critical_energy: f64, }, /// Custom invariant with a user-provided verification function. Custom { /// Name for logging and attestation. name: String, /// Estimated proof complexity (for tier routing). complexity: u32, }, } /// Rollback strategy for failed invariant checks. pub enum RollbackStrategy { /// Apply gradients to a scratch buffer, check invariants, then commit. /// Peak memory: weights + one layer's gradients. No full snapshot. DeltaApply, /// Store per-layer deltas, revert only modified layers on failure. /// Peak memory: weights + delta buffer (typically < 10% of weights). ChunkedRollback, /// Full snapshot (doubles peak memory). Use only when other strategies /// are insufficient (e.g., cross-layer invariants). FullSnapshot, } ``` ### Invariant Verification Flow ```rust impl VerifiedTrainer { /// Execute one verified training step. /// /// 1. Compute gradients via the underlying optimizer /// 2. Before applying gradients, verify pre-step invariants /// 3. Apply gradients /// 4. Verify post-step invariants /// 5. Issue attestation for this step /// 6. If any invariant fails, roll back gradients and return error pub fn step( &mut self, loss: f64, gradients: &Gradients, weights: &mut Weights, ) -> Result { // 1. Pre-step: verify gradient bounds and loss stability let pre_proofs = self.verify_invariants( InvariantPhase::PreStep, loss, weights, )?; // 2. Energy gate: compute energy/coherence proxy BEFORE mutation. // If below threshold, escalate proof tier or refuse step. if let Some(energy_gate) = &self.energy_gate { let energy = energy_gate.evaluate(weights, gradients); if energy < energy_gate.critical_energy { return Err(GraphTransformerError::MutationRejected { reason: format!("energy {} < critical {}", energy, energy_gate.critical_energy), }); } if energy < energy_gate.min_energy { // Force escalation to stronger proof tier self.current_tier_override = Some(energy_gate.escalation_tier); } } // 3. Apply gradients via delta-apply strategy (default). // Gradients go into a scratch buffer, not directly into weights. let delta = self.optimizer.compute_delta(gradients, weights)?; // 4. Post-step verification on proposed (weights + delta). // No mutation has occurred yet. match self.verify_invariants_on_proposed( InvariantPhase::PostStep, loss, weights, &delta ) { Ok(post_proofs) => { // 5. Commit: apply delta to actual weights. weights.apply_delta(&delta); // 6. Compose attestation and append to ledger. let attestation = self.compose_step_attestation( pre_proofs, post_proofs, ); self.ledger.append(attestation.clone()); self.scheduler.step(); self.current_tier_override = None; Ok(StepAttestation { step: self.ledger.len() as u64, attestation, loss, invariants_checked: self.invariants.len(), overridden: false, }) } Err(e) if self.config.allow_override => { // Degraded mode: step proceeds with OverrideProof. // The override is visible in the certificate. let override_proof = self.create_override_proof(&e)?; weights.apply_delta(&delta); self.ledger.append(override_proof.clone()); self.override_count += 1; Ok(StepAttestation { step: self.ledger.len() as u64, attestation: override_proof, loss, invariants_checked: self.invariants.len(), overridden: true, }) } Err(e) => { // Fail-closed: delta is discarded, weights unchanged. // Refusal is recorded in the ledger. let refusal = self.create_refusal_attestation(&e); self.ledger.append(refusal); Err(e) } } } } ``` ### Tier Routing for Training Invariants Training invariant verification uses the same three-tier routing as ADR-047: | Invariant | Typical Tier | Rationale | Formally Proven? | |-----------|-------------|-----------|------------------| | `LossStabilityBound` | Reflex | Moving avg comparison + gradient norm check, < 10 ns | **Yes** — bounded comparison | | `WeightNormBound` | Standard(100) | L2 norm computation, < 1 us | **Yes** — exact computation | | `PermutationEquivariance` | Deep | Random permutation + forward pass, < 100 us | **No** — statistical test with bound scope | | `LipschitzBound` | Standard(500) | Power iteration spectral norm, < 10 us | **No** — estimate with stated tolerance | | `CoherenceBound` | Standard(200) | Spectral coherence from sampled eigenvalues, < 5 us | **No** — estimate with stated sample count | | `EnergyGate` | Reflex/Standard | Energy proxy evaluation, < 100 ns | **Yes** — threshold comparison | | `Custom` | Routed by `complexity` field | User-defined | Depends on implementation | **Distinction between proven and estimated invariants:** The certificate explicitly records which invariants are formally proven (exact computation within the proof system) and which are statistical estimates with bound scope (rng_seed, sample_count, iterations, tolerance). A verifier knows exactly what was tested and can replay it. The routing decision is made by converting each `TrainingInvariant` into a `ProofKind` and calling `ruvector_verified::gated::route_proof`. For example, `LossStabilityBound` maps to `ProofKind::DimensionEquality` (literal comparison), while `PermutationEquivariance` maps to `ProofKind::Custom { estimated_complexity: samples * 100 }`. ### Certified Adversarial Robustness For models that require adversarial robustness certification, the `verified_training` module provides an IBP (Interval Bound Propagation) / DeepPoly integration: ```rust pub struct RobustnessCertifier { /// Perturbation radius (L-infinity norm). epsilon: f64, /// Certification method. method: CertificationMethod, } pub enum CertificationMethod { /// Interval Bound Propagation -- fast but loose. IBP, /// DeepPoly -- tighter but slower. DeepPoly, /// Combined: IBP for initial bound, DeepPoly for refinement. Hybrid { ibp_warmup_epochs: usize }, } impl RobustnessCertifier { /// Certify that the model's output is stable within epsilon-ball. /// Returns a ProofGate with the certified radius. pub fn certify( &self, model: &GraphTransformer, input: &GraphBatch, env: &mut ProofEnvironment, ) -> Result>; } pub struct RobustnessCertificate { /// Certified perturbation radius. pub certified_radius: f64, /// Fraction of nodes certified robust. pub certified_fraction: f64, /// Method used. pub method: CertificationMethod, /// Attestation. pub attestation: ProofAttestation, } ``` ### Training Certificate At the end of a training run, a `TrainingCertificate` is produced by composing all step attestations: ```rust pub struct TrainingCertificate { /// Total training steps completed. pub total_steps: u64, /// Total invariant violations (zero if fully verified). pub violations: u64, /// Number of steps that proceeded via OverrideProof (degraded mode). pub overridden_steps: u64, /// Composed attestation over all steps via compose_chain. pub attestation: ProofAttestation, /// Final loss value. pub final_loss: f64, /// Final coherence score (if CoherenceBound invariant was active). pub final_coherence: Option, /// Robustness certificate (if adversarial certification was run). pub robustness: Option, /// Epoch at which the certificate was sealed. pub epoch: u64, /// Per-invariant statistics. pub invariant_stats: Vec, // --- Artifact binding (hardening move #7) --- /// BLAKE3 hash of the final model weights. Binds certificate to /// the exact model artifact. Cannot be separated. pub weights_hash: [u8; 32], /// BLAKE3 hash of the VerifiedTrainerConfig (serialized). pub config_hash: [u8; 32], /// BLAKE3 hash of the dataset manifest (or RVF manifest root). /// None if no dataset manifest was provided. pub dataset_manifest_hash: Option<[u8; 32]>, /// BLAKE3 hash of the code (build hash / git commit). /// None if not provided. pub code_build_hash: Option<[u8; 32]>, } pub struct InvariantStats { /// Invariant name. pub name: String, /// Whether this invariant is formally proven or a statistical estimate. pub proof_class: ProofClass, /// Number of times checked. pub checks: u64, /// Number of times satisfied. pub satisfied: u64, /// Number of times overridden (degraded mode). pub overridden: u64, /// Average verification latency. pub avg_latency_ns: u64, /// Proof tier distribution: [reflex_count, standard_count, deep_count]. pub tier_distribution: [u64; 3], } pub enum ProofClass { /// Formally proven: exact computation within the proof system. Formal, /// Statistical estimate with bound scope. Certificate records /// the estimation parameters (rng_seed, iterations, tolerance). Statistical { rng_seed: Option, iterations: usize, tolerance: f64, }, } impl VerifiedTrainer { /// Seal the training run and produce a certificate. /// /// 1. Compacts the mutation ledger (proof-gated: compaction itself /// produces a composed attestation + witness that the compacted /// chain corresponds exactly to the original sequence). /// 2. Computes BLAKE3 hashes of weights, config, and optional manifests. /// 3. Composes all attestations into the final certificate. /// /// The sealed certificate is a product artifact: verifiable by /// third parties without trusting training logs. pub fn seal(self, weights: &Weights) -> TrainingCertificate; } ``` ### Performance Budget The target is proof overhead < 5% of training step time. For a typical GNN training step of ~10 ms (on CPU): - `LossMonotonicity` (Reflex): < 10 ns = 0.0001% - `WeightNormBound` (Standard): < 1 us = 0.01% - `LipschitzBound` (Standard): < 10 us = 0.1% - `CoherenceBound` (Standard): < 5 us = 0.05% - `PermutationEquivariance` (Deep, sampled): < 100 us = 1% - Attestation composition: < 1 us = 0.01% - **Total**: < 120 us = 1.2% (well within 5% budget) For GPU-accelerated training (step time ~1 ms), `PermutationEquivariance` with many samples may exceed 5%. Mitigation: reduce sample count or check equivariance every N steps (configurable via `check_interval` in `VerifiedTrainerConfig`). ### Integration with EWC and Replay Buffer The `VerifiedTrainer` composes with `ruvector-gnn`'s continual learning primitives: ```rust pub struct VerifiedTrainerConfig { /// Optimizer type (from ruvector-gnn). pub optimizer: OptimizerType, /// EWC lambda (0.0 = disabled). Uses ruvector_gnn::ElasticWeightConsolidation. pub ewc_lambda: f64, /// Replay buffer size (0 = disabled). Uses ruvector_gnn::ReplayBuffer. pub replay_buffer_size: usize, /// Scheduler type (from ruvector-gnn). pub scheduler: SchedulerType, /// Invariants to verify per step. pub invariants: Vec, /// Check interval for expensive invariants (e.g., equivariance). /// Cheap invariants (Reflex tier) run every step. pub expensive_check_interval: usize, /// Warmup steps during which invariant violations are logged but /// do not trigger rollback. After warmup, fail-closed applies. pub warmup_steps: usize, /// Robustness certification config (None = disabled). pub robustness: Option, /// Energy gate config (None = disabled). /// If enabled, energy is evaluated before every gradient application. pub energy_gate: Option, /// Default rollback strategy for invariant failures. pub rollback_strategy: RollbackStrategy, /// Allow degraded mode: if true, failed invariant checks produce /// an OverrideProof and increment a visible violation counter /// instead of stopping the step. Default: false (fail-closed). pub allow_override: bool, /// Optional dataset manifest hash for binding to the certificate. pub dataset_manifest_hash: Option<[u8; 32]>, /// Optional code build hash for binding to the certificate. pub code_build_hash: Option<[u8; 32]>, } ``` When EWC is enabled, the `WeightNormBound` invariant is automatically adjusted to account for the EWC penalty term. When the replay buffer is active, replayed samples also go through invariant verification. ## Consequences ### Positive - Every training run produces a `TrainingCertificate` bound to the exact model weights via BLAKE3 hash — portable, verifiable by third parties without trusting logs - Per-step invariant proofs catch regressions immediately — loss spikes, norm explosions, equivariance breaks become training-stopping events, not evaluation surprises - Clear distinction between formally proven invariants and statistical estimates — the certificate is defensible because it states exactly what was proven and what was estimated - EnergyGate integration makes training behave like inference gating — consistent proof-gated mutation across the full lifecycle - Delta-apply rollback strategy avoids doubling peak memory while preserving proof-gated semantics - Fail-closed by default with explicit OverrideProof for degraded mode — violations are visible, not silent ### Negative - `PermutationEquivariance` is a statistical test, not a formal proof — the certificate is honest about this, but it means equivariance is not guaranteed, only tested with bound scope - `LipschitzBound` via power iteration is an estimate — the certificate attests the estimate was within bounds, not the true Lipschitz constant - The `TrainingCertificate` is only as strong as the invariants specified — missing invariants are not caught - Robustness certification (IBP/DeepPoly) produces loose bounds for deep models; the certified radius may be conservative - Over-conservative invariants can stop learning — mitigated by check intervals, warmup periods, and adaptive thresholds (which are themselves bounded) ### Risks - **Proof cache hit rate drops**: High learning rate causes diverse weight states, Standard/Deep proofs dominate and exceed 5% budget. Mitigated by caching invariant structure (not values) — proof terms depend on structure, values are parameters. Monitor `ProofStats::cache_hit_rate` and alert below 80% - **GPU steps dominated by Deep checks**: Schedule deep checks asynchronously with two-phase commit: provisional update, finalize after deep check, revert if failed. Mitigation preserves proof-gated semantics without blocking the training loop - **EWC Fisher information**: O(n_params^2) in naive case. The existing diagonal approximation may miss cross-parameter interactions. Mitigated by periodic full Fisher computation (every K epochs) as a Deep-tier invariant - **Attestation chain growth**: 82 bytes per step * 100,000 steps = 8 MB. Mitigated by `MutationLedger::compact` — compaction is itself proof-gated: it produces a composed attestation plus a witness that the compacted chain corresponds exactly to the original sequence under the current epoch algebra - **Certificate separation**: Without artifact binding, the certificate can be detached from the model. Mitigated by BLAKE3 hashes of weights, config, dataset manifest, and code build hash in the certificate ### Acceptance Test Train 200 steps with invariants enabled, then intentionally inject one bad gradient update that would push a layer norm above `max_norm`. The system must: 1. Reject the step (fail-closed) 2. Emit a refusal attestation to the ledger 3. Leave weights unchanged (delta-apply was not committed) 4. The sealed `TrainingCertificate` must show exactly one violation with the correct step index and invariant name 5. The `weights_hash` in the certificate must match the actual final weights ## Implementation 1. Define `TrainingInvariant` enum and `VerifiedTrainerConfig` in `crates/ruvector-graph-transformer/src/verified_training/invariants.rs` 2. Implement `VerifiedTrainer` wrapping `ruvector_gnn::training::Optimizer` in `crates/ruvector-graph-transformer/src/verified_training/pipeline.rs` 3. Implement invariant-to-ProofKind mapping for tier routing 4. Implement `RobustnessCertifier` with IBP and DeepPoly in `crates/ruvector-graph-transformer/src/verified_training/mod.rs` 5. Implement `TrainingCertificate` and `seal()` method 6. Add benchmarks: verified training step overhead on a 3-layer GNN (128-dim, 10K nodes) 7. Integration test: train a small GNN for 100 steps with all invariants, verify certificate ## References - ADR-045: Lean-Agentic Integration (`ProofEnvironment`, `FastTermArena`) - ADR-046: Graph Transformer Unified Architecture (module structure) - ADR-047: Proof-Gated Mutation Protocol (`ProofGate`, `MutationLedger`, `compose_chain`) - `crates/ruvector-gnn/src/training.rs`: `Optimizer`, `OptimizerType`, `TrainConfig`, `sgd_step` - `crates/ruvector-gnn/src/ewc.rs`: `ElasticWeightConsolidation` - `crates/ruvector-gnn/src/scheduler.rs`: `LearningRateScheduler`, `SchedulerType` - `crates/ruvector-gnn/src/replay.rs`: `ReplayBuffer`, `ReplayEntry` - `crates/ruvector-verified/src/gated.rs`: `ProofTier`, `route_proof`, `verify_tiered` - `crates/ruvector-verified/src/proof_store.rs`: `ProofAttestation`, `create_attestation` - `crates/ruvector-verified/src/fast_arena.rs`: `FastTermArena` - `crates/ruvector-coherence/src/spectral.rs`: `SpectralCoherenceScore`, `SpectralTracker` - `crates/ruvector-mincut-gated-transformer/src/energy_gate.rs`: `EnergyGate` - Gowal et al., "Scalable Verified Training" (ICML 2019) -- IBP training - Singh et al., "Abstract Interpretation with DeepPoly" (POPL 2019)