Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,305 @@
# ADR-QE-001: Quantum Engine Core Architecture
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Context
### Problem Statement
ruVector needs a quantum simulation engine for on-device quantum algorithm
experimentation. The platform runs on distributed edge systems, primarily
targeting Cognitum's 256-core low-power processors, and emphasizes ultra-low-power
event-driven computing. Quantum simulation is a natural extension of ruVector's
mathematical computation capabilities: the same SIMD-optimized linear algebra
that powers vector search and neural inference can drive state-vector manipulation
for quantum circuits.
### Requirements
The engine must support gate-model quantum circuit simulation up to approximately
25 qubits, covering the following algorithm families:
| Algorithm Family | Use Case | Typical Qubits | Gate Depth |
|------------------|----------|-----------------|------------|
| VQE (Variational Quantum Eigensolver) | Molecular simulation, optimization | 8-20 | 50-500 per iteration |
| Grover's Search | Unstructured database search | 8-25 | O(sqrt(2^n)) |
| QAOA (Quantum Approximate Optimization) | Combinatorial optimization | 10-25 | O(p * edges) |
| Quantum Error Correction | Surface code, stabilizer circuits | 9-25 (logical + ancilla) | Repetitive syndrome rounds |
### Memory Scaling Analysis
Quantum state-vector simulation stores the full amplitude vector of 2^n complex
numbers. Each amplitude is a pair of f64 values (real + imaginary = 16 bytes).
Memory grows exponentially:
```
Qubits Amplitudes State Size With Scratch Buffer
------ ----------- ---------- -------------------
10 1,024 16 KB 32 KB
15 32,768 512 KB 1 MB
20 1,048,576 16 MB 32 MB
22 4,194,304 64 MB 128 MB
24 16,777,216 256 MB 512 MB
25 33,554,432 512 MB 1.07 GB
26 67,108,864 1.07 GB 2.14 GB
28 268,435,456 4.29 GB 8.59 GB
30 1,073,741,824 17.18 GB 34.36 GB
```
At 25 qubits the state vector requires approximately 512 MB (1.07 GB with a
scratch buffer for intermediate calculations). This is the practical ceiling
for WebAssembly's 32-bit address space. Native execution with sufficient RAM
can push to 30+ qubits.
### Edge Computing Constraints
Cognitum's 256-core processors operate under strict power and memory budgets:
- **Power envelope**: Event-driven activation; cores idle at near-zero draw
- **Memory**: Shared pool, typically 2-8 GB per node
- **Interconnect**: Low-latency mesh between cores, suitable for parallel simulation
- **Workload model**: Burst computation triggered by agent events, not continuous
The quantum engine must respect this model: allocate state only when a simulation
is triggered, execute the circuit, return results, and immediately release all
memory.
## Decision
Implement a **pure Rust state-vector quantum simulator** as a new crate family
(`ruQu` quantum engine) within the ruVector workspace. The following architectural
decisions define the engine.
### 1. Pure Rust Implementation (No C/C++ FFI)
The entire simulation engine is written in Rust with no foreign function interface
dependencies. This ensures:
- Compilation to `wasm32-unknown-unknown` without emscripten or C toolchains
- Memory safety guarantees throughout the simulation pipeline
- Unified build system via Cargo across all targets
- No external library version conflicts or platform-specific linking issues
### 2. State-Vector Simulation as Primary Backend
The engine uses explicit full-amplitude state-vector representation as its
primary simulation mode. Each gate application transforms the full 2^n
amplitude vector via matrix-vector multiplication.
```
Circuit Execution Model:
|psi_0> ──[H]──[CNOT]──[Rz(theta)]──[Measure]── classical bits
| | | |
v v v v
[init] [apply_H] [apply_CNOT] [apply_Rz] [sample]
| | | | |
2^n f64 2^n f64 2^n f64 2^n f64 collapse
complex complex complex complex to basis
```
Gate application follows the standard decomposition:
- **Single-qubit gates**: Iterate amplitude pairs (i, i XOR 2^target), apply 2x2
unitary. O(2^n) operations per gate.
- **Two-qubit gates**: Iterate amplitude quadruples, apply 4x4 unitary.
O(2^n) operations per gate.
- **Multi-qubit gates**: Decompose into single and two-qubit gates, or apply
directly via 2^k x 2^k matrix on k target qubits.
### 3. Qubit Limits and Precision
| Parameter | WASM Target | Native Target |
|-----------|-------------|---------------|
| Max qubits (default) | 25 | 30+ (RAM-dependent) |
| Max qubits (hard limit) | 26 (with f32) | Memory-limited |
| Precision (default) | Complex f64 | Complex f64 |
| Precision (optional) | Complex f32 | Complex f32 |
| State size at max | ~1.07 GB | ~17 GB at 30 qubits |
Complex f64 is the default precision, providing approximately 15 decimal digits
of accuracy -- sufficient for quantum chemistry applications and deep circuits
where accumulated floating-point error matters. An optional f32 mode halves
memory usage at the cost of precision, suitable for shallow circuits and
approximate optimization.
### 4. Event-Driven Activation Model
The engine follows ruVector's event-driven philosophy:
```
Agent Context ruQu Engine Memory
| | |
|-- trigger(circuit) ->| |
| |-- allocate(2^n) ---->|
| |<---- state_ptr ------|
| | |
| |-- [execute gates] -->|
| |-- [measure] -------->|
| | |
|<-- results ---------| |
| |-- deallocate() ----->|
| | |
(idle) (inert) (freed)
```
- **Inert by default**: No background threads, no persistent allocations
- **Allocate on demand**: State vector created when circuit execution begins
- **Free immediately**: All simulation memory released upon result delivery
- **No global state**: Multiple concurrent simulations supported via independent
state handles (no shared mutable global)
### 5. Dual-Target Compilation
The crate supports two compilation targets from a single codebase:
```
ruqu-core
|
+----------+----------+
| |
[native target] [wasm32-unknown-unknown]
| |
- Full SIMD (AVX2, - WASM SIMD128
AVX-512, NEON) - 4GB address limit
- Rayon threading - Optional SharedArrayBuffer
- Optional GPU (wgpu) - No GPU
- 30+ qubits - 25 qubit ceiling
- Full OS integration - Sandboxed
```
Conditional compilation via Cargo feature flags controls target-specific code
paths. The public API surface is identical across targets.
### 6. Optional Tensor Network Mode
For circuits with limited entanglement (e.g., shallow QAOA, certain VQE
ansatze), the engine offers an optional tensor network backend:
- Represents the quantum state as a network of tensors rather than a single
exponential vector
- Memory scales as O(n * chi^2) where chi is the bond dimension (maximum
entanglement width)
- Efficient for circuits where entanglement grows slowly or remains bounded
- Falls back to full state-vector when bond dimension exceeds threshold
- Enabled via the `tensor-network` feature flag
## Alternatives Considered
### Alternative 1: Qukit (Rust, WASM-ready)
A pre-1.0 Rust quantum simulator with WASM support.
| Criterion | Assessment |
|-----------|------------|
| Maturity | Pre-1.0, limited community |
| WASM support | Present but untested at scale |
| Optimization | Basic; no SIMD, no gate fusion |
| Integration | Would require adapter layer |
| Maintenance | External dependency risk |
**Rejected**: Insufficient optimization depth and maturity for production use.
### Alternative 2: QuantRS2 (Rust, Python-focused)
A Rust quantum simulator primarily targeting Python bindings via PyO3.
| Criterion | Assessment |
|-----------|------------|
| Performance | Good benchmarks on native |
| WASM support | Not a design target |
| Dependencies | Heavy; Python-oriented build |
| API design | Python-first, Rust API secondary |
| Integration | Significant impedance mismatch |
**Rejected**: Python-centric design creates unnecessary weight and integration
friction for a Rust-native edge system.
### Alternative 3: roqoqo + QuEST (Rust frontend, C backend)
roqoqo provides a Rust circuit description layer; QuEST is a high-performance
C/C++ state-vector simulator.
| Criterion | Assessment |
|-----------|------------|
| Performance | Excellent (QuEST is highly optimized) |
| WASM support | QuEST's C code breaks WASM compilation |
| Maintenance | External C library maintenance burden |
| Memory safety | C backend outside Rust safety guarantees |
**Rejected**: C dependency is incompatible with WASM target requirement.
### Alternative 4: Quant-Iron (Rust + OpenCL)
A Rust simulator leveraging OpenCL for GPU acceleration.
| Criterion | Assessment |
|-----------|------------|
| Performance | Excellent on GPU-equipped hardware |
| WASM support | OpenCL incompatible with WASM |
| Edge deployment | Most edge nodes lack discrete GPUs |
| Complexity | OpenCL runtime adds operational burden |
**Rejected**: OpenCL dependency incompatible with WASM and edge deployment model.
### Alternative 5: No Simulator (Cloud Quantum APIs)
Delegate all quantum computation to cloud-based quantum simulators or hardware.
| Criterion | Assessment |
|-----------|------------|
| Performance | Network-bound latency |
| Offline support | None; requires connectivity |
| Cost | Per-execution charges |
| Privacy | Circuit data sent to third party |
| Edge philosophy | Violates offline-first design |
**Rejected**: Fundamentally incompatible with ruVector's offline-first edge
computing philosophy.
## Consequences
### Positive
- **Full control**: Complete ownership of the simulation pipeline, enabling
deep integration with ruVector's math, SIMD, and memory subsystems
- **WASM portable**: Single codebase compiles to any WASM runtime, enabling
browser-based quantum experimentation
- **No external dependencies**: Eliminates supply chain risk from C/C++ or
Python library dependencies
- **Edge-aligned**: Event-driven activation model matches Cognitum's power
architecture
- **Extensible**: Gate set, noise models, and backends can evolve independently
### Negative
- **Development effort**: Building a competitive quantum simulator from scratch
requires significant engineering investment
- **Maintenance burden**: Team must benchmark, optimize, and maintain the
simulation engine alongside the rest of ruVector
- **Classical simulation limits**: Exponential scaling is a fundamental physics
constraint; the engine cannot exceed ~30 qubits on practical hardware
### Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Performance below competitors | Medium | High | Benchmark-driven development against QuantRS2/Qukit |
| Floating-point accuracy drift | Low | Medium | Comprehensive numerical tests, optional f64 enforcement |
| WASM memory exhaustion | Medium | Medium | Hard qubit limit with clear error messages (ADR-QE-003) |
| Scope creep into hardware simulation | Low | Low | Strict scope: gate-model only, no analog/pulse simulation |
## References
- [ADR-005: WASM Runtime Integration](/docs/adr/ADR-005-wasm-runtime-integration.md)
- [ADR-003: SIMD Optimization Strategy](/docs/adr/ADR-003-simd-optimization-strategy.md)
- [ADR-006: Memory Management](/docs/adr/ADR-006-memory-management.md)
- [ADR-014: Coherence Engine](/docs/adr/ADR-014-coherence-engine.md)
- [ADR-QE-002: Crate Structure & Integration](./ADR-QE-002-crate-structure-integration.md)
- [ADR-QE-003: WASM Compilation Strategy](./ADR-QE-003-wasm-compilation-strategy.md)
- [ADR-QE-004: Performance Optimization & Benchmarks](./ADR-QE-004-performance-optimization-benchmarks.md)
- Nielsen & Chuang, "Quantum Computation and Quantum Information" (2010)
- Aaronson & Gottesman, "Improved simulation of stabilizer circuits" (2004)

View File

@@ -0,0 +1,474 @@
# ADR-QE-002: Crate Structure & ruVector Integration
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Context
### Problem Statement
The quantum engine must fit within the ruVector workspace, which currently
comprises 73+ crates following a consistent modular architecture. The existing
`ruQu` crate handles classical coherence monitoring -- specifically min-cut
analysis and MWPM (Minimum Weight Perfect Matching) decoding for error
correction analysis. The new quantum simulation capability requires clear
separation from this classical functionality while integrating deeply with
ruVector's shared infrastructure.
### Existing Workspace Patterns
The ruVector workspace follows established conventions that the quantum engine
must respect:
```
ruvector/
crates/
ruvector-math/ # SIMD-optimized linear algebra
ruvector-hnsw/ # Vector similarity search
ruvector-metrics/ # Observability and telemetry
ruvector-router-wasm/ # WASM bindings for routing
ruQu/ # Classical coherence (min-cut, MWPM)
...73+ crates
Cargo.toml # Workspace root
```
Key conventions observed:
- **`no_std` + `alloc`** for maximum portability
- **Feature flags** for optional capabilities (parallel, gpu, etc.)
- **Separate WASM crates** for browser-facing bindings (e.g., `ruvector-router-wasm`)
- **Metrics integration** via `ruvector-metrics` for observability
- **SIMD reuse** via `ruvector-math` for hot-path computations
### Integration Points
The quantum engine must interact with several existing subsystems:
```
+-------------------+
| Agent Framework |
+--------+----------+
|
trigger circuit execution
|
+--------v----------+
| ruqu-core |
| (quantum sim) |
+---+------+--------+
| |
+----------+ +----------+
| |
+--------v--------+ +-----------v---------+
| ruvector-math | | ruvector-metrics |
| (SIMD, linalg) | | (telemetry) |
+-----------------+ +---------------------+
|
+--------v--------+
| ruQu (existing) |
| (min-cut, MWPM) |
+-----------------+
```
## Decision
Adopt a **three-crate architecture** for the quantum engine, each with a
clearly defined responsibility boundary.
### Crate 1: `ruqu-core` -- Pure Rust Simulation Library
The core simulation engine, containing all quantum computation logic.
**Responsibilities**:
- `QuantumCircuit`: Circuit representation and manipulation
- `QuantumState`: State-vector storage and operations
- `Gate` enum: Full gate set (Pauli, Hadamard, CNOT, Toffoli, parametric rotations, etc.)
- Measurement operations (computational basis, Pauli basis, mid-circuit)
- Circuit optimization passes (gate fusion, cancellation)
- Noise model application (optional)
- Entanglement tracking for state splitting
**Design constraints**:
- `#![no_std]` with `alloc` for embedded/WASM portability
- Zero required external dependencies beyond `alloc`
- All platform-specific code behind feature flags
**Feature flags**:
| Flag | Default | Description |
|------|---------|-------------|
| `std` | off | Enable std library features (file I/O, advanced error types) |
| `parallel` | off | Enable Rayon-based multi-threaded gate application |
| `gpu` | off | Enable wgpu-based GPU acceleration for large states |
| `tensor-network` | off | Enable tensor network backend for shallow circuits |
| `noise-model` | off | Enable depolarizing, amplitude damping, and custom noise channels |
| `f32` | off | Use f32 precision instead of f64 (halves memory, reduces accuracy) |
| `serde` | off | Enable serialization of circuits and states |
**Module structure**:
```
ruqu-core/
src/
lib.rs # Crate root, feature flag gating
state.rs # QuantumState: amplitude storage, initialization
circuit.rs # QuantumCircuit: gate sequence, metadata
gates/
mod.rs # Gate enum and dispatch
single.rs # Single-qubit gates (H, X, Y, Z, S, T, Rx, Ry, Rz, U3)
two.rs # Two-qubit gates (CNOT, CZ, SWAP, Rxx, Ryy, Rzz)
multi.rs # Multi-qubit gates (Toffoli, Fredkin, custom unitaries)
parametric.rs # Parameterized gate support for variational algorithms
execution/
mod.rs # Execution engine dispatch
statevector.rs # Full state-vector simulation engine
tensor.rs # Tensor network backend (feature-gated)
noise.rs # Noise channel application (feature-gated)
measurement.rs # Measurement: sampling, expectation values
optimize/
mod.rs # Circuit optimization pipeline
fusion.rs # Gate fusion pass
cancel.rs # Gate cancellation (HH=I, XX=I, etc.)
commute.rs # Commutation-based reordering
entanglement.rs # Entanglement tracking and state splitting
types.rs # Complex number types, precision configuration
error.rs # Error types (QubitOverflow, InvalidGate, etc.)
Cargo.toml
benches/
statevector.rs # Criterion benchmarks for core operations
```
**Public API surface**:
```rust
// Core types
pub struct QuantumState { /* ... */ }
pub struct QuantumCircuit { /* ... */ }
pub enum Gate { H, X, Y, Z, S, T, CNOT, CZ, Rx(f64), Ry(f64), Rz(f64), /* ... */ }
// Circuit construction
impl QuantumCircuit {
pub fn new(num_qubits: usize) -> Result<Self, QubitOverflow>;
pub fn gate(&mut self, gate: Gate, targets: &[usize]) -> &mut Self;
pub fn measure(&mut self, qubit: usize) -> &mut Self;
pub fn measure_all(&mut self) -> &mut Self;
pub fn barrier(&mut self) -> &mut Self;
pub fn depth(&self) -> usize;
pub fn gate_count(&self) -> usize;
pub fn optimize(&mut self) -> &mut Self;
}
// Execution
impl QuantumState {
pub fn new(num_qubits: usize) -> Result<Self, QubitOverflow>;
pub fn execute(&mut self, circuit: &QuantumCircuit) -> ExecutionResult;
pub fn sample(&self, shots: usize) -> Vec<BitString>;
pub fn expectation(&self, observable: &Observable) -> f64;
pub fn probabilities(&self) -> Vec<f64>;
pub fn amplitude(&self, basis_state: usize) -> Complex<f64>;
}
```
### Crate 2: `ruqu-wasm` -- WebAssembly Bindings
WASM-specific bindings exposing the quantum engine to JavaScript environments.
**Responsibilities**:
- wasm-bindgen annotated wrapper types
- JavaScript-friendly API (string-based circuit construction, JSON results)
- Memory limit enforcement (reject circuits exceeding WASM address space)
- Optional multi-threading via wasm-bindgen-rayon
**Design constraints**:
- Mirrors the `ruvector-router-wasm` crate pattern
- Thin wrapper; all logic delegated to `ruqu-core`
- TypeScript type definitions auto-generated
**Module structure**:
```
ruqu-wasm/
src/
lib.rs # wasm-bindgen entry points
circuit.rs # JS-facing QuantumCircuit wrapper
state.rs # JS-facing QuantumState wrapper
types.rs # JS-compatible type conversions
limits.rs # WASM memory limit checks
Cargo.toml
pkg/ # wasm-pack output (generated)
tests/
web.rs # wasm-bindgen-test browser tests
```
**JavaScript API**:
```javascript
import { QuantumCircuit, QuantumState } from 'ruqu-wasm';
// Construct circuit
const circuit = new QuantumCircuit(4);
circuit.h(0);
circuit.cnot(0, 1);
circuit.cnot(1, 2);
circuit.cnot(2, 3);
circuit.measureAll();
// Execute
const state = new QuantumState(4);
const result = state.execute(circuit);
// Sample measurement outcomes
const counts = state.sample(1024);
console.log(counts); // { "0000": 512, "1111": 512 }
// Get probabilities
const probs = state.probabilities();
```
**Memory limit enforcement**:
```rust
const WASM_MAX_QUBITS: usize = 25;
const WASM_MAX_STATE_BYTES: usize = 1 << 30; // 1 GB
pub fn check_wasm_limits(num_qubits: usize) -> Result<(), WasmLimitError> {
if num_qubits > WASM_MAX_QUBITS {
return Err(WasmLimitError::QubitOverflow {
requested: num_qubits,
maximum: WASM_MAX_QUBITS,
estimated_bytes: 16 * (1usize << num_qubits),
});
}
Ok(())
}
```
### Crate 3: `ruqu-algorithms` -- High-Level Algorithm Implementations
Quantum algorithm implementations built on top of `ruqu-core`.
**Responsibilities**:
- VQE (Variational Quantum Eigensolver) with classical optimizer integration
- Grover's search with oracle construction helpers
- QAOA (Quantum Approximate Optimization Algorithm)
- Quantum error correction (surface codes, stabilizer codes)
- Hamiltonian simulation primitives (Trotterization)
**Module structure**:
```
ruqu-algorithms/
src/
lib.rs
vqe/
mod.rs # VQE orchestration
ansatz.rs # Parameterized ansatz circuits (UCCSD, HEA)
hamiltonian.rs # Hamiltonian representation and decomposition
optimizer.rs # Classical optimizer trait + implementations
grover/
mod.rs # Grover's algorithm orchestration
oracle.rs # Oracle construction utilities
diffusion.rs # Diffusion operator
qaoa/
mod.rs # QAOA orchestration
mixer.rs # Mixer Hamiltonian circuits
cost.rs # Cost function encoding
qec/
mod.rs # QEC framework
surface.rs # Surface code implementation
stabilizer.rs # Stabilizer formalism
decoder.rs # Bridge to ruQu's MWPM decoder
trotter.rs # Trotterization for Hamiltonian simulation
utils.rs # Shared utilities (state preparation, etc.)
Cargo.toml
```
**VQE example**:
```rust
use ruqu_core::{QuantumCircuit, QuantumState};
use ruqu_algorithms::vqe::{VqeSolver, Hamiltonian, HardwareEfficientAnsatz};
let hamiltonian = Hamiltonian::from_pauli_sum(&[
(0.5, "ZZ", &[0, 1]),
(0.3, "X", &[0]),
(0.3, "X", &[1]),
]);
let ansatz = HardwareEfficientAnsatz::new(2, depth: 3);
let solver = VqeSolver::new(hamiltonian, ansatz)
.optimizer(NelderMead::default())
.max_iterations(200)
.convergence_threshold(1e-6);
let result = solver.solve();
println!("Ground state energy: {:.6}", result.energy);
```
### Integration Points
#### Agent Activation
Quantum circuits are triggered via the ruVector agent context system. An agent
can invoke simulation through graph query extensions:
```
Agent Query: "Simulate VQE for H2 molecule at bond length 0.74 A"
|
v
Agent Framework --> ruqu-algorithms::vqe::VqeSolver
| |
| +--> ruqu-core (multiple circuit executions)
| |
|<-- VqeResult ------+
|
v
Agent Response: { energy: -1.137, parameters: [...], iterations: 47 }
```
#### Memory Gating
Following ruVector's memory discipline (ADR-006):
- State vectors allocated exclusively within `QuantumState::new()` scope
- All amplitudes dropped when `QuantumState` goes out of scope
- No lazy or cached allocations persist between simulations
- Peak memory tracked and reported via `ruvector-metrics`
#### Observability
Every simulation reports metrics through the existing `ruvector-metrics` pipeline:
| Metric | Type | Description |
|--------|------|-------------|
| `ruqu.simulation.qubits` | Gauge | Number of qubits in current simulation |
| `ruqu.simulation.gates` | Counter | Total gates applied |
| `ruqu.simulation.depth` | Gauge | Circuit depth after optimization |
| `ruqu.simulation.duration_ns` | Histogram | Wall-clock simulation time |
| `ruqu.simulation.peak_memory_bytes` | Gauge | Peak memory during simulation |
| `ruqu.optimization.gates_eliminated` | Counter | Gates removed by optimization passes |
| `ruqu.measurement.shots` | Counter | Total measurement shots taken |
#### Coherence Bridge
The existing `ruQu` crate's min-cut analysis and MWPM decoders remain in place
and become accessible from `ruqu-algorithms` for quantum error correction:
```
ruqu-algorithms::qec::surface
|
+-- build syndrome graph
|
+-- invoke ruQu::mwpm::decode(syndrome)
|
+-- apply corrections to ruqu-core::QuantumState
```
This avoids duplicating decoding logic and leverages the existing, tested
classical infrastructure.
#### Math Reuse
`ruqu-core` depends on `ruvector-math` for SIMD-optimized operations:
- Complex number arithmetic (add, multiply, conjugate) using SIMD lanes
- Aligned memory allocation for state vectors
- Batch operations on amplitude arrays
- Norm calculation for state normalization
```rust
// In ruqu-core, gate application uses ruvector-math SIMD utilities
use ruvector_math::simd::{complex_mul_f64x4, complex_add_f64x4};
fn apply_single_qubit_gate(
state: &mut [Complex<f64>],
target: usize,
matrix: [[Complex<f64>; 2]; 2],
) {
let step = 1 << target;
for block in (0..state.len()).step_by(2 * step) {
for i in block..block + step {
let (a, b) = (state[i], state[i + step]);
state[i] = matrix[0][0] * a + matrix[0][1] * b;
state[i + step] = matrix[1][0] * a + matrix[1][1] * b;
}
}
}
```
### Dependency Graph
```
ruqu-algorithms
|
+---> ruqu-core
| |
| +---> ruvector-math (SIMD utilities)
| +---> ruvector-metrics (optional, behind "metrics" feature)
|
+---> ruQu (existing, for MWPM decoders in QEC)
ruqu-wasm
|
+---> ruqu-core
+---> wasm-bindgen
+---> wasm-bindgen-rayon (optional, behind "threads" feature)
```
### Workspace Cargo.toml Additions
```toml
[workspace]
members = [
# ... existing 73+ crates ...
"crates/ruqu-core",
"crates/ruqu-wasm",
"crates/ruqu-algorithms",
]
```
## Consequences
### Positive
- **Clean separation of concerns**: Each crate has a single, well-defined
responsibility -- simulation, WASM bindings, and algorithms respectively
- **Independent testing**: Each crate can be tested in isolation with its own
benchmark suite
- **Minimal WASM surface**: `ruqu-wasm` remains a thin wrapper, keeping the
compiled `.wasm` module small
- **Reuse of infrastructure**: SIMD, metrics, and classical decoders are shared,
not duplicated
- **Follows workspace conventions**: Same patterns as existing crates, reducing
onboarding friction for contributors
### Negative
- **Three crates to maintain**: Each requires its own CI, documentation, and
version management
- **Cross-crate API stabilization**: Changes to `ruqu-core`'s public API affect
both `ruqu-wasm` and `ruqu-algorithms`
- **Feature flag combinatorics**: Multiple feature flags across three crates
create a testing matrix that must be validated
### Risks and Mitigations
| Risk | Mitigation |
|------|------------|
| API churn in ruqu-core destabilizing dependents | Semver discipline; stabilize core types before 1.0 |
| Feature flag combinations causing compilation failures | CI matrix testing all supported flag combinations |
| Coherence bridge creating tight coupling with ruQu | Trait-based decoder interface; ruQu dependency optional |
| WASM crate size exceeding 2MB target | Regular binary size audits; aggressive dead code elimination |
## References
- [ADR-QE-001: Quantum Engine Core Architecture](./ADR-QE-001-quantum-engine-core-architecture.md)
- [ADR-QE-003: WASM Compilation Strategy](./ADR-QE-003-wasm-compilation-strategy.md)
- [ADR-QE-004: Performance Optimization & Benchmarks](./ADR-QE-004-performance-optimization-benchmarks.md)
- [Workspace Cargo.toml](/Cargo.toml)
- [ruvector-router-wasm pattern](/crates/ruvector-router-wasm/)
- [ruQu crate](/crates/ruQu/)
- [ruvector-math crate](/crates/ruvector-math/)
- [ruvector-metrics crate](/crates/ruvector-metrics/)

View File

@@ -0,0 +1,459 @@
# ADR-QE-003: WebAssembly Compilation Strategy
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Context
### Problem Statement
ruVector targets browsers, embedded/edge runtimes, and IoT devices via
WebAssembly. The quantum simulation engine must compile to
`wasm32-unknown-unknown` and run correctly in these constrained environments.
WASM introduces fundamental constraints that differ significantly from native
execution and must be addressed at the architectural level rather than
worked around at runtime.
### WASM Execution Environment Constraints
| Constraint | Detail | Impact on Quantum Simulation |
|------------|--------|------------------------------|
| 32-bit address space | ~4 GB theoretical max, ~2 GB practical | Hard ceiling on state vector size |
| Memory model | Linear memory, grows in 64 KB pages | Allocation must be page-aware |
| No native threads | Web Workers required for parallelism | Requires SharedArrayBuffer + COOP/COEP headers |
| No direct GPU | WebGPU is separate API, not WASM-native | GPU acceleration unavailable in WASM path |
| No OS syscalls | Sandboxed execution, no file/network | All I/O must go through host bindings |
| JIT compilation | V8/SpiderMonkey JIT, not AOT | ~1.5-3x slower than native, variable warmup |
| SIMD support | 128-bit SIMD proposal (widely supported since 2021) | 4 f32 or 2 f64 per vector lane |
| Stack size | Default ~1 MB, configurable | Deep recursion limited |
### Memory Budget Analysis for Quantum Simulation
The critical constraint is WASM's 32-bit address space. With a practical
usable limit of approximately 2 GB (due to browser memory allocation
behavior and address space fragmentation), the maximum feasible state vector
size is bounded:
```
Available WASM Memory Budget:
Total addressable: 4,294,967,296 bytes (4 GB theoretical)
Practical usable: ~2,147,483,648 bytes (2 GB, browser-dependent)
WASM overhead: ~100,000,000 bytes (module, stack, heap metadata)
Application overhead: ~50,000,000 bytes (circuit data, scratch buffers)
-------------------------------------------------
Available for state: ~2,000,000,000 bytes (1.86 GB)
State vector sizes:
24 qubits: 268,435,456 bytes (256 MB) -- comfortable
25 qubits: 536,870,912 bytes (512 MB) -- feasible
25 + scratch: ~1,073,741,824 bytes -- tight but within budget
26 qubits: 1,073,741,824 bytes (1 GB) -- state alone, no scratch room
27 qubits: 2,147,483,648 bytes (2 GB) -- exceeds practical limit
```
### Existing WASM Patterns in ruVector
The `ruvector-router-wasm` crate establishes conventions for WASM compilation:
- `wasm-pack build` as the compilation tool
- `wasm-bindgen` for JavaScript interop
- TypeScript definition generation
- Feature-flag controlled inclusion/exclusion of capabilities
- Dedicated test suites using `wasm-bindgen-test`
## Decision
### 1. Target and Toolchain
**Target triple**: `wasm32-unknown-unknown`
**Build toolchain**: `wasm-pack` with `wasm-bindgen`
```bash
# Development build
wasm-pack build crates/ruqu-wasm --target web --dev
# Release build with size optimization
wasm-pack build crates/ruqu-wasm --target web --release
# Node.js target (for server-side WASM)
wasm-pack build crates/ruqu-wasm --target nodejs --release
```
**Cargo profile for WASM release**:
```toml
[profile.wasm-release]
inherits = "release"
opt-level = "z" # Optimize for binary size
lto = true # Link-time optimization
codegen-units = 1 # Single codegen unit for maximum optimization
strip = true # Strip debug symbols
panic = "abort" # Smaller panic handling
```
### 2. Memory Limit Enforcement
`ruqu-wasm` enforces qubit limits before any allocation occurs. This is a hard
gate, not a soft warning.
**Enforcement strategy**:
```
User requests N qubits
|
v
[N <= 25?] ---NO---> Return WasmLimitError {
| requested: N,
YES maximum: 25,
| estimated_memory: 16 * 2^N,
v suggestion: "Use native build for >25 qubits"
[Estimate total }
memory needed]
|
v
[< 1.5 GB?] ---NO---> Return WasmLimitError::InsufficientMemory
|
YES
|
v
Proceed with allocation
```
**Qubit limits by precision**:
| Precision | Max Qubits (WASM) | State Size | With Scratch |
|-----------|--------------------|------------|--------------|
| Complex f64 (default) | 25 | 512 MB | ~1.07 GB |
| Complex f32 (optional) | 26 | 512 MB | ~1.07 GB |
**Error reporting**:
```rust
#[wasm_bindgen]
#[derive(Debug)]
pub struct WasmLimitError {
pub requested_qubits: usize,
pub maximum_qubits: usize,
pub estimated_bytes: usize,
pub message: String,
}
impl WasmLimitError {
pub fn qubit_overflow(requested: usize) -> Self {
let max = if cfg!(feature = "f32") { 26 } else { 25 };
let bytes_per_amplitude = if cfg!(feature = "f32") { 8 } else { 16 };
Self {
requested_qubits: requested,
maximum_qubits: max,
estimated_bytes: bytes_per_amplitude * (1usize << requested),
message: format!(
"Cannot simulate {} qubits in WASM: requires {} bytes, \
exceeds WASM address space. Maximum: {} qubits. \
Use native build for larger simulations.",
requested,
bytes_per_amplitude * (1usize << requested),
max
),
}
}
}
```
### 3. Threading Strategy
WASM multi-threading requires SharedArrayBuffer, which in turn requires
specific HTTP security headers (Cross-Origin-Opener-Policy and
Cross-Origin-Embedder-Policy). Not all deployment environments support these.
**Strategy**: Optional multi-threading with graceful fallback.
```
ruqu-wasm execution
|
v
[SharedArrayBuffer
available?]
/ \
YES NO
/ \
[wasm-bindgen-rayon] [single-threaded
parallel execution] execution]
| |
Split state vector Sequential gate
across Web Workers application
| |
v v
Fast (N cores) Slower (1 core)
```
**Compile-time configuration**:
```toml
# In ruqu-wasm/Cargo.toml
[features]
default = []
threads = ["wasm-bindgen-rayon", "ruqu-core/parallel"]
```
**Runtime detection**:
```rust
#[wasm_bindgen]
pub fn threading_available() -> bool {
// Check if SharedArrayBuffer is available in this environment
js_sys::eval("typeof SharedArrayBuffer !== 'undefined'")
.ok()
.and_then(|v| v.as_bool())
.unwrap_or(false)
}
```
**Required HTTP headers for threading**:
```
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
```
### 4. SIMD Utilization
The WASM SIMD proposal (128-bit vectors) is widely supported in modern browsers
and runtimes. The quantum engine uses SIMD for amplitude manipulation when
available.
**WASM SIMD capabilities**:
| Operation | WASM SIMD Instruction | Use in Quantum Sim |
|-----------|-----------------------|--------------------|
| f64x2 multiply | `f64x2.mul` | Complex multiplication (real part) |
| f64x2 add | `f64x2.add` | Amplitude accumulation |
| f64x2 sub | `f64x2.sub` | Complex multiplication (cross terms) |
| f64x2 shuffle | `i64x2.shuffle` | Swapping real/imaginary parts |
| f32x4 multiply | `f32x4.mul` | f32 mode complex multiply |
| f32x4 fma | emulated | Fused multiply-add for accuracy |
**Conditional compilation**:
```rust
// In ruqu-core, WASM SIMD path
#[cfg(all(target_arch = "wasm32", target_feature = "simd128"))]
mod wasm_simd {
use core::arch::wasm32::*;
/// Apply 2x2 unitary to a pair of amplitudes using WASM SIMD
#[inline(always)]
pub fn apply_gate_2x2_simd(
a_re: f64, a_im: f64,
b_re: f64, b_im: f64,
u00_re: f64, u00_im: f64,
u01_re: f64, u01_im: f64,
u10_re: f64, u10_im: f64,
u11_re: f64, u11_im: f64,
) -> (f64, f64, f64, f64) {
// Pack amplitude pair into SIMD lanes
let a = f64x2(a_re, a_im);
let b = f64x2(b_re, b_im);
// Complex multiply-accumulate for output amplitudes
// c0 = u00*a + u01*b
// c1 = u10*a + u11*b
// (expanded for complex arithmetic)
// ...
todo!()
}
}
// Fallback scalar path
#[cfg(not(all(target_arch = "wasm32", target_feature = "simd128")))]
mod scalar {
// Pure scalar complex arithmetic
}
```
**Comparison of SIMD widths across targets**:
```
Native (AVX-512): 512-bit = 8 f64 = 4 complex f64 per instruction
Native (AVX2): 256-bit = 4 f64 = 2 complex f64 per instruction
Native (NEON): 128-bit = 2 f64 = 1 complex f64 per instruction
WASM SIMD: 128-bit = 2 f64 = 1 complex f64 per instruction
```
WASM SIMD matches ARM NEON width but is slower due to JIT overhead. The engine
uses the same algorithmic structure as the NEON path, adapted for WASM SIMD
intrinsics.
### 5. No GPU in WASM
GPU acceleration is exclusively available in native builds. The WASM path
uses CPU-only simulation.
**Rationale**:
- WebGPU is a separate browser API, not accessible from WASM linear memory
- Bridging WASM to WebGPU would require complex JavaScript glue code
- WebGPU compute shader support varies across browsers
- The performance benefit is uncertain for the 25-qubit WASM ceiling
**Future consideration**: If WebGPU stabilizes and WASM-WebGPU interop matures,
a `ruqu-webgpu` crate could provide browser-side GPU acceleration. This is out
of scope for the initial release.
### 6. API Parity
`ruqu-wasm` exposes an API that is functionally identical to `ruqu-core` native.
The same circuit description produces the same measurement results (within
floating-point tolerance). Only performance and capacity differ.
**Parity guarantee**:
```
Same Circuit
|
+------------+------------+
| |
ruqu-core (native) ruqu-wasm (browser)
| |
- 30+ qubits - 25 qubits max
- AVX2/AVX-512 SIMD - WASM SIMD128
- Rayon threading - Optional Web Workers
- Optional GPU - CPU only
- ~17.5M gates/sec - ~5-12M gates/sec
| |
+------------+------------+
|
Same Results
(within fp tolerance)
```
**Verified by**: Shared test suite that runs against both native and WASM targets,
comparing outputs bitwise (for deterministic operations) or statistically (for
measurement sampling).
### 7. Module Size Target
Target `.wasm` binary size: **< 2 MB** for the default feature set.
**Size budget**:
| Component | Estimated Size |
|-----------|---------------|
| Core simulation engine | ~800 KB |
| Gate implementations | ~200 KB |
| Measurement and sampling | ~100 KB |
| wasm-bindgen glue | ~50 KB |
| Circuit optimization | ~150 KB |
| Error handling and validation | ~50 KB |
| **Total (default features)** | **~1.35 MB** |
| + noise-model feature | +200 KB |
| + tensor-network feature | +400 KB |
| **Total (all features)** | **~1.95 MB** |
**Size reduction techniques**:
- `opt-level = "z"` for size-optimized compilation
- LTO (Link-Time Optimization) for dead code elimination
- `wasm-opt` post-processing pass (binaryen)
- Feature flags to exclude unused capabilities
- `panic = "abort"` to eliminate unwinding machinery
- Avoid `format!` and `std::fmt` where possible in hot paths
**Build pipeline**:
```bash
# Build with wasm-pack
wasm-pack build crates/ruqu-wasm --target web --release
# Post-process with wasm-opt for additional size reduction
wasm-opt -Oz --enable-simd \
crates/ruqu-wasm/pkg/ruqu_wasm_bg.wasm \
-o crates/ruqu-wasm/pkg/ruqu_wasm_bg.wasm
# Verify size
ls -lh crates/ruqu-wasm/pkg/ruqu_wasm_bg.wasm
# Expected: < 2 MB
```
### 8. Future: wasm64 (Memory64 Proposal)
The WebAssembly Memory64 proposal extends the address space to 64 bits,
removing the 4 GB limitation. When this proposal reaches broad runtime support:
- Recompile `ruqu-wasm` targeting `wasm64-unknown-unknown`
- Lift the 25-qubit ceiling to match native limits
- Maintain backward compatibility with wasm32 via conditional compilation
**Current status**: Memory64 is at Phase 4 (standardized) in the WASM
specification process. Browser support is emerging but not yet universal.
**Migration path**:
```toml
# Future Cargo.toml
[features]
wasm64 = [] # Enable when targeting wasm64
# In code
#[cfg(feature = "wasm64")]
const MAX_QUBITS_WASM: usize = 30;
#[cfg(not(feature = "wasm64"))]
const MAX_QUBITS_WASM: usize = 25;
```
## Trade-offs Accepted
| Trade-off | Accepted Limitation | Justification |
|-----------|---------------------|---------------|
| Performance | ~1.5-3x slower than native | Universal deployment outweighs raw speed |
| Qubit ceiling | 25 qubits in WASM vs 30+ native | Sufficient for most educational and research workloads |
| Threading | Requires specific browser headers | Graceful fallback ensures always-works baseline |
| No GPU | CPU-only in browser | GPU simulation at 25 qubits shows minimal benefit |
| Binary size | ~1.35 MB module | Acceptable for a quantum simulation library |
## Consequences
### Positive
- **Universal deployment**: Any modern browser or WASM runtime can execute
quantum simulations without installation
- **Security sandboxing**: WASM's memory isolation prevents quantum simulation
code from accessing host resources
- **Edge-aligned**: Matches ruVector's philosophy of computation at the edge
- **Testable**: WASM builds can be tested in CI via headless browsers and
wasm-bindgen-test
- **Progressive enhancement**: Single-threaded baseline with optional threading
ensures broad compatibility
### Negative
- **Performance ceiling**: JIT overhead and narrower SIMD limit throughput
- **Memory limits**: 25-qubit hard ceiling until wasm64 adoption
- **Threading complexity**: SharedArrayBuffer requirement adds deployment
configuration burden
- **Debugging difficulty**: WASM debugging tools are less mature than native
debuggers
### Mitigations
| Issue | Mitigation |
|-------|------------|
| Performance gap | Document native vs WASM trade-offs; recommend native for >20 qubits |
| Memory exhaustion | Hard limit enforcement with informative error messages |
| Threading failures | Automatic fallback to single-threaded; no silent degradation |
| Debug difficulty | Source maps via wasm-pack; comprehensive logging to console |
| Binary size creep | CI size gate: fail build if .wasm exceeds 2 MB |
## References
- [ADR-QE-001: Quantum Engine Core Architecture](./ADR-QE-001-quantum-engine-core-architecture.md)
- [ADR-QE-002: Crate Structure & Integration](./ADR-QE-002-crate-structure-integration.md)
- [ADR-QE-004: Performance Optimization & Benchmarks](./ADR-QE-004-performance-optimization-benchmarks.md)
- [ADR-005: WASM Runtime Integration](/docs/adr/ADR-005-wasm-runtime-integration.md)
- [ruvector-router-wasm crate](/crates/ruvector-router-wasm/)
- [WebAssembly SIMD Proposal](https://github.com/WebAssembly/simd)
- [WebAssembly Memory64 Proposal](https://github.com/WebAssembly/memory64)
- [wasm-bindgen-rayon](https://github.com/RReverser/wasm-bindgen-rayon)
- [Cross-Origin Isolation Guide (MDN)](https://developer.mozilla.org/en-US/docs/Web/API/crossOriginIsolated)

View File

@@ -0,0 +1,564 @@
# ADR-QE-004: Performance Optimization & Benchmarks
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Context
### Problem Statement
Quantum state-vector simulation is computationally expensive. Every gate
application touches the full amplitude vector of 2^n complex numbers, making
gate application O(2^n) per gate for n qubits. For the quantum engine to be
practical on edge devices and in browser environments, it must achieve
competitive performance: millions of gates per second for small circuits,
interactive latency for 10-20 qubit workloads, and the ability to handle
moderately deep circuits (thousands of gates) without unacceptable delays.
### Computational Cost Model
For a circuit with n qubits, g gates, and s measurement shots:
```
Total operations (approximate):
Single-qubit gate: 2^n complex multiplications + 2^n complex additions
Two-qubit gate: 2^(n+1) complex multiplications + 2^(n+1) complex additions
Measurement (1 shot): 2^n probability calculations + sampling
Full circuit: sum_i(cost(gate_i)) + s * 2^n
Example: 20-qubit circuit, 500 gates, 1024 shots
Gate cost: 500 * 2^20 * ~4 FLOP = ~2.1 billion FLOP
Measure: 1024 * 2^20 * ~2 FLOP = ~2.1 billion FLOP
Total: ~4.2 billion FLOP
```
At 10 GFLOP/s (realistic single-core throughput), this is ~420 ms. With SIMD
and multi-threading, we target 10-50x improvement.
### Performance Baseline from Comparable Systems
| Simulator | Language | 20-qubit H gate | Notes |
|-----------|----------|-----------------|-------|
| Qiskit Aer | C++/Python | ~50 ns | Heavily optimized, OpenMP |
| Cirq | Python/C++ | ~200 ns | Google, less optimized |
| QuantRS2 | Rust | ~57 ns | Rust-native, AVX2 |
| Quest | C | ~40 ns | GPU-capable, highly tuned |
| Target (ruQu) | Rust | < 60 ns | Competitive with QuantRS2 |
These benchmarks measure per-gate time on a single-qubit Hadamard applied to
a 20-qubit state vector. Our target is to match or beat QuantRS2, the closest
comparable pure-Rust implementation.
## Decision
Implement a **multi-layered optimization strategy** with six complementary
techniques, each addressing a different performance bottleneck.
### Layer 1: SIMD Operations
Use `ruvector-math` SIMD utilities to vectorize amplitude manipulation.
Gate application fundamentally involves applying a 2x2 or 4x4 unitary matrix
to pairs/quadruples of complex amplitudes. SIMD processes multiple amplitude
components simultaneously.
**Native SIMD dispatch**:
```
Architecture Instruction Set Complex f64 per Cycle
----------- --------------- ---------------------
x86_64 AVX-512 4 (512-bit / 128-bit per complex)
x86_64 AVX2 2 (256-bit / 128-bit per complex)
ARM64 NEON 1 (128-bit / 128-bit per complex)
WASM SIMD128 1 (128-bit / 128-bit per complex)
Fallback Scalar 1 (sequential)
```
**Single-qubit gate application with AVX2**:
```
For each pair of amplitudes (a[i], a[i + 2^target]):
Load: a_re, a_im = load_f64x4([a[i].re, a[i].im, a[i+step].re, a[i+step].im])
Compute c0 = u00 * a + u01 * b:
mul_re = u00_re * a_re - u00_im * a_im + u01_re * b_re - u01_im * b_im
mul_im = u00_re * a_im + u00_im * a_re + u01_re * b_im + u01_im * b_re
Compute c1 = u10 * a + u11 * b:
(analogous)
Store: [c0.re, c0.im, c1.re, c1.im]
```
With AVX2 (256-bit), we process 2 complex f64 values per instruction,
yielding a theoretical 2x speedup over scalar. With AVX-512, this doubles to
4x. Practical speedup is 1.5-3.5x due to instruction latency and memory
bandwidth.
**Target per-gate throughput**:
| Qubits | Amplitudes | AVX2 (est.) | AVX-512 (est.) | WASM SIMD (est.) |
|--------|------------|-------------|----------------|-------------------|
| 10 | 1,024 | ~15 ns | ~10 ns | ~30 ns |
| 15 | 32,768 | ~1 us | ~0.5 us | ~2 us |
| 20 | 1,048,576 | ~50 us | ~25 us | ~100 us |
| 25 | 33,554,432 | ~1.5 ms | ~0.8 ms | ~3 ms |
### Layer 2: Multithreading
Rayon-based data parallelism splits the state vector across CPU cores for
gate application. Each thread processes an independent contiguous block of
amplitudes.
**Parallelization strategy**:
```
State vector: [amp_0, amp_1, ..., amp_{2^n - 1}]
Thread 0: [amp_0 ... amp_{2^n/T - 1}]
Thread 1: [amp_{2^n/T} ... amp_{2*2^n/T - 1}]
...
Thread T-1:[amp_{(T-1)*2^n/T} ... amp_{2^n - 1}]
Where T = number of threads (Rayon work-stealing pool)
```
**Gate application requires care with target qubit position**:
- If `target < log2(chunk_size)`: each chunk contains complete amplitude pairs.
Threads are fully independent. No synchronization needed.
- If `target >= log2(chunk_size)`: amplitude pairs span chunk boundaries.
Must adjust chunk boundaries to align with gate structure.
**Expected scaling**:
```
Qubits Amps 1 thread 8 threads Speedup
------ ---- -------- --------- -------
15 32K 1 us ~200 ns ~5x
20 1M 50 us ~8 us ~6x
22 4M 200 us ~30 us ~6.5x
24 16M 800 us ~120 us ~6.7x
25 32M 1.5 ms ~220 us ~6.8x
```
Speedup plateaus below linear (8x for 8 threads) due to memory bandwidth
saturation. At 24+ qubits, the state vector exceeds L3 cache and performance
becomes memory-bound.
**Parallelism threshold**: Do not parallelize below 14 qubits (16K amplitudes).
The overhead of Rayon's work-stealing exceeds the benefit for small states.
### Layer 3: Gate Fusion
Preprocess circuits to combine consecutive gates into single matrix
operations, reducing the number of state vector passes.
**Fusion rules**:
```
Rule 1: Consecutive single-qubit gates on the same qubit
Rz(a) -> Rx(b) -> Rz(c) ==> U3(a, b, c) [single matrix multiply]
Rule 2: Consecutive two-qubit gates on the same pair
CNOT(0,1) -> CZ(0,1) ==> Fused_2Q(0,1) [4x4 matrix]
Rule 3: Single-qubit gate followed by controlled gate
H(0) -> CNOT(0,1) ==> Fused operation (absorb H into CNOT matrix)
Rule 4: Identity cancellation
H -> H ==> Identity (remove both)
X -> X ==> Identity
S -> S_dag ==> Identity
CNOT -> CNOT (same control/target) ==> Identity
```
**Fusion effectiveness by algorithm**:
| Algorithm | Typical Fusion Ratio | Gate Reduction |
|-----------|----------------------|----------------|
| VQE (UCCSD ansatz) | 1.8-2.5x | 30-50% fewer state passes |
| Grover's | 1.2-1.5x | 15-25% |
| QAOA | 1.5-2.0x | 25-40% |
| QFT | 2.0-3.0x | 40-60% |
| Random circuit | 1.1-1.3x | 5-15% |
**Implementation**:
```rust
pub struct FusionPass;
impl CircuitOptimizer for FusionPass {
fn optimize(&self, circuit: &mut QuantumCircuit) {
let mut i = 0;
while i < circuit.gates.len() - 1 {
let current = &circuit.gates[i];
let next = &circuit.gates[i + 1];
if can_fuse(current, next) {
let fused = compute_fused_matrix(current, next);
circuit.gates[i] = fused;
circuit.gates.remove(i + 1);
// Don't advance i; check if we can fuse again
} else {
i += 1;
}
}
}
}
```
### Layer 4: Entanglement-Aware Splitting
Track which qubits have interacted via entangling gates. Simulate independent
qubit subsets as separate, smaller state vectors. Merge subsets when an
entangling gate connects them.
**Concept**:
```
Circuit: q0 --[H]--[CNOT(0,1)]--[Rz]--
q1 --[H]--[CNOT(0,1)]--[Ry]--
q2 --[H]--[X]---------[Rz]---[CNOT(2,0)]--
q3 --[H]--[Y]---------[Rx]--
Initially: {q0}, {q1}, {q2}, {q3} -- four 2^1 vectors (2 amps each)
After CNOT(0,1): {q0,q1}, {q2}, {q3} -- one 2^2 + two 2^1 vectors
After CNOT(2,0): {q0,q1,q2}, {q3} -- one 2^3 + one 2^1 vector
Memory: 8 + 2 = 10 amplitudes vs 2^4 = 16 amplitudes (full)
```
**Savings scale dramatically for circuits with late entanglement**:
```
Scenario: 20-qubit circuit, first 100 gates are local, then entangling
Without splitting: 2^20 = 1M amplitudes from gate 1
With splitting: 20 * 2^1 = 40 amplitudes until first entangling gate
Progressively merge as entanglement grows
```
**Data structure**:
```rust
pub struct SplitState {
/// Each subset: (qubit indices, state vector)
subsets: Vec<(Vec<usize>, QuantumState)>,
/// Union-Find structure for tracking connectivity
connectivity: UnionFind,
}
impl SplitState {
pub fn apply_gate(&mut self, gate: &Gate, targets: &[usize]) {
if gate.is_entangling() {
// Merge subsets containing target qubits
let merged = self.merge_subsets(targets);
// Apply gate to merged state
merged.apply_gate(gate, targets);
} else {
// Apply to the subset containing the target qubit
let subset = self.find_subset(targets[0]);
subset.apply_gate(gate, targets);
}
}
}
```
**When splitting helps vs. hurts**:
| Circuit Type | Splitting Benefit |
|-------------|-------------------|
| Shallow QAOA (p=1-3) | High (qubits entangle gradually) |
| VQE with local ansatz | High (many local rotations) |
| Grover's (full oracle) | Low (oracle entangles all qubits early) |
| QFT | Low (all-to-all entanglement) |
| Random circuits | Low (entangles quickly) |
The engine automatically disables splitting when all qubits are connected,
falling back to full state-vector simulation with zero overhead.
### Layer 5: Cache-Local Processing
For large state vectors (>20 qubits), cache utilization becomes critical.
The state vector exceeds L2 cache (typically 256 KB - 1 MB) and potentially
L3 cache (8-32 MB).
**Cache analysis**:
```
Qubits State Size L2 (512KB) L3 (16MB)
------ ---------- ---------- ---------
18 4 MB 8x oversize in cache
20 16 MB 32x in cache
22 64 MB 128x 4x oversize
24 256 MB 512x 16x oversize
25 512 MB 1024x 32x oversize
```
**Techniques**:
1. **Aligned allocation**: State vector aligned to cache line boundaries (64
bytes) for optimal prefetch behavior. Uses `ruvector-math` aligned allocator.
2. **Blocking/tiling**: For gates on high-index qubits, the stride between
amplitude pairs is large (2^target). Tiling the access pattern to process
cache-line-sized blocks sequentially improves spatial locality.
```
Without tiling (target qubit = 20):
Access pattern: amp[0], amp[1M], amp[1], amp[1M+1], ...
Cache misses: ~every access (stride = 16 MB)
With tiling (block size = L2/4):
Process block [0..64K], then [64K..128K], ...
Cache misses: ~1 per block (sequential within block)
```
3. **Prefetch hints**: Insert software prefetch instructions for the next block
of amplitudes while processing the current block.
```rust
// Prefetch next cache line while processing current
#[cfg(target_arch = "x86_64")]
unsafe {
core::arch::x86_64::_mm_prefetch(
state.as_ptr().add(i + CACHE_LINE_AMPS) as *const i8,
core::arch::x86_64::_MM_HINT_T0,
);
}
```
### Layer 6: Lazy Evaluation
Accumulate commuting rotations and defer their application until a
non-commuting gate appears. This reduces the number of full state-vector
passes for rotation-heavy circuits common in variational algorithms.
**Commutation rules**:
```
Rz(a) commutes with Rz(b) => Rz(a+b)
Rx(a) commutes with Rx(b) => Rx(a+b)
Rz commutes with CZ => Defer Rz
Diagonal gates commute => Combine phases
But:
Rz does NOT commute with H
Rx does NOT commute with CNOT (on target)
```
**Implementation sketch**:
```rust
pub struct LazyAccumulator {
/// Pending rotations per qubit: (axis, total_angle)
pending: HashMap<usize, Vec<(RotationAxis, f64)>>,
}
impl LazyAccumulator {
pub fn push_gate(&mut self, gate: &Gate, target: usize) -> Option<FlushedGate> {
if let Some(rotation) = gate.as_rotation() {
if let Some(existing) = self.pending.get_mut(&target) {
if existing.last().map_or(false, |(axis, _)| *axis == rotation.axis) {
// Same axis: accumulate angle
existing.last_mut().unwrap().1 += rotation.angle;
return None; // No gate emitted
}
}
self.pending.entry(target).or_default().push((rotation.axis, rotation.angle));
None
} else {
// Non-commuting gate: flush pending rotations for affected qubits
let flushed = self.flush(target);
Some(flushed)
}
}
}
```
**Effectiveness**: VQE circuits with alternating Rz-Rx-Rz layers see 20-40%
reduction in state-vector passes. QAOA circuits with repeated ZZ-rotation
layers see 15-30% reduction.
## Benchmark Targets
### Primary Benchmark Suite
| ID | Workload | Qubits | Gates | Target Time | Notes |
|----|----------|--------|-------|-------------|-------|
| B1 | Grover (8 qubits) | 8 | ~200 | < 1 ms | 3 Grover iterations |
| B2 | Grover (16 qubits) | 16 | ~3,000 | < 10 ms | ~64 iterations |
| B3 | VQE iteration (12 qubits) | 12 | ~120 | < 5 ms | Single parameter update |
| B4 | VQE iteration (20 qubits) | 20 | ~300 | < 50 ms | UCCSD ansatz |
| B5 | QAOA p=3 (10 nodes) | 10 | ~75 | < 1 ms | MaxCut on random graph |
| B6 | QAOA p=5 (20 nodes) | 20 | ~200 | < 200 ms | MaxCut on random graph |
| B7 | Surface code cycle (d=3) | 17 | ~20 | < 10 ms | Single syndrome round |
| B8 | 1000 surface code cycles | 17 | ~20,000 | < 2 s | Repeated error correction |
| B9 | QFT (20 qubits) | 20 | ~210 | < 30 ms | Full quantum Fourier transform |
| B10 | Random circuit (25 qubits) | 25 | 100 | < 10 s | Worst-case memory test |
### Micro-Benchmarks
Per-gate timing for individual operations:
| Gate | 10 qubits | 15 qubits | 20 qubits | 25 qubits |
|------|-----------|-----------|-----------|-----------|
| H | < 20 ns | < 0.5 us | < 50 us | < 1.5 ms |
| CNOT | < 30 ns | < 1 us | < 80 us | < 2.5 ms |
| Rz(theta) | < 15 ns | < 0.4 us | < 40 us | < 1.2 ms |
| Toffoli | < 50 ns | < 1.5 us | < 120 us | < 4 ms |
| Measure | < 10 ns | < 0.3 us | < 30 us | < 1 ms |
### WASM-Specific Benchmarks
| ID | Workload | Qubits | Target (WASM) | Target (Native) | Expected Ratio |
|----|----------|--------|---------------|-----------------|----------------|
| W1 | Grover (8) | 8 | < 3 ms | < 1 ms | ~3x |
| W2 | VQE iter (12) | 12 | < 12 ms | < 5 ms | ~2.5x |
| W3 | QAOA p=3 (10) | 10 | < 2.5 ms | < 1 ms | ~2.5x |
| W4 | Random (20) | 20 | < 500 ms | < 200 ms | ~2.5x |
| W5 | Random (25) | 25 | < 25 s | < 10 s | ~2.5x |
### Benchmark Infrastructure
Benchmarks use Criterion.rs for native and a custom timing harness for WASM:
```rust
// Native benchmarks (Criterion)
use criterion::{criterion_group, criterion_main, Criterion};
fn bench_grover_8(c: &mut Criterion) {
c.bench_function("grover_8_qubits", |b| {
b.iter(|| {
let mut state = QuantumState::new(8).unwrap();
let circuit = grover_circuit(8, &target_state);
state.execute(&circuit)
})
});
}
fn bench_single_gate_scaling(c: &mut Criterion) {
let mut group = c.benchmark_group("hadamard_scaling");
for n in [10, 12, 14, 16, 18, 20, 22, 24] {
group.bench_with_input(
BenchmarkId::from_parameter(n),
&n,
|b, &n| {
let mut state = QuantumState::new(n).unwrap();
let mut circuit = QuantumCircuit::new(n).unwrap();
circuit.gate(Gate::H, &[0]);
b.iter(|| state.execute(&circuit))
},
);
}
group.finish();
}
criterion_group!(benches, bench_grover_8, bench_single_gate_scaling);
criterion_main!(benches);
```
**WASM benchmark harness**:
```javascript
// Browser-based benchmark using performance.now()
async function benchmarkGrover8() {
const { QuantumCircuit, QuantumState } = await import('./ruqu_wasm.js');
const iterations = 100;
const start = performance.now();
for (let i = 0; i < iterations; i++) {
const circuit = QuantumCircuit.grover(8, 42);
const state = new QuantumState(8);
state.execute(circuit);
state.free();
circuit.free();
}
const elapsed = performance.now() - start;
console.log(`Grover 8-qubit: ${(elapsed / iterations).toFixed(3)} ms/iteration`);
}
```
### Performance Regression Detection
CI runs benchmark suite on every PR. Regressions exceeding 10% trigger a
warning; regressions exceeding 25% block the merge.
```yaml
# In CI pipeline
- name: Run benchmarks
run: |
cargo bench --package ruqu-core -- --save-baseline pr
cargo bench --package ruqu-core -- --baseline main --load-baseline pr
# critcmp compares and flags regressions
critcmp main pr --threshold 10
```
### Optimization Priority Matrix
Not all optimizations apply equally to all workloads. The priority matrix
guides implementation order:
| Optimization | Impact (small circuits) | Impact (large circuits) | Impl Effort | Priority |
|-------------|------------------------|------------------------|-------------|----------|
| SIMD | Medium (1.5-2x) | High (2-3.5x) | Medium | P0 |
| Multithreading | Low (overhead > benefit) | High (5-7x) | Medium | P1 |
| Gate fusion | High (30-50% fewer passes) | Medium (15-30%) | Low | P0 |
| Entanglement splitting | Variable (0-100x) | Low (quickly entangled) | High | P2 |
| Cache tiling | Low (fits in cache) | High (2-4x) | Medium | P1 |
| Lazy evaluation | Medium (20-40%) | Low (10-20%) | Low | P2 |
**Implementation order**: SIMD -> Gate Fusion -> Multithreading -> Cache Tiling
-> Lazy Evaluation -> Entanglement Splitting
## Consequences
### Positive
- **Competitive performance**: Multi-layered approach targets performance
parity with state-of-the-art Rust simulators (QuantRS2)
- **Interactive latency**: Most practical workloads (8-20 qubits) complete
in single-digit milliseconds, enabling real-time experimentation
- **Scalable**: Each optimization layer addresses a different bottleneck,
providing compounding benefits
- **Measurable**: Concrete benchmark targets enable objective progress tracking
and regression detection
### Negative
- **Optimization complexity**: Six optimization layers create significant
implementation and maintenance complexity
- **Ongoing tuning**: Performance characteristics vary across hardware;
benchmarks must cover representative platforms
- **Diminishing returns**: For >20 qubits, memory bandwidth dominates and
compute optimizations yield marginal gains
- **Testing burden**: Each optimization must be validated for numerical
correctness across all gate types
### Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Memory bandwidth bottleneck at >20 qubits | High | Medium | Document expected scaling; recommend native for large circuits |
| Gate fusion introducing numerical error | Low | High | Comprehensive numerical tests comparing fused vs. unfused results |
| Entanglement tracking overhead exceeding savings | Medium | Low | Automatic disable when all qubits connected within first 10 gates |
| WASM SIMD not available in target runtime | Low | Medium | Graceful fallback to scalar; runtime feature detection |
| Benchmark targets too aggressive for edge hardware | Medium | Low | Separate targets for edge (Cognitum) vs. desktop; scale expectations |
## References
- [ADR-QE-001: Quantum Engine Core Architecture](./ADR-QE-001-quantum-engine-core-architecture.md)
- [ADR-QE-002: Crate Structure & Integration](./ADR-QE-002-crate-structure-integration.md)
- [ADR-QE-003: WASM Compilation Strategy](./ADR-QE-003-wasm-compilation-strategy.md)
- [ADR-003: SIMD Optimization Strategy](/docs/adr/ADR-003-simd-optimization-strategy.md)
- [ruvector-math crate](/crates/ruvector-math/)
- Guerreschi & Hogaboam, "Intel Quantum Simulator: A cloud-ready high-performance
simulator of quantum circuits" (2020)
- Jones et al., "QuEST and High Performance Simulation of Quantum Computers" (2019)
- QuantRS2 benchmark data (internal comparison)

View File

@@ -0,0 +1,650 @@
# ADR-QE-005: Variational Quantum Eigensolver (VQE) Support
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1 | 2026-02-06 | ruv.io | Initial VQE architecture proposal |
---
## Context
### The Variational Quantum Eigensolver Problem
The Variational Quantum Eigensolver (VQE) is one of the most important near-term quantum
algorithms, with direct applications in computational chemistry, materials science, and
combinatorial optimization. VQE computes ground-state energies of molecular Hamiltonians
by variationally minimizing the expectation value of a Hamiltonian operator with respect
to a parameterized quantum state (ansatz).
### Why VQE Matters for ruQu
VQE sits at the intersection of quantum simulation and classical optimization, making it
a natural fit for ruQu's hybrid classical-quantum architecture:
1. **Chemistry applications**: Drug discovery, catalyst design, battery materials
2. **Optimization**: QUBO problems, portfolio optimization, logistics
3. **Benchmarking**: VQE circuits exercise the full gate set and serve as a representative
workload for evaluating simulator performance
4. **Agent integration**: ruVector agents can autonomously explore chemical configuration
spaces using VQE as the inner evaluation kernel
### Core Requirements
| Requirement | Description | Priority |
|-------------|-------------|----------|
| Parameterized circuits | Symbolic gate angles resolved at evaluation time | P0 |
| Hamiltonian decomposition | Represent H as sum of weighted Pauli strings | P0 |
| Exact expectation values | Direct state vector computation (no shot noise) | P0 |
| Gradient evaluation | Parameter-shift rule for classical optimizer | P0 |
| Shot-based sampling | Optional mode for hardware noise emulation | P1 |
| Classical optimizer interface | Trait-based abstraction for multiple optimizers | P1 |
| Hardware-efficient ansatz | Pre-built ansatz library for common topologies | P2 |
### Current Limitations
Without dedicated VQE support, users must manually:
- Construct parameterized circuits with explicit angle substitution per iteration
- Decompose Hamiltonians into individual Pauli measurements
- Implement gradient computation by duplicating circuit evaluations
- Wire up classical optimizers with no standard interface
This is error-prone and leaves significant performance on the table, since a state vector
simulator can compute exact expectation values in a single pass without sampling overhead.
---
## Decision
### 1. Parameterized Gate Architecture
Circuits accept symbolic parameters that are resolved to numeric values per evaluation.
This avoids circuit reconstruction on each VQE iteration.
```
┌──────────────────────────────────────────────────┐
│ Parameterized Circuit │
│ │
│ ┌─────┐ ┌──────────┐ ┌─────┐ ┌──────────┐ │
|0> ─────────┤ │ H ├──┤ Ry(θ[0]) ├──┤ CX ├──┤ Rz(θ[2]) ├──┤───
│ └─────┘ └──────────┘ └──┬──┘ └──────────┘ │
│ │ │
|0> ─────────┤──────────────────────────────●───── Ry(θ[1]) ────┤───
│ │
└──────────────────────────────────────────────────┘
parameters: [θ[0], θ[1], θ[2]]
values: [0.54, 1.23, -0.87]
```
**Data model**:
```rust
/// A symbolic parameter in a quantum circuit.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Parameter {
pub name: String,
pub index: usize,
}
/// A gate that may reference symbolic parameters.
pub enum ParameterizedGate {
/// Fixed gate (no parameters)
Fixed(Gate),
/// Rotation gate with a symbolic angle
Rx(ParameterExpr),
Ry(ParameterExpr),
Rz(ParameterExpr),
/// Parameterized two-qubit gate
Rzz(ParameterExpr, Qubit, Qubit),
}
/// Expression for a gate parameter (supports linear combinations).
pub enum ParameterExpr {
/// Direct parameter reference: θ[i]
Param(usize),
/// Scaled parameter: c * θ[i]
Scaled(f64, usize),
/// Sum of expressions
Sum(Box<ParameterExpr>, Box<ParameterExpr>),
/// Constant value
Constant(f64),
}
```
**Resolution**: When `evaluate(params: &[f64])` is called, each `ParameterExpr` is resolved
to a concrete `f64`, and the corresponding unitary matrix is computed. This happens once per
VQE iteration and is negligible compared to state vector manipulation.
### 2. Hamiltonian Representation
The Hamiltonian is represented as a sum of weighted Pauli strings:
```
H = c_0 * I + c_1 * Z_0 + c_2 * Z_1 + c_3 * Z_0 Z_1 + c_4 * X_0 X_1 + ...
```
where each term is a tensor product of single-qubit Pauli operators {I, X, Y, Z}.
```rust
/// A single Pauli operator on one qubit.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Pauli {
I,
X,
Y,
Z,
}
/// A Pauli string: tensor product of single-qubit Paulis.
/// Stored as a compact bitfield for n-qubit systems.
///
/// Encoding: 2 bits per qubit (00=I, 01=X, 10=Y, 11=Z)
/// For n <= 32 qubits, fits in a single u64.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct PauliString {
/// Packed Pauli operators (2 bits each)
pub ops: Vec<u64>,
/// Number of qubits
pub n_qubits: usize,
}
/// A Hamiltonian as a sum of weighted Pauli strings.
///
/// H = sum_j c_j P_j
pub struct PauliSum {
/// Terms: (coefficient, Pauli string)
pub terms: Vec<(Complex64, PauliString)>,
/// Number of qubits
pub n_qubits: usize,
}
```
**Optimization**: Identity terms (all-I Pauli strings) contribute a constant energy offset
and require no state vector computation. The implementation detects and separates these
before the expectation loop.
### 3. Direct Expectation Value Computation
This is the critical performance advantage of state vector simulation over real hardware.
On physical quantum computers, expectation values must be estimated via repeated
measurement (shot-based sampling), requiring O(1/epsilon^2) shots for epsilon precision.
In a state vector simulator, we compute the **exact** expectation value:
```
<psi| H |psi> = sum_j c_j * <psi| P_j |psi>
```
For each Pauli string P_j, the expectation value is:
```
<psi| P_j |psi> = sum_k psi_k* (P_j |psi>)_k
```
Since P_j is a tensor product of single-qubit Paulis, its action on a basis state |k> is:
- I: |k> -> |k>
- X: flips qubit, no phase
- Y: flips qubit, phase factor +/- i
- Z: no flip, phase factor +/- 1
This means each Pauli string maps each basis state to exactly one other basis state with
a phase factor. The expectation value reduces to a sum over 2^n amplitudes.
```rust
impl QuantumState {
/// Compute the exact expectation value of a PauliSum.
///
/// Complexity: O(T * 2^n) where T = number of Pauli terms, n = qubits.
/// For a 12-qubit system with 100 Pauli terms:
/// 100 * 4096 = 409,600 operations ~ 0.5ms
pub fn expectation(&self, hamiltonian: &PauliSum) -> f64 {
let mut total = 0.0_f64;
for (coeff, pauli) in &hamiltonian.terms {
let mut term_val = Complex64::zero();
for k in 0..self.amplitudes.len() {
// Compute P_j |k>: determine target index and phase
let (target_idx, phase) = pauli.apply_to_basis(k);
// <k| P_j |psi> = phase * psi[target_idx]
// Accumulate psi[k]* * phase * psi[target_idx]
term_val += self.amplitudes[k].conj()
* phase
* self.amplitudes[target_idx];
}
total += (coeff * term_val).re;
}
total
}
}
```
**Function signature**: `QuantumState::expectation(PauliSum) -> f64`
#### Accuracy Advantage Over Sampling
| Method | Precision | Evaluations | 12-qubit Cost |
|--------|-----------|-------------|---------------|
| Shot-based (1000 shots) | ~3% | 1000 circuit runs per term | ~500ms |
| Shot-based (10000 shots) | ~1% | 10000 circuit runs per term | ~5s |
| Shot-based (1M shots) | ~0.1% | 1M circuit runs per term | ~500s |
| **Exact (state vector)** | **Machine epsilon** | **1 pass over state** | **~0.5ms** |
For VQE convergence, exact expectation values eliminate the statistical noise floor that
plagues hardware-based VQE. Classical optimizers receive clean gradients, leading to:
- Faster convergence (fewer iterations)
- No barren plateau artifacts from shot noise
- Deterministic reproducibility
### 4. Gradient Support via Parameter-Shift Rule
The parameter-shift rule provides exact analytic gradients for parameterized quantum gates.
For a gate with parameter theta:
```
d/d(theta) <H> = [<H>(theta + pi/2) - <H>(theta - pi/2)] / 2
```
This requires two circuit evaluations per parameter per gradient component.
```rust
/// Compute the gradient of the expectation value with respect to all parameters.
///
/// Uses the parameter-shift rule:
/// grad_i = [E(theta_i + pi/2) - E(theta_i - pi/2)] / 2
///
/// Complexity: O(2 * n_params * circuit_eval_cost)
/// For 12 qubits, 20 parameters, 100 Pauli terms:
/// 2 * 20 * (circuit_sim + expectation) ~ 40 * 1ms = 40ms
pub fn gradient(
circuit: &ParameterizedCircuit,
hamiltonian: &PauliSum,
params: &[f64],
) -> Vec<f64> {
let n_params = params.len();
let mut grad = vec![0.0; n_params];
let shift = std::f64::consts::FRAC_PI_2; // pi/2
for i in 0..n_params {
// Forward shift
let mut params_plus = params.to_vec();
params_plus[i] += shift;
let e_plus = evaluate_energy(circuit, hamiltonian, &params_plus);
// Backward shift
let mut params_minus = params.to_vec();
params_minus[i] -= shift;
let e_minus = evaluate_energy(circuit, hamiltonian, &params_minus);
grad[i] = (e_plus - e_minus) / 2.0;
}
grad
}
```
### 5. Classical Optimizer Interface
A trait-based abstraction supports plugging in different classical optimizers without
changing the VQE loop:
```rust
/// Trait for classical optimizers used in the VQE outer loop.
pub trait ClassicalOptimizer: Send {
/// Initialize the optimizer with the parameter count.
fn initialize(&mut self, n_params: usize);
/// Propose next parameter values given current energy and optional gradient.
fn step(
&mut self,
params: &[f64],
energy: f64,
gradient: Option<&[f64]>,
) -> OptimizerResult;
/// Check if the optimizer has converged.
fn has_converged(&self) -> bool;
/// Get optimizer name for logging.
fn name(&self) -> &str;
}
/// Result of an optimizer step.
pub struct OptimizerResult {
pub new_params: Vec<f64>,
pub converged: bool,
pub iteration: usize,
}
```
**Provided implementations**:
| Optimizer | Type | Gradient Required | Best For |
|-----------|------|-------------------|----------|
| `GradientDescent` | Gradient-based | Yes | Simple landscapes |
| `Adam` | Adaptive gradient | Yes | Noisy gradients, deep circuits |
| `LBFGS` | Quasi-Newton | Yes | Smooth landscapes, fast convergence |
| `COBYLA` | Derivative-free | No | Non-differentiable cost functions |
| `NelderMead` | Simplex | No | Low-dimensional problems |
| `SPSA` | Stochastic | No | Shot-based mode, noisy evaluations |
### 6. VQE Iteration Loop
The complete VQE algorithm proceeds as follows:
```
VQE Iteration Loop
==================
Input: Hamiltonian H (PauliSum), Ansatz A (ParameterizedCircuit),
Optimizer O (ClassicalOptimizer), initial params theta_0
Output: Minimum energy E_min, optimal params theta_opt
theta = theta_0
O.initialize(len(theta))
repeat:
┌─────────────────────────────────────────────┐
│ 1. PREPARE STATE │
│ |psi(theta)> = A(theta) |0...0> │
│ [Simulate parameterized circuit] │
│ Cost: O(G * 2^n) where G = gate count │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 2. EVALUATE ENERGY │
│ E = <psi(theta)| H |psi(theta)> │
│ [Direct state vector expectation] │
│ Cost: O(T * 2^n) where T = Pauli terms │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 3. COMPUTE GRADIENT (if optimizer needs it) │
│ grad = parameter_shift(A, H, theta) │
│ [2 * n_params circuit evaluations] │
│ Cost: O(2P * (G + T) * 2^n) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 4. CLASSICAL UPDATE │
│ theta_new = O.step(theta, E, grad) │
│ [Pure classical computation] │
│ Cost: O(P^2) for quasi-Newton │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 5. CONVERGENCE CHECK │
│ if |E_new - E_old| < tol: STOP │
│ else: theta = theta_new, continue │
└─────────────────────────────────────────────┘
return (E_min, theta_opt)
```
**Pseudocode**:
```rust
pub fn vqe(
ansatz: &ParameterizedCircuit,
hamiltonian: &PauliSum,
optimizer: &mut dyn ClassicalOptimizer,
config: &VqeConfig,
) -> VqeResult {
let n_params = ansatz.parameter_count();
let mut params = config.initial_params.clone()
.unwrap_or_else(|| vec![0.0; n_params]);
optimizer.initialize(n_params);
let mut best_energy = f64::INFINITY;
let mut best_params = params.clone();
let mut history = Vec::new();
for iteration in 0..config.max_iterations {
// Step 1+2: Simulate circuit and compute energy
let state = ansatz.simulate(&params);
let energy = state.expectation(hamiltonian);
// Track best
if energy < best_energy {
best_energy = energy;
best_params = params.clone();
}
// Step 3: Compute gradient if needed
let grad = if optimizer.needs_gradient() {
Some(gradient(ansatz, hamiltonian, &params))
} else {
None
};
history.push(VqeIteration { iteration, energy, params: params.clone() });
// Step 4: Classical update
let result = optimizer.step(&params, energy, grad.as_deref());
params = result.new_params;
// Step 5: Convergence check
if result.converged || (iteration > 0 &&
(history[iteration].energy - history[iteration - 1].energy).abs()
< config.convergence_threshold) {
break;
}
}
VqeResult {
energy: best_energy,
optimal_params: best_params,
iterations: history.len(),
history,
converged: optimizer.has_converged(),
}
}
```
### 7. Optional Shot-Based Sampling Mode
For mimicking real hardware behavior and testing noise resilience:
```rust
/// Configuration for shot-based VQE mode.
pub struct ShotConfig {
/// Number of measurement shots per expectation estimation
pub shots: usize,
/// Random seed for reproducibility
pub seed: Option<u64>,
/// Readout error rate (probability of bit flip on measurement)
pub readout_error: f64,
}
impl QuantumState {
/// Estimate expectation value via shot-based sampling.
///
/// Samples the state `shots` times in the computational basis,
/// then computes the empirical expectation of each Pauli term.
pub fn expectation_sampled(
&self,
hamiltonian: &PauliSum,
config: &ShotConfig,
) -> (f64, f64) {
// Returns (mean, standard_error)
// Standard error = std_dev / sqrt(shots)
todo!()
}
}
```
### 8. Hardware-Efficient Ansatz Patterns
Pre-built ansatz constructors for common use cases:
```
Hardware-Efficient Ansatz (depth d, n qubits):
Layer 1..d:
┌─────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
┤ Ry ├──┤ Rz ├──┤ CNOT ├──┤ Ry ├──
└─────┘ └──────────┘ │ ladder │ └──────────┘
┌─────┐ ┌──────────┐ │ │ ┌──────────┐
┤ Ry ├──┤ Rz ├──┤ ├──┤ Ry ├──
└─────┘ └──────────┘ └──────────┘ └──────────┘
Parameters per layer: 3n (Ry + Rz + Ry per qubit)
Total parameters: 3nd
```
```rust
/// Pre-built ansatz constructors.
pub mod ansatz {
/// Hardware-efficient ansatz with Ry-Rz layers and linear CNOT entanglement.
pub fn hardware_efficient(n_qubits: usize, depth: usize) -> ParameterizedCircuit;
/// UCCSD (Unitary Coupled Cluster Singles and Doubles) for chemistry.
/// Generates excitation operators based on active space.
pub fn uccsd(n_electrons: usize, n_orbitals: usize) -> ParameterizedCircuit;
/// Hamiltonian variational ansatz: layers of exp(-i * theta_j * P_j)
/// for each term P_j in the Hamiltonian.
pub fn hamiltonian_variational(
hamiltonian: &PauliSum,
depth: usize,
) -> ParameterizedCircuit;
/// Symmetry-preserving ansatz that respects particle number conservation.
pub fn symmetry_preserving(
n_qubits: usize,
n_particles: usize,
depth: usize,
) -> ParameterizedCircuit;
}
```
### 9. Performance Analysis
#### 12-Qubit VQE Performance Estimate
| Component | Operations | Time |
|-----------|-----------|------|
| State vector size | 2^12 = 4,096 complex amplitudes | 64 KB |
| Circuit simulation (50 gates) | 50 * 4096 = 204,800 ops | ~0.3ms |
| Expectation (100 Pauli terms) | 100 * 4096 = 409,600 ops | ~0.5ms |
| Gradient (20 params) | 40 * (0.3 + 0.5) ms | ~32ms |
| Classical optimizer step | O(20^2) | ~0.001ms |
| **Total per iteration (with gradient)** | | **~33ms** |
| **Total per iteration (no gradient)** | | **~0.8ms** |
For gradient-free optimizers (COBYLA, Nelder-Mead), a 12-qubit VQE iteration completes
in under 1ms. With parameter-shift gradients, the cost scales linearly with parameter
count but remains under 50ms for typical chemistry ansatze.
**Scaling with qubit count**:
| Qubits | State Size | Memory | Energy Eval (100 terms) | Gradient (20 params) |
|--------|-----------|--------|------------------------|---------------------|
| 8 | 256 | 4 KB | ~0.03ms | ~2ms |
| 12 | 4,096 | 64 KB | ~0.5ms | ~33ms |
| 16 | 65,536 | 1 MB | ~8ms | ~500ms |
| 20 | 1,048,576 | 16 MB | ~130ms | ~8s |
| 24 | 16,777,216 | 256 MB | ~2s | ~130s |
| 28 | 268,435,456 | 4 GB | ~33s | ~35min |
### 10. Integration with ruVector Agent System
ruVector agents can drive autonomous chemistry optimization using VQE as the evaluation
kernel:
```
┌─────────────────────────────────────────────────────────────────┐
│ ruVector Agent Orchestration │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ Research │───>│ Architecture │───>│ Chemistry Agent │ │
│ │ Agent │ │ Agent │ │ │ │
│ │ │ │ │ │ - Molecule spec │ │
│ │ Literature│ │ Hamiltonian │ │ - Basis set sel. │ │
│ │ search │ │ generation │ │ - Active space │ │
│ └──────────┘ └──────────────┘ │ - VQE execution │ │
│ │ - Result analysis │ │
│ └────────┬───────────┘ │
│ │ │
│ ┌────────▼───────────┐ │
│ │ ruQu VQE Engine │ │
│ │ │ │
│ │ Parameterized │ │
│ │ Circuit + PauliSum│ │
│ │ + Optimizer │ │
│ └────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
The agent workflow:
1. **Research agent** retrieves molecular structure and prior computational results
2. **Architecture agent** generates the qubit Hamiltonian (Jordan-Wigner or Bravyi-Kitaev
transformation from fermionic operators)
3. **Chemistry agent** selects ansatz, optimizer, and runs VQE iterations
4. **Results** are stored in ruVector memory for pattern learning across molecules
---
## Consequences
### Benefits
1. **Exact expectation values** eliminate sampling noise, enabling faster convergence and
deterministic reproducibility -- a major advantage over hardware VQE
2. **Symbolic parameterization** avoids circuit reconstruction overhead, reducing per-iteration
cost to pure state manipulation
3. **Trait-based optimizer interface** allows users to swap optimizers without touching VQE
logic, and supports custom optimizer implementations
4. **Hardware-efficient ansatz library** provides tested, production-quality circuit templates
for common use cases
5. **Gradient support** via parameter-shift rule enables modern gradient-based optimization
(Adam, L-BFGS) that converges significantly faster than derivative-free methods
6. **Agent integration** enables autonomous, memory-enhanced chemistry exploration that
learns from prior VQE runs across molecular configurations
### Risks
| Risk | Probability | Impact | Mitigation |
|------|------------|--------|------------|
| Exponential memory scaling limits qubit count | High | Medium | Tensor network backend for >30 qubits (future ADR) |
| Parameter-shift gradient cost scales with parameter count | Medium | Medium | Batched gradient evaluation, simultaneous perturbation (SPSA) fallback |
| Hamiltonian term count explosion for large molecules | Medium | High | Pauli grouping (qubit-wise commuting), measurement reduction techniques |
| Optimizer convergence to local minima | Medium | Medium | Multi-start strategies, QAOA-inspired initialization |
### Trade-offs
| Decision | Advantage | Disadvantage |
|----------|-----------|--------------|
| Exact expectation over sampling | Machine-precision accuracy | Not representative of real hardware noise |
| Parameter-shift over finite-difference | Exact gradients | 2x evaluations per parameter |
| Trait-based optimizer | Extensible | Slight abstraction overhead |
| Compact PauliString bitfield | Cache-friendly | Complex bit manipulation logic |
---
## References
- Peruzzo, A. et al. "A variational eigenvalue solver on a photonic quantum processor." Nature Communications 5, 4213 (2014)
- McClean, J.R. et al. "The theory of variational hybrid quantum-classical algorithms." New Journal of Physics 18, 023023 (2016)
- Kandala, A. et al. "Hardware-efficient variational quantum eigensolver for small molecules." Nature 549, 242-246 (2017)
- Schuld, M. et al. "Evaluating analytic gradients on quantum hardware." Physical Review A 99, 032331 (2019)
- ADR-001: ruQu Architecture - Classical Nervous System for Quantum Machines
- ADR-QE-001 through ADR-QE-004: Prior quantum engine architecture decisions
- ruQu crate: `crates/ruQu/src/` - existing syndrome processing and coherence gate infrastructure
- ruVector memory system: pattern storage for cross-molecule VQE learning

View File

@@ -0,0 +1,562 @@
# ADR-QE-006: Grover's Search Algorithm Implementation
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1 | 2026-02-06 | ruv.io | Initial Grover's search architecture proposal |
---
## Context
### Unstructured Search and Quadratic Speedup
Grover's algorithm is one of the foundational quantum algorithms, providing a provable
quadratic speedup for unstructured search. Given a search space of N = 2^n items and an
oracle that marks one or more target items, Grover's algorithm finds a target in
O(sqrt(N)) oracle queries, compared to the classical O(N) lower bound.
### Building Blocks
The algorithm consists of two principal components applied repeatedly:
1. **Oracle (O)**: Flips the phase of marked (target) states
- On hardware: requires multi-controlled-Z decomposition into elementary gates
- In simulation: can be a single O(1) amplitude flip (key insight)
2. **Diffuser (D)**: Inversion about the mean amplitude (also called the Grover diffusion
operator)
- D = 2|s><s| - I, where |s> is the uniform superposition
- Implemented as: H^{otimes n} * (2|0><0| - I) * H^{otimes n}
### Why Simulation Unlocks a Unique Optimization
On real quantum hardware, the oracle must be decomposed into a circuit of elementary
gates. For a single marked state in n qubits, the oracle requires O(n) multi-controlled
gates, each of which may need further decomposition. The full gate count is O(n^2) or
worse depending on connectivity.
In a state vector simulator, we have **direct access to the amplitude array**. The oracle
for a known marked state at index t is simply:
```
amplitudes[t] *= -1
```
This is an O(1) operation, regardless of qubit count. This fundamentally changes the
performance profile of Grover simulation.
### Applications in ruVector
| Application | Description |
|-------------|-------------|
| Vector DB search | Encode HNSW candidate filtering as a Grover oracle |
| SAT solving | Map boolean satisfiability to oracle function |
| Cryptographic analysis | Brute-force key search with quadratic speedup |
| Database queries | Unstructured search over ruVector memory entries |
| Algorithm benchmarking | Reference implementation for quantum advantage studies |
---
## Decision
### 1. Oracle Implementation Strategy
We provide two oracle modes: optimized index-based for known targets, and general
unitary oracle for black-box functions.
#### Mode A: Index-Based Oracle (O(1) per application)
When the target index is known (or the oracle can be expressed as a predicate on
basis state indices), we bypass gate decomposition entirely:
```rust
impl QuantumState {
/// Apply Grover oracle by direct amplitude negation.
///
/// Flips the sign of amplitude at the given index.
/// This is an O(1) operation -- the key simulation advantage.
///
/// On hardware, this would require O(n) multi-controlled gates
/// decomposed into O(n^2) elementary gates.
#[inline]
pub fn oracle_flip(&mut self, target_index: usize) {
debug_assert!(target_index < self.amplitudes.len());
self.amplitudes[target_index] = -self.amplitudes[target_index];
}
/// Apply Grover oracle for multiple marked states.
///
/// Complexity: O(k) where k = number of marked states.
/// Hardware equivalent: O(k * n^2) gates.
pub fn oracle_flip_multi(&mut self, target_indices: &[usize]) {
for &idx in target_indices {
debug_assert!(idx < self.amplitudes.len());
self.amplitudes[idx] = -self.amplitudes[idx];
}
}
}
```
**Why this is valid**: The oracle operator O is defined as the diagonal unitary
O = I - 2|t><t|, which maps |t> to -|t> and leaves all other basis states unchanged.
In the amplitude array, this is exactly `amplitudes[t] *= -1`. No physical gate
decomposition is needed because we are simulating the mathematical operator directly.
#### Mode B: General Unitary Oracle
For black-box oracle functions where the marked states are not known in advance:
```rust
/// A general oracle as a unitary operation on the state vector.
///
/// The oracle function receives a basis state index and returns
/// true if it should be marked (phase-flipped).
pub trait GroverOracle: Send {
/// Evaluate whether basis state |index> is a target.
fn is_marked(&self, index: usize, n_qubits: usize) -> bool;
}
impl QuantumState {
/// Apply a general Grover oracle.
///
/// Iterates over all 2^n amplitudes, evaluating the oracle predicate.
/// Complexity: O(2^n) per application (equivalent to hardware cost).
pub fn oracle_apply(&mut self, oracle: &dyn GroverOracle) {
let n_qubits = self.n_qubits;
for i in 0..self.amplitudes.len() {
if oracle.is_marked(i, n_qubits) {
self.amplitudes[i] = -self.amplitudes[i];
}
}
}
}
```
### 2. Diffuser Implementation
The Grover diffuser (inversion about the mean) is decomposed as:
```
D = H^{otimes n} * phase_flip(|0>) * H^{otimes n}
```
where `phase_flip(|0>)` flips the sign of the all-zeros state: (2|0><0| - I).
```
Diffuser Circuit Decomposition:
|psi> ──[H]──[phase_flip(0)]──[H]──
Expanded:
┌───┐ ┌──────────────┐ ┌───┐
q[0] ──┤ H ├───┤ ├───┤ H ├──
└───┘ │ │ └───┘
┌───┐ │ 2|0><0| - I │ ┌───┐
q[1] ──┤ H ├───┤ ├───┤ H ├──
└───┘ │ │ └───┘
┌───┐ │ │ ┌───┐
q[2] ──┤ H ├───┤ ├───┤ H ├──
└───┘ └──────────────┘ └───┘
```
Both the H^{otimes n} layers and the phase_flip(0) benefit from simulation optimizations:
```rust
impl QuantumState {
/// Apply Hadamard to all qubits.
///
/// Optimized implementation using butterfly structure.
/// Complexity: O(n * 2^n)
pub fn hadamard_all(&mut self) {
for qubit in 0..self.n_qubits {
self.apply_hadamard(qubit);
}
}
/// Flip the phase of the |0...0> state.
///
/// O(1) operation via direct indexing -- another simulation advantage.
/// On hardware, this requires an n-controlled-Z gate.
#[inline]
pub fn phase_flip_zero(&mut self) {
// |0...0> is at index 0
self.amplitudes[0] = -self.amplitudes[0];
}
/// Apply the full Grover diffuser.
///
/// D = H^n * (2|0><0| - I) * H^n
///
/// Implementation note: (2|0><0| - I) negates all states except |0>,
/// which is equivalent to a global phase of -1 followed by
/// flipping amplitude[0]. We use the phase_flip_zero + global negate
/// approach for efficiency.
pub fn grover_diffuser(&mut self) {
self.hadamard_all();
// Apply 2|0><0| - I:
// Negate all amplitudes, then flip sign of |0> again
// This gives: amp[0] -> amp[0], amp[k] -> -amp[k] for k != 0
for amp in self.amplitudes.iter_mut() {
*amp = -*amp;
}
self.amplitudes[0] = -self.amplitudes[0];
self.hadamard_all();
}
}
```
### 3. Optimal Iteration Count
The optimal number of Grover iterations for k marked states out of N = 2^n total:
```
iterations = floor(pi/4 * sqrt(N/k))
```
For a single marked state (k=1):
| Qubits (n) | N = 2^n | Optimal Iterations | Classical Steps |
|------------|---------|-------------------|----------------|
| 4 | 16 | 3 | 16 |
| 8 | 256 | 12 | 256 |
| 12 | 4,096 | 50 | 4,096 |
| 16 | 65,536 | 201 | 65,536 |
| 20 | 1,048,576 | 804 | 1,048,576 |
```rust
/// Compute the optimal number of Grover iterations.
///
/// For k marked states in a search space of 2^n:
/// iterations = floor(pi/4 * sqrt(2^n / k))
pub fn optimal_iterations(n_qubits: usize, n_marked: usize) -> usize {
let n = (1_usize << n_qubits) as f64;
let k = n_marked as f64;
(std::f64::consts::FRAC_PI_4 * (n / k).sqrt()).floor() as usize
}
```
### 4. Complete Grover Algorithm
```rust
/// Configuration for Grover's search.
pub struct GroverConfig {
/// Number of qubits
pub n_qubits: usize,
/// Target indices (for index-based oracle)
pub targets: Vec<usize>,
/// Custom oracle (overrides targets if set)
pub oracle: Option<Box<dyn GroverOracle>>,
/// Override iteration count (auto-computed if None)
pub iterations: Option<usize>,
/// Number of measurement shots (for probabilistic result)
pub shots: usize,
}
/// Result of Grover's search.
pub struct GroverResult {
/// Most likely measurement outcome (basis state index)
pub found_index: usize,
/// Probability of measuring the found state
pub success_probability: f64,
/// Number of Grover iterations performed
pub iterations: usize,
/// Total wall-clock time
pub elapsed: Duration,
/// Full probability distribution (optional, for analysis)
pub probabilities: Option<Vec<f64>>,
}
```
**Pseudocode for the complete algorithm**:
```rust
pub fn grover_search(config: &GroverConfig) -> GroverResult {
let n = config.n_qubits;
let num_states = 1 << n;
// Step 1: Initialize uniform superposition
// |s> = H^n |0...0> = (1/sqrt(N)) * sum_k |k>
let mut state = QuantumState::new(n);
state.hadamard_all(); // O(n * 2^n)
// Step 2: Determine iteration count
let k = config.targets.len();
let iterations = config.iterations
.unwrap_or_else(|| optimal_iterations(n, k));
// Step 3: Apply Grover iterations
for _iter in 0..iterations {
// Oracle: flip phase of marked states
match &config.oracle {
Some(oracle) => state.oracle_apply(oracle.as_ref()),
None => state.oracle_flip_multi(&config.targets),
}
// Diffuser: inversion about the mean
state.grover_diffuser();
}
// Step 4: Measure (find highest-probability state)
let probabilities: Vec<f64> = state.amplitudes.iter()
.map(|a| a.norm_sqr())
.collect();
let found_index = probabilities.iter()
.enumerate()
.max_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap())
.map(|(i, _)| i)
.unwrap();
GroverResult {
found_index,
success_probability: probabilities[found_index],
iterations,
elapsed: start.elapsed(),
probabilities: Some(probabilities),
}
}
```
### 5. The O(1) Oracle Trick: Simulation-Unique Advantage
This section formalizes the performance advantage unique to state vector simulation.
**Hardware cost model** (per Grover iteration):
```
Oracle (hardware):
- Multi-controlled-Z gate: O(n) Toffoli gates
- Each Toffoli: ~6 CNOT + single-qubit gates
- Total: O(n) gates, each touching O(2^n) amplitudes in simulation
- Simulation cost: O(n * 2^n) per oracle application
Diffuser (hardware):
- H^n: n Hadamard gates = O(n * 2^n) simulation ops
- Multi-controlled-Z: same as oracle = O(n * 2^n) simulation ops
- H^n: O(n * 2^n) again
- Total: O(n * 2^n) per diffuser
Per iteration (hardware path): O(n * 2^n)
Total (hardware path): O(n * 2^n * sqrt(2^n)) = O(n * 2^(3n/2))
```
**Simulation cost model** (with O(1) oracle optimization):
```
Oracle (optimized):
- Direct amplitude flip: O(1) for single target, O(k) for k targets
- Simulation cost: O(k)
Diffuser (optimized):
- H^n: O(n * 2^n) -- unavoidable
- phase_flip(0): O(1) via direct index
- H^n: O(n * 2^n)
- Total: O(n * 2^n) per diffuser
Per iteration (optimized): O(n * 2^n) [dominated by diffuser]
Total (optimized): O(n * 2^n * sqrt(2^n)) = O(n * 2^(3n/2))
```
The asymptotic complexity is the same (diffuser dominates), but the constant factor
improvement is significant: the oracle step drops from O(n * 2^n) to O(k), saving
roughly 50% of per-iteration time for single-target search.
### 6. Multi-Target Grover Support
When multiple states are marked (k > 1), the algorithm converges faster:
```
iterations(k) = floor(pi/4 * sqrt(N/k))
```
The success probability oscillates sinusoidally. For k targets:
```
P(success after t iterations) = sin^2((2t+1) * arcsin(sqrt(k/N)))
```
```rust
/// Compute success probability after t Grover iterations.
pub fn success_probability(n_qubits: usize, n_marked: usize, iterations: usize) -> f64 {
let n = (1_usize << n_qubits) as f64;
let k = n_marked as f64;
let theta = (k / n).sqrt().asin();
let angle = (2.0 * iterations as f64 + 1.0) * theta;
angle.sin().powi(2)
}
```
**Over-iteration risk**: If too many iterations are applied, the algorithm starts
"uncomputing" the answer. The success probability oscillates with period
~pi * sqrt(N/k) / 2. Our implementation auto-computes the optimal count and warns
if the user-specified count deviates significantly.
### 7. Performance Benchmarks
#### Measured Performance Estimates
| Qubits | States | Iterations | Oracle Cost | Diffuser Cost | Total |
|--------|--------|-----------|-------------|--------------|-------|
| 4 | 16 | 3 | 3 * O(1) | 3 * O(64) | <0.01ms |
| 8 | 256 | 12 | 12 * O(1) | 12 * O(2048) | <0.1ms |
| 12 | 4,096 | 50 | 50 * O(1) | 50 * O(49K) | ~1ms |
| 16 | 65,536 | 201 | 201 * O(1) | 201 * O(1M) | ~10ms |
| 20 | 1,048,576 | 804 | 804 * O(1) | 804 * O(20M) | ~500ms |
| 24 | 16,777,216 | 3,217 | 3217 * O(1) | 3217 * O(402M) | ~60s |
**Gate-count equivalent** (for comparison with hardware gate-based simulation):
| Qubits | Grover Iterations | Equivalent Gate Count | Index-Optimized Ops |
|--------|------------------|----------------------|---------------------|
| 8 | 12 | ~200 gates | ~25K ops |
| 12 | 50 | ~1,500 gates | ~2.5M ops |
| 16 | 201 | ~10,000 gates | ~200M ops |
| 20 | 804 | ~60,000 gates | ~16B ops |
The "gates" column counts oracle gates (decomposed) + diffuser gates. The "ops" column
counts actual floating-point operations in the optimized simulation path. The ratio
confirms that the O(1) oracle trick yields a roughly 2x constant-factor improvement
for the overall search.
### 8. Integration with HNSW Index for Hybrid Quantum-Classical Search
A speculative but architecturally sound integration path connects Grover's search with
ruVector's HNSW (Hierarchical Navigable Small World) index:
```
Hybrid Quantum-Classical Nearest-Neighbor Search
=================================================
Phase 1: Classical HNSW (coarse filtering)
- Navigate the HNSW graph to find candidate neighborhood
- Reduce search space from N to ~sqrt(N) candidates
- Time: O(log N)
Phase 2: Grover's Search (fine filtering)
- Encode candidate set as Grover oracle
- Search for exact nearest neighbor among candidates
- Quadratic speedup over brute-force comparison
- Time: O(N^{1/4}) for sqrt(N) candidates
Combined: O(log N + N^{1/4}) vs classical O(log N + sqrt(N))
┌──────────────────────────────────────────────┐
│ HNSW Layer Navigation │
│ │
│ Layer 3: o ─────────── o ────── o │
│ │ │ │
│ Layer 2: o ── o ────── o ── o ──o │
│ │ │ │ │ │ │
│ Layer 1: o─o──o──o──o──o─o──o──o─o │
│ │ │ │ │ │ │ │ │ │ │ │
│ Layer 0: o-o-oo-oo-oo-oo-o-oo-oo-o │
│ │ │
│ ┌───────▼────────┐ │
│ │ Candidate Pool │ │
│ │ ~sqrt(N) items│ │
│ └───────┬────────┘ │
│ │ │
└────────────────────┼───────────────────────────┘
┌──────────▼───────────┐
│ Grover's Search │
│ │
│ Oracle: distance │
│ threshold on │
│ candidate indices │
│ │
│ O(N^{1/4}) queries │
└──────────────────────┘
```
This integration is facilitated by ruVector's existing HNSW implementation
(150x-12,500x faster than baseline, per ruVector performance targets). The Grover
oracle would encode a distance-threshold predicate: "is vector[i] within distance d
of the query vector?"
```rust
/// Oracle that marks basis states corresponding to vectors
/// within distance threshold of a query.
pub struct HnswGroverOracle {
/// Candidate indices from HNSW coarse search
pub candidates: Vec<usize>,
/// Query vector
pub query: Vec<f32>,
/// Distance threshold
pub threshold: f32,
/// Pre-computed distances (for O(1) oracle evaluation)
pub distances: Vec<f32>,
}
impl GroverOracle for HnswGroverOracle {
fn is_marked(&self, index: usize, _n_qubits: usize) -> bool {
if index < self.distances.len() {
self.distances[index] <= self.threshold
} else {
false
}
}
}
```
**Note**: This hybrid approach is currently theoretical for classical simulation.
Its value lies in (a) algorithm prototyping for future quantum hardware, and
(b) demonstrating integration patterns between quantum algorithms and classical
data structures.
---
## Consequences
### Benefits
1. **O(1) oracle optimization** provides a 2x constant-factor speedup unique to state
vector simulation, making Grover's algorithm practical for up to 20+ qubits
2. **Dual oracle modes** support both fast known-target search (index-based) and general
black-box function search (predicate-based)
3. **Auto-computed iteration count** prevents over-iteration and ensures near-optimal
success probability
4. **Multi-target support** handles the general case of k marked states with appropriate
iteration adjustment
5. **HNSW integration path** provides a concrete vision for hybrid quantum-classical
search that leverages ruVector's existing vector database infrastructure
### Risks
| Risk | Probability | Impact | Mitigation |
|------|------------|--------|------------|
| Diffuser dominates runtime, limiting oracle optimization benefit | High | Low | Accept 2x improvement; focus on SIMD-optimized Hadamard |
| Multi-target count unknown in practice | Medium | Medium | Quantum counting subroutine (future work) |
| HNSW integration adds complexity with unclear practical advantage | Low | Low | Keep as optional module, prototype-only initially |
| Over-iteration produces incorrect results | Low | High | Auto-compute + warning system + probability tracking |
### Trade-offs
| Decision | Advantage | Disadvantage |
|----------|-----------|--------------|
| O(1) index oracle | Massive speedup for known targets | Not applicable to true black-box search |
| Auto iteration count | Prevents user error | Less flexible for advanced use cases |
| General oracle trait | Supports arbitrary predicates | O(2^n) per application (no speedup over gates) |
| Eager probability tracking | Enables convergence monitoring | Memory overhead for probability vector |
---
## References
- Grover, L.K. "A fast quantum mechanical algorithm for database search." Proceedings of the 28th Annual ACM Symposium on Theory of Computing, 212-219 (1996)
- Boyer, M., Brassard, G., Hoyer, P., Tapp, A. "Tight bounds on quantum searching." Fortschritte der Physik 46, 493-505 (1998)
- Malviya, Y.K., Zapatero, R.A. "Quantum search algorithms for database search: A comprehensive review." arXiv:2311.01265 (2023)
- ADR-001: ruQu Architecture - Classical Nervous System for Quantum Machines
- ADR-QE-005: VQE Algorithm Support (parameterized circuits, expectation values)
- ruVector HNSW implementation: 150x-12,500x faster pattern search (CLAUDE.md performance targets)
- ruQu crate: `crates/ruQu/src/` - syndrome processing and state vector infrastructure

View File

@@ -0,0 +1,631 @@
# ADR-QE-007: QAOA MaxCut Implementation
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1 | 2026-02-06 | ruv.io | Initial QAOA MaxCut architecture proposal |
---
## Context
### Combinatorial Optimization on Quantum Computers
The Quantum Approximate Optimization Algorithm (QAOA), introduced by Farhi, Goldstone,
and Gutmann (2014), is a leading candidate for demonstrating quantum advantage on
combinatorial optimization problems. QAOA constructs a parameterized quantum circuit that
encodes the cost function of an optimization problem and uses classical outer-loop
optimization to find parameters that maximize the expected cost.
### MaxCut as the Canonical QAOA Problem
MaxCut is the prototypical problem for QAOA: given a graph G = (V, E), partition the
vertices into two sets S and S-complement to maximize the number of edges crossing the
partition.
```
MaxCut Example (5 vertices, 6 edges):
0 ─── 1
│ \ │
│ \ │
3 ─── 2
4
Optimal cut: S = {0, 2, 4}, S' = {1, 3}
Cut value: 5 edges crossing (0-1, 0-3, 1-2, 2-3, 2-4)
```
The cost function is:
```
C(z) = sum_{(i,j) in E} (1 - z_i * z_j) / 2
```
where z_i in {+1, -1} encodes the partition assignment.
### QAOA Circuit Structure
A depth-p QAOA circuit alternates two types of layers:
1. **Phase separation** (encodes the problem): For each edge (i,j), apply
exp(-i * gamma * Z_i Z_j / 2)
2. **Mixing** (explores the solution space): For each qubit i, apply
exp(-i * beta * X_i) = Rx(2*beta)
```
QAOA Circuit (p layers):
|+> ──[Phase(gamma_1)]──[Mix(beta_1)]──[Phase(gamma_2)]──[Mix(beta_2)]── ... ──[Measure]
Parameters: gamma = [gamma_1, ..., gamma_p], beta = [beta_1, ..., beta_p] │
Classical
Optimizer
```
### Why QAOA Matters for ruQu
| Motivation | Details |
|------------|---------|
| Optimization benchmarks | Standard workload for evaluating quantum simulator performance |
| Graph problems | Natural integration with ruVector graph database (ruvector-graph) |
| Variational algorithm | Shares infrastructure with VQE (ADR-QE-005): parameterized circuits, expectation values, classical optimizers |
| Scalability study | QAOA depth and graph size provide tunable complexity for benchmarking |
| Agent integration | ruVector agents can use QAOA to solve graph optimization tasks autonomously |
---
## Decision
### 1. Phase Separation Operator: Native Rzz Gate
The phase separation operator for MaxCut applies exp(-i * gamma * Z_i Z_j / 2) for
each edge (i,j). We implement this as a native two-qubit operation via direct amplitude
manipulation, avoiding CNOT decomposition.
**Mathematical basis**:
```
exp(-i * theta * Z_i Z_j / 2) acts on computational basis states as:
|00> -> e^{-i*theta/2} |00> (Z_i Z_j = +1)
|01> -> e^{+i*theta/2} |01> (Z_i Z_j = -1)
|10> -> e^{+i*theta/2} |10> (Z_i Z_j = -1)
|11> -> e^{-i*theta/2} |11> (Z_i Z_j = +1)
```
In the state vector, for each amplitude at index k:
- Extract bits i and j from k
- Compute parity = bit_i XOR bit_j
- Apply phase: `amp[k] *= exp(-i * theta * (-1)^parity / 2)`
- If parity = 0 (same bits): `amp[k] *= exp(-i * theta / 2)`
- If parity = 1 (different bits): `amp[k] *= exp(+i * theta / 2)`
```rust
impl QuantumState {
/// Apply Rzz(theta) = exp(-i * theta * Z_i Z_j / 2) via direct amplitude
/// manipulation.
///
/// For each basis state |k>:
/// - Compute parity of bits i and j in k
/// - Apply phase e^{-i * theta * (-1)^parity / 2}
///
/// Complexity: O(2^n) -- single pass over state vector.
/// Vectorizable: all amplitudes are independent (no swaps).
///
/// Hardware equivalent: CNOT(i,j) + Rz(theta, j) + CNOT(i,j) = 3 gates.
pub fn rzz(&mut self, theta: f64, qubit_i: usize, qubit_j: usize) {
let phase_same = Complex64::from_polar(1.0, -theta / 2.0);
let phase_diff = Complex64::from_polar(1.0, theta / 2.0);
let mask_i = 1_usize << qubit_i;
let mask_j = 1_usize << qubit_j;
for k in 0..self.amplitudes.len() {
let bit_i = (k & mask_i) >> qubit_i;
let bit_j = (k & mask_j) >> qubit_j;
let parity = bit_i ^ bit_j;
if parity == 0 {
self.amplitudes[k] *= phase_same;
} else {
self.amplitudes[k] *= phase_diff;
}
}
}
}
```
**Vectorization opportunity**: The inner loop is a streaming operation over the amplitude
array with no data dependencies between iterations. This is ideal for SIMD vectorization
(AVX-512 can process 8 complex64 values per instruction) and parallelization across
cores.
### 2. Mixing Operator
The mixing operator applies Rx(2*beta) to each qubit:
```
Rx(2*beta) = exp(-i * beta * X) = [[cos(beta), -i*sin(beta)],
[-i*sin(beta), cos(beta)]]
```
This uses the standard single-qubit gate application from the simulator core:
```rust
impl QuantumState {
/// Apply the QAOA mixing operator: Rx(2*beta) on each qubit.
///
/// Complexity: O(n * 2^n) for n qubits.
pub fn qaoa_mixing(&mut self, beta: f64) {
for qubit in 0..self.n_qubits {
self.rx(2.0 * beta, qubit);
}
}
}
```
### 3. QAOA Circuit Construction
A convenience function builds the full QAOA circuit from a graph and parameters:
```rust
/// A graph represented as an edge list with optional weights.
pub struct Graph {
/// Number of vertices
pub n_vertices: usize,
/// Edges: (vertex_i, vertex_j, weight)
pub edges: Vec<(usize, usize, f64)>,
}
impl Graph {
/// Construct from adjacency list.
pub fn from_adjacency_list(adj: &[Vec<usize>]) -> Self;
/// Construct from edge list (unweighted, weight = 1.0).
pub fn from_edge_list(n_vertices: usize, edges: &[(usize, usize)]) -> Self;
/// Load from ruVector graph query result.
pub fn from_ruvector_query(result: &GraphQueryResult) -> Self;
}
/// QAOA configuration.
pub struct QaoaConfig {
/// Graph defining the MaxCut instance
pub graph: Graph,
/// QAOA depth (number of layers)
pub p: usize,
/// Gamma parameters (phase separation angles), length = p
pub gammas: Vec<f64>,
/// Beta parameters (mixing angles), length = p
pub betas: Vec<f64>,
}
/// Build and simulate a QAOA circuit for MaxCut.
///
/// Circuit structure for depth p:
/// 1. Initialize |+>^n (Hadamard on all qubits)
/// 2. For layer l = 1..p:
/// a. Phase separation: Rzz(gamma_l, i, j) for each edge (i,j)
/// b. Mixing: Rx(2*beta_l) on each qubit
/// 3. Return final state
pub fn build_qaoa_circuit(config: &QaoaConfig) -> QuantumState {
let n = config.graph.n_vertices;
let mut state = QuantumState::new(n);
// Step 1: Initialize uniform superposition
state.hadamard_all();
// Step 2: Alternating phase separation and mixing layers
for layer in 0..config.p {
let gamma = config.gammas[layer];
let beta = config.betas[layer];
// Phase separation: apply Rzz for each edge
for &(i, j, weight) in &config.graph.edges {
state.rzz(gamma * weight, i, j);
}
// Mixing: Rx(2*beta) on each qubit
state.qaoa_mixing(beta);
}
state
}
```
**Pseudocode for the complete QAOA MaxCut solver**:
```rust
pub fn qaoa_maxcut(
graph: &Graph,
p: usize,
optimizer: &mut dyn ClassicalOptimizer,
config: &QaoaOptConfig,
) -> QaoaResult {
let n_params = 2 * p; // p gammas + p betas
optimizer.initialize(n_params);
let mut params = config.initial_params.clone()
.unwrap_or_else(|| {
// Standard initialization: gamma in [0, pi], beta in [0, pi/2]
let mut p_init = vec![0.0; n_params];
for i in 0..p {
p_init[i] = 0.5; // gamma_i
p_init[p + i] = 0.25; // beta_i
}
p_init
});
let mut best_cost = f64::NEG_INFINITY;
let mut best_params = params.clone();
let mut history = Vec::new();
for iteration in 0..config.max_iterations {
let gammas = params[..p].to_vec();
let betas = params[p..].to_vec();
// Build and simulate circuit
let qaoa_config = QaoaConfig {
graph: graph.clone(),
p,
gammas,
betas,
};
let state = build_qaoa_circuit(&qaoa_config);
// Evaluate MaxCut cost function
let cost = maxcut_expectation(&state, graph);
if cost > best_cost {
best_cost = cost;
best_params = params.clone();
}
// Gradient computation (parameter-shift rule, same as VQE)
let grad = if optimizer.needs_gradient() {
Some(qaoa_gradient(graph, p, &params))
} else {
None
};
history.push(QaoaIteration { iteration, cost, params: params.clone() });
let result = optimizer.step(&params, -cost, grad.as_deref());
// Note: negate cost because optimizer minimizes
params = result.new_params;
if result.converged {
break;
}
}
// Sample the final state to get candidate cuts
let final_state = build_qaoa_circuit(&QaoaConfig {
graph: graph.clone(),
p,
gammas: best_params[..p].to_vec(),
betas: best_params[p..].to_vec(),
});
let best_cut = sample_maxcut(&final_state, graph, config.sample_shots);
QaoaResult {
best_cost,
best_params,
best_cut,
iterations: history.len(),
history,
approximation_ratio: best_cost / graph.max_cut_upper_bound(),
}
}
```
### 4. Cost Function Evaluation
The MaxCut cost function in Pauli operator form is:
```
C = sum_{(i,j) in E} w_{ij} * (1 - Z_i Z_j) / 2
```
This reuses the PauliSum expectation API from ADR-QE-005:
```rust
/// Compute the MaxCut cost as the expectation value of the cost Hamiltonian.
///
/// C = sum_{(i,j) in E} w_ij * (1 - Z_i Z_j) / 2
/// = sum_{(i,j) in E} w_ij/2 - sum_{(i,j) in E} w_ij/2 * Z_i Z_j
/// = const - sum_{(i,j)} w_ij/2 * <Z_i Z_j>
///
/// Each Z_i Z_j expectation is computed via the efficient diagonal trick:
/// <psi| Z_i Z_j |psi> = sum_k |amp_k|^2 * (-1)^{bit_i(k) XOR bit_j(k)}
pub fn maxcut_expectation(state: &QuantumState, graph: &Graph) -> f64 {
let mut cost = 0.0;
for &(i, j, weight) in &graph.edges {
let mask_i = 1_usize << i;
let mask_j = 1_usize << j;
let mut zz_expectation = 0.0;
for k in 0..state.amplitudes.len() {
let bit_i = (k & mask_i) >> i;
let bit_j = (k & mask_j) >> j;
let parity = bit_i ^ bit_j;
let sign = 1.0 - 2.0 * parity as f64; // +1 if same, -1 if different
zz_expectation += state.amplitudes[k].norm_sqr() * sign;
}
cost += weight * (1.0 - zz_expectation) / 2.0;
}
cost
}
```
**Optimization**: Since Z_i Z_j is diagonal in the computational basis, the expectation
reduces to a weighted sum over probabilities. No amplitude swapping is needed, and the
computation is embarrassingly parallel.
### 5. Sampling Mode
In addition to exact expectation values, we support sampling the final state to
obtain candidate cuts:
```rust
/// Sample the QAOA state to find candidate MaxCut solutions.
///
/// Returns the best cut found across `shots` samples.
pub fn sample_maxcut(
state: &QuantumState,
graph: &Graph,
shots: usize,
) -> MaxCutSolution {
let probabilities: Vec<f64> = state.amplitudes.iter()
.map(|a| a.norm_sqr())
.collect();
let mut best_cut_value = 0.0;
let mut best_bitstring = 0_usize;
let mut rng = thread_rng();
for _ in 0..shots {
// Sample from probability distribution
let sample = sample_from_distribution(&probabilities, &mut rng);
// Evaluate cut value for this bitstring
let cut_value = evaluate_cut(sample, graph);
if cut_value > best_cut_value {
best_cut_value = cut_value;
best_bitstring = sample;
}
}
MaxCutSolution {
partition: best_bitstring,
cut_value: best_cut_value,
set_s: (0..graph.n_vertices)
.filter(|&v| (best_bitstring >> v) & 1 == 1)
.collect(),
set_s_complement: (0..graph.n_vertices)
.filter(|&v| (best_bitstring >> v) & 1 == 0)
.collect(),
}
}
```
### 6. Graph Interface
Three input modes cover common use cases:
```rust
impl Graph {
/// From adjacency list (unweighted).
///
/// Example: adj[0] = [1, 3] means vertex 0 connects to 1 and 3.
pub fn from_adjacency_list(adj: &[Vec<usize>]) -> Self {
let n = adj.len();
let mut edges = Vec::new();
let mut seen = std::collections::HashSet::new();
for (u, neighbors) in adj.iter().enumerate() {
for &v in neighbors {
let edge = if u < v { (u, v) } else { (v, u) };
if seen.insert(edge) {
edges.push((edge.0, edge.1, 1.0));
}
}
}
Self { n_vertices: n, edges }
}
/// From edge list with uniform weight.
pub fn from_edge_list(n_vertices: usize, edge_list: &[(usize, usize)]) -> Self {
Self {
n_vertices,
edges: edge_list.iter().map(|&(u, v)| (u, v, 1.0)).collect(),
}
}
/// From ruVector graph database query result.
///
/// Enables QAOA MaxCut on graphs stored in ruvector-graph.
pub fn from_ruvector_query(result: &GraphQueryResult) -> Self {
// Convert ruvector-graph nodes and edges to QAOA format
// Vertex IDs are remapped to contiguous 0..n range
todo!()
}
}
```
### 7. Tensor Network Optimization for Sparse Graphs
For sparse or planar graphs, the QAOA state can be represented more efficiently using
tensor network contraction. The key insight is that QAOA circuits have a structure
dictated by the graph topology:
```
Tensor Network View of QAOA:
Qubit 0: ──[H]──[Rzz(0,1)]──[Rzz(0,3)]──[Rx]── ...
Qubit 1: ──[H]──[Rzz(0,1)]──[Rzz(1,2)]──[Rx]── ...
Qubit 2: ──[H]──[Rzz(1,2)]──[Rzz(2,3)]──[Rx]── ...
Qubit 3: ──[H]──[Rzz(0,3)]──[Rzz(2,3)]──[Rx]── ...
For a planar graph with treewidth w, tensor contraction costs O(2^w * poly(n))
instead of O(2^n). For many practical graphs, w << n.
```
```rust
/// Detect graph treewidth and decide simulation strategy.
pub fn select_simulation_strategy(graph: &Graph) -> SimulationStrategy {
let treewidth = estimate_treewidth(graph);
let n = graph.n_vertices;
if treewidth <= 20 && n > 24 {
// Tensor network contraction is cheaper than full state vector
SimulationStrategy::TensorNetwork {
contraction_order: compute_contraction_order(graph),
estimated_cost: (1 << treewidth) * n * n,
}
} else {
SimulationStrategy::StateVector {
estimated_cost: 1 << n,
}
}
}
pub enum SimulationStrategy {
StateVector { estimated_cost: usize },
TensorNetwork {
contraction_order: Vec<ContractionStep>,
estimated_cost: usize,
},
}
```
### 8. Performance Analysis
#### Gate Counts and Timing
For a graph with n vertices, m edges, and QAOA depth p:
| Operation | Gate Count per Layer | Total Gates (p layers) |
|-----------|---------------------|----------------------|
| Phase separation (Rzz) | m | p * m |
| Mixing (Rx) | n | p * n |
| **Total per layer** | **m + n** | **p * (m + n)** |
**Benchmark estimates**:
| Configuration | n | m | p | Total Gates | Estimated Time |
|---------------|---|---|---|-------------|---------------|
| Small triangle | 3 | 3 | 1 | 6 | <0.01ms |
| Petersen graph | 10 | 15 | 3 | 75 | <0.1ms |
| Random d-reg (d=3) | 10 | 15 | 5 | 125 | <0.5ms |
| Grid 4x5 | 20 | 31 | 3 | 189 | ~50ms |
| Grid 4x5 | 20 | 31 | 5 | 315 | ~100ms |
| Random d-reg (d=4) | 20 | 40 | 5 | 400 | ~200ms |
| Dense (complete) | 20 | 190 | 3 | 630 | ~300ms |
| Sparse large | 24 | 36 | 3 | 216 | ~5s |
| Dense large | 24 | 276 | 5 | 1500 | ~30s |
**Memory requirements**:
| Qubits | State Vector Size | Memory |
|--------|------------------|--------|
| 10 | 1,024 | 16 KB |
| 16 | 65,536 | 1 MB |
| 20 | 1,048,576 | 16 MB |
| 24 | 16,777,216 | 256 MB |
| 28 | 268,435,456 | 4 GB |
### 9. Integration with ruvector-graph
The connection to ruVector's graph database enables a powerful workflow:
```
┌─────────────────────────────────────────────────────────────────────┐
│ QAOA MaxCut Pipeline │
│ │
│ ┌──────────────┐ ┌────────────────┐ ┌──────────────────┐ │
│ │ ruvector-graph│ │ QAOA Engine │ │ Result Store │ │
│ │ │ │ │ │ │ │
│ │ Query: │────>│ Build circuit │────>│ Optimal cut │ │
│ │ "find all │ │ Optimize │ │ Partition │ │
│ │ connected │ │ Sample │ │ Approximation │ │
│ │ subgraphs │ │ │ │ ratio │ │
│ │ of size k" │ │ │ │ │ │
│ └──────────────┘ └────────────────┘ └──────────────────┘ │
│ │
│ Data Flow: │
│ 1. Agent queries ruvector-graph for subgraph │
│ 2. Graph converted to QAOA format via Graph::from_ruvector_query() │
│ 3. QAOA optimizer runs with configurable depth p │
│ 4. Results stored in ruVector memory for pattern learning │
│ 5. Agent uses learned patterns to choose p and initial parameters │
└─────────────────────────────────────────────────────────────────────┘
```
The ruvector-mincut integration is particularly relevant: the existing
`SubpolynomialMinCut` algorithm (El-Hayek/Henzinger/Li, O(n^{o(1)}) amortized) provides
exact min-cut values that serve as a lower bound for MaxCut verification. QAOA solutions
can be validated against this classical baseline.
---
## Consequences
### Benefits
1. **Native Rzz gate** via direct amplitude manipulation avoids CNOT decomposition,
yielding a simpler and faster phase separation implementation
2. **PauliSum expectation API reuse** from ADR-QE-005 provides a unified interface for
all variational algorithms (VQE, QAOA, and future extensions)
3. **Graph interface flexibility** supports adjacency lists, edge lists, and ruVector
graph queries, covering the most common input formats
4. **Tensor network fallback** for low-treewidth graphs extends QAOA to larger problem
instances than pure state vector simulation allows
5. **ruvector-graph integration** enables a seamless pipeline from graph storage to
quantum optimization to result analysis
### Risks
| Risk | Probability | Impact | Mitigation |
|------|------------|--------|------------|
| QAOA at low depth p gives poor approximation ratios | High | Medium | Support high-p QAOA, classical warm-starting |
| Treewidth estimation is NP-hard in general | Medium | Low | Use heuristic upper bounds (min-degree, greedy) |
| Parameter landscape has many local minima | Medium | Medium | Multi-start optimization, INTERP initialization |
| Large dense graphs exhaust memory | Medium | High | Tensor network fallback, graph coarsening |
### Trade-offs
| Decision | Advantage | Disadvantage |
|----------|-----------|--------------|
| Direct Rzz over CNOT decomposition | Simpler, faster | Not a one-to-one hardware circuit mapping |
| Exact expectation over sampling | No statistical noise | Does not model real hardware shot noise |
| Automatic strategy selection | Transparent to user | Additional complexity in simulation backend |
| Integrated graph interface | Seamless workflow | Coupling to ruvector-graph API |
---
## References
- Farhi, E., Goldstone, J., Gutmann, S. "A Quantum Approximate Optimization Algorithm." arXiv:1411.4028 (2014)
- Hadfield, S. et al. "From the Quantum Approximate Optimization Algorithm to a Quantum Alternating Operator Ansatz." Algorithms 12, 34 (2019)
- Zhou, L. et al. "Quantum Approximate Optimization Algorithm: Performance, Mechanism, and Implementation on Near-Term Devices." Physical Review X 10, 021067 (2020)
- Guerreschi, G.G., Matsuura, A.Y. "QAOA for Max-Cut requires hundreds of qubits for quantum speed-up." Scientific Reports 9, 6903 (2019)
- ADR-001: ruQu Architecture - Classical Nervous System for Quantum Machines
- ADR-QE-005: VQE Algorithm Support (shared parameterized circuit and optimizer infrastructure)
- ADR-QE-006: Grover's Search Implementation (quantum state manipulation primitives)
- ruvector-mincut: `crates/ruvector-mincut/` - El-Hayek/Henzinger/Li subpolynomial min-cut
- ruvector-graph: graph database integration for sourcing MaxCut instances

View File

@@ -0,0 +1,997 @@
# ADR-QE-008: Surface Code Error Correction Simulation
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1 | 2026-02-06 | ruv.io | Initial surface code QEC simulation proposal |
---
## Context
### The Importance of QEC Simulation
Quantum Error Correction (QEC) is the bridge between noisy intermediate-scale quantum
(NISQ) devices and fault-tolerant quantum computing. Before deploying error correction
on real hardware, every aspect of the QEC stack must be validated through simulation:
1. **Decoder validation**: Verify that decoding algorithms (MWPM, Union-Find, neural
decoders) produce correct corrections under various noise models
2. **Threshold estimation**: Determine the physical error rate below which logical error
rate decreases with increasing code distance
3. **Architecture exploration**: Compare surface code layouts, flag qubit placements, and
scheduling strategies
4. **Noise model development**: Test decoder robustness against realistic noise (correlated
errors, leakage, crosstalk)
### Surface Codes as the Leading Architecture
The surface code is the most promising QEC architecture for superconducting qubit
platforms due to:
| Property | Value |
|----------|-------|
| Error threshold | ~1% (highest among practical codes) |
| Connectivity | Nearest-neighbor only (matches hardware) |
| Syndrome extraction | Local stabilizer measurements |
| Decoding | Efficient MWPM, Union-Find in O(n * alpha(n)) |
### Surface Code Layout (Distance-3)
```
Distance-3 Rotated Surface Code:
Data qubits: D0..D8 (9 total)
X-stabilizers: X0..X3 (4 ancilla qubits)
Z-stabilizers: Z0..Z3 (4 ancilla qubits)
Z0 Z1
/ \ / \
D0 ──── D1 ──── D2
| X0 | X1 |
D3 ──── D4 ──── D5
| X2 | X3 |
D6 ──── D7 ──── D8
\ / \ /
Z2 Z3
Qubit count: 9 data + 8 ancilla = 17 total qubits
State vector: 2^17 = 131,072 complex amplitudes
Memory: 2 MB per state vector
```
### What ruQu Provides Today
The existing ruQu crate already implements key components for error correction:
| Component | Module | Status |
|-----------|--------|--------|
| Syndrome processing | `syndrome.rs` | Production-ready (1M rounds/sec) |
| MWPM decoder | `decoder.rs` | Integrated via fusion-blossom |
| Min-cut coherence | `mincut.rs` | El-Hayek/Henzinger/Li algorithm |
| Three-filter pipeline | `filters.rs` | Structural + Shift + Evidence |
| Tile architecture | `tile.rs`, `fabric.rs` | 256-tile WASM fabric |
| Stim integration | `stim.rs` | Syndrome generation |
What is **missing** is the ability to simulate the full quantum state evolution of a
surface code cycle: ancilla initialization, stabilizer circuits, projective measurement,
state collapse, decoder feedback, and correction application. This ADR fills that gap.
### Requirements
| Requirement | Description | Priority |
|-------------|-------------|----------|
| Mid-circuit measurement | Projective measurement of individual qubits | P0 |
| Qubit reset | Reinitialize ancilla qubits to |0> each cycle | P0 |
| Conditional operations | Apply gates conditioned on measurement outcomes | P0 |
| Noise injection | Depolarizing, bit-flip, phase-flip channels | P0 |
| Syndrome extraction | Extract syndrome bits from ancilla measurements | P0 |
| Decoder integration | Feed syndromes to MWPM/min-cut decoder | P0 |
| Logical error tracking | Determine if logical error occurred | P1 |
| Multi-cycle simulation | Run thousands of QEC cycles efficiently | P1 |
| Leakage modeling | Simulate qubit leakage to non-computational states | P2 |
---
## Decision
### 1. Mid-Circuit Measurement
Mid-circuit measurement is the most critical new capability. Unlike final-state
measurement (which collapses the entire state), mid-circuit measurement collapses a
single qubit while preserving the rest of the system for continued evolution.
**Mathematical formulation**:
For measuring qubit q in the computational basis:
1. Split the state into two subspaces:
- |psi_0>: amplitudes where qubit q = 0
- |psi_1>: amplitudes where qubit q = 1
2. Compute probabilities:
- P(0) = ||psi_0||^2 = sum_{k: bit_q(k)=0} |amp_k|^2
- P(1) = ||psi_1||^2 = sum_{k: bit_q(k)=1} |amp_k|^2
3. Sample outcome m in {0, 1} according to P(0), P(1)
4. Collapse: zero out amplitudes in the non-selected subspace
5. Renormalize: divide remaining amplitudes by sqrt(P(m))
```rust
/// Result of a mid-circuit measurement.
pub struct MeasurementResult {
/// The measured qubit index
pub qubit: usize,
/// The measurement outcome (0 or 1)
pub outcome: u8,
/// The probability of this outcome
pub probability: f64,
}
impl QuantumState {
/// Perform a projective measurement on a single qubit.
///
/// This collapses the qubit to |0> or |1> based on Born probabilities,
/// zeroes out amplitudes in the rejected subspace, and renormalizes.
///
/// The remaining qubits are left in a valid quantum state for continued
/// simulation (essential for mid-circuit measurement in QEC).
///
/// Complexity: O(2^n) -- two passes over the state vector.
/// Pass 1: Compute probabilities P(0), P(1)
/// Pass 2: Collapse and renormalize
pub fn measure_qubit(
&mut self,
qubit: usize,
rng: &mut impl Rng,
) -> MeasurementResult {
let mask = 1_usize << qubit;
let n = self.amplitudes.len();
// Pass 1: Compute P(0) and P(1)
let mut prob_0 = 0.0_f64;
let mut prob_1 = 0.0_f64;
for k in 0..n {
let p = self.amplitudes[k].norm_sqr();
if (k & mask) == 0 {
prob_0 += p;
} else {
prob_1 += p;
}
}
// Sample outcome
let outcome = if rng.gen::<f64>() < prob_0 { 0_u8 } else { 1_u8 };
let prob_selected = if outcome == 0 { prob_0 } else { prob_1 };
let norm_factor = 1.0 / prob_selected.sqrt();
// Pass 2: Collapse and renormalize
for k in 0..n {
let bit = ((k & mask) >> qubit) as u8;
if bit == outcome {
self.amplitudes[k] *= norm_factor;
} else {
self.amplitudes[k] = Complex64::zero();
}
}
MeasurementResult {
qubit,
outcome,
probability: prob_selected,
}
}
/// Measure multiple qubits (ancilla register).
///
/// Measures each qubit sequentially. The order matters because each
/// measurement collapses the state before the next measurement.
/// For stabilizer measurements, this correctly handles correlated outcomes.
pub fn measure_qubits(
&mut self,
qubits: &[usize],
rng: &mut impl Rng,
) -> Vec<MeasurementResult> {
qubits.iter()
.map(|&q| self.measure_qubit(q, rng))
.collect()
}
}
```
### 2. Qubit Reset
Ancilla qubits must be reinitialized to |0> at the start of each syndrome extraction
cycle. The reset operation projects onto the |0> subspace and renormalizes:
```rust
impl QuantumState {
/// Reset a qubit to |0>.
///
/// Zeroes out all amplitudes where qubit q = 1, then renormalizes.
/// This is equivalent to measuring the qubit and, if the outcome is |1>,
/// applying an X gate to flip it back to |0>.
///
/// Complexity: O(2^n) -- single pass over state vector.
///
/// Used for ancilla reinitialization in each QEC cycle.
pub fn reset_qubit(&mut self, qubit: usize) {
let mask = 1_usize << qubit;
let partner_mask = !mask;
let n = self.amplitudes.len();
// For each pair of states (k, k XOR mask), move amplitude from
// the |1> component to the |0> component.
// This implements: |0><0| + |0><1| (measure-then-flip).
//
// Simpler approach: zero out |1> subspace, renormalize.
let mut norm_sq = 0.0_f64;
for k in 0..n {
if (k & mask) != 0 {
// Qubit q is |1> in this basis state
// Transfer amplitude to partner state with q = |0>
let partner = k & partner_mask;
// Coherent reset: add amplitudes
// For incoherent reset (thermal): would zero out instead
self.amplitudes[partner] += self.amplitudes[k];
self.amplitudes[k] = Complex64::zero();
}
}
// Renormalize
for k in 0..n {
norm_sq += self.amplitudes[k].norm_sqr();
}
let norm_factor = 1.0 / norm_sq.sqrt();
for amp in self.amplitudes.iter_mut() {
*amp *= norm_factor;
}
}
}
```
### 3. Noise Model
We implement three standard noise channels plus a combined depolarizing model.
Noise is applied by stochastically inserting Pauli gates after specified operations.
```
Noise Channels:
Bit-flip (X): rho -> (1-p) * rho + p * X * rho * X
Phase-flip (Z): rho -> (1-p) * rho + p * Z * rho * Z
Depolarizing: rho -> (1-p) * rho + p/3 * (X*rho*X + Y*rho*Y + Z*rho*Z)
```
For state vector simulation, noise is applied via **stochastic Pauli insertion**:
```rust
/// Noise model configuration.
#[derive(Debug, Clone)]
pub struct NoiseModel {
/// Single-qubit gate error rate
pub single_qubit_error: f64,
/// Two-qubit gate error rate
pub two_qubit_error: f64,
/// Measurement error rate (readout bit-flip)
pub measurement_error: f64,
/// Idle error rate (per qubit per cycle)
pub idle_error: f64,
/// Noise type
pub noise_type: NoiseType,
}
#[derive(Debug, Clone, Copy)]
pub enum NoiseType {
/// Random X errors with probability p
BitFlip,
/// Random Z errors with probability p
PhaseFlip,
/// Random X, Y, or Z errors each with probability p/3
Depolarizing,
/// Independent bit-flip (p_x) and phase-flip (p_z)
Independent { p_x: f64, p_z: f64 },
}
impl QuantumState {
/// Apply a noise channel to a single qubit.
///
/// For depolarizing noise with probability p:
/// - With probability 1-p: do nothing
/// - With probability p/3: apply X
/// - With probability p/3: apply Y
/// - With probability p/3: apply Z
///
/// This stochastic Pauli insertion is exact for Pauli channels
/// and a good approximation for general noise (Pauli twirl).
pub fn apply_noise(
&mut self,
qubit: usize,
error_rate: f64,
noise_type: NoiseType,
rng: &mut impl Rng,
) {
match noise_type {
NoiseType::BitFlip => {
if rng.gen::<f64>() < error_rate {
self.apply_x(qubit);
}
}
NoiseType::PhaseFlip => {
if rng.gen::<f64>() < error_rate {
self.apply_z(qubit);
}
}
NoiseType::Depolarizing => {
let r = rng.gen::<f64>();
if r < error_rate / 3.0 {
self.apply_x(qubit);
} else if r < 2.0 * error_rate / 3.0 {
self.apply_y(qubit);
} else if r < error_rate {
self.apply_z(qubit);
}
// else: no error (identity)
}
NoiseType::Independent { p_x, p_z } => {
if rng.gen::<f64>() < p_x {
self.apply_x(qubit);
}
if rng.gen::<f64>() < p_z {
self.apply_z(qubit);
}
}
}
}
/// Apply idle noise to all data qubits.
///
/// Called once per QEC cycle to model decoherence during idle periods.
pub fn apply_idle_noise(
&mut self,
data_qubits: &[usize],
noise: &NoiseModel,
rng: &mut impl Rng,
) {
for &q in data_qubits {
self.apply_noise(q, noise.idle_error, noise.noise_type, rng);
}
}
}
```
### 4. Syndrome Extraction Circuit
A complete surface code syndrome extraction cycle consists of:
1. Reset ancilla qubits to |0>
2. Apply CNOT chains from data qubits to ancilla (stabilizer circuits)
3. Measure ancilla qubits to extract syndrome bits
4. (Optionally) apply noise after each gate
```
Syndrome Extraction for X-Stabilizer X0 = X_D0 * X_D1 * X_D3 * X_D4:
D0: ────────●───────────────────────────
D1: ────────┼──────●────────────────────
│ │
D3: ────────┼──────┼──────●─────────────
│ │ │
D4: ────────┼──────┼──────┼──────●──────
│ │ │ │
X0: ──|0>──[H]──CNOT──CNOT──CNOT──CNOT──[H]──[M]── syndrome bit
(For X-stabilizers: Hadamard on ancilla before and after CNOTs)
(For Z-stabilizers: CNOTs in opposite direction, no Hadamards)
```
```rust
/// Surface code layout definition.
pub struct SurfaceCodeLayout {
/// Code distance
pub distance: usize,
/// Data qubit indices
pub data_qubits: Vec<usize>,
/// X-stabilizer definitions: (ancilla_qubit, [data_qubits])
pub x_stabilizers: Vec<(usize, Vec<usize>)>,
/// Z-stabilizer definitions: (ancilla_qubit, [data_qubits])
pub z_stabilizers: Vec<(usize, Vec<usize>)>,
/// Total qubit count (data + ancilla)
pub total_qubits: usize,
}
impl SurfaceCodeLayout {
/// Generate a distance-d rotated surface code layout.
pub fn rotated(distance: usize) -> Self {
let n_data = distance * distance;
let n_x_stab = (distance * distance - 1) / 2;
let n_z_stab = (distance * distance - 1) / 2;
let total = n_data + n_x_stab + n_z_stab;
// Assign qubit indices:
// 0..n_data: data qubits
// n_data..n_data+n_x_stab: X-stabilizer ancillae
// n_data+n_x_stab..total: Z-stabilizer ancillae
let data_qubits: Vec<usize> = (0..n_data).collect();
// Build stabilizer mappings based on rotated surface code geometry
let (x_stabilizers, z_stabilizers) =
build_rotated_stabilizers(distance, n_data);
Self {
distance,
data_qubits,
x_stabilizers,
z_stabilizers,
total_qubits: total,
}
}
}
/// One complete syndrome extraction cycle.
///
/// Returns the syndrome bitstring (one bit per stabilizer).
pub fn extract_syndrome(
state: &mut QuantumState,
layout: &SurfaceCodeLayout,
noise: &Option<NoiseModel>,
rng: &mut impl Rng,
) -> SyndromeBits {
let mut syndrome = SyndromeBits::new(
layout.x_stabilizers.len() + layout.z_stabilizers.len()
);
// Step 1: Reset all ancilla qubits
for &(ancilla, _) in layout.x_stabilizers.iter()
.chain(layout.z_stabilizers.iter())
{
state.reset_qubit(ancilla);
}
// Step 2: X-stabilizer circuits
for (stab_idx, &(ancilla, ref data)) in layout.x_stabilizers.iter().enumerate() {
// Hadamard on ancilla (transforms Z-basis CNOT to X-basis measurement)
state.apply_hadamard(ancilla);
if let Some(ref n) = noise {
state.apply_noise(ancilla, n.single_qubit_error, n.noise_type, rng);
}
// CNOT from each data qubit to ancilla
for &d in data {
state.apply_cnot(d, ancilla);
if let Some(ref n) = noise {
state.apply_noise(d, n.two_qubit_error, n.noise_type, rng);
state.apply_noise(ancilla, n.two_qubit_error, n.noise_type, rng);
}
}
// Hadamard on ancilla
state.apply_hadamard(ancilla);
if let Some(ref n) = noise {
state.apply_noise(ancilla, n.single_qubit_error, n.noise_type, rng);
}
// Measure ancilla
let result = state.measure_qubit(ancilla, rng);
// Apply measurement error
let mut outcome = result.outcome;
if let Some(ref n) = noise {
if rng.gen::<f64>() < n.measurement_error {
outcome ^= 1; // Flip the classical bit
}
}
syndrome.set(stab_idx, outcome);
}
// Step 3: Z-stabilizer circuits
let offset = layout.x_stabilizers.len();
for (stab_idx, &(ancilla, ref data)) in layout.z_stabilizers.iter().enumerate() {
// No Hadamard for Z-stabilizers
// CNOT from ancilla to each data qubit
for &d in data {
state.apply_cnot(ancilla, d);
if let Some(ref n) = noise {
state.apply_noise(d, n.two_qubit_error, n.noise_type, rng);
state.apply_noise(ancilla, n.two_qubit_error, n.noise_type, rng);
}
}
// Measure ancilla
let result = state.measure_qubit(ancilla, rng);
let mut outcome = result.outcome;
if let Some(ref n) = noise {
if rng.gen::<f64>() < n.measurement_error {
outcome ^= 1;
}
}
syndrome.set(offset + stab_idx, outcome);
}
// Step 4: Apply idle noise to data qubits
if let Some(ref n) = noise {
state.apply_idle_noise(&layout.data_qubits, n, rng);
}
syndrome
}
```
### 5. Decoder Integration
The syndrome bits feed into ruQu's existing decoder infrastructure:
```
Decoder Pipeline:
Syndrome Bits ──> SyndromeFilter ──> MWPM Decoder ──> Correction ──> Apply to State
│ │
│ ┌─────▼─────┐
│ │ ruvector- │
│ │ mincut │
└──────────────────────────────│ coherence │
│ validation │
└────────────┘
```
```rust
/// Decode syndrome and apply corrections.
///
/// This function bridges the quantum simulation (state vector) with
/// ruQu's classical decoder infrastructure.
pub fn decode_and_correct(
state: &mut QuantumState,
syndrome: &SyndromeBits,
layout: &SurfaceCodeLayout,
decoder: &mut MWPMDecoder,
) -> DecoderResult {
// Convert syndrome bits to DetectorBitmap (ruQu format)
let mut bitmap = DetectorBitmap::new(syndrome.len());
for i in 0..syndrome.len() {
bitmap.set(i, syndrome.get(i) == 1);
}
// Decode using MWPM
let correction = decoder.decode(&bitmap);
// Apply X corrections to data qubits
for &qubit in &correction.x_corrections {
state.apply_x(qubit);
}
// Apply Z corrections to data qubits
for &qubit in &correction.z_corrections {
state.apply_z(qubit);
}
DecoderResult {
correction,
syndrome: bitmap,
applied: true,
}
}
```
Integration with `ruvector-mincut` for coherence validation:
```rust
/// Validate decoder correction using min-cut coherence analysis.
///
/// Uses ruQu's existing DynamicMinCutEngine to assess whether the
/// post-correction state maintains structural coherence.
pub fn validate_correction(
syndrome: &SyndromeBits,
correction: &Correction,
mincut_engine: &mut DynamicMinCutEngine,
) -> CoherenceAssessment {
// Update min-cut graph edges based on syndrome pattern
// High syndrome density in a region lowers edge weights (less coherent)
// Correction success restores edge weights
let cut_value = mincut_engine.query_min_cut();
CoherenceAssessment {
min_cut_value: cut_value.value,
is_coherent: cut_value.value > COHERENCE_THRESHOLD,
witness: cut_value.witness_hash,
}
}
```
### 6. Logical Error Tracking
To determine if a logical error has occurred, we compare the initial and final
logical qubit states:
```rust
/// Track logical errors across QEC cycles.
///
/// A logical error occurs when the cumulative effect of physical errors
/// and decoder corrections results in a non-trivial logical operator
/// being applied to the encoded qubit.
pub struct LogicalErrorTracker {
/// Accumulated X corrections on data qubits
x_correction_parity: Vec<bool>,
/// Accumulated Z corrections on data qubits
z_correction_parity: Vec<bool>,
/// Known physical X errors (for debugging/validation)
x_error_parity: Vec<bool>,
/// Known physical Z errors
z_error_parity: Vec<bool>,
/// Logical X operator support (which data qubits)
logical_x_support: Vec<usize>,
/// Logical Z operator support
logical_z_support: Vec<usize>,
}
impl LogicalErrorTracker {
/// Check if a logical X error has occurred.
///
/// A logical X error occurs when the net X-type operator
/// (errors + corrections) has odd overlap with the logical Z operator.
pub fn has_logical_x_error(&self) -> bool {
let mut parity = false;
for &q in &self.logical_z_support {
parity ^= self.x_error_parity[q] ^ self.x_correction_parity[q];
}
parity
}
/// Check if a logical Z error has occurred.
pub fn has_logical_z_error(&self) -> bool {
let mut parity = false;
for &q in &self.logical_x_support {
parity ^= self.z_error_parity[q] ^ self.z_correction_parity[q];
}
parity
}
/// Check if any logical error has occurred.
pub fn has_logical_error(&self) -> bool {
self.has_logical_x_error() || self.has_logical_z_error()
}
}
```
### 7. Full Surface Code Simulation Cycle
Putting it all together, the complete simulation loop:
```
Full Surface Code QEC Cycle
============================
Input: Code distance d, noise model, number of cycles T, decoder
Output: Logical error rate estimate
layout = SurfaceCodeLayout::rotated(d)
state = QuantumState::new(layout.total_qubits)
tracker = LogicalErrorTracker::new(layout)
decoder = MWPMDecoder::new(d)
mincut = DynamicMinCutEngine::new()
// Prepare initial logical |0> state
prepare_logical_zero(&mut state, &layout)
for cycle in 0..T:
┌─────────────────────────────────────────────────────┐
│ 1. INJECT NOISE │
│ Apply depolarizing noise to all data qubits │
│ (models decoherence during idle + gate errors) │
│ tracker.record_errors(noise_locations) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ 2. EXTRACT SYNDROME │
│ Reset ancillae -> stabilizer circuits -> measure │
│ Returns syndrome bitstring for this cycle │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ 3. DECODE │
│ Feed syndrome to MWPM decoder │
│ Decoder returns correction (X and Z Pauli ops) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ 4. APPLY CORRECTION │
│ Apply Pauli corrections to data qubits │
│ tracker.record_corrections(corrections) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ 5. VALIDATE COHERENCE (optional) │
│ Run min-cut analysis on syndrome pattern │
│ Flag if coherence drops below threshold │
└─────────────────────────────────────────────────────┘
// After T cycles, check for logical error
logical_error = tracker.has_logical_error()
```
**Pseudocode for the full simulation**:
```rust
/// Run a complete surface code QEC simulation.
///
/// Returns the logical error rate estimated from `trials` independent runs,
/// each consisting of `cycles` QEC rounds.
pub fn simulate_surface_code(config: &SurfaceCodeConfig) -> SimulationResult {
let layout = SurfaceCodeLayout::rotated(config.distance);
let mut logical_errors = 0_u64;
let mut total_cycles = 0_u64;
for trial in 0..config.trials {
let mut state = QuantumState::new(layout.total_qubits);
let mut tracker = LogicalErrorTracker::new(&layout);
let mut decoder = MWPMDecoder::new(DecoderConfig {
distance: config.distance,
physical_error_rate: config.noise.idle_error,
..Default::default()
});
let mut rng = StdRng::seed_from_u64(config.seed + trial);
// Prepare logical |0>
prepare_logical_zero(&mut state, &layout);
for cycle in 0..config.cycles {
// 1. Inject noise
inject_data_noise(&mut state, &layout, &config.noise, &mut rng);
// 2. Extract syndrome
let syndrome = extract_syndrome(
&mut state, &layout, &Some(config.noise.clone()), &mut rng
);
// 3. Decode
let correction = decoder.decode_syndrome(&syndrome);
// 4. Apply correction
apply_correction(&mut state, &correction);
tracker.record_correction(&correction);
total_cycles += 1;
}
// Check for logical error
if tracker.has_logical_error() {
logical_errors += 1;
}
}
let logical_error_rate = logical_errors as f64 / config.trials as f64;
let error_per_cycle = 1.0 - (1.0 - logical_error_rate)
.powf(1.0 / config.cycles as f64);
SimulationResult {
logical_error_rate,
logical_error_per_cycle: error_per_cycle,
total_trials: config.trials,
total_cycles,
logical_errors,
distance: config.distance,
physical_error_rate: config.noise.idle_error,
}
}
```
### 8. Performance Estimates
#### Distance-3 Surface Code
| Parameter | Value |
|-----------|-------|
| Data qubits | 9 |
| Ancilla qubits | 8 |
| Total qubits | 17 |
| State vector entries | 2^17 = 131,072 |
| State vector memory | 2 MB |
| CNOTs per cycle | ~16 (4 per stabilizer, 4 stabilizers active) |
| Measurements per cycle | 8 |
| Resets per cycle | 8 |
| **Time per cycle** | **~0.5ms** |
| **1000 cycles** | **~0.5s** |
#### Distance-5 Surface Code
| Parameter | Value |
|-----------|-------|
| Data qubits | 25 |
| Ancilla qubits | 24 |
| Total qubits | 49 |
| State vector entries | 2^49 ~ 5.6 * 10^14 |
| State vector memory | **4 PB** (infeasible for full state vector) |
This highlights the fundamental scaling challenge: full state vector simulation of
distance-5 surface codes requires stabilizer simulation or tensor network methods,
not direct state vector evolution. However, for the critical distance-3 case, state
vector simulation is fast and provides ground truth.
**Practical simulation envelope**:
| Distance | Qubits | State Vector | Feasible? | Cycles/sec |
|----------|--------|-------------|-----------|------------|
| 2 (toy) | 7 | 128 entries | Yes | ~50,000 |
| 3 | 17 | 131K entries | Yes | ~2,000 |
| 3 (with noise) | 17 | 131K entries | Yes | ~1,000 |
| 4 | 31 | 2B entries | Marginal (16 GB) | ~0.1 |
| 5+ | 49+ | >10^14 | No (state vector) | -- |
For distance 5 and above, the implementation should fall back to **stabilizer
simulation** (Gottesman-Knill theorem: Clifford circuits on stabilizer states can be
simulated in polynomial time). Since surface code circuits consist entirely of Clifford
gates (H, CNOT, S) with Pauli noise, this is a natural fit.
### 9. Integration with Existing ruQu Pipeline
The surface code simulation integrates with the full ruQu stack:
```
┌─────────────────────────────────────────────────────────────────────┐
│ ruQu QEC Simulation Stack │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────────────────┐ │
│ │ State │ │ Syndrome │ │ Decoder Pipeline │ │
│ │ Vector │ │ Processing │ │ │ │
│ │ Engine │──│ (syndrome.rs)│──│ SyndromeFilter │ │
│ │ (new) │ │ │ │ ├── StructuralFilter │ │
│ │ │ │ DetectorBitmap │ │ ├── ShiftFilter │ │
│ │ measure() │ │ SyndromeBuffer │ │ ├── EvidenceFilter │ │
│ │ reset() │ │ SyndromeDelta │ │ └── MWPM Decoder │ │
│ │ noise() │ │ │ │ (decoder.rs) │ │
│ └─────────────┘ └──────────────┘ └───────────────────────────┘ │
│ │ │ │
│ │ ┌─────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────┐ ┌────────────────────────────────┐ │
│ │ Correction Application │ │ Coherence Validation │ │
│ │ │ │ │ │
│ │ apply_x(qubit) │ │ DynamicMinCutEngine │ │
│ │ apply_z(qubit) │ │ (mincut.rs) │ │
│ │ │ │ │ │
│ │ Logical Error Tracker │ │ El-Hayek/Henzinger/Li │ │
│ └──────────────────────────┘ │ O(n^{o(1)}) min-cut │ │
│ └────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Tile Architecture (fabric.rs, tile.rs) │ │
│ │ │ │
│ │ TileZero (coordinator) + 255 WorkerTiles │ │
│ │ Can parallelize across stabilizer groups for large codes │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```
Key integration points:
1. **Syndrome bits** from `measure_qubit()` are converted to `DetectorBitmap` format
for compatibility with ruQu's existing syndrome processing pipeline
2. **MWPM decoder** from `decoder.rs` (backed by fusion-blossom) receives syndromes
and returns corrections
3. **Min-cut coherence** from `mincut.rs` validates post-correction state quality
4. **Tile architecture** from `fabric.rs` can distribute stabilizer measurements across
tiles for parallel processing of large codes
5. **Stim integration** from `stim.rs` provides reference syndrome distributions for
decoder benchmarking
### 10. Error Rate Estimation
To estimate the error threshold, we run simulations at multiple physical error rates
and code distances:
```rust
/// Estimate the error threshold by scanning physical error rates.
///
/// The threshold is the physical error rate p* at which logical error rate
/// is independent of code distance. Below p*, larger codes are better.
/// Above p*, larger codes are worse.
pub fn estimate_threshold(
distances: &[usize],
error_rates: &[f64],
cycles_per_trial: usize,
trials: usize,
) -> ThresholdResult {
let mut results = Vec::new();
for &d in distances {
for &p in error_rates {
let config = SurfaceCodeConfig {
distance: d,
noise: NoiseModel {
idle_error: p,
single_qubit_error: p / 10.0,
two_qubit_error: p,
measurement_error: p,
noise_type: NoiseType::Depolarizing,
},
cycles: cycles_per_trial,
trials: trials as u64,
seed: 42,
};
let sim_result = simulate_surface_code(&config);
results.push((d, p, sim_result.logical_error_per_cycle));
}
}
// Find crossing point of d=3 and d=5 curves
find_threshold_crossing(&results)
}
```
---
## Consequences
### Benefits
1. **Full quantum state simulation** provides ground truth for decoder validation that
stabilizer simulation alone cannot (e.g., non-Clifford noise, leakage states)
2. **Seamless integration** with ruQu's existing syndrome processing, MWPM decoder,
and min-cut coherence infrastructure minimizes new code and leverages battle-tested
components
3. **Mid-circuit measurement** and qubit reset enable accurate simulation of the actual
hardware QEC cycle, not just the error model
4. **Noise model flexibility** (bit-flip, phase-flip, depolarizing, independent) covers
the standard noise models used in QEC research
5. **Logical error tracking** provides direct measurement of the quantity of interest
(logical error rate) without post-hoc analysis
6. **Integration with min-cut coherence** validates that decoder corrections maintain
structural coherence, bridging ruQu's unique coherence-gating approach with standard
QEC metrics
### Risks
| Risk | Probability | Impact | Mitigation |
|------|------------|--------|------------|
| State vector memory limits simulation to d <= 3 | High | High | Stabilizer simulation fallback for d >= 5 |
| Mid-circuit measurement breaks SIMD optimization | Medium | Medium | Separate hot/cold paths, measurement is infrequent |
| Noise model too simplistic for real hardware | Medium | Medium | Support custom noise channels, correlated errors |
| Decoder latency dominates simulation time | Low | Medium | Use streaming decoder, pre-built matching graphs |
| Logical error tracking complexity for higher distance | Low | Low | Automate logical operator computation from layout |
### Trade-offs
| Decision | Advantage | Disadvantage |
|----------|-----------|--------------|
| State vector over stabilizer simulation | Handles arbitrary noise and non-Clifford ops | Exponential memory, limited to d <= 3-4 |
| Stochastic Pauli insertion for noise | Simple, exact for Pauli channels | Approximate for non-Pauli noise |
| Sequential ancilla measurement | Correct correlated outcomes | Cannot parallelize measurement step |
| Integration with existing ruQu decoder | Reuses battle-tested code | Decoder API may not perfectly match simulation needs |
| Coherent reset (amplitude transfer) | Preserves entanglement structure | More complex than incoherent reset |
---
## References
- Fowler, A.G. et al. "Surface codes: Towards practical large-scale quantum computation." Physical Review A 86, 032324 (2012)
- Dennis, E. et al. "Topological quantum memory." Journal of Mathematical Physics 43, 4452-4505 (2002)
- Google Quantum AI. "Suppressing quantum errors by scaling a surface code logical qubit." Nature 614, 676-681 (2023)
- Higgott, O. "PyMatching: A Python package for decoding quantum codes with minimum-weight perfect matching." ACM Transactions on Quantum Computing 3, 1-16 (2022)
- Wu, Y. & Lin, H.H. "Hypergraph Decomposition and Secret Sharing." Discrete Applied Mathematics (2024)
- ADR-001: ruQu Architecture - Classical Nervous System for Quantum Machines
- ADR-QE-005: VQE Algorithm Support (quantum state manipulation, expectation values)
- ADR-QE-006: Grover's Search (state vector operations, measurement)
- ruQu syndrome module: `crates/ruQu/src/syndrome.rs` - DetectorBitmap, SyndromeBuffer
- ruQu decoder module: `crates/ruQu/src/decoder.rs` - MWPMDecoder, fusion-blossom
- ruQu mincut module: `crates/ruQu/src/mincut.rs` - DynamicMinCutEngine
- ruQu filters module: `crates/ruQu/src/filters.rs` - Three-filter coherence pipeline
- ruvector-mincut crate: `crates/ruvector-mincut/` - El-Hayek/Henzinger/Li algorithm

View File

@@ -0,0 +1,480 @@
# ADR-QE-009: Tensor Network Evaluation Mode
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
---
## Context
Full state-vector simulation stores all 2^n complex amplitudes explicitly, yielding
O(2^n) memory and O(G * 2^n) time for G gates. At n=30 this is 16 GiB; at n=40 it
exceeds 16 TiB. Many practically interesting circuits, however, contain limited
entanglement:
| Circuit family | Entanglement structure | Treewidth |
|---|---|---|
| Shallow QAOA on sparse graphs | Bounded by graph degree | Low (often < 20) |
| Separate-register circuits | Disjoint qubit subsets | Sum of sub-widths |
| Near-Clifford circuits | Stabilizer + few T gates | Depends on T count |
| 1D brickwork (finite depth) | Area-law entanglement | O(depth) |
| Random deep circuits (all-to-all) | Volume-law entanglement | O(n) -- no gain |
For the first four families, tensor network (TN) methods can trade increased
computation for drastically reduced memory by representing each gate as a tensor and
contracting the resulting network in an optimized order. The contraction cost scales
exponentially in the *treewidth* of the circuit's line graph rather than in the total
qubit count.
QuantRS2 (the Rust quantum simulation reference) demonstrated tensor network
contraction for circuits up to 60 qubits on commodity hardware when treewidth
remained below ~25. ruVector's existing `ruvector-mincut` crate already solves graph
partitioning problems that are structurally identical to contraction-order
optimization, providing a natural integration point.
The ruQu engine needs this capability to support:
1. Surface code simulations at distance d >= 7 (49+ data qubits) for decoder
validation, where the syndrome extraction circuit is shallow and geometrically
local.
2. Variational algorithm prototyping (VQE, QAOA) on graphs larger than 30 nodes.
3. Hybrid workflows where part of the circuit is simulated via state vector and part
via tensor contraction.
## Decision
### 1. Feature-Gated Backend
Tensor network evaluation is implemented as an optional backend behind the
`tensor-network` feature flag in `ruqu-core`:
```toml
# ruqu-core/Cargo.toml
[features]
default = ["state-vector"]
state-vector = []
tensor-network = ["dep:ndarray", "dep:petgraph"]
all-backends = ["state-vector", "tensor-network"]
```
When both backends are compiled in, the engine selects the backend at runtime based
on circuit analysis (see Section 4 below).
### 2. Tensor Representation
Every gate becomes a tensor connecting the qubit wire indices it acts on:
| Gate type | Tensor rank | Shape | Example |
|---|---|---|---|
| Single-qubit (H, X, Rz, ...) | 2 | [2, 2] | Input wire -> output wire |
| Two-qubit (CNOT, CZ, ...) | 4 | [2, 2, 2, 2] | Two input wires -> two output wires |
| Three-qubit (Toffoli) | 6 | [2, 2, 2, 2, 2, 2] | Three input -> three output |
| Measurement projector | 2 | [2, 2] | Diagonal in computational basis |
| Initial state |0> | 1 | [2] | Single output wire |
The circuit is converted into a tensor network graph where:
- Each tensor is a node.
- Each shared index (qubit wire between consecutive gates) is an edge.
- Open indices represent initial states and final measurement outcomes.
```
|0>---[H]---[CNOT_ctrl]---[Rz]---<meas>
|
|0>-----------[CNOT_tgt]---------<meas>
```
Becomes:
```
Node: init_0 (rank 1)
|
Node: H_0 (rank 2)
|
Node: CNOT_01 (rank 4)
/ \
| Node: Rz_0 (rank 2)
| |
| Node: meas_0 (rank 2)
|
Node: init_1 (rank 1)
... (connected via CNOT shared index)
Node: meas_1 (rank 2)
```
### 3. Contraction Strategy
Contraction order determines whether the computation is tractable. The cost of
contracting two tensors is the product of the dimensions of all indices involved.
Finding the optimal contraction order is NP-hard (equivalent to finding minimum
treewidth), so we use heuristics.
#### Contraction Path Optimization Pseudocode
```
function find_contraction_path(tensor_network: TN) -> ContractionPath:
// Phase 1: Simplify the network
apply_trivial_contractions(tensor_network) // rank-1 tensors, diagonal pairs
// Phase 2: Detect community structure
communities = detect_communities(tensor_network.graph)
// Phase 3: Contract within communities first (small subproblems)
intra_paths = []
for community in communities:
subgraph = tensor_network.subgraph(community)
if subgraph.num_tensors <= 20:
// Exact dynamic programming for small subgraphs
path = optimal_einsum_dp(subgraph)
else:
// Greedy with lookahead for larger subgraphs
path = greedy_with_lookahead(subgraph, lookahead=2)
intra_paths.append(path)
// Phase 4: Contract inter-community edges
// Each community is now a single large tensor
meta_graph = contract_communities(tensor_network, intra_paths)
inter_path = greedy_with_lookahead(meta_graph, lookahead=3)
// Phase 5: Compose the full path
return compose_paths(intra_paths, inter_path)
function greedy_with_lookahead(tn: TN, lookahead: int) -> Path:
path = []
remaining = tn.clone()
while remaining.num_tensors > 1:
best_cost = INFINITY
best_pair = None
// Evaluate all candidate contractions
for (i, j) in remaining.candidate_pairs():
cost = contraction_cost(remaining, i, j)
// Lookahead: estimate cost of subsequent contractions
if lookahead > 0:
simulated = remaining.simulate_contraction(i, j)
future_cost = estimate_future_cost(simulated, lookahead - 1)
cost += future_cost * DISCOUNT_FACTOR
if cost < best_cost:
best_cost = cost
best_pair = (i, j)
path.append(best_pair)
remaining.contract(best_pair)
return path
```
#### Community Detection via ruvector-mincut
The `ruvector-mincut` crate provides graph partitioning that is directly applicable
to contraction ordering:
```rust
use ruvector_mincut::{partition, PartitionConfig};
fn partition_tensor_network(tn: &TensorNetwork) -> Vec<Vec<TensorId>> {
let graph = tn.to_adjacency_graph();
let config = PartitionConfig {
num_partitions: estimate_optimal_partitions(tn),
balance_factor: 1.1, // Allow 10% imbalance
minimize: Objective::EdgeCut, // Minimize inter-partition wires
};
partition(&graph, &config)
}
```
The edge cut directly corresponds to the bond dimension of the inter-community
contraction, so minimizing edge cut minimizes the most expensive contraction step.
### 4. MPS (Matrix Product State) Mode
For circuits with 1D-like connectivity (nearest-neighbor gates on a line), a Matrix
Product State representation is more efficient than general tensor contraction.
```
A[1] -- A[2] -- A[3] -- ... -- A[n]
| | | |
phys_1 phys_2 phys_3 phys_n
```
Each site tensor A[i] has shape `[bond_left, physical, bond_right]` where:
- `physical` = 2 (qubit dimension)
- `bond_left`, `bond_right` = bond dimension chi
| Bond dimension (chi) | Memory per site | Total memory (n qubits) | Approximation |
|---|---|---|---|
| 1 | 16 bytes | 16n bytes | Product state only |
| 16 | 4 KiB | 4n KiB | Low entanglement |
| 64 | 64 KiB | 64n KiB | Moderate entanglement |
| 256 | 1 MiB | n MiB | High entanglement |
| 1024 | 16 MiB | 16n MiB | Near exact for many circuits |
**Truncation policy**: After each two-qubit gate, perform SVD on the updated bond.
If the bond dimension exceeds `chi_max`, truncate the smallest singular values.
Track the total discarded weight (sum of squared discarded singular values) as a
fidelity estimate:
```rust
pub struct MpsConfig {
/// Maximum bond dimension. Truncation occurs above this.
pub chi_max: usize,
/// Minimum singular value to retain (relative to largest).
pub svd_cutoff: f64,
/// Accumulated truncation error (updated during simulation).
pub fidelity_estimate: f64,
}
impl Default for MpsConfig {
fn default() -> Self {
Self {
chi_max: 256,
svd_cutoff: 1e-12,
fidelity_estimate: 1.0,
}
}
}
```
### 5. Automatic Mode Selection
The engine analyzes the circuit before execution to recommend a backend:
```rust
pub enum RecommendedBackend {
StateVector { reason: &'static str },
TensorNetwork { estimated_treewidth: usize, reason: &'static str },
Mps { estimated_max_bond: usize, reason: &'static str },
}
pub fn recommend_backend(circuit: &QuantumCircuit) -> RecommendedBackend {
let n = circuit.num_qubits();
let depth = circuit.depth();
let connectivity = circuit.connectivity_graph();
// Rule 1: Small circuits always use state vector
if n <= 20 {
return RecommendedBackend::StateVector {
reason: "Small circuit; state vector is fastest below 20 qubits",
};
}
// Rule 2: Check for 1D connectivity (MPS candidate)
if connectivity.max_degree() <= 2 && connectivity.is_path_graph() {
let estimated_bond = 2_usize.pow(depth.min(20) as u32);
return RecommendedBackend::Mps {
estimated_max_bond: estimated_bond,
reason: "1D nearest-neighbor connectivity detected",
};
}
// Rule 3: Estimate treewidth for general TN
let estimated_tw = estimate_treewidth(&connectivity, depth);
if estimated_tw < 25 && n > 25 {
return RecommendedBackend::TensorNetwork {
estimated_treewidth: estimated_tw,
reason: "Low treewidth relative to qubit count",
};
}
// Rule 4: Check memory feasibility for state vector
let sv_memory = 16 * (1_usize << n); // bytes
let available = estimate_available_memory();
if sv_memory > available {
// Force TN even if treewidth is high -- at least it has a chance
return RecommendedBackend::TensorNetwork {
estimated_treewidth: estimated_tw,
reason: "State vector exceeds available memory; TN is only option",
};
}
RecommendedBackend::StateVector {
reason: "High treewidth circuit; state vector is more efficient",
}
}
```
### 6. When Tensor Networks Win vs Lose
**Tensor networks win when:**
| Scenario | Why TN wins | Example |
|---|---|---|
| Shallow circuits on many qubits | Treewidth ~ depth, not n | 50-qubit depth-4 QAOA |
| Sparse graph connectivity | Low treewidth from graph structure | MaxCut on 3-regular graph |
| Separate registers | Independent contractions | n/2 Bell pairs |
| Near-Clifford | Stabilizer + few non-Clifford gates | Clifford + 5 T gates |
| Amplitude computation | Contract to single output, not full state | Sampling one bitstring |
**Tensor networks lose when:**
| Scenario | Why TN loses | Fallback |
|---|---|---|
| Deep random circuits | Treewidth ~ n | State vector (if n <= 30) |
| All-to-all connectivity | No structure to exploit | State vector |
| Full state tomography needed | Must contract once per amplitude | State vector |
| Very small circuits (n < 20) | Overhead exceeds state vector | State vector |
| High-fidelity MPS needed | Bond dimension grows exponentially | State vector or exact TN |
### 7. Example: 50-Qubit Shallow QAOA
Consider QAOA depth p=1 on a 50-node 3-regular graph:
```
Circuit structure:
- 50 qubits, initialized to |+>
- 75 ZZ gates (one per edge), parameterized by gamma
- 50 Rx gates, parameterized by beta
- Total: 125 + 50 = 175 gates
- Circuit depth: 4 (H layer, ZZ layer (3-colorable), Rx layer, measure)
Graph treewidth of 3-regular graph: typically 8-15
Tensor network contraction:
- Community detection finds ~5-8 communities of 6-10 nodes
- Intra-community contraction: O(2^10) ~ 1024 per community
- Inter-community bonds: ~15 edges cut
- Effective contraction complexity: O(2^15) = 32768
- Compare to state vector: O(2^50) = 1.1 * 10^15
Memory comparison:
- State vector: 2^50 * 16 bytes = 16 PiB (impossible)
- Tensor network: ~100 MiB working memory
- Speedup factor: practically infinite (feasible vs infeasible)
```
```
Contraction Diagram (simplified):
Community A Community B Community C
[q0-q9] [q10-q19] [q20-q29]
| | |
+--- bond=2^3 ----+---- bond=2^4 -----+
|
Community D Community E
[q30-q39] [q40-q49]
| |
+--- bond=2^3 ----+
Peak intermediate tensor: 2^15 elements = 512 KiB
```
### 8. Integration with State Vector Backend
Both backends implement the same trait:
```rust
pub trait SimulationBackend {
/// Execute the circuit and return measurement results.
fn execute(
&self,
circuit: &QuantumCircuit,
shots: usize,
config: &SimulationConfig,
) -> Result<SimulationResult, SimulationError>;
/// Compute expectation value of an observable.
fn expectation_value(
&self,
circuit: &QuantumCircuit,
observable: &Observable,
config: &SimulationConfig,
) -> Result<f64, SimulationError>;
/// Return the backend name for logging.
fn name(&self) -> &'static str;
}
```
Users interact through `QuantumCircuit` and never need to know which backend is
active:
```rust
let circuit = QuantumCircuit::new(50)
.h_all()
.append_qaoa_layer(graph, gamma, beta)
.measure_all();
// Automatic backend selection
let result = ruqu::execute(&circuit, 1000)?;
// -> Internally selects TensorNetwork backend due to n=50, low treewidth
// Or explicit backend override
let result = ruqu::execute_with_backend(
&circuit,
1000,
Backend::TensorNetwork(TnConfig::default()),
)?;
```
### 9. Future: ruvector-mincut Integration for Contraction Ordering
The `ruvector-mincut` crate currently solves balanced graph partitioning for vector
index sharding. The same algorithm directly applies to tensor network contraction
ordering via the following correspondence:
| Graph partitioning concept | TN contraction concept |
|---|---|
| Vertex | Tensor |
| Edge weight | Bond dimension (log2) |
| Partition | Contraction subtree |
| Edge cut | Inter-partition bond cost |
| Balanced partition | Balanced contraction tree |
Phase 1 (this ADR): Use `ruvector-mincut` for community detection in contraction
path optimization.
Phase 2 (future): Extend `ruvector-mincut` with hypergraph partitioning for
multi-index tensor contractions, enabling handling of higher-order tensor networks
(e.g., PEPS for 2D circuits).
## Consequences
### Positive
1. **Dramatically expanded qubit range**: Shallow circuits on 40-60 qubits become
tractable on commodity hardware.
2. **Surface code simulation**: Distance-7 surface codes (49 data + 48 ancilla = 97
qubits) can be simulated for decoder validation using MPS (the circuit is
geometrically local).
3. **Unified interface**: Users write circuits once; backend selection is automatic.
4. **Synergy with ruvector-mincut**: Leverages existing graph partitioning
investment.
5. **Complementary to state vector**: Each backend covers the other's weakness.
### Negative
1. **Implementation complexity**: Tensor contraction, SVD truncation, and path
optimization are non-trivial to implement correctly and efficiently.
2. **Approximation risk**: MPS truncation introduces controlled but nonzero error.
Users must understand fidelity estimates.
3. **Compilation time**: The `ndarray` and `petgraph` dependencies add to compile
time when the feature is enabled.
4. **Testing surface**: Two backends doubles the testing matrix for correctness
validation.
5. **Performance unpredictability**: Contraction cost depends on circuit structure
in ways that are hard to predict without running the path optimizer.
### Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Path optimizer finds poor ordering | Medium | High cost | Multiple heuristics + timeout fallback to greedy |
| MPS fidelity silently degrades | Medium | Incorrect results | Track discarded weight; warn if fidelity < 0.99 |
| Feature interaction bugs | Low | Incorrect results | Shared test suite: both backends must agree on small circuits |
| Memory spike during contraction | Medium | OOM | Pre-estimate peak intermediate tensor size; abort if too large |
## References
- QuantRS2 tensor network implementation: internal reference
- Markov & Shi, "Simulating Quantum Computation by Contracting Tensor Networks" (2008)
- Gray & Kourtis, "Hyper-optimized tensor network contraction" (2021) -- cotengra
- Schollwock, "The density-matrix renormalization group in the age of matrix product states" (2011)
- ADR-QE-001: Core Engine Architecture (state vector backend)
- ADR-QE-005: WASM Compilation Target
- `ruvector-mincut` crate documentation
- ADR-014: Coherence Engine (graph partitioning reuse)

View File

@@ -0,0 +1,689 @@
# ADR-QE-010: Observability & Monitoring Integration
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
---
## Context
ruVector provides comprehensive observability through the `ruvector-metrics` crate,
which aggregates telemetry from all subsystems into a unified monitoring dashboard.
The quantum simulation engine is a new subsystem that must participate in this
observability infrastructure.
Effective monitoring of quantum simulation is essential for:
1. **Performance tuning**: Identifying bottlenecks in gate application, memory
allocation, and parallelization efficiency.
2. **Resource management**: Tracking memory consumption to prevent OOM conditions
and to inform auto-scaling decisions.
3. **Debugging**: Tracing the execution of specific circuits to diagnose incorrect
results or unexpected behavior.
4. **Capacity planning**: Understanding workload patterns (qubit counts, circuit
depths, simulation frequency) to plan infrastructure.
5. **Compliance**: Auditable logs of simulation executions for regulated
environments (cryptographic validation, safety-critical applications).
### WASM Constraint
In WebAssembly deployment, there is no direct filesystem access and no native
networking. Observability in WASM must use browser-compatible mechanisms:
`console.log`, `console.warn`, `console.error`, or JavaScript callback functions
registered by the host application.
### Existing Infrastructure
| Component | Role | Integration Point |
|---|---|---|
| `ruvector-metrics` | Metrics aggregation and export | Trait-based sink |
| `ruvector-monitor` | Real-time dashboard UI | WebSocket feed |
| Rust `tracing` crate | Structured logging and spans | Subscriber-based |
| Prometheus / OpenTelemetry | External monitoring | Exporter plugins |
| Ed25519 audit trail | Cryptographic logging | `ruqu-audit` crate |
## Decision
### 1. Metrics Schema
Every simulation execution emits a structured metrics record. The schema is
versioned to allow evolution without breaking consumers.
```rust
/// Metrics emitted after each quantum simulation execution.
/// Schema version: 1.0.0
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SimulationMetrics {
/// Schema version for forward compatibility.
pub schema_version: &'static str,
/// Unique identifier for this simulation run.
pub simulation_id: Uuid,
/// Timestamp when simulation started (UTC).
pub started_at: DateTime<Utc>,
/// Timestamp when simulation completed (UTC).
pub completed_at: DateTime<Utc>,
// -- Circuit characteristics --
/// Number of qubits in the circuit.
pub qubit_count: u32,
/// Total number of gates (before optimization).
pub gate_count_raw: u64,
/// Total number of gates (after optimization/fusion).
pub gate_count_optimized: u64,
/// Circuit depth (longest path from input to output).
pub circuit_depth: u32,
/// Number of two-qubit gates (entangling operations).
pub two_qubit_gate_count: u64,
// -- Execution metrics --
/// Total wall-clock execution time in milliseconds.
pub execution_time_ms: f64,
/// Time spent in gate application (excluding allocation, measurement).
pub gate_application_time_ms: f64,
/// Time spent in measurement sampling.
pub measurement_time_ms: f64,
/// Peak memory consumption in bytes during simulation.
pub peak_memory_bytes: u64,
/// Memory allocated for the state vector / tensor network.
pub state_memory_bytes: u64,
/// Backend used for this simulation.
pub backend: BackendType,
// -- Throughput --
/// Gates applied per second (optimized gate count / gate application time).
pub gates_per_second: f64,
/// Qubits * depth per second (a normalized throughput metric).
pub quantum_volume_rate: f64,
// -- Optimization statistics --
/// Number of gates eliminated by fusion.
pub gates_fused: u64,
/// Number of gates eliminated as identity or redundant.
pub gates_skipped: u64,
/// Number of gate commutations applied.
pub gates_commuted: u64,
// -- Entanglement analysis --
/// Number of independent qubit subsets (entanglement groups).
pub entanglement_groups: u32,
/// Sizes of each entanglement group.
pub entanglement_group_sizes: Vec<u32>,
// -- Measurement outcomes (if measured) --
/// Number of measurement shots executed.
pub measurement_shots: Option<u64>,
/// Distribution entropy of measurement outcomes (bits).
pub outcome_entropy: Option<f64>,
// -- MPS-specific (tensor network backend) --
/// Maximum bond dimension reached (MPS mode only).
pub max_bond_dimension: Option<u32>,
/// Estimated fidelity after MPS truncation.
pub mps_fidelity_estimate: Option<f64>,
// -- Error information --
/// Whether the simulation completed successfully.
pub success: bool,
/// Error message if simulation failed.
pub error: Option<String>,
/// Error category for programmatic handling.
pub error_kind: Option<SimulationErrorKind>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum BackendType {
StateVector,
TensorNetwork,
Mps,
Hybrid,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum SimulationErrorKind {
QubitLimitExceeded,
MemoryAllocationFailed,
InvalidGateTarget,
InvalidParameter,
ContractionFailed,
MpsFidelityBelowThreshold,
Timeout,
InternalError,
}
```
### 2. Metrics Sink Trait
The engine publishes metrics through a trait abstraction, allowing different sinks
for native and WASM environments:
```rust
/// Trait for consuming simulation metrics.
/// Implementations exist for native (ruvector-metrics), WASM (JS callback),
/// and testing (in-memory collector).
pub trait MetricsSink: Send + Sync {
/// Publish a completed simulation's metrics.
fn publish(&self, metrics: &SimulationMetrics);
/// Publish an incremental progress update (for long-running simulations).
fn progress(&self, simulation_id: Uuid, percent_complete: f32, message: &str);
/// Publish a health status update.
fn health(&self, status: EngineHealthStatus);
}
/// Native implementation: forwards to ruvector-metrics.
pub struct NativeMetricsSink {
registry: Arc<ruvector_metrics::Registry>,
}
impl MetricsSink for NativeMetricsSink {
fn publish(&self, metrics: &SimulationMetrics) {
// Emit as histogram/counter/gauge values
self.registry.histogram("ruqu.execution_time_ms")
.record(metrics.execution_time_ms);
self.registry.gauge("ruqu.peak_memory_bytes")
.set(metrics.peak_memory_bytes as f64);
self.registry.counter("ruqu.simulations_total")
.increment(1);
self.registry.counter("ruqu.gates_applied_total")
.increment(metrics.gate_count_optimized);
self.registry.histogram("ruqu.gates_per_second")
.record(metrics.gates_per_second);
if !metrics.success {
self.registry.counter("ruqu.errors_total")
.increment(1);
}
}
fn progress(&self, _id: Uuid, percent: f32, _msg: &str) {
self.registry.gauge("ruqu.current_progress")
.set(percent as f64);
}
fn health(&self, status: EngineHealthStatus) {
self.registry.gauge("ruqu.health_status")
.set(status.as_numeric());
}
}
```
### 3. WASM Metrics Sink
In WASM, metrics are delivered via JavaScript callbacks:
```rust
#[cfg(target_arch = "wasm32")]
pub struct WasmMetricsSink {
/// JS callback function registered by host application.
callback: js_sys::Function,
}
#[cfg(target_arch = "wasm32")]
impl MetricsSink for WasmMetricsSink {
fn publish(&self, metrics: &SimulationMetrics) {
let json = serde_json::to_string(metrics)
.unwrap_or_else(|_| "{}".to_string());
let js_value = JsValue::from_str(&json);
let event_type = JsValue::from_str("simulation_complete");
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
}
fn progress(&self, id: Uuid, percent: f32, message: &str) {
let payload = format!(
r#"{{"simulation_id":"{}","percent":{},"message":"{}"}}"#,
id, percent, message
);
let js_value = JsValue::from_str(&payload);
let event_type = JsValue::from_str("simulation_progress");
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
}
fn health(&self, status: EngineHealthStatus) {
let payload = format!(r#"{{"status":"{}"}}"#, status.as_str());
let js_value = JsValue::from_str(&payload);
let event_type = JsValue::from_str("engine_health");
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
}
}
```
JavaScript host registration:
```javascript
// Host application registers the metrics callback
import init, { set_metrics_callback } from 'ruqu-wasm';
await init();
set_metrics_callback((eventType, data) => {
const metrics = JSON.parse(data);
switch (eventType) {
case 'simulation_complete':
console.log(`Simulation ${metrics.simulation_id} completed in ${metrics.execution_time_ms}ms`);
dashboard.updateMetrics(metrics);
break;
case 'simulation_progress':
progressBar.update(metrics.percent);
break;
case 'engine_health':
healthIndicator.set(metrics.status);
break;
}
});
```
### 4. Tracing Integration
The engine integrates with the Rust `tracing` crate for structured logging and
distributed tracing.
#### Span Hierarchy
```
ruqu::simulation (root span for entire simulation)
|
+-- ruqu::circuit_validation (validate circuit structure)
|
+-- ruqu::backend_selection (automatic backend choice)
|
+-- ruqu::optimization (gate fusion, commutation, etc.)
| |
| +-- ruqu::optimization::fusion (individual fusion passes)
| +-- ruqu::optimization::cancel (gate cancellation)
|
+-- ruqu::state_init (allocate and initialize state)
|
+-- ruqu::gate_application (apply all gates)
| |
| +-- ruqu::gate (individual gate -- DEBUG level only)
|
+-- ruqu::measurement (perform measurement sampling)
|
+-- ruqu::metrics_publish (emit metrics to sink)
|
+-- ruqu::state_cleanup (deallocate state vector)
```
#### Instrumentation Code
```rust
use tracing::{info, warn, debug, trace, instrument, Span};
#[instrument(
name = "ruqu::simulation",
skip(circuit, config, metrics_sink),
fields(
qubit_count = circuit.num_qubits(),
gate_count = circuit.gate_count(),
simulation_id = %Uuid::new_v4(),
)
)]
pub fn execute(
circuit: &QuantumCircuit,
shots: usize,
config: &SimulationConfig,
metrics_sink: &dyn MetricsSink,
) -> Result<SimulationResult, SimulationError> {
info!(
qubits = circuit.num_qubits(),
gates = circuit.gate_count(),
depth = circuit.depth(),
shots = shots,
"Starting quantum simulation"
);
// Validate
let _validation_span = tracing::info_span!("ruqu::circuit_validation").entered();
validate_circuit(circuit)?;
drop(_validation_span);
// Select backend
let _backend_span = tracing::info_span!("ruqu::backend_selection").entered();
let backend = select_backend(circuit, config);
info!(backend = backend.name(), "Backend selected");
drop(_backend_span);
// Optimize
let _opt_span = tracing::info_span!("ruqu::optimization").entered();
let optimized = optimize_circuit(circuit, config)?;
info!(
original_gates = circuit.gate_count(),
optimized_gates = optimized.gate_count(),
gates_fused = circuit.gate_count() - optimized.gate_count(),
"Circuit optimization complete"
);
drop(_opt_span);
// Execute
let result = backend.execute(&optimized, shots, config)?;
// At DEBUG level, log per-gate details
debug!(
execution_time_ms = result.execution_time_ms,
peak_memory = result.peak_memory_bytes,
"Simulation execution complete"
);
// At TRACE level only for small circuits, log amplitude information
if circuit.num_qubits() <= 10 {
trace!(
amplitudes = ?result.state_vector_snapshot(),
"Final state vector (small circuit trace)"
);
}
Ok(result)
}
```
### 5. Structured Error Reporting
All errors carry structured context for programmatic handling:
```rust
#[derive(Debug, thiserror::Error)]
pub enum SimulationError {
#[error("Qubit limit exceeded: requested {requested}, maximum {maximum}")]
QubitLimitExceeded {
requested: u32,
maximum: u32,
estimated_memory_bytes: u64,
available_memory_bytes: u64,
},
#[error("Memory allocation failed for {requested_bytes} bytes")]
MemoryAllocationFailed {
requested_bytes: u64,
qubit_count: u32,
suggestion: &'static str,
},
#[error("Invalid gate target: qubit {qubit} in {qubit_count}-qubit circuit")]
InvalidGateTarget {
gate_name: String,
qubit: u32,
qubit_count: u32,
gate_index: usize,
},
#[error("Invalid gate parameter: {parameter_name} = {value} ({reason})")]
InvalidParameter {
gate_name: String,
parameter_name: String,
value: f64,
reason: &'static str,
},
#[error("Tensor contraction failed: {reason}")]
ContractionFailed {
reason: String,
estimated_treewidth: usize,
suggestion: &'static str,
},
#[error("MPS fidelity {fidelity:.6} below threshold {threshold:.6}")]
MpsFidelityBelowThreshold {
fidelity: f64,
threshold: f64,
max_bond_dimension: usize,
suggestion: &'static str,
},
#[error("Simulation timed out after {elapsed_ms}ms (limit: {timeout_ms}ms)")]
Timeout {
elapsed_ms: u64,
timeout_ms: u64,
gates_completed: u64,
gates_remaining: u64,
},
#[error("Internal error: {message}")]
InternalError {
message: String,
source: Option<Box<dyn std::error::Error + Send + Sync>>,
},
}
```
Each error variant includes a `suggestion` field where applicable, guiding users
toward resolution:
| Error | Suggestion |
|---|---|
| QubitLimitExceeded | "Reduce qubit count or enable tensor-network feature for large circuits" |
| MemoryAllocationFailed | "Try tensor-network backend or reduce qubit count by 1-2 (halves/quarters memory)" |
| ContractionFailed | "Circuit treewidth too high for tensor network; use state vector for <= 30 qubits" |
| MpsFidelityBelowThreshold | "Increase chi_max or switch to exact state vector for high-fidelity results" |
### 6. Health Checks
The engine exposes health status for monitoring systems:
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EngineHealthStatus {
/// Whether the engine is ready to accept simulations.
pub ready: bool,
/// Maximum qubits supportable given current available memory.
pub max_supported_qubits: u32,
/// Available memory in bytes.
pub available_memory_bytes: u64,
/// Number of CPU cores available for parallel gate application.
pub available_cores: usize,
/// Whether the tensor-network backend is compiled in.
pub tensor_network_available: bool,
/// Current engine version.
pub version: &'static str,
/// Uptime since engine initialization (if applicable).
pub uptime_seconds: Option<f64>,
/// Number of simulations executed in current session.
pub simulations_executed: u64,
/// Total gates applied across all simulations in current session.
pub total_gates_applied: u64,
}
/// Check engine health. Callable at any time.
pub fn quantum_engine_ready() -> EngineHealthStatus {
let available_memory = estimate_available_memory();
let max_qubits = compute_max_qubits(available_memory);
EngineHealthStatus {
ready: max_qubits >= 4, // Minimum useful simulation
max_supported_qubits: max_qubits,
available_memory_bytes: available_memory,
available_cores: rayon::current_num_threads(),
tensor_network_available: cfg!(feature = "tensor-network"),
version: env!("CARGO_PKG_VERSION"),
uptime_seconds: None, // Library mode; no persistent uptime
simulations_executed: SESSION_COUNTER.load(Ordering::Relaxed),
total_gates_applied: SESSION_GATES.load(Ordering::Relaxed),
}
}
```
### 7. Logging Levels
| Level | Content | Audience | Performance Impact |
|---|---|---|---|
| ERROR | Simulation failures, OOM, invalid circuits | Operators, alerting | None |
| WARN | Approaching memory limits (>80%), MPS fidelity degradation, slow contraction | Operators | Negligible |
| INFO | Simulation start/end summaries, backend selection, optimization results | Developers, dashboards | Negligible |
| DEBUG | Per-optimization-pass details, memory allocation sizes, thread utilization | Developers debugging | Low |
| TRACE | Per-gate amplitude changes (small circuits only, n <= 10), SVD singular values | Deep debugging | High (small circuits only) |
TRACE level is gated on circuit size to prevent catastrophic log volume:
```rust
// TRACE-level amplitude logging is only emitted for circuits with <= 10 qubits.
// For larger circuits, TRACE only emits gate-level timing without amplitude data.
if tracing::enabled!(tracing::Level::TRACE) {
if circuit.num_qubits() <= 10 {
trace!(amplitudes = ?state.as_slice(), "Post-gate state");
} else {
trace!(gate_time_ns = elapsed.as_nanos(), "Gate applied");
}
}
```
### 8. Dashboard Integration
Metrics from the quantum engine appear in the ruVector monitoring UI as a dedicated
panel alongside vector operations, index health, and system resources.
```
+------------------------------------------------------------------+
| ruVector Monitoring Dashboard |
+------------------------------------------------------------------+
| |
| Vector Operations | Quantum Simulations |
| ------------------- | ----------------------- |
| Queries/sec: 12,450 | Simulations/min: 23 |
| P99 latency: 2.3ms | Avg execution: 145ms |
| Index size: 2.1M vectors | Avg qubits: 18.4 |
| | Peak memory: 4.2 GiB |
| | Backend: SV 87% / TN 13% |
| | Gates/sec: 2.1B |
| | Error rate: 0.02% |
| | |
| System Resources | Recent Simulations |
| ------------------- | ----------------------- |
| CPU: 34% | #a3f2.. 24q 230ms OK |
| Memory: 61% (49/80 GiB) | #b891.. 16q 12ms OK |
| Threads: 64/256 active | #c4d0.. 30q 1.2s OK |
| | #d122.. 35q ERR OOM |
+------------------------------------------------------------------+
```
Metrics are published via the existing `ruvector-metrics` WebSocket feed:
```json
{
"source": "ruqu",
"type": "simulation_complete",
"timestamp": "2026-02-06T14:23:01.442Z",
"data": {
"simulation_id": "a3f2e891-...",
"qubit_count": 24,
"execution_time_ms": 230.4,
"peak_memory_bytes": 268435456,
"backend": "StateVector",
"gates_per_second": 2147483648,
"success": true
}
}
```
### 9. Prometheus / OpenTelemetry Export
For external monitoring, the native metrics sink exports standard Prometheus
metrics:
```
# HELP ruqu_simulations_total Total quantum simulations executed
# TYPE ruqu_simulations_total counter
ruqu_simulations_total{backend="state_vector",status="success"} 1847
ruqu_simulations_total{backend="state_vector",status="error"} 3
ruqu_simulations_total{backend="tensor_network",status="success"} 241
# HELP ruqu_execution_time_ms Simulation execution time histogram
# TYPE ruqu_execution_time_ms histogram
ruqu_execution_time_ms_bucket{backend="state_vector",le="10"} 423
ruqu_execution_time_ms_bucket{backend="state_vector",le="100"} 1201
ruqu_execution_time_ms_bucket{backend="state_vector",le="1000"} 1834
ruqu_execution_time_ms_bucket{backend="state_vector",le="+Inf"} 1847
# HELP ruqu_peak_memory_bytes Peak memory during simulation
# TYPE ruqu_peak_memory_bytes gauge
ruqu_peak_memory_bytes 4294967296
# HELP ruqu_gates_per_second Gate application throughput
# TYPE ruqu_gates_per_second gauge
ruqu_gates_per_second 2.1e9
# HELP ruqu_max_supported_qubits Maximum qubits based on available memory
# TYPE ruqu_max_supported_qubits gauge
ruqu_max_supported_qubits 33
```
## Consequences
### Positive
1. **Unified observability**: Quantum simulation telemetry integrates seamlessly
with ruVector's existing monitoring infrastructure.
2. **Cross-platform**: The trait-based sink design supports native, WASM, and
testing environments without code changes in the engine.
3. **Actionable errors**: Structured errors with suggestions reduce debugging time
and improve developer experience.
4. **Performance visibility**: Gates-per-second, memory consumption, and backend
selection metrics enable informed performance tuning.
5. **Compliance ready**: Structured logging with simulation IDs supports audit
trail requirements.
### Negative
1. **Metric cardinality**: High-frequency simulations could generate significant
metric volume. Mitigated by aggregation at the sink level.
2. **WASM callback overhead**: JSON serialization for WASM metrics adds ~0.1ms per
simulation. Acceptable for typical workloads.
3. **Tracing overhead at DEBUG/TRACE**: Enabled tracing at low levels adds
measurable overhead. Production deployments should use INFO or above.
4. **Schema evolution**: Changes to `SimulationMetrics` require versioned handling
in consumers.
### Risks and Mitigations
| Risk | Mitigation |
|---|---|
| Metric volume overwhelming storage | Configurable sampling rate; aggregate in sink |
| WASM callback exceptions | Catch JS exceptions in callback wrapper; log to console |
| Schema breaking changes | Version field in metrics; consumer-side version dispatch |
| TRACE logging for large circuits | Qubit-count gate prevents amplitude logging above n=10 |
## References
- `ruvector-metrics` crate: internal metrics infrastructure
- Rust `tracing` crate: https://docs.rs/tracing
- OpenTelemetry Rust SDK: https://docs.rs/opentelemetry
- ADR-QE-005: WASM Compilation Target (WASM constraints)
- ADR-QE-011: Memory Gating & Power Management (resource monitoring)
- Prometheus exposition format: https://prometheus.io/docs/instrumenting/exposition_formats/

View File

@@ -0,0 +1,628 @@
# ADR-QE-011: Memory Gating & Power Management
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
---
## Context
ruVector is designed to operate within the Cognitum computing paradigm: a tile-based
architecture with 256 low-power processor cores, event-driven activation, and
aggressive power gating. Agents (software components) remain fully dormant until an
event triggers their activation. Once their work completes, they release all
resources and return to dormancy.
The quantum simulation engine must adhere to this model:
1. **Zero idle footprint**: When no simulation is running, the engine consumes zero
CPU cycles and zero heap memory beyond its compiled code and static data.
2. **Rapid activation**: The engine must be ready to execute a simulation within
microseconds of receiving a request.
3. **Prompt resource release**: Upon simulation completion (or failure), all
allocated memory is immediately freed.
4. **Predictable memory**: Callers must be able to determine exact memory
requirements before committing to a simulation.
### Memory Scale
The state vector for n qubits requires 2^n complex amplitudes, each consuming 16
bytes (two f64 values):
| Qubits | Amplitudes | Memory | Notes |
|--------|-----------|--------|-------|
| 10 | 1,024 | 16 KiB | Trivial |
| 15 | 32,768 | 512 KiB | Small |
| 20 | 1,048,576 | 16 MiB | Moderate |
| 25 | 33,554,432 | 512 MiB | Large |
| 28 | 268,435,456 | 4 GiB | Needs dedicated memory |
| 30 | 1,073,741,824 | 16 GiB | Workstation-class |
| 32 | 4,294,967,296 | 64 GiB | Server-class |
| 35 | 34,359,738,368 | 512 GiB | HPC |
| 40 | 1,099,511,627,776 | 16 TiB | Infeasible (state vector) |
Each additional qubit doubles memory. This exponential scaling makes memory the
primary resource constraint and the most important resource to manage.
### Edge and Embedded Constraints
On edge devices (embedded ruVector nodes, IoT gateways, mobile processors), memory
is severely limited:
| Platform | Typical RAM | Max qubits (state vector) |
|----------|------------|--------------------------|
| Cognitum tile (single) | 256 MiB | 23 |
| Cognitum tile cluster (4) | 1 GiB | 25 |
| Raspberry Pi 4 | 8 GiB | 28 |
| Mobile device | 4-6 GiB | 27-28 (with other apps) |
| Laptop | 16-64 GiB | 29-31 |
| Server | 256-512 GiB | 33-34 |
### WASM Memory Model
WebAssembly uses a linear memory that can grow but cannot shrink. Once a large
simulation allocates pages, those pages remain mapped until the WASM instance is
destroyed. This is a fundamental platform limitation that must be documented and
accounted for.
## Decision
### 1. Zero-Idle Footprint Architecture
The quantum engine is implemented as a pure library with no runtime overhead:
```rust
// The engine is a collection of functions and types.
// No background threads, no event loops, no persistent state.
// When not called, it consumes exactly zero CPU and zero heap.
pub struct QuantumEngine; // Zero-sized type; purely a namespace
impl QuantumEngine {
/// Execute a simulation. All resources are allocated on entry
/// and freed on exit (or on error).
pub fn execute(
circuit: &QuantumCircuit,
shots: usize,
config: &SimulationConfig,
) -> Result<SimulationResult, SimulationError> {
// 1. Estimate and validate memory
let required = Self::estimate_memory(circuit.num_qubits());
Self::validate_memory_available(required)?;
// 2. Allocate state vector (the big allocation)
let mut state = Self::allocate_state(circuit.num_qubits())?;
// 3. Execute gates (all computation happens here)
Self::apply_gates(circuit, &mut state, config)?;
// 4. Measure (if requested)
let measurements = Self::measure(&state, shots)?;
// 5. Build result (copies out what we need)
let result = SimulationResult::from_state_and_measurements(
&state, measurements, circuit,
);
// 6. state is dropped here -- Vec<Complex<f64>> deallocated
// No cleanup needed. No finalizers. Just drop.
Ok(result)
}
// state goes out of scope and is deallocated by Rust's ownership system
}
```
Key properties:
- No `new()` or `init()` methods that create persistent state.
- No `Drop` impl with complex cleanup logic.
- No `Arc`, `Mutex`, or shared state between calls.
- Each call is fully independent and self-contained.
### 2. On-Demand Allocation Strategy
State vectors are allocated at simulation start and freed at simulation end:
```rust
fn allocate_state(n_qubits: u32) -> Result<StateVector, SimulationError> {
let num_amplitudes = 1_usize.checked_shl(n_qubits)
.ok_or(SimulationError::QubitLimitExceeded {
requested: n_qubits,
maximum: (usize::BITS - 1) as u32,
estimated_memory_bytes: u64::MAX,
available_memory_bytes: estimate_available_memory() as u64,
})?;
let required_bytes = num_amplitudes
.checked_mul(std::mem::size_of::<Complex<f64>>())
.ok_or(SimulationError::MemoryAllocationFailed {
requested_bytes: u64::MAX,
qubit_count: n_qubits,
suggestion: "Qubit count exceeds addressable memory",
})?;
// Attempt allocation. Rust's global allocator will return an error
// (with #[global_allocator] configured) or the OS will OOM-kill us.
// We use try_reserve to handle this gracefully.
let mut amplitudes = Vec::new();
amplitudes.try_reserve_exact(num_amplitudes)
.map_err(|_| SimulationError::MemoryAllocationFailed {
requested_bytes: required_bytes as u64,
qubit_count: n_qubits,
suggestion: "Reduce qubit count or use tensor-network backend",
})?;
// Initialize to |00...0> state
amplitudes.resize(num_amplitudes, Complex::new(0.0, 0.0));
amplitudes[0] = Complex::new(1.0, 0.0);
Ok(StateVector { amplitudes, n_qubits })
}
```
The allocation sequence:
```
IDLE (zero memory)
|
v
estimate_memory(n) --> returns bytes needed
|
v
validate_memory_available(bytes) --> checks against OS/platform limits
| returns Err if insufficient
v
Vec::try_reserve_exact(2^n) --> attempts allocation
| returns Err on failure (no panic)
v
ALLOCATED (2^n * 16 bytes on heap)
|
v
[... simulation runs ...]
|
v
Vec::drop() --> automatic deallocation
|
v
IDLE (zero memory)
```
### 3. Memory Estimation API
Callers can query exact memory requirements before committing:
```rust
/// Returns the number of bytes required to simulate n_qubits.
/// This accounts for the state vector plus working memory for
/// gate application (temporary buffers, measurement arrays, etc.).
///
/// # Returns
/// - `Ok(bytes)` if the qubit count is representable
/// - `Err(...)` if 2^n_qubits overflows usize
pub fn estimate_memory(n_qubits: u32) -> Result<MemoryEstimate, SimulationError> {
let num_amplitudes = 1_usize.checked_shl(n_qubits)
.ok_or(SimulationError::QubitLimitExceeded {
requested: n_qubits,
maximum: (usize::BITS - 1) as u32,
estimated_memory_bytes: u64::MAX,
available_memory_bytes: 0,
})?;
let state_vector_bytes = num_amplitudes * std::mem::size_of::<Complex<f64>>();
// Working memory: temporary buffer for gate application (1 amplitude slice)
// Plus measurement result storage
let working_bytes = num_amplitudes * std::mem::size_of::<Complex<f64>>() / 4;
// Thread-local scratch space (per Rayon thread)
let thread_count = rayon::current_num_threads();
let scratch_per_thread = 64 * 1024; // 64 KiB per thread for local buffers
let thread_scratch = thread_count * scratch_per_thread;
Ok(MemoryEstimate {
state_vector_bytes: state_vector_bytes as u64,
working_bytes: working_bytes as u64,
thread_scratch_bytes: thread_scratch as u64,
total_bytes: (state_vector_bytes + working_bytes + thread_scratch) as u64,
num_amplitudes: num_amplitudes as u64,
})
}
#[derive(Debug, Clone)]
pub struct MemoryEstimate {
/// Bytes for the state vector (dominant cost).
pub state_vector_bytes: u64,
/// Bytes for gate-application working memory.
pub working_bytes: u64,
/// Bytes for thread-local scratch space.
pub thread_scratch_bytes: u64,
/// Total estimated bytes.
pub total_bytes: u64,
/// Number of complex amplitudes.
pub num_amplitudes: u64,
}
impl MemoryEstimate {
/// Returns true if the estimate fits within the given byte budget.
pub fn fits_in(&self, available_bytes: u64) -> bool {
self.total_bytes <= available_bytes
}
/// Suggest the maximum qubits for a given memory budget.
pub fn max_qubits_for(available_bytes: u64) -> u32 {
// Each qubit doubles memory; find largest n where 20 * 2^n <= available
// Factor of 20 accounts for 16-byte amplitudes + 25% working memory
let effective = available_bytes / 20;
if effective == 0 { return 0; }
(effective.ilog2()) as u32
}
}
```
### 4. Allocation Failure Handling
The engine never panics on allocation failure. All paths return structured errors:
```rust
// Pattern: every allocation is fallible and returns a descriptive error.
// State vector allocation failure:
SimulationError::MemoryAllocationFailed {
requested_bytes: 17_179_869_184, // 16 GiB
qubit_count: 30,
suggestion: "Reduce qubit count by 2 (to 28, ~4 GiB) or enable tensor-network backend",
}
// Integer overflow (qubit count too large):
SimulationError::QubitLimitExceeded {
requested: 64,
maximum: 33, // based on available memory
estimated_memory_bytes: u64::MAX,
available_memory_bytes: 68_719_476_736, // 64 GiB
}
```
Decision tree on allocation failure:
```
Memory allocation failed
|
+-- Is tensor-network feature enabled?
| |
| +-- YES: Suggest tensor-network backend
| | (may work if circuit has low treewidth)
| |
| +-- NO: Suggest reducing qubit count
| Calculate: max_qubits = floor(log2(available / 20))
| Suggest: "Reduce to {max_qubits} qubits ({memory} bytes)"
|
+-- Is the request wildly over budget (>100x)?
| |
| +-- YES: "Circuit requires {X} GiB but only {Y} MiB available"
| |
| +-- NO: "Circuit requires {X} GiB, {Y} GiB available.
| Reducing by {delta} qubits would fit."
|
+-- Return SimulationError (no panic, no abort)
```
### 5. CPU Yielding for Long Simulations
For simulations estimated to exceed 100ms, the engine can optionally yield between
gate batches to allow the OS scheduler to manage power states:
```rust
pub struct YieldConfig {
/// Enable cooperative yielding between gate batches.
/// Default: false (maximum throughput).
pub enabled: bool,
/// Number of gates to apply before yielding.
/// Default: 1000.
pub gates_per_slice: usize,
/// Yield mechanism.
/// Default: ThreadYield (std::thread::yield_now).
pub yield_strategy: YieldStrategy,
}
pub enum YieldStrategy {
/// Call std::thread::yield_now() between slices.
ThreadYield,
/// Sleep for specified duration between slices.
Sleep(Duration),
/// Call a user-provided callback between slices.
Callback(Box<dyn Fn(SliceProgress) + Send>),
}
pub struct SliceProgress {
pub gates_completed: u64,
pub gates_remaining: u64,
pub elapsed: Duration,
pub estimated_remaining: Duration,
}
// Usage in gate application loop:
fn apply_gates_with_yield(
circuit: &QuantumCircuit,
state: &mut StateVector,
yield_config: &YieldConfig,
) -> Result<(), SimulationError> {
let gates = circuit.gates();
for (i, gate) in gates.iter().enumerate() {
apply_single_gate(gate, state)?;
if yield_config.enabled && (i + 1) % yield_config.gates_per_slice == 0 {
match &yield_config.yield_strategy {
YieldStrategy::ThreadYield => std::thread::yield_now(),
YieldStrategy::Sleep(d) => std::thread::sleep(*d),
YieldStrategy::Callback(cb) => cb(SliceProgress {
gates_completed: (i + 1) as u64,
gates_remaining: (gates.len() - i - 1) as u64,
elapsed: start.elapsed(),
estimated_remaining: estimate_remaining(i, gates.len(), start),
}),
}
}
}
Ok(())
}
```
Yield is **disabled by default** to maximize throughput. It is primarily intended
for:
- Edge devices where power management is critical.
- Interactive applications where UI responsiveness matters.
- Long-running simulations (>1 second) where progress reporting is needed.
### 6. Thread Management
The quantum engine does not create or manage its own threads:
```
+-----------------------------------------------+
| Global Rayon Thread Pool |
| (shared by all ruVector subsystems) |
| |
| [Thread 0] [Thread 1] ... [Thread N-1] |
| ^ ^ ^ |
| | | | |
| +--+---+ +--+---+ +---+--+ |
| | ruQu | | ruQu | | idle | |
| | gate | | gate | | | |
| | apply | | apply| | | |
| +-------+ +------+ +------+ |
| |
| During simulation: threads work on gates |
| After simulation: threads return to pool |
| Pool idle: OS can power-gate cores |
+-----------------------------------------------+
```
Key properties:
- Rayon's global thread pool is initialized once by `ruvector-core` at startup.
- The quantum engine calls `rayon::par_iter()` and related APIs, borrowing threads
temporarily.
- When simulation completes, all threads are returned to the global pool.
- If no ruVector work is pending, Rayon threads park (blocking on a condvar),
consuming zero CPU. The OS can then power-gate the underlying cores.
### 7. WASM Memory Considerations
WebAssembly linear memory has a specific behavior that affects resource management:
```
WASM Memory Layout
+------------------+------------------+
| Initial pages | Grown pages |
| (compiled size) | (runtime alloc) |
+------------------+------------------+
0 initial_size current_size
Growth: memory.grow(delta_pages) -> adds pages to the end
Shrink: NOT SUPPORTED in WASM spec
After 25-qubit simulation:
+------------------+----------------------------------+
| Initial (1 MiB) | Grown for state vec (512 MiB) | <- HIGH WATER MARK
+------------------+----------------------------------+
After simulation completes:
+------------------+----------------------------------+
| Initial (1 MiB) | FREED internally but pages |
| | still mapped (512 MiB virtual) |
+------------------+----------------------------------+
The Rust allocator returns memory to its free list,
but WASM pages are not returned to the host.
```
**Implications and mitigations**:
1. **Document the behavior**: Users must understand that WASM memory is a high-water
mark. A 25-qubit simulation permanently increases the WASM instance's memory
footprint to ~512 MiB.
2. **Instance recycling**: For applications that run multiple simulations, create a
new WASM instance periodically to reset the memory high-water mark.
3. **Memory budget enforcement**: The WASM host can set `WebAssembly.Memory` with a
`maximum` parameter to cap growth:
```javascript
const memory = new WebAssembly.Memory({
initial: 16, // 1 MiB
maximum: 8192, // 512 MiB cap
});
```
4. **Pre-check in WASM**: The engine's `estimate_memory()` function works in WASM
and should be called before simulation to verify the allocation will succeed.
### 8. Cognitum Tile Integration
On Cognitum's tile-based architecture, the quantum engine maps to tiles as follows:
```
Cognitum Processor (256 tiles)
+--------+--------+--------+--------+
| Tile 0 | Tile 1 | Tile 2 | Tile 3 | <- Assigned to quantum sim
| ACTIVE | ACTIVE | ACTIVE | ACTIVE |
+--------+--------+--------+--------+
| Tile 4 | Tile 5 | Tile 6 | Tile 7 | <- Other ruVector work (or sleeping)
| sleep | vecDB | sleep | sleep |
+--------+--------+--------+--------+
| ... | ... | ... | ... |
| sleep | sleep | sleep | sleep | <- Power gated (zero consumption)
+--------+--------+--------+--------+
```
**Power state diagram for a quantum simulation lifecycle**:
```
State: ALL_TILES_IDLE
|
| Simulation request arrives
v
State: ALLOCATING
Action: Wake tiles 0-3 (or however many are needed)
Action: Allocate state vector across tile-local memory
Power: Tiles 0-3 ACTIVE, rest SLEEP
|
v
State: SIMULATING
Action: Apply gates in parallel across active tiles
Power: Tiles 0-3 at full clock rate
Duration: microseconds to seconds depending on circuit
|
v
State: MEASURING
Action: Sample measurement outcomes
Power: Tile 0 only (measurement is sequential)
|
v
State: DEALLOCATING
Action: Free state vector
Action: Return tiles to idle pool
|
v
State: ALL_TILES_IDLE
Power: Tiles 0-3 back to SLEEP
Memory: Zero heap allocation
```
**Tile assignment policy**:
- Small simulations (n <= 20): 1 tile sufficient.
- Medium simulations (20 < n <= 25): 2-4 tiles for parallel gate application.
- Large simulations (25 < n <= 30): All available tiles.
- The tile scheduler (part of Cognitum runtime) handles assignment. The quantum
engine simply uses Rayon parallelism; the runtime maps Rayon threads to tiles.
### 9. Memory Budget Table
Quick reference for capacity planning:
| Qubits | State Vector | Working Memory | Total | Platform Fit |
|--------|-------------|---------------|-------|-------------|
| 10 | 16 KiB | 4 KiB | 20 KiB | Any |
| 12 | 64 KiB | 16 KiB | 80 KiB | Any |
| 14 | 256 KiB | 64 KiB | 320 KiB | Any |
| 16 | 1 MiB | 256 KiB | 1.3 MiB | Any |
| 18 | 4 MiB | 1 MiB | 5 MiB | Any |
| 20 | 16 MiB | 4 MiB | 20 MiB | Any |
| 22 | 64 MiB | 16 MiB | 80 MiB | Cognitum single tile |
| 24 | 256 MiB | 64 MiB | 320 MiB | Cognitum 2+ tiles |
| 26 | 1 GiB | 256 MiB | 1.3 GiB | Cognitum cluster |
| 28 | 4 GiB | 1 GiB | 5 GiB | Laptop / RPi 8GB |
| 30 | 16 GiB | 4 GiB | 20 GiB | Workstation |
| 32 | 64 GiB | 16 GiB | 80 GiB | Server |
| 34 | 256 GiB | 64 GiB | 320 GiB | Large server |
### 10. Allocation and Deallocation Sequence Diagram
```
Caller Engine OS/Allocator
| | |
| execute(circuit) | |
|-------------------->| |
| | |
| | estimate_memory(n) |
| | validate_available() |
| | |
| | try_reserve_exact(2^n) |
| |------------------------>|
| | |
| | Ok(ptr) or Err |
| |<------------------------|
| | |
| | [if Err: return |
| | SimulationError] |
| | |
| | initialize |00...0> |
| | apply gates |
| | measure |
| | |
| | build result |
| | (copies measurements, |
| | expectation values) |
| | |
| | drop(state_vector) |
| |------------------------>|
| | | free(ptr, 2^n * 16)
| | |
| Ok(result) | |
|<--------------------| |
| | |
| [Engine holds ZERO | |
| heap memory now] | |
```
## Consequences
### Positive
1. **True zero-idle cost**: No background resource consumption. Perfectly aligned
with Cognitum's event-driven architecture and power gating.
2. **Predictable memory**: `estimate_memory()` gives exact requirements before
committing, preventing OOM surprises.
3. **Graceful degradation**: Allocation failures return structured errors with
actionable suggestions, never panics.
4. **Platform portable**: The same allocation strategy works on native (Linux, macOS,
Windows), WASM, and embedded (Cognitum tiles).
5. **No resource leaks**: Rust's ownership system guarantees deallocation on all
exit paths (success, error, panic).
### Negative
1. **No state caching**: Each simulation allocates and deallocates independently.
Repeated simulations on the same qubit count pay allocation cost each time.
Mitigation: allocation is O(2^n) but fast compared to O(G * 2^n) simulation.
2. **WASM memory high-water mark**: Cannot reclaim WASM linear memory pages.
Documented as a platform limitation with instance-recycling workaround.
3. **No memory pooling**: Could theoretically amortize allocation across simulations,
but this conflicts with the zero-idle-footprint requirement.
4. **Yield overhead**: When enabled, cooperative yielding adds per-slice overhead.
Mitigated by making it opt-in and configurable.
### Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| OOM despite estimate_memory check | Low | Crash | Check returns conservative estimate including working memory |
| WASM instance runs out of address space | Medium | Failure | Set `WebAssembly.Memory` maximum; document limitation |
| Allocation latency spike (OS page faults) | Medium | Slow start | Consider `madvise` / `mlock` hints for large allocations |
| Rayon thread pool contention | Medium | Degraded perf | Quantum engine yields between slices; Rayon work-stealing handles contention |
## References
- Cognitum Architecture Specification: event-driven tile-based computing
- Rust `Vec::try_reserve_exact`: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.try_reserve_exact
- WebAssembly Memory: https://webassembly.github.io/spec/core/syntax/modules.html#memories
- Rayon thread pool: https://docs.rs/rayon
- ADR-QE-001: Core Engine Architecture (zero-overhead design principle)
- ADR-QE-005: WASM Compilation Target (WASM constraints)
- ADR-QE-009: Tensor Network Evaluation Mode (alternative for large circuits)
- ADR-QE-010: Observability & Monitoring (memory metrics reporting)

View File

@@ -0,0 +1,876 @@
# ADR-QE-012: Min-Cut Coherence Integration
**Status**: Proposed
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
---
## Context
The ruVector ecosystem contains several components that must work together for
quantum error correction (QEC) simulation:
1. **ruQu (existing)**: A real-time coherence gating system that performs
boundary-to-boundary min-cut analysis on surface code error patterns. It includes
a three-filter syndrome pipeline (Structural | Shift | Evidence), a Minimum Weight
Perfect Matching (MWPM) decoder, and an early warning system that predicts
correlated failures 100+ cycles ahead.
2. **ruvector-mincut (existing)**: A graph partitioning crate that computes minimum
cuts and balanced partitions. Currently used for vector index sharding but
directly applicable to syndrome graph decomposition.
3. **Coherence Engine (ADR-014)**: Computes coherence energy via sheaf Laplacian
analysis. The "mincut-gated-transformer" concept uses coherence energy to skip
computation on "healthy" regions, achieving up to 50% FLOPs reduction.
4. **Quantum Simulation Engine (new, ADR-QE-001 through ADR-QE-011)**: The
state-vector and tensor-network simulator being designed in this ADR series.
The challenge is integrating these components into a coherent (pun intended)
pipeline where simulated quantum circuits produce syndromes, those syndromes are
decoded in real-time, and coherence analysis feeds back into simulation parameters.
### Surface Code Background
A distance-d surface code encodes 1 logical qubit in d^2 data qubits + (d^2 - 1)
ancilla qubits:
| Distance | Data qubits | Ancilla qubits | Total qubits | Error threshold |
|----------|------------|----------------|--------------|----------------|
| 3 | 9 | 8 | 17 | ~1% |
| 5 | 25 | 24 | 49 | ~1% |
| 7 | 49 | 48 | 97 | ~1% |
| 9 | 81 | 80 | 161 | ~1% |
| 11 | 121 | 120 | 241 | ~1% |
Syndrome extraction involves measuring ancilla qubits each cycle. The measurement
outcomes (syndromes) indicate where errors may have occurred. The decoder's job is
to determine the most likely error pattern from the syndrome and apply corrections.
### Performance Requirements
ruQu's existing decoder targets P99 latency of <4 microseconds for syndrome
decoding. The integrated simulation + decode pipeline must meet:
| Operation | Target latency | Notes |
|-----------|---------------|-------|
| Single syndrome decode | <4 us | Existing ruQu target (MWPM) |
| Syndrome extraction sim | <5 ms | One round of ancilla measurement |
| Full cycle (sim + decode) | <10 ms | Distance-3, single error cycle |
| Full cycle (sim + decode) | <50 ms | Distance-5 |
| Full cycle (sim + decode) | <200 ms | Distance-7 (tensor network) |
| Early warning evaluation | <1 ms | Check predicted vs actual syndromes |
## Decision
### 1. Architecture Overview
The integration follows a pipeline architecture where data flows from quantum
simulation through syndrome extraction, filtering, decoding, and coherence analysis:
```
+------------------------------------------------------------------+
| Quantum Error Correction Pipeline |
+------------------------------------------------------------------+
| |
| +------------------+ +---------------------+ |
| | Quantum Circuit | | Error Model | |
| | (surface code |---->| (depolarizing, | |
| | syndrome | | biased noise, | |
| | extraction) | | correlated) | |
| +------------------+ +---------------------+ |
| | | |
| v v |
| +--------------------------------------------+ |
| | Quantum Simulation Engine | |
| | (state vector or tensor network) | |
| | - Simulates noisy syndrome extraction | |
| | - Outputs ancilla measurement outcomes | |
| +--------------------------------------------+ |
| | |
| | syndrome bitstring |
| v |
| +--------------------------------------------+ |
| | SyndromeFilter (ruQu) | |
| | Filter 1: Structural (lattice geometry) | |
| | Filter 2: Shift (temporal correlations) | |
| | Filter 3: Evidence (statistical weight) | |
| +--------------------------------------------+ |
| | |
| | filtered syndrome |
| v |
| +--------------------------------------------+ |
| | MWPM Decoder (ruQu) | |
| | - Minimum Weight Perfect Matching | |
| | - Returns Pauli correction operators | |
| | - Target: <4 us P99 latency | |
| +--------------------------------------------+ |
| | |
| | correction operators (X, Z Paulis) |
| v |
| +--------------------------------------------+ |
| | Correction Application | |
| | - Apply Pauli gates to simulated state | |
| | - Verify logical qubit integrity | |
| +--------------------------------------------+ |
| | |
| | corrected state |
| v |
| +-----------------------+ +-------------------------+ |
| | Coherence Engine | | Early Warning System | |
| | (sheaf Laplacian) | | (100+ cycle prediction) | |
| | - Compute coherence |<-->| - Correlate historical | |
| | energy | | syndromes | |
| | - Gate simulation | | - Predict failures | |
| | FLOPs if healthy | | - Feed back to sim | |
| +-----------------------+ +-------------------------+ |
| | | |
| v v |
| +--------------------------------------------+ |
| | Cryptographic Audit Trail | |
| | - Ed25519 signed decisions | |
| | - Blake3 hash chains | |
| | - Every syndrome, decode, correction logged | |
| +--------------------------------------------+ |
| |
+------------------------------------------------------------------+
```
### 2. Syndrome-to-Decoder Bridge
The quantum simulation engine outputs raw measurement bitstrings. These are
converted to the syndrome format expected by ruQu's decoder:
```rust
/// Bridge between quantum simulation output and ruQu decoder input.
pub struct SyndromeBridge;
impl SyndromeBridge {
/// Convert simulation measurement outcomes to ruQu syndrome format.
///
/// The simulation measures ancilla qubits. A detection event occurs
/// when an ancilla measurement differs from the previous round
/// (or from the expected value in the first round).
pub fn extract_syndrome(
measurements: &MeasurementOutcome,
code: &SurfaceCodeLayout,
previous_round: Option<&SyndromeRound>,
) -> SyndromeRound {
let mut detections = Vec::new();
for ancilla in code.ancilla_qubits() {
let current = measurements.get(ancilla.index());
let previous = previous_round
.map(|r| r.get(ancilla.id()))
.unwrap_or(0); // Expected value in first round
if current != previous {
detections.push(Detection {
ancilla_id: ancilla.id(),
ancilla_type: ancilla.stabilizer_type(), // X or Z
position: ancilla.lattice_position(),
round: measurements.round_number(),
});
}
}
SyndromeRound {
round: measurements.round_number(),
detections,
raw_measurements: measurements.ancilla_bits().to_vec(),
}
}
/// Apply decoder corrections back to the simulation state.
pub fn apply_corrections(
state: &mut StateVector,
corrections: &DecoderCorrection,
code: &SurfaceCodeLayout,
) {
for (qubit_id, pauli) in &corrections.operations {
let qubit_index = code.data_qubit_index(*qubit_id);
match pauli {
Pauli::X => state.apply_x(qubit_index),
Pauli::Z => state.apply_z(qubit_index),
Pauli::Y => {
state.apply_x(qubit_index);
state.apply_z(qubit_index);
}
Pauli::I => {} // No correction needed
}
}
}
}
```
### 3. SyndromeFilter Pipeline (ruQu Integration)
The three-filter pipeline processes raw syndromes before decoding:
```rust
/// ruQu's three-stage syndrome filtering pipeline.
pub struct SyndromeFilterPipeline {
structural: StructuralFilter,
shift: ShiftFilter,
evidence: EvidenceFilter,
}
impl SyndromeFilterPipeline {
/// Process a syndrome round through all three filters.
pub fn filter(&mut self, syndrome: SyndromeRound) -> FilteredSyndrome {
// Filter 1: Structural
// Removes detections inconsistent with lattice geometry.
// E.g., isolated detections with no nearby partner.
let after_structural = self.structural.apply(&syndrome);
// Filter 2: Shift
// Accounts for temporal correlations between rounds.
// Detections that appear and disappear in consecutive rounds
// may be measurement errors (not data errors).
let after_shift = self.shift.apply(&after_structural);
// Filter 3: Evidence
// Weights remaining detections by statistical evidence.
// Uses error model probabilities to assign confidence scores.
let after_evidence = self.evidence.apply(&after_shift);
after_evidence
}
}
```
### 4. MWPM Decoder Integration
The filtered syndrome feeds into ruQu's MWPM decoder:
```rust
/// Interface to ruQu's Minimum Weight Perfect Matching decoder.
pub trait SyndromeDecoder {
/// Decode a filtered syndrome into correction operations.
/// Target: <4 microseconds P99 latency.
fn decode(
&self,
syndrome: &FilteredSyndrome,
code: &SurfaceCodeLayout,
) -> DecoderCorrection;
/// Decode with timing information for performance monitoring.
fn decode_timed(
&self,
syndrome: &FilteredSyndrome,
code: &SurfaceCodeLayout,
) -> (DecoderCorrection, DecoderTiming);
}
pub struct DecoderCorrection {
/// Pauli corrections to apply to data qubits.
pub operations: Vec<(QubitId, Pauli)>,
/// Confidence score (0.0 = no confidence, 1.0 = certain).
pub confidence: f64,
/// Whether a logical error was detected (correction may be wrong).
pub logical_error_detected: bool,
/// Matching weight (lower is more likely).
pub matching_weight: f64,
}
pub struct DecoderTiming {
/// Total decode time.
pub total_ns: u64,
/// Time spent building the matching graph.
pub graph_construction_ns: u64,
/// Time spent in the MWPM algorithm.
pub matching_ns: u64,
/// Number of detection events in the input.
pub num_detections: usize,
}
```
### 5. Min-Cut Graph Partitioning for Parallel Decoding
For large surface codes (distance >= 7), the syndrome graph can be partitioned
using `ruvector-mincut` for parallel decoding:
```rust
use ruvector_mincut::{partition, PartitionConfig, WeightedGraph};
/// Partition the syndrome graph for parallel decoding.
/// This exploits spatial locality in the surface code: errors in
/// distant regions can be decoded independently.
pub fn parallel_decode(
syndrome: &FilteredSyndrome,
code: &SurfaceCodeLayout,
decoder: &dyn SyndromeDecoder,
) -> DecoderCorrection {
// Build the detection graph (nodes = detections, edges = possible errors)
let detection_graph = build_detection_graph(syndrome, code);
// If small enough, decode directly
if detection_graph.num_nodes() <= 20 {
return decoder.decode(syndrome, code);
}
// Partition the detection graph using ruvector-mincut
let config = PartitionConfig {
num_partitions: estimate_partition_count(&detection_graph),
balance_factor: 1.2,
minimize: Objective::EdgeCut,
};
let partitions = partition(&detection_graph, &config);
// Decode each partition independently (in parallel via Rayon)
let partial_corrections: Vec<DecoderCorrection> = partitions
.par_iter()
.map(|partition| {
let sub_syndrome = syndrome.restrict_to(partition);
decoder.decode(&sub_syndrome, code)
})
.collect();
// Handle boundary edges (detections that span partitions)
let boundary_correction = decode_boundary_edges(
syndrome, code, &partitions, decoder,
);
// Merge all corrections
merge_corrections(partial_corrections, boundary_correction)
}
/// Estimate optimal partition count based on detection density.
fn estimate_partition_count(graph: &WeightedGraph) -> usize {
let n = graph.num_nodes();
if n <= 20 { 1 }
else if n <= 50 { 2 }
else if n <= 100 { 4 }
else { (n / 25).min(rayon::current_num_threads()) }
}
```
This matches ruQu's existing boundary-to-boundary min-cut analysis: the partition
boundaries correspond to the cuts in the syndrome graph where independent decoding
regions meet.
### 6. Coherence Gating for Simulation FLOPs Reduction
The sheaf Laplacian coherence energy (from ADR-014) provides a measure of how
"healthy" a quantum state region is. High coherence energy means the region is
behaving as expected (low error rate). This enables a novel optimization:
```
Coherence Gating Decision Tree
================================
For each region R of the surface code:
1. Compute coherence energy E(R) via sheaf Laplacian
2. Compare to thresholds:
E(R) > E_high (0.95)
|
+-- Region is HEALTHY
| Action: SKIP detailed simulation for this region
| Use: simplified noise model (Pauli channel approximation)
| Savings: ~50% FLOPs for this region
|
E_low (0.70) < E(R) <= E_high (0.95)
|
+-- Region is NOMINAL
| Action: STANDARD simulation
| Use: full gate-by-gate simulation with noise
| Savings: none
|
E(R) <= E_low (0.70)
|
+-- Region is DEGRADED
| Action: ENHANCED simulation
| Use: full simulation + additional diagnostics
| Extra: log detailed error patterns, trigger early warning
| Savings: negative (more work, but necessary)
```
Implementation:
```rust
/// Coherence-gated simulation mode.
/// Uses coherence energy to decide simulation fidelity per region.
pub struct CoherenceGatedSimulator {
/// Full-fidelity simulator for nominal/degraded regions.
full_simulator: Box<dyn SimulationBackend>,
/// Simplified simulator for healthy regions.
simplified_simulator: SimplifiedNoiseModel,
/// Coherence engine for computing region health.
coherence_engine: CoherenceEngine,
/// Thresholds for gating decisions.
high_threshold: f64,
low_threshold: f64,
}
impl CoherenceGatedSimulator {
/// Simulate one QEC cycle with coherence gating.
pub fn simulate_cycle(
&mut self,
state: &mut StateVector,
code: &SurfaceCodeLayout,
error_model: &ErrorModel,
history: &SyndromeHistory,
) -> CycleResult {
// Step 1: Compute coherence energy per region
let regions = code.spatial_regions();
let coherence = self.coherence_engine.compute_regional(
history, &regions,
);
// Step 2: Classify regions and simulate accordingly
let mut cycle_syndromes = Vec::new();
let mut flops_saved = 0_u64;
let mut flops_total = 0_u64;
for (region, energy) in regions.iter().zip(coherence.energies()) {
let region_qubits = code.qubits_in_region(region);
if *energy > self.high_threshold {
// HEALTHY: Use simplified Pauli noise model
let syndrome = self.simplified_simulator.simulate_region(
state, &region_qubits, error_model,
);
let full_cost = estimate_full_sim_cost(&region_qubits);
let simplified_cost = estimate_simplified_cost(&region_qubits);
flops_saved += full_cost - simplified_cost;
flops_total += simplified_cost;
cycle_syndromes.push(syndrome);
} else if *energy > self.low_threshold {
// NOMINAL: Full simulation
let syndrome = self.full_simulator.simulate_region(
state, &region_qubits, error_model,
);
let cost = estimate_full_sim_cost(&region_qubits);
flops_total += cost;
cycle_syndromes.push(syndrome);
} else {
// DEGRADED: Full simulation + diagnostics
let syndrome = self.full_simulator.simulate_region_with_diagnostics(
state, &region_qubits, error_model,
);
let cost = estimate_full_sim_cost(&region_qubits) * 12 / 10;
flops_total += cost;
cycle_syndromes.push(syndrome);
// Trigger early warning system
tracing::warn!(
region = %region.id(),
coherence_energy = energy,
"Degraded coherence detected; enhanced monitoring active"
);
}
}
CycleResult {
syndromes: merge_region_syndromes(cycle_syndromes),
flops_saved,
flops_total,
coherence_energies: coherence,
}
}
}
```
### 7. Cryptographic Audit Trail
All syndrome decisions are signed and chained for tamper-evident logging, following
the existing ruQu pattern:
```rust
use ed25519_dalek::{SigningKey, Signature, Signer};
use blake3::Hasher;
/// Cryptographically auditable decision record.
#[derive(Debug, Serialize, Deserialize)]
pub struct AuditRecord {
/// Sequence number in the audit chain.
pub sequence: u64,
/// Blake3 hash of the previous record (chain linkage).
pub previous_hash: [u8; 32],
/// Timestamp (nanosecond precision).
pub timestamp_ns: u128,
/// The decision being recorded.
pub decision: AuditableDecision,
/// Ed25519 signature over (sequence || previous_hash || timestamp || decision).
pub signature: Signature,
}
#[derive(Debug, Serialize, Deserialize)]
pub enum AuditableDecision {
/// Raw syndrome from simulation.
SyndromeExtracted {
round: u64,
detections: Vec<Detection>,
simulation_id: Uuid,
},
/// Filtered syndrome after pipeline.
SyndromeFiltered {
round: u64,
detections_before: usize,
detections_after: usize,
filters_applied: Vec<String>,
},
/// Decoder correction decision.
CorrectionApplied {
round: u64,
corrections: Vec<(QubitId, Pauli)>,
confidence: f64,
decode_time_ns: u64,
},
/// Coherence gating decision.
CoherenceGating {
round: u64,
region_id: String,
coherence_energy: f64,
decision: GatingDecision,
flops_saved: u64,
},
/// Early warning alert.
EarlyWarning {
round: u64,
predicted_failure_round: u64,
confidence: f64,
affected_region: String,
},
/// Logical error detected.
LogicalError {
round: u64,
error_type: String,
decoder_confidence: f64,
},
}
#[derive(Debug, Serialize, Deserialize)]
pub enum GatingDecision {
SkipDetailedSimulation,
StandardSimulation,
EnhancedSimulation,
}
/// Audit trail manager.
pub struct AuditTrail {
signing_key: SigningKey,
chain_head: [u8; 32],
sequence: u64,
}
impl AuditTrail {
/// Record a decision in the audit trail.
pub fn record(&mut self, decision: AuditableDecision) -> AuditRecord {
let timestamp_ns = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_nanos();
// Compute hash of the decision content
let mut hasher = Hasher::new();
hasher.update(&self.sequence.to_le_bytes());
hasher.update(&self.chain_head);
hasher.update(&timestamp_ns.to_le_bytes());
hasher.update(&bincode::serialize(&decision).unwrap());
let content_hash = hasher.finalize();
// Sign the hash
let signature = self.signing_key.sign(content_hash.as_bytes());
let record = AuditRecord {
sequence: self.sequence,
previous_hash: self.chain_head,
timestamp_ns,
decision,
signature,
};
// Update chain
self.chain_head = *content_hash.as_bytes();
self.sequence += 1;
record
}
}
```
### 8. Early Warning Feedback Loop
ruQu's early warning system predicts correlated failures 100+ cycles ahead. This
prediction feeds back into the simulation engine to validate decoder robustness:
```rust
/// Early warning integration with quantum simulation.
pub struct EarlyWarningIntegration {
warning_system: EarlyWarningSystem,
error_injector: ErrorInjector,
}
impl EarlyWarningIntegration {
/// Check early warning predictions and optionally inject
/// targeted errors to validate decoder response.
pub fn process_cycle(
&mut self,
history: &SyndromeHistory,
state: &mut StateVector,
code: &SurfaceCodeLayout,
) -> Vec<EarlyWarningAction> {
let predictions = self.warning_system.predict(history);
let mut actions = Vec::new();
for prediction in &predictions {
if prediction.confidence > 0.8 {
// High-confidence prediction: inject targeted errors
// to validate that the decoder handles this failure mode
let targeted_errors = self.error_injector.generate_targeted(
&prediction.affected_region,
&prediction.predicted_error_pattern,
code,
);
actions.push(EarlyWarningAction::InjectTargetedErrors {
region: prediction.affected_region.clone(),
errors: targeted_errors,
prediction_confidence: prediction.confidence,
predicted_failure_round: prediction.failure_round,
});
tracing::info!(
confidence = prediction.confidence,
failure_round = prediction.failure_round,
region = %prediction.affected_region,
"Early warning: injecting targeted errors for decoder validation"
);
} else if prediction.confidence > 0.5 {
// Moderate confidence: increase monitoring, do not inject
actions.push(EarlyWarningAction::IncreasedMonitoring {
region: prediction.affected_region.clone(),
enhanced_diagnostics: true,
});
}
}
actions
}
}
pub enum EarlyWarningAction {
/// Inject targeted errors to test decoder response.
InjectTargetedErrors {
region: String,
errors: Vec<InjectedError>,
prediction_confidence: f64,
predicted_failure_round: u64,
},
/// Increase monitoring without error injection.
IncreasedMonitoring {
region: String,
enhanced_diagnostics: bool,
},
}
```
### 9. Performance Targets
| Pipeline stage | Target latency | Distance-3 | Distance-5 | Distance-7 |
|---|---|---|---|---|
| Syndrome extraction (sim) | Varies | 2 ms | 15 ms | 80 ms |
| Syndrome filtering | <0.5 ms | 0.1 ms | 0.2 ms | 0.4 ms |
| MWPM decoding | <4 us | 1 us | 2 us | 3.5 us |
| Correction application | <0.1 ms | 0.01 ms | 0.05 ms | 0.08 ms |
| Coherence computation | <1 ms | 0.3 ms | 0.5 ms | 0.8 ms |
| Audit record creation | <0.05 ms | 0.02 ms | 0.03 ms | 0.04 ms |
| **Total cycle** | | **~3 ms** | **~16 ms** | **~82 ms** |
For distance-7 and above, the tensor network backend (ADR-QE-009) is used for
the syndrome extraction simulation, as 97 qubits exceeds state-vector capacity.
### 10. Integration Data Flow Summary
```
+-------------------+
| QuantumCircuit | Surface code syndrome extraction circuit
| (parameterized by | with noise model applied
| error model) |
+--------+----------+
|
v
+--------+----------+
| SimulationEngine | State vector (d<=5) or tensor network (d>=7)
| execute() |
+--------+----------+
|
| MeasurementOutcome (ancilla bitstring)
v
+--------+----------+
| SyndromeBridge | Convert measurements to detection events
| extract_syndrome()|
+--------+----------+
|
| SyndromeRound
v
+--------+----------+
| SyndromeFilter | Three-stage filtering (Structural|Shift|Evidence)
| Pipeline |
+--------+----------+
|
| FilteredSyndrome
v
+--------+----------+ +------------------+
| MWPM Decoder |<--->| ruvector-mincut | Parallel decoding
| (ruQu) | | graph partition | for large codes
+--------+----------+ +------------------+
|
| DecoderCorrection (Pauli operators)
v
+--------+----------+
| Correction Apply | Apply X/Z/Y Paulis to simulated state
+--------+----------+
|
| Corrected state
v
+--------+--+------+-----+---+
| | | |
v v v v
Coherence Early Warning Audit Trail
Engine System (Ed25519 +
(sheaf (100+ cycle Blake3)
Laplacian) prediction)
| |
| +---> Feeds back to simulation
| (targeted error injection)
|
+---> Coherence gating
(skip/standard/enhanced sim)
~50% FLOPs reduction when healthy
```
### 11. API Surface
The complete integration is exposed through a high-level API:
```rust
/// High-level QEC simulation with full pipeline integration.
pub struct QecSimulator {
engine: QuantumEngine,
bridge: SyndromeBridge,
filter: SyndromeFilterPipeline,
decoder: Box<dyn SyndromeDecoder>,
coherence: Option<CoherenceGatedSimulator>,
early_warning: Option<EarlyWarningIntegration>,
audit: AuditTrail,
history: SyndromeHistory,
}
impl QecSimulator {
/// Run N cycles of QEC simulation.
pub fn run_cycles(
&mut self,
code: &SurfaceCodeLayout,
error_model: &ErrorModel,
num_cycles: usize,
) -> QecSimulationResult {
let mut results = Vec::with_capacity(num_cycles);
for cycle in 0..num_cycles {
let cycle_result = self.run_single_cycle(code, error_model, cycle);
results.push(cycle_result);
}
QecSimulationResult {
cycles: results,
logical_error_rate: self.compute_logical_error_rate(&results),
total_flops_saved: results.iter().map(|r| r.flops_saved).sum(),
decoder_latency_p99: self.compute_decoder_p99(&results),
}
}
fn run_single_cycle(
&mut self,
code: &SurfaceCodeLayout,
error_model: &ErrorModel,
cycle: usize,
) -> CycleResult {
// ... full pipeline as described above
}
}
```
## Consequences
### Positive
1. **Unified pipeline**: Simulation, decoding, coherence analysis, and auditing
work together seamlessly rather than as disconnected tools.
2. **Real performance gains**: Coherence gating can reduce simulation FLOPs by
~50% for healthy regions, directly applicable to long QEC simulations.
3. **Decoder validation**: The simulation engine provides a controlled environment
to test decoder correctness under various error models.
4. **Early warning validation**: Predicted failures can be injected and the decoder's
response verified, increasing confidence in the early warning system.
5. **Auditable**: Every decision in the pipeline is cryptographically signed and
hash-chained, meeting compliance requirements for safety-critical applications.
6. **Leverages existing infrastructure**: `ruvector-mincut`, ruQu's decoder, and
the coherence engine are reused rather than reimplemented.
### Negative
1. **Coupling**: The integration creates dependencies between previously independent
crates. Changes to ruQu's syndrome format require updates to the bridge.
Mitigation: trait abstractions at integration boundaries.
2. **Complexity**: The full pipeline has many stages, each with its own configuration
and failure modes. Mitigation: sensible defaults and the high-level `QecSimulator`
API that hides complexity.
3. **Performance overhead**: Coherence computation and audit trail signing add
latency to each cycle. Mitigation: both are optional and can be disabled.
4. **Tensor network dependency**: Distance >= 7 codes require the tensor network
backend, which is behind a feature flag and may not always be compiled in.
### Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Coherence gating skips a region that has real errors | Low | Missed errors | Conservative thresholds; periodic full-fidelity verification cycles |
| MWPM decoder exceeds 4us on partitioned syndrome | Medium | Latency violation | Adaptive partition count; fallback to non-partitioned decode |
| Early warning false positives cause unnecessary error injection | Medium | Wasted cycles | Confidence threshold (>0.8) gates injection; injection is rate-limited |
| Audit trail storage grows unboundedly | Medium | Disk exhaustion | Configurable retention; periodic pruning of old records |
| Syndrome format version mismatch between sim and decoder | Low | Decode failure | Version field in SyndromeRound; compatibility checks at pipeline init |
## References
- ruQu crate: boundary-to-boundary min-cut coherence gating
- ruQu SyndromeFilter: three-filter pipeline (Structural | Shift | Evidence)
- `ruvector-mincut` crate: graph partitioning for parallel decoding
- ADR-014: Coherence Engine (sheaf Laplacian coherence computation)
- ADR-CE-001: Sheaf Laplacian (mathematical foundation)
- ADR-QE-001: Core Engine Architecture (simulation backends)
- ADR-QE-009: Tensor Network Evaluation Mode (large code simulation)
- ADR-QE-010: Observability & Monitoring (metrics for pipeline stages)
- ADR-QE-011: Memory Gating & Power Management (resource constraints)
- Fowler et al., "Surface codes: Towards practical large-scale quantum computation" (2012)
- Higgott, "PyMatching: A Python package for decoding quantum codes with MWPM" (2022)
- Dennis et al., "Topological quantum memory" (2002) -- MWPM decoding
- Ed25519: https://ed25519.cr.yp.to/
- Blake3: https://github.com/BLAKE3-team/BLAKE3

View File

@@ -0,0 +1,483 @@
# ADR-QE-013: Deutsch's Theorem — Proof, Historical Comparison, and Verification
**Status**: Accepted
**Date**: 2026-02-06
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2026-02-06 | ruv.io | Complete proof, historical comparison, ruqu verification |
---
## Context
Deutsch's theorem (1985) is the founding result of quantum computation. It demonstrates
that a quantum computer can extract a *global property* of a function using fewer queries
than any classical algorithm — the first provable quantum speedup. Our ruqu engine
(ADR-QE-001 through ADR-QE-008) implements the full gate set and state-vector simulator
required to verify this theorem programmatically.
This ADR provides:
1. A **rigorous proof** of Deutsch's theorem
2. A **comparative analysis** of the five major formulations by different authors
3. A **de-quantization critique** examining when the advantage truly holds
4. **Verification** via the ruqu-core simulator
---
## 1. Statement of the Theorem
**Deutsch's Problem.** Given a black-box oracle computing f: {0,1} → {0,1}, determine
whether f is *constant* (f(0) = f(1)) or *balanced* (f(0) ≠ f(1)).
**Theorem (Deutsch, 1985; deterministic form: Cleve et al., 1998).**
A quantum computer can solve Deutsch's problem with certainty using exactly **one** oracle
query. Any classical deterministic algorithm requires **two** queries.
---
## 2. Classical Lower Bound
**Claim.** Every classical deterministic algorithm requires 2 queries.
**Proof.** A classical algorithm queries f on inputs from {0,1} sequentially. After a
single query — say f(0) = b — both cases remain consistent with the observation:
- Constant: f(1) = b
- Balanced: f(1) = 1 b
No deterministic strategy can distinguish these without a second query.
A probabilistic classical algorithm can guess with probability 1/2 after one query,
but cannot achieve certainty. ∎
---
## 3. Quantum Proof (Complete)
### 3.1 Oracle Definition
The quantum oracle U_f acts on two qubits as:
```
U_f |x⟩|y⟩ = |x⟩|y ⊕ f(x)⟩
```
where ⊕ is addition modulo 2. This is a unitary (and self-inverse) operation for all
four possible functions f.
### 3.2 Circuit
```
q0: |0⟩ ─── H ─── U_f ─── H ─── M ──→ result
q1: |1⟩ ─── H ──────────────────────
```
### 3.3 Step-by-Step Derivation
**Step 1. Initialization.**
```
|ψ₀⟩ = |0⟩|1⟩
```
**Step 2. Hadamard on both qubits.**
```
|ψ₁⟩ = H|0⟩ ⊗ H|1⟩
= (|0⟩ + |1⟩)/√2 ⊗ (|0⟩ |1⟩)/√2
```
**Step 3. Phase Kickback Lemma.**
> **Lemma.** Let |y⁻⟩ = (|0⟩ |1⟩)/√2. Then for any x ∈ {0,1}:
>
> U_f |x⟩|y⁻⟩ = (1)^{f(x)} |x⟩|y⁻⟩
*Proof of Lemma.*
```
U_f |x⟩|y⁻⟩ = U_f |x⟩ (|0⟩ |1⟩)/√2
= (|x⟩|f(x)⟩ |x⟩|1⊕f(x)⟩) / √2
```
Case f(x) = 0:
```
= |x⟩(|0⟩ |1⟩)/√2 = (+1)|x⟩|y⁻⟩
```
Case f(x) = 1:
```
= |x⟩(|1⟩ |0⟩)/√2 = (1)|x⟩|y⁻⟩
```
Therefore U_f |x⟩|y⁻⟩ = (1)^{f(x)} |x⟩|y⁻⟩. ∎
**Step 4. Apply oracle to the superposition.**
By linearity of U_f and the Phase Kickback Lemma:
```
|ψ₂⟩ = [ (1)^{f(0)} |0⟩ + (1)^{f(1)} |1⟩ ] / √2 ⊗ |y⁻⟩
```
Factor out the global phase (1)^{f(0)}:
```
|ψ₂⟩ = (1)^{f(0)} · [ |0⟩ + (1)^{f(0)⊕f(1)} |1⟩ ] / √2 ⊗ |y⁻⟩
```
**Step 5. Final Hadamard on first qubit.**
Using H|+⟩ = |0⟩ and H|−⟩ = |1⟩:
- If f(0) ⊕ f(1) = 0 (constant): first qubit is |+⟩, so H|+⟩ = |0⟩
- If f(0) ⊕ f(1) = 1 (balanced): first qubit is |−⟩, so H|−⟩ = |1⟩
Therefore:
```
|ψ₃⟩ = (1)^{f(0)} · |f(0) ⊕ f(1)⟩ ⊗ |y⁻⟩
```
**Step 6. Measurement.**
| Measurement of q0 | Conclusion |
|---|---|
| \|0⟩ (probability 1) | f is **constant** |
| \|1⟩ (probability 1) | f is **balanced** |
The global phase (1)^{f(0)} is physically unobservable. The measurement outcome is
**deterministic** — no probabilistic element remains. ∎
### 3.4 Why This Works
The quantum advantage arises from three principles acting together:
1. **Superposition**: The Hadamard gate creates a state that simultaneously probes
both inputs f(0) and f(1) in a single oracle call.
2. **Phase kickback**: The oracle encodes f(x) into relative phases rather than
bit values, moving information from the amplitude magnitudes into the complex
phases of the state vector.
3. **Interference**: The final Hadamard converts the relative phase between |0⟩
and |1⟩ into a computational basis state that can be measured. Constructive
interference amplifies the correct answer; destructive interference suppresses
the wrong one.
The algorithm extracts f(0) ⊕ f(1) — a *global* property — without ever learning
either f(0) or f(1) individually. This is impossible classically with one query.
---
## 4. Historical Comparison of Proofs
### 4.1 Timeline
| Year | Authors | Key Contribution |
|------|---------|------------------|
| 1985 | Deutsch | First quantum algorithm; probabilistic (50% success) |
| 1992 | Deutsch & Jozsa | Deterministic n-bit generalization; required 2 queries |
| 1998 | Cleve, Ekert, Macchiavello & Mosca | Deterministic + single query (modern form) |
| 2001 | Nielsen & Chuang | Canonical textbook presentation |
| 2006 | Calude | De-quantization of the single-bit case |
### 4.2 Deutsch's Original Proof (1985)
**Paper:** "Quantum Theory, the Church-Turing Principle and the Universal Quantum
Computer," *Proc. Royal Society London A* 400, pp. 97117.
Deutsch's original algorithm was **probabilistic**, succeeding with probability 1/2.
The circuit prepared the first qubit in an eigenstate basis and relied on interference
at the output, but lacked the phase-kickback construction that the modern proof uses.
The key insight was not the algorithm itself but the *philosophical claim*: Deutsch
reformulated the Church-Turing thesis as a physical principle, arguing that since
physics is quantum mechanical, the correct model of computation must be quantum.
He noted that classical physics uses real numbers that cannot be represented by
Turing machines, and proposed the quantum Turing machine as the proper universal
model.
Deutsch also connected his work to the Everett many-worlds interpretation, arguing
that quantum parallelism could be understood as computation occurring across
parallel universes simultaneously.
**Limitations:**
- Only solved the 1-bit case
- Probabilistic (50% success rate)
- The advantage over classical was present but not deterministic
### 4.3 Deutsch-Jozsa Extension (1992)
**Paper:** "Rapid Solution of Problems by Quantum Computation," *Proc. Royal Society
London A* 439, pp. 553558.
Deutsch and Jozsa generalized to n-bit functions f: {0,1}ⁿ → {0,1} where f is
promised to be either constant (same output on all inputs) or balanced (outputs 0
on exactly half the inputs and 1 on the other half).
**Key differences from 1985:**
- Deterministic algorithm (no probabilistic element)
- Required **two** oracle queries (not one)
- Demonstrated **exponential** speedup: quantum O(1) queries vs. classical
worst-case 2^(n1) + 1 queries for n-bit functions
**Proof technique:** Applied Hadamard to all n input qubits, queried the oracle once,
applied Hadamard again, and measured. If f is constant, the output is always |0⟩ⁿ.
If balanced, the output is never |0⟩ⁿ. However, the original 1992 formulation used
a slightly different circuit that needed a second query for the single-bit case.
### 4.4 Cleve-Ekert-Macchiavello-Mosca Improvement (1998)
**Paper:** "Quantum Algorithms Revisited," *Proc. Royal Society London A* 454,
pp. 339354. (arXiv: quant-ph/9708016)
This paper provided the **modern, textbook form** of the algorithm:
- Deterministic
- Single oracle query
- Works for all n, including n = 1
**Critical innovation:** The introduction of the ancilla qubit initialized to |1⟩ and
the explicit identification of the **phase kickback** mechanism. They recognized that
preparing the target qubit as H|1⟩ = |−⟩ converts the oracle's bit-flip action into
a phase change — a technique now fundamental to quantum algorithm design.
They also identified a unifying structure across quantum algorithms: "a Fourier
transform, followed by an f-controlled-U, followed by another Fourier transform."
This pattern later appeared in Shor's algorithm and the quantum phase estimation
framework.
### 4.5 Nielsen & Chuang Textbook Presentation (2000/2010)
**Book:** *Quantum Computation and Quantum Information*, Cambridge University Press.
(Section 1.4.3)
Nielsen and Chuang's presentation is the most widely taught version:
- Full density matrix formalism
- Explicit circuit diagram notation
- Rigorous bra-ket algebraic derivation
- Connects to quantum parallelism concept
- Treats it as a gateway to Deutsch-Jozsa (Section 1.4.4) and ultimately
to Shor and Grover
**Proof style:** Algebraic state-tracking through the circuit, step by step. Emphasis
on the tensor product structure and the role of entanglement (or rather, the lack
thereof — Deutsch's algorithm creates no entanglement between the query and
ancilla registers).
### 4.6 Comparison Matrix
| Aspect | Deutsch (1985) | Deutsch-Jozsa (1992) | Cleve et al. (1998) | Nielsen-Chuang (2000) |
|--------|----------------|----------------------|---------------------|-----------------------|
| **Input bits** | 1 | n | n | n |
| **Deterministic** | No (p = 1/2) | Yes | Yes | Yes |
| **Oracle queries** | 1 | 2 | 1 | 1 |
| **Ancilla init** | \|0⟩ | \|0⟩ | \|1⟩ (key insight) | \|1⟩ |
| **Phase kickback** | Implicit | Partial | Explicit | Explicit |
| **Proof technique** | Interference argument | Algebraic | Algebraic + structural | Full density matrix |
| **Fourier structure** | Not identified | Not identified | Identified | Inherited |
| **Entanglement needed** | Debated | Debated | No | No |
---
## 5. De-Quantization and the Limits of Quantum Advantage
### 5.1 Calude's De-Quantization (2006)
Cristian Calude showed that Deutsch's problem (single-bit case) can be solved
classically with one query if the black box is permitted to operate on
*higher-dimensional classical objects* ("complex bits" — classical analogues of
qubits).
**Mechanism:** Replace the Boolean black box f: {0,1} → {0,1} with a linear-algebraic
black box F: C² → C² that computes the same function on a 2-dimensional complex
vector space. A single application of F to a carefully chosen input vector produces
enough information to extract f(0) ⊕ f(1).
**Implication:** The quantum speedup in the 1-bit case may be an artifact of
comparing quantum registers (which carry 2-dimensional complex amplitudes) against
classical registers (which carry 1-bit Boolean values).
### 5.2 Abbott et al. — Entanglement and Scalability
Abbott and collaborators extended the de-quantization analysis:
- Any quantum algorithm with **bounded entanglement** can be de-quantized into an
equally efficient classical simulation.
- For the general n-bit Deutsch-Jozsa problem, the de-quantization does **not**
scale: classical simulation requires exponential resources when the quantum
algorithm maintains non-trivial entanglement.
- Key result: entanglement is not *essential* for quantum computation (some advantage
persists with separable states), but it is necessary for *exponential* speedup.
### 5.3 Classical Wave Analogies
Several groups demonstrated classical optical simulations of Deutsch-Jozsa:
| Group | Method | Insight |
|-------|--------|---------|
| Perez-Garcia et al. | Ring cavity + linear optics | Wave interference mimics quantum interference |
| Metamaterial groups | Electromagnetic waveguides | Constructive/destructive interference for constant/balanced |
| LCD programmable optics | Spatial light modulation | Classical coherence sufficient for small n |
These demonstrate that the *interference* ingredient is not uniquely quantum —
classical wave physics provides it too. What scales uniquely in quantum mechanics
is the exponential dimension of the Hilbert space (2ⁿ amplitudes from n qubits),
which classical wave systems cannot efficiently replicate.
### 5.4 Resolution
The modern consensus:
1. **For n = 1:** The quantum advantage is **real but modest** (1 query vs. 2), and
can be replicated classically by enlarging the state space (de-quantization).
2. **For general n:** The quantum advantage is **exponential and genuine**. The
Deutsch-Jozsa algorithm uses O(1) queries vs. classical Ω(2^(n1)). No known
de-quantization scales to this regime without exponential classical resources.
3. **The true quantum resource** is not superposition alone (classical waves have it)
nor interference alone, but the **exponential state space** of multi-qubit systems
combined with the ability to manipulate phases coherently across that space.
---
## 6. The Four Oracles
The function f: {0,1} → {0,1} has exactly four possible instantiations:
| Oracle | f(0) | f(1) | Type | Circuit Implementation |
|--------|------|------|------|-----------------------|
| f₀ | 0 | 0 | Constant | Identity (no gates) |
| f₁ | 1 | 1 | Constant | X on ancilla (q1) |
| f₂ | 0 | 1 | Balanced | CNOT(q0, q1) |
| f₃ | 1 | 0 | Balanced | X(q0), CNOT(q0, q1), X(q0) |
### Expected measurement outcomes
For all four oracles, measurement of qubit 0 yields:
| Oracle | f(0) ⊕ f(1) | Measurement q0 | Classification |
|--------|-------------|----------------|----------------|
| f₀ | 0 | \|0⟩ (prob = 1.0) | Constant |
| f₁ | 0 | \|0⟩ (prob = 1.0) | Constant |
| f₂ | 1 | \|1⟩ (prob = 1.0) | Balanced |
| f₃ | 1 | \|1⟩ (prob = 1.0) | Balanced |
---
## 7. Verification via ruqu-core
The ruqu-core simulator can verify all four cases of Deutsch's algorithm. The
verification test constructs each oracle circuit and confirms the deterministic
measurement outcome:
```rust
use ruqu_core::prelude::*;
use ruqu_core::gate::Gate;
fn deutsch_algorithm(oracle: &str) -> bool {
let mut state = QuantumState::new(2).unwrap();
// Prepare |01⟩
state.apply_gate(&Gate::X(1)).unwrap();
// Hadamard both qubits
state.apply_gate(&Gate::H(0)).unwrap();
state.apply_gate(&Gate::H(1)).unwrap();
// Apply oracle
match oracle {
"f0" => { /* identity — f(x) = 0 */ }
"f1" => { state.apply_gate(&Gate::X(1)).unwrap(); }
"f2" => { state.apply_gate(&Gate::CNOT(0, 1)).unwrap(); }
"f3" => {
state.apply_gate(&Gate::X(0)).unwrap();
state.apply_gate(&Gate::CNOT(0, 1)).unwrap();
state.apply_gate(&Gate::X(0)).unwrap();
}
_ => panic!("Unknown oracle"),
}
// Hadamard on query qubit
state.apply_gate(&Gate::H(0)).unwrap();
// Measure qubit 0: |0⟩ = constant, |1⟩ = balanced
let probs = state.probabilities();
// prob(q0 = 1) = sum of probs where bit 0 is set
let prob_q0_one = probs[1] + probs[3]; // indices with bit 0 = 1
prob_q0_one > 0.5 // true = balanced, false = constant
}
// Verification:
assert!(!deutsch_algorithm("f0")); // constant
assert!(!deutsch_algorithm("f1")); // constant
assert!( deutsch_algorithm("f2")); // balanced
assert!( deutsch_algorithm("f3")); // balanced
```
This confirms that a single oracle query, using the ruqu state-vector simulator,
correctly classifies all four functions with probability 1.
---
## 8. Architectural Significance for ruVector
### 8.1 Validation of Core Primitives
Deutsch's algorithm exercises exactly the minimal set of quantum operations:
| Primitive | Used in Deutsch's Algorithm | ruqu Module |
|-----------|---------------------------|-------------|
| Qubit initialization | \|0⟩, \|1⟩ states | `state.rs` |
| Hadamard gate | Superposition creation | `gate.rs` |
| CNOT gate | Entangling oracle | `gate.rs` |
| Pauli-X gate | Bit flip oracle | `gate.rs` |
| Measurement | Extracting classical result | `state.rs` |
| Phase kickback | Core quantum mechanism | implicit |
Passing the Deutsch verification confirms that the simulator's gate kernels,
state-vector representation, and measurement machinery are correct — it is a
"minimum viable quantum correctness test."
### 8.2 Foundation for Advanced Algorithms
The phase-kickback technique proven here is the same mechanism used in:
- **Grover's algorithm** (ADR-QE-006): Oracle marks states via phase flip
- **VQE** (ADR-QE-005): Parameter-shift rule uses phase differences
- **Quantum Phase Estimation**: Controlled-U operators produce phase kickback
- **Shor's algorithm**: Order-finding oracle uses modular exponentiation kickback
---
## 9. References
| # | Reference | Year |
|---|-----------|------|
| 1 | D. Deutsch, "Quantum Theory, the Church-Turing Principle and the Universal Quantum Computer," *Proc. R. Soc. Lond. A* 400, 97117 | 1985 |
| 2 | D. Deutsch & R. Jozsa, "Rapid Solution of Problems by Quantum Computation," *Proc. R. Soc. Lond. A* 439, 553558 | 1992 |
| 3 | R. Cleve, A. Ekert, C. Macchiavello & M. Mosca, "Quantum Algorithms Revisited," *Proc. R. Soc. Lond. A* 454, 339354 (arXiv: quant-ph/9708016) | 1998 |
| 4 | M.A. Nielsen & I.L. Chuang, *Quantum Computation and Quantum Information*, Cambridge University Press, 10th Anniversary Ed. | 2010 |
| 5 | C.S. Calude, "De-quantizing the Solution of Deutsch's Problem," *Int. J. Quantum Information* 5(3), 409415 | 2007 |
| 6 | A.A. Abbott, "The Deutsch-Jozsa Problem: De-quantisation and Entanglement," *Natural Computing* 11(1), 311 | 2012 |
| 7 | R.P. Feynman, "Simulating Physics with Computers," *Int. J. Theoretical Physics* 21, 467488 | 1982 |
| 8 | Perez-Garcia et al., "Quantum Computation with Classical Light," *Physics Letters A* 380(22), 19251931 | 2016 |
---
## Decision
**Accepted.** Deutsch's theorem is verified by the ruqu-core engine across all four
oracle cases. The proof and historical comparison are documented here as the
theoretical foundation underpinning all quantum algorithms implemented in the
ruqu-algorithms crate (Grover, VQE, QAOA, Surface Code).
The de-quantization analysis confirms that our simulator's true value emerges at
scale (n > 2 qubits), where classical de-quantization fails and the exponential
Hilbert space becomes a genuine computational resource.

View File

@@ -0,0 +1,130 @@
# ADR-QE-014: Exotic Quantum-Classical Hybrid Discoveries
**Status:** Accepted
**Date:** 2026-02-06
**Crate:** `ruqu-exotic`
## Context
The `ruqu-exotic` crate implements 8 quantum-classical hybrid algorithms that use real quantum mechanics (superposition, interference, decoherence, error correction, entanglement) as computational primitives for classical AI/ML problems. These are not quantum computing on quantum hardware — they are quantum-*inspired* algorithms running on a classical simulator, where the quantum structure provides capabilities that classical approaches lack.
## Phase 1 Discoveries (Validated)
### Discovery 1: Decoherence Trajectory Fingerprinting
**Module:** `quantum_decay`
**Finding:** Similar embeddings decohere at similar rates. The fidelity loss trajectory is a fingerprint that clusters semantically related embeddings without any explicit similarity computation.
**Data:**
| Pair | Fidelity Difference |
|------|-------------------|
| Similar embeddings (A1 vs A2) | 0.008 |
| Different embeddings (A1 vs B) | 0.384 |
**Practical Application:** Replace TTL-based cache eviction with per-embedding fidelity thresholds. Stale detection becomes content-aware without knowing content semantics. The decoherence rate itself becomes a clustering signal — a new dimension for nearest-neighbor search.
### Discovery 2: Interference-Based Polysemy Resolution
**Module:** `interference_search`
**Finding:** Complex amplitude interference resolves polysemous terms at retrieval time with zero ML inference. Context vectors modulate meaning amplitudes through constructive/destructive interference.
**Data:**
| Context | Top Meaning | Probability |
|---------|-------------|-------------|
| Weather | "season" | 1.3252 |
| Geology | "water_source" | 1.3131 |
| Engineering | "mechanical" | 1.3252 |
**Practical Application:** Vector databases can disambiguate polysemous queries using only embedding arithmetic. Runs in microseconds vs. seconds for LLM-based reranking. Applicable to any search system dealing with ambiguous terms.
### Discovery 3: Counterfactual Dependency Mapping
**Module:** `reversible_memory`
**Finding:** Gate inversion enables counterfactual analysis: remove any operation from a sequence and measure divergence from the actual outcome. This quantitatively identifies critical vs. redundant steps.
**Data:**
| Step | Gate | Divergence | Classification |
|------|------|------------|----------------|
| 0 | H (superposition) | 0.500 | **Critical** |
| 1 | CNOT (entangle) | 0.500 | **Critical** |
| 2 | Rz(0.001) | 0.000 | **Redundant** |
| 3 | CNOT (propagate) | 0.000 | **Redundant** |
| 4 | H (mix) | 0.500 | **Critical** |
**Practical Application:** Automatic importance scoring for any pipeline of reversible transformations. Applicable to ML pipeline optimization, middleware chain debugging, database migration analysis. No source code analysis needed — works purely from operational traces.
### Discovery 4: Phase-Coherent Swarm Coordination
**Module:** `swarm_interference`
**Finding:** Agent phase alignment matters more than headcount. Three aligned agents produce 9.0 probability; two aligned + one orthogonal produce only 5.0 — a 44% drop despite identical agent count.
**Data:**
| Configuration | Probability |
|--------------|-------------|
| 3 agents, phase-aligned | 9.0 |
| 2 aligned + 1 orthogonal | 5.0 |
| 3 support + 3 oppose | ~0.0 |
**Practical Application:** Replace majority voting in multi-agent systems with interference-based aggregation. Naturally penalizes uncertain/confused agents and rewards aligned confident reasoning. Superior coordination primitive for LLM agent swarms and ensemble classifiers.
## Phase 2: Unexplored Cross-Module Interactions
The following cross-module experiments remain to be investigated:
### Hypothesis 5: Time-Dependent Disambiguation
**Modules:** `quantum_decay` + `interference_search`
**Question:** Does decoherence change which meaning wins? As an embedding ages, does its polysemy resolution shift?
### Hypothesis 6: QEC on Agent Swarm Reasoning
**Modules:** `reasoning_qec` + `swarm_interference`
**Question:** Can syndrome extraction detect when a swarm's collective reasoning chain has become incoherent?
### Hypothesis 7: Counterfactual Search Explanation
**Modules:** `quantum_collapse` + `reversible_memory`
**Question:** Can counterfactual analysis explain WHY a search collapsed to a particular result?
### Hypothesis 8: Diagnostic Swarm Health
**Modules:** `syndrome_diagnosis` + `swarm_interference`
**Question:** Can syndrome-based diagnosis identify which agent in a swarm is causing dysfunction?
### Hypothesis 9: Full Pipeline
**Modules:** All 8
**Question:** Decohere → Interfere → Collapse → QEC-verify → Diagnose: does the full pipeline produce emergent capabilities beyond what individual modules provide?
### Hypothesis 10: Decoherence as Privacy
**Modules:** `quantum_decay` + `quantum_collapse`
**Question:** Can controlled decoherence provide differential privacy for embedding search?
### Hypothesis 11: Interference Topology
**Modules:** `interference_search` + `swarm_interference`
**Question:** Do concept interference patterns predict optimal swarm topology?
### Hypothesis 12: Reality-Verified Reasoning
**Modules:** `reality_check` + `reasoning_qec`
**Question:** Can reality check circuits verify that QEC correction preserved reasoning fidelity?
## Architecture
All modules share the `ruqu-core` quantum simulator:
- State vectors up to 25 qubits (33M amplitudes)
- Full gate set: H, X, Y, Z, S, T, Rx, Ry, Rz, CNOT, CZ, SWAP, Rzz
- Measurement with collapse
- Fidelity comparison
- Compiles to WASM for browser execution
## Test Coverage
| Category | Tests | Status |
|----------|-------|--------|
| Unit tests (8 modules) | 57 | All pass |
| Integration tests | 42 | All pass |
| Discovery experiments | 4 | All validated |
| **Total** | **99** | **All pass** |
## Decision
Accept Phase 1 findings as validated. Proceed with Phase 2 cross-module discovery experiments to identify emergent capabilities.

View File

@@ -0,0 +1,361 @@
# ADR-QE-015: Quantum Hardware Integration & Scientific Instrument Layer
**Status**: Accepted
**Date**: 2026-02-12
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
**Supersedes**: None
**Extends**: ADR-QE-001, ADR-QE-002, ADR-QE-004
## Context
### Problem Statement
ruqu-core is currently a closed-world simulator: circuits run locally on state
vector, stabilizer, or tensor network backends with no path to real quantum
hardware, no cryptographic proof of execution, and no statistical rigor around
measurement confidence. For blockchain forensics and scientific applications,
three gaps must be closed:
1. **Hardware bridge**: Export circuits to OpenQASM 3.0, submit to IBM Quantum /
IonQ / Rigetti / Amazon Braket, and import calibration-aware noise models.
2. **Scientific rigor**: Every simulation result must carry confidence bounds,
be deterministically replayable, and be verifiable across backends.
3. **Audit trail**: A tamper-evident witness log must chain every execution so
results can be independently reproduced and verified.
These capabilities transform ruqu from a simulator into a **scientific
instrument** suitable for peer-reviewed quantum-enhanced forensics.
### Current State
| Component | Exists | Gap |
|-----------|--------|-----|
| State vector backend | Yes (ruqu-core) | No hardware export |
| Stabilizer backend | Yes (ruqu-core) | No cross-backend verification |
| Tensor network backend | Yes (ruqu-core) | No confidence bounds |
| Basic noise model | Yes (depolarizing, bit/phase flip) | No T1/T2/readout/crosstalk |
| Seeded RNG | Yes (SimConfig.seed) | No snapshot/restore, no replay log |
| Gate set | Complete (H,X,Y,Z,S,T,Rx,Ry,Rz,CNOT,CZ,SWAP,Rzz) | No QASM export |
| Circuit analyzer | Yes (Clifford fraction, depth) | No automatic verification |
## Decision
### Architecture Overview
```
ruqu-core (existing)
|
+------------------+------------------+
| | |
[OpenQASM 3.0] [Noise Models] [Scientific Layer]
Export Bridge Enhanced |
| | +----+----+--------+
| | | | |
[Hardware HAL] [Error [Replay] [Witness] [Confidence]
IBM/IonQ/ Mitigation] Engine Logger Bounds
Rigetti/Braket Pipeline
| | \ | /
+--------+------+ \ | /
| [Cross-Backend
[Transpiler] Verification]
Noise-Aware with
Live Calibration
```
All new code lives in `crates/ruqu-core/src/` as new modules, extending the
existing crate without breaking the public API.
### 1. OpenQASM 3.0 Export Bridge
**Module**: `src/qasm.rs`
Serializes any `QuantumCircuit` to valid OpenQASM 3.0 text. Supports the full
gate set in `Gate` enum, parameterized rotations, barriers, measurement, and
reset.
```
OPENQASM 3.0;
include "stdgates.inc";
qubit[n] q;
bit[n] c;
h q[0];
cx q[0], q[1];
rz(0.785398) q[2];
c[0] = measure q[0];
```
**Design decisions**:
- Gate names follow the OpenQASM 3.0 `stdgates.inc` naming convention
- `Unitary1Q` fused gates decompose to `U(theta, phi, lambda)` form
- Round-trip fidelity: `circuit -> qasm -> parse -> circuit` preserves
gate identity (not implemented here; parsing is out of scope)
- Output validated against IBM Quantum and IonQ acceptance criteria
### 2. Enhanced Noise Models
**Module**: `src/noise.rs`
Extends the existing `NoiseModel` with physically-motivated channels:
| Channel | Parameters | Kraus Operators |
|---------|-----------|-----------------|
| Depolarizing | p (error rate) | K0=sqrt(1-p)I, K1-3=sqrt(p/3){X,Y,Z} |
| Amplitude damping (T1) | gamma=1-exp(-t/T1) | K0=[[1,0],[0,sqrt(1-γ)]], K1=[[0,sqrt(γ)],[0,0]] |
| Phase damping (T2) | lambda=1-exp(-t/T2') | K0=[[1,0],[0,sqrt(1-λ)]], K1=[[0,0],[0,sqrt(λ)]] |
| Readout error | p01, p10 | Confusion matrix applied at measurement |
| Thermal relaxation | T1, T2, gate_time | Combined T1+T2 during idle periods |
| Crosstalk (ZZ) | zz_strength | Unitary Rzz rotation on adjacent qubits |
**Simulation approach**: Monte Carlo trajectories on the state vector. For each
gate, sample which Kraus operator to apply based on probabilities. This avoids
the 2x memory overhead of density matrix representation while giving correct
statistics over many shots.
**Calibration import**: `DeviceCalibration` struct holds per-qubit T1/T2/readout
errors and per-gate error rates, importable from hardware API JSON responses.
### 3. Error Mitigation Pipeline
**Module**: `src/mitigation.rs`
Post-processing techniques that improve result accuracy without modifying the
quantum circuit:
| Technique | Input | Output | Overhead |
|-----------|-------|--------|----------|
| Zero-Noise Extrapolation (ZNE) | Results at noise scales [1, 1.5, 2, 3] | Extrapolated zero-noise value | 3-4x shots |
| Measurement Error Mitigation | Raw counts + calibration matrix | Corrected counts | O(2^n) for n measured qubits |
| Clifford Data Regression (CDR) | Noisy results + stabilizer reference | Bias-corrected expectation | 2x circuits |
**ZNE implementation**: Gate folding (G -> G G^dag G) amplifies noise by
integer/half-integer factors. Richardson extrapolation fits a polynomial and
evaluates at noise_factor = 0.
**Measurement correction**: For <= 12 qubits, build full confusion matrix from
calibration data and invert via least-squares. For > 12 qubits, use tensor
product approximation assuming independent qubit readout errors.
### 4. Hardware Abstraction Layer
**Module**: `src/hardware.rs`
Trait-based provider abstraction for submitting circuits to real hardware:
```rust
pub trait HardwareProvider: Send + Sync {
fn name(&self) -> &str;
fn available_devices(&self) -> Vec<DeviceInfo>;
fn device_calibration(&self, device: &str) -> Option<DeviceCalibration>;
fn submit_circuit(&self, qasm: &str, shots: u32, device: &str)
-> Result<JobHandle>;
fn job_status(&self, handle: &JobHandle) -> Result<JobStatus>;
fn job_results(&self, handle: &JobHandle) -> Result<HardwareResult>;
}
```
**Provider adapters** (stubbed, not implementing actual HTTP clients):
| Provider | Auth | Circuit Format | API Style |
|----------|------|---------------|-----------|
| IBM Quantum | API key + token | OpenQASM 3.0 | REST |
| IonQ | API key (header) | OpenQASM 2.0 / native JSON | REST |
| Rigetti | OAuth2 / API key | Quil / OpenQASM | REST + gRPC |
| Amazon Braket | AWS credentials | OpenQASM 3.0 | AWS SDK |
Each adapter is a zero-dependency stub implementing the trait. Actual HTTP
clients are injected by the consumer, keeping ruqu-core `no_std`-compatible.
### 5. Noise-Aware Transpiler
**Module**: `src/transpiler.rs`
Maps abstract circuits to hardware-native gate sets using device calibration:
1. **Gate decomposition**: Decompose non-native gates into the target basis
(e.g., IBM: {CX, ID, RZ, SX, X}; IonQ: {GPI, GPI2, MS}).
2. **Qubit routing**: Map logical qubits to physical qubits respecting the
device coupling map (greedy nearest-neighbor heuristic).
3. **Noise-aware optimization**: Prefer gates/qubits with lower error rates
from live calibration data.
4. **Gate cancellation**: Cancel adjacent inverse gates (H-H, S-Sdg, etc.)
after routing.
### 6. Deterministic Replay Engine
**Module**: `src/replay.rs`
Every simulation execution is fully reproducible:
```rust
pub struct ExecutionRecord {
pub circuit_hash: [u8; 32], // SHA-256 of QASM representation
pub seed: u64, // ChaCha20 RNG seed
pub backend: BackendType, // Which backend was used
pub noise_config: Option<NoiseModelConfig>,
pub shots: u32,
pub software_version: &'static str,
pub timestamp_utc: u64,
}
```
**Replay guarantee**: Given an `ExecutionRecord`, calling
`replay(record, circuit)` produces bit-identical results. This requires:
- Deterministic RNG: `ChaCha20Rng` (via `rand_chacha`), seeded per-shot as
`base_seed.wrapping_add(shot_index)`
- Deterministic gate application order (already guaranteed by `Vec<Gate>`)
- Deterministic noise sampling (same RNG stream)
**Snapshot/restore**: For long-running VQE iterations, the engine can serialize
the state vector to a checkpoint and restore it, enabling resumable computation.
### 7. Witness Logging (Cryptographic Audit Trail)
**Module**: `src/witness.rs`
A tamper-evident append-only log where each entry contains:
```rust
pub struct WitnessEntry {
pub sequence: u64, // Monotonic counter
pub prev_hash: [u8; 32], // SHA-256 of previous entry
pub execution: ExecutionRecord, // Full replay metadata
pub result_hash: [u8; 32], // SHA-256 of measurement outcomes
pub entry_hash: [u8; 32], // SHA-256(sequence || prev_hash || execution || result_hash)
}
```
**Hash chain**: Each entry's `entry_hash` incorporates the previous entry's
hash, forming a blockchain-style chain. Tampering with any entry invalidates
all subsequent hashes.
**Verification**: `verify_witness_chain(entries)` walks the chain and confirms:
1. Hash linkage: `entry[i].prev_hash == entry[i-1].entry_hash`
2. Self-consistency: Recomputed `entry_hash` matches stored value
3. Optional replay: Re-execute the circuit and confirm `result_hash` matches
**Format**: Entries are serialized as length-prefixed bincode with CRC32
checksums, stored in an append-only file. JSON export available for
interoperability.
### 8. Confidence Bounds
**Module**: `src/confidence.rs`
Every measurement result carries statistical confidence:
| Metric | Method | Formula |
|--------|--------|---------|
| Probability CI | Wilson score | p_hat +/- z*sqrt(p*(1-p)/n + z^2/(4n^2)) / (1 + z^2/n) |
| Expectation value SE | Standard error | sigma / sqrt(n_shots) |
| Shot budget | Hoeffding bound | N >= ln(2/delta) / (2*epsilon^2) |
| Distribution distance | Total variation | TVD = 0.5 * sum(|p_i - q_i|) |
| Distribution test | Chi-squared | sum((O_i - E_i)^2 / E_i) |
**Confidence levels**: Results include 95% and 99% confidence intervals by
default. The user can request custom confidence levels.
**Convergence monitoring**: As shots accumulate, the engine tracks whether
confidence intervals have stabilized, enabling early termination when the
desired precision is reached.
### 9. Automatic Cross-Backend Verification
**Module**: `src/verification.rs`
Every simulation can be independently verified across backends:
```
Verification Protocol:
1. Analyze circuit (existing CircuitAnalysis)
2. If pure Clifford -> run on BOTH StateVector AND Stabilizer
-> compare measurement distributions (must match exactly)
3. If small enough for StateVector -> run on StateVector
-> compare with hardware results using chi-squared test
4. Report: {match_level, p_value, tvd, explanation}
```
**Verification levels**:
| Level | Comparison | Test | Threshold |
|-------|-----------|------|-----------|
| Exact | Stabilizer vs StateVector | Bitwise match | All probabilities equal |
| Statistical | Simulator vs Hardware | Chi-squared, p > 0.05 | TVD < 0.1 |
| Trend | VQE energy curves | Pearson correlation | r > 0.95 |
**Automatic Clifford detection**: Uses the existing `CircuitAnalysis.clifford_fraction`
to determine if stabilizer verification is applicable.
**Discrepancy report**: When backends disagree beyond statistical tolerance,
the engine produces a structured report identifying which qubits/gates show
the largest divergence.
## New Module Map
```
crates/ruqu-core/src/
lib.rs (existing, add mod declarations)
qasm.rs NEW - OpenQASM 3.0 serializer
noise.rs NEW - Enhanced noise models (T1/T2/readout/crosstalk)
mitigation.rs NEW - Error mitigation pipeline (ZNE, measurement correction)
hardware.rs NEW - Hardware abstraction layer + provider stubs
transpiler.rs NEW - Noise-aware circuit transpilation
replay.rs NEW - Deterministic replay engine
witness.rs NEW - Cryptographic witness logging
confidence.rs NEW - Statistical confidence bounds
verification.rs NEW - Cross-backend automatic verification
```
## Dependencies
New dependencies required in `ruqu-core/Cargo.toml`:
| Crate | Version | Feature | Purpose |
|-------|---------|---------|---------|
| `sha2` | 0.10 | optional: `witness` | SHA-256 hashing for witness chain |
| `rand_chacha` | 0.3 | optional: `replay` | Deterministic ChaCha20 RNG |
| `bincode` | 1.3 | optional: `witness` | Binary serialization for witness entries |
All new features are behind optional feature flags to keep the default build
minimal and `no_std`-compatible.
## Consequences
### Positive
- **Scientific credibility**: Every result carries confidence bounds, is
replayable, and has a tamper-evident audit trail
- **Hardware-ready**: Circuits can target real quantum processors via the HAL
- **Verifiable**: Cross-backend verification catches simulation bugs and
hardware errors automatically
- **Non-breaking**: All new modules are additive; existing API is unchanged
- **Minimal dependencies**: Core scientific features (confidence, replay) need
only `rand_chacha`; witness logging adds `sha2` + `bincode`
### Negative
- **Increased surface area**: 9 new modules add maintenance burden
- **Feature interaction complexity**: Noise + mitigation + verification creates
a combinatorial test space
- **Performance overhead**: Witness logging and confidence computation add
~5-10% per-shot overhead
### Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| RNG non-determinism across platforms | Low | High | Pin ChaCha20, test on x86+ARM+WASM |
| Hash chain corruption | Low | High | CRC32 per entry + full chain verification |
| Confidence bound miscalculation | Medium | High | Property-based testing with known distributions |
| Hardware API rate limits | Medium | Low | Exponential backoff + circuit batching |
## References
- [ADR-QE-001: Quantum Engine Core Architecture](./ADR-QE-001-quantum-engine-core-architecture.md)
- [ADR-QE-002: Crate Structure & Integration](./ADR-QE-002-crate-structure-integration.md)
- [ADR-QE-004: Performance Optimization & Benchmarks](./ADR-QE-004-performance-optimization-benchmarks.md)
- Wilson, E.B. "Probable inference, the law of succession, and statistical inference" (1927)
- Aaronson & Gottesman, "Improved simulation of stabilizer circuits" (2004)
- Temme, Bravyi, Gambetta, "Error mitigation for short-depth quantum circuits" (2017)
- OpenQASM 3.0 Specification, arXiv:2104.14722