Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/adr/quantum-engine/ADR-QE-004-performance-optimization-benchmarks.md
+++ b/vendor/ruvector/docs/adr/quantum-engine/ADR-QE-004-performance-optimization-benchmarks.md
@@ -0,0 +1,564 @@
+# ADR-QE-004: Performance Optimization & Benchmarks
+
+**Status**: Proposed
+**Date**: 2026-02-06
+**Authors**: ruv.io, RuVector Team
+**Deciders**: Architecture Review Board
+
+## Context
+
+### Problem Statement
+
+Quantum state-vector simulation is computationally expensive. Every gate
+application touches the full amplitude vector of 2^n complex numbers, making
+gate application O(2^n) per gate for n qubits. For the quantum engine to be
+practical on edge devices and in browser environments, it must achieve
+competitive performance: millions of gates per second for small circuits,
+interactive latency for 10-20 qubit workloads, and the ability to handle
+moderately deep circuits (thousands of gates) without unacceptable delays.
+
+### Computational Cost Model
+
+For a circuit with n qubits, g gates, and s measurement shots:
+
+```
+Total operations (approximate):
+
+  Single-qubit gate:   2^n complex multiplications + 2^n complex additions
+  Two-qubit gate:      2^(n+1) complex multiplications + 2^(n+1) complex additions
+  Measurement (1 shot): 2^n probability calculations + sampling
+  Full circuit:        sum_i(cost(gate_i)) + s * 2^n
+
+  Example: 20-qubit circuit, 500 gates, 1024 shots
+    Gate cost:  500 * 2^20 * ~4 FLOP = ~2.1 billion FLOP
+    Measure:    1024 * 2^20 * ~2 FLOP = ~2.1 billion FLOP
+    Total:      ~4.2 billion FLOP
+```
+
+At 10 GFLOP/s (realistic single-core throughput), this is ~420 ms. With SIMD
+and multi-threading, we target 10-50x improvement.
+
+### Performance Baseline from Comparable Systems
+
+| Simulator | Language | 20-qubit H gate | Notes |
+|-----------|----------|-----------------|-------|
+| Qiskit Aer | C++/Python | ~50 ns | Heavily optimized, OpenMP |
+| Cirq | Python/C++ | ~200 ns | Google, less optimized |
+| QuantRS2 | Rust | ~57 ns | Rust-native, AVX2 |
+| Quest | C | ~40 ns | GPU-capable, highly tuned |
+| Target (ruQu) | Rust | < 60 ns | Competitive with QuantRS2 |
+
+These benchmarks measure per-gate time on a single-qubit Hadamard applied to
+a 20-qubit state vector. Our target is to match or beat QuantRS2, the closest
+comparable pure-Rust implementation.
+
+## Decision
+
+Implement a **multi-layered optimization strategy** with six complementary
+techniques, each addressing a different performance bottleneck.
+
+### Layer 1: SIMD Operations
+
+Use `ruvector-math` SIMD utilities to vectorize amplitude manipulation.
+Gate application fundamentally involves applying a 2x2 or 4x4 unitary matrix
+to pairs/quadruples of complex amplitudes. SIMD processes multiple amplitude
+components simultaneously.
+
+**Native SIMD dispatch**:
+
+```
+Architecture     Instruction Set     Complex f64 per Cycle
+-----------      ---------------     ---------------------
+x86_64           AVX-512             4 (512-bit / 128-bit per complex)
+x86_64           AVX2                2 (256-bit / 128-bit per complex)
+ARM64            NEON                1 (128-bit / 128-bit per complex)
+WASM             SIMD128             1 (128-bit / 128-bit per complex)
+Fallback         Scalar              1 (sequential)
+```
+
+**Single-qubit gate application with AVX2**:
+
+```
+For each pair of amplitudes (a[i], a[i + 2^target]):
+
+  Load:  a_re, a_im = load_f64x4([a[i].re, a[i].im, a[i+step].re, a[i+step].im])
+
+  Compute c0 = u00 * a + u01 * b:
+    mul_re = u00_re * a_re - u00_im * a_im + u01_re * b_re - u01_im * b_im
+    mul_im = u00_re * a_im + u00_im * a_re + u01_re * b_im + u01_im * b_re
+
+  Compute c1 = u10 * a + u11 * b:
+    (analogous)
+
+  Store: [c0.re, c0.im, c1.re, c1.im]
+```
+
+With AVX2 (256-bit), we process 2 complex f64 values per instruction,
+yielding a theoretical 2x speedup over scalar. With AVX-512, this doubles to
+4x. Practical speedup is 1.5-3.5x due to instruction latency and memory
+bandwidth.
+
+**Target per-gate throughput**:
+
+| Qubits | Amplitudes | AVX2 (est.) | AVX-512 (est.) | WASM SIMD (est.) |
+|--------|------------|-------------|----------------|-------------------|
+| 10 | 1,024 | ~15 ns | ~10 ns | ~30 ns |
+| 15 | 32,768 | ~1 us | ~0.5 us | ~2 us |
+| 20 | 1,048,576 | ~50 us | ~25 us | ~100 us |
+| 25 | 33,554,432 | ~1.5 ms | ~0.8 ms | ~3 ms |
+
+### Layer 2: Multithreading
+
+Rayon-based data parallelism splits the state vector across CPU cores for
+gate application. Each thread processes an independent contiguous block of
+amplitudes.
+
+**Parallelization strategy**:
+
+```
+State vector: [amp_0, amp_1, ..., amp_{2^n - 1}]
+
+Thread 0:  [amp_0          ... amp_{2^n/T - 1}]
+Thread 1:  [amp_{2^n/T}    ... amp_{2*2^n/T - 1}]
+  ...
+Thread T-1:[amp_{(T-1)*2^n/T} ... amp_{2^n - 1}]
+
+Where T = number of threads (Rayon work-stealing pool)
+```
+
+**Gate application requires care with target qubit position**:
+
+- If `target < log2(chunk_size)`: each chunk contains complete amplitude pairs.
+  Threads are fully independent. No synchronization needed.
+- If `target >= log2(chunk_size)`: amplitude pairs span chunk boundaries.
+  Must adjust chunk boundaries to align with gate structure.
+
+**Expected scaling**:
+
+```
+Qubits    Amps         1 thread    8 threads    Speedup
+------    ----         --------    ---------    -------
+15        32K          1 us        ~200 ns      ~5x
+20        1M           50 us       ~8 us        ~6x
+22        4M           200 us      ~30 us       ~6.5x
+24        16M          800 us      ~120 us      ~6.7x
+25        32M          1.5 ms      ~220 us      ~6.8x
+```
+
+Speedup plateaus below linear (8x for 8 threads) due to memory bandwidth
+saturation. At 24+ qubits, the state vector exceeds L3 cache and performance
+becomes memory-bound.
+
+**Parallelism threshold**: Do not parallelize below 14 qubits (16K amplitudes).
+The overhead of Rayon's work-stealing exceeds the benefit for small states.
+
+### Layer 3: Gate Fusion
+
+Preprocess circuits to combine consecutive gates into single matrix
+operations, reducing the number of state vector passes.
+
+**Fusion rules**:
+
+```
+Rule 1: Consecutive single-qubit gates on the same qubit
+  Rz(a) -> Rx(b) -> Rz(c)  ==>  U3(a, b, c)  [single matrix multiply]
+
+Rule 2: Consecutive two-qubit gates on the same pair
+  CNOT(0,1) -> CZ(0,1)  ==>  Fused_2Q(0,1)  [4x4 matrix]
+
+Rule 3: Single-qubit gate followed by controlled gate
+  H(0) -> CNOT(0,1)  ==>  Fused operation (absorb H into CNOT matrix)
+
+Rule 4: Identity cancellation
+  H -> H  ==>  Identity (remove both)
+  X -> X  ==>  Identity
+  S -> S_dag  ==>  Identity
+  CNOT -> CNOT (same control/target)  ==>  Identity
+```
+
+**Fusion effectiveness by algorithm**:
+
+| Algorithm | Typical Fusion Ratio | Gate Reduction |
+|-----------|----------------------|----------------|
+| VQE (UCCSD ansatz) | 1.8-2.5x | 30-50% fewer state passes |
+| Grover's | 1.2-1.5x | 15-25% |
+| QAOA | 1.5-2.0x | 25-40% |
+| QFT | 2.0-3.0x | 40-60% |
+| Random circuit | 1.1-1.3x | 5-15% |
+
+**Implementation**:
+
+```rust
+pub struct FusionPass;
+
+impl CircuitOptimizer for FusionPass {
+    fn optimize(&self, circuit: &mut QuantumCircuit) {
+        let mut i = 0;
+        while i < circuit.gates.len() - 1 {
+            let current = &circuit.gates[i];
+            let next = &circuit.gates[i + 1];
+
+            if can_fuse(current, next) {
+                let fused = compute_fused_matrix(current, next);
+                circuit.gates[i] = fused;
+                circuit.gates.remove(i + 1);
+                // Don't advance i; check if we can fuse again
+            } else {
+                i += 1;
+            }
+        }
+    }
+}
+```
+
+### Layer 4: Entanglement-Aware Splitting
+
+Track which qubits have interacted via entangling gates. Simulate independent
+qubit subsets as separate, smaller state vectors. Merge subsets when an
+entangling gate connects them.
+
+**Concept**:
+
+```
+Circuit: q0 --[H]--[CNOT(0,1)]--[Rz]--
+         q1 --[H]--[CNOT(0,1)]--[Ry]--
+         q2 --[H]--[X]---------[Rz]---[CNOT(2,0)]--
+         q3 --[H]--[Y]---------[Rx]--
+
+Initially: {q0}, {q1}, {q2}, {q3}  -- four 2^1 vectors (2 amps each)
+After CNOT(0,1): {q0,q1}, {q2}, {q3}  -- one 2^2 + two 2^1 vectors
+After CNOT(2,0): {q0,q1,q2}, {q3}  -- one 2^3 + one 2^1 vector
+
+Memory: 8 + 2 = 10 amplitudes  vs  2^4 = 16 amplitudes (full)
+```
+
+**Savings scale dramatically for circuits with late entanglement**:
+
+```
+Scenario: 20-qubit circuit, first 100 gates are local, then entangling
+
+Without splitting: 2^20 = 1M amplitudes from gate 1
+With splitting:    20 * 2^1 = 40 amplitudes until first entangling gate
+                   Progressively merge as entanglement grows
+```
+
+**Data structure**:
+
+```rust
+pub struct SplitState {
+    /// Each subset: (qubit indices, state vector)
+    subsets: Vec<(Vec<usize>, QuantumState)>,
+    /// Union-Find structure for tracking connectivity
+    connectivity: UnionFind,
+}
+
+impl SplitState {
+    pub fn apply_gate(&mut self, gate: &Gate, targets: &[usize]) {
+        if gate.is_entangling() {
+            // Merge subsets containing target qubits
+            let merged = self.merge_subsets(targets);
+            // Apply gate to merged state
+            merged.apply_gate(gate, targets);
+        } else {
+            // Apply to the subset containing the target qubit
+            let subset = self.find_subset(targets[0]);
+            subset.apply_gate(gate, targets);
+        }
+    }
+}
+```
+
+**When splitting helps vs. hurts**:
+
+| Circuit Type | Splitting Benefit |
+|-------------|-------------------|
+| Shallow QAOA (p=1-3) | High (qubits entangle gradually) |
+| VQE with local ansatz | High (many local rotations) |
+| Grover's (full oracle) | Low (oracle entangles all qubits early) |
+| QFT | Low (all-to-all entanglement) |
+| Random circuits | Low (entangles quickly) |
+
+The engine automatically disables splitting when all qubits are connected,
+falling back to full state-vector simulation with zero overhead.
+
+### Layer 5: Cache-Local Processing
+
+For large state vectors (>20 qubits), cache utilization becomes critical.
+The state vector exceeds L2 cache (typically 256 KB - 1 MB) and potentially
+L3 cache (8-32 MB).
+
+**Cache analysis**:
+
+```
+Qubits    State Size     L2 (512KB)    L3 (16MB)
+------    ----------     ----------    ---------
+18        4 MB           8x oversize   in cache
+20        16 MB          32x           in cache
+22        64 MB          128x          4x oversize
+24        256 MB         512x          16x oversize
+25        512 MB         1024x         32x oversize
+```
+
+**Techniques**:
+
+1. **Aligned allocation**: State vector aligned to cache line boundaries (64
+   bytes) for optimal prefetch behavior. Uses `ruvector-math` aligned allocator.
+
+2. **Blocking/tiling**: For gates on high-index qubits, the stride between
+   amplitude pairs is large (2^target). Tiling the access pattern to process
+   cache-line-sized blocks sequentially improves spatial locality.
+
+   ```
+   Without tiling (target qubit = 20):
+     Access pattern: amp[0], amp[1M], amp[1], amp[1M+1], ...
+     Cache misses: ~every access (stride = 16 MB)
+
+   With tiling (block size = L2/4):
+     Process block [0..64K], then [64K..128K], ...
+     Cache misses: ~1 per block (sequential within block)
+   ```
+
+3. **Prefetch hints**: Insert software prefetch instructions for the next block
+   of amplitudes while processing the current block.
+
+   ```rust
+   // Prefetch next cache line while processing current
+   #[cfg(target_arch = "x86_64")]
+   unsafe {
+       core::arch::x86_64::_mm_prefetch(
+           state.as_ptr().add(i + CACHE_LINE_AMPS) as *const i8,
+           core::arch::x86_64::_MM_HINT_T0,
+       );
+   }
+   ```
+
+### Layer 6: Lazy Evaluation
+
+Accumulate commuting rotations and defer their application until a
+non-commuting gate appears. This reduces the number of full state-vector
+passes for rotation-heavy circuits common in variational algorithms.
+
+**Commutation rules**:
+
+```
+Rz(a) commutes with Rz(b)  =>  Rz(a+b)
+Rx(a) commutes with Rx(b)  =>  Rx(a+b)
+Rz commutes with CZ        =>  Defer Rz
+Diagonal gates commute      =>  Combine phases
+
+But:
+Rz does NOT commute with H
+Rx does NOT commute with CNOT (on target)
+```
+
+**Implementation sketch**:
+
+```rust
+pub struct LazyAccumulator {
+    /// Pending rotations per qubit: (axis, total_angle)
+    pending: HashMap<usize, Vec<(RotationAxis, f64)>>,
+}
+
+impl LazyAccumulator {
+    pub fn push_gate(&mut self, gate: &Gate, target: usize) -> Option<FlushedGate> {
+        if let Some(rotation) = gate.as_rotation() {
+            if let Some(existing) = self.pending.get_mut(&target) {
+                if existing.last().map_or(false, |(axis, _)| *axis == rotation.axis) {
+                    // Same axis: accumulate angle
+                    existing.last_mut().unwrap().1 += rotation.angle;
+                    return None; // No gate emitted
+                }
+            }
+            self.pending.entry(target).or_default().push((rotation.axis, rotation.angle));
+            None
+        } else {
+            // Non-commuting gate: flush pending rotations for affected qubits
+            let flushed = self.flush(target);
+            Some(flushed)
+        }
+    }
+}
+```
+
+**Effectiveness**: VQE circuits with alternating Rz-Rx-Rz layers see 20-40%
+reduction in state-vector passes. QAOA circuits with repeated ZZ-rotation
+layers see 15-30% reduction.
+
+## Benchmark Targets
+
+### Primary Benchmark Suite
+
+| ID | Workload | Qubits | Gates | Target Time | Notes |
+|----|----------|--------|-------|-------------|-------|
+| B1 | Grover (8 qubits) | 8 | ~200 | < 1 ms | 3 Grover iterations |
+| B2 | Grover (16 qubits) | 16 | ~3,000 | < 10 ms | ~64 iterations |
+| B3 | VQE iteration (12 qubits) | 12 | ~120 | < 5 ms | Single parameter update |
+| B4 | VQE iteration (20 qubits) | 20 | ~300 | < 50 ms | UCCSD ansatz |
+| B5 | QAOA p=3 (10 nodes) | 10 | ~75 | < 1 ms | MaxCut on random graph |
+| B6 | QAOA p=5 (20 nodes) | 20 | ~200 | < 200 ms | MaxCut on random graph |
+| B7 | Surface code cycle (d=3) | 17 | ~20 | < 10 ms | Single syndrome round |
+| B8 | 1000 surface code cycles | 17 | ~20,000 | < 2 s | Repeated error correction |
+| B9 | QFT (20 qubits) | 20 | ~210 | < 30 ms | Full quantum Fourier transform |
+| B10 | Random circuit (25 qubits) | 25 | 100 | < 10 s | Worst-case memory test |
+
+### Micro-Benchmarks
+
+Per-gate timing for individual operations:
+
+| Gate | 10 qubits | 15 qubits | 20 qubits | 25 qubits |
+|------|-----------|-----------|-----------|-----------|
+| H | < 20 ns | < 0.5 us | < 50 us | < 1.5 ms |
+| CNOT | < 30 ns | < 1 us | < 80 us | < 2.5 ms |
+| Rz(theta) | < 15 ns | < 0.4 us | < 40 us | < 1.2 ms |
+| Toffoli | < 50 ns | < 1.5 us | < 120 us | < 4 ms |
+| Measure | < 10 ns | < 0.3 us | < 30 us | < 1 ms |
+
+### WASM-Specific Benchmarks
+
+| ID | Workload | Qubits | Target (WASM) | Target (Native) | Expected Ratio |
+|----|----------|--------|---------------|-----------------|----------------|
+| W1 | Grover (8) | 8 | < 3 ms | < 1 ms | ~3x |
+| W2 | VQE iter (12) | 12 | < 12 ms | < 5 ms | ~2.5x |
+| W3 | QAOA p=3 (10) | 10 | < 2.5 ms | < 1 ms | ~2.5x |
+| W4 | Random (20) | 20 | < 500 ms | < 200 ms | ~2.5x |
+| W5 | Random (25) | 25 | < 25 s | < 10 s | ~2.5x |
+
+### Benchmark Infrastructure
+
+Benchmarks use Criterion.rs for native and a custom timing harness for WASM:
+
+```rust
+// Native benchmarks (Criterion)
+use criterion::{criterion_group, criterion_main, Criterion};
+
+fn bench_grover_8(c: &mut Criterion) {
+    c.bench_function("grover_8_qubits", |b| {
+        b.iter(|| {
+            let mut state = QuantumState::new(8).unwrap();
+            let circuit = grover_circuit(8, &target_state);
+            state.execute(&circuit)
+        })
+    });
+}
+
+fn bench_single_gate_scaling(c: &mut Criterion) {
+    let mut group = c.benchmark_group("hadamard_scaling");
+    for n in [10, 12, 14, 16, 18, 20, 22, 24] {
+        group.bench_with_input(
+            BenchmarkId::from_parameter(n),
+            &n,
+            |b, &n| {
+                let mut state = QuantumState::new(n).unwrap();
+                let mut circuit = QuantumCircuit::new(n).unwrap();
+                circuit.gate(Gate::H, &[0]);
+                b.iter(|| state.execute(&circuit))
+            },
+        );
+    }
+    group.finish();
+}
+
+criterion_group!(benches, bench_grover_8, bench_single_gate_scaling);
+criterion_main!(benches);
+```
+
+**WASM benchmark harness**:
+
+```javascript
+// Browser-based benchmark using performance.now()
+async function benchmarkGrover8() {
+    const { QuantumCircuit, QuantumState } = await import('./ruqu_wasm.js');
+
+    const iterations = 100;
+    const start = performance.now();
+
+    for (let i = 0; i < iterations; i++) {
+        const circuit = QuantumCircuit.grover(8, 42);
+        const state = new QuantumState(8);
+        state.execute(circuit);
+        state.free();
+        circuit.free();
+    }
+
+    const elapsed = performance.now() - start;
+    console.log(`Grover 8-qubit: ${(elapsed / iterations).toFixed(3)} ms/iteration`);
+}
+```
+
+### Performance Regression Detection
+
+CI runs benchmark suite on every PR. Regressions exceeding 10% trigger a
+warning; regressions exceeding 25% block the merge.
+
+```yaml
+# In CI pipeline
+- name: Run benchmarks
+  run: |
+    cargo bench --package ruqu-core -- --save-baseline pr
+    cargo bench --package ruqu-core -- --baseline main --load-baseline pr
+    # critcmp compares and flags regressions
+    critcmp main pr --threshold 10
+```
+
+### Optimization Priority Matrix
+
+Not all optimizations apply equally to all workloads. The priority matrix
+guides implementation order:
+
+| Optimization | Impact (small circuits) | Impact (large circuits) | Impl Effort | Priority |
+|-------------|------------------------|------------------------|-------------|----------|
+| SIMD | Medium (1.5-2x) | High (2-3.5x) | Medium | P0 |
+| Multithreading | Low (overhead > benefit) | High (5-7x) | Medium | P1 |
+| Gate fusion | High (30-50% fewer passes) | Medium (15-30%) | Low | P0 |
+| Entanglement splitting | Variable (0-100x) | Low (quickly entangled) | High | P2 |
+| Cache tiling | Low (fits in cache) | High (2-4x) | Medium | P1 |
+| Lazy evaluation | Medium (20-40%) | Low (10-20%) | Low | P2 |
+
+**Implementation order**: SIMD -> Gate Fusion -> Multithreading -> Cache Tiling
+-> Lazy Evaluation -> Entanglement Splitting
+
+## Consequences
+
+### Positive
+
+- **Competitive performance**: Multi-layered approach targets performance
+  parity with state-of-the-art Rust simulators (QuantRS2)
+- **Interactive latency**: Most practical workloads (8-20 qubits) complete
+  in single-digit milliseconds, enabling real-time experimentation
+- **Scalable**: Each optimization layer addresses a different bottleneck,
+  providing compounding benefits
+- **Measurable**: Concrete benchmark targets enable objective progress tracking
+  and regression detection
+
+### Negative
+
+- **Optimization complexity**: Six optimization layers create significant
+  implementation and maintenance complexity
+- **Ongoing tuning**: Performance characteristics vary across hardware;
+  benchmarks must cover representative platforms
+- **Diminishing returns**: For >20 qubits, memory bandwidth dominates and
+  compute optimizations yield marginal gains
+- **Testing burden**: Each optimization must be validated for numerical
+  correctness across all gate types
+
+### Risks and Mitigations
+
+| Risk | Likelihood | Impact | Mitigation |
+|------|-----------|--------|------------|
+| Memory bandwidth bottleneck at >20 qubits | High | Medium | Document expected scaling; recommend native for large circuits |
+| Gate fusion introducing numerical error | Low | High | Comprehensive numerical tests comparing fused vs. unfused results |
+| Entanglement tracking overhead exceeding savings | Medium | Low | Automatic disable when all qubits connected within first 10 gates |
+| WASM SIMD not available in target runtime | Low | Medium | Graceful fallback to scalar; runtime feature detection |
+| Benchmark targets too aggressive for edge hardware | Medium | Low | Separate targets for edge (Cognitum) vs. desktop; scale expectations |
+
+## References
+
+- [ADR-QE-001: Quantum Engine Core Architecture](./ADR-QE-001-quantum-engine-core-architecture.md)
+- [ADR-QE-002: Crate Structure & Integration](./ADR-QE-002-crate-structure-integration.md)
+- [ADR-QE-003: WASM Compilation Strategy](./ADR-QE-003-wasm-compilation-strategy.md)
+- [ADR-003: SIMD Optimization Strategy](/docs/adr/ADR-003-simd-optimization-strategy.md)
+- [ruvector-math crate](/crates/ruvector-math/)
+- Guerreschi & Hogaboam, "Intel Quantum Simulator: A cloud-ready high-performance
+  simulator of quantum circuits" (2020)
+- Jones et al., "QuEST and High Performance Simulation of Quantum Computers" (2019)
+- QuantRS2 benchmark data (internal comparison)