Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/docs/adr/quantum-engine/ADR-QE-009-tensor-network-evaluation.md
+++ b/docs/adr/quantum-engine/ADR-QE-009-tensor-network-evaluation.md
@@ -0,0 +1,480 @@
+# ADR-QE-009: Tensor Network Evaluation Mode
+
+**Status**: Proposed
+**Date**: 2026-02-06
+**Authors**: ruv.io, RuVector Team
+**Deciders**: Architecture Review Board
+
+---
+
+## Context
+
+Full state-vector simulation stores all 2^n complex amplitudes explicitly, yielding
+O(2^n) memory and O(G * 2^n) time for G gates. At n=30 this is 16 GiB; at n=40 it
+exceeds 16 TiB. Many practically interesting circuits, however, contain limited
+entanglement:
+
+| Circuit family | Entanglement structure | Treewidth |
+|---|---|---|
+| Shallow QAOA on sparse graphs | Bounded by graph degree | Low (often < 20) |
+| Separate-register circuits | Disjoint qubit subsets | Sum of sub-widths |
+| Near-Clifford circuits | Stabilizer + few T gates | Depends on T count |
+| 1D brickwork (finite depth) | Area-law entanglement | O(depth) |
+| Random deep circuits (all-to-all) | Volume-law entanglement | O(n) -- no gain |
+
+For the first four families, tensor network (TN) methods can trade increased
+computation for drastically reduced memory by representing each gate as a tensor and
+contracting the resulting network in an optimized order. The contraction cost scales
+exponentially in the *treewidth* of the circuit's line graph rather than in the total
+qubit count.
+
+QuantRS2 (the Rust quantum simulation reference) demonstrated tensor network
+contraction for circuits up to 60 qubits on commodity hardware when treewidth
+remained below ~25. ruVector's existing `ruvector-mincut` crate already solves graph
+partitioning problems that are structurally identical to contraction-order
+optimization, providing a natural integration point.
+
+The ruQu engine needs this capability to support:
+
+1. Surface code simulations at distance d >= 7 (49+ data qubits) for decoder
+   validation, where the syndrome extraction circuit is shallow and geometrically
+   local.
+2. Variational algorithm prototyping (VQE, QAOA) on graphs larger than 30 nodes.
+3. Hybrid workflows where part of the circuit is simulated via state vector and part
+   via tensor contraction.
+
+## Decision
+
+### 1. Feature-Gated Backend
+
+Tensor network evaluation is implemented as an optional backend behind the
+`tensor-network` feature flag in `ruqu-core`:
+
+```toml
+# ruqu-core/Cargo.toml
+[features]
+default = ["state-vector"]
+state-vector = []
+tensor-network = ["dep:ndarray", "dep:petgraph"]
+all-backends = ["state-vector", "tensor-network"]
+```
+
+When both backends are compiled in, the engine selects the backend at runtime based
+on circuit analysis (see Section 4 below).
+
+### 2. Tensor Representation
+
+Every gate becomes a tensor connecting the qubit wire indices it acts on:
+
+| Gate type | Tensor rank | Shape | Example |
+|---|---|---|---|
+| Single-qubit (H, X, Rz, ...) | 2 | [2, 2] | Input wire -> output wire |
+| Two-qubit (CNOT, CZ, ...) | 4 | [2, 2, 2, 2] | Two input wires -> two output wires |
+| Three-qubit (Toffoli) | 6 | [2, 2, 2, 2, 2, 2] | Three input -> three output |
+| Measurement projector | 2 | [2, 2] | Diagonal in computational basis |
+| Initial state |0> | 1 | [2] | Single output wire |
+
+The circuit is converted into a tensor network graph where:
+- Each tensor is a node.
+- Each shared index (qubit wire between consecutive gates) is an edge.
+- Open indices represent initial states and final measurement outcomes.
+
+```
+  |0>---[H]---[CNOT_ctrl]---[Rz]---<meas>
+                  |
+  |0>-----------[CNOT_tgt]---------<meas>
+```
+
+Becomes:
+
+```
+  Node: init_0 (rank 1)
+    |
+  Node: H_0 (rank 2)
+    |
+  Node: CNOT_01 (rank 4)
+   / \
+  |   Node: Rz_0 (rank 2)
+  |     |
+  |   Node: meas_0 (rank 2)
+  |
+  Node: init_1 (rank 1)
+    ... (connected via CNOT shared index)
+  Node: meas_1 (rank 2)
+```
+
+### 3. Contraction Strategy
+
+Contraction order determines whether the computation is tractable. The cost of
+contracting two tensors is the product of the dimensions of all indices involved.
+Finding the optimal contraction order is NP-hard (equivalent to finding minimum
+treewidth), so we use heuristics.
+
+#### Contraction Path Optimization Pseudocode
+
+```
+function find_contraction_path(tensor_network: TN) -> ContractionPath:
+    // Phase 1: Simplify the network
+    apply_trivial_contractions(tensor_network)  // rank-1 tensors, diagonal pairs
+
+    // Phase 2: Detect community structure
+    communities = detect_communities(tensor_network.graph)
+
+    // Phase 3: Contract within communities first (small subproblems)
+    intra_paths = []
+    for community in communities:
+        subgraph = tensor_network.subgraph(community)
+        if subgraph.num_tensors <= 20:
+            // Exact dynamic programming for small subgraphs
+            path = optimal_einsum_dp(subgraph)
+        else:
+            // Greedy with lookahead for larger subgraphs
+            path = greedy_with_lookahead(subgraph, lookahead=2)
+        intra_paths.append(path)
+
+    // Phase 4: Contract inter-community edges
+    // Each community is now a single large tensor
+    meta_graph = contract_communities(tensor_network, intra_paths)
+    inter_path = greedy_with_lookahead(meta_graph, lookahead=3)
+
+    // Phase 5: Compose the full path
+    return compose_paths(intra_paths, inter_path)
+
+
+function greedy_with_lookahead(tn: TN, lookahead: int) -> Path:
+    path = []
+    remaining = tn.clone()
+
+    while remaining.num_tensors > 1:
+        best_cost = INFINITY
+        best_pair = None
+
+        // Evaluate all candidate contractions
+        for (i, j) in remaining.candidate_pairs():
+            cost = contraction_cost(remaining, i, j)
+
+            // Lookahead: estimate cost of subsequent contractions
+            if lookahead > 0:
+                simulated = remaining.simulate_contraction(i, j)
+                future_cost = estimate_future_cost(simulated, lookahead - 1)
+                cost += future_cost * DISCOUNT_FACTOR
+
+            if cost < best_cost:
+                best_cost = cost
+                best_pair = (i, j)
+
+        path.append(best_pair)
+        remaining.contract(best_pair)
+
+    return path
+```
+
+#### Community Detection via ruvector-mincut
+
+The `ruvector-mincut` crate provides graph partitioning that is directly applicable
+to contraction ordering:
+
+```rust
+use ruvector_mincut::{partition, PartitionConfig};
+
+fn partition_tensor_network(tn: &TensorNetwork) -> Vec<Vec<TensorId>> {
+    let graph = tn.to_adjacency_graph();
+    let config = PartitionConfig {
+        num_partitions: estimate_optimal_partitions(tn),
+        balance_factor: 1.1,  // Allow 10% imbalance
+        minimize: Objective::EdgeCut,  // Minimize inter-partition wires
+    };
+    partition(&graph, &config)
+}
+```
+
+The edge cut directly corresponds to the bond dimension of the inter-community
+contraction, so minimizing edge cut minimizes the most expensive contraction step.
+
+### 4. MPS (Matrix Product State) Mode
+
+For circuits with 1D-like connectivity (nearest-neighbor gates on a line), a Matrix
+Product State representation is more efficient than general tensor contraction.
+
+```
+    A[1] -- A[2] -- A[3] -- ... -- A[n]
+     |       |       |               |
+   phys_1  phys_2  phys_3         phys_n
+```
+
+Each site tensor A[i] has shape `[bond_left, physical, bond_right]` where:
+- `physical` = 2 (qubit dimension)
+- `bond_left`, `bond_right` = bond dimension chi
+
+| Bond dimension (chi) | Memory per site | Total memory (n qubits) | Approximation |
+|---|---|---|---|
+| 1 | 16 bytes | 16n bytes | Product state only |
+| 16 | 4 KiB | 4n KiB | Low entanglement |
+| 64 | 64 KiB | 64n KiB | Moderate entanglement |
+| 256 | 1 MiB | n MiB | High entanglement |
+| 1024 | 16 MiB | 16n MiB | Near exact for many circuits |
+
+**Truncation policy**: After each two-qubit gate, perform SVD on the updated bond.
+If the bond dimension exceeds `chi_max`, truncate the smallest singular values.
+Track the total discarded weight (sum of squared discarded singular values) as a
+fidelity estimate:
+
+```rust
+pub struct MpsConfig {
+    /// Maximum bond dimension. Truncation occurs above this.
+    pub chi_max: usize,
+    /// Minimum singular value to retain (relative to largest).
+    pub svd_cutoff: f64,
+    /// Accumulated truncation error (updated during simulation).
+    pub fidelity_estimate: f64,
+}
+
+impl Default for MpsConfig {
+    fn default() -> Self {
+        Self {
+            chi_max: 256,
+            svd_cutoff: 1e-12,
+            fidelity_estimate: 1.0,
+        }
+    }
+}
+```
+
+### 5. Automatic Mode Selection
+
+The engine analyzes the circuit before execution to recommend a backend:
+
+```rust
+pub enum RecommendedBackend {
+    StateVector { reason: &'static str },
+    TensorNetwork { estimated_treewidth: usize, reason: &'static str },
+    Mps { estimated_max_bond: usize, reason: &'static str },
+}
+
+pub fn recommend_backend(circuit: &QuantumCircuit) -> RecommendedBackend {
+    let n = circuit.num_qubits();
+    let depth = circuit.depth();
+    let connectivity = circuit.connectivity_graph();
+
+    // Rule 1: Small circuits always use state vector
+    if n <= 20 {
+        return RecommendedBackend::StateVector {
+            reason: "Small circuit; state vector is fastest below 20 qubits",
+        };
+    }
+
+    // Rule 2: Check for 1D connectivity (MPS candidate)
+    if connectivity.max_degree() <= 2 && connectivity.is_path_graph() {
+        let estimated_bond = 2_usize.pow(depth.min(20) as u32);
+        return RecommendedBackend::Mps {
+            estimated_max_bond: estimated_bond,
+            reason: "1D nearest-neighbor connectivity detected",
+        };
+    }
+
+    // Rule 3: Estimate treewidth for general TN
+    let estimated_tw = estimate_treewidth(&connectivity, depth);
+    if estimated_tw < 25 && n > 25 {
+        return RecommendedBackend::TensorNetwork {
+            estimated_treewidth: estimated_tw,
+            reason: "Low treewidth relative to qubit count",
+        };
+    }
+
+    // Rule 4: Check memory feasibility for state vector
+    let sv_memory = 16 * (1_usize << n);  // bytes
+    let available = estimate_available_memory();
+    if sv_memory > available {
+        // Force TN even if treewidth is high -- at least it has a chance
+        return RecommendedBackend::TensorNetwork {
+            estimated_treewidth: estimated_tw,
+            reason: "State vector exceeds available memory; TN is only option",
+        };
+    }
+
+    RecommendedBackend::StateVector {
+        reason: "High treewidth circuit; state vector is more efficient",
+    }
+}
+```
+
+### 6. When Tensor Networks Win vs Lose
+
+**Tensor networks win when:**
+
+| Scenario | Why TN wins | Example |
+|---|---|---|
+| Shallow circuits on many qubits | Treewidth ~ depth, not n | 50-qubit depth-4 QAOA |
+| Sparse graph connectivity | Low treewidth from graph structure | MaxCut on 3-regular graph |
+| Separate registers | Independent contractions | n/2 Bell pairs |
+| Near-Clifford | Stabilizer + few non-Clifford gates | Clifford + 5 T gates |
+| Amplitude computation | Contract to single output, not full state | Sampling one bitstring |
+
+**Tensor networks lose when:**
+
+| Scenario | Why TN loses | Fallback |
+|---|---|---|
+| Deep random circuits | Treewidth ~ n | State vector (if n <= 30) |
+| All-to-all connectivity | No structure to exploit | State vector |
+| Full state tomography needed | Must contract once per amplitude | State vector |
+| Very small circuits (n < 20) | Overhead exceeds state vector | State vector |
+| High-fidelity MPS needed | Bond dimension grows exponentially | State vector or exact TN |
+
+### 7. Example: 50-Qubit Shallow QAOA
+
+Consider QAOA depth p=1 on a 50-node 3-regular graph:
+
+```
+Circuit structure:
+  - 50 qubits, initialized to |+>
+  - 75 ZZ gates (one per edge), parameterized by gamma
+  - 50 Rx gates, parameterized by beta
+  - Total: 125 + 50 = 175 gates
+  - Circuit depth: 4 (H layer, ZZ layer (3-colorable), Rx layer, measure)
+
+Graph treewidth of 3-regular graph: typically 8-15
+
+Tensor network contraction:
+  - Community detection finds ~5-8 communities of 6-10 nodes
+  - Intra-community contraction: O(2^10) ~ 1024 per community
+  - Inter-community bonds: ~15 edges cut
+  - Effective contraction complexity: O(2^15) = 32768
+  - Compare to state vector: O(2^50) = 1.1 * 10^15
+
+Memory comparison:
+  - State vector: 2^50 * 16 bytes = 16 PiB (impossible)
+  - Tensor network: ~100 MiB working memory
+  - Speedup factor: practically infinite (feasible vs infeasible)
+```
+
+```
+Contraction Diagram (simplified):
+
+  Community A        Community B        Community C
+  [q0-q9]           [q10-q19]          [q20-q29]
+     |                  |                   |
+     +--- bond=2^3 ----+---- bond=2^4 -----+
+                        |
+  Community D        Community E
+  [q30-q39]          [q40-q49]
+     |                  |
+     +--- bond=2^3 ----+
+
+  Peak intermediate tensor: 2^15 elements = 512 KiB
+```
+
+### 8. Integration with State Vector Backend
+
+Both backends implement the same trait:
+
+```rust
+pub trait SimulationBackend {
+    /// Execute the circuit and return measurement results.
+    fn execute(
+        &self,
+        circuit: &QuantumCircuit,
+        shots: usize,
+        config: &SimulationConfig,
+    ) -> Result<SimulationResult, SimulationError>;
+
+    /// Compute expectation value of an observable.
+    fn expectation_value(
+        &self,
+        circuit: &QuantumCircuit,
+        observable: &Observable,
+        config: &SimulationConfig,
+    ) -> Result<f64, SimulationError>;
+
+    /// Return the backend name for logging.
+    fn name(&self) -> &'static str;
+}
+```
+
+Users interact through `QuantumCircuit` and never need to know which backend is
+active:
+
+```rust
+let circuit = QuantumCircuit::new(50)
+    .h_all()
+    .append_qaoa_layer(graph, gamma, beta)
+    .measure_all();
+
+// Automatic backend selection
+let result = ruqu::execute(&circuit, 1000)?;
+// -> Internally selects TensorNetwork backend due to n=50, low treewidth
+
+// Or explicit backend override
+let result = ruqu::execute_with_backend(
+    &circuit,
+    1000,
+    Backend::TensorNetwork(TnConfig::default()),
+)?;
+```
+
+### 9. Future: ruvector-mincut Integration for Contraction Ordering
+
+The `ruvector-mincut` crate currently solves balanced graph partitioning for vector
+index sharding. The same algorithm directly applies to tensor network contraction
+ordering via the following correspondence:
+
+| Graph partitioning concept | TN contraction concept |
+|---|---|
+| Vertex | Tensor |
+| Edge weight | Bond dimension (log2) |
+| Partition | Contraction subtree |
+| Edge cut | Inter-partition bond cost |
+| Balanced partition | Balanced contraction tree |
+
+Phase 1 (this ADR): Use `ruvector-mincut` for community detection in contraction
+path optimization.
+
+Phase 2 (future): Extend `ruvector-mincut` with hypergraph partitioning for
+multi-index tensor contractions, enabling handling of higher-order tensor networks
+(e.g., PEPS for 2D circuits).
+
+## Consequences
+
+### Positive
+
+1. **Dramatically expanded qubit range**: Shallow circuits on 40-60 qubits become
+   tractable on commodity hardware.
+2. **Surface code simulation**: Distance-7 surface codes (49 data + 48 ancilla = 97
+   qubits) can be simulated for decoder validation using MPS (the circuit is
+   geometrically local).
+3. **Unified interface**: Users write circuits once; backend selection is automatic.
+4. **Synergy with ruvector-mincut**: Leverages existing graph partitioning
+   investment.
+5. **Complementary to state vector**: Each backend covers the other's weakness.
+
+### Negative
+
+1. **Implementation complexity**: Tensor contraction, SVD truncation, and path
+   optimization are non-trivial to implement correctly and efficiently.
+2. **Approximation risk**: MPS truncation introduces controlled but nonzero error.
+   Users must understand fidelity estimates.
+3. **Compilation time**: The `ndarray` and `petgraph` dependencies add to compile
+   time when the feature is enabled.
+4. **Testing surface**: Two backends doubles the testing matrix for correctness
+   validation.
+5. **Performance unpredictability**: Contraction cost depends on circuit structure
+   in ways that are hard to predict without running the path optimizer.
+
+### Risks and Mitigations
+
+| Risk | Likelihood | Impact | Mitigation |
+|---|---|---|---|
+| Path optimizer finds poor ordering | Medium | High cost | Multiple heuristics + timeout fallback to greedy |
+| MPS fidelity silently degrades | Medium | Incorrect results | Track discarded weight; warn if fidelity < 0.99 |
+| Feature interaction bugs | Low | Incorrect results | Shared test suite: both backends must agree on small circuits |
+| Memory spike during contraction | Medium | OOM | Pre-estimate peak intermediate tensor size; abort if too large |
+
+## References
+
+- QuantRS2 tensor network implementation: internal reference
+- Markov & Shi, "Simulating Quantum Computation by Contracting Tensor Networks" (2008)
+- Gray & Kourtis, "Hyper-optimized tensor network contraction" (2021) -- cotengra
+- Schollwock, "The density-matrix renormalization group in the age of matrix product states" (2011)
+- ADR-QE-001: Core Engine Architecture (state vector backend)
+- ADR-QE-005: WASM Compilation Target
+- `ruvector-mincut` crate documentation
+- ADR-014: Coherence Engine (graph partitioning reuse)