# ADR-QE-010: Observability & Monitoring Integration **Status**: Proposed **Date**: 2026-02-06 **Authors**: ruv.io, RuVector Team **Deciders**: Architecture Review Board --- ## Context ruVector provides comprehensive observability through the `ruvector-metrics` crate, which aggregates telemetry from all subsystems into a unified monitoring dashboard. The quantum simulation engine is a new subsystem that must participate in this observability infrastructure. Effective monitoring of quantum simulation is essential for: 1. **Performance tuning**: Identifying bottlenecks in gate application, memory allocation, and parallelization efficiency. 2. **Resource management**: Tracking memory consumption to prevent OOM conditions and to inform auto-scaling decisions. 3. **Debugging**: Tracing the execution of specific circuits to diagnose incorrect results or unexpected behavior. 4. **Capacity planning**: Understanding workload patterns (qubit counts, circuit depths, simulation frequency) to plan infrastructure. 5. **Compliance**: Auditable logs of simulation executions for regulated environments (cryptographic validation, safety-critical applications). ### WASM Constraint In WebAssembly deployment, there is no direct filesystem access and no native networking. Observability in WASM must use browser-compatible mechanisms: `console.log`, `console.warn`, `console.error`, or JavaScript callback functions registered by the host application. ### Existing Infrastructure | Component | Role | Integration Point | |---|---|---| | `ruvector-metrics` | Metrics aggregation and export | Trait-based sink | | `ruvector-monitor` | Real-time dashboard UI | WebSocket feed | | Rust `tracing` crate | Structured logging and spans | Subscriber-based | | Prometheus / OpenTelemetry | External monitoring | Exporter plugins | | Ed25519 audit trail | Cryptographic logging | `ruqu-audit` crate | ## Decision ### 1. Metrics Schema Every simulation execution emits a structured metrics record. The schema is versioned to allow evolution without breaking consumers. ```rust /// Metrics emitted after each quantum simulation execution. /// Schema version: 1.0.0 #[derive(Debug, Clone, Serialize, Deserialize)] pub struct SimulationMetrics { /// Schema version for forward compatibility. pub schema_version: &'static str, /// Unique identifier for this simulation run. pub simulation_id: Uuid, /// Timestamp when simulation started (UTC). pub started_at: DateTime, /// Timestamp when simulation completed (UTC). pub completed_at: DateTime, // -- Circuit characteristics -- /// Number of qubits in the circuit. pub qubit_count: u32, /// Total number of gates (before optimization). pub gate_count_raw: u64, /// Total number of gates (after optimization/fusion). pub gate_count_optimized: u64, /// Circuit depth (longest path from input to output). pub circuit_depth: u32, /// Number of two-qubit gates (entangling operations). pub two_qubit_gate_count: u64, // -- Execution metrics -- /// Total wall-clock execution time in milliseconds. pub execution_time_ms: f64, /// Time spent in gate application (excluding allocation, measurement). pub gate_application_time_ms: f64, /// Time spent in measurement sampling. pub measurement_time_ms: f64, /// Peak memory consumption in bytes during simulation. pub peak_memory_bytes: u64, /// Memory allocated for the state vector / tensor network. pub state_memory_bytes: u64, /// Backend used for this simulation. pub backend: BackendType, // -- Throughput -- /// Gates applied per second (optimized gate count / gate application time). pub gates_per_second: f64, /// Qubits * depth per second (a normalized throughput metric). pub quantum_volume_rate: f64, // -- Optimization statistics -- /// Number of gates eliminated by fusion. pub gates_fused: u64, /// Number of gates eliminated as identity or redundant. pub gates_skipped: u64, /// Number of gate commutations applied. pub gates_commuted: u64, // -- Entanglement analysis -- /// Number of independent qubit subsets (entanglement groups). pub entanglement_groups: u32, /// Sizes of each entanglement group. pub entanglement_group_sizes: Vec, // -- Measurement outcomes (if measured) -- /// Number of measurement shots executed. pub measurement_shots: Option, /// Distribution entropy of measurement outcomes (bits). pub outcome_entropy: Option, // -- MPS-specific (tensor network backend) -- /// Maximum bond dimension reached (MPS mode only). pub max_bond_dimension: Option, /// Estimated fidelity after MPS truncation. pub mps_fidelity_estimate: Option, // -- Error information -- /// Whether the simulation completed successfully. pub success: bool, /// Error message if simulation failed. pub error: Option, /// Error category for programmatic handling. pub error_kind: Option, } #[derive(Debug, Clone, Serialize, Deserialize)] pub enum BackendType { StateVector, TensorNetwork, Mps, Hybrid, } #[derive(Debug, Clone, Serialize, Deserialize)] pub enum SimulationErrorKind { QubitLimitExceeded, MemoryAllocationFailed, InvalidGateTarget, InvalidParameter, ContractionFailed, MpsFidelityBelowThreshold, Timeout, InternalError, } ``` ### 2. Metrics Sink Trait The engine publishes metrics through a trait abstraction, allowing different sinks for native and WASM environments: ```rust /// Trait for consuming simulation metrics. /// Implementations exist for native (ruvector-metrics), WASM (JS callback), /// and testing (in-memory collector). pub trait MetricsSink: Send + Sync { /// Publish a completed simulation's metrics. fn publish(&self, metrics: &SimulationMetrics); /// Publish an incremental progress update (for long-running simulations). fn progress(&self, simulation_id: Uuid, percent_complete: f32, message: &str); /// Publish a health status update. fn health(&self, status: EngineHealthStatus); } /// Native implementation: forwards to ruvector-metrics. pub struct NativeMetricsSink { registry: Arc, } impl MetricsSink for NativeMetricsSink { fn publish(&self, metrics: &SimulationMetrics) { // Emit as histogram/counter/gauge values self.registry.histogram("ruqu.execution_time_ms") .record(metrics.execution_time_ms); self.registry.gauge("ruqu.peak_memory_bytes") .set(metrics.peak_memory_bytes as f64); self.registry.counter("ruqu.simulations_total") .increment(1); self.registry.counter("ruqu.gates_applied_total") .increment(metrics.gate_count_optimized); self.registry.histogram("ruqu.gates_per_second") .record(metrics.gates_per_second); if !metrics.success { self.registry.counter("ruqu.errors_total") .increment(1); } } fn progress(&self, _id: Uuid, percent: f32, _msg: &str) { self.registry.gauge("ruqu.current_progress") .set(percent as f64); } fn health(&self, status: EngineHealthStatus) { self.registry.gauge("ruqu.health_status") .set(status.as_numeric()); } } ``` ### 3. WASM Metrics Sink In WASM, metrics are delivered via JavaScript callbacks: ```rust #[cfg(target_arch = "wasm32")] pub struct WasmMetricsSink { /// JS callback function registered by host application. callback: js_sys::Function, } #[cfg(target_arch = "wasm32")] impl MetricsSink for WasmMetricsSink { fn publish(&self, metrics: &SimulationMetrics) { let json = serde_json::to_string(metrics) .unwrap_or_else(|_| "{}".to_string()); let js_value = JsValue::from_str(&json); let event_type = JsValue::from_str("simulation_complete"); let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value); } fn progress(&self, id: Uuid, percent: f32, message: &str) { let payload = format!( r#"{{"simulation_id":"{}","percent":{},"message":"{}"}}"#, id, percent, message ); let js_value = JsValue::from_str(&payload); let event_type = JsValue::from_str("simulation_progress"); let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value); } fn health(&self, status: EngineHealthStatus) { let payload = format!(r#"{{"status":"{}"}}"#, status.as_str()); let js_value = JsValue::from_str(&payload); let event_type = JsValue::from_str("engine_health"); let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value); } } ``` JavaScript host registration: ```javascript // Host application registers the metrics callback import init, { set_metrics_callback } from 'ruqu-wasm'; await init(); set_metrics_callback((eventType, data) => { const metrics = JSON.parse(data); switch (eventType) { case 'simulation_complete': console.log(`Simulation ${metrics.simulation_id} completed in ${metrics.execution_time_ms}ms`); dashboard.updateMetrics(metrics); break; case 'simulation_progress': progressBar.update(metrics.percent); break; case 'engine_health': healthIndicator.set(metrics.status); break; } }); ``` ### 4. Tracing Integration The engine integrates with the Rust `tracing` crate for structured logging and distributed tracing. #### Span Hierarchy ``` ruqu::simulation (root span for entire simulation) | +-- ruqu::circuit_validation (validate circuit structure) | +-- ruqu::backend_selection (automatic backend choice) | +-- ruqu::optimization (gate fusion, commutation, etc.) | | | +-- ruqu::optimization::fusion (individual fusion passes) | +-- ruqu::optimization::cancel (gate cancellation) | +-- ruqu::state_init (allocate and initialize state) | +-- ruqu::gate_application (apply all gates) | | | +-- ruqu::gate (individual gate -- DEBUG level only) | +-- ruqu::measurement (perform measurement sampling) | +-- ruqu::metrics_publish (emit metrics to sink) | +-- ruqu::state_cleanup (deallocate state vector) ``` #### Instrumentation Code ```rust use tracing::{info, warn, debug, trace, instrument, Span}; #[instrument( name = "ruqu::simulation", skip(circuit, config, metrics_sink), fields( qubit_count = circuit.num_qubits(), gate_count = circuit.gate_count(), simulation_id = %Uuid::new_v4(), ) )] pub fn execute( circuit: &QuantumCircuit, shots: usize, config: &SimulationConfig, metrics_sink: &dyn MetricsSink, ) -> Result { info!( qubits = circuit.num_qubits(), gates = circuit.gate_count(), depth = circuit.depth(), shots = shots, "Starting quantum simulation" ); // Validate let _validation_span = tracing::info_span!("ruqu::circuit_validation").entered(); validate_circuit(circuit)?; drop(_validation_span); // Select backend let _backend_span = tracing::info_span!("ruqu::backend_selection").entered(); let backend = select_backend(circuit, config); info!(backend = backend.name(), "Backend selected"); drop(_backend_span); // Optimize let _opt_span = tracing::info_span!("ruqu::optimization").entered(); let optimized = optimize_circuit(circuit, config)?; info!( original_gates = circuit.gate_count(), optimized_gates = optimized.gate_count(), gates_fused = circuit.gate_count() - optimized.gate_count(), "Circuit optimization complete" ); drop(_opt_span); // Execute let result = backend.execute(&optimized, shots, config)?; // At DEBUG level, log per-gate details debug!( execution_time_ms = result.execution_time_ms, peak_memory = result.peak_memory_bytes, "Simulation execution complete" ); // At TRACE level only for small circuits, log amplitude information if circuit.num_qubits() <= 10 { trace!( amplitudes = ?result.state_vector_snapshot(), "Final state vector (small circuit trace)" ); } Ok(result) } ``` ### 5. Structured Error Reporting All errors carry structured context for programmatic handling: ```rust #[derive(Debug, thiserror::Error)] pub enum SimulationError { #[error("Qubit limit exceeded: requested {requested}, maximum {maximum}")] QubitLimitExceeded { requested: u32, maximum: u32, estimated_memory_bytes: u64, available_memory_bytes: u64, }, #[error("Memory allocation failed for {requested_bytes} bytes")] MemoryAllocationFailed { requested_bytes: u64, qubit_count: u32, suggestion: &'static str, }, #[error("Invalid gate target: qubit {qubit} in {qubit_count}-qubit circuit")] InvalidGateTarget { gate_name: String, qubit: u32, qubit_count: u32, gate_index: usize, }, #[error("Invalid gate parameter: {parameter_name} = {value} ({reason})")] InvalidParameter { gate_name: String, parameter_name: String, value: f64, reason: &'static str, }, #[error("Tensor contraction failed: {reason}")] ContractionFailed { reason: String, estimated_treewidth: usize, suggestion: &'static str, }, #[error("MPS fidelity {fidelity:.6} below threshold {threshold:.6}")] MpsFidelityBelowThreshold { fidelity: f64, threshold: f64, max_bond_dimension: usize, suggestion: &'static str, }, #[error("Simulation timed out after {elapsed_ms}ms (limit: {timeout_ms}ms)")] Timeout { elapsed_ms: u64, timeout_ms: u64, gates_completed: u64, gates_remaining: u64, }, #[error("Internal error: {message}")] InternalError { message: String, source: Option>, }, } ``` Each error variant includes a `suggestion` field where applicable, guiding users toward resolution: | Error | Suggestion | |---|---| | QubitLimitExceeded | "Reduce qubit count or enable tensor-network feature for large circuits" | | MemoryAllocationFailed | "Try tensor-network backend or reduce qubit count by 1-2 (halves/quarters memory)" | | ContractionFailed | "Circuit treewidth too high for tensor network; use state vector for <= 30 qubits" | | MpsFidelityBelowThreshold | "Increase chi_max or switch to exact state vector for high-fidelity results" | ### 6. Health Checks The engine exposes health status for monitoring systems: ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct EngineHealthStatus { /// Whether the engine is ready to accept simulations. pub ready: bool, /// Maximum qubits supportable given current available memory. pub max_supported_qubits: u32, /// Available memory in bytes. pub available_memory_bytes: u64, /// Number of CPU cores available for parallel gate application. pub available_cores: usize, /// Whether the tensor-network backend is compiled in. pub tensor_network_available: bool, /// Current engine version. pub version: &'static str, /// Uptime since engine initialization (if applicable). pub uptime_seconds: Option, /// Number of simulations executed in current session. pub simulations_executed: u64, /// Total gates applied across all simulations in current session. pub total_gates_applied: u64, } /// Check engine health. Callable at any time. pub fn quantum_engine_ready() -> EngineHealthStatus { let available_memory = estimate_available_memory(); let max_qubits = compute_max_qubits(available_memory); EngineHealthStatus { ready: max_qubits >= 4, // Minimum useful simulation max_supported_qubits: max_qubits, available_memory_bytes: available_memory, available_cores: rayon::current_num_threads(), tensor_network_available: cfg!(feature = "tensor-network"), version: env!("CARGO_PKG_VERSION"), uptime_seconds: None, // Library mode; no persistent uptime simulations_executed: SESSION_COUNTER.load(Ordering::Relaxed), total_gates_applied: SESSION_GATES.load(Ordering::Relaxed), } } ``` ### 7. Logging Levels | Level | Content | Audience | Performance Impact | |---|---|---|---| | ERROR | Simulation failures, OOM, invalid circuits | Operators, alerting | None | | WARN | Approaching memory limits (>80%), MPS fidelity degradation, slow contraction | Operators | Negligible | | INFO | Simulation start/end summaries, backend selection, optimization results | Developers, dashboards | Negligible | | DEBUG | Per-optimization-pass details, memory allocation sizes, thread utilization | Developers debugging | Low | | TRACE | Per-gate amplitude changes (small circuits only, n <= 10), SVD singular values | Deep debugging | High (small circuits only) | TRACE level is gated on circuit size to prevent catastrophic log volume: ```rust // TRACE-level amplitude logging is only emitted for circuits with <= 10 qubits. // For larger circuits, TRACE only emits gate-level timing without amplitude data. if tracing::enabled!(tracing::Level::TRACE) { if circuit.num_qubits() <= 10 { trace!(amplitudes = ?state.as_slice(), "Post-gate state"); } else { trace!(gate_time_ns = elapsed.as_nanos(), "Gate applied"); } } ``` ### 8. Dashboard Integration Metrics from the quantum engine appear in the ruVector monitoring UI as a dedicated panel alongside vector operations, index health, and system resources. ``` +------------------------------------------------------------------+ | ruVector Monitoring Dashboard | +------------------------------------------------------------------+ | | | Vector Operations | Quantum Simulations | | ------------------- | ----------------------- | | Queries/sec: 12,450 | Simulations/min: 23 | | P99 latency: 2.3ms | Avg execution: 145ms | | Index size: 2.1M vectors | Avg qubits: 18.4 | | | Peak memory: 4.2 GiB | | | Backend: SV 87% / TN 13% | | | Gates/sec: 2.1B | | | Error rate: 0.02% | | | | | System Resources | Recent Simulations | | ------------------- | ----------------------- | | CPU: 34% | #a3f2.. 24q 230ms OK | | Memory: 61% (49/80 GiB) | #b891.. 16q 12ms OK | | Threads: 64/256 active | #c4d0.. 30q 1.2s OK | | | #d122.. 35q ERR OOM | +------------------------------------------------------------------+ ``` Metrics are published via the existing `ruvector-metrics` WebSocket feed: ```json { "source": "ruqu", "type": "simulation_complete", "timestamp": "2026-02-06T14:23:01.442Z", "data": { "simulation_id": "a3f2e891-...", "qubit_count": 24, "execution_time_ms": 230.4, "peak_memory_bytes": 268435456, "backend": "StateVector", "gates_per_second": 2147483648, "success": true } } ``` ### 9. Prometheus / OpenTelemetry Export For external monitoring, the native metrics sink exports standard Prometheus metrics: ``` # HELP ruqu_simulations_total Total quantum simulations executed # TYPE ruqu_simulations_total counter ruqu_simulations_total{backend="state_vector",status="success"} 1847 ruqu_simulations_total{backend="state_vector",status="error"} 3 ruqu_simulations_total{backend="tensor_network",status="success"} 241 # HELP ruqu_execution_time_ms Simulation execution time histogram # TYPE ruqu_execution_time_ms histogram ruqu_execution_time_ms_bucket{backend="state_vector",le="10"} 423 ruqu_execution_time_ms_bucket{backend="state_vector",le="100"} 1201 ruqu_execution_time_ms_bucket{backend="state_vector",le="1000"} 1834 ruqu_execution_time_ms_bucket{backend="state_vector",le="+Inf"} 1847 # HELP ruqu_peak_memory_bytes Peak memory during simulation # TYPE ruqu_peak_memory_bytes gauge ruqu_peak_memory_bytes 4294967296 # HELP ruqu_gates_per_second Gate application throughput # TYPE ruqu_gates_per_second gauge ruqu_gates_per_second 2.1e9 # HELP ruqu_max_supported_qubits Maximum qubits based on available memory # TYPE ruqu_max_supported_qubits gauge ruqu_max_supported_qubits 33 ``` ## Consequences ### Positive 1. **Unified observability**: Quantum simulation telemetry integrates seamlessly with ruVector's existing monitoring infrastructure. 2. **Cross-platform**: The trait-based sink design supports native, WASM, and testing environments without code changes in the engine. 3. **Actionable errors**: Structured errors with suggestions reduce debugging time and improve developer experience. 4. **Performance visibility**: Gates-per-second, memory consumption, and backend selection metrics enable informed performance tuning. 5. **Compliance ready**: Structured logging with simulation IDs supports audit trail requirements. ### Negative 1. **Metric cardinality**: High-frequency simulations could generate significant metric volume. Mitigated by aggregation at the sink level. 2. **WASM callback overhead**: JSON serialization for WASM metrics adds ~0.1ms per simulation. Acceptable for typical workloads. 3. **Tracing overhead at DEBUG/TRACE**: Enabled tracing at low levels adds measurable overhead. Production deployments should use INFO or above. 4. **Schema evolution**: Changes to `SimulationMetrics` require versioned handling in consumers. ### Risks and Mitigations | Risk | Mitigation | |---|---| | Metric volume overwhelming storage | Configurable sampling rate; aggregate in sink | | WASM callback exceptions | Catch JS exceptions in callback wrapper; log to console | | Schema breaking changes | Version field in metrics; consumer-side version dispatch | | TRACE logging for large circuits | Qubit-count gate prevents amplitude logging above n=10 | ## References - `ruvector-metrics` crate: internal metrics infrastructure - Rust `tracing` crate: https://docs.rs/tracing - OpenTelemetry Rust SDK: https://docs.rs/opentelemetry - ADR-QE-005: WASM Compilation Target (WASM constraints) - ADR-QE-011: Memory Gating & Power Management (resource monitoring) - Prometheus exposition format: https://prometheus.io/docs/instrumenting/exposition_formats/