23 KiB
ADR-QE-010: Observability & Monitoring Integration
Status: Proposed Date: 2026-02-06 Authors: ruv.io, RuVector Team Deciders: Architecture Review Board
Context
ruVector provides comprehensive observability through the ruvector-metrics crate,
which aggregates telemetry from all subsystems into a unified monitoring dashboard.
The quantum simulation engine is a new subsystem that must participate in this
observability infrastructure.
Effective monitoring of quantum simulation is essential for:
- Performance tuning: Identifying bottlenecks in gate application, memory allocation, and parallelization efficiency.
- Resource management: Tracking memory consumption to prevent OOM conditions and to inform auto-scaling decisions.
- Debugging: Tracing the execution of specific circuits to diagnose incorrect results or unexpected behavior.
- Capacity planning: Understanding workload patterns (qubit counts, circuit depths, simulation frequency) to plan infrastructure.
- Compliance: Auditable logs of simulation executions for regulated environments (cryptographic validation, safety-critical applications).
WASM Constraint
In WebAssembly deployment, there is no direct filesystem access and no native
networking. Observability in WASM must use browser-compatible mechanisms:
console.log, console.warn, console.error, or JavaScript callback functions
registered by the host application.
Existing Infrastructure
| Component | Role | Integration Point |
|---|---|---|
ruvector-metrics |
Metrics aggregation and export | Trait-based sink |
ruvector-monitor |
Real-time dashboard UI | WebSocket feed |
Rust tracing crate |
Structured logging and spans | Subscriber-based |
| Prometheus / OpenTelemetry | External monitoring | Exporter plugins |
| Ed25519 audit trail | Cryptographic logging | ruqu-audit crate |
Decision
1. Metrics Schema
Every simulation execution emits a structured metrics record. The schema is versioned to allow evolution without breaking consumers.
/// Metrics emitted after each quantum simulation execution.
/// Schema version: 1.0.0
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SimulationMetrics {
/// Schema version for forward compatibility.
pub schema_version: &'static str,
/// Unique identifier for this simulation run.
pub simulation_id: Uuid,
/// Timestamp when simulation started (UTC).
pub started_at: DateTime<Utc>,
/// Timestamp when simulation completed (UTC).
pub completed_at: DateTime<Utc>,
// -- Circuit characteristics --
/// Number of qubits in the circuit.
pub qubit_count: u32,
/// Total number of gates (before optimization).
pub gate_count_raw: u64,
/// Total number of gates (after optimization/fusion).
pub gate_count_optimized: u64,
/// Circuit depth (longest path from input to output).
pub circuit_depth: u32,
/// Number of two-qubit gates (entangling operations).
pub two_qubit_gate_count: u64,
// -- Execution metrics --
/// Total wall-clock execution time in milliseconds.
pub execution_time_ms: f64,
/// Time spent in gate application (excluding allocation, measurement).
pub gate_application_time_ms: f64,
/// Time spent in measurement sampling.
pub measurement_time_ms: f64,
/// Peak memory consumption in bytes during simulation.
pub peak_memory_bytes: u64,
/// Memory allocated for the state vector / tensor network.
pub state_memory_bytes: u64,
/// Backend used for this simulation.
pub backend: BackendType,
// -- Throughput --
/// Gates applied per second (optimized gate count / gate application time).
pub gates_per_second: f64,
/// Qubits * depth per second (a normalized throughput metric).
pub quantum_volume_rate: f64,
// -- Optimization statistics --
/// Number of gates eliminated by fusion.
pub gates_fused: u64,
/// Number of gates eliminated as identity or redundant.
pub gates_skipped: u64,
/// Number of gate commutations applied.
pub gates_commuted: u64,
// -- Entanglement analysis --
/// Number of independent qubit subsets (entanglement groups).
pub entanglement_groups: u32,
/// Sizes of each entanglement group.
pub entanglement_group_sizes: Vec<u32>,
// -- Measurement outcomes (if measured) --
/// Number of measurement shots executed.
pub measurement_shots: Option<u64>,
/// Distribution entropy of measurement outcomes (bits).
pub outcome_entropy: Option<f64>,
// -- MPS-specific (tensor network backend) --
/// Maximum bond dimension reached (MPS mode only).
pub max_bond_dimension: Option<u32>,
/// Estimated fidelity after MPS truncation.
pub mps_fidelity_estimate: Option<f64>,
// -- Error information --
/// Whether the simulation completed successfully.
pub success: bool,
/// Error message if simulation failed.
pub error: Option<String>,
/// Error category for programmatic handling.
pub error_kind: Option<SimulationErrorKind>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum BackendType {
StateVector,
TensorNetwork,
Mps,
Hybrid,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum SimulationErrorKind {
QubitLimitExceeded,
MemoryAllocationFailed,
InvalidGateTarget,
InvalidParameter,
ContractionFailed,
MpsFidelityBelowThreshold,
Timeout,
InternalError,
}
2. Metrics Sink Trait
The engine publishes metrics through a trait abstraction, allowing different sinks for native and WASM environments:
/// Trait for consuming simulation metrics.
/// Implementations exist for native (ruvector-metrics), WASM (JS callback),
/// and testing (in-memory collector).
pub trait MetricsSink: Send + Sync {
/// Publish a completed simulation's metrics.
fn publish(&self, metrics: &SimulationMetrics);
/// Publish an incremental progress update (for long-running simulations).
fn progress(&self, simulation_id: Uuid, percent_complete: f32, message: &str);
/// Publish a health status update.
fn health(&self, status: EngineHealthStatus);
}
/// Native implementation: forwards to ruvector-metrics.
pub struct NativeMetricsSink {
registry: Arc<ruvector_metrics::Registry>,
}
impl MetricsSink for NativeMetricsSink {
fn publish(&self, metrics: &SimulationMetrics) {
// Emit as histogram/counter/gauge values
self.registry.histogram("ruqu.execution_time_ms")
.record(metrics.execution_time_ms);
self.registry.gauge("ruqu.peak_memory_bytes")
.set(metrics.peak_memory_bytes as f64);
self.registry.counter("ruqu.simulations_total")
.increment(1);
self.registry.counter("ruqu.gates_applied_total")
.increment(metrics.gate_count_optimized);
self.registry.histogram("ruqu.gates_per_second")
.record(metrics.gates_per_second);
if !metrics.success {
self.registry.counter("ruqu.errors_total")
.increment(1);
}
}
fn progress(&self, _id: Uuid, percent: f32, _msg: &str) {
self.registry.gauge("ruqu.current_progress")
.set(percent as f64);
}
fn health(&self, status: EngineHealthStatus) {
self.registry.gauge("ruqu.health_status")
.set(status.as_numeric());
}
}
3. WASM Metrics Sink
In WASM, metrics are delivered via JavaScript callbacks:
#[cfg(target_arch = "wasm32")]
pub struct WasmMetricsSink {
/// JS callback function registered by host application.
callback: js_sys::Function,
}
#[cfg(target_arch = "wasm32")]
impl MetricsSink for WasmMetricsSink {
fn publish(&self, metrics: &SimulationMetrics) {
let json = serde_json::to_string(metrics)
.unwrap_or_else(|_| "{}".to_string());
let js_value = JsValue::from_str(&json);
let event_type = JsValue::from_str("simulation_complete");
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
}
fn progress(&self, id: Uuid, percent: f32, message: &str) {
let payload = format!(
r#"{{"simulation_id":"{}","percent":{},"message":"{}"}}"#,
id, percent, message
);
let js_value = JsValue::from_str(&payload);
let event_type = JsValue::from_str("simulation_progress");
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
}
fn health(&self, status: EngineHealthStatus) {
let payload = format!(r#"{{"status":"{}"}}"#, status.as_str());
let js_value = JsValue::from_str(&payload);
let event_type = JsValue::from_str("engine_health");
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
}
}
JavaScript host registration:
// Host application registers the metrics callback
import init, { set_metrics_callback } from 'ruqu-wasm';
await init();
set_metrics_callback((eventType, data) => {
const metrics = JSON.parse(data);
switch (eventType) {
case 'simulation_complete':
console.log(`Simulation ${metrics.simulation_id} completed in ${metrics.execution_time_ms}ms`);
dashboard.updateMetrics(metrics);
break;
case 'simulation_progress':
progressBar.update(metrics.percent);
break;
case 'engine_health':
healthIndicator.set(metrics.status);
break;
}
});
4. Tracing Integration
The engine integrates with the Rust tracing crate for structured logging and
distributed tracing.
Span Hierarchy
ruqu::simulation (root span for entire simulation)
|
+-- ruqu::circuit_validation (validate circuit structure)
|
+-- ruqu::backend_selection (automatic backend choice)
|
+-- ruqu::optimization (gate fusion, commutation, etc.)
| |
| +-- ruqu::optimization::fusion (individual fusion passes)
| +-- ruqu::optimization::cancel (gate cancellation)
|
+-- ruqu::state_init (allocate and initialize state)
|
+-- ruqu::gate_application (apply all gates)
| |
| +-- ruqu::gate (individual gate -- DEBUG level only)
|
+-- ruqu::measurement (perform measurement sampling)
|
+-- ruqu::metrics_publish (emit metrics to sink)
|
+-- ruqu::state_cleanup (deallocate state vector)
Instrumentation Code
use tracing::{info, warn, debug, trace, instrument, Span};
#[instrument(
name = "ruqu::simulation",
skip(circuit, config, metrics_sink),
fields(
qubit_count = circuit.num_qubits(),
gate_count = circuit.gate_count(),
simulation_id = %Uuid::new_v4(),
)
)]
pub fn execute(
circuit: &QuantumCircuit,
shots: usize,
config: &SimulationConfig,
metrics_sink: &dyn MetricsSink,
) -> Result<SimulationResult, SimulationError> {
info!(
qubits = circuit.num_qubits(),
gates = circuit.gate_count(),
depth = circuit.depth(),
shots = shots,
"Starting quantum simulation"
);
// Validate
let _validation_span = tracing::info_span!("ruqu::circuit_validation").entered();
validate_circuit(circuit)?;
drop(_validation_span);
// Select backend
let _backend_span = tracing::info_span!("ruqu::backend_selection").entered();
let backend = select_backend(circuit, config);
info!(backend = backend.name(), "Backend selected");
drop(_backend_span);
// Optimize
let _opt_span = tracing::info_span!("ruqu::optimization").entered();
let optimized = optimize_circuit(circuit, config)?;
info!(
original_gates = circuit.gate_count(),
optimized_gates = optimized.gate_count(),
gates_fused = circuit.gate_count() - optimized.gate_count(),
"Circuit optimization complete"
);
drop(_opt_span);
// Execute
let result = backend.execute(&optimized, shots, config)?;
// At DEBUG level, log per-gate details
debug!(
execution_time_ms = result.execution_time_ms,
peak_memory = result.peak_memory_bytes,
"Simulation execution complete"
);
// At TRACE level only for small circuits, log amplitude information
if circuit.num_qubits() <= 10 {
trace!(
amplitudes = ?result.state_vector_snapshot(),
"Final state vector (small circuit trace)"
);
}
Ok(result)
}
5. Structured Error Reporting
All errors carry structured context for programmatic handling:
#[derive(Debug, thiserror::Error)]
pub enum SimulationError {
#[error("Qubit limit exceeded: requested {requested}, maximum {maximum}")]
QubitLimitExceeded {
requested: u32,
maximum: u32,
estimated_memory_bytes: u64,
available_memory_bytes: u64,
},
#[error("Memory allocation failed for {requested_bytes} bytes")]
MemoryAllocationFailed {
requested_bytes: u64,
qubit_count: u32,
suggestion: &'static str,
},
#[error("Invalid gate target: qubit {qubit} in {qubit_count}-qubit circuit")]
InvalidGateTarget {
gate_name: String,
qubit: u32,
qubit_count: u32,
gate_index: usize,
},
#[error("Invalid gate parameter: {parameter_name} = {value} ({reason})")]
InvalidParameter {
gate_name: String,
parameter_name: String,
value: f64,
reason: &'static str,
},
#[error("Tensor contraction failed: {reason}")]
ContractionFailed {
reason: String,
estimated_treewidth: usize,
suggestion: &'static str,
},
#[error("MPS fidelity {fidelity:.6} below threshold {threshold:.6}")]
MpsFidelityBelowThreshold {
fidelity: f64,
threshold: f64,
max_bond_dimension: usize,
suggestion: &'static str,
},
#[error("Simulation timed out after {elapsed_ms}ms (limit: {timeout_ms}ms)")]
Timeout {
elapsed_ms: u64,
timeout_ms: u64,
gates_completed: u64,
gates_remaining: u64,
},
#[error("Internal error: {message}")]
InternalError {
message: String,
source: Option<Box<dyn std::error::Error + Send + Sync>>,
},
}
Each error variant includes a suggestion field where applicable, guiding users
toward resolution:
| Error | Suggestion |
|---|---|
| QubitLimitExceeded | "Reduce qubit count or enable tensor-network feature for large circuits" |
| MemoryAllocationFailed | "Try tensor-network backend or reduce qubit count by 1-2 (halves/quarters memory)" |
| ContractionFailed | "Circuit treewidth too high for tensor network; use state vector for <= 30 qubits" |
| MpsFidelityBelowThreshold | "Increase chi_max or switch to exact state vector for high-fidelity results" |
6. Health Checks
The engine exposes health status for monitoring systems:
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EngineHealthStatus {
/// Whether the engine is ready to accept simulations.
pub ready: bool,
/// Maximum qubits supportable given current available memory.
pub max_supported_qubits: u32,
/// Available memory in bytes.
pub available_memory_bytes: u64,
/// Number of CPU cores available for parallel gate application.
pub available_cores: usize,
/// Whether the tensor-network backend is compiled in.
pub tensor_network_available: bool,
/// Current engine version.
pub version: &'static str,
/// Uptime since engine initialization (if applicable).
pub uptime_seconds: Option<f64>,
/// Number of simulations executed in current session.
pub simulations_executed: u64,
/// Total gates applied across all simulations in current session.
pub total_gates_applied: u64,
}
/// Check engine health. Callable at any time.
pub fn quantum_engine_ready() -> EngineHealthStatus {
let available_memory = estimate_available_memory();
let max_qubits = compute_max_qubits(available_memory);
EngineHealthStatus {
ready: max_qubits >= 4, // Minimum useful simulation
max_supported_qubits: max_qubits,
available_memory_bytes: available_memory,
available_cores: rayon::current_num_threads(),
tensor_network_available: cfg!(feature = "tensor-network"),
version: env!("CARGO_PKG_VERSION"),
uptime_seconds: None, // Library mode; no persistent uptime
simulations_executed: SESSION_COUNTER.load(Ordering::Relaxed),
total_gates_applied: SESSION_GATES.load(Ordering::Relaxed),
}
}
7. Logging Levels
| Level | Content | Audience | Performance Impact |
|---|---|---|---|
| ERROR | Simulation failures, OOM, invalid circuits | Operators, alerting | None |
| WARN | Approaching memory limits (>80%), MPS fidelity degradation, slow contraction | Operators | Negligible |
| INFO | Simulation start/end summaries, backend selection, optimization results | Developers, dashboards | Negligible |
| DEBUG | Per-optimization-pass details, memory allocation sizes, thread utilization | Developers debugging | Low |
| TRACE | Per-gate amplitude changes (small circuits only, n <= 10), SVD singular values | Deep debugging | High (small circuits only) |
TRACE level is gated on circuit size to prevent catastrophic log volume:
// TRACE-level amplitude logging is only emitted for circuits with <= 10 qubits.
// For larger circuits, TRACE only emits gate-level timing without amplitude data.
if tracing::enabled!(tracing::Level::TRACE) {
if circuit.num_qubits() <= 10 {
trace!(amplitudes = ?state.as_slice(), "Post-gate state");
} else {
trace!(gate_time_ns = elapsed.as_nanos(), "Gate applied");
}
}
8. Dashboard Integration
Metrics from the quantum engine appear in the ruVector monitoring UI as a dedicated panel alongside vector operations, index health, and system resources.
+------------------------------------------------------------------+
| ruVector Monitoring Dashboard |
+------------------------------------------------------------------+
| |
| Vector Operations | Quantum Simulations |
| ------------------- | ----------------------- |
| Queries/sec: 12,450 | Simulations/min: 23 |
| P99 latency: 2.3ms | Avg execution: 145ms |
| Index size: 2.1M vectors | Avg qubits: 18.4 |
| | Peak memory: 4.2 GiB |
| | Backend: SV 87% / TN 13% |
| | Gates/sec: 2.1B |
| | Error rate: 0.02% |
| | |
| System Resources | Recent Simulations |
| ------------------- | ----------------------- |
| CPU: 34% | #a3f2.. 24q 230ms OK |
| Memory: 61% (49/80 GiB) | #b891.. 16q 12ms OK |
| Threads: 64/256 active | #c4d0.. 30q 1.2s OK |
| | #d122.. 35q ERR OOM |
+------------------------------------------------------------------+
Metrics are published via the existing ruvector-metrics WebSocket feed:
{
"source": "ruqu",
"type": "simulation_complete",
"timestamp": "2026-02-06T14:23:01.442Z",
"data": {
"simulation_id": "a3f2e891-...",
"qubit_count": 24,
"execution_time_ms": 230.4,
"peak_memory_bytes": 268435456,
"backend": "StateVector",
"gates_per_second": 2147483648,
"success": true
}
}
9. Prometheus / OpenTelemetry Export
For external monitoring, the native metrics sink exports standard Prometheus metrics:
# HELP ruqu_simulations_total Total quantum simulations executed
# TYPE ruqu_simulations_total counter
ruqu_simulations_total{backend="state_vector",status="success"} 1847
ruqu_simulations_total{backend="state_vector",status="error"} 3
ruqu_simulations_total{backend="tensor_network",status="success"} 241
# HELP ruqu_execution_time_ms Simulation execution time histogram
# TYPE ruqu_execution_time_ms histogram
ruqu_execution_time_ms_bucket{backend="state_vector",le="10"} 423
ruqu_execution_time_ms_bucket{backend="state_vector",le="100"} 1201
ruqu_execution_time_ms_bucket{backend="state_vector",le="1000"} 1834
ruqu_execution_time_ms_bucket{backend="state_vector",le="+Inf"} 1847
# HELP ruqu_peak_memory_bytes Peak memory during simulation
# TYPE ruqu_peak_memory_bytes gauge
ruqu_peak_memory_bytes 4294967296
# HELP ruqu_gates_per_second Gate application throughput
# TYPE ruqu_gates_per_second gauge
ruqu_gates_per_second 2.1e9
# HELP ruqu_max_supported_qubits Maximum qubits based on available memory
# TYPE ruqu_max_supported_qubits gauge
ruqu_max_supported_qubits 33
Consequences
Positive
- Unified observability: Quantum simulation telemetry integrates seamlessly with ruVector's existing monitoring infrastructure.
- Cross-platform: The trait-based sink design supports native, WASM, and testing environments without code changes in the engine.
- Actionable errors: Structured errors with suggestions reduce debugging time and improve developer experience.
- Performance visibility: Gates-per-second, memory consumption, and backend selection metrics enable informed performance tuning.
- Compliance ready: Structured logging with simulation IDs supports audit trail requirements.
Negative
- Metric cardinality: High-frequency simulations could generate significant metric volume. Mitigated by aggregation at the sink level.
- WASM callback overhead: JSON serialization for WASM metrics adds ~0.1ms per simulation. Acceptable for typical workloads.
- Tracing overhead at DEBUG/TRACE: Enabled tracing at low levels adds measurable overhead. Production deployments should use INFO or above.
- Schema evolution: Changes to
SimulationMetricsrequire versioned handling in consumers.
Risks and Mitigations
| Risk | Mitigation |
|---|---|
| Metric volume overwhelming storage | Configurable sampling rate; aggregate in sink |
| WASM callback exceptions | Catch JS exceptions in callback wrapper; log to console |
| Schema breaking changes | Version field in metrics; consumer-side version dispatch |
| TRACE logging for large circuits | Qubit-count gate prevents amplitude logging above n=10 |
References
ruvector-metricscrate: internal metrics infrastructure- Rust
tracingcrate: https://docs.rs/tracing - OpenTelemetry Rust SDK: https://docs.rs/opentelemetry
- ADR-QE-005: WASM Compilation Target (WASM constraints)
- ADR-QE-011: Memory Gating & Power Management (resource monitoring)
- Prometheus exposition format: https://prometheus.io/docs/instrumenting/exposition_formats/