git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
690 lines
23 KiB
Markdown
690 lines
23 KiB
Markdown
# ADR-QE-010: Observability & Monitoring Integration
|
|
|
|
**Status**: Proposed
|
|
**Date**: 2026-02-06
|
|
**Authors**: ruv.io, RuVector Team
|
|
**Deciders**: Architecture Review Board
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
ruVector provides comprehensive observability through the `ruvector-metrics` crate,
|
|
which aggregates telemetry from all subsystems into a unified monitoring dashboard.
|
|
The quantum simulation engine is a new subsystem that must participate in this
|
|
observability infrastructure.
|
|
|
|
Effective monitoring of quantum simulation is essential for:
|
|
|
|
1. **Performance tuning**: Identifying bottlenecks in gate application, memory
|
|
allocation, and parallelization efficiency.
|
|
2. **Resource management**: Tracking memory consumption to prevent OOM conditions
|
|
and to inform auto-scaling decisions.
|
|
3. **Debugging**: Tracing the execution of specific circuits to diagnose incorrect
|
|
results or unexpected behavior.
|
|
4. **Capacity planning**: Understanding workload patterns (qubit counts, circuit
|
|
depths, simulation frequency) to plan infrastructure.
|
|
5. **Compliance**: Auditable logs of simulation executions for regulated
|
|
environments (cryptographic validation, safety-critical applications).
|
|
|
|
### WASM Constraint
|
|
|
|
In WebAssembly deployment, there is no direct filesystem access and no native
|
|
networking. Observability in WASM must use browser-compatible mechanisms:
|
|
`console.log`, `console.warn`, `console.error`, or JavaScript callback functions
|
|
registered by the host application.
|
|
|
|
### Existing Infrastructure
|
|
|
|
| Component | Role | Integration Point |
|
|
|---|---|---|
|
|
| `ruvector-metrics` | Metrics aggregation and export | Trait-based sink |
|
|
| `ruvector-monitor` | Real-time dashboard UI | WebSocket feed |
|
|
| Rust `tracing` crate | Structured logging and spans | Subscriber-based |
|
|
| Prometheus / OpenTelemetry | External monitoring | Exporter plugins |
|
|
| Ed25519 audit trail | Cryptographic logging | `ruqu-audit` crate |
|
|
|
|
## Decision
|
|
|
|
### 1. Metrics Schema
|
|
|
|
Every simulation execution emits a structured metrics record. The schema is
|
|
versioned to allow evolution without breaking consumers.
|
|
|
|
```rust
|
|
/// Metrics emitted after each quantum simulation execution.
|
|
/// Schema version: 1.0.0
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct SimulationMetrics {
|
|
/// Schema version for forward compatibility.
|
|
pub schema_version: &'static str,
|
|
|
|
/// Unique identifier for this simulation run.
|
|
pub simulation_id: Uuid,
|
|
|
|
/// Timestamp when simulation started (UTC).
|
|
pub started_at: DateTime<Utc>,
|
|
|
|
/// Timestamp when simulation completed (UTC).
|
|
pub completed_at: DateTime<Utc>,
|
|
|
|
// -- Circuit characteristics --
|
|
|
|
/// Number of qubits in the circuit.
|
|
pub qubit_count: u32,
|
|
|
|
/// Total number of gates (before optimization).
|
|
pub gate_count_raw: u64,
|
|
|
|
/// Total number of gates (after optimization/fusion).
|
|
pub gate_count_optimized: u64,
|
|
|
|
/// Circuit depth (longest path from input to output).
|
|
pub circuit_depth: u32,
|
|
|
|
/// Number of two-qubit gates (entangling operations).
|
|
pub two_qubit_gate_count: u64,
|
|
|
|
// -- Execution metrics --
|
|
|
|
/// Total wall-clock execution time in milliseconds.
|
|
pub execution_time_ms: f64,
|
|
|
|
/// Time spent in gate application (excluding allocation, measurement).
|
|
pub gate_application_time_ms: f64,
|
|
|
|
/// Time spent in measurement sampling.
|
|
pub measurement_time_ms: f64,
|
|
|
|
/// Peak memory consumption in bytes during simulation.
|
|
pub peak_memory_bytes: u64,
|
|
|
|
/// Memory allocated for the state vector / tensor network.
|
|
pub state_memory_bytes: u64,
|
|
|
|
/// Backend used for this simulation.
|
|
pub backend: BackendType,
|
|
|
|
// -- Throughput --
|
|
|
|
/// Gates applied per second (optimized gate count / gate application time).
|
|
pub gates_per_second: f64,
|
|
|
|
/// Qubits * depth per second (a normalized throughput metric).
|
|
pub quantum_volume_rate: f64,
|
|
|
|
// -- Optimization statistics --
|
|
|
|
/// Number of gates eliminated by fusion.
|
|
pub gates_fused: u64,
|
|
|
|
/// Number of gates eliminated as identity or redundant.
|
|
pub gates_skipped: u64,
|
|
|
|
/// Number of gate commutations applied.
|
|
pub gates_commuted: u64,
|
|
|
|
// -- Entanglement analysis --
|
|
|
|
/// Number of independent qubit subsets (entanglement groups).
|
|
pub entanglement_groups: u32,
|
|
|
|
/// Sizes of each entanglement group.
|
|
pub entanglement_group_sizes: Vec<u32>,
|
|
|
|
// -- Measurement outcomes (if measured) --
|
|
|
|
/// Number of measurement shots executed.
|
|
pub measurement_shots: Option<u64>,
|
|
|
|
/// Distribution entropy of measurement outcomes (bits).
|
|
pub outcome_entropy: Option<f64>,
|
|
|
|
// -- MPS-specific (tensor network backend) --
|
|
|
|
/// Maximum bond dimension reached (MPS mode only).
|
|
pub max_bond_dimension: Option<u32>,
|
|
|
|
/// Estimated fidelity after MPS truncation.
|
|
pub mps_fidelity_estimate: Option<f64>,
|
|
|
|
// -- Error information --
|
|
|
|
/// Whether the simulation completed successfully.
|
|
pub success: bool,
|
|
|
|
/// Error message if simulation failed.
|
|
pub error: Option<String>,
|
|
|
|
/// Error category for programmatic handling.
|
|
pub error_kind: Option<SimulationErrorKind>,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub enum BackendType {
|
|
StateVector,
|
|
TensorNetwork,
|
|
Mps,
|
|
Hybrid,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub enum SimulationErrorKind {
|
|
QubitLimitExceeded,
|
|
MemoryAllocationFailed,
|
|
InvalidGateTarget,
|
|
InvalidParameter,
|
|
ContractionFailed,
|
|
MpsFidelityBelowThreshold,
|
|
Timeout,
|
|
InternalError,
|
|
}
|
|
```
|
|
|
|
### 2. Metrics Sink Trait
|
|
|
|
The engine publishes metrics through a trait abstraction, allowing different sinks
|
|
for native and WASM environments:
|
|
|
|
```rust
|
|
/// Trait for consuming simulation metrics.
|
|
/// Implementations exist for native (ruvector-metrics), WASM (JS callback),
|
|
/// and testing (in-memory collector).
|
|
pub trait MetricsSink: Send + Sync {
|
|
/// Publish a completed simulation's metrics.
|
|
fn publish(&self, metrics: &SimulationMetrics);
|
|
|
|
/// Publish an incremental progress update (for long-running simulations).
|
|
fn progress(&self, simulation_id: Uuid, percent_complete: f32, message: &str);
|
|
|
|
/// Publish a health status update.
|
|
fn health(&self, status: EngineHealthStatus);
|
|
}
|
|
|
|
/// Native implementation: forwards to ruvector-metrics.
|
|
pub struct NativeMetricsSink {
|
|
registry: Arc<ruvector_metrics::Registry>,
|
|
}
|
|
|
|
impl MetricsSink for NativeMetricsSink {
|
|
fn publish(&self, metrics: &SimulationMetrics) {
|
|
// Emit as histogram/counter/gauge values
|
|
self.registry.histogram("ruqu.execution_time_ms")
|
|
.record(metrics.execution_time_ms);
|
|
self.registry.gauge("ruqu.peak_memory_bytes")
|
|
.set(metrics.peak_memory_bytes as f64);
|
|
self.registry.counter("ruqu.simulations_total")
|
|
.increment(1);
|
|
self.registry.counter("ruqu.gates_applied_total")
|
|
.increment(metrics.gate_count_optimized);
|
|
self.registry.histogram("ruqu.gates_per_second")
|
|
.record(metrics.gates_per_second);
|
|
|
|
if !metrics.success {
|
|
self.registry.counter("ruqu.errors_total")
|
|
.increment(1);
|
|
}
|
|
}
|
|
|
|
fn progress(&self, _id: Uuid, percent: f32, _msg: &str) {
|
|
self.registry.gauge("ruqu.current_progress")
|
|
.set(percent as f64);
|
|
}
|
|
|
|
fn health(&self, status: EngineHealthStatus) {
|
|
self.registry.gauge("ruqu.health_status")
|
|
.set(status.as_numeric());
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. WASM Metrics Sink
|
|
|
|
In WASM, metrics are delivered via JavaScript callbacks:
|
|
|
|
```rust
|
|
#[cfg(target_arch = "wasm32")]
|
|
pub struct WasmMetricsSink {
|
|
/// JS callback function registered by host application.
|
|
callback: js_sys::Function,
|
|
}
|
|
|
|
#[cfg(target_arch = "wasm32")]
|
|
impl MetricsSink for WasmMetricsSink {
|
|
fn publish(&self, metrics: &SimulationMetrics) {
|
|
let json = serde_json::to_string(metrics)
|
|
.unwrap_or_else(|_| "{}".to_string());
|
|
let js_value = JsValue::from_str(&json);
|
|
let event_type = JsValue::from_str("simulation_complete");
|
|
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
|
|
}
|
|
|
|
fn progress(&self, id: Uuid, percent: f32, message: &str) {
|
|
let payload = format!(
|
|
r#"{{"simulation_id":"{}","percent":{},"message":"{}"}}"#,
|
|
id, percent, message
|
|
);
|
|
let js_value = JsValue::from_str(&payload);
|
|
let event_type = JsValue::from_str("simulation_progress");
|
|
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
|
|
}
|
|
|
|
fn health(&self, status: EngineHealthStatus) {
|
|
let payload = format!(r#"{{"status":"{}"}}"#, status.as_str());
|
|
let js_value = JsValue::from_str(&payload);
|
|
let event_type = JsValue::from_str("engine_health");
|
|
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
|
|
}
|
|
}
|
|
```
|
|
|
|
JavaScript host registration:
|
|
|
|
```javascript
|
|
// Host application registers the metrics callback
|
|
import init, { set_metrics_callback } from 'ruqu-wasm';
|
|
|
|
await init();
|
|
|
|
set_metrics_callback((eventType, data) => {
|
|
const metrics = JSON.parse(data);
|
|
switch (eventType) {
|
|
case 'simulation_complete':
|
|
console.log(`Simulation ${metrics.simulation_id} completed in ${metrics.execution_time_ms}ms`);
|
|
dashboard.updateMetrics(metrics);
|
|
break;
|
|
case 'simulation_progress':
|
|
progressBar.update(metrics.percent);
|
|
break;
|
|
case 'engine_health':
|
|
healthIndicator.set(metrics.status);
|
|
break;
|
|
}
|
|
});
|
|
```
|
|
|
|
### 4. Tracing Integration
|
|
|
|
The engine integrates with the Rust `tracing` crate for structured logging and
|
|
distributed tracing.
|
|
|
|
#### Span Hierarchy
|
|
|
|
```
|
|
ruqu::simulation (root span for entire simulation)
|
|
|
|
|
+-- ruqu::circuit_validation (validate circuit structure)
|
|
|
|
|
+-- ruqu::backend_selection (automatic backend choice)
|
|
|
|
|
+-- ruqu::optimization (gate fusion, commutation, etc.)
|
|
| |
|
|
| +-- ruqu::optimization::fusion (individual fusion passes)
|
|
| +-- ruqu::optimization::cancel (gate cancellation)
|
|
|
|
|
+-- ruqu::state_init (allocate and initialize state)
|
|
|
|
|
+-- ruqu::gate_application (apply all gates)
|
|
| |
|
|
| +-- ruqu::gate (individual gate -- DEBUG level only)
|
|
|
|
|
+-- ruqu::measurement (perform measurement sampling)
|
|
|
|
|
+-- ruqu::metrics_publish (emit metrics to sink)
|
|
|
|
|
+-- ruqu::state_cleanup (deallocate state vector)
|
|
```
|
|
|
|
#### Instrumentation Code
|
|
|
|
```rust
|
|
use tracing::{info, warn, debug, trace, instrument, Span};
|
|
|
|
#[instrument(
|
|
name = "ruqu::simulation",
|
|
skip(circuit, config, metrics_sink),
|
|
fields(
|
|
qubit_count = circuit.num_qubits(),
|
|
gate_count = circuit.gate_count(),
|
|
simulation_id = %Uuid::new_v4(),
|
|
)
|
|
)]
|
|
pub fn execute(
|
|
circuit: &QuantumCircuit,
|
|
shots: usize,
|
|
config: &SimulationConfig,
|
|
metrics_sink: &dyn MetricsSink,
|
|
) -> Result<SimulationResult, SimulationError> {
|
|
info!(
|
|
qubits = circuit.num_qubits(),
|
|
gates = circuit.gate_count(),
|
|
depth = circuit.depth(),
|
|
shots = shots,
|
|
"Starting quantum simulation"
|
|
);
|
|
|
|
// Validate
|
|
let _validation_span = tracing::info_span!("ruqu::circuit_validation").entered();
|
|
validate_circuit(circuit)?;
|
|
drop(_validation_span);
|
|
|
|
// Select backend
|
|
let _backend_span = tracing::info_span!("ruqu::backend_selection").entered();
|
|
let backend = select_backend(circuit, config);
|
|
info!(backend = backend.name(), "Backend selected");
|
|
drop(_backend_span);
|
|
|
|
// Optimize
|
|
let _opt_span = tracing::info_span!("ruqu::optimization").entered();
|
|
let optimized = optimize_circuit(circuit, config)?;
|
|
info!(
|
|
original_gates = circuit.gate_count(),
|
|
optimized_gates = optimized.gate_count(),
|
|
gates_fused = circuit.gate_count() - optimized.gate_count(),
|
|
"Circuit optimization complete"
|
|
);
|
|
drop(_opt_span);
|
|
|
|
// Execute
|
|
let result = backend.execute(&optimized, shots, config)?;
|
|
|
|
// At DEBUG level, log per-gate details
|
|
debug!(
|
|
execution_time_ms = result.execution_time_ms,
|
|
peak_memory = result.peak_memory_bytes,
|
|
"Simulation execution complete"
|
|
);
|
|
|
|
// At TRACE level only for small circuits, log amplitude information
|
|
if circuit.num_qubits() <= 10 {
|
|
trace!(
|
|
amplitudes = ?result.state_vector_snapshot(),
|
|
"Final state vector (small circuit trace)"
|
|
);
|
|
}
|
|
|
|
Ok(result)
|
|
}
|
|
```
|
|
|
|
### 5. Structured Error Reporting
|
|
|
|
All errors carry structured context for programmatic handling:
|
|
|
|
```rust
|
|
#[derive(Debug, thiserror::Error)]
|
|
pub enum SimulationError {
|
|
#[error("Qubit limit exceeded: requested {requested}, maximum {maximum}")]
|
|
QubitLimitExceeded {
|
|
requested: u32,
|
|
maximum: u32,
|
|
estimated_memory_bytes: u64,
|
|
available_memory_bytes: u64,
|
|
},
|
|
|
|
#[error("Memory allocation failed for {requested_bytes} bytes")]
|
|
MemoryAllocationFailed {
|
|
requested_bytes: u64,
|
|
qubit_count: u32,
|
|
suggestion: &'static str,
|
|
},
|
|
|
|
#[error("Invalid gate target: qubit {qubit} in {qubit_count}-qubit circuit")]
|
|
InvalidGateTarget {
|
|
gate_name: String,
|
|
qubit: u32,
|
|
qubit_count: u32,
|
|
gate_index: usize,
|
|
},
|
|
|
|
#[error("Invalid gate parameter: {parameter_name} = {value} ({reason})")]
|
|
InvalidParameter {
|
|
gate_name: String,
|
|
parameter_name: String,
|
|
value: f64,
|
|
reason: &'static str,
|
|
},
|
|
|
|
#[error("Tensor contraction failed: {reason}")]
|
|
ContractionFailed {
|
|
reason: String,
|
|
estimated_treewidth: usize,
|
|
suggestion: &'static str,
|
|
},
|
|
|
|
#[error("MPS fidelity {fidelity:.6} below threshold {threshold:.6}")]
|
|
MpsFidelityBelowThreshold {
|
|
fidelity: f64,
|
|
threshold: f64,
|
|
max_bond_dimension: usize,
|
|
suggestion: &'static str,
|
|
},
|
|
|
|
#[error("Simulation timed out after {elapsed_ms}ms (limit: {timeout_ms}ms)")]
|
|
Timeout {
|
|
elapsed_ms: u64,
|
|
timeout_ms: u64,
|
|
gates_completed: u64,
|
|
gates_remaining: u64,
|
|
},
|
|
|
|
#[error("Internal error: {message}")]
|
|
InternalError {
|
|
message: String,
|
|
source: Option<Box<dyn std::error::Error + Send + Sync>>,
|
|
},
|
|
}
|
|
```
|
|
|
|
Each error variant includes a `suggestion` field where applicable, guiding users
|
|
toward resolution:
|
|
|
|
| Error | Suggestion |
|
|
|---|---|
|
|
| QubitLimitExceeded | "Reduce qubit count or enable tensor-network feature for large circuits" |
|
|
| MemoryAllocationFailed | "Try tensor-network backend or reduce qubit count by 1-2 (halves/quarters memory)" |
|
|
| ContractionFailed | "Circuit treewidth too high for tensor network; use state vector for <= 30 qubits" |
|
|
| MpsFidelityBelowThreshold | "Increase chi_max or switch to exact state vector for high-fidelity results" |
|
|
|
|
### 6. Health Checks
|
|
|
|
The engine exposes health status for monitoring systems:
|
|
|
|
```rust
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct EngineHealthStatus {
|
|
/// Whether the engine is ready to accept simulations.
|
|
pub ready: bool,
|
|
|
|
/// Maximum qubits supportable given current available memory.
|
|
pub max_supported_qubits: u32,
|
|
|
|
/// Available memory in bytes.
|
|
pub available_memory_bytes: u64,
|
|
|
|
/// Number of CPU cores available for parallel gate application.
|
|
pub available_cores: usize,
|
|
|
|
/// Whether the tensor-network backend is compiled in.
|
|
pub tensor_network_available: bool,
|
|
|
|
/// Current engine version.
|
|
pub version: &'static str,
|
|
|
|
/// Uptime since engine initialization (if applicable).
|
|
pub uptime_seconds: Option<f64>,
|
|
|
|
/// Number of simulations executed in current session.
|
|
pub simulations_executed: u64,
|
|
|
|
/// Total gates applied across all simulations in current session.
|
|
pub total_gates_applied: u64,
|
|
}
|
|
|
|
/// Check engine health. Callable at any time.
|
|
pub fn quantum_engine_ready() -> EngineHealthStatus {
|
|
let available_memory = estimate_available_memory();
|
|
let max_qubits = compute_max_qubits(available_memory);
|
|
|
|
EngineHealthStatus {
|
|
ready: max_qubits >= 4, // Minimum useful simulation
|
|
max_supported_qubits: max_qubits,
|
|
available_memory_bytes: available_memory,
|
|
available_cores: rayon::current_num_threads(),
|
|
tensor_network_available: cfg!(feature = "tensor-network"),
|
|
version: env!("CARGO_PKG_VERSION"),
|
|
uptime_seconds: None, // Library mode; no persistent uptime
|
|
simulations_executed: SESSION_COUNTER.load(Ordering::Relaxed),
|
|
total_gates_applied: SESSION_GATES.load(Ordering::Relaxed),
|
|
}
|
|
}
|
|
```
|
|
|
|
### 7. Logging Levels
|
|
|
|
| Level | Content | Audience | Performance Impact |
|
|
|---|---|---|---|
|
|
| ERROR | Simulation failures, OOM, invalid circuits | Operators, alerting | None |
|
|
| WARN | Approaching memory limits (>80%), MPS fidelity degradation, slow contraction | Operators | Negligible |
|
|
| INFO | Simulation start/end summaries, backend selection, optimization results | Developers, dashboards | Negligible |
|
|
| DEBUG | Per-optimization-pass details, memory allocation sizes, thread utilization | Developers debugging | Low |
|
|
| TRACE | Per-gate amplitude changes (small circuits only, n <= 10), SVD singular values | Deep debugging | High (small circuits only) |
|
|
|
|
TRACE level is gated on circuit size to prevent catastrophic log volume:
|
|
|
|
```rust
|
|
// TRACE-level amplitude logging is only emitted for circuits with <= 10 qubits.
|
|
// For larger circuits, TRACE only emits gate-level timing without amplitude data.
|
|
if tracing::enabled!(tracing::Level::TRACE) {
|
|
if circuit.num_qubits() <= 10 {
|
|
trace!(amplitudes = ?state.as_slice(), "Post-gate state");
|
|
} else {
|
|
trace!(gate_time_ns = elapsed.as_nanos(), "Gate applied");
|
|
}
|
|
}
|
|
```
|
|
|
|
### 8. Dashboard Integration
|
|
|
|
Metrics from the quantum engine appear in the ruVector monitoring UI as a dedicated
|
|
panel alongside vector operations, index health, and system resources.
|
|
|
|
```
|
|
+------------------------------------------------------------------+
|
|
| ruVector Monitoring Dashboard |
|
|
+------------------------------------------------------------------+
|
|
| |
|
|
| Vector Operations | Quantum Simulations |
|
|
| ------------------- | ----------------------- |
|
|
| Queries/sec: 12,450 | Simulations/min: 23 |
|
|
| P99 latency: 2.3ms | Avg execution: 145ms |
|
|
| Index size: 2.1M vectors | Avg qubits: 18.4 |
|
|
| | Peak memory: 4.2 GiB |
|
|
| | Backend: SV 87% / TN 13% |
|
|
| | Gates/sec: 2.1B |
|
|
| | Error rate: 0.02% |
|
|
| | |
|
|
| System Resources | Recent Simulations |
|
|
| ------------------- | ----------------------- |
|
|
| CPU: 34% | #a3f2.. 24q 230ms OK |
|
|
| Memory: 61% (49/80 GiB) | #b891.. 16q 12ms OK |
|
|
| Threads: 64/256 active | #c4d0.. 30q 1.2s OK |
|
|
| | #d122.. 35q ERR OOM |
|
|
+------------------------------------------------------------------+
|
|
```
|
|
|
|
Metrics are published via the existing `ruvector-metrics` WebSocket feed:
|
|
|
|
```json
|
|
{
|
|
"source": "ruqu",
|
|
"type": "simulation_complete",
|
|
"timestamp": "2026-02-06T14:23:01.442Z",
|
|
"data": {
|
|
"simulation_id": "a3f2e891-...",
|
|
"qubit_count": 24,
|
|
"execution_time_ms": 230.4,
|
|
"peak_memory_bytes": 268435456,
|
|
"backend": "StateVector",
|
|
"gates_per_second": 2147483648,
|
|
"success": true
|
|
}
|
|
}
|
|
```
|
|
|
|
### 9. Prometheus / OpenTelemetry Export
|
|
|
|
For external monitoring, the native metrics sink exports standard Prometheus
|
|
metrics:
|
|
|
|
```
|
|
# HELP ruqu_simulations_total Total quantum simulations executed
|
|
# TYPE ruqu_simulations_total counter
|
|
ruqu_simulations_total{backend="state_vector",status="success"} 1847
|
|
ruqu_simulations_total{backend="state_vector",status="error"} 3
|
|
ruqu_simulations_total{backend="tensor_network",status="success"} 241
|
|
|
|
# HELP ruqu_execution_time_ms Simulation execution time histogram
|
|
# TYPE ruqu_execution_time_ms histogram
|
|
ruqu_execution_time_ms_bucket{backend="state_vector",le="10"} 423
|
|
ruqu_execution_time_ms_bucket{backend="state_vector",le="100"} 1201
|
|
ruqu_execution_time_ms_bucket{backend="state_vector",le="1000"} 1834
|
|
ruqu_execution_time_ms_bucket{backend="state_vector",le="+Inf"} 1847
|
|
|
|
# HELP ruqu_peak_memory_bytes Peak memory during simulation
|
|
# TYPE ruqu_peak_memory_bytes gauge
|
|
ruqu_peak_memory_bytes 4294967296
|
|
|
|
# HELP ruqu_gates_per_second Gate application throughput
|
|
# TYPE ruqu_gates_per_second gauge
|
|
ruqu_gates_per_second 2.1e9
|
|
|
|
# HELP ruqu_max_supported_qubits Maximum qubits based on available memory
|
|
# TYPE ruqu_max_supported_qubits gauge
|
|
ruqu_max_supported_qubits 33
|
|
```
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
1. **Unified observability**: Quantum simulation telemetry integrates seamlessly
|
|
with ruVector's existing monitoring infrastructure.
|
|
2. **Cross-platform**: The trait-based sink design supports native, WASM, and
|
|
testing environments without code changes in the engine.
|
|
3. **Actionable errors**: Structured errors with suggestions reduce debugging time
|
|
and improve developer experience.
|
|
4. **Performance visibility**: Gates-per-second, memory consumption, and backend
|
|
selection metrics enable informed performance tuning.
|
|
5. **Compliance ready**: Structured logging with simulation IDs supports audit
|
|
trail requirements.
|
|
|
|
### Negative
|
|
|
|
1. **Metric cardinality**: High-frequency simulations could generate significant
|
|
metric volume. Mitigated by aggregation at the sink level.
|
|
2. **WASM callback overhead**: JSON serialization for WASM metrics adds ~0.1ms per
|
|
simulation. Acceptable for typical workloads.
|
|
3. **Tracing overhead at DEBUG/TRACE**: Enabled tracing at low levels adds
|
|
measurable overhead. Production deployments should use INFO or above.
|
|
4. **Schema evolution**: Changes to `SimulationMetrics` require versioned handling
|
|
in consumers.
|
|
|
|
### Risks and Mitigations
|
|
|
|
| Risk | Mitigation |
|
|
|---|---|
|
|
| Metric volume overwhelming storage | Configurable sampling rate; aggregate in sink |
|
|
| WASM callback exceptions | Catch JS exceptions in callback wrapper; log to console |
|
|
| Schema breaking changes | Version field in metrics; consumer-side version dispatch |
|
|
| TRACE logging for large circuits | Qubit-count gate prevents amplitude logging above n=10 |
|
|
|
|
## References
|
|
|
|
- `ruvector-metrics` crate: internal metrics infrastructure
|
|
- Rust `tracing` crate: https://docs.rs/tracing
|
|
- OpenTelemetry Rust SDK: https://docs.rs/opentelemetry
|
|
- ADR-QE-005: WASM Compilation Target (WASM constraints)
|
|
- ADR-QE-011: Memory Gating & Power Management (resource monitoring)
|
|
- Prometheus exposition format: https://prometheus.io/docs/instrumenting/exposition_formats/
|