Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
689
vendor/ruvector/docs/adr/quantum-engine/ADR-QE-010-observability-monitoring.md
vendored
Normal file
689
vendor/ruvector/docs/adr/quantum-engine/ADR-QE-010-observability-monitoring.md
vendored
Normal file
@@ -0,0 +1,689 @@
|
||||
# ADR-QE-010: Observability & Monitoring Integration
|
||||
|
||||
**Status**: Proposed
|
||||
**Date**: 2026-02-06
|
||||
**Authors**: ruv.io, RuVector Team
|
||||
**Deciders**: Architecture Review Board
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
ruVector provides comprehensive observability through the `ruvector-metrics` crate,
|
||||
which aggregates telemetry from all subsystems into a unified monitoring dashboard.
|
||||
The quantum simulation engine is a new subsystem that must participate in this
|
||||
observability infrastructure.
|
||||
|
||||
Effective monitoring of quantum simulation is essential for:
|
||||
|
||||
1. **Performance tuning**: Identifying bottlenecks in gate application, memory
|
||||
allocation, and parallelization efficiency.
|
||||
2. **Resource management**: Tracking memory consumption to prevent OOM conditions
|
||||
and to inform auto-scaling decisions.
|
||||
3. **Debugging**: Tracing the execution of specific circuits to diagnose incorrect
|
||||
results or unexpected behavior.
|
||||
4. **Capacity planning**: Understanding workload patterns (qubit counts, circuit
|
||||
depths, simulation frequency) to plan infrastructure.
|
||||
5. **Compliance**: Auditable logs of simulation executions for regulated
|
||||
environments (cryptographic validation, safety-critical applications).
|
||||
|
||||
### WASM Constraint
|
||||
|
||||
In WebAssembly deployment, there is no direct filesystem access and no native
|
||||
networking. Observability in WASM must use browser-compatible mechanisms:
|
||||
`console.log`, `console.warn`, `console.error`, or JavaScript callback functions
|
||||
registered by the host application.
|
||||
|
||||
### Existing Infrastructure
|
||||
|
||||
| Component | Role | Integration Point |
|
||||
|---|---|---|
|
||||
| `ruvector-metrics` | Metrics aggregation and export | Trait-based sink |
|
||||
| `ruvector-monitor` | Real-time dashboard UI | WebSocket feed |
|
||||
| Rust `tracing` crate | Structured logging and spans | Subscriber-based |
|
||||
| Prometheus / OpenTelemetry | External monitoring | Exporter plugins |
|
||||
| Ed25519 audit trail | Cryptographic logging | `ruqu-audit` crate |
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. Metrics Schema
|
||||
|
||||
Every simulation execution emits a structured metrics record. The schema is
|
||||
versioned to allow evolution without breaking consumers.
|
||||
|
||||
```rust
|
||||
/// Metrics emitted after each quantum simulation execution.
|
||||
/// Schema version: 1.0.0
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct SimulationMetrics {
|
||||
/// Schema version for forward compatibility.
|
||||
pub schema_version: &'static str,
|
||||
|
||||
/// Unique identifier for this simulation run.
|
||||
pub simulation_id: Uuid,
|
||||
|
||||
/// Timestamp when simulation started (UTC).
|
||||
pub started_at: DateTime<Utc>,
|
||||
|
||||
/// Timestamp when simulation completed (UTC).
|
||||
pub completed_at: DateTime<Utc>,
|
||||
|
||||
// -- Circuit characteristics --
|
||||
|
||||
/// Number of qubits in the circuit.
|
||||
pub qubit_count: u32,
|
||||
|
||||
/// Total number of gates (before optimization).
|
||||
pub gate_count_raw: u64,
|
||||
|
||||
/// Total number of gates (after optimization/fusion).
|
||||
pub gate_count_optimized: u64,
|
||||
|
||||
/// Circuit depth (longest path from input to output).
|
||||
pub circuit_depth: u32,
|
||||
|
||||
/// Number of two-qubit gates (entangling operations).
|
||||
pub two_qubit_gate_count: u64,
|
||||
|
||||
// -- Execution metrics --
|
||||
|
||||
/// Total wall-clock execution time in milliseconds.
|
||||
pub execution_time_ms: f64,
|
||||
|
||||
/// Time spent in gate application (excluding allocation, measurement).
|
||||
pub gate_application_time_ms: f64,
|
||||
|
||||
/// Time spent in measurement sampling.
|
||||
pub measurement_time_ms: f64,
|
||||
|
||||
/// Peak memory consumption in bytes during simulation.
|
||||
pub peak_memory_bytes: u64,
|
||||
|
||||
/// Memory allocated for the state vector / tensor network.
|
||||
pub state_memory_bytes: u64,
|
||||
|
||||
/// Backend used for this simulation.
|
||||
pub backend: BackendType,
|
||||
|
||||
// -- Throughput --
|
||||
|
||||
/// Gates applied per second (optimized gate count / gate application time).
|
||||
pub gates_per_second: f64,
|
||||
|
||||
/// Qubits * depth per second (a normalized throughput metric).
|
||||
pub quantum_volume_rate: f64,
|
||||
|
||||
// -- Optimization statistics --
|
||||
|
||||
/// Number of gates eliminated by fusion.
|
||||
pub gates_fused: u64,
|
||||
|
||||
/// Number of gates eliminated as identity or redundant.
|
||||
pub gates_skipped: u64,
|
||||
|
||||
/// Number of gate commutations applied.
|
||||
pub gates_commuted: u64,
|
||||
|
||||
// -- Entanglement analysis --
|
||||
|
||||
/// Number of independent qubit subsets (entanglement groups).
|
||||
pub entanglement_groups: u32,
|
||||
|
||||
/// Sizes of each entanglement group.
|
||||
pub entanglement_group_sizes: Vec<u32>,
|
||||
|
||||
// -- Measurement outcomes (if measured) --
|
||||
|
||||
/// Number of measurement shots executed.
|
||||
pub measurement_shots: Option<u64>,
|
||||
|
||||
/// Distribution entropy of measurement outcomes (bits).
|
||||
pub outcome_entropy: Option<f64>,
|
||||
|
||||
// -- MPS-specific (tensor network backend) --
|
||||
|
||||
/// Maximum bond dimension reached (MPS mode only).
|
||||
pub max_bond_dimension: Option<u32>,
|
||||
|
||||
/// Estimated fidelity after MPS truncation.
|
||||
pub mps_fidelity_estimate: Option<f64>,
|
||||
|
||||
// -- Error information --
|
||||
|
||||
/// Whether the simulation completed successfully.
|
||||
pub success: bool,
|
||||
|
||||
/// Error message if simulation failed.
|
||||
pub error: Option<String>,
|
||||
|
||||
/// Error category for programmatic handling.
|
||||
pub error_kind: Option<SimulationErrorKind>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub enum BackendType {
|
||||
StateVector,
|
||||
TensorNetwork,
|
||||
Mps,
|
||||
Hybrid,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub enum SimulationErrorKind {
|
||||
QubitLimitExceeded,
|
||||
MemoryAllocationFailed,
|
||||
InvalidGateTarget,
|
||||
InvalidParameter,
|
||||
ContractionFailed,
|
||||
MpsFidelityBelowThreshold,
|
||||
Timeout,
|
||||
InternalError,
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Metrics Sink Trait
|
||||
|
||||
The engine publishes metrics through a trait abstraction, allowing different sinks
|
||||
for native and WASM environments:
|
||||
|
||||
```rust
|
||||
/// Trait for consuming simulation metrics.
|
||||
/// Implementations exist for native (ruvector-metrics), WASM (JS callback),
|
||||
/// and testing (in-memory collector).
|
||||
pub trait MetricsSink: Send + Sync {
|
||||
/// Publish a completed simulation's metrics.
|
||||
fn publish(&self, metrics: &SimulationMetrics);
|
||||
|
||||
/// Publish an incremental progress update (for long-running simulations).
|
||||
fn progress(&self, simulation_id: Uuid, percent_complete: f32, message: &str);
|
||||
|
||||
/// Publish a health status update.
|
||||
fn health(&self, status: EngineHealthStatus);
|
||||
}
|
||||
|
||||
/// Native implementation: forwards to ruvector-metrics.
|
||||
pub struct NativeMetricsSink {
|
||||
registry: Arc<ruvector_metrics::Registry>,
|
||||
}
|
||||
|
||||
impl MetricsSink for NativeMetricsSink {
|
||||
fn publish(&self, metrics: &SimulationMetrics) {
|
||||
// Emit as histogram/counter/gauge values
|
||||
self.registry.histogram("ruqu.execution_time_ms")
|
||||
.record(metrics.execution_time_ms);
|
||||
self.registry.gauge("ruqu.peak_memory_bytes")
|
||||
.set(metrics.peak_memory_bytes as f64);
|
||||
self.registry.counter("ruqu.simulations_total")
|
||||
.increment(1);
|
||||
self.registry.counter("ruqu.gates_applied_total")
|
||||
.increment(metrics.gate_count_optimized);
|
||||
self.registry.histogram("ruqu.gates_per_second")
|
||||
.record(metrics.gates_per_second);
|
||||
|
||||
if !metrics.success {
|
||||
self.registry.counter("ruqu.errors_total")
|
||||
.increment(1);
|
||||
}
|
||||
}
|
||||
|
||||
fn progress(&self, _id: Uuid, percent: f32, _msg: &str) {
|
||||
self.registry.gauge("ruqu.current_progress")
|
||||
.set(percent as f64);
|
||||
}
|
||||
|
||||
fn health(&self, status: EngineHealthStatus) {
|
||||
self.registry.gauge("ruqu.health_status")
|
||||
.set(status.as_numeric());
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. WASM Metrics Sink
|
||||
|
||||
In WASM, metrics are delivered via JavaScript callbacks:
|
||||
|
||||
```rust
|
||||
#[cfg(target_arch = "wasm32")]
|
||||
pub struct WasmMetricsSink {
|
||||
/// JS callback function registered by host application.
|
||||
callback: js_sys::Function,
|
||||
}
|
||||
|
||||
#[cfg(target_arch = "wasm32")]
|
||||
impl MetricsSink for WasmMetricsSink {
|
||||
fn publish(&self, metrics: &SimulationMetrics) {
|
||||
let json = serde_json::to_string(metrics)
|
||||
.unwrap_or_else(|_| "{}".to_string());
|
||||
let js_value = JsValue::from_str(&json);
|
||||
let event_type = JsValue::from_str("simulation_complete");
|
||||
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
|
||||
}
|
||||
|
||||
fn progress(&self, id: Uuid, percent: f32, message: &str) {
|
||||
let payload = format!(
|
||||
r#"{{"simulation_id":"{}","percent":{},"message":"{}"}}"#,
|
||||
id, percent, message
|
||||
);
|
||||
let js_value = JsValue::from_str(&payload);
|
||||
let event_type = JsValue::from_str("simulation_progress");
|
||||
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
|
||||
}
|
||||
|
||||
fn health(&self, status: EngineHealthStatus) {
|
||||
let payload = format!(r#"{{"status":"{}"}}"#, status.as_str());
|
||||
let js_value = JsValue::from_str(&payload);
|
||||
let event_type = JsValue::from_str("engine_health");
|
||||
let _ = self.callback.call2(&JsValue::NULL, &event_type, &js_value);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
JavaScript host registration:
|
||||
|
||||
```javascript
|
||||
// Host application registers the metrics callback
|
||||
import init, { set_metrics_callback } from 'ruqu-wasm';
|
||||
|
||||
await init();
|
||||
|
||||
set_metrics_callback((eventType, data) => {
|
||||
const metrics = JSON.parse(data);
|
||||
switch (eventType) {
|
||||
case 'simulation_complete':
|
||||
console.log(`Simulation ${metrics.simulation_id} completed in ${metrics.execution_time_ms}ms`);
|
||||
dashboard.updateMetrics(metrics);
|
||||
break;
|
||||
case 'simulation_progress':
|
||||
progressBar.update(metrics.percent);
|
||||
break;
|
||||
case 'engine_health':
|
||||
healthIndicator.set(metrics.status);
|
||||
break;
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
### 4. Tracing Integration
|
||||
|
||||
The engine integrates with the Rust `tracing` crate for structured logging and
|
||||
distributed tracing.
|
||||
|
||||
#### Span Hierarchy
|
||||
|
||||
```
|
||||
ruqu::simulation (root span for entire simulation)
|
||||
|
|
||||
+-- ruqu::circuit_validation (validate circuit structure)
|
||||
|
|
||||
+-- ruqu::backend_selection (automatic backend choice)
|
||||
|
|
||||
+-- ruqu::optimization (gate fusion, commutation, etc.)
|
||||
| |
|
||||
| +-- ruqu::optimization::fusion (individual fusion passes)
|
||||
| +-- ruqu::optimization::cancel (gate cancellation)
|
||||
|
|
||||
+-- ruqu::state_init (allocate and initialize state)
|
||||
|
|
||||
+-- ruqu::gate_application (apply all gates)
|
||||
| |
|
||||
| +-- ruqu::gate (individual gate -- DEBUG level only)
|
||||
|
|
||||
+-- ruqu::measurement (perform measurement sampling)
|
||||
|
|
||||
+-- ruqu::metrics_publish (emit metrics to sink)
|
||||
|
|
||||
+-- ruqu::state_cleanup (deallocate state vector)
|
||||
```
|
||||
|
||||
#### Instrumentation Code
|
||||
|
||||
```rust
|
||||
use tracing::{info, warn, debug, trace, instrument, Span};
|
||||
|
||||
#[instrument(
|
||||
name = "ruqu::simulation",
|
||||
skip(circuit, config, metrics_sink),
|
||||
fields(
|
||||
qubit_count = circuit.num_qubits(),
|
||||
gate_count = circuit.gate_count(),
|
||||
simulation_id = %Uuid::new_v4(),
|
||||
)
|
||||
)]
|
||||
pub fn execute(
|
||||
circuit: &QuantumCircuit,
|
||||
shots: usize,
|
||||
config: &SimulationConfig,
|
||||
metrics_sink: &dyn MetricsSink,
|
||||
) -> Result<SimulationResult, SimulationError> {
|
||||
info!(
|
||||
qubits = circuit.num_qubits(),
|
||||
gates = circuit.gate_count(),
|
||||
depth = circuit.depth(),
|
||||
shots = shots,
|
||||
"Starting quantum simulation"
|
||||
);
|
||||
|
||||
// Validate
|
||||
let _validation_span = tracing::info_span!("ruqu::circuit_validation").entered();
|
||||
validate_circuit(circuit)?;
|
||||
drop(_validation_span);
|
||||
|
||||
// Select backend
|
||||
let _backend_span = tracing::info_span!("ruqu::backend_selection").entered();
|
||||
let backend = select_backend(circuit, config);
|
||||
info!(backend = backend.name(), "Backend selected");
|
||||
drop(_backend_span);
|
||||
|
||||
// Optimize
|
||||
let _opt_span = tracing::info_span!("ruqu::optimization").entered();
|
||||
let optimized = optimize_circuit(circuit, config)?;
|
||||
info!(
|
||||
original_gates = circuit.gate_count(),
|
||||
optimized_gates = optimized.gate_count(),
|
||||
gates_fused = circuit.gate_count() - optimized.gate_count(),
|
||||
"Circuit optimization complete"
|
||||
);
|
||||
drop(_opt_span);
|
||||
|
||||
// Execute
|
||||
let result = backend.execute(&optimized, shots, config)?;
|
||||
|
||||
// At DEBUG level, log per-gate details
|
||||
debug!(
|
||||
execution_time_ms = result.execution_time_ms,
|
||||
peak_memory = result.peak_memory_bytes,
|
||||
"Simulation execution complete"
|
||||
);
|
||||
|
||||
// At TRACE level only for small circuits, log amplitude information
|
||||
if circuit.num_qubits() <= 10 {
|
||||
trace!(
|
||||
amplitudes = ?result.state_vector_snapshot(),
|
||||
"Final state vector (small circuit trace)"
|
||||
);
|
||||
}
|
||||
|
||||
Ok(result)
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Structured Error Reporting
|
||||
|
||||
All errors carry structured context for programmatic handling:
|
||||
|
||||
```rust
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum SimulationError {
|
||||
#[error("Qubit limit exceeded: requested {requested}, maximum {maximum}")]
|
||||
QubitLimitExceeded {
|
||||
requested: u32,
|
||||
maximum: u32,
|
||||
estimated_memory_bytes: u64,
|
||||
available_memory_bytes: u64,
|
||||
},
|
||||
|
||||
#[error("Memory allocation failed for {requested_bytes} bytes")]
|
||||
MemoryAllocationFailed {
|
||||
requested_bytes: u64,
|
||||
qubit_count: u32,
|
||||
suggestion: &'static str,
|
||||
},
|
||||
|
||||
#[error("Invalid gate target: qubit {qubit} in {qubit_count}-qubit circuit")]
|
||||
InvalidGateTarget {
|
||||
gate_name: String,
|
||||
qubit: u32,
|
||||
qubit_count: u32,
|
||||
gate_index: usize,
|
||||
},
|
||||
|
||||
#[error("Invalid gate parameter: {parameter_name} = {value} ({reason})")]
|
||||
InvalidParameter {
|
||||
gate_name: String,
|
||||
parameter_name: String,
|
||||
value: f64,
|
||||
reason: &'static str,
|
||||
},
|
||||
|
||||
#[error("Tensor contraction failed: {reason}")]
|
||||
ContractionFailed {
|
||||
reason: String,
|
||||
estimated_treewidth: usize,
|
||||
suggestion: &'static str,
|
||||
},
|
||||
|
||||
#[error("MPS fidelity {fidelity:.6} below threshold {threshold:.6}")]
|
||||
MpsFidelityBelowThreshold {
|
||||
fidelity: f64,
|
||||
threshold: f64,
|
||||
max_bond_dimension: usize,
|
||||
suggestion: &'static str,
|
||||
},
|
||||
|
||||
#[error("Simulation timed out after {elapsed_ms}ms (limit: {timeout_ms}ms)")]
|
||||
Timeout {
|
||||
elapsed_ms: u64,
|
||||
timeout_ms: u64,
|
||||
gates_completed: u64,
|
||||
gates_remaining: u64,
|
||||
},
|
||||
|
||||
#[error("Internal error: {message}")]
|
||||
InternalError {
|
||||
message: String,
|
||||
source: Option<Box<dyn std::error::Error + Send + Sync>>,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Each error variant includes a `suggestion` field where applicable, guiding users
|
||||
toward resolution:
|
||||
|
||||
| Error | Suggestion |
|
||||
|---|---|
|
||||
| QubitLimitExceeded | "Reduce qubit count or enable tensor-network feature for large circuits" |
|
||||
| MemoryAllocationFailed | "Try tensor-network backend or reduce qubit count by 1-2 (halves/quarters memory)" |
|
||||
| ContractionFailed | "Circuit treewidth too high for tensor network; use state vector for <= 30 qubits" |
|
||||
| MpsFidelityBelowThreshold | "Increase chi_max or switch to exact state vector for high-fidelity results" |
|
||||
|
||||
### 6. Health Checks
|
||||
|
||||
The engine exposes health status for monitoring systems:
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct EngineHealthStatus {
|
||||
/// Whether the engine is ready to accept simulations.
|
||||
pub ready: bool,
|
||||
|
||||
/// Maximum qubits supportable given current available memory.
|
||||
pub max_supported_qubits: u32,
|
||||
|
||||
/// Available memory in bytes.
|
||||
pub available_memory_bytes: u64,
|
||||
|
||||
/// Number of CPU cores available for parallel gate application.
|
||||
pub available_cores: usize,
|
||||
|
||||
/// Whether the tensor-network backend is compiled in.
|
||||
pub tensor_network_available: bool,
|
||||
|
||||
/// Current engine version.
|
||||
pub version: &'static str,
|
||||
|
||||
/// Uptime since engine initialization (if applicable).
|
||||
pub uptime_seconds: Option<f64>,
|
||||
|
||||
/// Number of simulations executed in current session.
|
||||
pub simulations_executed: u64,
|
||||
|
||||
/// Total gates applied across all simulations in current session.
|
||||
pub total_gates_applied: u64,
|
||||
}
|
||||
|
||||
/// Check engine health. Callable at any time.
|
||||
pub fn quantum_engine_ready() -> EngineHealthStatus {
|
||||
let available_memory = estimate_available_memory();
|
||||
let max_qubits = compute_max_qubits(available_memory);
|
||||
|
||||
EngineHealthStatus {
|
||||
ready: max_qubits >= 4, // Minimum useful simulation
|
||||
max_supported_qubits: max_qubits,
|
||||
available_memory_bytes: available_memory,
|
||||
available_cores: rayon::current_num_threads(),
|
||||
tensor_network_available: cfg!(feature = "tensor-network"),
|
||||
version: env!("CARGO_PKG_VERSION"),
|
||||
uptime_seconds: None, // Library mode; no persistent uptime
|
||||
simulations_executed: SESSION_COUNTER.load(Ordering::Relaxed),
|
||||
total_gates_applied: SESSION_GATES.load(Ordering::Relaxed),
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 7. Logging Levels
|
||||
|
||||
| Level | Content | Audience | Performance Impact |
|
||||
|---|---|---|---|
|
||||
| ERROR | Simulation failures, OOM, invalid circuits | Operators, alerting | None |
|
||||
| WARN | Approaching memory limits (>80%), MPS fidelity degradation, slow contraction | Operators | Negligible |
|
||||
| INFO | Simulation start/end summaries, backend selection, optimization results | Developers, dashboards | Negligible |
|
||||
| DEBUG | Per-optimization-pass details, memory allocation sizes, thread utilization | Developers debugging | Low |
|
||||
| TRACE | Per-gate amplitude changes (small circuits only, n <= 10), SVD singular values | Deep debugging | High (small circuits only) |
|
||||
|
||||
TRACE level is gated on circuit size to prevent catastrophic log volume:
|
||||
|
||||
```rust
|
||||
// TRACE-level amplitude logging is only emitted for circuits with <= 10 qubits.
|
||||
// For larger circuits, TRACE only emits gate-level timing without amplitude data.
|
||||
if tracing::enabled!(tracing::Level::TRACE) {
|
||||
if circuit.num_qubits() <= 10 {
|
||||
trace!(amplitudes = ?state.as_slice(), "Post-gate state");
|
||||
} else {
|
||||
trace!(gate_time_ns = elapsed.as_nanos(), "Gate applied");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8. Dashboard Integration
|
||||
|
||||
Metrics from the quantum engine appear in the ruVector monitoring UI as a dedicated
|
||||
panel alongside vector operations, index health, and system resources.
|
||||
|
||||
```
|
||||
+------------------------------------------------------------------+
|
||||
| ruVector Monitoring Dashboard |
|
||||
+------------------------------------------------------------------+
|
||||
| |
|
||||
| Vector Operations | Quantum Simulations |
|
||||
| ------------------- | ----------------------- |
|
||||
| Queries/sec: 12,450 | Simulations/min: 23 |
|
||||
| P99 latency: 2.3ms | Avg execution: 145ms |
|
||||
| Index size: 2.1M vectors | Avg qubits: 18.4 |
|
||||
| | Peak memory: 4.2 GiB |
|
||||
| | Backend: SV 87% / TN 13% |
|
||||
| | Gates/sec: 2.1B |
|
||||
| | Error rate: 0.02% |
|
||||
| | |
|
||||
| System Resources | Recent Simulations |
|
||||
| ------------------- | ----------------------- |
|
||||
| CPU: 34% | #a3f2.. 24q 230ms OK |
|
||||
| Memory: 61% (49/80 GiB) | #b891.. 16q 12ms OK |
|
||||
| Threads: 64/256 active | #c4d0.. 30q 1.2s OK |
|
||||
| | #d122.. 35q ERR OOM |
|
||||
+------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Metrics are published via the existing `ruvector-metrics` WebSocket feed:
|
||||
|
||||
```json
|
||||
{
|
||||
"source": "ruqu",
|
||||
"type": "simulation_complete",
|
||||
"timestamp": "2026-02-06T14:23:01.442Z",
|
||||
"data": {
|
||||
"simulation_id": "a3f2e891-...",
|
||||
"qubit_count": 24,
|
||||
"execution_time_ms": 230.4,
|
||||
"peak_memory_bytes": 268435456,
|
||||
"backend": "StateVector",
|
||||
"gates_per_second": 2147483648,
|
||||
"success": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 9. Prometheus / OpenTelemetry Export
|
||||
|
||||
For external monitoring, the native metrics sink exports standard Prometheus
|
||||
metrics:
|
||||
|
||||
```
|
||||
# HELP ruqu_simulations_total Total quantum simulations executed
|
||||
# TYPE ruqu_simulations_total counter
|
||||
ruqu_simulations_total{backend="state_vector",status="success"} 1847
|
||||
ruqu_simulations_total{backend="state_vector",status="error"} 3
|
||||
ruqu_simulations_total{backend="tensor_network",status="success"} 241
|
||||
|
||||
# HELP ruqu_execution_time_ms Simulation execution time histogram
|
||||
# TYPE ruqu_execution_time_ms histogram
|
||||
ruqu_execution_time_ms_bucket{backend="state_vector",le="10"} 423
|
||||
ruqu_execution_time_ms_bucket{backend="state_vector",le="100"} 1201
|
||||
ruqu_execution_time_ms_bucket{backend="state_vector",le="1000"} 1834
|
||||
ruqu_execution_time_ms_bucket{backend="state_vector",le="+Inf"} 1847
|
||||
|
||||
# HELP ruqu_peak_memory_bytes Peak memory during simulation
|
||||
# TYPE ruqu_peak_memory_bytes gauge
|
||||
ruqu_peak_memory_bytes 4294967296
|
||||
|
||||
# HELP ruqu_gates_per_second Gate application throughput
|
||||
# TYPE ruqu_gates_per_second gauge
|
||||
ruqu_gates_per_second 2.1e9
|
||||
|
||||
# HELP ruqu_max_supported_qubits Maximum qubits based on available memory
|
||||
# TYPE ruqu_max_supported_qubits gauge
|
||||
ruqu_max_supported_qubits 33
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Unified observability**: Quantum simulation telemetry integrates seamlessly
|
||||
with ruVector's existing monitoring infrastructure.
|
||||
2. **Cross-platform**: The trait-based sink design supports native, WASM, and
|
||||
testing environments without code changes in the engine.
|
||||
3. **Actionable errors**: Structured errors with suggestions reduce debugging time
|
||||
and improve developer experience.
|
||||
4. **Performance visibility**: Gates-per-second, memory consumption, and backend
|
||||
selection metrics enable informed performance tuning.
|
||||
5. **Compliance ready**: Structured logging with simulation IDs supports audit
|
||||
trail requirements.
|
||||
|
||||
### Negative
|
||||
|
||||
1. **Metric cardinality**: High-frequency simulations could generate significant
|
||||
metric volume. Mitigated by aggregation at the sink level.
|
||||
2. **WASM callback overhead**: JSON serialization for WASM metrics adds ~0.1ms per
|
||||
simulation. Acceptable for typical workloads.
|
||||
3. **Tracing overhead at DEBUG/TRACE**: Enabled tracing at low levels adds
|
||||
measurable overhead. Production deployments should use INFO or above.
|
||||
4. **Schema evolution**: Changes to `SimulationMetrics` require versioned handling
|
||||
in consumers.
|
||||
|
||||
### Risks and Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
|---|---|
|
||||
| Metric volume overwhelming storage | Configurable sampling rate; aggregate in sink |
|
||||
| WASM callback exceptions | Catch JS exceptions in callback wrapper; log to console |
|
||||
| Schema breaking changes | Version field in metrics; consumer-side version dispatch |
|
||||
| TRACE logging for large circuits | Qubit-count gate prevents amplitude logging above n=10 |
|
||||
|
||||
## References
|
||||
|
||||
- `ruvector-metrics` crate: internal metrics infrastructure
|
||||
- Rust `tracing` crate: https://docs.rs/tracing
|
||||
- OpenTelemetry Rust SDK: https://docs.rs/opentelemetry
|
||||
- ADR-QE-005: WASM Compilation Target (WASM constraints)
|
||||
- ADR-QE-011: Memory Gating & Power Management (resource monitoring)
|
||||
- Prometheus exposition format: https://prometheus.io/docs/instrumenting/exposition_formats/
|
||||
Reference in New Issue
Block a user