git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
11 KiB
Tiny Dancer Observability Guide
This guide covers the comprehensive observability features in Tiny Dancer, including Prometheus metrics, OpenTelemetry distributed tracing, and structured logging.
Table of Contents
- Overview
- Prometheus Metrics
- Distributed Tracing
- Structured Logging
- Integration Guide
- Examples
- Best Practices
Overview
Tiny Dancer provides three layers of observability:
- Prometheus Metrics: Real-time performance metrics and system health
- OpenTelemetry Tracing: Distributed tracing for request flow analysis
- Structured Logging: Context-rich logs with the
tracingcrate
All three work together to provide complete visibility into your routing system.
Prometheus Metrics
Available Metrics
Request Metrics
tiny_dancer_routing_requests_total{status="success|failure"}
Counter tracking total routing requests by status.
tiny_dancer_routing_latency_seconds{operation="total"}
Histogram of routing operation latency in seconds.
Feature Engineering Metrics
tiny_dancer_feature_engineering_duration_seconds{batch_size="1-10|11-50|51-100|100+"}
Histogram of feature engineering duration by batch size.
Model Inference Metrics
tiny_dancer_model_inference_duration_seconds{model_type="fastgrnn"}
Histogram of model inference duration.
Circuit Breaker Metrics
tiny_dancer_circuit_breaker_state
Gauge showing circuit breaker state:
- 0 = Closed (healthy)
- 1 = Half-Open (testing)
- 2 = Open (failing)
Routing Decision Metrics
tiny_dancer_routing_decisions_total{model_type="lightweight|powerful"}
Counter of routing decisions by target model type.
tiny_dancer_confidence_scores{decision_type="lightweight|powerful"}
Histogram of confidence scores by decision type.
tiny_dancer_uncertainty_estimates{decision_type="lightweight|powerful"}
Histogram of uncertainty estimates.
Candidate Metrics
tiny_dancer_candidates_processed_total{batch_size_range="1-10|11-50|51-100|100+"}
Counter of total candidates processed by batch size range.
Error Metrics
tiny_dancer_errors_total{error_type="inference_error|circuit_breaker_open|..."}
Counter of errors by type.
Using Metrics
use ruvector_tiny_dancer_core::{Router, RouterConfig};
// Create router (metrics are automatically collected)
let router = Router::new(RouterConfig::default())?;
// Process requests...
let response = router.route(request)?;
// Export metrics in Prometheus format
let metrics = router.export_metrics()?;
println!("{}", metrics);
Prometheus Configuration
scrape_configs:
- job_name: 'tiny-dancer'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9090']
Example Grafana Dashboard
{
"dashboard": {
"title": "Tiny Dancer Routing",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "rate(tiny_dancer_routing_requests_total[5m])"
}]
},
{
"title": "P95 Latency",
"targets": [{
"expr": "histogram_quantile(0.95, rate(tiny_dancer_routing_latency_seconds_bucket[5m]))"
}]
},
{
"title": "Circuit Breaker State",
"targets": [{
"expr": "tiny_dancer_circuit_breaker_state"
}]
},
{
"title": "Lightweight vs Powerful Routing",
"targets": [{
"expr": "rate(tiny_dancer_routing_decisions_total[5m])"
}]
}
]
}
}
Distributed Tracing
OpenTelemetry Integration
Tiny Dancer integrates with OpenTelemetry for distributed tracing, supporting exporters like Jaeger, Zipkin, and more.
Trace Spans
The following spans are automatically created:
routing_request: Complete routing operationcircuit_breaker_check: Circuit breaker validationfeature_engineering: Feature extraction and engineeringmodel_inference: Neural model inference (per candidate)uncertainty_estimation: Uncertainty quantification
Configuration
use ruvector_tiny_dancer_core::{TracingConfig, TracingSystem};
// Configure tracing
let config = TracingConfig {
service_name: "tiny-dancer".to_string(),
service_version: "1.0.0".to_string(),
jaeger_agent_endpoint: Some("localhost:6831".to_string()),
sampling_ratio: 1.0, // Sample 100% of traces
enable_stdout: false,
};
// Initialize tracing
let tracing_system = TracingSystem::new(config);
tracing_system.init()?;
// Your application code...
// Shutdown and flush traces
tracing_system.shutdown();
Jaeger Setup
# Run Jaeger all-in-one
docker run -d \
-p 6831:6831/udp \
-p 16686:16686 \
jaegertracing/all-in-one:latest
# Access Jaeger UI at http://localhost:16686
Trace Context Propagation
use ruvector_tiny_dancer_core::TraceContext;
// Get trace context from current span
if let Some(ctx) = TraceContext::from_current() {
println!("Trace ID: {}", ctx.trace_id);
println!("Span ID: {}", ctx.span_id);
// W3C Trace Context format for HTTP headers
let traceparent = ctx.to_w3c_traceparent();
// Example: "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
}
Custom Spans
use ruvector_tiny_dancer_core::RoutingSpan;
use tracing::info_span;
// Create custom span
let span = info_span!("my_operation", param1 = "value");
let _guard = span.enter();
// Or use pre-defined span helpers
let span = RoutingSpan::routing_request(candidate_count);
let _guard = span.enter();
Structured Logging
Log Levels
Tiny Dancer uses the tracing crate for structured logging:
- ERROR: Critical failures (circuit breaker open, inference errors)
- WARN: Warnings (model path not found, degraded performance)
- INFO: Normal operations (router initialization, request completion)
- DEBUG: Detailed information (feature extraction, inference results)
- TRACE: Very detailed information (internal state changes)
Example Logs
INFO tiny_dancer_router: Initializing Tiny Dancer router
INFO tiny_dancer_router: Circuit breaker enabled with threshold: 5
INFO tiny_dancer_router: Processing routing request candidate_count=3
DEBUG tiny_dancer_router: Extracting features batch_size=3
DEBUG tiny_dancer_router: Model inference completed candidate_id="candidate-1" confidence=0.92
DEBUG tiny_dancer_router: Routing decision made candidate_id="candidate-1" use_lightweight=true uncertainty=0.08
INFO tiny_dancer_router: Routing request completed successfully inference_time_us=245 lightweight_routes=2 powerful_routes=1
Configuring Logging
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};
// Basic setup
tracing_subscriber::fmt()
.with_max_level(tracing::Level::INFO)
.init();
// Advanced setup with JSON formatting
tracing_subscriber::registry()
.with(tracing_subscriber::fmt::layer().json())
.with(tracing_subscriber::filter::LevelFilter::from_level(
tracing::Level::DEBUG
))
.init();
Integration Guide
Complete Setup
use ruvector_tiny_dancer_core::{
Router, RouterConfig, TracingConfig, TracingSystem
};
use tracing_subscriber;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// 1. Initialize structured logging
tracing_subscriber::fmt()
.with_max_level(tracing::Level::INFO)
.init();
// 2. Initialize distributed tracing
let tracing_config = TracingConfig {
service_name: "my-service".to_string(),
service_version: "1.0.0".to_string(),
jaeger_agent_endpoint: Some("localhost:6831".to_string()),
sampling_ratio: 0.1, // Sample 10% in production
enable_stdout: false,
};
let tracing_system = TracingSystem::new(tracing_config);
tracing_system.init()?;
// 3. Create router (metrics automatically enabled)
let router = Router::new(RouterConfig::default())?;
// 4. Process requests (all observability automatic)
let response = router.route(request)?;
// 5. Periodically export metrics (e.g., to HTTP endpoint)
let metrics = router.export_metrics()?;
// 6. Cleanup
tracing_system.shutdown();
Ok(())
}
HTTP Metrics Endpoint
use axum::{Router, routing::get};
async fn metrics_handler(
router: Arc<ruvector_tiny_dancer_core::Router>
) -> String {
router.export_metrics().unwrap_or_default()
}
let app = Router::new()
.route("/metrics", get(metrics_handler));
Examples
1. Metrics Only
cargo run --example metrics_example
Demonstrates Prometheus metrics collection and export.
2. Tracing Only
# Start Jaeger first
docker run -d -p6831:6831/udp -p16686:16686 jaegertracing/all-in-one:latest
# Run example
cargo run --example tracing_example
Shows distributed tracing with OpenTelemetry.
3. Full Observability
cargo run --example full_observability
Combines metrics, tracing, and structured logging.
Best Practices
Production Configuration
-
Sampling: Don't trace every request in production
sampling_ratio: 0.01, // 1% sampling -
Log Levels: Use INFO or WARN in production
.with_max_level(tracing::Level::INFO) -
Metrics Cardinality: Be careful with high-cardinality labels
- ✓ Good:
{model_type="lightweight"} - ✗ Bad:
{candidate_id="12345"}(too many unique values)
- ✓ Good:
-
Performance: Metrics collection is very lightweight (<1μs overhead)
Alerting Rules
Example Prometheus alerting rules:
groups:
- name: tiny_dancer
rules:
- alert: HighErrorRate
expr: rate(tiny_dancer_errors_total[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
- alert: CircuitBreakerOpen
expr: tiny_dancer_circuit_breaker_state == 2
for: 1m
annotations:
summary: "Circuit breaker is open"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(tiny_dancer_routing_latency_seconds_bucket[5m])) > 0.01
for: 5m
annotations:
summary: "P95 latency above 10ms"
Debugging Performance Issues
-
Check metrics for high-level patterns
rate(tiny_dancer_routing_requests_total[5m]) -
Use traces to identify bottlenecks
- Look for long spans
- Identify slow candidates
-
Review logs for error details
grep "ERROR" logs.txt | jq .
Troubleshooting
Metrics Not Appearing
- Ensure router is processing requests
- Check metrics export:
router.export_metrics()? - Verify Prometheus scrape configuration
Traces Not in Jaeger
- Confirm Jaeger is running:
docker ps - Check endpoint:
jaeger_agent_endpoint: Some("localhost:6831") - Verify sampling ratio > 0
- Call
tracing_system.shutdown()to flush
High Memory Usage
- Reduce sampling ratio
- Decrease histogram buckets
- Lower log level to INFO or WARN