Files
wifi-densepose/crates/ruvector-tiny-dancer-core/docs/OBSERVABILITY.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

11 KiB

Tiny Dancer Observability Guide

This guide covers the comprehensive observability features in Tiny Dancer, including Prometheus metrics, OpenTelemetry distributed tracing, and structured logging.

Table of Contents

  1. Overview
  2. Prometheus Metrics
  3. Distributed Tracing
  4. Structured Logging
  5. Integration Guide
  6. Examples
  7. Best Practices

Overview

Tiny Dancer provides three layers of observability:

  • Prometheus Metrics: Real-time performance metrics and system health
  • OpenTelemetry Tracing: Distributed tracing for request flow analysis
  • Structured Logging: Context-rich logs with the tracing crate

All three work together to provide complete visibility into your routing system.

Prometheus Metrics

Available Metrics

Request Metrics

tiny_dancer_routing_requests_total{status="success|failure"}

Counter tracking total routing requests by status.

tiny_dancer_routing_latency_seconds{operation="total"}

Histogram of routing operation latency in seconds.

Feature Engineering Metrics

tiny_dancer_feature_engineering_duration_seconds{batch_size="1-10|11-50|51-100|100+"}

Histogram of feature engineering duration by batch size.

Model Inference Metrics

tiny_dancer_model_inference_duration_seconds{model_type="fastgrnn"}

Histogram of model inference duration.

Circuit Breaker Metrics

tiny_dancer_circuit_breaker_state

Gauge showing circuit breaker state:

  • 0 = Closed (healthy)
  • 1 = Half-Open (testing)
  • 2 = Open (failing)

Routing Decision Metrics

tiny_dancer_routing_decisions_total{model_type="lightweight|powerful"}

Counter of routing decisions by target model type.

tiny_dancer_confidence_scores{decision_type="lightweight|powerful"}

Histogram of confidence scores by decision type.

tiny_dancer_uncertainty_estimates{decision_type="lightweight|powerful"}

Histogram of uncertainty estimates.

Candidate Metrics

tiny_dancer_candidates_processed_total{batch_size_range="1-10|11-50|51-100|100+"}

Counter of total candidates processed by batch size range.

Error Metrics

tiny_dancer_errors_total{error_type="inference_error|circuit_breaker_open|..."}

Counter of errors by type.

Using Metrics

use ruvector_tiny_dancer_core::{Router, RouterConfig};

// Create router (metrics are automatically collected)
let router = Router::new(RouterConfig::default())?;

// Process requests...
let response = router.route(request)?;

// Export metrics in Prometheus format
let metrics = router.export_metrics()?;
println!("{}", metrics);

Prometheus Configuration

scrape_configs:
  - job_name: 'tiny-dancer'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:9090']

Example Grafana Dashboard

{
  "dashboard": {
    "title": "Tiny Dancer Routing",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(tiny_dancer_routing_requests_total[5m])"
        }]
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(tiny_dancer_routing_latency_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Circuit Breaker State",
        "targets": [{
          "expr": "tiny_dancer_circuit_breaker_state"
        }]
      },
      {
        "title": "Lightweight vs Powerful Routing",
        "targets": [{
          "expr": "rate(tiny_dancer_routing_decisions_total[5m])"
        }]
      }
    ]
  }
}

Distributed Tracing

OpenTelemetry Integration

Tiny Dancer integrates with OpenTelemetry for distributed tracing, supporting exporters like Jaeger, Zipkin, and more.

Trace Spans

The following spans are automatically created:

  • routing_request: Complete routing operation
  • circuit_breaker_check: Circuit breaker validation
  • feature_engineering: Feature extraction and engineering
  • model_inference: Neural model inference (per candidate)
  • uncertainty_estimation: Uncertainty quantification

Configuration

use ruvector_tiny_dancer_core::{TracingConfig, TracingSystem};

// Configure tracing
let config = TracingConfig {
    service_name: "tiny-dancer".to_string(),
    service_version: "1.0.0".to_string(),
    jaeger_agent_endpoint: Some("localhost:6831".to_string()),
    sampling_ratio: 1.0, // Sample 100% of traces
    enable_stdout: false,
};

// Initialize tracing
let tracing_system = TracingSystem::new(config);
tracing_system.init()?;

// Your application code...

// Shutdown and flush traces
tracing_system.shutdown();

Jaeger Setup

# Run Jaeger all-in-one
docker run -d \
  -p 6831:6831/udp \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

# Access Jaeger UI at http://localhost:16686

Trace Context Propagation

use ruvector_tiny_dancer_core::TraceContext;

// Get trace context from current span
if let Some(ctx) = TraceContext::from_current() {
    println!("Trace ID: {}", ctx.trace_id);
    println!("Span ID: {}", ctx.span_id);

    // W3C Trace Context format for HTTP headers
    let traceparent = ctx.to_w3c_traceparent();
    // Example: "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
}

Custom Spans

use ruvector_tiny_dancer_core::RoutingSpan;
use tracing::info_span;

// Create custom span
let span = info_span!("my_operation", param1 = "value");
let _guard = span.enter();

// Or use pre-defined span helpers
let span = RoutingSpan::routing_request(candidate_count);
let _guard = span.enter();

Structured Logging

Log Levels

Tiny Dancer uses the tracing crate for structured logging:

  • ERROR: Critical failures (circuit breaker open, inference errors)
  • WARN: Warnings (model path not found, degraded performance)
  • INFO: Normal operations (router initialization, request completion)
  • DEBUG: Detailed information (feature extraction, inference results)
  • TRACE: Very detailed information (internal state changes)

Example Logs

INFO tiny_dancer_router: Initializing Tiny Dancer router
INFO tiny_dancer_router: Circuit breaker enabled with threshold: 5
INFO tiny_dancer_router: Processing routing request candidate_count=3
DEBUG tiny_dancer_router: Extracting features batch_size=3
DEBUG tiny_dancer_router: Model inference completed candidate_id="candidate-1" confidence=0.92
DEBUG tiny_dancer_router: Routing decision made candidate_id="candidate-1" use_lightweight=true uncertainty=0.08
INFO tiny_dancer_router: Routing request completed successfully inference_time_us=245 lightweight_routes=2 powerful_routes=1

Configuring Logging

use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

// Basic setup
tracing_subscriber::fmt()
    .with_max_level(tracing::Level::INFO)
    .init();

// Advanced setup with JSON formatting
tracing_subscriber::registry()
    .with(tracing_subscriber::fmt::layer().json())
    .with(tracing_subscriber::filter::LevelFilter::from_level(
        tracing::Level::DEBUG
    ))
    .init();

Integration Guide

Complete Setup

use ruvector_tiny_dancer_core::{
    Router, RouterConfig, TracingConfig, TracingSystem
};
use tracing_subscriber;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Initialize structured logging
    tracing_subscriber::fmt()
        .with_max_level(tracing::Level::INFO)
        .init();

    // 2. Initialize distributed tracing
    let tracing_config = TracingConfig {
        service_name: "my-service".to_string(),
        service_version: "1.0.0".to_string(),
        jaeger_agent_endpoint: Some("localhost:6831".to_string()),
        sampling_ratio: 0.1, // Sample 10% in production
        enable_stdout: false,
    };
    let tracing_system = TracingSystem::new(tracing_config);
    tracing_system.init()?;

    // 3. Create router (metrics automatically enabled)
    let router = Router::new(RouterConfig::default())?;

    // 4. Process requests (all observability automatic)
    let response = router.route(request)?;

    // 5. Periodically export metrics (e.g., to HTTP endpoint)
    let metrics = router.export_metrics()?;

    // 6. Cleanup
    tracing_system.shutdown();

    Ok(())
}

HTTP Metrics Endpoint

use axum::{Router, routing::get};

async fn metrics_handler(
    router: Arc<ruvector_tiny_dancer_core::Router>
) -> String {
    router.export_metrics().unwrap_or_default()
}

let app = Router::new()
    .route("/metrics", get(metrics_handler));

Examples

1. Metrics Only

cargo run --example metrics_example

Demonstrates Prometheus metrics collection and export.

2. Tracing Only

# Start Jaeger first
docker run -d -p6831:6831/udp -p16686:16686 jaegertracing/all-in-one:latest

# Run example
cargo run --example tracing_example

Shows distributed tracing with OpenTelemetry.

3. Full Observability

cargo run --example full_observability

Combines metrics, tracing, and structured logging.

Best Practices

Production Configuration

  1. Sampling: Don't trace every request in production

    sampling_ratio: 0.01, // 1% sampling
    
  2. Log Levels: Use INFO or WARN in production

    .with_max_level(tracing::Level::INFO)
    
  3. Metrics Cardinality: Be careful with high-cardinality labels

    • ✓ Good: {model_type="lightweight"}
    • ✗ Bad: {candidate_id="12345"} (too many unique values)
  4. Performance: Metrics collection is very lightweight (<1μs overhead)

Alerting Rules

Example Prometheus alerting rules:

groups:
  - name: tiny_dancer
    rules:
      - alert: HighErrorRate
        expr: rate(tiny_dancer_errors_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"

      - alert: CircuitBreakerOpen
        expr: tiny_dancer_circuit_breaker_state == 2
        for: 1m
        annotations:
          summary: "Circuit breaker is open"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(tiny_dancer_routing_latency_seconds_bucket[5m])) > 0.01
        for: 5m
        annotations:
          summary: "P95 latency above 10ms"

Debugging Performance Issues

  1. Check metrics for high-level patterns

    rate(tiny_dancer_routing_requests_total[5m])
    
  2. Use traces to identify bottlenecks

    • Look for long spans
    • Identify slow candidates
  3. Review logs for error details

    grep "ERROR" logs.txt | jq .
    

Troubleshooting

Metrics Not Appearing

  • Ensure router is processing requests
  • Check metrics export: router.export_metrics()?
  • Verify Prometheus scrape configuration

Traces Not in Jaeger

  • Confirm Jaeger is running: docker ps
  • Check endpoint: jaeger_agent_endpoint: Some("localhost:6831")
  • Verify sampling ratio > 0
  • Call tracing_system.shutdown() to flush

High Memory Usage

  • Reduce sampling ratio
  • Decrease histogram buckets
  • Lower log level to INFO or WARN

Reference