Files
wifi-densepose/vendor/ruvector/npm/packages/burst-scaling/RUNBOOK.md

14 KiB

Ruvector Burst Scaling - Operations Runbook

Overview

This runbook provides operational procedures for managing the Ruvector adaptive burst scaling system. This system handles traffic spikes from 500M to 25B concurrent streams while maintaining <50ms p99 latency.

Table of Contents

  1. Architecture Overview
  2. Normal Operations
  3. Burst Event Procedures
  4. Emergency Procedures
  5. Monitoring & Alerts
  6. Cost Management
  7. Troubleshooting
  8. Runbook Contacts

Architecture Overview

Components

  • Burst Predictor: Predicts upcoming traffic spikes using event calendars and ML
  • Reactive Scaler: Real-time auto-scaling based on metrics
  • Capacity Manager: Global orchestration with budget controls
  • Cloud Run: Containerized application with auto-scaling (10-1000 instances per region)
  • Global Load Balancer: Distributes traffic across regions
  • Cloud SQL: Database with read replicas
  • Redis: Caching layer

Regions

  • Primary: us-central1
  • Secondary: europe-west1, asia-east1
  • On-demand: Additional regions can be activated

Normal Operations

Daily Checks (Automated)

Verify all regions are healthy Check p99 latency < 50ms Confirm instance counts within expected range Review cost vs budget (should be ~$240k/day baseline) Check for upcoming predicted bursts

Weekly Review

  1. Review Prediction Accuracy

    npm run predictor
    

    Target: >85% accuracy

  2. Analyze Cost Trends

    • Review Cloud Console billing dashboard
    • Compare actual vs predicted costs
    • Adjust budget thresholds if needed
  3. Update Event Calendar

    • Add known upcoming events (sports, releases)
    • Review historical patterns
    • Train ML models with recent data

Monthly Tasks

  • Review and update scaling thresholds
  • Audit degradation strategies
  • Conduct burst simulation testing
  • Update on-call documentation
  • Review SLA compliance (p99 < 50ms)

Burst Event Procedures

Pre-Event (15 minutes before)

Automatic: Burst Predictor triggers pre-warming

Manual Verification:

  1. Check Cloud Console for pre-warming status
  2. Verify instances scaling up in predicted regions
  3. Monitor cost dashboard for expected increases
  4. Alert team via Slack #burst-events

During Event

Monitor (every 5 minutes):

Scaling Actions (if needed):

# Check current capacity
gcloud run services describe ruvector-us-central1 --region=us-central1

# Manual scale-out (emergency only)
gcloud run services update ruvector-us-central1 \
  --region=us-central1 \
  --max-instances=1500

# Check reactive scaler status
npm run scaler

# Check capacity manager
npm run manager

Post-Event (within 1 hour)

  1. Verify Scale-In

    • Instances should gradually reduce to normal levels
    • Should take 5-10 minutes after traffic normalizes
  2. Review Performance

    • Export metrics to CSV
    • Calculate actual vs predicted load
    • Document any issues
  3. Update Patterns

    # Train model with new data
    npm run predictor -- --train --event-id="world-cup-2026"
    
  4. Cost Analysis

    • Compare actual cost vs budget
    • Document any overages
    • Update cost projections

Emergency Procedures

Scenario 1: Latency Spike (p99 > 100ms)

Severity: HIGH Response Time: 2 minutes

Actions:

  1. Immediate:

    # Force scale-out across all regions
    for region in us-central1 europe-west1 asia-east1; do
      gcloud run services update ruvector-$region \
        --region=$region \
        --min-instances=100 \
        --max-instances=1500
    done
    
  2. Investigate:

    • Check Cloud SQL connections (should be < 5000)
    • Verify Redis hit rate (should be > 90%)
    • Review application logs for slow queries
  3. Escalate if latency doesn't improve in 5 minutes

Scenario 2: Budget Exceeded (>120% hourly limit)

Severity: MEDIUM Response Time: 5 minutes

Actions:

  1. Check if legitimate burst:

    npm run manager
    # Review degradation level
    
  2. If unexpected:

    • Enable minor degradation:
      # Shed free-tier traffic
      gcloud run services update-traffic ruvector-us-central1 \
        --to-tags=premium=100
      
  3. If critical (>150% budget):

    • Enable major degradation
    • Contact finance team
    • Consider enabling hard budget limit

Scenario 3: Region Failure

Severity: CRITICAL Response Time: Immediate

Actions:

  1. Automatic: Load balancer should route around failed region

  2. Manual Verification:

    # Check backend health
    gcloud compute backend-services get-health ruvector-backend \
      --global
    
  3. If capacity issues:

    # Scale up remaining regions
    gcloud run services update ruvector-europe-west1 \
      --region=europe-west1 \
      --max-instances=2000
    
  4. Activate backup region:

    # Deploy to us-east1
    cd terraform
    terraform apply -var="regions=[\"us-central1\",\"europe-west1\",\"asia-east1\",\"us-east1\"]"
    

Scenario 4: Database Connection Exhaustion

Severity: HIGH Response Time: 3 minutes

Actions:

  1. Immediate:

    # Scale up Cloud SQL
    gcloud sql instances patch ruvector-db-us-central1 \
      --cpu=32 \
      --memory=128GB
    
    # Increase max connections
    gcloud sql instances patch ruvector-db-us-central1 \
      --database-flags=max_connections=10000
    
  2. Temporary:

    • Increase Redis cache TTL
    • Enable read-only mode for non-critical endpoints
    • Route read queries to replicas
  3. Long-term:

    • Add more read replicas
    • Implement connection pooling
    • Review query optimization

Scenario 5: Cascading Failures

Severity: CRITICAL Response Time: Immediate

Actions:

  1. Enable Circuit Breakers:

    • Automatic via load balancer configuration
    • Unhealthy backends ejected after 5 consecutive errors
  2. Graceful Degradation:

    # Enable critical degradation mode
    npm run manager -- --degrade=critical
    
    • Premium tier only
    • Disable non-essential features
    • Enable maintenance page for free tier
  3. Emergency Scale-Down:

    # If cascading continues, scale down to known-good state
    gcloud run services update ruvector-us-central1 \
      --region=us-central1 \
      --min-instances=50 \
      --max-instances=50
    
  4. Incident Response:

    • Page on-call SRE
    • Open war room
    • Activate disaster recovery plan

Monitoring & Alerts

Cloud Monitoring Dashboard

URL: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst

Key Metrics:

  • Total connections (all regions)
  • Connections by region
  • P50/P95/P99 latency
  • Instance count
  • CPU/Memory utilization
  • Error rate
  • Hourly cost
  • Burst event timeline

Alert Policies

Alert Threshold Severity Response Time
High P99 Latency >50ms for 2min HIGH 5 min
Critical Latency >100ms for 1min CRITICAL 2 min
High Error Rate >1% for 5min HIGH 5 min
Budget Warning >80% hourly MEDIUM 15 min
Budget Critical >100% hourly HIGH 5 min
Region Down 0 healthy backends CRITICAL Immediate
CPU Critical >90% for 5min HIGH 5 min
Memory Critical >90% for 3min CRITICAL 2 min

Notification Channels

  • Email: ops@ruvector.io
  • PagerDuty: Critical alerts only
  • Slack: #alerts-burst-scaling
  • Phone: On-call rotation (critical only)

Log Queries

High Latency Requests:

resource.type="cloud_run_revision"
jsonPayload.latency > 0.1
severity >= WARNING

Scaling Events:

resource.type="cloud_run_revision"
jsonPayload.message =~ "SCALING|SCALED"

Cost Events:

jsonPayload.message =~ "BUDGET"

Cost Management

Budget Structure

  • Hourly: $10,000 (~200-400 instances)
  • Daily: $200,000 (baseline + moderate bursts)
  • Monthly: $5,000,000 (includes major events)

Cost Thresholds

Level Action Impact
50% Info log None
80% Warning alert None
90% Critical alert None
100% Minor degradation Free tier limited
120% Major degradation Free tier shed
150% Critical degradation Premium only

Cost Optimization

Automatic:

  • Gradual scale-in after bursts
  • Preemptible instances for batch jobs
  • Aggressive CDN caching
  • Connection pooling

Manual:

# Review cost by region
gcloud billing accounts list
gcloud billing projects describe ruvector-prod

# Analyze top cost drivers
gcloud alpha billing budgets list --billing-account=YOUR_ACCOUNT

# Optimize specific region
terraform apply -var="us-central1-max-instances=800"

Cost Forecasting

# Generate cost forecast
npm run manager -- --forecast=7days

# Expected costs:
# - Normal week: $1.4M
# - Major event week: $2.5M
# - World Cup week: $4.8M

Troubleshooting

Issue: Auto-scaling not responding

Symptoms: Load increasing but instances not scaling

Diagnosis:

# Check Cloud Run auto-scaling config
gcloud run services describe ruvector-us-central1 \
  --region=us-central1 \
  --format="value(spec.template.spec.scaling)"

# Check for quota limits
gcloud compute project-info describe --project=ruvector-prod \
  | grep -A5 CPUS

Resolution:

  • Verify max-instances not reached
  • Check quota limits
  • Review IAM permissions for service account
  • Restart capacity manager

Issue: Predictions inaccurate

Symptoms: Actual load differs significantly from predicted

Diagnosis:

npm run predictor -- --check-accuracy

Resolution:

  • Update event calendar with actual times
  • Retrain models with recent data
  • Adjust multiplier for event types
  • Review regional distribution assumptions

Issue: Database connection pool exhausted

Symptoms: Connection errors, high latency

Diagnosis:

# Check active connections
gcloud sql operations list --instance=ruvector-db-us-central1

# Check Cloud SQL metrics
gcloud monitoring time-series list \
  --filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'

Resolution:

  • Scale up Cloud SQL instance
  • Increase max_connections
  • Add read replicas
  • Review connection pooling settings

Issue: Redis cache misses

Symptoms: High database load, increased latency

Diagnosis:

# Check Redis stats
gcloud redis instances describe ruvector-redis-us-central1 \
  --region=us-central1

# Check hit rate
gcloud monitoring time-series list \
  --filter='metric.type="redis.googleapis.com/stats/cache_hit_ratio"'

Resolution:

  • Increase Redis memory
  • Review cache TTL settings
  • Implement cache warming for predicted bursts
  • Review cache key patterns

Runbook Contacts

On-Call Rotation

Primary On-Call: Check PagerDuty Secondary On-Call: Check PagerDuty Escalation: VP Engineering

Team Contacts

Role Contact Phone
SRE Lead sre-lead@ruvector.io +1-XXX-XXX-XXXX
DevOps devops@ruvector.io +1-XXX-XXX-XXXX
Engineering Manager eng-mgr@ruvector.io +1-XXX-XXX-XXXX
VP Engineering vp-eng@ruvector.io +1-XXX-XXX-XXXX

External Contacts

Service Contact SLA
GCP Support Premium Support 15 min
PagerDuty support@pagerduty.com 1 hour
Network Provider NOC hotline 30 min

War Room

Zoom: https://zoom.us/j/ruvector-war-room Slack: #incident-response Docs: https://docs.ruvector.io/incidents


Appendix

Quick Reference Commands

# Check system status
npm run manager

# View current metrics
gcloud monitoring dashboards list

# Force scale-out
gcloud run services update ruvector-REGION --max-instances=1500

# Enable degradation
npm run manager -- --degrade=minor

# Check predictions
npm run predictor

# View logs
gcloud logging read "resource.type=cloud_run_revision" --limit=50

# Check costs
gcloud billing accounts list

Terraform Quick Reference

# Initialize
cd terraform && terraform init

# Plan changes
terraform plan -var-file="prod.tfvars"

# Apply changes
terraform apply -var-file="prod.tfvars"

# Emergency scale-out
terraform apply -var="max_instances=2000"

# Add region
terraform apply -var='regions=["us-central1","europe-west1","asia-east1","us-east1"]'

Health Check URLs

Disaster Recovery

RTO (Recovery Time Objective): 15 minutes RPO (Recovery Point Objective): 5 minutes

Backup Locations:

  • Database: Point-in-time recovery (7 days)
  • Configuration: Git repository
  • Terraform state: GCS bucket (versioned)

Recovery Procedure:

  1. Restore from latest backup
  2. Deploy infrastructure via Terraform
  3. Validate health checks
  4. Update DNS if needed
  5. Resume traffic

Revision History

Version Date Author Changes
1.0 2025-01-20 DevOps Team Initial version

Last Updated: 2025-01-20 Next Review: 2025-02-20 Owner: SRE Team