Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

14 KiB

Raw Blame History

Ruvector Burst Scaling - Operations Runbook

Overview

This runbook provides operational procedures for managing the Ruvector adaptive burst scaling system. This system handles traffic spikes from 500M to 25B concurrent streams while maintaining <50ms p99 latency.

Architecture Overview
Normal Operations
Burst Event Procedures
Emergency Procedures
Monitoring & Alerts
Cost Management
Troubleshooting
Runbook Contacts

Architecture Overview

Components

Burst Predictor: Predicts upcoming traffic spikes using event calendars and ML
Reactive Scaler: Real-time auto-scaling based on metrics
Capacity Manager: Global orchestration with budget controls
Cloud Run: Containerized application with auto-scaling (10-1000 instances per region)
Global Load Balancer: Distributes traffic across regions
Cloud SQL: Database with read replicas
Redis: Caching layer

Regions

Primary: us-central1
Secondary: europe-west1, asia-east1
On-demand: Additional regions can be activated

Normal Operations

Daily Checks (Automated)

✅ Verify all regions are healthy ✅ Check p99 latency < 50ms ✅ Confirm instance counts within expected range ✅ Review cost vs budget (should be ~$240k/day baseline) ✅ Check for upcoming predicted bursts

Weekly Review

Review Prediction Accuracy
```
npm run predictor
```
Target: >85% accuracy
Analyze Cost Trends
- Review Cloud Console billing dashboard
- Compare actual vs predicted costs
- Adjust budget thresholds if needed
Update Event Calendar
- Add known upcoming events (sports, releases)
- Review historical patterns
- Train ML models with recent data

Monthly Tasks

Review and update scaling thresholds
Audit degradation strategies
Conduct burst simulation testing
Update on-call documentation
Review SLA compliance (p99 < 50ms)

Burst Event Procedures

Pre-Event (15 minutes before)

Automatic: Burst Predictor triggers pre-warming

Manual Verification:

Check Cloud Console for pre-warming status
Verify instances scaling up in predicted regions
Monitor cost dashboard for expected increases
Alert team via Slack #burst-events

During Event

Monitor (every 5 minutes):

Dashboard: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst
Key metrics:
- Connection count (should handle 10-50x)
- P99 latency (must stay < 50ms)
- Error rate (must stay < 1%)
- Instance count per region

Scaling Actions (if needed):

# Check current capacity
gcloud run services describe ruvector-us-central1 --region=us-central1

# Manual scale-out (emergency only)
gcloud run services update ruvector-us-central1 \
  --region=us-central1 \
  --max-instances=1500

# Check reactive scaler status
npm run scaler

# Check capacity manager
npm run manager

Post-Event (within 1 hour)

Verify Scale-In
- Instances should gradually reduce to normal levels
- Should take 5-10 minutes after traffic normalizes
Review Performance
- Export metrics to CSV
- Calculate actual vs predicted load
- Document any issues

Update Patterns

# Train model with new data
npm run predictor -- --train --event-id="world-cup-2026"

Cost Analysis
- Compare actual cost vs budget
- Document any overages
- Update cost projections

Emergency Procedures

Scenario 1: Latency Spike (p99 > 100ms)

Severity: HIGH Response Time: 2 minutes

Actions:

Immediate:

# Force scale-out across all regions
for region in us-central1 europe-west1 asia-east1; do
  gcloud run services update ruvector-$region \
    --region=$region \
    --min-instances=100 \
    --max-instances=1500
done

Investigate:
- Check Cloud SQL connections (should be < 5000)
- Verify Redis hit rate (should be > 90%)
- Review application logs for slow queries
Escalate if latency doesn't improve in 5 minutes

Scenario 2: Budget Exceeded (>120% hourly limit)

Severity: MEDIUM Response Time: 5 minutes

Actions:

Check if legitimate burst:

npm run manager
# Review degradation level

If unexpected:

Enable minor degradation:

# Shed free-tier traffic
gcloud run services update-traffic ruvector-us-central1 \
  --to-tags=premium=100

If critical (>150% budget):
- Enable major degradation
- Contact finance team
- Consider enabling hard budget limit

Scenario 3: Region Failure

Severity: CRITICAL Response Time: Immediate

Actions:

Automatic: Load balancer should route around failed region

Manual Verification:

# Check backend health
gcloud compute backend-services get-health ruvector-backend \
  --global

If capacity issues:

# Scale up remaining regions
gcloud run services update ruvector-europe-west1 \
  --region=europe-west1 \
  --max-instances=2000

Activate backup region:

# Deploy to us-east1
cd terraform
terraform apply -var="regions=[\"us-central1\",\"europe-west1\",\"asia-east1\",\"us-east1\"]"

Scenario 4: Database Connection Exhaustion

Severity: HIGH Response Time: 3 minutes

Actions:

Immediate:

# Scale up Cloud SQL
gcloud sql instances patch ruvector-db-us-central1 \
  --cpu=32 \
  --memory=128GB

# Increase max connections
gcloud sql instances patch ruvector-db-us-central1 \
  --database-flags=max_connections=10000

Temporary:
- Increase Redis cache TTL
- Enable read-only mode for non-critical endpoints
- Route read queries to replicas
Long-term:
- Add more read replicas
- Implement connection pooling
- Review query optimization

Scenario 5: Cascading Failures

Severity: CRITICAL Response Time: Immediate

Actions:

Enable Circuit Breakers:
- Automatic via load balancer configuration
- Unhealthy backends ejected after 5 consecutive errors
Graceful Degradation:
```
# Enable critical degradation mode
npm run manager -- --degrade=critical
```
- Premium tier only
- Disable non-essential features
- Enable maintenance page for free tier

Emergency Scale-Down:

# If cascading continues, scale down to known-good state
gcloud run services update ruvector-us-central1 \
  --region=us-central1 \
  --min-instances=50 \
  --max-instances=50

Incident Response:
- Page on-call SRE
- Open war room
- Activate disaster recovery plan

Monitoring & Alerts

Cloud Monitoring Dashboard

URL: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst

Key Metrics:

Total connections (all regions)
Connections by region
P50/P95/P99 latency
Instance count
CPU/Memory utilization
Error rate
Hourly cost
Burst event timeline

Alert Policies

Alert	Threshold	Severity	Response Time
High P99 Latency	>50ms for 2min	HIGH	5 min
Critical Latency	>100ms for 1min	CRITICAL	2 min
High Error Rate	>1% for 5min	HIGH	5 min
Budget Warning	>80% hourly	MEDIUM	15 min
Budget Critical	>100% hourly	HIGH	5 min
Region Down	0 healthy backends	CRITICAL	Immediate
CPU Critical	>90% for 5min	HIGH	5 min
Memory Critical	>90% for 3min	CRITICAL	2 min

Notification Channels

Email: ops@ruvector.io
PagerDuty: Critical alerts only
Slack: #alerts-burst-scaling
Phone: On-call rotation (critical only)

Log Queries

High Latency Requests:

resource.type="cloud_run_revision"
jsonPayload.latency > 0.1
severity >= WARNING

Scaling Events:

resource.type="cloud_run_revision"
jsonPayload.message =~ "SCALING|SCALED"

Cost Events:

jsonPayload.message =~ "BUDGET"

Cost Management

Budget Structure

Hourly: $10,000 (~200-400 instances)
Daily: $200,000 (baseline + moderate bursts)
Monthly: $5,000,000 (includes major events)

Cost Thresholds

Level	Action	Impact
50%	Info log	None
80%	Warning alert	None
90%	Critical alert	None
100%	Minor degradation	Free tier limited
120%	Major degradation	Free tier shed
150%	Critical degradation	Premium only

Cost Optimization

Automatic:

Gradual scale-in after bursts
Preemptible instances for batch jobs
Aggressive CDN caching
Connection pooling

Manual:

# Review cost by region
gcloud billing accounts list
gcloud billing projects describe ruvector-prod

# Analyze top cost drivers
gcloud alpha billing budgets list --billing-account=YOUR_ACCOUNT

# Optimize specific region
terraform apply -var="us-central1-max-instances=800"

Cost Forecasting

# Generate cost forecast
npm run manager -- --forecast=7days

# Expected costs:
# - Normal week: $1.4M
# - Major event week: $2.5M
# - World Cup week: $4.8M

Troubleshooting

Issue: Auto-scaling not responding

Symptoms: Load increasing but instances not scaling

Diagnosis:

# Check Cloud Run auto-scaling config
gcloud run services describe ruvector-us-central1 \
  --region=us-central1 \
  --format="value(spec.template.spec.scaling)"

# Check for quota limits
gcloud compute project-info describe --project=ruvector-prod \
  | grep -A5 CPUS

Resolution:

Verify max-instances not reached
Check quota limits
Review IAM permissions for service account
Restart capacity manager

Issue: Predictions inaccurate

Symptoms: Actual load differs significantly from predicted

Diagnosis:

npm run predictor -- --check-accuracy

Resolution:

Update event calendar with actual times
Retrain models with recent data
Adjust multiplier for event types
Review regional distribution assumptions

Issue: Database connection pool exhausted

Symptoms: Connection errors, high latency

Diagnosis:

# Check active connections
gcloud sql operations list --instance=ruvector-db-us-central1

# Check Cloud SQL metrics
gcloud monitoring time-series list \
  --filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'

Resolution:

Scale up Cloud SQL instance
Increase max_connections
Add read replicas
Review connection pooling settings

Issue: Redis cache misses

Symptoms: High database load, increased latency

Diagnosis:

# Check Redis stats
gcloud redis instances describe ruvector-redis-us-central1 \
  --region=us-central1

# Check hit rate
gcloud monitoring time-series list \
  --filter='metric.type="redis.googleapis.com/stats/cache_hit_ratio"'

Resolution:

Increase Redis memory
Review cache TTL settings
Implement cache warming for predicted bursts
Review cache key patterns

Runbook Contacts

On-Call Rotation

Primary On-Call: Check PagerDuty Secondary On-Call: Check PagerDuty Escalation: VP Engineering

Team Contacts

Role	Contact	Phone
SRE Lead	sre-lead@ruvector.io	+1-XXX-XXX-XXXX
DevOps	devops@ruvector.io	+1-XXX-XXX-XXXX
Engineering Manager	eng-mgr@ruvector.io	+1-XXX-XXX-XXXX
VP Engineering	vp-eng@ruvector.io	+1-XXX-XXX-XXXX

External Contacts

Service	Contact	SLA
GCP Support	Premium Support	15 min
PagerDuty	support@pagerduty.com	1 hour
Network Provider	NOC hotline	30 min

War Room

Zoom: https://zoom.us/j/ruvector-war-room Slack: #incident-response Docs: https://docs.ruvector.io/incidents

Appendix

Quick Reference Commands

# Check system status
npm run manager

# View current metrics
gcloud monitoring dashboards list

# Force scale-out
gcloud run services update ruvector-REGION --max-instances=1500

# Enable degradation
npm run manager -- --degrade=minor

# Check predictions
npm run predictor

# View logs
gcloud logging read "resource.type=cloud_run_revision" --limit=50

# Check costs
gcloud billing accounts list

Terraform Quick Reference

# Initialize
cd terraform && terraform init

# Plan changes
terraform plan -var-file="prod.tfvars"

# Apply changes
terraform apply -var-file="prod.tfvars"

# Emergency scale-out
terraform apply -var="max_instances=2000"

# Add region
terraform apply -var='regions=["us-central1","europe-west1","asia-east1","us-east1"]'

Health Check URLs

Application: https://api.ruvector.io/health
Database: https://api.ruvector.io/health/db
Redis: https://api.ruvector.io/health/redis
Load Balancer: Check Cloud Console

Disaster Recovery

RTO (Recovery Time Objective): 15 minutes RPO (Recovery Point Objective): 5 minutes

Backup Locations:

Database: Point-in-time recovery (7 days)
Configuration: Git repository
Terraform state: GCS bucket (versioned)

Recovery Procedure:

Restore from latest backup
Deploy infrastructure via Terraform
Validate health checks
Update DNS if needed
Resume traffic

Revision History

Version	Date	Author	Changes
1.0	2025-01-20	DevOps Team	Initial version

Last Updated: 2025-01-20 Next Review: 2025-02-20 Owner: SRE Team

14 KiB Raw Blame History

Ruvector Burst Scaling - Operations Runbook

Overview

Table of Contents

Architecture Overview

Components

Regions

Normal Operations

Daily Checks (Automated)

Weekly Review

Monthly Tasks

Burst Event Procedures

Pre-Event (15 minutes before)

During Event

Post-Event (within 1 hour)

Emergency Procedures

Scenario 1: Latency Spike (p99 > 100ms)

Scenario 2: Budget Exceeded (>120% hourly limit)

Scenario 3: Region Failure

Scenario 4: Database Connection Exhaustion

Scenario 5: Cascading Failures

Monitoring & Alerts

Cloud Monitoring Dashboard

Alert Policies

Notification Channels

Log Queries

Cost Management

Budget Structure

Cost Thresholds

Cost Optimization

Cost Forecasting

Troubleshooting

Issue: Auto-scaling not responding

Issue: Predictions inaccurate

Issue: Database connection pool exhausted

Issue: Redis cache misses

Runbook Contacts

On-Call Rotation

Team Contacts

External Contacts

War Room

Appendix

Quick Reference Commands

Terraform Quick Reference

Health Check URLs

Disaster Recovery

Revision History

14 KiB

Raw Blame History