Files
wifi-densepose/npm/packages/burst-scaling/RUNBOOK.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

595 lines
14 KiB
Markdown

# Ruvector Burst Scaling - Operations Runbook
## Overview
This runbook provides operational procedures for managing the Ruvector adaptive burst scaling system. This system handles traffic spikes from 500M to 25B concurrent streams while maintaining <50ms p99 latency.
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Normal Operations](#normal-operations)
3. [Burst Event Procedures](#burst-event-procedures)
4. [Emergency Procedures](#emergency-procedures)
5. [Monitoring & Alerts](#monitoring--alerts)
6. [Cost Management](#cost-management)
7. [Troubleshooting](#troubleshooting)
8. [Runbook Contacts](#runbook-contacts)
---
## Architecture Overview
### Components
- **Burst Predictor**: Predicts upcoming traffic spikes using event calendars and ML
- **Reactive Scaler**: Real-time auto-scaling based on metrics
- **Capacity Manager**: Global orchestration with budget controls
- **Cloud Run**: Containerized application with auto-scaling (10-1000 instances per region)
- **Global Load Balancer**: Distributes traffic across regions
- **Cloud SQL**: Database with read replicas
- **Redis**: Caching layer
### Regions
- Primary: us-central1
- Secondary: europe-west1, asia-east1
- On-demand: Additional regions can be activated
---
## Normal Operations
### Daily Checks (Automated)
✅ Verify all regions are healthy
✅ Check p99 latency < 50ms
✅ Confirm instance counts within expected range
✅ Review cost vs budget (should be ~$240k/day baseline)
✅ Check for upcoming predicted bursts
### Weekly Review
1. **Review Prediction Accuracy**
```bash
npm run predictor
```
Target: >85% accuracy
2. **Analyze Cost Trends**
- Review Cloud Console billing dashboard
- Compare actual vs predicted costs
- Adjust budget thresholds if needed
3. **Update Event Calendar**
- Add known upcoming events (sports, releases)
- Review historical patterns
- Train ML models with recent data
### Monthly Tasks
- Review and update scaling thresholds
- Audit degradation strategies
- Conduct burst simulation testing
- Update on-call documentation
- Review SLA compliance (p99 < 50ms)
---
## Burst Event Procedures
### Pre-Event (15 minutes before)
**Automatic**: Burst Predictor triggers pre-warming
**Manual Verification**:
1. Check Cloud Console for pre-warming status
2. Verify instances scaling up in predicted regions
3. Monitor cost dashboard for expected increases
4. Alert team via Slack #burst-events
### During Event
**Monitor (every 5 minutes)**:
- Dashboard: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst
- Key metrics:
- Connection count (should handle 10-50x)
- P99 latency (must stay < 50ms)
- Error rate (must stay < 1%)
- Instance count per region
**Scaling Actions** (if needed):
```bash
# Check current capacity
gcloud run services describe ruvector-us-central1 --region=us-central1
# Manual scale-out (emergency only)
gcloud run services update ruvector-us-central1 \
--region=us-central1 \
--max-instances=1500
# Check reactive scaler status
npm run scaler
# Check capacity manager
npm run manager
```
### Post-Event (within 1 hour)
1. **Verify Scale-In**
- Instances should gradually reduce to normal levels
- Should take 5-10 minutes after traffic normalizes
2. **Review Performance**
- Export metrics to CSV
- Calculate actual vs predicted load
- Document any issues
3. **Update Patterns**
```bash
# Train model with new data
npm run predictor -- --train --event-id="world-cup-2026"
```
4. **Cost Analysis**
- Compare actual cost vs budget
- Document any overages
- Update cost projections
---
## Emergency Procedures
### Scenario 1: Latency Spike (p99 > 100ms)
**Severity**: HIGH
**Response Time**: 2 minutes
**Actions**:
1. **Immediate**:
```bash
# Force scale-out across all regions
for region in us-central1 europe-west1 asia-east1; do
gcloud run services update ruvector-$region \
--region=$region \
--min-instances=100 \
--max-instances=1500
done
```
2. **Investigate**:
- Check Cloud SQL connections (should be < 5000)
- Verify Redis hit rate (should be > 90%)
- Review application logs for slow queries
3. **Escalate** if latency doesn't improve in 5 minutes
### Scenario 2: Budget Exceeded (>120% hourly limit)
**Severity**: MEDIUM
**Response Time**: 5 minutes
**Actions**:
1. **Check if legitimate burst**:
```bash
npm run manager
# Review degradation level
```
2. **If unexpected**:
- Enable minor degradation:
```bash
# Shed free-tier traffic
gcloud run services update-traffic ruvector-us-central1 \
--to-tags=premium=100
```
3. **If critical (>150% budget)**:
- Enable major degradation
- Contact finance team
- Consider enabling hard budget limit
### Scenario 3: Region Failure
**Severity**: CRITICAL
**Response Time**: Immediate
**Actions**:
1. **Automatic**: Load balancer should route around failed region
2. **Manual Verification**:
```bash
# Check backend health
gcloud compute backend-services get-health ruvector-backend \
--global
```
3. **If capacity issues**:
```bash
# Scale up remaining regions
gcloud run services update ruvector-europe-west1 \
--region=europe-west1 \
--max-instances=2000
```
4. **Activate backup region**:
```bash
# Deploy to us-east1
cd terraform
terraform apply -var="regions=[\"us-central1\",\"europe-west1\",\"asia-east1\",\"us-east1\"]"
```
### Scenario 4: Database Connection Exhaustion
**Severity**: HIGH
**Response Time**: 3 minutes
**Actions**:
1. **Immediate**:
```bash
# Scale up Cloud SQL
gcloud sql instances patch ruvector-db-us-central1 \
--cpu=32 \
--memory=128GB
# Increase max connections
gcloud sql instances patch ruvector-db-us-central1 \
--database-flags=max_connections=10000
```
2. **Temporary**:
- Increase Redis cache TTL
- Enable read-only mode for non-critical endpoints
- Route read queries to replicas
3. **Long-term**:
- Add more read replicas
- Implement connection pooling
- Review query optimization
### Scenario 5: Cascading Failures
**Severity**: CRITICAL
**Response Time**: Immediate
**Actions**:
1. **Enable Circuit Breakers**:
- Automatic via load balancer configuration
- Unhealthy backends ejected after 5 consecutive errors
2. **Graceful Degradation**:
```bash
# Enable critical degradation mode
npm run manager -- --degrade=critical
```
- Premium tier only
- Disable non-essential features
- Enable maintenance page for free tier
3. **Emergency Scale-Down**:
```bash
# If cascading continues, scale down to known-good state
gcloud run services update ruvector-us-central1 \
--region=us-central1 \
--min-instances=50 \
--max-instances=50
```
4. **Incident Response**:
- Page on-call SRE
- Open war room
- Activate disaster recovery plan
---
## Monitoring & Alerts
### Cloud Monitoring Dashboard
**URL**: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst
**Key Metrics**:
- Total connections (all regions)
- Connections by region
- P50/P95/P99 latency
- Instance count
- CPU/Memory utilization
- Error rate
- Hourly cost
- Burst event timeline
### Alert Policies
| Alert | Threshold | Severity | Response Time |
|-------|-----------|----------|---------------|
| High P99 Latency | >50ms for 2min | HIGH | 5 min |
| Critical Latency | >100ms for 1min | CRITICAL | 2 min |
| High Error Rate | >1% for 5min | HIGH | 5 min |
| Budget Warning | >80% hourly | MEDIUM | 15 min |
| Budget Critical | >100% hourly | HIGH | 5 min |
| Region Down | 0 healthy backends | CRITICAL | Immediate |
| CPU Critical | >90% for 5min | HIGH | 5 min |
| Memory Critical | >90% for 3min | CRITICAL | 2 min |
### Notification Channels
- **Email**: ops@ruvector.io
- **PagerDuty**: Critical alerts only
- **Slack**: #alerts-burst-scaling
- **Phone**: On-call rotation (critical only)
### Log Queries
**High Latency Requests**:
```sql
resource.type="cloud_run_revision"
jsonPayload.latency > 0.1
severity >= WARNING
```
**Scaling Events**:
```sql
resource.type="cloud_run_revision"
jsonPayload.message =~ "SCALING|SCALED"
```
**Cost Events**:
```sql
jsonPayload.message =~ "BUDGET"
```
---
## Cost Management
### Budget Structure
- **Hourly**: $10,000 (~200-400 instances)
- **Daily**: $200,000 (baseline + moderate bursts)
- **Monthly**: $5,000,000 (includes major events)
### Cost Thresholds
| Level | Action | Impact |
|-------|--------|--------|
| 50% | Info log | None |
| 80% | Warning alert | None |
| 90% | Critical alert | None |
| 100% | Minor degradation | Free tier limited |
| 120% | Major degradation | Free tier shed |
| 150% | Critical degradation | Premium only |
### Cost Optimization
**Automatic**:
- Gradual scale-in after bursts
- Preemptible instances for batch jobs
- Aggressive CDN caching
- Connection pooling
**Manual**:
```bash
# Review cost by region
gcloud billing accounts list
gcloud billing projects describe ruvector-prod
# Analyze top cost drivers
gcloud alpha billing budgets list --billing-account=YOUR_ACCOUNT
# Optimize specific region
terraform apply -var="us-central1-max-instances=800"
```
### Cost Forecasting
```bash
# Generate cost forecast
npm run manager -- --forecast=7days
# Expected costs:
# - Normal week: $1.4M
# - Major event week: $2.5M
# - World Cup week: $4.8M
```
---
## Troubleshooting
### Issue: Auto-scaling not responding
**Symptoms**: Load increasing but instances not scaling
**Diagnosis**:
```bash
# Check Cloud Run auto-scaling config
gcloud run services describe ruvector-us-central1 \
--region=us-central1 \
--format="value(spec.template.spec.scaling)"
# Check for quota limits
gcloud compute project-info describe --project=ruvector-prod \
| grep -A5 CPUS
```
**Resolution**:
- Verify max-instances not reached
- Check quota limits
- Review IAM permissions for service account
- Restart capacity manager
### Issue: Predictions inaccurate
**Symptoms**: Actual load differs significantly from predicted
**Diagnosis**:
```bash
npm run predictor -- --check-accuracy
```
**Resolution**:
- Update event calendar with actual times
- Retrain models with recent data
- Adjust multiplier for event types
- Review regional distribution assumptions
### Issue: Database connection pool exhausted
**Symptoms**: Connection errors, high latency
**Diagnosis**:
```bash
# Check active connections
gcloud sql operations list --instance=ruvector-db-us-central1
# Check Cloud SQL metrics
gcloud monitoring time-series list \
--filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'
```
**Resolution**:
- Scale up Cloud SQL instance
- Increase max_connections
- Add read replicas
- Review connection pooling settings
### Issue: Redis cache misses
**Symptoms**: High database load, increased latency
**Diagnosis**:
```bash
# Check Redis stats
gcloud redis instances describe ruvector-redis-us-central1 \
--region=us-central1
# Check hit rate
gcloud monitoring time-series list \
--filter='metric.type="redis.googleapis.com/stats/cache_hit_ratio"'
```
**Resolution**:
- Increase Redis memory
- Review cache TTL settings
- Implement cache warming for predicted bursts
- Review cache key patterns
---
## Runbook Contacts
### On-Call Rotation
**Primary On-Call**: Check PagerDuty
**Secondary On-Call**: Check PagerDuty
**Escalation**: VP Engineering
### Team Contacts
| Role | Contact | Phone |
|------|---------|-------|
| SRE Lead | sre-lead@ruvector.io | +1-XXX-XXX-XXXX |
| DevOps | devops@ruvector.io | +1-XXX-XXX-XXXX |
| Engineering Manager | eng-mgr@ruvector.io | +1-XXX-XXX-XXXX |
| VP Engineering | vp-eng@ruvector.io | +1-XXX-XXX-XXXX |
### External Contacts
| Service | Contact | SLA |
|---------|---------|-----|
| GCP Support | Premium Support | 15 min |
| PagerDuty | support@pagerduty.com | 1 hour |
| Network Provider | NOC hotline | 30 min |
### War Room
**Zoom**: https://zoom.us/j/ruvector-war-room
**Slack**: #incident-response
**Docs**: https://docs.ruvector.io/incidents
---
## Appendix
### Quick Reference Commands
```bash
# Check system status
npm run manager
# View current metrics
gcloud monitoring dashboards list
# Force scale-out
gcloud run services update ruvector-REGION --max-instances=1500
# Enable degradation
npm run manager -- --degrade=minor
# Check predictions
npm run predictor
# View logs
gcloud logging read "resource.type=cloud_run_revision" --limit=50
# Check costs
gcloud billing accounts list
```
### Terraform Quick Reference
```bash
# Initialize
cd terraform && terraform init
# Plan changes
terraform plan -var-file="prod.tfvars"
# Apply changes
terraform apply -var-file="prod.tfvars"
# Emergency scale-out
terraform apply -var="max_instances=2000"
# Add region
terraform apply -var='regions=["us-central1","europe-west1","asia-east1","us-east1"]'
```
### Health Check URLs
- **Application**: https://api.ruvector.io/health
- **Database**: https://api.ruvector.io/health/db
- **Redis**: https://api.ruvector.io/health/redis
- **Load Balancer**: Check Cloud Console
### Disaster Recovery
**RTO (Recovery Time Objective)**: 15 minutes
**RPO (Recovery Point Objective)**: 5 minutes
**Backup Locations**:
- Database: Point-in-time recovery (7 days)
- Configuration: Git repository
- Terraform state: GCS bucket (versioned)
**Recovery Procedure**:
1. Restore from latest backup
2. Deploy infrastructure via Terraform
3. Validate health checks
4. Update DNS if needed
5. Resume traffic
---
## Revision History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-01-20 | DevOps Team | Initial version |
---
**Last Updated**: 2025-01-20
**Next Review**: 2025-02-20
**Owner**: SRE Team