wifi-densepose/npm/packages/burst-scaling/RUNBOOK.md

# Ruvector Burst Scaling - Operations Runbook

## Overview

This runbook provides operational procedures for managing the Ruvector adaptive burst scaling system. This system handles traffic spikes from 500M to 25B concurrent streams while maintaining <50ms p99 latency.

## Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [Normal Operations](#normal-operations)
3. [Burst Event Procedures](#burst-event-procedures)
4. [Emergency Procedures](#emergency-procedures)
5. [Monitoring & Alerts](#monitoring--alerts)
6. [Cost Management](#cost-management)
7. [Troubleshooting](#troubleshooting)
8. [Runbook Contacts](#runbook-contacts)

---

## Architecture Overview

### Components

- **Burst Predictor**: Predicts upcoming traffic spikes using event calendars and ML
- **Reactive Scaler**: Real-time auto-scaling based on metrics
- **Capacity Manager**: Global orchestration with budget controls
- **Cloud Run**: Containerized application with auto-scaling (10-1000 instances per region)
- **Global Load Balancer**: Distributes traffic across regions
- **Cloud SQL**: Database with read replicas
- **Redis**: Caching layer

### Regions

- Primary: us-central1
- Secondary: europe-west1, asia-east1
- On-demand: Additional regions can be activated

---

## Normal Operations

### Daily Checks (Automated)

✅ Verify all regions are healthy
✅ Check p99 latency < 50ms
✅ Confirm instance counts within expected range
✅ Review cost vs budget (should be ~$240k/day baseline)
✅ Check for upcoming predicted bursts

### Weekly Review

1. **Review Prediction Accuracy**
   ```bash
   npm run predictor
   ```
   Target: >85% accuracy

2. **Analyze Cost Trends**
   - Review Cloud Console billing dashboard
   - Compare actual vs predicted costs
   - Adjust budget thresholds if needed

3. **Update Event Calendar**
   - Add known upcoming events (sports, releases)
   - Review historical patterns
   - Train ML models with recent data

### Monthly Tasks

- Review and update scaling thresholds
- Audit degradation strategies
- Conduct burst simulation testing
- Update on-call documentation
- Review SLA compliance (p99 < 50ms)

---

## Burst Event Procedures

### Pre-Event (15 minutes before)

**Automatic**: Burst Predictor triggers pre-warming

**Manual Verification**:
1. Check Cloud Console for pre-warming status
2. Verify instances scaling up in predicted regions
3. Monitor cost dashboard for expected increases
4. Alert team via Slack #burst-events

### During Event

**Monitor (every 5 minutes)**:
- Dashboard: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst
- Key metrics:
  - Connection count (should handle 10-50x)
  - P99 latency (must stay < 50ms)
  - Error rate (must stay < 1%)
  - Instance count per region

**Scaling Actions** (if needed):
```bash
# Check current capacity
gcloud run services describe ruvector-us-central1 --region=us-central1

# Manual scale-out (emergency only)
gcloud run services update ruvector-us-central1 \
  --region=us-central1 \
  --max-instances=1500

# Check reactive scaler status
npm run scaler

# Check capacity manager
npm run manager
```

### Post-Event (within 1 hour)

1. **Verify Scale-In**
   - Instances should gradually reduce to normal levels
   - Should take 5-10 minutes after traffic normalizes

2. **Review Performance**
   - Export metrics to CSV
   - Calculate actual vs predicted load
   - Document any issues

3. **Update Patterns**
   ```bash
   # Train model with new data
   npm run predictor -- --train --event-id="world-cup-2026"
   ```

4. **Cost Analysis**
   - Compare actual cost vs budget
   - Document any overages
   - Update cost projections

---

## Emergency Procedures

### Scenario 1: Latency Spike (p99 > 100ms)

**Severity**: HIGH
**Response Time**: 2 minutes

**Actions**:
1. **Immediate**:
   ```bash
   # Force scale-out across all regions
   for region in us-central1 europe-west1 asia-east1; do
     gcloud run services update ruvector-$region \
       --region=$region \
       --min-instances=100 \
       --max-instances=1500
   done
   ```

2. **Investigate**:
   - Check Cloud SQL connections (should be < 5000)
   - Verify Redis hit rate (should be > 90%)
   - Review application logs for slow queries

3. **Escalate** if latency doesn't improve in 5 minutes

### Scenario 2: Budget Exceeded (>120% hourly limit)

**Severity**: MEDIUM
**Response Time**: 5 minutes

**Actions**:
1. **Check if legitimate burst**:
   ```bash
   npm run manager
   # Review degradation level
   ```

2. **If unexpected**:
   - Enable minor degradation:
     ```bash
     # Shed free-tier traffic
     gcloud run services update-traffic ruvector-us-central1 \
       --to-tags=premium=100
     ```

3. **If critical (>150% budget)**:
   - Enable major degradation
   - Contact finance team
   - Consider enabling hard budget limit

### Scenario 3: Region Failure

**Severity**: CRITICAL
**Response Time**: Immediate

**Actions**:
1. **Automatic**: Load balancer should route around failed region

2. **Manual Verification**:
   ```bash
   # Check backend health
   gcloud compute backend-services get-health ruvector-backend \
     --global
   ```

3. **If capacity issues**:
   ```bash
   # Scale up remaining regions
   gcloud run services update ruvector-europe-west1 \
     --region=europe-west1 \
     --max-instances=2000
   ```

4. **Activate backup region**:
   ```bash
   # Deploy to us-east1
   cd terraform
   terraform apply -var="regions=[\"us-central1\",\"europe-west1\",\"asia-east1\",\"us-east1\"]"
   ```

### Scenario 4: Database Connection Exhaustion

**Severity**: HIGH
**Response Time**: 3 minutes

**Actions**:
1. **Immediate**:
   ```bash
   # Scale up Cloud SQL
   gcloud sql instances patch ruvector-db-us-central1 \
     --cpu=32 \
     --memory=128GB

   # Increase max connections
   gcloud sql instances patch ruvector-db-us-central1 \
     --database-flags=max_connections=10000
   ```

2. **Temporary**:
   - Increase Redis cache TTL
   - Enable read-only mode for non-critical endpoints
   - Route read queries to replicas

3. **Long-term**:
   - Add more read replicas
   - Implement connection pooling
   - Review query optimization

### Scenario 5: Cascading Failures

**Severity**: CRITICAL
**Response Time**: Immediate

**Actions**:
1. **Enable Circuit Breakers**:
   - Automatic via load balancer configuration
   - Unhealthy backends ejected after 5 consecutive errors

2. **Graceful Degradation**:
   ```bash
   # Enable critical degradation mode
   npm run manager -- --degrade=critical
   ```
   - Premium tier only
   - Disable non-essential features
   - Enable maintenance page for free tier

3. **Emergency Scale-Down**:
   ```bash
   # If cascading continues, scale down to known-good state
   gcloud run services update ruvector-us-central1 \
     --region=us-central1 \
     --min-instances=50 \
     --max-instances=50
   ```

4. **Incident Response**:
   - Page on-call SRE
   - Open war room
   - Activate disaster recovery plan

---

## Monitoring & Alerts

### Cloud Monitoring Dashboard

**URL**: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst

**Key Metrics**:
- Total connections (all regions)
- Connections by region
- P50/P95/P99 latency
- Instance count
- CPU/Memory utilization
- Error rate
- Hourly cost
- Burst event timeline

### Alert Policies

| Alert | Threshold | Severity | Response Time |
|-------|-----------|----------|---------------|
| High P99 Latency | >50ms for 2min | HIGH | 5 min |
| Critical Latency | >100ms for 1min | CRITICAL | 2 min |
| High Error Rate | >1% for 5min | HIGH | 5 min |
| Budget Warning | >80% hourly | MEDIUM | 15 min |
| Budget Critical | >100% hourly | HIGH | 5 min |
| Region Down | 0 healthy backends | CRITICAL | Immediate |
| CPU Critical | >90% for 5min | HIGH | 5 min |
| Memory Critical | >90% for 3min | CRITICAL | 2 min |

### Notification Channels

- **Email**: ops@ruvector.io
- **PagerDuty**: Critical alerts only
- **Slack**: #alerts-burst-scaling
- **Phone**: On-call rotation (critical only)

### Log Queries

**High Latency Requests**:
```sql
resource.type="cloud_run_revision"
jsonPayload.latency > 0.1
severity >= WARNING
```

**Scaling Events**:
```sql
resource.type="cloud_run_revision"
jsonPayload.message =~ "SCALING|SCALED"
```

**Cost Events**:
```sql
jsonPayload.message =~ "BUDGET"
```

---

## Cost Management

### Budget Structure

- **Hourly**: $10,000 (~200-400 instances)
- **Daily**: $200,000 (baseline + moderate bursts)
- **Monthly**: $5,000,000 (includes major events)

### Cost Thresholds

| Level | Action | Impact |
|-------|--------|--------|
| 50% | Info log | None |
| 80% | Warning alert | None |
| 90% | Critical alert | None |
| 100% | Minor degradation | Free tier limited |
| 120% | Major degradation | Free tier shed |
| 150% | Critical degradation | Premium only |

### Cost Optimization

**Automatic**:
- Gradual scale-in after bursts
- Preemptible instances for batch jobs
- Aggressive CDN caching
- Connection pooling

**Manual**:
```bash
# Review cost by region
gcloud billing accounts list
gcloud billing projects describe ruvector-prod

# Analyze top cost drivers
gcloud alpha billing budgets list --billing-account=YOUR_ACCOUNT

# Optimize specific region
terraform apply -var="us-central1-max-instances=800"
```

### Cost Forecasting

```bash
# Generate cost forecast
npm run manager -- --forecast=7days

# Expected costs:
# - Normal week: $1.4M
# - Major event week: $2.5M
# - World Cup week: $4.8M
```

---

## Troubleshooting

### Issue: Auto-scaling not responding

**Symptoms**: Load increasing but instances not scaling

**Diagnosis**:
```bash
# Check Cloud Run auto-scaling config
gcloud run services describe ruvector-us-central1 \
  --region=us-central1 \
  --format="value(spec.template.spec.scaling)"

# Check for quota limits
gcloud compute project-info describe --project=ruvector-prod \
  | grep -A5 CPUS
```

**Resolution**:
- Verify max-instances not reached
- Check quota limits
- Review IAM permissions for service account
- Restart capacity manager

### Issue: Predictions inaccurate

**Symptoms**: Actual load differs significantly from predicted

**Diagnosis**:
```bash
npm run predictor -- --check-accuracy
```

**Resolution**:
- Update event calendar with actual times
- Retrain models with recent data
- Adjust multiplier for event types
- Review regional distribution assumptions

### Issue: Database connection pool exhausted

**Symptoms**: Connection errors, high latency

**Diagnosis**:
```bash
# Check active connections
gcloud sql operations list --instance=ruvector-db-us-central1

# Check Cloud SQL metrics
gcloud monitoring time-series list \
  --filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'
```

**Resolution**:
- Scale up Cloud SQL instance
- Increase max_connections
- Add read replicas
- Review connection pooling settings

### Issue: Redis cache misses

**Symptoms**: High database load, increased latency

**Diagnosis**:
```bash
# Check Redis stats
gcloud redis instances describe ruvector-redis-us-central1 \
  --region=us-central1

# Check hit rate
gcloud monitoring time-series list \
  --filter='metric.type="redis.googleapis.com/stats/cache_hit_ratio"'
```

**Resolution**:
- Increase Redis memory
- Review cache TTL settings
- Implement cache warming for predicted bursts
- Review cache key patterns

---

## Runbook Contacts

### On-Call Rotation

**Primary On-Call**: Check PagerDuty
**Secondary On-Call**: Check PagerDuty
**Escalation**: VP Engineering

### Team Contacts

| Role | Contact | Phone |
|------|---------|-------|
| SRE Lead | sre-lead@ruvector.io | +1-XXX-XXX-XXXX |
| DevOps | devops@ruvector.io | +1-XXX-XXX-XXXX |
| Engineering Manager | eng-mgr@ruvector.io | +1-XXX-XXX-XXXX |
| VP Engineering | vp-eng@ruvector.io | +1-XXX-XXX-XXXX |

### External Contacts

| Service | Contact | SLA |
|---------|---------|-----|
| GCP Support | Premium Support | 15 min |
| PagerDuty | support@pagerduty.com | 1 hour |
| Network Provider | NOC hotline | 30 min |

### War Room

**Zoom**: https://zoom.us/j/ruvector-war-room
**Slack**: #incident-response
**Docs**: https://docs.ruvector.io/incidents

---

## Appendix

### Quick Reference Commands

```bash
# Check system status
npm run manager

# View current metrics
gcloud monitoring dashboards list

# Force scale-out
gcloud run services update ruvector-REGION --max-instances=1500

# Enable degradation
npm run manager -- --degrade=minor

# Check predictions
npm run predictor

# View logs
gcloud logging read "resource.type=cloud_run_revision" --limit=50

# Check costs
gcloud billing accounts list
```

### Terraform Quick Reference

```bash
# Initialize
cd terraform && terraform init

# Plan changes
terraform plan -var-file="prod.tfvars"

# Apply changes
terraform apply -var-file="prod.tfvars"

# Emergency scale-out
terraform apply -var="max_instances=2000"

# Add region
terraform apply -var='regions=["us-central1","europe-west1","asia-east1","us-east1"]'
```

### Health Check URLs

- **Application**: https://api.ruvector.io/health
- **Database**: https://api.ruvector.io/health/db
- **Redis**: https://api.ruvector.io/health/redis
- **Load Balancer**: Check Cloud Console

### Disaster Recovery

**RTO (Recovery Time Objective)**: 15 minutes
**RPO (Recovery Point Objective)**: 5 minutes

**Backup Locations**:
- Database: Point-in-time recovery (7 days)
- Configuration: Git repository
- Terraform state: GCS bucket (versioned)

**Recovery Procedure**:
1. Restore from latest backup
2. Deploy infrastructure via Terraform
3. Validate health checks
4. Update DNS if needed
5. Resume traffic

---

## Revision History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-01-20 | DevOps Team | Initial version |

---

**Last Updated**: 2025-01-20
**Next Review**: 2025-02-20
**Owner**: SRE Team