git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
595 lines
14 KiB
Markdown
595 lines
14 KiB
Markdown
# Ruvector Burst Scaling - Operations Runbook
|
|
|
|
## Overview
|
|
|
|
This runbook provides operational procedures for managing the Ruvector adaptive burst scaling system. This system handles traffic spikes from 500M to 25B concurrent streams while maintaining <50ms p99 latency.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Architecture Overview](#architecture-overview)
|
|
2. [Normal Operations](#normal-operations)
|
|
3. [Burst Event Procedures](#burst-event-procedures)
|
|
4. [Emergency Procedures](#emergency-procedures)
|
|
5. [Monitoring & Alerts](#monitoring--alerts)
|
|
6. [Cost Management](#cost-management)
|
|
7. [Troubleshooting](#troubleshooting)
|
|
8. [Runbook Contacts](#runbook-contacts)
|
|
|
|
---
|
|
|
|
## Architecture Overview
|
|
|
|
### Components
|
|
|
|
- **Burst Predictor**: Predicts upcoming traffic spikes using event calendars and ML
|
|
- **Reactive Scaler**: Real-time auto-scaling based on metrics
|
|
- **Capacity Manager**: Global orchestration with budget controls
|
|
- **Cloud Run**: Containerized application with auto-scaling (10-1000 instances per region)
|
|
- **Global Load Balancer**: Distributes traffic across regions
|
|
- **Cloud SQL**: Database with read replicas
|
|
- **Redis**: Caching layer
|
|
|
|
### Regions
|
|
|
|
- Primary: us-central1
|
|
- Secondary: europe-west1, asia-east1
|
|
- On-demand: Additional regions can be activated
|
|
|
|
---
|
|
|
|
## Normal Operations
|
|
|
|
### Daily Checks (Automated)
|
|
|
|
✅ Verify all regions are healthy
|
|
✅ Check p99 latency < 50ms
|
|
✅ Confirm instance counts within expected range
|
|
✅ Review cost vs budget (should be ~$240k/day baseline)
|
|
✅ Check for upcoming predicted bursts
|
|
|
|
### Weekly Review
|
|
|
|
1. **Review Prediction Accuracy**
|
|
```bash
|
|
npm run predictor
|
|
```
|
|
Target: >85% accuracy
|
|
|
|
2. **Analyze Cost Trends**
|
|
- Review Cloud Console billing dashboard
|
|
- Compare actual vs predicted costs
|
|
- Adjust budget thresholds if needed
|
|
|
|
3. **Update Event Calendar**
|
|
- Add known upcoming events (sports, releases)
|
|
- Review historical patterns
|
|
- Train ML models with recent data
|
|
|
|
### Monthly Tasks
|
|
|
|
- Review and update scaling thresholds
|
|
- Audit degradation strategies
|
|
- Conduct burst simulation testing
|
|
- Update on-call documentation
|
|
- Review SLA compliance (p99 < 50ms)
|
|
|
|
---
|
|
|
|
## Burst Event Procedures
|
|
|
|
### Pre-Event (15 minutes before)
|
|
|
|
**Automatic**: Burst Predictor triggers pre-warming
|
|
|
|
**Manual Verification**:
|
|
1. Check Cloud Console for pre-warming status
|
|
2. Verify instances scaling up in predicted regions
|
|
3. Monitor cost dashboard for expected increases
|
|
4. Alert team via Slack #burst-events
|
|
|
|
### During Event
|
|
|
|
**Monitor (every 5 minutes)**:
|
|
- Dashboard: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst
|
|
- Key metrics:
|
|
- Connection count (should handle 10-50x)
|
|
- P99 latency (must stay < 50ms)
|
|
- Error rate (must stay < 1%)
|
|
- Instance count per region
|
|
|
|
**Scaling Actions** (if needed):
|
|
```bash
|
|
# Check current capacity
|
|
gcloud run services describe ruvector-us-central1 --region=us-central1
|
|
|
|
# Manual scale-out (emergency only)
|
|
gcloud run services update ruvector-us-central1 \
|
|
--region=us-central1 \
|
|
--max-instances=1500
|
|
|
|
# Check reactive scaler status
|
|
npm run scaler
|
|
|
|
# Check capacity manager
|
|
npm run manager
|
|
```
|
|
|
|
### Post-Event (within 1 hour)
|
|
|
|
1. **Verify Scale-In**
|
|
- Instances should gradually reduce to normal levels
|
|
- Should take 5-10 minutes after traffic normalizes
|
|
|
|
2. **Review Performance**
|
|
- Export metrics to CSV
|
|
- Calculate actual vs predicted load
|
|
- Document any issues
|
|
|
|
3. **Update Patterns**
|
|
```bash
|
|
# Train model with new data
|
|
npm run predictor -- --train --event-id="world-cup-2026"
|
|
```
|
|
|
|
4. **Cost Analysis**
|
|
- Compare actual cost vs budget
|
|
- Document any overages
|
|
- Update cost projections
|
|
|
|
---
|
|
|
|
## Emergency Procedures
|
|
|
|
### Scenario 1: Latency Spike (p99 > 100ms)
|
|
|
|
**Severity**: HIGH
|
|
**Response Time**: 2 minutes
|
|
|
|
**Actions**:
|
|
1. **Immediate**:
|
|
```bash
|
|
# Force scale-out across all regions
|
|
for region in us-central1 europe-west1 asia-east1; do
|
|
gcloud run services update ruvector-$region \
|
|
--region=$region \
|
|
--min-instances=100 \
|
|
--max-instances=1500
|
|
done
|
|
```
|
|
|
|
2. **Investigate**:
|
|
- Check Cloud SQL connections (should be < 5000)
|
|
- Verify Redis hit rate (should be > 90%)
|
|
- Review application logs for slow queries
|
|
|
|
3. **Escalate** if latency doesn't improve in 5 minutes
|
|
|
|
### Scenario 2: Budget Exceeded (>120% hourly limit)
|
|
|
|
**Severity**: MEDIUM
|
|
**Response Time**: 5 minutes
|
|
|
|
**Actions**:
|
|
1. **Check if legitimate burst**:
|
|
```bash
|
|
npm run manager
|
|
# Review degradation level
|
|
```
|
|
|
|
2. **If unexpected**:
|
|
- Enable minor degradation:
|
|
```bash
|
|
# Shed free-tier traffic
|
|
gcloud run services update-traffic ruvector-us-central1 \
|
|
--to-tags=premium=100
|
|
```
|
|
|
|
3. **If critical (>150% budget)**:
|
|
- Enable major degradation
|
|
- Contact finance team
|
|
- Consider enabling hard budget limit
|
|
|
|
### Scenario 3: Region Failure
|
|
|
|
**Severity**: CRITICAL
|
|
**Response Time**: Immediate
|
|
|
|
**Actions**:
|
|
1. **Automatic**: Load balancer should route around failed region
|
|
|
|
2. **Manual Verification**:
|
|
```bash
|
|
# Check backend health
|
|
gcloud compute backend-services get-health ruvector-backend \
|
|
--global
|
|
```
|
|
|
|
3. **If capacity issues**:
|
|
```bash
|
|
# Scale up remaining regions
|
|
gcloud run services update ruvector-europe-west1 \
|
|
--region=europe-west1 \
|
|
--max-instances=2000
|
|
```
|
|
|
|
4. **Activate backup region**:
|
|
```bash
|
|
# Deploy to us-east1
|
|
cd terraform
|
|
terraform apply -var="regions=[\"us-central1\",\"europe-west1\",\"asia-east1\",\"us-east1\"]"
|
|
```
|
|
|
|
### Scenario 4: Database Connection Exhaustion
|
|
|
|
**Severity**: HIGH
|
|
**Response Time**: 3 minutes
|
|
|
|
**Actions**:
|
|
1. **Immediate**:
|
|
```bash
|
|
# Scale up Cloud SQL
|
|
gcloud sql instances patch ruvector-db-us-central1 \
|
|
--cpu=32 \
|
|
--memory=128GB
|
|
|
|
# Increase max connections
|
|
gcloud sql instances patch ruvector-db-us-central1 \
|
|
--database-flags=max_connections=10000
|
|
```
|
|
|
|
2. **Temporary**:
|
|
- Increase Redis cache TTL
|
|
- Enable read-only mode for non-critical endpoints
|
|
- Route read queries to replicas
|
|
|
|
3. **Long-term**:
|
|
- Add more read replicas
|
|
- Implement connection pooling
|
|
- Review query optimization
|
|
|
|
### Scenario 5: Cascading Failures
|
|
|
|
**Severity**: CRITICAL
|
|
**Response Time**: Immediate
|
|
|
|
**Actions**:
|
|
1. **Enable Circuit Breakers**:
|
|
- Automatic via load balancer configuration
|
|
- Unhealthy backends ejected after 5 consecutive errors
|
|
|
|
2. **Graceful Degradation**:
|
|
```bash
|
|
# Enable critical degradation mode
|
|
npm run manager -- --degrade=critical
|
|
```
|
|
- Premium tier only
|
|
- Disable non-essential features
|
|
- Enable maintenance page for free tier
|
|
|
|
3. **Emergency Scale-Down**:
|
|
```bash
|
|
# If cascading continues, scale down to known-good state
|
|
gcloud run services update ruvector-us-central1 \
|
|
--region=us-central1 \
|
|
--min-instances=50 \
|
|
--max-instances=50
|
|
```
|
|
|
|
4. **Incident Response**:
|
|
- Page on-call SRE
|
|
- Open war room
|
|
- Activate disaster recovery plan
|
|
|
|
---
|
|
|
|
## Monitoring & Alerts
|
|
|
|
### Cloud Monitoring Dashboard
|
|
|
|
**URL**: https://console.cloud.google.com/monitoring/dashboards/custom/ruvector-burst
|
|
|
|
**Key Metrics**:
|
|
- Total connections (all regions)
|
|
- Connections by region
|
|
- P50/P95/P99 latency
|
|
- Instance count
|
|
- CPU/Memory utilization
|
|
- Error rate
|
|
- Hourly cost
|
|
- Burst event timeline
|
|
|
|
### Alert Policies
|
|
|
|
| Alert | Threshold | Severity | Response Time |
|
|
|-------|-----------|----------|---------------|
|
|
| High P99 Latency | >50ms for 2min | HIGH | 5 min |
|
|
| Critical Latency | >100ms for 1min | CRITICAL | 2 min |
|
|
| High Error Rate | >1% for 5min | HIGH | 5 min |
|
|
| Budget Warning | >80% hourly | MEDIUM | 15 min |
|
|
| Budget Critical | >100% hourly | HIGH | 5 min |
|
|
| Region Down | 0 healthy backends | CRITICAL | Immediate |
|
|
| CPU Critical | >90% for 5min | HIGH | 5 min |
|
|
| Memory Critical | >90% for 3min | CRITICAL | 2 min |
|
|
|
|
### Notification Channels
|
|
|
|
- **Email**: ops@ruvector.io
|
|
- **PagerDuty**: Critical alerts only
|
|
- **Slack**: #alerts-burst-scaling
|
|
- **Phone**: On-call rotation (critical only)
|
|
|
|
### Log Queries
|
|
|
|
**High Latency Requests**:
|
|
```sql
|
|
resource.type="cloud_run_revision"
|
|
jsonPayload.latency > 0.1
|
|
severity >= WARNING
|
|
```
|
|
|
|
**Scaling Events**:
|
|
```sql
|
|
resource.type="cloud_run_revision"
|
|
jsonPayload.message =~ "SCALING|SCALED"
|
|
```
|
|
|
|
**Cost Events**:
|
|
```sql
|
|
jsonPayload.message =~ "BUDGET"
|
|
```
|
|
|
|
---
|
|
|
|
## Cost Management
|
|
|
|
### Budget Structure
|
|
|
|
- **Hourly**: $10,000 (~200-400 instances)
|
|
- **Daily**: $200,000 (baseline + moderate bursts)
|
|
- **Monthly**: $5,000,000 (includes major events)
|
|
|
|
### Cost Thresholds
|
|
|
|
| Level | Action | Impact |
|
|
|-------|--------|--------|
|
|
| 50% | Info log | None |
|
|
| 80% | Warning alert | None |
|
|
| 90% | Critical alert | None |
|
|
| 100% | Minor degradation | Free tier limited |
|
|
| 120% | Major degradation | Free tier shed |
|
|
| 150% | Critical degradation | Premium only |
|
|
|
|
### Cost Optimization
|
|
|
|
**Automatic**:
|
|
- Gradual scale-in after bursts
|
|
- Preemptible instances for batch jobs
|
|
- Aggressive CDN caching
|
|
- Connection pooling
|
|
|
|
**Manual**:
|
|
```bash
|
|
# Review cost by region
|
|
gcloud billing accounts list
|
|
gcloud billing projects describe ruvector-prod
|
|
|
|
# Analyze top cost drivers
|
|
gcloud alpha billing budgets list --billing-account=YOUR_ACCOUNT
|
|
|
|
# Optimize specific region
|
|
terraform apply -var="us-central1-max-instances=800"
|
|
```
|
|
|
|
### Cost Forecasting
|
|
|
|
```bash
|
|
# Generate cost forecast
|
|
npm run manager -- --forecast=7days
|
|
|
|
# Expected costs:
|
|
# - Normal week: $1.4M
|
|
# - Major event week: $2.5M
|
|
# - World Cup week: $4.8M
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: Auto-scaling not responding
|
|
|
|
**Symptoms**: Load increasing but instances not scaling
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check Cloud Run auto-scaling config
|
|
gcloud run services describe ruvector-us-central1 \
|
|
--region=us-central1 \
|
|
--format="value(spec.template.spec.scaling)"
|
|
|
|
# Check for quota limits
|
|
gcloud compute project-info describe --project=ruvector-prod \
|
|
| grep -A5 CPUS
|
|
```
|
|
|
|
**Resolution**:
|
|
- Verify max-instances not reached
|
|
- Check quota limits
|
|
- Review IAM permissions for service account
|
|
- Restart capacity manager
|
|
|
|
### Issue: Predictions inaccurate
|
|
|
|
**Symptoms**: Actual load differs significantly from predicted
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
npm run predictor -- --check-accuracy
|
|
```
|
|
|
|
**Resolution**:
|
|
- Update event calendar with actual times
|
|
- Retrain models with recent data
|
|
- Adjust multiplier for event types
|
|
- Review regional distribution assumptions
|
|
|
|
### Issue: Database connection pool exhausted
|
|
|
|
**Symptoms**: Connection errors, high latency
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check active connections
|
|
gcloud sql operations list --instance=ruvector-db-us-central1
|
|
|
|
# Check Cloud SQL metrics
|
|
gcloud monitoring time-series list \
|
|
--filter='metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'
|
|
```
|
|
|
|
**Resolution**:
|
|
- Scale up Cloud SQL instance
|
|
- Increase max_connections
|
|
- Add read replicas
|
|
- Review connection pooling settings
|
|
|
|
### Issue: Redis cache misses
|
|
|
|
**Symptoms**: High database load, increased latency
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check Redis stats
|
|
gcloud redis instances describe ruvector-redis-us-central1 \
|
|
--region=us-central1
|
|
|
|
# Check hit rate
|
|
gcloud monitoring time-series list \
|
|
--filter='metric.type="redis.googleapis.com/stats/cache_hit_ratio"'
|
|
```
|
|
|
|
**Resolution**:
|
|
- Increase Redis memory
|
|
- Review cache TTL settings
|
|
- Implement cache warming for predicted bursts
|
|
- Review cache key patterns
|
|
|
|
---
|
|
|
|
## Runbook Contacts
|
|
|
|
### On-Call Rotation
|
|
|
|
**Primary On-Call**: Check PagerDuty
|
|
**Secondary On-Call**: Check PagerDuty
|
|
**Escalation**: VP Engineering
|
|
|
|
### Team Contacts
|
|
|
|
| Role | Contact | Phone |
|
|
|------|---------|-------|
|
|
| SRE Lead | sre-lead@ruvector.io | +1-XXX-XXX-XXXX |
|
|
| DevOps | devops@ruvector.io | +1-XXX-XXX-XXXX |
|
|
| Engineering Manager | eng-mgr@ruvector.io | +1-XXX-XXX-XXXX |
|
|
| VP Engineering | vp-eng@ruvector.io | +1-XXX-XXX-XXXX |
|
|
|
|
### External Contacts
|
|
|
|
| Service | Contact | SLA |
|
|
|---------|---------|-----|
|
|
| GCP Support | Premium Support | 15 min |
|
|
| PagerDuty | support@pagerduty.com | 1 hour |
|
|
| Network Provider | NOC hotline | 30 min |
|
|
|
|
### War Room
|
|
|
|
**Zoom**: https://zoom.us/j/ruvector-war-room
|
|
**Slack**: #incident-response
|
|
**Docs**: https://docs.ruvector.io/incidents
|
|
|
|
---
|
|
|
|
## Appendix
|
|
|
|
### Quick Reference Commands
|
|
|
|
```bash
|
|
# Check system status
|
|
npm run manager
|
|
|
|
# View current metrics
|
|
gcloud monitoring dashboards list
|
|
|
|
# Force scale-out
|
|
gcloud run services update ruvector-REGION --max-instances=1500
|
|
|
|
# Enable degradation
|
|
npm run manager -- --degrade=minor
|
|
|
|
# Check predictions
|
|
npm run predictor
|
|
|
|
# View logs
|
|
gcloud logging read "resource.type=cloud_run_revision" --limit=50
|
|
|
|
# Check costs
|
|
gcloud billing accounts list
|
|
```
|
|
|
|
### Terraform Quick Reference
|
|
|
|
```bash
|
|
# Initialize
|
|
cd terraform && terraform init
|
|
|
|
# Plan changes
|
|
terraform plan -var-file="prod.tfvars"
|
|
|
|
# Apply changes
|
|
terraform apply -var-file="prod.tfvars"
|
|
|
|
# Emergency scale-out
|
|
terraform apply -var="max_instances=2000"
|
|
|
|
# Add region
|
|
terraform apply -var='regions=["us-central1","europe-west1","asia-east1","us-east1"]'
|
|
```
|
|
|
|
### Health Check URLs
|
|
|
|
- **Application**: https://api.ruvector.io/health
|
|
- **Database**: https://api.ruvector.io/health/db
|
|
- **Redis**: https://api.ruvector.io/health/redis
|
|
- **Load Balancer**: Check Cloud Console
|
|
|
|
### Disaster Recovery
|
|
|
|
**RTO (Recovery Time Objective)**: 15 minutes
|
|
**RPO (Recovery Point Objective)**: 5 minutes
|
|
|
|
**Backup Locations**:
|
|
- Database: Point-in-time recovery (7 days)
|
|
- Configuration: Git repository
|
|
- Terraform state: GCS bucket (versioned)
|
|
|
|
**Recovery Procedure**:
|
|
1. Restore from latest backup
|
|
2. Deploy infrastructure via Terraform
|
|
3. Validate health checks
|
|
4. Update DNS if needed
|
|
5. Resume traffic
|
|
|
|
---
|
|
|
|
## Revision History
|
|
|
|
| Version | Date | Author | Changes |
|
|
|---------|------|--------|---------|
|
|
| 1.0 | 2025-01-20 | DevOps Team | Initial version |
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-01-20
|
|
**Next Review**: 2025-02-20
|
|
**Owner**: SRE Team
|