# RuVector Global Deployment Guide ## Overview This guide provides step-by-step instructions for deploying RuVector's globally distributed streaming system capable of handling 500 million concurrent learning streams with burst capacity up to 25 billion. **Target Infrastructure**: Google Cloud Platform (GCP) **Architecture**: Multi-region Cloud Run with global load balancing **Deployment Time**: 4-6 hours for initial setup --- ## Table of Contents 1. [Prerequisites](#prerequisites) 2. [Phase 1: Initial Setup](#phase-1-initial-setup) 3. [Phase 2: Core Infrastructure](#phase-2-core-infrastructure) 4. [Phase 3: Multi-Region Deployment](#phase-3-multi-region-deployment) 5. [Phase 4: Load Balancing & CDN](#phase-4-load-balancing--cdn) 6. [Phase 5: Monitoring & Alerting](#phase-5-monitoring--alerting) 7. [Phase 6: Validation & Testing](#phase-6-validation--testing) 8. [Operations](#operations) 9. [Troubleshooting](#troubleshooting) --- ## Prerequisites ### Required Tools ```bash # Install gcloud CLI curl https://sdk.cloud.google.com | bash exec -l $SHELL # Install Terraform wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip unzip terraform_1.6.0_linux_amd64.zip sudo mv terraform /usr/local/bin/ # Install Node.js 18+ curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash - sudo apt-get install -y nodejs # Install Rust (for building ruvector core) curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env # Install Docker sudo apt-get update sudo apt-get install -y docker.io sudo usermod -aG docker $USER # Install K6 (for load testing) sudo gpg -k sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69 echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list sudo apt-get update sudo apt-get install k6 ``` ### GCP Project Setup ```bash # Set project variables export PROJECT_ID="your-project-id" export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)") export BILLING_ACCOUNT="your-billing-account-id" # Authenticate gcloud auth login gcloud auth application-default login # Set default project gcloud config set project $PROJECT_ID # Enable billing gcloud billing projects link $PROJECT_ID --billing-account=$BILLING_ACCOUNT ``` ### Enable Required APIs ```bash # Enable all required GCP APIs gcloud services enable \ run.googleapis.com \ compute.googleapis.com \ sql-component.googleapis.com \ sqladmin.googleapis.com \ redis.googleapis.com \ servicenetworking.googleapis.com \ vpcaccess.googleapis.com \ cloudscheduler.googleapis.com \ cloudtasks.googleapis.com \ pubsub.googleapis.com \ monitoring.googleapis.com \ logging.googleapis.com \ cloudtrace.googleapis.com \ cloudbuild.googleapis.com \ artifactregistry.googleapis.com \ secretmanager.googleapis.com \ cloudresourcemanager.googleapis.com \ iamcredentials.googleapis.com \ cloudfunctions.googleapis.com \ networkconnectivity.googleapis.com ``` ### Service Accounts ```bash # Create service accounts gcloud iam service-accounts create ruvector-cloudrun \ --display-name="RuVector Cloud Run Service Account" gcloud iam service-accounts create ruvector-deployer \ --display-name="RuVector CI/CD Deployer" # Grant necessary permissions export CLOUDRUN_SA="ruvector-cloudrun@${PROJECT_ID}.iam.gserviceaccount.com" export DEPLOYER_SA="ruvector-deployer@${PROJECT_ID}.iam.gserviceaccount.com" # Cloud Run permissions gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:${CLOUDRUN_SA}" \ --role="roles/cloudsql.client" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:${CLOUDRUN_SA}" \ --role="roles/redis.editor" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:${CLOUDRUN_SA}" \ --role="roles/pubsub.publisher" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:${CLOUDRUN_SA}" \ --role="roles/secretmanager.secretAccessor" # Deployer permissions gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:${DEPLOYER_SA}" \ --role="roles/run.admin" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:${DEPLOYER_SA}" \ --role="roles/iam.serviceAccountUser" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:${DEPLOYER_SA}" \ --role="roles/cloudbuild.builds.editor" ``` ### Budget Alerts ```bash # Create budget (adjust amounts as needed) gcloud billing budgets create \ --billing-account=$BILLING_ACCOUNT \ --display-name="RuVector Monthly Budget" \ --budget-amount=500000 \ --threshold-rule=percent=50 \ --threshold-rule=percent=80 \ --threshold-rule=percent=100 \ --threshold-rule=percent=120 ``` --- ## Phase 1: Initial Setup ### 1.1 Clone Repository ```bash cd /home/user git clone https://github.com/ruvnet/ruvector.git cd ruvector ``` ### 1.2 Build Rust Core ```bash # Build ruvector core cargo build --release # Build Node.js bindings cd crates/ruvector-node npm install npm run build cd ../.. ``` ### 1.3 Configure Environment ```bash # Create terraform variables file cd /home/user/ruvector/src/burst-scaling/terraform cat > terraform.tfvars < $LB_IP" echo "A record: *.ruvector.io -> $LB_IP" ``` **Manually configure in your DNS provider**: - `ruvector.io` A record → `$LB_IP` - `*.ruvector.io` A record → `$LB_IP` ### 4.5 SSL Certificate ```bash # Create managed SSL certificate (auto-renewal) gcloud compute ssl-certificates create ruvector-ssl-cert \ --domains=ruvector.io,api.ruvector.io,*.ruvector.io \ --global # Wait for certificate provisioning (can take 15-30 minutes) gcloud compute ssl-certificates list ``` --- ## Phase 5: Monitoring & Alerting ### 5.1 Import Monitoring Dashboard ```bash cd /home/user/ruvector/src/burst-scaling # Create dashboard gcloud monitoring dashboards create \ --config-from-file=monitoring-dashboard.json ``` **Dashboard includes**: - Connection counts per region - Latency percentiles - Error rates - Resource utilization - Cost tracking ### 5.2 Configure Alert Policies ```bash # High latency alert gcloud alpha monitoring policies create \ --notification-channels=CHANNEL_ID \ --display-name="High P99 Latency" \ --condition-display-name="P99 > 50ms" \ --condition-threshold-value=0.050 \ --condition-threshold-duration=60s \ --aggregation-alignment-period=60s # High error rate alert gcloud alpha monitoring policies create \ --notification-channels=CHANNEL_ID \ --display-name="High Error Rate" \ --condition-display-name="Errors > 1%" \ --condition-threshold-value=0.01 \ --condition-threshold-duration=300s # Region unhealthy alert gcloud alpha monitoring policies create \ --notification-channels=CHANNEL_ID \ --display-name="Region Unhealthy" \ --condition-display-name="Health Check Failed" \ --condition-threshold-value=1 \ --condition-threshold-duration=180s ``` ### 5.3 Log-Based Metrics ```bash # Create custom metrics from logs gcloud logging metrics create error_rate \ --description="Application error rate" \ --log-filter='resource.type="cloud_run_revision" severity>=ERROR' gcloud logging metrics create connection_count \ --description="Active connection count" \ --log-filter='resource.type="cloud_run_revision" jsonPayload.event="connection_established"' \ --value-extractor='EXTRACT(jsonPayload.connection_id)' ``` --- ## Phase 6: Validation & Testing ### 6.1 Smoke Test ```bash cd /home/user/ruvector/benchmarks # Run quick validation (2 hours) npm run test:quick # Expected output: # ✓ Baseline load test passed # ✓ Single region test passed # ✓ Basic failover test passed # ✓ Mixed workload test passed ``` ### 6.2 Load Test (Baseline) ```bash # Run baseline 500M concurrent test (4 hours) npm run scenario:baseline-500m # Monitor progress npm run dashboard # Open http://localhost:8000/visualization-dashboard.html ``` **Success criteria**: - P99 latency < 50ms - P50 latency < 10ms - Error rate < 0.1% - All regions healthy ### 6.3 Burst Test (10x) ```bash # Run 10x burst test (2 hours) npm run scenario:product-launch-10x # This will spike to 5B concurrent ``` **Success criteria**: - System survives without crash - P99 latency < 100ms - Auto-scaling completes within 60s - Error rate < 2% ### 6.4 Failover Test ```bash # Run regional failover test (1 hour) npm run scenario:region-failover # This will simulate region failure ``` **Success criteria**: - Failover completes within 60s - Connection loss < 5% - No cascading failures --- ## Operations ### Daily Operations #### Morning Checklist ```bash #!/bin/bash # Save as: /home/user/ruvector/scripts/daily-check.sh # Check service health echo "=== Service Health ===" for region in us-central1 europe-west1 asia-east1; do gcloud run services describe ruvector-streaming \ --region=$region \ --format='value(status.conditions[0].status)' | \ grep -q "True" && echo "✓ $region" || echo "✗ $region UNHEALTHY" done # Check error rates (last 24h) echo -e "\n=== Error Rates (24h) ===" gcloud logging read 'resource.type="cloud_run_revision" severity>=ERROR' \ --limit=10 \ --format=json | jq -r '.[].jsonPayload.message' # Check costs (yesterday) echo -e "\n=== Cost (Yesterday) ===" # Requires BigQuery billing export # bq query --use_legacy_sql=false "SELECT SUM(cost) FROM billing.gcp_billing_export WHERE DATE(usage_start_time) = CURRENT_DATE() - 1" # Check capacity echo -e "\n=== Capacity ===" gcloud run services describe ruvector-streaming \ --region=us-central1 \ --format='value(spec.template.spec.containerConcurrency,status.observedGeneration)' ``` #### Scaling Operations ```bash # Manually scale up for planned event gcloud run services update ruvector-streaming \ --region=us-central1 \ --min-instances=100 \ --max-instances=1000 # Manually scale down after event gcloud run services update ruvector-streaming \ --region=us-central1 \ --min-instances=10 \ --max-instances=500 # Or use the burst predictor cd /home/user/ruvector/src/burst-scaling node dist/burst-predictor.js --event "Product Launch" --time "2025-12-01T10:00:00Z" ``` ### Weekly Operations #### Performance Review ```bash # Generate weekly performance report cd /home/user/ruvector/benchmarks npm run report -- --period "last-7-days" --format pdf # Review metrics: # - Average latency trends # - Error rate trends # - Cost per million queries # - Capacity utilization ``` #### Cost Optimization ```bash # Identify idle resources gcloud run services list --format='table( metadata.name, metadata.namespace, status.url, status.traffic[0].percent )' | grep "0%" # Review committed use discounts gcloud compute commitments list # Check for underutilized databases gcloud sql instances list --format='table( name, region, settings.tier, state )' | grep RUNNABLE ``` ### Monthly Operations #### Capacity Planning ```bash # Analyze growth trends # Review last 3 months of connection counts # Project next month's capacity needs # Request quota increases if needed # Request quota increase gcloud compute project-info describe --project=$PROJECT_ID gcloud compute regions describe us-central1 --format='value(quotas)' # Submit increase request if needed gcloud compute project-info add-metadata \ --metadata=quotas='{"CPUS":"10000","DISKS_TOTAL_GB":"100000"}' ``` #### Security Updates ```bash # Update container images cd /home/user/ruvector git pull origin main docker build -t gcr.io/${PROJECT_ID}/ruvector-streaming:latest . docker push gcr.io/${PROJECT_ID}/ruvector-streaming:latest # Rolling update gcloud run services update ruvector-streaming \ --image=gcr.io/${PROJECT_ID}/ruvector-streaming:latest \ --region=us-central1 # Verify update gcloud run revisions list --service=ruvector-streaming --region=us-central1 ``` --- ## Troubleshooting ### Issue: High Latency (P99 > 50ms) **Diagnosis**: ```bash # Check database connections gcloud sql operations list --instance=ruvector-db --limit=10 # Check Redis hit rates gcloud redis instances describe ruvector-cache-us-central1 \ --region=us-central1 \ --format='value(metadata.stats.hitRate)' # Check Cloud Run cold starts gcloud run services describe ruvector-streaming \ --region=us-central1 \ --format='value(status.traffic[0].latestRevision)' ``` **Solutions**: 1. Increase min instances to reduce cold starts 2. Increase Redis memory or optimize cache keys 3. Add read replicas to database 4. Enable connection pooling 5. Review slow queries in database ### Issue: High Error Rate (> 1%) **Diagnosis**: ```bash # Check error types gcloud logging read 'resource.type="cloud_run_revision" severity>=ERROR' \ --limit=100 \ --format=json | jq -r '.[] | .jsonPayload.error_type' | sort | uniq -c # Check recent deployments gcloud run revisions list --service=ruvector-streaming --region=us-central1 --limit=5 ``` **Solutions**: 1. Rollback to previous revision if recent deploy 2. Check database connection pool exhaustion 3. Verify API rate limits not exceeded 4. Check for memory leaks (restart instances) 5. Review error logs for patterns ### Issue: Auto-Scaling Not Working **Diagnosis**: ```bash # Check scaling metrics gcloud monitoring time-series list \ --filter='metric.type="run.googleapis.com/container/instance_count"' \ --interval-start-time="2025-01-01T00:00:00Z" \ --interval-end-time="2025-01-02T00:00:00Z" # Check quotas gcloud compute project-info describe --project=$PROJECT_ID | grep -A 5 "CPUS" ``` **Solutions**: 1. Request quota increase if limits hit 2. Check budget caps (may block scaling) 3. Verify IAM permissions for auto-scaler 4. Review scaling policies (min/max instances) 5. Check for regional capacity issues ### Issue: Regional Failover Not Working **Diagnosis**: ```bash # Check health checks gcloud compute health-checks describe ruvector-health-check # Check backend service health gcloud compute backend-services get-health ruvector-backend-service --global # Check load balancer configuration gcloud compute url-maps describe ruvector-lb ``` **Solutions**: 1. Verify health check endpoints responding 2. Check firewall rules allow health checks 3. Verify backend services configured correctly 4. Check DNS propagation 5. Review load balancer logs ### Issue: Cost Overruns **Diagnosis**: ```bash # Check current spend gcloud billing accounts list # Identify expensive resources gcloud compute instances list --format='table(name,zone,machineType,status)' gcloud sql instances list --format='table(name,region,tier,status)' gcloud redis instances list --format='table(name,region,tier,memorySizeGb)' ``` **Solutions**: 1. Scale down min instances in low-traffic regions 2. Reduce Redis memory size if underutilized 3. Downgrade database tier if CPU/memory low 4. Enable more aggressive CDN caching 5. Review and delete unused resources --- ## Rollback Procedures ### Rollback Cloud Run Service ```bash # List revisions gcloud run revisions list --service=ruvector-streaming --region=us-central1 # Rollback to previous revision PREVIOUS_REVISION=$(gcloud run revisions list \ --service=ruvector-streaming \ --region=us-central1 \ --format='value(metadata.name)' \ --limit=2 | tail -n1) gcloud run services update-traffic ruvector-streaming \ --region=us-central1 \ --to-revisions=$PREVIOUS_REVISION=100 ``` ### Rollback Infrastructure Changes ```bash cd /home/user/ruvector/src/burst-scaling/terraform # Revert to previous state terraform state pull > current-state.tfstate terraform state push previous-state.tfstate terraform apply -auto-approve ``` ### Emergency Shutdown ```bash # Disable all traffic to service gcloud run services update ruvector-streaming \ --region=us-central1 \ --max-instances=0 # Or delete service entirely gcloud run services delete ruvector-streaming --region=us-central1 --quiet ``` --- ## Cost Summary ### Initial Setup Costs - One-time setup: ~$100 (testing, quota requests, etc.) ### Monthly Operating Costs (Baseline 500M concurrent) - **Cloud Run**: $2.4M ($0.0048 per connection) - **Cloud SQL**: $150K (3 regions, read replicas) - **Redis**: $45K (3 regions, 128GB each) - **Load Balancer + CDN**: $80K - **Networking**: $50K - **Monitoring + Logging**: $20K - **Storage**: $5K - **Total**: ~$2.75M/month (optimized) ### Burst Event Costs (World Cup 50x, 3 hours) - **Cloud Run**: ~$80K - **Database**: ~$2K (connection spikes) - **Redis**: ~$500 (included in monthly) - **Networking**: ~$5K - **Total**: ~$88K per event ### Cost Optimization Tips 1. Use committed use discounts (30% savings) 2. Enable auto-scaling to scale down when idle 3. Increase CDN cache hit rate to reduce backend load 4. Use preemptible instances for non-critical workloads 5. Regularly review and delete unused resources --- ## Next Steps 1. **Complete Initial Deployment** (Phases 1-5) 2. **Run Validation Tests** (Phase 6) 3. **Schedule Load Tests** (Baseline, then burst) 4. **Set Up Monitoring Dashboard** 5. **Configure Alert Policies** 6. **Create Runbook** (Already created: `/home/user/ruvector/src/burst-scaling/RUNBOOK.md`) 7. **Train Team on Operations** 8. **Plan First Production Event** (Start small, scale up) 9. **Iterate and Optimize** (Based on real traffic) --- ## Additional Resources - [Architecture Overview](./architecture-overview.md) - [Scaling Strategy](./scaling-strategy.md) - [Infrastructure Design](./infrastructure-design.md) - [Load Test Scenarios](/home/user/ruvector/benchmarks/LOAD_TEST_SCENARIOS.md) - [Operations Runbook](/home/user/ruvector/src/burst-scaling/RUNBOOK.md) - [Benchmarking Guide](/home/user/ruvector/benchmarks/README.md) - [GCP Cloud Run Docs](https://cloud.google.com/run/docs) - [GCP Load Balancing Docs](https://cloud.google.com/load-balancing/docs) --- ## Support For issues or questions: - GitHub Issues: https://github.com/ruvnet/ruvector/issues - Email: ops@ruvector.io - Slack: #ruvector-ops --- **Document Version**: 1.0 **Last Updated**: 2025-11-20 **Deployment Status**: Ready for Production