git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
942 lines
24 KiB
Markdown
942 lines
24 KiB
Markdown
# RuVector Global Deployment Guide
|
|
|
|
## Overview
|
|
|
|
This guide provides step-by-step instructions for deploying RuVector's globally distributed streaming system capable of handling 500 million concurrent learning streams with burst capacity up to 25 billion.
|
|
|
|
**Target Infrastructure**: Google Cloud Platform (GCP)
|
|
**Architecture**: Multi-region Cloud Run with global load balancing
|
|
**Deployment Time**: 4-6 hours for initial setup
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Prerequisites](#prerequisites)
|
|
2. [Phase 1: Initial Setup](#phase-1-initial-setup)
|
|
3. [Phase 2: Core Infrastructure](#phase-2-core-infrastructure)
|
|
4. [Phase 3: Multi-Region Deployment](#phase-3-multi-region-deployment)
|
|
5. [Phase 4: Load Balancing & CDN](#phase-4-load-balancing--cdn)
|
|
6. [Phase 5: Monitoring & Alerting](#phase-5-monitoring--alerting)
|
|
7. [Phase 6: Validation & Testing](#phase-6-validation--testing)
|
|
8. [Operations](#operations)
|
|
9. [Troubleshooting](#troubleshooting)
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
### Required Tools
|
|
```bash
|
|
# Install gcloud CLI
|
|
curl https://sdk.cloud.google.com | bash
|
|
exec -l $SHELL
|
|
|
|
# Install Terraform
|
|
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
|
|
unzip terraform_1.6.0_linux_amd64.zip
|
|
sudo mv terraform /usr/local/bin/
|
|
|
|
# Install Node.js 18+
|
|
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
|
|
sudo apt-get install -y nodejs
|
|
|
|
# Install Rust (for building ruvector core)
|
|
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
|
source $HOME/.cargo/env
|
|
|
|
# Install Docker
|
|
sudo apt-get update
|
|
sudo apt-get install -y docker.io
|
|
sudo usermod -aG docker $USER
|
|
|
|
# Install K6 (for load testing)
|
|
sudo gpg -k
|
|
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
|
|
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
|
|
sudo apt-get update
|
|
sudo apt-get install k6
|
|
```
|
|
|
|
### GCP Project Setup
|
|
```bash
|
|
# Set project variables
|
|
export PROJECT_ID="your-project-id"
|
|
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
|
|
export BILLING_ACCOUNT="your-billing-account-id"
|
|
|
|
# Authenticate
|
|
gcloud auth login
|
|
gcloud auth application-default login
|
|
|
|
# Set default project
|
|
gcloud config set project $PROJECT_ID
|
|
|
|
# Enable billing
|
|
gcloud billing projects link $PROJECT_ID --billing-account=$BILLING_ACCOUNT
|
|
```
|
|
|
|
### Enable Required APIs
|
|
```bash
|
|
# Enable all required GCP APIs
|
|
gcloud services enable \
|
|
run.googleapis.com \
|
|
compute.googleapis.com \
|
|
sql-component.googleapis.com \
|
|
sqladmin.googleapis.com \
|
|
redis.googleapis.com \
|
|
servicenetworking.googleapis.com \
|
|
vpcaccess.googleapis.com \
|
|
cloudscheduler.googleapis.com \
|
|
cloudtasks.googleapis.com \
|
|
pubsub.googleapis.com \
|
|
monitoring.googleapis.com \
|
|
logging.googleapis.com \
|
|
cloudtrace.googleapis.com \
|
|
cloudbuild.googleapis.com \
|
|
artifactregistry.googleapis.com \
|
|
secretmanager.googleapis.com \
|
|
cloudresourcemanager.googleapis.com \
|
|
iamcredentials.googleapis.com \
|
|
cloudfunctions.googleapis.com \
|
|
networkconnectivity.googleapis.com
|
|
```
|
|
|
|
### Service Accounts
|
|
```bash
|
|
# Create service accounts
|
|
gcloud iam service-accounts create ruvector-cloudrun \
|
|
--display-name="RuVector Cloud Run Service Account"
|
|
|
|
gcloud iam service-accounts create ruvector-deployer \
|
|
--display-name="RuVector CI/CD Deployer"
|
|
|
|
# Grant necessary permissions
|
|
export CLOUDRUN_SA="ruvector-cloudrun@${PROJECT_ID}.iam.gserviceaccount.com"
|
|
export DEPLOYER_SA="ruvector-deployer@${PROJECT_ID}.iam.gserviceaccount.com"
|
|
|
|
# Cloud Run permissions
|
|
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
|
--member="serviceAccount:${CLOUDRUN_SA}" \
|
|
--role="roles/cloudsql.client"
|
|
|
|
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
|
--member="serviceAccount:${CLOUDRUN_SA}" \
|
|
--role="roles/redis.editor"
|
|
|
|
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
|
--member="serviceAccount:${CLOUDRUN_SA}" \
|
|
--role="roles/pubsub.publisher"
|
|
|
|
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
|
--member="serviceAccount:${CLOUDRUN_SA}" \
|
|
--role="roles/secretmanager.secretAccessor"
|
|
|
|
# Deployer permissions
|
|
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
|
--member="serviceAccount:${DEPLOYER_SA}" \
|
|
--role="roles/run.admin"
|
|
|
|
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
|
--member="serviceAccount:${DEPLOYER_SA}" \
|
|
--role="roles/iam.serviceAccountUser"
|
|
|
|
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
|
--member="serviceAccount:${DEPLOYER_SA}" \
|
|
--role="roles/cloudbuild.builds.editor"
|
|
```
|
|
|
|
### Budget Alerts
|
|
```bash
|
|
# Create budget (adjust amounts as needed)
|
|
gcloud billing budgets create \
|
|
--billing-account=$BILLING_ACCOUNT \
|
|
--display-name="RuVector Monthly Budget" \
|
|
--budget-amount=500000 \
|
|
--threshold-rule=percent=50 \
|
|
--threshold-rule=percent=80 \
|
|
--threshold-rule=percent=100 \
|
|
--threshold-rule=percent=120
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 1: Initial Setup
|
|
|
|
### 1.1 Clone Repository
|
|
```bash
|
|
cd /home/user
|
|
git clone https://github.com/ruvnet/ruvector.git
|
|
cd ruvector
|
|
```
|
|
|
|
### 1.2 Build Rust Core
|
|
```bash
|
|
# Build ruvector core
|
|
cargo build --release
|
|
|
|
# Build Node.js bindings
|
|
cd crates/ruvector-node
|
|
npm install
|
|
npm run build
|
|
|
|
cd ../..
|
|
```
|
|
|
|
### 1.3 Configure Environment
|
|
```bash
|
|
# Create terraform variables file
|
|
cd /home/user/ruvector/src/burst-scaling/terraform
|
|
|
|
cat > terraform.tfvars <<EOF
|
|
project_id = "${PROJECT_ID}"
|
|
project_number = "${PROJECT_NUMBER}"
|
|
|
|
regions = [
|
|
{
|
|
name = "us-central1"
|
|
display_name = "Iowa (US Central)"
|
|
priority = 10
|
|
capacity = 80000000
|
|
cost_per_hour = 0.50
|
|
min_instances = 10
|
|
max_instances = 800
|
|
},
|
|
{
|
|
name = "europe-west1"
|
|
display_name = "Belgium (Europe West)"
|
|
priority = 9
|
|
capacity = 80000000
|
|
cost_per_hour = 0.55
|
|
min_instances = 10
|
|
max_instances = 800
|
|
},
|
|
{
|
|
name = "asia-east1"
|
|
display_name = "Taiwan (Asia East)"
|
|
priority = 8
|
|
capacity = 80000000
|
|
cost_per_hour = 0.60
|
|
min_instances = 10
|
|
max_instances = 800
|
|
}
|
|
]
|
|
|
|
service_config = {
|
|
cpu = "4"
|
|
memory = "16Gi"
|
|
concurrency = 100
|
|
timeout_seconds = 300
|
|
max_instances_per_region = 1000
|
|
}
|
|
|
|
scaling_config = {
|
|
enable_adaptive_scaling = true
|
|
target_cpu_utilization = 0.70
|
|
target_memory_utilization = 0.75
|
|
target_concurrency = 0.80
|
|
scale_down_delay = "300s"
|
|
min_scale_down_rate = 2
|
|
max_scale_up_rate = 10
|
|
}
|
|
|
|
budget_limits = {
|
|
hourly_limit_usd = 10000
|
|
daily_limit_usd = 200000
|
|
monthly_limit_usd = 5000000
|
|
alert_threshold = 0.80
|
|
}
|
|
|
|
# Customize with your settings
|
|
redis_memory_gb = 128
|
|
database_tier = "db-custom-4-16384"
|
|
enable_cdn = true
|
|
enable_armor = true
|
|
|
|
# Notification channels (replace with your values)
|
|
slack_webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
|
email_addresses = ["ops@yourcompany.com"]
|
|
EOF
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 2: Core Infrastructure
|
|
|
|
### 2.1 Deploy with Terraform
|
|
```bash
|
|
cd /home/user/ruvector/src/burst-scaling/terraform
|
|
|
|
# Initialize Terraform
|
|
terraform init
|
|
|
|
# Review plan
|
|
terraform plan -out=tfplan
|
|
|
|
# Apply infrastructure
|
|
terraform apply tfplan
|
|
```
|
|
|
|
**This will create**:
|
|
- VPC network with subnets
|
|
- Cloud SQL PostgreSQL instance with replicas
|
|
- Memorystore Redis instances per region
|
|
- Cloud Storage buckets
|
|
- Pub/Sub topics
|
|
- Secrets Manager secrets
|
|
- IAM bindings
|
|
- Monitoring policies
|
|
|
|
**Expected time**: 30-45 minutes
|
|
|
|
### 2.2 Initialize Database
|
|
```bash
|
|
# Get Cloud SQL connection name
|
|
export SQL_CONNECTION=$(terraform output -raw sql_connection_name)
|
|
|
|
# Connect to database
|
|
gcloud sql connect ruvector-db --user=postgres --quiet
|
|
|
|
# Run initialization script
|
|
\i /home/user/ruvector/src/cloud-architecture/init-database.sql
|
|
|
|
# Exit psql
|
|
\q
|
|
```
|
|
|
|
**Database Schema Created**:
|
|
- `vectors` table with metadata
|
|
- `agents` table for coordination
|
|
- `tasks` table for distributed work
|
|
- `sync_queue` for cross-region sync
|
|
- Indexes and partitions
|
|
|
|
### 2.3 Configure Secrets
|
|
```bash
|
|
# Store database password
|
|
echo -n "your-secure-password" | gcloud secrets create db-password \
|
|
--data-file=- \
|
|
--replication-policy="automatic"
|
|
|
|
# Store Redis connection strings (per region)
|
|
for region in us-central1 europe-west1 asia-east1; do
|
|
REDIS_HOST=$(terraform output -json redis_hosts | jq -r ".${region}")
|
|
echo "redis://${REDIS_HOST}:6379" | gcloud secrets create redis-connection-${region} \
|
|
--data-file=- \
|
|
--replication-policy="automatic"
|
|
done
|
|
|
|
# Store API keys and other secrets
|
|
echo -n "your-api-key" | gcloud secrets create api-key \
|
|
--data-file=- \
|
|
--replication-policy="automatic"
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 3: Multi-Region Deployment
|
|
|
|
### 3.1 Build Container Image
|
|
```bash
|
|
cd /home/user/ruvector
|
|
|
|
# Build multi-arch image
|
|
docker buildx create --use
|
|
docker buildx build --platform linux/amd64,linux/arm64 \
|
|
-t gcr.io/${PROJECT_ID}/ruvector-streaming:latest \
|
|
-f src/cloud-run/Dockerfile \
|
|
--push .
|
|
```
|
|
|
|
### 3.2 Deploy to Regions (Automated)
|
|
```bash
|
|
cd /home/user/ruvector/src/cloud-run
|
|
|
|
# Use Cloud Build for multi-region deployment
|
|
gcloud builds submit --config=cloudbuild.yaml
|
|
```
|
|
|
|
**This deploys to**:
|
|
- us-central1
|
|
- us-east1
|
|
- us-west1
|
|
- europe-west1
|
|
- europe-west3
|
|
- europe-north1
|
|
- asia-east1
|
|
- asia-southeast1
|
|
- asia-northeast1
|
|
|
|
**Deployment strategy**: Canary (10% → 50% → 100%)
|
|
|
|
**Expected time**: 20-30 minutes
|
|
|
|
### 3.3 Verify Regional Deployments
|
|
```bash
|
|
# List all deployments
|
|
gcloud run services list --platform managed
|
|
|
|
# Test each region
|
|
for region in us-central1 europe-west1 asia-east1; do
|
|
URL=$(gcloud run services describe ruvector-streaming \
|
|
--region=$region \
|
|
--format='value(status.url)')
|
|
|
|
curl -f "$URL/health" && echo "✓ $region healthy" || echo "✗ $region failed"
|
|
done
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 4: Load Balancing & CDN
|
|
|
|
### 4.1 Create Backend Services
|
|
```bash
|
|
# This is handled by Terraform, but verify:
|
|
gcloud compute backend-services list
|
|
|
|
# Should show backend services for each region
|
|
```
|
|
|
|
### 4.2 Configure Global Load Balancer
|
|
```bash
|
|
# Create URL map
|
|
gcloud compute url-maps create ruvector-lb \
|
|
--default-backend-bucket=ruvector-static-assets
|
|
|
|
# Add backend service
|
|
gcloud compute url-maps add-path-matcher ruvector-lb \
|
|
--path-matcher-name=ruvector-matcher \
|
|
--default-service=ruvector-backend-service \
|
|
--new-hosts='*.ruvector.io'
|
|
|
|
# Create target HTTPS proxy
|
|
gcloud compute target-https-proxies create ruvector-https-proxy \
|
|
--url-map=ruvector-lb \
|
|
--ssl-certificates=ruvector-ssl-cert
|
|
|
|
# Create global forwarding rule
|
|
gcloud compute forwarding-rules create ruvector-https-rule \
|
|
--global \
|
|
--target-https-proxy=ruvector-https-proxy \
|
|
--ports=443
|
|
```
|
|
|
|
### 4.3 Configure Cloud CDN
|
|
```bash
|
|
# Enable CDN on backend service
|
|
gcloud compute backend-services update ruvector-backend-service \
|
|
--enable-cdn \
|
|
--cache-mode=CACHE_ALL_STATIC \
|
|
--default-ttl=3600 \
|
|
--max-ttl=86400 \
|
|
--client-ttl=3600 \
|
|
--global
|
|
```
|
|
|
|
### 4.4 DNS Configuration
|
|
```bash
|
|
# Get load balancer IP
|
|
LB_IP=$(gcloud compute addresses describe ruvector-global-ip \
|
|
--global \
|
|
--format='value(address)')
|
|
|
|
echo "Configure your DNS:"
|
|
echo "A record: ruvector.io -> $LB_IP"
|
|
echo "A record: *.ruvector.io -> $LB_IP"
|
|
```
|
|
|
|
**Manually configure in your DNS provider**:
|
|
- `ruvector.io` A record → `$LB_IP`
|
|
- `*.ruvector.io` A record → `$LB_IP`
|
|
|
|
### 4.5 SSL Certificate
|
|
```bash
|
|
# Create managed SSL certificate (auto-renewal)
|
|
gcloud compute ssl-certificates create ruvector-ssl-cert \
|
|
--domains=ruvector.io,api.ruvector.io,*.ruvector.io \
|
|
--global
|
|
|
|
# Wait for certificate provisioning (can take 15-30 minutes)
|
|
gcloud compute ssl-certificates list
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 5: Monitoring & Alerting
|
|
|
|
### 5.1 Import Monitoring Dashboard
|
|
```bash
|
|
cd /home/user/ruvector/src/burst-scaling
|
|
|
|
# Create dashboard
|
|
gcloud monitoring dashboards create \
|
|
--config-from-file=monitoring-dashboard.json
|
|
```
|
|
|
|
**Dashboard includes**:
|
|
- Connection counts per region
|
|
- Latency percentiles
|
|
- Error rates
|
|
- Resource utilization
|
|
- Cost tracking
|
|
|
|
### 5.2 Configure Alert Policies
|
|
```bash
|
|
# High latency alert
|
|
gcloud alpha monitoring policies create \
|
|
--notification-channels=CHANNEL_ID \
|
|
--display-name="High P99 Latency" \
|
|
--condition-display-name="P99 > 50ms" \
|
|
--condition-threshold-value=0.050 \
|
|
--condition-threshold-duration=60s \
|
|
--aggregation-alignment-period=60s
|
|
|
|
# High error rate alert
|
|
gcloud alpha monitoring policies create \
|
|
--notification-channels=CHANNEL_ID \
|
|
--display-name="High Error Rate" \
|
|
--condition-display-name="Errors > 1%" \
|
|
--condition-threshold-value=0.01 \
|
|
--condition-threshold-duration=300s
|
|
|
|
# Region unhealthy alert
|
|
gcloud alpha monitoring policies create \
|
|
--notification-channels=CHANNEL_ID \
|
|
--display-name="Region Unhealthy" \
|
|
--condition-display-name="Health Check Failed" \
|
|
--condition-threshold-value=1 \
|
|
--condition-threshold-duration=180s
|
|
```
|
|
|
|
### 5.3 Log-Based Metrics
|
|
```bash
|
|
# Create custom metrics from logs
|
|
gcloud logging metrics create error_rate \
|
|
--description="Application error rate" \
|
|
--log-filter='resource.type="cloud_run_revision"
|
|
severity>=ERROR'
|
|
|
|
gcloud logging metrics create connection_count \
|
|
--description="Active connection count" \
|
|
--log-filter='resource.type="cloud_run_revision"
|
|
jsonPayload.event="connection_established"' \
|
|
--value-extractor='EXTRACT(jsonPayload.connection_id)'
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 6: Validation & Testing
|
|
|
|
### 6.1 Smoke Test
|
|
```bash
|
|
cd /home/user/ruvector/benchmarks
|
|
|
|
# Run quick validation (2 hours)
|
|
npm run test:quick
|
|
|
|
# Expected output:
|
|
# ✓ Baseline load test passed
|
|
# ✓ Single region test passed
|
|
# ✓ Basic failover test passed
|
|
# ✓ Mixed workload test passed
|
|
```
|
|
|
|
### 6.2 Load Test (Baseline)
|
|
```bash
|
|
# Run baseline 500M concurrent test (4 hours)
|
|
npm run scenario:baseline-500m
|
|
|
|
# Monitor progress
|
|
npm run dashboard
|
|
# Open http://localhost:8000/visualization-dashboard.html
|
|
```
|
|
|
|
**Success criteria**:
|
|
- P99 latency < 50ms
|
|
- P50 latency < 10ms
|
|
- Error rate < 0.1%
|
|
- All regions healthy
|
|
|
|
### 6.3 Burst Test (10x)
|
|
```bash
|
|
# Run 10x burst test (2 hours)
|
|
npm run scenario:product-launch-10x
|
|
|
|
# This will spike to 5B concurrent
|
|
```
|
|
|
|
**Success criteria**:
|
|
- System survives without crash
|
|
- P99 latency < 100ms
|
|
- Auto-scaling completes within 60s
|
|
- Error rate < 2%
|
|
|
|
### 6.4 Failover Test
|
|
```bash
|
|
# Run regional failover test (1 hour)
|
|
npm run scenario:region-failover
|
|
|
|
# This will simulate region failure
|
|
```
|
|
|
|
**Success criteria**:
|
|
- Failover completes within 60s
|
|
- Connection loss < 5%
|
|
- No cascading failures
|
|
|
|
---
|
|
|
|
## Operations
|
|
|
|
### Daily Operations
|
|
|
|
#### Morning Checklist
|
|
```bash
|
|
#!/bin/bash
|
|
# Save as: /home/user/ruvector/scripts/daily-check.sh
|
|
|
|
# Check service health
|
|
echo "=== Service Health ==="
|
|
for region in us-central1 europe-west1 asia-east1; do
|
|
gcloud run services describe ruvector-streaming \
|
|
--region=$region \
|
|
--format='value(status.conditions[0].status)' | \
|
|
grep -q "True" && echo "✓ $region" || echo "✗ $region UNHEALTHY"
|
|
done
|
|
|
|
# Check error rates (last 24h)
|
|
echo -e "\n=== Error Rates (24h) ==="
|
|
gcloud logging read 'resource.type="cloud_run_revision" severity>=ERROR' \
|
|
--limit=10 \
|
|
--format=json | jq -r '.[].jsonPayload.message'
|
|
|
|
# Check costs (yesterday)
|
|
echo -e "\n=== Cost (Yesterday) ==="
|
|
# Requires BigQuery billing export
|
|
# bq query --use_legacy_sql=false "SELECT SUM(cost) FROM billing.gcp_billing_export WHERE DATE(usage_start_time) = CURRENT_DATE() - 1"
|
|
|
|
# Check capacity
|
|
echo -e "\n=== Capacity ==="
|
|
gcloud run services describe ruvector-streaming \
|
|
--region=us-central1 \
|
|
--format='value(spec.template.spec.containerConcurrency,status.observedGeneration)'
|
|
```
|
|
|
|
#### Scaling Operations
|
|
```bash
|
|
# Manually scale up for planned event
|
|
gcloud run services update ruvector-streaming \
|
|
--region=us-central1 \
|
|
--min-instances=100 \
|
|
--max-instances=1000
|
|
|
|
# Manually scale down after event
|
|
gcloud run services update ruvector-streaming \
|
|
--region=us-central1 \
|
|
--min-instances=10 \
|
|
--max-instances=500
|
|
|
|
# Or use the burst predictor
|
|
cd /home/user/ruvector/src/burst-scaling
|
|
node dist/burst-predictor.js --event "Product Launch" --time "2025-12-01T10:00:00Z"
|
|
```
|
|
|
|
### Weekly Operations
|
|
|
|
#### Performance Review
|
|
```bash
|
|
# Generate weekly performance report
|
|
cd /home/user/ruvector/benchmarks
|
|
npm run report -- --period "last-7-days" --format pdf
|
|
|
|
# Review metrics:
|
|
# - Average latency trends
|
|
# - Error rate trends
|
|
# - Cost per million queries
|
|
# - Capacity utilization
|
|
```
|
|
|
|
#### Cost Optimization
|
|
```bash
|
|
# Identify idle resources
|
|
gcloud run services list --format='table(
|
|
metadata.name,
|
|
metadata.namespace,
|
|
status.url,
|
|
status.traffic[0].percent
|
|
)' | grep "0%"
|
|
|
|
# Review committed use discounts
|
|
gcloud compute commitments list
|
|
|
|
# Check for underutilized databases
|
|
gcloud sql instances list --format='table(
|
|
name,
|
|
region,
|
|
settings.tier,
|
|
state
|
|
)' | grep RUNNABLE
|
|
```
|
|
|
|
### Monthly Operations
|
|
|
|
#### Capacity Planning
|
|
```bash
|
|
# Analyze growth trends
|
|
# Review last 3 months of connection counts
|
|
# Project next month's capacity needs
|
|
# Request quota increases if needed
|
|
|
|
# Request quota increase
|
|
gcloud compute project-info describe --project=$PROJECT_ID
|
|
gcloud compute regions describe us-central1 --format='value(quotas)'
|
|
|
|
# Submit increase request if needed
|
|
gcloud compute project-info add-metadata \
|
|
--metadata=quotas='{"CPUS":"10000","DISKS_TOTAL_GB":"100000"}'
|
|
```
|
|
|
|
#### Security Updates
|
|
```bash
|
|
# Update container images
|
|
cd /home/user/ruvector
|
|
git pull origin main
|
|
docker build -t gcr.io/${PROJECT_ID}/ruvector-streaming:latest .
|
|
docker push gcr.io/${PROJECT_ID}/ruvector-streaming:latest
|
|
|
|
# Rolling update
|
|
gcloud run services update ruvector-streaming \
|
|
--image=gcr.io/${PROJECT_ID}/ruvector-streaming:latest \
|
|
--region=us-central1
|
|
|
|
# Verify update
|
|
gcloud run revisions list --service=ruvector-streaming --region=us-central1
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: High Latency (P99 > 50ms)
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check database connections
|
|
gcloud sql operations list --instance=ruvector-db --limit=10
|
|
|
|
# Check Redis hit rates
|
|
gcloud redis instances describe ruvector-cache-us-central1 \
|
|
--region=us-central1 \
|
|
--format='value(metadata.stats.hitRate)'
|
|
|
|
# Check Cloud Run cold starts
|
|
gcloud run services describe ruvector-streaming \
|
|
--region=us-central1 \
|
|
--format='value(status.traffic[0].latestRevision)'
|
|
```
|
|
|
|
**Solutions**:
|
|
1. Increase min instances to reduce cold starts
|
|
2. Increase Redis memory or optimize cache keys
|
|
3. Add read replicas to database
|
|
4. Enable connection pooling
|
|
5. Review slow queries in database
|
|
|
|
### Issue: High Error Rate (> 1%)
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check error types
|
|
gcloud logging read 'resource.type="cloud_run_revision" severity>=ERROR' \
|
|
--limit=100 \
|
|
--format=json | jq -r '.[] | .jsonPayload.error_type' | sort | uniq -c
|
|
|
|
# Check recent deployments
|
|
gcloud run revisions list --service=ruvector-streaming --region=us-central1 --limit=5
|
|
```
|
|
|
|
**Solutions**:
|
|
1. Rollback to previous revision if recent deploy
|
|
2. Check database connection pool exhaustion
|
|
3. Verify API rate limits not exceeded
|
|
4. Check for memory leaks (restart instances)
|
|
5. Review error logs for patterns
|
|
|
|
### Issue: Auto-Scaling Not Working
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check scaling metrics
|
|
gcloud monitoring time-series list \
|
|
--filter='metric.type="run.googleapis.com/container/instance_count"' \
|
|
--interval-start-time="2025-01-01T00:00:00Z" \
|
|
--interval-end-time="2025-01-02T00:00:00Z"
|
|
|
|
# Check quotas
|
|
gcloud compute project-info describe --project=$PROJECT_ID | grep -A 5 "CPUS"
|
|
```
|
|
|
|
**Solutions**:
|
|
1. Request quota increase if limits hit
|
|
2. Check budget caps (may block scaling)
|
|
3. Verify IAM permissions for auto-scaler
|
|
4. Review scaling policies (min/max instances)
|
|
5. Check for regional capacity issues
|
|
|
|
### Issue: Regional Failover Not Working
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check health checks
|
|
gcloud compute health-checks describe ruvector-health-check
|
|
|
|
# Check backend service health
|
|
gcloud compute backend-services get-health ruvector-backend-service --global
|
|
|
|
# Check load balancer configuration
|
|
gcloud compute url-maps describe ruvector-lb
|
|
```
|
|
|
|
**Solutions**:
|
|
1. Verify health check endpoints responding
|
|
2. Check firewall rules allow health checks
|
|
3. Verify backend services configured correctly
|
|
4. Check DNS propagation
|
|
5. Review load balancer logs
|
|
|
|
### Issue: Cost Overruns
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check current spend
|
|
gcloud billing accounts list
|
|
|
|
# Identify expensive resources
|
|
gcloud compute instances list --format='table(name,zone,machineType,status)'
|
|
gcloud sql instances list --format='table(name,region,tier,status)'
|
|
gcloud redis instances list --format='table(name,region,tier,memorySizeGb)'
|
|
```
|
|
|
|
**Solutions**:
|
|
1. Scale down min instances in low-traffic regions
|
|
2. Reduce Redis memory size if underutilized
|
|
3. Downgrade database tier if CPU/memory low
|
|
4. Enable more aggressive CDN caching
|
|
5. Review and delete unused resources
|
|
|
|
---
|
|
|
|
## Rollback Procedures
|
|
|
|
### Rollback Cloud Run Service
|
|
```bash
|
|
# List revisions
|
|
gcloud run revisions list --service=ruvector-streaming --region=us-central1
|
|
|
|
# Rollback to previous revision
|
|
PREVIOUS_REVISION=$(gcloud run revisions list \
|
|
--service=ruvector-streaming \
|
|
--region=us-central1 \
|
|
--format='value(metadata.name)' \
|
|
--limit=2 | tail -n1)
|
|
|
|
gcloud run services update-traffic ruvector-streaming \
|
|
--region=us-central1 \
|
|
--to-revisions=$PREVIOUS_REVISION=100
|
|
```
|
|
|
|
### Rollback Infrastructure Changes
|
|
```bash
|
|
cd /home/user/ruvector/src/burst-scaling/terraform
|
|
|
|
# Revert to previous state
|
|
terraform state pull > current-state.tfstate
|
|
terraform state push previous-state.tfstate
|
|
terraform apply -auto-approve
|
|
```
|
|
|
|
### Emergency Shutdown
|
|
```bash
|
|
# Disable all traffic to service
|
|
gcloud run services update ruvector-streaming \
|
|
--region=us-central1 \
|
|
--max-instances=0
|
|
|
|
# Or delete service entirely
|
|
gcloud run services delete ruvector-streaming --region=us-central1 --quiet
|
|
```
|
|
|
|
---
|
|
|
|
## Cost Summary
|
|
|
|
### Initial Setup Costs
|
|
- One-time setup: ~$100 (testing, quota requests, etc.)
|
|
|
|
### Monthly Operating Costs (Baseline 500M concurrent)
|
|
- **Cloud Run**: $2.4M ($0.0048 per connection)
|
|
- **Cloud SQL**: $150K (3 regions, read replicas)
|
|
- **Redis**: $45K (3 regions, 128GB each)
|
|
- **Load Balancer + CDN**: $80K
|
|
- **Networking**: $50K
|
|
- **Monitoring + Logging**: $20K
|
|
- **Storage**: $5K
|
|
- **Total**: ~$2.75M/month (optimized)
|
|
|
|
### Burst Event Costs (World Cup 50x, 3 hours)
|
|
- **Cloud Run**: ~$80K
|
|
- **Database**: ~$2K (connection spikes)
|
|
- **Redis**: ~$500 (included in monthly)
|
|
- **Networking**: ~$5K
|
|
- **Total**: ~$88K per event
|
|
|
|
### Cost Optimization Tips
|
|
1. Use committed use discounts (30% savings)
|
|
2. Enable auto-scaling to scale down when idle
|
|
3. Increase CDN cache hit rate to reduce backend load
|
|
4. Use preemptible instances for non-critical workloads
|
|
5. Regularly review and delete unused resources
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Complete Initial Deployment** (Phases 1-5)
|
|
2. **Run Validation Tests** (Phase 6)
|
|
3. **Schedule Load Tests** (Baseline, then burst)
|
|
4. **Set Up Monitoring Dashboard**
|
|
5. **Configure Alert Policies**
|
|
6. **Create Runbook** (Already created: `/home/user/ruvector/src/burst-scaling/RUNBOOK.md`)
|
|
7. **Train Team on Operations**
|
|
8. **Plan First Production Event** (Start small, scale up)
|
|
9. **Iterate and Optimize** (Based on real traffic)
|
|
|
|
---
|
|
|
|
## Additional Resources
|
|
|
|
- [Architecture Overview](./architecture-overview.md)
|
|
- [Scaling Strategy](./scaling-strategy.md)
|
|
- [Infrastructure Design](./infrastructure-design.md)
|
|
- [Load Test Scenarios](/home/user/ruvector/benchmarks/LOAD_TEST_SCENARIOS.md)
|
|
- [Operations Runbook](/home/user/ruvector/src/burst-scaling/RUNBOOK.md)
|
|
- [Benchmarking Guide](/home/user/ruvector/benchmarks/README.md)
|
|
- [GCP Cloud Run Docs](https://cloud.google.com/run/docs)
|
|
- [GCP Load Balancing Docs](https://cloud.google.com/load-balancing/docs)
|
|
|
|
---
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
|
|
- Email: ops@ruvector.io
|
|
- Slack: #ruvector-ops
|
|
|
|
---
|
|
|
|
**Document Version**: 1.0
|
|
**Last Updated**: 2025-11-20
|
|
**Deployment Status**: Ready for Production
|