Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
941
vendor/ruvector/docs/cloud-architecture/DEPLOYMENT_GUIDE.md
vendored
Normal file
941
vendor/ruvector/docs/cloud-architecture/DEPLOYMENT_GUIDE.md
vendored
Normal file
@@ -0,0 +1,941 @@
|
||||
# RuVector Global Deployment Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide provides step-by-step instructions for deploying RuVector's globally distributed streaming system capable of handling 500 million concurrent learning streams with burst capacity up to 25 billion.
|
||||
|
||||
**Target Infrastructure**: Google Cloud Platform (GCP)
|
||||
**Architecture**: Multi-region Cloud Run with global load balancing
|
||||
**Deployment Time**: 4-6 hours for initial setup
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Prerequisites](#prerequisites)
|
||||
2. [Phase 1: Initial Setup](#phase-1-initial-setup)
|
||||
3. [Phase 2: Core Infrastructure](#phase-2-core-infrastructure)
|
||||
4. [Phase 3: Multi-Region Deployment](#phase-3-multi-region-deployment)
|
||||
5. [Phase 4: Load Balancing & CDN](#phase-4-load-balancing--cdn)
|
||||
6. [Phase 5: Monitoring & Alerting](#phase-5-monitoring--alerting)
|
||||
7. [Phase 6: Validation & Testing](#phase-6-validation--testing)
|
||||
8. [Operations](#operations)
|
||||
9. [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Required Tools
|
||||
```bash
|
||||
# Install gcloud CLI
|
||||
curl https://sdk.cloud.google.com | bash
|
||||
exec -l $SHELL
|
||||
|
||||
# Install Terraform
|
||||
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
|
||||
unzip terraform_1.6.0_linux_amd64.zip
|
||||
sudo mv terraform /usr/local/bin/
|
||||
|
||||
# Install Node.js 18+
|
||||
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
|
||||
sudo apt-get install -y nodejs
|
||||
|
||||
# Install Rust (for building ruvector core)
|
||||
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||
source $HOME/.cargo/env
|
||||
|
||||
# Install Docker
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y docker.io
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
# Install K6 (for load testing)
|
||||
sudo gpg -k
|
||||
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
|
||||
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
|
||||
sudo apt-get update
|
||||
sudo apt-get install k6
|
||||
```
|
||||
|
||||
### GCP Project Setup
|
||||
```bash
|
||||
# Set project variables
|
||||
export PROJECT_ID="your-project-id"
|
||||
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
|
||||
export BILLING_ACCOUNT="your-billing-account-id"
|
||||
|
||||
# Authenticate
|
||||
gcloud auth login
|
||||
gcloud auth application-default login
|
||||
|
||||
# Set default project
|
||||
gcloud config set project $PROJECT_ID
|
||||
|
||||
# Enable billing
|
||||
gcloud billing projects link $PROJECT_ID --billing-account=$BILLING_ACCOUNT
|
||||
```
|
||||
|
||||
### Enable Required APIs
|
||||
```bash
|
||||
# Enable all required GCP APIs
|
||||
gcloud services enable \
|
||||
run.googleapis.com \
|
||||
compute.googleapis.com \
|
||||
sql-component.googleapis.com \
|
||||
sqladmin.googleapis.com \
|
||||
redis.googleapis.com \
|
||||
servicenetworking.googleapis.com \
|
||||
vpcaccess.googleapis.com \
|
||||
cloudscheduler.googleapis.com \
|
||||
cloudtasks.googleapis.com \
|
||||
pubsub.googleapis.com \
|
||||
monitoring.googleapis.com \
|
||||
logging.googleapis.com \
|
||||
cloudtrace.googleapis.com \
|
||||
cloudbuild.googleapis.com \
|
||||
artifactregistry.googleapis.com \
|
||||
secretmanager.googleapis.com \
|
||||
cloudresourcemanager.googleapis.com \
|
||||
iamcredentials.googleapis.com \
|
||||
cloudfunctions.googleapis.com \
|
||||
networkconnectivity.googleapis.com
|
||||
```
|
||||
|
||||
### Service Accounts
|
||||
```bash
|
||||
# Create service accounts
|
||||
gcloud iam service-accounts create ruvector-cloudrun \
|
||||
--display-name="RuVector Cloud Run Service Account"
|
||||
|
||||
gcloud iam service-accounts create ruvector-deployer \
|
||||
--display-name="RuVector CI/CD Deployer"
|
||||
|
||||
# Grant necessary permissions
|
||||
export CLOUDRUN_SA="ruvector-cloudrun@${PROJECT_ID}.iam.gserviceaccount.com"
|
||||
export DEPLOYER_SA="ruvector-deployer@${PROJECT_ID}.iam.gserviceaccount.com"
|
||||
|
||||
# Cloud Run permissions
|
||||
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
||||
--member="serviceAccount:${CLOUDRUN_SA}" \
|
||||
--role="roles/cloudsql.client"
|
||||
|
||||
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
||||
--member="serviceAccount:${CLOUDRUN_SA}" \
|
||||
--role="roles/redis.editor"
|
||||
|
||||
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
||||
--member="serviceAccount:${CLOUDRUN_SA}" \
|
||||
--role="roles/pubsub.publisher"
|
||||
|
||||
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
||||
--member="serviceAccount:${CLOUDRUN_SA}" \
|
||||
--role="roles/secretmanager.secretAccessor"
|
||||
|
||||
# Deployer permissions
|
||||
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
||||
--member="serviceAccount:${DEPLOYER_SA}" \
|
||||
--role="roles/run.admin"
|
||||
|
||||
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
||||
--member="serviceAccount:${DEPLOYER_SA}" \
|
||||
--role="roles/iam.serviceAccountUser"
|
||||
|
||||
gcloud projects add-iam-policy-binding $PROJECT_ID \
|
||||
--member="serviceAccount:${DEPLOYER_SA}" \
|
||||
--role="roles/cloudbuild.builds.editor"
|
||||
```
|
||||
|
||||
### Budget Alerts
|
||||
```bash
|
||||
# Create budget (adjust amounts as needed)
|
||||
gcloud billing budgets create \
|
||||
--billing-account=$BILLING_ACCOUNT \
|
||||
--display-name="RuVector Monthly Budget" \
|
||||
--budget-amount=500000 \
|
||||
--threshold-rule=percent=50 \
|
||||
--threshold-rule=percent=80 \
|
||||
--threshold-rule=percent=100 \
|
||||
--threshold-rule=percent=120
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Initial Setup
|
||||
|
||||
### 1.1 Clone Repository
|
||||
```bash
|
||||
cd /home/user
|
||||
git clone https://github.com/ruvnet/ruvector.git
|
||||
cd ruvector
|
||||
```
|
||||
|
||||
### 1.2 Build Rust Core
|
||||
```bash
|
||||
# Build ruvector core
|
||||
cargo build --release
|
||||
|
||||
# Build Node.js bindings
|
||||
cd crates/ruvector-node
|
||||
npm install
|
||||
npm run build
|
||||
|
||||
cd ../..
|
||||
```
|
||||
|
||||
### 1.3 Configure Environment
|
||||
```bash
|
||||
# Create terraform variables file
|
||||
cd /home/user/ruvector/src/burst-scaling/terraform
|
||||
|
||||
cat > terraform.tfvars <<EOF
|
||||
project_id = "${PROJECT_ID}"
|
||||
project_number = "${PROJECT_NUMBER}"
|
||||
|
||||
regions = [
|
||||
{
|
||||
name = "us-central1"
|
||||
display_name = "Iowa (US Central)"
|
||||
priority = 10
|
||||
capacity = 80000000
|
||||
cost_per_hour = 0.50
|
||||
min_instances = 10
|
||||
max_instances = 800
|
||||
},
|
||||
{
|
||||
name = "europe-west1"
|
||||
display_name = "Belgium (Europe West)"
|
||||
priority = 9
|
||||
capacity = 80000000
|
||||
cost_per_hour = 0.55
|
||||
min_instances = 10
|
||||
max_instances = 800
|
||||
},
|
||||
{
|
||||
name = "asia-east1"
|
||||
display_name = "Taiwan (Asia East)"
|
||||
priority = 8
|
||||
capacity = 80000000
|
||||
cost_per_hour = 0.60
|
||||
min_instances = 10
|
||||
max_instances = 800
|
||||
}
|
||||
]
|
||||
|
||||
service_config = {
|
||||
cpu = "4"
|
||||
memory = "16Gi"
|
||||
concurrency = 100
|
||||
timeout_seconds = 300
|
||||
max_instances_per_region = 1000
|
||||
}
|
||||
|
||||
scaling_config = {
|
||||
enable_adaptive_scaling = true
|
||||
target_cpu_utilization = 0.70
|
||||
target_memory_utilization = 0.75
|
||||
target_concurrency = 0.80
|
||||
scale_down_delay = "300s"
|
||||
min_scale_down_rate = 2
|
||||
max_scale_up_rate = 10
|
||||
}
|
||||
|
||||
budget_limits = {
|
||||
hourly_limit_usd = 10000
|
||||
daily_limit_usd = 200000
|
||||
monthly_limit_usd = 5000000
|
||||
alert_threshold = 0.80
|
||||
}
|
||||
|
||||
# Customize with your settings
|
||||
redis_memory_gb = 128
|
||||
database_tier = "db-custom-4-16384"
|
||||
enable_cdn = true
|
||||
enable_armor = true
|
||||
|
||||
# Notification channels (replace with your values)
|
||||
slack_webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||||
email_addresses = ["ops@yourcompany.com"]
|
||||
EOF
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Core Infrastructure
|
||||
|
||||
### 2.1 Deploy with Terraform
|
||||
```bash
|
||||
cd /home/user/ruvector/src/burst-scaling/terraform
|
||||
|
||||
# Initialize Terraform
|
||||
terraform init
|
||||
|
||||
# Review plan
|
||||
terraform plan -out=tfplan
|
||||
|
||||
# Apply infrastructure
|
||||
terraform apply tfplan
|
||||
```
|
||||
|
||||
**This will create**:
|
||||
- VPC network with subnets
|
||||
- Cloud SQL PostgreSQL instance with replicas
|
||||
- Memorystore Redis instances per region
|
||||
- Cloud Storage buckets
|
||||
- Pub/Sub topics
|
||||
- Secrets Manager secrets
|
||||
- IAM bindings
|
||||
- Monitoring policies
|
||||
|
||||
**Expected time**: 30-45 minutes
|
||||
|
||||
### 2.2 Initialize Database
|
||||
```bash
|
||||
# Get Cloud SQL connection name
|
||||
export SQL_CONNECTION=$(terraform output -raw sql_connection_name)
|
||||
|
||||
# Connect to database
|
||||
gcloud sql connect ruvector-db --user=postgres --quiet
|
||||
|
||||
# Run initialization script
|
||||
\i /home/user/ruvector/src/cloud-architecture/init-database.sql
|
||||
|
||||
# Exit psql
|
||||
\q
|
||||
```
|
||||
|
||||
**Database Schema Created**:
|
||||
- `vectors` table with metadata
|
||||
- `agents` table for coordination
|
||||
- `tasks` table for distributed work
|
||||
- `sync_queue` for cross-region sync
|
||||
- Indexes and partitions
|
||||
|
||||
### 2.3 Configure Secrets
|
||||
```bash
|
||||
# Store database password
|
||||
echo -n "your-secure-password" | gcloud secrets create db-password \
|
||||
--data-file=- \
|
||||
--replication-policy="automatic"
|
||||
|
||||
# Store Redis connection strings (per region)
|
||||
for region in us-central1 europe-west1 asia-east1; do
|
||||
REDIS_HOST=$(terraform output -json redis_hosts | jq -r ".${region}")
|
||||
echo "redis://${REDIS_HOST}:6379" | gcloud secrets create redis-connection-${region} \
|
||||
--data-file=- \
|
||||
--replication-policy="automatic"
|
||||
done
|
||||
|
||||
# Store API keys and other secrets
|
||||
echo -n "your-api-key" | gcloud secrets create api-key \
|
||||
--data-file=- \
|
||||
--replication-policy="automatic"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Multi-Region Deployment
|
||||
|
||||
### 3.1 Build Container Image
|
||||
```bash
|
||||
cd /home/user/ruvector
|
||||
|
||||
# Build multi-arch image
|
||||
docker buildx create --use
|
||||
docker buildx build --platform linux/amd64,linux/arm64 \
|
||||
-t gcr.io/${PROJECT_ID}/ruvector-streaming:latest \
|
||||
-f src/cloud-run/Dockerfile \
|
||||
--push .
|
||||
```
|
||||
|
||||
### 3.2 Deploy to Regions (Automated)
|
||||
```bash
|
||||
cd /home/user/ruvector/src/cloud-run
|
||||
|
||||
# Use Cloud Build for multi-region deployment
|
||||
gcloud builds submit --config=cloudbuild.yaml
|
||||
```
|
||||
|
||||
**This deploys to**:
|
||||
- us-central1
|
||||
- us-east1
|
||||
- us-west1
|
||||
- europe-west1
|
||||
- europe-west3
|
||||
- europe-north1
|
||||
- asia-east1
|
||||
- asia-southeast1
|
||||
- asia-northeast1
|
||||
|
||||
**Deployment strategy**: Canary (10% → 50% → 100%)
|
||||
|
||||
**Expected time**: 20-30 minutes
|
||||
|
||||
### 3.3 Verify Regional Deployments
|
||||
```bash
|
||||
# List all deployments
|
||||
gcloud run services list --platform managed
|
||||
|
||||
# Test each region
|
||||
for region in us-central1 europe-west1 asia-east1; do
|
||||
URL=$(gcloud run services describe ruvector-streaming \
|
||||
--region=$region \
|
||||
--format='value(status.url)')
|
||||
|
||||
curl -f "$URL/health" && echo "✓ $region healthy" || echo "✗ $region failed"
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Load Balancing & CDN
|
||||
|
||||
### 4.1 Create Backend Services
|
||||
```bash
|
||||
# This is handled by Terraform, but verify:
|
||||
gcloud compute backend-services list
|
||||
|
||||
# Should show backend services for each region
|
||||
```
|
||||
|
||||
### 4.2 Configure Global Load Balancer
|
||||
```bash
|
||||
# Create URL map
|
||||
gcloud compute url-maps create ruvector-lb \
|
||||
--default-backend-bucket=ruvector-static-assets
|
||||
|
||||
# Add backend service
|
||||
gcloud compute url-maps add-path-matcher ruvector-lb \
|
||||
--path-matcher-name=ruvector-matcher \
|
||||
--default-service=ruvector-backend-service \
|
||||
--new-hosts='*.ruvector.io'
|
||||
|
||||
# Create target HTTPS proxy
|
||||
gcloud compute target-https-proxies create ruvector-https-proxy \
|
||||
--url-map=ruvector-lb \
|
||||
--ssl-certificates=ruvector-ssl-cert
|
||||
|
||||
# Create global forwarding rule
|
||||
gcloud compute forwarding-rules create ruvector-https-rule \
|
||||
--global \
|
||||
--target-https-proxy=ruvector-https-proxy \
|
||||
--ports=443
|
||||
```
|
||||
|
||||
### 4.3 Configure Cloud CDN
|
||||
```bash
|
||||
# Enable CDN on backend service
|
||||
gcloud compute backend-services update ruvector-backend-service \
|
||||
--enable-cdn \
|
||||
--cache-mode=CACHE_ALL_STATIC \
|
||||
--default-ttl=3600 \
|
||||
--max-ttl=86400 \
|
||||
--client-ttl=3600 \
|
||||
--global
|
||||
```
|
||||
|
||||
### 4.4 DNS Configuration
|
||||
```bash
|
||||
# Get load balancer IP
|
||||
LB_IP=$(gcloud compute addresses describe ruvector-global-ip \
|
||||
--global \
|
||||
--format='value(address)')
|
||||
|
||||
echo "Configure your DNS:"
|
||||
echo "A record: ruvector.io -> $LB_IP"
|
||||
echo "A record: *.ruvector.io -> $LB_IP"
|
||||
```
|
||||
|
||||
**Manually configure in your DNS provider**:
|
||||
- `ruvector.io` A record → `$LB_IP`
|
||||
- `*.ruvector.io` A record → `$LB_IP`
|
||||
|
||||
### 4.5 SSL Certificate
|
||||
```bash
|
||||
# Create managed SSL certificate (auto-renewal)
|
||||
gcloud compute ssl-certificates create ruvector-ssl-cert \
|
||||
--domains=ruvector.io,api.ruvector.io,*.ruvector.io \
|
||||
--global
|
||||
|
||||
# Wait for certificate provisioning (can take 15-30 minutes)
|
||||
gcloud compute ssl-certificates list
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Monitoring & Alerting
|
||||
|
||||
### 5.1 Import Monitoring Dashboard
|
||||
```bash
|
||||
cd /home/user/ruvector/src/burst-scaling
|
||||
|
||||
# Create dashboard
|
||||
gcloud monitoring dashboards create \
|
||||
--config-from-file=monitoring-dashboard.json
|
||||
```
|
||||
|
||||
**Dashboard includes**:
|
||||
- Connection counts per region
|
||||
- Latency percentiles
|
||||
- Error rates
|
||||
- Resource utilization
|
||||
- Cost tracking
|
||||
|
||||
### 5.2 Configure Alert Policies
|
||||
```bash
|
||||
# High latency alert
|
||||
gcloud alpha monitoring policies create \
|
||||
--notification-channels=CHANNEL_ID \
|
||||
--display-name="High P99 Latency" \
|
||||
--condition-display-name="P99 > 50ms" \
|
||||
--condition-threshold-value=0.050 \
|
||||
--condition-threshold-duration=60s \
|
||||
--aggregation-alignment-period=60s
|
||||
|
||||
# High error rate alert
|
||||
gcloud alpha monitoring policies create \
|
||||
--notification-channels=CHANNEL_ID \
|
||||
--display-name="High Error Rate" \
|
||||
--condition-display-name="Errors > 1%" \
|
||||
--condition-threshold-value=0.01 \
|
||||
--condition-threshold-duration=300s
|
||||
|
||||
# Region unhealthy alert
|
||||
gcloud alpha monitoring policies create \
|
||||
--notification-channels=CHANNEL_ID \
|
||||
--display-name="Region Unhealthy" \
|
||||
--condition-display-name="Health Check Failed" \
|
||||
--condition-threshold-value=1 \
|
||||
--condition-threshold-duration=180s
|
||||
```
|
||||
|
||||
### 5.3 Log-Based Metrics
|
||||
```bash
|
||||
# Create custom metrics from logs
|
||||
gcloud logging metrics create error_rate \
|
||||
--description="Application error rate" \
|
||||
--log-filter='resource.type="cloud_run_revision"
|
||||
severity>=ERROR'
|
||||
|
||||
gcloud logging metrics create connection_count \
|
||||
--description="Active connection count" \
|
||||
--log-filter='resource.type="cloud_run_revision"
|
||||
jsonPayload.event="connection_established"' \
|
||||
--value-extractor='EXTRACT(jsonPayload.connection_id)'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Validation & Testing
|
||||
|
||||
### 6.1 Smoke Test
|
||||
```bash
|
||||
cd /home/user/ruvector/benchmarks
|
||||
|
||||
# Run quick validation (2 hours)
|
||||
npm run test:quick
|
||||
|
||||
# Expected output:
|
||||
# ✓ Baseline load test passed
|
||||
# ✓ Single region test passed
|
||||
# ✓ Basic failover test passed
|
||||
# ✓ Mixed workload test passed
|
||||
```
|
||||
|
||||
### 6.2 Load Test (Baseline)
|
||||
```bash
|
||||
# Run baseline 500M concurrent test (4 hours)
|
||||
npm run scenario:baseline-500m
|
||||
|
||||
# Monitor progress
|
||||
npm run dashboard
|
||||
# Open http://localhost:8000/visualization-dashboard.html
|
||||
```
|
||||
|
||||
**Success criteria**:
|
||||
- P99 latency < 50ms
|
||||
- P50 latency < 10ms
|
||||
- Error rate < 0.1%
|
||||
- All regions healthy
|
||||
|
||||
### 6.3 Burst Test (10x)
|
||||
```bash
|
||||
# Run 10x burst test (2 hours)
|
||||
npm run scenario:product-launch-10x
|
||||
|
||||
# This will spike to 5B concurrent
|
||||
```
|
||||
|
||||
**Success criteria**:
|
||||
- System survives without crash
|
||||
- P99 latency < 100ms
|
||||
- Auto-scaling completes within 60s
|
||||
- Error rate < 2%
|
||||
|
||||
### 6.4 Failover Test
|
||||
```bash
|
||||
# Run regional failover test (1 hour)
|
||||
npm run scenario:region-failover
|
||||
|
||||
# This will simulate region failure
|
||||
```
|
||||
|
||||
**Success criteria**:
|
||||
- Failover completes within 60s
|
||||
- Connection loss < 5%
|
||||
- No cascading failures
|
||||
|
||||
---
|
||||
|
||||
## Operations
|
||||
|
||||
### Daily Operations
|
||||
|
||||
#### Morning Checklist
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Save as: /home/user/ruvector/scripts/daily-check.sh
|
||||
|
||||
# Check service health
|
||||
echo "=== Service Health ==="
|
||||
for region in us-central1 europe-west1 asia-east1; do
|
||||
gcloud run services describe ruvector-streaming \
|
||||
--region=$region \
|
||||
--format='value(status.conditions[0].status)' | \
|
||||
grep -q "True" && echo "✓ $region" || echo "✗ $region UNHEALTHY"
|
||||
done
|
||||
|
||||
# Check error rates (last 24h)
|
||||
echo -e "\n=== Error Rates (24h) ==="
|
||||
gcloud logging read 'resource.type="cloud_run_revision" severity>=ERROR' \
|
||||
--limit=10 \
|
||||
--format=json | jq -r '.[].jsonPayload.message'
|
||||
|
||||
# Check costs (yesterday)
|
||||
echo -e "\n=== Cost (Yesterday) ==="
|
||||
# Requires BigQuery billing export
|
||||
# bq query --use_legacy_sql=false "SELECT SUM(cost) FROM billing.gcp_billing_export WHERE DATE(usage_start_time) = CURRENT_DATE() - 1"
|
||||
|
||||
# Check capacity
|
||||
echo -e "\n=== Capacity ==="
|
||||
gcloud run services describe ruvector-streaming \
|
||||
--region=us-central1 \
|
||||
--format='value(spec.template.spec.containerConcurrency,status.observedGeneration)'
|
||||
```
|
||||
|
||||
#### Scaling Operations
|
||||
```bash
|
||||
# Manually scale up for planned event
|
||||
gcloud run services update ruvector-streaming \
|
||||
--region=us-central1 \
|
||||
--min-instances=100 \
|
||||
--max-instances=1000
|
||||
|
||||
# Manually scale down after event
|
||||
gcloud run services update ruvector-streaming \
|
||||
--region=us-central1 \
|
||||
--min-instances=10 \
|
||||
--max-instances=500
|
||||
|
||||
# Or use the burst predictor
|
||||
cd /home/user/ruvector/src/burst-scaling
|
||||
node dist/burst-predictor.js --event "Product Launch" --time "2025-12-01T10:00:00Z"
|
||||
```
|
||||
|
||||
### Weekly Operations
|
||||
|
||||
#### Performance Review
|
||||
```bash
|
||||
# Generate weekly performance report
|
||||
cd /home/user/ruvector/benchmarks
|
||||
npm run report -- --period "last-7-days" --format pdf
|
||||
|
||||
# Review metrics:
|
||||
# - Average latency trends
|
||||
# - Error rate trends
|
||||
# - Cost per million queries
|
||||
# - Capacity utilization
|
||||
```
|
||||
|
||||
#### Cost Optimization
|
||||
```bash
|
||||
# Identify idle resources
|
||||
gcloud run services list --format='table(
|
||||
metadata.name,
|
||||
metadata.namespace,
|
||||
status.url,
|
||||
status.traffic[0].percent
|
||||
)' | grep "0%"
|
||||
|
||||
# Review committed use discounts
|
||||
gcloud compute commitments list
|
||||
|
||||
# Check for underutilized databases
|
||||
gcloud sql instances list --format='table(
|
||||
name,
|
||||
region,
|
||||
settings.tier,
|
||||
state
|
||||
)' | grep RUNNABLE
|
||||
```
|
||||
|
||||
### Monthly Operations
|
||||
|
||||
#### Capacity Planning
|
||||
```bash
|
||||
# Analyze growth trends
|
||||
# Review last 3 months of connection counts
|
||||
# Project next month's capacity needs
|
||||
# Request quota increases if needed
|
||||
|
||||
# Request quota increase
|
||||
gcloud compute project-info describe --project=$PROJECT_ID
|
||||
gcloud compute regions describe us-central1 --format='value(quotas)'
|
||||
|
||||
# Submit increase request if needed
|
||||
gcloud compute project-info add-metadata \
|
||||
--metadata=quotas='{"CPUS":"10000","DISKS_TOTAL_GB":"100000"}'
|
||||
```
|
||||
|
||||
#### Security Updates
|
||||
```bash
|
||||
# Update container images
|
||||
cd /home/user/ruvector
|
||||
git pull origin main
|
||||
docker build -t gcr.io/${PROJECT_ID}/ruvector-streaming:latest .
|
||||
docker push gcr.io/${PROJECT_ID}/ruvector-streaming:latest
|
||||
|
||||
# Rolling update
|
||||
gcloud run services update ruvector-streaming \
|
||||
--image=gcr.io/${PROJECT_ID}/ruvector-streaming:latest \
|
||||
--region=us-central1
|
||||
|
||||
# Verify update
|
||||
gcloud run revisions list --service=ruvector-streaming --region=us-central1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: High Latency (P99 > 50ms)
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check database connections
|
||||
gcloud sql operations list --instance=ruvector-db --limit=10
|
||||
|
||||
# Check Redis hit rates
|
||||
gcloud redis instances describe ruvector-cache-us-central1 \
|
||||
--region=us-central1 \
|
||||
--format='value(metadata.stats.hitRate)'
|
||||
|
||||
# Check Cloud Run cold starts
|
||||
gcloud run services describe ruvector-streaming \
|
||||
--region=us-central1 \
|
||||
--format='value(status.traffic[0].latestRevision)'
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Increase min instances to reduce cold starts
|
||||
2. Increase Redis memory or optimize cache keys
|
||||
3. Add read replicas to database
|
||||
4. Enable connection pooling
|
||||
5. Review slow queries in database
|
||||
|
||||
### Issue: High Error Rate (> 1%)
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check error types
|
||||
gcloud logging read 'resource.type="cloud_run_revision" severity>=ERROR' \
|
||||
--limit=100 \
|
||||
--format=json | jq -r '.[] | .jsonPayload.error_type' | sort | uniq -c
|
||||
|
||||
# Check recent deployments
|
||||
gcloud run revisions list --service=ruvector-streaming --region=us-central1 --limit=5
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Rollback to previous revision if recent deploy
|
||||
2. Check database connection pool exhaustion
|
||||
3. Verify API rate limits not exceeded
|
||||
4. Check for memory leaks (restart instances)
|
||||
5. Review error logs for patterns
|
||||
|
||||
### Issue: Auto-Scaling Not Working
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check scaling metrics
|
||||
gcloud monitoring time-series list \
|
||||
--filter='metric.type="run.googleapis.com/container/instance_count"' \
|
||||
--interval-start-time="2025-01-01T00:00:00Z" \
|
||||
--interval-end-time="2025-01-02T00:00:00Z"
|
||||
|
||||
# Check quotas
|
||||
gcloud compute project-info describe --project=$PROJECT_ID | grep -A 5 "CPUS"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Request quota increase if limits hit
|
||||
2. Check budget caps (may block scaling)
|
||||
3. Verify IAM permissions for auto-scaler
|
||||
4. Review scaling policies (min/max instances)
|
||||
5. Check for regional capacity issues
|
||||
|
||||
### Issue: Regional Failover Not Working
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check health checks
|
||||
gcloud compute health-checks describe ruvector-health-check
|
||||
|
||||
# Check backend service health
|
||||
gcloud compute backend-services get-health ruvector-backend-service --global
|
||||
|
||||
# Check load balancer configuration
|
||||
gcloud compute url-maps describe ruvector-lb
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Verify health check endpoints responding
|
||||
2. Check firewall rules allow health checks
|
||||
3. Verify backend services configured correctly
|
||||
4. Check DNS propagation
|
||||
5. Review load balancer logs
|
||||
|
||||
### Issue: Cost Overruns
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check current spend
|
||||
gcloud billing accounts list
|
||||
|
||||
# Identify expensive resources
|
||||
gcloud compute instances list --format='table(name,zone,machineType,status)'
|
||||
gcloud sql instances list --format='table(name,region,tier,status)'
|
||||
gcloud redis instances list --format='table(name,region,tier,memorySizeGb)'
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Scale down min instances in low-traffic regions
|
||||
2. Reduce Redis memory size if underutilized
|
||||
3. Downgrade database tier if CPU/memory low
|
||||
4. Enable more aggressive CDN caching
|
||||
5. Review and delete unused resources
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Rollback Cloud Run Service
|
||||
```bash
|
||||
# List revisions
|
||||
gcloud run revisions list --service=ruvector-streaming --region=us-central1
|
||||
|
||||
# Rollback to previous revision
|
||||
PREVIOUS_REVISION=$(gcloud run revisions list \
|
||||
--service=ruvector-streaming \
|
||||
--region=us-central1 \
|
||||
--format='value(metadata.name)' \
|
||||
--limit=2 | tail -n1)
|
||||
|
||||
gcloud run services update-traffic ruvector-streaming \
|
||||
--region=us-central1 \
|
||||
--to-revisions=$PREVIOUS_REVISION=100
|
||||
```
|
||||
|
||||
### Rollback Infrastructure Changes
|
||||
```bash
|
||||
cd /home/user/ruvector/src/burst-scaling/terraform
|
||||
|
||||
# Revert to previous state
|
||||
terraform state pull > current-state.tfstate
|
||||
terraform state push previous-state.tfstate
|
||||
terraform apply -auto-approve
|
||||
```
|
||||
|
||||
### Emergency Shutdown
|
||||
```bash
|
||||
# Disable all traffic to service
|
||||
gcloud run services update ruvector-streaming \
|
||||
--region=us-central1 \
|
||||
--max-instances=0
|
||||
|
||||
# Or delete service entirely
|
||||
gcloud run services delete ruvector-streaming --region=us-central1 --quiet
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Summary
|
||||
|
||||
### Initial Setup Costs
|
||||
- One-time setup: ~$100 (testing, quota requests, etc.)
|
||||
|
||||
### Monthly Operating Costs (Baseline 500M concurrent)
|
||||
- **Cloud Run**: $2.4M ($0.0048 per connection)
|
||||
- **Cloud SQL**: $150K (3 regions, read replicas)
|
||||
- **Redis**: $45K (3 regions, 128GB each)
|
||||
- **Load Balancer + CDN**: $80K
|
||||
- **Networking**: $50K
|
||||
- **Monitoring + Logging**: $20K
|
||||
- **Storage**: $5K
|
||||
- **Total**: ~$2.75M/month (optimized)
|
||||
|
||||
### Burst Event Costs (World Cup 50x, 3 hours)
|
||||
- **Cloud Run**: ~$80K
|
||||
- **Database**: ~$2K (connection spikes)
|
||||
- **Redis**: ~$500 (included in monthly)
|
||||
- **Networking**: ~$5K
|
||||
- **Total**: ~$88K per event
|
||||
|
||||
### Cost Optimization Tips
|
||||
1. Use committed use discounts (30% savings)
|
||||
2. Enable auto-scaling to scale down when idle
|
||||
3. Increase CDN cache hit rate to reduce backend load
|
||||
4. Use preemptible instances for non-critical workloads
|
||||
5. Regularly review and delete unused resources
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Complete Initial Deployment** (Phases 1-5)
|
||||
2. **Run Validation Tests** (Phase 6)
|
||||
3. **Schedule Load Tests** (Baseline, then burst)
|
||||
4. **Set Up Monitoring Dashboard**
|
||||
5. **Configure Alert Policies**
|
||||
6. **Create Runbook** (Already created: `/home/user/ruvector/src/burst-scaling/RUNBOOK.md`)
|
||||
7. **Train Team on Operations**
|
||||
8. **Plan First Production Event** (Start small, scale up)
|
||||
9. **Iterate and Optimize** (Based on real traffic)
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Architecture Overview](./architecture-overview.md)
|
||||
- [Scaling Strategy](./scaling-strategy.md)
|
||||
- [Infrastructure Design](./infrastructure-design.md)
|
||||
- [Load Test Scenarios](/home/user/ruvector/benchmarks/LOAD_TEST_SCENARIOS.md)
|
||||
- [Operations Runbook](/home/user/ruvector/src/burst-scaling/RUNBOOK.md)
|
||||
- [Benchmarking Guide](/home/user/ruvector/benchmarks/README.md)
|
||||
- [GCP Cloud Run Docs](https://cloud.google.com/run/docs)
|
||||
- [GCP Load Balancing Docs](https://cloud.google.com/load-balancing/docs)
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
|
||||
- Email: ops@ruvector.io
|
||||
- Slack: #ruvector-ops
|
||||
|
||||
---
|
||||
|
||||
**Document Version**: 1.0
|
||||
**Last Updated**: 2025-11-20
|
||||
**Deployment Status**: Ready for Production
|
||||
1190
vendor/ruvector/docs/cloud-architecture/PERFORMANCE_OPTIMIZATION_GUIDE.md
vendored
Normal file
1190
vendor/ruvector/docs/cloud-architecture/PERFORMANCE_OPTIMIZATION_GUIDE.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1114
vendor/ruvector/docs/cloud-architecture/architecture-overview.md
vendored
Normal file
1114
vendor/ruvector/docs/cloud-architecture/architecture-overview.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
2034
vendor/ruvector/docs/cloud-architecture/infrastructure-design.md
vendored
Normal file
2034
vendor/ruvector/docs/cloud-architecture/infrastructure-design.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1160
vendor/ruvector/docs/cloud-architecture/scaling-strategy.md
vendored
Normal file
1160
vendor/ruvector/docs/cloud-architecture/scaling-strategy.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user