git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
1115 lines
40 KiB
Markdown
1115 lines
40 KiB
Markdown
# Ruvector Global Streaming Architecture
|
|
## 500 Million Concurrent Streams on Google Cloud Run
|
|
|
|
**Version:** 1.0.0
|
|
**Last Updated:** 2025-11-20
|
|
**Target Scale:** 500M concurrent learning streams
|
|
**SLA Target:** 99.99% availability, <10ms p50, <50ms p99
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
This document outlines the comprehensive architecture for scaling Ruvector to support 500 million concurrent learning streams using Google Cloud Run with global multi-region deployment. The design leverages Ruvector's Rust-native performance (<0.5ms base latency) combined with GCP's global infrastructure to deliver sub-10ms p50 latency and 99.99% availability.
|
|
|
|
**Key Architecture Principles:**
|
|
- **Stateless Service Layer**: Cloud Run services for horizontal scalability
|
|
- **Distributed State**: Regional vector data stores with eventual consistency
|
|
- **Edge-First Routing**: Cloud CDN + Load Balancer for proximity-based routing
|
|
- **Burst Resilience**: Predictive + reactive auto-scaling with 10-50x burst capacity
|
|
- **Multi-Region Active-Active**: 15+ global regions for low latency and fault tolerance
|
|
|
|
---
|
|
|
|
## 1. Global Multi-Region Topology
|
|
|
|
### 1.1 Regional Distribution
|
|
|
|
**Primary Regions (15 Core Deployments):**
|
|
|
|
```
|
|
Americas (5):
|
|
├── us-central1 (Iowa) - Primary US Hub
|
|
├── us-east1 (South Carolina) - East Coast
|
|
├── us-west1 (Oregon) - West Coast
|
|
├── southamerica-east1 (São Paulo) - LATAM Hub
|
|
└── northamerica-northeast1 (Montreal) - Canada
|
|
|
|
Europe (4):
|
|
├── europe-west1 (Belgium) - Primary EU Hub
|
|
├── europe-west2 (London) - UK/Finance
|
|
├── europe-west3 (Frankfurt) - Central Europe
|
|
└── europe-north1 (Finland) - Nordic Region
|
|
|
|
Asia-Pacific (5):
|
|
├── asia-northeast1 (Tokyo) - Japan Hub
|
|
├── asia-southeast1 (Singapore) - Southeast Asia Hub
|
|
├── australia-southeast1 (Sydney) - Australia/NZ
|
|
├── asia-south1 (Mumbai) - India Hub
|
|
└── asia-east1 (Taiwan) - Greater China
|
|
|
|
Middle East & Africa (1):
|
|
└── me-west1 (Tel Aviv) - MENA Region
|
|
```
|
|
|
|
**Capacity Distribution (Baseline):**
|
|
- Tier 1 Hubs (5): 80M streams each = 400M total
|
|
- us-central1, europe-west1, asia-northeast1, asia-southeast1, southamerica-east1
|
|
- Tier 2 Regions (10): 10M streams each = 100M total
|
|
- All other regions
|
|
|
|
**Geographic Load Distribution Strategy:**
|
|
```
|
|
User Location → Nearest Edge Location → Regional Cloud Run Service
|
|
↓
|
|
Cloud CDN Cache Layer
|
|
↓
|
|
Regional Vector Data Store
|
|
↓
|
|
Cross-Region Replication (async)
|
|
```
|
|
|
|
### 1.2 Network Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Global Layer (Anycast IPv4/IPv6) │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ Cloud Load Balancer (Global HTTPS) │ │
|
|
│ │ - Anycast IP: 1 global IP address │ │
|
|
│ │ - SSL/TLS Termination (Google-managed certs) │ │
|
|
│ │ - DDoS Protection (Cloud Armor) │ │
|
|
│ │ - Geo-routing based on client proximity │ │
|
|
│ └────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Edge Layer (120+ Edge Locations) │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ Cloud CDN │ │
|
|
│ │ - Cache query responses (5-60s TTL) │ │
|
|
│ │ - Cache embeddings/vectors (1-5 min TTL) │ │
|
|
│ │ - Negative caching for rate limits │ │
|
|
│ │ - Compression (Brotli/gzip) │ │
|
|
│ │ - HTTP/3 (QUIC) support │ │
|
|
│ └────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Regional Layer (15 Regions) │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ Regional Backend Services │ │
|
|
│ │ - Load balancing algorithm: WEIGHTED_MAGLEV │ │
|
|
│ │ - Session affinity: CLIENT_IP (5 min) │ │
|
|
│ │ - Health checks: HTTP/2 gRPC (5s interval) │ │
|
|
│ │ - Connection draining: 30s │ │
|
|
│ └────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Compute Layer (Cloud Run Services) │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ Ruvector Streaming Service (per region) │ │
|
|
│ │ - 500-5,000 instances (auto-scaled) │ │
|
|
│ │ - 100 concurrent requests per instance │ │
|
|
│ │ - HTTP/2 + gRPC streaming │ │
|
|
│ │ - WebSocket support for persistent connections │ │
|
|
│ └────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Cloud Run Service Design
|
|
|
|
### 2.1 Service Architecture
|
|
|
|
**Ruvector Streaming Service Components:**
|
|
|
|
```rust
|
|
// Core service structure (conceptual)
|
|
┌──────────────────────────────────────────┐
|
|
│ Cloud Run Container │
|
|
│ ┌────────────────────────────────────┐ │
|
|
│ │ HTTP/2 + gRPC Server │ │
|
|
│ │ - Axum/Tonic framework │ │
|
|
│ │ - 100 concurrent connections │ │
|
|
│ │ - Keep-alive: 60s │ │
|
|
│ └────────────────────────────────────┘ │
|
|
│ ┌────────────────────────────────────┐ │
|
|
│ │ Ruvector Core Engine │ │
|
|
│ │ - HNSW index (in-memory) │ │
|
|
│ │ - SIMD-optimized search │ │
|
|
│ │ - Product quantization │ │
|
|
│ │ - Arena allocator │ │
|
|
│ └────────────────────────────────────┘ │
|
|
│ ┌────────────────────────────────────┐ │
|
|
│ │ Connection Pool Manager │ │
|
|
│ │ - Redis (metadata) │ │
|
|
│ │ - Cloud Storage (vectors) │ │
|
|
│ │ - Pub/Sub (coordination) │ │
|
|
│ └────────────────────────────────────┘ │
|
|
│ ┌────────────────────────────────────┐ │
|
|
│ │ Memory-Mapped Vector Store │ │
|
|
│ │ - Local NVMe SSD (hot data) │ │
|
|
│ │ - 8GB vector cache per instance │ │
|
|
│ │ - LRU eviction policy │ │
|
|
│ └────────────────────────────────────┘ │
|
|
└──────────────────────────────────────────┘
|
|
```
|
|
|
|
### 2.2 Service Configuration
|
|
|
|
**Base Configuration (Per Instance):**
|
|
```yaml
|
|
service: ruvector-streaming
|
|
region: multi-region (15 regions)
|
|
resources:
|
|
cpu: 4 vCPU
|
|
memory: 16 GiB
|
|
startup_cpu_boost: true
|
|
concurrency:
|
|
max_per_instance: 100 # concurrent requests
|
|
target_utilization: 0.70 # 70% target for headroom
|
|
scaling:
|
|
min_instances: 500 # per region (baseline)
|
|
max_instances: 5000 # per region (burst capacity)
|
|
scale_down_delay: 180s # 3 min cooldown
|
|
networking:
|
|
vpc_connector: regional-vpc-connector
|
|
vpc_egress: private-ranges-only
|
|
execution_environment: gen2
|
|
timeout: 300s # 5 min for long-running streams
|
|
startup_timeout: 240s # 4 min for HNSW index loading
|
|
```
|
|
|
|
**Container Specifications:**
|
|
- **Base Image:** `rust:1.77-alpine` (optimized for size)
|
|
- **Runtime:** Tokio async runtime with rayon thread pool
|
|
- **Binary Size:** ~15MB (stripped, LTO-optimized)
|
|
- **Cold Start:** <2s (with startup CPU boost)
|
|
- **Warm Start:** <100ms
|
|
|
|
### 2.3 Regional Deployment Strategy
|
|
|
|
**Deployment Topology:**
|
|
```
|
|
Each Region Deploys:
|
|
├── Primary Cluster (Active)
|
|
│ ├── 500-5,000 Cloud Run instances
|
|
│ ├── Regional Memorystore Redis (16GB-256GB)
|
|
│ ├── Regional Cloud SQL (metadata)
|
|
│ └── Regional Cloud Storage bucket (vectors)
|
|
├── Standby Cluster (Warm Standby)
|
|
│ ├── 50-100 instances (10% of primary)
|
|
│ └── Read-only replicas
|
|
└── Monitoring Stack
|
|
├── Cloud Monitoring dashboards
|
|
├── Cloud Logging (structured logs)
|
|
└── Cloud Trace (distributed tracing)
|
|
```
|
|
|
|
**Traffic Distribution:**
|
|
- **Active-Active:** All regions serve traffic simultaneously
|
|
- **Geo-Routing:** Users routed to nearest healthy region
|
|
- **Spillover:** Overloaded regions redirect to nearest neighbor
|
|
- **Failover:** Automatic re-routing on region failure (<30s)
|
|
|
|
---
|
|
|
|
## 3. Load Balancing & Traffic Routing
|
|
|
|
### 3.1 Global Load Balancer Configuration
|
|
|
|
```yaml
|
|
load_balancer:
|
|
type: EXTERNAL_MANAGED
|
|
ip_version: IPV4_IPV6
|
|
protocol: HTTPS
|
|
|
|
ssl_policy:
|
|
min_tls_version: TLS_1_2
|
|
profile: MODERN
|
|
|
|
backend_service:
|
|
protocol: HTTP2
|
|
port: 443
|
|
timeout: 300s
|
|
|
|
load_balancing_scheme: WEIGHTED_MAGLEV
|
|
session_affinity: CLIENT_IP
|
|
affinity_cookie_ttl: 300s # 5 min
|
|
|
|
health_check:
|
|
type: HTTP2
|
|
port: 8080
|
|
request_path: /health/ready
|
|
check_interval: 5s
|
|
timeout: 3s
|
|
healthy_threshold: 2
|
|
unhealthy_threshold: 3
|
|
|
|
cdn_policy:
|
|
cache_mode: CACHE_ALL_STATIC
|
|
default_ttl: 30s
|
|
max_ttl: 300s
|
|
client_ttl: 30s
|
|
negative_caching: true
|
|
negative_caching_policy:
|
|
- code: 404
|
|
ttl: 60s
|
|
- code: 429 # Rate limit
|
|
ttl: 10s
|
|
```
|
|
|
|
### 3.2 Routing Strategy
|
|
|
|
**Request Flow:**
|
|
```
|
|
1. Client Request
|
|
↓
|
|
2. DNS Resolution (Anycast IP)
|
|
↓
|
|
3. Edge Location (Cloud CDN)
|
|
├─→ Cache HIT: Return cached response (<5ms)
|
|
└─→ Cache MISS: Forward to backend
|
|
↓
|
|
4. Global Load Balancer
|
|
├─→ Route to nearest region (latency-based)
|
|
├─→ Check region health
|
|
└─→ Apply rate limiting (Cloud Armor)
|
|
↓
|
|
5. Regional Backend Service
|
|
├─→ Select healthy Cloud Run instance
|
|
├─→ Connection pooling (reuse existing)
|
|
└─→ Session affinity (same user → same instance)
|
|
↓
|
|
6. Cloud Run Instance
|
|
├─→ Check local cache (Memorystore Redis)
|
|
├─→ Query HNSW index (in-memory)
|
|
└─→ Return results
|
|
↓
|
|
7. Response Path
|
|
├─→ Cache at edge (CDN)
|
|
├─→ Compress (Brotli)
|
|
└─→ Return to client
|
|
```
|
|
|
|
**Routing Rules:**
|
|
```javascript
|
|
// Pseudo-code for routing logic
|
|
function routeRequest(request, regions) {
|
|
const userLocation = geolocate(request.clientIP);
|
|
const nearestRegions = findNearestRegions(userLocation, 3);
|
|
|
|
for (const region of nearestRegions) {
|
|
if (region.health === 'HEALTHY' && region.capacity > 20%) {
|
|
return region;
|
|
}
|
|
}
|
|
|
|
// Spillover to next available region
|
|
return findLeastLoadedRegion(regions.filter(r => r.health === 'HEALTHY'));
|
|
}
|
|
```
|
|
|
|
### 3.3 Cloud CDN Configuration
|
|
|
|
**Cache Strategy:**
|
|
```yaml
|
|
cdn_configuration:
|
|
cache_key_policy:
|
|
include_protocol: true
|
|
include_host: true
|
|
include_query_string: true
|
|
query_string_whitelist:
|
|
- query_vector_id
|
|
- k # top-k results
|
|
- metric # distance metric
|
|
|
|
cache_rules:
|
|
# Vector embedding queries (high cache hit rate)
|
|
- path: /api/v1/embed/*
|
|
cache_mode: CACHE_ALL
|
|
default_ttl: 300s # 5 min
|
|
|
|
# Search queries (moderate cache hit rate)
|
|
- path: /api/v1/search
|
|
cache_mode: USE_ORIGIN_HEADERS
|
|
default_ttl: 30s
|
|
|
|
# Real-time updates (no cache)
|
|
- path: /api/v1/insert
|
|
cache_mode: FORCE_CACHE_ALL_BYPASS
|
|
|
|
negative_caching:
|
|
enabled: true
|
|
ttl: 60s
|
|
status_codes: [404, 429, 500, 502, 503, 504]
|
|
```
|
|
|
|
**Cache Performance Targets:**
|
|
- **Hit Rate:** >60% (steady state), >80% (burst events)
|
|
- **Latency Reduction:** 5-15ms (edge) vs 30-50ms (origin)
|
|
- **Bandwidth Savings:** 40-60% reduction in origin traffic
|
|
|
|
---
|
|
|
|
## 4. Data Replication & Consistency
|
|
|
|
### 4.1 Data Architecture
|
|
|
|
**Three-Tier Storage Model:**
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Tier 1: Hot Data (In-Memory) │
|
|
│ - Cloud Run instance memory (16GB per instance) │
|
|
│ - HNSW index for active vectors │
|
|
│ - LRU cache (most recent 100K vectors per instance) │
|
|
│ - Latency: <0.5ms │
|
|
└─────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Tier 2: Warm Data (Regional Cache) │
|
|
│ - Memorystore Redis (16GB-256GB per region) │
|
|
│ - Recently accessed vectors (1M-10M vectors) │
|
|
│ - TTL: 1 hour (sliding window) │
|
|
│ - Latency: 1-3ms │
|
|
└─────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Tier 3: Cold Data (Object Storage) │
|
|
│ - Cloud Storage (multi-region buckets) │
|
|
│ - Full vector database (billions of vectors) │
|
|
│ - Memory-mapped files for large datasets │
|
|
│ - Latency: 10-30ms (first access) │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 4.2 Replication Strategy
|
|
|
|
**Multi-Region Replication:**
|
|
|
|
```
|
|
Primary Region (us-central1)
|
|
↓ (real-time sync via Pub/Sub)
|
|
Regional Hubs (5 Tier-1 regions)
|
|
↓ (async replication, <5s lag)
|
|
Secondary Regions (10 Tier-2 regions)
|
|
↓ (periodic sync, <60s lag)
|
|
Cross-Region Backup (nearline storage)
|
|
```
|
|
|
|
**Consistency Model:**
|
|
- **Writes:** Eventually consistent (5-60s global propagation)
|
|
- **Reads:** Read-your-writes consistency within region
|
|
- **Critical Metadata:** Strong consistency (Cloud Spanner or Cloud SQL with multi-region)
|
|
|
|
**Replication Flow:**
|
|
```rust
|
|
// Conceptual write path
|
|
1. User writes vector to regional Cloud Run instance
|
|
↓
|
|
2. Instance writes to:
|
|
a) Local memory (immediate)
|
|
b) Regional Redis (1-2ms)
|
|
c) Regional Cloud Storage (5-10ms)
|
|
↓
|
|
3. Pub/Sub message published to global topic
|
|
↓
|
|
4. Regional subscribers receive update (100-500ms)
|
|
↓
|
|
5. Subscribers update:
|
|
a) Regional Redis cache (invalidate or update)
|
|
b) Regional Cloud Storage (async copy)
|
|
↓
|
|
6. Background job syncs to other regions (5-60s)
|
|
```
|
|
|
|
### 4.3 Conflict Resolution
|
|
|
|
**Vector Update Conflicts:**
|
|
```
|
|
Strategy: Last-Write-Wins (LWW) with Vector Clocks
|
|
|
|
1. Each update includes:
|
|
- Timestamp (Unix nanoseconds)
|
|
- Region ID
|
|
- Version number
|
|
|
|
2. On conflict:
|
|
- Compare timestamps
|
|
- If same timestamp: lexicographic order by Region ID
|
|
- Update conflict counter metric
|
|
|
|
3. Rare conflicts (<0.01% of writes):
|
|
- Log for analysis
|
|
- Emit monitoring alert if rate exceeds threshold
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Edge Caching Strategy
|
|
|
|
### 5.1 Multi-Level Cache Hierarchy
|
|
|
|
```
|
|
L1: Browser/Client Cache (User Device)
|
|
└─ TTL: 5 min
|
|
└─ Size: ~10-50MB per client
|
|
└─ Hit Rate: 70-80%
|
|
↓
|
|
L2: Cloud CDN Edge Cache (120+ edge locations)
|
|
└─ TTL: 30-300s (content-dependent)
|
|
└─ Size: ~100GB-1TB per edge
|
|
└─ Hit Rate: 60-70%
|
|
↓
|
|
L3: Regional Memorystore Redis (15 regions)
|
|
└─ TTL: 1 hour (sliding)
|
|
└─ Size: 16GB-256GB per region
|
|
└─ Hit Rate: 80-90%
|
|
↓
|
|
L4: Cloud Run Instance Memory (per instance)
|
|
└─ TTL: Instance lifetime
|
|
└─ Size: 8GB per instance
|
|
└─ Hit Rate: 95%+
|
|
↓
|
|
L5: Cloud Storage (origin, multi-region)
|
|
└─ Persistent storage
|
|
└─ Size: Unlimited (petabytes)
|
|
└─ Always available
|
|
```
|
|
|
|
### 5.2 Cache Warming Strategy
|
|
|
|
**Pre-Event Warming (for predictable bursts):**
|
|
```bash
|
|
# Example: World Cup event in 2 hours
|
|
1. Historical Analysis
|
|
- Analyze similar events (previous World Cup matches)
|
|
- Identify top 10K vectors likely to be queried
|
|
- Estimate query patterns by region
|
|
|
|
2. Pre-Population (T-2 hours)
|
|
- Batch load hot vectors into Redis (all regions)
|
|
- Distribute to Cloud Run instances (rolling)
|
|
- Trigger CDN cache pre-fetch for common queries
|
|
|
|
3. Validation (T-1 hour)
|
|
- Run cache hit rate tests
|
|
- Verify all regions have hot data
|
|
- Scale up Cloud Run instances (50% → 100%)
|
|
|
|
4. Final Prep (T-30 min)
|
|
- Scale to 120% capacity
|
|
- Enable aggressive rate limiting for non-critical traffic
|
|
- Activate burst alerting channels
|
|
```
|
|
|
|
**Real-Time Adaptive Warming:**
|
|
```rust
|
|
// Pseudo-code for adaptive cache warming
|
|
fn adaptive_cache_warming() {
|
|
monitor_query_patterns(5min_window);
|
|
|
|
if detect_emerging_pattern() {
|
|
let hot_vectors = identify_trending_vectors();
|
|
|
|
// Async pre-load to regional caches
|
|
spawn_async(|| {
|
|
for region in all_regions {
|
|
redis_mset(region, hot_vectors, ttl=3600);
|
|
}
|
|
});
|
|
|
|
// Update CDN cache keys
|
|
cdn_prefetch(hot_vectors);
|
|
}
|
|
}
|
|
```
|
|
|
|
### 5.3 Cache Invalidation
|
|
|
|
**Invalidation Strategies:**
|
|
```yaml
|
|
invalidation_rules:
|
|
# Vector updates (immediate invalidation)
|
|
- trigger: vector_update
|
|
scope: global
|
|
method: PURGE_BY_KEY
|
|
propagation_time: <5s
|
|
|
|
# Batch updates (lazy invalidation)
|
|
- trigger: batch_insert
|
|
scope: regional
|
|
method: EXPIRE_BY_TTL
|
|
ttl: 60s
|
|
|
|
# Model updates (full cache clear)
|
|
- trigger: model_version_change
|
|
scope: global
|
|
method: PURGE_ALL
|
|
notice_period: 5min # gradual rollout
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Connection Pooling & Streaming Protocol
|
|
|
|
### 6.1 Connection Pool Architecture
|
|
|
|
**Regional Connection Pool:**
|
|
```
|
|
┌───────────────────────────────────────────────────────┐
|
|
│ Cloud Run Instance (4 vCPU, 16GB) │
|
|
│ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ HTTP/2 Connection Pool │ │
|
|
│ │ - Max connections: 100 concurrent │ │
|
|
│ │ - Keep-alive: 60s │ │
|
|
│ │ - Idle timeout: 90s │ │
|
|
│ │ - Max streams per conn: 100 (HTTP/2 multiplex)│ │
|
|
│ └─────────────────────────────────────────────────┘ │
|
|
│ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ Redis Connection Pool (Memorystore) │ │
|
|
│ │ - Pool size: 50 connections │ │
|
|
│ │ - Max idle: 20 │ │
|
|
│ │ - Timeout: 5s │ │
|
|
│ │ - Pipeline: 10 commands per batch │ │
|
|
│ └─────────────────────────────────────────────────┘ │
|
|
│ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ Pub/Sub Connection (coordination) │ │
|
|
│ │ - Persistent gRPC stream │ │
|
|
│ │ - Auto-reconnect with exponential backoff │ │
|
|
│ │ - Batched message publishing (100ms window) │ │
|
|
│ └─────────────────────────────────────────────────┘ │
|
|
└───────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 6.2 Streaming Protocol Design
|
|
|
|
**Supported Protocols:**
|
|
|
|
**1. HTTP/2 Server-Sent Events (SSE) - Primary**
|
|
```http
|
|
GET /api/v1/stream/search HTTP/2
|
|
Host: ruvector.example.com
|
|
Accept: text/event-stream
|
|
Authorization: Bearer <token>
|
|
|
|
# Response (streaming)
|
|
HTTP/2 200 OK
|
|
Content-Type: text/event-stream
|
|
Cache-Control: no-cache
|
|
|
|
data: {"event":"search_start","query_id":"abc123"}
|
|
|
|
data: {"event":"result","vector_id":"vec_001","score":0.95}
|
|
|
|
data: {"event":"result","vector_id":"vec_002","score":0.89}
|
|
|
|
data: {"event":"search_complete","total_results":50}
|
|
```
|
|
|
|
**2. WebSocket - For Bidirectional Streams**
|
|
```javascript
|
|
// Client-side
|
|
const ws = new WebSocket('wss://ruvector.example.com/api/v1/ws');
|
|
|
|
ws.send(JSON.stringify({
|
|
type: 'search',
|
|
query: [0.1, 0.2, 0.3, ...],
|
|
k: 100,
|
|
stream: true
|
|
}));
|
|
|
|
ws.onmessage = (event) => {
|
|
const result = JSON.parse(event.data);
|
|
// Process incremental results
|
|
};
|
|
```
|
|
|
|
**3. gRPC Streaming - For Backend Services**
|
|
```protobuf
|
|
service VectorSearch {
|
|
rpc StreamSearch(SearchRequest) returns (stream SearchResult);
|
|
rpc BidirectionalSearch(stream SearchRequest) returns (stream SearchResult);
|
|
}
|
|
|
|
message SearchRequest {
|
|
repeated float query = 1;
|
|
int32 k = 2;
|
|
string metric = 3;
|
|
}
|
|
|
|
message SearchResult {
|
|
string vector_id = 1;
|
|
float score = 2;
|
|
bytes metadata = 3;
|
|
}
|
|
```
|
|
|
|
### 6.3 Connection Management
|
|
|
|
**Connection Lifecycle:**
|
|
```rust
|
|
// Conceptual connection manager
|
|
struct ConnectionManager {
|
|
active_connections: Arc<DashMap<ConnectionId, Connection>>,
|
|
max_connections: usize,
|
|
idle_timeout: Duration,
|
|
}
|
|
|
|
impl ConnectionManager {
|
|
async fn handle_connection(&self, conn: Connection) {
|
|
// 1. Authentication & Rate Limiting
|
|
let user = authenticate(&conn).await?;
|
|
check_rate_limit(&user)?;
|
|
|
|
// 2. Register connection
|
|
self.active_connections.insert(conn.id, conn.clone());
|
|
|
|
// 3. Keep-alive loop
|
|
tokio::spawn(async move {
|
|
loop {
|
|
select! {
|
|
msg = conn.recv() => process_message(msg),
|
|
_ = sleep(60s) => conn.send_ping(),
|
|
_ = sleep(idle_timeout) => break,
|
|
}
|
|
}
|
|
});
|
|
|
|
// 4. Cleanup on disconnect
|
|
self.active_connections.remove(&conn.id);
|
|
log_connection_metrics(&conn);
|
|
}
|
|
|
|
async fn handle_overload(&self) {
|
|
if self.active_connections.len() > self.max_connections * 0.9 {
|
|
// Shed least valuable connections
|
|
let connections = self.find_idle_connections(older_than=5min);
|
|
for conn in connections.iter().take(100) {
|
|
conn.close_gracefully(reason="capacity");
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Load Shedding Strategy:**
|
|
```yaml
|
|
load_shedding:
|
|
triggers:
|
|
- cpu_usage > 85%
|
|
- memory_usage > 90%
|
|
- connection_count > 95 (per instance)
|
|
- latency_p99 > 100ms
|
|
|
|
actions:
|
|
- priority: reject_new_connections
|
|
threshold: 95%
|
|
|
|
- priority: close_idle_connections
|
|
idle_time: >5min
|
|
threshold: 90%
|
|
|
|
- priority: rate_limit_aggressive
|
|
limit: 10 req/s per user
|
|
threshold: 85%
|
|
|
|
- priority: shed_non_premium_traffic
|
|
percentage: 20%
|
|
threshold: 95%
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Monitoring & Observability
|
|
|
|
### 7.1 Key Metrics
|
|
|
|
**Service-Level Indicators (SLIs):**
|
|
```yaml
|
|
availability:
|
|
target: 99.99%
|
|
measurement: successful_requests / total_requests
|
|
window: 30 days
|
|
|
|
latency:
|
|
p50_target: <10ms
|
|
p95_target: <30ms
|
|
p99_target: <50ms
|
|
measurement: time_to_first_byte
|
|
|
|
throughput:
|
|
target: 500M concurrent streams
|
|
measurement: active_websocket_connections
|
|
|
|
error_rate:
|
|
target: <0.1%
|
|
measurement: (4xx + 5xx) / total_requests
|
|
```
|
|
|
|
**Resource Metrics:**
|
|
```yaml
|
|
cloud_run:
|
|
- instance_count (per region)
|
|
- cpu_utilization
|
|
- memory_utilization
|
|
- container_startup_time
|
|
- request_count
|
|
- active_connections
|
|
|
|
redis:
|
|
- cache_hit_rate
|
|
- memory_usage
|
|
- eviction_count
|
|
- commands_per_second
|
|
|
|
cloud_storage:
|
|
- read_operations
|
|
- write_operations
|
|
- bandwidth_usage
|
|
- replication_lag
|
|
```
|
|
|
|
### 7.2 Distributed Tracing
|
|
|
|
**Trace Propagation:**
|
|
```
|
|
Request ID: req_abc123_us-central1_inst042
|
|
|
|
Span 1: Global Load Balancer (0-2ms)
|
|
└─ Span 2: Cloud CDN Edge (2-5ms)
|
|
└─ Span 3: Regional LB (5-8ms)
|
|
└─ Span 4: Cloud Run Instance (8-15ms)
|
|
├─ Span 5: Redis Lookup (8-11ms)
|
|
│ └─ Result: CACHE_MISS
|
|
├─ Span 6: HNSW Search (11-14ms)
|
|
│ └─ Result: 100 vectors found
|
|
└─ Span 7: Response Serialization (14-15ms)
|
|
|
|
Total Latency: 15ms (p50 target: <10ms) ⚠️ SLOW
|
|
```
|
|
|
|
### 7.3 Alerting Rules
|
|
|
|
**Critical Alerts (PagerDuty):**
|
|
```yaml
|
|
alerts:
|
|
- name: RegionDown
|
|
condition: region_availability < 95%
|
|
severity: critical
|
|
notification: immediate
|
|
|
|
- name: LatencyDegraded
|
|
condition: p99_latency > 50ms for 5 min
|
|
severity: critical
|
|
notification: immediate
|
|
|
|
- name: ErrorRateHigh
|
|
condition: error_rate > 1% for 5 min
|
|
severity: critical
|
|
notification: immediate
|
|
|
|
- name: CapacityExhausted
|
|
condition: instance_count > 90% of max
|
|
severity: warning
|
|
notification: 15 min delay
|
|
auto_remediation: scale_up
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Disaster Recovery & Failover
|
|
|
|
### 8.1 Failure Scenarios
|
|
|
|
**Regional Failure:**
|
|
```
|
|
Scenario: us-central1 becomes unavailable
|
|
|
|
Automatic Response (< 30s):
|
|
1. Global LB detects unhealthy region (health checks fail)
|
|
2. Traffic re-routes to nearby regions:
|
|
- East Coast: us-east1
|
|
- West Coast: us-west1
|
|
3. Spillover regions scale up 2x capacity (auto-scaling)
|
|
4. CDN cache serves stale content (5 min grace period)
|
|
5. Alerts sent to on-call team
|
|
|
|
Manual Response (< 5 min):
|
|
1. Confirm outage scope and cause
|
|
2. Increase max_instances in spillover regions
|
|
3. Warm up additional regions if needed
|
|
4. Update status page
|
|
|
|
Recovery (< 30 min):
|
|
1. Region comes back online
|
|
2. Gradual traffic shift (10% every 5 min)
|
|
3. Verify metrics return to normal
|
|
4. Post-mortem analysis
|
|
```
|
|
|
|
**Multi-Region Failure (catastrophic):**
|
|
```
|
|
Scenario: 3+ regions simultaneously fail
|
|
|
|
Response:
|
|
1. Activate DR runbook
|
|
2. Promote standby clusters to active
|
|
3. Scale remaining healthy regions to 150% capacity
|
|
4. Enable aggressive caching (10 min TTL)
|
|
5. Activate read-only mode for non-critical operations
|
|
6. Coordinate with GCP support for expedited recovery
|
|
```
|
|
|
|
### 8.2 Backup & Recovery
|
|
|
|
**Data Backup Strategy:**
|
|
```yaml
|
|
backups:
|
|
vector_data:
|
|
frequency: continuous (Cloud Storage versioning)
|
|
retention: 30 days
|
|
storage_class: nearline
|
|
|
|
metadata:
|
|
frequency: every 6 hours (Cloud SQL automated backups)
|
|
retention: 7 days
|
|
point_in_time_recovery: enabled
|
|
|
|
configuration:
|
|
frequency: on change (Git repository)
|
|
retention: indefinite
|
|
|
|
recovery_objectives:
|
|
rpo: <1 hour (maximum data loss)
|
|
rto: <30 min (maximum downtime)
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Security & Compliance
|
|
|
|
### 9.1 Security Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ Perimeter Security │
|
|
│ - Cloud Armor (DDoS protection, WAF) │
|
|
│ - SSL/TLS 1.2+ (Google-managed certificates) │
|
|
│ - Rate limiting (100 req/s per IP) │
|
|
└─────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ Authentication & Authorization │
|
|
│ - OAuth 2.0 / JWT tokens │
|
|
│ - API keys with scoped permissions │
|
|
│ - Workload Identity (service-to-service) │
|
|
└─────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ Network Security │
|
|
│ - VPC Service Controls │
|
|
│ - Private Service Connect (Redis, SQL) │
|
|
│ - VPC Peering (cross-region) │
|
|
│ - Cloud NAT (egress only for Cloud Run) │
|
|
└─────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ Data Security │
|
|
│ - Encryption at rest (CMEK for sensitive data) │
|
|
│ - Encryption in transit (TLS 1.2+) │
|
|
│ - Customer-managed encryption keys (optional) │
|
|
│ - Data residency controls (regional isolation) │
|
|
└─────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 9.2 Compliance
|
|
|
|
**Certifications & Standards:**
|
|
- SOC 2 Type II
|
|
- ISO 27001
|
|
- GDPR compliant (data residency in EU for EU users)
|
|
- HIPAA compliant (for healthcare use cases)
|
|
- PCI DSS Level 1 (for payment-related vectors)
|
|
|
|
---
|
|
|
|
## 10. Integration with Agentic-Flow
|
|
|
|
### 10.1 Coordination Architecture
|
|
|
|
**Agentic-Flow Integration:**
|
|
```javascript
|
|
// Example: Distributed agent coordination via ruvector
|
|
|
|
const { AgenticFlow } = require('agentic-flow');
|
|
const { VectorDB } = require('ruvector');
|
|
|
|
// Initialize distributed vector memory
|
|
const flow = new AgenticFlow({
|
|
vectorStore: new VectorDB({
|
|
endpoint: 'https://ruvector.example.com',
|
|
region: 'auto', // auto-selects nearest region
|
|
streaming: true,
|
|
}),
|
|
topology: 'mesh',
|
|
coordinationHooks: {
|
|
preTask: async (task) => {
|
|
// Store task embedding for similarity search
|
|
const embedding = await embedTask(task);
|
|
await flow.vectorStore.insert(task.id, embedding, {
|
|
metadata: { type: 'task', status: 'pending' }
|
|
});
|
|
},
|
|
postTask: async (task, result) => {
|
|
// Update task with result
|
|
await flow.vectorStore.update(task.id, {
|
|
metadata: { status: 'completed', result }
|
|
});
|
|
}
|
|
}
|
|
});
|
|
|
|
// Distributed agent search for similar tasks
|
|
async function findSimilarTasks(currentTask) {
|
|
const stream = flow.vectorStore.searchStream(
|
|
currentTask.embedding,
|
|
{ k: 10, filter: { type: 'task' } }
|
|
);
|
|
|
|
for await (const result of stream) {
|
|
console.log(`Similar task: ${result.id}, score: ${result.score}`);
|
|
}
|
|
}
|
|
```
|
|
|
|
### 10.2 Pub/Sub Coordination
|
|
|
|
**Cross-Region Agent Coordination:**
|
|
```yaml
|
|
pubsub_topics:
|
|
agent-coordination:
|
|
regions: all
|
|
message_retention: 7 days
|
|
ordering_key: agent_id
|
|
|
|
task-distribution:
|
|
regions: all
|
|
message_retention: 1 day
|
|
ordering_key: task_priority
|
|
|
|
vector-updates:
|
|
regions: all
|
|
message_retention: 1 hour
|
|
ordering_key: vector_id
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Next Steps
|
|
|
|
### 11.1 Implementation Phases
|
|
|
|
**Phase 1: Foundation (Weeks 1-4)**
|
|
- Deploy to 3 pilot regions (us-central1, europe-west1, asia-northeast1)
|
|
- Baseline capacity: 30M concurrent streams
|
|
- Load testing and optimization
|
|
|
|
**Phase 2: Global Expansion (Weeks 5-8)**
|
|
- Deploy to all 15 regions
|
|
- Enable cross-region replication
|
|
- Capacity: 100M concurrent streams
|
|
|
|
**Phase 3: Optimization (Weeks 9-12)**
|
|
- Fine-tune auto-scaling policies
|
|
- Optimize cache hit rates
|
|
- Enable advanced features (predictive scaling)
|
|
- Capacity: 300M concurrent streams
|
|
|
|
**Phase 4: Full Scale (Weeks 13-16)**
|
|
- Scale to 500M concurrent streams
|
|
- Burst testing (10-50x load)
|
|
- Disaster recovery drills
|
|
- Production readiness review
|
|
|
|
### 11.2 Success Metrics
|
|
|
|
**Technical Metrics:**
|
|
- ✅ p50 latency: <10ms
|
|
- ✅ p99 latency: <50ms
|
|
- ✅ Availability: 99.99%
|
|
- ✅ Concurrent streams: 500M+
|
|
- ✅ Burst capacity: 10-50x baseline
|
|
|
|
**Business Metrics:**
|
|
- Cost per million requests: <$5
|
|
- Infrastructure cost as % of revenue: <15%
|
|
- Time to scale (0→500M): <30 minutes
|
|
- Mean time to recovery (MTTR): <30 minutes
|
|
|
|
---
|
|
|
|
## Appendix A: Reference Architecture Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ │
|
|
│ GLOBAL INTERNET │
|
|
│ │
|
|
└────────────────────────────────┬────────────────────────────────────────┘
|
|
│
|
|
│ Anycast IPv4/IPv6
|
|
↓
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ GOOGLE CLOUD GLOBAL LOAD BALANCER │
|
|
│ • Single global IP address │
|
|
│ • SSL/TLS termination │
|
|
│ • DDoS protection (Cloud Armor) │
|
|
│ • Geo-routing (proximity-based) │
|
|
└───┬─────────────────────┬───────────────────────┬─────────────────────┬─┘
|
|
│ │ │ │
|
|
↓ ↓ ↓ ↓
|
|
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
|
|
│ Americas │ │ Europe │ │Asia-Pacific│ │MENA/Africa│
|
|
│ 5 Regions │ │ 4 Regions │ │ 5 Regions │ │ 1 Region │
|
|
│ 180M │ │ 120M │ │ 180M │ │ 20M │
|
|
│ streams │ │ streams │ │ streams │ │ streams │
|
|
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘
|
|
│ │ │ │
|
|
└──────────────────┴─────────────────────┴─────────────────────┘
|
|
│
|
|
┌───────────┴───────────┐
|
|
│ │
|
|
↓ ↓
|
|
┌──────────────────┐ ┌──────────────────┐
|
|
│ Cloud CDN Edge │ │ Regional Stack │
|
|
│ 120+ Locations │ │ (per region) │
|
|
│ • Cache: 60-70% │ │ │
|
|
│ • Latency: 5ms │ │ ┌────────────┐ │
|
|
└──────────────────┘ │ │ Cloud Run │ │
|
|
│ │ 500-5000 │ │
|
|
│ │ instances │ │
|
|
│ └────────────┘ │
|
|
│ ┌────────────┐ │
|
|
│ │Memorystore │ │
|
|
│ │ Redis 256GB│ │
|
|
│ └────────────┘ │
|
|
│ ┌────────────┐ │
|
|
│ │Cloud Storage │
|
|
│ │Multi-Region│ │
|
|
│ └────────────┘ │
|
|
└──────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
**Document Version:** 1.0.0
|
|
**Last Updated:** 2025-11-20
|
|
**Next Review:** 2025-12-20
|
|
**Owner:** Infrastructure Team
|
|
**Approval:** CTO, VP Engineering
|