wifi-densepose/examples/vibecast-7sense/docs/adr/ADR-004-performance-optimization.md

# ADR-004: Performance Optimization Strategy

## Status

Proposed

## Date

2026-01-15

## Context

7sense is a bioacoustics platform that processes bird call audio using Perch 2.0 embeddings (1536-D vectors from 5-second audio segments at 32kHz) stored in a RuVector-based system with HNSW indexing and GNN learning capabilities. The system must handle:

- **Scale**: 1M+ bird call embeddings with sub-100ms query latency
- **Continuous Learning**: GNN refinement without blocking query operations
- **Hierarchical Data**: Poincare ball hyperbolic embeddings for species/call-type taxonomies
- **Real-time Ingestion**: Streaming audio from field sensors

This ADR defines the performance optimization strategy to meet these requirements while maintaining system reliability and cost efficiency.

## Decision

We adopt a multi-layered performance optimization approach covering HNSW tuning, embedding quantization, batch processing, memory management, caching, GNN scheduling, and horizontal scalability.

---

## 1. HNSW Parameter Tuning

### 1.1 Core Parameters

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **M** (max connections per node) | 32 | Optimal for 1536-D vectors; balances recall vs memory. Higher than default (16) due to high dimensionality. |
| **efConstruction** | 200 | Build-time search depth. Higher ensures quality graph structure for dense embedding spaces. |
| **efSearch** | 128 (default) / 256 (high-recall) | Query-time search depth. Tunable per query based on precision requirements. |
| **maxLevel** | auto (log2(N)/log2(M)) | Automatically determined; ~6-7 levels for 1M vectors with M=32. |

### 1.2 Dimensionality-Specific Adjustments

```
For 1536-D Perch embeddings:
- Use L2 distance (Euclidean) for normalized vectors
- Consider Product Quantization (PQ) for memory reduction (see Section 2)
- Enable SIMD acceleration (AVX-512 where available)
```

### 1.3 Benchmark Targets

| Metric | Target | Measurement Method |
|--------|--------|-------------------|
| Recall@10 | >= 0.95 | Compare against brute-force ground truth |
| Recall@100 | >= 0.98 | Same |
| Query Latency (p50) | < 10ms | Single-threaded, 1M vectors |
| Query Latency (p99) | < 50ms | Under concurrent load |
| Build Time | < 30 min | For 1M vectors cold start |

### 1.4 Tuning Protocol

```typescript
interface HNSWTuningConfig {
  // Phase 1: Initial calibration (10K sample)
  calibration: {
    sampleSize: 10000,
    mRange: [16, 24, 32, 48],
    efConstructionRange: [100, 200, 400],
    targetRecall: 0.95
  },

  // Phase 2: Full index build with optimal params
  production: {
    m: 32,  // Determined from calibration
    efConstruction: 200,
    efSearchDefault: 128,
    efSearchHighRecall: 256
  },

  // Phase 3: Runtime adaptation
  adaptive: {
    enableDynamicEf: true,
    efFloor: 64,
    efCeiling: 512,
    latencyTarget: 50  // ms
  }
}
```

---

## 2. Embedding Quantization Strategy

### 2.1 Tiered Storage Architecture

```
HOT TIER (Active Queries)
-------------------------
- Format: float32 (full precision)
- Size: 1536 * 4 = 6,144 bytes/vector
- Capacity: ~100K vectors (600MB RAM)
- Use: Real-time queries, recent recordings

WARM TIER (Frequent Access)
---------------------------
- Format: float16 (half precision)
- Size: 1536 * 2 = 3,072 bytes/vector
- Capacity: ~500K vectors (1.5GB RAM)
- Use: Weekly active data, popular species

COLD TIER (Archive)
-------------------
- Format: int8 (scalar quantization)
- Size: 1536 * 1 = 1,536 bytes/vector
- Capacity: ~2M+ vectors (3GB disk)
- Use: Historical data, rare species
```

### 2.2 Quantization Methods

| Method | Compression | Recall Impact | Use Case |
|--------|-------------|---------------|----------|
| **Scalar (int8)** | 4x | -2-3% recall | Cold storage, bulk search |
| **Product Quantization (PQ)** | 8-16x | -3-5% recall | Very large archives |
| **Binary** | 32x | -10-15% recall | First-pass filtering only |

### 2.3 Scalar Quantization Implementation

```typescript
class ScalarQuantizer {
  // Per-dimension min/max for calibration
  private mins: Float32Array;
  private maxs: Float32Array;
  private scales: Float32Array;

  calibrate(embeddings: Float32Array[], sampleSize: number = 10000): void {
    // Sample random embeddings for range estimation
    const sample = this.randomSample(embeddings, sampleSize);

    for (let d = 0; d < 1536; d++) {
      const values = sample.map(e => e[d]);
      this.mins[d] = Math.min(...values);
      this.maxs[d] = Math.max(...values);
      this.scales[d] = 255 / (this.maxs[d] - this.mins[d]);
    }
  }

  quantize(embedding: Float32Array): Uint8Array {
    const quantized = new Uint8Array(1536);
    for (let d = 0; d < 1536; d++) {
      const normalized = (embedding[d] - this.mins[d]) * this.scales[d];
      quantized[d] = Math.round(Math.max(0, Math.min(255, normalized)));
    }
    return quantized;
  }

  dequantize(quantized: Uint8Array): Float32Array {
    const embedding = new Float32Array(1536);
    for (let d = 0; d < 1536; d++) {
      embedding[d] = (quantized[d] / this.scales[d]) + this.mins[d];
    }
    return embedding;
  }
}
```

### 2.4 Promotion/Demotion Policy

```
PROMOTION (Cold -> Warm -> Hot)
-------------------------------
Trigger: Query frequency > threshold OR explicit prefetch
- Cold -> Warm: 5+ queries in 24h
- Warm -> Hot: 20+ queries in 1h

DEMOTION (Hot -> Warm -> Cold)
------------------------------
Trigger: Time-based decay OR memory pressure
- Hot -> Warm: No queries in 1h
- Warm -> Cold: No queries in 7d
- LRU eviction when tier exceeds capacity
```

---

## 3. Batch Processing Pipeline

### 3.1 Audio Ingestion Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                     AUDIO INGESTION PIPELINE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  [Sensors]  ──>  [Buffer Queue]  ──>  [Segment Detector]           │
│      │               │                      │                       │
│      │          (5min chunks)         (5s windows)                  │
│      v               v                      v                       │
│  [Raw Storage]  [Batch Accumulator]  [Perch Embedder]              │
│                      │                      │                       │
│                 (1000 segments)        (GPU batch)                  │
│                      v                      v                       │
│              [Embedding Queue]  <──  [1536-D vectors]              │
│                      │                                              │
│                      v                                              │
│              [HNSW Batch Insert]                                   │
│                      │                                              │
│              (async, non-blocking)                                  │
│                      v                                              │
│              [Index + Metadata Store]                              │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### 3.2 Batch Sizing Parameters

| Stage | Batch Size | Latency Target | Throughput |
|-------|------------|----------------|------------|
| Audio buffer | 5 min chunks | < 1s queue delay | 100+ streams |
| Segment detection | 100 segments | < 500ms | 1000 segments/s |
| Perch embedding | 64 segments | < 2s GPU | 32 segments/s/GPU |
| HNSW insertion | 1000 vectors | < 100ms | 10K vectors/s |
| Metadata write | 1000 records | < 50ms | 20K records/s |

### 3.3 Backpressure Handling

```typescript
interface BackpressureConfig {
  // Queue depth thresholds
  warningThreshold: 10000,    // Start logging warnings
  throttleThreshold: 50000,   // Reduce intake rate
  dropThreshold: 100000,      // Drop lowest-priority data

  // Priority levels for graceful degradation
  priorities: {
    critical: 'endangered_species',      // Never drop
    high: 'known_species_new_recording', // Drop last
    normal: 'routine_monitoring',        // Standard handling
    low: 'background_noise_samples'      // Drop first
  },

  // Rate limiting
  maxIngestionRate: 10000,  // segments/minute
  burstAllowance: 5000,     // temporary overflow
}
```

### 3.4 Batch Insert Optimization

```typescript
async function batchInsertEmbeddings(
  embeddings: Float32Array[],
  metadata: EmbeddingMetadata[],
  config: BatchConfig
): Promise<BatchResult> {
  const batchSize = config.batchSize || 1000;
  const results: BatchResult = { inserted: 0, failed: 0, latencyMs: [] };

  // Sort by expected cluster for better cache locality
  const sorted = sortByClusterHint(embeddings, metadata);

  for (let i = 0; i < sorted.length; i += batchSize) {
    const batch = sorted.slice(i, i + batchSize);
    const start = performance.now();

    // Parallel insert with connection pooling
    await Promise.all([
      hnswIndex.batchAdd(batch.embeddings),
      metadataStore.batchInsert(batch.metadata)
    ]);

    results.latencyMs.push(performance.now() - start);
    results.inserted += batch.length;
  }

  return results;
}
```

---

## 4. Memory Management

### 4.1 Streaming vs Batch Trade-offs

| Mode | Memory Footprint | Latency | Use Case |
|------|------------------|---------|----------|
| **Streaming** | O(window_size) ~50MB | Real-time (<1s) | Live monitoring |
| **Micro-batch** | O(batch_size) ~200MB | Near-real-time (<5s) | Standard ingestion |
| **Batch** | O(full_batch) ~2GB | Minutes | Bulk historical import |

### 4.2 Memory Budget Allocation

```
TOTAL MEMORY BUDGET: 16GB (single node)
=======================================

HNSW Index (Hot):     4GB  (25%)
  - ~650K float32 vectors
  - Navigation structure overhead

Embedding Cache:      3GB  (19%)
  - LRU cache for frequent queries
  - Warm tier spillover

GNN Model:            2GB  (12%)
  - Model parameters
  - Gradient buffers
  - Activation cache

Query Buffers:        2GB  (12%)
  - Concurrent query working memory
  - Result aggregation

Ingestion Pipeline:   2GB  (12%)
  - Audio processing buffers
  - Batch accumulation

Metadata/Index:       2GB  (12%)
  - SQLite/RocksDB buffers
  - B-tree indices

OS/Overhead:          1GB  (6%)
  - System requirements
  - Safety margin
```

### 4.3 Memory Pressure Response

```typescript
interface MemoryManager {
  thresholds: {
    warning: 0.75,    // 75% utilization
    critical: 0.90,   // 90% utilization
    emergency: 0.95   // 95% utilization
  },

  responses: {
    warning: [
      'reduce_batch_sizes',
      'increase_demotion_rate',
      'log_memory_profile'
    ],
    critical: [
      'pause_gnn_training',
      'aggressive_cache_eviction',
      'reject_low_priority_queries'
    ],
    emergency: [
      'stop_ingestion',
      'force_checkpoint',
      'alert_operations'
    ]
  }
}
```

### 4.4 Zero-Copy Optimizations

```typescript
// Use memory-mapped files for large read-only data
const coldTierIndex = mmap('/data/cold_embeddings.bin', {
  mode: 'readonly',
  advice: MADV_RANDOM  // Optimize for random access
});

// Share embedding buffers between query threads
const sharedQueryBuffer = new SharedArrayBuffer(
  QUERY_BATCH_SIZE * EMBEDDING_DIM * 4
);

// Avoid copies in pipeline stages
function processSegment(audio: AudioBuffer): EmbeddingResult {
  // Pass views, not copies
  const spectrogram = computeMelSpectrogram(audio.subarray(0, WINDOW_SIZE));
  const embedding = perchModel.embed(spectrogram);  // Returns view
  return { embedding, metadata: extractMetadata(audio) };
}
```

---

## 5. Caching Strategy

### 5.1 Multi-Level Cache Architecture

```
┌────────────────────────────────────────────────────────────────┐
│                     CACHE HIERARCHY                            │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  L1: Query Result Cache (100MB)                               │
│  ├── Key: hash(query_embedding + search_params)               │
│  ├── TTL: 5 minutes                                           │
│  ├── Hit Rate Target: 40%+ for repeated queries               │
│  └── Eviction: LRU with frequency boost                       │
│                                                                │
│  L2: Nearest Neighbor Cache (500MB)                           │
│  ├── Key: embedding_id                                        │
│  ├── Value: precomputed k-NN list                             │
│  ├── TTL: 1 hour (invalidate on index update)                 │
│  └── Hit Rate Target: 60%+ for hot embeddings                 │
│                                                                │
│  L3: Cluster Centroid Cache (200MB)                           │
│  ├── Key: cluster_id                                          │
│  ├── Value: centroid + exemplar embeddings                    │
│  ├── TTL: 24 hours                                            │
│  └── Use: Fast cluster assignment for new embeddings          │
│                                                                │
│  L4: Metadata Cache (300MB)                                   │
│  ├── Key: embedding_id                                        │
│  ├── Value: species, location, timestamp, etc.                │
│  ├── TTL: None (invalidate on update)                         │
│  └── Hit Rate Target: 90%+ (frequently accessed)              │
│                                                                │
└────────────────────────────────────────────────────────────────┘
```

### 5.2 Cache Warming Strategy

```typescript
interface CacheWarmingConfig {
  // Startup warming
  startup: {
    // Load most queried embeddings from past 24h
    recentQueryEmbeddings: 10000,
    // Load all cluster centroids
    clusterCentroids: 'all',
    // Load endangered species data
    prioritySpecies: ['species_list_from_config']
  },

  // Predictive warming
  predictive: {
    // Time-based patterns
    schedules: [
      { time: '05:00', action: 'warm_dawn_chorus_species' },
      { time: '19:00', action: 'warm_dusk_species' }
    ],
    // Geographic patterns
    sensorActivation: {
      triggerRadius: '50km',
      preloadNeighborSites: true
    }
  },

  // Query-driven warming
  queryDriven: {
    // On any query, prefetch neighbors' neighbors
    prefetchDepth: 2,
    prefetchCount: 10
  }
}
```

### 5.3 Cache Invalidation

```typescript
// Event-driven invalidation
const cacheInvalidator = {
  onEmbeddingInsert(id: string, embedding: Float32Array): void {
    // Invalidate affected NN caches
    const affectedNeighbors = hnswIndex.getNeighbors(id, 50);
    affectedNeighbors.forEach(nid => nnCache.invalidate(nid));

    // Invalidate cluster centroid if significantly different
    const cluster = clusterAssignment.get(id);
    if (cluster && distanceFromCentroid(embedding, cluster) > threshold) {
      centroidCache.invalidate(cluster.id);
    }
  },

  onClusterUpdate(clusterId: string): void {
    // Invalidate centroid and all member NN caches
    centroidCache.invalidate(clusterId);
    const members = clusterMembers.get(clusterId);
    members.forEach(mid => nnCache.invalidate(mid));
  },

  onGNNTrainingComplete(): void {
    // Embeddings may have shifted - invalidate distance-based caches
    nnCache.clear();
    queryCache.clear();
    // Centroid cache can remain (recomputed lazily)
  }
}
```

---

## 6. GNN Training Schedule

### 6.1 Training Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                    GNN TRAINING PIPELINE                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ONLINE LEARNING (Continuous)                                      │
│  ├── Trigger: Every 1000 new embeddings                            │
│  ├── Scope: Local neighborhood refinement                          │
│  ├── Duration: < 100ms (non-blocking)                              │
│  └── Method: Single GNN message-passing step                       │
│                                                                     │
│  INCREMENTAL TRAINING (Scheduled)                                  │
│  ├── Trigger: Hourly (off-peak) or 10K new embeddings              │
│  ├── Scope: Updated subgraph (new nodes + 2-hop neighbors)         │
│  ├── Duration: 1-5 minutes                                         │
│  └── Method: 3-5 GNN epochs on affected subgraph                   │
│                                                                     │
│  FULL RETRAINING (Periodic)                                        │
│  ├── Trigger: Weekly (Sunday 02:00-06:00) or manual                │
│  ├── Scope: Entire graph                                           │
│  ├── Duration: 1-4 hours                                           │
│  └── Method: Full GNN training with hyperparameter tuning          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### 6.2 Non-Blocking Training Protocol

```typescript
class GNNTrainingScheduler {
  private queryPriorityLock = new AsyncLock();

  async onlineUpdate(newEmbeddings: EmbeddingBatch): Promise<void> {
    // Non-blocking: runs in background, doesn't affect queries
    setImmediate(async () => {
      const subgraph = this.extractLocalSubgraph(newEmbeddings);
      await this.gnn.singleStep(subgraph);
    });
  }

  async incrementalTrain(): Promise<TrainingResult> {
    // Acquire read lock (queries continue, writes pause)
    await this.queryPriorityLock.acquireRead();

    try {
      const updatedSubgraph = this.getUpdatedSubgraph();
      const result = await this.gnn.train(updatedSubgraph, {
        epochs: 5,
        earlyStop: { patience: 2, minDelta: 0.001 }
      });

      // Apply updates atomically
      await this.applyEmbeddingUpdates(result.refinedEmbeddings);
      return result;
    } finally {
      this.queryPriorityLock.release();
    }
  }

  async fullRetrain(): Promise<TrainingResult> {
    // Acquire write lock (pause all operations)
    await this.queryPriorityLock.acquireWrite();

    try {
      // Checkpoint current state for rollback
      await this.checkpoint();

      const result = await this.gnn.fullTrain({
        epochs: 50,
        learningRate: 0.001,
        earlyStop: { patience: 10, minDelta: 0.0001 }
      });

      // Validate before applying
      if (result.validationRecall < 0.90) {
        await this.rollback();
        throw new Error('Training degraded recall, rolled back');
      }

      await this.applyFullUpdate(result);
      return result;
    } finally {
      this.queryPriorityLock.release();
    }
  }
}
```

### 6.3 Training Resource Allocation

```
OFF-PEAK (02:00-06:00 local time)
---------------------------------
- Full retraining allowed
- 100% GPU utilization
- Query latency SLA relaxed to 200ms

PEAK HOURS (06:00-22:00)
------------------------
- Online updates only
- GPU limited to 20% for training
- Query latency SLA: 50ms p99

TRANSITION PERIODS
------------------
- Incremental training allowed
- GPU limited to 50% for training
- Query latency SLA: 100ms p99
```

---

## 7. Benchmarking Framework

### 7.1 Benchmark Suite

```typescript
interface BenchmarkSuite {
  // Core HNSW benchmarks
  hnsw: {
    insertThroughput: {
      description: 'Vectors inserted per second',
      target: '>= 10,000 vectors/s',
      dataset: '1M random 1536-D vectors'
    },
    queryLatency: {
      description: 'Single query latency distribution',
      targets: {
        p50: '<= 10ms',
        p95: '<= 30ms',
        p99: '<= 50ms'
      },
      dataset: '1M indexed, 10K queries'
    },
    recallAtK: {
      description: 'Recall compared to brute force',
      targets: {
        recall10: '>= 0.95',
        recall100: '>= 0.98'
      }
    },
    concurrentQueries: {
      description: 'Throughput under concurrent load',
      target: '>= 1,000 QPS at p99 < 100ms',
      concurrency: [1, 10, 50, 100, 200]
    }
  },

  // End-to-end pipeline benchmarks
  pipeline: {
    audioToEmbedding: {
      description: 'Full audio processing latency',
      target: '<= 200ms per 5s segment',
      includeIO: true
    },
    ingestionThroughput: {
      description: 'Sustained ingestion rate',
      target: '>= 100 segments/second',
      duration: '1 hour'
    },
    queryWithMetadata: {
      description: 'Query + metadata fetch',
      target: '<= 75ms p99'
    }
  },

  // GNN-specific benchmarks
  gnn: {
    onlineUpdateLatency: {
      description: 'Single-step GNN update',
      target: '<= 100ms for 1K node subgraph'
    },
    incrementalTrainTime: {
      description: 'Hourly incremental training',
      target: '<= 5 minutes for 10K updates'
    },
    recallImprovement: {
      description: 'Recall gain from GNN refinement',
      target: '>= 2% improvement over baseline HNSW'
    }
  },

  // Memory benchmarks
  memory: {
    indexMemoryPerVector: {
      description: 'Memory per indexed vector',
      target: '<= 8KB (including overhead)'
    },
    cacheHitRate: {
      description: 'Cache effectiveness',
      targets: {
        queryCache: '>= 30%',
        nnCache: '>= 50%',
        metadataCache: '>= 80%'
      }
    },
    quantizationRecallLoss: {
      description: 'Recall loss from int8 quantization',
      target: '<= 3%'
    }
  }
}
```

### 7.2 SLA Definitions

| Operation | p50 | p95 | p99 | p99.9 |
|-----------|-----|-----|-----|-------|
| kNN Query (k=10) | 5ms | 20ms | 50ms | 100ms |
| kNN Query (k=100) | 10ms | 40ms | 80ms | 150ms |
| Range Query (r<0.5) | 15ms | 50ms | 100ms | 200ms |
| Insert Single | 1ms | 5ms | 10ms | 20ms |
| Batch Insert (1000) | 50ms | 100ms | 200ms | 500ms |
| Cluster Assignment | 20ms | 50ms | 100ms | 200ms |
| Full Pipeline (audio->result) | 200ms | 500ms | 1000ms | 2000ms |

### 7.3 Continuous Benchmarking

```yaml
# .github/workflows/performance.yml
benchmark_schedule:
  nightly:
    - hnsw_insert_throughput
    - hnsw_query_latency
    - hnsw_recall

  weekly:
    - full_pipeline_benchmark
    - gnn_training_benchmark
    - memory_pressure_test

  on_release:
    - all_benchmarks
    - scalability_test_10M
    - longevity_test_24h

regression_thresholds:
  latency_increase: 10%   # Alert if p99 increases by 10%
  throughput_decrease: 5% # Alert if QPS drops by 5%
  recall_decrease: 1%     # Alert if recall drops by 1%
```

---

## 8. Horizontal Scalability

### 8.1 Sharding Strategy

```
SHARDING APPROACH: Geographic + Temporal Hybrid
===============================================

Primary Shard Key: Geographic Region (sensor cluster)
- Shard 0: North America West
- Shard 1: North America East
- Shard 2: Europe
- Shard 3: Asia-Pacific
- Shard 4: South America
- Shard 5: Africa

Secondary Partition: Temporal (within shard)
- Hot: Current month
- Warm: Past 12 months
- Cold: Archive (>12 months)

Cross-Shard Queries:
- Use scatter-gather pattern
- Merge results by distance
- Timeout per shard: 100ms
```

### 8.2 Cluster Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐           │
│   │  Query LB   │    │  Query LB   │    │  Query LB   │           │
│   └──────┬──────┘    └──────┬──────┘    └──────┬──────┘           │
│          │                  │                  │                   │
│          v                  v                  v                   │
│   ┌─────────────────────────────────────────────────────┐         │
│   │               Query Router (Consistent Hash)         │         │
│   └─────────────────────────┬───────────────────────────┘         │
│                             │                                      │
│      ┌──────────────────────┼──────────────────────┐              │
│      │                      │                      │              │
│      v                      v                      v              │
│ ┌─────────┐           ┌─────────┐           ┌─────────┐          │
│ │ Shard 0 │           │ Shard 1 │           │ Shard N │          │
│ │ (3 rep) │           │ (3 rep) │           │ (3 rep) │          │
│ └────┬────┘           └────┬────┘           └────┬────┘          │
│      │                     │                     │                │
│      v                     v                     v                │
│ ┌─────────┐           ┌─────────┐           ┌─────────┐          │
│ │  HNSW   │           │  HNSW   │           │  HNSW   │          │
│ │  Index  │           │  Index  │           │  Index  │          │
│ └─────────┘           └─────────┘           └─────────┘          │
│                                                                   │
│   ┌─────────────────────────────────────────────────────┐        │
│   │          Shared Metadata Store (Distributed)         │        │
│   └─────────────────────────────────────────────────────┘        │
│                                                                   │
│   ┌─────────────────────────────────────────────────────┐        │
│   │       GNN Training Coordinator (Async Gossip)        │        │
│   └─────────────────────────────────────────────────────┘        │
│                                                                   │
└─────────────────────────────────────────────────────────────────────┘
```

### 8.3 Scaling Thresholds

| Metric | Scale-Out Trigger | Scale-In Trigger |
|--------|-------------------|------------------|
| Query Latency p99 | > 80ms sustained 5min | < 30ms sustained 1h |
| CPU Utilization | > 70% sustained 5min | < 30% sustained 1h |
| Memory Utilization | > 80% | < 50% sustained 1h |
| Queue Depth | > 10K pending | < 1K sustained 30min |
| Shard Size | > 500K vectors | N/A (don't scale in) |

### 8.4 Cross-Shard Query Protocol

```typescript
async function globalKnnQuery(
  query: Float32Array,
  k: number,
  options: QueryOptions
): Promise<SearchResult[]> {
  const shards = options.shards || getAllShards();
  const perShardK = Math.ceil(k * 1.5);  // Over-fetch for merge

  // Scatter phase
  const shardPromises = shards.map(shard =>
    queryShardWithTimeout(shard, query, perShardK, options.timeout || 100)
  );

  // Gather phase with partial results on timeout
  const results = await Promise.allSettled(shardPromises);

  // Merge and re-rank
  const allResults = results
    .filter(r => r.status === 'fulfilled')
    .flatMap(r => r.value);

  // Sort by distance and take top-k
  allResults.sort((a, b) => a.distance - b.distance);
  return allResults.slice(0, k);
}
```

---

## 9. Latency Budget Breakdown

### 9.1 Query Path Latency Budget

```
TOTAL BUDGET: 50ms (p99 target)
===============================

┌────────────────────────────────────────────────────┐
│ Component                  │ Budget  │ % of Total │
├────────────────────────────────────────────────────┤
│ Network (client -> LB)     │  5ms    │   10%      │
│ Load Balancer routing      │  1ms    │    2%      │
│ Query parsing/validation   │  1ms    │    2%      │
│ Cache lookup (L1-L4)       │  3ms    │    6%      │
│ HNSW search (k=10)         │ 25ms    │   50%      │
│ Metadata fetch             │  5ms    │   10%      │
│ Result serialization       │  2ms    │    4%      │
│ Network (LB -> client)     │  5ms    │   10%      │
│ Buffer/headroom            │  3ms    │    6%      │
├────────────────────────────────────────────────────┤
│ TOTAL                      │ 50ms    │  100%      │
└────────────────────────────────────────────────────┘
```

### 9.2 Ingestion Path Latency Budget

```
TOTAL BUDGET: 200ms (p99 target for single segment)
==================================================

┌────────────────────────────────────────────────────┐
│ Component                  │ Budget  │ % of Total │
├────────────────────────────────────────────────────┤
│ Audio receive/decode       │ 10ms    │    5%      │
│ Mel spectrogram compute    │ 20ms    │   10%      │
│ Perch model inference      │ 80ms    │   40%      │
│ Embedding normalization    │  5ms    │    2%      │
│ HNSW insertion             │ 20ms    │   10%      │
│ Metadata write             │ 10ms    │    5%      │
│ Cache invalidation         │ 10ms    │    5%      │
│ Acknowledgment             │  5ms    │    2%      │
│ Buffer/headroom            │ 40ms    │   20%      │
├────────────────────────────────────────────────────┤
│ TOTAL                      │200ms    │  100%      │
└────────────────────────────────────────────────────┘
```

### 9.3 GNN Training Latency Constraints

```
ONLINE UPDATE: 100ms max (non-blocking)
---------------------------------------
- Subgraph extraction: 20ms
- Single GNN forward pass: 50ms
- Embedding update (async): 30ms

INCREMENTAL TRAINING: 5 min max
-------------------------------
- Subgraph construction: 30s
- Training (5 epochs): 4 min
- Embedding sync: 30s

FULL RETRAINING: 4 hour max
---------------------------
- Graph snapshot: 10 min
- Training (50 epochs): 3.5 hours
- Validation: 10 min
- Cutover: 10 min
```

---

## Consequences

### Positive

- **Sub-100ms query latency** achieved through HNSW tuning and multi-level caching
- **4x storage reduction** for cold data via int8 scalar quantization
- **Non-blocking GNN learning** enables continuous improvement without query degradation
- **Linear horizontal scaling** via geographic sharding
- **Clear SLAs** enable capacity planning and alerting

### Negative

- **Increased operational complexity** from multi-tier storage and distributed architecture
- **Memory overhead** from caching layers (~1.1GB dedicated to caches)
- **Quantization recall loss** of 2-3% for cold tier data
- **Cross-shard query overhead** adds latency for global searches

### Neutral

- **Trade-off flexibility** allows tuning precision vs. latency per use case
- **Benchmark-driven development** requires ongoing measurement infrastructure

---

## Implementation Phases

### Phase 1: Foundation (Weeks 1-2)
- Implement HNSW with tuned parameters
- Set up benchmark suite
- Establish baseline metrics

### Phase 2: Optimization (Weeks 3-4)
- Implement scalar quantization
- Add multi-level caching
- Optimize batch ingestion pipeline

### Phase 3: Learning (Weeks 5-6)
- Integrate GNN training scheduler
- Implement non-blocking updates
- Validate recall improvements

### Phase 4: Scale (Weeks 7-8)
- Implement sharding layer
- Deploy distributed architecture
- Load test at 1M+ vectors

---

## References

- Perch 2.0: https://arxiv.org/abs/2508.04665
- RuVector: https://github.com/ruvnet/ruvector
- HNSW Paper: Malkov & Yashunin, 2018
- Product Quantization: Jegou et al., 2011
- Graph Attention Networks: Velickovic et al., 2018