Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,974 @@
# ADR-004: Performance Optimization Strategy
## Status
Proposed
## Date
2026-01-15
## Context
7sense is a bioacoustics platform that processes bird call audio using Perch 2.0 embeddings (1536-D vectors from 5-second audio segments at 32kHz) stored in a RuVector-based system with HNSW indexing and GNN learning capabilities. The system must handle:
- **Scale**: 1M+ bird call embeddings with sub-100ms query latency
- **Continuous Learning**: GNN refinement without blocking query operations
- **Hierarchical Data**: Poincare ball hyperbolic embeddings for species/call-type taxonomies
- **Real-time Ingestion**: Streaming audio from field sensors
This ADR defines the performance optimization strategy to meet these requirements while maintaining system reliability and cost efficiency.
## Decision
We adopt a multi-layered performance optimization approach covering HNSW tuning, embedding quantization, batch processing, memory management, caching, GNN scheduling, and horizontal scalability.
---
## 1. HNSW Parameter Tuning
### 1.1 Core Parameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **M** (max connections per node) | 32 | Optimal for 1536-D vectors; balances recall vs memory. Higher than default (16) due to high dimensionality. |
| **efConstruction** | 200 | Build-time search depth. Higher ensures quality graph structure for dense embedding spaces. |
| **efSearch** | 128 (default) / 256 (high-recall) | Query-time search depth. Tunable per query based on precision requirements. |
| **maxLevel** | auto (log2(N)/log2(M)) | Automatically determined; ~6-7 levels for 1M vectors with M=32. |
### 1.2 Dimensionality-Specific Adjustments
```
For 1536-D Perch embeddings:
- Use L2 distance (Euclidean) for normalized vectors
- Consider Product Quantization (PQ) for memory reduction (see Section 2)
- Enable SIMD acceleration (AVX-512 where available)
```
### 1.3 Benchmark Targets
| Metric | Target | Measurement Method |
|--------|--------|-------------------|
| Recall@10 | >= 0.95 | Compare against brute-force ground truth |
| Recall@100 | >= 0.98 | Same |
| Query Latency (p50) | < 10ms | Single-threaded, 1M vectors |
| Query Latency (p99) | < 50ms | Under concurrent load |
| Build Time | < 30 min | For 1M vectors cold start |
### 1.4 Tuning Protocol
```typescript
interface HNSWTuningConfig {
// Phase 1: Initial calibration (10K sample)
calibration: {
sampleSize: 10000,
mRange: [16, 24, 32, 48],
efConstructionRange: [100, 200, 400],
targetRecall: 0.95
},
// Phase 2: Full index build with optimal params
production: {
m: 32, // Determined from calibration
efConstruction: 200,
efSearchDefault: 128,
efSearchHighRecall: 256
},
// Phase 3: Runtime adaptation
adaptive: {
enableDynamicEf: true,
efFloor: 64,
efCeiling: 512,
latencyTarget: 50 // ms
}
}
```
---
## 2. Embedding Quantization Strategy
### 2.1 Tiered Storage Architecture
```
HOT TIER (Active Queries)
-------------------------
- Format: float32 (full precision)
- Size: 1536 * 4 = 6,144 bytes/vector
- Capacity: ~100K vectors (600MB RAM)
- Use: Real-time queries, recent recordings
WARM TIER (Frequent Access)
---------------------------
- Format: float16 (half precision)
- Size: 1536 * 2 = 3,072 bytes/vector
- Capacity: ~500K vectors (1.5GB RAM)
- Use: Weekly active data, popular species
COLD TIER (Archive)
-------------------
- Format: int8 (scalar quantization)
- Size: 1536 * 1 = 1,536 bytes/vector
- Capacity: ~2M+ vectors (3GB disk)
- Use: Historical data, rare species
```
### 2.2 Quantization Methods
| Method | Compression | Recall Impact | Use Case |
|--------|-------------|---------------|----------|
| **Scalar (int8)** | 4x | -2-3% recall | Cold storage, bulk search |
| **Product Quantization (PQ)** | 8-16x | -3-5% recall | Very large archives |
| **Binary** | 32x | -10-15% recall | First-pass filtering only |
### 2.3 Scalar Quantization Implementation
```typescript
class ScalarQuantizer {
// Per-dimension min/max for calibration
private mins: Float32Array;
private maxs: Float32Array;
private scales: Float32Array;
calibrate(embeddings: Float32Array[], sampleSize: number = 10000): void {
// Sample random embeddings for range estimation
const sample = this.randomSample(embeddings, sampleSize);
for (let d = 0; d < 1536; d++) {
const values = sample.map(e => e[d]);
this.mins[d] = Math.min(...values);
this.maxs[d] = Math.max(...values);
this.scales[d] = 255 / (this.maxs[d] - this.mins[d]);
}
}
quantize(embedding: Float32Array): Uint8Array {
const quantized = new Uint8Array(1536);
for (let d = 0; d < 1536; d++) {
const normalized = (embedding[d] - this.mins[d]) * this.scales[d];
quantized[d] = Math.round(Math.max(0, Math.min(255, normalized)));
}
return quantized;
}
dequantize(quantized: Uint8Array): Float32Array {
const embedding = new Float32Array(1536);
for (let d = 0; d < 1536; d++) {
embedding[d] = (quantized[d] / this.scales[d]) + this.mins[d];
}
return embedding;
}
}
```
### 2.4 Promotion/Demotion Policy
```
PROMOTION (Cold -> Warm -> Hot)
-------------------------------
Trigger: Query frequency > threshold OR explicit prefetch
- Cold -> Warm: 5+ queries in 24h
- Warm -> Hot: 20+ queries in 1h
DEMOTION (Hot -> Warm -> Cold)
------------------------------
Trigger: Time-based decay OR memory pressure
- Hot -> Warm: No queries in 1h
- Warm -> Cold: No queries in 7d
- LRU eviction when tier exceeds capacity
```
---
## 3. Batch Processing Pipeline
### 3.1 Audio Ingestion Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ AUDIO INGESTION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ [Sensors] ──> [Buffer Queue] ──> [Segment Detector] │
│ │ │ │ │
│ │ (5min chunks) (5s windows) │
│ v v v │
│ [Raw Storage] [Batch Accumulator] [Perch Embedder] │
│ │ │ │
│ (1000 segments) (GPU batch) │
│ v v │
│ [Embedding Queue] <── [1536-D vectors] │
│ │ │
│ v │
│ [HNSW Batch Insert] │
│ │ │
│ (async, non-blocking) │
│ v │
│ [Index + Metadata Store] │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### 3.2 Batch Sizing Parameters
| Stage | Batch Size | Latency Target | Throughput |
|-------|------------|----------------|------------|
| Audio buffer | 5 min chunks | < 1s queue delay | 100+ streams |
| Segment detection | 100 segments | < 500ms | 1000 segments/s |
| Perch embedding | 64 segments | < 2s GPU | 32 segments/s/GPU |
| HNSW insertion | 1000 vectors | < 100ms | 10K vectors/s |
| Metadata write | 1000 records | < 50ms | 20K records/s |
### 3.3 Backpressure Handling
```typescript
interface BackpressureConfig {
// Queue depth thresholds
warningThreshold: 10000, // Start logging warnings
throttleThreshold: 50000, // Reduce intake rate
dropThreshold: 100000, // Drop lowest-priority data
// Priority levels for graceful degradation
priorities: {
critical: 'endangered_species', // Never drop
high: 'known_species_new_recording', // Drop last
normal: 'routine_monitoring', // Standard handling
low: 'background_noise_samples' // Drop first
},
// Rate limiting
maxIngestionRate: 10000, // segments/minute
burstAllowance: 5000, // temporary overflow
}
```
### 3.4 Batch Insert Optimization
```typescript
async function batchInsertEmbeddings(
embeddings: Float32Array[],
metadata: EmbeddingMetadata[],
config: BatchConfig
): Promise<BatchResult> {
const batchSize = config.batchSize || 1000;
const results: BatchResult = { inserted: 0, failed: 0, latencyMs: [] };
// Sort by expected cluster for better cache locality
const sorted = sortByClusterHint(embeddings, metadata);
for (let i = 0; i < sorted.length; i += batchSize) {
const batch = sorted.slice(i, i + batchSize);
const start = performance.now();
// Parallel insert with connection pooling
await Promise.all([
hnswIndex.batchAdd(batch.embeddings),
metadataStore.batchInsert(batch.metadata)
]);
results.latencyMs.push(performance.now() - start);
results.inserted += batch.length;
}
return results;
}
```
---
## 4. Memory Management
### 4.1 Streaming vs Batch Trade-offs
| Mode | Memory Footprint | Latency | Use Case |
|------|------------------|---------|----------|
| **Streaming** | O(window_size) ~50MB | Real-time (<1s) | Live monitoring |
| **Micro-batch** | O(batch_size) ~200MB | Near-real-time (<5s) | Standard ingestion |
| **Batch** | O(full_batch) ~2GB | Minutes | Bulk historical import |
### 4.2 Memory Budget Allocation
```
TOTAL MEMORY BUDGET: 16GB (single node)
=======================================
HNSW Index (Hot): 4GB (25%)
- ~650K float32 vectors
- Navigation structure overhead
Embedding Cache: 3GB (19%)
- LRU cache for frequent queries
- Warm tier spillover
GNN Model: 2GB (12%)
- Model parameters
- Gradient buffers
- Activation cache
Query Buffers: 2GB (12%)
- Concurrent query working memory
- Result aggregation
Ingestion Pipeline: 2GB (12%)
- Audio processing buffers
- Batch accumulation
Metadata/Index: 2GB (12%)
- SQLite/RocksDB buffers
- B-tree indices
OS/Overhead: 1GB (6%)
- System requirements
- Safety margin
```
### 4.3 Memory Pressure Response
```typescript
interface MemoryManager {
thresholds: {
warning: 0.75, // 75% utilization
critical: 0.90, // 90% utilization
emergency: 0.95 // 95% utilization
},
responses: {
warning: [
'reduce_batch_sizes',
'increase_demotion_rate',
'log_memory_profile'
],
critical: [
'pause_gnn_training',
'aggressive_cache_eviction',
'reject_low_priority_queries'
],
emergency: [
'stop_ingestion',
'force_checkpoint',
'alert_operations'
]
}
}
```
### 4.4 Zero-Copy Optimizations
```typescript
// Use memory-mapped files for large read-only data
const coldTierIndex = mmap('/data/cold_embeddings.bin', {
mode: 'readonly',
advice: MADV_RANDOM // Optimize for random access
});
// Share embedding buffers between query threads
const sharedQueryBuffer = new SharedArrayBuffer(
QUERY_BATCH_SIZE * EMBEDDING_DIM * 4
);
// Avoid copies in pipeline stages
function processSegment(audio: AudioBuffer): EmbeddingResult {
// Pass views, not copies
const spectrogram = computeMelSpectrogram(audio.subarray(0, WINDOW_SIZE));
const embedding = perchModel.embed(spectrogram); // Returns view
return { embedding, metadata: extractMetadata(audio) };
}
```
---
## 5. Caching Strategy
### 5.1 Multi-Level Cache Architecture
```
┌────────────────────────────────────────────────────────────────┐
│ CACHE HIERARCHY │
├────────────────────────────────────────────────────────────────┤
│ │
│ L1: Query Result Cache (100MB) │
│ ├── Key: hash(query_embedding + search_params) │
│ ├── TTL: 5 minutes │
│ ├── Hit Rate Target: 40%+ for repeated queries │
│ └── Eviction: LRU with frequency boost │
│ │
│ L2: Nearest Neighbor Cache (500MB) │
│ ├── Key: embedding_id │
│ ├── Value: precomputed k-NN list │
│ ├── TTL: 1 hour (invalidate on index update) │
│ └── Hit Rate Target: 60%+ for hot embeddings │
│ │
│ L3: Cluster Centroid Cache (200MB) │
│ ├── Key: cluster_id │
│ ├── Value: centroid + exemplar embeddings │
│ ├── TTL: 24 hours │
│ └── Use: Fast cluster assignment for new embeddings │
│ │
│ L4: Metadata Cache (300MB) │
│ ├── Key: embedding_id │
│ ├── Value: species, location, timestamp, etc. │
│ ├── TTL: None (invalidate on update) │
│ └── Hit Rate Target: 90%+ (frequently accessed) │
│ │
└────────────────────────────────────────────────────────────────┘
```
### 5.2 Cache Warming Strategy
```typescript
interface CacheWarmingConfig {
// Startup warming
startup: {
// Load most queried embeddings from past 24h
recentQueryEmbeddings: 10000,
// Load all cluster centroids
clusterCentroids: 'all',
// Load endangered species data
prioritySpecies: ['species_list_from_config']
},
// Predictive warming
predictive: {
// Time-based patterns
schedules: [
{ time: '05:00', action: 'warm_dawn_chorus_species' },
{ time: '19:00', action: 'warm_dusk_species' }
],
// Geographic patterns
sensorActivation: {
triggerRadius: '50km',
preloadNeighborSites: true
}
},
// Query-driven warming
queryDriven: {
// On any query, prefetch neighbors' neighbors
prefetchDepth: 2,
prefetchCount: 10
}
}
```
### 5.3 Cache Invalidation
```typescript
// Event-driven invalidation
const cacheInvalidator = {
onEmbeddingInsert(id: string, embedding: Float32Array): void {
// Invalidate affected NN caches
const affectedNeighbors = hnswIndex.getNeighbors(id, 50);
affectedNeighbors.forEach(nid => nnCache.invalidate(nid));
// Invalidate cluster centroid if significantly different
const cluster = clusterAssignment.get(id);
if (cluster && distanceFromCentroid(embedding, cluster) > threshold) {
centroidCache.invalidate(cluster.id);
}
},
onClusterUpdate(clusterId: string): void {
// Invalidate centroid and all member NN caches
centroidCache.invalidate(clusterId);
const members = clusterMembers.get(clusterId);
members.forEach(mid => nnCache.invalidate(mid));
},
onGNNTrainingComplete(): void {
// Embeddings may have shifted - invalidate distance-based caches
nnCache.clear();
queryCache.clear();
// Centroid cache can remain (recomputed lazily)
}
}
```
---
## 6. GNN Training Schedule
### 6.1 Training Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ GNN TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ONLINE LEARNING (Continuous) │
│ ├── Trigger: Every 1000 new embeddings │
│ ├── Scope: Local neighborhood refinement │
│ ├── Duration: < 100ms (non-blocking) │
│ └── Method: Single GNN message-passing step │
│ │
│ INCREMENTAL TRAINING (Scheduled) │
│ ├── Trigger: Hourly (off-peak) or 10K new embeddings │
│ ├── Scope: Updated subgraph (new nodes + 2-hop neighbors) │
│ ├── Duration: 1-5 minutes │
│ └── Method: 3-5 GNN epochs on affected subgraph │
│ │
│ FULL RETRAINING (Periodic) │
│ ├── Trigger: Weekly (Sunday 02:00-06:00) or manual │
│ ├── Scope: Entire graph │
│ ├── Duration: 1-4 hours │
│ └── Method: Full GNN training with hyperparameter tuning │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### 6.2 Non-Blocking Training Protocol
```typescript
class GNNTrainingScheduler {
private queryPriorityLock = new AsyncLock();
async onlineUpdate(newEmbeddings: EmbeddingBatch): Promise<void> {
// Non-blocking: runs in background, doesn't affect queries
setImmediate(async () => {
const subgraph = this.extractLocalSubgraph(newEmbeddings);
await this.gnn.singleStep(subgraph);
});
}
async incrementalTrain(): Promise<TrainingResult> {
// Acquire read lock (queries continue, writes pause)
await this.queryPriorityLock.acquireRead();
try {
const updatedSubgraph = this.getUpdatedSubgraph();
const result = await this.gnn.train(updatedSubgraph, {
epochs: 5,
earlyStop: { patience: 2, minDelta: 0.001 }
});
// Apply updates atomically
await this.applyEmbeddingUpdates(result.refinedEmbeddings);
return result;
} finally {
this.queryPriorityLock.release();
}
}
async fullRetrain(): Promise<TrainingResult> {
// Acquire write lock (pause all operations)
await this.queryPriorityLock.acquireWrite();
try {
// Checkpoint current state for rollback
await this.checkpoint();
const result = await this.gnn.fullTrain({
epochs: 50,
learningRate: 0.001,
earlyStop: { patience: 10, minDelta: 0.0001 }
});
// Validate before applying
if (result.validationRecall < 0.90) {
await this.rollback();
throw new Error('Training degraded recall, rolled back');
}
await this.applyFullUpdate(result);
return result;
} finally {
this.queryPriorityLock.release();
}
}
}
```
### 6.3 Training Resource Allocation
```
OFF-PEAK (02:00-06:00 local time)
---------------------------------
- Full retraining allowed
- 100% GPU utilization
- Query latency SLA relaxed to 200ms
PEAK HOURS (06:00-22:00)
------------------------
- Online updates only
- GPU limited to 20% for training
- Query latency SLA: 50ms p99
TRANSITION PERIODS
------------------
- Incremental training allowed
- GPU limited to 50% for training
- Query latency SLA: 100ms p99
```
---
## 7. Benchmarking Framework
### 7.1 Benchmark Suite
```typescript
interface BenchmarkSuite {
// Core HNSW benchmarks
hnsw: {
insertThroughput: {
description: 'Vectors inserted per second',
target: '>= 10,000 vectors/s',
dataset: '1M random 1536-D vectors'
},
queryLatency: {
description: 'Single query latency distribution',
targets: {
p50: '<= 10ms',
p95: '<= 30ms',
p99: '<= 50ms'
},
dataset: '1M indexed, 10K queries'
},
recallAtK: {
description: 'Recall compared to brute force',
targets: {
recall10: '>= 0.95',
recall100: '>= 0.98'
}
},
concurrentQueries: {
description: 'Throughput under concurrent load',
target: '>= 1,000 QPS at p99 < 100ms',
concurrency: [1, 10, 50, 100, 200]
}
},
// End-to-end pipeline benchmarks
pipeline: {
audioToEmbedding: {
description: 'Full audio processing latency',
target: '<= 200ms per 5s segment',
includeIO: true
},
ingestionThroughput: {
description: 'Sustained ingestion rate',
target: '>= 100 segments/second',
duration: '1 hour'
},
queryWithMetadata: {
description: 'Query + metadata fetch',
target: '<= 75ms p99'
}
},
// GNN-specific benchmarks
gnn: {
onlineUpdateLatency: {
description: 'Single-step GNN update',
target: '<= 100ms for 1K node subgraph'
},
incrementalTrainTime: {
description: 'Hourly incremental training',
target: '<= 5 minutes for 10K updates'
},
recallImprovement: {
description: 'Recall gain from GNN refinement',
target: '>= 2% improvement over baseline HNSW'
}
},
// Memory benchmarks
memory: {
indexMemoryPerVector: {
description: 'Memory per indexed vector',
target: '<= 8KB (including overhead)'
},
cacheHitRate: {
description: 'Cache effectiveness',
targets: {
queryCache: '>= 30%',
nnCache: '>= 50%',
metadataCache: '>= 80%'
}
},
quantizationRecallLoss: {
description: 'Recall loss from int8 quantization',
target: '<= 3%'
}
}
}
```
### 7.2 SLA Definitions
| Operation | p50 | p95 | p99 | p99.9 |
|-----------|-----|-----|-----|-------|
| kNN Query (k=10) | 5ms | 20ms | 50ms | 100ms |
| kNN Query (k=100) | 10ms | 40ms | 80ms | 150ms |
| Range Query (r<0.5) | 15ms | 50ms | 100ms | 200ms |
| Insert Single | 1ms | 5ms | 10ms | 20ms |
| Batch Insert (1000) | 50ms | 100ms | 200ms | 500ms |
| Cluster Assignment | 20ms | 50ms | 100ms | 200ms |
| Full Pipeline (audio->result) | 200ms | 500ms | 1000ms | 2000ms |
### 7.3 Continuous Benchmarking
```yaml
# .github/workflows/performance.yml
benchmark_schedule:
nightly:
- hnsw_insert_throughput
- hnsw_query_latency
- hnsw_recall
weekly:
- full_pipeline_benchmark
- gnn_training_benchmark
- memory_pressure_test
on_release:
- all_benchmarks
- scalability_test_10M
- longevity_test_24h
regression_thresholds:
latency_increase: 10% # Alert if p99 increases by 10%
throughput_decrease: 5% # Alert if QPS drops by 5%
recall_decrease: 1% # Alert if recall drops by 1%
```
---
## 8. Horizontal Scalability
### 8.1 Sharding Strategy
```
SHARDING APPROACH: Geographic + Temporal Hybrid
===============================================
Primary Shard Key: Geographic Region (sensor cluster)
- Shard 0: North America West
- Shard 1: North America East
- Shard 2: Europe
- Shard 3: Asia-Pacific
- Shard 4: South America
- Shard 5: Africa
Secondary Partition: Temporal (within shard)
- Hot: Current month
- Warm: Past 12 months
- Cold: Archive (>12 months)
Cross-Shard Queries:
- Use scatter-gather pattern
- Merge results by distance
- Timeout per shard: 100ms
```
### 8.2 Cluster Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Query LB │ │ Query LB │ │ Query LB │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ v v v │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Query Router (Consistent Hash) │ │
│ └─────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌──────────────────────┼──────────────────────┐ │
│ │ │ │ │
│ v v v │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Shard 0 │ │ Shard 1 │ │ Shard N │ │
│ │ (3 rep) │ │ (3 rep) │ │ (3 rep) │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ v v v │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ HNSW │ │ HNSW │ │ HNSW │ │
│ │ Index │ │ Index │ │ Index │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Shared Metadata Store (Distributed) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ GNN Training Coordinator (Async Gossip) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### 8.3 Scaling Thresholds
| Metric | Scale-Out Trigger | Scale-In Trigger |
|--------|-------------------|------------------|
| Query Latency p99 | > 80ms sustained 5min | < 30ms sustained 1h |
| CPU Utilization | > 70% sustained 5min | < 30% sustained 1h |
| Memory Utilization | > 80% | < 50% sustained 1h |
| Queue Depth | > 10K pending | < 1K sustained 30min |
| Shard Size | > 500K vectors | N/A (don't scale in) |
### 8.4 Cross-Shard Query Protocol
```typescript
async function globalKnnQuery(
query: Float32Array,
k: number,
options: QueryOptions
): Promise<SearchResult[]> {
const shards = options.shards || getAllShards();
const perShardK = Math.ceil(k * 1.5); // Over-fetch for merge
// Scatter phase
const shardPromises = shards.map(shard =>
queryShardWithTimeout(shard, query, perShardK, options.timeout || 100)
);
// Gather phase with partial results on timeout
const results = await Promise.allSettled(shardPromises);
// Merge and re-rank
const allResults = results
.filter(r => r.status === 'fulfilled')
.flatMap(r => r.value);
// Sort by distance and take top-k
allResults.sort((a, b) => a.distance - b.distance);
return allResults.slice(0, k);
}
```
---
## 9. Latency Budget Breakdown
### 9.1 Query Path Latency Budget
```
TOTAL BUDGET: 50ms (p99 target)
===============================
┌────────────────────────────────────────────────────┐
│ Component │ Budget │ % of Total │
├────────────────────────────────────────────────────┤
│ Network (client -> LB) │ 5ms │ 10% │
│ Load Balancer routing │ 1ms │ 2% │
│ Query parsing/validation │ 1ms │ 2% │
│ Cache lookup (L1-L4) │ 3ms │ 6% │
│ HNSW search (k=10) │ 25ms │ 50% │
│ Metadata fetch │ 5ms │ 10% │
│ Result serialization │ 2ms │ 4% │
│ Network (LB -> client) │ 5ms │ 10% │
│ Buffer/headroom │ 3ms │ 6% │
├────────────────────────────────────────────────────┤
│ TOTAL │ 50ms │ 100% │
└────────────────────────────────────────────────────┘
```
### 9.2 Ingestion Path Latency Budget
```
TOTAL BUDGET: 200ms (p99 target for single segment)
==================================================
┌────────────────────────────────────────────────────┐
│ Component │ Budget │ % of Total │
├────────────────────────────────────────────────────┤
│ Audio receive/decode │ 10ms │ 5% │
│ Mel spectrogram compute │ 20ms │ 10% │
│ Perch model inference │ 80ms │ 40% │
│ Embedding normalization │ 5ms │ 2% │
│ HNSW insertion │ 20ms │ 10% │
│ Metadata write │ 10ms │ 5% │
│ Cache invalidation │ 10ms │ 5% │
│ Acknowledgment │ 5ms │ 2% │
│ Buffer/headroom │ 40ms │ 20% │
├────────────────────────────────────────────────────┤
│ TOTAL │200ms │ 100% │
└────────────────────────────────────────────────────┘
```
### 9.3 GNN Training Latency Constraints
```
ONLINE UPDATE: 100ms max (non-blocking)
---------------------------------------
- Subgraph extraction: 20ms
- Single GNN forward pass: 50ms
- Embedding update (async): 30ms
INCREMENTAL TRAINING: 5 min max
-------------------------------
- Subgraph construction: 30s
- Training (5 epochs): 4 min
- Embedding sync: 30s
FULL RETRAINING: 4 hour max
---------------------------
- Graph snapshot: 10 min
- Training (50 epochs): 3.5 hours
- Validation: 10 min
- Cutover: 10 min
```
---
## Consequences
### Positive
- **Sub-100ms query latency** achieved through HNSW tuning and multi-level caching
- **4x storage reduction** for cold data via int8 scalar quantization
- **Non-blocking GNN learning** enables continuous improvement without query degradation
- **Linear horizontal scaling** via geographic sharding
- **Clear SLAs** enable capacity planning and alerting
### Negative
- **Increased operational complexity** from multi-tier storage and distributed architecture
- **Memory overhead** from caching layers (~1.1GB dedicated to caches)
- **Quantization recall loss** of 2-3% for cold tier data
- **Cross-shard query overhead** adds latency for global searches
### Neutral
- **Trade-off flexibility** allows tuning precision vs. latency per use case
- **Benchmark-driven development** requires ongoing measurement infrastructure
---
## Implementation Phases
### Phase 1: Foundation (Weeks 1-2)
- Implement HNSW with tuned parameters
- Set up benchmark suite
- Establish baseline metrics
### Phase 2: Optimization (Weeks 3-4)
- Implement scalar quantization
- Add multi-level caching
- Optimize batch ingestion pipeline
### Phase 3: Learning (Weeks 5-6)
- Integrate GNN training scheduler
- Implement non-blocking updates
- Validate recall improvements
### Phase 4: Scale (Weeks 7-8)
- Implement sharding layer
- Deploy distributed architecture
- Load test at 1M+ vectors
---
## References
- Perch 2.0: https://arxiv.org/abs/2508.04665
- RuVector: https://github.com/ruvnet/ruvector
- HNSW Paper: Malkov & Yashunin, 2018
- Product Quantization: Jegou et al., 2011
- Graph Attention Networks: Velickovic et al., 2018