Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,574 @@
# ADR-001: System Architecture Overview
**Status:** Accepted
**Date:** 2026-01-15
**Decision Makers:** 7sense Architecture Team
**Technical Area:** System Architecture
---
## Context and Problem Statement
7sense aims to transform bioacoustic signals (primarily bird calls) into a navigable geometric space where meaningful structure emerges. The system must process audio recordings, generate high-dimensional embeddings using Perch 2.0 (1536-D vectors), organize them with HNSW indexing in RuVector, and apply GNN-based learning to surface patterns such as call types, motifs, and behavioral contexts.
The core challenge is designing an architecture that:
1. **Handles diverse data pipelines** - From raw 32kHz audio to queryable vector embeddings
2. **Scales to millions of call segments** - Real-world bioacoustic monitoring generates vast datasets
3. **Supports scientific workflows** - Researchers need reproducibility, transparency, and evidence-backed interpretations (RAB pattern)
4. **Enables real-time and batch processing** - Field deployments require streaming; research requires bulk analysis
5. **Integrates ML inference efficiently** - ONNX-based Perch 2.0 inference in Rust for performance
### Current State
This is a greenfield project building upon:
- **Perch 2.0**: Google DeepMind's bioacoustic embedding model (EfficientNet-B3 backbone, 1536-D output)
- **RuVector**: Rust-based vector database with HNSW indexing and self-learning GNN layers
- **RAB Pattern**: Retrieval-Augmented Bioacoustics for evidence-backed interpretation
---
## Decision Drivers
### Performance Requirements
- **Embedding generation**: Process 5-second audio segments at >100 segments/second
- **Vector search**: Sub-millisecond kNN queries on 1M+ vectors (HNSW target: ~100us)
- **Batch ingestion**: 1M vectors/minute build speed (RuVector baseline)
- **Memory efficiency**: Support 32x compression for cold data tiers
### Scalability Requirements
- **Data volume**: Support 10K to 10M+ call segments per deployment
- **Concurrent users**: Multiple researchers querying simultaneously
- **Geographic distribution**: Sensor networks across multiple sites
- **Temporal depth**: Years of historical recordings
### Scientific Rigor Requirements
- **Reproducibility**: Deterministic pipelines with versioned models and parameters
- **Transparency**: RAB-style evidence packs citing retrieved calls for any interpretation
- **Auditability**: Full provenance tracking from raw audio to conclusions
- **Validation**: Built-in verification against ground truth labels
### Operational Requirements
- **Deployment flexibility**: Edge (sensor), cloud, and hybrid deployments
- **Monitoring**: Health metrics, processing throughput, index quality
- **Updates**: Hot-swap embedding models without full reindexing
- **Recovery**: Graceful degradation and disaster recovery
---
## Considered Options
### Option A: Monolithic Architecture
A single application handling all concerns: audio processing, embedding generation, vector storage, GNN learning, API serving, and visualization.
**Pros:**
- Simplest deployment model
- No inter-service communication overhead
- Single codebase to maintain
**Cons:**
- Cannot scale components independently
- Single point of failure
- Difficult to update individual components
- Memory pressure from co-located ML models
- Not suitable for distributed sensor networks
### Option B: Microservices Architecture
Fully decomposed services: Audio Ingest Service, Embedding Service, Vector Store Service, GNN Learning Service, Query Service, Visualization Service, etc.
**Pros:**
- Independent scaling per service
- Technology flexibility per service
- Fault isolation
- Team parallelization
**Cons:**
- Significant operational complexity
- Network latency between services
- Data consistency challenges
- Overkill for initial team size
- Complex debugging across service boundaries
### Option C: Modular Monolith Architecture
A single deployable unit with clearly separated internal modules, designed for future extraction into services if needed.
**Pros:**
- Maintains deployment simplicity
- Clear module boundaries enable future splitting
- In-process communication for performance-critical paths
- Easier debugging and testing
- Appropriate for current team/project scale
- Can evolve toward microservices as needs emerge
**Cons:**
- Requires discipline to maintain module boundaries
- All modules share the same runtime resources
- Scaling requires scaling the entire application
---
## Decision Outcome
**Chosen Option: Option C - Modular Monolith Architecture**
We adopt a modular monolith architecture with clearly defined domain boundaries, designed with explicit seams that allow future extraction to services. This balances immediate development velocity with long-term architectural flexibility.
### Rationale
1. **Right-sized for current needs**: A small team building a new product benefits from deployment simplicity
2. **Performance-critical paths stay in-process**: Audio-to-embedding-to-index flow benefits from zero network hops
3. **Scientific workflow alignment**: Researchers prefer reproducible, debuggable systems over distributed complexity
4. **Evolution path preserved**: Module boundaries are designed as potential service boundaries
5. **RuVector integration**: RuVector is designed as an embeddable library, making monolith integration natural
---
## Technical Specifications
### Module Architecture
```
sevensense/
├── core/ # Domain-agnostic foundations
│ ├── config/ # Configuration management
│ ├── error/ # Error types and handling
│ ├── telemetry/ # Logging, metrics, tracing
│ └── storage/ # Abstract storage interfaces
├── audio/ # Audio Processing Domain
│ ├── ingest/ # Audio file reading, streaming
│ ├── segment/ # Call detection and segmentation
│ ├── features/ # Acoustic feature extraction
│ └── spectrogram/ # Mel spectrogram generation
├── embedding/ # Embedding Generation Domain
│ ├── perch/ # Perch 2.0 ONNX inference
│ ├── models/ # Model versioning and registry
│ ├── batch/ # Batch embedding pipelines
│ └── normalize/ # Vector normalization (L2, etc.)
├── vectordb/ # Vector Storage Domain (RuVector)
│ ├── index/ # HNSW index management
│ ├── graph/ # Graph structure (nodes, edges)
│ ├── query/ # Similarity search, Cypher queries
│ └── hyperbolic/ # Poincare ball embeddings
├── learning/ # GNN Learning Domain
│ ├── gnn/ # GNN layers (GCN, GAT, GraphSAGE)
│ ├── attention/ # Attention mechanisms
│ ├── training/ # Self-supervised training loops
│ └── refinement/ # Embedding refinement pipelines
├── analysis/ # Analysis Domain
│ ├── clustering/ # HDBSCAN, prototype extraction
│ ├── sequence/ # Motif detection, transition analysis
│ ├── entropy/ # Sequence entropy metrics
│ └── validation/ # Ground truth comparison
├── rab/ # Retrieval-Augmented Bioacoustics
│ ├── evidence/ # Evidence pack construction
│ ├── retrieval/ # Adaptive retrieval depth
│ ├── interpretation/ # Constrained interpretation generation
│ └── citation/ # Source attribution
├── api/ # API Layer
│ ├── rest/ # REST endpoints
│ ├── graphql/ # GraphQL schema and resolvers
│ ├── websocket/ # Real-time streaming
│ └── grpc/ # gRPC for inter-service (future)
├── visualization/ # Visualization Domain
│ ├── projection/ # UMAP/t-SNE dimensionality reduction
│ ├── graph_viz/ # Network visualization
│ ├── spectrogram_viz/ # Spectrogram rendering
│ └── export/ # Export formats (JSON, PNG, etc.)
└── cli/ # Command Line Interface
├── ingest/ # Batch ingestion commands
├── query/ # Query commands
├── train/ # Training commands
└── export/ # Export commands
```
### Data Model
#### Core Entities (Graph Nodes)
```rust
/// Raw audio recording from a sensor
struct Recording {
id: Uuid,
sensor_id: String,
location: GeoPoint, // lat, lon, elevation
start_timestamp: DateTime,
duration_ms: u32,
sample_rate: u32, // 32000 Hz for Perch 2.0
channels: u8,
habitat: Option<String>,
weather: Option<WeatherData>,
file_path: PathBuf,
checksum: String, // SHA-256 for reproducibility
}
/// Detected call segment within a recording
struct CallSegment {
id: Uuid,
recording_id: Uuid,
start_ms: u32,
end_ms: u32,
snr_db: f32, // Signal-to-noise ratio
peak_frequency_hz: f32,
energy: f32,
detection_confidence: f32,
detection_method: String, // "energy_threshold", "whisper_seg", etc.
}
/// Embedding vector for a call segment
struct Embedding {
id: Uuid,
segment_id: Uuid,
model_id: String, // "perch2_v1.0"
dimensions: u16, // 1536 for Perch 2.0
vector: Vec<f32>,
normalized: bool,
created_at: DateTime,
}
/// Cluster prototype (centroid of similar calls)
struct Prototype {
id: Uuid,
cluster_id: Uuid,
centroid_vector: Vec<f32>,
exemplar_ids: Vec<Uuid>, // Representative segments
member_count: u32,
coherence_score: f32,
}
/// Cluster of similar call segments
struct Cluster {
id: Uuid,
method: String, // "hdbscan", "kmeans", etc.
parameters: HashMap<String, Value>,
created_at: DateTime,
validation_score: Option<f32>,
}
/// Optional taxonomic reference
struct Taxon {
id: Uuid,
scientific_name: String,
common_name: String,
inat_id: Option<u64>, // iNaturalist ID
ebird_code: Option<String>, // eBird species code
}
```
#### Relationships (Graph Edges)
```rust
/// Recording contains segments
edge HAS_SEGMENT: Recording -> CallSegment
/// Temporal sequence within recording
edge NEXT: CallSegment -> CallSegment {
delta_ms: u32, // Time gap between calls
}
/// Acoustic similarity from HNSW
edge SIMILAR: CallSegment -> CallSegment {
distance: f32, // Cosine or Euclidean
rank: u8, // kNN rank (1 = nearest)
}
/// Cluster membership
edge ASSIGNED_TO: CallSegment -> Cluster
/// Prototype ownership
edge HAS_PROTOTYPE: Cluster -> Prototype
/// Species identification (when available)
edge IDENTIFIED_AS: CallSegment -> Taxon {
confidence: f32,
method: String, // "manual", "model", "consensus"
}
```
### Processing Pipeline
```
┌─────────────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Audio │───▶│ Segment │───▶│ Mel │───▶│ Perch2.0 │ │
│ │ Input │ │Detection │ │Spectrogram│ │ ONNX │ │
│ │(32kHz,5s)│ │ │ │(500x128) │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ │ │ │ ▼ │
│ │ │ │ ┌──────────┐ │
│ │ │ │ │Embedding │ │
│ │ │ │ │ (1536-D) │ │
│ │ │ │ └──────────┘ │
│ │ │ │ │ │
└───────┼───────────────┼───────────────┼───────────────┼──────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ RuVector │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ HNSW │ │ Graph │ │ Metadata Store │ │ │
│ │ │ Index │ │ Store │ │ (Recordings, │ │ │
│ │ │ │ │ (Edges) │ │ Segments, etc.) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LEARNING LAYER │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ GNN │ │ Attention │ │ Hyperbolic │ │
│ │ Reranker │───▶│ Layers │───▶│ Refinement │ │
│ │(GCN/GAT/SAGE)│ │ │ │ (Poincare) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └───────────────────┴───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Refined │ │
│ │ Embeddings │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ ANALYSIS LAYER │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Clustering│ │ Sequence │ │ Anomaly │ │ Entropy │ │ RAB │ │
│ │(HDBSCAN) │ │ Mining │ │Detection │ │ Metrics │ │ Evidence │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ API / PRESENTATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ REST │ │ GraphQL │ │WebSocket │ │ CLI │ │ WASM │ │
│ │ API │ │ API │ │(Streaming)│ │ │ │ (Browser)│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
### Key Interfaces Between Modules
```rust
// Audio -> Embedding interface
trait AudioEmbedder {
fn embed_segment(&self, audio: &AudioSegment) -> Result<Embedding>;
fn embed_batch(&self, segments: &[AudioSegment]) -> Result<Vec<Embedding>>;
fn model_info(&self) -> ModelInfo;
}
// Embedding -> VectorDB interface
trait VectorStore {
fn insert(&mut self, embedding: &Embedding) -> Result<()>;
fn search_knn(&self, query: &[f32], k: usize) -> Result<Vec<SearchResult>>;
fn get_neighbors(&self, id: Uuid) -> Result<Vec<Neighbor>>;
fn build_similarity_edges(&mut self, k: usize) -> Result<usize>;
}
// VectorDB -> Learning interface
trait GraphLearner {
fn train_step(&mut self, graph: &Graph) -> Result<TrainMetrics>;
fn refine_embeddings(&self, embeddings: &mut [Embedding]) -> Result<()>;
fn attention_weights(&self, node_id: Uuid) -> Result<Vec<(Uuid, f32)>>;
}
// Learning -> Analysis interface
trait PatternAnalyzer {
fn cluster(&self, embeddings: &[Embedding]) -> Result<Vec<Cluster>>;
fn find_motifs(&self, sequences: &[Sequence]) -> Result<Vec<Motif>>;
fn compute_entropy(&self, transitions: &TransitionMatrix) -> f32;
}
// Analysis -> RAB interface
trait EvidenceBuilder {
fn build_pack(&self, query: &Query) -> Result<EvidencePack>;
fn generate_interpretation(&self, pack: &EvidencePack) -> Result<Interpretation>;
fn cite_sources(&self, interpretation: &Interpretation) -> Vec<Citation>;
}
```
### Configuration Structure
```yaml
# sevensense.yaml
sevensense:
# Audio processing settings
audio:
sample_rate: 32000 # Perch 2.0 requirement
segment_duration_ms: 5000 # 5 seconds
segment_overlap_ms: 500 # Overlap for continuity
min_snr_db: 10.0 # Minimum signal-to-noise
detection_method: "energy" # or "whisper_seg", "tweety"
# Embedding generation
embedding:
model: "perch2_v1.0"
onnx_path: "./models/perch2.onnx"
dimensions: 1536
normalize: true
batch_size: 32
# Vector database (RuVector)
vectordb:
index_type: "hnsw"
hnsw:
m: 16 # Connections per node
ef_construction: 200 # Build-time search width
ef_search: 100 # Query-time search width
distance_metric: "cosine" # or "euclidean", "poincare"
enable_hyperbolic: false # Experimental
compression:
hot_tier: "none"
warm_tier: "pq_8" # Product quantization
cold_tier: "pq_4" # Aggressive compression
# GNN learning
learning:
enabled: true
gnn_type: "gat" # GCN, GAT, or GraphSAGE
hidden_dim: 256
num_layers: 2
attention_heads: 4
learning_rate: 0.001
training_interval_hours: 24
# Analysis settings
analysis:
clustering:
method: "hdbscan"
min_cluster_size: 10
min_samples: 5
sequence:
max_gap_ms: 2000 # Max silence between calls
min_motif_length: 3
# RAB settings
rab:
retrieval_k: 10 # Neighbors to retrieve
min_confidence: 0.7
cite_exemplars: true
# API settings
api:
host: "0.0.0.0"
port: 8080
enable_graphql: true
enable_websocket: true
cors_origins: ["*"]
# Telemetry
telemetry:
log_level: "info"
metrics_port: 9090
tracing_enabled: true
tracing_endpoint: "http://localhost:4317"
```
---
## Consequences
### Positive Consequences
1. **Development velocity**: Single deployment simplifies CI/CD and local development
2. **Performance**: Critical audio-to-index path has zero network overhead
3. **Debugging**: Stack traces span the entire flow; no distributed tracing required initially
4. **Testing**: Integration tests run in-process without container orchestration
5. **Scientific reproducibility**: Single binary with pinned dependencies ensures consistent results
6. **Resource efficiency**: Shared memory pools and caches across modules
7. **Evolution path**: Clear module boundaries allow extraction to services when justified
### Negative Consequences
1. **Scaling limitations**: Cannot scale embedding generation independently from query serving
2. **Deployment coupling**: Updates to any module require full redeployment
3. **Resource contention**: GNN training may compete with query serving for CPU/memory
4. **Technology constraints**: All modules must work within Rust ecosystem (mitigated by FFI)
### Mitigation Strategies
| Risk | Mitigation |
|------|------------|
| Scaling limitations | Design async job queues that could become external workers |
| Deployment coupling | Blue-green deployments with health checks |
| Resource contention | Configurable resource limits per module; background training scheduling |
| Technology constraints | ONNX runtime for ML; FFI bindings for specialized libraries |
---
## Related Decisions
- **ADR-002**: Perch 2.0 Integration Strategy (ONNX vs. birdnet-onnx crate)
- **ADR-003**: HNSW vs. Hyperbolic Space Configuration
- **ADR-004**: GNN Training Strategy (Online vs. Batch)
- **ADR-005**: RAB Evidence Pack Schema
- **ADR-006**: API Design (REST/GraphQL/gRPC)
---
## Compliance and Standards
### Scientific Standards
- All embeddings include model version and parameters for reproducibility
- Evidence packs include full retrieval citations per RAB methodology
- Validation metrics align with published benchmarks (V-measure, silhouette scores)
### Data Standards
- Audio metadata follows Darwin Core / TDWG standards where applicable
- Taxonomic references link to iNaturalist and eBird identifiers
- Geospatial data uses WGS84 coordinates
### Security Considerations
- No PII in bioacoustic data (sensor IDs are pseudonymous)
- API authentication via JWT tokens
- Audit logging for all data modifications
---
## References
1. Perch 2.0 Paper: "The Bittern Lesson for Bioacoustics" (arXiv:2508.04665)
2. RuVector Documentation: https://github.com/ruvnet/ruvector
3. HNSW Paper: "Efficient and Robust Approximate Nearest Neighbor Search"
4. RAB Pattern: Retrieval-Augmented Bioacoustics methodology
5. AVN Deep Learning Study: "A deep learning approach for the analysis of birdsong" (eLife 2025)
---
## Revision History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2026-01-15 | 7sense Architecture Team | Initial version |

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,974 @@
# ADR-004: Performance Optimization Strategy
## Status
Proposed
## Date
2026-01-15
## Context
7sense is a bioacoustics platform that processes bird call audio using Perch 2.0 embeddings (1536-D vectors from 5-second audio segments at 32kHz) stored in a RuVector-based system with HNSW indexing and GNN learning capabilities. The system must handle:
- **Scale**: 1M+ bird call embeddings with sub-100ms query latency
- **Continuous Learning**: GNN refinement without blocking query operations
- **Hierarchical Data**: Poincare ball hyperbolic embeddings for species/call-type taxonomies
- **Real-time Ingestion**: Streaming audio from field sensors
This ADR defines the performance optimization strategy to meet these requirements while maintaining system reliability and cost efficiency.
## Decision
We adopt a multi-layered performance optimization approach covering HNSW tuning, embedding quantization, batch processing, memory management, caching, GNN scheduling, and horizontal scalability.
---
## 1. HNSW Parameter Tuning
### 1.1 Core Parameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **M** (max connections per node) | 32 | Optimal for 1536-D vectors; balances recall vs memory. Higher than default (16) due to high dimensionality. |
| **efConstruction** | 200 | Build-time search depth. Higher ensures quality graph structure for dense embedding spaces. |
| **efSearch** | 128 (default) / 256 (high-recall) | Query-time search depth. Tunable per query based on precision requirements. |
| **maxLevel** | auto (log2(N)/log2(M)) | Automatically determined; ~6-7 levels for 1M vectors with M=32. |
### 1.2 Dimensionality-Specific Adjustments
```
For 1536-D Perch embeddings:
- Use L2 distance (Euclidean) for normalized vectors
- Consider Product Quantization (PQ) for memory reduction (see Section 2)
- Enable SIMD acceleration (AVX-512 where available)
```
### 1.3 Benchmark Targets
| Metric | Target | Measurement Method |
|--------|--------|-------------------|
| Recall@10 | >= 0.95 | Compare against brute-force ground truth |
| Recall@100 | >= 0.98 | Same |
| Query Latency (p50) | < 10ms | Single-threaded, 1M vectors |
| Query Latency (p99) | < 50ms | Under concurrent load |
| Build Time | < 30 min | For 1M vectors cold start |
### 1.4 Tuning Protocol
```typescript
interface HNSWTuningConfig {
// Phase 1: Initial calibration (10K sample)
calibration: {
sampleSize: 10000,
mRange: [16, 24, 32, 48],
efConstructionRange: [100, 200, 400],
targetRecall: 0.95
},
// Phase 2: Full index build with optimal params
production: {
m: 32, // Determined from calibration
efConstruction: 200,
efSearchDefault: 128,
efSearchHighRecall: 256
},
// Phase 3: Runtime adaptation
adaptive: {
enableDynamicEf: true,
efFloor: 64,
efCeiling: 512,
latencyTarget: 50 // ms
}
}
```
---
## 2. Embedding Quantization Strategy
### 2.1 Tiered Storage Architecture
```
HOT TIER (Active Queries)
-------------------------
- Format: float32 (full precision)
- Size: 1536 * 4 = 6,144 bytes/vector
- Capacity: ~100K vectors (600MB RAM)
- Use: Real-time queries, recent recordings
WARM TIER (Frequent Access)
---------------------------
- Format: float16 (half precision)
- Size: 1536 * 2 = 3,072 bytes/vector
- Capacity: ~500K vectors (1.5GB RAM)
- Use: Weekly active data, popular species
COLD TIER (Archive)
-------------------
- Format: int8 (scalar quantization)
- Size: 1536 * 1 = 1,536 bytes/vector
- Capacity: ~2M+ vectors (3GB disk)
- Use: Historical data, rare species
```
### 2.2 Quantization Methods
| Method | Compression | Recall Impact | Use Case |
|--------|-------------|---------------|----------|
| **Scalar (int8)** | 4x | -2-3% recall | Cold storage, bulk search |
| **Product Quantization (PQ)** | 8-16x | -3-5% recall | Very large archives |
| **Binary** | 32x | -10-15% recall | First-pass filtering only |
### 2.3 Scalar Quantization Implementation
```typescript
class ScalarQuantizer {
// Per-dimension min/max for calibration
private mins: Float32Array;
private maxs: Float32Array;
private scales: Float32Array;
calibrate(embeddings: Float32Array[], sampleSize: number = 10000): void {
// Sample random embeddings for range estimation
const sample = this.randomSample(embeddings, sampleSize);
for (let d = 0; d < 1536; d++) {
const values = sample.map(e => e[d]);
this.mins[d] = Math.min(...values);
this.maxs[d] = Math.max(...values);
this.scales[d] = 255 / (this.maxs[d] - this.mins[d]);
}
}
quantize(embedding: Float32Array): Uint8Array {
const quantized = new Uint8Array(1536);
for (let d = 0; d < 1536; d++) {
const normalized = (embedding[d] - this.mins[d]) * this.scales[d];
quantized[d] = Math.round(Math.max(0, Math.min(255, normalized)));
}
return quantized;
}
dequantize(quantized: Uint8Array): Float32Array {
const embedding = new Float32Array(1536);
for (let d = 0; d < 1536; d++) {
embedding[d] = (quantized[d] / this.scales[d]) + this.mins[d];
}
return embedding;
}
}
```
### 2.4 Promotion/Demotion Policy
```
PROMOTION (Cold -> Warm -> Hot)
-------------------------------
Trigger: Query frequency > threshold OR explicit prefetch
- Cold -> Warm: 5+ queries in 24h
- Warm -> Hot: 20+ queries in 1h
DEMOTION (Hot -> Warm -> Cold)
------------------------------
Trigger: Time-based decay OR memory pressure
- Hot -> Warm: No queries in 1h
- Warm -> Cold: No queries in 7d
- LRU eviction when tier exceeds capacity
```
---
## 3. Batch Processing Pipeline
### 3.1 Audio Ingestion Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ AUDIO INGESTION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ [Sensors] ──> [Buffer Queue] ──> [Segment Detector] │
│ │ │ │ │
│ │ (5min chunks) (5s windows) │
│ v v v │
│ [Raw Storage] [Batch Accumulator] [Perch Embedder] │
│ │ │ │
│ (1000 segments) (GPU batch) │
│ v v │
│ [Embedding Queue] <── [1536-D vectors] │
│ │ │
│ v │
│ [HNSW Batch Insert] │
│ │ │
│ (async, non-blocking) │
│ v │
│ [Index + Metadata Store] │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### 3.2 Batch Sizing Parameters
| Stage | Batch Size | Latency Target | Throughput |
|-------|------------|----------------|------------|
| Audio buffer | 5 min chunks | < 1s queue delay | 100+ streams |
| Segment detection | 100 segments | < 500ms | 1000 segments/s |
| Perch embedding | 64 segments | < 2s GPU | 32 segments/s/GPU |
| HNSW insertion | 1000 vectors | < 100ms | 10K vectors/s |
| Metadata write | 1000 records | < 50ms | 20K records/s |
### 3.3 Backpressure Handling
```typescript
interface BackpressureConfig {
// Queue depth thresholds
warningThreshold: 10000, // Start logging warnings
throttleThreshold: 50000, // Reduce intake rate
dropThreshold: 100000, // Drop lowest-priority data
// Priority levels for graceful degradation
priorities: {
critical: 'endangered_species', // Never drop
high: 'known_species_new_recording', // Drop last
normal: 'routine_monitoring', // Standard handling
low: 'background_noise_samples' // Drop first
},
// Rate limiting
maxIngestionRate: 10000, // segments/minute
burstAllowance: 5000, // temporary overflow
}
```
### 3.4 Batch Insert Optimization
```typescript
async function batchInsertEmbeddings(
embeddings: Float32Array[],
metadata: EmbeddingMetadata[],
config: BatchConfig
): Promise<BatchResult> {
const batchSize = config.batchSize || 1000;
const results: BatchResult = { inserted: 0, failed: 0, latencyMs: [] };
// Sort by expected cluster for better cache locality
const sorted = sortByClusterHint(embeddings, metadata);
for (let i = 0; i < sorted.length; i += batchSize) {
const batch = sorted.slice(i, i + batchSize);
const start = performance.now();
// Parallel insert with connection pooling
await Promise.all([
hnswIndex.batchAdd(batch.embeddings),
metadataStore.batchInsert(batch.metadata)
]);
results.latencyMs.push(performance.now() - start);
results.inserted += batch.length;
}
return results;
}
```
---
## 4. Memory Management
### 4.1 Streaming vs Batch Trade-offs
| Mode | Memory Footprint | Latency | Use Case |
|------|------------------|---------|----------|
| **Streaming** | O(window_size) ~50MB | Real-time (<1s) | Live monitoring |
| **Micro-batch** | O(batch_size) ~200MB | Near-real-time (<5s) | Standard ingestion |
| **Batch** | O(full_batch) ~2GB | Minutes | Bulk historical import |
### 4.2 Memory Budget Allocation
```
TOTAL MEMORY BUDGET: 16GB (single node)
=======================================
HNSW Index (Hot): 4GB (25%)
- ~650K float32 vectors
- Navigation structure overhead
Embedding Cache: 3GB (19%)
- LRU cache for frequent queries
- Warm tier spillover
GNN Model: 2GB (12%)
- Model parameters
- Gradient buffers
- Activation cache
Query Buffers: 2GB (12%)
- Concurrent query working memory
- Result aggregation
Ingestion Pipeline: 2GB (12%)
- Audio processing buffers
- Batch accumulation
Metadata/Index: 2GB (12%)
- SQLite/RocksDB buffers
- B-tree indices
OS/Overhead: 1GB (6%)
- System requirements
- Safety margin
```
### 4.3 Memory Pressure Response
```typescript
interface MemoryManager {
thresholds: {
warning: 0.75, // 75% utilization
critical: 0.90, // 90% utilization
emergency: 0.95 // 95% utilization
},
responses: {
warning: [
'reduce_batch_sizes',
'increase_demotion_rate',
'log_memory_profile'
],
critical: [
'pause_gnn_training',
'aggressive_cache_eviction',
'reject_low_priority_queries'
],
emergency: [
'stop_ingestion',
'force_checkpoint',
'alert_operations'
]
}
}
```
### 4.4 Zero-Copy Optimizations
```typescript
// Use memory-mapped files for large read-only data
const coldTierIndex = mmap('/data/cold_embeddings.bin', {
mode: 'readonly',
advice: MADV_RANDOM // Optimize for random access
});
// Share embedding buffers between query threads
const sharedQueryBuffer = new SharedArrayBuffer(
QUERY_BATCH_SIZE * EMBEDDING_DIM * 4
);
// Avoid copies in pipeline stages
function processSegment(audio: AudioBuffer): EmbeddingResult {
// Pass views, not copies
const spectrogram = computeMelSpectrogram(audio.subarray(0, WINDOW_SIZE));
const embedding = perchModel.embed(spectrogram); // Returns view
return { embedding, metadata: extractMetadata(audio) };
}
```
---
## 5. Caching Strategy
### 5.1 Multi-Level Cache Architecture
```
┌────────────────────────────────────────────────────────────────┐
│ CACHE HIERARCHY │
├────────────────────────────────────────────────────────────────┤
│ │
│ L1: Query Result Cache (100MB) │
│ ├── Key: hash(query_embedding + search_params) │
│ ├── TTL: 5 minutes │
│ ├── Hit Rate Target: 40%+ for repeated queries │
│ └── Eviction: LRU with frequency boost │
│ │
│ L2: Nearest Neighbor Cache (500MB) │
│ ├── Key: embedding_id │
│ ├── Value: precomputed k-NN list │
│ ├── TTL: 1 hour (invalidate on index update) │
│ └── Hit Rate Target: 60%+ for hot embeddings │
│ │
│ L3: Cluster Centroid Cache (200MB) │
│ ├── Key: cluster_id │
│ ├── Value: centroid + exemplar embeddings │
│ ├── TTL: 24 hours │
│ └── Use: Fast cluster assignment for new embeddings │
│ │
│ L4: Metadata Cache (300MB) │
│ ├── Key: embedding_id │
│ ├── Value: species, location, timestamp, etc. │
│ ├── TTL: None (invalidate on update) │
│ └── Hit Rate Target: 90%+ (frequently accessed) │
│ │
└────────────────────────────────────────────────────────────────┘
```
### 5.2 Cache Warming Strategy
```typescript
interface CacheWarmingConfig {
// Startup warming
startup: {
// Load most queried embeddings from past 24h
recentQueryEmbeddings: 10000,
// Load all cluster centroids
clusterCentroids: 'all',
// Load endangered species data
prioritySpecies: ['species_list_from_config']
},
// Predictive warming
predictive: {
// Time-based patterns
schedules: [
{ time: '05:00', action: 'warm_dawn_chorus_species' },
{ time: '19:00', action: 'warm_dusk_species' }
],
// Geographic patterns
sensorActivation: {
triggerRadius: '50km',
preloadNeighborSites: true
}
},
// Query-driven warming
queryDriven: {
// On any query, prefetch neighbors' neighbors
prefetchDepth: 2,
prefetchCount: 10
}
}
```
### 5.3 Cache Invalidation
```typescript
// Event-driven invalidation
const cacheInvalidator = {
onEmbeddingInsert(id: string, embedding: Float32Array): void {
// Invalidate affected NN caches
const affectedNeighbors = hnswIndex.getNeighbors(id, 50);
affectedNeighbors.forEach(nid => nnCache.invalidate(nid));
// Invalidate cluster centroid if significantly different
const cluster = clusterAssignment.get(id);
if (cluster && distanceFromCentroid(embedding, cluster) > threshold) {
centroidCache.invalidate(cluster.id);
}
},
onClusterUpdate(clusterId: string): void {
// Invalidate centroid and all member NN caches
centroidCache.invalidate(clusterId);
const members = clusterMembers.get(clusterId);
members.forEach(mid => nnCache.invalidate(mid));
},
onGNNTrainingComplete(): void {
// Embeddings may have shifted - invalidate distance-based caches
nnCache.clear();
queryCache.clear();
// Centroid cache can remain (recomputed lazily)
}
}
```
---
## 6. GNN Training Schedule
### 6.1 Training Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ GNN TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ONLINE LEARNING (Continuous) │
│ ├── Trigger: Every 1000 new embeddings │
│ ├── Scope: Local neighborhood refinement │
│ ├── Duration: < 100ms (non-blocking) │
│ └── Method: Single GNN message-passing step │
│ │
│ INCREMENTAL TRAINING (Scheduled) │
│ ├── Trigger: Hourly (off-peak) or 10K new embeddings │
│ ├── Scope: Updated subgraph (new nodes + 2-hop neighbors) │
│ ├── Duration: 1-5 minutes │
│ └── Method: 3-5 GNN epochs on affected subgraph │
│ │
│ FULL RETRAINING (Periodic) │
│ ├── Trigger: Weekly (Sunday 02:00-06:00) or manual │
│ ├── Scope: Entire graph │
│ ├── Duration: 1-4 hours │
│ └── Method: Full GNN training with hyperparameter tuning │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### 6.2 Non-Blocking Training Protocol
```typescript
class GNNTrainingScheduler {
private queryPriorityLock = new AsyncLock();
async onlineUpdate(newEmbeddings: EmbeddingBatch): Promise<void> {
// Non-blocking: runs in background, doesn't affect queries
setImmediate(async () => {
const subgraph = this.extractLocalSubgraph(newEmbeddings);
await this.gnn.singleStep(subgraph);
});
}
async incrementalTrain(): Promise<TrainingResult> {
// Acquire read lock (queries continue, writes pause)
await this.queryPriorityLock.acquireRead();
try {
const updatedSubgraph = this.getUpdatedSubgraph();
const result = await this.gnn.train(updatedSubgraph, {
epochs: 5,
earlyStop: { patience: 2, minDelta: 0.001 }
});
// Apply updates atomically
await this.applyEmbeddingUpdates(result.refinedEmbeddings);
return result;
} finally {
this.queryPriorityLock.release();
}
}
async fullRetrain(): Promise<TrainingResult> {
// Acquire write lock (pause all operations)
await this.queryPriorityLock.acquireWrite();
try {
// Checkpoint current state for rollback
await this.checkpoint();
const result = await this.gnn.fullTrain({
epochs: 50,
learningRate: 0.001,
earlyStop: { patience: 10, minDelta: 0.0001 }
});
// Validate before applying
if (result.validationRecall < 0.90) {
await this.rollback();
throw new Error('Training degraded recall, rolled back');
}
await this.applyFullUpdate(result);
return result;
} finally {
this.queryPriorityLock.release();
}
}
}
```
### 6.3 Training Resource Allocation
```
OFF-PEAK (02:00-06:00 local time)
---------------------------------
- Full retraining allowed
- 100% GPU utilization
- Query latency SLA relaxed to 200ms
PEAK HOURS (06:00-22:00)
------------------------
- Online updates only
- GPU limited to 20% for training
- Query latency SLA: 50ms p99
TRANSITION PERIODS
------------------
- Incremental training allowed
- GPU limited to 50% for training
- Query latency SLA: 100ms p99
```
---
## 7. Benchmarking Framework
### 7.1 Benchmark Suite
```typescript
interface BenchmarkSuite {
// Core HNSW benchmarks
hnsw: {
insertThroughput: {
description: 'Vectors inserted per second',
target: '>= 10,000 vectors/s',
dataset: '1M random 1536-D vectors'
},
queryLatency: {
description: 'Single query latency distribution',
targets: {
p50: '<= 10ms',
p95: '<= 30ms',
p99: '<= 50ms'
},
dataset: '1M indexed, 10K queries'
},
recallAtK: {
description: 'Recall compared to brute force',
targets: {
recall10: '>= 0.95',
recall100: '>= 0.98'
}
},
concurrentQueries: {
description: 'Throughput under concurrent load',
target: '>= 1,000 QPS at p99 < 100ms',
concurrency: [1, 10, 50, 100, 200]
}
},
// End-to-end pipeline benchmarks
pipeline: {
audioToEmbedding: {
description: 'Full audio processing latency',
target: '<= 200ms per 5s segment',
includeIO: true
},
ingestionThroughput: {
description: 'Sustained ingestion rate',
target: '>= 100 segments/second',
duration: '1 hour'
},
queryWithMetadata: {
description: 'Query + metadata fetch',
target: '<= 75ms p99'
}
},
// GNN-specific benchmarks
gnn: {
onlineUpdateLatency: {
description: 'Single-step GNN update',
target: '<= 100ms for 1K node subgraph'
},
incrementalTrainTime: {
description: 'Hourly incremental training',
target: '<= 5 minutes for 10K updates'
},
recallImprovement: {
description: 'Recall gain from GNN refinement',
target: '>= 2% improvement over baseline HNSW'
}
},
// Memory benchmarks
memory: {
indexMemoryPerVector: {
description: 'Memory per indexed vector',
target: '<= 8KB (including overhead)'
},
cacheHitRate: {
description: 'Cache effectiveness',
targets: {
queryCache: '>= 30%',
nnCache: '>= 50%',
metadataCache: '>= 80%'
}
},
quantizationRecallLoss: {
description: 'Recall loss from int8 quantization',
target: '<= 3%'
}
}
}
```
### 7.2 SLA Definitions
| Operation | p50 | p95 | p99 | p99.9 |
|-----------|-----|-----|-----|-------|
| kNN Query (k=10) | 5ms | 20ms | 50ms | 100ms |
| kNN Query (k=100) | 10ms | 40ms | 80ms | 150ms |
| Range Query (r<0.5) | 15ms | 50ms | 100ms | 200ms |
| Insert Single | 1ms | 5ms | 10ms | 20ms |
| Batch Insert (1000) | 50ms | 100ms | 200ms | 500ms |
| Cluster Assignment | 20ms | 50ms | 100ms | 200ms |
| Full Pipeline (audio->result) | 200ms | 500ms | 1000ms | 2000ms |
### 7.3 Continuous Benchmarking
```yaml
# .github/workflows/performance.yml
benchmark_schedule:
nightly:
- hnsw_insert_throughput
- hnsw_query_latency
- hnsw_recall
weekly:
- full_pipeline_benchmark
- gnn_training_benchmark
- memory_pressure_test
on_release:
- all_benchmarks
- scalability_test_10M
- longevity_test_24h
regression_thresholds:
latency_increase: 10% # Alert if p99 increases by 10%
throughput_decrease: 5% # Alert if QPS drops by 5%
recall_decrease: 1% # Alert if recall drops by 1%
```
---
## 8. Horizontal Scalability
### 8.1 Sharding Strategy
```
SHARDING APPROACH: Geographic + Temporal Hybrid
===============================================
Primary Shard Key: Geographic Region (sensor cluster)
- Shard 0: North America West
- Shard 1: North America East
- Shard 2: Europe
- Shard 3: Asia-Pacific
- Shard 4: South America
- Shard 5: Africa
Secondary Partition: Temporal (within shard)
- Hot: Current month
- Warm: Past 12 months
- Cold: Archive (>12 months)
Cross-Shard Queries:
- Use scatter-gather pattern
- Merge results by distance
- Timeout per shard: 100ms
```
### 8.2 Cluster Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Query LB │ │ Query LB │ │ Query LB │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ v v v │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Query Router (Consistent Hash) │ │
│ └─────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌──────────────────────┼──────────────────────┐ │
│ │ │ │ │
│ v v v │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Shard 0 │ │ Shard 1 │ │ Shard N │ │
│ │ (3 rep) │ │ (3 rep) │ │ (3 rep) │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ v v v │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ HNSW │ │ HNSW │ │ HNSW │ │
│ │ Index │ │ Index │ │ Index │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Shared Metadata Store (Distributed) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ GNN Training Coordinator (Async Gossip) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### 8.3 Scaling Thresholds
| Metric | Scale-Out Trigger | Scale-In Trigger |
|--------|-------------------|------------------|
| Query Latency p99 | > 80ms sustained 5min | < 30ms sustained 1h |
| CPU Utilization | > 70% sustained 5min | < 30% sustained 1h |
| Memory Utilization | > 80% | < 50% sustained 1h |
| Queue Depth | > 10K pending | < 1K sustained 30min |
| Shard Size | > 500K vectors | N/A (don't scale in) |
### 8.4 Cross-Shard Query Protocol
```typescript
async function globalKnnQuery(
query: Float32Array,
k: number,
options: QueryOptions
): Promise<SearchResult[]> {
const shards = options.shards || getAllShards();
const perShardK = Math.ceil(k * 1.5); // Over-fetch for merge
// Scatter phase
const shardPromises = shards.map(shard =>
queryShardWithTimeout(shard, query, perShardK, options.timeout || 100)
);
// Gather phase with partial results on timeout
const results = await Promise.allSettled(shardPromises);
// Merge and re-rank
const allResults = results
.filter(r => r.status === 'fulfilled')
.flatMap(r => r.value);
// Sort by distance and take top-k
allResults.sort((a, b) => a.distance - b.distance);
return allResults.slice(0, k);
}
```
---
## 9. Latency Budget Breakdown
### 9.1 Query Path Latency Budget
```
TOTAL BUDGET: 50ms (p99 target)
===============================
┌────────────────────────────────────────────────────┐
│ Component │ Budget │ % of Total │
├────────────────────────────────────────────────────┤
│ Network (client -> LB) │ 5ms │ 10% │
│ Load Balancer routing │ 1ms │ 2% │
│ Query parsing/validation │ 1ms │ 2% │
│ Cache lookup (L1-L4) │ 3ms │ 6% │
│ HNSW search (k=10) │ 25ms │ 50% │
│ Metadata fetch │ 5ms │ 10% │
│ Result serialization │ 2ms │ 4% │
│ Network (LB -> client) │ 5ms │ 10% │
│ Buffer/headroom │ 3ms │ 6% │
├────────────────────────────────────────────────────┤
│ TOTAL │ 50ms │ 100% │
└────────────────────────────────────────────────────┘
```
### 9.2 Ingestion Path Latency Budget
```
TOTAL BUDGET: 200ms (p99 target for single segment)
==================================================
┌────────────────────────────────────────────────────┐
│ Component │ Budget │ % of Total │
├────────────────────────────────────────────────────┤
│ Audio receive/decode │ 10ms │ 5% │
│ Mel spectrogram compute │ 20ms │ 10% │
│ Perch model inference │ 80ms │ 40% │
│ Embedding normalization │ 5ms │ 2% │
│ HNSW insertion │ 20ms │ 10% │
│ Metadata write │ 10ms │ 5% │
│ Cache invalidation │ 10ms │ 5% │
│ Acknowledgment │ 5ms │ 2% │
│ Buffer/headroom │ 40ms │ 20% │
├────────────────────────────────────────────────────┤
│ TOTAL │200ms │ 100% │
└────────────────────────────────────────────────────┘
```
### 9.3 GNN Training Latency Constraints
```
ONLINE UPDATE: 100ms max (non-blocking)
---------------------------------------
- Subgraph extraction: 20ms
- Single GNN forward pass: 50ms
- Embedding update (async): 30ms
INCREMENTAL TRAINING: 5 min max
-------------------------------
- Subgraph construction: 30s
- Training (5 epochs): 4 min
- Embedding sync: 30s
FULL RETRAINING: 4 hour max
---------------------------
- Graph snapshot: 10 min
- Training (50 epochs): 3.5 hours
- Validation: 10 min
- Cutover: 10 min
```
---
## Consequences
### Positive
- **Sub-100ms query latency** achieved through HNSW tuning and multi-level caching
- **4x storage reduction** for cold data via int8 scalar quantization
- **Non-blocking GNN learning** enables continuous improvement without query degradation
- **Linear horizontal scaling** via geographic sharding
- **Clear SLAs** enable capacity planning and alerting
### Negative
- **Increased operational complexity** from multi-tier storage and distributed architecture
- **Memory overhead** from caching layers (~1.1GB dedicated to caches)
- **Quantization recall loss** of 2-3% for cold tier data
- **Cross-shard query overhead** adds latency for global searches
### Neutral
- **Trade-off flexibility** allows tuning precision vs. latency per use case
- **Benchmark-driven development** requires ongoing measurement infrastructure
---
## Implementation Phases
### Phase 1: Foundation (Weeks 1-2)
- Implement HNSW with tuned parameters
- Set up benchmark suite
- Establish baseline metrics
### Phase 2: Optimization (Weeks 3-4)
- Implement scalar quantization
- Add multi-level caching
- Optimize batch ingestion pipeline
### Phase 3: Learning (Weeks 5-6)
- Integrate GNN training scheduler
- Implement non-blocking updates
- Validate recall improvements
### Phase 4: Scale (Weeks 7-8)
- Implement sharding layer
- Deploy distributed architecture
- Load test at 1M+ vectors
---
## References
- Perch 2.0: https://arxiv.org/abs/2508.04665
- RuVector: https://github.com/ruvnet/ruvector
- HNSW Paper: Malkov & Yashunin, 2018
- Product Quantization: Jegou et al., 2011
- Graph Attention Networks: Velickovic et al., 2018

View File

@@ -0,0 +1,746 @@
# ADR-005: Self-Learning and Hooks Integration
## Status
Proposed
## Date
2026-01-15
## Context
7sense processes bioacoustic data through Perch 2.0 embeddings (1536-D vectors) stored in RuVector with HNSW indexing. To maximize the value of this acoustic geometry, we need a self-learning system that:
1. Continuously improves retrieval quality based on user feedback
2. Discovers and consolidates successful clustering configurations
3. Learns species-specific embedding characteristics over time
4. Prevents catastrophic forgetting when adapting to new domains (marine vs avian vs terrestrial)
RuVector includes a built-in GNN layer designed for index self-improvement, and the claude-flow framework provides a comprehensive hooks system with 27 hooks and 12 background workers that can orchestrate continuous learning pipelines.
## Decision
We will implement a four-stage learning loop architecture integrated with claude-flow hooks, utilizing SONA (Self-Optimizing Neural Architecture) patterns and EWC++ (Elastic Weight Consolidation) for continual learning without forgetting.
### Learning Loop Architecture
```
+-------------------+ +------------------+ +-------------------+ +---------------------+
| RETRIEVE | --> | JUDGE | --> | DISTILL | --> | CONSOLIDATE |
| (HNSW + Pattern) | | (Verdict System) | | (LoRA Fine-tune) | | (EWC++ Integration) |
+-------------------+ +------------------+ +-------------------+ +---------------------+
^ |
| |
+----------------------------------------------------------------------------+
Continuous Feedback Loop
```
#### Stage 1: RETRIEVE
Fetch relevant patterns from the ReasoningBank using HNSW-indexed vector search:
```bash
# Search for similar bioacoustic analysis patterns
npx @claude-flow/cli@latest memory search \
--query "whale song clustering high-frequency harmonics" \
--namespace patterns \
--limit 5 \
--threshold 0.7
# Retrieve species-specific embedding characteristics
npx @claude-flow/cli@latest hooks intelligence pattern-search \
--query "humpback whale vocalization" \
--namespace species \
--top-k 3
```
Performance characteristics:
- HNSW retrieval: 150x-12,500x faster than brute force
- Pattern matching: 761 decisions/sec
- Sub-millisecond adaptation via SONA
#### Stage 2: JUDGE
Evaluate retrieved patterns with a verdict system that scores relevance and success:
```typescript
interface BioacousticVerdict {
pattern_id: string;
task_type: 'clustering' | 'motif_discovery' | 'species_identification' | 'anomaly_detection';
verdict: 'success' | 'partial' | 'failure';
confidence: number; // 0.0 - 1.0
metrics: {
silhouette_score?: number; // For clustering
retrieval_precision?: number; // For search quality
user_correction_rate?: number; // For feedback integration
snr_threshold_effectiveness?: number;
};
feedback_source: 'automatic' | 'user_correction' | 'expert_annotation';
}
```
Verdict aggregation rules:
- Success (confidence > 0.85): Promote pattern to long-term memory
- Partial (0.5 < confidence < 0.85): Mark for refinement
- Failure (confidence < 0.5): Demote or archive with failure context
#### Stage 3: DISTILL
Extract key learnings via LoRA (Low-Rank Adaptation) fine-tuning:
```bash
# Train neural patterns on successful bioacoustic analysis
npx @claude-flow/cli@latest hooks intelligence trajectory-start \
--task "clustering whale songs by call type" \
--agent "bioacoustic-analyzer"
# Record analysis steps
npx @claude-flow/cli@latest hooks intelligence trajectory-step \
--trajectory-id "$TRAJ_ID" \
--action "applied hierarchical clustering with ward linkage" \
--result "silhouette score 0.78" \
--quality 0.85
# Complete trajectory with success
npx @claude-flow/cli@latest hooks intelligence trajectory-end \
--trajectory-id "$TRAJ_ID" \
--success true \
--feedback "user confirmed 23/25 clusters as valid call types"
```
LoRA benefits for bioacoustics:
- 99% parameter reduction (critical for edge deployment on field sensors)
- 10-100x faster training than full fine-tuning
- Minimal memory footprint for continuous learning
#### Stage 4: CONSOLIDATE
Prevent catastrophic forgetting via EWC++ when learning new domains:
```bash
# Force SONA learning cycle with EWC++ consolidation
npx @claude-flow/cli@latest hooks intelligence learn \
--consolidate true \
--trajectory-ids "$WHALE_TRAJ,$BIRD_TRAJ,$INSECT_TRAJ"
```
EWC++ strategy for bioacoustics:
- Compute Fisher information matrix for critical embedding dimensions
- Penalize changes to weights important for existing species recognition
- Allow plasticity for new acoustic domains (marine -> avian -> terrestrial)
### Claude-Flow Hooks Integration
#### Pre-Task Hook: Route Bioacoustic Analysis Tasks
The `pre-task` hook routes incoming analysis requests to optimal processing paths:
```bash
# Before starting any bioacoustic analysis
npx @claude-flow/cli@latest hooks pre-task \
--task-id "analysis-$(date +%s)" \
--description "cluster humpback whale songs from Pacific Northwest dataset"
```
Routing decisions based on task characteristics:
| Task Type | Recommended Agent | Model Tier | Rationale |
|-----------|-------------------|------------|-----------|
| Simple retrieval | retrieval-agent | Haiku | Fast kNN lookup |
| Clustering | clustering-specialist | Sonnet | Algorithm selection |
| Motif discovery | sequence-analyzer | Sonnet | Temporal pattern analysis |
| Cross-species analysis | bioacoustic-expert | Opus | Complex reasoning |
| Anomaly detection | anomaly-detector | Haiku | Real-time processing |
| Embedding refinement | ml-specialist | Opus | Architecture decisions |
Pre-task also retrieves relevant patterns:
```bash
# Get routing recommendation with pattern retrieval
npx @claude-flow/cli@latest hooks route \
--task "identify dialect variations in orca pod communications" \
--context "Pacific Northwest, 2024 field recordings"
```
Output includes:
- Recommended agent type and model tier
- Top-3 similar successful patterns from memory
- Suggested HNSW parameters based on past success
- Estimated confidence and processing time
#### Post-Task Hook: Store Successful Patterns
After successful analysis, store the pattern for future retrieval:
```bash
# Record task completion
npx @claude-flow/cli@latest hooks post-task \
--task-id "$TASK_ID" \
--success true \
--agent "clustering-specialist" \
--quality 0.92
# Store the successful pattern
npx @claude-flow/cli@latest memory store \
--namespace patterns \
--key "whale-clustering-hierarchical-ward-2026-01" \
--value '{
"task_type": "clustering",
"species_group": "cetacean",
"algorithm": "hierarchical",
"linkage": "ward",
"distance_metric": "cosine",
"min_cluster_size": 5,
"silhouette_score": 0.78,
"num_clusters_discovered": 23,
"snr_threshold": 15,
"embedding_preprocessing": "l2_normalize",
"hnsw_params": {"ef_construction": 200, "M": 32}
}'
# Train neural patterns on the success
npx @claude-flow/cli@latest hooks post-edit \
--file "analysis-results.json" \
--success true \
--train-neural true
```
#### Pre-Edit Hook: Context for Embedding Refinement
Before modifying embedding configurations or HNSW parameters:
```bash
# Get context before editing embedding pipeline
npx @claude-flow/cli@latest hooks pre-edit \
--file "src/embeddings/perch_config.rs" \
--operation "refactor"
```
Returns:
- Related patterns that worked for similar configurations
- Agent recommendations for the edit type
- Risk assessment for the change
- Suggested validation tests
#### Post-Edit Hook: Train Neural Patterns
After successful configuration changes:
```bash
# Record successful embedding refinement
npx @claude-flow/cli@latest hooks post-edit \
--file "src/embeddings/perch_config.rs" \
--success true \
--agent "ml-specialist"
# Store the refinement as a pattern
npx @claude-flow/cli@latest hooks intelligence pattern-store \
--pattern "HNSW ef_search=150 optimal for whale song retrieval" \
--type "configuration" \
--confidence 0.88 \
--metadata '{"species": "cetacean", "corpus_size": 500000}'
```
### Memory Namespaces for Bioacoustics
#### Namespace: `patterns`
Stores successful clustering and analysis configurations:
```bash
# Store clustering pattern
npx @claude-flow/cli@latest memory store \
--namespace patterns \
--key "birdsong-dbscan-dawn-chorus" \
--value '{
"algorithm": "DBSCAN",
"eps": 0.15,
"min_samples": 3,
"preprocessing": ["l2_normalize", "pca_128"],
"context": "dawn_chorus",
"success_rate": 0.91,
"species_groups": ["passerine", "corvid"],
"temporal_window": "04:00-07:00"
}'
# Search for relevant patterns
npx @claude-flow/cli@latest memory search \
--namespace patterns \
--query "clustering algorithm for dense dawn chorus recordings"
```
Pattern schema:
```typescript
interface ClusteringPattern {
algorithm: 'DBSCAN' | 'HDBSCAN' | 'hierarchical' | 'kmeans' | 'spectral';
parameters: Record<string, number | string>;
preprocessing: string[];
context: string;
success_rate: number;
species_groups: string[];
environmental_conditions?: {
habitat?: string;
time_of_day?: string;
season?: string;
weather?: string;
};
hnsw_tuning?: {
ef_construction: number;
ef_search: number;
M: number;
};
}
```
#### Namespace: `motifs`
Stores discovered sequence patterns and syntactic structures:
```bash
# Store discovered motif
npx @claude-flow/cli@latest memory store \
--namespace motifs \
--key "humpback-song-unit-sequence-A" \
--value '{
"species": "Megaptera novaeangliae",
"pattern_type": "song_unit_sequence",
"sequence": ["A1", "A2", "B1", "A1", "C1"],
"transition_probabilities": {
"A1->A2": 0.85,
"A2->B1": 0.72,
"B1->A1": 0.68,
"A1->C1": 0.45
},
"typical_duration_ms": 45000,
"occurrence_rate": 0.34,
"recording_ids": ["rec_2024_001", "rec_2024_002"],
"discovered_by": "sequence-analyzer",
"confidence": 0.89
}'
# Search for similar motifs
npx @claude-flow/cli@latest memory search \
--namespace motifs \
--query "humpback whale song phrase transitions"
```
Motif schema:
```typescript
interface SequenceMotif {
species: string;
pattern_type: 'song_unit_sequence' | 'call_response' | 'alarm_cascade' | 'contact_pattern';
sequence: string[];
transition_probabilities: Record<string, number>;
typical_duration_ms: number;
occurrence_rate: number;
temporal_context?: {
time_of_day?: string;
season?: string;
behavioral_context?: string;
};
recording_ids: string[];
discovered_by: string;
confidence: number;
validation_status: 'automatic' | 'expert_verified' | 'disputed';
}
```
#### Namespace: `species`
Stores species-specific embedding characteristics:
```bash
# Store species embedding profile
npx @claude-flow/cli@latest memory store \
--namespace species \
--key "orca-pacific-northwest-resident" \
--value '{
"species": "Orcinus orca",
"population": "Southern Resident",
"location": "Pacific Northwest",
"embedding_characteristics": {
"centroid_cluster_distance": 0.12,
"intra_pod_variance": 0.08,
"inter_pod_variance": 0.23,
"frequency_range_hz": [500, 12000],
"dominant_frequencies_hz": [2000, 5000, 8000]
},
"retrieval_optimization": {
"optimal_k": 15,
"distance_threshold": 0.25,
"ef_search": 200
},
"known_call_types": 34,
"dialect_markers": ["S01", "S02", "S03"],
"last_updated": "2026-01-15"
}'
# Search for species characteristics
npx @claude-flow/cli@latest memory search \
--namespace species \
--query "cetacean vocalization embedding characteristics Pacific"
```
Species schema:
```typescript
interface SpeciesEmbeddingProfile {
species: string;
population?: string;
location?: string;
embedding_characteristics: {
centroid_cluster_distance: number;
intra_population_variance: number;
inter_population_variance: number;
frequency_range_hz: [number, number];
dominant_frequencies_hz: number[];
embedding_norm_range?: [number, number];
};
retrieval_optimization: {
optimal_k: number;
distance_threshold: number;
ef_search: number;
ef_construction?: number;
};
known_call_types: number;
dialect_markers?: string[];
acoustic_niche?: {
typical_snr_db: number;
overlap_species: string[];
distinguishing_features: string[];
};
last_updated: string;
}
```
### Background Workers Utilization
#### Worker: `optimize` - HNSW Parameter Tuning
Continuously optimizes HNSW parameters based on retrieval quality:
```bash
# Dispatch HNSW optimization worker
npx @claude-flow/cli@latest hooks worker dispatch \
--trigger optimize \
--context "bioacoustic-hnsw" \
--priority high
# Check optimization status
npx @claude-flow/cli@latest hooks worker status
```
Optimization targets:
- `ef_construction`: Balance between index build time and recall
- `ef_search`: Balance between query latency and accuracy
- `M`: Balance between memory usage and graph connectivity
Automated tuning workflow:
1. Sample recent queries and their success rates
2. Run parameter sweep on subset
3. Evaluate recall@k and latency
4. Apply best parameters if improvement > 5%
5. Store successful configuration in `patterns` namespace
```typescript
interface HNSWOptimizationResult {
previous_params: { ef_construction: number; ef_search: number; M: number };
new_params: { ef_construction: number; ef_search: number; M: number };
improvement: {
recall_at_10: number; // Percentage improvement
latency_p99_ms: number;
memory_mb: number;
};
evaluation_corpus_size: number;
applied: boolean;
timestamp: string;
}
```
#### Worker: `consolidate` - Memory Consolidation
Consolidates learned patterns and prevents memory fragmentation:
```bash
# Dispatch consolidation worker (low priority, runs during idle)
npx @claude-flow/cli@latest hooks worker dispatch \
--trigger consolidate \
--priority low \
--background true
```
Consolidation operations:
1. Merge similar patterns within each namespace
2. Archive low-confidence or stale patterns
3. Update pattern embeddings for improved retrieval
4. Compute and cache centroid patterns for fast routing
5. Run EWC++ to protect critical learned weights
```bash
# Force SONA learning cycle with consolidation
npx @claude-flow/cli@latest hooks intelligence learn \
--consolidate true
```
Consolidation schedule:
- Hourly: Merge patterns with >0.95 similarity
- Daily: Archive patterns not accessed in 30 days
- Weekly: Full EWC++ consolidation pass
#### Worker: `audit` - Data Quality Checks
Validates embedding quality and detects drift:
```bash
# Dispatch audit worker
npx @claude-flow/cli@latest hooks worker dispatch \
--trigger audit \
--context "embedding-quality" \
--priority critical
```
Audit checks:
1. **Embedding health**: Detect NaN, infinity, or collapsed embeddings
2. **Distribution drift**: Compare embedding statistics over time
3. **Retrieval quality**: Sample-based precision/recall checks
4. **Label consistency**: Cross-reference with expert annotations
5. **Temporal coherence**: Verify sequence relationships
```typescript
interface AuditResult {
check_type: 'embedding_health' | 'distribution_drift' | 'retrieval_quality' | 'label_consistency';
status: 'pass' | 'warning' | 'fail';
metrics: {
nan_rate?: number;
norm_variance?: number;
drift_score?: number;
precision_at_10?: number;
consistency_rate?: number;
};
affected_recordings?: string[];
recommended_action?: string;
timestamp: string;
}
```
Automated responses:
- Warning: Log and notify, continue processing
- Fail: Pause ingestion, alert operators, revert to last known good state
### Transfer Learning from Related Projects
#### Project Transfer Protocol
Leverage patterns from related bioacoustic projects:
```bash
# Transfer patterns from a related whale research project
npx @claude-flow/cli@latest hooks transfer \
--source-path "/projects/cetacean-acoustics" \
--min-confidence 0.8 \
--filter "species:cetacean"
# Transfer from IPFS-distributed pattern registry
npx @claude-flow/cli@latest hooks transfer store \
--pattern-id "marine-mammal-clustering-v2"
```
Transfer eligibility criteria:
1. Source project confidence > 0.8
2. Domain overlap > 50% (based on species groups)
3. No conflicting patterns in target
4. Embedding model compatibility (same Perch version)
Transfer adaptation process:
1. Retrieve candidate patterns from source
2. Validate against target domain characteristics
3. Apply domain adaptation if needed (fine-tune on local data)
4. Integrate with reduced initial confidence (0.7x)
5. Gradually increase confidence based on local success
```bash
# Check transfer candidates
npx @claude-flow/cli@latest transfer store-search \
--query "bioacoustic clustering" \
--category "marine" \
--min-rating 4.0 \
--verified true
```
### Feedback Loops: User Corrections to Embedding Refinement
#### Correction Capture
```typescript
interface UserCorrection {
correction_id: string;
timestamp: string;
user_id: string;
expertise_level: 'novice' | 'intermediate' | 'expert' | 'domain_expert';
correction_type: 'cluster_assignment' | 'species_label' | 'call_type' | 'sequence_boundary';
original_prediction: {
value: string;
confidence: number;
source: 'automatic' | 'pattern_match';
};
corrected_value: string;
affected_segments: string[];
context?: string;
}
```
#### Feedback Integration Pipeline
```bash
# Step 1: Log user correction
npx @claude-flow/cli@latest memory store \
--namespace corrections \
--key "correction-$(date +%s)-$USER" \
--value '{
"correction_type": "species_label",
"original": {"value": "Megaptera novaeangliae", "confidence": 0.72},
"corrected": "Balaenoptera musculus",
"segment_ids": ["seg_001", "seg_002"],
"user_expertise": "domain_expert"
}'
# Step 2: Trigger learning from correction
npx @claude-flow/cli@latest hooks intelligence trajectory-start \
--task "learn from species misclassification correction"
npx @claude-flow/cli@latest hooks intelligence trajectory-step \
--trajectory-id "$TRAJ_ID" \
--action "analyzed embedding distance between humpback and blue whale" \
--result "found confounding frequency overlap in low-SNR segments" \
--quality 0.7
npx @claude-flow/cli@latest hooks intelligence trajectory-end \
--trajectory-id "$TRAJ_ID" \
--success true \
--feedback "updated SNR threshold from 10 to 15 dB for cetacean classification"
# Step 3: Update species namespace
npx @claude-flow/cli@latest memory store \
--namespace species \
--key "blue-whale-humpback-distinction" \
--value '{
"confusion_pair": ["Megaptera novaeangliae", "Balaenoptera musculus"],
"distinguishing_features": ["frequency_range", "call_duration"],
"recommended_snr_threshold": 15,
"embedding_distance_threshold": 0.18
}'
```
#### Feedback Weight by Expertise
| Expertise Level | Weight | Trigger Threshold | Immediate Action |
|-----------------|--------|-------------------|------------------|
| Domain Expert | 1.0 | 1 correction | Update pattern |
| Expert | 0.8 | 2 corrections | Update pattern |
| Intermediate | 0.5 | 5 corrections | Flag for review |
| Novice | 0.2 | 10 corrections | Queue for expert |
#### Continuous Refinement Loop
```
User Correction
|
v
+------------------+
| Correction Store | (namespace: corrections)
+------------------+
|
v
+------------------+
| Pattern Analysis | (identify affected patterns)
+------------------+
|
v
+------------------+
| Verdict Update | (reduce confidence of failed patterns)
+------------------+
|
v
+------------------+
| SONA Learning | (trajectory-based fine-tuning)
+------------------+
|
v
+------------------+
| EWC++ Consolidate| (protect other learned patterns)
+------------------+
|
v
+------------------+
| Pattern Update | (store refined pattern)
+------------------+
|
v
Improved Retrieval
```
### Implementation Checklist
#### Phase 1: Core Infrastructure (Week 1-2)
- [ ] Set up memory namespaces (`patterns`, `motifs`, `species`, `corrections`)
- [ ] Implement pre-task hook for bioacoustic task routing
- [ ] Implement post-task hook for pattern storage
- [ ] Configure HNSW parameters for 1536-D Perch embeddings
- [ ] Set up audit worker for embedding health checks
#### Phase 2: Learning Integration (Week 3-4)
- [ ] Implement trajectory tracking for analysis workflows
- [ ] Configure LoRA fine-tuning for embedding refinement
- [ ] Set up EWC++ consolidation schedule
- [ ] Implement feedback capture from user interface
- [ ] Configure optimize worker for HNSW tuning
#### Phase 3: Advanced Features (Week 5-6)
- [ ] Implement motif discovery and storage
- [ ] Set up species-specific embedding profiles
- [ ] Configure transfer learning from related projects
- [ ] Implement expertise-weighted feedback integration
- [ ] Set up consolidate worker for memory optimization
#### Phase 4: Monitoring and Refinement (Ongoing)
- [ ] Dashboard for learning metrics
- [ ] Alerting for quality degradation
- [ ] A/B testing for pattern effectiveness
- [ ] Regular audit of learned patterns
## Consequences
### Positive
1. **Continuous Improvement**: System gets better with every analysis task
2. **Domain Adaptation**: EWC++ allows learning new species without forgetting existing knowledge
3. **Expert Knowledge Capture**: User corrections are systematically integrated
4. **Efficient Processing**: Pattern reuse reduces computation for common tasks
5. **Transparent Learning**: Trajectory tracking provides explainability
6. **Cross-Project Synergy**: Transfer learning leverages community knowledge
### Negative
1. **Complexity**: Multiple interacting systems require careful orchestration
2. **Storage Growth**: Pattern storage will grow over time (mitigated by consolidation)
3. **Cold Start**: Initial deployments lack learned patterns (mitigated by transfer)
4. **Feedback Dependency**: Quality depends on user correction quality
### Neutral
1. **Operational Overhead**: Background workers require monitoring
2. **Parameter Tuning**: Initial HNSW parameters need manual optimization
3. **Expertise Requirements**: Domain experts needed for high-quality feedback
## References
1. RuVector GNN Architecture: https://github.com/ruvnet/ruvector
2. SONA Pattern Documentation: claude-flow v3 hooks system
3. EWC++ Paper: "Overcoming catastrophic forgetting in neural networks"
4. Perch 2.0 Embeddings: https://arxiv.org/abs/2508.04665
5. HNSW Algorithm: "Efficient and robust approximate nearest neighbor search"
6. LoRA Fine-tuning: "LoRA: Low-Rank Adaptation of Large Language Models"

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff