Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
2526
vendor/ruvector/examples/vibecast-7sense/docs/DDD-IMPLEMENTATION-PLAN.md
vendored
Normal file
2526
vendor/ruvector/examples/vibecast-7sense/docs/DDD-IMPLEMENTATION-PLAN.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
574
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-001-system-architecture.md
vendored
Normal file
574
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-001-system-architecture.md
vendored
Normal file
@@ -0,0 +1,574 @@
|
||||
# ADR-001: System Architecture Overview
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-01-15
|
||||
**Decision Makers:** 7sense Architecture Team
|
||||
**Technical Area:** System Architecture
|
||||
|
||||
---
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
7sense aims to transform bioacoustic signals (primarily bird calls) into a navigable geometric space where meaningful structure emerges. The system must process audio recordings, generate high-dimensional embeddings using Perch 2.0 (1536-D vectors), organize them with HNSW indexing in RuVector, and apply GNN-based learning to surface patterns such as call types, motifs, and behavioral contexts.
|
||||
|
||||
The core challenge is designing an architecture that:
|
||||
|
||||
1. **Handles diverse data pipelines** - From raw 32kHz audio to queryable vector embeddings
|
||||
2. **Scales to millions of call segments** - Real-world bioacoustic monitoring generates vast datasets
|
||||
3. **Supports scientific workflows** - Researchers need reproducibility, transparency, and evidence-backed interpretations (RAB pattern)
|
||||
4. **Enables real-time and batch processing** - Field deployments require streaming; research requires bulk analysis
|
||||
5. **Integrates ML inference efficiently** - ONNX-based Perch 2.0 inference in Rust for performance
|
||||
|
||||
### Current State
|
||||
|
||||
This is a greenfield project building upon:
|
||||
- **Perch 2.0**: Google DeepMind's bioacoustic embedding model (EfficientNet-B3 backbone, 1536-D output)
|
||||
- **RuVector**: Rust-based vector database with HNSW indexing and self-learning GNN layers
|
||||
- **RAB Pattern**: Retrieval-Augmented Bioacoustics for evidence-backed interpretation
|
||||
|
||||
---
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
### Performance Requirements
|
||||
- **Embedding generation**: Process 5-second audio segments at >100 segments/second
|
||||
- **Vector search**: Sub-millisecond kNN queries on 1M+ vectors (HNSW target: ~100us)
|
||||
- **Batch ingestion**: 1M vectors/minute build speed (RuVector baseline)
|
||||
- **Memory efficiency**: Support 32x compression for cold data tiers
|
||||
|
||||
### Scalability Requirements
|
||||
- **Data volume**: Support 10K to 10M+ call segments per deployment
|
||||
- **Concurrent users**: Multiple researchers querying simultaneously
|
||||
- **Geographic distribution**: Sensor networks across multiple sites
|
||||
- **Temporal depth**: Years of historical recordings
|
||||
|
||||
### Scientific Rigor Requirements
|
||||
- **Reproducibility**: Deterministic pipelines with versioned models and parameters
|
||||
- **Transparency**: RAB-style evidence packs citing retrieved calls for any interpretation
|
||||
- **Auditability**: Full provenance tracking from raw audio to conclusions
|
||||
- **Validation**: Built-in verification against ground truth labels
|
||||
|
||||
### Operational Requirements
|
||||
- **Deployment flexibility**: Edge (sensor), cloud, and hybrid deployments
|
||||
- **Monitoring**: Health metrics, processing throughput, index quality
|
||||
- **Updates**: Hot-swap embedding models without full reindexing
|
||||
- **Recovery**: Graceful degradation and disaster recovery
|
||||
|
||||
---
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option A: Monolithic Architecture
|
||||
|
||||
A single application handling all concerns: audio processing, embedding generation, vector storage, GNN learning, API serving, and visualization.
|
||||
|
||||
**Pros:**
|
||||
- Simplest deployment model
|
||||
- No inter-service communication overhead
|
||||
- Single codebase to maintain
|
||||
|
||||
**Cons:**
|
||||
- Cannot scale components independently
|
||||
- Single point of failure
|
||||
- Difficult to update individual components
|
||||
- Memory pressure from co-located ML models
|
||||
- Not suitable for distributed sensor networks
|
||||
|
||||
### Option B: Microservices Architecture
|
||||
|
||||
Fully decomposed services: Audio Ingest Service, Embedding Service, Vector Store Service, GNN Learning Service, Query Service, Visualization Service, etc.
|
||||
|
||||
**Pros:**
|
||||
- Independent scaling per service
|
||||
- Technology flexibility per service
|
||||
- Fault isolation
|
||||
- Team parallelization
|
||||
|
||||
**Cons:**
|
||||
- Significant operational complexity
|
||||
- Network latency between services
|
||||
- Data consistency challenges
|
||||
- Overkill for initial team size
|
||||
- Complex debugging across service boundaries
|
||||
|
||||
### Option C: Modular Monolith Architecture
|
||||
|
||||
A single deployable unit with clearly separated internal modules, designed for future extraction into services if needed.
|
||||
|
||||
**Pros:**
|
||||
- Maintains deployment simplicity
|
||||
- Clear module boundaries enable future splitting
|
||||
- In-process communication for performance-critical paths
|
||||
- Easier debugging and testing
|
||||
- Appropriate for current team/project scale
|
||||
- Can evolve toward microservices as needs emerge
|
||||
|
||||
**Cons:**
|
||||
- Requires discipline to maintain module boundaries
|
||||
- All modules share the same runtime resources
|
||||
- Scaling requires scaling the entire application
|
||||
|
||||
---
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
**Chosen Option: Option C - Modular Monolith Architecture**
|
||||
|
||||
We adopt a modular monolith architecture with clearly defined domain boundaries, designed with explicit seams that allow future extraction to services. This balances immediate development velocity with long-term architectural flexibility.
|
||||
|
||||
### Rationale
|
||||
|
||||
1. **Right-sized for current needs**: A small team building a new product benefits from deployment simplicity
|
||||
2. **Performance-critical paths stay in-process**: Audio-to-embedding-to-index flow benefits from zero network hops
|
||||
3. **Scientific workflow alignment**: Researchers prefer reproducible, debuggable systems over distributed complexity
|
||||
4. **Evolution path preserved**: Module boundaries are designed as potential service boundaries
|
||||
5. **RuVector integration**: RuVector is designed as an embeddable library, making monolith integration natural
|
||||
|
||||
---
|
||||
|
||||
## Technical Specifications
|
||||
|
||||
### Module Architecture
|
||||
|
||||
```
|
||||
sevensense/
|
||||
├── core/ # Domain-agnostic foundations
|
||||
│ ├── config/ # Configuration management
|
||||
│ ├── error/ # Error types and handling
|
||||
│ ├── telemetry/ # Logging, metrics, tracing
|
||||
│ └── storage/ # Abstract storage interfaces
|
||||
│
|
||||
├── audio/ # Audio Processing Domain
|
||||
│ ├── ingest/ # Audio file reading, streaming
|
||||
│ ├── segment/ # Call detection and segmentation
|
||||
│ ├── features/ # Acoustic feature extraction
|
||||
│ └── spectrogram/ # Mel spectrogram generation
|
||||
│
|
||||
├── embedding/ # Embedding Generation Domain
|
||||
│ ├── perch/ # Perch 2.0 ONNX inference
|
||||
│ ├── models/ # Model versioning and registry
|
||||
│ ├── batch/ # Batch embedding pipelines
|
||||
│ └── normalize/ # Vector normalization (L2, etc.)
|
||||
│
|
||||
├── vectordb/ # Vector Storage Domain (RuVector)
|
||||
│ ├── index/ # HNSW index management
|
||||
│ ├── graph/ # Graph structure (nodes, edges)
|
||||
│ ├── query/ # Similarity search, Cypher queries
|
||||
│ └── hyperbolic/ # Poincare ball embeddings
|
||||
│
|
||||
├── learning/ # GNN Learning Domain
|
||||
│ ├── gnn/ # GNN layers (GCN, GAT, GraphSAGE)
|
||||
│ ├── attention/ # Attention mechanisms
|
||||
│ ├── training/ # Self-supervised training loops
|
||||
│ └── refinement/ # Embedding refinement pipelines
|
||||
│
|
||||
├── analysis/ # Analysis Domain
|
||||
│ ├── clustering/ # HDBSCAN, prototype extraction
|
||||
│ ├── sequence/ # Motif detection, transition analysis
|
||||
│ ├── entropy/ # Sequence entropy metrics
|
||||
│ └── validation/ # Ground truth comparison
|
||||
│
|
||||
├── rab/ # Retrieval-Augmented Bioacoustics
|
||||
│ ├── evidence/ # Evidence pack construction
|
||||
│ ├── retrieval/ # Adaptive retrieval depth
|
||||
│ ├── interpretation/ # Constrained interpretation generation
|
||||
│ └── citation/ # Source attribution
|
||||
│
|
||||
├── api/ # API Layer
|
||||
│ ├── rest/ # REST endpoints
|
||||
│ ├── graphql/ # GraphQL schema and resolvers
|
||||
│ ├── websocket/ # Real-time streaming
|
||||
│ └── grpc/ # gRPC for inter-service (future)
|
||||
│
|
||||
├── visualization/ # Visualization Domain
|
||||
│ ├── projection/ # UMAP/t-SNE dimensionality reduction
|
||||
│ ├── graph_viz/ # Network visualization
|
||||
│ ├── spectrogram_viz/ # Spectrogram rendering
|
||||
│ └── export/ # Export formats (JSON, PNG, etc.)
|
||||
│
|
||||
└── cli/ # Command Line Interface
|
||||
├── ingest/ # Batch ingestion commands
|
||||
├── query/ # Query commands
|
||||
├── train/ # Training commands
|
||||
└── export/ # Export commands
|
||||
```
|
||||
|
||||
### Data Model
|
||||
|
||||
#### Core Entities (Graph Nodes)
|
||||
|
||||
```rust
|
||||
/// Raw audio recording from a sensor
|
||||
struct Recording {
|
||||
id: Uuid,
|
||||
sensor_id: String,
|
||||
location: GeoPoint, // lat, lon, elevation
|
||||
start_timestamp: DateTime,
|
||||
duration_ms: u32,
|
||||
sample_rate: u32, // 32000 Hz for Perch 2.0
|
||||
channels: u8,
|
||||
habitat: Option<String>,
|
||||
weather: Option<WeatherData>,
|
||||
file_path: PathBuf,
|
||||
checksum: String, // SHA-256 for reproducibility
|
||||
}
|
||||
|
||||
/// Detected call segment within a recording
|
||||
struct CallSegment {
|
||||
id: Uuid,
|
||||
recording_id: Uuid,
|
||||
start_ms: u32,
|
||||
end_ms: u32,
|
||||
snr_db: f32, // Signal-to-noise ratio
|
||||
peak_frequency_hz: f32,
|
||||
energy: f32,
|
||||
detection_confidence: f32,
|
||||
detection_method: String, // "energy_threshold", "whisper_seg", etc.
|
||||
}
|
||||
|
||||
/// Embedding vector for a call segment
|
||||
struct Embedding {
|
||||
id: Uuid,
|
||||
segment_id: Uuid,
|
||||
model_id: String, // "perch2_v1.0"
|
||||
dimensions: u16, // 1536 for Perch 2.0
|
||||
vector: Vec<f32>,
|
||||
normalized: bool,
|
||||
created_at: DateTime,
|
||||
}
|
||||
|
||||
/// Cluster prototype (centroid of similar calls)
|
||||
struct Prototype {
|
||||
id: Uuid,
|
||||
cluster_id: Uuid,
|
||||
centroid_vector: Vec<f32>,
|
||||
exemplar_ids: Vec<Uuid>, // Representative segments
|
||||
member_count: u32,
|
||||
coherence_score: f32,
|
||||
}
|
||||
|
||||
/// Cluster of similar call segments
|
||||
struct Cluster {
|
||||
id: Uuid,
|
||||
method: String, // "hdbscan", "kmeans", etc.
|
||||
parameters: HashMap<String, Value>,
|
||||
created_at: DateTime,
|
||||
validation_score: Option<f32>,
|
||||
}
|
||||
|
||||
/// Optional taxonomic reference
|
||||
struct Taxon {
|
||||
id: Uuid,
|
||||
scientific_name: String,
|
||||
common_name: String,
|
||||
inat_id: Option<u64>, // iNaturalist ID
|
||||
ebird_code: Option<String>, // eBird species code
|
||||
}
|
||||
```
|
||||
|
||||
#### Relationships (Graph Edges)
|
||||
|
||||
```rust
|
||||
/// Recording contains segments
|
||||
edge HAS_SEGMENT: Recording -> CallSegment
|
||||
|
||||
/// Temporal sequence within recording
|
||||
edge NEXT: CallSegment -> CallSegment {
|
||||
delta_ms: u32, // Time gap between calls
|
||||
}
|
||||
|
||||
/// Acoustic similarity from HNSW
|
||||
edge SIMILAR: CallSegment -> CallSegment {
|
||||
distance: f32, // Cosine or Euclidean
|
||||
rank: u8, // kNN rank (1 = nearest)
|
||||
}
|
||||
|
||||
/// Cluster membership
|
||||
edge ASSIGNED_TO: CallSegment -> Cluster
|
||||
|
||||
/// Prototype ownership
|
||||
edge HAS_PROTOTYPE: Cluster -> Prototype
|
||||
|
||||
/// Species identification (when available)
|
||||
edge IDENTIFIED_AS: CallSegment -> Taxon {
|
||||
confidence: f32,
|
||||
method: String, // "manual", "model", "consensus"
|
||||
}
|
||||
```
|
||||
|
||||
### Processing Pipeline
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ INGESTION PIPELINE │
|
||||
├─────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ Audio │───▶│ Segment │───▶│ Mel │───▶│ Perch2.0 │ │
|
||||
│ │ Input │ │Detection │ │Spectrogram│ │ ONNX │ │
|
||||
│ │(32kHz,5s)│ │ │ │(500x128) │ │ │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ │ │ │ ▼ │
|
||||
│ │ │ │ ┌──────────┐ │
|
||||
│ │ │ │ │Embedding │ │
|
||||
│ │ │ │ │ (1536-D) │ │
|
||||
│ │ │ │ └──────────┘ │
|
||||
│ │ │ │ │ │
|
||||
└───────┼───────────────┼───────────────┼───────────────┼──────────────────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ STORAGE LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────────┐ │
|
||||
│ │ RuVector │ │
|
||||
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
|
||||
│ │ │ HNSW │ │ Graph │ │ Metadata Store │ │ │
|
||||
│ │ │ Index │ │ Store │ │ (Recordings, │ │ │
|
||||
│ │ │ │ │ (Edges) │ │ Segments, etc.) │ │ │
|
||||
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
|
||||
│ └──────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ LEARNING LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ GNN │ │ Attention │ │ Hyperbolic │ │
|
||||
│ │ Reranker │───▶│ Layers │───▶│ Refinement │ │
|
||||
│ │(GCN/GAT/SAGE)│ │ │ │ (Poincare) │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ └───────────────────┴───────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────┐ │
|
||||
│ │ Refined │ │
|
||||
│ │ Embeddings │ │
|
||||
│ └──────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ ANALYSIS LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │Clustering│ │ Sequence │ │ Anomaly │ │ Entropy │ │ RAB │ │
|
||||
│ │(HDBSCAN) │ │ Mining │ │Detection │ │ Metrics │ │ Evidence │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ API / PRESENTATION │
|
||||
├─────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ REST │ │ GraphQL │ │WebSocket │ │ CLI │ │ WASM │ │
|
||||
│ │ API │ │ API │ │(Streaming)│ │ │ │ (Browser)│ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Key Interfaces Between Modules
|
||||
|
||||
```rust
|
||||
// Audio -> Embedding interface
|
||||
trait AudioEmbedder {
|
||||
fn embed_segment(&self, audio: &AudioSegment) -> Result<Embedding>;
|
||||
fn embed_batch(&self, segments: &[AudioSegment]) -> Result<Vec<Embedding>>;
|
||||
fn model_info(&self) -> ModelInfo;
|
||||
}
|
||||
|
||||
// Embedding -> VectorDB interface
|
||||
trait VectorStore {
|
||||
fn insert(&mut self, embedding: &Embedding) -> Result<()>;
|
||||
fn search_knn(&self, query: &[f32], k: usize) -> Result<Vec<SearchResult>>;
|
||||
fn get_neighbors(&self, id: Uuid) -> Result<Vec<Neighbor>>;
|
||||
fn build_similarity_edges(&mut self, k: usize) -> Result<usize>;
|
||||
}
|
||||
|
||||
// VectorDB -> Learning interface
|
||||
trait GraphLearner {
|
||||
fn train_step(&mut self, graph: &Graph) -> Result<TrainMetrics>;
|
||||
fn refine_embeddings(&self, embeddings: &mut [Embedding]) -> Result<()>;
|
||||
fn attention_weights(&self, node_id: Uuid) -> Result<Vec<(Uuid, f32)>>;
|
||||
}
|
||||
|
||||
// Learning -> Analysis interface
|
||||
trait PatternAnalyzer {
|
||||
fn cluster(&self, embeddings: &[Embedding]) -> Result<Vec<Cluster>>;
|
||||
fn find_motifs(&self, sequences: &[Sequence]) -> Result<Vec<Motif>>;
|
||||
fn compute_entropy(&self, transitions: &TransitionMatrix) -> f32;
|
||||
}
|
||||
|
||||
// Analysis -> RAB interface
|
||||
trait EvidenceBuilder {
|
||||
fn build_pack(&self, query: &Query) -> Result<EvidencePack>;
|
||||
fn generate_interpretation(&self, pack: &EvidencePack) -> Result<Interpretation>;
|
||||
fn cite_sources(&self, interpretation: &Interpretation) -> Vec<Citation>;
|
||||
}
|
||||
```
|
||||
|
||||
### Configuration Structure
|
||||
|
||||
```yaml
|
||||
# sevensense.yaml
|
||||
sevensense:
|
||||
# Audio processing settings
|
||||
audio:
|
||||
sample_rate: 32000 # Perch 2.0 requirement
|
||||
segment_duration_ms: 5000 # 5 seconds
|
||||
segment_overlap_ms: 500 # Overlap for continuity
|
||||
min_snr_db: 10.0 # Minimum signal-to-noise
|
||||
detection_method: "energy" # or "whisper_seg", "tweety"
|
||||
|
||||
# Embedding generation
|
||||
embedding:
|
||||
model: "perch2_v1.0"
|
||||
onnx_path: "./models/perch2.onnx"
|
||||
dimensions: 1536
|
||||
normalize: true
|
||||
batch_size: 32
|
||||
|
||||
# Vector database (RuVector)
|
||||
vectordb:
|
||||
index_type: "hnsw"
|
||||
hnsw:
|
||||
m: 16 # Connections per node
|
||||
ef_construction: 200 # Build-time search width
|
||||
ef_search: 100 # Query-time search width
|
||||
distance_metric: "cosine" # or "euclidean", "poincare"
|
||||
enable_hyperbolic: false # Experimental
|
||||
compression:
|
||||
hot_tier: "none"
|
||||
warm_tier: "pq_8" # Product quantization
|
||||
cold_tier: "pq_4" # Aggressive compression
|
||||
|
||||
# GNN learning
|
||||
learning:
|
||||
enabled: true
|
||||
gnn_type: "gat" # GCN, GAT, or GraphSAGE
|
||||
hidden_dim: 256
|
||||
num_layers: 2
|
||||
attention_heads: 4
|
||||
learning_rate: 0.001
|
||||
training_interval_hours: 24
|
||||
|
||||
# Analysis settings
|
||||
analysis:
|
||||
clustering:
|
||||
method: "hdbscan"
|
||||
min_cluster_size: 10
|
||||
min_samples: 5
|
||||
sequence:
|
||||
max_gap_ms: 2000 # Max silence between calls
|
||||
min_motif_length: 3
|
||||
|
||||
# RAB settings
|
||||
rab:
|
||||
retrieval_k: 10 # Neighbors to retrieve
|
||||
min_confidence: 0.7
|
||||
cite_exemplars: true
|
||||
|
||||
# API settings
|
||||
api:
|
||||
host: "0.0.0.0"
|
||||
port: 8080
|
||||
enable_graphql: true
|
||||
enable_websocket: true
|
||||
cors_origins: ["*"]
|
||||
|
||||
# Telemetry
|
||||
telemetry:
|
||||
log_level: "info"
|
||||
metrics_port: 9090
|
||||
tracing_enabled: true
|
||||
tracing_endpoint: "http://localhost:4317"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
1. **Development velocity**: Single deployment simplifies CI/CD and local development
|
||||
2. **Performance**: Critical audio-to-index path has zero network overhead
|
||||
3. **Debugging**: Stack traces span the entire flow; no distributed tracing required initially
|
||||
4. **Testing**: Integration tests run in-process without container orchestration
|
||||
5. **Scientific reproducibility**: Single binary with pinned dependencies ensures consistent results
|
||||
6. **Resource efficiency**: Shared memory pools and caches across modules
|
||||
7. **Evolution path**: Clear module boundaries allow extraction to services when justified
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
1. **Scaling limitations**: Cannot scale embedding generation independently from query serving
|
||||
2. **Deployment coupling**: Updates to any module require full redeployment
|
||||
3. **Resource contention**: GNN training may compete with query serving for CPU/memory
|
||||
4. **Technology constraints**: All modules must work within Rust ecosystem (mitigated by FFI)
|
||||
|
||||
### Mitigation Strategies
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| Scaling limitations | Design async job queues that could become external workers |
|
||||
| Deployment coupling | Blue-green deployments with health checks |
|
||||
| Resource contention | Configurable resource limits per module; background training scheduling |
|
||||
| Technology constraints | ONNX runtime for ML; FFI bindings for specialized libraries |
|
||||
|
||||
---
|
||||
|
||||
## Related Decisions
|
||||
|
||||
- **ADR-002**: Perch 2.0 Integration Strategy (ONNX vs. birdnet-onnx crate)
|
||||
- **ADR-003**: HNSW vs. Hyperbolic Space Configuration
|
||||
- **ADR-004**: GNN Training Strategy (Online vs. Batch)
|
||||
- **ADR-005**: RAB Evidence Pack Schema
|
||||
- **ADR-006**: API Design (REST/GraphQL/gRPC)
|
||||
|
||||
---
|
||||
|
||||
## Compliance and Standards
|
||||
|
||||
### Scientific Standards
|
||||
- All embeddings include model version and parameters for reproducibility
|
||||
- Evidence packs include full retrieval citations per RAB methodology
|
||||
- Validation metrics align with published benchmarks (V-measure, silhouette scores)
|
||||
|
||||
### Data Standards
|
||||
- Audio metadata follows Darwin Core / TDWG standards where applicable
|
||||
- Taxonomic references link to iNaturalist and eBird identifiers
|
||||
- Geospatial data uses WGS84 coordinates
|
||||
|
||||
### Security Considerations
|
||||
- No PII in bioacoustic data (sensor IDs are pseudonymous)
|
||||
- API authentication via JWT tokens
|
||||
- Audit logging for all data modifications
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
1. Perch 2.0 Paper: "The Bittern Lesson for Bioacoustics" (arXiv:2508.04665)
|
||||
2. RuVector Documentation: https://github.com/ruvnet/ruvector
|
||||
3. HNSW Paper: "Efficient and Robust Approximate Nearest Neighbor Search"
|
||||
4. RAB Pattern: Retrieval-Augmented Bioacoustics methodology
|
||||
5. AVN Deep Learning Study: "A deep learning approach for the analysis of birdsong" (eLife 2025)
|
||||
|
||||
---
|
||||
|
||||
## Revision History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 1.0 | 2026-01-15 | 7sense Architecture Team | Initial version |
|
||||
1341
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-002-ddd-bounded-contexts.md
vendored
Normal file
1341
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-002-ddd-bounded-contexts.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1836
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-003-security-architecture.md
vendored
Normal file
1836
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-003-security-architecture.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
974
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-004-performance-optimization.md
vendored
Normal file
974
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-004-performance-optimization.md
vendored
Normal file
@@ -0,0 +1,974 @@
|
||||
# ADR-004: Performance Optimization Strategy
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Date
|
||||
|
||||
2026-01-15
|
||||
|
||||
## Context
|
||||
|
||||
7sense is a bioacoustics platform that processes bird call audio using Perch 2.0 embeddings (1536-D vectors from 5-second audio segments at 32kHz) stored in a RuVector-based system with HNSW indexing and GNN learning capabilities. The system must handle:
|
||||
|
||||
- **Scale**: 1M+ bird call embeddings with sub-100ms query latency
|
||||
- **Continuous Learning**: GNN refinement without blocking query operations
|
||||
- **Hierarchical Data**: Poincare ball hyperbolic embeddings for species/call-type taxonomies
|
||||
- **Real-time Ingestion**: Streaming audio from field sensors
|
||||
|
||||
This ADR defines the performance optimization strategy to meet these requirements while maintaining system reliability and cost efficiency.
|
||||
|
||||
## Decision
|
||||
|
||||
We adopt a multi-layered performance optimization approach covering HNSW tuning, embedding quantization, batch processing, memory management, caching, GNN scheduling, and horizontal scalability.
|
||||
|
||||
---
|
||||
|
||||
## 1. HNSW Parameter Tuning
|
||||
|
||||
### 1.1 Core Parameters
|
||||
|
||||
| Parameter | Value | Rationale |
|
||||
|-----------|-------|-----------|
|
||||
| **M** (max connections per node) | 32 | Optimal for 1536-D vectors; balances recall vs memory. Higher than default (16) due to high dimensionality. |
|
||||
| **efConstruction** | 200 | Build-time search depth. Higher ensures quality graph structure for dense embedding spaces. |
|
||||
| **efSearch** | 128 (default) / 256 (high-recall) | Query-time search depth. Tunable per query based on precision requirements. |
|
||||
| **maxLevel** | auto (log2(N)/log2(M)) | Automatically determined; ~6-7 levels for 1M vectors with M=32. |
|
||||
|
||||
### 1.2 Dimensionality-Specific Adjustments
|
||||
|
||||
```
|
||||
For 1536-D Perch embeddings:
|
||||
- Use L2 distance (Euclidean) for normalized vectors
|
||||
- Consider Product Quantization (PQ) for memory reduction (see Section 2)
|
||||
- Enable SIMD acceleration (AVX-512 where available)
|
||||
```
|
||||
|
||||
### 1.3 Benchmark Targets
|
||||
|
||||
| Metric | Target | Measurement Method |
|
||||
|--------|--------|-------------------|
|
||||
| Recall@10 | >= 0.95 | Compare against brute-force ground truth |
|
||||
| Recall@100 | >= 0.98 | Same |
|
||||
| Query Latency (p50) | < 10ms | Single-threaded, 1M vectors |
|
||||
| Query Latency (p99) | < 50ms | Under concurrent load |
|
||||
| Build Time | < 30 min | For 1M vectors cold start |
|
||||
|
||||
### 1.4 Tuning Protocol
|
||||
|
||||
```typescript
|
||||
interface HNSWTuningConfig {
|
||||
// Phase 1: Initial calibration (10K sample)
|
||||
calibration: {
|
||||
sampleSize: 10000,
|
||||
mRange: [16, 24, 32, 48],
|
||||
efConstructionRange: [100, 200, 400],
|
||||
targetRecall: 0.95
|
||||
},
|
||||
|
||||
// Phase 2: Full index build with optimal params
|
||||
production: {
|
||||
m: 32, // Determined from calibration
|
||||
efConstruction: 200,
|
||||
efSearchDefault: 128,
|
||||
efSearchHighRecall: 256
|
||||
},
|
||||
|
||||
// Phase 3: Runtime adaptation
|
||||
adaptive: {
|
||||
enableDynamicEf: true,
|
||||
efFloor: 64,
|
||||
efCeiling: 512,
|
||||
latencyTarget: 50 // ms
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Embedding Quantization Strategy
|
||||
|
||||
### 2.1 Tiered Storage Architecture
|
||||
|
||||
```
|
||||
HOT TIER (Active Queries)
|
||||
-------------------------
|
||||
- Format: float32 (full precision)
|
||||
- Size: 1536 * 4 = 6,144 bytes/vector
|
||||
- Capacity: ~100K vectors (600MB RAM)
|
||||
- Use: Real-time queries, recent recordings
|
||||
|
||||
WARM TIER (Frequent Access)
|
||||
---------------------------
|
||||
- Format: float16 (half precision)
|
||||
- Size: 1536 * 2 = 3,072 bytes/vector
|
||||
- Capacity: ~500K vectors (1.5GB RAM)
|
||||
- Use: Weekly active data, popular species
|
||||
|
||||
COLD TIER (Archive)
|
||||
-------------------
|
||||
- Format: int8 (scalar quantization)
|
||||
- Size: 1536 * 1 = 1,536 bytes/vector
|
||||
- Capacity: ~2M+ vectors (3GB disk)
|
||||
- Use: Historical data, rare species
|
||||
```
|
||||
|
||||
### 2.2 Quantization Methods
|
||||
|
||||
| Method | Compression | Recall Impact | Use Case |
|
||||
|--------|-------------|---------------|----------|
|
||||
| **Scalar (int8)** | 4x | -2-3% recall | Cold storage, bulk search |
|
||||
| **Product Quantization (PQ)** | 8-16x | -3-5% recall | Very large archives |
|
||||
| **Binary** | 32x | -10-15% recall | First-pass filtering only |
|
||||
|
||||
### 2.3 Scalar Quantization Implementation
|
||||
|
||||
```typescript
|
||||
class ScalarQuantizer {
|
||||
// Per-dimension min/max for calibration
|
||||
private mins: Float32Array;
|
||||
private maxs: Float32Array;
|
||||
private scales: Float32Array;
|
||||
|
||||
calibrate(embeddings: Float32Array[], sampleSize: number = 10000): void {
|
||||
// Sample random embeddings for range estimation
|
||||
const sample = this.randomSample(embeddings, sampleSize);
|
||||
|
||||
for (let d = 0; d < 1536; d++) {
|
||||
const values = sample.map(e => e[d]);
|
||||
this.mins[d] = Math.min(...values);
|
||||
this.maxs[d] = Math.max(...values);
|
||||
this.scales[d] = 255 / (this.maxs[d] - this.mins[d]);
|
||||
}
|
||||
}
|
||||
|
||||
quantize(embedding: Float32Array): Uint8Array {
|
||||
const quantized = new Uint8Array(1536);
|
||||
for (let d = 0; d < 1536; d++) {
|
||||
const normalized = (embedding[d] - this.mins[d]) * this.scales[d];
|
||||
quantized[d] = Math.round(Math.max(0, Math.min(255, normalized)));
|
||||
}
|
||||
return quantized;
|
||||
}
|
||||
|
||||
dequantize(quantized: Uint8Array): Float32Array {
|
||||
const embedding = new Float32Array(1536);
|
||||
for (let d = 0; d < 1536; d++) {
|
||||
embedding[d] = (quantized[d] / this.scales[d]) + this.mins[d];
|
||||
}
|
||||
return embedding;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.4 Promotion/Demotion Policy
|
||||
|
||||
```
|
||||
PROMOTION (Cold -> Warm -> Hot)
|
||||
-------------------------------
|
||||
Trigger: Query frequency > threshold OR explicit prefetch
|
||||
- Cold -> Warm: 5+ queries in 24h
|
||||
- Warm -> Hot: 20+ queries in 1h
|
||||
|
||||
DEMOTION (Hot -> Warm -> Cold)
|
||||
------------------------------
|
||||
Trigger: Time-based decay OR memory pressure
|
||||
- Hot -> Warm: No queries in 1h
|
||||
- Warm -> Cold: No queries in 7d
|
||||
- LRU eviction when tier exceeds capacity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Batch Processing Pipeline
|
||||
|
||||
### 3.1 Audio Ingestion Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ AUDIO INGESTION PIPELINE │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ [Sensors] ──> [Buffer Queue] ──> [Segment Detector] │
|
||||
│ │ │ │ │
|
||||
│ │ (5min chunks) (5s windows) │
|
||||
│ v v v │
|
||||
│ [Raw Storage] [Batch Accumulator] [Perch Embedder] │
|
||||
│ │ │ │
|
||||
│ (1000 segments) (GPU batch) │
|
||||
│ v v │
|
||||
│ [Embedding Queue] <── [1536-D vectors] │
|
||||
│ │ │
|
||||
│ v │
|
||||
│ [HNSW Batch Insert] │
|
||||
│ │ │
|
||||
│ (async, non-blocking) │
|
||||
│ v │
|
||||
│ [Index + Metadata Store] │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 3.2 Batch Sizing Parameters
|
||||
|
||||
| Stage | Batch Size | Latency Target | Throughput |
|
||||
|-------|------------|----------------|------------|
|
||||
| Audio buffer | 5 min chunks | < 1s queue delay | 100+ streams |
|
||||
| Segment detection | 100 segments | < 500ms | 1000 segments/s |
|
||||
| Perch embedding | 64 segments | < 2s GPU | 32 segments/s/GPU |
|
||||
| HNSW insertion | 1000 vectors | < 100ms | 10K vectors/s |
|
||||
| Metadata write | 1000 records | < 50ms | 20K records/s |
|
||||
|
||||
### 3.3 Backpressure Handling
|
||||
|
||||
```typescript
|
||||
interface BackpressureConfig {
|
||||
// Queue depth thresholds
|
||||
warningThreshold: 10000, // Start logging warnings
|
||||
throttleThreshold: 50000, // Reduce intake rate
|
||||
dropThreshold: 100000, // Drop lowest-priority data
|
||||
|
||||
// Priority levels for graceful degradation
|
||||
priorities: {
|
||||
critical: 'endangered_species', // Never drop
|
||||
high: 'known_species_new_recording', // Drop last
|
||||
normal: 'routine_monitoring', // Standard handling
|
||||
low: 'background_noise_samples' // Drop first
|
||||
},
|
||||
|
||||
// Rate limiting
|
||||
maxIngestionRate: 10000, // segments/minute
|
||||
burstAllowance: 5000, // temporary overflow
|
||||
}
|
||||
```
|
||||
|
||||
### 3.4 Batch Insert Optimization
|
||||
|
||||
```typescript
|
||||
async function batchInsertEmbeddings(
|
||||
embeddings: Float32Array[],
|
||||
metadata: EmbeddingMetadata[],
|
||||
config: BatchConfig
|
||||
): Promise<BatchResult> {
|
||||
const batchSize = config.batchSize || 1000;
|
||||
const results: BatchResult = { inserted: 0, failed: 0, latencyMs: [] };
|
||||
|
||||
// Sort by expected cluster for better cache locality
|
||||
const sorted = sortByClusterHint(embeddings, metadata);
|
||||
|
||||
for (let i = 0; i < sorted.length; i += batchSize) {
|
||||
const batch = sorted.slice(i, i + batchSize);
|
||||
const start = performance.now();
|
||||
|
||||
// Parallel insert with connection pooling
|
||||
await Promise.all([
|
||||
hnswIndex.batchAdd(batch.embeddings),
|
||||
metadataStore.batchInsert(batch.metadata)
|
||||
]);
|
||||
|
||||
results.latencyMs.push(performance.now() - start);
|
||||
results.inserted += batch.length;
|
||||
}
|
||||
|
||||
return results;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Memory Management
|
||||
|
||||
### 4.1 Streaming vs Batch Trade-offs
|
||||
|
||||
| Mode | Memory Footprint | Latency | Use Case |
|
||||
|------|------------------|---------|----------|
|
||||
| **Streaming** | O(window_size) ~50MB | Real-time (<1s) | Live monitoring |
|
||||
| **Micro-batch** | O(batch_size) ~200MB | Near-real-time (<5s) | Standard ingestion |
|
||||
| **Batch** | O(full_batch) ~2GB | Minutes | Bulk historical import |
|
||||
|
||||
### 4.2 Memory Budget Allocation
|
||||
|
||||
```
|
||||
TOTAL MEMORY BUDGET: 16GB (single node)
|
||||
=======================================
|
||||
|
||||
HNSW Index (Hot): 4GB (25%)
|
||||
- ~650K float32 vectors
|
||||
- Navigation structure overhead
|
||||
|
||||
Embedding Cache: 3GB (19%)
|
||||
- LRU cache for frequent queries
|
||||
- Warm tier spillover
|
||||
|
||||
GNN Model: 2GB (12%)
|
||||
- Model parameters
|
||||
- Gradient buffers
|
||||
- Activation cache
|
||||
|
||||
Query Buffers: 2GB (12%)
|
||||
- Concurrent query working memory
|
||||
- Result aggregation
|
||||
|
||||
Ingestion Pipeline: 2GB (12%)
|
||||
- Audio processing buffers
|
||||
- Batch accumulation
|
||||
|
||||
Metadata/Index: 2GB (12%)
|
||||
- SQLite/RocksDB buffers
|
||||
- B-tree indices
|
||||
|
||||
OS/Overhead: 1GB (6%)
|
||||
- System requirements
|
||||
- Safety margin
|
||||
```
|
||||
|
||||
### 4.3 Memory Pressure Response
|
||||
|
||||
```typescript
|
||||
interface MemoryManager {
|
||||
thresholds: {
|
||||
warning: 0.75, // 75% utilization
|
||||
critical: 0.90, // 90% utilization
|
||||
emergency: 0.95 // 95% utilization
|
||||
},
|
||||
|
||||
responses: {
|
||||
warning: [
|
||||
'reduce_batch_sizes',
|
||||
'increase_demotion_rate',
|
||||
'log_memory_profile'
|
||||
],
|
||||
critical: [
|
||||
'pause_gnn_training',
|
||||
'aggressive_cache_eviction',
|
||||
'reject_low_priority_queries'
|
||||
],
|
||||
emergency: [
|
||||
'stop_ingestion',
|
||||
'force_checkpoint',
|
||||
'alert_operations'
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.4 Zero-Copy Optimizations
|
||||
|
||||
```typescript
|
||||
// Use memory-mapped files for large read-only data
|
||||
const coldTierIndex = mmap('/data/cold_embeddings.bin', {
|
||||
mode: 'readonly',
|
||||
advice: MADV_RANDOM // Optimize for random access
|
||||
});
|
||||
|
||||
// Share embedding buffers between query threads
|
||||
const sharedQueryBuffer = new SharedArrayBuffer(
|
||||
QUERY_BATCH_SIZE * EMBEDDING_DIM * 4
|
||||
);
|
||||
|
||||
// Avoid copies in pipeline stages
|
||||
function processSegment(audio: AudioBuffer): EmbeddingResult {
|
||||
// Pass views, not copies
|
||||
const spectrogram = computeMelSpectrogram(audio.subarray(0, WINDOW_SIZE));
|
||||
const embedding = perchModel.embed(spectrogram); // Returns view
|
||||
return { embedding, metadata: extractMetadata(audio) };
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Caching Strategy
|
||||
|
||||
### 5.1 Multi-Level Cache Architecture
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────┐
|
||||
│ CACHE HIERARCHY │
|
||||
├────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ L1: Query Result Cache (100MB) │
|
||||
│ ├── Key: hash(query_embedding + search_params) │
|
||||
│ ├── TTL: 5 minutes │
|
||||
│ ├── Hit Rate Target: 40%+ for repeated queries │
|
||||
│ └── Eviction: LRU with frequency boost │
|
||||
│ │
|
||||
│ L2: Nearest Neighbor Cache (500MB) │
|
||||
│ ├── Key: embedding_id │
|
||||
│ ├── Value: precomputed k-NN list │
|
||||
│ ├── TTL: 1 hour (invalidate on index update) │
|
||||
│ └── Hit Rate Target: 60%+ for hot embeddings │
|
||||
│ │
|
||||
│ L3: Cluster Centroid Cache (200MB) │
|
||||
│ ├── Key: cluster_id │
|
||||
│ ├── Value: centroid + exemplar embeddings │
|
||||
│ ├── TTL: 24 hours │
|
||||
│ └── Use: Fast cluster assignment for new embeddings │
|
||||
│ │
|
||||
│ L4: Metadata Cache (300MB) │
|
||||
│ ├── Key: embedding_id │
|
||||
│ ├── Value: species, location, timestamp, etc. │
|
||||
│ ├── TTL: None (invalidate on update) │
|
||||
│ └── Hit Rate Target: 90%+ (frequently accessed) │
|
||||
│ │
|
||||
└────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 5.2 Cache Warming Strategy
|
||||
|
||||
```typescript
|
||||
interface CacheWarmingConfig {
|
||||
// Startup warming
|
||||
startup: {
|
||||
// Load most queried embeddings from past 24h
|
||||
recentQueryEmbeddings: 10000,
|
||||
// Load all cluster centroids
|
||||
clusterCentroids: 'all',
|
||||
// Load endangered species data
|
||||
prioritySpecies: ['species_list_from_config']
|
||||
},
|
||||
|
||||
// Predictive warming
|
||||
predictive: {
|
||||
// Time-based patterns
|
||||
schedules: [
|
||||
{ time: '05:00', action: 'warm_dawn_chorus_species' },
|
||||
{ time: '19:00', action: 'warm_dusk_species' }
|
||||
],
|
||||
// Geographic patterns
|
||||
sensorActivation: {
|
||||
triggerRadius: '50km',
|
||||
preloadNeighborSites: true
|
||||
}
|
||||
},
|
||||
|
||||
// Query-driven warming
|
||||
queryDriven: {
|
||||
// On any query, prefetch neighbors' neighbors
|
||||
prefetchDepth: 2,
|
||||
prefetchCount: 10
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5.3 Cache Invalidation
|
||||
|
||||
```typescript
|
||||
// Event-driven invalidation
|
||||
const cacheInvalidator = {
|
||||
onEmbeddingInsert(id: string, embedding: Float32Array): void {
|
||||
// Invalidate affected NN caches
|
||||
const affectedNeighbors = hnswIndex.getNeighbors(id, 50);
|
||||
affectedNeighbors.forEach(nid => nnCache.invalidate(nid));
|
||||
|
||||
// Invalidate cluster centroid if significantly different
|
||||
const cluster = clusterAssignment.get(id);
|
||||
if (cluster && distanceFromCentroid(embedding, cluster) > threshold) {
|
||||
centroidCache.invalidate(cluster.id);
|
||||
}
|
||||
},
|
||||
|
||||
onClusterUpdate(clusterId: string): void {
|
||||
// Invalidate centroid and all member NN caches
|
||||
centroidCache.invalidate(clusterId);
|
||||
const members = clusterMembers.get(clusterId);
|
||||
members.forEach(mid => nnCache.invalidate(mid));
|
||||
},
|
||||
|
||||
onGNNTrainingComplete(): void {
|
||||
// Embeddings may have shifted - invalidate distance-based caches
|
||||
nnCache.clear();
|
||||
queryCache.clear();
|
||||
// Centroid cache can remain (recomputed lazily)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. GNN Training Schedule
|
||||
|
||||
### 6.1 Training Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ GNN TRAINING PIPELINE │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ONLINE LEARNING (Continuous) │
|
||||
│ ├── Trigger: Every 1000 new embeddings │
|
||||
│ ├── Scope: Local neighborhood refinement │
|
||||
│ ├── Duration: < 100ms (non-blocking) │
|
||||
│ └── Method: Single GNN message-passing step │
|
||||
│ │
|
||||
│ INCREMENTAL TRAINING (Scheduled) │
|
||||
│ ├── Trigger: Hourly (off-peak) or 10K new embeddings │
|
||||
│ ├── Scope: Updated subgraph (new nodes + 2-hop neighbors) │
|
||||
│ ├── Duration: 1-5 minutes │
|
||||
│ └── Method: 3-5 GNN epochs on affected subgraph │
|
||||
│ │
|
||||
│ FULL RETRAINING (Periodic) │
|
||||
│ ├── Trigger: Weekly (Sunday 02:00-06:00) or manual │
|
||||
│ ├── Scope: Entire graph │
|
||||
│ ├── Duration: 1-4 hours │
|
||||
│ └── Method: Full GNN training with hyperparameter tuning │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 6.2 Non-Blocking Training Protocol
|
||||
|
||||
```typescript
|
||||
class GNNTrainingScheduler {
|
||||
private queryPriorityLock = new AsyncLock();
|
||||
|
||||
async onlineUpdate(newEmbeddings: EmbeddingBatch): Promise<void> {
|
||||
// Non-blocking: runs in background, doesn't affect queries
|
||||
setImmediate(async () => {
|
||||
const subgraph = this.extractLocalSubgraph(newEmbeddings);
|
||||
await this.gnn.singleStep(subgraph);
|
||||
});
|
||||
}
|
||||
|
||||
async incrementalTrain(): Promise<TrainingResult> {
|
||||
// Acquire read lock (queries continue, writes pause)
|
||||
await this.queryPriorityLock.acquireRead();
|
||||
|
||||
try {
|
||||
const updatedSubgraph = this.getUpdatedSubgraph();
|
||||
const result = await this.gnn.train(updatedSubgraph, {
|
||||
epochs: 5,
|
||||
earlyStop: { patience: 2, minDelta: 0.001 }
|
||||
});
|
||||
|
||||
// Apply updates atomically
|
||||
await this.applyEmbeddingUpdates(result.refinedEmbeddings);
|
||||
return result;
|
||||
} finally {
|
||||
this.queryPriorityLock.release();
|
||||
}
|
||||
}
|
||||
|
||||
async fullRetrain(): Promise<TrainingResult> {
|
||||
// Acquire write lock (pause all operations)
|
||||
await this.queryPriorityLock.acquireWrite();
|
||||
|
||||
try {
|
||||
// Checkpoint current state for rollback
|
||||
await this.checkpoint();
|
||||
|
||||
const result = await this.gnn.fullTrain({
|
||||
epochs: 50,
|
||||
learningRate: 0.001,
|
||||
earlyStop: { patience: 10, minDelta: 0.0001 }
|
||||
});
|
||||
|
||||
// Validate before applying
|
||||
if (result.validationRecall < 0.90) {
|
||||
await this.rollback();
|
||||
throw new Error('Training degraded recall, rolled back');
|
||||
}
|
||||
|
||||
await this.applyFullUpdate(result);
|
||||
return result;
|
||||
} finally {
|
||||
this.queryPriorityLock.release();
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6.3 Training Resource Allocation
|
||||
|
||||
```
|
||||
OFF-PEAK (02:00-06:00 local time)
|
||||
---------------------------------
|
||||
- Full retraining allowed
|
||||
- 100% GPU utilization
|
||||
- Query latency SLA relaxed to 200ms
|
||||
|
||||
PEAK HOURS (06:00-22:00)
|
||||
------------------------
|
||||
- Online updates only
|
||||
- GPU limited to 20% for training
|
||||
- Query latency SLA: 50ms p99
|
||||
|
||||
TRANSITION PERIODS
|
||||
------------------
|
||||
- Incremental training allowed
|
||||
- GPU limited to 50% for training
|
||||
- Query latency SLA: 100ms p99
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Benchmarking Framework
|
||||
|
||||
### 7.1 Benchmark Suite
|
||||
|
||||
```typescript
|
||||
interface BenchmarkSuite {
|
||||
// Core HNSW benchmarks
|
||||
hnsw: {
|
||||
insertThroughput: {
|
||||
description: 'Vectors inserted per second',
|
||||
target: '>= 10,000 vectors/s',
|
||||
dataset: '1M random 1536-D vectors'
|
||||
},
|
||||
queryLatency: {
|
||||
description: 'Single query latency distribution',
|
||||
targets: {
|
||||
p50: '<= 10ms',
|
||||
p95: '<= 30ms',
|
||||
p99: '<= 50ms'
|
||||
},
|
||||
dataset: '1M indexed, 10K queries'
|
||||
},
|
||||
recallAtK: {
|
||||
description: 'Recall compared to brute force',
|
||||
targets: {
|
||||
recall10: '>= 0.95',
|
||||
recall100: '>= 0.98'
|
||||
}
|
||||
},
|
||||
concurrentQueries: {
|
||||
description: 'Throughput under concurrent load',
|
||||
target: '>= 1,000 QPS at p99 < 100ms',
|
||||
concurrency: [1, 10, 50, 100, 200]
|
||||
}
|
||||
},
|
||||
|
||||
// End-to-end pipeline benchmarks
|
||||
pipeline: {
|
||||
audioToEmbedding: {
|
||||
description: 'Full audio processing latency',
|
||||
target: '<= 200ms per 5s segment',
|
||||
includeIO: true
|
||||
},
|
||||
ingestionThroughput: {
|
||||
description: 'Sustained ingestion rate',
|
||||
target: '>= 100 segments/second',
|
||||
duration: '1 hour'
|
||||
},
|
||||
queryWithMetadata: {
|
||||
description: 'Query + metadata fetch',
|
||||
target: '<= 75ms p99'
|
||||
}
|
||||
},
|
||||
|
||||
// GNN-specific benchmarks
|
||||
gnn: {
|
||||
onlineUpdateLatency: {
|
||||
description: 'Single-step GNN update',
|
||||
target: '<= 100ms for 1K node subgraph'
|
||||
},
|
||||
incrementalTrainTime: {
|
||||
description: 'Hourly incremental training',
|
||||
target: '<= 5 minutes for 10K updates'
|
||||
},
|
||||
recallImprovement: {
|
||||
description: 'Recall gain from GNN refinement',
|
||||
target: '>= 2% improvement over baseline HNSW'
|
||||
}
|
||||
},
|
||||
|
||||
// Memory benchmarks
|
||||
memory: {
|
||||
indexMemoryPerVector: {
|
||||
description: 'Memory per indexed vector',
|
||||
target: '<= 8KB (including overhead)'
|
||||
},
|
||||
cacheHitRate: {
|
||||
description: 'Cache effectiveness',
|
||||
targets: {
|
||||
queryCache: '>= 30%',
|
||||
nnCache: '>= 50%',
|
||||
metadataCache: '>= 80%'
|
||||
}
|
||||
},
|
||||
quantizationRecallLoss: {
|
||||
description: 'Recall loss from int8 quantization',
|
||||
target: '<= 3%'
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 7.2 SLA Definitions
|
||||
|
||||
| Operation | p50 | p95 | p99 | p99.9 |
|
||||
|-----------|-----|-----|-----|-------|
|
||||
| kNN Query (k=10) | 5ms | 20ms | 50ms | 100ms |
|
||||
| kNN Query (k=100) | 10ms | 40ms | 80ms | 150ms |
|
||||
| Range Query (r<0.5) | 15ms | 50ms | 100ms | 200ms |
|
||||
| Insert Single | 1ms | 5ms | 10ms | 20ms |
|
||||
| Batch Insert (1000) | 50ms | 100ms | 200ms | 500ms |
|
||||
| Cluster Assignment | 20ms | 50ms | 100ms | 200ms |
|
||||
| Full Pipeline (audio->result) | 200ms | 500ms | 1000ms | 2000ms |
|
||||
|
||||
### 7.3 Continuous Benchmarking
|
||||
|
||||
```yaml
|
||||
# .github/workflows/performance.yml
|
||||
benchmark_schedule:
|
||||
nightly:
|
||||
- hnsw_insert_throughput
|
||||
- hnsw_query_latency
|
||||
- hnsw_recall
|
||||
|
||||
weekly:
|
||||
- full_pipeline_benchmark
|
||||
- gnn_training_benchmark
|
||||
- memory_pressure_test
|
||||
|
||||
on_release:
|
||||
- all_benchmarks
|
||||
- scalability_test_10M
|
||||
- longevity_test_24h
|
||||
|
||||
regression_thresholds:
|
||||
latency_increase: 10% # Alert if p99 increases by 10%
|
||||
throughput_decrease: 5% # Alert if QPS drops by 5%
|
||||
recall_decrease: 1% # Alert if recall drops by 1%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Horizontal Scalability
|
||||
|
||||
### 8.1 Sharding Strategy
|
||||
|
||||
```
|
||||
SHARDING APPROACH: Geographic + Temporal Hybrid
|
||||
===============================================
|
||||
|
||||
Primary Shard Key: Geographic Region (sensor cluster)
|
||||
- Shard 0: North America West
|
||||
- Shard 1: North America East
|
||||
- Shard 2: Europe
|
||||
- Shard 3: Asia-Pacific
|
||||
- Shard 4: South America
|
||||
- Shard 5: Africa
|
||||
|
||||
Secondary Partition: Temporal (within shard)
|
||||
- Hot: Current month
|
||||
- Warm: Past 12 months
|
||||
- Cold: Archive (>12 months)
|
||||
|
||||
Cross-Shard Queries:
|
||||
- Use scatter-gather pattern
|
||||
- Merge results by distance
|
||||
- Timeout per shard: 100ms
|
||||
```
|
||||
|
||||
### 8.2 Cluster Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ DISTRIBUTED ARCHITECTURE │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Query LB │ │ Query LB │ │ Query LB │ │
|
||||
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
||||
│ │ │ │ │
|
||||
│ v v v │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ Query Router (Consistent Hash) │ │
|
||||
│ └─────────────────────────┬───────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────┼──────────────────────┐ │
|
||||
│ │ │ │ │
|
||||
│ v v v │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ Shard 0 │ │ Shard 1 │ │ Shard N │ │
|
||||
│ │ (3 rep) │ │ (3 rep) │ │ (3 rep) │ │
|
||||
│ └────┬────┘ └────┬────┘ └────┬────┘ │
|
||||
│ │ │ │ │
|
||||
│ v v v │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ HNSW │ │ HNSW │ │ HNSW │ │
|
||||
│ │ Index │ │ Index │ │ Index │ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ Shared Metadata Store (Distributed) │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ GNN Training Coordinator (Async Gossip) │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 8.3 Scaling Thresholds
|
||||
|
||||
| Metric | Scale-Out Trigger | Scale-In Trigger |
|
||||
|--------|-------------------|------------------|
|
||||
| Query Latency p99 | > 80ms sustained 5min | < 30ms sustained 1h |
|
||||
| CPU Utilization | > 70% sustained 5min | < 30% sustained 1h |
|
||||
| Memory Utilization | > 80% | < 50% sustained 1h |
|
||||
| Queue Depth | > 10K pending | < 1K sustained 30min |
|
||||
| Shard Size | > 500K vectors | N/A (don't scale in) |
|
||||
|
||||
### 8.4 Cross-Shard Query Protocol
|
||||
|
||||
```typescript
|
||||
async function globalKnnQuery(
|
||||
query: Float32Array,
|
||||
k: number,
|
||||
options: QueryOptions
|
||||
): Promise<SearchResult[]> {
|
||||
const shards = options.shards || getAllShards();
|
||||
const perShardK = Math.ceil(k * 1.5); // Over-fetch for merge
|
||||
|
||||
// Scatter phase
|
||||
const shardPromises = shards.map(shard =>
|
||||
queryShardWithTimeout(shard, query, perShardK, options.timeout || 100)
|
||||
);
|
||||
|
||||
// Gather phase with partial results on timeout
|
||||
const results = await Promise.allSettled(shardPromises);
|
||||
|
||||
// Merge and re-rank
|
||||
const allResults = results
|
||||
.filter(r => r.status === 'fulfilled')
|
||||
.flatMap(r => r.value);
|
||||
|
||||
// Sort by distance and take top-k
|
||||
allResults.sort((a, b) => a.distance - b.distance);
|
||||
return allResults.slice(0, k);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Latency Budget Breakdown
|
||||
|
||||
### 9.1 Query Path Latency Budget
|
||||
|
||||
```
|
||||
TOTAL BUDGET: 50ms (p99 target)
|
||||
===============================
|
||||
|
||||
┌────────────────────────────────────────────────────┐
|
||||
│ Component │ Budget │ % of Total │
|
||||
├────────────────────────────────────────────────────┤
|
||||
│ Network (client -> LB) │ 5ms │ 10% │
|
||||
│ Load Balancer routing │ 1ms │ 2% │
|
||||
│ Query parsing/validation │ 1ms │ 2% │
|
||||
│ Cache lookup (L1-L4) │ 3ms │ 6% │
|
||||
│ HNSW search (k=10) │ 25ms │ 50% │
|
||||
│ Metadata fetch │ 5ms │ 10% │
|
||||
│ Result serialization │ 2ms │ 4% │
|
||||
│ Network (LB -> client) │ 5ms │ 10% │
|
||||
│ Buffer/headroom │ 3ms │ 6% │
|
||||
├────────────────────────────────────────────────────┤
|
||||
│ TOTAL │ 50ms │ 100% │
|
||||
└────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 9.2 Ingestion Path Latency Budget
|
||||
|
||||
```
|
||||
TOTAL BUDGET: 200ms (p99 target for single segment)
|
||||
==================================================
|
||||
|
||||
┌────────────────────────────────────────────────────┐
|
||||
│ Component │ Budget │ % of Total │
|
||||
├────────────────────────────────────────────────────┤
|
||||
│ Audio receive/decode │ 10ms │ 5% │
|
||||
│ Mel spectrogram compute │ 20ms │ 10% │
|
||||
│ Perch model inference │ 80ms │ 40% │
|
||||
│ Embedding normalization │ 5ms │ 2% │
|
||||
│ HNSW insertion │ 20ms │ 10% │
|
||||
│ Metadata write │ 10ms │ 5% │
|
||||
│ Cache invalidation │ 10ms │ 5% │
|
||||
│ Acknowledgment │ 5ms │ 2% │
|
||||
│ Buffer/headroom │ 40ms │ 20% │
|
||||
├────────────────────────────────────────────────────┤
|
||||
│ TOTAL │200ms │ 100% │
|
||||
└────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 9.3 GNN Training Latency Constraints
|
||||
|
||||
```
|
||||
ONLINE UPDATE: 100ms max (non-blocking)
|
||||
---------------------------------------
|
||||
- Subgraph extraction: 20ms
|
||||
- Single GNN forward pass: 50ms
|
||||
- Embedding update (async): 30ms
|
||||
|
||||
INCREMENTAL TRAINING: 5 min max
|
||||
-------------------------------
|
||||
- Subgraph construction: 30s
|
||||
- Training (5 epochs): 4 min
|
||||
- Embedding sync: 30s
|
||||
|
||||
FULL RETRAINING: 4 hour max
|
||||
---------------------------
|
||||
- Graph snapshot: 10 min
|
||||
- Training (50 epochs): 3.5 hours
|
||||
- Validation: 10 min
|
||||
- Cutover: 10 min
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Sub-100ms query latency** achieved through HNSW tuning and multi-level caching
|
||||
- **4x storage reduction** for cold data via int8 scalar quantization
|
||||
- **Non-blocking GNN learning** enables continuous improvement without query degradation
|
||||
- **Linear horizontal scaling** via geographic sharding
|
||||
- **Clear SLAs** enable capacity planning and alerting
|
||||
|
||||
### Negative
|
||||
|
||||
- **Increased operational complexity** from multi-tier storage and distributed architecture
|
||||
- **Memory overhead** from caching layers (~1.1GB dedicated to caches)
|
||||
- **Quantization recall loss** of 2-3% for cold tier data
|
||||
- **Cross-shard query overhead** adds latency for global searches
|
||||
|
||||
### Neutral
|
||||
|
||||
- **Trade-off flexibility** allows tuning precision vs. latency per use case
|
||||
- **Benchmark-driven development** requires ongoing measurement infrastructure
|
||||
|
||||
---
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Foundation (Weeks 1-2)
|
||||
- Implement HNSW with tuned parameters
|
||||
- Set up benchmark suite
|
||||
- Establish baseline metrics
|
||||
|
||||
### Phase 2: Optimization (Weeks 3-4)
|
||||
- Implement scalar quantization
|
||||
- Add multi-level caching
|
||||
- Optimize batch ingestion pipeline
|
||||
|
||||
### Phase 3: Learning (Weeks 5-6)
|
||||
- Integrate GNN training scheduler
|
||||
- Implement non-blocking updates
|
||||
- Validate recall improvements
|
||||
|
||||
### Phase 4: Scale (Weeks 7-8)
|
||||
- Implement sharding layer
|
||||
- Deploy distributed architecture
|
||||
- Load test at 1M+ vectors
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Perch 2.0: https://arxiv.org/abs/2508.04665
|
||||
- RuVector: https://github.com/ruvnet/ruvector
|
||||
- HNSW Paper: Malkov & Yashunin, 2018
|
||||
- Product Quantization: Jegou et al., 2011
|
||||
- Graph Attention Networks: Velickovic et al., 2018
|
||||
746
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-005-self-learning-hooks.md
vendored
Normal file
746
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-005-self-learning-hooks.md
vendored
Normal file
@@ -0,0 +1,746 @@
|
||||
# ADR-005: Self-Learning and Hooks Integration
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Date
|
||||
|
||||
2026-01-15
|
||||
|
||||
## Context
|
||||
|
||||
7sense processes bioacoustic data through Perch 2.0 embeddings (1536-D vectors) stored in RuVector with HNSW indexing. To maximize the value of this acoustic geometry, we need a self-learning system that:
|
||||
|
||||
1. Continuously improves retrieval quality based on user feedback
|
||||
2. Discovers and consolidates successful clustering configurations
|
||||
3. Learns species-specific embedding characteristics over time
|
||||
4. Prevents catastrophic forgetting when adapting to new domains (marine vs avian vs terrestrial)
|
||||
|
||||
RuVector includes a built-in GNN layer designed for index self-improvement, and the claude-flow framework provides a comprehensive hooks system with 27 hooks and 12 background workers that can orchestrate continuous learning pipelines.
|
||||
|
||||
## Decision
|
||||
|
||||
We will implement a four-stage learning loop architecture integrated with claude-flow hooks, utilizing SONA (Self-Optimizing Neural Architecture) patterns and EWC++ (Elastic Weight Consolidation) for continual learning without forgetting.
|
||||
|
||||
### Learning Loop Architecture
|
||||
|
||||
```
|
||||
+-------------------+ +------------------+ +-------------------+ +---------------------+
|
||||
| RETRIEVE | --> | JUDGE | --> | DISTILL | --> | CONSOLIDATE |
|
||||
| (HNSW + Pattern) | | (Verdict System) | | (LoRA Fine-tune) | | (EWC++ Integration) |
|
||||
+-------------------+ +------------------+ +-------------------+ +---------------------+
|
||||
^ |
|
||||
| |
|
||||
+----------------------------------------------------------------------------+
|
||||
Continuous Feedback Loop
|
||||
```
|
||||
|
||||
#### Stage 1: RETRIEVE
|
||||
|
||||
Fetch relevant patterns from the ReasoningBank using HNSW-indexed vector search:
|
||||
|
||||
```bash
|
||||
# Search for similar bioacoustic analysis patterns
|
||||
npx @claude-flow/cli@latest memory search \
|
||||
--query "whale song clustering high-frequency harmonics" \
|
||||
--namespace patterns \
|
||||
--limit 5 \
|
||||
--threshold 0.7
|
||||
|
||||
# Retrieve species-specific embedding characteristics
|
||||
npx @claude-flow/cli@latest hooks intelligence pattern-search \
|
||||
--query "humpback whale vocalization" \
|
||||
--namespace species \
|
||||
--top-k 3
|
||||
```
|
||||
|
||||
Performance characteristics:
|
||||
- HNSW retrieval: 150x-12,500x faster than brute force
|
||||
- Pattern matching: 761 decisions/sec
|
||||
- Sub-millisecond adaptation via SONA
|
||||
|
||||
#### Stage 2: JUDGE
|
||||
|
||||
Evaluate retrieved patterns with a verdict system that scores relevance and success:
|
||||
|
||||
```typescript
|
||||
interface BioacousticVerdict {
|
||||
pattern_id: string;
|
||||
task_type: 'clustering' | 'motif_discovery' | 'species_identification' | 'anomaly_detection';
|
||||
verdict: 'success' | 'partial' | 'failure';
|
||||
confidence: number; // 0.0 - 1.0
|
||||
metrics: {
|
||||
silhouette_score?: number; // For clustering
|
||||
retrieval_precision?: number; // For search quality
|
||||
user_correction_rate?: number; // For feedback integration
|
||||
snr_threshold_effectiveness?: number;
|
||||
};
|
||||
feedback_source: 'automatic' | 'user_correction' | 'expert_annotation';
|
||||
}
|
||||
```
|
||||
|
||||
Verdict aggregation rules:
|
||||
- Success (confidence > 0.85): Promote pattern to long-term memory
|
||||
- Partial (0.5 < confidence < 0.85): Mark for refinement
|
||||
- Failure (confidence < 0.5): Demote or archive with failure context
|
||||
|
||||
#### Stage 3: DISTILL
|
||||
|
||||
Extract key learnings via LoRA (Low-Rank Adaptation) fine-tuning:
|
||||
|
||||
```bash
|
||||
# Train neural patterns on successful bioacoustic analysis
|
||||
npx @claude-flow/cli@latest hooks intelligence trajectory-start \
|
||||
--task "clustering whale songs by call type" \
|
||||
--agent "bioacoustic-analyzer"
|
||||
|
||||
# Record analysis steps
|
||||
npx @claude-flow/cli@latest hooks intelligence trajectory-step \
|
||||
--trajectory-id "$TRAJ_ID" \
|
||||
--action "applied hierarchical clustering with ward linkage" \
|
||||
--result "silhouette score 0.78" \
|
||||
--quality 0.85
|
||||
|
||||
# Complete trajectory with success
|
||||
npx @claude-flow/cli@latest hooks intelligence trajectory-end \
|
||||
--trajectory-id "$TRAJ_ID" \
|
||||
--success true \
|
||||
--feedback "user confirmed 23/25 clusters as valid call types"
|
||||
```
|
||||
|
||||
LoRA benefits for bioacoustics:
|
||||
- 99% parameter reduction (critical for edge deployment on field sensors)
|
||||
- 10-100x faster training than full fine-tuning
|
||||
- Minimal memory footprint for continuous learning
|
||||
|
||||
#### Stage 4: CONSOLIDATE
|
||||
|
||||
Prevent catastrophic forgetting via EWC++ when learning new domains:
|
||||
|
||||
```bash
|
||||
# Force SONA learning cycle with EWC++ consolidation
|
||||
npx @claude-flow/cli@latest hooks intelligence learn \
|
||||
--consolidate true \
|
||||
--trajectory-ids "$WHALE_TRAJ,$BIRD_TRAJ,$INSECT_TRAJ"
|
||||
```
|
||||
|
||||
EWC++ strategy for bioacoustics:
|
||||
- Compute Fisher information matrix for critical embedding dimensions
|
||||
- Penalize changes to weights important for existing species recognition
|
||||
- Allow plasticity for new acoustic domains (marine -> avian -> terrestrial)
|
||||
|
||||
### Claude-Flow Hooks Integration
|
||||
|
||||
#### Pre-Task Hook: Route Bioacoustic Analysis Tasks
|
||||
|
||||
The `pre-task` hook routes incoming analysis requests to optimal processing paths:
|
||||
|
||||
```bash
|
||||
# Before starting any bioacoustic analysis
|
||||
npx @claude-flow/cli@latest hooks pre-task \
|
||||
--task-id "analysis-$(date +%s)" \
|
||||
--description "cluster humpback whale songs from Pacific Northwest dataset"
|
||||
```
|
||||
|
||||
Routing decisions based on task characteristics:
|
||||
|
||||
| Task Type | Recommended Agent | Model Tier | Rationale |
|
||||
|-----------|-------------------|------------|-----------|
|
||||
| Simple retrieval | retrieval-agent | Haiku | Fast kNN lookup |
|
||||
| Clustering | clustering-specialist | Sonnet | Algorithm selection |
|
||||
| Motif discovery | sequence-analyzer | Sonnet | Temporal pattern analysis |
|
||||
| Cross-species analysis | bioacoustic-expert | Opus | Complex reasoning |
|
||||
| Anomaly detection | anomaly-detector | Haiku | Real-time processing |
|
||||
| Embedding refinement | ml-specialist | Opus | Architecture decisions |
|
||||
|
||||
Pre-task also retrieves relevant patterns:
|
||||
|
||||
```bash
|
||||
# Get routing recommendation with pattern retrieval
|
||||
npx @claude-flow/cli@latest hooks route \
|
||||
--task "identify dialect variations in orca pod communications" \
|
||||
--context "Pacific Northwest, 2024 field recordings"
|
||||
```
|
||||
|
||||
Output includes:
|
||||
- Recommended agent type and model tier
|
||||
- Top-3 similar successful patterns from memory
|
||||
- Suggested HNSW parameters based on past success
|
||||
- Estimated confidence and processing time
|
||||
|
||||
#### Post-Task Hook: Store Successful Patterns
|
||||
|
||||
After successful analysis, store the pattern for future retrieval:
|
||||
|
||||
```bash
|
||||
# Record task completion
|
||||
npx @claude-flow/cli@latest hooks post-task \
|
||||
--task-id "$TASK_ID" \
|
||||
--success true \
|
||||
--agent "clustering-specialist" \
|
||||
--quality 0.92
|
||||
|
||||
# Store the successful pattern
|
||||
npx @claude-flow/cli@latest memory store \
|
||||
--namespace patterns \
|
||||
--key "whale-clustering-hierarchical-ward-2026-01" \
|
||||
--value '{
|
||||
"task_type": "clustering",
|
||||
"species_group": "cetacean",
|
||||
"algorithm": "hierarchical",
|
||||
"linkage": "ward",
|
||||
"distance_metric": "cosine",
|
||||
"min_cluster_size": 5,
|
||||
"silhouette_score": 0.78,
|
||||
"num_clusters_discovered": 23,
|
||||
"snr_threshold": 15,
|
||||
"embedding_preprocessing": "l2_normalize",
|
||||
"hnsw_params": {"ef_construction": 200, "M": 32}
|
||||
}'
|
||||
|
||||
# Train neural patterns on the success
|
||||
npx @claude-flow/cli@latest hooks post-edit \
|
||||
--file "analysis-results.json" \
|
||||
--success true \
|
||||
--train-neural true
|
||||
```
|
||||
|
||||
#### Pre-Edit Hook: Context for Embedding Refinement
|
||||
|
||||
Before modifying embedding configurations or HNSW parameters:
|
||||
|
||||
```bash
|
||||
# Get context before editing embedding pipeline
|
||||
npx @claude-flow/cli@latest hooks pre-edit \
|
||||
--file "src/embeddings/perch_config.rs" \
|
||||
--operation "refactor"
|
||||
```
|
||||
|
||||
Returns:
|
||||
- Related patterns that worked for similar configurations
|
||||
- Agent recommendations for the edit type
|
||||
- Risk assessment for the change
|
||||
- Suggested validation tests
|
||||
|
||||
#### Post-Edit Hook: Train Neural Patterns
|
||||
|
||||
After successful configuration changes:
|
||||
|
||||
```bash
|
||||
# Record successful embedding refinement
|
||||
npx @claude-flow/cli@latest hooks post-edit \
|
||||
--file "src/embeddings/perch_config.rs" \
|
||||
--success true \
|
||||
--agent "ml-specialist"
|
||||
|
||||
# Store the refinement as a pattern
|
||||
npx @claude-flow/cli@latest hooks intelligence pattern-store \
|
||||
--pattern "HNSW ef_search=150 optimal for whale song retrieval" \
|
||||
--type "configuration" \
|
||||
--confidence 0.88 \
|
||||
--metadata '{"species": "cetacean", "corpus_size": 500000}'
|
||||
```
|
||||
|
||||
### Memory Namespaces for Bioacoustics
|
||||
|
||||
#### Namespace: `patterns`
|
||||
|
||||
Stores successful clustering and analysis configurations:
|
||||
|
||||
```bash
|
||||
# Store clustering pattern
|
||||
npx @claude-flow/cli@latest memory store \
|
||||
--namespace patterns \
|
||||
--key "birdsong-dbscan-dawn-chorus" \
|
||||
--value '{
|
||||
"algorithm": "DBSCAN",
|
||||
"eps": 0.15,
|
||||
"min_samples": 3,
|
||||
"preprocessing": ["l2_normalize", "pca_128"],
|
||||
"context": "dawn_chorus",
|
||||
"success_rate": 0.91,
|
||||
"species_groups": ["passerine", "corvid"],
|
||||
"temporal_window": "04:00-07:00"
|
||||
}'
|
||||
|
||||
# Search for relevant patterns
|
||||
npx @claude-flow/cli@latest memory search \
|
||||
--namespace patterns \
|
||||
--query "clustering algorithm for dense dawn chorus recordings"
|
||||
```
|
||||
|
||||
Pattern schema:
|
||||
```typescript
|
||||
interface ClusteringPattern {
|
||||
algorithm: 'DBSCAN' | 'HDBSCAN' | 'hierarchical' | 'kmeans' | 'spectral';
|
||||
parameters: Record<string, number | string>;
|
||||
preprocessing: string[];
|
||||
context: string;
|
||||
success_rate: number;
|
||||
species_groups: string[];
|
||||
environmental_conditions?: {
|
||||
habitat?: string;
|
||||
time_of_day?: string;
|
||||
season?: string;
|
||||
weather?: string;
|
||||
};
|
||||
hnsw_tuning?: {
|
||||
ef_construction: number;
|
||||
ef_search: number;
|
||||
M: number;
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### Namespace: `motifs`
|
||||
|
||||
Stores discovered sequence patterns and syntactic structures:
|
||||
|
||||
```bash
|
||||
# Store discovered motif
|
||||
npx @claude-flow/cli@latest memory store \
|
||||
--namespace motifs \
|
||||
--key "humpback-song-unit-sequence-A" \
|
||||
--value '{
|
||||
"species": "Megaptera novaeangliae",
|
||||
"pattern_type": "song_unit_sequence",
|
||||
"sequence": ["A1", "A2", "B1", "A1", "C1"],
|
||||
"transition_probabilities": {
|
||||
"A1->A2": 0.85,
|
||||
"A2->B1": 0.72,
|
||||
"B1->A1": 0.68,
|
||||
"A1->C1": 0.45
|
||||
},
|
||||
"typical_duration_ms": 45000,
|
||||
"occurrence_rate": 0.34,
|
||||
"recording_ids": ["rec_2024_001", "rec_2024_002"],
|
||||
"discovered_by": "sequence-analyzer",
|
||||
"confidence": 0.89
|
||||
}'
|
||||
|
||||
# Search for similar motifs
|
||||
npx @claude-flow/cli@latest memory search \
|
||||
--namespace motifs \
|
||||
--query "humpback whale song phrase transitions"
|
||||
```
|
||||
|
||||
Motif schema:
|
||||
```typescript
|
||||
interface SequenceMotif {
|
||||
species: string;
|
||||
pattern_type: 'song_unit_sequence' | 'call_response' | 'alarm_cascade' | 'contact_pattern';
|
||||
sequence: string[];
|
||||
transition_probabilities: Record<string, number>;
|
||||
typical_duration_ms: number;
|
||||
occurrence_rate: number;
|
||||
temporal_context?: {
|
||||
time_of_day?: string;
|
||||
season?: string;
|
||||
behavioral_context?: string;
|
||||
};
|
||||
recording_ids: string[];
|
||||
discovered_by: string;
|
||||
confidence: number;
|
||||
validation_status: 'automatic' | 'expert_verified' | 'disputed';
|
||||
}
|
||||
```
|
||||
|
||||
#### Namespace: `species`
|
||||
|
||||
Stores species-specific embedding characteristics:
|
||||
|
||||
```bash
|
||||
# Store species embedding profile
|
||||
npx @claude-flow/cli@latest memory store \
|
||||
--namespace species \
|
||||
--key "orca-pacific-northwest-resident" \
|
||||
--value '{
|
||||
"species": "Orcinus orca",
|
||||
"population": "Southern Resident",
|
||||
"location": "Pacific Northwest",
|
||||
"embedding_characteristics": {
|
||||
"centroid_cluster_distance": 0.12,
|
||||
"intra_pod_variance": 0.08,
|
||||
"inter_pod_variance": 0.23,
|
||||
"frequency_range_hz": [500, 12000],
|
||||
"dominant_frequencies_hz": [2000, 5000, 8000]
|
||||
},
|
||||
"retrieval_optimization": {
|
||||
"optimal_k": 15,
|
||||
"distance_threshold": 0.25,
|
||||
"ef_search": 200
|
||||
},
|
||||
"known_call_types": 34,
|
||||
"dialect_markers": ["S01", "S02", "S03"],
|
||||
"last_updated": "2026-01-15"
|
||||
}'
|
||||
|
||||
# Search for species characteristics
|
||||
npx @claude-flow/cli@latest memory search \
|
||||
--namespace species \
|
||||
--query "cetacean vocalization embedding characteristics Pacific"
|
||||
```
|
||||
|
||||
Species schema:
|
||||
```typescript
|
||||
interface SpeciesEmbeddingProfile {
|
||||
species: string;
|
||||
population?: string;
|
||||
location?: string;
|
||||
embedding_characteristics: {
|
||||
centroid_cluster_distance: number;
|
||||
intra_population_variance: number;
|
||||
inter_population_variance: number;
|
||||
frequency_range_hz: [number, number];
|
||||
dominant_frequencies_hz: number[];
|
||||
embedding_norm_range?: [number, number];
|
||||
};
|
||||
retrieval_optimization: {
|
||||
optimal_k: number;
|
||||
distance_threshold: number;
|
||||
ef_search: number;
|
||||
ef_construction?: number;
|
||||
};
|
||||
known_call_types: number;
|
||||
dialect_markers?: string[];
|
||||
acoustic_niche?: {
|
||||
typical_snr_db: number;
|
||||
overlap_species: string[];
|
||||
distinguishing_features: string[];
|
||||
};
|
||||
last_updated: string;
|
||||
}
|
||||
```
|
||||
|
||||
### Background Workers Utilization
|
||||
|
||||
#### Worker: `optimize` - HNSW Parameter Tuning
|
||||
|
||||
Continuously optimizes HNSW parameters based on retrieval quality:
|
||||
|
||||
```bash
|
||||
# Dispatch HNSW optimization worker
|
||||
npx @claude-flow/cli@latest hooks worker dispatch \
|
||||
--trigger optimize \
|
||||
--context "bioacoustic-hnsw" \
|
||||
--priority high
|
||||
|
||||
# Check optimization status
|
||||
npx @claude-flow/cli@latest hooks worker status
|
||||
```
|
||||
|
||||
Optimization targets:
|
||||
- `ef_construction`: Balance between index build time and recall
|
||||
- `ef_search`: Balance between query latency and accuracy
|
||||
- `M`: Balance between memory usage and graph connectivity
|
||||
|
||||
Automated tuning workflow:
|
||||
1. Sample recent queries and their success rates
|
||||
2. Run parameter sweep on subset
|
||||
3. Evaluate recall@k and latency
|
||||
4. Apply best parameters if improvement > 5%
|
||||
5. Store successful configuration in `patterns` namespace
|
||||
|
||||
```typescript
|
||||
interface HNSWOptimizationResult {
|
||||
previous_params: { ef_construction: number; ef_search: number; M: number };
|
||||
new_params: { ef_construction: number; ef_search: number; M: number };
|
||||
improvement: {
|
||||
recall_at_10: number; // Percentage improvement
|
||||
latency_p99_ms: number;
|
||||
memory_mb: number;
|
||||
};
|
||||
evaluation_corpus_size: number;
|
||||
applied: boolean;
|
||||
timestamp: string;
|
||||
}
|
||||
```
|
||||
|
||||
#### Worker: `consolidate` - Memory Consolidation
|
||||
|
||||
Consolidates learned patterns and prevents memory fragmentation:
|
||||
|
||||
```bash
|
||||
# Dispatch consolidation worker (low priority, runs during idle)
|
||||
npx @claude-flow/cli@latest hooks worker dispatch \
|
||||
--trigger consolidate \
|
||||
--priority low \
|
||||
--background true
|
||||
```
|
||||
|
||||
Consolidation operations:
|
||||
1. Merge similar patterns within each namespace
|
||||
2. Archive low-confidence or stale patterns
|
||||
3. Update pattern embeddings for improved retrieval
|
||||
4. Compute and cache centroid patterns for fast routing
|
||||
5. Run EWC++ to protect critical learned weights
|
||||
|
||||
```bash
|
||||
# Force SONA learning cycle with consolidation
|
||||
npx @claude-flow/cli@latest hooks intelligence learn \
|
||||
--consolidate true
|
||||
```
|
||||
|
||||
Consolidation schedule:
|
||||
- Hourly: Merge patterns with >0.95 similarity
|
||||
- Daily: Archive patterns not accessed in 30 days
|
||||
- Weekly: Full EWC++ consolidation pass
|
||||
|
||||
#### Worker: `audit` - Data Quality Checks
|
||||
|
||||
Validates embedding quality and detects drift:
|
||||
|
||||
```bash
|
||||
# Dispatch audit worker
|
||||
npx @claude-flow/cli@latest hooks worker dispatch \
|
||||
--trigger audit \
|
||||
--context "embedding-quality" \
|
||||
--priority critical
|
||||
```
|
||||
|
||||
Audit checks:
|
||||
1. **Embedding health**: Detect NaN, infinity, or collapsed embeddings
|
||||
2. **Distribution drift**: Compare embedding statistics over time
|
||||
3. **Retrieval quality**: Sample-based precision/recall checks
|
||||
4. **Label consistency**: Cross-reference with expert annotations
|
||||
5. **Temporal coherence**: Verify sequence relationships
|
||||
|
||||
```typescript
|
||||
interface AuditResult {
|
||||
check_type: 'embedding_health' | 'distribution_drift' | 'retrieval_quality' | 'label_consistency';
|
||||
status: 'pass' | 'warning' | 'fail';
|
||||
metrics: {
|
||||
nan_rate?: number;
|
||||
norm_variance?: number;
|
||||
drift_score?: number;
|
||||
precision_at_10?: number;
|
||||
consistency_rate?: number;
|
||||
};
|
||||
affected_recordings?: string[];
|
||||
recommended_action?: string;
|
||||
timestamp: string;
|
||||
}
|
||||
```
|
||||
|
||||
Automated responses:
|
||||
- Warning: Log and notify, continue processing
|
||||
- Fail: Pause ingestion, alert operators, revert to last known good state
|
||||
|
||||
### Transfer Learning from Related Projects
|
||||
|
||||
#### Project Transfer Protocol
|
||||
|
||||
Leverage patterns from related bioacoustic projects:
|
||||
|
||||
```bash
|
||||
# Transfer patterns from a related whale research project
|
||||
npx @claude-flow/cli@latest hooks transfer \
|
||||
--source-path "/projects/cetacean-acoustics" \
|
||||
--min-confidence 0.8 \
|
||||
--filter "species:cetacean"
|
||||
|
||||
# Transfer from IPFS-distributed pattern registry
|
||||
npx @claude-flow/cli@latest hooks transfer store \
|
||||
--pattern-id "marine-mammal-clustering-v2"
|
||||
```
|
||||
|
||||
Transfer eligibility criteria:
|
||||
1. Source project confidence > 0.8
|
||||
2. Domain overlap > 50% (based on species groups)
|
||||
3. No conflicting patterns in target
|
||||
4. Embedding model compatibility (same Perch version)
|
||||
|
||||
Transfer adaptation process:
|
||||
1. Retrieve candidate patterns from source
|
||||
2. Validate against target domain characteristics
|
||||
3. Apply domain adaptation if needed (fine-tune on local data)
|
||||
4. Integrate with reduced initial confidence (0.7x)
|
||||
5. Gradually increase confidence based on local success
|
||||
|
||||
```bash
|
||||
# Check transfer candidates
|
||||
npx @claude-flow/cli@latest transfer store-search \
|
||||
--query "bioacoustic clustering" \
|
||||
--category "marine" \
|
||||
--min-rating 4.0 \
|
||||
--verified true
|
||||
```
|
||||
|
||||
### Feedback Loops: User Corrections to Embedding Refinement
|
||||
|
||||
#### Correction Capture
|
||||
|
||||
```typescript
|
||||
interface UserCorrection {
|
||||
correction_id: string;
|
||||
timestamp: string;
|
||||
user_id: string;
|
||||
expertise_level: 'novice' | 'intermediate' | 'expert' | 'domain_expert';
|
||||
correction_type: 'cluster_assignment' | 'species_label' | 'call_type' | 'sequence_boundary';
|
||||
original_prediction: {
|
||||
value: string;
|
||||
confidence: number;
|
||||
source: 'automatic' | 'pattern_match';
|
||||
};
|
||||
corrected_value: string;
|
||||
affected_segments: string[];
|
||||
context?: string;
|
||||
}
|
||||
```
|
||||
|
||||
#### Feedback Integration Pipeline
|
||||
|
||||
```bash
|
||||
# Step 1: Log user correction
|
||||
npx @claude-flow/cli@latest memory store \
|
||||
--namespace corrections \
|
||||
--key "correction-$(date +%s)-$USER" \
|
||||
--value '{
|
||||
"correction_type": "species_label",
|
||||
"original": {"value": "Megaptera novaeangliae", "confidence": 0.72},
|
||||
"corrected": "Balaenoptera musculus",
|
||||
"segment_ids": ["seg_001", "seg_002"],
|
||||
"user_expertise": "domain_expert"
|
||||
}'
|
||||
|
||||
# Step 2: Trigger learning from correction
|
||||
npx @claude-flow/cli@latest hooks intelligence trajectory-start \
|
||||
--task "learn from species misclassification correction"
|
||||
|
||||
npx @claude-flow/cli@latest hooks intelligence trajectory-step \
|
||||
--trajectory-id "$TRAJ_ID" \
|
||||
--action "analyzed embedding distance between humpback and blue whale" \
|
||||
--result "found confounding frequency overlap in low-SNR segments" \
|
||||
--quality 0.7
|
||||
|
||||
npx @claude-flow/cli@latest hooks intelligence trajectory-end \
|
||||
--trajectory-id "$TRAJ_ID" \
|
||||
--success true \
|
||||
--feedback "updated SNR threshold from 10 to 15 dB for cetacean classification"
|
||||
|
||||
# Step 3: Update species namespace
|
||||
npx @claude-flow/cli@latest memory store \
|
||||
--namespace species \
|
||||
--key "blue-whale-humpback-distinction" \
|
||||
--value '{
|
||||
"confusion_pair": ["Megaptera novaeangliae", "Balaenoptera musculus"],
|
||||
"distinguishing_features": ["frequency_range", "call_duration"],
|
||||
"recommended_snr_threshold": 15,
|
||||
"embedding_distance_threshold": 0.18
|
||||
}'
|
||||
```
|
||||
|
||||
#### Feedback Weight by Expertise
|
||||
|
||||
| Expertise Level | Weight | Trigger Threshold | Immediate Action |
|
||||
|-----------------|--------|-------------------|------------------|
|
||||
| Domain Expert | 1.0 | 1 correction | Update pattern |
|
||||
| Expert | 0.8 | 2 corrections | Update pattern |
|
||||
| Intermediate | 0.5 | 5 corrections | Flag for review |
|
||||
| Novice | 0.2 | 10 corrections | Queue for expert |
|
||||
|
||||
#### Continuous Refinement Loop
|
||||
|
||||
```
|
||||
User Correction
|
||||
|
|
||||
v
|
||||
+------------------+
|
||||
| Correction Store | (namespace: corrections)
|
||||
+------------------+
|
||||
|
|
||||
v
|
||||
+------------------+
|
||||
| Pattern Analysis | (identify affected patterns)
|
||||
+------------------+
|
||||
|
|
||||
v
|
||||
+------------------+
|
||||
| Verdict Update | (reduce confidence of failed patterns)
|
||||
+------------------+
|
||||
|
|
||||
v
|
||||
+------------------+
|
||||
| SONA Learning | (trajectory-based fine-tuning)
|
||||
+------------------+
|
||||
|
|
||||
v
|
||||
+------------------+
|
||||
| EWC++ Consolidate| (protect other learned patterns)
|
||||
+------------------+
|
||||
|
|
||||
v
|
||||
+------------------+
|
||||
| Pattern Update | (store refined pattern)
|
||||
+------------------+
|
||||
|
|
||||
v
|
||||
Improved Retrieval
|
||||
```
|
||||
|
||||
### Implementation Checklist
|
||||
|
||||
#### Phase 1: Core Infrastructure (Week 1-2)
|
||||
|
||||
- [ ] Set up memory namespaces (`patterns`, `motifs`, `species`, `corrections`)
|
||||
- [ ] Implement pre-task hook for bioacoustic task routing
|
||||
- [ ] Implement post-task hook for pattern storage
|
||||
- [ ] Configure HNSW parameters for 1536-D Perch embeddings
|
||||
- [ ] Set up audit worker for embedding health checks
|
||||
|
||||
#### Phase 2: Learning Integration (Week 3-4)
|
||||
|
||||
- [ ] Implement trajectory tracking for analysis workflows
|
||||
- [ ] Configure LoRA fine-tuning for embedding refinement
|
||||
- [ ] Set up EWC++ consolidation schedule
|
||||
- [ ] Implement feedback capture from user interface
|
||||
- [ ] Configure optimize worker for HNSW tuning
|
||||
|
||||
#### Phase 3: Advanced Features (Week 5-6)
|
||||
|
||||
- [ ] Implement motif discovery and storage
|
||||
- [ ] Set up species-specific embedding profiles
|
||||
- [ ] Configure transfer learning from related projects
|
||||
- [ ] Implement expertise-weighted feedback integration
|
||||
- [ ] Set up consolidate worker for memory optimization
|
||||
|
||||
#### Phase 4: Monitoring and Refinement (Ongoing)
|
||||
|
||||
- [ ] Dashboard for learning metrics
|
||||
- [ ] Alerting for quality degradation
|
||||
- [ ] A/B testing for pattern effectiveness
|
||||
- [ ] Regular audit of learned patterns
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Continuous Improvement**: System gets better with every analysis task
|
||||
2. **Domain Adaptation**: EWC++ allows learning new species without forgetting existing knowledge
|
||||
3. **Expert Knowledge Capture**: User corrections are systematically integrated
|
||||
4. **Efficient Processing**: Pattern reuse reduces computation for common tasks
|
||||
5. **Transparent Learning**: Trajectory tracking provides explainability
|
||||
6. **Cross-Project Synergy**: Transfer learning leverages community knowledge
|
||||
|
||||
### Negative
|
||||
|
||||
1. **Complexity**: Multiple interacting systems require careful orchestration
|
||||
2. **Storage Growth**: Pattern storage will grow over time (mitigated by consolidation)
|
||||
3. **Cold Start**: Initial deployments lack learned patterns (mitigated by transfer)
|
||||
4. **Feedback Dependency**: Quality depends on user correction quality
|
||||
|
||||
### Neutral
|
||||
|
||||
1. **Operational Overhead**: Background workers require monitoring
|
||||
2. **Parameter Tuning**: Initial HNSW parameters need manual optimization
|
||||
3. **Expertise Requirements**: Domain experts needed for high-quality feedback
|
||||
|
||||
## References
|
||||
|
||||
1. RuVector GNN Architecture: https://github.com/ruvnet/ruvector
|
||||
2. SONA Pattern Documentation: claude-flow v3 hooks system
|
||||
3. EWC++ Paper: "Overcoming catastrophic forgetting in neural networks"
|
||||
4. Perch 2.0 Embeddings: https://arxiv.org/abs/2508.04665
|
||||
5. HNSW Algorithm: "Efficient and robust approximate nearest neighbor search"
|
||||
6. LoRA Fine-tuning: "LoRA: Low-Rank Adaptation of Large Language Models"
|
||||
1545
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-006-data-architecture.md
vendored
Normal file
1545
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-006-data-architecture.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1173
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-007-ml-inference-pipeline.md
vendored
Normal file
1173
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-007-ml-inference-pipeline.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1163
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-008-api-design.md
vendored
Normal file
1163
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-008-api-design.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1006
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-009-visualization-ui.md
vendored
Normal file
1006
vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-009-visualization-ui.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
214
vendor/ruvector/examples/vibecast-7sense/docs/plans/research/RESEARCH.txt
vendored
Normal file
214
vendor/ruvector/examples/vibecast-7sense/docs/plans/research/RESEARCH.txt
vendored
Normal file
@@ -0,0 +1,214 @@
|
||||
Transforming Bioacoustic Signals into a Navigable Geometric Space
|
||||
Abstract: We propose a system that converts bioacoustic signals (e.g. birdsong) into a high-dimensional vector space where meaningful structure emerges. By mapping audio features (pitch, rhythm, repetition, spectral texture) into geometry, similar sounds cluster together and sequence patterns form visible trajectories. This leverages the RuVector platform – a Rust-based vector database with HNSW indexing and self-learning Graph Neural Network (GNN) layers – augmented with domain-specific audio processing. The goal is a full pipeline from audio to insight: extracting robust embeddings for calls, organizing them with HNSW in potentially hyperbolic space, and using GNN+attention mechanisms to learn and highlight relationships (motifs, transitions) in the data. We outline key design decisions and state-of-the-art techniques for implementation, along with strategies for verification and visualization of the resulting “sound map.”
|
||||
Vision: From Sound to Geometry
|
||||
Bioacoustic data, once mere “background noise,” can now be treated as a rich dataset. By translating thousands of bird calls into points in a multi-dimensional space, hidden structure becomes apparent. What initially sounds chaotic reveals clusters of repeated patterns and motifs, as well as distinct pathways corresponding to sequences of calls. Patterns that would elude the human ear can be detected at scale by this geometric approach. For example, recent work on unsupervised birdsong analysis showed that when individual syllables are embedded and plotted (via UMAP), they form multiple dense clusters that closely correspond to true syllable types
|
||||
. In the figure below, each point is a bird syllable embedded in 2D and colored by its automatically assigned cluster label, illustrating how similar calls group together visually:
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
Unsupervised clustering of zebra finch song syllables. Each point (in two-dimensional UMAP space) represents one syllable, colored by the cluster label assigned by an automated method. Such geometric embedding reveals distinct groups corresponding to different syllable types
|
||||
. This “sound into geometry” paradigm provides a visual system to observe how vocalizations behave across time, regions, and species. For instance, alarm or alert calls might cluster in one region of the space, whereas coordination/contact calls occupy another, reflecting functional groupings. Sequences of calls trace trajectories through the space – revealing common transitions and phrase structures as connected patterns. We are not directly translating animal communication into human language, but uncovering its structural organization. Identifying these structures is a crucial first step toward understanding the latent grammar of animal communication.
|
||||
Pipeline Depth: Audio Processing vs. Vector Space Focus
|
||||
Full Audio-to-Vector Pipeline: One approach is to implement the entire pipeline in Rust – from raw audio to feature vector. This entails signal processing steps like FFT and Mel spectrogram computation, followed by neural feature extraction. For example, one could compute mel-frequency spectrograms of bird calls, then apply a learned embedding model (e.g. a CNN or transformer encoder) to produce a fixed-dimensional vector for each call. There is precedent for such end-to-end pipelines in Python (e.g. Avian Vocalization Analysis tools), where a deep network maps spectrograms of syllables into an embedding space
|
||||
. A Rust implementation could use libraries (or custom DSP code) for spectrograms, and potentially port or reimplement a neural network. The advantage of full pipeline control is optimization and integration – the feature extraction can be tuned to bioacoustic specifics (e.g. emphasize pitch contours, use log-mel scales suitable for bird hearing range). It would allow real-time or streaming processing of audio directly into the vector database. Feature-Vector Input Focus: Alternatively, we might assume that audio is preprocessed externally (using existing ML models or Python pipelines) and focus our Rust implementation on the vector space organization layer. In this scenario, the input to RuVector would be high-quality feature vectors (embeddings) for each call, rather than raw waveforms. This approach leverages state-of-the-art acoustic embedding models (which could be trained in Python using large datasets and specialized libraries) and avoids reimplementing complex neural nets from scratch in Rust. We would then concentrate on how to index, connect, and analyze these vectors using HNSW and GNN. This division of labor can speed up development – we use the best available acoustic models, and RuVector handles the similarity search and learning on top of those embeddings. Recommendation: We propose a hybrid approach: start by using existing tools to generate embeddings (for rapid prototyping and validation of the concept), then gradually port critical pieces to Rust for performance. For example, one could use a pre-trained model (like a contrastive audio encoder trained on bird calls) to embed each signal, and feed those vectors into RuVector. If needed, later implement a simplified version in Rust or integrate via FFI. This ensures we get the geometry and clustering right before expending effort on low-level DSP. In summary, focus initially on the vector space layer with the assumption of quality feature vectors as input, but design with a path to incorporate full audio processing in Rust down the line.
|
||||
Integration Architecture with RuVector
|
||||
We need to decide how to integrate this bioacoustic analysis into the RuVector ecosystem. Two possibilities emerge:
|
||||
New RuVector Module/Feature: Extend RuVector with a built-in “bioacoustic” feature flag or module. This could include Rust code for audio processing, custom distance functions (if any), or Cypher query extensions for this domain. For example, a ruvector-bioacoustic crate might handle audio-specific tasks (like converting WAV files to mel-spectrum embeddings) and then use ruvector-core APIs to insert vectors and queries. The benefit is a seamless experience – users could point the system at audio data and use Cypher/Graph queries to traverse the acoustic similarity graph. It also could allow leveraging RuVector’s learning features directly on raw data (e.g. use the GNN to fine-tune the embedding model parameters via backprop, if that integration is made).
|
||||
Standalone Example or Application: Build this as a separate application that uses RuVector as the storage and learning engine. In this case, RuVector remains domain-agnostic; we write an example (or reference implementation) demonstrating how to ingest bird call audio, generate embeddings (perhaps via an external model or a simple built-in one), and then load them into RuVector. The analysis (clustering, sequence detection, etc.) would be done through RuVector’s query interface and GNN features, but all domain logic lives in the example code. This has the advantage of keeping RuVector’s core clean and general, while still showcasing its capabilities on a compelling real-world use case. It could be a flagship demo for RuVector (“AI for Nature: indexing a million bird calls”).
|
||||
Recommendation: Start with a standalone example application leveraging RuVector. This will be faster to iterate on (no need to modify RuVector’s core libraries initially) and can inform what generic features might be missing. If we find functionalities that are broadly useful (e.g. a specialized distance metric or a compression scheme for audio features), we can upstream those into RuVector as optional features later. By structuring the example well, we ensure that integrating it as a module later (if desired) is straightforward. In practice, we might create a small Rust program or library that uses ruvector-core and ruvector-gnn crates to build an index of bioacoustic vectors, and includes some domain-specific conveniences (like reading audio files, maybe using hound crate for WAV I/O, etc.). This approach keeps the architecture modular: RuVector provides the vector DB and learning substrate; our code provides the domain conversion and interpretation.
|
||||
Algorithmic Focus: HNSW + GNN at the Core, with Domain-Specific Enhancements
|
||||
RuVector’s existing HNSW + GNN architecture is well-suited as the backbone for organizing and learning from bioacoustic embeddings. We will leverage this core and incorporate additional algorithms as needed:
|
||||
HNSW for Vector Organization: At the heart is the HNSW index, which efficiently organizes high-dimensional vectors in a multi-layer graph for fast similarity search
|
||||
. HNSW (Hierarchical Navigable Small World) will allow us to store hundreds of thousands of call embeddings and still retrieve nearest neighbors in milliseconds. This is crucial for mapping new data into the space and exploring clusters. Each bird call embedding becomes a node in the graph; edges link it to acoustically similar calls. The HNSW structure ensures a small-world property: calls can be navigated via short paths through both local and long-range connections. This provides the “navigable geometric space” – essentially a graph where distance correlates with acoustic similarity.
|
||||
Graph Neural Network Layer for Learning: What sets RuVector apart is its self-learning GNN component. RuVector uses GNNs to automatically enhance and adjust the vector index over time
|
||||
. In our context, we can harness this in two ways. First, as queries or insertions happen, the GNN can refine the embeddings or edge weights to improve clustering of frequently related calls. For example, if certain calls often appear in sequence (transitions), the system could learn to place them closer or strengthen their graph connection. The GNN’s role is to make the vector space adaptive: rather than a static index, it becomes a living representation that can be tuned via training objectives. We might define a self-supervised learning task on the graph – e.g. link prediction (neighboring calls in a recording should be predicted to connect) or contrastive learning (calls from the same context should be pulled together, different contexts pushed apart). RuVector’s GNN module provides the tools: it includes implementations of popular GNN layers (GCN, GraphSAGE, GAT) and even training utilities (optimizers, loss functions like InfoNCE)
|
||||
. This means we can perform differentiable search and fine-tune embeddings within the database. In summary, the GNN layer will be the “cognitive substrate” that learns the latent relationships beyond raw acoustic similarity, aligning with the idea of a cognitive map of sounds.
|
||||
Hyperbolic Embeddings for Hierarchy: Bioacoustic data can exhibit hierarchical structure (e.g. calls group by species, then by call-type, etc.). We plan to explore hyperbolic embedding spaces (e.g. Poincaré ball model) which are known to naturally represent hierarchical relationships with low distortion
|
||||
. RuVector already supports hyperbolic math and distance functions (Poincaré and Lorentz models)
|
||||
, so we can enable that for our index. In practice, this might mean storing and searching the embeddings in hyperbolic space (with appropriate distance metric). If the data indeed has a hierarchy (say, a tree of call types or evolutionary relations across species), a hyperbolic space will embed it more uniformly (clusters radiating outwards for more specific groups). For example, generic calls (or background noise) might lie near the center and species-specific unique calls toward the periphery, mirroring a tree. Using hyperbolic space could improve the organization of the space if Euclidean assumptions fall short. We will need to experiment – RuVector’s flexible support means we can easily switch the distance function to Poincaré distance
|
||||
and see if cluster quality improves.
|
||||
Attention Mechanisms: Modern deep learning introduces attention to capture relationships in sequences and graphs. RuVector comes with a library of 39 attention mechanisms (including graph attention)
|
||||
. We will leverage attention in a couple of ways. Graph Attention Networks (GAT): By applying a GAT layer on the similarity graph of calls, the model can learn to weight the influence of neighboring nodes when updating embeddings. This is useful if some nearest neighbors are more important than others (e.g. perhaps a certain cluster of very similar calls should be given more weight, versus an outlier neighbor). GAT will assign learnable attention coefficients to edges, effectively learning which connections denote real pattern vs. noise
|
||||
. This aligns with our goal of highlighting key relationships – the attention will help focus on relevant motifs or transitions and ignore spurious links. Sequence Modeling: Additionally, if we treat a series of calls (like a bird’s song bout or a dawn chorus recording) as a sequence, we could use a temporal attention model (akin to a Transformer) to find patterns. For instance, an attention-based sequence encoder could take the sequence of embedding vectors and learn an embedding for the whole sequence or predict the next call type. This could unveil common phrases in birdsong by attending to repeating sub-sequences. While a full transformer might be outside RuVector’s immediate scope, we can use attention in a simpler way: e.g. use Dynamic Time Warping (DTW) alignment (discussed below) to get candidate similar sequences, then apply a self-attention over sequences to summarize them. Overall, attention provides a powerful mechanism to capture long-range dependencies – in our case, relationships between calls across time or across clusters – complementing the local similarity captured by HNSW.
|
||||
Dynamic Time Warping (DTW) for Sequence Alignment: DTW is a classic algorithm to directly align time-series that may vary in speed or length. In bioacoustics, DTW has been used to compare bird song phrases by warping the time axis to find optimal match
|
||||
. We will consider DTW for specific tasks like verifying if two sequences are the same motif. For example, if we suspect two different recordings contain the same pattern of calls, DTW can align their spectrograms or pitch contours to confirm similarity even if one is slower. However, DTW operates on raw sequences rather than on the learned embedding space. In many cases, our vector-space approach might make DTW unnecessary: if each syllable or call is well-embedded, then a simple Euclidean or cosine distance between sequences of embeddings (perhaps averaged or using Earth Mover’s Distance as in some studies
|
||||
) could suffice. Nonetheless, DTW could be integrated as a post-processing verification step: after clustering calls and hypothesizing motifs, run DTW on the audio of calls within a cluster to ensure they indeed match. We note that past studies found DTW-based methods outperform simple cross-correlation in classifying call motifs, especially when call durations vary widely
|
||||
. Thus, for high accuracy motif detection, a DTW refinement could boost precision. Implementation-wise, we might use DTW on extracted feature sequences (e.g. pitch trajectories) for a small subset of comparisons (not for every query, due to cost). This targeted use of DTW can complement the global embedding approach by handling edge cases (like subtle variations in a known motif).
|
||||
Topological Data Analysis (TDA) for Motif Discovery: As an exploratory research direction, we can apply techniques from TDA (such as persistent homology) to the point cloud or sequence graph to find robust structures. Persistent homology can identify clusters, loops, and other topological features that persist across multiple scales
|
||||
. In the context of bird vocalizations, a loop in the state space might correspond to a repeated cycle of syllables (a chorus or song motif that loops back to the start). Similarly, highly persistent clusters would affirm strongly distinct call types. While the primary clustering will be done via HDBSCAN or similar on the embeddings, using TDA could reveal higher-order structures like cycles (e.g. a bird alternates between two call types A and B in an A-B-A-B pattern, forming a loop in embedding transition graph). There has been work applying TDA to time-series and dynamic networks
|
||||
, suggesting we could construct a graph of transitions between calls and compute its persistent homology to detect repeating circuits. This is an advanced analysis layer that we might use for research validation rather than core implementation. If the user specifically wants motif detection, an alternative simpler approach is to compute an n-gram model or Markov chain of call sequences and find frequently occurring sequences (which is essentially what some behavioral analyses do via transition matrices
|
||||
). The entropy of the transition matrix can quantify how stereotyped a bird’s song syntax is
|
||||
. In fact, the AVN paper computed entropy rates of syllable transitions to compare birds
|
||||
; we can replicate similar metrics from our data to verify that our learned structure correlates with known biological variations (e.g. more chaotic sequences yield higher entropy). TDA would be a novel angle, whereas using established metrics (cluster purity, sequence entropy) will likely suffice for validation.
|
||||
In summary, our algorithmic focus is to primarily leverage RuVector’s HNSW/GNN infrastructure with domain-specific embedding strategies, while keeping an open mind to classical methods (DTW) and new ones (TDA) if they enhance results. The core idea is to obtain a meaningful embedding for each sound, index them in a graph that supports efficient similarity queries, and then apply learning on that graph (using GNN with attention, possibly in a hyperbolic space) to surface the patterns of interest: clusters (call types), motifs (repeat sequences), and context groupings (e.g. calls used in similar behavioral contexts cluster together).
|
||||
Implementation Plan and Components
|
||||
With the design choices above, we outline a concrete implementation plan:
|
||||
Data Ingestion & Preprocessing: Gather a sufficiently large and diverse dataset of bird audio recordings. This might include labeled datasets (for verification) like the ones used in literature (zebra finch songs, field recordings of various species) and unlabeled soundscape recordings. We will split continuous recordings into discrete call/syllable segments. This can be done via an automated segmentation algorithm – e.g. WhisperSeg or TweetyNet (deep learning models proven to segment bird songs accurately
|
||||
). Alternatively, simpler energy-threshold methods could be a fallback for unlabeled data. The output of this stage is a collection of audio snippets, each presumably containing one call or syllable, with optional metadata (species, time, location if known).
|
||||
Feature Extraction (Audio to Embedding): For each audio segment, compute features. We will start with standard spectral features: a mel spectrogram (possibly 32–128 mel bands, capturing ~0–10 kHz range which covers most bird vocalizations). Then, use a neural network to get an embedding. We have options here:
|
||||
Use a pre-trained embedding model such as OpenSoundscape or BirdNET encoder, if available, to embed each spectrogram to, say, a 128-D vector.
|
||||
Train a custom embedding model using contrastive learning: e.g. a triplet loss where we anchor on a syllable, pull another instance of the same type closer, and push a different type away
|
||||
. This is essentially what Leblois et al. did mapping syllables to an 8-D space with a triplet loss, achieving meaningful song comparisons
|
||||
. We could implement a small CNN in Rust (or easier, in Python then port weights) for this. RuVector’s GNN module even provides losses like InfoNCE
|
||||
which could possibly be repurposed to train the embedding online. For efficiency, initial training might be outside RuVector, but the learned embedding function can then be used within Rust (e.g. via ndarray operations or even ONNX runtime).
|
||||
Include auxiliary features that domain experts find useful: pitch contour, duration, Wiener entropy, etc. Past research computed many acoustic features and used multivariate analysis
|
||||
. Instead of manually selecting features, a neural embedding should capture them implicitly, but we may log some for interpretability.
|
||||
Normalize and compress embeddings as needed (e.g. unit-normalize to use cosine distance). RuVector can handle large dimensions, but we might aim for 32–128 dims for balance.
|
||||
Building the Vector Space (RuVector Index): Feed all embeddings into RuVector’s vector database. We will use HNSW indexing (the default in ruvector-core) to build the graph of nearest neighbors. This gives us a navigable small-world graph where each embedding has edges to its M nearest neighbors. According to RuVector benchmarks, insertion and search are very fast (e.g. 1M vectors/min build speed
|
||||
), so even millions of calls is feasible. At this stage, we can already query for similar sounds: given a new call embedding, HNSW will return, say, the top 10 most similar calls in ~100 microseconds
|
||||
. This allows interactive exploration. We should verify qualitatively that similar calls (maybe from same species or call type) indeed retrieve each other – an initial sanity check of embedding quality.
|
||||
Applying GNN Learning in the Index: With the graph constructed, we enable RuVector’s learning mode. Concretely, we can define a training loop where the GNN layer processes the graph to refine embeddings. One approach: use GraphSAGE or GCN to propagate information between neighbors and train it to minimize a contrastive loss (like InfoNCE) that pulls neighbors closer and pushes random non-neighbors apart. Since RuVector is a database, an interesting twist is that we might use user interactions (or simulated ones) as signals – e.g., if a user clusters some calls or confirms certain calls are of same type, feed that as supervision. In absence of external feedback, we rely on unsupervised signals: the graph itself (neighbors likely similar by acoustic feature, we can trust that to some extent), and sequence information. We can incorporate temporal adjacency: create edges between calls that occur sequentially in a recording (these are known transitions). Then train the GNN to predict those edges or make those connected nodes closer in embedding space. Essentially, we fuse the similarity graph with a sequence graph (making a multigraph or a hypergraph). RuVector’s graph capabilities (Cypher queries) will help to add those connections (e.g. create a relation :FOLLOWS between call A and B if B comes after A in some recording). Then a GNN can be trained to encode both modalities – acoustic similarity and temporal occurrence – potentially teasing apart different contexts (calls that are similar and often sequential might be the same phrase vs. calls similar but never sequential might be similar call types used in different contexts). Technically, we’ll utilize Graph Attention Network (GAT) layers for this training, as discussed. GAT will learn weights for acoustic-similarity edges vs. sequential edges, etc. We will likely iterate the training: embed -> build graph -> GNN updates embeddings -> rebuild graph (if needed) -> etc. Because RuVector supports differentiable search (end-to-end training), we might not even need an explicit rebuild; the embeddings adjust continuously and HNSW can accommodate slight moves (though large moves might need reinsertion). We should monitor if the GNN learning converges (RuVector’s tools like replay buffers, EWC are there to help avoid catastrophic forgetting
|
||||
, which is remarkable in a database context). After training, we expect tighter clusters and more meaningful distances, effectively tuning the space to bioacoustic structure rather than just raw spectral similarity.
|
||||
Advanced Pattern Detection: With a refined vector space, perform higher-level analyses:
|
||||
Clustering: Run a clustering algorithm (HDBSCAN or similar) on the embeddings to group calls into putative call types or motifs. As noted, thousands of signals often form clear clusters corresponding to syllable types
|
||||
. We will verify cluster consistency against any available labels (e.g. species or known call categories). If labels are not available, internal validation like silhouette scores or the V-measure (used in AVN to compare cluster labels to ground truth
|
||||
) can be used. RuVector might allow Cypher queries to find connected components or densely connected subgraphs in the similarity graph as a form of clustering.
|
||||
Sequence mining: Analyze the graph of sequential transitions. This could be as simple as counting transitions to build a Markov chain (which can be queried from the graph: in Cypher, find all (:Call)-[:FOLLOWS]->(:Call) patterns). Identify frequent sequences (motifs) by looking for paths that repeat or loops. If using TDA, compute the first homology group for cycles in the graph; if any significant 1-cycles are found, they indicate looping sequences (repeated motifs). Alternatively, perform a search for repeated path patterns of a given length (graph is small enough per bird to brute force short path patterns). We can also embed entire sequences: use the sequence of vector points for a recording and apply a sequence embedding (maybe summing or an RNN) to compare entire recordings.
|
||||
Contextual clustering: If our dataset spans different contexts (e.g. nighttime versus daytime calls, different geographies), we can project those metadata onto the space. Perhaps run dimension reduction (UMAP/T-SNE) on the final embeddings to visualize on 2D plots, coloring points by metadata to see if patterns emerge (e.g. species separation, or before/after sunrise differences). Such visualization can confirm that the geometric space is capturing meaningful axes of variation.
|
||||
Verification and Evaluation: We will rigorously verify that the system is discovering real structure:
|
||||
Clustering accuracy: If we have ground truth labels for some calls (from expert annotations or known call catalogs), measure clustering quality (purity, V-measure, etc.). For example, do alarm calls from species X cluster together distinct from contact calls? If using existing datasets like the zebra finch syllables, we can directly compare our automatic labels to manual ones (the AVN study reported ~0.80 V-measure for their automatic labeling
|
||||
; we aim for similar ballpark).
|
||||
Sequence pattern validation: For any candidate motif the system finds, manually inspect spectrograms of those calls to ensure they truly match. Also cross-validate with literature: e.g. if the system clusters a certain pattern, check if that pattern was reported in ethological studies (perhaps known songs or call types). If possible, play back clustered sounds to bird experts for confirmation.
|
||||
Quantitative metrics: Use sequence entropy and consistency metrics similar to published studies
|
||||
to compare populations. For instance, measure the entropy of the transition matrix for each bird’s song. Known results: isolated or neurologically impaired birds have higher entropy (less stereotypy) than wild-type
|
||||
. We can see if our automatically derived sequences show that trend, which would validate that the structure we uncover correlates with biological reality.
|
||||
Scalability and performance: We should also verify that the pipeline runs efficiently on large data. HNSW has sub-linear query scaling
|
||||
, and RuVector’s design targets high QPS
|
||||
, so we anticipate good performance. We’ll test with increasing dataset sizes (e.g. 10k, 100k, 1M calls) to ensure indexing and search remain fast. Memory usage is a concern; however, RuVector’s adaptive compression can down-tier infrequently used vectors (up to 32× compression for cold data)
|
||||
, which will help handle very large archives without losing much query performance.
|
||||
Visualization and Exploration Interface: Finally, to truly make the geometric space navigable, we will implement visualization tools:
|
||||
Generate 2D projections (using UMAP or t-SNE) for snapshots of the dataset and provide interactive plots where a user can click on a point to hear the call, see its nearest neighbors, etc. This can be done by exporting data and using a Python notebook or web D3.js interface.
|
||||
Possibly integrate with RuVector’s WASM/Browser support
|
||||
to run the vector search in a web app. One could imagine an explorer app where you upload a bird call, it finds similar calls in the database, and plots them in a local map.
|
||||
Create graph visualizations for motifs: e.g. show a network diagram of a set of calls with edges for transitions or similarity above a threshold. This could highlight how certain calls link together forming a chain (a motif) or a star cluster (a call type with variants).
|
||||
Use color and shape to encode metadata in visualizations (each species a different color, each recording location a different marker) to leverage the multi-faceted nature of the data.
|
||||
The end result will be a system where a researcher can navigate the space of sounds. They could query “find calls similar to this one” or “what does the acoustic space look like for region Y vs region Z”, and get immediate answers backed by the geometry. It moves bioacoustics into a data-driven discovery realm: instead of spending years listening for patterns, one can cluster and map millions of calls to see the patterns emerge
|
||||
.
|
||||
Conclusion
|
||||
In summary, our implementation will transform unstructured bioacoustic recordings into an organized, queryable geometric database using state-of-the-art techniques in machine listening and graph learning. By integrating a full audio processing pipeline with RuVector’s vector search and GNN learning capabilities, we create a system that not only stores bioacoustic embeddings but continuously learns from them – adjusting to reveal the hidden structure of animal vocal communication. The “sound into geometry” approach lets us uncover clusters of calls and repeated motifs that hint at an underlying grammar in the sounds of nature. This aligns closely with RuVector’s vision of a database that learns from data: every new call analyzed or query made can refine the map. Leveraging hyperbolic embeddings will capture hierarchical relations (species, call types)
|
||||
, and attention-based GNN layers will focus the model on salient relationships (key transitions, contextual links)
|
||||
. We will validate the system against known biological patterns and ensure it scales to real-world data volumes. Finally, through intuitive visualizations, we will make this complex multi-dimensional space navigable to users, turning passive audio data into an interactive atlas of animal communication. By revealing structure before meaning, we take the crucial first step toward deciphering the languages of nature – a frontier where machine learning and ecology meet, and one that this project is poised to explore. Sources:
|
||||
Simmonds, D. et al. (2025). A deep learning approach for the analysis of birdsong. eLife 14: e101111. (Figures and results on unsupervised syllable clustering)
|
||||
Cohen, R. (2026). RuVector: A Database that Autonomously Learns – Project introduction post. (Describes RuVector’s use of GNN and attention to improve vector search)
|
||||
RuVector Documentation (2025). Attention Mechanisms, GNNs, and Hyperbolic Embeddings in RuVector. (List of supported GNN layers and hyperbolic functions)
|
||||
EmergentMind (2025). Hierarchical Navigable Small-World (HNSW) Graph. (Overview of HNSW for vector search)
|
||||
Meliza, C.D. et al. (2013). Pitch- and spectral-based dynamic time warping for comparing avian vocalizations. J. Acoust. Soc. Am. 134(2): 1407–1415. (Demonstrates DTW’s effectiveness in grouping call motifs)
|
||||
Zhenyu, M. et al. (2020). Weighted persistent homology for biomolecular data analysis. Sci. Rep. 10, 2079. (Background on TDA and persistent homology for discovering data “shape”)
|
||||
Citations
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
|
||||
Hierarchical Navigable Small-World Graph
|
||||
|
||||
https://www.emergentmind.com/topics/hierarchical-navigable-small-world-hnsw-graph
|
||||
|
||||
RuVector: AI-Powered Database with Graph Neural Networks | Reuven Cohen posted on the topic | LinkedIn
|
||||
|
||||
https://www.linkedin.com/posts/reuvencohen_ruvector-a-database-that-automomously-learns-activity-7403869636058128384-5Q9Z
|
||||
|
||||
ruvector_gnn - Rust
|
||||
|
||||
https://docs.rs/ruvector-gnn/latest/ruvector_gnn/
|
||||
|
||||
@ruvector/postgres-cli - npm
|
||||
|
||||
https://www.npmjs.com/package/@ruvector/postgres-cli
|
||||
|
||||
GitHub - ruvnet/ruvector: A distributed vector database that learns. Store embeddings, query with Cypher, scale horizontally with Raft consensus, and let the index improve itself through Graph Neural Networks.
|
||||
|
||||
https://github.com/ruvnet/ruvector
|
||||
|
||||
GitHub - ruvnet/ruvector: A distributed vector database that learns. Store embeddings, query with Cypher, scale horizontally with Raft consensus, and let the index improve itself through Graph Neural Networks.
|
||||
|
||||
https://github.com/ruvnet/ruvector
|
||||
|
||||
RuVector: AI-Powered Database with Graph Neural Networks | Reuven Cohen posted on the topic | LinkedIn
|
||||
|
||||
https://www.linkedin.com/posts/reuvencohen_ruvector-a-database-that-automomously-learns-activity-7403869636058128384-5Q9Z
|
||||
Pitch- and spectral-based dynamic time warping methods for comparing field recordings of harmonic avian vocalizations - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC3745477/
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
|
||||
Weighted persistent homology for biomolecular data analysis | Scientific Reports
|
||||
|
||||
https://www.nature.com/articles/s41598-019-55660-3?error=cookies_not_supported&code=4b5b1d40-9381-4a9f-bd1d-8b19534c67f1
|
||||
|
||||
Persistent homology of complex networks for dynamic state detection
|
||||
|
||||
https://link.aps.org/doi/10.1103/PhysRevE.100.022314
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
Pitch- and spectral-based dynamic time warping methods for comparing field recordings of harmonic avian vocalizations - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC3745477/
|
||||
|
||||
GitHub - ruvnet/ruvector: A distributed vector database that learns. Store embeddings, query with Cypher, scale horizontally with Raft consensus, and let the index improve itself through Graph Neural Networks.
|
||||
|
||||
https://github.com/ruvnet/ruvector
|
||||
|
||||
GitHub - ruvnet/ruvector: A distributed vector database that learns. Store embeddings, query with Cypher, scale horizontally with Raft consensus, and let the index improve itself through Graph Neural Networks.
|
||||
|
||||
https://github.com/ruvnet/ruvector
|
||||
|
||||
ruvector_gnn - Rust
|
||||
|
||||
https://docs.rs/ruvector-gnn/latest/ruvector_gnn/
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
A deep learning approach for the analysis of birdsong - PMC
|
||||
|
||||
https://pmc.ncbi.nlm.nih.gov/articles/PMC12626419/
|
||||
|
||||
Hierarchical Navigable Small-World Graph
|
||||
|
||||
https://www.emergentmind.com/topics/hierarchical-navigable-small-world-hnsw-graph
|
||||
|
||||
GitHub - ruvnet/ruvector: A distributed vector database that learns. Store embeddings, query with Cypher, scale horizontally with Raft consensus, and let the index improve itself through Graph Neural Networks.
|
||||
|
||||
https://github.com/ruvnet/ruvector
|
||||
|
||||
GitHub - ruvnet/ruvector: A distributed vector database that learns. Store embeddings, query with Cypher, scale horizontally with Raft consensus, and let the index improve itself through Graph Neural Networks.
|
||||
|
||||
https://github.com/ruvnet/ruvector
|
||||
|
||||
How AI reveals Birds' Language?? For the first time ... - Instagram
|
||||
|
||||
https://www.instagram.com/reel/DTXaDjmk75y/
|
||||
|
||||
Real World Uses Of Pattern Recognition - Instagram
|
||||
|
||||
https://www.instagram.com/popular/real-world-uses-of-pattern-recognition/
|
||||
All Sources
|
||||
pmc.ncbi.nlm.nih
|
||||
|
||||
emergentmind
|
||||
|
||||
linkedin
|
||||
|
||||
docs
|
||||
|
||||
npmjs
|
||||
|
||||
github
|
||||
|
||||
nature
|
||||
|
||||
link.aps
|
||||
|
||||
instagram
|
||||
238
vendor/ruvector/examples/vibecast-7sense/docs/plans/research/intro.md
vendored
Normal file
238
vendor/ruvector/examples/vibecast-7sense/docs/plans/research/intro.md
vendored
Normal file
@@ -0,0 +1,238 @@
|
||||
## What Perch 2.0 changes for a RuVector pipeline
|
||||
|
||||
Perch 2.0 is explicitly designed to produce embeddings that stay useful under domain shift and support workflows like nearest-neighbor retrieval, clustering, and linear probes on modest hardware. ([arXiv][1])
|
||||
|
||||
Key technical facts that matter for engineering:
|
||||
|
||||
* Input is **5 second mono audio at 32 kHz** (160,000 samples), with a log-mel frontend producing **500 frames x 128 mel bins (60 Hz to 16 kHz)**. ([arXiv][2])
|
||||
* Backbone is **EfficientNet-B3**, and the mean pooled embedding is **1536-D**. ([arXiv][2])
|
||||
* Training includes:
|
||||
|
||||
* supervised species classification,
|
||||
* **prototype-learning classifier head** used for self-distillation,
|
||||
* and an auxiliary **source-prediction** objective. ([arXiv][2])
|
||||
* It is multi-taxa and reports SOTA on BirdSet and BEANS, plus strong marine transfer despite little marine training data. ([arXiv][1])
|
||||
* DeepMind describes this Perch release as an open model and points to Kaggle availability. ([Google DeepMind][3])
|
||||
|
||||
Why this is a big deal for RuVector: once embeddings are “good enough,” HNSW stops being a storage trick and becomes a navigable map where neighborhoods are meaningful. RuVector’s whole value proposition is then unlocked: fast HNSW retrieval, plus a learnable GNN reranker and attention on top of the neighbor graph. ([GitHub][4])
|
||||
|
||||
## RAB is the right framing for “interpretation” without hallucination
|
||||
|
||||
Retrieval-Augmented Bioacoustics (RAB) is basically “RAG for animal sound,” with two design choices that align perfectly with a RuVector substrate:
|
||||
|
||||
1. adapt retrieval depth based on signal quality
|
||||
2. cite the retrieved calls directly in the generated output for transparency
|
||||
|
||||
That is exactly how you keep “translation” honest: you are not translating meaning, you are producing an evidence-guided structural interpretation.
|
||||
|
||||
## Practical integration blueprint: Perch 2.0 + RuVector + RAB
|
||||
|
||||
### 1) Ingestion schema in RuVector
|
||||
|
||||
Model the world as both vectors and a graph:
|
||||
|
||||
**Nodes**
|
||||
|
||||
* `Recording {id, sensor_id, lat, lon, start_ts, habitat, weather, ...}`
|
||||
* `CallSegment {id, recording_id, t0_ms, t1_ms, snr, energy, ...}`
|
||||
* `Embedding {id, segment_id, model="perch2", dim=1536, ...}`
|
||||
* `Prototype {id, cluster_id, centroid_vec, exemplars[]}`
|
||||
* `Cluster {id, method, params, ...}`
|
||||
* optional: `Taxon {inat_id, scientific_name, common_name}`
|
||||
|
||||
**Edges**
|
||||
|
||||
* `(:Recording)-[:HAS_SEGMENT]->(:CallSegment)`
|
||||
* `(:CallSegment)-[:NEXT {dt_ms}]->(:CallSegment)` for sequences
|
||||
* `(:CallSegment)-[:SIMILAR {dist}]->(:CallSegment)` from HNSW neighbors
|
||||
* `(:Cluster)-[:HAS_PROTOTYPE]->(:Prototype)`
|
||||
* `(:CallSegment)-[:ASSIGNED_TO]->(:Cluster)` (after clustering)
|
||||
|
||||
RuVector already supports storing embeddings and querying with Cypher-style graph queries, plus a GNN refinement layer that applies multi-head attention over neighbors. ([GitHub][4])
|
||||
|
||||
### 2) Embedding in Rust, not Python
|
||||
|
||||
You have two very practical Rust-first options:
|
||||
|
||||
**Option A: ONNX Runtime**
|
||||
There are published Perch v2 ONNX conversions with concrete tensor shapes:
|
||||
|
||||
* input: `['batch', 160000]`
|
||||
* outputs include: `embedding ['batch', 1536]`, plus spectrogram and logits ([Hugging Face][5])
|
||||
|
||||
That gets you native Rust inference with `onnxruntime` bindings, and you can keep everything in the same process as RuVector.
|
||||
|
||||
**Option B: Use an existing Rust crate that already supports Perch v2**
|
||||
There is a Rust library `birdnet-onnx` that supports Perch v2 inference (32kHz, 5s segments) and returns predictions. ([Docs.rs][6])
|
||||
Even if you do not keep it long-term, it is an excellent “verification harness” to de-risk the pipeline.
|
||||
|
||||
### 3) The retrieval core: HNSW is your “acoustic cartography”
|
||||
|
||||
For each `CallSegment`:
|
||||
|
||||
1. embed with Perch 2.0 -> `Vec<f32>(1536)`
|
||||
2. insert vector into RuVector
|
||||
3. store metadata and computed features (snr, pitch stats, rhythm, spectral centroid)
|
||||
4. periodically (or continuously) rebuild neighbor edges `SIMILAR` from top-k
|
||||
|
||||
Once you have this, you instantly get:
|
||||
|
||||
* nearest-neighbor “find similar calls”
|
||||
* cluster discovery (call types, dialects, soundscape regimes)
|
||||
* anomaly detection (rare calls, new species, anthropogenic intrusions)
|
||||
|
||||
### 4) Add the GNN and attention where it matters
|
||||
|
||||
Use the graph as supervision:
|
||||
|
||||
* acoustic edges from HNSW (similarity)
|
||||
* temporal edges from `NEXT` (syntax)
|
||||
* optional co-occurrence edges (same time window, same sensor neighborhood)
|
||||
|
||||
Then train a lightweight GNN reranker whose job is not “classify species,” but:
|
||||
|
||||
* re-rank neighbors for retrieval quality
|
||||
* increase cluster coherence
|
||||
* learn transition regularities
|
||||
|
||||
This matches RuVector’s “HNSW retrieval then GNN enhancement” pattern. ([GitHub][4])
|
||||
|
||||
### 5) RAB layer: evidence packs + constrained generation
|
||||
|
||||
For any query (a segment, a time interval, a habitat), build an **Evidence Pack**:
|
||||
|
||||
* top-k neighbors (IDs, distances)
|
||||
* k cluster exemplars (prototype calls)
|
||||
* top predicted taxa (if you choose to surface logits)
|
||||
* local sequence context (previous and next segments)
|
||||
* signal quality (snr, clipping, overlap score)
|
||||
* spectrogram thumbnails
|
||||
|
||||
Then generation produces only these kinds of outputs:
|
||||
|
||||
* monitoring summary
|
||||
* annotation suggestions
|
||||
* “this resembles X and Y exemplars, differs by Z”
|
||||
* hypothesis prompts for researchers
|
||||
|
||||
And it must cite which retrieved calls informed each statement, matching the RAB proposal’s attribution emphasis.
|
||||
|
||||
## Verification that the geometry is real
|
||||
|
||||
Here is a verification stack that starts cheap and becomes rigorous.
|
||||
|
||||
### Level 1: Mechanical correctness
|
||||
|
||||
* audio is actually 32 kHz mono
|
||||
* 5s windows align with model expectations ([arXiv][2])
|
||||
* embedding norms are stable (no NaNs, no collapse)
|
||||
* duplicate audio -> near-identical embedding
|
||||
|
||||
### Level 2: Retrieval sanity
|
||||
|
||||
Pick 50 known calls (or manually curated exemplars):
|
||||
|
||||
* do nearest-neighbor retrieval
|
||||
* manually check if top 10 are genuinely similar
|
||||
|
||||
Perch’s own evaluation includes one-shot retrieval style tests using cosine distance as a proxy for clustering usefulness, which is exactly your use case. ([arXiv][7])
|
||||
|
||||
### Level 3: Few-shot probes
|
||||
|
||||
Train linear probes on small labeled subsets:
|
||||
|
||||
* species
|
||||
* call type
|
||||
* habitat context
|
||||
* sensor ID (should be weak if embeddings are not overfitting device artifacts)
|
||||
|
||||
Perch 2.0 is explicitly oriented toward strong linear probing and retrieval without full fine-tuning. ([arXiv][1])
|
||||
|
||||
### Level 4: Sequence validity
|
||||
|
||||
Check whether your transition graph produces:
|
||||
|
||||
* stable motifs
|
||||
* repeated trajectories
|
||||
* entropy rates that differ by condition or location
|
||||
|
||||
If you want “motif truth,” DTW can be your high-precision confirmation step for a small subset, not your global engine.
|
||||
|
||||
## Visualization in Rust, end-to-end
|
||||
|
||||
You can do a fully Rust-native viz loop now:
|
||||
|
||||
1. Use RuVector to get kNN for each point (already computed by HNSW).
|
||||
2. Feed that kNN graph into a Rust UMAP implementation such as `umap-rs` (it expects precomputed neighbors). ([Docs.rs][8])
|
||||
3. Render interactive scatter plots using Rust bindings for Plotly, or export JSON for a web viewer. ([Crates.io][9])
|
||||
|
||||
Bonus: Perch outputs spectrogram tensors in some exported forms, so you can attach “what the model saw” to each point and show it on hover or click. ([Hugging Face][5])
|
||||
|
||||
## “Translation” that stays scientifically honest
|
||||
|
||||
If you use the word “translation,” I would keep it scoped like this:
|
||||
|
||||
* Translate a call into:
|
||||
|
||||
* nearest exemplars
|
||||
* cluster membership
|
||||
* structural descriptors (pitch contour stats, rhythm intervals, spectral texture)
|
||||
* sequence role (often followed by X, often precedes Y)
|
||||
|
||||
Not “the bird said danger,” but:
|
||||
|
||||
* “This call sits in the same neighborhood as known alarm exemplars and appears in similar sequence positions during disturbance periods.”
|
||||
|
||||
That is the RAB sweet spot: interpretable, evidence-backed, testable.
|
||||
|
||||
## Practical to exotic: what becomes feasible now
|
||||
|
||||
With Perch-grade embeddings, your ladder tightens:
|
||||
|
||||
**Practical**
|
||||
|
||||
* biodiversity indexing and monitoring summaries
|
||||
* fast search over million-hour corpora
|
||||
* sensor drift and anthropogenic anomaly alerts
|
||||
|
||||
**Advanced**
|
||||
|
||||
* few-shot adaptation for new sites with tiny labeled sets
|
||||
* call library curation via cluster prototypes
|
||||
* cross-taxa transfer experiments (insects vs birds vs amphibians)
|
||||
|
||||
**Exotic but defensible**
|
||||
|
||||
* closed-loop call-response experiments that probe structural sensitivity
|
||||
* synthetic prototype interpolation (generate “between-cluster” calls) with strict ethics and permitting
|
||||
* cross-species “structure maps” that compare signaling complexity without pretending semantics
|
||||
|
||||
## Two next moves that will accelerate you immediately
|
||||
|
||||
1. **Build the “call library + evidence pack” layer first.**
|
||||
It turns embeddings into a product and forces transparency.
|
||||
|
||||
2. **Treat GNN as retrieval optimization, not a magic classifier.**
|
||||
Your win is better neighborhoods, cleaner motifs, and more stable trajectories.
|
||||
|
||||
If you want, I can turn this into:
|
||||
|
||||
* a concrete repo layout (`ruvector-bioacoustic/` crate + CLI + wasm viewer), or
|
||||
* a short “vision memo” you can share publicly that frames Perch 2.0 + RuVector + RAB as the start of navigable animal communication geometry.
|
||||
|
||||
[1]: https://www.arxiv.org/pdf/2508.04665v2 "Perch 2.0: The Bittern Lesson for Bioacoustics"
|
||||
[2]: https://arxiv.org/html/2508.04665v1 "Perch 2.0: The Bittern Lesson for Bioacoustics"
|
||||
[3]: https://deepmind.google/blog/how-ai-is-helping-advance-the-science-of-bioacoustics-to-save-endangered-species/ "
|
||||
How AI is helping advance the science of bioacoustics to save endangered species -
|
||||
Google DeepMind
|
||||
|
||||
"
|
||||
[4]: https://github.com/ruvnet/ruvector "GitHub - ruvnet/ruvector: A distributed vector database that learns. Store embeddings, query with Cypher, scale horizontally with Raft consensus, and let the index improve itself through Graph Neural Networks."
|
||||
[5]: https://huggingface.co/justinchuby/Perch-onnx?utm_source=chatgpt.com "justinchuby/Perch-onnx"
|
||||
[6]: https://docs.rs/birdnet-onnx?utm_source=chatgpt.com "birdnet_onnx - Rust"
|
||||
[7]: https://arxiv.org/html/2508.04665v1?utm_source=chatgpt.com "Perch 2.0: The Bittern Lesson for Bioacoustics"
|
||||
[8]: https://docs.rs/umap-rs?utm_source=chatgpt.com "umap_rs - Rust"
|
||||
[9]: https://crates.io/crates/plotly?utm_source=chatgpt.com "plotly - crates.io: Rust Package Registry"
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user