Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-001-system-architecture.md
+++ b/vendor/ruvector/examples/vibecast-7sense/docs/adr/ADR-001-system-architecture.md
@@ -0,0 +1,574 @@
+# ADR-001: System Architecture Overview
+
+**Status:** Accepted
+**Date:** 2026-01-15
+**Decision Makers:** 7sense Architecture Team
+**Technical Area:** System Architecture
+
+---
+
+## Context and Problem Statement
+
+7sense aims to transform bioacoustic signals (primarily bird calls) into a navigable geometric space where meaningful structure emerges. The system must process audio recordings, generate high-dimensional embeddings using Perch 2.0 (1536-D vectors), organize them with HNSW indexing in RuVector, and apply GNN-based learning to surface patterns such as call types, motifs, and behavioral contexts.
+
+The core challenge is designing an architecture that:
+
+1. **Handles diverse data pipelines** - From raw 32kHz audio to queryable vector embeddings
+2. **Scales to millions of call segments** - Real-world bioacoustic monitoring generates vast datasets
+3. **Supports scientific workflows** - Researchers need reproducibility, transparency, and evidence-backed interpretations (RAB pattern)
+4. **Enables real-time and batch processing** - Field deployments require streaming; research requires bulk analysis
+5. **Integrates ML inference efficiently** - ONNX-based Perch 2.0 inference in Rust for performance
+
+### Current State
+
+This is a greenfield project building upon:
+- **Perch 2.0**: Google DeepMind's bioacoustic embedding model (EfficientNet-B3 backbone, 1536-D output)
+- **RuVector**: Rust-based vector database with HNSW indexing and self-learning GNN layers
+- **RAB Pattern**: Retrieval-Augmented Bioacoustics for evidence-backed interpretation
+
+---
+
+## Decision Drivers
+
+### Performance Requirements
+- **Embedding generation**: Process 5-second audio segments at >100 segments/second
+- **Vector search**: Sub-millisecond kNN queries on 1M+ vectors (HNSW target: ~100us)
+- **Batch ingestion**: 1M vectors/minute build speed (RuVector baseline)
+- **Memory efficiency**: Support 32x compression for cold data tiers
+
+### Scalability Requirements
+- **Data volume**: Support 10K to 10M+ call segments per deployment
+- **Concurrent users**: Multiple researchers querying simultaneously
+- **Geographic distribution**: Sensor networks across multiple sites
+- **Temporal depth**: Years of historical recordings
+
+### Scientific Rigor Requirements
+- **Reproducibility**: Deterministic pipelines with versioned models and parameters
+- **Transparency**: RAB-style evidence packs citing retrieved calls for any interpretation
+- **Auditability**: Full provenance tracking from raw audio to conclusions
+- **Validation**: Built-in verification against ground truth labels
+
+### Operational Requirements
+- **Deployment flexibility**: Edge (sensor), cloud, and hybrid deployments
+- **Monitoring**: Health metrics, processing throughput, index quality
+- **Updates**: Hot-swap embedding models without full reindexing
+- **Recovery**: Graceful degradation and disaster recovery
+
+---
+
+## Considered Options
+
+### Option A: Monolithic Architecture
+
+A single application handling all concerns: audio processing, embedding generation, vector storage, GNN learning, API serving, and visualization.
+
+**Pros:**
+- Simplest deployment model
+- No inter-service communication overhead
+- Single codebase to maintain
+
+**Cons:**
+- Cannot scale components independently
+- Single point of failure
+- Difficult to update individual components
+- Memory pressure from co-located ML models
+- Not suitable for distributed sensor networks
+
+### Option B: Microservices Architecture
+
+Fully decomposed services: Audio Ingest Service, Embedding Service, Vector Store Service, GNN Learning Service, Query Service, Visualization Service, etc.
+
+**Pros:**
+- Independent scaling per service
+- Technology flexibility per service
+- Fault isolation
+- Team parallelization
+
+**Cons:**
+- Significant operational complexity
+- Network latency between services
+- Data consistency challenges
+- Overkill for initial team size
+- Complex debugging across service boundaries
+
+### Option C: Modular Monolith Architecture
+
+A single deployable unit with clearly separated internal modules, designed for future extraction into services if needed.
+
+**Pros:**
+- Maintains deployment simplicity
+- Clear module boundaries enable future splitting
+- In-process communication for performance-critical paths
+- Easier debugging and testing
+- Appropriate for current team/project scale
+- Can evolve toward microservices as needs emerge
+
+**Cons:**
+- Requires discipline to maintain module boundaries
+- All modules share the same runtime resources
+- Scaling requires scaling the entire application
+
+---
+
+## Decision Outcome
+
+**Chosen Option: Option C - Modular Monolith Architecture**
+
+We adopt a modular monolith architecture with clearly defined domain boundaries, designed with explicit seams that allow future extraction to services. This balances immediate development velocity with long-term architectural flexibility.
+
+### Rationale
+
+1. **Right-sized for current needs**: A small team building a new product benefits from deployment simplicity
+2. **Performance-critical paths stay in-process**: Audio-to-embedding-to-index flow benefits from zero network hops
+3. **Scientific workflow alignment**: Researchers prefer reproducible, debuggable systems over distributed complexity
+4. **Evolution path preserved**: Module boundaries are designed as potential service boundaries
+5. **RuVector integration**: RuVector is designed as an embeddable library, making monolith integration natural
+
+---
+
+## Technical Specifications
+
+### Module Architecture
+
+```
+sevensense/
+├── core/                      # Domain-agnostic foundations
+│   ├── config/               # Configuration management
+│   ├── error/                # Error types and handling
+│   ├── telemetry/            # Logging, metrics, tracing
+│   └── storage/              # Abstract storage interfaces
+│
+├── audio/                     # Audio Processing Domain
+│   ├── ingest/               # Audio file reading, streaming
+│   ├── segment/              # Call detection and segmentation
+│   ├── features/             # Acoustic feature extraction
+│   └── spectrogram/          # Mel spectrogram generation
+│
+├── embedding/                 # Embedding Generation Domain
+│   ├── perch/                # Perch 2.0 ONNX inference
+│   ├── models/               # Model versioning and registry
+│   ├── batch/                # Batch embedding pipelines
+│   └── normalize/            # Vector normalization (L2, etc.)
+│
+├── vectordb/                  # Vector Storage Domain (RuVector)
+│   ├── index/                # HNSW index management
+│   ├── graph/                # Graph structure (nodes, edges)
+│   ├── query/                # Similarity search, Cypher queries
+│   └── hyperbolic/           # Poincare ball embeddings
+│
+├── learning/                  # GNN Learning Domain
+│   ├── gnn/                  # GNN layers (GCN, GAT, GraphSAGE)
+│   ├── attention/            # Attention mechanisms
+│   ├── training/             # Self-supervised training loops
+│   └── refinement/           # Embedding refinement pipelines
+│
+├── analysis/                  # Analysis Domain
+│   ├── clustering/           # HDBSCAN, prototype extraction
+│   ├── sequence/             # Motif detection, transition analysis
+│   ├── entropy/              # Sequence entropy metrics
+│   └── validation/           # Ground truth comparison
+│
+├── rab/                       # Retrieval-Augmented Bioacoustics
+│   ├── evidence/             # Evidence pack construction
+│   ├── retrieval/            # Adaptive retrieval depth
+│   ├── interpretation/       # Constrained interpretation generation
+│   └── citation/             # Source attribution
+│
+├── api/                       # API Layer
+│   ├── rest/                 # REST endpoints
+│   ├── graphql/              # GraphQL schema and resolvers
+│   ├── websocket/            # Real-time streaming
+│   └── grpc/                 # gRPC for inter-service (future)
+│
+├── visualization/             # Visualization Domain
+│   ├── projection/           # UMAP/t-SNE dimensionality reduction
+│   ├── graph_viz/            # Network visualization
+│   ├── spectrogram_viz/      # Spectrogram rendering
+│   └── export/               # Export formats (JSON, PNG, etc.)
+│
+└── cli/                       # Command Line Interface
+    ├── ingest/               # Batch ingestion commands
+    ├── query/                # Query commands
+    ├── train/                # Training commands
+    └── export/               # Export commands
+```
+
+### Data Model
+
+#### Core Entities (Graph Nodes)
+
+```rust
+/// Raw audio recording from a sensor
+struct Recording {
+    id: Uuid,
+    sensor_id: String,
+    location: GeoPoint,          // lat, lon, elevation
+    start_timestamp: DateTime,
+    duration_ms: u32,
+    sample_rate: u32,            // 32000 Hz for Perch 2.0
+    channels: u8,
+    habitat: Option<String>,
+    weather: Option<WeatherData>,
+    file_path: PathBuf,
+    checksum: String,            // SHA-256 for reproducibility
+}
+
+/// Detected call segment within a recording
+struct CallSegment {
+    id: Uuid,
+    recording_id: Uuid,
+    start_ms: u32,
+    end_ms: u32,
+    snr_db: f32,                 // Signal-to-noise ratio
+    peak_frequency_hz: f32,
+    energy: f32,
+    detection_confidence: f32,
+    detection_method: String,    // "energy_threshold", "whisper_seg", etc.
+}
+
+/// Embedding vector for a call segment
+struct Embedding {
+    id: Uuid,
+    segment_id: Uuid,
+    model_id: String,            // "perch2_v1.0"
+    dimensions: u16,             // 1536 for Perch 2.0
+    vector: Vec<f32>,
+    normalized: bool,
+    created_at: DateTime,
+}
+
+/// Cluster prototype (centroid of similar calls)
+struct Prototype {
+    id: Uuid,
+    cluster_id: Uuid,
+    centroid_vector: Vec<f32>,
+    exemplar_ids: Vec<Uuid>,     // Representative segments
+    member_count: u32,
+    coherence_score: f32,
+}
+
+/// Cluster of similar call segments
+struct Cluster {
+    id: Uuid,
+    method: String,              // "hdbscan", "kmeans", etc.
+    parameters: HashMap<String, Value>,
+    created_at: DateTime,
+    validation_score: Option<f32>,
+}
+
+/// Optional taxonomic reference
+struct Taxon {
+    id: Uuid,
+    scientific_name: String,
+    common_name: String,
+    inat_id: Option<u64>,        // iNaturalist ID
+    ebird_code: Option<String>,  // eBird species code
+}
+```
+
+#### Relationships (Graph Edges)
+
+```rust
+/// Recording contains segments
+edge HAS_SEGMENT: Recording -> CallSegment
+
+/// Temporal sequence within recording
+edge NEXT: CallSegment -> CallSegment {
+    delta_ms: u32,               // Time gap between calls
+}
+
+/// Acoustic similarity from HNSW
+edge SIMILAR: CallSegment -> CallSegment {
+    distance: f32,               // Cosine or Euclidean
+    rank: u8,                    // kNN rank (1 = nearest)
+}
+
+/// Cluster membership
+edge ASSIGNED_TO: CallSegment -> Cluster
+
+/// Prototype ownership
+edge HAS_PROTOTYPE: Cluster -> Prototype
+
+/// Species identification (when available)
+edge IDENTIFIED_AS: CallSegment -> Taxon {
+    confidence: f32,
+    method: String,              // "manual", "model", "consensus"
+}
+```
+
+### Processing Pipeline
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         INGESTION PIPELINE                               │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐          │
+│  │  Audio   │───▶│ Segment  │───▶│   Mel    │───▶│ Perch2.0 │          │
+│  │  Input   │    │Detection │    │Spectrogram│   │  ONNX    │          │
+│  │(32kHz,5s)│    │          │    │(500x128) │    │          │          │
+│  └──────────┘    └──────────┘    └──────────┘    └──────────┘          │
+│       │               │               │               │                  │
+│       │               │               │               ▼                  │
+│       │               │               │         ┌──────────┐            │
+│       │               │               │         │Embedding │            │
+│       │               │               │         │ (1536-D) │            │
+│       │               │               │         └──────────┘            │
+│       │               │               │               │                  │
+└───────┼───────────────┼───────────────┼───────────────┼──────────────────┘
+        │               │               │               │
+        ▼               ▼               ▼               ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         STORAGE LAYER                                    │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  ┌──────────────────────────────────────────────────────────────┐       │
+│  │                        RuVector                               │       │
+│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │       │
+│  │  │   HNSW      │  │   Graph     │  │   Metadata Store    │  │       │
+│  │  │   Index     │  │   Store     │  │   (Recordings,      │  │       │
+│  │  │             │  │   (Edges)   │  │    Segments, etc.)  │  │       │
+│  │  └─────────────┘  └─────────────┘  └─────────────────────┘  │       │
+│  └──────────────────────────────────────────────────────────────┘       │
+│                                                                          │
+└─────────────────────────────────────────────────────────────────────────┘
+        │               │               │               │
+        ▼               ▼               ▼               ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         LEARNING LAYER                                   │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐              │
+│  │    GNN       │    │  Attention   │    │  Hyperbolic  │              │
+│  │  Reranker    │───▶│   Layers     │───▶│  Refinement  │              │
+│  │(GCN/GAT/SAGE)│    │              │    │  (Poincare)  │              │
+│  └──────────────┘    └──────────────┘    └──────────────┘              │
+│         │                   │                   │                        │
+│         └───────────────────┴───────────────────┘                        │
+│                             │                                            │
+│                             ▼                                            │
+│                    ┌──────────────┐                                     │
+│                    │   Refined    │                                     │
+│                    │  Embeddings  │                                     │
+│                    └──────────────┘                                     │
+│                                                                          │
+└─────────────────────────────────────────────────────────────────────────┘
+        │               │               │               │
+        ▼               ▼               ▼               ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         ANALYSIS LAYER                                   │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
+│  │Clustering│  │ Sequence │  │ Anomaly  │  │  Entropy │  │   RAB    │ │
+│  │(HDBSCAN) │  │  Mining  │  │Detection │  │  Metrics │  │ Evidence │ │
+│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
+│                                                                          │
+└─────────────────────────────────────────────────────────────────────────┘
+        │
+        ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         API / PRESENTATION                               │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
+│  │   REST   │  │ GraphQL  │  │WebSocket │  │   CLI    │  │   WASM   │ │
+│  │   API    │  │   API    │  │(Streaming)│ │          │  │ (Browser)│ │
+│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
+│                                                                          │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+### Key Interfaces Between Modules
+
+```rust
+// Audio -> Embedding interface
+trait AudioEmbedder {
+    fn embed_segment(&self, audio: &AudioSegment) -> Result<Embedding>;
+    fn embed_batch(&self, segments: &[AudioSegment]) -> Result<Vec<Embedding>>;
+    fn model_info(&self) -> ModelInfo;
+}
+
+// Embedding -> VectorDB interface
+trait VectorStore {
+    fn insert(&mut self, embedding: &Embedding) -> Result<()>;
+    fn search_knn(&self, query: &[f32], k: usize) -> Result<Vec<SearchResult>>;
+    fn get_neighbors(&self, id: Uuid) -> Result<Vec<Neighbor>>;
+    fn build_similarity_edges(&mut self, k: usize) -> Result<usize>;
+}
+
+// VectorDB -> Learning interface
+trait GraphLearner {
+    fn train_step(&mut self, graph: &Graph) -> Result<TrainMetrics>;
+    fn refine_embeddings(&self, embeddings: &mut [Embedding]) -> Result<()>;
+    fn attention_weights(&self, node_id: Uuid) -> Result<Vec<(Uuid, f32)>>;
+}
+
+// Learning -> Analysis interface
+trait PatternAnalyzer {
+    fn cluster(&self, embeddings: &[Embedding]) -> Result<Vec<Cluster>>;
+    fn find_motifs(&self, sequences: &[Sequence]) -> Result<Vec<Motif>>;
+    fn compute_entropy(&self, transitions: &TransitionMatrix) -> f32;
+}
+
+// Analysis -> RAB interface
+trait EvidenceBuilder {
+    fn build_pack(&self, query: &Query) -> Result<EvidencePack>;
+    fn generate_interpretation(&self, pack: &EvidencePack) -> Result<Interpretation>;
+    fn cite_sources(&self, interpretation: &Interpretation) -> Vec<Citation>;
+}
+```
+
+### Configuration Structure
+
+```yaml
+# sevensense.yaml
+sevensense:
+  # Audio processing settings
+  audio:
+    sample_rate: 32000          # Perch 2.0 requirement
+    segment_duration_ms: 5000   # 5 seconds
+    segment_overlap_ms: 500     # Overlap for continuity
+    min_snr_db: 10.0           # Minimum signal-to-noise
+    detection_method: "energy"  # or "whisper_seg", "tweety"
+
+  # Embedding generation
+  embedding:
+    model: "perch2_v1.0"
+    onnx_path: "./models/perch2.onnx"
+    dimensions: 1536
+    normalize: true
+    batch_size: 32
+
+  # Vector database (RuVector)
+  vectordb:
+    index_type: "hnsw"
+    hnsw:
+      m: 16                     # Connections per node
+      ef_construction: 200      # Build-time search width
+      ef_search: 100           # Query-time search width
+    distance_metric: "cosine"   # or "euclidean", "poincare"
+    enable_hyperbolic: false    # Experimental
+    compression:
+      hot_tier: "none"
+      warm_tier: "pq_8"        # Product quantization
+      cold_tier: "pq_4"        # Aggressive compression
+
+  # GNN learning
+  learning:
+    enabled: true
+    gnn_type: "gat"            # GCN, GAT, or GraphSAGE
+    hidden_dim: 256
+    num_layers: 2
+    attention_heads: 4
+    learning_rate: 0.001
+    training_interval_hours: 24
+
+  # Analysis settings
+  analysis:
+    clustering:
+      method: "hdbscan"
+      min_cluster_size: 10
+      min_samples: 5
+    sequence:
+      max_gap_ms: 2000         # Max silence between calls
+      min_motif_length: 3
+
+  # RAB settings
+  rab:
+    retrieval_k: 10            # Neighbors to retrieve
+    min_confidence: 0.7
+    cite_exemplars: true
+
+  # API settings
+  api:
+    host: "0.0.0.0"
+    port: 8080
+    enable_graphql: true
+    enable_websocket: true
+    cors_origins: ["*"]
+
+  # Telemetry
+  telemetry:
+    log_level: "info"
+    metrics_port: 9090
+    tracing_enabled: true
+    tracing_endpoint: "http://localhost:4317"
+```
+
+---
+
+## Consequences
+
+### Positive Consequences
+
+1. **Development velocity**: Single deployment simplifies CI/CD and local development
+2. **Performance**: Critical audio-to-index path has zero network overhead
+3. **Debugging**: Stack traces span the entire flow; no distributed tracing required initially
+4. **Testing**: Integration tests run in-process without container orchestration
+5. **Scientific reproducibility**: Single binary with pinned dependencies ensures consistent results
+6. **Resource efficiency**: Shared memory pools and caches across modules
+7. **Evolution path**: Clear module boundaries allow extraction to services when justified
+
+### Negative Consequences
+
+1. **Scaling limitations**: Cannot scale embedding generation independently from query serving
+2. **Deployment coupling**: Updates to any module require full redeployment
+3. **Resource contention**: GNN training may compete with query serving for CPU/memory
+4. **Technology constraints**: All modules must work within Rust ecosystem (mitigated by FFI)
+
+### Mitigation Strategies
+
+| Risk | Mitigation |
+|------|------------|
+| Scaling limitations | Design async job queues that could become external workers |
+| Deployment coupling | Blue-green deployments with health checks |
+| Resource contention | Configurable resource limits per module; background training scheduling |
+| Technology constraints | ONNX runtime for ML; FFI bindings for specialized libraries |
+
+---
+
+## Related Decisions
+
+- **ADR-002**: Perch 2.0 Integration Strategy (ONNX vs. birdnet-onnx crate)
+- **ADR-003**: HNSW vs. Hyperbolic Space Configuration
+- **ADR-004**: GNN Training Strategy (Online vs. Batch)
+- **ADR-005**: RAB Evidence Pack Schema
+- **ADR-006**: API Design (REST/GraphQL/gRPC)
+
+---
+
+## Compliance and Standards
+
+### Scientific Standards
+- All embeddings include model version and parameters for reproducibility
+- Evidence packs include full retrieval citations per RAB methodology
+- Validation metrics align with published benchmarks (V-measure, silhouette scores)
+
+### Data Standards
+- Audio metadata follows Darwin Core / TDWG standards where applicable
+- Taxonomic references link to iNaturalist and eBird identifiers
+- Geospatial data uses WGS84 coordinates
+
+### Security Considerations
+- No PII in bioacoustic data (sensor IDs are pseudonymous)
+- API authentication via JWT tokens
+- Audit logging for all data modifications
+
+---
+
+## References
+
+1. Perch 2.0 Paper: "The Bittern Lesson for Bioacoustics" (arXiv:2508.04665)
+2. RuVector Documentation: https://github.com/ruvnet/ruvector
+3. HNSW Paper: "Efficient and Robust Approximate Nearest Neighbor Search"
+4. RAB Pattern: Retrieval-Augmented Bioacoustics methodology
+5. AVN Deep Learning Study: "A deep learning approach for the analysis of birdsong" (eLife 2025)
+
+---
+
+## Revision History
+
+| Version | Date | Author | Changes |
+|---------|------|--------|---------|
+| 1.0 | 2026-01-15 | 7sense Architecture Team | Initial version |