Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/examples/vibecast-7sense/docs/plans/research/intro.md
+++ b/examples/vibecast-7sense/docs/plans/research/intro.md
@@ -0,0 +1,238 @@
+## What Perch 2.0 changes for a RuVector pipeline
+
+Perch 2.0 is explicitly designed to produce embeddings that stay useful under domain shift and support workflows like nearest-neighbor retrieval, clustering, and linear probes on modest hardware. ([arXiv][1])
+
+Key technical facts that matter for engineering:
+
+* Input is **5 second mono audio at 32 kHz** (160,000 samples), with a log-mel frontend producing **500 frames x 128 mel bins (60 Hz to 16 kHz)**. ([arXiv][2])
+* Backbone is **EfficientNet-B3**, and the mean pooled embedding is **1536-D**. ([arXiv][2])
+* Training includes:
+
+  * supervised species classification,
+  * **prototype-learning classifier head** used for self-distillation,
+  * and an auxiliary **source-prediction** objective. ([arXiv][2])
+* It is multi-taxa and reports SOTA on BirdSet and BEANS, plus strong marine transfer despite little marine training data. ([arXiv][1])
+* DeepMind describes this Perch release as an open model and points to Kaggle availability. ([Google DeepMind][3])
+
+Why this is a big deal for RuVector: once embeddings are “good enough,” HNSW stops being a storage trick and becomes a navigable map where neighborhoods are meaningful. RuVector’s whole value proposition is then unlocked: fast HNSW retrieval, plus a learnable GNN reranker and attention on top of the neighbor graph. ([GitHub][4])
+
+## RAB is the right framing for “interpretation” without hallucination
+
+Retrieval-Augmented Bioacoustics (RAB) is basically “RAG for animal sound,” with two design choices that align perfectly with a RuVector substrate:
+
+1. adapt retrieval depth based on signal quality
+2. cite the retrieved calls directly in the generated output for transparency 
+
+That is exactly how you keep “translation” honest: you are not translating meaning, you are producing an evidence-guided structural interpretation.
+
+## Practical integration blueprint: Perch 2.0 + RuVector + RAB
+
+### 1) Ingestion schema in RuVector
+
+Model the world as both vectors and a graph:
+
+**Nodes**
+
+* `Recording {id, sensor_id, lat, lon, start_ts, habitat, weather, ...}`
+* `CallSegment {id, recording_id, t0_ms, t1_ms, snr, energy, ...}`
+* `Embedding {id, segment_id, model="perch2", dim=1536, ...}`
+* `Prototype {id, cluster_id, centroid_vec, exemplars[]}`
+* `Cluster {id, method, params, ...}`
+* optional: `Taxon {inat_id, scientific_name, common_name}`
+
+**Edges**
+
+* `(:Recording)-[:HAS_SEGMENT]->(:CallSegment)`
+* `(:CallSegment)-[:NEXT {dt_ms}]->(:CallSegment)` for sequences
+* `(:CallSegment)-[:SIMILAR {dist}]->(:CallSegment)` from HNSW neighbors
+* `(:Cluster)-[:HAS_PROTOTYPE]->(:Prototype)`
+* `(:CallSegment)-[:ASSIGNED_TO]->(:Cluster)` (after clustering)
+
+RuVector already supports storing embeddings and querying with Cypher-style graph queries, plus a GNN refinement layer that applies multi-head attention over neighbors. ([GitHub][4])
+
+### 2) Embedding in Rust, not Python
+
+You have two very practical Rust-first options:
+
+**Option A: ONNX Runtime**
+There are published Perch v2 ONNX conversions with concrete tensor shapes:
+
+* input: `['batch', 160000]`
+* outputs include: `embedding ['batch', 1536]`, plus spectrogram and logits ([Hugging Face][5])
+
+That gets you native Rust inference with `onnxruntime` bindings, and you can keep everything in the same process as RuVector.
+
+**Option B: Use an existing Rust crate that already supports Perch v2**
+There is a Rust library `birdnet-onnx` that supports Perch v2 inference (32kHz, 5s segments) and returns predictions. ([Docs.rs][6])
+Even if you do not keep it long-term, it is an excellent “verification harness” to de-risk the pipeline.
+
+### 3) The retrieval core: HNSW is your “acoustic cartography”
+
+For each `CallSegment`:
+
+1. embed with Perch 2.0 -> `Vec<f32>(1536)`
+2. insert vector into RuVector
+3. store metadata and computed features (snr, pitch stats, rhythm, spectral centroid)
+4. periodically (or continuously) rebuild neighbor edges `SIMILAR` from top-k
+
+Once you have this, you instantly get:
+
+* nearest-neighbor “find similar calls”
+* cluster discovery (call types, dialects, soundscape regimes)
+* anomaly detection (rare calls, new species, anthropogenic intrusions)
+
+### 4) Add the GNN and attention where it matters
+
+Use the graph as supervision:
+
+* acoustic edges from HNSW (similarity)
+* temporal edges from `NEXT` (syntax)
+* optional co-occurrence edges (same time window, same sensor neighborhood)
+
+Then train a lightweight GNN reranker whose job is not “classify species,” but:
+
+* re-rank neighbors for retrieval quality
+* increase cluster coherence
+* learn transition regularities
+
+This matches RuVector’s “HNSW retrieval then GNN enhancement” pattern. ([GitHub][4])
+
+### 5) RAB layer: evidence packs + constrained generation
+
+For any query (a segment, a time interval, a habitat), build an **Evidence Pack**:
+
+* top-k neighbors (IDs, distances)
+* k cluster exemplars (prototype calls)
+* top predicted taxa (if you choose to surface logits)
+* local sequence context (previous and next segments)
+* signal quality (snr, clipping, overlap score)
+* spectrogram thumbnails
+
+Then generation produces only these kinds of outputs:
+
+* monitoring summary
+* annotation suggestions
+* “this resembles X and Y exemplars, differs by Z”
+* hypothesis prompts for researchers
+
+And it must cite which retrieved calls informed each statement, matching the RAB proposal’s attribution emphasis. 
+
+## Verification that the geometry is real
+
+Here is a verification stack that starts cheap and becomes rigorous.
+
+### Level 1: Mechanical correctness
+
+* audio is actually 32 kHz mono
+* 5s windows align with model expectations ([arXiv][2])
+* embedding norms are stable (no NaNs, no collapse)
+* duplicate audio -> near-identical embedding
+
+### Level 2: Retrieval sanity
+
+Pick 50 known calls (or manually curated exemplars):
+
+* do nearest-neighbor retrieval
+* manually check if top 10 are genuinely similar
+
+Perch’s own evaluation includes one-shot retrieval style tests using cosine distance as a proxy for clustering usefulness, which is exactly your use case. ([arXiv][7])
+
+### Level 3: Few-shot probes
+
+Train linear probes on small labeled subsets:
+
+* species
+* call type
+* habitat context
+* sensor ID (should be weak if embeddings are not overfitting device artifacts)
+
+Perch 2.0 is explicitly oriented toward strong linear probing and retrieval without full fine-tuning. ([arXiv][1])
+
+### Level 4: Sequence validity
+
+Check whether your transition graph produces:
+
+* stable motifs
+* repeated trajectories
+* entropy rates that differ by condition or location
+
+If you want “motif truth,” DTW can be your high-precision confirmation step for a small subset, not your global engine.
+
+## Visualization in Rust, end-to-end
+
+You can do a fully Rust-native viz loop now:
+
+1. Use RuVector to get kNN for each point (already computed by HNSW).
+2. Feed that kNN graph into a Rust UMAP implementation such as `umap-rs` (it expects precomputed neighbors). ([Docs.rs][8])
+3. Render interactive scatter plots using Rust bindings for Plotly, or export JSON for a web viewer. ([Crates.io][9])
+
+Bonus: Perch outputs spectrogram tensors in some exported forms, so you can attach “what the model saw” to each point and show it on hover or click. ([Hugging Face][5])
+
+## “Translation” that stays scientifically honest
+
+If you use the word “translation,” I would keep it scoped like this:
+
+* Translate a call into:
+
+  * nearest exemplars
+  * cluster membership
+  * structural descriptors (pitch contour stats, rhythm intervals, spectral texture)
+  * sequence role (often followed by X, often precedes Y)
+
+Not “the bird said danger,” but:
+
+* “This call sits in the same neighborhood as known alarm exemplars and appears in similar sequence positions during disturbance periods.”
+
+That is the RAB sweet spot: interpretable, evidence-backed, testable.
+
+## Practical to exotic: what becomes feasible now
+
+With Perch-grade embeddings, your ladder tightens:
+
+**Practical**
+
+* biodiversity indexing and monitoring summaries
+* fast search over million-hour corpora
+* sensor drift and anthropogenic anomaly alerts
+
+**Advanced**
+
+* few-shot adaptation for new sites with tiny labeled sets
+* call library curation via cluster prototypes
+* cross-taxa transfer experiments (insects vs birds vs amphibians)
+
+**Exotic but defensible**
+
+* closed-loop call-response experiments that probe structural sensitivity
+* synthetic prototype interpolation (generate “between-cluster” calls) with strict ethics and permitting
+* cross-species “structure maps” that compare signaling complexity without pretending semantics
+
+## Two next moves that will accelerate you immediately
+
+1. **Build the “call library + evidence pack” layer first.**
+   It turns embeddings into a product and forces transparency.
+
+2. **Treat GNN as retrieval optimization, not a magic classifier.**
+   Your win is better neighborhoods, cleaner motifs, and more stable trajectories.
+
+If you want, I can turn this into:
+
+* a concrete repo layout (`ruvector-bioacoustic/` crate + CLI + wasm viewer), or
+* a short “vision memo” you can share publicly that frames Perch 2.0 + RuVector + RAB as the start of navigable animal communication geometry.
+
+[1]: https://www.arxiv.org/pdf/2508.04665v2 "Perch 2.0: The Bittern Lesson for Bioacoustics"
+[2]: https://arxiv.org/html/2508.04665v1 "Perch 2.0: The Bittern Lesson for Bioacoustics"
+[3]: https://deepmind.google/blog/how-ai-is-helping-advance-the-science-of-bioacoustics-to-save-endangered-species/ "
+      How AI is helping advance the science of bioacoustics to save endangered species - 
+        Google DeepMind
+      
+    "
+[4]: https://github.com/ruvnet/ruvector "GitHub - ruvnet/ruvector: A distributed vector database that learns. Store embeddings, query with Cypher, scale horizontally with Raft consensus, and let the index improve itself through Graph Neural Networks."
+[5]: https://huggingface.co/justinchuby/Perch-onnx?utm_source=chatgpt.com "justinchuby/Perch-onnx"
+[6]: https://docs.rs/birdnet-onnx?utm_source=chatgpt.com "birdnet_onnx - Rust"
+[7]: https://arxiv.org/html/2508.04665v1?utm_source=chatgpt.com "Perch 2.0: The Bittern Lesson for Bioacoustics"
+[8]: https://docs.rs/umap-rs?utm_source=chatgpt.com "umap_rs - Rust"
+[9]: https://crates.io/crates/plotly?utm_source=chatgpt.com "plotly - crates.io: Rust Package Registry"
+
+---
+