# ADR-006: Data Architecture and Vector Storage **Status:** Accepted **Date:** 2026-01-15 **Deciders:** 7sense Architecture Team **Context:** Bioacoustic data pipeline for RuVector integration ## Context 7sense transforms bioacoustic signals (birdsong, wildlife vocalizations) into navigable geometric spaces using the RuVector platform. The system processes audio recordings through Perch 2.0 to generate 1536-dimensional embeddings, which are then indexed using HNSW for fast similarity search and organized via Graph Neural Networks (GNN) for pattern discovery. This ADR defines the complete data architecture including: - Entity schemas and relationships - Vector storage tiering strategy - Temporal data handling - Metadata enrichment - Data lifecycle management - Backup and recovery procedures ## Decision ### 1. Schema Design #### 1.1 Core Entities ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ ENTITY RELATIONSHIP DIAGRAM │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ │ │ │ Recording │──1:N──│ CallSegment │──1:1──│ Embedding │ │ │ └──────────────┘ └───────────────┘ └──────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌──────┴──────┐ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌──────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ Sensor │ │ Cluster │ │ Prototype│ │ Taxon │ │ │ └──────────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` #### 1.2 Node Definitions **Recording** - Source audio file metadata ```typescript interface Recording { id: UUID; // Primary identifier sensor_id: UUID; // Reference to sensor file_path: string; // Storage location file_hash: string; // SHA-256 for deduplication duration_ms: number; // Total duration sample_rate: number; // Expected: 32000 Hz channels: number; // Expected: 1 (mono) bit_depth: number; // Audio bit depth start_ts: ISO8601; // Recording start timestamp end_ts: ISO8601; // Recording end timestamp lat: float; // GPS latitude (WGS84) lon: float; // GPS longitude (WGS84) altitude_m: float; // Elevation in meters habitat: HabitatType; // Enum: forest, wetland, urban, etc. weather: WeatherConditions; // Nested weather data quality_score: float; // 0.0-1.0 automated quality assessment processing_status: Status; // pending, processing, complete, failed created_at: ISO8601; updated_at: ISO8601; } interface WeatherConditions { temperature_c: float; humidity_pct: float; wind_speed_ms: float; wind_direction_deg: float; precipitation_mm: float; cloud_cover_pct: float; pressure_hpa: float; source: string; // weather API source } ``` **CallSegment** - Individual vocalization segment ```typescript interface CallSegment { id: UUID; recording_id: UUID; // Parent recording segment_index: number; // Order within recording t0_ms: number; // Start offset in milliseconds t1_ms: number; // End offset in milliseconds duration_ms: number; // Computed: t1_ms - t0_ms snr_db: float; // Signal-to-noise ratio energy: float; // RMS energy level peak_freq_hz: float; // Dominant frequency bandwidth_hz: float; // Frequency range entropy: float; // Spectral entropy (Wiener) pitch_contour: float[]; // Sampled pitch values rhythm_intervals: float[]; // Inter-onset intervals spectral_centroid: float; // Spectral center of mass spectral_flatness: float; // Tonality measure zero_crossing_rate: float; // Temporal texture clipping_detected: boolean; // Audio quality flag overlap_score: float; // Overlap with other calls segmentation_method: string; // whisper_seg, tweety_net, energy_threshold segmentation_confidence: float; created_at: ISO8601; } ``` **Embedding** - Vector representation from Perch 2.0 ```typescript interface Embedding { id: UUID; segment_id: UUID; // Parent segment model_name: string; // "perch_2.0" model_version: string; // Specific model version dimensions: number; // 1536 for Perch 2.0 vector: Float32Array; // Full-precision embedding vector_quantized: Int8Array; // Quantized for warm tier vector_compressed: Uint8Array; // Compressed for cold tier storage_tier: StorageTier; // hot, warm, cold norm: float; // L2 norm for validation generation_time_ms: number; // Inference latency checksum: string; // Integrity verification created_at: ISO8601; last_accessed: ISO8601; // For tiering decisions access_count: number; // Usage tracking } enum StorageTier { HOT = 'hot', // Full float32, in-memory HNSW WARM = 'warm', // int8 quantized, SSD-backed COLD = 'cold' // 32x compressed, archival } ``` **Prototype** - Cluster centroid/exemplar ```typescript interface Prototype { id: UUID; cluster_id: UUID; // Parent cluster centroid_vector: Float32Array; // Averaged embedding exemplar_ids: UUID[]; // Representative segment IDs exemplar_count: number; // Number of exemplars intra_cluster_variance: float; silhouette_score: float; // Cluster quality metric created_at: ISO8601; updated_at: ISO8601; } ``` **Cluster** - Grouping of similar calls ```typescript interface Cluster { id: UUID; name: string; // Human-readable label description: string; // Auto-generated or manual method: ClusterMethod; // hdbscan, kmeans, spectral params: Record; // Algorithm parameters member_count: number; // Number of assigned segments coherence_score: float; // Internal validity stability_score: float; // Bootstrap stability parent_cluster_id: UUID; // Hierarchical clustering level: number; // Hierarchy depth created_at: ISO8601; updated_at: ISO8601; } enum ClusterMethod { HDBSCAN = 'hdbscan', KMEANS = 'kmeans', SPECTRAL = 'spectral', AGGLOMERATIVE = 'agglomerative' } ``` **Taxon** - Species/taxonomic reference ```typescript interface Taxon { id: UUID; inat_id: number; // iNaturalist taxon ID scientific_name: string; // Binomial nomenclature common_name: string; // English common name family: string; // Taxonomic family order: string; // Taxonomic order class: string; // Taxonomic class conservation_status: string; // IUCN status frequency_range_hz: [number, number]; // Typical vocalization range habitat_types: HabitatType[]; created_at: ISO8601; } ``` **Sensor** - Recording device metadata ```typescript interface Sensor { id: UUID; name: string; // Device identifier model: string; // Hardware model manufacturer: string; serial_number: string; microphone_type: string; // omni, cardioid, etc. sensitivity_dbv: float; // Microphone sensitivity frequency_response: [number, number]; // Hz range deployment_lat: float; deployment_lon: float; deployment_altitude_m: float; deployment_habitat: HabitatType; deployment_start: ISO8601; deployment_end: ISO8601; calibration_date: ISO8601; calibration_factor: float; status: SensorStatus; created_at: ISO8601; updated_at: ISO8601; } enum SensorStatus { ACTIVE = 'active', INACTIVE = 'inactive', MAINTENANCE = 'maintenance', RETIRED = 'retired' } ``` #### 1.3 Edge Definitions (Graph Relationships) ```cypher // Structural relationships (:Recording)-[:HAS_SEGMENT {order: int}]->(:CallSegment) (:CallSegment)-[:HAS_EMBEDDING]->(:Embedding) (:Recording)-[:FROM_SENSOR]->(:Sensor) // Temporal sequence (syntax graph) (:CallSegment)-[:NEXT { dt_ms: int, // Time delta between segments same_speaker: boolean, // Likely same individual transition_prob: float // Learned transition probability }]->(:CallSegment) // Acoustic similarity (HNSW neighbors) (:CallSegment)-[:SIMILAR { distance: float, // Cosine distance rank: int, // Neighbor rank (1-k) tier: string // hot, warm, cold }]->(:CallSegment) // Cluster assignments (:Cluster)-[:HAS_PROTOTYPE]->(:Prototype) (:CallSegment)-[:ASSIGNED_TO { confidence: float, // Assignment confidence distance_to_centroid: float }]->(:Cluster) (:Cluster)-[:CHILD_OF]->(:Cluster) // Hierarchical // Taxonomic links (:CallSegment)-[:IDENTIFIED_AS { confidence: float, method: string, // model, manual, consensus verified: boolean }]->(:Taxon) // Co-occurrence (same time window, nearby sensors) (:CallSegment)-[:CO_OCCURS { time_overlap_ms: int, spatial_distance_m: float }]->(:CallSegment) ``` #### 1.4 Index Definitions **Primary Indexes** ```sql -- UUID lookups CREATE UNIQUE INDEX idx_recording_id ON recordings(id); CREATE UNIQUE INDEX idx_segment_id ON call_segments(id); CREATE UNIQUE INDEX idx_embedding_id ON embeddings(id); CREATE UNIQUE INDEX idx_cluster_id ON clusters(id); CREATE UNIQUE INDEX idx_taxon_id ON taxa(id); CREATE UNIQUE INDEX idx_sensor_id ON sensors(id); -- Foreign key relationships CREATE INDEX idx_segment_recording ON call_segments(recording_id); CREATE INDEX idx_embedding_segment ON embeddings(segment_id); CREATE INDEX idx_recording_sensor ON recordings(sensor_id); ``` **Temporal Indexes** ```sql -- Time-based queries CREATE INDEX idx_recording_start ON recordings(start_ts); CREATE INDEX idx_recording_timerange ON recordings USING GIST ( tstzrange(start_ts, end_ts) ); CREATE INDEX idx_segment_time ON call_segments(recording_id, t0_ms); ``` **Spatial Indexes** ```sql -- Geographic queries (PostGIS) CREATE INDEX idx_recording_location ON recordings USING GIST ( ST_SetSRID(ST_MakePoint(lon, lat), 4326) ); CREATE INDEX idx_sensor_location ON sensors USING GIST ( ST_SetSRID(ST_MakePoint(deployment_lon, deployment_lat), 4326) ); ``` **HNSW Vector Index** ```sql -- Hot tier: Full precision HNSW CREATE INDEX idx_embedding_hnsw_hot ON embeddings USING hnsw (vector vector_cosine_ops) WITH ( m = 16, -- Connections per layer ef_construction = 200, -- Build-time search width ef_search = 100 -- Query-time search width ) WHERE storage_tier = 'hot'; -- Warm tier: Quantized HNSW CREATE INDEX idx_embedding_hnsw_warm ON embeddings USING hnsw (vector_quantized vector_l2_ops) WITH (m = 12, ef_construction = 100) WHERE storage_tier = 'warm'; ``` --- ### 2. Vector Storage Strategy #### 2.1 Tiered Storage Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ VECTOR STORAGE TIERS │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │ │ HOT TIER - Full Precision (float32) │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ Storage: In-memory + NVMe SSD │ │ │ │ Format: 1536 x float32 = 6,144 bytes/vector │ │ │ │ Index: HNSW (M=16, ef=200) │ │ │ │ Latency: <1ms query, <100us retrieval │ │ │ │ Capacity: ~1M vectors / 6GB RAM │ │ │ │ Use: Active queries, recent recordings, frequent access │ │ │ └────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ (access_count < threshold, age > 7 days) │ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │ │ WARM TIER - Quantized (int8) │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ Storage: SSD-backed, memory-mapped │ │ │ │ Format: 1536 x int8 = 1,536 bytes/vector (4x compression) │ │ │ │ Index: HNSW (M=12, ef=100) │ │ │ │ Latency: <10ms query │ │ │ │ Capacity: ~10M vectors / 15GB SSD │ │ │ │ Use: Historical data, periodic access, batch processing │ │ │ └────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ (age > 90 days, access_count < 10) │ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │ │ COLD TIER - Compressed (Product Quantization) │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ Storage: Object storage (S3/GCS) + local cache │ │ │ │ Format: PQ-encoded = ~192 bytes/vector (32x compression) │ │ │ │ Index: IVF-PQ (coarse quantizer + product quantization) │ │ │ │ Latency: <100ms query (cache hit), <1s (cache miss) │ │ │ │ Capacity: ~100M vectors / 20GB storage │ │ │ │ Use: Archival, compliance, rare access, bulk analytics │ │ │ └────────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` #### 2.2 Quantization Implementation **Scalar Quantization (Hot to Warm)** ```typescript interface ScalarQuantizer { scale: float; // Computed from data distribution zero_point: float; // Offset for centering quantize(vector: Float32Array): Int8Array { const quantized = new Int8Array(vector.length); for (let i = 0; i < vector.length; i++) { const scaled = (vector[i] - this.zero_point) / this.scale; quantized[i] = Math.max(-128, Math.min(127, Math.round(scaled))); } return quantized; } dequantize(quantized: Int8Array): Float32Array { const vector = new Float32Array(quantized.length); for (let i = 0; i < quantized.length; i++) { vector[i] = quantized[i] * this.scale + this.zero_point; } return vector; } } // Calibration: compute scale/zero_point from representative sample function calibrateQuantizer(samples: Float32Array[]): ScalarQuantizer { let min = Infinity, max = -Infinity; for (const sample of samples) { for (const val of sample) { min = Math.min(min, val); max = Math.max(max, val); } } return { scale: (max - min) / 255, zero_point: min + (max - min) / 2 }; } ``` **Product Quantization (Warm to Cold)** ```typescript interface ProductQuantizer { num_subvectors: number; // 192 (divide 1536 into 192 x 8) subvector_dim: number; // 8 dimensions per subvector num_centroids: number; // 256 (1 byte per subvector) codebooks: Float32Array[][]; // 192 codebooks, each with 256 centroids encode(vector: Float32Array): Uint8Array { const codes = new Uint8Array(this.num_subvectors); for (let i = 0; i < this.num_subvectors; i++) { const subvec = vector.slice( i * this.subvector_dim, (i + 1) * this.subvector_dim ); codes[i] = this.findNearestCentroid(subvec, this.codebooks[i]); } return codes; } decode(codes: Uint8Array): Float32Array { const vector = new Float32Array(1536); for (let i = 0; i < this.num_subvectors; i++) { const centroid = this.codebooks[i][codes[i]]; vector.set(centroid, i * this.subvector_dim); } return vector; } asymmetricDistance(query: Float32Array, codes: Uint8Array): float { let dist = 0; for (let i = 0; i < this.num_subvectors; i++) { const subquery = query.slice( i * this.subvector_dim, (i + 1) * this.subvector_dim ); const centroid = this.codebooks[i][codes[i]]; dist += euclideanDistance(subquery, centroid); } return dist; } } ``` #### 2.3 Tiering Policy ```typescript interface TieringPolicy { hot_to_warm: { age_days: 7, // Move after 7 days access_threshold: 100, // Unless accessed >100 times batch_size: 10000 // Process in batches }, warm_to_cold: { age_days: 90, // Move after 90 days access_threshold: 10, // Unless accessed >10 times in last 30 days batch_size: 50000 }, promotion: { cold_to_warm: { access_count: 5, // Promote after 5 accesses time_window_hours: 24 }, warm_to_hot: { access_count: 20, time_window_hours: 1 } } } // Background tiering job async function runTieringJob(policy: TieringPolicy): Promise { const stats = { demoted: 0, promoted: 0 }; // Demote hot -> warm const hotCandidates = await db.query(` SELECT id FROM embeddings WHERE storage_tier = 'hot' AND created_at < NOW() - INTERVAL '${policy.hot_to_warm.age_days} days' AND access_count < ${policy.hot_to_warm.access_threshold} ORDER BY last_accessed ASC LIMIT ${policy.hot_to_warm.batch_size} `); for (const batch of chunk(hotCandidates, 1000)) { await demoteToWarm(batch); stats.demoted += batch.length; } // Promote cold -> warm based on access patterns const coldAccessLog = await getRecentAccesses('cold', 24); const promotionCandidates = coldAccessLog .filter(e => e.count >= policy.promotion.cold_to_warm.access_count); for (const candidate of promotionCandidates) { await promoteToWarm(candidate.id); stats.promoted++; } return stats; } ``` #### 2.4 Hyperbolic Embedding Option For hierarchical species relationships, optionally store embeddings in Poincare ball space: ```typescript interface HyperbolicEmbedding { euclidean_vector: Float32Array; // Original 1536-D poincare_vector: Float32Array; // Mapped to Poincare ball curvature: float; // Ball curvature (typically -1) } // Map Euclidean to Poincare ball function exponentialMap( euclidean: Float32Array, curvature: float = -1 ): Float32Array { const norm = l2Norm(euclidean); const c = Math.abs(curvature); const factor = Math.tanh(Math.sqrt(c) * norm / 2) / (Math.sqrt(c) * norm); return euclidean.map(x => x * factor); } // Poincare distance for hyperbolic similarity function poincareDistance( u: Float32Array, v: Float32Array, curvature: float = -1 ): float { const c = Math.abs(curvature); const u_norm_sq = dotProduct(u, u); const v_norm_sq = dotProduct(v, v); const diff = subtract(u, v); const diff_norm_sq = dotProduct(diff, diff); const numerator = 2 * diff_norm_sq; const denominator = (1 - c * u_norm_sq) * (1 - c * v_norm_sq); return Math.acosh(1 + numerator / denominator) / Math.sqrt(c); } ``` --- ### 3. Graph Relationships for Cypher Queries #### 3.1 Common Query Patterns **Find Similar Calls** ```cypher // Top-k similar calls to a given segment MATCH (source:CallSegment {id: $segment_id})-[sim:SIMILAR]->(target:CallSegment) WHERE sim.distance < $threshold RETURN target, sim.distance, sim.rank ORDER BY sim.distance ASC LIMIT $k ``` **Temporal Sequence Analysis** ```cypher // Find call sequences (motifs) of length n MATCH path = (start:CallSegment)-[:NEXT*1..5]->(end:CallSegment) WHERE start.recording_id = $recording_id RETURN [node IN nodes(path) | node.id] AS sequence, [rel IN relationships(path) | rel.dt_ms] AS intervals, length(path) AS motif_length ORDER BY motif_length DESC ``` **Cluster Exploration** ```cypher // Get cluster members with their prototypes MATCH (c:Cluster {id: $cluster_id})-[:HAS_PROTOTYPE]->(p:Prototype) MATCH (seg:CallSegment)-[a:ASSIGNED_TO]->(c) WHERE a.confidence > 0.8 RETURN c, p, collect(seg)[..10] AS exemplars, count(seg) AS total_members ``` **Species Distribution** ```cypher // Calls by species in a geographic region MATCH (r:Recording)-[:HAS_SEGMENT]->(seg:CallSegment) -[:IDENTIFIED_AS]->(t:Taxon) WHERE point.distance( point({latitude: r.lat, longitude: r.lon}), point({latitude: $center_lat, longitude: $center_lon}) ) < $radius_m RETURN t.scientific_name, t.common_name, count(seg) AS call_count ORDER BY call_count DESC ``` **Co-occurrence Networks** ```cypher // Species co-occurring in same time windows MATCH (seg1:CallSegment)-[:IDENTIFIED_AS]->(t1:Taxon), (seg1)-[:CO_OCCURS]->(seg2:CallSegment)-[:IDENTIFIED_AS]->(t2:Taxon) WHERE t1.id <> t2.id RETURN t1.common_name, t2.common_name, count(*) AS co_occurrence_count ORDER BY co_occurrence_count DESC LIMIT 20 ``` **Transition Matrix** ```cypher // Markov transition probabilities between call types MATCH (c1:Cluster)<-[:ASSIGNED_TO]-(seg1:CallSegment) -[:NEXT]->(seg2:CallSegment)-[:ASSIGNED_TO]->(c2:Cluster) WITH c1.name AS from_cluster, c2.name AS to_cluster, count(*) AS transitions MATCH (c1:Cluster {name: from_cluster})<-[:ASSIGNED_TO]-(seg:CallSegment) WITH from_cluster, to_cluster, transitions, count(seg) AS from_total RETURN from_cluster, to_cluster, toFloat(transitions) / from_total AS transition_prob ORDER BY from_cluster, transition_prob DESC ``` #### 3.2 GNN Training Edges ```cypher // Create training edges for GNN // Acoustic similarity edges (from HNSW) MATCH (seg:CallSegment)-[:HAS_EMBEDDING]->(emb:Embedding) WITH seg, emb CALL { WITH seg, emb MATCH (other:CallSegment)-[:HAS_EMBEDDING]->(other_emb:Embedding) WHERE other.id <> seg.id WITH seg, other, gds.similarity.cosine(emb.vector, other_emb.vector) AS sim WHERE sim > 0.8 RETURN other, sim ORDER BY sim DESC LIMIT 10 } MERGE (seg)-[r:SIMILAR]->(other) SET r.distance = 1 - sim, r.tier = 'hot' ``` --- ### 4. Temporal Data Handling #### 4.1 Timestamp Standards ```typescript // All timestamps in ISO 8601 with timezone type ISO8601 = string; // e.g., "2026-01-15T08:30:00.000Z" interface TemporalMetadata { // Recording level recording_start_ts: ISO8601; // When recording began recording_end_ts: ISO8601; // When recording ended recording_timezone: string; // IANA timezone (e.g., "America/Los_Angeles") // Segment level (relative to recording) segment_offset_ms: number; // Milliseconds from recording start segment_absolute_ts: ISO8601; // Computed absolute timestamp // Derived temporal features time_of_day: TimeOfDay; // dawn, morning, midday, afternoon, dusk, night day_of_week: number; // 0-6 day_of_year: number; // 1-366 lunar_phase: LunarPhase; // new, waxing, full, waning sunrise_offset_min: number; // Minutes from local sunrise sunset_offset_min: number; // Minutes from local sunset } enum TimeOfDay { DAWN = 'dawn', // -30min to +30min of sunrise MORNING = 'morning', // sunrise+30min to noon MIDDAY = 'midday', // noon +/- 2 hours AFTERNOON = 'afternoon', // midday to sunset-30min DUSK = 'dusk', // -30min to +30min of sunset NIGHT = 'night' // sunset+30min to sunrise-30min } ``` #### 4.2 Sequence Ordering ```typescript interface SequenceManager { // Build sequence graph from recording buildSequenceGraph(recordingId: UUID): Promise { const segments = await db.query(` SELECT id, t0_ms, t1_ms, recording_id FROM call_segments WHERE recording_id = $1 ORDER BY t0_ms ASC `, [recordingId]); const edges: SequenceEdge[] = []; for (let i = 0; i < segments.length - 1; i++) { const current = segments[i]; const next = segments[i + 1]; const gap_ms = next.t0_ms - current.t1_ms; // Only link if gap is reasonable (< 5 seconds) if (gap_ms < 5000 && gap_ms >= 0) { edges.push({ source_id: current.id, target_id: next.id, dt_ms: gap_ms, same_speaker: gap_ms < 500, // Heuristic for same individual sequence_index: i }); } } return edges; } // Detect repeated sequences (motifs) findMotifs(recordingId: UUID, minLength: number = 3): Promise { // Build suffix array of cluster assignments const sequence = await this.getClusterSequence(recordingId); const motifs = this.findRepeatedSubstrings(sequence, minLength); return motifs; } } interface SequenceEdge { source_id: UUID; target_id: UUID; dt_ms: number; // Time gap same_speaker: boolean; // Likely same individual sequence_index: number; // Position in recording } interface Motif { pattern: string[]; // Cluster IDs in order occurrences: number; // How many times it appears positions: number[][]; // Start positions for each occurrence entropy: number; // Pattern entropy } ``` #### 4.3 Time-Series Partitioning ```sql -- Partition recordings by month for efficient time-range queries CREATE TABLE recordings ( id UUID PRIMARY KEY, start_ts TIMESTAMPTZ NOT NULL, -- ... other columns ) PARTITION BY RANGE (start_ts); -- Create partitions CREATE TABLE recordings_2026_01 PARTITION OF recordings FOR VALUES FROM ('2026-01-01') TO ('2026-02-01'); CREATE TABLE recordings_2026_02 PARTITION OF recordings FOR VALUES FROM ('2026-02-01') TO ('2026-03-01'); -- Automatic partition creation (pg_partman or similar) SELECT create_parent('public.recordings', 'start_ts', 'native', 'monthly'); ``` --- ### 5. Metadata Enrichment #### 5.1 Enrichment Pipeline ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ METADATA ENRICHMENT PIPELINE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Raw Audio │────▶│ Segmented │────▶│ Embedded │ │ │ │ │ │ Calls │ │ Vectors │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Recording │ │ Acoustic │ │ Similarity │ │ │ │ Metadata │ │ Features │ │ Neighbors │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ ENRICHMENT SERVICES │ │ │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ • Weather API → temperature, humidity, wind, precipitation │ │ │ │ • Geocoding → habitat type, elevation, land cover │ │ │ │ • Astronomy → sunrise/sunset, lunar phase, day length │ │ │ │ • Taxonomy → species ID, conservation status, range maps │ │ │ │ • Soundscape → background noise level, anthropogenic detection │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ ENRICHED RECORD │ │ │ │ Recording + Weather + Habitat + Temporal + Taxonomic + Quality │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` #### 5.2 Enrichment Sources ```typescript interface EnrichmentConfig { weather: { provider: 'openweathermap' | 'visualcrossing' | 'meteoblue'; api_key: string; cache_ttl_hours: 24; historical_lookback_days: 7; }; geocoding: { provider: 'nominatim' | 'google' | 'mapbox'; cache_ttl_days: 30; enrichments: ['habitat', 'elevation', 'land_cover', 'protected_area']; }; astronomy: { // Computed locally, no API needed compute: ['sunrise', 'sunset', 'civil_twilight', 'lunar_phase', 'day_length']; }; taxonomy: { sources: ['inat', 'ebird', 'xeno-canto']; auto_id_confidence_threshold: 0.8; human_review_threshold: 0.5; }; soundscape: { background_noise_window_ms: 100; anthropogenic_detection: boolean; frequency_bands: [[0, 2000], [2000, 8000], [8000, 16000]]; }; } // Enrichment job async function enrichRecording(recordingId: UUID): Promise { const recording = await db.getRecording(recordingId); const enrichments = await Promise.all([ weatherService.getHistorical(recording.lat, recording.lon, recording.start_ts), geocodingService.getHabitat(recording.lat, recording.lon), astronomyService.getSolarData(recording.lat, recording.lon, recording.start_ts), soundscapeAnalyzer.analyze(recording.file_path) ]); return { ...recording, weather: enrichments[0], habitat: enrichments[1], astronomy: enrichments[2], soundscape: enrichments[3] }; } ``` #### 5.3 Species Identification ```typescript interface SpeciesIdentification { segment_id: UUID; predictions: TaxonPrediction[]; method: 'perch_classifier' | 'birdnet' | 'manual' | 'consensus'; confidence_aggregation: 'max' | 'mean' | 'ensemble'; } interface TaxonPrediction { taxon_id: UUID; scientific_name: string; confidence: float; rank: number; } // Multi-model ensemble for species ID async function identifySpecies(segmentId: UUID): Promise { const segment = await db.getSegment(segmentId); const embedding = await db.getEmbedding(segmentId); // Get predictions from multiple sources const perchPreds = await perchClassifier.predict(embedding.vector); const nnPreds = await nearestNeighborTaxon(embedding.vector, k=10); // Ensemble combination const combined = ensemblePredictions([perchPreds, nnPreds], weights=[0.6, 0.4]); return { segment_id: segmentId, predictions: combined.slice(0, 5), method: 'ensemble', confidence_aggregation: 'weighted_mean' }; } ``` --- ### 6. Data Lifecycle #### 6.1 Lifecycle Stages ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ DATA LIFECYCLE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ STAGE 1: INGESTION │ │ ━━━━━━━━━━━━━━━━━━ │ │ • Audio upload (S3/GCS/local) │ │ • Format validation (32kHz, mono, WAV/FLAC) │ │ • Deduplication (SHA-256 hash check) │ │ • Recording metadata extraction │ │ • Queue for processing │ │ │ │ STAGE 2: PROCESSING │ │ ━━━━━━━━━━━━━━━━━━━ │ │ • Audio segmentation (WhisperSeg/TweetyNet) │ │ • Acoustic feature extraction │ │ • Perch 2.0 embedding generation │ │ • Quality scoring │ │ • Initial species predictions │ │ │ │ STAGE 3: INDEXING │ │ ━━━━━━━━━━━━━━━━━━ │ │ • HNSW index insertion (hot tier) │ │ • Neighbor edge creation │ │ • Sequence graph construction │ │ • Cluster assignment │ │ • Metadata enrichment │ │ │ │ STAGE 4: ACTIVE USE │ │ ━━━━━━━━━━━━━━━━━━━ │ │ • Query serving │ │ • GNN refinement (continuous learning) │ │ • Access tracking │ │ • Cache warming │ │ │ │ STAGE 5: TIERING │ │ ━━━━━━━━━━━━━━━━━━ │ │ • Hot → Warm (7 days, quantization) │ │ • Warm → Cold (90 days, compression) │ │ • Promotion on access │ │ │ │ STAGE 6: ARCHIVAL │ │ ━━━━━━━━━━━━━━━━━━ │ │ • Cold storage (S3 Glacier/equivalent) │ │ • Metadata preserved in primary DB │ │ • On-demand retrieval (minutes latency) │ │ │ │ STAGE 7: RETENTION/DELETION │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ • Configurable retention policies │ │ • Legal hold support │ │ • Secure deletion (GDPR compliance) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` #### 6.2 Processing Pipeline ```typescript interface ProcessingPipeline { stages: PipelineStage[]; async process(recordingId: UUID): Promise { const context: ProcessingContext = { recordingId, startTime: Date.now() }; for (const stage of this.stages) { try { context[stage.name] = await stage.execute(context); await this.updateStatus(recordingId, stage.name, 'complete'); } catch (error) { await this.handleError(recordingId, stage.name, error); if (stage.critical) throw error; } } return this.summarize(context); } } const pipeline = new ProcessingPipeline({ stages: [ { name: 'validate', critical: true, execute: validateAudio }, { name: 'segment', critical: true, execute: segmentAudio }, { name: 'extract_features', critical: false, execute: extractAcousticFeatures }, { name: 'embed', critical: true, execute: generateEmbeddings }, { name: 'index', critical: true, execute: insertToHNSW }, { name: 'enrich', critical: false, execute: enrichMetadata }, { name: 'identify', critical: false, execute: identifySpecies }, { name: 'cluster', critical: false, execute: assignToClusters } ] }); ``` #### 6.3 Retention Policies ```typescript interface RetentionPolicy { name: string; conditions: RetentionCondition[]; action: 'archive' | 'delete' | 'anonymize'; grace_period_days: number; } const defaultPolicies: RetentionPolicy[] = [ { name: 'standard_archival', conditions: [ { field: 'age_days', operator: '>', value: 365 }, { field: 'access_count_last_180_days', operator: '<', value: 5 } ], action: 'archive', grace_period_days: 30 }, { name: 'low_quality_deletion', conditions: [ { field: 'quality_score', operator: '<', value: 0.3 }, { field: 'age_days', operator: '>', value: 90 }, { field: 'manually_reviewed', operator: '=', value: false } ], action: 'delete', grace_period_days: 14 }, { name: 'gdpr_deletion', conditions: [ { field: 'deletion_requested', operator: '=', value: true } ], action: 'delete', grace_period_days: 0 } ]; // Retention job async function enforceRetention(): Promise { const report: RetentionReport = { archived: 0, deleted: 0, errors: [] }; for (const policy of defaultPolicies) { const candidates = await findRetentionCandidates(policy); for (const candidate of candidates) { try { if (policy.action === 'archive') { await archiveRecording(candidate.id); report.archived++; } else if (policy.action === 'delete') { await secureDelete(candidate.id); report.deleted++; } } catch (error) { report.errors.push({ id: candidate.id, error: error.message }); } } } return report; } ``` --- ### 7. Backup and Recovery Strategy #### 7.1 Backup Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ BACKUP ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ PRIMARY DATABASE (PostgreSQL + pgvector) │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ • Streaming replication to standby │ │ │ │ • WAL archiving to object storage │ │ │ │ • Point-in-time recovery enabled │ │ │ │ • pg_basebackup daily full backup │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌──────────────────┼──────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Standby │ │ WAL Archive │ │ Daily Full │ │ │ │ Replica │ │ (S3/GCS) │ │ Backup │ │ │ │ (sync) │ │ (15 min) │ │ (encrypted) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ VECTOR INDEX (HNSW) │ │ │ │ ━━━━━━━━━━━━━━━━━━━━ │ │ │ │ • Snapshot to object storage daily │ │ │ │ • Incremental updates via change log │ │ │ │ • Rebuild from source embeddings (disaster recovery) │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ AUDIO FILES │ │ │ │ ━━━━━━━━━━━━━━━ │ │ │ │ • Primary: Object storage (S3/GCS) with versioning │ │ │ │ • Cross-region replication for disaster recovery │ │ │ │ • Glacier transition for cold data │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ GRAPH DATABASE (Neo4j/RuVector Graph Layer) │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ • Online backup every 6 hours │ │ │ │ • Transaction log shipping │ │ │ │ • Cluster mode for HA │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` #### 7.2 Backup Schedule ```typescript interface BackupSchedule { database: { full_backup: { frequency: 'daily', time: '02:00 UTC', retention_days: 30 }, incremental_backup: { frequency: 'hourly', retention_days: 7 }, wal_archiving: { frequency: 'continuous', archive_timeout_seconds: 900, // 15 minutes max retention_days: 14 } }; vector_index: { snapshot: { frequency: 'daily', time: '03:00 UTC', retention_count: 7 }, change_log: { frequency: 'continuous', retention_hours: 72 } }; audio_files: { versioning: true, cross_region_replication: true, glacier_transition_days: 90 }; graph: { online_backup: { frequency: 'every 6 hours', retention_count: 28 } }; } ``` #### 7.3 Recovery Procedures ```typescript interface RecoveryProcedures { // Point-in-time recovery async recoverToPointInTime(targetTime: ISO8601): Promise { // 1. Stop application writes await this.enableMaintenanceMode(); // 2. Identify nearest full backup const baseBackup = await this.findBaseBackup(targetTime); // 3. Restore base backup await this.restoreBaseBackup(baseBackup); // 4. Apply WAL logs up to target time await this.applyWALTo(targetTime); // 5. Rebuild vector index from restored embeddings await this.rebuildVectorIndex(); // 6. Verify data integrity const integrity = await this.verifyIntegrity(); // 7. Resume operations await this.disableMaintenanceMode(); return { success: integrity.valid, restoredTo: targetTime, recordsRecovered: integrity.recordCount, duration: Date.now() - startTime }; } // Full disaster recovery async fullDisasterRecovery(targetRegion: string): Promise { // 1. Provision new infrastructure in target region await this.provisionInfrastructure(targetRegion); // 2. Restore latest database backup const latestBackup = await this.getLatestBackup(); await this.restoreBackup(latestBackup); // 3. Sync audio files from cross-region replica await this.syncAudioFiles(targetRegion); // 4. Rebuild all indexes await this.rebuildAllIndexes(); // 5. Update DNS/load balancer await this.updateRouting(targetRegion); return { success: true, region: targetRegion, rto: Date.now() - startTime }; } } ``` #### 7.4 Recovery Objectives | Metric | Target | Description | |--------|--------|-------------| | **RPO** (Recovery Point Objective) | 15 minutes | Maximum data loss in time | | **RTO** (Recovery Time Objective) | 1 hour | Time to restore service | | **RTO (Full DR)** | 4 hours | Cross-region disaster recovery | | **Backup Verification** | Daily | Automated restore test | --- ### 8. Data Validation Rules #### 8.1 Input Validation ```typescript interface ValidationRules { recording: { file_format: ['wav', 'flac', 'mp3'], sample_rate: { min: 16000, max: 96000, preferred: 32000 }, channels: { allowed: [1, 2], preferred: 1 }, duration: { min_seconds: 1, max_seconds: 3600 }, file_size: { max_mb: 500 }, required_fields: ['file_path', 'start_ts', 'lat', 'lon'] }; segment: { duration: { min_ms: 50, max_ms: 30000 }, snr: { min_db: -10, warn_below: 5 }, overlap: { warn_above: 0.5 }, required_fields: ['recording_id', 't0_ms', 't1_ms'] }; embedding: { dimensions: 1536, norm_range: { min: 0.5, max: 2.0 }, nan_allowed: false, inf_allowed: false, required_fields: ['segment_id', 'vector', 'model_name'] }; coordinates: { lat: { min: -90, max: 90 }, lon: { min: -180, max: 180 }, precision: 6 // decimal places }; } // Validation implementation class DataValidator { validateRecording(recording: Recording): ValidationResult { const errors: ValidationError[] = []; const warnings: ValidationWarning[] = []; // Format check const ext = recording.file_path.split('.').pop()?.toLowerCase(); if (!ValidationRules.recording.file_format.includes(ext)) { errors.push({ field: 'file_path', message: `Invalid format: ${ext}` }); } // Sample rate check if (recording.sample_rate !== ValidationRules.recording.sample_rate.preferred) { warnings.push({ field: 'sample_rate', message: `Non-preferred sample rate: ${recording.sample_rate}Hz` }); } // Coordinate validation if (recording.lat < -90 || recording.lat > 90) { errors.push({ field: 'lat', message: 'Latitude out of range' }); } // Required fields for (const field of ValidationRules.recording.required_fields) { if (!recording[field]) { errors.push({ field, message: 'Required field missing' }); } } return { valid: errors.length === 0, errors, warnings }; } validateEmbedding(embedding: Embedding): ValidationResult { const errors: ValidationError[] = []; // Dimension check if (embedding.vector.length !== ValidationRules.embedding.dimensions) { errors.push({ field: 'vector', message: `Expected ${ValidationRules.embedding.dimensions}D, got ${embedding.vector.length}D` }); } // NaN/Inf check for (let i = 0; i < embedding.vector.length; i++) { if (isNaN(embedding.vector[i])) { errors.push({ field: 'vector', message: `NaN at index ${i}` }); break; } if (!isFinite(embedding.vector[i])) { errors.push({ field: 'vector', message: `Infinity at index ${i}` }); break; } } // Norm check const norm = l2Norm(embedding.vector); if (norm < ValidationRules.embedding.norm_range.min || norm > ValidationRules.embedding.norm_range.max) { errors.push({ field: 'vector', message: `Norm ${norm.toFixed(3)} outside expected range` }); } return { valid: errors.length === 0, errors, warnings: [] }; } } ``` #### 8.2 Consistency Checks ```typescript interface ConsistencyChecker { // Run all consistency checks async runChecks(): Promise { const checks = await Promise.all([ this.checkOrphanedSegments(), this.checkMissingEmbeddings(), this.checkDuplicateRecordings(), this.checkBrokenReferences(), this.checkIndexSync(), this.checkTemporalConsistency() ]); return { timestamp: new Date().toISOString(), checks, overallHealth: checks.every(c => c.passed) ? 'healthy' : 'degraded' }; } // Find segments without parent recordings async checkOrphanedSegments(): Promise { const orphans = await db.query(` SELECT cs.id FROM call_segments cs LEFT JOIN recordings r ON cs.recording_id = r.id WHERE r.id IS NULL `); return { name: 'orphaned_segments', passed: orphans.length === 0, count: orphans.length, action: 'DELETE orphaned segments or restore recordings' }; } // Find segments without embeddings async checkMissingEmbeddings(): Promise { const missing = await db.query(` SELECT cs.id FROM call_segments cs LEFT JOIN embeddings e ON cs.id = e.segment_id WHERE e.id IS NULL AND cs.created_at < NOW() - INTERVAL '1 hour' `); return { name: 'missing_embeddings', passed: missing.length === 0, count: missing.length, action: 'Reprocess segments to generate embeddings' }; } // Verify HNSW index matches database async checkIndexSync(): Promise { const dbCount = await db.query(`SELECT COUNT(*) FROM embeddings WHERE storage_tier = 'hot'`); const indexCount = await hnsw.getVectorCount(); const diff = Math.abs(dbCount - indexCount); return { name: 'index_sync', passed: diff < 100, // Allow small discrepancy during updates count: diff, action: diff > 100 ? 'Rebuild HNSW index' : 'None required' }; } // Check temporal ordering of segments async checkTemporalConsistency(): Promise { const violations = await db.query(` SELECT recording_id, COUNT(*) as overlaps FROM ( SELECT recording_id, t0_ms, t1_ms, LEAD(t0_ms) OVER (PARTITION BY recording_id ORDER BY t0_ms) as next_t0 FROM call_segments ) sub WHERE t1_ms > next_t0 GROUP BY recording_id `); return { name: 'temporal_consistency', passed: violations.length === 0, count: violations.reduce((sum, v) => sum + v.overlaps, 0), action: 'Re-segment recordings with overlapping segments' }; } } ``` #### 8.3 Quality Gates ```typescript interface QualityGates { // Gate for accepting new recordings recording_acceptance: { min_quality_score: 0.3, min_snr_db: 0, max_clipping_ratio: 0.1, min_duration_seconds: 5 }; // Gate for including in HNSW hot tier hot_tier_eligibility: { embedding_norm_range: [0.8, 1.2], segmentation_confidence: 0.7, no_nan_values: true }; // Gate for species identification species_id_confidence: { auto_accept_threshold: 0.9, human_review_threshold: 0.5, reject_below: 0.3 }; // Gate for cluster assignment cluster_assignment: { min_confidence: 0.6, max_distance_to_centroid: 0.5 }; } // Quality gate enforcement class QualityGateEnforcer { async enforceRecordingGate(recording: Recording): Promise { const gates = QualityGates.recording_acceptance; const failures: string[] = []; if (recording.quality_score < gates.min_quality_score) { failures.push(`Quality score ${recording.quality_score} < ${gates.min_quality_score}`); } // Additional checks... return { passed: failures.length === 0, failures, action: failures.length > 0 ? 'quarantine' : 'proceed' }; } } ``` --- ## Consequences ### Positive 1. **Performance**: Tiered storage achieves 150x-12,500x search improvement via HNSW with graceful degradation to warm/cold tiers 2. **Scalability**: Architecture supports 100M+ vectors with sub-second queries 3. **Flexibility**: Graph relationships enable complex Cypher queries for motif and sequence analysis 4. **Data Quality**: Comprehensive validation prevents corrupt data from entering the system 5. **Recoverability**: RPO of 15 minutes and RTO of 1 hour meet operational requirements 6. **Cost Efficiency**: 32x compression for cold tier dramatically reduces storage costs ### Negative 1. **Complexity**: Three-tier storage adds operational overhead 2. **Latency Variability**: Cold tier queries are 100-1000x slower than hot tier 3. **Migration Risk**: Quantization introduces small accuracy loss (~2-5%) 4. **Storage Duplication**: Multiple tiers may temporarily hold same data during transitions ### Mitigations - Automated tiering policies minimize manual intervention - Warm tier serves as buffer, ensuring graceful degradation - Calibrated quantization preserves retrieval quality above 95% - Background jobs clean up duplicate data after successful tier migration --- ## References - [Perch 2.0: The Bittern Lesson for Bioacoustics](https://arxiv.org/abs/2508.04665) - [RuVector: A Database that Autonomously Learns](https://github.com/ruvnet/ruvector) - [HNSW: Hierarchical Navigable Small World Graphs](https://arxiv.org/abs/1603.09320) - [Product Quantization for Nearest Neighbor Search](https://ieeexplore.ieee.org/document/5432202) - [Poincare Embeddings for Learning Hierarchical Representations](https://arxiv.org/abs/1705.08039) --- **Document Version:** 1.0 **Last Updated:** 2026-01-15 **Next Review:** 2026-04-15