git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
14 KiB
ADR-014: Health Biomarker Analysis Engine
Status: Accepted | Date: 2026-02-22 | Authors: RuVector Genomics Architecture Team Parents: ADR-001 (Vision), ADR-003 (HNSW Index), ADR-004 (Attention), ADR-009 (Variant Calling), ADR-011 (Performance Targets), ADR-013 (RVDNA Format)
Context
The rvDNA crate already implements 17 clinically-relevant health SNPs across 4 categories (Cancer Risk, Cardiovascular, Neurological, Metabolism) in health.rs, with dedicated analysis functions for APOE genotyping, MTHFR compound status, and COMT/OPRM1 pain profiling. The genotyping pipeline (genotyping.rs) provides end-to-end 23andMe analysis with 7-stage processing.
However, the current health variant analysis has several limitations:
| Limitation | Impact | Module |
|---|---|---|
| No polygenic risk scoring | Individual SNP effects miss gene-gene interactions | health.rs |
| No longitudinal tracking | Cannot monitor biomarker changes over time | None |
| No streaming data ingestion | Real-time health monitoring impossible | None |
| No vector-indexed biomarker search | Cannot correlate across populations | None |
| No composite health scoring | No unified risk quantification | health.rs |
| No RVDNA biomarker section | Health data not persisted in AI-native format | rvdna.rs |
The health biomarker domain requires three capabilities beyond SNP lookup: (1) composite risk scoring that aggregates across gene networks, (2) streaming ingestion for real-time monitoring, and (3) HNSW-indexed population-scale similarity search for correlating individual profiles against reference cohorts.
Decision: Health Biomarker Analysis Engine
We introduce a biomarker analysis engine (biomarker.rs) that extends the existing health.rs SNP analysis with:
- Composite Biomarker Profiles — Aggregate individual SNP results into category-level and global risk scores with configurable weighting
- Streaming Data Simulation — Simulated real-time biomarker data streams with configurable noise, drift, and anomaly injection for testing temporal analysis
- HNSW-Indexed Profile Search — Store biomarker profiles as dense vectors in HNSW index for population-scale similarity search
- Temporal Biomarker Tracking — Time-series analysis with trend detection, moving averages, and anomaly detection
- Real Example Data — Curated biomarker datasets based on clinically validated reference ranges
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Health Biomarker Engine │
├──────────────┬──────────────┬───────────────┬───────────────────┤
│ Composite │ Streaming │ HNSW-Indexed │ Temporal │
│ Risk Score │ Simulator │ Population │ Tracker │
│ │ │ Search │ │
├──────────────┤ │ │ │
│ Gene Network │ Noise Model │ Profile Vec │ Moving Average │
│ Interaction │ Drift Model │ Quantization │ Trend Detection │
│ Weights │ Anomalies │ Similarity │ Anomaly Detect │
└──────┬───────┴──────┬───────┴───────┬───────┴───────┬───────────┘
│ │ │ │
┌──────┴──────┐ ┌─────┴─────┐ ┌─────┴──────┐ ┌────┴────────┐
│ health.rs │ │ tokio │ │ ruvector │ │ biomarker │
│ 17 SNPs │ │ streams │ │ -core HNSW │ │ time series │
│ APOE/MTHFR │ │ channels │ │ VectorDB │ │ ring buffer │
└─────────────┘ └───────────┘ └────────────┘ └─────────────┘
Component Specifications
1. Composite Biomarker Profile
pub struct BiomarkerProfile {
pub subject_id: String,
pub timestamp: i64,
pub snp_results: Vec<HealthVariantResult>,
pub category_scores: HashMap<String, CategoryScore>,
pub global_risk_score: f64,
pub profile_vector: Vec<f32>, // Dense vector for HNSW indexing
}
pub struct CategoryScore {
pub category: String,
pub score: f64, // 0.0 (low risk) to 1.0 (high risk)
pub confidence: f64, // Based on genotyped fraction
pub contributing_variants: Vec<String>,
}
Scoring Algorithm:
- Each SNP contributes a risk weight based on its clinical significance and genotype
- Category scores aggregate SNP weights within gene-network boundaries
- Gene-gene interaction terms (e.g., COMT x OPRM1 for pain) apply multiplicative modifiers
- Global risk score uses weighted geometric mean across categories
- Profile vector is the concatenation of normalized category scores + individual SNP encodings (one-hot genotype)
Weight Matrix (evidence-based):
| Gene | Risk Weight (Hom Ref) | Risk Weight (Het) | Risk Weight (Hom Alt) | Category |
|---|---|---|---|---|
| APOE (rs429358) | 0.0 | 0.45 | 0.90 | Neurological |
| BRCA1 (rs80357906) | 0.0 | 0.70 | 0.95 | Cancer |
| MTHFR C677T | 0.0 | 0.30 | 0.65 | Metabolism |
| COMT Val158Met | 0.0 | 0.25 | 0.50 | Neurological |
| CYP1A2 | 0.0 | 0.15 | 0.35 | Metabolism |
| SLCO1B1 | 0.0 | 0.40 | 0.75 | Cardiovascular |
Interaction Terms:
| Interaction | Modifier | Rationale |
|---|---|---|
| COMT(AA) x OPRM1(GG) | 1.4x pain score | Synergistic pain sensitivity |
| MTHFR(677TT) x MTHFR(1298CC) | 1.3x metabolism score | Compound heterozygote |
| APOE(e4/e4) x TP53(variant) | 1.2x neurological score | Neurodegeneration + impaired DNA repair |
| BRCA1(carrier) x TP53(variant) | 1.5x cancer score | DNA repair pathway compromise |
2. Streaming Biomarker Simulator
pub struct StreamConfig {
pub base_interval_ms: u64, // Interval between readings
pub noise_amplitude: f64, // Gaussian noise σ
pub drift_rate: f64, // Linear drift per reading
pub anomaly_probability: f64, // Probability of anomalous reading
pub anomaly_magnitude: f64, // Size of anomaly spike
pub num_biomarkers: usize, // Number of parallel streams
pub window_size: usize, // Sliding window for statistics
}
pub struct BiomarkerReading {
pub timestamp_ms: u64,
pub biomarker_id: String,
pub value: f64,
pub reference_range: (f64, f64),
pub is_anomaly: bool,
pub z_score: f64,
}
Simulation Model:
- Base values drawn from clinically validated reference ranges (see Section 3)
- Gaussian noise with configurable σ (default: 2% of reference range)
- Linear drift models chronic condition progression
- Anomaly injection via Poisson process (default: p=0.02 per reading)
- Anomalies modeled as multiplicative spikes (default: 2.5x normal variation)
Streaming Protocol:
- Uses
tokio::sync::mpscchannels for async data flow - Ring buffer (capacity: 10,000 readings) for windowed statistics
- Moving average, exponential smoothing, and z-score computation in real-time
- Backpressure via bounded channels prevents memory exhaustion
3. HNSW-Indexed Population Search
Biomarker profile vectors are stored in RuVector's HNSW index for population-scale similarity search:
pub struct PopulationIndex {
pub db: VectorDB,
pub profile_dim: usize, // Vector dimension (typically 64)
pub population_size: usize,
pub metadata: HashMap<String, serde_json::Value>,
}
Vector Encoding:
- 17 SNPs x 3 genotype one-hot = 51 dimensions
- 4 category scores = 4 dimensions
- 1 global risk score = 1 dimension
- 4 interaction terms = 4 dimensions
- MTHFR score (1) + Pain score (1) + APOE risk (1) + Caffeine metabolism (1) = 4 dimensions
- Total: 64 dimensions (power of 2 for SIMD alignment)
Search Performance (from ADR-011):
- p50 latency: <100 μs at 10k profiles
- p99 latency: <250 μs at 10k profiles
- Recall@10: >97%
- HNSW config: M=16, ef_construction=200, ef_search=50
4. Reference Biomarker Data
Curated reference ranges from clinical literature (CDC, WHO, NCBI ClinVar):
| Biomarker | Unit | Low | Normal Low | Normal High | High | Critical |
|---|---|---|---|---|---|---|
| Total Cholesterol | mg/dL | - | <200 | 200-239 | >=240 | >300 |
| LDL Cholesterol | mg/dL | - | <100 | 100-159 | >=160 | >190 |
| HDL Cholesterol | mg/dL | <40 | 40-59 | >=60 | - | - |
| Triglycerides | mg/dL | - | <150 | 150-199 | >=200 | >500 |
| Fasting Glucose | mg/dL | <70 | 70-99 | 100-125 | >=126 | >300 |
| HbA1c | % | <4.0 | 4.0-5.6 | 5.7-6.4 | >=6.5 | >10.0 |
| Homocysteine | μmol/L | - | <10 | 10-15 | >15 | >30 |
| Vitamin D (25-OH) | ng/mL | <20 | 20-29 | 30-100 | >100 | >150 |
| CRP (hs) | mg/L | - | <1.0 | 1.0-3.0 | >3.0 | >10.0 |
| TSH | mIU/L | <0.4 | 0.4-2.0 | 2.0-4.0 | >4.0 | >10.0 |
| Ferritin | ng/mL | <12 | 12-150 | 150-300 | >300 | >1000 |
| Vitamin B12 | pg/mL | <200 | 200-300 | 300-900 | >900 | - |
These values are used to:
- Validate streaming simulator output
- Calculate z-scores for anomaly detection
- Generate realistic synthetic population data
- Provide clinical context in biomarker reports
Performance Targets
| Operation | Target | Mechanism |
|---|---|---|
| Composite score (17 SNPs) | <50 μs | In-memory weight matrix multiply |
| Profile vector encoding | <100 μs | One-hot + normalize |
| Population similarity top-10 | <150 μs | HNSW search on 64-dim vectors |
| Stream processing (single reading) | <10 μs | Ring buffer + running stats |
| Anomaly detection | <5 μs | Z-score against moving window |
| Full biomarker report | <1 ms | Score + encode + search |
| Population index build (10k) | <500 ms | Batch HNSW insert |
| Streaming throughput | >100k readings/sec | Lock-free ring buffer |
Integration Points
| Existing Module | Integration | Direction |
|---|---|---|
health.rs |
SNP results feed composite scorer | Input |
genotyping.rs |
23andMe pipeline generates BiomarkerProfile | Input |
ruvector-core |
HNSW index stores profile vectors | Bidirectional |
rvdna.rs |
Profile vectors stored in metadata section | Output |
epigenomics.rs |
Methylation data enriches biomarker profile | Input |
pharma.rs |
CYP metabolizer status informs drug-related biomarkers | Input |
Consequences
Positive:
- Unified risk scoring replaces per-SNP interpretation with actionable composite scores
- Streaming architecture enables real-time health monitoring use cases
- HNSW indexing enables population-scale "patients like me" queries in <150 μs
- Reference biomarker data provides clinical validation framework
- 64-dim profile vectors are SIMD-aligned for maximum throughput
- Ring buffer streaming achieves >100k readings/sec without allocation pressure
Negative:
- Composite scoring weights are simplified; clinical deployment requires validated coefficients from GWAS
- Streaming simulator generates synthetic data only; real clinical integration requires HL7/FHIR adapters
- Additional 64-dim vector per profile increases RVDNA file size by ~256 bytes per subject
Neutral:
- Risk scores are educational/research only; same disclaimer as existing
health.rs - Gene-gene interaction terms are limited to known pairs; extensible via configuration
Options Considered
- Extend health.rs with scoring — rejected: would grow file beyond 500-line limit; scoring + streaming + search are distinct bounded contexts
- Separate crate — rejected: too much coupling to existing types; shared types across modules
- New module (biomarker.rs) — selected: clean separation, imports from
health.rs, integrates withruvector-coreHNSW, stays within the rvDNA crate boundary
Implementation Strategy
Phase 1 (This ADR):
biomarker.rs: Composite scoring engine with reference databiomarker_stream.rs: Streaming simulator with ring buffer and anomaly detection- Integration tests with realistic 23andMe-derived profiles
- Benchmark suite validating performance targets
Phase 2 (Future):
- RVDNA Section 7: Biomarker profile storage in binary format
- Population index persistence (serialize HNSW graph to RVDNA)
- WASM export for browser-based biomarker dashboards
- HL7/FHIR streaming adapter for clinical integration
Related Decisions
- ADR-001: Vision — health biomarker analysis is a key clinical application
- ADR-003: HNSW index — population search uses the same index infrastructure
- ADR-009: Variant calling — biomarker profiles integrate variant quality scores
- ADR-011: Performance targets — all biomarker operations must meet latency budgets
- ADR-013: RVDNA format — biomarker vectors stored in metadata section (Phase 1) or dedicated section (Phase 2)
References
- CPIC Guidelines — Pharmacogenomics dosing guidelines
- ClinVar — Clinical variant significance database
- gnomAD — Population allele frequencies
- Horvath Clock — Epigenetic age estimation
- APOE Alzheimer's Meta-Analysis — e4 odds ratios
- MTHFR Clinical Review — Compound heterozygote effects