Files
wifi-densepose/examples/dna/adr/ADR-014-health-biomarker-analysis.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

271 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-014: Health Biomarker Analysis Engine
**Status:** Accepted | **Date:** 2026-02-22 | **Authors:** RuVector Genomics Architecture Team
**Parents:** ADR-001 (Vision), ADR-003 (HNSW Index), ADR-004 (Attention), ADR-009 (Variant Calling), ADR-011 (Performance Targets), ADR-013 (RVDNA Format)
## Context
The rvDNA crate already implements 17 clinically-relevant health SNPs across 4 categories (Cancer Risk, Cardiovascular, Neurological, Metabolism) in `health.rs`, with dedicated analysis functions for APOE genotyping, MTHFR compound status, and COMT/OPRM1 pain profiling. The genotyping pipeline (`genotyping.rs`) provides end-to-end 23andMe analysis with 7-stage processing.
However, the current health variant analysis has several limitations:
| Limitation | Impact | Module |
|-----------|--------|--------|
| No polygenic risk scoring | Individual SNP effects miss gene-gene interactions | `health.rs` |
| No longitudinal tracking | Cannot monitor biomarker changes over time | None |
| No streaming data ingestion | Real-time health monitoring impossible | None |
| No vector-indexed biomarker search | Cannot correlate across populations | None |
| No composite health scoring | No unified risk quantification | `health.rs` |
| No RVDNA biomarker section | Health data not persisted in AI-native format | `rvdna.rs` |
The health biomarker domain requires three capabilities beyond SNP lookup: (1) composite risk scoring that aggregates across gene networks, (2) streaming ingestion for real-time monitoring, and (3) HNSW-indexed population-scale similarity search for correlating individual profiles against reference cohorts.
## Decision: Health Biomarker Analysis Engine
We introduce a biomarker analysis engine (`biomarker.rs`) that extends the existing `health.rs` SNP analysis with:
1. **Composite Biomarker Profiles** — Aggregate individual SNP results into category-level and global risk scores with configurable weighting
2. **Streaming Data Simulation** — Simulated real-time biomarker data streams with configurable noise, drift, and anomaly injection for testing temporal analysis
3. **HNSW-Indexed Profile Search** — Store biomarker profiles as dense vectors in HNSW index for population-scale similarity search
4. **Temporal Biomarker Tracking** — Time-series analysis with trend detection, moving averages, and anomaly detection
5. **Real Example Data** — Curated biomarker datasets based on clinically validated reference ranges
### Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Health Biomarker Engine │
├──────────────┬──────────────┬───────────────┬───────────────────┤
│ Composite │ Streaming │ HNSW-Indexed │ Temporal │
│ Risk Score │ Simulator │ Population │ Tracker │
│ │ │ Search │ │
├──────────────┤ │ │ │
│ Gene Network │ Noise Model │ Profile Vec │ Moving Average │
│ Interaction │ Drift Model │ Quantization │ Trend Detection │
│ Weights │ Anomalies │ Similarity │ Anomaly Detect │
└──────┬───────┴──────┬───────┴───────┬───────┴───────┬───────────┘
│ │ │ │
┌──────┴──────┐ ┌─────┴─────┐ ┌─────┴──────┐ ┌────┴────────┐
│ health.rs │ │ tokio │ │ ruvector │ │ biomarker │
│ 17 SNPs │ │ streams │ │ -core HNSW │ │ time series │
│ APOE/MTHFR │ │ channels │ │ VectorDB │ │ ring buffer │
└─────────────┘ └───────────┘ └────────────┘ └─────────────┘
```
### Component Specifications
#### 1. Composite Biomarker Profile
```rust
pub struct BiomarkerProfile {
pub subject_id: String,
pub timestamp: i64,
pub snp_results: Vec<HealthVariantResult>,
pub category_scores: HashMap<String, CategoryScore>,
pub global_risk_score: f64,
pub profile_vector: Vec<f32>, // Dense vector for HNSW indexing
}
pub struct CategoryScore {
pub category: String,
pub score: f64, // 0.0 (low risk) to 1.0 (high risk)
pub confidence: f64, // Based on genotyped fraction
pub contributing_variants: Vec<String>,
}
```
**Scoring Algorithm:**
- Each SNP contributes a risk weight based on its clinical significance and genotype
- Category scores aggregate SNP weights within gene-network boundaries
- Gene-gene interaction terms (e.g., COMT x OPRM1 for pain) apply multiplicative modifiers
- Global risk score uses weighted geometric mean across categories
- Profile vector is the concatenation of normalized category scores + individual SNP encodings (one-hot genotype)
**Weight Matrix (evidence-based):**
| Gene | Risk Weight (Hom Ref) | Risk Weight (Het) | Risk Weight (Hom Alt) | Category |
|------|----------------------|-------------------|----------------------|----------|
| APOE (rs429358) | 0.0 | 0.45 | 0.90 | Neurological |
| BRCA1 (rs80357906) | 0.0 | 0.70 | 0.95 | Cancer |
| MTHFR C677T | 0.0 | 0.30 | 0.65 | Metabolism |
| COMT Val158Met | 0.0 | 0.25 | 0.50 | Neurological |
| CYP1A2 | 0.0 | 0.15 | 0.35 | Metabolism |
| SLCO1B1 | 0.0 | 0.40 | 0.75 | Cardiovascular |
**Interaction Terms:**
| Interaction | Modifier | Rationale |
|------------|----------|-----------|
| COMT(AA) x OPRM1(GG) | 1.4x pain score | Synergistic pain sensitivity |
| MTHFR(677TT) x MTHFR(1298CC) | 1.3x metabolism score | Compound heterozygote |
| APOE(e4/e4) x TP53(variant) | 1.2x neurological score | Neurodegeneration + impaired DNA repair |
| BRCA1(carrier) x TP53(variant) | 1.5x cancer score | DNA repair pathway compromise |
#### 2. Streaming Biomarker Simulator
```rust
pub struct StreamConfig {
pub base_interval_ms: u64, // Interval between readings
pub noise_amplitude: f64, // Gaussian noise σ
pub drift_rate: f64, // Linear drift per reading
pub anomaly_probability: f64, // Probability of anomalous reading
pub anomaly_magnitude: f64, // Size of anomaly spike
pub num_biomarkers: usize, // Number of parallel streams
pub window_size: usize, // Sliding window for statistics
}
pub struct BiomarkerReading {
pub timestamp_ms: u64,
pub biomarker_id: String,
pub value: f64,
pub reference_range: (f64, f64),
pub is_anomaly: bool,
pub z_score: f64,
}
```
**Simulation Model:**
- Base values drawn from clinically validated reference ranges (see Section 3)
- Gaussian noise with configurable σ (default: 2% of reference range)
- Linear drift models chronic condition progression
- Anomaly injection via Poisson process (default: p=0.02 per reading)
- Anomalies modeled as multiplicative spikes (default: 2.5x normal variation)
**Streaming Protocol:**
- Uses `tokio::sync::mpsc` channels for async data flow
- Ring buffer (capacity: 10,000 readings) for windowed statistics
- Moving average, exponential smoothing, and z-score computation in real-time
- Backpressure via bounded channels prevents memory exhaustion
#### 3. HNSW-Indexed Population Search
Biomarker profile vectors are stored in RuVector's HNSW index for population-scale similarity search:
```rust
pub struct PopulationIndex {
pub db: VectorDB,
pub profile_dim: usize, // Vector dimension (typically 64)
pub population_size: usize,
pub metadata: HashMap<String, serde_json::Value>,
}
```
**Vector Encoding:**
- 17 SNPs x 3 genotype one-hot = 51 dimensions
- 4 category scores = 4 dimensions
- 1 global risk score = 1 dimension
- 4 interaction terms = 4 dimensions
- MTHFR score (1) + Pain score (1) + APOE risk (1) + Caffeine metabolism (1) = 4 dimensions
- **Total: 64 dimensions** (power of 2 for SIMD alignment)
**Search Performance (from ADR-011):**
- p50 latency: <100 μs at 10k profiles
- p99 latency: <250 μs at 10k profiles
- Recall@10: >97%
- HNSW config: M=16, ef_construction=200, ef_search=50
#### 4. Reference Biomarker Data
Curated reference ranges from clinical literature (CDC, WHO, NCBI ClinVar):
| Biomarker | Unit | Low | Normal Low | Normal High | High | Critical |
|-----------|------|-----|------------|-------------|------|----------|
| Total Cholesterol | mg/dL | - | <200 | 200-239 | >=240 | >300 |
| LDL Cholesterol | mg/dL | - | <100 | 100-159 | >=160 | >190 |
| HDL Cholesterol | mg/dL | <40 | 40-59 | >=60 | - | - |
| Triglycerides | mg/dL | - | <150 | 150-199 | >=200 | >500 |
| Fasting Glucose | mg/dL | <70 | 70-99 | 100-125 | >=126 | >300 |
| HbA1c | % | <4.0 | 4.0-5.6 | 5.7-6.4 | >=6.5 | >10.0 |
| Homocysteine | μmol/L | - | <10 | 10-15 | >15 | >30 |
| Vitamin D (25-OH) | ng/mL | <20 | 20-29 | 30-100 | >100 | >150 |
| CRP (hs) | mg/L | - | <1.0 | 1.0-3.0 | >3.0 | >10.0 |
| TSH | mIU/L | <0.4 | 0.4-2.0 | 2.0-4.0 | >4.0 | >10.0 |
| Ferritin | ng/mL | <12 | 12-150 | 150-300 | >300 | >1000 |
| Vitamin B12 | pg/mL | <200 | 200-300 | 300-900 | >900 | - |
These values are used to:
1. Validate streaming simulator output
2. Calculate z-scores for anomaly detection
3. Generate realistic synthetic population data
4. Provide clinical context in biomarker reports
### Performance Targets
| Operation | Target | Mechanism |
|-----------|--------|-----------|
| Composite score (17 SNPs) | <50 μs | In-memory weight matrix multiply |
| Profile vector encoding | <100 μs | One-hot + normalize |
| Population similarity top-10 | <150 μs | HNSW search on 64-dim vectors |
| Stream processing (single reading) | <10 μs | Ring buffer + running stats |
| Anomaly detection | <5 μs | Z-score against moving window |
| Full biomarker report | <1 ms | Score + encode + search |
| Population index build (10k) | <500 ms | Batch HNSW insert |
| Streaming throughput | >100k readings/sec | Lock-free ring buffer |
### Integration Points
| Existing Module | Integration | Direction |
|----------------|-------------|-----------|
| `health.rs` | SNP results feed composite scorer | Input |
| `genotyping.rs` | 23andMe pipeline generates BiomarkerProfile | Input |
| `ruvector-core` | HNSW index stores profile vectors | Bidirectional |
| `rvdna.rs` | Profile vectors stored in metadata section | Output |
| `epigenomics.rs` | Methylation data enriches biomarker profile | Input |
| `pharma.rs` | CYP metabolizer status informs drug-related biomarkers | Input |
## Consequences
**Positive:**
- Unified risk scoring replaces per-SNP interpretation with actionable composite scores
- Streaming architecture enables real-time health monitoring use cases
- HNSW indexing enables population-scale "patients like me" queries in <150 μs
- Reference biomarker data provides clinical validation framework
- 64-dim profile vectors are SIMD-aligned for maximum throughput
- Ring buffer streaming achieves >100k readings/sec without allocation pressure
**Negative:**
- Composite scoring weights are simplified; clinical deployment requires validated coefficients from GWAS
- Streaming simulator generates synthetic data only; real clinical integration requires HL7/FHIR adapters
- Additional 64-dim vector per profile increases RVDNA file size by ~256 bytes per subject
**Neutral:**
- Risk scores are educational/research only; same disclaimer as existing `health.rs`
- Gene-gene interaction terms are limited to known pairs; extensible via configuration
## Options Considered
1. **Extend health.rs with scoring** — rejected: would grow file beyond 500-line limit; scoring + streaming + search are distinct bounded contexts
2. **Separate crate** — rejected: too much coupling to existing types; shared types across modules
3. **New module (biomarker.rs)** — selected: clean separation, imports from `health.rs`, integrates with `ruvector-core` HNSW, stays within the rvDNA crate boundary
## Implementation Strategy
**Phase 1 (This ADR):**
- `biomarker.rs`: Composite scoring engine with reference data
- `biomarker_stream.rs`: Streaming simulator with ring buffer and anomaly detection
- Integration tests with realistic 23andMe-derived profiles
- Benchmark suite validating performance targets
**Phase 2 (Future):**
- RVDNA Section 7: Biomarker profile storage in binary format
- Population index persistence (serialize HNSW graph to RVDNA)
- WASM export for browser-based biomarker dashboards
- HL7/FHIR streaming adapter for clinical integration
## Related Decisions
- **ADR-001**: Vision — health biomarker analysis is a key clinical application
- **ADR-003**: HNSW index — population search uses the same index infrastructure
- **ADR-009**: Variant calling — biomarker profiles integrate variant quality scores
- **ADR-011**: Performance targets — all biomarker operations must meet latency budgets
- **ADR-013**: RVDNA format — biomarker vectors stored in metadata section (Phase 1) or dedicated section (Phase 2)
## References
- [CPIC Guidelines](https://cpicpgx.org/) — Pharmacogenomics dosing guidelines
- [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) — Clinical variant significance database
- [gnomAD](https://gnomad.broadinstitute.org/) — Population allele frequencies
- [Horvath Clock](https://doi.org/10.1186/gb-2013-14-10-r115) — Epigenetic age estimation
- [APOE Alzheimer's Meta-Analysis](https://doi.org/10.1001/jama.278.16.1349) — e4 odds ratios
- [MTHFR Clinical Review](https://doi.org/10.1007/s12035-019-1547-z) — Compound heterozygote effects