Files
wifi-densepose/examples/dna/adr/ADR-014-health-biomarker-analysis.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

14 KiB
Raw Blame History

ADR-014: Health Biomarker Analysis Engine

Status: Accepted | Date: 2026-02-22 | Authors: RuVector Genomics Architecture Team Parents: ADR-001 (Vision), ADR-003 (HNSW Index), ADR-004 (Attention), ADR-009 (Variant Calling), ADR-011 (Performance Targets), ADR-013 (RVDNA Format)

Context

The rvDNA crate already implements 17 clinically-relevant health SNPs across 4 categories (Cancer Risk, Cardiovascular, Neurological, Metabolism) in health.rs, with dedicated analysis functions for APOE genotyping, MTHFR compound status, and COMT/OPRM1 pain profiling. The genotyping pipeline (genotyping.rs) provides end-to-end 23andMe analysis with 7-stage processing.

However, the current health variant analysis has several limitations:

Limitation Impact Module
No polygenic risk scoring Individual SNP effects miss gene-gene interactions health.rs
No longitudinal tracking Cannot monitor biomarker changes over time None
No streaming data ingestion Real-time health monitoring impossible None
No vector-indexed biomarker search Cannot correlate across populations None
No composite health scoring No unified risk quantification health.rs
No RVDNA biomarker section Health data not persisted in AI-native format rvdna.rs

The health biomarker domain requires three capabilities beyond SNP lookup: (1) composite risk scoring that aggregates across gene networks, (2) streaming ingestion for real-time monitoring, and (3) HNSW-indexed population-scale similarity search for correlating individual profiles against reference cohorts.

Decision: Health Biomarker Analysis Engine

We introduce a biomarker analysis engine (biomarker.rs) that extends the existing health.rs SNP analysis with:

  1. Composite Biomarker Profiles — Aggregate individual SNP results into category-level and global risk scores with configurable weighting
  2. Streaming Data Simulation — Simulated real-time biomarker data streams with configurable noise, drift, and anomaly injection for testing temporal analysis
  3. HNSW-Indexed Profile Search — Store biomarker profiles as dense vectors in HNSW index for population-scale similarity search
  4. Temporal Biomarker Tracking — Time-series analysis with trend detection, moving averages, and anomaly detection
  5. Real Example Data — Curated biomarker datasets based on clinically validated reference ranges

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                   Health Biomarker Engine                        │
├──────────────┬──────────────┬───────────────┬───────────────────┤
│  Composite   │  Streaming   │  HNSW-Indexed │   Temporal        │
│  Risk Score  │  Simulator   │  Population   │   Tracker         │
│              │              │  Search       │                   │
├──────────────┤              │               │                   │
│ Gene Network │  Noise Model │  Profile Vec  │  Moving Average   │
│ Interaction  │  Drift Model │  Quantization │  Trend Detection  │
│ Weights      │  Anomalies   │  Similarity   │  Anomaly Detect   │
└──────┬───────┴──────┬───────┴───────┬───────┴───────┬───────────┘
       │              │               │               │
┌──────┴──────┐ ┌─────┴─────┐  ┌─────┴──────┐  ┌────┴────────┐
│ health.rs   │ │ tokio     │  │ ruvector   │  │ biomarker   │
│ 17 SNPs     │ │ streams   │  │ -core HNSW │  │ time series │
│ APOE/MTHFR  │ │ channels  │  │ VectorDB   │  │ ring buffer │
└─────────────┘ └───────────┘  └────────────┘  └─────────────┘

Component Specifications

1. Composite Biomarker Profile

pub struct BiomarkerProfile {
    pub subject_id: String,
    pub timestamp: i64,
    pub snp_results: Vec<HealthVariantResult>,
    pub category_scores: HashMap<String, CategoryScore>,
    pub global_risk_score: f64,
    pub profile_vector: Vec<f32>,      // Dense vector for HNSW indexing
}

pub struct CategoryScore {
    pub category: String,
    pub score: f64,                     // 0.0 (low risk) to 1.0 (high risk)
    pub confidence: f64,                // Based on genotyped fraction
    pub contributing_variants: Vec<String>,
}

Scoring Algorithm:

  • Each SNP contributes a risk weight based on its clinical significance and genotype
  • Category scores aggregate SNP weights within gene-network boundaries
  • Gene-gene interaction terms (e.g., COMT x OPRM1 for pain) apply multiplicative modifiers
  • Global risk score uses weighted geometric mean across categories
  • Profile vector is the concatenation of normalized category scores + individual SNP encodings (one-hot genotype)

Weight Matrix (evidence-based):

Gene Risk Weight (Hom Ref) Risk Weight (Het) Risk Weight (Hom Alt) Category
APOE (rs429358) 0.0 0.45 0.90 Neurological
BRCA1 (rs80357906) 0.0 0.70 0.95 Cancer
MTHFR C677T 0.0 0.30 0.65 Metabolism
COMT Val158Met 0.0 0.25 0.50 Neurological
CYP1A2 0.0 0.15 0.35 Metabolism
SLCO1B1 0.0 0.40 0.75 Cardiovascular

Interaction Terms:

Interaction Modifier Rationale
COMT(AA) x OPRM1(GG) 1.4x pain score Synergistic pain sensitivity
MTHFR(677TT) x MTHFR(1298CC) 1.3x metabolism score Compound heterozygote
APOE(e4/e4) x TP53(variant) 1.2x neurological score Neurodegeneration + impaired DNA repair
BRCA1(carrier) x TP53(variant) 1.5x cancer score DNA repair pathway compromise

2. Streaming Biomarker Simulator

pub struct StreamConfig {
    pub base_interval_ms: u64,          // Interval between readings
    pub noise_amplitude: f64,           // Gaussian noise σ
    pub drift_rate: f64,                // Linear drift per reading
    pub anomaly_probability: f64,       // Probability of anomalous reading
    pub anomaly_magnitude: f64,         // Size of anomaly spike
    pub num_biomarkers: usize,          // Number of parallel streams
    pub window_size: usize,             // Sliding window for statistics
}

pub struct BiomarkerReading {
    pub timestamp_ms: u64,
    pub biomarker_id: String,
    pub value: f64,
    pub reference_range: (f64, f64),
    pub is_anomaly: bool,
    pub z_score: f64,
}

Simulation Model:

  • Base values drawn from clinically validated reference ranges (see Section 3)
  • Gaussian noise with configurable σ (default: 2% of reference range)
  • Linear drift models chronic condition progression
  • Anomaly injection via Poisson process (default: p=0.02 per reading)
  • Anomalies modeled as multiplicative spikes (default: 2.5x normal variation)

Streaming Protocol:

  • Uses tokio::sync::mpsc channels for async data flow
  • Ring buffer (capacity: 10,000 readings) for windowed statistics
  • Moving average, exponential smoothing, and z-score computation in real-time
  • Backpressure via bounded channels prevents memory exhaustion

Biomarker profile vectors are stored in RuVector's HNSW index for population-scale similarity search:

pub struct PopulationIndex {
    pub db: VectorDB,
    pub profile_dim: usize,             // Vector dimension (typically 64)
    pub population_size: usize,
    pub metadata: HashMap<String, serde_json::Value>,
}

Vector Encoding:

  • 17 SNPs x 3 genotype one-hot = 51 dimensions
  • 4 category scores = 4 dimensions
  • 1 global risk score = 1 dimension
  • 4 interaction terms = 4 dimensions
  • MTHFR score (1) + Pain score (1) + APOE risk (1) + Caffeine metabolism (1) = 4 dimensions
  • Total: 64 dimensions (power of 2 for SIMD alignment)

Search Performance (from ADR-011):

  • p50 latency: <100 μs at 10k profiles
  • p99 latency: <250 μs at 10k profiles
  • Recall@10: >97%
  • HNSW config: M=16, ef_construction=200, ef_search=50

4. Reference Biomarker Data

Curated reference ranges from clinical literature (CDC, WHO, NCBI ClinVar):

Biomarker Unit Low Normal Low Normal High High Critical
Total Cholesterol mg/dL - <200 200-239 >=240 >300
LDL Cholesterol mg/dL - <100 100-159 >=160 >190
HDL Cholesterol mg/dL <40 40-59 >=60 - -
Triglycerides mg/dL - <150 150-199 >=200 >500
Fasting Glucose mg/dL <70 70-99 100-125 >=126 >300
HbA1c % <4.0 4.0-5.6 5.7-6.4 >=6.5 >10.0
Homocysteine μmol/L - <10 10-15 >15 >30
Vitamin D (25-OH) ng/mL <20 20-29 30-100 >100 >150
CRP (hs) mg/L - <1.0 1.0-3.0 >3.0 >10.0
TSH mIU/L <0.4 0.4-2.0 2.0-4.0 >4.0 >10.0
Ferritin ng/mL <12 12-150 150-300 >300 >1000
Vitamin B12 pg/mL <200 200-300 300-900 >900 -

These values are used to:

  1. Validate streaming simulator output
  2. Calculate z-scores for anomaly detection
  3. Generate realistic synthetic population data
  4. Provide clinical context in biomarker reports

Performance Targets

Operation Target Mechanism
Composite score (17 SNPs) <50 μs In-memory weight matrix multiply
Profile vector encoding <100 μs One-hot + normalize
Population similarity top-10 <150 μs HNSW search on 64-dim vectors
Stream processing (single reading) <10 μs Ring buffer + running stats
Anomaly detection <5 μs Z-score against moving window
Full biomarker report <1 ms Score + encode + search
Population index build (10k) <500 ms Batch HNSW insert
Streaming throughput >100k readings/sec Lock-free ring buffer

Integration Points

Existing Module Integration Direction
health.rs SNP results feed composite scorer Input
genotyping.rs 23andMe pipeline generates BiomarkerProfile Input
ruvector-core HNSW index stores profile vectors Bidirectional
rvdna.rs Profile vectors stored in metadata section Output
epigenomics.rs Methylation data enriches biomarker profile Input
pharma.rs CYP metabolizer status informs drug-related biomarkers Input

Consequences

Positive:

  • Unified risk scoring replaces per-SNP interpretation with actionable composite scores
  • Streaming architecture enables real-time health monitoring use cases
  • HNSW indexing enables population-scale "patients like me" queries in <150 μs
  • Reference biomarker data provides clinical validation framework
  • 64-dim profile vectors are SIMD-aligned for maximum throughput
  • Ring buffer streaming achieves >100k readings/sec without allocation pressure

Negative:

  • Composite scoring weights are simplified; clinical deployment requires validated coefficients from GWAS
  • Streaming simulator generates synthetic data only; real clinical integration requires HL7/FHIR adapters
  • Additional 64-dim vector per profile increases RVDNA file size by ~256 bytes per subject

Neutral:

  • Risk scores are educational/research only; same disclaimer as existing health.rs
  • Gene-gene interaction terms are limited to known pairs; extensible via configuration

Options Considered

  1. Extend health.rs with scoring — rejected: would grow file beyond 500-line limit; scoring + streaming + search are distinct bounded contexts
  2. Separate crate — rejected: too much coupling to existing types; shared types across modules
  3. New module (biomarker.rs) — selected: clean separation, imports from health.rs, integrates with ruvector-core HNSW, stays within the rvDNA crate boundary

Implementation Strategy

Phase 1 (This ADR):

  • biomarker.rs: Composite scoring engine with reference data
  • biomarker_stream.rs: Streaming simulator with ring buffer and anomaly detection
  • Integration tests with realistic 23andMe-derived profiles
  • Benchmark suite validating performance targets

Phase 2 (Future):

  • RVDNA Section 7: Biomarker profile storage in binary format
  • Population index persistence (serialize HNSW graph to RVDNA)
  • WASM export for browser-based biomarker dashboards
  • HL7/FHIR streaming adapter for clinical integration
  • ADR-001: Vision — health biomarker analysis is a key clinical application
  • ADR-003: HNSW index — population search uses the same index infrastructure
  • ADR-009: Variant calling — biomarker profiles integrate variant quality scores
  • ADR-011: Performance targets — all biomarker operations must meet latency budgets
  • ADR-013: RVDNA format — biomarker vectors stored in metadata section (Phase 1) or dedicated section (Phase 2)

References