git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
11 KiB
ADR-015: npm/WASM Health Biomarker Engine
Status: Accepted | Date: 2026-02-22 | Authors: RuVector Genomics Architecture Team Parents: ADR-001 (Vision), ADR-008 (WASM Edge), ADR-011 (Performance Targets), ADR-014 (Health Biomarker Analysis)
Context
ADR-014 delivered the Rust biomarker analysis engine (biomarker.rs, biomarker_stream.rs) with composite risk scoring across 20 SNPs, 6 gene-gene interactions, 64-dim L2-normalized profile vectors, and a streaming processor with RingBuffer, CUSUM changepoint detection, and Welford online statistics. ADR-008 established WASM as the delivery mechanism for browser-side genomic computation.
The @ruvector/rvdna npm package (v0.2.0) already exposes 2-bit encoding, protein translation, cosine similarity, and 23andMe genotyping via pure-JS fallbacks and optional NAPI-RS native bindings. However, it lacks the biomarker engine entirely:
| Gap | Impact | Severity |
|---|---|---|
| No biomarker risk scoring in JS | Browser/Node users cannot compute composite health risk | Critical |
| No streaming processor in JS | Real-time biomarker dashboards impossible without native | Critical |
| No profile vector encoding | Population similarity search unavailable in JS | High |
| No TypeScript types for biomarker API | Developer experience degraded | Medium |
| No benchmarks for JS path | Cannot validate performance parity claims | Medium |
The decision is whether to (a) require WASM/native for all biomarker features, (b) provide a pure-JS implementation that mirrors the Rust engine exactly, or (c) a hybrid approach.
Decision: Pure-JS Biomarker Engine with WASM Acceleration Path
We implement a complete pure-JS biomarker engine in @ruvector/rvdna v0.3.0 that mirrors the Rust biomarker.rs and biomarker_stream.rs exactly, with a future WASM acceleration path for compute-intensive operations.
Rationale
- Zero-dependency accessibility — Any Node.js or browser environment can run biomarker analysis without compiling Rust or loading WASM
- Exact algorithmic parity — Same 20 SNPs, same 6 interactions, same 64-dim vector layout, same CUSUM parameters, same Welford statistics
- Progressive enhancement — Pure JS works everywhere; WASM (future) accelerates hot paths (vector encoding, population generation)
- Test oracle — JS implementation serves as a cross-language verification oracle against the Rust engine
Architecture
@ruvector/rvdna v0.3.0
├── index.js # Entry point, re-exports all modules
├── index.d.ts # Full TypeScript definitions
├── src/
│ ├── biomarker.js # Risk scoring engine (mirrors biomarker.rs)
│ └── stream.js # Streaming processor (mirrors biomarker_stream.rs)
└── tests/
└── test-biomarker.js # Comprehensive test suite + benchmarks
Module 1: Biomarker Risk Scoring (src/biomarker.js)
Data Tables (exact mirror of Rust):
| Table | Entries | Fields |
|---|---|---|
BIOMARKER_REFERENCES |
13 | name, unit, normalLow, normalHigh, criticalLow, criticalHigh, category |
SNPS |
20 | rsid, category, wRef, wHet, wAlt, homRef, het, homAlt, maf |
INTERACTIONS |
6 | rsidA, rsidB, modifier, category |
CAT_ORDER |
4 | Cancer Risk, Cardiovascular, Neurological, Metabolism |
Functions:
| Function | Input | Output | Mirrors |
|---|---|---|---|
biomarkerReferences() |
— | BiomarkerReference[] |
biomarker_references() |
zScore(value, ref) |
number, BiomarkerReference | number | z_score() |
classifyBiomarker(value, ref) |
number, BiomarkerReference | enum string | classify_biomarker() |
computeRiskScores(genotypes) |
Map<rsid,genotype> |
BiomarkerProfile |
compute_risk_scores() |
encodeProfileVector(profile) |
BiomarkerProfile | Float32Array(64) |
encode_profile_vector() |
generateSyntheticPopulation(count, seed) |
number, number | BiomarkerProfile[] |
generate_synthetic_population() |
Scoring Algorithm (identical to Rust):
- For each of 20 SNPs, look up genotype and compute weight (wRef/wHet/wAlt)
- Aggregate weights per category (Cancer Risk, Cardiovascular, Neurological, Metabolism)
- Apply 6 multiplicative interaction modifiers where both SNPs are non-reference
- Normalize each category:
score = raw / maxPossible, clamped to [0, 1] - Confidence = genotyped fraction per category
- Global risk = weighted average:
sum(score * confidence) / sum(confidence)
Profile Vector Layout (64 dimensions, L2-normalized):
| Dims | Content | Count |
|---|---|---|
| 0–50 | One-hot genotype encoding (17 SNPs x 3) | 51 |
| 51–54 | Category scores | 4 |
| 55 | Global risk score | 1 |
| 56–59 | First 4 interaction modifiers | 4 |
| 60 | MTHFR score / 4 | 1 |
| 61 | Pain score / 4 | 1 |
| 62 | APOE risk code / 2 | 1 |
| 63 | LPA composite | 1 |
PRNG: Mulberry32 (deterministic, no dependencies, matches seeded output for synthetic populations).
Module 2: Streaming Biomarker Processor (src/stream.js)
Data Structures:
| Structure | Purpose | Mirrors |
|---|---|---|
RingBuffer |
Fixed-capacity circular buffer, no allocation after init | RingBuffer<T> |
StreamProcessor |
Per-biomarker rolling stats, anomaly detection, trend analysis | StreamProcessor |
StreamStats |
mean, variance, min, max, EMA, CUSUM, changepoint | StreamStats |
Constants (identical to Rust):
| Constant | Value | Purpose |
|---|---|---|
EMA_ALPHA |
0.1 | Exponential moving average smoothing |
Z_SCORE_THRESHOLD |
2.5 | Anomaly detection threshold |
REF_OVERSHOOT |
0.20 | Out-of-range tolerance (20% of range) |
CUSUM_THRESHOLD |
4.0 | Changepoint detection sensitivity |
CUSUM_DRIFT |
0.5 | CUSUM allowable drift |
Statistics:
- Welford's online algorithm for single-pass mean and sample standard deviation (2x fewer cache misses than two-pass)
- Simple linear regression for trend slope via least-squares
- CUSUM (Cumulative Sum) for changepoint detection with automatic reset
Biomarker Definitions (6 streams):
| ID | Reference Low | Reference High |
|---|---|---|
| glucose | 70 | 100 |
| cholesterol_total | 150 | 200 |
| hdl | 40 | 60 |
| ldl | 70 | 130 |
| triglycerides | 50 | 150 |
| crp | 0.1 | 3.0 |
Performance Targets
| Operation | JS Target | Rust Baseline | Acceptable Ratio |
|---|---|---|---|
computeRiskScores (20 SNPs) |
<200 us | <50 us | 4x |
encodeProfileVector (64-dim) |
<300 us | <100 us | 3x |
StreamProcessor.processReading |
<50 us | <10 us | 5x |
generateSyntheticPopulation(1000) |
<100 ms | <20 ms | 5x |
| RingBuffer push+iter (100 items) | <20 us | <2 us | 10x |
Benchmark methodology: performance.now() with 1000-iteration warmup, 10000 measured iterations, report p50/p99.
TypeScript Definitions
Full .d.ts types for every exported function, interface, and enum. Key types:
BiomarkerReference— 13-field clinical reference rangeBiomarkerClassification—'CriticalLow' | 'Low' | 'Normal' | 'High' | 'CriticalHigh'CategoryScore— per-category risk with confidence and contributing variantsBiomarkerProfile— complete risk profile with 64-dim vectorStreamConfig— streaming processor configurationBiomarkerReading— timestamped biomarker data pointStreamStats— rolling statistics with CUSUM stateProcessingResult— per-reading anomaly detection resultStreamSummary— aggregate statistics across all biomarker streams
Test Coverage
| Category | Tests | Coverage |
|---|---|---|
| Biomarker references | 2 | Count, z-score math |
| Classification | 5 | All 5 classification levels |
| Risk scoring | 4 | All-ref low risk, elevated cancer, interaction amplification, BRCA1+TP53 |
| Profile vectors | 3 | 64-dim, L2-normalized, deterministic |
| Population generation | 3 | Correct count, deterministic, MTHFR-homocysteine correlation |
| RingBuffer | 4 | Push/iter, overflow, capacity-1, clear |
| Stream processor | 3 | Stats computation, summary totals, throughput |
| Anomaly detection | 3 | Z-score anomaly, out-of-range, zero anomaly for constant |
| Trend detection | 3 | Positive, negative, exact slope |
| Z-score / EMA | 2 | Near-mean small z, EMA convergence |
| Benchmarks | 5 | All performance targets |
Total: 37 tests + 5 benchmarks
WASM Acceleration Path (Future — Phase 2)
When @ruvector/rvdna-wasm is available:
// Automatic acceleration — same API, WASM hot path
const { computeRiskScores } = require('@ruvector/rvdna');
// Internally checks: nativeModule?.computeRiskScores ?? jsFallback
WASM candidates (>10x speedup potential):
encodeProfileVector— SIMD dot products for L2 normalizationgenerateSyntheticPopulation— bulk PRNG + matrix operationsStreamProcessor.processReading— vectorized Welford accumulation
Versioning
@ruvector/rvdnabumps from0.2.0to0.3.0(new public API surface)filesarray inpackage.jsonupdated to includesrc/directory- Keywords expanded:
biomarker,health,risk-score,streaming,anomaly-detection - No breaking changes to existing v0.2.0 API
Consequences
Positive:
- Full biomarker engine available in any JS runtime without native compilation
- Algorithmic parity with Rust ensures cross-language consistency
- Pure JS means zero WASM load time for initial render in browser dashboards
- Comprehensive test suite provides regression safety net
- TypeScript types enable IDE autocompletion and compile-time checking
- Benchmarks establish baseline for future WASM optimization
Negative:
- JS is 3-10x slower than Rust for numerical computation
- Synthetic population generation uses Mulberry32 PRNG (not cryptographically identical to Rust's StdRng)
- MTHFR/pain analysis simplified in JS (no cross-module dependency on health.rs internals)
Neutral:
- Same clinical disclaimers apply: research/educational use only
- Gene-gene interaction weights unchanged from ADR-014
Options Considered
- WASM-only — rejected: forces async init, 2MB+ download, excludes lightweight Node.js scripts
- Pure JS only, no WASM path — rejected: leaves performance on the table for browser dashboards
- Pure JS with WASM acceleration path — selected: immediate availability + future optimization
- Thin wrapper over native module — rejected: native bindings unavailable on most platforms
Related Decisions
- ADR-008: WASM Edge Genomics — establishes WASM as browser delivery mechanism
- ADR-011: Performance Targets — JS targets derived as acceptable multiples of Rust baselines
- ADR-014: Health Biomarker Analysis — Rust engine this ADR mirrors in JavaScript
References
- Mulberry32 PRNG — 32-bit deterministic PRNG
- Welford's Online Algorithm — Numerically stable variance
- CUSUM — Cumulative sum control chart for changepoint detection
- CPIC Guidelines — Pharmacogenomics evidence base