Files
wifi-densepose/examples/dna/adr/ADR-015-npm-wasm-biomarker-engine.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

11 KiB
Raw Blame History

ADR-015: npm/WASM Health Biomarker Engine

Status: Accepted | Date: 2026-02-22 | Authors: RuVector Genomics Architecture Team Parents: ADR-001 (Vision), ADR-008 (WASM Edge), ADR-011 (Performance Targets), ADR-014 (Health Biomarker Analysis)

Context

ADR-014 delivered the Rust biomarker analysis engine (biomarker.rs, biomarker_stream.rs) with composite risk scoring across 20 SNPs, 6 gene-gene interactions, 64-dim L2-normalized profile vectors, and a streaming processor with RingBuffer, CUSUM changepoint detection, and Welford online statistics. ADR-008 established WASM as the delivery mechanism for browser-side genomic computation.

The @ruvector/rvdna npm package (v0.2.0) already exposes 2-bit encoding, protein translation, cosine similarity, and 23andMe genotyping via pure-JS fallbacks and optional NAPI-RS native bindings. However, it lacks the biomarker engine entirely:

Gap Impact Severity
No biomarker risk scoring in JS Browser/Node users cannot compute composite health risk Critical
No streaming processor in JS Real-time biomarker dashboards impossible without native Critical
No profile vector encoding Population similarity search unavailable in JS High
No TypeScript types for biomarker API Developer experience degraded Medium
No benchmarks for JS path Cannot validate performance parity claims Medium

The decision is whether to (a) require WASM/native for all biomarker features, (b) provide a pure-JS implementation that mirrors the Rust engine exactly, or (c) a hybrid approach.

Decision: Pure-JS Biomarker Engine with WASM Acceleration Path

We implement a complete pure-JS biomarker engine in @ruvector/rvdna v0.3.0 that mirrors the Rust biomarker.rs and biomarker_stream.rs exactly, with a future WASM acceleration path for compute-intensive operations.

Rationale

  1. Zero-dependency accessibility — Any Node.js or browser environment can run biomarker analysis without compiling Rust or loading WASM
  2. Exact algorithmic parity — Same 20 SNPs, same 6 interactions, same 64-dim vector layout, same CUSUM parameters, same Welford statistics
  3. Progressive enhancement — Pure JS works everywhere; WASM (future) accelerates hot paths (vector encoding, population generation)
  4. Test oracle — JS implementation serves as a cross-language verification oracle against the Rust engine

Architecture

@ruvector/rvdna v0.3.0
├── index.js                 # Entry point, re-exports all modules
├── index.d.ts               # Full TypeScript definitions
├── src/
│   ├── biomarker.js         # Risk scoring engine (mirrors biomarker.rs)
│   └── stream.js            # Streaming processor (mirrors biomarker_stream.rs)
└── tests/
    └── test-biomarker.js    # Comprehensive test suite + benchmarks

Module 1: Biomarker Risk Scoring (src/biomarker.js)

Data Tables (exact mirror of Rust):

Table Entries Fields
BIOMARKER_REFERENCES 13 name, unit, normalLow, normalHigh, criticalLow, criticalHigh, category
SNPS 20 rsid, category, wRef, wHet, wAlt, homRef, het, homAlt, maf
INTERACTIONS 6 rsidA, rsidB, modifier, category
CAT_ORDER 4 Cancer Risk, Cardiovascular, Neurological, Metabolism

Functions:

Function Input Output Mirrors
biomarkerReferences() BiomarkerReference[] biomarker_references()
zScore(value, ref) number, BiomarkerReference number z_score()
classifyBiomarker(value, ref) number, BiomarkerReference enum string classify_biomarker()
computeRiskScores(genotypes) Map<rsid,genotype> BiomarkerProfile compute_risk_scores()
encodeProfileVector(profile) BiomarkerProfile Float32Array(64) encode_profile_vector()
generateSyntheticPopulation(count, seed) number, number BiomarkerProfile[] generate_synthetic_population()

Scoring Algorithm (identical to Rust):

  1. For each of 20 SNPs, look up genotype and compute weight (wRef/wHet/wAlt)
  2. Aggregate weights per category (Cancer Risk, Cardiovascular, Neurological, Metabolism)
  3. Apply 6 multiplicative interaction modifiers where both SNPs are non-reference
  4. Normalize each category: score = raw / maxPossible, clamped to [0, 1]
  5. Confidence = genotyped fraction per category
  6. Global risk = weighted average: sum(score * confidence) / sum(confidence)

Profile Vector Layout (64 dimensions, L2-normalized):

Dims Content Count
050 One-hot genotype encoding (17 SNPs x 3) 51
5154 Category scores 4
55 Global risk score 1
5659 First 4 interaction modifiers 4
60 MTHFR score / 4 1
61 Pain score / 4 1
62 APOE risk code / 2 1
63 LPA composite 1

PRNG: Mulberry32 (deterministic, no dependencies, matches seeded output for synthetic populations).

Module 2: Streaming Biomarker Processor (src/stream.js)

Data Structures:

Structure Purpose Mirrors
RingBuffer Fixed-capacity circular buffer, no allocation after init RingBuffer<T>
StreamProcessor Per-biomarker rolling stats, anomaly detection, trend analysis StreamProcessor
StreamStats mean, variance, min, max, EMA, CUSUM, changepoint StreamStats

Constants (identical to Rust):

Constant Value Purpose
EMA_ALPHA 0.1 Exponential moving average smoothing
Z_SCORE_THRESHOLD 2.5 Anomaly detection threshold
REF_OVERSHOOT 0.20 Out-of-range tolerance (20% of range)
CUSUM_THRESHOLD 4.0 Changepoint detection sensitivity
CUSUM_DRIFT 0.5 CUSUM allowable drift

Statistics:

  • Welford's online algorithm for single-pass mean and sample standard deviation (2x fewer cache misses than two-pass)
  • Simple linear regression for trend slope via least-squares
  • CUSUM (Cumulative Sum) for changepoint detection with automatic reset

Biomarker Definitions (6 streams):

ID Reference Low Reference High
glucose 70 100
cholesterol_total 150 200
hdl 40 60
ldl 70 130
triglycerides 50 150
crp 0.1 3.0

Performance Targets

Operation JS Target Rust Baseline Acceptable Ratio
computeRiskScores (20 SNPs) <200 us <50 us 4x
encodeProfileVector (64-dim) <300 us <100 us 3x
StreamProcessor.processReading <50 us <10 us 5x
generateSyntheticPopulation(1000) <100 ms <20 ms 5x
RingBuffer push+iter (100 items) <20 us <2 us 10x

Benchmark methodology: performance.now() with 1000-iteration warmup, 10000 measured iterations, report p50/p99.

TypeScript Definitions

Full .d.ts types for every exported function, interface, and enum. Key types:

  • BiomarkerReference — 13-field clinical reference range
  • BiomarkerClassification'CriticalLow' | 'Low' | 'Normal' | 'High' | 'CriticalHigh'
  • CategoryScore — per-category risk with confidence and contributing variants
  • BiomarkerProfile — complete risk profile with 64-dim vector
  • StreamConfig — streaming processor configuration
  • BiomarkerReading — timestamped biomarker data point
  • StreamStats — rolling statistics with CUSUM state
  • ProcessingResult — per-reading anomaly detection result
  • StreamSummary — aggregate statistics across all biomarker streams

Test Coverage

Category Tests Coverage
Biomarker references 2 Count, z-score math
Classification 5 All 5 classification levels
Risk scoring 4 All-ref low risk, elevated cancer, interaction amplification, BRCA1+TP53
Profile vectors 3 64-dim, L2-normalized, deterministic
Population generation 3 Correct count, deterministic, MTHFR-homocysteine correlation
RingBuffer 4 Push/iter, overflow, capacity-1, clear
Stream processor 3 Stats computation, summary totals, throughput
Anomaly detection 3 Z-score anomaly, out-of-range, zero anomaly for constant
Trend detection 3 Positive, negative, exact slope
Z-score / EMA 2 Near-mean small z, EMA convergence
Benchmarks 5 All performance targets

Total: 37 tests + 5 benchmarks

WASM Acceleration Path (Future — Phase 2)

When @ruvector/rvdna-wasm is available:

// Automatic acceleration — same API, WASM hot path
const { computeRiskScores } = require('@ruvector/rvdna');
// Internally checks: nativeModule?.computeRiskScores ?? jsFallback

WASM candidates (>10x speedup potential):

  • encodeProfileVector — SIMD dot products for L2 normalization
  • generateSyntheticPopulation — bulk PRNG + matrix operations
  • StreamProcessor.processReading — vectorized Welford accumulation

Versioning

  • @ruvector/rvdna bumps from 0.2.0 to 0.3.0 (new public API surface)
  • files array in package.json updated to include src/ directory
  • Keywords expanded: biomarker, health, risk-score, streaming, anomaly-detection
  • No breaking changes to existing v0.2.0 API

Consequences

Positive:

  • Full biomarker engine available in any JS runtime without native compilation
  • Algorithmic parity with Rust ensures cross-language consistency
  • Pure JS means zero WASM load time for initial render in browser dashboards
  • Comprehensive test suite provides regression safety net
  • TypeScript types enable IDE autocompletion and compile-time checking
  • Benchmarks establish baseline for future WASM optimization

Negative:

  • JS is 3-10x slower than Rust for numerical computation
  • Synthetic population generation uses Mulberry32 PRNG (not cryptographically identical to Rust's StdRng)
  • MTHFR/pain analysis simplified in JS (no cross-module dependency on health.rs internals)

Neutral:

  • Same clinical disclaimers apply: research/educational use only
  • Gene-gene interaction weights unchanged from ADR-014

Options Considered

  1. WASM-only — rejected: forces async init, 2MB+ download, excludes lightweight Node.js scripts
  2. Pure JS only, no WASM path — rejected: leaves performance on the table for browser dashboards
  3. Pure JS with WASM acceleration path — selected: immediate availability + future optimization
  4. Thin wrapper over native module — rejected: native bindings unavailable on most platforms
  • ADR-008: WASM Edge Genomics — establishes WASM as browser delivery mechanism
  • ADR-011: Performance Targets — JS targets derived as acceptable multiples of Rust baselines
  • ADR-014: Health Biomarker Analysis — Rust engine this ADR mirrors in JavaScript

References