Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

11 KiB

Raw Blame History

ADR-015: npm/WASM Health Biomarker Engine

Status: Accepted | Date: 2026-02-22 | Authors: RuVector Genomics Architecture Team Parents: ADR-001 (Vision), ADR-008 (WASM Edge), ADR-011 (Performance Targets), ADR-014 (Health Biomarker Analysis)

Context

ADR-014 delivered the Rust biomarker analysis engine (biomarker.rs, biomarker_stream.rs) with composite risk scoring across 20 SNPs, 6 gene-gene interactions, 64-dim L2-normalized profile vectors, and a streaming processor with RingBuffer, CUSUM changepoint detection, and Welford online statistics. ADR-008 established WASM as the delivery mechanism for browser-side genomic computation.

The @ruvector/rvdna npm package (v0.2.0) already exposes 2-bit encoding, protein translation, cosine similarity, and 23andMe genotyping via pure-JS fallbacks and optional NAPI-RS native bindings. However, it lacks the biomarker engine entirely:

Gap	Impact	Severity
No biomarker risk scoring in JS	Browser/Node users cannot compute composite health risk	Critical
No streaming processor in JS	Real-time biomarker dashboards impossible without native	Critical
No profile vector encoding	Population similarity search unavailable in JS	High
No TypeScript types for biomarker API	Developer experience degraded	Medium
No benchmarks for JS path	Cannot validate performance parity claims	Medium

The decision is whether to (a) require WASM/native for all biomarker features, (b) provide a pure-JS implementation that mirrors the Rust engine exactly, or (c) a hybrid approach.

Decision: Pure-JS Biomarker Engine with WASM Acceleration Path

We implement a complete pure-JS biomarker engine in @ruvector/rvdna v0.3.0 that mirrors the Rust biomarker.rs and biomarker_stream.rs exactly, with a future WASM acceleration path for compute-intensive operations.

Rationale

Zero-dependency accessibility — Any Node.js or browser environment can run biomarker analysis without compiling Rust or loading WASM
Exact algorithmic parity — Same 20 SNPs, same 6 interactions, same 64-dim vector layout, same CUSUM parameters, same Welford statistics
Progressive enhancement — Pure JS works everywhere; WASM (future) accelerates hot paths (vector encoding, population generation)
Test oracle — JS implementation serves as a cross-language verification oracle against the Rust engine

Architecture

@ruvector/rvdna v0.3.0
├── index.js                 # Entry point, re-exports all modules
├── index.d.ts               # Full TypeScript definitions
├── src/
│   ├── biomarker.js         # Risk scoring engine (mirrors biomarker.rs)
│   └── stream.js            # Streaming processor (mirrors biomarker_stream.rs)
└── tests/
    └── test-biomarker.js    # Comprehensive test suite + benchmarks

Module 1: Biomarker Risk Scoring (`src/biomarker.js`)

Data Tables (exact mirror of Rust):

Table	Entries	Fields
`BIOMARKER_REFERENCES`	13	name, unit, normalLow, normalHigh, criticalLow, criticalHigh, category
`SNPS`	20	rsid, category, wRef, wHet, wAlt, homRef, het, homAlt, maf
`INTERACTIONS`	6	rsidA, rsidB, modifier, category
`CAT_ORDER`	4	Cancer Risk, Cardiovascular, Neurological, Metabolism

Functions:

Function	Input	Output	Mirrors
`biomarkerReferences()`	—	`BiomarkerReference[]`	`biomarker_references()`
`zScore(value, ref)`	number, BiomarkerReference	number	`z_score()`
`classifyBiomarker(value, ref)`	number, BiomarkerReference	enum string	`classify_biomarker()`
`computeRiskScores(genotypes)`	`Map<rsid,genotype>`	`BiomarkerProfile`	`compute_risk_scores()`
`encodeProfileVector(profile)`	BiomarkerProfile	`Float32Array(64)`	`encode_profile_vector()`
`generateSyntheticPopulation(count, seed)`	number, number	`BiomarkerProfile[]`	`generate_synthetic_population()`

Scoring Algorithm (identical to Rust):

For each of 20 SNPs, look up genotype and compute weight (wRef/wHet/wAlt)
Aggregate weights per category (Cancer Risk, Cardiovascular, Neurological, Metabolism)
Apply 6 multiplicative interaction modifiers where both SNPs are non-reference
Normalize each category: score = raw / maxPossible, clamped to [0, 1]
Confidence = genotyped fraction per category
Global risk = weighted average: sum(score * confidence) / sum(confidence)

Profile Vector Layout (64 dimensions, L2-normalized):

Dims	Content	Count
0–50	One-hot genotype encoding (17 SNPs x 3)	51
51–54	Category scores	4
55	Global risk score	1
56–59	First 4 interaction modifiers	4
60	MTHFR score / 4	1
61	Pain score / 4	1
62	APOE risk code / 2	1
63	LPA composite	1

PRNG: Mulberry32 (deterministic, no dependencies, matches seeded output for synthetic populations).

Module 2: Streaming Biomarker Processor (`src/stream.js`)

Data Structures:

Structure	Purpose	Mirrors
`RingBuffer`	Fixed-capacity circular buffer, no allocation after init	`RingBuffer<T>`
`StreamProcessor`	Per-biomarker rolling stats, anomaly detection, trend analysis	`StreamProcessor`
`StreamStats`	mean, variance, min, max, EMA, CUSUM, changepoint	`StreamStats`

Constants (identical to Rust):

Constant	Value	Purpose
`EMA_ALPHA`	0.1	Exponential moving average smoothing
`Z_SCORE_THRESHOLD`	2.5	Anomaly detection threshold
`REF_OVERSHOOT`	0.20	Out-of-range tolerance (20% of range)
`CUSUM_THRESHOLD`	4.0	Changepoint detection sensitivity
`CUSUM_DRIFT`	0.5	CUSUM allowable drift

Statistics:

Welford's online algorithm for single-pass mean and sample standard deviation (2x fewer cache misses than two-pass)
Simple linear regression for trend slope via least-squares
CUSUM (Cumulative Sum) for changepoint detection with automatic reset

Biomarker Definitions (6 streams):

ID	Reference Low	Reference High
glucose	70	100
cholesterol_total	150	200
hdl	40	60
ldl	70	130
triglycerides	50	150
crp	0.1	3.0

Performance Targets

Operation	JS Target	Rust Baseline	Acceptable Ratio
`computeRiskScores` (20 SNPs)	<200 us	<50 us	4x
`encodeProfileVector` (64-dim)	<300 us	<100 us	3x
`StreamProcessor.processReading`	<50 us	<10 us	5x
`generateSyntheticPopulation(1000)`	<100 ms	<20 ms	5x
RingBuffer push+iter (100 items)	<20 us	<2 us	10x

Benchmark methodology: performance.now() with 1000-iteration warmup, 10000 measured iterations, report p50/p99.

TypeScript Definitions

Full .d.ts types for every exported function, interface, and enum. Key types:

BiomarkerReference — 13-field clinical reference range
BiomarkerClassification — 'CriticalLow' | 'Low' | 'Normal' | 'High' | 'CriticalHigh'
CategoryScore — per-category risk with confidence and contributing variants
BiomarkerProfile — complete risk profile with 64-dim vector
StreamConfig — streaming processor configuration
BiomarkerReading — timestamped biomarker data point
StreamStats — rolling statistics with CUSUM state
ProcessingResult — per-reading anomaly detection result
StreamSummary — aggregate statistics across all biomarker streams

Test Coverage

Category	Tests	Coverage
Biomarker references	2	Count, z-score math
Classification	5	All 5 classification levels
Risk scoring	4	All-ref low risk, elevated cancer, interaction amplification, BRCA1+TP53
Profile vectors	3	64-dim, L2-normalized, deterministic
Population generation	3	Correct count, deterministic, MTHFR-homocysteine correlation
RingBuffer	4	Push/iter, overflow, capacity-1, clear
Stream processor	3	Stats computation, summary totals, throughput
Anomaly detection	3	Z-score anomaly, out-of-range, zero anomaly for constant
Trend detection	3	Positive, negative, exact slope
Z-score / EMA	2	Near-mean small z, EMA convergence
Benchmarks	5	All performance targets

Total: 37 tests + 5 benchmarks

WASM Acceleration Path (Future — Phase 2)

When @ruvector/rvdna-wasm is available:

// Automatic acceleration — same API, WASM hot path
const { computeRiskScores } = require('@ruvector/rvdna');
// Internally checks: nativeModule?.computeRiskScores ?? jsFallback

WASM candidates (>10x speedup potential):

encodeProfileVector — SIMD dot products for L2 normalization
generateSyntheticPopulation — bulk PRNG + matrix operations
StreamProcessor.processReading — vectorized Welford accumulation

Versioning

@ruvector/rvdna bumps from 0.2.0 to 0.3.0 (new public API surface)
files array in package.json updated to include src/ directory
Keywords expanded: biomarker, health, risk-score, streaming, anomaly-detection
No breaking changes to existing v0.2.0 API

Consequences

Positive:

Full biomarker engine available in any JS runtime without native compilation
Algorithmic parity with Rust ensures cross-language consistency
Pure JS means zero WASM load time for initial render in browser dashboards
Comprehensive test suite provides regression safety net
TypeScript types enable IDE autocompletion and compile-time checking
Benchmarks establish baseline for future WASM optimization

Negative:

JS is 3-10x slower than Rust for numerical computation
Synthetic population generation uses Mulberry32 PRNG (not cryptographically identical to Rust's StdRng)
MTHFR/pain analysis simplified in JS (no cross-module dependency on health.rs internals)

Neutral:

Same clinical disclaimers apply: research/educational use only
Gene-gene interaction weights unchanged from ADR-014

Options Considered

WASM-only — rejected: forces async init, 2MB+ download, excludes lightweight Node.js scripts
Pure JS only, no WASM path — rejected: leaves performance on the table for browser dashboards
Pure JS with WASM acceleration path — selected: immediate availability + future optimization
Thin wrapper over native module — rejected: native bindings unavailable on most platforms

ADR-008: WASM Edge Genomics — establishes WASM as browser delivery mechanism
ADR-011: Performance Targets — JS targets derived as acceptable multiples of Rust baselines
ADR-014: Health Biomarker Analysis — Rust engine this ADR mirrors in JavaScript

References

Mulberry32 PRNG — 32-bit deterministic PRNG
Welford's Online Algorithm — Numerically stable variance
CUSUM — Cumulative sum control chart for changepoint detection
CPIC Guidelines — Pharmacogenomics evidence base

11 KiB Raw Blame History Unescape Escape