Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,230 @@
# ADR-015: npm/WASM Health Biomarker Engine
**Status:** Accepted | **Date:** 2026-02-22 | **Authors:** RuVector Genomics Architecture Team
**Parents:** ADR-001 (Vision), ADR-008 (WASM Edge), ADR-011 (Performance Targets), ADR-014 (Health Biomarker Analysis)
## Context
ADR-014 delivered the Rust biomarker analysis engine (`biomarker.rs`, `biomarker_stream.rs`) with composite risk scoring across 20 SNPs, 6 gene-gene interactions, 64-dim L2-normalized profile vectors, and a streaming processor with RingBuffer, CUSUM changepoint detection, and Welford online statistics. ADR-008 established WASM as the delivery mechanism for browser-side genomic computation.
The `@ruvector/rvdna` npm package (v0.2.0) already exposes 2-bit encoding, protein translation, cosine similarity, and 23andMe genotyping via pure-JS fallbacks and optional NAPI-RS native bindings. However, it lacks the biomarker engine entirely:
| Gap | Impact | Severity |
|-----|--------|----------|
| No biomarker risk scoring in JS | Browser/Node users cannot compute composite health risk | Critical |
| No streaming processor in JS | Real-time biomarker dashboards impossible without native | Critical |
| No profile vector encoding | Population similarity search unavailable in JS | High |
| No TypeScript types for biomarker API | Developer experience degraded | Medium |
| No benchmarks for JS path | Cannot validate performance parity claims | Medium |
The decision is whether to (a) require WASM/native for all biomarker features, (b) provide a pure-JS implementation that mirrors the Rust engine exactly, or (c) a hybrid approach.
## Decision: Pure-JS Biomarker Engine with WASM Acceleration Path
We implement a **complete pure-JS biomarker engine** in `@ruvector/rvdna` v0.3.0 that mirrors the Rust `biomarker.rs` and `biomarker_stream.rs` exactly, with a future WASM acceleration path for compute-intensive operations.
### Rationale
1. **Zero-dependency accessibility** — Any Node.js or browser environment can run biomarker analysis without compiling Rust or loading WASM
2. **Exact algorithmic parity** — Same 20 SNPs, same 6 interactions, same 64-dim vector layout, same CUSUM parameters, same Welford statistics
3. **Progressive enhancement** — Pure JS works everywhere; WASM (future) accelerates hot paths (vector encoding, population generation)
4. **Test oracle** — JS implementation serves as a cross-language verification oracle against the Rust engine
### Architecture
```
@ruvector/rvdna v0.3.0
├── index.js # Entry point, re-exports all modules
├── index.d.ts # Full TypeScript definitions
├── src/
│ ├── biomarker.js # Risk scoring engine (mirrors biomarker.rs)
│ └── stream.js # Streaming processor (mirrors biomarker_stream.rs)
└── tests/
└── test-biomarker.js # Comprehensive test suite + benchmarks
```
### Module 1: Biomarker Risk Scoring (`src/biomarker.js`)
**Data Tables (exact mirror of Rust):**
| Table | Entries | Fields |
|-------|---------|--------|
| `BIOMARKER_REFERENCES` | 13 | name, unit, normalLow, normalHigh, criticalLow, criticalHigh, category |
| `SNPS` | 20 | rsid, category, wRef, wHet, wAlt, homRef, het, homAlt, maf |
| `INTERACTIONS` | 6 | rsidA, rsidB, modifier, category |
| `CAT_ORDER` | 4 | Cancer Risk, Cardiovascular, Neurological, Metabolism |
**Functions:**
| Function | Input | Output | Mirrors |
|----------|-------|--------|---------|
| `biomarkerReferences()` | — | `BiomarkerReference[]` | `biomarker_references()` |
| `zScore(value, ref)` | number, BiomarkerReference | number | `z_score()` |
| `classifyBiomarker(value, ref)` | number, BiomarkerReference | enum string | `classify_biomarker()` |
| `computeRiskScores(genotypes)` | `Map<rsid,genotype>` | `BiomarkerProfile` | `compute_risk_scores()` |
| `encodeProfileVector(profile)` | BiomarkerProfile | `Float32Array(64)` | `encode_profile_vector()` |
| `generateSyntheticPopulation(count, seed)` | number, number | `BiomarkerProfile[]` | `generate_synthetic_population()` |
**Scoring Algorithm (identical to Rust):**
1. For each of 20 SNPs, look up genotype and compute weight (wRef/wHet/wAlt)
2. Aggregate weights per category (Cancer Risk, Cardiovascular, Neurological, Metabolism)
3. Apply 6 multiplicative interaction modifiers where both SNPs are non-reference
4. Normalize each category: `score = raw / maxPossible`, clamped to [0, 1]
5. Confidence = genotyped fraction per category
6. Global risk = weighted average: `sum(score * confidence) / sum(confidence)`
**Profile Vector Layout (64 dimensions, L2-normalized):**
| Dims | Content | Count |
|------|---------|-------|
| 050 | One-hot genotype encoding (17 SNPs x 3) | 51 |
| 5154 | Category scores | 4 |
| 55 | Global risk score | 1 |
| 5659 | First 4 interaction modifiers | 4 |
| 60 | MTHFR score / 4 | 1 |
| 61 | Pain score / 4 | 1 |
| 62 | APOE risk code / 2 | 1 |
| 63 | LPA composite | 1 |
**PRNG:** Mulberry32 (deterministic, no dependencies, matches seeded output for synthetic populations).
### Module 2: Streaming Biomarker Processor (`src/stream.js`)
**Data Structures:**
| Structure | Purpose | Mirrors |
|-----------|---------|---------|
| `RingBuffer` | Fixed-capacity circular buffer, no allocation after init | `RingBuffer<T>` |
| `StreamProcessor` | Per-biomarker rolling stats, anomaly detection, trend analysis | `StreamProcessor` |
| `StreamStats` | mean, variance, min, max, EMA, CUSUM, changepoint | `StreamStats` |
**Constants (identical to Rust):**
| Constant | Value | Purpose |
|----------|-------|---------|
| `EMA_ALPHA` | 0.1 | Exponential moving average smoothing |
| `Z_SCORE_THRESHOLD` | 2.5 | Anomaly detection threshold |
| `REF_OVERSHOOT` | 0.20 | Out-of-range tolerance (20% of range) |
| `CUSUM_THRESHOLD` | 4.0 | Changepoint detection sensitivity |
| `CUSUM_DRIFT` | 0.5 | CUSUM allowable drift |
**Statistics:**
- **Welford's online algorithm** for single-pass mean and sample standard deviation (2x fewer cache misses than two-pass)
- **Simple linear regression** for trend slope via least-squares
- **CUSUM** (Cumulative Sum) for changepoint detection with automatic reset
**Biomarker Definitions (6 streams):**
| ID | Reference Low | Reference High |
|----|--------------|---------------|
| glucose | 70 | 100 |
| cholesterol_total | 150 | 200 |
| hdl | 40 | 60 |
| ldl | 70 | 130 |
| triglycerides | 50 | 150 |
| crp | 0.1 | 3.0 |
### Performance Targets
| Operation | JS Target | Rust Baseline | Acceptable Ratio |
|-----------|-----------|---------------|------------------|
| `computeRiskScores` (20 SNPs) | <200 us | <50 us | 4x |
| `encodeProfileVector` (64-dim) | <300 us | <100 us | 3x |
| `StreamProcessor.processReading` | <50 us | <10 us | 5x |
| `generateSyntheticPopulation(1000)` | <100 ms | <20 ms | 5x |
| RingBuffer push+iter (100 items) | <20 us | <2 us | 10x |
**Benchmark methodology:** `performance.now()` with 1000-iteration warmup, 10000 measured iterations, report p50/p99.
### TypeScript Definitions
Full `.d.ts` types for every exported function, interface, and enum. Key types:
- `BiomarkerReference` — 13-field clinical reference range
- `BiomarkerClassification``'CriticalLow' | 'Low' | 'Normal' | 'High' | 'CriticalHigh'`
- `CategoryScore` — per-category risk with confidence and contributing variants
- `BiomarkerProfile` — complete risk profile with 64-dim vector
- `StreamConfig` — streaming processor configuration
- `BiomarkerReading` — timestamped biomarker data point
- `StreamStats` — rolling statistics with CUSUM state
- `ProcessingResult` — per-reading anomaly detection result
- `StreamSummary` — aggregate statistics across all biomarker streams
### Test Coverage
| Category | Tests | Coverage |
|----------|-------|----------|
| Biomarker references | 2 | Count, z-score math |
| Classification | 5 | All 5 classification levels |
| Risk scoring | 4 | All-ref low risk, elevated cancer, interaction amplification, BRCA1+TP53 |
| Profile vectors | 3 | 64-dim, L2-normalized, deterministic |
| Population generation | 3 | Correct count, deterministic, MTHFR-homocysteine correlation |
| RingBuffer | 4 | Push/iter, overflow, capacity-1, clear |
| Stream processor | 3 | Stats computation, summary totals, throughput |
| Anomaly detection | 3 | Z-score anomaly, out-of-range, zero anomaly for constant |
| Trend detection | 3 | Positive, negative, exact slope |
| Z-score / EMA | 2 | Near-mean small z, EMA convergence |
| Benchmarks | 5 | All performance targets |
**Total: 37 tests + 5 benchmarks**
### WASM Acceleration Path (Future — Phase 2)
When `@ruvector/rvdna-wasm` is available:
```js
// Automatic acceleration — same API, WASM hot path
const { computeRiskScores } = require('@ruvector/rvdna');
// Internally checks: nativeModule?.computeRiskScores ?? jsFallback
```
**WASM candidates (>10x speedup potential):**
- `encodeProfileVector` — SIMD dot products for L2 normalization
- `generateSyntheticPopulation` — bulk PRNG + matrix operations
- `StreamProcessor.processReading` — vectorized Welford accumulation
### Versioning
- `@ruvector/rvdna` bumps from `0.2.0` to `0.3.0` (new public API surface)
- `files` array in `package.json` updated to include `src/` directory
- Keywords expanded: `biomarker`, `health`, `risk-score`, `streaming`, `anomaly-detection`
- No breaking changes to existing v0.2.0 API
## Consequences
**Positive:**
- Full biomarker engine available in any JS runtime without native compilation
- Algorithmic parity with Rust ensures cross-language consistency
- Pure JS means zero WASM load time for initial render in browser dashboards
- Comprehensive test suite provides regression safety net
- TypeScript types enable IDE autocompletion and compile-time checking
- Benchmarks establish baseline for future WASM optimization
**Negative:**
- JS is 3-10x slower than Rust for numerical computation
- Synthetic population generation uses Mulberry32 PRNG (not cryptographically identical to Rust's StdRng)
- MTHFR/pain analysis simplified in JS (no cross-module dependency on health.rs internals)
**Neutral:**
- Same clinical disclaimers apply: research/educational use only
- Gene-gene interaction weights unchanged from ADR-014
## Options Considered
1. **WASM-only** — rejected: forces async init, 2MB+ download, excludes lightweight Node.js scripts
2. **Pure JS only, no WASM path** — rejected: leaves performance on the table for browser dashboards
3. **Pure JS with WASM acceleration path** — selected: immediate availability + future optimization
4. **Thin wrapper over native module** — rejected: native bindings unavailable on most platforms
## Related Decisions
- **ADR-008**: WASM Edge Genomics — establishes WASM as browser delivery mechanism
- **ADR-011**: Performance Targets — JS targets derived as acceptable multiples of Rust baselines
- **ADR-014**: Health Biomarker Analysis — Rust engine this ADR mirrors in JavaScript
## References
- [Mulberry32 PRNG](https://gist.github.com/tommyettinger/46a874533244883189143505d203312c) — 32-bit deterministic PRNG
- [Welford's Online Algorithm](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford%27s_online_algorithm) — Numerically stable variance
- [CUSUM](https://en.wikipedia.org/wiki/CUSUM) — Cumulative sum control chart for changepoint detection
- [CPIC Guidelines](https://cpicpgx.org/) — Pharmacogenomics evidence base