Files
wifi-densepose/docs/research/sublinear-time-solver/adr/ADR-STS-004-wasm-cross-platform.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

15 KiB

ADR-STS-004: WASM and Cross-Platform Compilation Strategy

Status: Accepted Date: 2026-02-20 Authors: RuVector Architecture Team Deciders: Architecture Review Board

Version History

Version Date Author Changes
0.1 2026-02-20 RuVector Team Initial proposal
1.0 2026-02-20 RuVector Team Accepted: full implementation complete

Context

Multi-Platform Deployment Requirement

RuVector deploys across four target platforms with distinct constraints:

Platform ISA SIMD Threads Memory Target Triple
Server (Linux/macOS) x86_64 AVX-512/AVX2/SSE4.1 Full (Rayon) 2+ GB x86_64-unknown-linux-gnu
Edge (Apple Silicon) ARM64 NEON Full (Rayon) 512 MB aarch64-apple-darwin
Browser wasm32 SIMD128 Web Workers 4-8 MB wasm32-unknown-unknown
Cloudflare Workers wasm32 None Single 128 MB wasm32-unknown-unknown
Node.js (NAPI) Native Native Full 512 MB via napi-rs

Existing WASM Infrastructure

RuVector has 15+ WASM crates following the Core-Binding-Surface pattern:

ruvector-core       →  ruvector-wasm         →  @ruvector/core (npm)
ruvector-graph      →  ruvector-graph-wasm   →  @ruvector/graph (npm)
ruvector-attention   →  ruvector-attention-wasm →  @ruvector/attention (npm)
ruvector-gnn        →  ruvector-gnn-wasm     →  @ruvector/gnn (npm)
ruvector-math       →  ruvector-math-wasm    →  @ruvector/math (npm)

Each WASM crate uses wasm-bindgen 0.2, serde-wasm-bindgen, js-sys 0.3, and getrandom 0.3 with wasm_js feature.

WASM Constraints for Solver

  • No std::thread — all parallelism via Web Workers
  • No std::fs / std::net — no persistent storage, no network
  • Default linear memory: 16 MB (expandable to ~4 GB)
  • parking_lot required instead of std::sync::Mutex
  • getrandom/wasm_js for randomness (Hybrid Random Walk, Monte Carlo)
  • No dynamic linking — all code in single module

Performance Targets

Platform 10K solve 100K solve Memory Budget
Server (AVX2) < 2 ms < 50 ms 2 GB
Edge (NEON) < 5 ms < 100 ms 512 MB
Browser (SIMD128) < 50 ms < 500 ms 8 MB
Edge (Cloudflare) < 10 ms < 200 ms 128 MB
Node.js (NAPI) < 3 ms < 60 ms 512 MB

Decision

1. Three-Crate Pattern

Follow established RuVector convention with three crates:

crates/ruvector-solver/          # Core Rust (no platform deps)
crates/ruvector-solver-wasm/     # wasm-bindgen bindings
crates/ruvector-solver-node/     # NAPI-RS bindings

Cargo.toml for ruvector-solver (core):

[package]
name = "ruvector-solver"
version = "0.1.0"
edition = "2021"
rust-version = "1.77"

[features]
default = []
nalgebra-backend = ["nalgebra"]
ndarray-backend = ["ndarray"]
parallel = ["rayon", "crossbeam"]
simd = []
wasm = []
full = ["nalgebra-backend", "ndarray-backend", "parallel"]

# Algorithm features
neumann = []
forward-push = []
backward-push = []
hybrid-random-walk = ["getrandom"]
true-solver = ["neumann"]  # TRUE uses Neumann internally
cg = []
bmssp = []
all-algorithms = ["neumann", "forward-push", "backward-push",
                  "hybrid-random-walk", "true-solver", "cg", "bmssp"]

[dependencies]
serde = { workspace = true, features = ["derive"] }
nalgebra = { workspace = true, optional = true, default-features = false }
ndarray = { workspace = true, optional = true }
rayon = { workspace = true, optional = true }
crossbeam = { workspace = true, optional = true }
getrandom = { workspace = true, optional = true }

[target.'cfg(target_arch = "wasm32")'.dependencies]
getrandom = { workspace = true, features = ["wasm_js"] }

Cargo.toml for ruvector-solver-wasm:

[package]
name = "ruvector-solver-wasm"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib"]

[dependencies]
ruvector-solver = { path = "../ruvector-solver", default-features = false,
    features = ["wasm", "neumann", "forward-push", "backward-push", "cg"] }
wasm-bindgen = { workspace = true }
serde-wasm-bindgen = "0.6"
js-sys = { workspace = true }
web-sys = { workspace = true, features = ["console"] }
getrandom = { workspace = true, features = ["wasm_js"] }

[profile.release]
opt-level = "s"   # Optimize for size in WASM
lto = true

Cargo.toml for ruvector-solver-node:

[package]
name = "ruvector-solver-node"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib"]

[dependencies]
ruvector-solver = { path = "../ruvector-solver",
    features = ["full", "all-algorithms"] }
napi = { workspace = true, features = ["async"] }
napi-derive = { workspace = true }
tokio = { workspace = true, features = ["rt-multi-thread"] }

2. SIMD Strategy Per Platform

Architecture Detection and Dispatch

/// SIMD dispatcher for solver hot paths
pub mod simd {
    #[cfg(target_arch = "x86_64")]
    pub fn spmv_simd(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
        if is_x86_feature_detected!("avx512f") {
            unsafe { spmv_avx512(vals, cols, x) }
        } else if is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma") {
            unsafe { spmv_avx2_fma(vals, cols, x) }
        } else {
            spmv_scalar(vals, cols, x)
        }
    }

    #[cfg(target_arch = "aarch64")]
    pub fn spmv_simd(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
        unsafe { spmv_neon_unrolled(vals, cols, x) }
    }

    #[cfg(target_arch = "wasm32")]
    pub fn spmv_simd(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
        // WASM SIMD128 via core::arch::wasm32
        #[cfg(target_feature = "simd128")]
        {
            unsafe { spmv_wasm_simd128(vals, cols, x) }
        }
        #[cfg(not(target_feature = "simd128"))]
        {
            spmv_scalar(vals, cols, x)
        }
    }

    /// AVX2+FMA SpMV accumulation with 4x unrolling
    #[cfg(target_arch = "x86_64")]
    #[target_feature(enable = "avx2,fma")]
    unsafe fn spmv_avx2_fma(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
        use std::arch::x86_64::*;
        let mut acc0 = _mm256_setzero_ps();
        let mut acc1 = _mm256_setzero_ps();
        let n = vals.len();
        let chunks = n / 16;

        for i in 0..chunks {
            let base = i * 16;
            // Gather x values using column indices
            let idx0 = _mm256_loadu_si256(cols.as_ptr().add(base) as *const __m256i);
            let idx1 = _mm256_loadu_si256(cols.as_ptr().add(base + 8) as *const __m256i);
            let x0 = _mm256_i32gather_ps::<4>(x.as_ptr(), idx0);
            let x1 = _mm256_i32gather_ps::<4>(x.as_ptr(), idx1);
            let v0 = _mm256_loadu_ps(vals.as_ptr().add(base));
            let v1 = _mm256_loadu_ps(vals.as_ptr().add(base + 8));
            acc0 = _mm256_fmadd_ps(v0, x0, acc0);
            acc1 = _mm256_fmadd_ps(v1, x1, acc1);
        }

        // Horizontal sum
        let sum = _mm256_add_ps(acc0, acc1);
        let hi = _mm256_extractf128_ps::<1>(sum);
        let lo = _mm256_castps256_ps128(sum);
        let sum128 = _mm_add_ps(hi, lo);
        let shuf = _mm_movehdup_ps(sum128);
        let sums = _mm_add_ps(sum128, shuf);
        let shuf2 = _mm_movehl_ps(sums, sums);
        let result = _mm_add_ss(sums, shuf2);

        let mut total = _mm_cvtss_f32(result);

        // Scalar remainder
        for j in (chunks * 16)..n {
            total += vals[j] * x[cols[j] as usize];
        }
        total
    }

    /// NEON SpMV with 4x unrolling for ARM64
    #[cfg(target_arch = "aarch64")]
    unsafe fn spmv_neon_unrolled(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
        use std::arch::aarch64::*;
        let mut acc0 = vdupq_n_f32(0.0);
        let mut acc1 = vdupq_n_f32(0.0);
        let mut acc2 = vdupq_n_f32(0.0);
        let mut acc3 = vdupq_n_f32(0.0);
        let n = vals.len();
        let chunks = n / 16;

        for i in 0..chunks {
            let base = i * 16;
            // Manual gather for NEON (no hardware gather instruction)
            let mut xbuf = [0.0f32; 16];
            for k in 0..16 {
                xbuf[k] = *x.get_unchecked(cols[base + k] as usize);
            }
            let v0 = vld1q_f32(vals.as_ptr().add(base));
            let v1 = vld1q_f32(vals.as_ptr().add(base + 4));
            let v2 = vld1q_f32(vals.as_ptr().add(base + 8));
            let v3 = vld1q_f32(vals.as_ptr().add(base + 12));
            let x0 = vld1q_f32(xbuf.as_ptr());
            let x1 = vld1q_f32(xbuf.as_ptr().add(4));
            let x2 = vld1q_f32(xbuf.as_ptr().add(8));
            let x3 = vld1q_f32(xbuf.as_ptr().add(12));
            acc0 = vfmaq_f32(acc0, v0, x0);
            acc1 = vfmaq_f32(acc1, v1, x1);
            acc2 = vfmaq_f32(acc2, v2, x2);
            acc3 = vfmaq_f32(acc3, v3, x3);
        }

        let sum01 = vaddq_f32(acc0, acc1);
        let sum23 = vaddq_f32(acc2, acc3);
        let sum = vaddq_f32(sum01, sum23);
        let mut total = vaddvq_f32(sum);

        for j in (chunks * 16)..n {
            total += vals[j] * x[cols[j] as usize];
        }
        total
    }
}

3. Conditional Compilation Architecture

// Parallelism: Rayon on native, single-threaded on WASM
#[cfg(all(feature = "parallel", not(target_arch = "wasm32")))]
fn batch_solve_parallel(problems: &[SparseSystem]) -> Vec<SolverResult> {
    use rayon::prelude::*;
    problems.par_iter().map(|p| solve_single(p)).collect()
}

#[cfg(any(not(feature = "parallel"), target_arch = "wasm32"))]
fn batch_solve_parallel(problems: &[SparseSystem]) -> Vec<SolverResult> {
    problems.iter().map(|p| solve_single(p)).collect()
}

// Random number generation
#[cfg(not(target_arch = "wasm32"))]
fn random_seed() -> u64 {
    use std::time::SystemTime;
    SystemTime::now().duration_since(SystemTime::UNIX_EPOCH)
        .unwrap().as_nanos() as u64
}

#[cfg(target_arch = "wasm32")]
fn random_seed() -> u64 {
    let mut buf = [0u8; 8];
    getrandom::getrandom(&mut buf).expect("getrandom failed");
    u64::from_le_bytes(buf)
}

4. WASM-Specific Patterns

Web Worker Pool (JavaScript side):

// Following existing ruvector-wasm/src/worker-pool.js pattern
class SolverWorkerPool {
    constructor(numWorkers = navigator.hardwareConcurrency || 4) {
        this.workers = [];
        this.queue = [];
        for (let i = 0; i < numWorkers; i++) {
            const worker = new Worker(new URL('./solver-worker.js', import.meta.url));
            worker.onmessage = (e) => this._onResult(i, e.data);
            this.workers.push({ worker, busy: false });
        }
    }

    async solve(config) {
        return new Promise((resolve, reject) => {
            const free = this.workers.find(w => !w.busy);
            if (free) {
                free.busy = true;
                free.worker.postMessage({
                    type: 'solve',
                    config,
                    // Transfer ArrayBuffer for zero-copy
                    matrix: config.matrix
                }, [config.matrix.buffer]);
                free.resolve = resolve;
                free.reject = reject;
            } else {
                this.queue.push({ config, resolve, reject });
            }
        });
    }
}

SharedArrayBuffer (when COOP/COEP available):

// Check for cross-origin isolation
if (typeof SharedArrayBuffer !== 'undefined') {
    // Zero-copy shared matrix between main thread and workers
    const shared = new SharedArrayBuffer(matrix.byteLength);
    new Float32Array(shared).set(matrix);
    // Workers can read directly without transfer
    workers.forEach(w => w.postMessage({ type: 'set_matrix', buffer: shared }));
}

IndexedDB for Persistence:

// Cache solver preprocessing results (TRUE sparsifier, etc.)
class SolverCache {
    async store(key, sparsifier) {
        const db = await this._openDB();
        const tx = db.transaction('cache', 'readwrite');
        await tx.objectStore('cache').put({
            key,
            data: sparsifier.buffer,
            timestamp: Date.now()
        });
    }

    async load(key) {
        const db = await this._openDB();
        const tx = db.transaction('cache', 'readonly');
        return tx.objectStore('cache').get(key);
    }
}

5. Build Pipeline

# WASM build (production)
cd crates/ruvector-solver-wasm
wasm-pack build --target web --release
wasm-opt -O3 -o pkg/ruvector_solver_wasm_bg_opt.wasm pkg/ruvector_solver_wasm_bg.wasm
mv pkg/ruvector_solver_wasm_bg_opt.wasm pkg/ruvector_solver_wasm_bg.wasm

# WASM build with SIMD128
RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web --release

# Node.js build
cd crates/ruvector-solver-node
npm run build  # napi build --release

# Multi-platform CI
cargo build --release --target x86_64-unknown-linux-gnu
cargo build --release --target aarch64-apple-darwin
cargo build --release --target wasm32-unknown-unknown

6. WASM Bundle Size Budget

Component Estimated Size (gzipped) Budget
Solver core (CG + Neumann + Push) ~80 KB 100 KB
SIMD128 kernels ~15 KB 20 KB
wasm-bindgen glue ~10 KB 15 KB
serde-wasm-bindgen ~20 KB 25 KB
Total ~125 KB 160 KB

Optimization: Use opt-level = "s" and wasm-opt -Oz for size-constrained deployments.


Consequences

Positive

  1. Universal deployment: Same solver logic runs on all 5 platforms
  2. Platform-optimized: Each target gets architecture-specific SIMD kernels
  3. Minimal overhead: WASM binary < 160 KB gzipped
  4. Web Worker parallelism: Browser gets multi-threaded solver via worker pool
  5. SharedArrayBuffer: Zero-copy where cross-origin isolation available
  6. Proven pattern: Follows RuVector's established Core-Binding-Surface architecture

Negative

  1. WASM algorithm subset: TRUE and BMSSP excluded from browser target (preprocessing cost)
  2. SIMD gap: WASM SIMD128 is 2-4x slower than AVX2 for equivalent operations
  3. No WASM threads: Web Workers add message-passing overhead vs native threads
  4. Gather limitation: NEON and WASM lack hardware gather; manual gather adds latency

Neutral

  1. nalgebra compiles to WASM with default-features = false — no code changes needed
  2. WASM SIMD128 support is universal in modern browsers (Chrome 91+, Firefox 89+, Safari 16.4+)

Implementation Status

WASM bindings complete via wasm-bindgen in ruvector-solver-wasm crate. All 7 algorithms exposed to JavaScript. TypedArray zero-copy for matrix data. Feature-gated compilation (wasm feature). Scalar SpMV fallback when SIMD unavailable. 32-bit index support for wasm32 memory model.


References