Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
1033
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-001-core-integration-architecture.md
vendored
Normal file
1033
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-001-core-integration-architecture.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1113
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-002-algorithm-selection-routing.md
vendored
Normal file
1113
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-002-algorithm-selection-routing.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1447
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-003-memory-management-strategy.md
vendored
Normal file
1447
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-003-memory-management-strategy.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
463
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-004-wasm-cross-platform.md
vendored
Normal file
463
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-004-wasm-cross-platform.md
vendored
Normal file
@@ -0,0 +1,463 @@
|
||||
# ADR-STS-004: WASM and Cross-Platform Compilation Strategy
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-02-20
|
||||
**Authors**: RuVector Architecture Team
|
||||
**Deciders**: Architecture Review Board
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 0.1 | 2026-02-20 | RuVector Team | Initial proposal |
|
||||
| 1.0 | 2026-02-20 | RuVector Team | Accepted: full implementation complete |
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
### Multi-Platform Deployment Requirement
|
||||
|
||||
RuVector deploys across four target platforms with distinct constraints:
|
||||
|
||||
| Platform | ISA | SIMD | Threads | Memory | Target Triple |
|
||||
|----------|-----|------|---------|--------|--------------|
|
||||
| Server (Linux/macOS) | x86_64 | AVX-512/AVX2/SSE4.1 | Full (Rayon) | 2+ GB | x86_64-unknown-linux-gnu |
|
||||
| Edge (Apple Silicon) | ARM64 | NEON | Full (Rayon) | 512 MB | aarch64-apple-darwin |
|
||||
| Browser | wasm32 | SIMD128 | Web Workers | 4-8 MB | wasm32-unknown-unknown |
|
||||
| Cloudflare Workers | wasm32 | None | Single | 128 MB | wasm32-unknown-unknown |
|
||||
| Node.js (NAPI) | Native | Native | Full | 512 MB | via napi-rs |
|
||||
|
||||
### Existing WASM Infrastructure
|
||||
|
||||
RuVector has 15+ WASM crates following the **Core-Binding-Surface** pattern:
|
||||
|
||||
```
|
||||
ruvector-core → ruvector-wasm → @ruvector/core (npm)
|
||||
ruvector-graph → ruvector-graph-wasm → @ruvector/graph (npm)
|
||||
ruvector-attention → ruvector-attention-wasm → @ruvector/attention (npm)
|
||||
ruvector-gnn → ruvector-gnn-wasm → @ruvector/gnn (npm)
|
||||
ruvector-math → ruvector-math-wasm → @ruvector/math (npm)
|
||||
```
|
||||
|
||||
Each WASM crate uses `wasm-bindgen 0.2`, `serde-wasm-bindgen`, `js-sys 0.3`, and `getrandom 0.3` with `wasm_js` feature.
|
||||
|
||||
### WASM Constraints for Solver
|
||||
|
||||
- No `std::thread` — all parallelism via Web Workers
|
||||
- No `std::fs` / `std::net` — no persistent storage, no network
|
||||
- Default linear memory: 16 MB (expandable to ~4 GB)
|
||||
- `parking_lot` required instead of `std::sync::Mutex`
|
||||
- `getrandom/wasm_js` for randomness (Hybrid Random Walk, Monte Carlo)
|
||||
- No dynamic linking — all code in single module
|
||||
|
||||
### Performance Targets
|
||||
|
||||
| Platform | 10K solve | 100K solve | Memory Budget |
|
||||
|----------|-----------|------------|---------------|
|
||||
| Server (AVX2) | < 2 ms | < 50 ms | 2 GB |
|
||||
| Edge (NEON) | < 5 ms | < 100 ms | 512 MB |
|
||||
| Browser (SIMD128) | < 50 ms | < 500 ms | 8 MB |
|
||||
| Edge (Cloudflare) | < 10 ms | < 200 ms | 128 MB |
|
||||
| Node.js (NAPI) | < 3 ms | < 60 ms | 512 MB |
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. Three-Crate Pattern
|
||||
|
||||
Follow established RuVector convention with three crates:
|
||||
|
||||
```
|
||||
crates/ruvector-solver/ # Core Rust (no platform deps)
|
||||
crates/ruvector-solver-wasm/ # wasm-bindgen bindings
|
||||
crates/ruvector-solver-node/ # NAPI-RS bindings
|
||||
```
|
||||
|
||||
#### Cargo.toml for ruvector-solver (core):
|
||||
|
||||
```toml
|
||||
[package]
|
||||
name = "ruvector-solver"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
rust-version = "1.77"
|
||||
|
||||
[features]
|
||||
default = []
|
||||
nalgebra-backend = ["nalgebra"]
|
||||
ndarray-backend = ["ndarray"]
|
||||
parallel = ["rayon", "crossbeam"]
|
||||
simd = []
|
||||
wasm = []
|
||||
full = ["nalgebra-backend", "ndarray-backend", "parallel"]
|
||||
|
||||
# Algorithm features
|
||||
neumann = []
|
||||
forward-push = []
|
||||
backward-push = []
|
||||
hybrid-random-walk = ["getrandom"]
|
||||
true-solver = ["neumann"] # TRUE uses Neumann internally
|
||||
cg = []
|
||||
bmssp = []
|
||||
all-algorithms = ["neumann", "forward-push", "backward-push",
|
||||
"hybrid-random-walk", "true-solver", "cg", "bmssp"]
|
||||
|
||||
[dependencies]
|
||||
serde = { workspace = true, features = ["derive"] }
|
||||
nalgebra = { workspace = true, optional = true, default-features = false }
|
||||
ndarray = { workspace = true, optional = true }
|
||||
rayon = { workspace = true, optional = true }
|
||||
crossbeam = { workspace = true, optional = true }
|
||||
getrandom = { workspace = true, optional = true }
|
||||
|
||||
[target.'cfg(target_arch = "wasm32")'.dependencies]
|
||||
getrandom = { workspace = true, features = ["wasm_js"] }
|
||||
```
|
||||
|
||||
#### Cargo.toml for ruvector-solver-wasm:
|
||||
|
||||
```toml
|
||||
[package]
|
||||
name = "ruvector-solver-wasm"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
|
||||
[lib]
|
||||
crate-type = ["cdylib"]
|
||||
|
||||
[dependencies]
|
||||
ruvector-solver = { path = "../ruvector-solver", default-features = false,
|
||||
features = ["wasm", "neumann", "forward-push", "backward-push", "cg"] }
|
||||
wasm-bindgen = { workspace = true }
|
||||
serde-wasm-bindgen = "0.6"
|
||||
js-sys = { workspace = true }
|
||||
web-sys = { workspace = true, features = ["console"] }
|
||||
getrandom = { workspace = true, features = ["wasm_js"] }
|
||||
|
||||
[profile.release]
|
||||
opt-level = "s" # Optimize for size in WASM
|
||||
lto = true
|
||||
```
|
||||
|
||||
#### Cargo.toml for ruvector-solver-node:
|
||||
|
||||
```toml
|
||||
[package]
|
||||
name = "ruvector-solver-node"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
|
||||
[lib]
|
||||
crate-type = ["cdylib"]
|
||||
|
||||
[dependencies]
|
||||
ruvector-solver = { path = "../ruvector-solver",
|
||||
features = ["full", "all-algorithms"] }
|
||||
napi = { workspace = true, features = ["async"] }
|
||||
napi-derive = { workspace = true }
|
||||
tokio = { workspace = true, features = ["rt-multi-thread"] }
|
||||
```
|
||||
|
||||
### 2. SIMD Strategy Per Platform
|
||||
|
||||
#### Architecture Detection and Dispatch
|
||||
|
||||
```rust
|
||||
/// SIMD dispatcher for solver hot paths
|
||||
pub mod simd {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
pub fn spmv_simd(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
|
||||
if is_x86_feature_detected!("avx512f") {
|
||||
unsafe { spmv_avx512(vals, cols, x) }
|
||||
} else if is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma") {
|
||||
unsafe { spmv_avx2_fma(vals, cols, x) }
|
||||
} else {
|
||||
spmv_scalar(vals, cols, x)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(target_arch = "aarch64")]
|
||||
pub fn spmv_simd(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
|
||||
unsafe { spmv_neon_unrolled(vals, cols, x) }
|
||||
}
|
||||
|
||||
#[cfg(target_arch = "wasm32")]
|
||||
pub fn spmv_simd(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
|
||||
// WASM SIMD128 via core::arch::wasm32
|
||||
#[cfg(target_feature = "simd128")]
|
||||
{
|
||||
unsafe { spmv_wasm_simd128(vals, cols, x) }
|
||||
}
|
||||
#[cfg(not(target_feature = "simd128"))]
|
||||
{
|
||||
spmv_scalar(vals, cols, x)
|
||||
}
|
||||
}
|
||||
|
||||
/// AVX2+FMA SpMV accumulation with 4x unrolling
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
#[target_feature(enable = "avx2,fma")]
|
||||
unsafe fn spmv_avx2_fma(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
|
||||
use std::arch::x86_64::*;
|
||||
let mut acc0 = _mm256_setzero_ps();
|
||||
let mut acc1 = _mm256_setzero_ps();
|
||||
let n = vals.len();
|
||||
let chunks = n / 16;
|
||||
|
||||
for i in 0..chunks {
|
||||
let base = i * 16;
|
||||
// Gather x values using column indices
|
||||
let idx0 = _mm256_loadu_si256(cols.as_ptr().add(base) as *const __m256i);
|
||||
let idx1 = _mm256_loadu_si256(cols.as_ptr().add(base + 8) as *const __m256i);
|
||||
let x0 = _mm256_i32gather_ps::<4>(x.as_ptr(), idx0);
|
||||
let x1 = _mm256_i32gather_ps::<4>(x.as_ptr(), idx1);
|
||||
let v0 = _mm256_loadu_ps(vals.as_ptr().add(base));
|
||||
let v1 = _mm256_loadu_ps(vals.as_ptr().add(base + 8));
|
||||
acc0 = _mm256_fmadd_ps(v0, x0, acc0);
|
||||
acc1 = _mm256_fmadd_ps(v1, x1, acc1);
|
||||
}
|
||||
|
||||
// Horizontal sum
|
||||
let sum = _mm256_add_ps(acc0, acc1);
|
||||
let hi = _mm256_extractf128_ps::<1>(sum);
|
||||
let lo = _mm256_castps256_ps128(sum);
|
||||
let sum128 = _mm_add_ps(hi, lo);
|
||||
let shuf = _mm_movehdup_ps(sum128);
|
||||
let sums = _mm_add_ps(sum128, shuf);
|
||||
let shuf2 = _mm_movehl_ps(sums, sums);
|
||||
let result = _mm_add_ss(sums, shuf2);
|
||||
|
||||
let mut total = _mm_cvtss_f32(result);
|
||||
|
||||
// Scalar remainder
|
||||
for j in (chunks * 16)..n {
|
||||
total += vals[j] * x[cols[j] as usize];
|
||||
}
|
||||
total
|
||||
}
|
||||
|
||||
/// NEON SpMV with 4x unrolling for ARM64
|
||||
#[cfg(target_arch = "aarch64")]
|
||||
unsafe fn spmv_neon_unrolled(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
|
||||
use std::arch::aarch64::*;
|
||||
let mut acc0 = vdupq_n_f32(0.0);
|
||||
let mut acc1 = vdupq_n_f32(0.0);
|
||||
let mut acc2 = vdupq_n_f32(0.0);
|
||||
let mut acc3 = vdupq_n_f32(0.0);
|
||||
let n = vals.len();
|
||||
let chunks = n / 16;
|
||||
|
||||
for i in 0..chunks {
|
||||
let base = i * 16;
|
||||
// Manual gather for NEON (no hardware gather instruction)
|
||||
let mut xbuf = [0.0f32; 16];
|
||||
for k in 0..16 {
|
||||
xbuf[k] = *x.get_unchecked(cols[base + k] as usize);
|
||||
}
|
||||
let v0 = vld1q_f32(vals.as_ptr().add(base));
|
||||
let v1 = vld1q_f32(vals.as_ptr().add(base + 4));
|
||||
let v2 = vld1q_f32(vals.as_ptr().add(base + 8));
|
||||
let v3 = vld1q_f32(vals.as_ptr().add(base + 12));
|
||||
let x0 = vld1q_f32(xbuf.as_ptr());
|
||||
let x1 = vld1q_f32(xbuf.as_ptr().add(4));
|
||||
let x2 = vld1q_f32(xbuf.as_ptr().add(8));
|
||||
let x3 = vld1q_f32(xbuf.as_ptr().add(12));
|
||||
acc0 = vfmaq_f32(acc0, v0, x0);
|
||||
acc1 = vfmaq_f32(acc1, v1, x1);
|
||||
acc2 = vfmaq_f32(acc2, v2, x2);
|
||||
acc3 = vfmaq_f32(acc3, v3, x3);
|
||||
}
|
||||
|
||||
let sum01 = vaddq_f32(acc0, acc1);
|
||||
let sum23 = vaddq_f32(acc2, acc3);
|
||||
let sum = vaddq_f32(sum01, sum23);
|
||||
let mut total = vaddvq_f32(sum);
|
||||
|
||||
for j in (chunks * 16)..n {
|
||||
total += vals[j] * x[cols[j] as usize];
|
||||
}
|
||||
total
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Conditional Compilation Architecture
|
||||
|
||||
```rust
|
||||
// Parallelism: Rayon on native, single-threaded on WASM
|
||||
#[cfg(all(feature = "parallel", not(target_arch = "wasm32")))]
|
||||
fn batch_solve_parallel(problems: &[SparseSystem]) -> Vec<SolverResult> {
|
||||
use rayon::prelude::*;
|
||||
problems.par_iter().map(|p| solve_single(p)).collect()
|
||||
}
|
||||
|
||||
#[cfg(any(not(feature = "parallel"), target_arch = "wasm32"))]
|
||||
fn batch_solve_parallel(problems: &[SparseSystem]) -> Vec<SolverResult> {
|
||||
problems.iter().map(|p| solve_single(p)).collect()
|
||||
}
|
||||
|
||||
// Random number generation
|
||||
#[cfg(not(target_arch = "wasm32"))]
|
||||
fn random_seed() -> u64 {
|
||||
use std::time::SystemTime;
|
||||
SystemTime::now().duration_since(SystemTime::UNIX_EPOCH)
|
||||
.unwrap().as_nanos() as u64
|
||||
}
|
||||
|
||||
#[cfg(target_arch = "wasm32")]
|
||||
fn random_seed() -> u64 {
|
||||
let mut buf = [0u8; 8];
|
||||
getrandom::getrandom(&mut buf).expect("getrandom failed");
|
||||
u64::from_le_bytes(buf)
|
||||
}
|
||||
```
|
||||
|
||||
### 4. WASM-Specific Patterns
|
||||
|
||||
#### Web Worker Pool (JavaScript side):
|
||||
|
||||
```javascript
|
||||
// Following existing ruvector-wasm/src/worker-pool.js pattern
|
||||
class SolverWorkerPool {
|
||||
constructor(numWorkers = navigator.hardwareConcurrency || 4) {
|
||||
this.workers = [];
|
||||
this.queue = [];
|
||||
for (let i = 0; i < numWorkers; i++) {
|
||||
const worker = new Worker(new URL('./solver-worker.js', import.meta.url));
|
||||
worker.onmessage = (e) => this._onResult(i, e.data);
|
||||
this.workers.push({ worker, busy: false });
|
||||
}
|
||||
}
|
||||
|
||||
async solve(config) {
|
||||
return new Promise((resolve, reject) => {
|
||||
const free = this.workers.find(w => !w.busy);
|
||||
if (free) {
|
||||
free.busy = true;
|
||||
free.worker.postMessage({
|
||||
type: 'solve',
|
||||
config,
|
||||
// Transfer ArrayBuffer for zero-copy
|
||||
matrix: config.matrix
|
||||
}, [config.matrix.buffer]);
|
||||
free.resolve = resolve;
|
||||
free.reject = reject;
|
||||
} else {
|
||||
this.queue.push({ config, resolve, reject });
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### SharedArrayBuffer (when COOP/COEP available):
|
||||
|
||||
```javascript
|
||||
// Check for cross-origin isolation
|
||||
if (typeof SharedArrayBuffer !== 'undefined') {
|
||||
// Zero-copy shared matrix between main thread and workers
|
||||
const shared = new SharedArrayBuffer(matrix.byteLength);
|
||||
new Float32Array(shared).set(matrix);
|
||||
// Workers can read directly without transfer
|
||||
workers.forEach(w => w.postMessage({ type: 'set_matrix', buffer: shared }));
|
||||
}
|
||||
```
|
||||
|
||||
#### IndexedDB for Persistence:
|
||||
|
||||
```javascript
|
||||
// Cache solver preprocessing results (TRUE sparsifier, etc.)
|
||||
class SolverCache {
|
||||
async store(key, sparsifier) {
|
||||
const db = await this._openDB();
|
||||
const tx = db.transaction('cache', 'readwrite');
|
||||
await tx.objectStore('cache').put({
|
||||
key,
|
||||
data: sparsifier.buffer,
|
||||
timestamp: Date.now()
|
||||
});
|
||||
}
|
||||
|
||||
async load(key) {
|
||||
const db = await this._openDB();
|
||||
const tx = db.transaction('cache', 'readonly');
|
||||
return tx.objectStore('cache').get(key);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Build Pipeline
|
||||
|
||||
```bash
|
||||
# WASM build (production)
|
||||
cd crates/ruvector-solver-wasm
|
||||
wasm-pack build --target web --release
|
||||
wasm-opt -O3 -o pkg/ruvector_solver_wasm_bg_opt.wasm pkg/ruvector_solver_wasm_bg.wasm
|
||||
mv pkg/ruvector_solver_wasm_bg_opt.wasm pkg/ruvector_solver_wasm_bg.wasm
|
||||
|
||||
# WASM build with SIMD128
|
||||
RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web --release
|
||||
|
||||
# Node.js build
|
||||
cd crates/ruvector-solver-node
|
||||
npm run build # napi build --release
|
||||
|
||||
# Multi-platform CI
|
||||
cargo build --release --target x86_64-unknown-linux-gnu
|
||||
cargo build --release --target aarch64-apple-darwin
|
||||
cargo build --release --target wasm32-unknown-unknown
|
||||
```
|
||||
|
||||
### 6. WASM Bundle Size Budget
|
||||
|
||||
| Component | Estimated Size (gzipped) | Budget |
|
||||
|-----------|-------------------------|--------|
|
||||
| Solver core (CG + Neumann + Push) | ~80 KB | 100 KB |
|
||||
| SIMD128 kernels | ~15 KB | 20 KB |
|
||||
| wasm-bindgen glue | ~10 KB | 15 KB |
|
||||
| serde-wasm-bindgen | ~20 KB | 25 KB |
|
||||
| **Total** | **~125 KB** | **160 KB** |
|
||||
|
||||
Optimization: Use `opt-level = "s"` and `wasm-opt -Oz` for size-constrained deployments.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Universal deployment**: Same solver logic runs on all 5 platforms
|
||||
2. **Platform-optimized**: Each target gets architecture-specific SIMD kernels
|
||||
3. **Minimal overhead**: WASM binary < 160 KB gzipped
|
||||
4. **Web Worker parallelism**: Browser gets multi-threaded solver via worker pool
|
||||
5. **SharedArrayBuffer**: Zero-copy where cross-origin isolation available
|
||||
6. **Proven pattern**: Follows RuVector's established Core-Binding-Surface architecture
|
||||
|
||||
### Negative
|
||||
|
||||
1. **WASM algorithm subset**: TRUE and BMSSP excluded from browser target (preprocessing cost)
|
||||
2. **SIMD gap**: WASM SIMD128 is 2-4x slower than AVX2 for equivalent operations
|
||||
3. **No WASM threads**: Web Workers add message-passing overhead vs native threads
|
||||
4. **Gather limitation**: NEON and WASM lack hardware gather; manual gather adds latency
|
||||
|
||||
### Neutral
|
||||
|
||||
1. nalgebra compiles to WASM with `default-features = false` — no code changes needed
|
||||
2. WASM SIMD128 support is universal in modern browsers (Chrome 91+, Firefox 89+, Safari 16.4+)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Status
|
||||
|
||||
WASM bindings complete via wasm-bindgen in ruvector-solver-wasm crate. All 7 algorithms exposed to JavaScript. TypedArray zero-copy for matrix data. Feature-gated compilation (wasm feature). Scalar SpMV fallback when SIMD unavailable. 32-bit index support for wasm32 memory model.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [06-wasm-integration.md](../06-wasm-integration.md) — Detailed WASM analysis
|
||||
- [08-performance-analysis.md](../08-performance-analysis.md) — Platform performance targets
|
||||
- [11-typescript-integration.md](../11-typescript-integration.md) — TypeScript type generation
|
||||
- ADR-005 — RuVector WASM runtime integration
|
||||
448
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-005-security-model.md
vendored
Normal file
448
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-005-security-model.md
vendored
Normal file
@@ -0,0 +1,448 @@
|
||||
# ADR-STS-005: Security Model and Threat Mitigation
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-02-20
|
||||
**Authors**: RuVector Security Team
|
||||
**Deciders**: Architecture Review Board
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 0.1 | 2026-02-20 | RuVector Team | Initial proposal |
|
||||
| 1.0 | 2026-02-20 | RuVector Team | Accepted: full implementation complete |
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
### Current Security Posture
|
||||
|
||||
RuVector employs defense-in-depth security across multiple layers:
|
||||
|
||||
| Layer | Mechanism | Strength |
|
||||
|-------|-----------|----------|
|
||||
| **Cryptographic** | Ed25519 signatures, SHAKE-256 witness chains, TEE attestation (SGX/SEV-SNP) | Very High |
|
||||
| **WASM Sandbox** | Kernel pack verification (Ed25519 + SHA256 allowlist), epoch interruption, memory layout validation | High |
|
||||
| **MCP Coherence Gate** | 3-tier Permit/Defer/Deny with witness receipts, hash-chain integrity | High |
|
||||
| **Edge-Net** | PiKey Ed25519 identity, challenge-response, per-IP rate limiting, adaptive attack detection | High |
|
||||
| **Storage** | Path traversal prevention, feature-gated backends | Medium |
|
||||
| **Server API** | Serde validation, trace logging | Low |
|
||||
|
||||
### Known Weaknesses (Pre-Integration)
|
||||
|
||||
| ID | Weakness | DREAD Score | Severity |
|
||||
|----|----------|-------------|----------|
|
||||
| SEC-W1 | Fully permissive CORS (`allow_origin(Any)`) | 7.8 | High |
|
||||
| SEC-W2 | No REST API authentication | 9.2 | Critical |
|
||||
| SEC-W3 | Unbounded search parameters (`k` unlimited) | 6.4 | Medium |
|
||||
| SEC-W4 | 90 `unsafe` blocks in SIMD/arena/quantization | 5.2 | Medium |
|
||||
| SEC-W5 | `insecure_*` constructors without `#[cfg]` gating | 4.8 | Medium |
|
||||
| SEC-W6 | Hardcoded default backup password in edge-net | 6.1 | Medium |
|
||||
| SEC-W7 | Unvalidated collection names | 5.5 | Medium |
|
||||
|
||||
### New Attack Surface from Solver Integration
|
||||
|
||||
| Surface | Description | Risk |
|
||||
|---------|-------------|------|
|
||||
| AS-1 | New deserialization points (problem definitions, solver state) | High |
|
||||
| AS-2 | WASM sandbox boundary (solver WASM modules) | High |
|
||||
| AS-3 | MCP tool registration (40+ solver tools callable by AI agents) | High |
|
||||
| AS-4 | Computational cost amplification (expensive solve operations) | High |
|
||||
| AS-5 | Session management state (solver sessions) | Medium |
|
||||
| AS-6 | Cross-tool information flow (solver ↔ coherence gate) | Medium |
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. WASM Sandbox Integration
|
||||
|
||||
Solver WASM modules are treated as kernel packs within the existing security framework:
|
||||
|
||||
```rust
|
||||
pub struct SolverKernelConfig {
|
||||
/// Ed25519 public key for solver WASM verification
|
||||
pub signing_key: ed25519_dalek::VerifyingKey,
|
||||
|
||||
/// SHA256 hashes of approved solver WASM binaries
|
||||
pub allowed_hashes: HashSet<[u8; 32]>,
|
||||
|
||||
/// Memory limits proportional to problem size
|
||||
pub max_memory_pages: u32, // Absolute ceiling: 2048 (128MB)
|
||||
|
||||
/// Epoch budget: proportional to expected O(n^alpha) runtime
|
||||
pub epoch_budget_fn: Box<dyn Fn(usize) -> u64>, // f(n) → ticks
|
||||
|
||||
/// Stack size limit (prevent deep recursion)
|
||||
pub max_stack_bytes: usize, // Default: 1MB
|
||||
}
|
||||
|
||||
impl SolverKernelConfig {
|
||||
pub fn default_server() -> Self {
|
||||
Self {
|
||||
max_memory_pages: 2048, // 128MB
|
||||
max_stack_bytes: 1 << 20, // 1MB
|
||||
epoch_budget_fn: Box::new(|n| {
|
||||
// O(n * log(n)) ticks with 10x safety margin
|
||||
(n as u64) * ((n as f64).log2() as u64 + 1) * 10
|
||||
}),
|
||||
..Default::default()
|
||||
}
|
||||
}
|
||||
|
||||
pub fn default_browser() -> Self {
|
||||
Self {
|
||||
max_memory_pages: 128, // 8MB
|
||||
max_stack_bytes: 256_000, // 256KB
|
||||
epoch_budget_fn: Box::new(|n| {
|
||||
(n as u64) * ((n as f64).log2() as u64 + 1) * 5
|
||||
}),
|
||||
..Default::default()
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Input Validation at All Boundaries
|
||||
|
||||
```rust
|
||||
/// Comprehensive input validation for solver API inputs
|
||||
pub fn validate_solver_input(input: &SolverInput) -> Result<(), ValidationError> {
|
||||
// === Size bounds ===
|
||||
const MAX_NODES: usize = 10_000_000;
|
||||
const MAX_EDGES: usize = 100_000_000;
|
||||
const MAX_DIM: usize = 65_536;
|
||||
const MAX_ITERATIONS: u64 = 1_000_000;
|
||||
const MAX_TIMEOUT_MS: u64 = 300_000;
|
||||
const MAX_MATRIX_ELEMENTS: usize = 1_000_000_000;
|
||||
|
||||
if input.node_count > MAX_NODES {
|
||||
return Err(ValidationError::TooLarge {
|
||||
field: "node_count", max: MAX_NODES, actual: input.node_count,
|
||||
});
|
||||
}
|
||||
|
||||
if input.edge_count > MAX_EDGES {
|
||||
return Err(ValidationError::TooLarge {
|
||||
field: "edge_count", max: MAX_EDGES, actual: input.edge_count,
|
||||
});
|
||||
}
|
||||
|
||||
// === Numeric sanity ===
|
||||
for (i, weight) in input.edge_weights.iter().enumerate() {
|
||||
if !weight.is_finite() {
|
||||
return Err(ValidationError::InvalidNumber {
|
||||
field: "edge_weights", index: i, reason: "non-finite value",
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// === Structural consistency ===
|
||||
let max_edges = if input.directed {
|
||||
input.node_count.saturating_mul(input.node_count.saturating_sub(1))
|
||||
} else {
|
||||
input.node_count.saturating_mul(input.node_count.saturating_sub(1)) / 2
|
||||
};
|
||||
if input.edge_count > max_edges {
|
||||
return Err(ValidationError::InconsistentGraph {
|
||||
reason: "more edges than possible for given node count",
|
||||
});
|
||||
}
|
||||
|
||||
// === Parameter ranges ===
|
||||
if input.tolerance <= 0.0 || input.tolerance > 1.0 {
|
||||
return Err(ValidationError::OutOfRange {
|
||||
field: "tolerance", min: 0.0, max: 1.0, actual: input.tolerance,
|
||||
});
|
||||
}
|
||||
|
||||
if input.max_iterations > MAX_ITERATIONS {
|
||||
return Err(ValidationError::OutOfRange {
|
||||
field: "max_iterations", min: 1.0, max: MAX_ITERATIONS as f64,
|
||||
actual: input.max_iterations as f64,
|
||||
});
|
||||
}
|
||||
|
||||
// === Dimension bounds ===
|
||||
if input.dimension > MAX_DIM {
|
||||
return Err(ValidationError::TooLarge {
|
||||
field: "dimension", max: MAX_DIM, actual: input.dimension,
|
||||
});
|
||||
}
|
||||
|
||||
// === Vector value checks ===
|
||||
if let Some(ref values) = input.values {
|
||||
if values.len() != input.dimension {
|
||||
return Err(ValidationError::DimensionMismatch {
|
||||
expected: input.dimension, actual: values.len(),
|
||||
});
|
||||
}
|
||||
for (i, v) in values.iter().enumerate() {
|
||||
if !v.is_finite() {
|
||||
return Err(ValidationError::InvalidNumber {
|
||||
field: "values", index: i, reason: "non-finite value",
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### 3. MCP Tool Access Control
|
||||
|
||||
```rust
|
||||
/// Solver MCP tools require PermitToken from coherence gate
|
||||
pub struct SolverMcpHandler {
|
||||
solver: Arc<dyn SolverEngine>,
|
||||
gate: Arc<CoherenceGate>,
|
||||
rate_limiter: RateLimiter,
|
||||
budget_enforcer: BudgetEnforcer,
|
||||
}
|
||||
|
||||
impl SolverMcpHandler {
|
||||
pub async fn handle_tool_call(
|
||||
&self, call: McpToolCall
|
||||
) -> Result<McpToolResult, McpError> {
|
||||
// 1. Rate limiting
|
||||
let agent_id = call.agent_id.as_deref().unwrap_or("anonymous");
|
||||
self.rate_limiter.check(agent_id)?;
|
||||
|
||||
// 2. PermitToken verification
|
||||
let token = call.arguments.get("permit_token")
|
||||
.ok_or(McpError::Unauthorized("missing permit_token"))?;
|
||||
self.gate.verify_token(token).await
|
||||
.map_err(|_| McpError::Unauthorized("invalid permit_token"))?;
|
||||
|
||||
// 3. Input validation
|
||||
let input: SolverInput = serde_json::from_value(call.arguments.clone())
|
||||
.map_err(|e| McpError::InvalidRequest(e.to_string()))?;
|
||||
validate_solver_input(&input)?;
|
||||
|
||||
// 4. Resource budget check
|
||||
let estimate = self.solver.estimate_complexity(&input);
|
||||
self.budget_enforcer.check(agent_id, &estimate)?;
|
||||
|
||||
// 5. Execute with resource limits
|
||||
let result = self.solver.solve_with_budget(&input, estimate.budget).await?;
|
||||
|
||||
// 6. Generate witness receipt
|
||||
let witness = WitnessEntry {
|
||||
prev_hash: self.gate.latest_hash(),
|
||||
action_hash: shake256_256(&bincode::encode(&result)?),
|
||||
timestamp_ns: current_time_ns(),
|
||||
witness_type: WITNESS_TYPE_SOLVER_INVOCATION,
|
||||
};
|
||||
self.gate.append_witness(witness);
|
||||
|
||||
Ok(McpToolResult::from(result))
|
||||
}
|
||||
}
|
||||
|
||||
/// Per-agent rate limiter
|
||||
pub struct RateLimiter {
|
||||
windows: DashMap<String, (Instant, u32)>,
|
||||
config: RateLimitConfig,
|
||||
}
|
||||
|
||||
pub struct RateLimitConfig {
|
||||
pub solve_per_minute: u32, // Default: 10
|
||||
pub status_per_minute: u32, // Default: 60
|
||||
pub session_per_minute: u32, // Default: 30
|
||||
pub burst_multiplier: u32, // Default: 3
|
||||
}
|
||||
|
||||
impl RateLimiter {
|
||||
pub fn check(&self, agent_id: &str) -> Result<(), McpError> {
|
||||
let mut entry = self.windows.entry(agent_id.to_string())
|
||||
.or_insert((Instant::now(), 0));
|
||||
|
||||
if entry.0.elapsed() > Duration::from_secs(60) {
|
||||
*entry = (Instant::now(), 0);
|
||||
}
|
||||
|
||||
entry.1 += 1;
|
||||
if entry.1 > self.config.solve_per_minute {
|
||||
return Err(McpError::RateLimited {
|
||||
agent_id: agent_id.to_string(),
|
||||
retry_after_secs: 60 - entry.0.elapsed().as_secs(),
|
||||
});
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Serialization Safety
|
||||
|
||||
```rust
|
||||
/// Safe deserialization with size limits
|
||||
pub fn deserialize_solver_input(bytes: &[u8]) -> Result<SolverInput, SolverError> {
|
||||
// Body size limit: 10MB
|
||||
const MAX_BODY_SIZE: usize = 10 * 1024 * 1024;
|
||||
if bytes.len() > MAX_BODY_SIZE {
|
||||
return Err(SolverError::InvalidInput(
|
||||
ValidationError::PayloadTooLarge { max: MAX_BODY_SIZE, actual: bytes.len() }
|
||||
));
|
||||
}
|
||||
|
||||
// Deserialize with serde_json (safe, bounded by input size)
|
||||
let input: SolverInput = serde_json::from_slice(bytes)
|
||||
.map_err(|e| SolverError::InvalidInput(ValidationError::ParseError(e.to_string())))?;
|
||||
|
||||
// Application-level validation
|
||||
validate_solver_input(&input)?;
|
||||
|
||||
Ok(input)
|
||||
}
|
||||
|
||||
/// Bincode deserialization with size limit
|
||||
pub fn deserialize_bincode<T: serde::de::DeserializeOwned>(bytes: &[u8]) -> Result<T, SolverError> {
|
||||
let config = bincode::config::standard()
|
||||
.with_limit::<{ 10 * 1024 * 1024 }>(); // 10MB max
|
||||
|
||||
bincode::serde::decode_from_slice(bytes, config)
|
||||
.map(|(val, _)| val)
|
||||
.map_err(|e| SolverError::InvalidInput(
|
||||
ValidationError::ParseError(format!("bincode: {}", e))
|
||||
))
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Audit Trail
|
||||
|
||||
```rust
|
||||
/// Solver invocations generate witness entries
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct SolverAuditEntry {
|
||||
pub request_id: Uuid,
|
||||
pub agent_id: String,
|
||||
pub algorithm: Algorithm,
|
||||
pub input_hash: [u8; 32], // SHAKE-256 of input
|
||||
pub output_hash: [u8; 32], // SHAKE-256 of output
|
||||
pub iterations: usize,
|
||||
pub wall_time_us: u64,
|
||||
pub converged: bool,
|
||||
pub residual: f64,
|
||||
pub timestamp_ns: u128,
|
||||
}
|
||||
|
||||
impl SolverAuditEntry {
|
||||
pub fn to_witness(&self) -> WitnessEntry {
|
||||
WitnessEntry {
|
||||
prev_hash: [0u8; 32], // Set by chain
|
||||
action_hash: shake256_256(&bincode::encode(self).unwrap()),
|
||||
timestamp_ns: self.timestamp_ns,
|
||||
witness_type: WITNESS_TYPE_SOLVER_INVOCATION,
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6. Supply Chain Security
|
||||
|
||||
```toml
|
||||
# .cargo/deny.toml
|
||||
[advisories]
|
||||
vulnerability = "deny"
|
||||
unmaintained = "warn"
|
||||
|
||||
[licenses]
|
||||
allow = ["MIT", "Apache-2.0", "BSD-2-Clause", "BSD-3-Clause", "ISC"]
|
||||
deny = ["GPL-2.0", "GPL-3.0", "AGPL-3.0"]
|
||||
|
||||
[bans]
|
||||
deny = [
|
||||
{ name = "openssl-sys" }, # Prefer rustls
|
||||
]
|
||||
```
|
||||
|
||||
CI pipeline additions:
|
||||
|
||||
```yaml
|
||||
# .github/workflows/security.yml
|
||||
- name: Cargo audit
|
||||
run: cargo audit
|
||||
- name: Cargo deny
|
||||
run: cargo deny check
|
||||
- name: npm audit
|
||||
run: npm audit --audit-level=high
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## STRIDE Threat Analysis
|
||||
|
||||
| Threat | Category | Risk | Mitigation |
|
||||
|--------|----------|------|------------|
|
||||
| Malicious problem submission via API | Tampering | High | Input validation (Section 2), body size limits |
|
||||
| WASM resource limits bypass via crafted input | Elevation | High | Kernel pack framework (Section 1), epoch limits |
|
||||
| Receipt enumeration via sequential IDs | Info Disc. | Medium | Rate limiting (Section 3), auth requirement |
|
||||
| Solver flooding with expensive problems | DoS | High | Rate limiting, compute budgets, concurrent solve semaphore |
|
||||
| Replay of valid permit token | Spoofing | Medium | Token TTL, nonce, single-use enforcement |
|
||||
| Solver calls without audit trail | Repudiation | Medium | Mandatory witness entries (Section 5) |
|
||||
| Modified solver WASM binary | Tampering | High | Ed25519 + SHA256 allowlist (Section 1) |
|
||||
| Compromised dependency injection | Tampering | Medium | cargo-deny, cargo-audit, SBOM (Section 6) |
|
||||
| NaN/Inf propagation in solver output | Integrity | Medium | Output validation, finite-check on results |
|
||||
| Cross-tool MCP escalation | Elevation | Medium | Unidirectional flow enforcement |
|
||||
|
||||
---
|
||||
|
||||
## Security Testing Checklist
|
||||
|
||||
- [ ] All solver API endpoints reject payloads > 10MB
|
||||
- [ ] `k` parameter bounded to MAX_K (10,000)
|
||||
- [ ] Solver WASM modules signed and allowlisted
|
||||
- [ ] WASM execution has problem-size-proportional epoch deadlines
|
||||
- [ ] WASM memory limited to MAX_SOLVER_PAGES (2048)
|
||||
- [ ] MCP solver tools require valid PermitToken
|
||||
- [ ] Per-agent rate limiting enforced on all MCP tools
|
||||
- [ ] Deserialization uses size limits (bincode `with_limit`)
|
||||
- [ ] Session IDs are server-generated UUIDs
|
||||
- [ ] Session count per client bounded (max: 10)
|
||||
- [ ] CORS restricted to known origins
|
||||
- [ ] Authentication required on mutating endpoints
|
||||
- [ ] `unsafe` code reviewed for solver integration paths
|
||||
- [ ] `cargo audit` and `npm audit` pass (no critical vulns)
|
||||
- [ ] Fuzz testing targets for all deserialization entry points
|
||||
- [ ] Solver results include tolerance bounds
|
||||
- [ ] Cross-tool MCP calls prevented
|
||||
- [ ] Witness chain entries created for solver invocations
|
||||
- [ ] Input NaN/Inf rejected before reaching solver
|
||||
- [ ] Output NaN/Inf detected and error returned
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Defense-in-depth**: Solver integrates into existing security layers, not bypassing them
|
||||
2. **Auditable**: All solver invocations have cryptographic witness receipts
|
||||
3. **Resource-bounded**: Compute budgets prevent cost amplification attacks
|
||||
4. **Supply chain secured**: Automated auditing in CI pipeline
|
||||
5. **Platform-safe**: WASM sandbox enforces memory and CPU limits
|
||||
|
||||
### Negative
|
||||
|
||||
1. **PermitToken overhead**: Gate verification adds ~100μs per solver call
|
||||
2. **Rate limiting friction**: Legitimate high-throughput use cases may hit limits
|
||||
3. **Audit storage**: Witness entries add ~200 bytes per solver invocation
|
||||
|
||||
---
|
||||
|
||||
## Implementation Status
|
||||
|
||||
Input validation module (validation.rs) checks CSR structural invariants, index bounds, NaN/Inf detection. Budget enforcement prevents resource exhaustion. Audit trail logs all solver invocations. No unsafe code in public API surface (unsafe confined to internal spmv_unchecked and SIMD). All assertions verified in 177 tests.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [09-security-analysis.md](../09-security-analysis.md) — Full security analysis
|
||||
- [07-mcp-integration.md](../07-mcp-integration.md) — MCP tool access patterns
|
||||
- [06-wasm-integration.md](../06-wasm-integration.md) — WASM sandbox model
|
||||
- ADR-007 — RuVector security review
|
||||
- ADR-012 — RuVector security remediation
|
||||
503
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-006-benchmark-framework.md
vendored
Normal file
503
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-006-benchmark-framework.md
vendored
Normal file
@@ -0,0 +1,503 @@
|
||||
# ADR-STS-006: Benchmark Framework and Performance Validation
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-02-20
|
||||
**Authors**: RuVector Performance Team
|
||||
**Deciders**: Architecture Review Board
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 0.1 | 2026-02-20 | RuVector Team | Initial proposal |
|
||||
| 1.0 | 2026-02-20 | RuVector Team | Accepted: full implementation complete |
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
### Existing Benchmark Infrastructure
|
||||
|
||||
RuVector maintains 90+ benchmark files using Criterion.rs 0.5 with HTML reports. The release profile enables aggressive optimization (`lto = "fat"`, `codegen-units = 1`, `opt-level = 3`), and the bench profile inherits release with debug symbols for profiling.
|
||||
|
||||
### Published Performance Baselines
|
||||
|
||||
| Metric | Value | Platform | Source |
|
||||
|--------|-------|----------|--------|
|
||||
| Euclidean 128D | 14.9 ns | M4 Pro NEON | BENCHMARK_RESULTS.md |
|
||||
| Dot Product 128D | 12.0 ns | M4 Pro NEON | BENCHMARK_RESULTS.md |
|
||||
| HNSW k=10, 10K vectors | 25.2 μs | M4 Pro | BENCHMARK_RESULTS.md |
|
||||
| Batch 1K×384D | 278 μs | Linux AVX2 | BENCHMARK_RESULTS.md |
|
||||
| Binary hamming 384D | 0.9 ns | M4 Pro | BENCHMARK_RESULTS.md |
|
||||
|
||||
### Validation Requirements
|
||||
|
||||
The sublinear-time solver claims 10-600x speedups. These must be validated with:
|
||||
- Statistical significance (Criterion p < 0.05)
|
||||
- Crossover point identification (where sublinear beats traditional)
|
||||
- Accuracy-performance tradeoff quantification
|
||||
- Multi-platform consistency verification
|
||||
- Regression detection in CI
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. Six New Benchmark Suites
|
||||
|
||||
#### Suite 1: `benches/solver_baseline.rs`
|
||||
|
||||
Establishes baselines for operations the solver replaces:
|
||||
|
||||
```rust
|
||||
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId, Throughput};
|
||||
|
||||
fn dense_matmul_baseline(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("dense_matmul_baseline");
|
||||
|
||||
for size in [64, 256, 1024, 4096] {
|
||||
let a = random_dense_matrix(size, size, 42);
|
||||
let x = random_vector(size, 43);
|
||||
let mut y = vec![0.0f32; size];
|
||||
|
||||
group.throughput(Throughput::Elements((size * size) as u64));
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("naive", size),
|
||||
&size,
|
||||
|b, _| b.iter(|| dense_matvec_naive(&a, &x, &mut y)),
|
||||
);
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("simd_unrolled", size),
|
||||
&size,
|
||||
|b, _| b.iter(|| dense_matvec_simd(&a, &x, &mut y)),
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
|
||||
fn sparse_matmul_baseline(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("sparse_matmul_baseline");
|
||||
|
||||
for (n, density) in [(1000, 0.01), (1000, 0.05), (10000, 0.01), (10000, 0.05)] {
|
||||
let csr = random_csr_matrix(n, n, density, 44);
|
||||
let x = random_vector(n, 45);
|
||||
let mut y = vec![0.0f32; n];
|
||||
|
||||
group.throughput(Throughput::Elements(csr.nnz() as u64));
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new(format!("csr_{}x{}_{:.0}pct", n, n, density * 100.0), n),
|
||||
&n,
|
||||
|b, _| b.iter(|| csr.spmv(&x, &mut y)),
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
|
||||
criterion_group!(baselines, dense_matmul_baseline, sparse_matmul_baseline);
|
||||
criterion_main!(baselines);
|
||||
```
|
||||
|
||||
#### Suite 2: `benches/solver_neumann.rs`
|
||||
|
||||
```rust
|
||||
fn neumann_convergence(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("neumann_convergence");
|
||||
group.warm_up_time(Duration::from_secs(5));
|
||||
group.sample_size(200);
|
||||
|
||||
let csr = random_diag_dominant_csr(10000, 0.01, 46);
|
||||
let b = random_vector(10000, 47);
|
||||
|
||||
for eps in [1e-2, 1e-4, 1e-6, 1e-8] {
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("eps", format!("{:.0e}", eps)),
|
||||
&eps,
|
||||
|bench, &eps| {
|
||||
bench.iter(|| {
|
||||
let solver = NeumannSolver::new(eps, 1000);
|
||||
solver.solve(&csr, &b)
|
||||
})
|
||||
},
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
|
||||
fn neumann_sparsity_impact(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("neumann_sparsity_impact");
|
||||
let n = 10000;
|
||||
|
||||
for density in [0.001, 0.01, 0.05, 0.10, 0.50] {
|
||||
let csr = random_diag_dominant_csr(n, density, 48);
|
||||
let b = random_vector(n, 49);
|
||||
|
||||
group.throughput(Throughput::Elements(csr.nnz() as u64));
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("density", format!("{:.1}pct", density * 100.0)),
|
||||
&density,
|
||||
|bench, _| {
|
||||
bench.iter(|| {
|
||||
NeumannSolver::new(1e-4, 1000).solve(&csr, &b)
|
||||
})
|
||||
},
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
|
||||
fn neumann_vs_direct(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("neumann_vs_direct");
|
||||
|
||||
for n in [100, 500, 1000, 5000, 10000] {
|
||||
let csr = random_diag_dominant_csr(n, 0.01, 50);
|
||||
let b = random_vector(n, 51);
|
||||
let dense = csr.to_dense();
|
||||
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("neumann", n), &n,
|
||||
|bench, _| bench.iter(|| NeumannSolver::new(1e-6, 1000).solve(&csr, &b)),
|
||||
);
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("dense_direct", n), &n,
|
||||
|bench, _| bench.iter(|| dense_solve(&dense, &b)),
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
|
||||
criterion_group!(neumann, neumann_convergence, neumann_sparsity_impact, neumann_vs_direct);
|
||||
```
|
||||
|
||||
#### Suite 3: `benches/solver_push.rs`
|
||||
|
||||
```rust
|
||||
fn forward_push_scaling(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("forward_push_scaling");
|
||||
|
||||
for n in [100, 1000, 10000, 100000] {
|
||||
let graph = random_sparse_graph(n, 0.005, 52);
|
||||
|
||||
for eps in [1e-2, 1e-4, 1e-6] {
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new(format!("n{}_eps{:.0e}", n, eps), n),
|
||||
&(n, eps),
|
||||
|bench, &(_, eps)| {
|
||||
bench.iter(|| {
|
||||
let solver = ForwardPushSolver::new(0.85, eps);
|
||||
solver.ppr_from_source(&graph, 0)
|
||||
})
|
||||
},
|
||||
);
|
||||
}
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
|
||||
fn backward_push_vs_forward(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("push_direction_comparison");
|
||||
let n = 10000;
|
||||
let graph = random_sparse_graph(n, 0.005, 53);
|
||||
|
||||
for eps in [1e-2, 1e-4] {
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("forward", format!("{:.0e}", eps)), &eps,
|
||||
|bench, &eps| bench.iter(|| ForwardPushSolver::new(0.85, eps).ppr_from_source(&graph, 0)),
|
||||
);
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("backward", format!("{:.0e}", eps)), &eps,
|
||||
|bench, &eps| bench.iter(|| BackwardPushSolver::new(0.85, eps).ppr_to_target(&graph, 0)),
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
```
|
||||
|
||||
#### Suite 4: `benches/solver_random_walk.rs`
|
||||
|
||||
```rust
|
||||
fn random_walk_entry_estimation(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("random_walk_estimation");
|
||||
|
||||
for n in [1000, 10000, 100000] {
|
||||
let csr = random_laplacian_csr(n, 0.005, 54);
|
||||
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("single_entry", n), &n,
|
||||
|bench, _| bench.iter(|| {
|
||||
HybridRandomWalkSolver::new(1e-4, 1000).estimate_entry(&csr, 0, n/2)
|
||||
}),
|
||||
);
|
||||
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("batch_100_entries", n), &n,
|
||||
|bench, _| bench.iter(|| {
|
||||
let pairs: Vec<(usize, usize)> = (0..100).map(|i| (i, n - 1 - i)).collect();
|
||||
HybridRandomWalkSolver::new(1e-4, 1000).estimate_batch(&csr, &pairs)
|
||||
}),
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
```
|
||||
|
||||
#### Suite 5: `benches/solver_scheduler.rs`
|
||||
|
||||
```rust
|
||||
fn scheduler_latency(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("scheduler_latency");
|
||||
|
||||
group.bench_function("noop_task", |b| {
|
||||
let scheduler = SolverScheduler::new(4);
|
||||
b.iter(|| scheduler.submit(|| {}))
|
||||
});
|
||||
|
||||
group.bench_function("100ns_task", |b| {
|
||||
let scheduler = SolverScheduler::new(4);
|
||||
b.iter(|| scheduler.submit(|| {
|
||||
std::hint::spin_loop(); // ~100ns
|
||||
}))
|
||||
});
|
||||
|
||||
group.bench_function("1us_task", |b| {
|
||||
let scheduler = SolverScheduler::new(4);
|
||||
b.iter(|| scheduler.submit(|| {
|
||||
for _ in 0..100 { std::hint::spin_loop(); }
|
||||
}))
|
||||
});
|
||||
|
||||
group.finish();
|
||||
}
|
||||
|
||||
fn scheduler_throughput(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("scheduler_throughput");
|
||||
|
||||
for task_count in [1000, 10_000, 100_000, 1_000_000] {
|
||||
group.throughput(Throughput::Elements(task_count));
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("tasks", task_count), &task_count,
|
||||
|bench, &count| {
|
||||
let scheduler = SolverScheduler::new(4);
|
||||
let counter = Arc::new(AtomicU64::new(0));
|
||||
bench.iter(|| {
|
||||
counter.store(0, Ordering::Relaxed);
|
||||
for _ in 0..count {
|
||||
let c = counter.clone();
|
||||
scheduler.submit(move || { c.fetch_add(1, Ordering::Relaxed); });
|
||||
}
|
||||
scheduler.flush();
|
||||
assert_eq!(counter.load(Ordering::Relaxed), count);
|
||||
})
|
||||
},
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
```
|
||||
|
||||
#### Suite 6: `benches/solver_e2e.rs`
|
||||
|
||||
```rust
|
||||
fn accelerated_search(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("accelerated_search");
|
||||
group.sample_size(50);
|
||||
group.warm_up_time(Duration::from_secs(5));
|
||||
|
||||
for n in [10_000, 100_000] {
|
||||
let db = build_test_db(n, 384, 56);
|
||||
let query = random_vector(384, 57);
|
||||
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("hnsw_only", n), &n,
|
||||
|bench, _| bench.iter(|| db.search(&query, 10)),
|
||||
);
|
||||
|
||||
group.bench_with_input(
|
||||
BenchmarkId::new("hnsw_plus_solver_rerank", n), &n,
|
||||
|bench, _| bench.iter(|| {
|
||||
let candidates = db.search(&query, 100); // Broad HNSW
|
||||
solver_rerank(&db, &query, &candidates, 10) // Solver-accelerated reranking
|
||||
}),
|
||||
);
|
||||
}
|
||||
group.finish();
|
||||
}
|
||||
|
||||
fn accelerated_batch_analytics(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("batch_analytics");
|
||||
group.sample_size(10);
|
||||
|
||||
let n = 10_000;
|
||||
let vectors = random_matrix(n, 384, 58);
|
||||
|
||||
group.bench_function("pairwise_brute_force", |b| {
|
||||
b.iter(|| pairwise_distances_brute(&vectors))
|
||||
});
|
||||
|
||||
group.bench_function("pairwise_solver_estimated", |b| {
|
||||
b.iter(|| pairwise_distances_solver(&vectors, 1e-4))
|
||||
});
|
||||
|
||||
group.finish();
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Regression Prevention
|
||||
|
||||
Hard thresholds enforced in CI:
|
||||
|
||||
```rust
|
||||
// In each benchmark suite, add regression markers
|
||||
fn solver_regression_tests(c: &mut Criterion) {
|
||||
let mut group = c.benchmark_group("solver_regression");
|
||||
|
||||
// These thresholds trigger CI failure if exceeded
|
||||
group.bench_function("neumann_10k_1pct", |b| {
|
||||
let csr = random_diag_dominant_csr(10000, 0.01, 60);
|
||||
let rhs = random_vector(10000, 61);
|
||||
b.iter(|| NeumannSolver::new(1e-4, 1000).solve(&csr, &rhs))
|
||||
// Target: < 500μs
|
||||
});
|
||||
|
||||
group.bench_function("forward_push_10k", |b| {
|
||||
let graph = random_sparse_graph(10000, 0.005, 62);
|
||||
b.iter(|| ForwardPushSolver::new(0.85, 1e-4).ppr_from_source(&graph, 0))
|
||||
// Target: < 100μs
|
||||
});
|
||||
|
||||
group.bench_function("cg_10k_1pct", |b| {
|
||||
let csr = random_laplacian_csr(10000, 0.01, 63);
|
||||
let rhs = random_vector(10000, 64);
|
||||
b.iter(|| ConjugateGradientSolver::new(1e-6, 1000).solve(&csr, &rhs))
|
||||
// Target: < 1ms
|
||||
});
|
||||
|
||||
group.finish();
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Accuracy Validation Suite
|
||||
|
||||
Alongside latency benchmarks, accuracy must be tracked:
|
||||
|
||||
```rust
|
||||
fn accuracy_validation() {
|
||||
// Neumann vs exact solve
|
||||
let csr = random_diag_dominant_csr(1000, 0.01, 70);
|
||||
let b = random_vector(1000, 71);
|
||||
let exact = dense_solve(&csr.to_dense(), &b);
|
||||
|
||||
for eps in [1e-2, 1e-4, 1e-6] {
|
||||
let approx = NeumannSolver::new(eps, 1000).solve(&csr, &b).unwrap();
|
||||
let relative_error = l2_distance(&exact, &approx.solution) / l2_norm(&exact);
|
||||
assert!(relative_error < eps * 10.0, // 10x margin
|
||||
"Neumann eps={}: relative error {} exceeds bound {}",
|
||||
eps, relative_error, eps * 10.0);
|
||||
}
|
||||
|
||||
// Forward Push recall@k
|
||||
let graph = random_sparse_graph(10000, 0.005, 72);
|
||||
let exact_ppr = exact_pagerank(&graph, 0, 0.85);
|
||||
let top_k_exact: Vec<usize> = exact_ppr.top_k(100);
|
||||
|
||||
for eps in [1e-2, 1e-4] {
|
||||
let approx_ppr = ForwardPushSolver::new(0.85, eps).ppr_from_source(&graph, 0);
|
||||
let top_k_approx: Vec<usize> = approx_ppr.top_k(100);
|
||||
let recall = set_overlap(&top_k_exact, &top_k_approx) as f64 / 100.0;
|
||||
assert!(recall > 0.9, "Forward Push eps={}: recall@100 = {} < 0.9", eps, recall);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4. CI Integration
|
||||
|
||||
```yaml
|
||||
# .github/workflows/bench.yml
|
||||
name: Benchmark Suite
|
||||
on:
|
||||
pull_request:
|
||||
paths: ['crates/ruvector-solver/**']
|
||||
schedule:
|
||||
- cron: '0 2 * * *' # Nightly at 2 AM
|
||||
|
||||
jobs:
|
||||
bench-pr:
|
||||
runs-on: ubuntu-latest
|
||||
if: github.event_name == 'pull_request'
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- run: cargo bench -p ruvector-solver -- solver_regression
|
||||
- uses: benchmark-action/github-action-benchmark@v1
|
||||
with:
|
||||
tool: 'cargo'
|
||||
output-file-path: target/criterion/report/index.html
|
||||
|
||||
bench-nightly:
|
||||
runs-on: ubuntu-latest
|
||||
if: github.event_name == 'schedule'
|
||||
strategy:
|
||||
matrix:
|
||||
target: [x86_64-unknown-linux-gnu, aarch64-unknown-linux-gnu]
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- run: cargo bench -p ruvector-solver --target ${{ matrix.target }}
|
||||
- run: cargo bench -p ruvector-solver -- solver_accuracy
|
||||
- uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: bench-results-${{ matrix.target }}
|
||||
path: target/criterion/
|
||||
```
|
||||
|
||||
### 5. Reporting Format
|
||||
|
||||
Following existing BENCHMARK_RESULTS.md conventions:
|
||||
|
||||
```markdown
|
||||
## Solver Integration Benchmarks
|
||||
|
||||
### Environment
|
||||
- **Date**: 2026-02-20
|
||||
- **Platform**: Linux x86_64, AMD EPYC 7763 (AVX-512)
|
||||
- **Rust**: 1.77, release profile (lto=fat, codegen-units=1)
|
||||
- **Criterion**: 0.5, 200 samples, 5s warmup
|
||||
|
||||
### Results
|
||||
|
||||
| Operation | Baseline | Solver | Speedup | Accuracy |
|
||||
|-----------|----------|--------|---------|----------|
|
||||
| MatVec 10K×10K (1%) | 400 μs | 15 μs | 26.7x | ε < 1e-4 |
|
||||
| PageRank 10K nodes | 50 ms | 80 μs | 625x | recall@100 > 0.95 |
|
||||
| Spectral gap est. | N/A | 50 μs | New | within 5% of exact |
|
||||
| Batch pairwise 10K | 480 s | 15 s | 32x | ε < 1e-3 |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Reproducible validation**: All speedup claims backed by Criterion benchmarks
|
||||
2. **Regression prevention**: CI catches performance degradations before merge
|
||||
3. **Multi-platform**: Benchmarks run on x86_64 and aarch64
|
||||
4. **Accuracy tracking**: Approximate algorithms validated against exact baselines
|
||||
5. **Aligned infrastructure**: Uses existing Criterion.rs setup, no new tools
|
||||
|
||||
### Negative
|
||||
|
||||
1. **Benchmark maintenance**: 6 new benchmark files to maintain
|
||||
2. **CI time**: Nightly full suite adds ~30 minutes to CI
|
||||
3. **Flaky thresholds**: Regression thresholds may need periodic recalibration
|
||||
|
||||
---
|
||||
|
||||
## Implementation Status
|
||||
|
||||
Complete Criterion benchmark suite delivered with 5 benchmark groups: solver_baseline (dense reference), solver_neumann (Neumann series profiling), solver_cg (conjugate gradient scaling), solver_push (push algorithm comparison), solver_e2e (end-to-end pipeline). Min-cut gating benchmark script (scripts/run_mincut_bench.sh) with 1k-sample grid search over lambda/tau parameters. Profiler crate (ruvector-profiler) provides memory, latency, power measurement with CSV output.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [08-performance-analysis.md](../08-performance-analysis.md) — Existing benchmarks and methodology
|
||||
- [10-algorithm-analysis.md](../10-algorithm-analysis.md) — Algorithm complexity for threshold derivation
|
||||
- [12-testing-strategy.md](../12-testing-strategy.md) — Testing strategy integration
|
||||
949
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-007-feature-flags-rollout.md
vendored
Normal file
949
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-007-feature-flags-rollout.md
vendored
Normal file
@@ -0,0 +1,949 @@
|
||||
# ADR-STS-007: Feature Flag Architecture and Progressive Rollout
|
||||
|
||||
## Status
|
||||
|
||||
**Accepted**
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------------|------------------------------------------------|
|
||||
| Version | 1.0 |
|
||||
| Date | 2026-02-20 |
|
||||
| Authors | RuVector Architecture Team |
|
||||
| Deciders | Architecture Review Board |
|
||||
| Supersedes | N/A |
|
||||
| Related | ADR-STS-001 (Solver Integration), ADR-STS-003 (WASM Strategy) |
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
The RuVector workspace (v2.0.3, Rust 2021 edition, resolver v2) contains 100+ crates
|
||||
spanning vector storage, graph databases, GNN layers, attention mechanisms, sparse
|
||||
inference, and mathematics. Feature flags are already used extensively throughout the
|
||||
codebase:
|
||||
|
||||
- **ruvector-core**: `default = ["simd", "storage", "hnsw", "api-embeddings", "parallel"]`
|
||||
- **ruvector-graph**: `default = ["full"]` with `full`, `simd`, `storage`, `async-runtime`,
|
||||
`compression`, `distributed`, `federation`, `wasm`
|
||||
- **ruvector-math**: `default = ["std"]` with `simd`, `parallel`, `serde`
|
||||
- **ruvector-gnn**: `default = ["simd", "mmap"]` with `wasm`, `napi`
|
||||
- **ruvector-attention**: `default = ["simd"]` with `wasm`, `napi`, `math`, `sheaf`
|
||||
|
||||
The sublinear-time-solver (v0.1.3) introduces new algorithmic capabilities --- coherence
|
||||
verification, spectral graph methods, GNN-accelerated search, and sublinear query
|
||||
resolution --- that must be integrated without disrupting any of these existing feature
|
||||
surfaces.
|
||||
|
||||
### Constraints
|
||||
|
||||
1. **Zero breaking changes** to the public API of any existing crate.
|
||||
2. **Opt-in per subsystem**: each solver capability must be individually selectable.
|
||||
3. **Gradual rollout**: phased introduction from experimental to default.
|
||||
4. **Platform parity**: feature gates must account for native, WASM, and Node.js targets.
|
||||
5. **CI tractability**: the feature matrix must remain testable without combinatorial
|
||||
explosion.
|
||||
6. **Dependency hygiene**: enabling a solver feature must not pull in nalgebra when only
|
||||
ndarray is needed, and vice versa.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
We adopt a **hierarchical feature flag architecture** with four tiers: the solver crate
|
||||
defines its own backend and acceleration flags, consuming crates expose subsystem-scoped
|
||||
`sublinear-*` flags, the workspace root provides aggregate flags for convenience, and CI
|
||||
tests a curated feature matrix rather than all 2^N combinations.
|
||||
|
||||
### 1. Solver Crate Feature Definitions
|
||||
|
||||
```toml
|
||||
# crates/ruvector-solver/Cargo.toml
|
||||
|
||||
[package]
|
||||
name = "ruvector-solver"
|
||||
version = "0.1.0"
|
||||
edition.workspace = true
|
||||
rust-version.workspace = true
|
||||
license.workspace = true
|
||||
authors.workspace = true
|
||||
repository.workspace = true
|
||||
description = "Sublinear-time solver: coherence verification, spectral methods, GNN search"
|
||||
|
||||
[features]
|
||||
default = []
|
||||
|
||||
# Linear algebra backends (mutually independent, both can be active)
|
||||
nalgebra-backend = ["dep:nalgebra"]
|
||||
ndarray-backend = ["dep:ndarray"]
|
||||
|
||||
# Acceleration
|
||||
parallel = ["dep:rayon"]
|
||||
simd = [] # Auto-detected at build time via cfg
|
||||
gpu = ["ruvector-math/parallel"] # Future: GPU dispatch through ruvector-math
|
||||
|
||||
# Platform targets
|
||||
wasm = [
|
||||
"dep:wasm-bindgen",
|
||||
"dep:serde_wasm_bindgen",
|
||||
"dep:js-sys",
|
||||
]
|
||||
|
||||
# Convenience aggregates
|
||||
full = ["nalgebra-backend", "ndarray-backend", "parallel"]
|
||||
|
||||
[dependencies]
|
||||
# Core (always present)
|
||||
ruvector-math = { path = "../ruvector-math", default-features = false }
|
||||
serde = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
thiserror = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
rand = { workspace = true }
|
||||
rand_distr = { workspace = true }
|
||||
|
||||
# Optional backends
|
||||
nalgebra = { version = "0.33", default-features = false, features = ["std"], optional = true }
|
||||
ndarray = { workspace = true, features = ["serde"], optional = true }
|
||||
|
||||
# Optional acceleration
|
||||
rayon = { workspace = true, optional = true }
|
||||
|
||||
# Optional WASM
|
||||
wasm-bindgen = { workspace = true, optional = true }
|
||||
serde_wasm_bindgen = { version = "0.6", optional = true }
|
||||
js-sys = { workspace = true, optional = true }
|
||||
|
||||
[dev-dependencies]
|
||||
criterion = { workspace = true }
|
||||
proptest = { workspace = true }
|
||||
approx = "0.5"
|
||||
```
|
||||
|
||||
### 2. Consuming Crate Feature Gates
|
||||
|
||||
Each crate that integrates solver capabilities exposes granular `sublinear-*` flags
|
||||
that map onto solver features. This keeps the dependency graph explicit and auditable.
|
||||
|
||||
#### 2.1 ruvector-core
|
||||
|
||||
```toml
|
||||
# Additions to crates/ruvector-core/Cargo.toml [features]
|
||||
|
||||
# Sublinear solver integration (opt-in)
|
||||
sublinear = ["dep:ruvector-solver"]
|
||||
|
||||
# Coherence verification for HNSW index quality
|
||||
sublinear-coherence = [
|
||||
"sublinear",
|
||||
"ruvector-solver/nalgebra-backend",
|
||||
]
|
||||
```
|
||||
|
||||
The `sublinear-coherence` flag enables runtime coherence checks on HNSW graph edges.
|
||||
It requires the nalgebra backend because the coherence verifier uses sheaf-theoretic
|
||||
linear algebra that maps naturally to nalgebra's matrix abstractions.
|
||||
|
||||
#### 2.2 ruvector-graph
|
||||
|
||||
```toml
|
||||
# Additions to crates/ruvector-graph/Cargo.toml [features]
|
||||
|
||||
# Sublinear spectral partitioning and Laplacian solvers
|
||||
sublinear = ["dep:ruvector-solver"]
|
||||
|
||||
sublinear-graph = [
|
||||
"sublinear",
|
||||
"ruvector-solver/ndarray-backend",
|
||||
]
|
||||
|
||||
# Spectral methods for graph partitioning
|
||||
sublinear-spectral = [
|
||||
"sublinear-graph",
|
||||
"ruvector-solver/parallel",
|
||||
]
|
||||
```
|
||||
|
||||
Graph crates use the ndarray backend because ruvector-graph already depends on ndarray
|
||||
for adjacency matrices and spectral embeddings. Pulling in nalgebra here would add an
|
||||
unnecessary second linear algebra library.
|
||||
|
||||
#### 2.3 ruvector-gnn
|
||||
|
||||
```toml
|
||||
# Additions to crates/ruvector-gnn/Cargo.toml [features]
|
||||
|
||||
# GNN-accelerated sublinear search
|
||||
sublinear = ["dep:ruvector-solver"]
|
||||
|
||||
sublinear-gnn = [
|
||||
"sublinear",
|
||||
"ruvector-solver/ndarray-backend",
|
||||
]
|
||||
```
|
||||
|
||||
#### 2.4 ruvector-attention
|
||||
|
||||
```toml
|
||||
# Additions to crates/ruvector-attention/Cargo.toml [features]
|
||||
|
||||
# Sublinear attention routing
|
||||
sublinear = ["dep:ruvector-solver"]
|
||||
|
||||
sublinear-attention = [
|
||||
"sublinear",
|
||||
"ruvector-solver/nalgebra-backend",
|
||||
"math",
|
||||
]
|
||||
```
|
||||
|
||||
#### 2.5 ruvector-collections
|
||||
|
||||
```toml
|
||||
# Additions to crates/ruvector-collections/Cargo.toml [features]
|
||||
|
||||
# Sublinear collection-level query dispatch
|
||||
sublinear = ["ruvector-core/sublinear"]
|
||||
```
|
||||
|
||||
Collections delegates to ruvector-core and does not directly depend on the solver crate.
|
||||
|
||||
### 3. Workspace-Level Aggregate Flags
|
||||
|
||||
```toml
|
||||
# Additions to workspace Cargo.toml [workspace.dependencies]
|
||||
|
||||
ruvector-solver = { path = "crates/ruvector-solver", default-features = false }
|
||||
```
|
||||
|
||||
No workspace-level default features are set for the solver. Each consumer pulls exactly
|
||||
the features it needs.
|
||||
|
||||
### 4. Conditional Compilation Patterns
|
||||
|
||||
All solver-gated code uses consistent `cfg` attribute patterns to ensure the compiler
|
||||
eliminates dead code paths when features are disabled.
|
||||
|
||||
#### 4.1 Module-Level Gating
|
||||
|
||||
```rust
|
||||
// In crates/ruvector-core/src/lib.rs
|
||||
|
||||
#[cfg(feature = "sublinear")]
|
||||
pub mod sublinear;
|
||||
|
||||
#[cfg(feature = "sublinear-coherence")]
|
||||
pub mod coherence;
|
||||
```
|
||||
|
||||
#### 4.2 Trait Implementation Gating
|
||||
|
||||
```rust
|
||||
// In crates/ruvector-core/src/index/hnsw.rs
|
||||
|
||||
#[cfg(feature = "sublinear-coherence")]
|
||||
impl HnswIndex {
|
||||
/// Verify edge coherence across the HNSW graph using sheaf Laplacian.
|
||||
///
|
||||
/// Returns the coherence score in [0, 1] where 1.0 means perfectly coherent.
|
||||
/// Only available when the `sublinear-coherence` feature is enabled.
|
||||
pub fn verify_coherence(&self, config: &CoherenceConfig) -> Result<f64, SolverError> {
|
||||
use ruvector_solver::coherence::SheafCoherenceVerifier;
|
||||
|
||||
let verifier = SheafCoherenceVerifier::new(config.clone());
|
||||
verifier.verify(&self.graph)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 4.3 Function-Level Gating with Fallback
|
||||
|
||||
```rust
|
||||
// In crates/ruvector-graph/src/query/planner.rs
|
||||
|
||||
/// Select the optimal query execution strategy.
|
||||
///
|
||||
/// When `sublinear-spectral` is enabled, the planner considers spectral
|
||||
/// partitioning for large graph traversals. Otherwise, it falls back to
|
||||
/// the existing cost-based optimizer.
|
||||
pub fn select_strategy(&self, query: &GraphQuery) -> ExecutionStrategy {
|
||||
#[cfg(feature = "sublinear-spectral")]
|
||||
{
|
||||
if self.should_use_spectral(query) {
|
||||
return self.plan_spectral(query);
|
||||
}
|
||||
}
|
||||
|
||||
// Default path: cost-based optimizer (always available)
|
||||
self.plan_cost_based(query)
|
||||
}
|
||||
```
|
||||
|
||||
#### 4.4 Compile-Time Backend Selection
|
||||
|
||||
```rust
|
||||
// In crates/ruvector-solver/src/backend.rs
|
||||
|
||||
/// Marker type for the active linear algebra backend.
|
||||
///
|
||||
/// The solver supports nalgebra and ndarray simultaneously. Consumers
|
||||
/// select which backend(s) to activate via feature flags. When both
|
||||
/// are active, the solver can dispatch to whichever backend is more
|
||||
/// efficient for a given operation.
|
||||
|
||||
#[cfg(feature = "nalgebra-backend")]
|
||||
pub mod nalgebra_ops {
|
||||
use nalgebra::{DMatrix, DVector};
|
||||
|
||||
pub fn solve_laplacian(laplacian: &DMatrix<f64>, rhs: &DVector<f64>) -> DVector<f64> {
|
||||
// Cholesky decomposition for positive semi-definite Laplacians
|
||||
let chol = laplacian.clone().cholesky()
|
||||
.expect("Laplacian must be positive semi-definite");
|
||||
chol.solve(rhs)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "ndarray-backend")]
|
||||
pub mod ndarray_ops {
|
||||
use ndarray::{Array1, Array2};
|
||||
|
||||
pub fn spectral_embedding(adjacency: &Array2<f64>, dim: usize) -> Array2<f64> {
|
||||
// Eigendecomposition of the normalized Laplacian
|
||||
// ... implementation details
|
||||
todo!("spectral embedding via ndarray")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Runtime Algorithm Selection
|
||||
|
||||
Beyond compile-time feature gates, the solver provides a runtime dispatch layer
|
||||
that selects between dense and sublinear code paths based on data characteristics.
|
||||
|
||||
```rust
|
||||
// In crates/ruvector-solver/src/dispatch.rs
|
||||
|
||||
/// Configuration for runtime algorithm selection.
|
||||
#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)]
|
||||
pub struct SolverDispatchConfig {
|
||||
/// Sparsity threshold above which the sublinear path is preferred.
|
||||
/// Default: 0.95 (95% sparse). Range: [0.0, 1.0].
|
||||
pub sparsity_threshold: f64,
|
||||
|
||||
/// Minimum number of elements before sublinear algorithms are considered.
|
||||
/// Below this threshold, dense algorithms are always faster due to setup costs.
|
||||
/// Default: 10_000.
|
||||
pub min_elements_for_sublinear: usize,
|
||||
|
||||
/// Maximum fraction of elements the sublinear path may touch.
|
||||
/// If the solver would need to examine more than this fraction,
|
||||
/// it falls back to the dense path.
|
||||
/// Default: 0.1 (10%).
|
||||
pub max_touch_fraction: f64,
|
||||
|
||||
/// Force a specific path regardless of data characteristics.
|
||||
/// None means auto-detection (recommended).
|
||||
pub force_path: Option<SolverPath>,
|
||||
}
|
||||
|
||||
impl Default for SolverDispatchConfig {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
sparsity_threshold: 0.95,
|
||||
min_elements_for_sublinear: 10_000,
|
||||
max_touch_fraction: 0.1,
|
||||
force_path: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Which execution path to use.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
|
||||
pub enum SolverPath {
|
||||
/// Traditional dense algorithms.
|
||||
Dense,
|
||||
/// Sublinear-time algorithms (only touches a fraction of the data).
|
||||
Sublinear,
|
||||
}
|
||||
|
||||
/// Determine the optimal execution path for the given data.
|
||||
pub fn select_path(
|
||||
total_elements: usize,
|
||||
nonzero_elements: usize,
|
||||
config: &SolverDispatchConfig,
|
||||
) -> SolverPath {
|
||||
if let Some(forced) = config.force_path {
|
||||
return forced;
|
||||
}
|
||||
|
||||
if total_elements < config.min_elements_for_sublinear {
|
||||
return SolverPath::Dense;
|
||||
}
|
||||
|
||||
let sparsity = 1.0 - (nonzero_elements as f64 / total_elements as f64);
|
||||
if sparsity >= config.sparsity_threshold {
|
||||
SolverPath::Sublinear
|
||||
} else {
|
||||
SolverPath::Dense
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6. WASM Feature Interaction Matrix
|
||||
|
||||
WASM targets cannot use certain features (mmap, threads via rayon, SIMD on older
|
||||
runtimes). The following matrix defines valid feature combinations per platform.
|
||||
|
||||
```
|
||||
Legend: Y = supported N = not supported P = partial (polyfill)
|
||||
|
||||
Feature | native-x86_64 | native-aarch64 | wasm32-unknown | wasm32-wasi
|
||||
---------------------------+---------------+----------------+----------------+------------
|
||||
sublinear | Y | Y | Y | Y
|
||||
sublinear-coherence | Y | Y | Y | Y
|
||||
sublinear-graph | Y | Y | Y | Y
|
||||
sublinear-gnn | Y | Y | Y | Y
|
||||
sublinear-spectral | Y | Y | N (no rayon) | N
|
||||
sublinear-attention | Y | Y | Y | Y
|
||||
nalgebra-backend | Y | Y | Y | Y
|
||||
ndarray-backend | Y | Y | Y | Y
|
||||
parallel (rayon) | Y | Y | N | N
|
||||
simd | Y | Y | P (128-bit) | P
|
||||
gpu | Y | P | N | N
|
||||
solver + storage | Y | Y | N | Y (fs)
|
||||
solver + hnsw | Y | Y | N | N
|
||||
```
|
||||
|
||||
#### WASM Guard Pattern
|
||||
|
||||
```rust
|
||||
// In crates/ruvector-solver/src/lib.rs
|
||||
|
||||
// Prevent invalid feature combinations at compile time.
|
||||
#[cfg(all(feature = "parallel", target_arch = "wasm32"))]
|
||||
compile_error!(
|
||||
"The `parallel` feature (rayon) is not supported on wasm32 targets. \
|
||||
Remove it or use `--no-default-features` when building for WASM."
|
||||
);
|
||||
|
||||
#[cfg(all(feature = "gpu", target_arch = "wasm32"))]
|
||||
compile_error!(
|
||||
"The `gpu` feature is not supported on wasm32 targets."
|
||||
);
|
||||
```
|
||||
|
||||
### 7. Feature Flag Documentation Pattern
|
||||
|
||||
Every feature flag must include a doc comment in the crate-level documentation.
|
||||
|
||||
```rust
|
||||
// In crates/ruvector-solver/src/lib.rs
|
||||
|
||||
//! # Feature Flags
|
||||
//!
|
||||
//! | Flag | Default | Description |
|
||||
//! |--------------------|---------|--------------------------------------------------|
|
||||
//! | `nalgebra-backend` | off | Enable nalgebra for sheaf/coherence operations |
|
||||
//! | `ndarray-backend` | off | Enable ndarray for spectral/graph operations |
|
||||
//! | `parallel` | off | Enable rayon for multi-threaded solver execution |
|
||||
//! | `simd` | off | Enable SIMD intrinsics (auto-detected at build) |
|
||||
//! | `gpu` | off | Enable GPU dispatch through ruvector-math |
|
||||
//! | `wasm` | off | Enable WASM bindings via wasm-bindgen |
|
||||
//! | `full` | off | Enable nalgebra + ndarray + parallel |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Progressive Rollout Plan
|
||||
|
||||
### Phase 1: Foundation (Weeks 1-3)
|
||||
|
||||
**Goal**: Introduce the solver crate with zero consumer integration.
|
||||
|
||||
| Task | Acceptance Criteria |
|
||||
|---------------------------------------------------|----------------------------------------------|
|
||||
| Create `crates/ruvector-solver` with empty public API | Crate compiles, no downstream changes |
|
||||
| Define all feature flags in Cargo.toml | `cargo check --all-features` passes |
|
||||
| Add solver to workspace members list | `cargo build -p ruvector-solver` succeeds |
|
||||
| Write compile-time WASM guards | WASM build fails gracefully on invalid combos|
|
||||
| Add `ruvector-solver` to workspace dependencies | Resolver v2 is satisfied |
|
||||
| Set up CI job for `ruvector-solver` feature matrix | All matrix entries pass |
|
||||
|
||||
**Feature flags available**: `nalgebra-backend`, `ndarray-backend`, `parallel`, `simd`,
|
||||
`wasm`, `full`.
|
||||
|
||||
**Consumer flags available**: None (solver is not yet a dependency of any consumer).
|
||||
|
||||
**Risk**: Minimal. No consumer code changes.
|
||||
|
||||
### Phase 2: Core Integration (Weeks 4-7)
|
||||
|
||||
**Goal**: Enable coherence verification in ruvector-core and GNN acceleration in
|
||||
ruvector-gnn behind opt-in feature flags.
|
||||
|
||||
| Task | Acceptance Criteria |
|
||||
|---------------------------------------------------|----------------------------------------------|
|
||||
| Add `sublinear` flag to ruvector-core | Flag compiles with no behavioral change |
|
||||
| Add `sublinear-coherence` flag to ruvector-core | Coherence verifier runs on HNSW graphs |
|
||||
| Add `sublinear-gnn` flag to ruvector-gnn | GNN training uses sublinear message passing |
|
||||
| Write integration tests for coherence | Tests pass with and without the flag |
|
||||
| Write integration tests for GNN acceleration | Tests pass with and without the flag |
|
||||
| Benchmark coherence overhead | Less than 5% latency increase on default path|
|
||||
| Update ruvector-core README with new flags | Documentation is current |
|
||||
|
||||
**Feature flags available**: Phase 1 flags + `sublinear`, `sublinear-coherence`,
|
||||
`sublinear-gnn`.
|
||||
|
||||
**Rollback plan**: Remove the `sublinear*` feature flags from consumer Cargo.toml and
|
||||
delete the gated modules. No API changes to revert because all new code is behind
|
||||
feature gates.
|
||||
|
||||
### Phase 3: Extended Integration (Weeks 8-11)
|
||||
|
||||
**Goal**: Bring sublinear spectral methods to ruvector-graph and sublinear attention
|
||||
routing to ruvector-attention.
|
||||
|
||||
| Task | Acceptance Criteria |
|
||||
|---------------------------------------------------|----------------------------------------------|
|
||||
| Add `sublinear-graph` flag to ruvector-graph | Spectral partitioning available behind flag |
|
||||
| Add `sublinear-spectral` flag to ruvector-graph | Parallel spectral solver works |
|
||||
| Add `sublinear-attention` flag to ruvector-attention | Attention routing uses solver dispatch |
|
||||
| Add `sublinear` flag to ruvector-collections | Collection query dispatch delegates properly |
|
||||
| WASM builds for all new flags | `cargo build --target wasm32-unknown-unknown`|
|
||||
| Performance benchmarks for spectral partitioning | At least 2x speedup on graphs with >100k nodes|
|
||||
| Cross-crate integration tests | Multi-crate feature combos work end-to-end |
|
||||
|
||||
**Feature flags available**: Phase 2 flags + `sublinear-graph`, `sublinear-spectral`,
|
||||
`sublinear-attention`.
|
||||
|
||||
### Phase 4: Default Promotion (Weeks 12-16)
|
||||
|
||||
**Goal**: After validation, promote selected sublinear features to default feature sets.
|
||||
|
||||
| Task | Acceptance Criteria |
|
||||
|---------------------------------------------------|----------------------------------------------|
|
||||
| Collect benchmark data from all phases | Data covers all target platforms |
|
||||
| Run `cargo semver-checks` on all modified crates | Zero breaking changes detected |
|
||||
| Promote `sublinear-coherence` to ruvector-core default | Default build includes coherence checks |
|
||||
| Promote `sublinear-gnn` to ruvector-gnn default | Default GNN build uses solver acceleration |
|
||||
| Update ruvector workspace version to 2.1.0 | Minor version bump signals new capabilities |
|
||||
| Publish updated crates to crates.io | All crates pass `cargo publish --dry-run` |
|
||||
|
||||
**Promotion criteria** (all must be met):
|
||||
|
||||
1. Zero regressions in existing benchmark suite.
|
||||
2. Less than 2% compile-time increase for `cargo build` with default features.
|
||||
3. Less than 50 KB binary size increase for default builds.
|
||||
4. All platform CI targets pass.
|
||||
5. At least 4 weeks of Phase 3 stability with no feature-related bug reports.
|
||||
|
||||
**Feature changes at promotion**:
|
||||
|
||||
```toml
|
||||
# BEFORE (Phase 3)
|
||||
# crates/ruvector-core/Cargo.toml
|
||||
[features]
|
||||
default = ["simd", "storage", "hnsw", "api-embeddings", "parallel"]
|
||||
|
||||
# AFTER (Phase 4)
|
||||
# crates/ruvector-core/Cargo.toml
|
||||
[features]
|
||||
default = ["simd", "storage", "hnsw", "api-embeddings", "parallel", "sublinear-coherence"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CI Configuration for Feature Matrix Testing
|
||||
|
||||
### Strategy: Tiered Matrix
|
||||
|
||||
Testing all 2^N feature combinations is infeasible. Instead, we test a curated set of
|
||||
meaningful profiles that cover: (a) each feature in isolation, (b) common real-world
|
||||
combinations, and (c) platform-specific builds.
|
||||
|
||||
```yaml
|
||||
# .github/workflows/solver-features.yml
|
||||
|
||||
name: Solver Feature Matrix
|
||||
on:
|
||||
push:
|
||||
paths:
|
||||
- 'crates/ruvector-solver/**'
|
||||
- 'crates/ruvector-core/**'
|
||||
- 'crates/ruvector-graph/**'
|
||||
- 'crates/ruvector-gnn/**'
|
||||
- 'crates/ruvector-attention/**'
|
||||
pull_request:
|
||||
paths:
|
||||
- 'crates/ruvector-solver/**'
|
||||
|
||||
jobs:
|
||||
feature-matrix:
|
||||
runs-on: ${{ matrix.os }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
include:
|
||||
# Tier 1: Individual features on Linux
|
||||
- os: ubuntu-latest
|
||||
target: x86_64-unknown-linux-gnu
|
||||
features: "nalgebra-backend"
|
||||
name: "nalgebra-only"
|
||||
- os: ubuntu-latest
|
||||
target: x86_64-unknown-linux-gnu
|
||||
features: "ndarray-backend"
|
||||
name: "ndarray-only"
|
||||
- os: ubuntu-latest
|
||||
target: x86_64-unknown-linux-gnu
|
||||
features: "parallel"
|
||||
name: "parallel-only"
|
||||
- os: ubuntu-latest
|
||||
target: x86_64-unknown-linux-gnu
|
||||
features: "simd"
|
||||
name: "simd-only"
|
||||
|
||||
# Tier 2: Common combinations
|
||||
- os: ubuntu-latest
|
||||
target: x86_64-unknown-linux-gnu
|
||||
features: "nalgebra-backend,parallel"
|
||||
name: "coherence-profile"
|
||||
- os: ubuntu-latest
|
||||
target: x86_64-unknown-linux-gnu
|
||||
features: "ndarray-backend,parallel"
|
||||
name: "spectral-profile"
|
||||
- os: ubuntu-latest
|
||||
target: x86_64-unknown-linux-gnu
|
||||
features: "full"
|
||||
name: "full-profile"
|
||||
- os: ubuntu-latest
|
||||
target: x86_64-unknown-linux-gnu
|
||||
features: ""
|
||||
name: "no-features"
|
||||
|
||||
# Tier 3: Platform-specific
|
||||
- os: ubuntu-latest
|
||||
target: wasm32-unknown-unknown
|
||||
features: "wasm,nalgebra-backend"
|
||||
name: "wasm-nalgebra"
|
||||
- os: ubuntu-latest
|
||||
target: wasm32-unknown-unknown
|
||||
features: "wasm,ndarray-backend"
|
||||
name: "wasm-ndarray"
|
||||
- os: ubuntu-latest
|
||||
target: wasm32-unknown-unknown
|
||||
features: "wasm"
|
||||
name: "wasm-minimal"
|
||||
- os: macos-latest
|
||||
target: aarch64-apple-darwin
|
||||
features: "full"
|
||||
name: "aarch64-full"
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: dtolnay/rust-toolchain@stable
|
||||
with:
|
||||
targets: ${{ matrix.target }}
|
||||
- name: Check ${{ matrix.name }}
|
||||
run: |
|
||||
cargo check -p ruvector-solver \
|
||||
--target ${{ matrix.target }} \
|
||||
--no-default-features \
|
||||
--features "${{ matrix.features }}"
|
||||
- name: Test ${{ matrix.name }}
|
||||
if: matrix.target != 'wasm32-unknown-unknown'
|
||||
run: |
|
||||
cargo test -p ruvector-solver \
|
||||
--no-default-features \
|
||||
--features "${{ matrix.features }}"
|
||||
|
||||
# Consumer crate integration matrix
|
||||
consumer-integration:
|
||||
runs-on: ubuntu-latest
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
include:
|
||||
- crate: ruvector-core
|
||||
features: "sublinear-coherence"
|
||||
- crate: ruvector-graph
|
||||
features: "sublinear-spectral"
|
||||
- crate: ruvector-gnn
|
||||
features: "sublinear-gnn"
|
||||
- crate: ruvector-attention
|
||||
features: "sublinear-attention"
|
||||
- crate: ruvector-collections
|
||||
features: "sublinear"
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: dtolnay/rust-toolchain@stable
|
||||
- name: Test ${{ matrix.crate }} + ${{ matrix.features }}
|
||||
run: |
|
||||
cargo test -p ${{ matrix.crate }} \
|
||||
--features "${{ matrix.features }}"
|
||||
|
||||
# Semver compliance check
|
||||
semver-check:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: dtolnay/rust-toolchain@stable
|
||||
- name: Install cargo-semver-checks
|
||||
run: cargo install cargo-semver-checks
|
||||
- name: Check semver compliance
|
||||
run: |
|
||||
for crate in ruvector-core ruvector-graph ruvector-gnn ruvector-attention; do
|
||||
cargo semver-checks check-release -p "$crate"
|
||||
done
|
||||
```
|
||||
|
||||
### Local Developer Workflow
|
||||
|
||||
```bash
|
||||
# Verify a single feature
|
||||
cargo check -p ruvector-solver --no-default-features --features nalgebra-backend
|
||||
|
||||
# Verify WASM compatibility
|
||||
cargo check -p ruvector-solver --target wasm32-unknown-unknown --no-default-features --features wasm
|
||||
|
||||
# Run the full matrix locally (requires cargo-hack)
|
||||
cargo install cargo-hack
|
||||
cargo hack check -p ruvector-solver --feature-powerset --depth 2
|
||||
|
||||
# Verify no semver breakage
|
||||
cargo install cargo-semver-checks
|
||||
cargo semver-checks check-release -p ruvector-core
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration Guide for Existing Users
|
||||
|
||||
### Users Who Do Not Want Sublinear Features
|
||||
|
||||
No action required. All sublinear features default to `off`. Existing builds, APIs,
|
||||
and binary sizes are unchanged.
|
||||
|
||||
```toml
|
||||
# This continues to work exactly as before:
|
||||
[dependencies]
|
||||
ruvector-core = "2.1"
|
||||
```
|
||||
|
||||
### Users Who Want Coherence Verification
|
||||
|
||||
```toml
|
||||
# Cargo.toml
|
||||
[dependencies]
|
||||
ruvector-core = { version = "2.1", features = ["sublinear-coherence"] }
|
||||
```
|
||||
|
||||
```rust
|
||||
// main.rs
|
||||
use ruvector_core::index::HnswIndex;
|
||||
use ruvector_core::coherence::CoherenceConfig;
|
||||
|
||||
fn main() -> anyhow::Result<()> {
|
||||
let index = HnswIndex::new(/* ... */)?;
|
||||
// ... insert vectors ...
|
||||
|
||||
let config = CoherenceConfig::default();
|
||||
let score = index.verify_coherence(&config)?;
|
||||
println!("HNSW coherence score: {score:.4}");
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Users Who Want GNN-Accelerated Search
|
||||
|
||||
```toml
|
||||
# Cargo.toml
|
||||
[dependencies]
|
||||
ruvector-gnn = { version = "2.1", features = ["sublinear-gnn"] }
|
||||
```
|
||||
|
||||
```rust
|
||||
use ruvector_gnn::SublinearGnnSearch;
|
||||
|
||||
let searcher = SublinearGnnSearch::builder()
|
||||
.sparsity_threshold(0.90)
|
||||
.min_elements(5_000)
|
||||
.build()?;
|
||||
|
||||
let results = searcher.search(&graph, &query_vector, k)?;
|
||||
```
|
||||
|
||||
### Users Who Want Spectral Graph Partitioning
|
||||
|
||||
```toml
|
||||
# Cargo.toml
|
||||
[dependencies]
|
||||
ruvector-graph = { version = "2.1", features = ["sublinear-spectral"] }
|
||||
```
|
||||
|
||||
```rust
|
||||
use ruvector_graph::spectral::SpectralPartitioner;
|
||||
|
||||
let partitioner = SpectralPartitioner::new(num_partitions);
|
||||
let partition_map = partitioner.partition(&graph)?;
|
||||
```
|
||||
|
||||
### Users Who Want Everything
|
||||
|
||||
```toml
|
||||
# Cargo.toml
|
||||
[dependencies]
|
||||
ruvector-core = { version = "2.1", features = ["sublinear-coherence"] }
|
||||
ruvector-graph = { version = "2.1", features = ["sublinear-spectral"] }
|
||||
ruvector-gnn = { version = "2.1", features = ["sublinear-gnn"] }
|
||||
ruvector-attention = { version = "2.1", features = ["sublinear-attention"] }
|
||||
```
|
||||
|
||||
### WASM Users
|
||||
|
||||
```toml
|
||||
# Cargo.toml
|
||||
[dependencies]
|
||||
ruvector-core = { version = "2.1", default-features = false, features = [
|
||||
"memory-only",
|
||||
"sublinear-coherence",
|
||||
] }
|
||||
```
|
||||
|
||||
Note: `sublinear-spectral` is not available on WASM because it depends on rayon.
|
||||
Use `sublinear-graph` (without parallel spectral) instead.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Zero disruption**: all existing users, builds, and CI pipelines continue to work
|
||||
unchanged because every new capability is behind an opt-in feature flag.
|
||||
- **Granular adoption**: teams can enable exactly the solver capabilities they need
|
||||
without pulling in unused backends or dependencies.
|
||||
- **Dependency isolation**: nalgebra users do not pay for ndarray, and vice versa.
|
||||
The feature flag hierarchy enforces this separation at the Cargo resolver level.
|
||||
- **Platform safety**: compile-time guards prevent invalid feature combinations on
|
||||
WASM, eliminating a class of runtime surprises.
|
||||
- **Auditable dependency graph**: `cargo tree --features sublinear-coherence` shows
|
||||
exactly what each flag brings in, making security review straightforward.
|
||||
- **Reversible**: any phase can be rolled back by removing feature flags from consumer
|
||||
crates, with zero API changes to revert.
|
||||
- **CI efficiency**: the tiered matrix tests meaningful combinations rather than an
|
||||
exponential powerset, keeping CI times tractable.
|
||||
|
||||
### Negative
|
||||
|
||||
- **Cognitive overhead**: developers must understand the feature flag hierarchy to
|
||||
choose the right flags. The naming convention (`sublinear-*`) and documentation
|
||||
mitigate this but do not eliminate it.
|
||||
- **Combinatorial testing gap**: we cannot test every possible combination. Edge-case
|
||||
interactions between features (e.g., `sublinear-coherence` + `distributed` + `wasm`)
|
||||
may surface late.
|
||||
- **Conditional compilation complexity**: `#[cfg(feature = "...")]` blocks add
|
||||
indirection to the codebase. Code navigation tools may not resolve cfg-gated items
|
||||
correctly.
|
||||
- **Feature flag drift**: if a consuming crate adds a solver feature but the solver
|
||||
crate reorganizes its flag names, the consumer will fail to compile. Cargo's resolver
|
||||
catches this at build time, but the error message may be unclear.
|
||||
- **Binary size**: each additional feature flag adds code behind conditional compilation,
|
||||
potentially increasing binary size for users who enable many features.
|
||||
|
||||
### Neutral
|
||||
|
||||
- The solver crate is a new workspace member, increasing the total crate count by one.
|
||||
- Workspace dependency resolution time increases marginally due to one additional crate.
|
||||
- Feature flags become the primary coordination mechanism between solver and consumer
|
||||
crates, replacing what would otherwise be runtime configuration.
|
||||
|
||||
---
|
||||
|
||||
## Options Considered
|
||||
|
||||
### Option 1: Monolithic Feature Flag (Rejected)
|
||||
|
||||
A single `sublinear` flag on each consumer crate that enables all solver capabilities.
|
||||
|
||||
- **Pros**: Simple to understand, one flag per crate, minimal documentation needed.
|
||||
- **Cons**: All-or-nothing adoption. Users who only need coherence must also pull in
|
||||
ndarray for spectral methods and rayon for parallel solvers. This violates the
|
||||
dependency hygiene constraint and increases binary size unnecessarily.
|
||||
- **Verdict**: Rejected because it forces unnecessary dependencies on consumers.
|
||||
|
||||
### Option 2: Runtime-Only Selection (Rejected)
|
||||
|
||||
No feature flags. The solver crate is always compiled with all backends. Algorithm
|
||||
selection happens purely at runtime.
|
||||
|
||||
- **Pros**: No conditional compilation, simpler build system, no feature matrix in CI.
|
||||
- **Cons**: Every consumer always pays the compile-time and binary-size cost of all
|
||||
backends. WASM targets would fail to compile because rayon and mmap are always
|
||||
included. This violates the platform parity constraint.
|
||||
- **Verdict**: Rejected because it is incompatible with WASM and wastes resources.
|
||||
|
||||
### Option 3: Separate Crates Per Algorithm (Rejected)
|
||||
|
||||
Instead of feature flags, create `ruvector-solver-coherence`,
|
||||
`ruvector-solver-spectral`, `ruvector-solver-gnn` as separate crates.
|
||||
|
||||
- **Pros**: Maximum isolation, each crate has its own version and changelog. Consumers
|
||||
depend only on the crate they need.
|
||||
- **Cons**: High maintenance overhead (4+ additional Cargo.toml files, CI jobs, crate
|
||||
publications). Shared types between solver algorithms require a `ruvector-solver-types`
|
||||
crate, adding another layer. The workspace already has 100+ crates; adding 4-5 more
|
||||
for one integration is disproportionate.
|
||||
- **Verdict**: Rejected due to maintenance burden and workspace bloat.
|
||||
|
||||
### Option 4: Hierarchical Feature Flags (Accepted)
|
||||
|
||||
The approach described in this ADR. One solver crate with backend flags, consumer crates
|
||||
with `sublinear-*` flags, workspace-level aggregates for convenience.
|
||||
|
||||
- **Pros**: Balances granularity with simplicity. One new crate, N feature flags.
|
||||
Cargo's feature unification handles transitive activation. CI matrix is tractable.
|
||||
- **Cons**: Requires careful documentation and naming conventions. Some cognitive
|
||||
overhead for new contributors.
|
||||
- **Verdict**: Accepted as the best balance of isolation, usability, and maintenance cost.
|
||||
|
||||
---
|
||||
|
||||
## Related Decisions
|
||||
|
||||
- **ADR-STS-001**: Solver Integration Architecture -- defines the overall integration
|
||||
strategy that this ADR implements via feature flags.
|
||||
- **ADR-STS-003**: WASM Strategy -- defines platform constraints that this ADR enforces
|
||||
via compile-time guards.
|
||||
- **ADR-STS-004**: Performance Benchmarks -- defines the benchmarking framework used to
|
||||
validate Phase 4 promotion criteria.
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 0.1 | 2026-02-20 | RuVector Team | Initial proposal |
|
||||
| 1.0 | 2026-02-20 | RuVector Team | Accepted: full implementation complete |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Status
|
||||
|
||||
Feature flag system fully operational: `neumann`, `cg`, `forward-push`, `backward-push`, `hybrid-random-walk`, `true-solver`, `bmssp` as individual flags. `all-algorithms` meta-flag enables all. `simd` for AVX2 acceleration. `wasm` for WebAssembly target. `parallel` for rayon/crossbeam concurrency. Default features: neumann, cg, forward-push. Conditional compilation throughout with `#[cfg(feature = ...)]`.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Cargo Features Reference](https://doc.rust-lang.org/cargo/reference/features.html)
|
||||
- [cargo-semver-checks](https://github.com/obi1kenobi/cargo-semver-checks)
|
||||
- [cargo-hack](https://github.com/taiki-e/cargo-hack) -- for feature powerset testing
|
||||
- [MADR 3.0 Template](https://adr.github.io/madr/)
|
||||
- [ruvector-core Cargo.toml](/home/user/ruvector/crates/ruvector-core/Cargo.toml)
|
||||
- [ruvector-graph Cargo.toml](/home/user/ruvector/crates/ruvector-graph/Cargo.toml)
|
||||
- [ruvector-math Cargo.toml](/home/user/ruvector/crates/ruvector-math/Cargo.toml)
|
||||
- [ruvector-gnn Cargo.toml](/home/user/ruvector/crates/ruvector-gnn/Cargo.toml)
|
||||
- [ruvector-attention Cargo.toml](/home/user/ruvector/crates/ruvector-attention/Cargo.toml)
|
||||
- [Workspace Cargo.toml](/home/user/ruvector/Cargo.toml)
|
||||
1541
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-008-error-handling-fault-tolerance.md
vendored
Normal file
1541
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-008-error-handling-fault-tolerance.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1174
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-009-concurrency-parallelism.md
vendored
Normal file
1174
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-009-concurrency-parallelism.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
1387
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-010-api-surface-design.md
vendored
Normal file
1387
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-010-api-surface-design.md
vendored
Normal file
File diff suppressed because it is too large
Load Diff
593
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-SOTA-research-analysis.md
vendored
Normal file
593
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-SOTA-research-analysis.md
vendored
Normal file
@@ -0,0 +1,593 @@
|
||||
# State-of-the-Art Research Analysis: Sublinear-Time Algorithms for Vector Database Operations
|
||||
|
||||
**Date**: 2026-02-20
|
||||
**Classification**: Research Analysis
|
||||
**Scope**: SOTA algorithms applicable to RuVector's 79-crate ecosystem
|
||||
**Version**: 4.0 (Full Implementation Verified)
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
This document surveys the state-of-the-art in sublinear-time algorithms as of February 2026, with focus on applicability to vector database operations, graph analytics, spectral methods, and neural network training. RuVector's integration of these algorithms represents a first-of-kind capability among vector databases — no competitor (Pinecone, Weaviate, Milvus, Qdrant, ChromaDB) offers integrated O(log n) solvers.
|
||||
|
||||
As of February 2026, all 7 algorithms from the practical subset are fully implemented in the ruvector-solver crate (10,729 LOC, 241 tests) with SIMD acceleration, WASM bindings, and NAPI Node.js bindings.
|
||||
|
||||
### Key Findings
|
||||
|
||||
- **Theoretical frontier**: Nearly-linear Laplacian solvers now achieve O(m · polylog(n)) with practical constant factors
|
||||
- **Dynamic algorithms**: Subpolynomial O(n^{o(1)}) dynamic min-cut is now achievable (RuVector already implements this)
|
||||
- **Quantum-classical bridge**: Dequantized algorithms provide O(polylog(n)) for specific matrix operations
|
||||
- **Practical gap**: Most SOTA results have impractical constants; the 7 algorithms in the solver library represent the practical subset
|
||||
- **RuVector advantage**: 91/100 compatibility score, 10-600x projected speedups in 6 subsystems
|
||||
- **Hardware evolution**: ARM SVE2, CXL memory, and AVX-512 on Zen 5 will further amplify solver performance
|
||||
- **Error composition**: Information-theoretic analysis shows ε_total ≤ Σε_i for additive pipelines, enabling principled error budgeting
|
||||
|
||||
---
|
||||
|
||||
## 2. Foundational Theory
|
||||
|
||||
### 2.1 Spielman-Teng Nearly-Linear Laplacian Solvers (2004-2014)
|
||||
|
||||
The breakthrough that made sublinear graph algorithms practical.
|
||||
|
||||
**Key result**: Solve Lx = b for graph Laplacian L in O(m · log^c(n) · log(1/ε)) time, where c was originally ~70 but reduced to ~2 in later work.
|
||||
|
||||
**Technique**: Recursive preconditioning via graph sparsification. Construct a sparser graph G' that approximates L spectrally, use G' as preconditioner for G, recursing until the graph is trivially solvable.
|
||||
|
||||
**Impact on RuVector**: Foundation for TRUE algorithm's sparsification step. Prime Radiant's sheaf Laplacian benefits directly.
|
||||
|
||||
### 2.2 Koutis-Miller-Peng (2010-2014)
|
||||
|
||||
Simplified the Spielman-Teng framework significantly.
|
||||
|
||||
**Key result**: O(m · log(n) · log(1/ε)) for SDD systems using low-stretch spanning trees.
|
||||
|
||||
**Technique**: Ultra-sparsifiers (sparsifiers with O(n) edges), sampling with probability proportional to effective resistance, recursive preconditioning.
|
||||
|
||||
**Impact on RuVector**: The effective resistance computation connects to ruvector-mincut's sparsification. Shared infrastructure opportunity.
|
||||
|
||||
### 2.3 Cohen-Kyng-Miller-Pachocki-Peng-Rao-Xu (CKMPPRX, 2014)
|
||||
|
||||
**Key result**: O(m · sqrt(log n) · log(1/ε)) via approximate Gaussian elimination.
|
||||
|
||||
**Technique**: "Almost-Cholesky" factorization that preserves sparsity. Eliminates degree-1 and degree-2 vertices, then samples fill-in edges.
|
||||
|
||||
**Impact on RuVector**: Potential future improvement over CG for Laplacian systems. Currently not in the solver library due to implementation complexity.
|
||||
|
||||
### 2.4 Kyng-Sachdeva (2016-2020)
|
||||
|
||||
**Key result**: Practical O(m · log²(n)) Laplacian solver with small constants.
|
||||
|
||||
**Technique**: Approximate Gaussian elimination with careful fill-in management.
|
||||
|
||||
**Impact on RuVector**: Candidate for future BMSSP enhancement. Current BMSSP uses algebraic multigrid which is more general but has larger constants for pure Laplacians.
|
||||
|
||||
### 2.5 Randomized Numerical Linear Algebra (Martinsson-Tropp, 2020-2024)
|
||||
|
||||
**Key result**: Unified framework for randomized matrix decomposition achieving O(mn · log(n)) for rank-k approximation of m×n matrices, vs O(mnk) for deterministic SVD.
|
||||
|
||||
**Key papers**:
|
||||
- Martinsson, P.G., Tropp, J.A. (2020): "Randomized Numerical Linear Algebra: Foundations and Algorithms" — comprehensive survey establishing practical RandNLA
|
||||
- Tropp, J.A. et al. (2023): Improved analysis of randomized block Krylov methods
|
||||
- Nakatsukasa, Y., Tropp, J.A. (2024): Fast and accurate randomized algorithms for linear algebra and eigenvalue problems
|
||||
|
||||
**Techniques**:
|
||||
- Randomized range finders with power iteration
|
||||
- Randomized SVD via single-pass streaming
|
||||
- Sketch-and-solve for least squares
|
||||
- CountSketch and OSNAP for sparse embedding
|
||||
|
||||
**Impact on RuVector**: Directly applicable to ruvector-math's matrix operations. The sketch-and-solve paradigm can accelerate spectral filtering when combined with Neumann series. Potential for streaming updates to TRUE preprocessing.
|
||||
|
||||
---
|
||||
|
||||
## 3. Recent Breakthroughs (2023-2026)
|
||||
|
||||
### 3.1 Maximum Flow in Almost-Linear Time (Chen et al., 2022-2023)
|
||||
|
||||
**Key result**: First m^{1+o(1)} time algorithm for maximum flow and minimum cut in undirected graphs.
|
||||
|
||||
**Publication**: FOCS 2022, refined 2023. arXiv:2203.00671
|
||||
|
||||
**Technique**: Interior point method with dynamic data structures for maintaining electrical flows. Uses approximate Laplacian solvers as a subroutine.
|
||||
|
||||
**Impact on RuVector**: ruvector-mincut's dynamic min-cut already benefits from this lineage. The solver integration provides the Laplacian solve subroutine that makes this algorithm practical.
|
||||
|
||||
### 3.2 Subpolynomial Dynamic Min-Cut (December 2024)
|
||||
|
||||
**Key result**: O(n^{o(1)}) amortized update time for dynamic minimum cut.
|
||||
|
||||
**Publication**: arXiv:2512.13105 (December 2024)
|
||||
|
||||
**Technique**: Expander decomposition with hierarchical data structures. Maintains near-optimal cut under edge insertions and deletions.
|
||||
|
||||
**Impact on RuVector**: Already implemented in `ruvector-mincut`. This is the state-of-the-art for dynamic graph algorithms.
|
||||
|
||||
### 3.3 Local Graph Clustering (Andersen-Chung-Lang, Orecchia-Zhu)
|
||||
|
||||
**Key result**: Find a cluster of conductance ≤ φ containing a seed vertex in O(volume(cluster)/φ) time, independent of graph size.
|
||||
|
||||
**Technique**: Personalized PageRank push with threshold. Sweep cut on the PPR vector.
|
||||
|
||||
**Impact on RuVector**: Forward Push algorithm in the solver. Directly applicable to ruvector-graph's community detection and ruvector-core's semantic neighborhood discovery.
|
||||
|
||||
### 3.4 Spectral Sparsification Advances (2011-2024)
|
||||
|
||||
**Key result**: O(n · polylog(n)) edge sparsifiers preserving all cut values within (1±ε).
|
||||
|
||||
**Technique**: Sampling edges proportional to effective resistance. Benczur-Karger for cut sparsifiers, Spielman-Srivastava for spectral.
|
||||
|
||||
**Recent advances** (2023-2024):
|
||||
- Improved constant factors in effective resistance sampling
|
||||
- Dynamic spectral sparsification with polylog update time
|
||||
- Distributed spectral sparsification for multi-node setups
|
||||
|
||||
**Impact on RuVector**: TRUE algorithm's sparsification step. Also shared with ruvector-mincut's expander decomposition.
|
||||
|
||||
### 3.5 Johnson-Lindenstrauss Advances (2017-2024)
|
||||
|
||||
**Key result**: Optimal JL transforms with O(d · log(n)) time using sparse projection matrices.
|
||||
|
||||
**Key papers**:
|
||||
- Larsen-Nelson (2017): Optimal tradeoff between target dimension and distortion
|
||||
- Cohen et al. (2022): Sparse JL with O(1/ε) nonzeros per row
|
||||
- Nelson-Nguyên (2024): Near-optimal JL for streaming data
|
||||
|
||||
**Impact on RuVector**: TRUE algorithm's dimensionality reduction step. Also applicable to ruvector-core's batch distance computation via random projection.
|
||||
|
||||
### 3.6 Quantum-Inspired Sublinear Algorithms (Tang, 2018-2024)
|
||||
|
||||
**Key result**: "Dequantized" classical algorithms achieving O(polylog(n/ε)) for:
|
||||
- Low-rank approximation
|
||||
- Recommendation systems
|
||||
- Principal component analysis
|
||||
- Linear regression
|
||||
|
||||
**Technique**: Replace quantum amplitude estimation with classical sampling from SQ (sampling and query) access model.
|
||||
|
||||
**Impact on RuVector**: ruQu (quantum crate) can leverage these for hybrid quantum-classical approaches. The sampling techniques inform Forward Push and Hybrid Random Walk design.
|
||||
|
||||
### 3.7 Sublinear Graph Neural Networks (2023-2025)
|
||||
|
||||
**Key result**: GNN inference in O(k · log(n)) time per node (vs O(k · n · d) standard).
|
||||
|
||||
**Techniques**:
|
||||
- Lazy propagation: Only propagate features for queried nodes
|
||||
- Importance sampling: Sample neighbors proportional to attention weights
|
||||
- Graph sparsification: Train on spectrally-equivalent sparse graph
|
||||
|
||||
**Impact on RuVector**: Directly applicable to ruvector-gnn. SublinearAggregation strategy implements lazy propagation via Forward Push.
|
||||
|
||||
### 3.8 Optimal Transport in Sublinear Time (2022-2025)
|
||||
|
||||
**Key result**: Approximate optimal transport in O(n · log(n) / ε²) via entropy-regularized Sinkhorn with tree-based initialization.
|
||||
|
||||
**Techniques**:
|
||||
- Tree-Wasserstein: O(n · log(n)) exact computation on tree metrics
|
||||
- Sliced Wasserstein: O(n · log(n) · d) via 1D projections
|
||||
- Sublinear Sinkhorn: Exploiting sparsity in cost matrix
|
||||
|
||||
**Impact on RuVector**: ruvector-math includes optimal transport capabilities. Solver-accelerated Sinkhorn replaces dense O(n²) matrix-vector products with sparse O(nnz).
|
||||
|
||||
### 3.9 Sublinear Spectral Density Estimation (Cohen-Musco, 2024)
|
||||
|
||||
**Key result**: Estimate the spectral density of a symmetric matrix in O(m · polylog(n)) time, sufficient to determine eigenvalue distribution without computing individual eigenvalues.
|
||||
|
||||
**Technique**: Stochastic trace estimation via Hutchinson's method combined with Chebyshev polynomial approximation. Uses O(log(1/δ)) random probe vectors and O(log(n/ε)) Chebyshev terms per probe.
|
||||
|
||||
**Impact on RuVector**: Enables rapid condition number estimation for algorithm routing (ADR-STS-002). Can determine whether a matrix is well-conditioned (use Neumann) or ill-conditioned (use CG/BMSSP) in O(m · log²(n)) time vs O(n³) for full eigendecomposition.
|
||||
|
||||
### 3.10 Faster Effective Resistance Computation (Durfee et al., 2023-2024)
|
||||
|
||||
**Key result**: Compute all-pairs effective resistances approximately in O(m · log³(n) / ε²) time, or a single effective resistance in O(m · log(n) · log(1/ε)) time.
|
||||
|
||||
**Technique**: Reduce effective resistance computation to Laplacian solving: R_eff(s,t) = (e_s - e_t)^T L^+ (e_s - e_t). Single-pair uses one Laplacian solve; batch uses JL projection to reduce to O(log(n)/ε²) solves.
|
||||
|
||||
**Recent advances** (2024):
|
||||
- Improved batch algorithms using sketching
|
||||
- Dynamic effective resistance under edge updates in polylog amortized time
|
||||
- Distributed effective resistance for partitioned graphs
|
||||
|
||||
**Impact on RuVector**: Critical for TRUE's sparsification step (edge sampling proportional to effective resistance). Also enables efficient graph centrality measures and network robustness analysis in ruvector-graph.
|
||||
|
||||
### 3.11 Neural Network Acceleration via Sublinear Layers (2024-2025)
|
||||
|
||||
**Key result**: Replace dense attention and MLP layers with sublinear-time operations achieving O(n · log(n)) or O(n · √n) complexity while maintaining >95% accuracy.
|
||||
|
||||
**Key techniques**:
|
||||
- Sparse attention via locality-sensitive hashing (Reformer lineage, improved 2024)
|
||||
- Random feature attention: approximate softmax kernel with O(n · d · log(n)) random Fourier features
|
||||
- Sublinear MLP: product-key memory replacing dense layers with O(√n) lookups
|
||||
- Graph-based attention: PDE diffusion on sparse attention graph (directly uses CG)
|
||||
|
||||
**Impact on RuVector**: ruvector-attention's 40+ attention mechanisms can integrate solver-backed sparse attention. PDE-based attention diffusion is already in the solver design (ADR-STS-001). The random feature approach informs TRUE's JL projection design.
|
||||
|
||||
### 3.12 Distributed Laplacian Solvers (2023-2025)
|
||||
|
||||
**Key result**: Solve Laplacian systems across k machines in O(m/k · polylog(n) + n · polylog(n)) time with O(n · polylog(n)) communication.
|
||||
|
||||
**Techniques**:
|
||||
- Graph partitioning with low-conductance separators
|
||||
- Local solving on partitions + Schur complement coupling
|
||||
- Communication-efficient iterative refinement
|
||||
|
||||
**Impact on RuVector**: Directly applicable to ruvector-cluster's sharded graph processing. Enables scaling the solver beyond single-machine memory limits by distributing the Laplacian across cluster shards.
|
||||
|
||||
### 3.13 Sketching-Based Matrix Approximation (2023-2025)
|
||||
|
||||
**Key result**: Maintain a sketch of a streaming matrix supporting approximate matrix-vector products in O(k · n) time and O(k · n) space, where k is the sketch dimension.
|
||||
|
||||
**Key advances**:
|
||||
- Frequent Directions (Liberty, 2013) extended to streaming with O(k · n) space for rank-k approximation
|
||||
- CountSketch-based SpMV approximation: O(nnz + k²) time per multiply
|
||||
- Tensor sketching for higher-order interactions
|
||||
- Mergeable sketches for distributed aggregation
|
||||
|
||||
**Impact on RuVector**: Enables incremental TRUE preprocessing — as the graph evolves, the sparsifier sketch can be updated in O(k) per edge change rather than recomputing from scratch. Also applicable to streaming analytics in ruvector-graph.
|
||||
|
||||
---
|
||||
|
||||
## 4. Algorithm Complexity Comparison
|
||||
|
||||
### SOTA vs Traditional — Comprehensive Table
|
||||
|
||||
| Operation | Traditional | SOTA Sublinear | Speedup @ n=10K | Speedup @ n=1M | In Solver? |
|
||||
|-----------|------------|---------------|-----------------|----------------|-----------|
|
||||
| Dense Ax=b | O(n³) | O(n^2.373) (Strassen+) | 2x | 10x | No (use BLAS) |
|
||||
| Sparse Ax=b (SPD) | O(n² nnz) | O(√κ · log(1/ε) · nnz) (CG) | 10-100x | 100-1000x | Yes (CG) |
|
||||
| Laplacian Lx=b | O(n³) | O(m · log²(n) · log(1/ε)) | 50-500x | 500-10Kx | Yes (BMSSP) |
|
||||
| PageRank (single source) | O(n · m) | O(1/ε) (Forward Push) | 100-1000x | 10K-100Kx | Yes |
|
||||
| PageRank (pairwise) | O(n · m) | O(√n/ε) (Hybrid RW) | 10-100x | 100-1000x | Yes |
|
||||
| Spectral gap | O(n³) eigendecomp | O(m · log(n)) (random walk) | 50x | 5000x | Partial |
|
||||
| Graph clustering | O(n · m · k) | O(vol(C)/φ) (local) | 10-100x | 1000-10Kx | Yes (Push) |
|
||||
| Spectral sparsification | N/A (new) | O(m · log(n)/ε²) | New capability | New capability | Yes (TRUE) |
|
||||
| JL projection | O(n · d · k) | O(n · d · 1/ε) sparse | 2-5x | 2-5x | Yes (TRUE) |
|
||||
| Min-cut (dynamic) | O(n · m) per update | O(n^{o(1)}) amortized | 100x+ | 10K+x | Separate crate |
|
||||
| GNN message passing | O(n · d · avg_deg) | O(k · log(n) · d) | 5-50x | 50-500x | Via Push |
|
||||
| Attention (PDE) | O(n²) pairwise | O(m · √κ · log(1/ε)) sparse | 10-100x | 100-10Kx | Yes (CG) |
|
||||
| Optimal transport | O(n² · log(n)/ε) | O(n · log(n)/ε²) | 100x | 10Kx | Partial |
|
||||
| Matrix-vector (Neumann) | O(n²) dense | O(k · nnz) sparse | 5-50x | 50-600x | Yes |
|
||||
| Effective resistance | O(n³) inverse | O(m · log(n)/ε²) | 50-500x | 5K-50Kx | Yes (CG/TRUE) |
|
||||
| Spectral density | O(n³) eigendecomp | O(m · polylog(n)) | 50-500x | 5K-50Kx | Planned |
|
||||
| Matrix sketch update | O(mn) full recompute | O(k) per update | n/k ≈ 100x | n/k ≈ 10Kx | Planned |
|
||||
|
||||
---
|
||||
|
||||
## 5. Implementation Complexity Analysis
|
||||
|
||||
### Practical Constant Factors and Implementation Difficulty
|
||||
|
||||
| Algorithm | Theoretical | Practical Constant | LOC (production) | Impl. Difficulty | Numerical Stability | Memory Overhead |
|
||||
|-----------|------------|-------------------|-----------------|-----------------|--------------------|---------—------|
|
||||
| **Neumann Series** | O(k · nnz) | c ≈ 2.5 ns/nonzero | ~200 | 1/5 (Easy) | Moderate — diverges if ρ(I-A) ≥ 1 | 3n floats (r, p, temp) |
|
||||
| **Forward Push** | O(1/ε) | c ≈ 15 ns/push | ~350 | 2/5 (Moderate) | Good — monotone convergence | n + active_set floats |
|
||||
| **Backward Push** | O(1/ε) | c ≈ 18 ns/push | ~400 | 2/5 (Moderate) | Good — same as Forward | n + active_set floats |
|
||||
| **Hybrid Random Walk** | O(√n/ε) | c ≈ 50 ns/step | ~500 | 3/5 (Hard) | Variable — Monte Carlo variance | 4n floats + PRNG state |
|
||||
| **TRUE** | O(log n) | c varies by phase | ~800 | 4/5 (Very Hard) | Compound — 3 error sources | JL matrix + sparsifier + solve |
|
||||
| **Conjugate Gradient** | O(√κ · nnz) | c ≈ 2.5 ns/nonzero | ~300 | 2/5 (Moderate) | Requires reorthogonalization for large κ | 5n floats (r, p, Ap, x, z) |
|
||||
| **BMSSP** | O(nnz · log n) | c ≈ 5 ns/nonzero | ~1200 | 5/5 (Expert) | Excellent — multigrid smoothing | Hierarchy: ~2x original matrix |
|
||||
|
||||
### Constant Factor Analysis: Theoretical vs Measured
|
||||
|
||||
The gap between asymptotic complexity and wall-clock time is driven by:
|
||||
|
||||
1. **Cache effects**: SpMV with random access patterns (gather) achieves 20-40% of peak FLOPS due to cache misses. Sequential access (CSR row scan) achieves 60-80%.
|
||||
|
||||
2. **SIMD utilization**: AVX2 gather instructions have 4-8 cycle latency vs 1 cycle for sequential loads. Effective SIMD speedup for SpMV is ~4x (not 8x theoretical for 256-bit).
|
||||
|
||||
3. **Branch prediction**: Push algorithms have data-dependent branches (threshold checks), reducing effective IPC to ~2 from peak ~4.
|
||||
|
||||
4. **Memory bandwidth**: SpMV is bandwidth-bound at density > 1%. Theoretical FLOP rate irrelevant; memory bandwidth (40-80 GB/s on server) determines throughput.
|
||||
|
||||
5. **Allocation overhead**: Without arena allocator, malloc/free adds 5-20μs per solve. With arena: ~200ns.
|
||||
|
||||
---
|
||||
|
||||
## 6. Error Analysis and Accuracy Guarantees
|
||||
|
||||
### 6.1 Error Propagation in Composed Algorithms
|
||||
|
||||
When multiple approximate algorithms are composed in a pipeline, errors compound:
|
||||
|
||||
**Additive model** (for Neumann, Push, CG):
|
||||
```
|
||||
ε_total ≤ ε_1 + ε_2 + ... + ε_k
|
||||
```
|
||||
Where each ε_i is the per-stage approximation error.
|
||||
|
||||
**Multiplicative model** (for TRUE with JL → sparsify → solve):
|
||||
```
|
||||
||x̃ - x*|| ≤ (1 + ε_JL)(1 + ε_sparsify)(1 + ε_solve) · ||x*||
|
||||
≈ (1 + ε_JL + ε_sparsify + ε_solve) · ||x*|| (for small ε)
|
||||
```
|
||||
|
||||
### 6.2 Information-Theoretic Lower Bounds
|
||||
|
||||
| Query Type | Lower Bound on Error | Achieving Algorithm | Gap to Lower Bound |
|
||||
|-----------|---------------------|--------------------|--------------------|
|
||||
| Single Ax=b entry | Ω(1/√T) for T queries | Hybrid Random Walk | ≤ 2x |
|
||||
| Full Ax=b solve | Ω(ε) with O(√κ · log(1/ε)) iterations | CG | Optimal (Nemirovski-Yudin) |
|
||||
| PPR from source | Ω(ε) with O(1/ε) push operations | Forward Push | Optimal |
|
||||
| Pairwise PPR | Ω(1/√n · ε) | Hybrid Random Walk + Push | ≤ 3x |
|
||||
| Spectral sparsifier | Ω(n · log(n)/ε²) edges | Spielman-Srivastava | Optimal |
|
||||
|
||||
### 6.3 Error Amplification in Iterative Methods
|
||||
|
||||
CG error amplification is bounded by the Chebyshev polynomial:
|
||||
```
|
||||
||x_k - x*||_A ≤ 2 · ((√κ - 1)/(√κ + 1))^k · ||x_0 - x*||_A
|
||||
```
|
||||
|
||||
For Neumann series, error is geometric:
|
||||
```
|
||||
||x_k - x*|| ≤ ρ^k · ||b|| / (1 - ρ)
|
||||
```
|
||||
where ρ = spectral radius of (I - A). **Critical**: when ρ > 0.99, Neumann needs >460 iterations for ε = 0.01, making CG preferred.
|
||||
|
||||
### 6.4 Mixed-Precision Arithmetic Implications
|
||||
|
||||
| Precision | Unit Roundoff | Max Useful ε | Storage Savings | SpMV Speedup |
|
||||
|-----------|-------------|-------------|----------------|-------------|
|
||||
| f64 | 1.1 × 10⁻¹⁶ | 1e-12 | 1x (baseline) | 1x |
|
||||
| f32 | 5.96 × 10⁻⁸ | 1e-5 | 2x | 2x (SIMD width doubles) |
|
||||
| f16 | 4.88 × 10⁻⁴ | 1e-2 | 4x | 4x |
|
||||
| bf16 | 3.91 × 10⁻³ | 1e-1 | 4x | 4x |
|
||||
|
||||
**Recommendation**: Use f32 storage with f64 accumulation for CG when κ > 100. Use pure f32 for Neumann and Push (tolerance floor 1e-5). Mixed f16/f32 only for inference-time operations with ε > 0.01.
|
||||
|
||||
### 6.5 Error Budget Allocation Strategy
|
||||
|
||||
For a pipeline with k stages and total budget ε_total:
|
||||
|
||||
**Uniform allocation**: ε_i = ε_total / k — simple but suboptimal.
|
||||
|
||||
**Cost-weighted allocation**: Allocate more budget to expensive stages:
|
||||
```
|
||||
ε_i = ε_total · (cost_i / Σ cost_j)^{-1/2} / Σ (cost_j / Σ cost_k)^{-1/2}
|
||||
```
|
||||
This minimizes total compute cost subject to ε_total constraint.
|
||||
|
||||
**Adaptive allocation** (implemented in SONA): Start with uniform, then reallocate based on observed per-stage error utilization. If stage i consistently uses only 50% of its budget, redistribute the unused portion.
|
||||
|
||||
---
|
||||
|
||||
## 7. Hardware Evolution Impact (2024-2028)
|
||||
|
||||
### 7.1 Apple M4 Pro/Max Unified Memory
|
||||
|
||||
- **192KB L1 / 16MB L2 / 48MB L3**: Larger caches improve SpMV for matrices up to ~4M nonzeros entirely in L3
|
||||
- **Unified memory architecture**: No PCIe bottleneck for GPU offload; AMX coprocessor shares same memory pool
|
||||
- **Impact**: Solver working sets up to 48MB stay in L3 (previously 16MB on M2). Tiling thresholds shift upward. Expected 20-30% improvement for n=10K-100K problems.
|
||||
|
||||
### 7.2 AMD Zen 5 (Turin) AVX-512
|
||||
|
||||
- **Full-width AVX-512** (512-bit): 16 f32 per vector operation (vs 8 for AVX2)
|
||||
- **Improved gather**: Zen 5 gather throughput ~2x Zen 4, reducing SpMV gather bottleneck
|
||||
- **Impact**: SpMV throughput increases from ~250M nonzeros/s (AVX2) to ~450M nonzeros/s (AVX-512). CG and Neumann benefit proportionally.
|
||||
|
||||
### 7.3 ARM SVE/SVE2 (Variable-Width SIMD)
|
||||
|
||||
- **Scalable Vector Extension**: Vector length agnostic code (128-2048 bit)
|
||||
- **Predicated execution**: Native support for variable-length row processing (no scalar remainder loop)
|
||||
- **Gather/scatter**: SVE2 adds efficient hardware gather comparable to AVX-512
|
||||
- **Impact**: Single SIMD kernel works across ARM implementations. SpMV kernel simplification: no per-architecture width specialization needed. Expected availability in server ARM (Neoverse V3+) and future Apple Silicon.
|
||||
|
||||
### 7.4 RISC-V Vector Extension (RVV 1.0)
|
||||
|
||||
- **Status**: RVV 1.0 ratified; hardware shipping (SiFive P870, SpacemiT K1)
|
||||
- **Variable-length vectors**: Similar to SVE, length-agnostic programming model
|
||||
- **Gather support**: Indexed load instructions with configurable element width
|
||||
- **Impact on RuVector**: Future WASM target (RISC-V + WASM is a growing embedded/edge deployment). Solver should plan for RVV SIMD backend in P3 timeline. LLVM auto-vectorization for RVV is maturing rapidly.
|
||||
|
||||
### 7.5 CXL Memory Expansion
|
||||
|
||||
- **Compute Express Link**: Adds disaggregated memory beyond DRAM capacity
|
||||
- **CXL 3.0**: Shared memory pools across multiple hosts
|
||||
- **Latency**: ~150-300ns (vs ~80ns DRAM), acceptable for large-matrix SpMV
|
||||
- **Impact**: Enables n > 10M problems on single-socket servers. Memory-mapped CSR on CXL has 2-3x latency penalty but removes the memory wall. Tiling strategy adjusts: treat CXL as a faster tier than disk but slower than DRAM.
|
||||
|
||||
### 7.6 Neuromorphic and Analog Computing
|
||||
|
||||
- **Intel Loihi 2**: Spiking neural network chip with native random walk acceleration
|
||||
- **Analog matrix multiply**: Emerging memristor crossbar arrays for O(1) SpMV
|
||||
- **Impact on RuVector**: Long-term (2028+). Random walk algorithms (Hybrid RW) are natural fits for neuromorphic hardware. Analog SpMV could reduce CG iteration cost to O(n) regardless of nnz. Currently speculative; no production-ready integration path.
|
||||
|
||||
---
|
||||
|
||||
## 8. Competitive Landscape
|
||||
|
||||
### 8.1 RuVector+Solver vs Vector Database Competition
|
||||
|
||||
| Capability | RuVector+Solver | Pinecone | Weaviate | Milvus | Qdrant | ChromaDB | Vald | LanceDB |
|
||||
|-----------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
||||
| Sublinear Laplacian solve | O(log n) | - | - | - | - | - | - | - |
|
||||
| Graph PageRank | O(1/ε) | - | - | - | - | - | - | - |
|
||||
| Spectral sparsification | O(m log n/ε²) | - | - | - | - | - | - | - |
|
||||
| Integrated GNN | Yes (5 layers) | - | - | - | - | - | - | - |
|
||||
| WASM deployment | Yes | - | - | - | - | - | - | Yes |
|
||||
| Dynamic min-cut | O(n^{o(1)}) | - | - | - | - | - | - | - |
|
||||
| Coherence engine | Yes (sheaf) | - | - | - | - | - | - | - |
|
||||
| MCP tool integration | Yes (40+ tools) | - | - | - | - | - | - | - |
|
||||
| Post-quantum crypto | Yes (rvf-crypto) | - | - | - | - | - | - | - |
|
||||
| Quantum algorithms | Yes (ruQu) | - | - | - | - | - | - | - |
|
||||
| Self-learning (SONA) | Yes | - | Partial | - | - | - | - | - |
|
||||
| Sparse linear algebra | 7 algorithms | - | - | - | - | - | - | - |
|
||||
| Multi-platform SIMD | AVX-512/NEON/WASM | - | - | AVX2 | AVX2 | - | - | - |
|
||||
|
||||
### 8.2 Academic Graph Processing Systems
|
||||
|
||||
| System | Solver Integration | Sublinear Algorithms | Language | Production Ready |
|
||||
|--------|-------------------|---------------------|----------|-----------------|
|
||||
| **GraphBLAS** (SuiteSparse) | SpMV only | No sublinear solvers | C | Yes |
|
||||
| **Galois** (UT Austin) | None | Local graph algorithms | C++ | Research |
|
||||
| **Ligra** (MIT) | None | Semi-external memory | C++ | Research |
|
||||
| **PowerGraph** (CMU) | None | Pregel-style only | C++ | Deprecated |
|
||||
| **NetworKit** | Algebraic multigrid | Partial (local clustering) | C++/Python | Yes |
|
||||
| **RuVector+Solver** | Full 7-algorithm suite | Yes (all categories) | Rust | In development |
|
||||
|
||||
**Key differentiator**: GraphBLAS provides SpMV but not solver-level operations. NetworKit has algebraic multigrid but no JL projection, random walk solvers, or WASM deployment. No academic system combines all seven algorithm families with production-grade multi-platform deployment.
|
||||
|
||||
### 8.3 Specialized Solver Libraries
|
||||
|
||||
| Library | Algorithms | Language | WASM | Key Limitation for RuVector |
|
||||
|---------|-----------|----------|------|---------------------------|
|
||||
| **LAMG** (Lean AMG) | Algebraic multigrid | MATLAB/C | No | MATLAB dependency, no Rust FFI |
|
||||
| **PETSc** | CG, GMRES, AMG, etc. | C/Fortran | No | Heavy dependency (MPI), not embeddable |
|
||||
| **Eigen** | CG, BiCGSTAB, SimplicialLDLT | C++ | Partial | C++ FFI complexity, no Push/Walk |
|
||||
| **nalgebra** (Rust) | Dense LU/QR/SVD | Rust | Yes | No sparse solvers, no sublinear algorithms |
|
||||
| **sprs** (Rust) | CSR/CSC format | Rust | Yes | Format only, no solvers |
|
||||
| **Solver Library** | All 7 algorithms | Rust | Yes | Target integration (this project) |
|
||||
|
||||
### 8.4 Adoption Risk from Competitors
|
||||
|
||||
**Low risk** (next 2 years): The 7-algorithm solver suite requires deep expertise in randomized linear algebra, spectral graph theory, and SIMD optimization. No vector database competitor has signaled investment in this direction.
|
||||
|
||||
**Medium risk** (2-4 years): Academic libraries (GraphBLAS, NetworKit) could add similar capabilities. However, multi-platform deployment (WASM, NAPI, MCP) remains a significant engineering barrier.
|
||||
|
||||
**Mitigation**: First-mover advantage plus deep integration into 6 subsystems creates switching costs. SONA adaptive routing learns workload-specific optimizations that a drop-in replacement cannot replicate.
|
||||
|
||||
---
|
||||
|
||||
## 9. Open Research Questions
|
||||
|
||||
Relevant to RuVector's future development:
|
||||
|
||||
1. **Practical nearly-linear Laplacian solvers**: Can CKMPPRX's O(m · √(log n)) be implemented with constants competitive with CG for n < 10M?
|
||||
2. **Dynamic spectral sparsification**: Can the sparsifier be maintained under edge updates in polylog time, enabling real-time TRUE preprocessing?
|
||||
3. **Sublinear attention**: Can PDE-based attention be computed in O(n · polylog(n)) for arbitrary attention patterns, not just sparse Laplacian structure?
|
||||
4. **Quantum advantage for sparse systems**: Does quantum walk-based Laplacian solving (HHL algorithm) provide practical speedup over classical CG at achievable qubit counts (100-1000)?
|
||||
5. **Distributed sublinear algorithms**: Can Forward Push and Hybrid Random Walk be efficiently distributed across ruvector-cluster's sharded graph?
|
||||
6. **Adaptive sparsity detection**: Can SONA learn to predict matrix sparsity patterns from historical queries, enabling pre-computed sparsifiers?
|
||||
7. **Error-optimal algorithm composition**: What is the information-theoretically optimal error allocation across a pipeline of k approximate algorithms?
|
||||
8. **Hardware-aware routing**: Can the algorithm router exploit specific SIMD width, cache size, and memory bandwidth to make per-hardware-generation routing decisions?
|
||||
9. **Streaming sublinear solving**: Can Laplacian solvers operate on streaming edge updates without full matrix reconstruction?
|
||||
10. **Sublinear Fisher Information**: Can the Fisher Information Matrix for EWC be approximated in sublinear time, enabling faster continual learning?
|
||||
|
||||
---
|
||||
|
||||
## 10. Research Integration Roadmap
|
||||
|
||||
### Short-Term (6 months)
|
||||
|
||||
| Research Result | Integration Target | Expected Impact | Effort |
|
||||
|----------------|-------------------|-----------------|--------|
|
||||
| Spectral density estimation | Algorithm router (condition number) | 5-10x faster routing decisions | Medium |
|
||||
| Faster effective resistance | TRUE sparsification quality | 2-3x faster preprocessing | Medium |
|
||||
| Streaming JL sketches | Incremental TRUE updates | Real-time sparsifier maintenance | High |
|
||||
| Mixed-precision CG | f32/f64 hybrid solver | 2x memory reduction, ~1.5x speedup | Low |
|
||||
|
||||
### Medium-Term (1 year)
|
||||
|
||||
| Research Result | Integration Target | Expected Impact | Effort |
|
||||
|----------------|-------------------|-----------------|--------|
|
||||
| Distributed Laplacian solvers | ruvector-cluster scaling | n > 1M node support | Very High |
|
||||
| SVE/SVE2 SIMD backend | ARM server deployment | Single kernel across ARM chips | Medium |
|
||||
| Sublinear GNN layers | ruvector-gnn acceleration | 10-50x GNN inference speedup | High |
|
||||
| Neural network sparse attention | ruvector-attention PDE mode | New attention mechanism | High |
|
||||
|
||||
### Long-Term (2-3 years)
|
||||
|
||||
| Research Result | Integration Target | Expected Impact | Effort |
|
||||
|----------------|-------------------|-----------------|--------|
|
||||
| CKMPPRX practical implementation | Replace BMSSP for Laplacians | O(m · √(log n)) solving | Expert |
|
||||
| Quantum-classical hybrid | ruQu integration | Potential quantum advantage for κ > 10⁶ | Research |
|
||||
| Neuromorphic random walks | Specialized hardware backend | Orders-of-magnitude random walk speedup | Research |
|
||||
| CXL memory tier | Large-scale matrix storage | 10M+ node problems on commodity hardware | Medium |
|
||||
| Analog SpMV accelerator | Hardware-accelerated CG | O(1) matrix-vector products | Speculative |
|
||||
|
||||
---
|
||||
|
||||
## 11. Bibliography
|
||||
|
||||
1. Spielman, D.A., Teng, S.-H. (2004). "Nearly-Linear Time Algorithms for Graph Partitioning, Graph Sparsification, and Solving Linear Systems." STOC 2004.
|
||||
2. Koutis, I., Miller, G.L., Peng, R. (2011). "A Nearly-m log n Time Solver for SDD Linear Systems." FOCS 2011.
|
||||
3. Cohen, M.B., Kyng, R., Miller, G.L., Pachocki, J.W., Peng, R., Rao, A.B., Xu, S.C. (2014). "Solving SDD Linear Systems in Nearly m log^{1/2} n Time." STOC 2014.
|
||||
4. Kyng, R., Sachdeva, S. (2016). "Approximate Gaussian Elimination for Laplacians." FOCS 2016.
|
||||
5. Chen, L., Kyng, R., Liu, Y.P., Peng, R., Gutenberg, M.P., Sachdeva, S. (2022). "Maximum Flow and Minimum-Cost Flow in Almost-Linear Time." FOCS 2022. arXiv:2203.00671.
|
||||
6. Andersen, R., Chung, F., Lang, K. (2006). "Local Graph Partitioning using PageRank Vectors." FOCS 2006.
|
||||
7. Lofgren, P., Banerjee, S., Goel, A., Seshadhri, C. (2014). "FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs." KDD 2014.
|
||||
8. Spielman, D.A., Srivastava, N. (2011). "Graph Sparsification by Effective Resistances." SIAM J. Comput.
|
||||
9. Benczur, A.A., Karger, D.R. (2015). "Randomized Approximation Schemes for Cuts and Flows in Capacitated Graphs." SIAM J. Comput.
|
||||
10. Johnson, W.B., Lindenstrauss, J. (1984). "Extensions of Lipschitz mappings into a Hilbert space." Contemporary Mathematics.
|
||||
11. Larsen, K.G., Nelson, J. (2017). "Optimality of the Johnson-Lindenstrauss Lemma." FOCS 2017.
|
||||
12. Tang, E. (2019). "A Quantum-Inspired Classical Algorithm for Recommendation Systems." STOC 2019.
|
||||
13. Hestenes, M.R., Stiefel, E. (1952). "Methods of Conjugate Gradients for Solving Linear Systems." J. Res. Nat. Bur. Standards.
|
||||
14. Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS.
|
||||
15. Hamilton, W.L., Ying, R., Leskovec, J. (2017). "Inductive Representation Learning on Large Graphs." NeurIPS 2017.
|
||||
16. Cuturi, M. (2013). "Sinkhorn Distances: Lightspeed Computation of Optimal Transport." NeurIPS 2013.
|
||||
17. arXiv:2512.13105 (2024). "Subpolynomial-Time Dynamic Minimum Cut."
|
||||
18. Defferrard, M., Bresson, X., Vandergheynst, P. (2016). "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering." NeurIPS 2016.
|
||||
19. Shewchuk, J.R. (1994). "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain." Technical Report.
|
||||
20. Briggs, W.L., Henson, V.E., McCormick, S.F. (2000). "A Multigrid Tutorial." SIAM.
|
||||
21. Martinsson, P.G., Tropp, J.A. (2020). "Randomized Numerical Linear Algebra: Foundations and Algorithms." Acta Numerica.
|
||||
22. Musco, C., Musco, C. (2024). "Sublinear Spectral Density Estimation." STOC 2024.
|
||||
23. Durfee, D., Kyng, R., Peebles, J., Rao, A.B., Sachdeva, S. (2023). "Sampling Random Spanning Trees Faster than Matrix Multiplication." STOC 2023.
|
||||
24. Nakatsukasa, Y., Tropp, J.A. (2024). "Fast and Accurate Randomized Algorithms for Linear Algebra and Eigenvalue Problems." Found. Comput. Math.
|
||||
25. Liberty, E. (2013). "Simple and Deterministic Matrix Sketching." KDD 2013.
|
||||
26. Kitaev, N., Kaiser, L., Levskaya, A. (2020). "Reformer: The Efficient Transformer." ICLR 2020.
|
||||
27. Galhotra, S., Mazumdar, A., Pal, S., Rajaraman, R. (2024). "Distributed Laplacian Solvers via Communication-Efficient Iterative Methods." PODC 2024.
|
||||
28. Cohen, M.B., Nelson, J., Woodruff, D.P. (2022). "Optimal Approximate Matrix Product in Terms of Stable Rank." ICALP 2022.
|
||||
29. Nemirovski, A., Yudin, D. (1983). "Problem Complexity and Method Efficiency in Optimization." Wiley.
|
||||
30. Clarkson, K.L., Woodruff, D.P. (2017). "Low-Rank Approximation and Regression in Input Sparsity Time." J. ACM.
|
||||
|
||||
---
|
||||
|
||||
## 13. Implementation Realization
|
||||
|
||||
All seven algorithms identified in the practical subset (Section 5) have been fully implemented in the `ruvector-solver` crate. The following table maps each SOTA algorithm to its implementation module, current status, and test coverage.
|
||||
|
||||
### 13.1 Algorithm-to-Module Mapping
|
||||
|
||||
| Algorithm | Module | LOC | Tests | Status |
|
||||
|-----------|--------|-----|-------|--------|
|
||||
| Neumann Series | `neumann.rs` | 715 | 18 unit + 5 integration | Complete, Jacobi-preconditioned |
|
||||
| Conjugate Gradient | `cg.rs` | 1,112 | 24 unit + 5 integration | Complete |
|
||||
| Forward Push | `forward_push.rs` | 828 | 17 unit + 6 integration | Complete |
|
||||
| Backward Push | `backward_push.rs` | 714 | 14 unit | Complete |
|
||||
| Hybrid Random Walk | `random_walk.rs` | 838 | 22 unit | Complete |
|
||||
| TRUE | `true_solver.rs` | 908 | 18 unit | Complete (JL + sparsify + Neumann) |
|
||||
| BMSSP | `bmssp.rs` | 1,151 | 16 unit | Complete (multigrid) |
|
||||
|
||||
**Supporting Infrastructure**:
|
||||
|
||||
| Module | LOC | Tests | Purpose |
|
||||
|--------|-----|-------|---------|
|
||||
| `router.rs` | 1,702 | 24+4 | Adaptive algorithm selection with SONA compatibility |
|
||||
| `types.rs` | 600 | 8 | CsrMatrix, SpMV, SparsityProfile, convergence types |
|
||||
| `validation.rs` | 790 | 34+5 | Input validation at system boundary |
|
||||
| `audit.rs` | 316 | 8 | SHAKE-256 witness chain audit trail |
|
||||
| `budget.rs` | 310 | 9 | Compute budget enforcement |
|
||||
| `arena.rs` | 176 | 2 | Cache-aligned arena allocator |
|
||||
| `simd.rs` | 162 | 2 | SIMD abstraction (AVX-512/AVX2/NEON/WASM SIMD128) |
|
||||
| `error.rs` | 120 | — | Structured error hierarchy |
|
||||
| `events.rs` | 86 | — | Event sourcing for state changes |
|
||||
| `traits.rs` | 138 | — | Solver trait definitions |
|
||||
| `lib.rs` | 63 | — | Public API re-exports |
|
||||
|
||||
**Totals**: 10,729 LOC across 18 source files, 241 #[test] functions across 19 test files.
|
||||
|
||||
### 13.2 Fused Kernels
|
||||
|
||||
`spmv_unchecked` and `fused_residual_norm_sq` deliver bounds-check-free inner loops, reducing per-iteration overhead by 15-30%. These fused kernels eliminate redundant memory traversals by combining the residual computation and norm accumulation into a single pass, turning what would be 3 separate memory passes into 1.
|
||||
|
||||
### 13.3 WASM and NAPI Bindings
|
||||
|
||||
All algorithms are available in browser via `wasm-bindgen`. The WASM build includes SIMD128 acceleration for SpMV and exposes the full solver API (CG, Neumann, Forward Push, Backward Push, Hybrid Random Walk, TRUE, BMSSP) through JavaScript-friendly bindings. NAPI bindings provide native Node.js integration for server-side workloads without the overhead of WASM interpretation.
|
||||
|
||||
### 13.4 Cross-Document Implementation Verification
|
||||
|
||||
All research documents in the sublinear-time-solver series now have implementation traceability:
|
||||
|
||||
| Document | ID | Status | Key Implementations |
|
||||
|----------|-----|--------|-------------------|
|
||||
| 00 Executive Summary | — | Updated | Overview of 10,729 LOC solver |
|
||||
| 01-14 Integration Analyses | — | Complete | Architecture, WASM, MCP, performance |
|
||||
| 15 Fifty-Year Vision | ADR-STS-VISION-001 | Implemented (Phase 1) | 10/10 vectors mapped to artifacts |
|
||||
| 16 DNA Convergence | ADR-STS-DNA-001 | Implemented | 7/7 convergence points solver-ready |
|
||||
| 17 Quantum Convergence | ADR-STS-QUANTUM-001 | Implemented | 8/8 convergence points solver-ready |
|
||||
| 18 AGI Optimization | ADR-STS-AGI-001 | Implemented | All quantitative targets tracked |
|
||||
| ADR-STS-001 to 010 | — | Accepted, Implemented | Full ADR series complete |
|
||||
| DDD Strategic Design | — | Complete | Bounded contexts defined |
|
||||
| DDD Tactical Design | — | Complete | Aggregates and entities |
|
||||
| DDD Integration Patterns | — | Complete | Anti-corruption layers |
|
||||
532
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-optimization-guide.md
vendored
Normal file
532
vendor/ruvector/docs/research/sublinear-time-solver/adr/ADR-STS-optimization-guide.md
vendored
Normal file
@@ -0,0 +1,532 @@
|
||||
# Optimization Guide: Sublinear-Time Solver Integration
|
||||
|
||||
**Date**: 2026-02-20
|
||||
**Classification**: Engineering Reference
|
||||
**Scope**: Performance optimization strategies for solver integration
|
||||
**Version**: 2.0 (Optimizations Realized)
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
This guide provides concrete optimization strategies for achieving maximum performance from the sublinear-time-solver integration into RuVector. Targets: 10-600x speedups across 6 critical subsystems while maintaining <2% accuracy loss. Organized by optimization tier: SIMD → Memory → Algorithm → Numerical → Concurrency → WASM → Profiling → Compilation → Platform.
|
||||
|
||||
---
|
||||
|
||||
## 2. SIMD Optimization Strategy
|
||||
|
||||
### 2.1 Architecture-Specific Kernels
|
||||
|
||||
The solver's hot path is SpMV (sparse matrix-vector multiply). Each architecture requires a dedicated kernel:
|
||||
|
||||
| Architecture | SIMD Width | f32/iteration | Key Instruction | Expected SpMV Throughput |
|
||||
|-------------|-----------|--------------|-----------------|-------------------------|
|
||||
| AVX-512 | 512-bit | 16 | `_mm512_i32gather_ps` | ~400M nonzeros/s |
|
||||
| AVX2+FMA | 256-bit | 8×4 unrolled | `_mm256_i32gather_ps` + `_mm256_fmadd_ps` | ~250M nonzeros/s |
|
||||
| NEON | 128-bit | 4×4 unrolled | Manual gather + `vfmaq_f32` | ~150M nonzeros/s |
|
||||
| WASM SIMD128 | 128-bit | 4 | `f32x4_mul` + `f32x4_add` | ~80M nonzeros/s |
|
||||
| Scalar | 32-bit | 1 | `fmaf` | ~40M nonzeros/s |
|
||||
|
||||
### 2.2 SpMV Kernels
|
||||
|
||||
**AVX2+FMA SpMV with gather** (primary kernel):
|
||||
```
|
||||
for each row i:
|
||||
acc = _mm256_setzero_ps()
|
||||
for j in row_ptrs[i]..row_ptrs[i+1] step 8:
|
||||
indices = _mm256_loadu_si256(&col_indices[j])
|
||||
vals = _mm256_loadu_ps(&values[j])
|
||||
x_gathered = _mm256_i32gather_ps(x_ptr, indices, 4)
|
||||
acc = _mm256_fmadd_ps(vals, x_gathered, acc)
|
||||
y[i] = horizontal_sum(acc) + scalar_remainder
|
||||
```
|
||||
|
||||
**AVX-512 SpMV with masking** (for variable-length rows):
|
||||
```
|
||||
for each row i:
|
||||
acc = _mm512_setzero_ps()
|
||||
len = row_ptrs[i+1] - row_ptrs[i]
|
||||
full_chunks = len / 16
|
||||
remainder = len % 16
|
||||
|
||||
for j in 0..full_chunks:
|
||||
base = row_ptrs[i] + j * 16
|
||||
idx = _mm512_loadu_si512(&col_indices[base])
|
||||
v = _mm512_loadu_ps(&values[base])
|
||||
x = _mm512_i32gather_ps(idx, x_ptr, 4)
|
||||
acc = _mm512_fmadd_ps(v, x, acc)
|
||||
|
||||
if remainder > 0:
|
||||
mask = (1 << remainder) - 1
|
||||
base = row_ptrs[i] + full_chunks * 16
|
||||
idx = _mm512_maskz_loadu_epi32(mask, &col_indices[base])
|
||||
v = _mm512_maskz_loadu_ps(mask, &values[base])
|
||||
x = _mm512_mask_i32gather_ps(zeros, mask, idx, x_ptr, 4)
|
||||
acc = _mm512_fmadd_ps(v, x, acc)
|
||||
|
||||
y[i] = _mm512_reduce_add_ps(acc)
|
||||
```
|
||||
|
||||
**WASM SIMD128 SpMV kernel**:
|
||||
```
|
||||
for each row i:
|
||||
acc = f32x4_splat(0.0)
|
||||
for j in row_ptrs[i]..row_ptrs[i+1] step 4:
|
||||
x_vec = f32x4(x[col_indices[j]], x[col_indices[j+1]],
|
||||
x[col_indices[j+2]], x[col_indices[j+3]])
|
||||
v = v128_load(&values[j])
|
||||
acc = f32x4_add(acc, f32x4_mul(v, x_vec))
|
||||
y[i] = horizontal_sum_f32x4(acc) + scalar_remainder
|
||||
```
|
||||
|
||||
**Vectorized PRNG** (for Hybrid Random Walk):
|
||||
```
|
||||
state[4][4] = initialize_from_seed()
|
||||
for each walk:
|
||||
random = xoshiro256_simd(state) // 4 random values per call
|
||||
next_node = random % degree[current_node]
|
||||
```
|
||||
|
||||
### 2.3 Auto-Vectorization Guidelines
|
||||
|
||||
1. **Sequential access**: Iterate arrays in order (no random access in inner loop)
|
||||
2. **No branches**: Use `select`/`blend` instead of `if` in hot loops
|
||||
3. **Independent accumulators**: 4 separate sums, combine at end
|
||||
4. **Aligned data**: Use `#[repr(align(64))]` on hot data structures
|
||||
5. **Known bounds**: Use `get_unchecked()` after external bounds check
|
||||
6. **Compiler hints**: `#[inline(always)]` on hot functions, `#[cold]` on error paths
|
||||
|
||||
### 2.4 Throughput Formulas
|
||||
|
||||
SpMV throughput is bounded by memory bandwidth:
|
||||
```
|
||||
Throughput = min(BW_memory / 8, FLOPS_peak / 2) nonzeros/s
|
||||
```
|
||||
Where 8 = bytes/nonzero (4B value + 4B index), 2 = FLOPs/nonzero (mul + add).
|
||||
|
||||
SpMV is almost always memory-bandwidth-bound. SIMD reduces instruction count but memory throughput is the fundamental limit.
|
||||
|
||||
---
|
||||
|
||||
## 3. Memory Optimization
|
||||
|
||||
### 3.1 Cache-Aware Tiling
|
||||
|
||||
| Working Set | Cache Level | Performance | Strategy |
|
||||
|------------|------------|-------------|---------|
|
||||
| < 48 KB | L1 (M4 Pro: 192KB/perf) | Peak (100%) | Direct iteration, no tiling |
|
||||
| < 256 KB | L2 | 80-90% of peak | Single-pass with prefetch |
|
||||
| < 16 MB | L3 | 50-70% of peak | Row-block tiling |
|
||||
| > 16 MB | DRAM | 20-40% of peak | Page-level tiling + prefetch |
|
||||
| > available RAM | Disk | 1-5% of peak | Memory-mapped streaming |
|
||||
|
||||
**Tiling formula**: `TILE_ROWS = L3_SIZE / (avg_row_nnz × 12 bytes)`
|
||||
|
||||
### 3.2 Prefetch Strategy
|
||||
|
||||
```rust
|
||||
// Software prefetch for SpMV x-vector access
|
||||
for row in 0..n {
|
||||
if row + 1 < n {
|
||||
let next_start = row_ptrs[row + 1];
|
||||
for j in next_start..(next_start + 8).min(row_ptrs[row + 2]) {
|
||||
prefetch_read_l2(&x[col_indices[j] as usize]);
|
||||
}
|
||||
}
|
||||
process_row(row);
|
||||
}
|
||||
```
|
||||
|
||||
Prefetch distance: L1 = 64 bytes ahead, L2 = 256 bytes ahead.
|
||||
|
||||
### 3.3 Arena Allocator Integration
|
||||
|
||||
```rust
|
||||
// Before: ~20μs overhead per solve
|
||||
let r = vec![0.0f32; n]; let p = vec![0.0f32; n]; let ap = vec![0.0f32; n];
|
||||
|
||||
// After: ~0.2μs overhead per solve
|
||||
let mut arena = SolverArena::with_capacity(n * 12);
|
||||
let r = arena.alloc_slice::<f32>(n);
|
||||
let p = arena.alloc_slice::<f32>(n);
|
||||
let ap = arena.alloc_slice::<f32>(n);
|
||||
arena.reset();
|
||||
```
|
||||
|
||||
### 3.4 Cache Line Alignment
|
||||
|
||||
```rust
|
||||
#[repr(C, align(64))]
|
||||
struct SolverScratch { r: [f32; N], p: [f32; N], ap: [f32; N] }
|
||||
|
||||
#[repr(C, align(128))] // Prevent false sharing in parallel stats
|
||||
struct ThreadStats { iterations: u64, residual: f64, _pad: [u8; 112] }
|
||||
```
|
||||
|
||||
### 3.5 Memory-Mapped Large Matrices
|
||||
|
||||
```rust
|
||||
let mmap = unsafe { memmap2::Mmap::map(&file)? };
|
||||
let values: &[f32] = bytemuck::cast_slice(&mmap[header_size..]);
|
||||
```
|
||||
|
||||
### 3.6 Zero-Copy Data Paths
|
||||
|
||||
| Path | Mechanism | Overhead |
|
||||
|------|-----------|----------|
|
||||
| SoA → Solver | `&[f32]` borrow | 0 bytes |
|
||||
| HNSW → CSR | Direct construction | O(n×M) one-time |
|
||||
| Solver → WASM | `Float32Array::view()` | 0 bytes |
|
||||
| Solver → NAPI | `napi::Buffer` | 0 bytes |
|
||||
| Solver → REST | `serde_json::to_writer` | 1 serialization |
|
||||
|
||||
---
|
||||
|
||||
## 4. Algorithmic Optimization
|
||||
|
||||
### 4.1 Preconditioning Strategies
|
||||
|
||||
| Preconditioner | Setup Cost | Per-Iteration Cost | Condition Improvement | Best For |
|
||||
|---------------|-----------|-------------------|----------------------|----------|
|
||||
| None | 0 | 0 | 1x | Well-conditioned (κ < 10) |
|
||||
| Diagonal (Jacobi) | O(n) | O(n) | √(d_max/d_min) | General SPD |
|
||||
| Incomplete Cholesky | O(nnz) | O(nnz) | 10-100x | Moderately ill-conditioned |
|
||||
| Algebraic Multigrid | O(nnz·log n) | O(nnz) | Near-optimal for Laplacians | κ > 100 |
|
||||
|
||||
**Default**: Diagonal preconditioner. Escalate to AMG when κ > 100 and n > 50K.
|
||||
|
||||
### 4.2 Sparsity Exploitation
|
||||
|
||||
```rust
|
||||
fn select_path(matrix: &CsrMatrix<f32>) -> ComputePath {
|
||||
let density = matrix.density();
|
||||
if density > 0.50 { ComputePath::Dense }
|
||||
else if density > 0.05 { ComputePath::Sparse }
|
||||
else { ComputePath::Sublinear }
|
||||
}
|
||||
```
|
||||
|
||||
### 4.3 Batch Amortization
|
||||
|
||||
| Preprocessing Cost | Per-Solve Cost | Break-Even B |
|
||||
|-------------------|---------------|-------------|
|
||||
| 425 ms (n=100K, 1%) | 0.43 ms (ε=0.1) | 634 solves |
|
||||
| 42 ms (n=10K, 1%) | 0.04 ms (ε=0.1) | 63 solves |
|
||||
| 4 ms (n=1K, 1%) | 0.004 ms (ε=0.1) | 6 solves |
|
||||
|
||||
### 4.4 Lazy Evaluation
|
||||
|
||||
```rust
|
||||
let x_ij = solver.estimate_entry(A, i, j)?; // O(√n/ε) via random walk
|
||||
// vs full solve O(nnz × iterations). Speedup = √n for n=1M → 1000x
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Numerical Optimization
|
||||
|
||||
### 5.1 Kahan Summation for SpMV
|
||||
|
||||
```rust
|
||||
fn spmv_row_kahan(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
|
||||
let mut sum: f64 = 0.0;
|
||||
let mut comp: f64 = 0.0;
|
||||
for i in 0..vals.len() {
|
||||
let y = (vals[i] as f64) * (x[cols[i] as usize] as f64) - comp;
|
||||
let t = sum + y;
|
||||
comp = (t - sum) - y;
|
||||
sum = t;
|
||||
}
|
||||
sum as f32
|
||||
}
|
||||
```
|
||||
|
||||
Use when: rows > 1000 nonzeros or ε < 1e-6. Overhead: ~2x. Alternative: f64 accumulator.
|
||||
|
||||
### 5.2 Mixed Precision Strategy
|
||||
|
||||
| Precision Mode | Storage | Accumulation | Max ε | Memory | SpMV Speed |
|
||||
|---------------|---------|-------------|-------|--------|-----------|
|
||||
| Pure f32 | f32 | f32 | 1e-4 | 1x | 1x (fastest) |
|
||||
| **Default** (f32/f64) | f32 | f64 | 1e-7 | 1x | 0.95x |
|
||||
| Pure f64 | f64 | f64 | 1e-12 | 2x | 0.5x |
|
||||
|
||||
### 5.3 Condition Number Estimation
|
||||
|
||||
Fast κ estimation via power iteration (20 iterations × 2 SpMVs = O(40 × nnz)):
|
||||
|
||||
```rust
|
||||
fn estimate_kappa(A: &CsrMatrix<f32>) -> f64 {
|
||||
let lambda_max = power_iteration(A, 20);
|
||||
let lambda_min = inverse_power_iteration_cg(A, 20);
|
||||
lambda_max / lambda_min
|
||||
}
|
||||
```
|
||||
|
||||
### 5.4 Spectral Radius for Neumann
|
||||
|
||||
Estimate ρ(I-A) via 20-step power iteration. Rules:
|
||||
- ρ < 0.9: Neumann converges fast (< 50 iterations for ε=0.01)
|
||||
- 0.9 ≤ ρ < 0.99: Neumann slow, consider CG
|
||||
- ρ ≥ 0.99: Switch to CG (Neumann needs > 460 iterations)
|
||||
- ρ ≥ 1.0: Neumann diverges — CG/BMSSP mandatory
|
||||
|
||||
---
|
||||
|
||||
## 6. WASM-Specific Optimization
|
||||
|
||||
### 6.1 Memory Growth Strategy
|
||||
|
||||
Pre-allocate: `pages = ceil(n × avg_nnz × 12 / 65536) + 32`. Growth during solving costs ~1ms per grow.
|
||||
|
||||
### 6.2 wasm-opt Configuration
|
||||
|
||||
```bash
|
||||
wasm-opt -O3 --enable-simd --enable-bulk-memory \
|
||||
--precompute-propagate --optimize-instructions \
|
||||
--reorder-functions --coalesce-locals --vacuum \
|
||||
pkg/solver_bg.wasm -o pkg/solver_bg_opt.wasm
|
||||
```
|
||||
|
||||
Expected: 15-25% size reduction, 5-10% speed improvement.
|
||||
|
||||
### 6.3 Worker Thread Optimization
|
||||
|
||||
Use Transferable objects (zero-copy move) or SharedArrayBuffer (zero-copy share):
|
||||
|
||||
```javascript
|
||||
worker.postMessage({ type: 'solve', matrix: values.buffer },
|
||||
[values.buffer]); // Transfer list — moves, doesn't copy
|
||||
```
|
||||
|
||||
### 6.4 Bundle Size Budget
|
||||
|
||||
| Component | Size (gzipped) | Budget |
|
||||
|-----------|---------------|--------|
|
||||
| Solver core (CG + Neumann + Push) | ~80 KB | 100 KB |
|
||||
| SIMD128 kernels | ~15 KB | 20 KB |
|
||||
| wasm-bindgen glue | ~10 KB | 15 KB |
|
||||
| serde-wasm-bindgen | ~20 KB | 25 KB |
|
||||
| **Total** | **~125 KB** | **160 KB** |
|
||||
|
||||
---
|
||||
|
||||
## 7. Profiling Methodology
|
||||
|
||||
### 7.1 Performance Counter Analysis
|
||||
|
||||
```bash
|
||||
perf stat -e cycles,instructions,cache-references,cache-misses,\
|
||||
L1-dcache-load-misses,LLC-load-misses ./target/release/bench_spmv
|
||||
```
|
||||
|
||||
Expected good SpMV profile: IPC 2.0-3.0, L1 miss 5-15%, LLC miss < 1%, branch miss < 1%.
|
||||
|
||||
### 7.2 Hot Spot Identification
|
||||
|
||||
```bash
|
||||
perf record -g --call-graph dwarf ./target/release/bench_solver
|
||||
perf script | stackcollapse-perf.pl | flamegraph.pl > solver_flame.svg
|
||||
```
|
||||
|
||||
Expected: 60-80% in spmv_*, 10-15% in dot/norm, < 5% in allocation.
|
||||
|
||||
### 7.3 Roofline Model
|
||||
|
||||
SpMV arithmetic intensity = 0.167 FLOP/byte. On 80 GB/s server: achievable = 13.3 GFLOPS (1.3% of 1 TFLOPS peak). SpMV is deeply memory-bound — optimize for memory traffic reduction, not FLOPS.
|
||||
|
||||
### 7.4 Criterion.rs Best Practices
|
||||
|
||||
```rust
|
||||
group.warm_up_time(Duration::from_secs(5)); // Stabilize cache state
|
||||
group.sample_size(200); // Statistical significance
|
||||
group.throughput(Throughput::Elements(nnz)); // Report nonzeros/sec
|
||||
// Use black_box() to prevent dead code elimination
|
||||
b.iter(|| black_box(solver.solve(&csr, &rhs)))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Concurrency Optimization
|
||||
|
||||
### 8.1 Rayon Configuration
|
||||
|
||||
```rust
|
||||
let chunk_size = (n / rayon::current_num_threads()).max(1024);
|
||||
problems.par_chunks(chunk_size).map(|chunk| ...).collect()
|
||||
```
|
||||
|
||||
### 8.2 Thread Scaling
|
||||
|
||||
| Threads | Efficiency | Bottleneck |
|
||||
|---------|-----------|-----------|
|
||||
| 1 | 100% | N/A |
|
||||
| 2 | 90-95% | Rayon overhead |
|
||||
| 4 | 75-85% | Memory bandwidth |
|
||||
| 8 | 55-70% | L3 contention |
|
||||
| 16 | 40-55% | NUMA effects |
|
||||
|
||||
Use `num_cpus::get_physical()` threads. Avoid nested Rayon (deadlock risk).
|
||||
|
||||
---
|
||||
|
||||
## 9. Compilation Optimization
|
||||
|
||||
### 9.1 PGO Pipeline
|
||||
|
||||
```bash
|
||||
RUSTFLAGS="-Cprofile-generate=/tmp/pgo" cargo build --release -p ruvector-solver
|
||||
./target/release/bench_solver --profile-workload
|
||||
llvm-profdata merge -o /tmp/pgo/merged.profdata /tmp/pgo/*.profraw
|
||||
RUSTFLAGS="-Cprofile-use=/tmp/pgo/merged.profdata" cargo build --release
|
||||
```
|
||||
|
||||
Expected: 5-15% improvement.
|
||||
|
||||
### 9.2 Release Profile
|
||||
|
||||
```toml
|
||||
[profile.release]
|
||||
opt-level = 3
|
||||
lto = "fat"
|
||||
codegen-units = 1
|
||||
strip = true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Platform-Specific Optimization
|
||||
|
||||
### 10.1 Server (Linux x86_64)
|
||||
|
||||
- Huge pages: `MADV_HUGEPAGE` for large matrices (10-30% TLB miss reduction)
|
||||
- NUMA-aware: Pin threads to same node as matrix memory
|
||||
- AVX-512: Prefer on Zen 4+/Ice Lake+
|
||||
|
||||
### 10.2 Apple Silicon (macOS ARM64)
|
||||
|
||||
- Unified memory: No NUMA concerns
|
||||
- NEON 4x unrolled with independent accumulators
|
||||
- M4 Pro: 192KB L1, 16MB L2, 48MB L3
|
||||
|
||||
### 10.3 Browser (WASM)
|
||||
|
||||
- Memory budget < 8MB, SIMD128 always enabled
|
||||
- Web Workers for batch, SharedArrayBuffer for zero-copy
|
||||
- IndexedDB caching for TRUE preprocessing
|
||||
|
||||
### 10.4 Cloudflare Workers
|
||||
|
||||
- 128MB memory, 50ms CPU limit
|
||||
- Reflex/Retrieval lanes only
|
||||
- Single-threaded, pre-warm with small solve
|
||||
|
||||
---
|
||||
|
||||
## 11. Optimization Checklist
|
||||
|
||||
### P0 (Critical)
|
||||
|
||||
| Item | Impact | Effort | Validation |
|
||||
|------|--------|--------|------------|
|
||||
| SIMD SpMV (AVX2+FMA, NEON) | 4-8x SpMV | L | Criterion vs scalar |
|
||||
| Arena allocator | 100x alloc reduction | S | dhat profiling |
|
||||
| Zero-copy SoA → solver | Eliminates copies | M | Memory profiling |
|
||||
| CSR with aligned storage | SIMD foundation | M | Cache miss rate |
|
||||
| Diagonal preconditioning | 2-10x CG speedup | S | Iteration count |
|
||||
| Feature-gated Rayon | Multi-core utilization | S | Thread scaling |
|
||||
| Input validation | Security baseline | S | Fuzz testing |
|
||||
| CI regression benchmarks | Prevents degradation | M | CI green |
|
||||
|
||||
### P1 (High)
|
||||
|
||||
| Item | Impact | Effort | Validation |
|
||||
|------|--------|--------|------------|
|
||||
| AVX-512 SpMV | 1.5-2x over AVX2 | M | Zen 4 benchmark |
|
||||
| WASM SIMD128 SpMV | 2-3x over scalar | M | wasm-pack bench |
|
||||
| Cache-aware tiling | 30-50% for n>100K | M | perf cache misses |
|
||||
| Memory-mapped CSR | Removes memory ceiling | M | 1GB matrix load |
|
||||
| SONA adaptive routing | Auto-optimal selection | L | >90% routing accuracy |
|
||||
| TRUE batch amortization | 100-1000x repeated | M | Break-even validated |
|
||||
| Web Worker pool | 2-4x WASM throughput | M | Worker benchmark |
|
||||
|
||||
### P2 (Medium)
|
||||
|
||||
| Item | Impact | Effort | Validation |
|
||||
|------|--------|--------|------------|
|
||||
| PGO in CI | 5-15% overall | M | PGO comparison |
|
||||
| Vectorized PRNG | 2-4x random walk | S | Walk throughput |
|
||||
| SIMD convergence checks | 4-8x check speed | S | Inline benchmark |
|
||||
| Mixed precision (f32/f64) | 2x memory savings | M | Accuracy suite |
|
||||
| Incomplete Cholesky | 10-100x condition | L | Iteration count |
|
||||
|
||||
### P3 (Long-term)
|
||||
|
||||
| Item | Impact | Effort | Validation |
|
||||
|------|--------|--------|------------|
|
||||
| Algebraic multigrid | Near-optimal Laplacians | XL | V-cycle convergence |
|
||||
| NUMA-aware allocation | 10-20% multi-socket | M | NUMA profiling |
|
||||
| GPU offload (Metal/CUDA) | 10-100x dense | XL | GPU benchmark |
|
||||
| Distributed solver | n > 1M scaling | XL | Distributed bench |
|
||||
|
||||
---
|
||||
|
||||
## 12. Performance Targets
|
||||
|
||||
| Operation | Server (AVX2) | Edge (NEON) | Browser (WASM) | Cloudflare |
|
||||
|-----------|:---:|:---:|:---:|:---:|
|
||||
| SpMV 10K×10K (1%) | < 30 μs | < 50 μs | < 200 μs | < 300 μs |
|
||||
| CG solve 10K (ε=1e-6) | < 1 ms | < 2 ms | < 20 ms | < 30 ms |
|
||||
| Forward Push 10K (ε=1e-4) | < 50 μs | < 100 μs | < 500 μs | < 1 ms |
|
||||
| Neumann 10K (k=20) | < 600 μs | < 1 ms | < 5 ms | < 8 ms |
|
||||
| BMSSP 100K (ε=1e-4) | < 50 ms | < 100 ms | N/A | < 200 ms |
|
||||
| TRUE prep 100K (ε=0.1) | < 500 ms | < 1 s | N/A | < 2 s |
|
||||
| TRUE solve 100K (amort.) | < 1 ms | < 2 ms | N/A | < 5 ms |
|
||||
| Batch pairwise 10K | < 15 s | < 30 s | < 120 s | N/A |
|
||||
| Scheduler tick | < 200 ns | < 300 ns | N/A | N/A |
|
||||
| Algorithm routing | < 1 μs | < 1 μs | < 5 μs | < 5 μs |
|
||||
|
||||
---
|
||||
|
||||
## 13. Measurement Methodology
|
||||
|
||||
1. **Criterion.rs**: 200 samples, 5s warmup, p < 0.05 significance
|
||||
2. **Multi-platform**: x86_64 (AVX2) and aarch64 (NEON)
|
||||
3. **Deterministic seeds**: `random_vector(dim, seed=42)`
|
||||
4. **Equal accuracy**: Fix ε before comparing
|
||||
5. **Cold + hot cache**: Report both first-run and steady-state
|
||||
6. **Profile.bench**: Release optimization with debug symbols
|
||||
7. **Regression CI**: 10% degradation threshold triggers failure
|
||||
8. **Memory profiling**: Peak RSS and allocation count via dhat
|
||||
9. **Roofline analysis**: Verify memory-bound operation
|
||||
10. **Statistical rigor**: Report median, p5, p95, coefficient of variation
|
||||
|
||||
---
|
||||
|
||||
## Realized Optimizations
|
||||
|
||||
The following optimizations from this guide have been implemented in the `ruvector-solver` crate as of February 2026.
|
||||
|
||||
### Implemented Techniques
|
||||
|
||||
1. **Jacobi-preconditioned Neumann series (D^{-1} splitting)**: The Neumann solver extracts the diagonal of A and applies D^{-1} as a preconditioner before iteration. This transforms the iteration matrix from (I - A) to (I - D^{-1}A), significantly reducing the spectral radius for diagonally-dominant systems and enabling convergence where unpreconditioned Neumann would diverge or stall.
|
||||
|
||||
2. **spmv_unchecked: raw pointer SpMV with zero bounds checks**: The inner SpMV loop uses unsafe raw pointer arithmetic to eliminate Rust's bounds-check overhead on every array access. An external bounds validation is performed once before entering the hot loop, maintaining safety guarantees while removing per-element branch overhead.
|
||||
|
||||
3. **fused_residual_norm_sq: single-pass residual + norm computation (3 memory passes to 1)**: Instead of computing r = b - Ax (pass 1), then ||r||^2 (pass 2) as separate operations, the fused kernel computes both the residual vector and its squared norm in a single traversal. This eliminates 2 of 3 memory traversals per iteration, which is critical since SpMV is memory-bandwidth-bound.
|
||||
|
||||
4. **4-wide unrolled Jacobi update in Neumann iteration**: The Jacobi preconditioner application loop is manually unrolled 4x, processing four elements per loop body. This reduces loop overhead and exposes instruction-level parallelism to the CPU's out-of-order execution engine.
|
||||
|
||||
5. **AVX2 SIMD SpMV (8-wide f32 via horizontal sum)**: The AVX2 SpMV kernel processes 8 f32 values per SIMD instruction using `_mm256_i32gather_ps` for gathering x-vector entries and `_mm256_fmadd_ps` for fused multiply-add accumulation. A horizontal sum reduces the 8-lane accumulator to a scalar row result.
|
||||
|
||||
6. **Arena allocator for zero-allocation iteration**: Solver working memory (residual, search direction, temporary vectors) is pre-allocated from a bump arena before the iteration loop begins. This eliminates all heap allocation during the solve phase, reducing per-solve overhead from ~20 microseconds to ~200 nanoseconds.
|
||||
|
||||
7. **Algorithm router with automatic characterization**: The solver includes an algorithm router that characterizes input matrices (size, density, estimated spectral radius, SPD detection) and selects the optimal algorithm automatically. The router runs in under 1 microsecond and directs traffic to the appropriate solver based on the matrix properties identified in Sections 4 and 5.
|
||||
|
||||
### Performance Data
|
||||
|
||||
| Algorithm | Complexity | Notes |
|
||||
|-----------|-----------|-------|
|
||||
| **Neumann** | O(k * nnz) | Converges with k typically 10-50 for well-conditioned systems (spectral radius < 0.9). Jacobi preconditioning extends the convergence regime. |
|
||||
| **CG** | O(sqrt(kappa) * log(1/epsilon) * nnz) | Gold standard for SPD systems. Optimal by the Nemirovski-Yudin lower bound. Scales gracefully with condition number. |
|
||||
| **Fused kernel** | Eliminates 2 of 3 memory traversals per iteration | For bandwidth-bound SpMV (arithmetic intensity 0.167 FLOP/byte), reducing memory passes from 3 to 1 translates directly to up to 3x throughput improvement for the residual computation step. |
|
||||
Reference in New Issue
Block a user