Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

19 KiB

Raw Blame History

Optimization Guide: Sublinear-Time Solver Integration

Date: 2026-02-20 Classification: Engineering Reference Scope: Performance optimization strategies for solver integration Version: 2.0 (Optimizations Realized)

1. Executive Summary

This guide provides concrete optimization strategies for achieving maximum performance from the sublinear-time-solver integration into RuVector. Targets: 10-600x speedups across 6 critical subsystems while maintaining <2% accuracy loss. Organized by optimization tier: SIMD → Memory → Algorithm → Numerical → Concurrency → WASM → Profiling → Compilation → Platform.

2. SIMD Optimization Strategy

2.1 Architecture-Specific Kernels

The solver's hot path is SpMV (sparse matrix-vector multiply). Each architecture requires a dedicated kernel:

Architecture	SIMD Width	f32/iteration	Key Instruction	Expected SpMV Throughput
AVX-512	512-bit	16	`_mm512_i32gather_ps`	~400M nonzeros/s
AVX2+FMA	256-bit	8×4 unrolled	`_mm256_i32gather_ps` + `_mm256_fmadd_ps`	~250M nonzeros/s
NEON	128-bit	4×4 unrolled	Manual gather + `vfmaq_f32`	~150M nonzeros/s
WASM SIMD128	128-bit	4	`f32x4_mul` + `f32x4_add`	~80M nonzeros/s
Scalar	32-bit	1	`fmaf`	~40M nonzeros/s

2.2 SpMV Kernels

AVX2+FMA SpMV with gather (primary kernel):

for each row i:
    acc = _mm256_setzero_ps()
    for j in row_ptrs[i]..row_ptrs[i+1] step 8:
        indices = _mm256_loadu_si256(&col_indices[j])
        vals = _mm256_loadu_ps(&values[j])
        x_gathered = _mm256_i32gather_ps(x_ptr, indices, 4)
        acc = _mm256_fmadd_ps(vals, x_gathered, acc)
    y[i] = horizontal_sum(acc) + scalar_remainder

AVX-512 SpMV with masking (for variable-length rows):

for each row i:
    acc = _mm512_setzero_ps()
    len = row_ptrs[i+1] - row_ptrs[i]
    full_chunks = len / 16
    remainder = len % 16

    for j in 0..full_chunks:
        base = row_ptrs[i] + j * 16
        idx = _mm512_loadu_si512(&col_indices[base])
        v = _mm512_loadu_ps(&values[base])
        x = _mm512_i32gather_ps(idx, x_ptr, 4)
        acc = _mm512_fmadd_ps(v, x, acc)

    if remainder > 0:
        mask = (1 << remainder) - 1
        base = row_ptrs[i] + full_chunks * 16
        idx = _mm512_maskz_loadu_epi32(mask, &col_indices[base])
        v = _mm512_maskz_loadu_ps(mask, &values[base])
        x = _mm512_mask_i32gather_ps(zeros, mask, idx, x_ptr, 4)
        acc = _mm512_fmadd_ps(v, x, acc)

    y[i] = _mm512_reduce_add_ps(acc)

WASM SIMD128 SpMV kernel:

for each row i:
    acc = f32x4_splat(0.0)
    for j in row_ptrs[i]..row_ptrs[i+1] step 4:
        x_vec = f32x4(x[col_indices[j]], x[col_indices[j+1]],
                       x[col_indices[j+2]], x[col_indices[j+3]])
        v = v128_load(&values[j])
        acc = f32x4_add(acc, f32x4_mul(v, x_vec))
    y[i] = horizontal_sum_f32x4(acc) + scalar_remainder

Vectorized PRNG (for Hybrid Random Walk):

state[4][4] = initialize_from_seed()
for each walk:
    random = xoshiro256_simd(state)  // 4 random values per call
    next_node = random % degree[current_node]

2.3 Auto-Vectorization Guidelines

Sequential access: Iterate arrays in order (no random access in inner loop)
No branches: Use select/blend instead of if in hot loops
Independent accumulators: 4 separate sums, combine at end
Aligned data: Use #[repr(align(64))] on hot data structures
Known bounds: Use get_unchecked() after external bounds check
Compiler hints: #[inline(always)] on hot functions, #[cold] on error paths

2.4 Throughput Formulas

SpMV throughput is bounded by memory bandwidth:

Throughput = min(BW_memory / 8, FLOPS_peak / 2) nonzeros/s

Where 8 = bytes/nonzero (4B value + 4B index), 2 = FLOPs/nonzero (mul + add).

SpMV is almost always memory-bandwidth-bound. SIMD reduces instruction count but memory throughput is the fundamental limit.

3. Memory Optimization

3.1 Cache-Aware Tiling

Working Set	Cache Level	Performance	Strategy
< 48 KB	L1 (M4 Pro: 192KB/perf)	Peak (100%)	Direct iteration, no tiling
< 256 KB	L2	80-90% of peak	Single-pass with prefetch
< 16 MB	L3	50-70% of peak	Row-block tiling
> 16 MB	DRAM	20-40% of peak	Page-level tiling + prefetch
> available RAM	Disk	1-5% of peak	Memory-mapped streaming

Tiling formula: TILE_ROWS = L3_SIZE / (avg_row_nnz × 12 bytes)

3.2 Prefetch Strategy

// Software prefetch for SpMV x-vector access
for row in 0..n {
    if row + 1 < n {
        let next_start = row_ptrs[row + 1];
        for j in next_start..(next_start + 8).min(row_ptrs[row + 2]) {
            prefetch_read_l2(&x[col_indices[j] as usize]);
        }
    }
    process_row(row);
}

Prefetch distance: L1 = 64 bytes ahead, L2 = 256 bytes ahead.

3.3 Arena Allocator Integration

// Before: ~20μs overhead per solve
let r = vec![0.0f32; n]; let p = vec![0.0f32; n]; let ap = vec![0.0f32; n];

// After: ~0.2μs overhead per solve
let mut arena = SolverArena::with_capacity(n * 12);
let r = arena.alloc_slice::<f32>(n);
let p = arena.alloc_slice::<f32>(n);
let ap = arena.alloc_slice::<f32>(n);
arena.reset();

3.4 Cache Line Alignment

#[repr(C, align(64))]
struct SolverScratch { r: [f32; N], p: [f32; N], ap: [f32; N] }

#[repr(C, align(128))]  // Prevent false sharing in parallel stats
struct ThreadStats { iterations: u64, residual: f64, _pad: [u8; 112] }

3.5 Memory-Mapped Large Matrices

let mmap = unsafe { memmap2::Mmap::map(&file)? };
let values: &[f32] = bytemuck::cast_slice(&mmap[header_size..]);

3.6 Zero-Copy Data Paths

Path	Mechanism	Overhead
SoA → Solver	`&[f32]` borrow	0 bytes
HNSW → CSR	Direct construction	O(n×M) one-time
Solver → WASM	`Float32Array::view()`	0 bytes
Solver → NAPI	`napi::Buffer`	0 bytes
Solver → REST	`serde_json::to_writer`	1 serialization

4. Algorithmic Optimization

4.1 Preconditioning Strategies

Preconditioner	Setup Cost	Per-Iteration Cost	Condition Improvement	Best For
None	0	0	1x	Well-conditioned (κ < 10)
Diagonal (Jacobi)	O(n)	O(n)	√(d_max/d_min)	General SPD
Incomplete Cholesky	O(nnz)	O(nnz)	10-100x	Moderately ill-conditioned
Algebraic Multigrid	O(nnz·log n)	O(nnz)	Near-optimal for Laplacians	κ > 100

Default: Diagonal preconditioner. Escalate to AMG when κ > 100 and n > 50K.

4.2 Sparsity Exploitation

fn select_path(matrix: &CsrMatrix<f32>) -> ComputePath {
    let density = matrix.density();
    if density > 0.50 { ComputePath::Dense }
    else if density > 0.05 { ComputePath::Sparse }
    else { ComputePath::Sublinear }
}

4.3 Batch Amortization

Preprocessing Cost	Per-Solve Cost	Break-Even B
425 ms (n=100K, 1%)	0.43 ms (ε=0.1)	634 solves
42 ms (n=10K, 1%)	0.04 ms (ε=0.1)	63 solves
4 ms (n=1K, 1%)	0.004 ms (ε=0.1)	6 solves

4.4 Lazy Evaluation

let x_ij = solver.estimate_entry(A, i, j)?;  // O(√n/ε) via random walk
// vs full solve O(nnz × iterations). Speedup = √n for n=1M → 1000x

5. Numerical Optimization

5.1 Kahan Summation for SpMV

fn spmv_row_kahan(vals: &[f32], cols: &[u32], x: &[f32]) -> f32 {
    let mut sum: f64 = 0.0;
    let mut comp: f64 = 0.0;
    for i in 0..vals.len() {
        let y = (vals[i] as f64) * (x[cols[i] as usize] as f64) - comp;
        let t = sum + y;
        comp = (t - sum) - y;
        sum = t;
    }
    sum as f32
}

Use when: rows > 1000 nonzeros or ε < 1e-6. Overhead: ~2x. Alternative: f64 accumulator.

5.2 Mixed Precision Strategy

Precision Mode	Storage	Accumulation	Max ε	Memory	SpMV Speed
Pure f32	f32	f32	1e-4	1x	1x (fastest)
Default (f32/f64)	f32	f64	1e-7	1x	0.95x
Pure f64	f64	f64	1e-12	2x	0.5x

5.3 Condition Number Estimation

Fast κ estimation via power iteration (20 iterations × 2 SpMVs = O(40 × nnz)):

fn estimate_kappa(A: &CsrMatrix<f32>) -> f64 {
    let lambda_max = power_iteration(A, 20);
    let lambda_min = inverse_power_iteration_cg(A, 20);
    lambda_max / lambda_min
}

5.4 Spectral Radius for Neumann

Estimate ρ(I-A) via 20-step power iteration. Rules:

ρ < 0.9: Neumann converges fast (< 50 iterations for ε=0.01)
0.9 ≤ ρ < 0.99: Neumann slow, consider CG
ρ ≥ 0.99: Switch to CG (Neumann needs > 460 iterations)
ρ ≥ 1.0: Neumann diverges — CG/BMSSP mandatory

6. WASM-Specific Optimization

6.1 Memory Growth Strategy

Pre-allocate: pages = ceil(n × avg_nnz × 12 / 65536) + 32. Growth during solving costs ~1ms per grow.

6.2 wasm-opt Configuration

wasm-opt -O3 --enable-simd --enable-bulk-memory \
  --precompute-propagate --optimize-instructions \
  --reorder-functions --coalesce-locals --vacuum \
  pkg/solver_bg.wasm -o pkg/solver_bg_opt.wasm

Expected: 15-25% size reduction, 5-10% speed improvement.

6.3 Worker Thread Optimization

Use Transferable objects (zero-copy move) or SharedArrayBuffer (zero-copy share):

worker.postMessage({ type: 'solve', matrix: values.buffer },
    [values.buffer]);  // Transfer list — moves, doesn't copy

6.4 Bundle Size Budget

Component	Size (gzipped)	Budget
Solver core (CG + Neumann + Push)	~80 KB	100 KB
SIMD128 kernels	~15 KB	20 KB
wasm-bindgen glue	~10 KB	15 KB
serde-wasm-bindgen	~20 KB	25 KB
Total	~125 KB	160 KB

7. Profiling Methodology

7.1 Performance Counter Analysis

perf stat -e cycles,instructions,cache-references,cache-misses,\
  L1-dcache-load-misses,LLC-load-misses ./target/release/bench_spmv

Expected good SpMV profile: IPC 2.0-3.0, L1 miss 5-15%, LLC miss < 1%, branch miss < 1%.

7.2 Hot Spot Identification

perf record -g --call-graph dwarf ./target/release/bench_solver
perf script | stackcollapse-perf.pl | flamegraph.pl > solver_flame.svg

Expected: 60-80% in spmv_*, 10-15% in dot/norm, < 5% in allocation.

7.3 Roofline Model

SpMV arithmetic intensity = 0.167 FLOP/byte. On 80 GB/s server: achievable = 13.3 GFLOPS (1.3% of 1 TFLOPS peak). SpMV is deeply memory-bound — optimize for memory traffic reduction, not FLOPS.

7.4 Criterion.rs Best Practices

group.warm_up_time(Duration::from_secs(5));  // Stabilize cache state
group.sample_size(200);                       // Statistical significance
group.throughput(Throughput::Elements(nnz));  // Report nonzeros/sec
// Use black_box() to prevent dead code elimination
b.iter(|| black_box(solver.solve(&csr, &rhs)))

8. Concurrency Optimization

8.1 Rayon Configuration

let chunk_size = (n / rayon::current_num_threads()).max(1024);
problems.par_chunks(chunk_size).map(|chunk| ...).collect()

8.2 Thread Scaling

Threads	Efficiency	Bottleneck
1	100%	N/A
2	90-95%	Rayon overhead
4	75-85%	Memory bandwidth
8	55-70%	L3 contention
16	40-55%	NUMA effects

Use num_cpus::get_physical() threads. Avoid nested Rayon (deadlock risk).

9. Compilation Optimization

9.1 PGO Pipeline

RUSTFLAGS="-Cprofile-generate=/tmp/pgo" cargo build --release -p ruvector-solver
./target/release/bench_solver --profile-workload
llvm-profdata merge -o /tmp/pgo/merged.profdata /tmp/pgo/*.profraw
RUSTFLAGS="-Cprofile-use=/tmp/pgo/merged.profdata" cargo build --release

Expected: 5-15% improvement.

9.2 Release Profile

[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
strip = true

10. Platform-Specific Optimization

10.1 Server (Linux x86_64)

Huge pages: MADV_HUGEPAGE for large matrices (10-30% TLB miss reduction)
NUMA-aware: Pin threads to same node as matrix memory
AVX-512: Prefer on Zen 4+/Ice Lake+

10.2 Apple Silicon (macOS ARM64)

Unified memory: No NUMA concerns
NEON 4x unrolled with independent accumulators
M4 Pro: 192KB L1, 16MB L2, 48MB L3

10.3 Browser (WASM)

Memory budget < 8MB, SIMD128 always enabled
Web Workers for batch, SharedArrayBuffer for zero-copy
IndexedDB caching for TRUE preprocessing

10.4 Cloudflare Workers

128MB memory, 50ms CPU limit
Reflex/Retrieval lanes only
Single-threaded, pre-warm with small solve

11. Optimization Checklist

P0 (Critical)

Item	Impact	Effort	Validation
SIMD SpMV (AVX2+FMA, NEON)	4-8x SpMV	L	Criterion vs scalar
Arena allocator	100x alloc reduction	S	dhat profiling
Zero-copy SoA → solver	Eliminates copies	M	Memory profiling
CSR with aligned storage	SIMD foundation	M	Cache miss rate
Diagonal preconditioning	2-10x CG speedup	S	Iteration count
Feature-gated Rayon	Multi-core utilization	S	Thread scaling
Input validation	Security baseline	S	Fuzz testing
CI regression benchmarks	Prevents degradation	M	CI green

P1 (High)

Item	Impact	Effort	Validation
AVX-512 SpMV	1.5-2x over AVX2	M	Zen 4 benchmark
WASM SIMD128 SpMV	2-3x over scalar	M	wasm-pack bench
Cache-aware tiling	30-50% for n>100K	M	perf cache misses
Memory-mapped CSR	Removes memory ceiling	M	1GB matrix load
SONA adaptive routing	Auto-optimal selection	L	>90% routing accuracy
TRUE batch amortization	100-1000x repeated	M	Break-even validated
Web Worker pool	2-4x WASM throughput	M	Worker benchmark

P2 (Medium)

Item	Impact	Effort	Validation
PGO in CI	5-15% overall	M	PGO comparison
Vectorized PRNG	2-4x random walk	S	Walk throughput
SIMD convergence checks	4-8x check speed	S	Inline benchmark
Mixed precision (f32/f64)	2x memory savings	M	Accuracy suite
Incomplete Cholesky	10-100x condition	L	Iteration count

P3 (Long-term)

Item	Impact	Effort	Validation
Algebraic multigrid	Near-optimal Laplacians	XL	V-cycle convergence
NUMA-aware allocation	10-20% multi-socket	M	NUMA profiling
GPU offload (Metal/CUDA)	10-100x dense	XL	GPU benchmark
Distributed solver	n > 1M scaling	XL	Distributed bench

12. Performance Targets

Operation	Server (AVX2)	Edge (NEON)	Browser (WASM)	Cloudflare
SpMV 10K×10K (1%)	< 30 μs	< 50 μs	< 200 μs	< 300 μs
CG solve 10K (ε=1e-6)	< 1 ms	< 2 ms	< 20 ms	< 30 ms
Forward Push 10K (ε=1e-4)	< 50 μs	< 100 μs	< 500 μs	< 1 ms
Neumann 10K (k=20)	< 600 μs	< 1 ms	< 5 ms	< 8 ms
BMSSP 100K (ε=1e-4)	< 50 ms	< 100 ms	N/A	< 200 ms
TRUE prep 100K (ε=0.1)	< 500 ms	< 1 s	N/A	< 2 s
TRUE solve 100K (amort.)	< 1 ms	< 2 ms	N/A	< 5 ms
Batch pairwise 10K	< 15 s	< 30 s	< 120 s	N/A
Scheduler tick	< 200 ns	< 300 ns	N/A	N/A
Algorithm routing	< 1 μs	< 1 μs	< 5 μs	< 5 μs

13. Measurement Methodology

Criterion.rs: 200 samples, 5s warmup, p < 0.05 significance
Multi-platform: x86_64 (AVX2) and aarch64 (NEON)
Deterministic seeds: random_vector(dim, seed=42)
Equal accuracy: Fix ε before comparing
Cold + hot cache: Report both first-run and steady-state
Profile.bench: Release optimization with debug symbols
Regression CI: 10% degradation threshold triggers failure
Memory profiling: Peak RSS and allocation count via dhat
Roofline analysis: Verify memory-bound operation
Statistical rigor: Report median, p5, p95, coefficient of variation

Realized Optimizations

The following optimizations from this guide have been implemented in the ruvector-solver crate as of February 2026.

Implemented Techniques

Jacobi-preconditioned Neumann series (D^{-1} splitting): The Neumann solver extracts the diagonal of A and applies D^{-1} as a preconditioner before iteration. This transforms the iteration matrix from (I - A) to (I - D^{-1}A), significantly reducing the spectral radius for diagonally-dominant systems and enabling convergence where unpreconditioned Neumann would diverge or stall.
spmv_unchecked: raw pointer SpMV with zero bounds checks: The inner SpMV loop uses unsafe raw pointer arithmetic to eliminate Rust's bounds-check overhead on every array access. An external bounds validation is performed once before entering the hot loop, maintaining safety guarantees while removing per-element branch overhead.
fused_residual_norm_sq: single-pass residual + norm computation (3 memory passes to 1): Instead of computing r = b - Ax (pass 1), then ||r||^2 (pass 2) as separate operations, the fused kernel computes both the residual vector and its squared norm in a single traversal. This eliminates 2 of 3 memory traversals per iteration, which is critical since SpMV is memory-bandwidth-bound.
4-wide unrolled Jacobi update in Neumann iteration: The Jacobi preconditioner application loop is manually unrolled 4x, processing four elements per loop body. This reduces loop overhead and exposes instruction-level parallelism to the CPU's out-of-order execution engine.
AVX2 SIMD SpMV (8-wide f32 via horizontal sum): The AVX2 SpMV kernel processes 8 f32 values per SIMD instruction using _mm256_i32gather_ps for gathering x-vector entries and _mm256_fmadd_ps for fused multiply-add accumulation. A horizontal sum reduces the 8-lane accumulator to a scalar row result.
Arena allocator for zero-allocation iteration: Solver working memory (residual, search direction, temporary vectors) is pre-allocated from a bump arena before the iteration loop begins. This eliminates all heap allocation during the solve phase, reducing per-solve overhead from ~20 microseconds to ~200 nanoseconds.
Algorithm router with automatic characterization: The solver includes an algorithm router that characterizes input matrices (size, density, estimated spectral radius, SPD detection) and selects the optimal algorithm automatically. The router runs in under 1 microsecond and directs traffic to the appropriate solver based on the matrix properties identified in Sections 4 and 5.

Performance Data

Algorithm	Complexity	Notes
Neumann	O(k * nnz)	Converges with k typically 10-50 for well-conditioned systems (spectral radius < 0.9). Jacobi preconditioning extends the convergence regime.
CG	O(sqrt(kappa) * log(1/epsilon) * nnz)	Gold standard for SPD systems. Optimal by the Nemirovski-Yudin lower bound. Scales gracefully with condition number.
Fused kernel	Eliminates 2 of 3 memory traversals per iteration	For bandwidth-bound SpMV (arithmetic intensity 0.167 FLOP/byte), reducing memory passes from 3 to 1 translates directly to up to 3x throughput improvement for the residual computation step.

19 KiB Raw Blame History Unescape Escape