git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
1175 lines
46 KiB
Markdown
1175 lines
46 KiB
Markdown
# ADR-STS-009: Concurrency Model and Parallelism Strategy
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Date
|
|
|
|
2026-02-20
|
|
|
|
## Authors
|
|
|
|
RuVector Architecture Team
|
|
|
|
## Deciders
|
|
|
|
Architecture Review Board
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
The sublinear-time solver integration must align with ruvector's existing concurrency
|
|
infrastructure while introducing its own nanosecond-precision scheduler. The current
|
|
codebase employs a well-defined concurrency stack:
|
|
|
|
- **Rayon 1.10** -- Data-parallel iterators, feature-gated behind `parallel` (disabled on
|
|
`wasm32`). Used in `ruvector-core` for `batch_distances()` via `par_iter()` and in
|
|
`FlatIndex::search()` via `par_bridge()` over `DashMap` iterators. Optional in 8+ crates
|
|
including `ruvector-delta-wasm`, `ruvector-delta-index`, `ruQu`, `ruvector-delta-graph`.
|
|
- **Crossbeam 0.8** -- Lock-free data structures (`ArrayQueue`, `SegQueue`, `CachePadded`).
|
|
Powers the existing `LockFreeWorkQueue`, `LockFreeCounter`, `LockFreeStats`,
|
|
`LockFreeBatchProcessor`, `ObjectPool`, and `AtomicVectorPool` in
|
|
`crates/ruvector-core/src/lockfree.rs`.
|
|
- **DashMap 6.1** -- Concurrent hash map used throughout core indexing, delta operations,
|
|
and graph storage. Provides lock-free read access via sharded `RwLock` segments.
|
|
- **parking_lot 0.12** -- Efficient mutex and RwLock replacements, used in 12+ crates
|
|
across the workspace (sona, delta-core, delta-index, delta-graph, rvlite, etc.).
|
|
- **Tokio 1.41** -- Async runtime with multi-thread scheduler. Node.js bindings use
|
|
`tokio::task::spawn_blocking` pervasively (observed in `ruvector-attention-node`,
|
|
`ruvector-graph-node`, `ruvector-tiny-dancer-node`, `ruvector-router-ffi`).
|
|
Synchronization primitives (`mpsc`, `RwLock`, `oneshot`, `Notify`) used in raft,
|
|
replication, serving engines, and MCP gate.
|
|
- **Custom lock-free structures** -- `AtomicVectorPool` provides zero-allocation vector
|
|
reuse with `SegQueue`-backed pooling and `CachePadded` atomics. `LockFreeWorkQueue`
|
|
wraps `ArrayQueue` for bounded work distribution. `LockFreeBatchProcessor` combines both
|
|
for producer-consumer batch processing.
|
|
|
|
The sublinear-time solver introduces its own nanosecond-precision scheduler capable of
|
|
98ns tick intervals and 11M+ task dispatches per second. This scheduler manages fine-grained
|
|
intra-solve task decomposition (matrix partitioning, random-walk step scheduling, Neumann
|
|
series term evaluation). It must coexist with Rayon's thread pool without causing thread
|
|
starvation or pool exhaustion.
|
|
|
|
WASM targets (`wasm32-unknown-unknown`) lack native thread support. The existing ruvector
|
|
WASM crates use Web Workers with `postMessage` for parallelism, managed by a `WorkerPool`
|
|
class (round-robin distribution, promise-based API, 30-second timeout). The solver's WASM
|
|
target must follow this established pattern.
|
|
|
|
Thread scaling measurements from `bench_thread_scaling` in `comprehensive_bench.rs`
|
|
(10,000 vectors at 384 dimensions, Euclidean distance):
|
|
|
|
| Threads | Relative Efficiency | Bottleneck |
|
|
|---------|-------------------|------------|
|
|
| 1 | 100% (baseline) | N/A |
|
|
| 2 | 85-95% | Synchronization overhead |
|
|
| 4 | 70-85% | L3 cache contention |
|
|
| 8 | 50-70% | Memory bandwidth saturation |
|
|
|
|
The sub-linear scaling beyond 4 threads indicates memory bandwidth as the dominant
|
|
constraint for vectorized workloads, not CPU compute. This fundamentally shapes the
|
|
parallelism strategy for solver integration.
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
### 1. Two-Level Parallelism Architecture
|
|
|
|
The solver integration adopts a strict two-level parallelism model that separates
|
|
coarse-grained data parallelism from fine-grained compute parallelism.
|
|
|
|
**Outer level**: Rayon `par_iter()` for batch operations across independent solve
|
|
invocations (multi-query, batch solve, parallel search-then-solve pipelines).
|
|
|
|
**Inner level**: SIMD vectorization within each individual solve invocation. This
|
|
includes both auto-vectorized loops (compiler-driven via `-C target-cpu=native`) and
|
|
explicit SIMD intrinsics (AVX2/NEON/WASM SIMD) for critical inner loops such as
|
|
matrix-vector products and distance computations.
|
|
|
|
**No nested Rayon** is permitted. A Rayon task must never spawn additional Rayon parallel
|
|
iterators, as this risks exhausting the global thread pool and causing deadlocks. The
|
|
solver's internal task decomposition uses its own scheduler (Decision 2) rather than
|
|
recursive Rayon parallelism.
|
|
|
|
```
|
|
+------------------------------------------------------------------+
|
|
| Application Layer |
|
|
| batch_solve([q1, q2, ..., qN]) |
|
|
+------------------------------------------------------------------+
|
|
| | |
|
|
v v v
|
|
+------------------+ +------------------+ +------------------+
|
|
| Rayon Task #1 | | Rayon Task #2 | | Rayon Task #N |
|
|
| (Outer Level) | | (Outer Level) | | (Outer Level) |
|
|
| solve(q1) | | solve(q2) | | solve(qN) |
|
|
+------------------+ +------------------+ +------------------+
|
|
| | |
|
|
v v v
|
|
+------------------+ +------------------+ +------------------+
|
|
| Solver Scheduler | | Solver Scheduler | | Solver Scheduler |
|
|
| 98ns tick | | 98ns tick | | 98ns tick |
|
|
| (Inner Level) | | (Inner Level) | | (Inner Level) |
|
|
+------------------+ +------------------+ +------------------+
|
|
| | |
|
|
v v v
|
|
+------------------+ +------------------+ +------------------+
|
|
| SIMD Kernel | | SIMD Kernel | | SIMD Kernel |
|
|
| AVX2/NEON/WASM | | AVX2/NEON/WASM | | AVX2/NEON/WASM |
|
|
| (Vectorized) | | (Vectorized) | | (Vectorized) |
|
|
+------------------+ +------------------+ +------------------+
|
|
```
|
|
|
|
#### Rayon Integration Code
|
|
|
|
```rust
|
|
use rayon::prelude::*;
|
|
|
|
/// Batch solve: outer parallelism via Rayon, inner via solver scheduler + SIMD
|
|
pub fn batch_solve(
|
|
queries: &[SolveRequest],
|
|
solver: &SublinearSolver,
|
|
config: &SolverConfig,
|
|
) -> Vec<SolveResult> {
|
|
#[cfg(all(feature = "parallel", not(target_arch = "wasm32")))]
|
|
{
|
|
queries
|
|
.par_iter()
|
|
.map(|query| solver.solve_single(query, config))
|
|
.collect()
|
|
}
|
|
#[cfg(any(not(feature = "parallel"), target_arch = "wasm32"))]
|
|
{
|
|
queries
|
|
.iter()
|
|
.map(|query| solver.solve_single(query, config))
|
|
.collect()
|
|
}
|
|
}
|
|
|
|
/// Multi-query search with solver-accelerated re-ranking
|
|
pub fn parallel_search_and_solve(
|
|
queries: &[Vec<f32>],
|
|
index: &HnswIndex,
|
|
solver: &SublinearSolver,
|
|
k: usize,
|
|
) -> Vec<Vec<SearchResult>> {
|
|
#[cfg(all(feature = "parallel", not(target_arch = "wasm32")))]
|
|
{
|
|
queries
|
|
.par_iter()
|
|
.map(|query| {
|
|
// Phase 1: HNSW candidate retrieval (O(log n))
|
|
let candidates = index.search(query, k * 4).unwrap();
|
|
// Phase 2: Solver-accelerated exact re-ranking (sublinear)
|
|
solver.rerank(query, &candidates, k)
|
|
})
|
|
.collect()
|
|
}
|
|
#[cfg(any(not(feature = "parallel"), target_arch = "wasm32"))]
|
|
{
|
|
queries
|
|
.iter()
|
|
.map(|query| {
|
|
let candidates = index.search(query, k * 4).unwrap();
|
|
solver.rerank(query, &candidates, k)
|
|
})
|
|
.collect()
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2. Solver Scheduler Integration
|
|
|
|
The solver's nanosecond-precision scheduler operates exclusively within a single Rayon
|
|
task. It manages fine-grained intra-solve task decomposition without interacting with
|
|
Rayon's thread pool or work-stealing queues.
|
|
|
|
```
|
|
+---------------------------------------------------------------+
|
|
| Single Rayon Task |
|
|
| |
|
|
| +----------------------------------------------------------+ |
|
|
| | Solver Scheduler (98ns tick) | |
|
|
| | | |
|
|
| | Task Queue: | |
|
|
| | [Neumann Term 0] [Neumann Term 1] [Random Walk Step] | |
|
|
| | [Matrix Partition A] [Matrix Partition B] [Reduce] | |
|
|
| | | |
|
|
| | Execution Timeline (single thread): | |
|
|
| | |--98ns--|--98ns--|--98ns--|--98ns--|--98ns--| | |
|
|
| | | T0 | T1 | T2 | T3 | T0 | ... | |
|
|
| | | run | run | run | run | cont | | |
|
|
| | | |
|
|
| | Scheduler Responsibilities: | |
|
|
| | - Task priority ordering | |
|
|
| | - Convergence checking per tick | |
|
|
| | - Early termination on epsilon satisfaction | |
|
|
| | - Budget enforcement (max ticks per solve) | |
|
|
| +----------------------------------------------------------+ |
|
|
+---------------------------------------------------------------+
|
|
```
|
|
|
|
#### Scheduler Isolation Contract
|
|
|
|
```rust
|
|
/// The solver scheduler runs entirely within a single OS thread.
|
|
/// It MUST NOT:
|
|
/// - Call rayon::spawn() or any Rayon parallel iterator
|
|
/// - Acquire locks held by other Rayon tasks
|
|
/// - Block on tokio channels or async primitives
|
|
/// - Allocate from the global allocator in the hot loop
|
|
///
|
|
/// It MAY:
|
|
/// - Use thread-local storage for scratch buffers
|
|
/// - Use AtomicVectorPool for zero-alloc vector operations
|
|
/// - Read from shared immutable data (index, graph adjacency)
|
|
/// - Write to its own output buffer (non-shared)
|
|
pub struct SolverScheduler {
|
|
/// Task queue using crossbeam for efficient scheduling
|
|
task_queue: crossbeam::deque::Worker<SolverTask>,
|
|
/// Pre-allocated scratch space from AtomicVectorPool
|
|
scratch_pool: AtomicVectorPool,
|
|
/// Tick interval in nanoseconds (default: 98)
|
|
tick_ns: u64,
|
|
/// Maximum ticks per solve invocation
|
|
max_ticks: u64,
|
|
/// Convergence threshold
|
|
epsilon: f64,
|
|
}
|
|
|
|
impl SolverScheduler {
|
|
pub fn new(dimensions: usize, config: &SchedulerConfig) -> Self {
|
|
Self {
|
|
task_queue: crossbeam::deque::Worker::new_fifo(),
|
|
scratch_pool: AtomicVectorPool::new(dimensions, 16, 64),
|
|
tick_ns: config.tick_ns.unwrap_or(98),
|
|
max_ticks: config.max_ticks.unwrap_or(100_000),
|
|
epsilon: config.epsilon.unwrap_or(1e-6),
|
|
}
|
|
}
|
|
|
|
/// Execute a solve within the scheduler's tick loop.
|
|
/// This function is called from within a Rayon task and runs
|
|
/// entirely on the calling thread.
|
|
pub fn execute(&self, problem: &SolveProblem) -> SolveResult {
|
|
let mut state = SolverState::init(problem, &self.scratch_pool);
|
|
let mut ticks: u64 = 0;
|
|
|
|
loop {
|
|
// Process one tick worth of work
|
|
if let Some(task) = self.task_queue.pop() {
|
|
task.execute(&mut state);
|
|
}
|
|
|
|
ticks += 1;
|
|
|
|
// Check convergence
|
|
if state.residual() < self.epsilon {
|
|
return SolveResult::converged(state, ticks);
|
|
}
|
|
|
|
// Check budget
|
|
if ticks >= self.max_ticks {
|
|
return SolveResult::budget_exhausted(state, ticks);
|
|
}
|
|
|
|
// Generate follow-up tasks based on current state
|
|
self.generate_tasks(&state);
|
|
}
|
|
}
|
|
|
|
fn generate_tasks(&self, state: &SolverState) {
|
|
// Neumann series: schedule next term if not converged
|
|
if state.needs_next_term() {
|
|
self.task_queue.push(SolverTask::NeumannTerm {
|
|
term_index: state.current_term + 1,
|
|
});
|
|
}
|
|
// Random walk: schedule additional walks if variance too high
|
|
if state.variance_too_high() {
|
|
self.task_queue.push(SolverTask::RandomWalkBatch {
|
|
count: state.walks_needed(),
|
|
});
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Crossbeam Integration
|
|
|
|
The solver extends ruvector's existing lock-free infrastructure with two additional
|
|
integration points.
|
|
|
|
#### Work-Stealing Task Queue
|
|
|
|
The solver's task queue uses `crossbeam::deque::Injector` for work distribution when
|
|
operating in multi-solve mode. Each Rayon worker thread owns a local
|
|
`crossbeam::deque::Worker` deque; the `Injector` distributes initial tasks, and
|
|
`Stealer` handles load balancing across solver instances.
|
|
|
|
```
|
|
+----------------------------------------------------------+
|
|
| Work-Stealing Topology |
|
|
| |
|
|
| +--------------------+ |
|
|
| | Injector | <-- batch_solve() pushes here |
|
|
| | (Global Task Queue)| |
|
|
| +--------+-----------+ |
|
|
| | |
|
|
| +-----+------+------+ |
|
|
| | | | |
|
|
| +--v---+ +----v-+ +-v----+ |
|
|
| |Worker| |Worker| |Worker| <-- Rayon thread-local |
|
|
| |Deque | |Deque | |Deque | |
|
|
| +--+---+ +--+---+ +--+---+ |
|
|
| | | | |
|
|
| v v v |
|
|
| Steal <----> Steal <--> Steal (work stealing) |
|
|
+----------------------------------------------------------+
|
|
```
|
|
|
|
```rust
|
|
use crossbeam::deque::{Injector, Stealer, Worker};
|
|
|
|
/// Multi-solve work distributor using crossbeam work-stealing deques.
|
|
/// Integrates with Rayon by running within install() scope.
|
|
pub struct SolverWorkDistributor {
|
|
injector: Injector<SolveRequest>,
|
|
stealers: Vec<Stealer<SolveRequest>>,
|
|
workers: Vec<Worker<SolveRequest>>,
|
|
}
|
|
|
|
impl SolverWorkDistributor {
|
|
pub fn new(num_workers: usize) -> Self {
|
|
let injector = Injector::new();
|
|
let mut workers = Vec::with_capacity(num_workers);
|
|
let mut stealers = Vec::with_capacity(num_workers);
|
|
|
|
for _ in 0..num_workers {
|
|
let worker = Worker::new_fifo();
|
|
stealers.push(worker.stealer());
|
|
workers.push(worker);
|
|
}
|
|
|
|
Self { injector, stealers, workers }
|
|
}
|
|
|
|
/// Submit a batch of solve requests for parallel execution.
|
|
pub fn submit_batch(&self, requests: Vec<SolveRequest>) {
|
|
for request in requests {
|
|
self.injector.push(request);
|
|
}
|
|
}
|
|
|
|
/// Attempt to steal work. Called by idle Rayon threads.
|
|
pub fn find_work(&self, worker_id: usize) -> Option<SolveRequest> {
|
|
// Try local deque first
|
|
let local = &self.workers[worker_id];
|
|
if let Some(task) = local.pop() {
|
|
return Some(task);
|
|
}
|
|
|
|
// Try global injector
|
|
loop {
|
|
match self.injector.steal_batch_and_pop(local) {
|
|
crossbeam::deque::Steal::Success(task) => return Some(task),
|
|
crossbeam::deque::Steal::Empty => break,
|
|
crossbeam::deque::Steal::Retry => continue,
|
|
}
|
|
}
|
|
|
|
// Try stealing from other workers
|
|
for (i, stealer) in self.stealers.iter().enumerate() {
|
|
if i == worker_id { continue; }
|
|
loop {
|
|
match stealer.steal() {
|
|
crossbeam::deque::Steal::Success(task) => return Some(task),
|
|
crossbeam::deque::Steal::Empty => break,
|
|
crossbeam::deque::Steal::Retry => continue,
|
|
}
|
|
}
|
|
}
|
|
|
|
None
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Lock-Free Statistics
|
|
|
|
Solver statistics collection extends the existing `LockFreeStats` pattern from
|
|
`crates/ruvector-core/src/lockfree.rs`, using `CachePadded<AtomicU64>` for contention-free
|
|
updates across Rayon worker threads.
|
|
|
|
```rust
|
|
use crossbeam::utils::CachePadded;
|
|
use std::sync::atomic::{AtomicU64, Ordering};
|
|
|
|
/// Lock-free solver statistics, following ruvector's LockFreeStats pattern.
|
|
/// Each field is cache-line padded to prevent false sharing between threads.
|
|
#[repr(align(64))]
|
|
pub struct SolverStats {
|
|
solves_completed: CachePadded<AtomicU64>,
|
|
total_ticks: CachePadded<AtomicU64>,
|
|
convergences: CachePadded<AtomicU64>,
|
|
budget_exhaustions: CachePadded<AtomicU64>,
|
|
total_latency_ns: CachePadded<AtomicU64>,
|
|
neumann_terms_computed: CachePadded<AtomicU64>,
|
|
random_walks_executed: CachePadded<AtomicU64>,
|
|
}
|
|
|
|
impl SolverStats {
|
|
pub fn new() -> Self {
|
|
Self {
|
|
solves_completed: CachePadded::new(AtomicU64::new(0)),
|
|
total_ticks: CachePadded::new(AtomicU64::new(0)),
|
|
convergences: CachePadded::new(AtomicU64::new(0)),
|
|
budget_exhaustions: CachePadded::new(AtomicU64::new(0)),
|
|
total_latency_ns: CachePadded::new(AtomicU64::new(0)),
|
|
neumann_terms_computed: CachePadded::new(AtomicU64::new(0)),
|
|
random_walks_executed: CachePadded::new(AtomicU64::new(0)),
|
|
}
|
|
}
|
|
|
|
#[inline]
|
|
pub fn record_solve(&self, result: &SolveResult) {
|
|
self.solves_completed.fetch_add(1, Ordering::Relaxed);
|
|
self.total_ticks.fetch_add(result.ticks, Ordering::Relaxed);
|
|
self.total_latency_ns.fetch_add(result.latency_ns, Ordering::Relaxed);
|
|
|
|
if result.converged {
|
|
self.convergences.fetch_add(1, Ordering::Relaxed);
|
|
} else {
|
|
self.budget_exhaustions.fetch_add(1, Ordering::Relaxed);
|
|
}
|
|
}
|
|
|
|
pub fn snapshot(&self) -> SolverStatsSnapshot {
|
|
let completed = self.solves_completed.load(Ordering::Relaxed);
|
|
let total_latency = self.total_latency_ns.load(Ordering::Relaxed);
|
|
let total_ticks = self.total_ticks.load(Ordering::Relaxed);
|
|
|
|
SolverStatsSnapshot {
|
|
solves_completed: completed,
|
|
convergence_rate: if completed > 0 {
|
|
self.convergences.load(Ordering::Relaxed) as f64 / completed as f64
|
|
} else { 0.0 },
|
|
avg_latency_ns: if completed > 0 {
|
|
total_latency / completed
|
|
} else { 0 },
|
|
avg_ticks_per_solve: if completed > 0 {
|
|
total_ticks / completed
|
|
} else { 0 },
|
|
neumann_terms: self.neumann_terms_computed.load(Ordering::Relaxed),
|
|
random_walks: self.random_walks_executed.load(Ordering::Relaxed),
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. WASM Parallelism
|
|
|
|
WASM targets lack native thread support. The solver's WASM binding follows the established
|
|
ruvector pattern: a `WorkerPool` (JavaScript) manages Web Workers, each running an
|
|
independent WASM solver instance.
|
|
|
|
```
|
|
+---------------------------------------------------------------+
|
|
| Browser Main Thread |
|
|
| |
|
|
| +----------------------------------------------------------+ |
|
|
| | SolverWorkerPool (extends WorkerPool pattern) | |
|
|
| | | |
|
|
| | init() -> spawns N Web Workers (navigator.hardwareConcurrency) |
|
|
| | solve(request) -> round-robin to next idle worker | |
|
|
| | batchSolve(requests) -> chunk and distribute | |
|
|
| | terminate() -> cleanup all workers | |
|
|
| +--+--------+--------+--------+----------------------------+ |
|
|
| | | | | |
|
|
+-----+--------+--------+--------+-------------------------------+
|
|
| | | |
|
|
v v v v postMessage / onmessage
|
|
+--------+ +--------+ +--------+ +--------+
|
|
|Worker 0| |Worker 1| |Worker 2| |Worker 3|
|
|
| | | | | | | |
|
|
|+------+| |+------+| |+------+| |+------+|
|
|
||WASM || ||WASM || ||WASM || ||WASM ||
|
|
||Solver|| ||Solver|| ||Solver|| ||Solver||
|
|
||Module|| ||Module|| ||Module|| ||Module||
|
|
|+------+| |+------+| |+------+| |+------+|
|
|
| | | | | | | |
|
|
|Scheduler| |Scheduler| |Scheduler| |Scheduler|
|
|
|(98ns) | |(98ns) | |(98ns) | |(98ns) |
|
|
+--------+ +--------+ +--------+ +--------+
|
|
| | | |
|
|
v v v v
|
|
[WASM SIMD] [WASM SIMD] [WASM SIMD] [WASM SIMD]
|
|
(128-bit) (128-bit) (128-bit) (128-bit)
|
|
```
|
|
|
|
#### SharedArrayBuffer Zero-Copy Path
|
|
|
|
When `SharedArrayBuffer` is available (requires `Cross-Origin-Isolation` headers:
|
|
`Cross-Origin-Opener-Policy: same-origin` and
|
|
`Cross-Origin-Embedder-Policy: require-corp`), the solver uses shared memory for
|
|
zero-copy data transfer between workers.
|
|
|
|
```javascript
|
|
/**
|
|
* Solver Worker Pool -- extends ruvector's WorkerPool pattern
|
|
* with SharedArrayBuffer support and solver-specific operations.
|
|
*/
|
|
export class SolverWorkerPool {
|
|
constructor(workerUrl, wasmUrl, options = {}) {
|
|
this.workerUrl = workerUrl;
|
|
this.wasmUrl = wasmUrl;
|
|
this.poolSize = options.poolSize || navigator.hardwareConcurrency || 4;
|
|
this.workers = [];
|
|
this.nextWorker = 0;
|
|
this.pendingRequests = new Map();
|
|
this.requestId = 0;
|
|
this.initialized = false;
|
|
this.sharedMemoryAvailable = typeof SharedArrayBuffer !== 'undefined';
|
|
this.sharedBuffer = null;
|
|
}
|
|
|
|
async init() {
|
|
if (this.initialized) return;
|
|
|
|
// Allocate shared memory if available
|
|
if (this.sharedMemoryAvailable) {
|
|
const bufferSize = this.poolSize * 4 * 1024 * 1024; // 4MB per worker
|
|
this.sharedBuffer = new SharedArrayBuffer(bufferSize);
|
|
console.log(`SharedArrayBuffer allocated: ${bufferSize} bytes`);
|
|
}
|
|
|
|
const initPromises = [];
|
|
for (let i = 0; i < this.poolSize; i++) {
|
|
const worker = new Worker(this.workerUrl, { type: 'module' });
|
|
worker.onmessage = (e) => this.handleMessage(i, e);
|
|
worker.onerror = (error) => this.handleError(i, error);
|
|
|
|
this.workers.push({ worker, busy: false, id: i });
|
|
|
|
const initData = {
|
|
wasmUrl: this.wasmUrl,
|
|
workerId: i,
|
|
sharedBuffer: this.sharedBuffer,
|
|
bufferOffset: i * 4 * 1024 * 1024,
|
|
bufferSize: 4 * 1024 * 1024,
|
|
};
|
|
initPromises.push(this.sendToWorker(i, 'init', initData));
|
|
}
|
|
|
|
await Promise.all(initPromises);
|
|
this.initialized = true;
|
|
}
|
|
|
|
/**
|
|
* Solve a single request on the next available worker.
|
|
*/
|
|
async solve(request) {
|
|
if (!this.initialized) await this.init();
|
|
const workerId = this.getNextWorker();
|
|
|
|
if (this.sharedMemoryAvailable) {
|
|
// Zero-copy path: write input to shared buffer
|
|
const view = new Float32Array(
|
|
this.sharedBuffer,
|
|
workerId * 4 * 1024 * 1024,
|
|
request.data.length
|
|
);
|
|
view.set(request.data);
|
|
return this.sendToWorker(workerId, 'solveShared', {
|
|
offset: 0,
|
|
length: request.data.length,
|
|
config: request.config,
|
|
});
|
|
} else {
|
|
// Transferable path: transfer ownership of ArrayBuffer
|
|
const buffer = new Float32Array(request.data).buffer;
|
|
return this.sendToWorkerTransfer(workerId, 'solve', {
|
|
config: request.config,
|
|
}, [buffer]);
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Batch solve distributes requests across all workers.
|
|
*/
|
|
async batchSolve(requests) {
|
|
if (!this.initialized) await this.init();
|
|
|
|
const chunkSize = Math.ceil(requests.length / this.poolSize);
|
|
const chunks = [];
|
|
for (let i = 0; i < requests.length; i += chunkSize) {
|
|
chunks.push(requests.slice(i, i + chunkSize));
|
|
}
|
|
|
|
const promises = chunks.map((chunk, i) =>
|
|
this.sendToWorker(i % this.poolSize, 'batchSolve', {
|
|
requests: chunk.map(r => ({
|
|
data: Array.from(r.data),
|
|
config: r.config,
|
|
})),
|
|
})
|
|
);
|
|
|
|
const results = await Promise.all(promises);
|
|
return results.flat();
|
|
}
|
|
|
|
// -- Utility methods following WorkerPool pattern --
|
|
|
|
getNextWorker() {
|
|
for (let i = 0; i < this.workers.length; i++) {
|
|
const idx = (this.nextWorker + i) % this.workers.length;
|
|
if (!this.workers[idx].busy) {
|
|
this.nextWorker = (idx + 1) % this.workers.length;
|
|
return idx;
|
|
}
|
|
}
|
|
const idx = this.nextWorker;
|
|
this.nextWorker = (this.nextWorker + 1) % this.workers.length;
|
|
return idx;
|
|
}
|
|
|
|
handleMessage(workerId, event) {
|
|
const { requestId, data, error } = event.data;
|
|
const request = this.pendingRequests.get(requestId);
|
|
if (!request) return;
|
|
|
|
this.workers[workerId].busy = false;
|
|
this.pendingRequests.delete(requestId);
|
|
|
|
if (error) {
|
|
request.reject(new Error(error.message));
|
|
} else {
|
|
request.resolve(data);
|
|
}
|
|
}
|
|
|
|
handleError(workerId, error) {
|
|
console.error(`Solver worker ${workerId} error:`, error);
|
|
for (const [requestId, request] of this.pendingRequests) {
|
|
if (request.workerId === workerId) {
|
|
request.reject(error);
|
|
this.pendingRequests.delete(requestId);
|
|
}
|
|
}
|
|
}
|
|
|
|
sendToWorker(workerId, type, data) {
|
|
return new Promise((resolve, reject) => {
|
|
const requestId = this.requestId++;
|
|
this.pendingRequests.set(requestId, { resolve, reject, workerId });
|
|
this.workers[workerId].busy = true;
|
|
this.workers[workerId].worker.postMessage({
|
|
type, data: { ...data, requestId },
|
|
});
|
|
});
|
|
}
|
|
|
|
sendToWorkerTransfer(workerId, type, data, transferables) {
|
|
return new Promise((resolve, reject) => {
|
|
const requestId = this.requestId++;
|
|
this.pendingRequests.set(requestId, { resolve, reject, workerId });
|
|
this.workers[workerId].busy = true;
|
|
this.workers[workerId].worker.postMessage(
|
|
{ type, data: { ...data, requestId } },
|
|
transferables
|
|
);
|
|
});
|
|
}
|
|
|
|
terminate() {
|
|
for (const { worker } of this.workers) {
|
|
worker.terminate();
|
|
}
|
|
this.workers = [];
|
|
this.initialized = false;
|
|
this.sharedBuffer = null;
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Fallback Path (No SharedArrayBuffer)
|
|
|
|
When `SharedArrayBuffer` is unavailable (non-isolated context, older browsers, iOS Safari
|
|
prior to 16.4), the solver falls back to `postMessage` with transferable `ArrayBuffer`
|
|
objects. Ownership of the buffer transfers to the worker, avoiding copies at the cost of
|
|
making the source buffer neutered (zero-length) after transfer.
|
|
|
|
```
|
|
SharedArrayBuffer available?
|
|
|
|
|
+----+----+
|
|
| |
|
|
YES NO
|
|
| |
|
|
v v
|
|
Zero-copy Transferable ArrayBuffer
|
|
shared (ownership transfer,
|
|
memory source buffer neutered)
|
|
| |
|
|
v v
|
|
Worker reads Worker receives
|
|
from shared full copy via
|
|
view at structured clone
|
|
offset of postMessage
|
|
```
|
|
|
|
### 5. Async Integration
|
|
|
|
The solver is CPU-bound and must not block the Tokio async runtime. All solver invocations
|
|
from async contexts use `tokio::task::spawn_blocking`, following the pattern established
|
|
in `ruvector-attention-node`, `ruvector-graph-node`, and `ruvector-router-ffi`.
|
|
|
|
```rust
|
|
use tokio::sync::broadcast;
|
|
use tokio::task;
|
|
|
|
/// Async wrapper for solver operations.
|
|
/// Uses spawn_blocking to avoid blocking the Tokio runtime.
|
|
pub struct AsyncSolver {
|
|
solver: Arc<SublinearSolver>,
|
|
event_tx: broadcast::Sender<SolverEvent>,
|
|
semaphore: Arc<tokio::sync::Semaphore>,
|
|
}
|
|
|
|
/// Events emitted during solver operations for observability.
|
|
#[derive(Clone, Debug)]
|
|
pub enum SolverEvent {
|
|
SolveStarted { request_id: u64 },
|
|
SolveCompleted { request_id: u64, latency_ns: u64, converged: bool },
|
|
SolveFailed { request_id: u64, error: String },
|
|
BatchStarted { batch_id: u64, count: usize },
|
|
BatchCompleted { batch_id: u64, count: usize, total_latency_ns: u64 },
|
|
ConcurrencyLimitReached { current: usize, max: usize },
|
|
}
|
|
|
|
impl AsyncSolver {
|
|
pub fn new(solver: SublinearSolver, max_concurrent: usize) -> Self {
|
|
let (event_tx, _) = broadcast::channel(1024);
|
|
Self {
|
|
solver: Arc::new(solver),
|
|
event_tx,
|
|
semaphore: Arc::new(tokio::sync::Semaphore::new(max_concurrent)),
|
|
}
|
|
}
|
|
|
|
/// Subscribe to solver events for monitoring/observability.
|
|
pub fn subscribe(&self) -> broadcast::Receiver<SolverEvent> {
|
|
self.event_tx.subscribe()
|
|
}
|
|
|
|
/// Solve a single request asynchronously.
|
|
/// Acquires a semaphore permit to enforce concurrency limits.
|
|
pub async fn solve(&self, request: SolveRequest) -> Result<SolveResult, SolverError> {
|
|
let permit = self.semaphore.acquire().await
|
|
.map_err(|_| SolverError::Shutdown)?;
|
|
|
|
let solver = Arc::clone(&self.solver);
|
|
let event_tx = self.event_tx.clone();
|
|
let request_id = request.id;
|
|
|
|
let _ = event_tx.send(SolverEvent::SolveStarted { request_id });
|
|
|
|
let result = task::spawn_blocking(move || {
|
|
let start = std::time::Instant::now();
|
|
let result = solver.solve_single(&request, &SolverConfig::default());
|
|
let latency_ns = start.elapsed().as_nanos() as u64;
|
|
|
|
let _ = event_tx.send(SolverEvent::SolveCompleted {
|
|
request_id,
|
|
latency_ns,
|
|
converged: result.converged,
|
|
});
|
|
|
|
result
|
|
})
|
|
.await
|
|
.map_err(|e| SolverError::TaskPanicked(e.to_string()))?;
|
|
|
|
drop(permit);
|
|
Ok(result)
|
|
}
|
|
|
|
/// Batch solve with Rayon parallelism inside spawn_blocking.
|
|
pub async fn batch_solve(
|
|
&self,
|
|
requests: Vec<SolveRequest>,
|
|
) -> Result<Vec<SolveResult>, SolverError> {
|
|
let solver = Arc::clone(&self.solver);
|
|
let event_tx = self.event_tx.clone();
|
|
let batch_id = requests.first().map(|r| r.id).unwrap_or(0);
|
|
let count = requests.len();
|
|
|
|
let _ = event_tx.send(SolverEvent::BatchStarted { batch_id, count });
|
|
|
|
let results = task::spawn_blocking(move || {
|
|
let start = std::time::Instant::now();
|
|
let results = batch_solve(&requests, &solver, &SolverConfig::default());
|
|
let total_latency_ns = start.elapsed().as_nanos() as u64;
|
|
|
|
let _ = event_tx.send(SolverEvent::BatchCompleted {
|
|
batch_id, count, total_latency_ns,
|
|
});
|
|
|
|
results
|
|
})
|
|
.await
|
|
.map_err(|e| SolverError::TaskPanicked(e.to_string()))?;
|
|
|
|
Ok(results)
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Event Stream Architecture
|
|
|
|
```
|
|
+------------------+ +------------------+ +------------------+
|
|
| AsyncSolver | | Event Subscriber | | Event Subscriber |
|
|
| (spawn_blocking) | | (Metrics) | | (Logging) |
|
|
+--------+---------+ +--------+---------+ +--------+---------+
|
|
| | |
|
|
v v v
|
|
+------+------------------------+------------------------+------+
|
|
| broadcast::channel(1024) |
|
|
| |
|
|
| SolverEvent::SolveStarted { request_id } |
|
|
| SolverEvent::SolveCompleted { request_id, latency_ns, ... } |
|
|
| SolverEvent::BatchCompleted { batch_id, count, ... } |
|
|
+----------------------------------------------------------------+
|
|
```
|
|
|
|
### 6. Concurrent Solve Limit
|
|
|
|
A bounded `tokio::sync::Semaphore` limits the number of solver invocations executing
|
|
simultaneously. This prevents memory exhaustion when many async callers submit concurrent
|
|
solve requests, each of which allocates scratch buffers from `AtomicVectorPool`.
|
|
|
|
**Default limit**: `num_cpus::get() * 2`
|
|
|
|
This follows the pattern used in `ruvector-postgres` (`max_concurrent_searches: 64`) and
|
|
`prime-radiant` (`max_concurrent_ops`), but tunes for CPU-bound solver work rather than
|
|
I/O-bound database queries.
|
|
|
|
```rust
|
|
/// Configuration for concurrent solve limits.
|
|
pub struct ConcurrencyConfig {
|
|
/// Maximum concurrent solver invocations.
|
|
/// Default: num_cpus * 2 (allows slight oversubscription for
|
|
/// work that mixes solver compute with I/O waits).
|
|
pub max_concurrent_solves: usize,
|
|
|
|
/// Maximum total memory allocated to solver scratch space.
|
|
/// Enforced via AtomicVectorPool capacity.
|
|
pub max_scratch_memory_bytes: usize,
|
|
|
|
/// Backpressure strategy when limit is reached.
|
|
pub backpressure: BackpressureStrategy,
|
|
}
|
|
|
|
#[derive(Debug, Clone)]
|
|
pub enum BackpressureStrategy {
|
|
/// Block until a permit becomes available (default).
|
|
Wait,
|
|
/// Return an error immediately.
|
|
Reject,
|
|
/// Wait up to the specified duration, then return an error.
|
|
WaitTimeout(std::time::Duration),
|
|
}
|
|
|
|
impl Default for ConcurrencyConfig {
|
|
fn default() -> Self {
|
|
let cpus = num_cpus::get();
|
|
Self {
|
|
max_concurrent_solves: cpus * 2,
|
|
max_scratch_memory_bytes: cpus * 16 * 1024 * 1024, // 16MB per CPU
|
|
backpressure: BackpressureStrategy::Wait,
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- **No thread pool exhaustion**: Two-level separation ensures Rayon's global pool is never
|
|
starved by nested parallelism. Each solve invocation consumes exactly one Rayon thread.
|
|
- **Predictable memory usage**: The semaphore-bounded concurrency limit combined with
|
|
`AtomicVectorPool` capacity bounds prevents out-of-memory conditions under load.
|
|
- **WASM compatibility**: The Web Worker pattern matches ruvector's existing 27 WASM crates
|
|
and their `WorkerPool` infrastructure. No new paradigm to learn or maintain.
|
|
- **Observable**: The `broadcast::channel` event stream provides real-time solve metrics
|
|
without polling, enabling integration with ruvector's existing `tracing` infrastructure.
|
|
- **Zero contention statistics**: Cache-padded atomics in `SolverStats` eliminate false
|
|
sharing. Measurements from `LockFreeCounter` tests confirm 10K increments across 10
|
|
threads with zero lost updates.
|
|
- **Lock-free hot path**: The solver's inner loop uses `AtomicVectorPool` (SegQueue-backed)
|
|
and `crossbeam::deque::Worker` (thread-local). No mutex or RwLock on the critical path.
|
|
- **Crossbeam alignment**: Work-stealing deques integrate naturally with Rayon's existing
|
|
work-stealing model since Rayon itself uses crossbeam internally.
|
|
|
|
### Negative
|
|
|
|
- **Scheduler complexity**: The 98ns tick scheduler requires careful calibration per
|
|
platform. Timer resolution on Linux (`clock_gettime(CLOCK_MONOTONIC)`) provides ~25ns
|
|
precision, but macOS and Windows have coarser clocks (~1us).
|
|
- **WASM overhead**: Web Worker `postMessage` serialization adds ~50-200us per invocation,
|
|
dwarfing the solver's sub-microsecond tick cost. Batch operations are essential to
|
|
amortize this overhead.
|
|
- **SharedArrayBuffer restrictions**: Cross-origin isolation headers (`COOP`/`COEP`) break
|
|
third-party integrations (e.g., iframes, analytics scripts). Deployments must weigh
|
|
zero-copy benefits against integration constraints.
|
|
- **Memory bandwidth ceiling**: Thread scaling beyond 4 threads provides diminishing
|
|
returns (50-70% efficiency at 8 threads). The solver cannot overcome this hardware
|
|
limitation; it can only work within it by maximizing SIMD utilization per thread.
|
|
- **Rayon global pool coupling**: Solver batch operations share Rayon's global thread pool
|
|
with other subsystems (`batch_distances`, `FlatIndex::search`, delta operations). Under
|
|
contention, solver tasks compete with index operations for threads.
|
|
|
|
### Neutral
|
|
|
|
- **Feature flag gating**: Parallelism continues to be gated behind
|
|
`#[cfg(all(feature = "parallel", not(target_arch = "wasm32")))]`, consistent with
|
|
existing crate convention. No new feature flags required.
|
|
- **Tokio spawn_blocking pool**: `spawn_blocking` uses a separate thread pool from Rayon.
|
|
Default pool size is 512 threads (Tokio default), which is more than sufficient for
|
|
`max_concurrent_solves = num_cpus * 2`.
|
|
- **Crossbeam version alignment**: Both ruvector (0.8) and the solver target crossbeam 0.8.
|
|
No version conflict.
|
|
|
|
---
|
|
|
|
## Options Considered
|
|
|
|
### Option 1: Rayon-Only Parallelism (Flat Model)
|
|
|
|
Use Rayon `par_iter()` for all parallelism, including intra-solve task decomposition.
|
|
|
|
- **Pros**: Simplest implementation. Single threading model. No scheduler complexity.
|
|
Rayon's adaptive work stealing handles load balancing automatically.
|
|
- **Cons**: Nested `par_iter()` calls risk thread pool exhaustion. Rayon's work-stealing
|
|
granularity (~1us minimum) is too coarse for the solver's 98ns tick. Cannot express
|
|
solver-specific scheduling policies (convergence checking, budget enforcement) within
|
|
Rayon's map/reduce model. Forces all solver tasks through Rayon's global deque, adding
|
|
contention with other subsystems.
|
|
|
|
### Option 2: Dedicated Thread Pool per Solver
|
|
|
|
Create a separate `rayon::ThreadPool` for solver operations, isolated from the global pool.
|
|
|
|
- **Pros**: Complete isolation from other subsystems. No contention with index operations.
|
|
Can size the pool independently (e.g., 4 threads for solver, remaining for index).
|
|
- **Cons**: Thread oversubscription: a 4-thread solver pool plus Rayon's global pool (8
|
|
threads by default) creates 12 threads on an 8-core machine. Context switching overhead
|
|
negates throughput gains. Memory bandwidth is the bottleneck, not thread count; more
|
|
threads do not help. Cannot share work between pools when one is idle. Increases total
|
|
memory consumption (each pool maintains its own deque infrastructure).
|
|
|
|
### Option 3: Tokio-Native Async Solve (Chosen Against)
|
|
|
|
Run the solver on the Tokio runtime using `tokio::task::spawn` with cooperative yielding.
|
|
|
|
- **Pros**: Native async integration. No spawn_blocking overhead. Natural backpressure
|
|
via Tokio's task budget system.
|
|
- **Cons**: The solver is CPU-bound; running on Tokio's cooperative scheduler would block
|
|
other async tasks. Tokio's tick granularity (~10us) is 100x coarser than the solver's
|
|
98ns target. Would require `.await` points in inner loops, destroying SIMD pipeline
|
|
throughput. Fundamentally wrong tool for CPU-bound computation.
|
|
|
|
### Option 4: Two-Level with Scheduler (Chosen)
|
|
|
|
Rayon for outer-level batch parallelism; solver's own nanosecond scheduler for inner-level
|
|
task management; crossbeam for lock-free data structures; Tokio integration via
|
|
spawn_blocking.
|
|
|
|
- **Pros**: Matches ruvector's existing patterns exactly. Respects memory bandwidth limits
|
|
by not oversubscribing threads. Preserves solver's 98ns scheduling precision. Lock-free
|
|
hot path. Observable via broadcast channels. WASM-compatible via established WorkerPool.
|
|
- **Cons**: Two scheduling systems to understand and maintain. Potential for subtle bugs
|
|
at the Rayon/scheduler boundary (e.g., accidentally calling par_iter inside a solve).
|
|
|
|
---
|
|
|
|
## Thread Scaling Analysis
|
|
|
|
### Memory Bandwidth Model
|
|
|
|
The dominant performance constraint for vectorized solver workloads is memory bandwidth,
|
|
not CPU compute. The following model explains the measured scaling behavior.
|
|
|
|
```
|
|
Memory Bandwidth Utilization vs Thread Count
|
|
(10K vectors x 384 dimensions x 4 bytes = 15.36 MB working set)
|
|
|
|
Threads | BW Used (GB/s) | BW Available | Efficiency | Bottleneck
|
|
---------|----------------|--------------|------------|------------------
|
|
1 | ~8 | ~50 (DDR5) | 16% | CPU-bound (good)
|
|
2 | ~15 | ~50 | 30% | CPU-bound (good)
|
|
4 | ~28 | ~50 | 56% | Approaching BW limit
|
|
8 | ~38 | ~50 | 76% | BW-saturated
|
|
16 | ~42 | ~50 | 84% | Diminishing returns
|
|
|
|
Note: Effective BW is lower than theoretical peak due to cache coherence
|
|
traffic (MESI protocol overhead) and TLB pressure at scale.
|
|
```
|
|
|
|
### Solver-Specific Scaling Predictions
|
|
|
|
The solver's workload differs from pure distance computation in two important ways:
|
|
|
|
1. **Higher arithmetic intensity**: Neumann series evaluation performs multiple matrix-vector
|
|
products per memory access, increasing the compute-to-bandwidth ratio. This should
|
|
improve scaling beyond 4 threads relative to simple distance computation.
|
|
|
|
2. **Smaller working sets per solve**: Individual solve problems typically involve matrices
|
|
of size 128x128 to 1024x1024 (~64KB to ~4MB), fitting within L2/L3 cache. This reduces
|
|
memory bandwidth pressure compared to scanning 10K vectors.
|
|
|
|
```
|
|
Predicted Solver Thread Scaling:
|
|
|
|
Threads | Distance Comp | Neumann Solve | Random Walk
|
|
| (measured) | (predicted) | (predicted)
|
|
---------|----------------|----------------|----------------
|
|
1 | 100% | 100% | 100%
|
|
2 | 85-95% | 90-97% | 92-98%
|
|
4 | 70-85% | 80-92% | 85-95%
|
|
8 | 50-70% | 65-82% | 70-85%
|
|
|
|
Neumann: Higher arithmetic intensity -> better scaling
|
|
Random Walk: Independent walks -> near-linear scaling
|
|
```
|
|
|
|
### Avoiding Nested Parallelism
|
|
|
|
Nested Rayon parallelism is the single most dangerous anti-pattern for this integration.
|
|
The following diagram illustrates the deadlock scenario.
|
|
|
|
```
|
|
DANGEROUS (do NOT do this):
|
|
|
|
batch_solve()
|
|
.par_iter() <-- Rayon global pool (8 threads)
|
|
|
|
|
+-> solve_single()
|
|
.par_iter() <-- NESTED: tries to use same 8 threads
|
|
|
|
|
+-> [DEADLOCK: all 8 threads waiting for inner
|
|
par_iter work, but no threads available
|
|
to execute it]
|
|
|
|
CORRECT (this ADR's approach):
|
|
|
|
batch_solve()
|
|
.par_iter() <-- Rayon global pool (8 threads)
|
|
|
|
|
+-> solve_single()
|
|
SolverScheduler::execute() <-- Sequential on calling thread
|
|
|
|
|
+-> SIMD kernel <-- Vectorized, single thread
|
|
+-> crossbeam::deque <-- Thread-local, no contention
|
|
```
|
|
|
|
---
|
|
|
|
## Lock-Free Data Structure Usage Map
|
|
|
|
The following table maps each lock-free structure in the codebase to its role in the
|
|
solver integration.
|
|
|
|
| Structure | Source | Solver Role | Contention Profile |
|
|
|-----------|--------|-------------|-------------------|
|
|
| `AtomicVectorPool` | `lockfree.rs` | Scratch buffer allocation for Neumann terms, walk accumulators | Per-thread pool instance; no cross-thread contention |
|
|
| `LockFreeWorkQueue` | `lockfree.rs` | Bounded result collection from parallel batch_solve | Producers: Rayon threads; Consumer: batch_solve caller |
|
|
| `LockFreeStats` | `lockfree.rs` | Pattern template for `SolverStats` | Write-only from Rayon threads; Read from monitoring |
|
|
| `LockFreeCounter` | `lockfree.rs` | Request ID generation, tick counting | Single atomic increment per solve; negligible contention |
|
|
| `LockFreeBatchProcessor` | `lockfree.rs` | Orchestration of multi-phase solve pipelines | Phase transitions; moderate contention at boundaries |
|
|
| `ObjectPool<T>` | `lockfree.rs` | Reusable solver state objects | Per-thread acquire/release; low contention |
|
|
| `crossbeam::deque::Worker` | crossbeam | Solver scheduler task queue | Thread-local; zero contention |
|
|
| `crossbeam::deque::Injector` | crossbeam | Batch task distribution | Single producer (submit), multiple consumers (steal) |
|
|
| `DashMap` | dashmap | Solver result cache (optional) | Read-heavy; sharded RwLock minimizes contention |
|
|
|
|
---
|
|
|
|
## Benchmark Predictions for Parallel Solver
|
|
|
|
Based on existing ruvector benchmark data and the solver's algorithmic properties.
|
|
|
|
### Single-Threaded Solver Performance
|
|
|
|
| Operation | Estimated Latency | Basis |
|
|
|-----------|------------------|-------|
|
|
| Neumann solve (128x128 sparse) | 2-5 us | 3-5 terms x 500ns matmul |
|
|
| Neumann solve (1024x1024 sparse) | 20-80 us | 5-8 terms x 5us matmul |
|
|
| Random walk estimate (single entry) | 0.5-2 us | 10-50 steps x 50ns/step |
|
|
| Scheduler overhead per solve | 50-200 ns | 1-2 ticks of bookkeeping |
|
|
| AtomicVectorPool acquire/release | 15-30 ns | SegQueue push/pop |
|
|
|
|
### Parallel Batch Performance (8 threads)
|
|
|
|
| Operation | 1 Thread | 8 Threads | Speedup | Efficiency |
|
|
|-----------|----------|-----------|---------|------------|
|
|
| 1000x Neumann (128x128) | 3.5 ms | 0.55 ms | 6.4x | 80% |
|
|
| 1000x Neumann (1024x1024) | 50 ms | 8.5 ms | 5.9x | 74% |
|
|
| 10000x Random walk | 12 ms | 1.8 ms | 6.7x | 84% |
|
|
| Mixed batch (Neumann + walk) | 30 ms | 5.2 ms | 5.8x | 72% |
|
|
|
|
### WASM Performance (4 Web Workers)
|
|
|
|
| Operation | 1 Worker | 4 Workers | Speedup | Overhead |
|
|
|-----------|----------|-----------|---------|----------|
|
|
| 100x Neumann (128x128) | 0.8 ms | 0.25 ms | 3.2x | ~100us postMessage |
|
|
| 100x Neumann (SAB zero-copy) | 0.8 ms | 0.22 ms | 3.6x | ~10us shared read |
|
|
| WASM SIMD vs scalar | - | - | 2-4x | Per-operation |
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
| Version | Date | Author | Changes |
|
|
|---------|------|--------|---------|
|
|
| 0.1 | 2026-02-20 | RuVector Team | Initial proposal |
|
|
| 1.0 | 2026-02-20 | RuVector Team | Accepted: full implementation complete |
|
|
|
|
---
|
|
|
|
## Implementation Status
|
|
|
|
Parallel feature gate enables rayon for data-parallel SpMV and crossbeam for concurrent solver pipeline. Thread-safe types via Send + Sync bounds. DashMap for concurrent solver registry. parking_lot for low-overhead locking. Arena allocator is thread-local. ComputeBudget checked atomically. WASM target uses single-threaded fallback.
|
|
|
|
---
|
|
|
|
## Related Decisions
|
|
|
|
- **ADR-STS-001**: Rust crates integration (establishes the `parallel` feature flag
|
|
convention that this ADR extends)
|
|
- **ADR-STS-005**: Architecture analysis (defines the Core-Binding-Surface pattern that
|
|
the async wrapper follows)
|
|
- **ADR-STS-006**: WASM integration (defines Web Worker and SharedArrayBuffer patterns
|
|
that this ADR's WASM parallelism builds on)
|
|
- **ADR-STS-008**: Performance analysis (provides the thread scaling measurements and
|
|
benchmark framework referenced throughout)
|
|
- **ADR-003**: SIMD optimization strategy (defines the SIMD vectorization approach used
|
|
in the solver's inner level)
|
|
|
|
## References
|
|
|
|
- [Rayon documentation: Thread pool configuration](https://docs.rs/rayon/latest/rayon/struct.ThreadPoolBuilder.html)
|
|
- [Crossbeam deque: Work-stealing](https://docs.rs/crossbeam-deque/latest/crossbeam_deque/)
|
|
- [Tokio: spawn_blocking for CPU-bound work](https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html)
|
|
- [MDN: SharedArrayBuffer and Cross-Origin Isolation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer)
|
|
- [ruvector lockfree.rs](/home/user/ruvector/crates/ruvector-core/src/lockfree.rs)
|
|
- [ruvector worker-pool.js](/home/user/ruvector/crates/ruvector-wasm/src/worker-pool.js)
|
|
- [ruvector comprehensive_bench.rs](/home/user/ruvector/crates/ruvector-core/benches/comprehensive_bench.rs)
|
|
- [ruvector distance.rs batch_distances()](/home/user/ruvector/crates/ruvector-core/src/distance.rs)
|