Files
wifi-densepose/docs/research/sublinear-time-solver/adr/ADR-STS-009-concurrency-parallelism.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

1175 lines
46 KiB
Markdown

# ADR-STS-009: Concurrency Model and Parallelism Strategy
## Status
Accepted
## Date
2026-02-20
## Authors
RuVector Architecture Team
## Deciders
Architecture Review Board
---
## Context
The sublinear-time solver integration must align with ruvector's existing concurrency
infrastructure while introducing its own nanosecond-precision scheduler. The current
codebase employs a well-defined concurrency stack:
- **Rayon 1.10** -- Data-parallel iterators, feature-gated behind `parallel` (disabled on
`wasm32`). Used in `ruvector-core` for `batch_distances()` via `par_iter()` and in
`FlatIndex::search()` via `par_bridge()` over `DashMap` iterators. Optional in 8+ crates
including `ruvector-delta-wasm`, `ruvector-delta-index`, `ruQu`, `ruvector-delta-graph`.
- **Crossbeam 0.8** -- Lock-free data structures (`ArrayQueue`, `SegQueue`, `CachePadded`).
Powers the existing `LockFreeWorkQueue`, `LockFreeCounter`, `LockFreeStats`,
`LockFreeBatchProcessor`, `ObjectPool`, and `AtomicVectorPool` in
`crates/ruvector-core/src/lockfree.rs`.
- **DashMap 6.1** -- Concurrent hash map used throughout core indexing, delta operations,
and graph storage. Provides lock-free read access via sharded `RwLock` segments.
- **parking_lot 0.12** -- Efficient mutex and RwLock replacements, used in 12+ crates
across the workspace (sona, delta-core, delta-index, delta-graph, rvlite, etc.).
- **Tokio 1.41** -- Async runtime with multi-thread scheduler. Node.js bindings use
`tokio::task::spawn_blocking` pervasively (observed in `ruvector-attention-node`,
`ruvector-graph-node`, `ruvector-tiny-dancer-node`, `ruvector-router-ffi`).
Synchronization primitives (`mpsc`, `RwLock`, `oneshot`, `Notify`) used in raft,
replication, serving engines, and MCP gate.
- **Custom lock-free structures** -- `AtomicVectorPool` provides zero-allocation vector
reuse with `SegQueue`-backed pooling and `CachePadded` atomics. `LockFreeWorkQueue`
wraps `ArrayQueue` for bounded work distribution. `LockFreeBatchProcessor` combines both
for producer-consumer batch processing.
The sublinear-time solver introduces its own nanosecond-precision scheduler capable of
98ns tick intervals and 11M+ task dispatches per second. This scheduler manages fine-grained
intra-solve task decomposition (matrix partitioning, random-walk step scheduling, Neumann
series term evaluation). It must coexist with Rayon's thread pool without causing thread
starvation or pool exhaustion.
WASM targets (`wasm32-unknown-unknown`) lack native thread support. The existing ruvector
WASM crates use Web Workers with `postMessage` for parallelism, managed by a `WorkerPool`
class (round-robin distribution, promise-based API, 30-second timeout). The solver's WASM
target must follow this established pattern.
Thread scaling measurements from `bench_thread_scaling` in `comprehensive_bench.rs`
(10,000 vectors at 384 dimensions, Euclidean distance):
| Threads | Relative Efficiency | Bottleneck |
|---------|-------------------|------------|
| 1 | 100% (baseline) | N/A |
| 2 | 85-95% | Synchronization overhead |
| 4 | 70-85% | L3 cache contention |
| 8 | 50-70% | Memory bandwidth saturation |
The sub-linear scaling beyond 4 threads indicates memory bandwidth as the dominant
constraint for vectorized workloads, not CPU compute. This fundamentally shapes the
parallelism strategy for solver integration.
---
## Decision
### 1. Two-Level Parallelism Architecture
The solver integration adopts a strict two-level parallelism model that separates
coarse-grained data parallelism from fine-grained compute parallelism.
**Outer level**: Rayon `par_iter()` for batch operations across independent solve
invocations (multi-query, batch solve, parallel search-then-solve pipelines).
**Inner level**: SIMD vectorization within each individual solve invocation. This
includes both auto-vectorized loops (compiler-driven via `-C target-cpu=native`) and
explicit SIMD intrinsics (AVX2/NEON/WASM SIMD) for critical inner loops such as
matrix-vector products and distance computations.
**No nested Rayon** is permitted. A Rayon task must never spawn additional Rayon parallel
iterators, as this risks exhausting the global thread pool and causing deadlocks. The
solver's internal task decomposition uses its own scheduler (Decision 2) rather than
recursive Rayon parallelism.
```
+------------------------------------------------------------------+
| Application Layer |
| batch_solve([q1, q2, ..., qN]) |
+------------------------------------------------------------------+
| | |
v v v
+------------------+ +------------------+ +------------------+
| Rayon Task #1 | | Rayon Task #2 | | Rayon Task #N |
| (Outer Level) | | (Outer Level) | | (Outer Level) |
| solve(q1) | | solve(q2) | | solve(qN) |
+------------------+ +------------------+ +------------------+
| | |
v v v
+------------------+ +------------------+ +------------------+
| Solver Scheduler | | Solver Scheduler | | Solver Scheduler |
| 98ns tick | | 98ns tick | | 98ns tick |
| (Inner Level) | | (Inner Level) | | (Inner Level) |
+------------------+ +------------------+ +------------------+
| | |
v v v
+------------------+ +------------------+ +------------------+
| SIMD Kernel | | SIMD Kernel | | SIMD Kernel |
| AVX2/NEON/WASM | | AVX2/NEON/WASM | | AVX2/NEON/WASM |
| (Vectorized) | | (Vectorized) | | (Vectorized) |
+------------------+ +------------------+ +------------------+
```
#### Rayon Integration Code
```rust
use rayon::prelude::*;
/// Batch solve: outer parallelism via Rayon, inner via solver scheduler + SIMD
pub fn batch_solve(
queries: &[SolveRequest],
solver: &SublinearSolver,
config: &SolverConfig,
) -> Vec<SolveResult> {
#[cfg(all(feature = "parallel", not(target_arch = "wasm32")))]
{
queries
.par_iter()
.map(|query| solver.solve_single(query, config))
.collect()
}
#[cfg(any(not(feature = "parallel"), target_arch = "wasm32"))]
{
queries
.iter()
.map(|query| solver.solve_single(query, config))
.collect()
}
}
/// Multi-query search with solver-accelerated re-ranking
pub fn parallel_search_and_solve(
queries: &[Vec<f32>],
index: &HnswIndex,
solver: &SublinearSolver,
k: usize,
) -> Vec<Vec<SearchResult>> {
#[cfg(all(feature = "parallel", not(target_arch = "wasm32")))]
{
queries
.par_iter()
.map(|query| {
// Phase 1: HNSW candidate retrieval (O(log n))
let candidates = index.search(query, k * 4).unwrap();
// Phase 2: Solver-accelerated exact re-ranking (sublinear)
solver.rerank(query, &candidates, k)
})
.collect()
}
#[cfg(any(not(feature = "parallel"), target_arch = "wasm32"))]
{
queries
.iter()
.map(|query| {
let candidates = index.search(query, k * 4).unwrap();
solver.rerank(query, &candidates, k)
})
.collect()
}
}
```
### 2. Solver Scheduler Integration
The solver's nanosecond-precision scheduler operates exclusively within a single Rayon
task. It manages fine-grained intra-solve task decomposition without interacting with
Rayon's thread pool or work-stealing queues.
```
+---------------------------------------------------------------+
| Single Rayon Task |
| |
| +----------------------------------------------------------+ |
| | Solver Scheduler (98ns tick) | |
| | | |
| | Task Queue: | |
| | [Neumann Term 0] [Neumann Term 1] [Random Walk Step] | |
| | [Matrix Partition A] [Matrix Partition B] [Reduce] | |
| | | |
| | Execution Timeline (single thread): | |
| | |--98ns--|--98ns--|--98ns--|--98ns--|--98ns--| | |
| | | T0 | T1 | T2 | T3 | T0 | ... | |
| | | run | run | run | run | cont | | |
| | | |
| | Scheduler Responsibilities: | |
| | - Task priority ordering | |
| | - Convergence checking per tick | |
| | - Early termination on epsilon satisfaction | |
| | - Budget enforcement (max ticks per solve) | |
| +----------------------------------------------------------+ |
+---------------------------------------------------------------+
```
#### Scheduler Isolation Contract
```rust
/// The solver scheduler runs entirely within a single OS thread.
/// It MUST NOT:
/// - Call rayon::spawn() or any Rayon parallel iterator
/// - Acquire locks held by other Rayon tasks
/// - Block on tokio channels or async primitives
/// - Allocate from the global allocator in the hot loop
///
/// It MAY:
/// - Use thread-local storage for scratch buffers
/// - Use AtomicVectorPool for zero-alloc vector operations
/// - Read from shared immutable data (index, graph adjacency)
/// - Write to its own output buffer (non-shared)
pub struct SolverScheduler {
/// Task queue using crossbeam for efficient scheduling
task_queue: crossbeam::deque::Worker<SolverTask>,
/// Pre-allocated scratch space from AtomicVectorPool
scratch_pool: AtomicVectorPool,
/// Tick interval in nanoseconds (default: 98)
tick_ns: u64,
/// Maximum ticks per solve invocation
max_ticks: u64,
/// Convergence threshold
epsilon: f64,
}
impl SolverScheduler {
pub fn new(dimensions: usize, config: &SchedulerConfig) -> Self {
Self {
task_queue: crossbeam::deque::Worker::new_fifo(),
scratch_pool: AtomicVectorPool::new(dimensions, 16, 64),
tick_ns: config.tick_ns.unwrap_or(98),
max_ticks: config.max_ticks.unwrap_or(100_000),
epsilon: config.epsilon.unwrap_or(1e-6),
}
}
/// Execute a solve within the scheduler's tick loop.
/// This function is called from within a Rayon task and runs
/// entirely on the calling thread.
pub fn execute(&self, problem: &SolveProblem) -> SolveResult {
let mut state = SolverState::init(problem, &self.scratch_pool);
let mut ticks: u64 = 0;
loop {
// Process one tick worth of work
if let Some(task) = self.task_queue.pop() {
task.execute(&mut state);
}
ticks += 1;
// Check convergence
if state.residual() < self.epsilon {
return SolveResult::converged(state, ticks);
}
// Check budget
if ticks >= self.max_ticks {
return SolveResult::budget_exhausted(state, ticks);
}
// Generate follow-up tasks based on current state
self.generate_tasks(&state);
}
}
fn generate_tasks(&self, state: &SolverState) {
// Neumann series: schedule next term if not converged
if state.needs_next_term() {
self.task_queue.push(SolverTask::NeumannTerm {
term_index: state.current_term + 1,
});
}
// Random walk: schedule additional walks if variance too high
if state.variance_too_high() {
self.task_queue.push(SolverTask::RandomWalkBatch {
count: state.walks_needed(),
});
}
}
}
```
### 3. Crossbeam Integration
The solver extends ruvector's existing lock-free infrastructure with two additional
integration points.
#### Work-Stealing Task Queue
The solver's task queue uses `crossbeam::deque::Injector` for work distribution when
operating in multi-solve mode. Each Rayon worker thread owns a local
`crossbeam::deque::Worker` deque; the `Injector` distributes initial tasks, and
`Stealer` handles load balancing across solver instances.
```
+----------------------------------------------------------+
| Work-Stealing Topology |
| |
| +--------------------+ |
| | Injector | <-- batch_solve() pushes here |
| | (Global Task Queue)| |
| +--------+-----------+ |
| | |
| +-----+------+------+ |
| | | | |
| +--v---+ +----v-+ +-v----+ |
| |Worker| |Worker| |Worker| <-- Rayon thread-local |
| |Deque | |Deque | |Deque | |
| +--+---+ +--+---+ +--+---+ |
| | | | |
| v v v |
| Steal <----> Steal <--> Steal (work stealing) |
+----------------------------------------------------------+
```
```rust
use crossbeam::deque::{Injector, Stealer, Worker};
/// Multi-solve work distributor using crossbeam work-stealing deques.
/// Integrates with Rayon by running within install() scope.
pub struct SolverWorkDistributor {
injector: Injector<SolveRequest>,
stealers: Vec<Stealer<SolveRequest>>,
workers: Vec<Worker<SolveRequest>>,
}
impl SolverWorkDistributor {
pub fn new(num_workers: usize) -> Self {
let injector = Injector::new();
let mut workers = Vec::with_capacity(num_workers);
let mut stealers = Vec::with_capacity(num_workers);
for _ in 0..num_workers {
let worker = Worker::new_fifo();
stealers.push(worker.stealer());
workers.push(worker);
}
Self { injector, stealers, workers }
}
/// Submit a batch of solve requests for parallel execution.
pub fn submit_batch(&self, requests: Vec<SolveRequest>) {
for request in requests {
self.injector.push(request);
}
}
/// Attempt to steal work. Called by idle Rayon threads.
pub fn find_work(&self, worker_id: usize) -> Option<SolveRequest> {
// Try local deque first
let local = &self.workers[worker_id];
if let Some(task) = local.pop() {
return Some(task);
}
// Try global injector
loop {
match self.injector.steal_batch_and_pop(local) {
crossbeam::deque::Steal::Success(task) => return Some(task),
crossbeam::deque::Steal::Empty => break,
crossbeam::deque::Steal::Retry => continue,
}
}
// Try stealing from other workers
for (i, stealer) in self.stealers.iter().enumerate() {
if i == worker_id { continue; }
loop {
match stealer.steal() {
crossbeam::deque::Steal::Success(task) => return Some(task),
crossbeam::deque::Steal::Empty => break,
crossbeam::deque::Steal::Retry => continue,
}
}
}
None
}
}
```
#### Lock-Free Statistics
Solver statistics collection extends the existing `LockFreeStats` pattern from
`crates/ruvector-core/src/lockfree.rs`, using `CachePadded<AtomicU64>` for contention-free
updates across Rayon worker threads.
```rust
use crossbeam::utils::CachePadded;
use std::sync::atomic::{AtomicU64, Ordering};
/// Lock-free solver statistics, following ruvector's LockFreeStats pattern.
/// Each field is cache-line padded to prevent false sharing between threads.
#[repr(align(64))]
pub struct SolverStats {
solves_completed: CachePadded<AtomicU64>,
total_ticks: CachePadded<AtomicU64>,
convergences: CachePadded<AtomicU64>,
budget_exhaustions: CachePadded<AtomicU64>,
total_latency_ns: CachePadded<AtomicU64>,
neumann_terms_computed: CachePadded<AtomicU64>,
random_walks_executed: CachePadded<AtomicU64>,
}
impl SolverStats {
pub fn new() -> Self {
Self {
solves_completed: CachePadded::new(AtomicU64::new(0)),
total_ticks: CachePadded::new(AtomicU64::new(0)),
convergences: CachePadded::new(AtomicU64::new(0)),
budget_exhaustions: CachePadded::new(AtomicU64::new(0)),
total_latency_ns: CachePadded::new(AtomicU64::new(0)),
neumann_terms_computed: CachePadded::new(AtomicU64::new(0)),
random_walks_executed: CachePadded::new(AtomicU64::new(0)),
}
}
#[inline]
pub fn record_solve(&self, result: &SolveResult) {
self.solves_completed.fetch_add(1, Ordering::Relaxed);
self.total_ticks.fetch_add(result.ticks, Ordering::Relaxed);
self.total_latency_ns.fetch_add(result.latency_ns, Ordering::Relaxed);
if result.converged {
self.convergences.fetch_add(1, Ordering::Relaxed);
} else {
self.budget_exhaustions.fetch_add(1, Ordering::Relaxed);
}
}
pub fn snapshot(&self) -> SolverStatsSnapshot {
let completed = self.solves_completed.load(Ordering::Relaxed);
let total_latency = self.total_latency_ns.load(Ordering::Relaxed);
let total_ticks = self.total_ticks.load(Ordering::Relaxed);
SolverStatsSnapshot {
solves_completed: completed,
convergence_rate: if completed > 0 {
self.convergences.load(Ordering::Relaxed) as f64 / completed as f64
} else { 0.0 },
avg_latency_ns: if completed > 0 {
total_latency / completed
} else { 0 },
avg_ticks_per_solve: if completed > 0 {
total_ticks / completed
} else { 0 },
neumann_terms: self.neumann_terms_computed.load(Ordering::Relaxed),
random_walks: self.random_walks_executed.load(Ordering::Relaxed),
}
}
}
```
### 4. WASM Parallelism
WASM targets lack native thread support. The solver's WASM binding follows the established
ruvector pattern: a `WorkerPool` (JavaScript) manages Web Workers, each running an
independent WASM solver instance.
```
+---------------------------------------------------------------+
| Browser Main Thread |
| |
| +----------------------------------------------------------+ |
| | SolverWorkerPool (extends WorkerPool pattern) | |
| | | |
| | init() -> spawns N Web Workers (navigator.hardwareConcurrency) |
| | solve(request) -> round-robin to next idle worker | |
| | batchSolve(requests) -> chunk and distribute | |
| | terminate() -> cleanup all workers | |
| +--+--------+--------+--------+----------------------------+ |
| | | | | |
+-----+--------+--------+--------+-------------------------------+
| | | |
v v v v postMessage / onmessage
+--------+ +--------+ +--------+ +--------+
|Worker 0| |Worker 1| |Worker 2| |Worker 3|
| | | | | | | |
|+------+| |+------+| |+------+| |+------+|
||WASM || ||WASM || ||WASM || ||WASM ||
||Solver|| ||Solver|| ||Solver|| ||Solver||
||Module|| ||Module|| ||Module|| ||Module||
|+------+| |+------+| |+------+| |+------+|
| | | | | | | |
|Scheduler| |Scheduler| |Scheduler| |Scheduler|
|(98ns) | |(98ns) | |(98ns) | |(98ns) |
+--------+ +--------+ +--------+ +--------+
| | | |
v v v v
[WASM SIMD] [WASM SIMD] [WASM SIMD] [WASM SIMD]
(128-bit) (128-bit) (128-bit) (128-bit)
```
#### SharedArrayBuffer Zero-Copy Path
When `SharedArrayBuffer` is available (requires `Cross-Origin-Isolation` headers:
`Cross-Origin-Opener-Policy: same-origin` and
`Cross-Origin-Embedder-Policy: require-corp`), the solver uses shared memory for
zero-copy data transfer between workers.
```javascript
/**
* Solver Worker Pool -- extends ruvector's WorkerPool pattern
* with SharedArrayBuffer support and solver-specific operations.
*/
export class SolverWorkerPool {
constructor(workerUrl, wasmUrl, options = {}) {
this.workerUrl = workerUrl;
this.wasmUrl = wasmUrl;
this.poolSize = options.poolSize || navigator.hardwareConcurrency || 4;
this.workers = [];
this.nextWorker = 0;
this.pendingRequests = new Map();
this.requestId = 0;
this.initialized = false;
this.sharedMemoryAvailable = typeof SharedArrayBuffer !== 'undefined';
this.sharedBuffer = null;
}
async init() {
if (this.initialized) return;
// Allocate shared memory if available
if (this.sharedMemoryAvailable) {
const bufferSize = this.poolSize * 4 * 1024 * 1024; // 4MB per worker
this.sharedBuffer = new SharedArrayBuffer(bufferSize);
console.log(`SharedArrayBuffer allocated: ${bufferSize} bytes`);
}
const initPromises = [];
for (let i = 0; i < this.poolSize; i++) {
const worker = new Worker(this.workerUrl, { type: 'module' });
worker.onmessage = (e) => this.handleMessage(i, e);
worker.onerror = (error) => this.handleError(i, error);
this.workers.push({ worker, busy: false, id: i });
const initData = {
wasmUrl: this.wasmUrl,
workerId: i,
sharedBuffer: this.sharedBuffer,
bufferOffset: i * 4 * 1024 * 1024,
bufferSize: 4 * 1024 * 1024,
};
initPromises.push(this.sendToWorker(i, 'init', initData));
}
await Promise.all(initPromises);
this.initialized = true;
}
/**
* Solve a single request on the next available worker.
*/
async solve(request) {
if (!this.initialized) await this.init();
const workerId = this.getNextWorker();
if (this.sharedMemoryAvailable) {
// Zero-copy path: write input to shared buffer
const view = new Float32Array(
this.sharedBuffer,
workerId * 4 * 1024 * 1024,
request.data.length
);
view.set(request.data);
return this.sendToWorker(workerId, 'solveShared', {
offset: 0,
length: request.data.length,
config: request.config,
});
} else {
// Transferable path: transfer ownership of ArrayBuffer
const buffer = new Float32Array(request.data).buffer;
return this.sendToWorkerTransfer(workerId, 'solve', {
config: request.config,
}, [buffer]);
}
}
/**
* Batch solve distributes requests across all workers.
*/
async batchSolve(requests) {
if (!this.initialized) await this.init();
const chunkSize = Math.ceil(requests.length / this.poolSize);
const chunks = [];
for (let i = 0; i < requests.length; i += chunkSize) {
chunks.push(requests.slice(i, i + chunkSize));
}
const promises = chunks.map((chunk, i) =>
this.sendToWorker(i % this.poolSize, 'batchSolve', {
requests: chunk.map(r => ({
data: Array.from(r.data),
config: r.config,
})),
})
);
const results = await Promise.all(promises);
return results.flat();
}
// -- Utility methods following WorkerPool pattern --
getNextWorker() {
for (let i = 0; i < this.workers.length; i++) {
const idx = (this.nextWorker + i) % this.workers.length;
if (!this.workers[idx].busy) {
this.nextWorker = (idx + 1) % this.workers.length;
return idx;
}
}
const idx = this.nextWorker;
this.nextWorker = (this.nextWorker + 1) % this.workers.length;
return idx;
}
handleMessage(workerId, event) {
const { requestId, data, error } = event.data;
const request = this.pendingRequests.get(requestId);
if (!request) return;
this.workers[workerId].busy = false;
this.pendingRequests.delete(requestId);
if (error) {
request.reject(new Error(error.message));
} else {
request.resolve(data);
}
}
handleError(workerId, error) {
console.error(`Solver worker ${workerId} error:`, error);
for (const [requestId, request] of this.pendingRequests) {
if (request.workerId === workerId) {
request.reject(error);
this.pendingRequests.delete(requestId);
}
}
}
sendToWorker(workerId, type, data) {
return new Promise((resolve, reject) => {
const requestId = this.requestId++;
this.pendingRequests.set(requestId, { resolve, reject, workerId });
this.workers[workerId].busy = true;
this.workers[workerId].worker.postMessage({
type, data: { ...data, requestId },
});
});
}
sendToWorkerTransfer(workerId, type, data, transferables) {
return new Promise((resolve, reject) => {
const requestId = this.requestId++;
this.pendingRequests.set(requestId, { resolve, reject, workerId });
this.workers[workerId].busy = true;
this.workers[workerId].worker.postMessage(
{ type, data: { ...data, requestId } },
transferables
);
});
}
terminate() {
for (const { worker } of this.workers) {
worker.terminate();
}
this.workers = [];
this.initialized = false;
this.sharedBuffer = null;
}
}
```
#### Fallback Path (No SharedArrayBuffer)
When `SharedArrayBuffer` is unavailable (non-isolated context, older browsers, iOS Safari
prior to 16.4), the solver falls back to `postMessage` with transferable `ArrayBuffer`
objects. Ownership of the buffer transfers to the worker, avoiding copies at the cost of
making the source buffer neutered (zero-length) after transfer.
```
SharedArrayBuffer available?
|
+----+----+
| |
YES NO
| |
v v
Zero-copy Transferable ArrayBuffer
shared (ownership transfer,
memory source buffer neutered)
| |
v v
Worker reads Worker receives
from shared full copy via
view at structured clone
offset of postMessage
```
### 5. Async Integration
The solver is CPU-bound and must not block the Tokio async runtime. All solver invocations
from async contexts use `tokio::task::spawn_blocking`, following the pattern established
in `ruvector-attention-node`, `ruvector-graph-node`, and `ruvector-router-ffi`.
```rust
use tokio::sync::broadcast;
use tokio::task;
/// Async wrapper for solver operations.
/// Uses spawn_blocking to avoid blocking the Tokio runtime.
pub struct AsyncSolver {
solver: Arc<SublinearSolver>,
event_tx: broadcast::Sender<SolverEvent>,
semaphore: Arc<tokio::sync::Semaphore>,
}
/// Events emitted during solver operations for observability.
#[derive(Clone, Debug)]
pub enum SolverEvent {
SolveStarted { request_id: u64 },
SolveCompleted { request_id: u64, latency_ns: u64, converged: bool },
SolveFailed { request_id: u64, error: String },
BatchStarted { batch_id: u64, count: usize },
BatchCompleted { batch_id: u64, count: usize, total_latency_ns: u64 },
ConcurrencyLimitReached { current: usize, max: usize },
}
impl AsyncSolver {
pub fn new(solver: SublinearSolver, max_concurrent: usize) -> Self {
let (event_tx, _) = broadcast::channel(1024);
Self {
solver: Arc::new(solver),
event_tx,
semaphore: Arc::new(tokio::sync::Semaphore::new(max_concurrent)),
}
}
/// Subscribe to solver events for monitoring/observability.
pub fn subscribe(&self) -> broadcast::Receiver<SolverEvent> {
self.event_tx.subscribe()
}
/// Solve a single request asynchronously.
/// Acquires a semaphore permit to enforce concurrency limits.
pub async fn solve(&self, request: SolveRequest) -> Result<SolveResult, SolverError> {
let permit = self.semaphore.acquire().await
.map_err(|_| SolverError::Shutdown)?;
let solver = Arc::clone(&self.solver);
let event_tx = self.event_tx.clone();
let request_id = request.id;
let _ = event_tx.send(SolverEvent::SolveStarted { request_id });
let result = task::spawn_blocking(move || {
let start = std::time::Instant::now();
let result = solver.solve_single(&request, &SolverConfig::default());
let latency_ns = start.elapsed().as_nanos() as u64;
let _ = event_tx.send(SolverEvent::SolveCompleted {
request_id,
latency_ns,
converged: result.converged,
});
result
})
.await
.map_err(|e| SolverError::TaskPanicked(e.to_string()))?;
drop(permit);
Ok(result)
}
/// Batch solve with Rayon parallelism inside spawn_blocking.
pub async fn batch_solve(
&self,
requests: Vec<SolveRequest>,
) -> Result<Vec<SolveResult>, SolverError> {
let solver = Arc::clone(&self.solver);
let event_tx = self.event_tx.clone();
let batch_id = requests.first().map(|r| r.id).unwrap_or(0);
let count = requests.len();
let _ = event_tx.send(SolverEvent::BatchStarted { batch_id, count });
let results = task::spawn_blocking(move || {
let start = std::time::Instant::now();
let results = batch_solve(&requests, &solver, &SolverConfig::default());
let total_latency_ns = start.elapsed().as_nanos() as u64;
let _ = event_tx.send(SolverEvent::BatchCompleted {
batch_id, count, total_latency_ns,
});
results
})
.await
.map_err(|e| SolverError::TaskPanicked(e.to_string()))?;
Ok(results)
}
}
```
#### Event Stream Architecture
```
+------------------+ +------------------+ +------------------+
| AsyncSolver | | Event Subscriber | | Event Subscriber |
| (spawn_blocking) | | (Metrics) | | (Logging) |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
v v v
+------+------------------------+------------------------+------+
| broadcast::channel(1024) |
| |
| SolverEvent::SolveStarted { request_id } |
| SolverEvent::SolveCompleted { request_id, latency_ns, ... } |
| SolverEvent::BatchCompleted { batch_id, count, ... } |
+----------------------------------------------------------------+
```
### 6. Concurrent Solve Limit
A bounded `tokio::sync::Semaphore` limits the number of solver invocations executing
simultaneously. This prevents memory exhaustion when many async callers submit concurrent
solve requests, each of which allocates scratch buffers from `AtomicVectorPool`.
**Default limit**: `num_cpus::get() * 2`
This follows the pattern used in `ruvector-postgres` (`max_concurrent_searches: 64`) and
`prime-radiant` (`max_concurrent_ops`), but tunes for CPU-bound solver work rather than
I/O-bound database queries.
```rust
/// Configuration for concurrent solve limits.
pub struct ConcurrencyConfig {
/// Maximum concurrent solver invocations.
/// Default: num_cpus * 2 (allows slight oversubscription for
/// work that mixes solver compute with I/O waits).
pub max_concurrent_solves: usize,
/// Maximum total memory allocated to solver scratch space.
/// Enforced via AtomicVectorPool capacity.
pub max_scratch_memory_bytes: usize,
/// Backpressure strategy when limit is reached.
pub backpressure: BackpressureStrategy,
}
#[derive(Debug, Clone)]
pub enum BackpressureStrategy {
/// Block until a permit becomes available (default).
Wait,
/// Return an error immediately.
Reject,
/// Wait up to the specified duration, then return an error.
WaitTimeout(std::time::Duration),
}
impl Default for ConcurrencyConfig {
fn default() -> Self {
let cpus = num_cpus::get();
Self {
max_concurrent_solves: cpus * 2,
max_scratch_memory_bytes: cpus * 16 * 1024 * 1024, // 16MB per CPU
backpressure: BackpressureStrategy::Wait,
}
}
}
```
---
## Consequences
### Positive
- **No thread pool exhaustion**: Two-level separation ensures Rayon's global pool is never
starved by nested parallelism. Each solve invocation consumes exactly one Rayon thread.
- **Predictable memory usage**: The semaphore-bounded concurrency limit combined with
`AtomicVectorPool` capacity bounds prevents out-of-memory conditions under load.
- **WASM compatibility**: The Web Worker pattern matches ruvector's existing 27 WASM crates
and their `WorkerPool` infrastructure. No new paradigm to learn or maintain.
- **Observable**: The `broadcast::channel` event stream provides real-time solve metrics
without polling, enabling integration with ruvector's existing `tracing` infrastructure.
- **Zero contention statistics**: Cache-padded atomics in `SolverStats` eliminate false
sharing. Measurements from `LockFreeCounter` tests confirm 10K increments across 10
threads with zero lost updates.
- **Lock-free hot path**: The solver's inner loop uses `AtomicVectorPool` (SegQueue-backed)
and `crossbeam::deque::Worker` (thread-local). No mutex or RwLock on the critical path.
- **Crossbeam alignment**: Work-stealing deques integrate naturally with Rayon's existing
work-stealing model since Rayon itself uses crossbeam internally.
### Negative
- **Scheduler complexity**: The 98ns tick scheduler requires careful calibration per
platform. Timer resolution on Linux (`clock_gettime(CLOCK_MONOTONIC)`) provides ~25ns
precision, but macOS and Windows have coarser clocks (~1us).
- **WASM overhead**: Web Worker `postMessage` serialization adds ~50-200us per invocation,
dwarfing the solver's sub-microsecond tick cost. Batch operations are essential to
amortize this overhead.
- **SharedArrayBuffer restrictions**: Cross-origin isolation headers (`COOP`/`COEP`) break
third-party integrations (e.g., iframes, analytics scripts). Deployments must weigh
zero-copy benefits against integration constraints.
- **Memory bandwidth ceiling**: Thread scaling beyond 4 threads provides diminishing
returns (50-70% efficiency at 8 threads). The solver cannot overcome this hardware
limitation; it can only work within it by maximizing SIMD utilization per thread.
- **Rayon global pool coupling**: Solver batch operations share Rayon's global thread pool
with other subsystems (`batch_distances`, `FlatIndex::search`, delta operations). Under
contention, solver tasks compete with index operations for threads.
### Neutral
- **Feature flag gating**: Parallelism continues to be gated behind
`#[cfg(all(feature = "parallel", not(target_arch = "wasm32")))]`, consistent with
existing crate convention. No new feature flags required.
- **Tokio spawn_blocking pool**: `spawn_blocking` uses a separate thread pool from Rayon.
Default pool size is 512 threads (Tokio default), which is more than sufficient for
`max_concurrent_solves = num_cpus * 2`.
- **Crossbeam version alignment**: Both ruvector (0.8) and the solver target crossbeam 0.8.
No version conflict.
---
## Options Considered
### Option 1: Rayon-Only Parallelism (Flat Model)
Use Rayon `par_iter()` for all parallelism, including intra-solve task decomposition.
- **Pros**: Simplest implementation. Single threading model. No scheduler complexity.
Rayon's adaptive work stealing handles load balancing automatically.
- **Cons**: Nested `par_iter()` calls risk thread pool exhaustion. Rayon's work-stealing
granularity (~1us minimum) is too coarse for the solver's 98ns tick. Cannot express
solver-specific scheduling policies (convergence checking, budget enforcement) within
Rayon's map/reduce model. Forces all solver tasks through Rayon's global deque, adding
contention with other subsystems.
### Option 2: Dedicated Thread Pool per Solver
Create a separate `rayon::ThreadPool` for solver operations, isolated from the global pool.
- **Pros**: Complete isolation from other subsystems. No contention with index operations.
Can size the pool independently (e.g., 4 threads for solver, remaining for index).
- **Cons**: Thread oversubscription: a 4-thread solver pool plus Rayon's global pool (8
threads by default) creates 12 threads on an 8-core machine. Context switching overhead
negates throughput gains. Memory bandwidth is the bottleneck, not thread count; more
threads do not help. Cannot share work between pools when one is idle. Increases total
memory consumption (each pool maintains its own deque infrastructure).
### Option 3: Tokio-Native Async Solve (Chosen Against)
Run the solver on the Tokio runtime using `tokio::task::spawn` with cooperative yielding.
- **Pros**: Native async integration. No spawn_blocking overhead. Natural backpressure
via Tokio's task budget system.
- **Cons**: The solver is CPU-bound; running on Tokio's cooperative scheduler would block
other async tasks. Tokio's tick granularity (~10us) is 100x coarser than the solver's
98ns target. Would require `.await` points in inner loops, destroying SIMD pipeline
throughput. Fundamentally wrong tool for CPU-bound computation.
### Option 4: Two-Level with Scheduler (Chosen)
Rayon for outer-level batch parallelism; solver's own nanosecond scheduler for inner-level
task management; crossbeam for lock-free data structures; Tokio integration via
spawn_blocking.
- **Pros**: Matches ruvector's existing patterns exactly. Respects memory bandwidth limits
by not oversubscribing threads. Preserves solver's 98ns scheduling precision. Lock-free
hot path. Observable via broadcast channels. WASM-compatible via established WorkerPool.
- **Cons**: Two scheduling systems to understand and maintain. Potential for subtle bugs
at the Rayon/scheduler boundary (e.g., accidentally calling par_iter inside a solve).
---
## Thread Scaling Analysis
### Memory Bandwidth Model
The dominant performance constraint for vectorized solver workloads is memory bandwidth,
not CPU compute. The following model explains the measured scaling behavior.
```
Memory Bandwidth Utilization vs Thread Count
(10K vectors x 384 dimensions x 4 bytes = 15.36 MB working set)
Threads | BW Used (GB/s) | BW Available | Efficiency | Bottleneck
---------|----------------|--------------|------------|------------------
1 | ~8 | ~50 (DDR5) | 16% | CPU-bound (good)
2 | ~15 | ~50 | 30% | CPU-bound (good)
4 | ~28 | ~50 | 56% | Approaching BW limit
8 | ~38 | ~50 | 76% | BW-saturated
16 | ~42 | ~50 | 84% | Diminishing returns
Note: Effective BW is lower than theoretical peak due to cache coherence
traffic (MESI protocol overhead) and TLB pressure at scale.
```
### Solver-Specific Scaling Predictions
The solver's workload differs from pure distance computation in two important ways:
1. **Higher arithmetic intensity**: Neumann series evaluation performs multiple matrix-vector
products per memory access, increasing the compute-to-bandwidth ratio. This should
improve scaling beyond 4 threads relative to simple distance computation.
2. **Smaller working sets per solve**: Individual solve problems typically involve matrices
of size 128x128 to 1024x1024 (~64KB to ~4MB), fitting within L2/L3 cache. This reduces
memory bandwidth pressure compared to scanning 10K vectors.
```
Predicted Solver Thread Scaling:
Threads | Distance Comp | Neumann Solve | Random Walk
| (measured) | (predicted) | (predicted)
---------|----------------|----------------|----------------
1 | 100% | 100% | 100%
2 | 85-95% | 90-97% | 92-98%
4 | 70-85% | 80-92% | 85-95%
8 | 50-70% | 65-82% | 70-85%
Neumann: Higher arithmetic intensity -> better scaling
Random Walk: Independent walks -> near-linear scaling
```
### Avoiding Nested Parallelism
Nested Rayon parallelism is the single most dangerous anti-pattern for this integration.
The following diagram illustrates the deadlock scenario.
```
DANGEROUS (do NOT do this):
batch_solve()
.par_iter() <-- Rayon global pool (8 threads)
|
+-> solve_single()
.par_iter() <-- NESTED: tries to use same 8 threads
|
+-> [DEADLOCK: all 8 threads waiting for inner
par_iter work, but no threads available
to execute it]
CORRECT (this ADR's approach):
batch_solve()
.par_iter() <-- Rayon global pool (8 threads)
|
+-> solve_single()
SolverScheduler::execute() <-- Sequential on calling thread
|
+-> SIMD kernel <-- Vectorized, single thread
+-> crossbeam::deque <-- Thread-local, no contention
```
---
## Lock-Free Data Structure Usage Map
The following table maps each lock-free structure in the codebase to its role in the
solver integration.
| Structure | Source | Solver Role | Contention Profile |
|-----------|--------|-------------|-------------------|
| `AtomicVectorPool` | `lockfree.rs` | Scratch buffer allocation for Neumann terms, walk accumulators | Per-thread pool instance; no cross-thread contention |
| `LockFreeWorkQueue` | `lockfree.rs` | Bounded result collection from parallel batch_solve | Producers: Rayon threads; Consumer: batch_solve caller |
| `LockFreeStats` | `lockfree.rs` | Pattern template for `SolverStats` | Write-only from Rayon threads; Read from monitoring |
| `LockFreeCounter` | `lockfree.rs` | Request ID generation, tick counting | Single atomic increment per solve; negligible contention |
| `LockFreeBatchProcessor` | `lockfree.rs` | Orchestration of multi-phase solve pipelines | Phase transitions; moderate contention at boundaries |
| `ObjectPool<T>` | `lockfree.rs` | Reusable solver state objects | Per-thread acquire/release; low contention |
| `crossbeam::deque::Worker` | crossbeam | Solver scheduler task queue | Thread-local; zero contention |
| `crossbeam::deque::Injector` | crossbeam | Batch task distribution | Single producer (submit), multiple consumers (steal) |
| `DashMap` | dashmap | Solver result cache (optional) | Read-heavy; sharded RwLock minimizes contention |
---
## Benchmark Predictions for Parallel Solver
Based on existing ruvector benchmark data and the solver's algorithmic properties.
### Single-Threaded Solver Performance
| Operation | Estimated Latency | Basis |
|-----------|------------------|-------|
| Neumann solve (128x128 sparse) | 2-5 us | 3-5 terms x 500ns matmul |
| Neumann solve (1024x1024 sparse) | 20-80 us | 5-8 terms x 5us matmul |
| Random walk estimate (single entry) | 0.5-2 us | 10-50 steps x 50ns/step |
| Scheduler overhead per solve | 50-200 ns | 1-2 ticks of bookkeeping |
| AtomicVectorPool acquire/release | 15-30 ns | SegQueue push/pop |
### Parallel Batch Performance (8 threads)
| Operation | 1 Thread | 8 Threads | Speedup | Efficiency |
|-----------|----------|-----------|---------|------------|
| 1000x Neumann (128x128) | 3.5 ms | 0.55 ms | 6.4x | 80% |
| 1000x Neumann (1024x1024) | 50 ms | 8.5 ms | 5.9x | 74% |
| 10000x Random walk | 12 ms | 1.8 ms | 6.7x | 84% |
| Mixed batch (Neumann + walk) | 30 ms | 5.2 ms | 5.8x | 72% |
### WASM Performance (4 Web Workers)
| Operation | 1 Worker | 4 Workers | Speedup | Overhead |
|-----------|----------|-----------|---------|----------|
| 100x Neumann (128x128) | 0.8 ms | 0.25 ms | 3.2x | ~100us postMessage |
| 100x Neumann (SAB zero-copy) | 0.8 ms | 0.22 ms | 3.6x | ~10us shared read |
| WASM SIMD vs scalar | - | - | 2-4x | Per-operation |
---
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1 | 2026-02-20 | RuVector Team | Initial proposal |
| 1.0 | 2026-02-20 | RuVector Team | Accepted: full implementation complete |
---
## Implementation Status
Parallel feature gate enables rayon for data-parallel SpMV and crossbeam for concurrent solver pipeline. Thread-safe types via Send + Sync bounds. DashMap for concurrent solver registry. parking_lot for low-overhead locking. Arena allocator is thread-local. ComputeBudget checked atomically. WASM target uses single-threaded fallback.
---
## Related Decisions
- **ADR-STS-001**: Rust crates integration (establishes the `parallel` feature flag
convention that this ADR extends)
- **ADR-STS-005**: Architecture analysis (defines the Core-Binding-Surface pattern that
the async wrapper follows)
- **ADR-STS-006**: WASM integration (defines Web Worker and SharedArrayBuffer patterns
that this ADR's WASM parallelism builds on)
- **ADR-STS-008**: Performance analysis (provides the thread scaling measurements and
benchmark framework referenced throughout)
- **ADR-003**: SIMD optimization strategy (defines the SIMD vectorization approach used
in the solver's inner level)
## References
- [Rayon documentation: Thread pool configuration](https://docs.rs/rayon/latest/rayon/struct.ThreadPoolBuilder.html)
- [Crossbeam deque: Work-stealing](https://docs.rs/crossbeam-deque/latest/crossbeam_deque/)
- [Tokio: spawn_blocking for CPU-bound work](https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html)
- [MDN: SharedArrayBuffer and Cross-Origin Isolation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer)
- [ruvector lockfree.rs](/home/user/ruvector/crates/ruvector-core/src/lockfree.rs)
- [ruvector worker-pool.js](/home/user/ruvector/crates/ruvector-wasm/src/worker-pool.js)
- [ruvector comprehensive_bench.rs](/home/user/ruvector/crates/ruvector-core/benches/comprehensive_bench.rs)
- [ruvector distance.rs batch_distances()](/home/user/ruvector/crates/ruvector-core/src/distance.rs)