Files
wifi-densepose/docs/research/sublinear-time-solver/adr/ADR-STS-002-algorithm-selection-routing.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

42 KiB

ADR-STS-002: Algorithm Selection and Sublinear Routing Strategy

Status

Accepted

Date

2026-02-20

Authors

RuVector Architecture Team

Deciders

Architecture Review Board

Version History

Version Date Author Changes
0.1 2026-02-20 RuVector Team Initial proposal
0.2 2026-02-20 RuVector Team Comprehensive rewrite: crossover analysis, error budget decomposition, SONA/EWC integration, full decision matrix
1.0 2026-02-20 RuVector Team Accepted: full implementation complete

Context

RuVector integrates seven sublinear algorithms from the sublinear-time-solver library. Each algorithm occupies a distinct region of the problem-characteristic space, with non-trivial crossover boundaries determined by matrix size (n), sparsity (nnz/n^2), condition number (kappa), query type (single-source, pairwise, batch), target platform (native/WASM/edge), and available compute budget (wall-time, memory, energy).

Algorithm Portfolio

Algorithm Complexity Primary Domain
Neumann Series O(k * nnz) Sparse SPD, diagonally dominant
Forward Push O(1/eps) Single-source PPR, graph exploration
Backward Push O(1/eps) Reverse relevance to target node
Hybrid Random Walk O(sqrt(n)/eps) Pairwise relevance, Monte Carlo
TRUE O(log n) amortized Large-scale Laplacian, JL + sparsification
Conjugate Gradient O(sqrt(kappa) * log(1/eps) * nnz) Gold-standard SPD solve
BMSSP O(nnz * log n) Multigrid hierarchical, near-linear

RuVector Consumption Points

Each RuVector subsystem requires different algorithms:

Prime Radiant (sheaf Laplacian)     -> CG, TRUE, Neumann
ruvector-gnn (message passing)      -> Forward Push, Neumann
ruvector-math (spectral filtering)  -> Neumann, CG, Chebyshev (existing)
ruvector-graph (PageRank/centrality)-> Forward Push, Backward Push
ruvector-attention (PDE diffusion)  -> CG on sparse Laplacian
ruvector-mincut (effective resist.) -> CG, TRUE (shared sparsifier)
ruvector-core (distance approx.)    -> JL projection (TRUE component)

Motivating Constraints

Without a principled routing strategy, each RuVector subsystem would need to independently select an algorithm, leading to duplicated logic, inconsistent quality-of-service, missed optimization opportunities, and platform-incompatible choices (e.g., selecting TRUE with heavy preprocessing on a WASM target with a 4 MB memory budget).

The routing problem is compounded by:

  1. Heterogeneous platforms: Native (AVX-512, 64 GB RAM), WASM browser (SIMD128, 4 MB), Cloudflare edge (SIMD128, 128 MB), Apple Silicon (NEON, 16 GB).
  2. Diverse query types: RuVector's 10+ crates generate fundamentally different problem structures, from dense O(n^2) attention matrices to ultra-sparse HNSW adjacency graphs.
  3. Cascading error budgets: When multiple sublinear algorithms compose (e.g., JL projection -> sparsification -> Neumann iteration -> push aggregation), error accumulates and must be managed holistically.
  4. Latency constraints: Compute lanes range from Lane 0 Reflex (<1 ms) to Lane 3 Deliberate (unbounded), requiring algorithm choices that respect wall-time budgets.

Decision

Implement a three-tier routing system that combines compile-time platform constraints, runtime heuristic dispatch, and adaptive learning from historical performance.

Tier 1: Static Rules (Compile-Time)

Feature flags and cfg attributes select the set of available algorithms per target platform. These constraints are absolute and override all runtime decisions.

# WASM target -- exclude algorithms requiring heavy preprocessing
[target.'cfg(target_arch = "wasm32")'.dependencies]
ruvector-solver = { version = "0.1", default-features = false, features = [
    "neumann", "forward-push", "backward-push", "cg"
] }
# TRUE excluded: preprocessing O(m*log(n)/eps^2) exceeds WASM memory budget
# BMSSP excluded: hierarchy construction + storage O(nnz*log(n)) too large
# Hybrid Random Walk: conditional on getrandom/wasm_js for PRNG

# Native target -- all algorithms available
[target.'cfg(not(target_arch = "wasm32"))'.dependencies]
ruvector-solver = { version = "0.1", features = ["full"] }

Platform-algorithm availability matrix:

Algorithm Native x86_64 Native ARM64 WASM Browser (4MB) WASM Edge (128MB) NAPI Node
Neumann Series Yes Yes Yes Yes Yes
Forward Push Yes Yes Yes Yes Yes
Backward Push Yes Yes Yes Yes Yes
Hybrid Random Walk Yes Yes No (1) Yes Yes
TRUE Yes Yes No (2) Yes (n<500K) Yes
Conjugate Gradient Yes Yes Yes Yes Yes
BMSSP Yes Yes No (2) Yes (n<50K) Yes

(1) Requires getrandom/wasm_js feature for cryptographic PRNG in browser context. (2) Preprocessing memory exceeds browser budget for problems of practical size.

Feature flag structure:

[features]
default = ["solver-full"]
solver-full = ["solver-true", "solver-bmssp", "solver-neumann", "solver-push", "solver-cg"]
solver-true = []           # TRUE: JL + sparsification + adaptive Neumann
solver-bmssp = []          # BMSSP: multigrid hierarchical
solver-neumann = []        # Neumann series
solver-push = []           # Forward/Backward Push + Hybrid Random Walk
solver-cg = []             # Conjugate Gradient
solver-wasm = ["solver-neumann", "solver-push", "solver-cg"]  # WASM-safe subset

Tier 2: Heuristic Router (Runtime, <1 ms)

A deterministic decision tree selects the optimal algorithm based on problem characteristics. The router executes in under 1 ms and requires no heap allocation. All thresholds are derived from the crossover analysis below.

Router Input Signature

pub struct RoutingQuery {
    /// Matrix dimension (n x n) or graph vertex count
    pub n: usize,
    /// Number of non-zero entries (for sparse matrices) or edge count
    pub nnz: usize,
    /// Query type determines which algorithms are applicable
    pub query_type: QueryType,
    /// Target accuracy (epsilon)
    pub eps: f64,
    /// Condition number estimate (0.0 if unknown; see Appendix B)
    pub kappa_estimate: f64,
    /// Available wall-time budget
    pub budget: ComputeBudget,
    /// Whether preprocessing has been amortized (batch mode)
    pub batch_mode: bool,
    /// Number of right-hand sides (for batch Laplacian solves)
    pub num_rhs: usize,
    /// Diagonal dominance ratio: min_i(A_ii / sum_{j!=i} |A_ij|), range [0,inf)
    pub diagonal_dominance: f64,
}

pub enum QueryType {
    /// Solve Ax = b for SPD matrix A
    LinearSolve,
    /// Single-source personalized PageRank from vertex s
    SingleSourcePPR { source: usize },
    /// Reverse relevance: all sources relevant to target t
    ReverseRelevance { target: usize },
    /// Pairwise relevance between specific (s, t)
    PairwiseRelevance { source: usize, target: usize },
    /// Spectral graph filtering: apply h(L)x
    SpectralFilter { filter_type: FilterType },
    /// Eigenvector computation for spectral clustering
    SpectralClustering { num_clusters: usize },
    /// Dimension reduction via JL projection
    DimensionReduction { target_dim: usize },
    /// Multi-scale graph decomposition
    MultiScaleDecomposition,
    /// Batch Laplacian solves (same graph, multiple RHS)
    BatchLaplacian { count: usize },
}

pub enum FilterType {
    /// Rational filter: (I + alpha*L)^{-1}, heat kernel
    Rational { alpha: f64 },
    /// Polynomial filter: Chebyshev expansion
    Polynomial { degree: usize },
    /// General filter requiring inversion
    General,
}

Decision Tree

The heuristic router implements the following decision tree. At each node, the first matching rule fires.

ROOT
 |
 +-- QueryType::SingleSourcePPR
 |     +-- [ALWAYS] => ForwardPush
 |         Rationale: O(1/eps) independent of n, deterministic, no preprocessing.
 |         No other algorithm is competitive for single-source graph queries.
 |
 +-- QueryType::ReverseRelevance
 |     +-- [ALWAYS] => BackwardPush
 |         Rationale: Dual of ForwardPush, O(1/eps) for column queries.
 |
 +-- QueryType::PairwiseRelevance
 |     +-- n < 1,000 => ForwardPush (compute full PPR, read target entry)
 |     +-- n >= 1,000 => HybridRandomWalk
 |         Rationale: O(sqrt(n)/eps) beats full PPR computation for large n
 |         when only a single pairwise value is needed.
 |
 +-- QueryType::LinearSolve
 |     +-- n < 500 => CG (no preconditioner)
 |     |   Rationale: Below crossover for all sublinear methods. CG converges
 |     |   in O(sqrt(kappa)*log(1/eps)) iterations, each O(nnz). At n=500,
 |     |   even kappa=n^2 yields manageable iteration counts.
 |     |
 |     +-- sparsity_ratio < 0.01 AND kappa_estimate < 5 AND diagonal_dominance > 0.5
 |     |   => NeumannSeries
 |     |   Rationale: Very sparse, well-conditioned, diagonally dominant.
 |     |   Neumann converges geometrically with rate rho < 1-1/kappa.
 |     |   For kappa < 5: rho < 0.8, so k < 14*log(1/eps) iterations.
 |     |
 |     +-- sparsity_ratio < 0.05 AND kappa_estimate < 10,000
 |     |   => CG (diagonal preconditioner)
 |     |   Rationale: Moderate condition number. Diagonal preconditioning
 |     |   reduces effective kappa by ~10x. Iterations: O(sqrt(1000)*log(1/eps)).
 |     |
 |     +-- kappa_estimate >= 25 AND n > 50,000
 |     |   => BMSSP
 |     |   Rationale: Multigrid convergence is independent of kappa.
 |     |   O(nnz*log(1/eps)) per V-cycle. Beats CG's O(sqrt(kappa)*...).
 |     |
 |     +-- [DEFAULT] => CG (diagonal preconditioner)
 |         Rationale: Safe default for all SPD systems. Deterministic,
 |         well-understood convergence. Memory footprint O(nnz + 4n).
 |
 +-- QueryType::BatchLaplacian { count }
 |     +-- count >= 10 AND n >= 100,000 => TRUE
 |     |   Rationale: Amortize preprocessing O(m*log(n)/eps^2) over count solves.
 |     |   Per-solve cost O(log(n)) amortized. Break-even analysis: see Crossover Points.
 |     |
 |     +-- count >= 10 AND kappa_estimate >= 25 => BMSSP
 |     |   Rationale: Reuse multigrid hierarchy across batch. O(nnz*log(1/eps)) per solve.
 |     |
 |     +-- [DEFAULT] => CG
 |
 +-- QueryType::SpectralFilter { filter_type }
 |     +-- Rational { alpha } AND alpha >= 0.01
 |     |   => NeumannSeries on (I + alpha*L)
 |     |   Rationale: Guaranteed convergence (spectral radius of D^{-1}B < 1
 |     |   for alpha > 0). Iterations k = O(1/alpha). For alpha >= 0.01, k <= 100.
 |     |
 |     +-- General => CG
 |     |   Rationale: CG solves Lx = b directly for inversion-based filters.
 |     |
 |     +-- Polynomial { degree } => NoRoute (use existing Chebyshev)
 |         Rationale: Chebyshev recurrence is optimal for arbitrary polynomial
 |         filters. Router returns NoRoute; caller uses ruvector-math Chebyshev.
 |
 +-- QueryType::SpectralClustering { num_clusters }
 |     +-- n < 10,000 => CG (shift-invert for eigenvectors)
 |     +-- n >= 10,000 AND n < 100,000 => BMSSP (multigrid eigensolver)
 |     +-- n >= 100,000 => TRUE (sparsify + JL + adaptive Neumann)
 |
 +-- QueryType::DimensionReduction
 |     +-- [ALWAYS] => TRUE (JL component only)
 |         Rationale: JL projection to k = ceil(24*ln(n)/eps^2) dimensions.
 |
 +-- QueryType::MultiScaleDecomposition
       +-- [ALWAYS] => BMSSP
           Rationale: BMSSP coarsening hierarchy IS the decomposition.

Router Implementation

The router is a pure function with no allocations:

pub fn route(query: &RoutingQuery, available: &[Algorithm]) -> RoutingDecision {
    let sparsity_ratio = query.nnz as f64 / (query.n as f64 * query.n as f64);

    let candidate = match &query.query_type {
        QueryType::SingleSourcePPR { .. } => Algorithm::ForwardPush,
        QueryType::ReverseRelevance { .. } => Algorithm::BackwardPush,
        QueryType::PairwiseRelevance { .. } => {
            if query.n < 1_000 { Algorithm::ForwardPush }
            else { Algorithm::HybridRandomWalk }
        }
        QueryType::LinearSolve => route_linear_solve(query, sparsity_ratio),
        QueryType::BatchLaplacian { count } => route_batch(query, *count),
        QueryType::SpectralFilter { filter_type } => route_filter(query, filter_type),
        QueryType::SpectralClustering { .. } => route_clustering(query),
        QueryType::DimensionReduction { .. } => Algorithm::TRUE,
        QueryType::MultiScaleDecomposition => Algorithm::BMSSP,
    };

    if available.contains(&candidate) {
        RoutingDecision::Route(candidate)
    } else {
        RoutingDecision::Fallback(select_fallback(query, available))
    }
}

fn select_fallback(query: &RoutingQuery, available: &[Algorithm]) -> Algorithm {
    // Fallback priority: CG > Neumann > BMSSP > ForwardPush > HybridRW > TRUE
    let priority = [
        Algorithm::ConjugateGradient,
        Algorithm::NeumannSeries,
        Algorithm::BMSSP,
        Algorithm::ForwardPush,
        Algorithm::HybridRandomWalk,
        Algorithm::TRUE,
    ];
    priority.iter()
        .find(|a| available.contains(a))
        .copied()
        .unwrap_or(Algorithm::ConjugateGradient)
}

Tier 3: Adaptive Learning (Runtime, SONA-Powered)

The third tier uses RuVector's SONA (Self-Optimizing Neural Architecture) framework to learn from historical solve performance and adjust routing weights.

Architecture

                     RoutingQuery
                          |
                     [Tier 2 Heuristic]
                          |
                   candidate algorithm
                          |
                  [SONA Override Check]
                   /              \
              no override      override (confidence > 0.8)
                  |                  |
           use heuristic      use SONA prediction
                  |                  |
                  v                  v
              Execute Algorithm
                  |
             SolveOutcome {
               algorithm, wall_time, residual,
               iterations, memory_peak
             }
                  |
             [Feedback to SONA]
                  |
            Update routing weights

SONA Feature Extraction

SONA maintains a routing weight matrix W of shape [num_features x num_algorithms]. The features are derived from the RoutingQuery:

Feature vector f(query):
  f[0]  = log2(n)                           // Scale feature
  f[1]  = log2(nnz + 1)                     // Density feature
  f[2]  = nnz / n^2                         // Sparsity ratio
  f[3]  = log2(kappa_estimate + 1)          // Condition number
  f[4]  = log2(1/eps)                       // Precision requirement
  f[5]  = encode(query_type)                // One-hot (7 categories)
  f[6]  = encode(platform)                  // One-hot (4 categories)
  f[7]  = budget.max_wall_time_ms / 1000.0  // Normalized time budget
  f[8]  = num_rhs                           // Batch size
  f[9]  = diagonal_dominance                // Dominance ratio

The SONA model predicts algorithm performance scores:

scores = softmax(W^T * f(query))
selected = argmax(scores) if max(scores) > confidence_threshold (0.8)
         = heuristic_choice otherwise

Memory overhead of SONA model: 10 features x 7 algorithms = 70 floats = 280 bytes. EWC Fisher diagonal adds another 280 bytes. Total: < 1 KB.

EWC for Catastrophic Forgetting Prevention

When workload distribution shifts (e.g., a cluster transitions from graph queries to attention-matrix solves), the learned weights must adapt without losing knowledge of the previous workload. Elastic Weight Consolidation (EWC), already implemented in ruvector-gnn/src/ewc.rs, prevents this:

L_total = L_current + (lambda/2) * sum_i( F_i * (W_i - W*_i)^2 )

where:

  • F_i is the diagonal of the Fisher Information Matrix computed over the previous workload's routing decisions
  • W*_i are the weights learned for that workload
  • lambda controls the consolidation penalty strength (default: 100.0)

The Fisher diagonal is computed efficiently over the last N routing decisions:

F_i = (1/N) * sum_{j=1}^{N} (d log p(a_j | q_j; W) / dW_i)^2

EWC checkpoints every 1,000 outcomes to capture workload snapshots.

Feedback Loop and Reward Function

After each solve, the actual performance is fed back:

pub struct SolveOutcome {
    pub query: RoutingQuery,
    pub algorithm_used: Algorithm,
    pub wall_time_us: u64,
    pub iterations: usize,
    pub final_residual: f64,
    pub memory_peak_bytes: usize,
    pub converged: bool,
}

impl SONARouter {
    pub fn record_outcome(&mut self, outcome: SolveOutcome) {
        let reward = self.compute_reward(&outcome);
        let features = self.extract_features(&outcome.query);
        self.sona.update(features, outcome.algorithm_used, reward);

        if self.outcome_count % 1000 == 0 {
            self.sona.update_fisher_diagonal();
        }
    }

    fn compute_reward(&self, outcome: &SolveOutcome) -> f64 {
        if !outcome.converged {
            return -1.0;
        }
        let time_score = 1.0 - (outcome.wall_time_us as f64
            / outcome.query.budget.max_wall_time_us() as f64).min(1.0);
        let accuracy_score = (outcome.final_residual.log10()
            / outcome.query.eps.log10()).min(1.0).max(0.0);
        let memory_score = 1.0 - (outcome.memory_peak_bytes as f64
            / outcome.query.budget.max_memory_bytes as f64).min(1.0);
        0.5 * time_score + 0.3 * accuracy_score + 0.2 * memory_score
    }
}

Cold start: SONA requires ~10,000 recorded outcomes before overrides become reliable. During cold start, all decisions fall through to the Tier 2 heuristic.


Algorithm Selection Decision Matrix

The authoritative reference for routing decisions across all relevant dimensions:

Dimension Neumann Series Forward Push Backward Push Hybrid RW TRUE CG BMSSP
Input type Sparse SPD matrix A Graph G + source s Graph G + target t Graph G + (s,t) Sparse Laplacian L Sparse SPD matrix A Sparse Laplacian L
Output type Approx A^{-1}b PPR vector pi_s PPR column pi_{*,t} Scalar pi(s,t) Approx L^{-1}b Exact (to tol) x Approx L^{-1}b
Best n range 500 - 1M 1 - unlimited 1 - unlimited 1K - 10M 100K - unlimited 100 - 10M 50K - 10M
Sparsity req. nnz/n^2 < 0.1 Natural graph Natural graph Natural graph Any sparse nnz/n^2 < 0.5 Hierarchical structure
Preprocessing None, O(1) None, O(1) None, O(1) None, O(1) O(m*log(n)/eps^2) O(n) diag precond O(m*log(n)) coarsen
Per-solve cost O(k*nnz) O(1/eps) O(1/eps) O(sqrt(n)/eps) O(log(n)) amortized O(sqrt(kappa)*log(1/eps)*nnz) O(nnz*log(1/eps))
Deterministic? Yes Yes Yes No (Monte Carlo) No (JL + sampling) Yes Partially (AMG coarsening)
Parallelizable? SpMV parallel Limited (push serial) Limited (push serial) Walk parallel (good) High (all phases) SpMV parallel Level-parallel
WASM compatible? Yes Yes Yes Conditional (PRNG) No (memory) Yes Conditional (n<50K)
Numerical stability Requires rho(D^{-1}B)<1 Kahan summation Kahan summation Variance management 3-component error Reorthogonalization Coarsening quality
Convergence Geometric: rho^k Absolute: eps*vol(G) Absolute: eps*vol(G) Probabilistic Relative energy norm Deterministic A-norm V-cycle: sigma<1
Memory footprint O(nnz + n) O(nonzero PPR entries) O(nonzero PPR entries) O(n + num_walks) O(n*log(n)/eps^2) O(nnz + 4n) O(nnz*log(n))
Condition sensitivity High (diverges if rho>=1) None (topology only) None (topology only) None (topology only) Low (sparsification) High (sqrt(kappa) iters) Low (multigrid)
Composability Nestable Chainable (multi-hop) Chainable (multi-hop) Terminal (point query) Preprocessing reusable Preconditioner swappable Hierarchy reusable

Crossover Points

The following analysis determines the exact n and kappa values where each algorithm becomes faster than alternatives. Constant factors are calibrated from RuVector's published benchmark results (doc 08) on Apple M4 Pro (NEON) and Linux/AVX2.

Constant Factor Calibration

From RuVector benchmarks:

c_spmv   = 2.5 ns per nonzero  (AVX2 SpMV, from prime-radiant SIMD benchmarks)
c_spmv_n = 3.5 ns per nonzero  (NEON SpMV, extrapolated from distance benchmarks)
c_push   = 15  ns per push op  (graph traversal, from HNSW benchmark overhead)
c_walk   = 50  ns per RW step  (includes PRNG + graph access, from ruvector-core)
c_jl     = 1.5 ns per f32 mul  (from dot product: 12 ns / 8 dims)
c_alloc  = 20  ns per arena alloc (from bench_memory.rs)

Crossover 1: Neumann Series vs Conjugate Gradient

Both solve Ax = b for sparse SPD A. Wall-time models:

T_neumann(n) = k_neumann * nnz * c_spmv
  where k_neumann = ceil( log(1/eps) / log(1/rho) )
  and rho = 1 - 1/kappa  (for regularized Laplacian with shift delta ~ lambda_min)

T_cg(n) = k_cg * nnz * c_spmv
  where k_cg = ceil( sqrt(kappa) * log(2/eps) )

Setting T_neumann = T_cg and simplifying (nnz and c_spmv cancel):

log(1/eps) / log(1/(1-1/kappa)) = sqrt(kappa) * log(2/eps)

For kappa >> 1, using log(1/(1-x)) ~ x for small x:

log(1/eps) / (1/kappa) ~ sqrt(kappa) * log(2/eps)
kappa * log(1/eps) ~ sqrt(kappa) * log(2/eps)
sqrt(kappa) ~ log(2/eps) / log(1/eps) ~ 1
kappa ~ 1

This means Neumann iteration count grows as O(kappa * log(1/eps)), while CG grows as O(sqrt(kappa) * log(1/eps)). CG dominates for kappa > ~4.

Concrete comparison at eps = 1e-6:

kappa k_neumann k_cg Winner
2 20 20 Tie
4 55 28 CG (2.0x)
10 138 44 CG (3.1x)
100 1,382 138 CG (10x)
1,000 13,816 437 CG (31.6x)

Router threshold: Use Neumann only when kappa_estimate < 5 AND diagonal_dominance > 0.5. For graph Laplacians, expander graphs have kappa ~ O(1), making Neumann competitive.

Crossover 2: CG vs BMSSP

T_cg = sqrt(kappa) * log(1/eps) * nnz * c_spmv
T_bmssp = C_mg * nnz * log(1/eps) * c_spmv + T_coarsen
  where C_mg ~ 5 (multigrid cycle overhead: 2 smoothing sweeps + restriction + prolongation)
  and T_coarsen = nnz * log(n) * c_spmv (one-time hierarchy construction)

Ignoring preprocessing (batch mode or amortized):

sqrt(kappa) > C_mg = 5
kappa > 25

Including preprocessing over B solves:

sqrt(kappa) * nnz * c > C_mg * nnz * c + (nnz * log(n) * c) / B
sqrt(kappa) > 5 + log(n) / (B * log(1/eps))

For n = 100K, eps = 1e-6, B = 10:

sqrt(kappa) > 5 + 17 / (10 * 14) = 5.12
kappa > 26.2

Router threshold: Use BMSSP when kappa > 25 (single solve, n > 50K) or kappa > 25 + log(n) / B (batch mode).

Crossover 3: CG vs TRUE (Batch Mode)

TRUE has heavy preprocessing but very fast per-solve amortized cost:

T_true_prep = m * log(n) / eps^2 * c_spmv    (sparsification + JL)
T_true_solve = log^2(n) * n / eps^2 * c_jl   (per-solve, JL-dominated)
T_cg_solve = sqrt(kappa) * log(1/eps) * nnz * c_spmv  (per-solve)

TRUE wins over B solves when:

T_true_prep + B * T_true_solve < B * T_cg_solve
T_true_prep / B < T_cg_solve - T_true_solve

For n = 100K, nnz = 10n = 10^6, eps = 0.01, kappa = 1000, c_spmv = 2.5 ns, c_jl = 1.5 ns:

T_true_prep  = 10^6 * 17 / 0.0001 * 2.5e-9 = 425 ms
T_true_solve = 289 * 10^5 / 0.0001 * 1.5e-9 = 43 ms
T_cg_solve   = 31.6 * 14 * 10^6 * 2.5e-9 = 1.1 ms

TRUE per-solve (43 ms) is 39x slower than CG (1.1 ms) at this configuration. TRUE only becomes viable when the preprocessing is amortized over a very large batch:

425 ms / B + 43 ms < 1.1 ms  =>  impossible (43 ms > 1.1 ms)

TRUE per-solve dominates CG at this scale. TRUE becomes practical only for n >> 10^6 or when eps is relaxed to ~0.1 (reducing JL target dimension dramatically):

At eps = 0.1: T_true_solve = 289 * 10^5 / 0.01 * 1.5e-9 = 0.43 ms. Now: 425 ms / B + 0.43 ms < 1.1 ms => B > 425 / 0.67 = 634.

Router threshold: Use TRUE when n >= 100K AND batch_mode AND num_rhs >= max(10, n/1000) AND eps >= 0.05.

Crossover 4: Forward Push vs Hybrid Random Walk (Pairwise)

For pairwise PPR(s,t):

T_push = (1/eps) * c_push              (computes full PPR vector, reads entry t)
T_hybrid = (sqrt(n)/eps) * c_walk      (directly estimates pairwise value)

However, Forward Push for pairwise is wasteful because it computes the entire PPR vector but only needs one entry. The relevant comparison accounts for the push's "wasted work":

T_push_effective = (1/eps) * c_push    (total, regardless of target)
T_hybrid = (sqrt(n)/eps) * c_walk

Crossover at T_push = T_hybrid:

c_push / eps = sqrt(n) * c_walk / eps
sqrt(n) = c_push / c_walk = 15 / 50 = 0.3
n = 0.09

This suggests push is always cheaper in raw operations. But for large graphs (n > 10^5), the push generates O(1/eps) nonzero PPR entries that must be stored in memory, while Hybrid RW uses O(sqrt(n)) memory. The practical crossover considers memory pressure and cache behavior:

  • For n < 1,000: Push is faster and memory is not a concern.
  • For n >= 1,000: Hybrid RW has better cache locality (walks are sequential in the adjacency list) and provides probabilistic confidence bounds.
  • For n >= 10^6: Push may exceed L3 cache with its O(1/eps) nonzero entries when eps < 10^{-4}.

Router threshold: Use Forward Push for pairwise when n < 1,000. Use Hybrid RW otherwise.

Crossover Summary Table

Algorithm A Algorithm B Crossover Condition Winner Below Winner Above
Neumann CG kappa ~ 4 Neumann CG
CG BMSSP kappa ~ 25 (n > 50K) CG BMSSP
CG TRUE (batch) n*B ~ 10^7, eps >= 0.05 CG TRUE
Forward Push Hybrid RW n ~ 1K (pairwise query) Push Hybrid RW
CG BMSSP (batch) kappa ~ 25 + log(n)/B CG BMSSP
Neumann BMSSP kappa ~ 5 (n > 50K) Neumann BMSSP
Chebyshev Neumann alpha ~ 0.01 (rational) Chebyshev Neumann

Error Budget Decomposition

When multiple sublinear algorithms compose in a RuVector pipeline, the total approximation error is bounded by the sum of individual component errors.

Error Accumulation Model

For additive error components (independent approximations):

eps_total <= eps_quantization + eps_jl + eps_sparsify + eps_solver + eps_push

For multiplicative error components (chained distance-preserving transformations):

(1 + eps_total) <= (1 + eps_jl) * (1 + eps_sparsify) * (1 + eps_solver)

For small eps (< 0.1), the multiplicative model approximates to:

eps_total ~ eps_jl + eps_sparsify + eps_solver + O(eps^2)

We use the additive model with conservative bounds throughout.

Default Budget Allocation (eps_total = 0.1)

Component Symbol Budget Fraction Rationale
Quantization eps_q 0.030 30% Scalar u8: error per dim ~ range/255. For normalized [-1,1]: 2/255 ~ 0.0078/dim. Over d dims with sqrt cancellation: 0.0078*sqrt(d). At d=128: 0.088 (within budget). At d=384: 0.153 (exceeds; use f32 inputs to solver).
JL Projection eps_jl 0.020 20% Target dim k = ceil(24*ln(n)/eps_jl^2). For n=1M, eps_jl=0.02: k=840K (impractical). JL is practical only when eps_jl >= 0.1 (k=3360 for n=1M). Reallocate when JL absent.
Sparsification eps_s 0.020 20% Benczur-Karger: O(n*log(n)/eps_s^2) edges. For n=100K, eps_s=0.02: 4.25e9 edges (too many). Practical: eps_s >= 0.05 yields 6.8e8 edges. Adjust per problem.
Solver Residual eps_r 0.020 20% CG: ||r||/||b|| < eps_r. Neumann: rho^k < eps_r. BMSSP: sigma^k < eps_r. Cheapest to improve (logarithmic in 1/eps_r).
Push Approx eps_p 0.010 10% Forward/Backward Push: ||pi - pi_approx||_1 < eps_p*vol(G). For search: PPR rank errors at eps_p=0.01 are negligible for top-k retrieval.

Adaptive Budget Reallocation

Not all components are active in every pipeline. The router reallocates unused budget proportionally:

pub fn allocate_error_budget(
    eps_total: f64,
    active_components: &[ErrorComponent],
) -> HashMap<ErrorComponent, f64> {
    let base_weights: HashMap<ErrorComponent, f64> = [
        (ErrorComponent::Quantization,   0.30),
        (ErrorComponent::JLProjection,   0.20),
        (ErrorComponent::Sparsification, 0.20),
        (ErrorComponent::SolverResidual, 0.20),
        (ErrorComponent::PushApprox,     0.10),
    ].into_iter().collect();

    let active_weight_sum: f64 = active_components.iter()
        .filter_map(|c| base_weights.get(c))
        .sum();

    active_components.iter()
        .filter_map(|c| {
            base_weights.get(c).map(|w| (*c, eps_total * w / active_weight_sum))
        })
        .collect()
}

Example allocations for common pipelines:

Pipeline: CG-only linear solve (active: SolverResidual)

eps_solver = eps_total = 0.1

Pipeline: Forward Push for hybrid search (active: Quantization, PushApprox)

eps_quantization = 0.1 * 0.30 / 0.40 = 0.075
eps_push         = 0.1 * 0.10 / 0.40 = 0.025

Pipeline: TRUE for batch spectral clustering (active: JL, Sparsification, Solver)

eps_jl       = 0.1 * 0.20 / 0.60 = 0.0333
eps_sparsify = 0.1 * 0.20 / 0.60 = 0.0333
eps_solver   = 0.1 * 0.20 / 0.60 = 0.0333

Pipeline: Full stack (Quantized input -> JL -> Sparsify -> Neumann -> Push)

eps_q = 0.030, eps_jl = 0.020, eps_s = 0.020, eps_r = 0.020, eps_p = 0.010
Total = 0.100

Precision Requirements by RuVector Use Case

Use Case Required eps Recommended Algorithm Justification
k-NN vector search 0.1 Forward Push + quantized dist Top-k robust to 10% distance error
Spectral clustering 0.05 CG + diagonal preconditioner Eigenvector sign determines partition
GNN attention weights 0.01 CG or Neumann Softmax amplifies small errors
Optimal transport plan 0.001 CG (high precision) Marginal constraints are strict
Min-cut value 0.01 Sparsification + exact Cut value used for structural decisions
Natural gradient (FIM) 0.1 Diagonal approx or CG FIM is inherently ill-conditioned

Post-Solve Error Verification

pub struct ErrorAudit {
    pub component: ErrorComponent,
    pub budget: f64,
    pub actual: f64,
    pub within_budget: bool,
}

pub fn audit_error_budget(
    budgets: &HashMap<ErrorComponent, f64>,
    actuals: &HashMap<ErrorComponent, f64>,
) -> Vec<ErrorAudit> {
    budgets.iter().map(|(component, budget)| {
        let actual = actuals.get(component).copied().unwrap_or(0.0);
        ErrorAudit {
            component: *component,
            budget: *budget,
            actual,
            within_budget: actual <= budget * 1.1, // 10% slack
        }
    }).collect()
}

If any component exceeds its budget by more than 10%, the router logs a warning and the SONA feedback loop penalizes the algorithm selection for that problem profile.


Consequences

Positive

  1. Automatic optimization: Consumers (ruvector-math, ruvector-graph, ruvector-attention, ruvector-mincut, ruvector-gnn) call a single route() function instead of manually selecting algorithms. This eliminates duplicated selection logic across 10+ crates.

  2. Platform safety: Compile-time Tier 1 rules make it impossible to select memory-exceeding algorithms on WASM targets. Prevents runtime OOM in browsers.

  3. Quantified crossover points: The crossover analysis provides concrete thresholds (kappa < 4 for Neumann, kappa > 25 for BMSSP, n >= 100K + batch for TRUE) validated against benchmark-calibrated constant factors.

  4. Error budget composability: Adaptive error allocation ensures multi-algorithm pipelines maintain end-to-end accuracy guarantees without manual per-component tuning.

  5. Continuous improvement: SONA adaptive learning improves routing over time. EWC prevents catastrophic forgetting during workload shifts.

  6. Latency predictability: Tier 2 heuristic executes in <1 ms with zero heap allocation, making routing overhead negligible relative to solve times (10 us - 10 ms).

  7. Batch optimization: TRUE amortization analysis provides clear break-even criteria (num_rhs >= max(10, n/1000), eps >= 0.05) for when expensive preprocessing is justified.

Negative

  1. Implementation complexity: Three routing tiers add code. SONA adaptive tier requires training data collection, EWC checkpoint management, and model validation.

  2. Threshold brittleness: Tier 2 thresholds (kappa < 4, kappa > 25, n >= 100K) are calibrated on AVX2 at c_spmv = 2.5 ns. NEON and WASM have different constants, requiring per-platform threshold tuning or dynamic calibration.

  3. Condition number estimation cost: Several decisions depend on kappa_estimate. If the caller does not provide it, the router must use a default (risking suboptimal selection) or estimate it at O(40 * nnz) cost (~100 us for nnz = 10^6). See Appendix B.

  4. Error budget conservatism: The additive error model is conservative; in practice errors may partially cancel. This means allocated budgets are slightly tighter than necessary, leading to marginally more computation.

  5. SONA cold start: ~10,000 outcomes needed before adaptive overrides are reliable. During cold start, the system operates as a two-tier (static + heuristic) router.

  6. Testing surface: 7 algorithms x 8 query types x 5 platforms = 280 configurations. Exhaustive testing is infeasible; sampling-based CI validation is required.

Neutral

  1. Fallback degradation: When the preferred algorithm is excluded by Tier 1, the router selects the best available alternative. This degrades gracefully but may produce unexpected performance characteristics for platform-constrained targets.

  2. Chebyshev path preserved: The router returns NoRoute for polynomial spectral filters, preserving the existing ruvector-math Chebyshev infrastructure unchanged.

  3. SONA memory: < 1 KB total (280 bytes weights + 280 bytes Fisher diagonal). Negligible.


Options Considered

Option 1: Single-Tier Static Dispatch (Rejected)

Map each RuVector subsystem to a fixed algorithm at compile time:

  • ruvector-graph -> Forward Push
  • ruvector-attention -> CG
  • ruvector-math/spectral -> Neumann
  • ruvector-mincut -> BMSSP

Pros:

  • Zero runtime overhead.
  • Simple implementation, easy to test.

Cons:

  • No adaptation to problem characteristics. A 100-node graph gets the same algorithm as a 10M-node graph.
  • No error budget management across composed components.
  • Each subsystem locked to one algorithm regardless of query type.

Rejected: The problem-characteristic space is too varied (n from 100 to 10M, kappa from 1 to 10^6). A single algorithm cannot be optimal across this range.

Option 2: Per-Call Manual Selection (Rejected)

Expose all seven algorithms directly; each caller selects explicitly:

solver.solve_with(Algorithm::ConjugateGradient, input, eps)

Pros:

  • Maximum flexibility. No routing overhead.

Cons:

  • Duplicates selection logic across every call site (10+ crates).
  • Requires every caller to understand crossover analysis and numerical tradeoffs.
  • No centralized error budget management.

Rejected: Violates DRY. Algorithm selection expertise should be centralized.

Option 3: Two-Tier Without Adaptive Learning (Accepted as Phase 1)

Implement only Tier 1 (static rules) and Tier 2 (heuristic router), deferring SONA.

Pros:

  • Simpler implementation. Fully deterministic. No cold-start problem.
  • Easier to debug and validate.

Cons:

  • Cannot adapt to hardware-specific performance characteristics.
  • Cannot improve as workload patterns emerge.
  • Heuristic thresholds may become stale as hardware evolves.

Accepted as Phase 1: The two-tier system is the initial implementation. SONA Tier 3 is added in Phase 2 after the heuristic router is validated in production. This ADR documents all three tiers as the target architecture.


Compliance

  • ADR-STS-001: Routing integrates within the SolverEngine trait hierarchy
  • ADR-STS-007: Feature flags control per-platform algorithm availability
  • ADR-STS-008: Fallback chain (sublinear -> CG -> dense) triggered by routing failures
  • ADR-STS-009: Parallel dispatch of solver operations via Rayon (feature-gated)
  • ADR-STS-010: Router exposed through the SolverEngine API surface

  • ADR-STS-001: Core Integration Architecture (trait hierarchy, crate structure)
  • ADR-002: Modular DDD Architecture (bounded context separation)
  • ADR-004: MCP Transport Optimization (solver routing exposed via MCP tools)
  • ADR-006: Unified Memory Service (SONA model + cache stored in memory service)
  • ADR-008: Neural Learning Integration (SONA framework for Tier 3)
  • ADR-009: Hybrid Memory Backend (HNSW search for similar routing queries in SONA)
  • ADR-026: 3-Tier Model Routing (solver tiers mirror agent model tiers)

References

  1. /home/user/ruvector/docs/research/sublinear-time-solver/10-algorithm-analysis.md -- Full mathematical analysis of all seven algorithms, convergence guarantees, error bounds, and RuVector use-case mappings.

  2. /home/user/ruvector/docs/research/sublinear-time-solver/08-performance-analysis.md -- Benchmark infrastructure, SIMD acceleration, memory efficiency, crossover projections.

  3. /home/user/ruvector/docs/research/sublinear-time-solver/05-architecture-analysis.md -- Layered integration strategy, module boundaries, event-driven patterns.

  4. Andersen, R., Chung, F., Lang, K. (2006). "Local Graph Partitioning using PageRank Vectors." FOCS 2006.

  5. Lofgren, P., Banerjee, S., Goel, A., Seshadhri, C. (2014). "FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs." KDD 2014.

  6. Spielman, D., Teng, S.-H. (2014). "Nearly Linear Time Algorithms for Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems." SIAM J. Matrix Anal. Appl.

  7. Koutis, I., Miller, G.L., Peng, R. (2011). "A Nearly-m*log(n) Time Solver for SDD Linear Systems." FOCS 2011.

  8. Hestenes, M.R., Stiefel, E. (1952). "Methods of Conjugate Gradients for Solving Linear Systems." J. Res. Nat. Bur. Standards.

  9. Johnson, W.B., Lindenstrauss, J. (1984). "Extensions of Lipschitz mappings into a Hilbert space." Contemporary Mathematics.

  10. Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS.


Implementation Status

Algorithm router implemented with crossover analysis: Neumann for diag-dominant (fastest for well-conditioned), CG as gold-standard SPD fallback, Forward/Backward Push for PageRank, TRUE for large-scale Laplacian, BMSSP for multigrid. Router uses matrix characterization (size, density, diagonal dominance, symmetry) for automatic algorithm selection.


Appendix A: Router Configuration Schema

[solver.router]
# Tier 1: Platform (auto-detected, overridable)
platform = "auto"  # "native", "wasm-browser", "wasm-edge", "auto"

# Tier 2: Heuristic thresholds
[solver.router.thresholds]
neumann_max_kappa = 4.0
neumann_min_diagonal_dominance = 0.5
bmssp_min_kappa = 25.0
bmssp_min_n = 50_000
true_min_n = 100_000
true_min_batch_size = 10
true_min_eps = 0.05
hybrid_rw_min_n = 1_000
cg_default_preconditioner = "diagonal"
spectral_cluster_bmssp_min_n = 10_000
spectral_cluster_true_min_n = 100_000
neumann_filter_min_alpha = 0.01
small_n_threshold = 500

# Tier 3: SONA adaptive learning
[solver.router.sona]
enabled = false           # Enable after Phase 1 validation
confidence_threshold = 0.8
ewc_lambda = 100.0
ewc_checkpoint_interval = 1000
learning_rate = 0.001
feature_dim = 10
cold_start_outcomes = 10_000
exploration_rate_initial = 0.1
exploration_decay = 1000.0

Appendix B: Condition Number Estimation

When kappa_estimate is not provided, the router estimates it using power iteration:

Algorithm: Estimate kappa(A) for SPD matrix A

1. lambda_max via 20 power iterations:
   v = random unit vector
   for i in 1..20:
     v = A * v / ||A * v||
   lambda_max_est = v^T * A * v

2. lambda_min via shifted inverse iteration:
   Use trace-based estimate:
   lambda_min_est = trace(A)/n - sqrt( trace(A^2)/n - (trace(A)/n)^2 )
   (Requires one SpMV for trace(A^2) = sum of squared row norms)

3. kappa_est = lambda_max_est / max(lambda_min_est, 1e-15)

Cost: ~22 SpMV = O(22 * nnz).
At nnz = 10^6, c_spmv = 2.5 ns: ~55 us (acceptable overhead).

The router caches kappa estimates per matrix fingerprint (hash of dimensions + first 64 nonzero values) to avoid recomputation on repeated calls with the same matrix.

Appendix C: Platform Detection

pub fn detect_platform() -> Platform {
    #[cfg(target_arch = "wasm32")]
    {
        let pages = core::arch::wasm32::memory_size(0);
        let bytes = pages * 65536;
        if bytes < 32 * 1024 * 1024 {
            Platform::WasmBrowser
        } else {
            Platform::WasmEdge
        }
    }

    #[cfg(all(not(target_arch = "wasm32"), target_arch = "x86_64"))]
    {
        if is_x86_feature_detected!("avx512f") {
            Platform::NativeAVX512
        } else if is_x86_feature_detected!("avx2") {
            Platform::NativeAVX2
        } else {
            Platform::NativeScalar
        }
    }

    #[cfg(all(not(target_arch = "wasm32"), target_arch = "aarch64"))]
    {
        Platform::NativeNEON
    }
}

Appendix D: Notation Reference

Symbol Meaning
n Matrix dimension or graph vertex count
m Edge count (m = nnz/2 for symmetric graphs)
nnz Number of nonzero entries in sparse matrix
d Vector dimensionality
kappa Condition number: lambda_max / lambda_min
eps Target approximation accuracy
rho Spectral radius of iteration matrix D^{-1}B
k Number of iterations, clusters, or neighbors (context-dependent)
alpha PPR teleportation probability (typically 0.15) or filter parameter
vol(G) Volume of graph: sum of all vertex degrees = 2m
L Graph Laplacian: L = D - A
D Degree diagonal matrix
A Adjacency matrix
sigma Multigrid V-cycle convergence factor (< 1 for convergence)
delta Failure probability for probabilistic algorithms
C_mg Multigrid cycle overhead constant (~5 for typical AMG)
c_spmv Nanoseconds per nonzero for sparse matrix-vector multiply
c_push Nanoseconds per push operation in Forward/Backward Push
c_walk Nanoseconds per random walk step (includes PRNG)
c_jl Nanoseconds per f32 multiply in JL projection
F_i Fisher Information Matrix diagonal entry (EWC)
W SONA routing weight matrix, shape [features x algorithms]
B Batch size (number of right-hand sides for batch solves)