Files
wifi-densepose/crates/ruvector-bench/docs/BENCHMARKS.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

12 KiB

Ruvector Benchmark Suite Documentation

Comprehensive benchmarking tools for measuring and analyzing Ruvector's performance across various workloads and configurations.

Table of Contents

  1. Overview
  2. Installation
  3. Benchmark Tools
  4. Quick Start
  5. Detailed Usage
  6. Understanding Results
  7. Performance Targets
  8. Troubleshooting

Overview

The Ruvector benchmark suite provides:

  • ANN-Benchmarks Compatibility: Standard SIFT1M, GIST1M, Deep1M testing
  • AgenticDB Workloads: Reflexion episodes, skill libraries, causal graphs
  • Latency Analysis: p50, p95, p99, p99.9 percentile measurements
  • Memory Profiling: Usage at various scales with quantization effects
  • System Comparison: Ruvector vs other implementations
  • Performance Profiling: CPU flamegraphs and hotspot analysis

Installation

Prerequisites

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Optional: HDF5 for loading real ANN benchmark datasets
# Ubuntu/Debian
sudo apt-get install libhdf5-dev

# macOS
brew install hdf5

# Optional: Profiling tools
sudo apt-get install linux-perf  # Linux only

Build Benchmarks

cd crates/ruvector-bench

# Standard build
cargo build --release

# With profiling support
cargo build --release --features profiling

# With HDF5 dataset support
cargo build --release --features hdf5-datasets

Benchmark Tools

1. ANN Benchmark (ann-benchmark)

Tests standard ANN benchmark datasets with configurable HNSW parameters.

Features:

  • SIFT1M (128D, 1M vectors)
  • GIST1M (960D, 1M vectors)
  • Deep1M (96D, 1M vectors)
  • Synthetic dataset generation
  • Recall-QPS curves at 90%, 95%, 99%
  • Multiple ef_search values

2. AgenticDB Benchmark (agenticdb-benchmark)

Simulates agentic AI workloads.

Workloads:

  • Reflexion episode storage/retrieval
  • Skill library search
  • Causal graph queries
  • Learning session throughput (mixed read/write)

3. Latency Benchmark (latency-benchmark)

Measures detailed latency characteristics.

Tests:

  • Single-threaded latency
  • Multi-threaded latency (configurable thread counts)
  • Effect of ef_search on latency
  • Effect of quantization on latency/recall tradeoff

4. Memory Benchmark (memory-benchmark)

Profiles memory usage at scale.

Tests:

  • Memory at 10K, 100K, 1M vectors
  • Effect of quantization (none, scalar, binary)
  • Index overhead analysis
  • Memory per vector calculation

5. Comparison Benchmark (comparison-benchmark)

Compares Ruvector against other systems.

Comparisons:

  • Ruvector (optimized)
  • Ruvector (no quantization)
  • Simulated Python baseline
  • Simulated brute-force search

6. Profiling Benchmark (profiling-benchmark)

Generates performance profiles.

Outputs:

  • CPU flamegraphs (SVG)
  • Profiling reports
  • Hotspot identification
  • SIMD utilization analysis

Quick Start

Run All Benchmarks

# Full benchmark suite
./scripts/run_all_benchmarks.sh

# Quick mode (smaller datasets)
./scripts/run_all_benchmarks.sh --quick

# With profiling
./scripts/run_all_benchmarks.sh --profile

Run Individual Benchmarks

# ANN benchmarks
cargo run --release --bin ann-benchmark -- \
    --dataset synthetic \
    --num-vectors 100000 \
    --queries 1000

# AgenticDB workloads
cargo run --release --bin agenticdb-benchmark -- \
    --episodes 10000 \
    --queries 500

# Latency profiling
cargo run --release --bin latency-benchmark -- \
    --num-vectors 50000 \
    --threads "1,4,8,16"

# Memory profiling
cargo run --release --bin memory-benchmark -- \
    --scales "1000,10000,100000"

# System comparison
cargo run --release --bin comparison-benchmark -- \
    --num-vectors 50000

# Performance profiling
cargo run --release --features profiling --bin profiling-benchmark -- \
    --flamegraph

Detailed Usage

ANN Benchmark Options

cargo run --release --bin ann-benchmark -- --help

Options:
  -d, --dataset <DATASET>              Dataset: sift1m, gist1m, deep1m, synthetic [default: synthetic]
  -n, --num-vectors <NUM_VECTORS>      Number of vectors [default: 100000]
  -q, --queries <NUM_QUERIES>          Number of queries [default: 1000]
  -d, --dimensions <DIMENSIONS>        Vector dimensions [default: 128]
  -k, --k <K>                          K nearest neighbors [default: 10]
  -m, --m <M>                          HNSW M parameter [default: 32]
      --ef-construction <VALUE>        HNSW ef_construction [default: 200]
      --ef-search-values <VALUES>      HNSW ef_search values (comma-separated) [default: 50,100,200,400]
  -o, --output <OUTPUT>                Output directory [default: bench_results]
      --metric <METRIC>                Distance metric [default: cosine]
      --quantization <QUANT>           Quantization: none, scalar, binary [default: scalar]

AgenticDB Benchmark Options

cargo run --release --bin agenticdb-benchmark -- --help

Options:
      --episodes <EPISODES>    Number of episodes [default: 10000]
      --skills <SKILLS>        Number of skills [default: 1000]
  -q, --queries <QUERIES>      Number of queries [default: 500]
  -o, --output <OUTPUT>        Output directory [default: bench_results]

Latency Benchmark Options

cargo run --release --bin latency-benchmark -- --help

Options:
  -n, --num-vectors <NUM_VECTORS>    Number of vectors [default: 50000]
  -q, --queries <QUERIES>            Number of queries [default: 1000]
  -d, --dimensions <DIMENSIONS>      Vector dimensions [default: 384]
  -t, --threads <THREADS>            Thread counts to test [default: 1,4,8,16]
  -o, --output <OUTPUT>              Output directory [default: bench_results]

Understanding Results

Output Files

Each benchmark generates three output files:

  1. JSON ({benchmark}_benchmark.json): Raw data for programmatic analysis
  2. CSV ({benchmark}_benchmark.csv): Tabular data for spreadsheet analysis
  3. Markdown ({benchmark}_benchmark.md): Human-readable report

Key Metrics

QPS (Queries Per Second)

  • Higher is better
  • Measures throughput
  • Target: >10,000 QPS for 100K vectors

Latency Percentiles

  • p50: Median latency (typical user experience)
  • p95: 95th percentile (captures most outliers)
  • p99: 99th percentile (worst-case for most users)
  • p99.9: 99.9th percentile (extreme outliers)
  • Lower is better
  • Target: <5ms p99 for 100K vectors

Recall

  • Recall@1: Percentage of times the true nearest neighbor is found
  • Recall@10: Percentage of true top-10 neighbors found
  • Recall@100: Percentage of true top-100 neighbors found
  • Higher is better
  • Target: >95% recall@10

Memory

  • Total memory usage in MB
  • Memory per vector in KB
  • Compression ratio with quantization
  • Target: <2KB per vector with quantization

Reading Benchmark Reports

Example output interpretation:

ef_search  QPS    p50 (ms)  p99 (ms)  Recall@10  Memory (MB)
50         15234  0.05      0.12      92.5%      156.2
100        12456  0.06      0.15      96.8%      156.2
200        8932   0.08      0.20      98.9%      156.2

Analysis:

  • Increasing ef_search improves recall but reduces QPS
  • ef_search=100 offers good balance (96.8% recall, 12K QPS)
  • Memory usage constant across ef_search values

Performance Targets

AgenticDB Replacement Goals

Ruvector targets 10-100x performance improvement over AgenticDB:

Metric AgenticDB (Python) Ruvector (Target) Speedup
Reflexion Retrieval ~100 QPS >5,000 QPS 50x
Skill Search ~50 QPS >2,000 QPS 40x
Index Build Time ~60s/10K <5s/10K 12x
Memory Usage ~500MB/100K <100MB/100K 5x

ANN-Benchmarks Targets

Competitive with state-of-the-art implementations:

Dataset Recall@10 QPS Target Latency p99
SIFT1M >95% >10,000 <1ms
GIST1M >95% >5,000 <2ms
Deep1M >95% >15,000 <0.5ms

Advanced Topics

Profiling with Flamegraphs

Generate CPU flamegraphs to identify performance bottlenecks:

cargo run --release --features profiling --bin profiling-benchmark -- \
    --flamegraph \
    --output bench_results/profiling

# View flamegraph
firefox bench_results/profiling/flamegraph.svg

Interpreting Flamegraphs:

  • Width = CPU time spent
  • Height = call stack depth
  • Look for wide plateaus (hotspots)
  • Focus optimization on top 20% of time

Custom Benchmark Scenarios

Create custom benchmarks by modifying the tools:

// Example: Custom dimension test
let dimensions = vec![64, 128, 256, 512, 768, 1024];
for dim in dimensions {
    let result = bench_custom(dim)?;
    results.push(result);
}

Continuous Benchmarking

Integrate with CI/CD:

# .github/workflows/benchmark.yml
name: Benchmarks
on: [push]
jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run benchmarks
        run: |
          cd crates/ruvector-bench
          ./scripts/run_all_benchmarks.sh --quick
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: benchmark-results
          path: crates/ruvector-bench/bench_results/

Troubleshooting

Common Issues

"HDF5 not found"

# Install HDF5 development libraries
sudo apt-get install libhdf5-dev  # Ubuntu/Debian
brew install hdf5                 # macOS

# Or build without HDF5 support
cargo build --release --no-default-features

"Out of memory"

# Reduce dataset size
cargo run --release --bin ann-benchmark -- --num-vectors 10000

# Or use quick mode
./scripts/run_all_benchmarks.sh --quick

"Profiling not working"

# Ensure profiling feature is enabled
cargo build --release --features profiling

# Linux: May need perf permissions
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

"Benchmarks taking too long"

# Use quick mode
./scripts/run_all_benchmarks.sh --quick

# Or run individual benchmarks
cargo run --release --bin latency-benchmark -- --queries 100

Performance Debugging

If benchmarks show unexpectedly slow results:

  1. Check CPU governor:

    # Linux: Use performance mode
    sudo cpupower frequency-set -g performance
    
  2. Verify release build:

    cargo build --release  # Not --debug!
    
  3. Check system load:

    htop  # Ensure no other heavy processes
    
  4. Review HNSW parameters:

    • Reduce ef_construction for faster indexing
    • Reduce ef_search for faster queries (at cost of recall)

Results Analysis

Comparing Runs

# Compare two benchmark runs
diff -u bench_results_old/ann_benchmark.csv bench_results_new/ann_benchmark.csv

# Plot results with Python
python3 scripts/plot_results.py bench_results/

Statistical Significance

For reliable benchmarks:

  • Run multiple iterations (3-5 times)
  • Use appropriate dataset sizes (>10K vectors)
  • Ensure consistent system load
  • Record system specs in metadata

Contributing

To add new benchmarks:

  1. Create new binary in src/bin/
  2. Use ruvector_bench utilities
  3. Output results in standard format
  4. Update this documentation
  5. Add to run_all_benchmarks.sh

References

Support

For issues or questions:


Last updated: 2025-11-19