Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

12 KiB

Raw Blame History

Ruvector Benchmark Suite Documentation

Comprehensive benchmarking tools for measuring and analyzing Ruvector's performance across various workloads and configurations.

Overview
Installation
Benchmark Tools
Quick Start
Detailed Usage
Understanding Results
Performance Targets
Troubleshooting

Overview

The Ruvector benchmark suite provides:

ANN-Benchmarks Compatibility: Standard SIFT1M, GIST1M, Deep1M testing
AgenticDB Workloads: Reflexion episodes, skill libraries, causal graphs
Latency Analysis: p50, p95, p99, p99.9 percentile measurements
Memory Profiling: Usage at various scales with quantization effects
System Comparison: Ruvector vs other implementations
Performance Profiling: CPU flamegraphs and hotspot analysis

Installation

Prerequisites

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Optional: HDF5 for loading real ANN benchmark datasets
# Ubuntu/Debian
sudo apt-get install libhdf5-dev

# macOS
brew install hdf5

# Optional: Profiling tools
sudo apt-get install linux-perf  # Linux only

Build Benchmarks

cd crates/ruvector-bench

# Standard build
cargo build --release

# With profiling support
cargo build --release --features profiling

# With HDF5 dataset support
cargo build --release --features hdf5-datasets

Benchmark Tools

1. ANN Benchmark (`ann-benchmark`)

Tests standard ANN benchmark datasets with configurable HNSW parameters.

Features:

SIFT1M (128D, 1M vectors)
GIST1M (960D, 1M vectors)
Deep1M (96D, 1M vectors)
Synthetic dataset generation
Recall-QPS curves at 90%, 95%, 99%
Multiple ef_search values

2. AgenticDB Benchmark (`agenticdb-benchmark`)

Simulates agentic AI workloads.

Workloads:

Reflexion episode storage/retrieval
Skill library search
Causal graph queries
Learning session throughput (mixed read/write)

3. Latency Benchmark (`latency-benchmark`)

Measures detailed latency characteristics.

Tests:

Single-threaded latency
Multi-threaded latency (configurable thread counts)
Effect of ef_search on latency
Effect of quantization on latency/recall tradeoff

4. Memory Benchmark (`memory-benchmark`)

Profiles memory usage at scale.

Tests:

Memory at 10K, 100K, 1M vectors
Effect of quantization (none, scalar, binary)
Index overhead analysis
Memory per vector calculation

5. Comparison Benchmark (`comparison-benchmark`)

Compares Ruvector against other systems.

Comparisons:

Ruvector (optimized)
Ruvector (no quantization)
Simulated Python baseline
Simulated brute-force search

6. Profiling Benchmark (`profiling-benchmark`)

Generates performance profiles.

Outputs:

CPU flamegraphs (SVG)
Profiling reports
Hotspot identification
SIMD utilization analysis

Quick Start

Run All Benchmarks

# Full benchmark suite
./scripts/run_all_benchmarks.sh

# Quick mode (smaller datasets)
./scripts/run_all_benchmarks.sh --quick

# With profiling
./scripts/run_all_benchmarks.sh --profile

Run Individual Benchmarks

# ANN benchmarks
cargo run --release --bin ann-benchmark -- \
    --dataset synthetic \
    --num-vectors 100000 \
    --queries 1000

# AgenticDB workloads
cargo run --release --bin agenticdb-benchmark -- \
    --episodes 10000 \
    --queries 500

# Latency profiling
cargo run --release --bin latency-benchmark -- \
    --num-vectors 50000 \
    --threads "1,4,8,16"

# Memory profiling
cargo run --release --bin memory-benchmark -- \
    --scales "1000,10000,100000"

# System comparison
cargo run --release --bin comparison-benchmark -- \
    --num-vectors 50000

# Performance profiling
cargo run --release --features profiling --bin profiling-benchmark -- \
    --flamegraph

Detailed Usage

ANN Benchmark Options

cargo run --release --bin ann-benchmark -- --help

Options:
  -d, --dataset <DATASET>              Dataset: sift1m, gist1m, deep1m, synthetic [default: synthetic]
  -n, --num-vectors <NUM_VECTORS>      Number of vectors [default: 100000]
  -q, --queries <NUM_QUERIES>          Number of queries [default: 1000]
  -d, --dimensions <DIMENSIONS>        Vector dimensions [default: 128]
  -k, --k <K>                          K nearest neighbors [default: 10]
  -m, --m <M>                          HNSW M parameter [default: 32]
      --ef-construction <VALUE>        HNSW ef_construction [default: 200]
      --ef-search-values <VALUES>      HNSW ef_search values (comma-separated) [default: 50,100,200,400]
  -o, --output <OUTPUT>                Output directory [default: bench_results]
      --metric <METRIC>                Distance metric [default: cosine]
      --quantization <QUANT>           Quantization: none, scalar, binary [default: scalar]

AgenticDB Benchmark Options

cargo run --release --bin agenticdb-benchmark -- --help

Options:
      --episodes <EPISODES>    Number of episodes [default: 10000]
      --skills <SKILLS>        Number of skills [default: 1000]
  -q, --queries <QUERIES>      Number of queries [default: 500]
  -o, --output <OUTPUT>        Output directory [default: bench_results]

Latency Benchmark Options

cargo run --release --bin latency-benchmark -- --help

Options:
  -n, --num-vectors <NUM_VECTORS>    Number of vectors [default: 50000]
  -q, --queries <QUERIES>            Number of queries [default: 1000]
  -d, --dimensions <DIMENSIONS>      Vector dimensions [default: 384]
  -t, --threads <THREADS>            Thread counts to test [default: 1,4,8,16]
  -o, --output <OUTPUT>              Output directory [default: bench_results]

Understanding Results

Output Files

Each benchmark generates three output files:

JSON ({benchmark}_benchmark.json): Raw data for programmatic analysis
CSV ({benchmark}_benchmark.csv): Tabular data for spreadsheet analysis
Markdown ({benchmark}_benchmark.md): Human-readable report

Key Metrics

QPS (Queries Per Second)

Higher is better
Measures throughput
Target: >10,000 QPS for 100K vectors

Latency Percentiles

p50: Median latency (typical user experience)
p95: 95th percentile (captures most outliers)
p99: 99th percentile (worst-case for most users)
p99.9: 99.9th percentile (extreme outliers)
Lower is better
Target: <5ms p99 for 100K vectors

Recall

Recall@1: Percentage of times the true nearest neighbor is found
Recall@10: Percentage of true top-10 neighbors found
Recall@100: Percentage of true top-100 neighbors found
Higher is better
Target: >95% recall@10

Memory

Total memory usage in MB
Memory per vector in KB
Compression ratio with quantization
Target: <2KB per vector with quantization

Reading Benchmark Reports

Example output interpretation:

ef_search  QPS    p50 (ms)  p99 (ms)  Recall@10  Memory (MB)
50         15234  0.05      0.12      92.5%      156.2
100        12456  0.06      0.15      96.8%      156.2
200        8932   0.08      0.20      98.9%      156.2

Analysis:

Increasing ef_search improves recall but reduces QPS
ef_search=100 offers good balance (96.8% recall, 12K QPS)
Memory usage constant across ef_search values

Performance Targets

AgenticDB Replacement Goals

Ruvector targets 10-100x performance improvement over AgenticDB:

Metric	AgenticDB (Python)	Ruvector (Target)	Speedup
Reflexion Retrieval	~100 QPS	>5,000 QPS	50x
Skill Search	~50 QPS	>2,000 QPS	40x
Index Build Time	~60s/10K	<5s/10K	12x
Memory Usage	~500MB/100K	<100MB/100K	5x

ANN-Benchmarks Targets

Competitive with state-of-the-art implementations:

Dataset	Recall@10	QPS Target	Latency p99
SIFT1M	>95%	>10,000	<1ms
GIST1M	>95%	>5,000	<2ms
Deep1M	>95%	>15,000	<0.5ms

Advanced Topics

Profiling with Flamegraphs

Generate CPU flamegraphs to identify performance bottlenecks:

cargo run --release --features profiling --bin profiling-benchmark -- \
    --flamegraph \
    --output bench_results/profiling

# View flamegraph
firefox bench_results/profiling/flamegraph.svg

Interpreting Flamegraphs:

Width = CPU time spent
Height = call stack depth
Look for wide plateaus (hotspots)
Focus optimization on top 20% of time

Custom Benchmark Scenarios

Create custom benchmarks by modifying the tools:

// Example: Custom dimension test
let dimensions = vec![64, 128, 256, 512, 768, 1024];
for dim in dimensions {
    let result = bench_custom(dim)?;
    results.push(result);
}

Continuous Benchmarking

Integrate with CI/CD:

# .github/workflows/benchmark.yml
name: Benchmarks
on: [push]
jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run benchmarks
        run: |
          cd crates/ruvector-bench
          ./scripts/run_all_benchmarks.sh --quick
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: benchmark-results
          path: crates/ruvector-bench/bench_results/

Troubleshooting

Common Issues

"HDF5 not found"

# Install HDF5 development libraries
sudo apt-get install libhdf5-dev  # Ubuntu/Debian
brew install hdf5                 # macOS

# Or build without HDF5 support
cargo build --release --no-default-features

"Out of memory"

# Reduce dataset size
cargo run --release --bin ann-benchmark -- --num-vectors 10000

# Or use quick mode
./scripts/run_all_benchmarks.sh --quick

"Profiling not working"

# Ensure profiling feature is enabled
cargo build --release --features profiling

# Linux: May need perf permissions
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

"Benchmarks taking too long"

# Use quick mode
./scripts/run_all_benchmarks.sh --quick

# Or run individual benchmarks
cargo run --release --bin latency-benchmark -- --queries 100

Performance Debugging

If benchmarks show unexpectedly slow results:

Check CPU governor:

# Linux: Use performance mode
sudo cpupower frequency-set -g performance

Verify release build:
```
cargo build --release  # Not --debug!
```

Check system load:

htop  # Ensure no other heavy processes

Review HNSW parameters:
- Reduce ef_construction for faster indexing
- Reduce ef_search for faster queries (at cost of recall)

Results Analysis

Comparing Runs

# Compare two benchmark runs
diff -u bench_results_old/ann_benchmark.csv bench_results_new/ann_benchmark.csv

# Plot results with Python
python3 scripts/plot_results.py bench_results/

Statistical Significance

For reliable benchmarks:

Run multiple iterations (3-5 times)
Use appropriate dataset sizes (>10K vectors)
Ensure consistent system load
Record system specs in metadata

Contributing

To add new benchmarks:

Create new binary in src/bin/
Use ruvector_bench utilities
Output results in standard format
Update this documentation
Add to run_all_benchmarks.sh

References

Support

For issues or questions:

GitHub Issues: https://github.com/ruvnet/ruvector/issues
Documentation: https://github.com/ruvnet/ruvector/docs

Last updated: 2025-11-19

12 KiB Raw Blame History