wifi-densepose/crates/ruvector-bench/docs/BENCHMARKS.md

# Ruvector Benchmark Suite Documentation

Comprehensive benchmarking tools for measuring and analyzing Ruvector's performance across various workloads and configurations.

## Table of Contents

1. [Overview](#overview)
2. [Installation](#installation)
3. [Benchmark Tools](#benchmark-tools)
4. [Quick Start](#quick-start)
5. [Detailed Usage](#detailed-usage)
6. [Understanding Results](#understanding-results)
7. [Performance Targets](#performance-targets)
8. [Troubleshooting](#troubleshooting)

## Overview

The Ruvector benchmark suite provides:

- **ANN-Benchmarks Compatibility**: Standard SIFT1M, GIST1M, Deep1M testing
- **AgenticDB Workloads**: Reflexion episodes, skill libraries, causal graphs
- **Latency Analysis**: p50, p95, p99, p99.9 percentile measurements
- **Memory Profiling**: Usage at various scales with quantization effects
- **System Comparison**: Ruvector vs other implementations
- **Performance Profiling**: CPU flamegraphs and hotspot analysis

## Installation

### Prerequisites

```bash
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Optional: HDF5 for loading real ANN benchmark datasets
# Ubuntu/Debian
sudo apt-get install libhdf5-dev

# macOS
brew install hdf5

# Optional: Profiling tools
sudo apt-get install linux-perf  # Linux only
```

### Build Benchmarks

```bash
cd crates/ruvector-bench

# Standard build
cargo build --release

# With profiling support
cargo build --release --features profiling

# With HDF5 dataset support
cargo build --release --features hdf5-datasets
```

## Benchmark Tools

### 1. ANN Benchmark (`ann-benchmark`)

Tests standard ANN benchmark datasets with configurable HNSW parameters.

**Features:**
- SIFT1M (128D, 1M vectors)
- GIST1M (960D, 1M vectors)
- Deep1M (96D, 1M vectors)
- Synthetic dataset generation
- Recall-QPS curves at 90%, 95%, 99%
- Multiple ef_search values

### 2. AgenticDB Benchmark (`agenticdb-benchmark`)

Simulates agentic AI workloads.

**Workloads:**
- Reflexion episode storage/retrieval
- Skill library search
- Causal graph queries
- Learning session throughput (mixed read/write)

### 3. Latency Benchmark (`latency-benchmark`)

Measures detailed latency characteristics.

**Tests:**
- Single-threaded latency
- Multi-threaded latency (configurable thread counts)
- Effect of ef_search on latency
- Effect of quantization on latency/recall tradeoff

### 4. Memory Benchmark (`memory-benchmark`)

Profiles memory usage at scale.

**Tests:**
- Memory at 10K, 100K, 1M vectors
- Effect of quantization (none, scalar, binary)
- Index overhead analysis
- Memory per vector calculation

### 5. Comparison Benchmark (`comparison-benchmark`)

Compares Ruvector against other systems.

**Comparisons:**
- Ruvector (optimized)
- Ruvector (no quantization)
- Simulated Python baseline
- Simulated brute-force search

### 6. Profiling Benchmark (`profiling-benchmark`)

Generates performance profiles.

**Outputs:**
- CPU flamegraphs (SVG)
- Profiling reports
- Hotspot identification
- SIMD utilization analysis

## Quick Start

### Run All Benchmarks

```bash
# Full benchmark suite
./scripts/run_all_benchmarks.sh

# Quick mode (smaller datasets)
./scripts/run_all_benchmarks.sh --quick

# With profiling
./scripts/run_all_benchmarks.sh --profile
```

### Run Individual Benchmarks

```bash
# ANN benchmarks
cargo run --release --bin ann-benchmark -- \
    --dataset synthetic \
    --num-vectors 100000 \
    --queries 1000

# AgenticDB workloads
cargo run --release --bin agenticdb-benchmark -- \
    --episodes 10000 \
    --queries 500

# Latency profiling
cargo run --release --bin latency-benchmark -- \
    --num-vectors 50000 \
    --threads "1,4,8,16"

# Memory profiling
cargo run --release --bin memory-benchmark -- \
    --scales "1000,10000,100000"

# System comparison
cargo run --release --bin comparison-benchmark -- \
    --num-vectors 50000

# Performance profiling
cargo run --release --features profiling --bin profiling-benchmark -- \
    --flamegraph
```

## Detailed Usage

### ANN Benchmark Options

```bash
cargo run --release --bin ann-benchmark -- --help

Options:
  -d, --dataset <DATASET>              Dataset: sift1m, gist1m, deep1m, synthetic [default: synthetic]
  -n, --num-vectors <NUM_VECTORS>      Number of vectors [default: 100000]
  -q, --queries <NUM_QUERIES>          Number of queries [default: 1000]
  -d, --dimensions <DIMENSIONS>        Vector dimensions [default: 128]
  -k, --k <K>                          K nearest neighbors [default: 10]
  -m, --m <M>                          HNSW M parameter [default: 32]
      --ef-construction <VALUE>        HNSW ef_construction [default: 200]
      --ef-search-values <VALUES>      HNSW ef_search values (comma-separated) [default: 50,100,200,400]
  -o, --output <OUTPUT>                Output directory [default: bench_results]
      --metric <METRIC>                Distance metric [default: cosine]
      --quantization <QUANT>           Quantization: none, scalar, binary [default: scalar]
```

### AgenticDB Benchmark Options

```bash
cargo run --release --bin agenticdb-benchmark -- --help

Options:
      --episodes <EPISODES>    Number of episodes [default: 10000]
      --skills <SKILLS>        Number of skills [default: 1000]
  -q, --queries <QUERIES>      Number of queries [default: 500]
  -o, --output <OUTPUT>        Output directory [default: bench_results]
```

### Latency Benchmark Options

```bash
cargo run --release --bin latency-benchmark -- --help

Options:
  -n, --num-vectors <NUM_VECTORS>    Number of vectors [default: 50000]
  -q, --queries <QUERIES>            Number of queries [default: 1000]
  -d, --dimensions <DIMENSIONS>      Vector dimensions [default: 384]
  -t, --threads <THREADS>            Thread counts to test [default: 1,4,8,16]
  -o, --output <OUTPUT>              Output directory [default: bench_results]
```

## Understanding Results

### Output Files

Each benchmark generates three output files:

1. **JSON** (`{benchmark}_benchmark.json`): Raw data for programmatic analysis
2. **CSV** (`{benchmark}_benchmark.csv`): Tabular data for spreadsheet analysis
3. **Markdown** (`{benchmark}_benchmark.md`): Human-readable report

### Key Metrics

#### QPS (Queries Per Second)
- Higher is better
- Measures throughput
- Target: >10,000 QPS for 100K vectors

#### Latency Percentiles
- **p50**: Median latency (typical user experience)
- **p95**: 95th percentile (captures most outliers)
- **p99**: 99th percentile (worst-case for most users)
- **p99.9**: 99.9th percentile (extreme outliers)
- Lower is better
- Target: <5ms p99 for 100K vectors

#### Recall
- **Recall@1**: Percentage of times the true nearest neighbor is found
- **Recall@10**: Percentage of true top-10 neighbors found
- **Recall@100**: Percentage of true top-100 neighbors found
- Higher is better
- Target: >95% recall@10

#### Memory
- Total memory usage in MB
- Memory per vector in KB
- Compression ratio with quantization
- Target: <2KB per vector with quantization

### Reading Benchmark Reports

Example output interpretation:

```
ef_search  QPS    p50 (ms)  p99 (ms)  Recall@10  Memory (MB)
50         15234  0.05      0.12      92.5%      156.2
100        12456  0.06      0.15      96.8%      156.2
200        8932   0.08      0.20      98.9%      156.2
```

**Analysis:**
- Increasing ef_search improves recall but reduces QPS
- ef_search=100 offers good balance (96.8% recall, 12K QPS)
- Memory usage constant across ef_search values

## Performance Targets

### AgenticDB Replacement Goals

Ruvector targets **10-100x performance improvement** over AgenticDB:

| Metric | AgenticDB (Python) | Ruvector (Target) | Speedup |
|--------|-------------------|-------------------|---------|
| Reflexion Retrieval | ~100 QPS | >5,000 QPS | 50x |
| Skill Search | ~50 QPS | >2,000 QPS | 40x |
| Index Build Time | ~60s/10K | <5s/10K | 12x |
| Memory Usage | ~500MB/100K | <100MB/100K | 5x |

### ANN-Benchmarks Targets

Competitive with state-of-the-art implementations:

| Dataset | Recall@10 | QPS Target | Latency p99 |
|---------|-----------|------------|-------------|
| SIFT1M | >95% | >10,000 | <1ms |
| GIST1M | >95% | >5,000 | <2ms |
| Deep1M | >95% | >15,000 | <0.5ms |

## Advanced Topics

### Profiling with Flamegraphs

Generate CPU flamegraphs to identify performance bottlenecks:

```bash
cargo run --release --features profiling --bin profiling-benchmark -- \
    --flamegraph \
    --output bench_results/profiling

# View flamegraph
firefox bench_results/profiling/flamegraph.svg
```

**Interpreting Flamegraphs:**
- Width = CPU time spent
- Height = call stack depth
- Look for wide plateaus (hotspots)
- Focus optimization on top 20% of time

### Custom Benchmark Scenarios

Create custom benchmarks by modifying the tools:

```rust
// Example: Custom dimension test
let dimensions = vec![64, 128, 256, 512, 768, 1024];
for dim in dimensions {
    let result = bench_custom(dim)?;
    results.push(result);
}
```

### Continuous Benchmarking

Integrate with CI/CD:

```yaml
# .github/workflows/benchmark.yml
name: Benchmarks
on: [push]
jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run benchmarks
        run: |
          cd crates/ruvector-bench
          ./scripts/run_all_benchmarks.sh --quick
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: benchmark-results
          path: crates/ruvector-bench/bench_results/
```

## Troubleshooting

### Common Issues

#### "HDF5 not found"

```bash
# Install HDF5 development libraries
sudo apt-get install libhdf5-dev  # Ubuntu/Debian
brew install hdf5                 # macOS

# Or build without HDF5 support
cargo build --release --no-default-features
```

#### "Out of memory"

```bash
# Reduce dataset size
cargo run --release --bin ann-benchmark -- --num-vectors 10000

# Or use quick mode
./scripts/run_all_benchmarks.sh --quick
```

#### "Profiling not working"

```bash
# Ensure profiling feature is enabled
cargo build --release --features profiling

# Linux: May need perf permissions
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
```

#### "Benchmarks taking too long"

```bash
# Use quick mode
./scripts/run_all_benchmarks.sh --quick

# Or run individual benchmarks
cargo run --release --bin latency-benchmark -- --queries 100
```

### Performance Debugging

If benchmarks show unexpectedly slow results:

1. **Check CPU governor:**
   ```bash
   # Linux: Use performance mode
   sudo cpupower frequency-set -g performance
   ```

2. **Verify release build:**
   ```bash
   cargo build --release  # Not --debug!
   ```

3. **Check system load:**
   ```bash
   htop  # Ensure no other heavy processes
   ```

4. **Review HNSW parameters:**
   - Reduce ef_construction for faster indexing
   - Reduce ef_search for faster queries (at cost of recall)

## Results Analysis

### Comparing Runs

```bash
# Compare two benchmark runs
diff -u bench_results_old/ann_benchmark.csv bench_results_new/ann_benchmark.csv

# Plot results with Python
python3 scripts/plot_results.py bench_results/
```

### Statistical Significance

For reliable benchmarks:
- Run multiple iterations (3-5 times)
- Use appropriate dataset sizes (>10K vectors)
- Ensure consistent system load
- Record system specs in metadata

## Contributing

To add new benchmarks:

1. Create new binary in `src/bin/`
2. Use `ruvector_bench` utilities
3. Output results in standard format
4. Update this documentation
5. Add to `run_all_benchmarks.sh`

## References

- [ANN-Benchmarks](http://ann-benchmarks.com)
- [HNSW Paper](https://arxiv.org/abs/1603.09320)
- [AgenticDB Documentation](https://github.com/agenticdb/agenticdb)
- [Ruvector Repository](https://github.com/ruvnet/ruvector)

## Support

For issues or questions:
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
- Documentation: https://github.com/ruvnet/ruvector/docs

---

Last updated: 2025-11-19