wifi-densepose/docs/benchmarks/BENCHMARKING_GUIDE.md

# Benchmarking Guide

This guide explains how to run, interpret, and contribute benchmarks for Ruvector.

## Table of Contents

1. [Running Benchmarks](#running-benchmarks)
2. [Benchmark Suite](#benchmark-suite)
3. [Interpreting Results](#interpreting-results)
4. [Performance Targets](#performance-targets)
5. [Comparison Methodology](#comparison-methodology)
6. [Contributing Benchmarks](#contributing-benchmarks)

## Running Benchmarks

### Quick Start

```bash
# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench distance_metrics
cargo bench hnsw_search
cargo bench batch_operations

# With flamegraph profiling
cargo flamegraph --bench hnsw_search

# With criterion reports
cargo bench -- --save-baseline main
git checkout feature-branch
cargo bench -- --baseline main
```

### Benchmark Crates

```bash
# Core benchmarks
cd crates/ruvector-bench
cargo bench

# Comparison benchmarks
cargo run --release --bin comparison_benchmark

# Memory benchmarks
cargo run --release --bin memory_benchmark

# Latency benchmarks
cargo run --release --bin latency_benchmark
```

### SIMD Optimization

Enable SIMD for maximum performance:

```bash
RUSTFLAGS="-C target-cpu=native" cargo bench

# Or specific features
RUSTFLAGS="-C target-feature=+avx2,+fma" cargo bench
```

## Benchmark Suite

### 1. Distance Metrics Benchmark

**File**: `crates/ruvector-core/benches/distance_metrics.rs`

**What it measures**: Raw distance calculation performance

**Metrics**:
- Euclidean (L2) distance
- Cosine similarity
- Dot product
- Manhattan (L1) distance
- SIMD vs scalar implementations

**Run**:
```bash
cargo bench distance_metrics
```

**Expected results**:
```
euclidean_128d/simd       time:   [45.234 ns 45.456 ns 45.678 ns]
euclidean_128d/scalar     time:   [312.45 ns 315.23 ns 318.91 ns]
                                  ↑ 7x slower
cosine_128d/simd          time:   [52.123 ns 52.345 ns 52.567 ns]
dotproduct_128d/simd      time:   [38.901 ns 39.123 ns 39.345 ns]
```

### 2. HNSW Search Benchmark

**File**: `crates/ruvector-core/benches/hnsw_search.rs`

**What it measures**: End-to-end search performance

**Metrics**:
- Search latency (p50, p95, p99)
- Queries per second (QPS)
- Recall accuracy
- Different dataset sizes (1K, 10K, 100K, 1M vectors)
- Different ef_search values (50, 100, 200, 500)

**Run**:
```bash
cargo bench hnsw_search
```

**Expected results**:
```
search_1M_vectors_k10_ef100
                        time:   [845.23 µs 856.78 µs 868.45 µs]
                        thrpt:  [1,151 queries/s]
                        recall: [95.2%]

search_1M_vectors_k10_ef200
                        time:   [1.678 ms 1.689 ms 1.701 ms]
                        thrpt:  [587 queries/s]
                        recall: [98.7%]
```

### 3. Batch Operations Benchmark

**File**: `crates/ruvector-core/benches/batch_operations.rs`

**What it measures**: Throughput for bulk operations

**Metrics**:
- Batch insert throughput
- Parallel vs sequential inserts
- Different batch sizes (100, 1K, 10K)

**Run**:
```bash
cargo bench batch_operations
```

**Expected results**:
```
batch_insert_1000_parallel
                        time:   [45.234 ms 46.123 ms 47.012 ms]
                        thrpt:  [21,271 vectors/s]

batch_insert_1000_sequential
                        time:   [234.56 ms 238.91 ms 243.27 ms]
                        thrpt:  [4,111 vectors/s]
                        ↑ 5x slower
```

### 4. Quantization Benchmark

**File**: `crates/ruvector-core/benches/quantization_bench.rs`

**What it measures**: Quantization performance and accuracy

**Metrics**:
- Quantization time
- Dequantization time
- Distance calculation with quantized vectors
- Recall impact

**Run**:
```bash
cargo bench quantization
```

**Expected results**:
```
scalar_quantize_128d      time:   [234.56 ns 236.78 ns 239.01 ns]
product_quantize_128d     time:   [1.234 µs 1.245 µs 1.256 µs]

search_with_scalar_quant  time:   [678.90 µs 685.12 µs 691.34 µs]
                          recall: [97.3%]

search_with_product_quant time:   [523.45 µs 528.67 µs 533.89 µs]
                          recall: [92.8%]
```

### 5. Comprehensive Benchmark

**File**: `crates/ruvector-core/benches/comprehensive_bench.rs`

**What it measures**: End-to-end system performance

**Run**:
```bash
cargo bench comprehensive
```

## Interpreting Results

### Criterion Output

```
test_name               time:   [lower_bound mean upper_bound]
                        thrpt:  [throughput]
                        change: [% change from baseline]
```

**Example**:
```
search_100K_vectors     time:   [234.56 µs 238.91 µs 243.27 µs]
                        thrpt:  [4,111 queries/s]
                        change: [-5.2% -3.8% -2.1%] (faster)
```

**Interpretation**:
- Mean: 238.91 µs
- 95% confidence interval: [234.56 µs, 243.27 µs]
- Throughput: ~4,111 queries/second
- 3.8% faster than baseline

### Latency Percentiles

```bash
cargo run --release --bin latency_benchmark
```

**Output**:
```
Latency percentiles (100K queries):
  p50:  0.85 ms
  p90:  1.23 ms
  p95:  1.67 ms
  p99:  3.45 ms
  p999: 8.91 ms
```

**Interpretation**:
- 50% of queries complete in < 0.85ms
- 95% of queries complete in < 1.67ms
- 99% of queries complete in < 3.45ms

### Memory Usage

```bash
cargo run --release --bin memory_benchmark
```

**Output**:
```
Memory usage (1M vectors, 128D):
  Vectors (full):        512.0 MB
  Vectors (scalar):      128.0 MB (4x compression)
  HNSW graph:           640.0 MB
  Metadata:              50.0 MB
  ──────────────────────────────
  Total:                818.0 MB
```

## Performance Targets

### Search Latency

| Dataset | Target p50 | Target p95 | Target QPS |
|---------|-----------|-----------|-----------|
| 10K vectors | < 100 µs | < 200 µs | 10,000+ |
| 100K vectors | < 500 µs | < 1 ms | 2,000+ |
| 1M vectors | < 1 ms | < 2 ms | 1,000+ |
| 10M vectors | < 2 ms | < 5 ms | 500+ |

### Insert Throughput

| Operation | Target |
|-----------|--------|
| Single insert | 1,000+ ops/sec |
| Batch insert (1K) | 10,000+ vectors/sec |
| Batch insert (10K) | 50,000+ vectors/sec |

### Memory Efficiency

| Configuration | Target Memory per Vector |
|---------------|-------------------------|
| Full precision | 512 bytes (128D) |
| Scalar quant | 128 bytes (4x compression) |
| Product quant | 16-32 bytes (16-32x compression) |

### Recall Accuracy

| Configuration | Target Recall |
|---------------|---------------|
| ef_search=50 | 85%+ |
| ef_search=100 | 90%+ |
| ef_search=200 | 95%+ |
| ef_search=500 | 99%+ |

## Comparison Methodology

### Against FAISS

```bash
cargo run --release --bin comparison_benchmark -- --system faiss
```

**Metrics compared**:
- Search latency (same dataset, same k)
- Memory usage
- Build time
- Recall@10

**Example output**:
```
Benchmark: 1M vectors, 128D, k=10

                  Ruvector    FAISS       Speedup
────────────────────────────────────────────────
Build time        245s        312s        1.27x
Search (p50)      0.85ms      2.34ms      2.75x
Search (p95)      1.67ms      4.56ms      2.73x
Memory            818MB       1,245MB     1.52x
Recall@10         95.2%       95.8%       ~same
```

### Versioned Benchmarks

Track performance over time:

```bash
# Save baseline
git checkout v0.1.0
cargo bench -- --save-baseline v0.1.0

# Compare to new version
git checkout v0.2.0
cargo bench -- --baseline v0.1.0
```

## Contributing Benchmarks

### Adding a New Benchmark

1. Create benchmark file:
```rust
// crates/ruvector-core/benches/my_benchmark.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use ruvector_core::*;

fn my_benchmark(c: &mut Criterion) {
    let db = setup_test_db();

    c.bench_function("my_operation", |b| {
        b.iter(|| {
            // Operation to benchmark
            db.my_operation(black_box(&input))
        })
    });
}

criterion_group!(benches, my_benchmark);
criterion_main!(benches);
```

2. Register in `Cargo.toml`:
```toml
[[bench]]
name = "my_benchmark"
harness = false
```

3. Run and verify:
```bash
cargo bench my_benchmark
```

### Benchmark Best Practices

1. **Use `black_box`**: Prevent compiler optimizations
   ```rust
   b.iter(|| db.search(black_box(&query)))
   ```

2. **Measure what matters**: Focus on user-facing operations

3. **Realistic workloads**: Use representative data sizes

4. **Multiple iterations**: Criterion handles this automatically

5. **Isolate variables**: Benchmark one thing at a time

6. **Document context**: Explain what's being measured

7. **CI integration**: Run benchmarks in CI to catch regressions

### Profiling

```bash
# Flamegraph
cargo flamegraph --bench hnsw_search

# perf (Linux)
perf record -g cargo bench hnsw_search
perf report

# Cachegrind (memory profiling)
valgrind --tool=cachegrind cargo bench hnsw_search
```

## CI/CD Integration

### GitHub Actions

``yaml
- name: Run benchmarks
  run: |
    cargo bench --bench distance_metrics -- --save-baseline main

- name: Compare to baseline
  run: |
    cargo bench --bench distance_metrics -- --baseline main
```

### Performance Regression Detection

Fail CI if performance regresses > 5%:

```rust
// In benchmark code
let previous_mean = load_baseline("main");
let current_mean = measure_current();
let regression = (current_mean - previous_mean) / previous_mean;

assert!(regression < 0.05, "Performance regression > 5%");
```

## Resources

- [Criterion.rs documentation](https://bheisler.github.io/criterion.rs/book/)
- [Rust Performance Book](https://nnethercote.github.io/perf-book/)
- [Benchmarking Rust programs](https://doc.rust-lang.org/cargo/commands/cargo-bench.html)
- [ANN-Benchmarks](http://ann-benchmarks.com/) - Standard vector search benchmarks

## Questions?

Open an issue: https://github.com/ruvnet/ruvector/issues