Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,467 @@
# Ruvector Benchmark Suite Documentation
Comprehensive benchmarking tools for measuring and analyzing Ruvector's performance across various workloads and configurations.
## Table of Contents
1. [Overview](#overview)
2. [Installation](#installation)
3. [Benchmark Tools](#benchmark-tools)
4. [Quick Start](#quick-start)
5. [Detailed Usage](#detailed-usage)
6. [Understanding Results](#understanding-results)
7. [Performance Targets](#performance-targets)
8. [Troubleshooting](#troubleshooting)
## Overview
The Ruvector benchmark suite provides:
- **ANN-Benchmarks Compatibility**: Standard SIFT1M, GIST1M, Deep1M testing
- **AgenticDB Workloads**: Reflexion episodes, skill libraries, causal graphs
- **Latency Analysis**: p50, p95, p99, p99.9 percentile measurements
- **Memory Profiling**: Usage at various scales with quantization effects
- **System Comparison**: Ruvector vs other implementations
- **Performance Profiling**: CPU flamegraphs and hotspot analysis
## Installation
### Prerequisites
```bash
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Optional: HDF5 for loading real ANN benchmark datasets
# Ubuntu/Debian
sudo apt-get install libhdf5-dev
# macOS
brew install hdf5
# Optional: Profiling tools
sudo apt-get install linux-perf # Linux only
```
### Build Benchmarks
```bash
cd crates/ruvector-bench
# Standard build
cargo build --release
# With profiling support
cargo build --release --features profiling
# With HDF5 dataset support
cargo build --release --features hdf5-datasets
```
## Benchmark Tools
### 1. ANN Benchmark (`ann-benchmark`)
Tests standard ANN benchmark datasets with configurable HNSW parameters.
**Features:**
- SIFT1M (128D, 1M vectors)
- GIST1M (960D, 1M vectors)
- Deep1M (96D, 1M vectors)
- Synthetic dataset generation
- Recall-QPS curves at 90%, 95%, 99%
- Multiple ef_search values
### 2. AgenticDB Benchmark (`agenticdb-benchmark`)
Simulates agentic AI workloads.
**Workloads:**
- Reflexion episode storage/retrieval
- Skill library search
- Causal graph queries
- Learning session throughput (mixed read/write)
### 3. Latency Benchmark (`latency-benchmark`)
Measures detailed latency characteristics.
**Tests:**
- Single-threaded latency
- Multi-threaded latency (configurable thread counts)
- Effect of ef_search on latency
- Effect of quantization on latency/recall tradeoff
### 4. Memory Benchmark (`memory-benchmark`)
Profiles memory usage at scale.
**Tests:**
- Memory at 10K, 100K, 1M vectors
- Effect of quantization (none, scalar, binary)
- Index overhead analysis
- Memory per vector calculation
### 5. Comparison Benchmark (`comparison-benchmark`)
Compares Ruvector against other systems.
**Comparisons:**
- Ruvector (optimized)
- Ruvector (no quantization)
- Simulated Python baseline
- Simulated brute-force search
### 6. Profiling Benchmark (`profiling-benchmark`)
Generates performance profiles.
**Outputs:**
- CPU flamegraphs (SVG)
- Profiling reports
- Hotspot identification
- SIMD utilization analysis
## Quick Start
### Run All Benchmarks
```bash
# Full benchmark suite
./scripts/run_all_benchmarks.sh
# Quick mode (smaller datasets)
./scripts/run_all_benchmarks.sh --quick
# With profiling
./scripts/run_all_benchmarks.sh --profile
```
### Run Individual Benchmarks
```bash
# ANN benchmarks
cargo run --release --bin ann-benchmark -- \
--dataset synthetic \
--num-vectors 100000 \
--queries 1000
# AgenticDB workloads
cargo run --release --bin agenticdb-benchmark -- \
--episodes 10000 \
--queries 500
# Latency profiling
cargo run --release --bin latency-benchmark -- \
--num-vectors 50000 \
--threads "1,4,8,16"
# Memory profiling
cargo run --release --bin memory-benchmark -- \
--scales "1000,10000,100000"
# System comparison
cargo run --release --bin comparison-benchmark -- \
--num-vectors 50000
# Performance profiling
cargo run --release --features profiling --bin profiling-benchmark -- \
--flamegraph
```
## Detailed Usage
### ANN Benchmark Options
```bash
cargo run --release --bin ann-benchmark -- --help
Options:
-d, --dataset <DATASET> Dataset: sift1m, gist1m, deep1m, synthetic [default: synthetic]
-n, --num-vectors <NUM_VECTORS> Number of vectors [default: 100000]
-q, --queries <NUM_QUERIES> Number of queries [default: 1000]
-d, --dimensions <DIMENSIONS> Vector dimensions [default: 128]
-k, --k <K> K nearest neighbors [default: 10]
-m, --m <M> HNSW M parameter [default: 32]
--ef-construction <VALUE> HNSW ef_construction [default: 200]
--ef-search-values <VALUES> HNSW ef_search values (comma-separated) [default: 50,100,200,400]
-o, --output <OUTPUT> Output directory [default: bench_results]
--metric <METRIC> Distance metric [default: cosine]
--quantization <QUANT> Quantization: none, scalar, binary [default: scalar]
```
### AgenticDB Benchmark Options
```bash
cargo run --release --bin agenticdb-benchmark -- --help
Options:
--episodes <EPISODES> Number of episodes [default: 10000]
--skills <SKILLS> Number of skills [default: 1000]
-q, --queries <QUERIES> Number of queries [default: 500]
-o, --output <OUTPUT> Output directory [default: bench_results]
```
### Latency Benchmark Options
```bash
cargo run --release --bin latency-benchmark -- --help
Options:
-n, --num-vectors <NUM_VECTORS> Number of vectors [default: 50000]
-q, --queries <QUERIES> Number of queries [default: 1000]
-d, --dimensions <DIMENSIONS> Vector dimensions [default: 384]
-t, --threads <THREADS> Thread counts to test [default: 1,4,8,16]
-o, --output <OUTPUT> Output directory [default: bench_results]
```
## Understanding Results
### Output Files
Each benchmark generates three output files:
1. **JSON** (`{benchmark}_benchmark.json`): Raw data for programmatic analysis
2. **CSV** (`{benchmark}_benchmark.csv`): Tabular data for spreadsheet analysis
3. **Markdown** (`{benchmark}_benchmark.md`): Human-readable report
### Key Metrics
#### QPS (Queries Per Second)
- Higher is better
- Measures throughput
- Target: >10,000 QPS for 100K vectors
#### Latency Percentiles
- **p50**: Median latency (typical user experience)
- **p95**: 95th percentile (captures most outliers)
- **p99**: 99th percentile (worst-case for most users)
- **p99.9**: 99.9th percentile (extreme outliers)
- Lower is better
- Target: <5ms p99 for 100K vectors
#### Recall
- **Recall@1**: Percentage of times the true nearest neighbor is found
- **Recall@10**: Percentage of true top-10 neighbors found
- **Recall@100**: Percentage of true top-100 neighbors found
- Higher is better
- Target: >95% recall@10
#### Memory
- Total memory usage in MB
- Memory per vector in KB
- Compression ratio with quantization
- Target: <2KB per vector with quantization
### Reading Benchmark Reports
Example output interpretation:
```
ef_search QPS p50 (ms) p99 (ms) Recall@10 Memory (MB)
50 15234 0.05 0.12 92.5% 156.2
100 12456 0.06 0.15 96.8% 156.2
200 8932 0.08 0.20 98.9% 156.2
```
**Analysis:**
- Increasing ef_search improves recall but reduces QPS
- ef_search=100 offers good balance (96.8% recall, 12K QPS)
- Memory usage constant across ef_search values
## Performance Targets
### AgenticDB Replacement Goals
Ruvector targets **10-100x performance improvement** over AgenticDB:
| Metric | AgenticDB (Python) | Ruvector (Target) | Speedup |
|--------|-------------------|-------------------|---------|
| Reflexion Retrieval | ~100 QPS | >5,000 QPS | 50x |
| Skill Search | ~50 QPS | >2,000 QPS | 40x |
| Index Build Time | ~60s/10K | <5s/10K | 12x |
| Memory Usage | ~500MB/100K | <100MB/100K | 5x |
### ANN-Benchmarks Targets
Competitive with state-of-the-art implementations:
| Dataset | Recall@10 | QPS Target | Latency p99 |
|---------|-----------|------------|-------------|
| SIFT1M | >95% | >10,000 | <1ms |
| GIST1M | >95% | >5,000 | <2ms |
| Deep1M | >95% | >15,000 | <0.5ms |
## Advanced Topics
### Profiling with Flamegraphs
Generate CPU flamegraphs to identify performance bottlenecks:
```bash
cargo run --release --features profiling --bin profiling-benchmark -- \
--flamegraph \
--output bench_results/profiling
# View flamegraph
firefox bench_results/profiling/flamegraph.svg
```
**Interpreting Flamegraphs:**
- Width = CPU time spent
- Height = call stack depth
- Look for wide plateaus (hotspots)
- Focus optimization on top 20% of time
### Custom Benchmark Scenarios
Create custom benchmarks by modifying the tools:
```rust
// Example: Custom dimension test
let dimensions = vec![64, 128, 256, 512, 768, 1024];
for dim in dimensions {
let result = bench_custom(dim)?;
results.push(result);
}
```
### Continuous Benchmarking
Integrate with CI/CD:
```yaml
# .github/workflows/benchmark.yml
name: Benchmarks
on: [push]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run benchmarks
run: |
cd crates/ruvector-bench
./scripts/run_all_benchmarks.sh --quick
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: benchmark-results
path: crates/ruvector-bench/bench_results/
```
## Troubleshooting
### Common Issues
#### "HDF5 not found"
```bash
# Install HDF5 development libraries
sudo apt-get install libhdf5-dev # Ubuntu/Debian
brew install hdf5 # macOS
# Or build without HDF5 support
cargo build --release --no-default-features
```
#### "Out of memory"
```bash
# Reduce dataset size
cargo run --release --bin ann-benchmark -- --num-vectors 10000
# Or use quick mode
./scripts/run_all_benchmarks.sh --quick
```
#### "Profiling not working"
```bash
# Ensure profiling feature is enabled
cargo build --release --features profiling
# Linux: May need perf permissions
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
```
#### "Benchmarks taking too long"
```bash
# Use quick mode
./scripts/run_all_benchmarks.sh --quick
# Or run individual benchmarks
cargo run --release --bin latency-benchmark -- --queries 100
```
### Performance Debugging
If benchmarks show unexpectedly slow results:
1. **Check CPU governor:**
```bash
# Linux: Use performance mode
sudo cpupower frequency-set -g performance
```
2. **Verify release build:**
```bash
cargo build --release # Not --debug!
```
3. **Check system load:**
```bash
htop # Ensure no other heavy processes
```
4. **Review HNSW parameters:**
- Reduce ef_construction for faster indexing
- Reduce ef_search for faster queries (at cost of recall)
## Results Analysis
### Comparing Runs
```bash
# Compare two benchmark runs
diff -u bench_results_old/ann_benchmark.csv bench_results_new/ann_benchmark.csv
# Plot results with Python
python3 scripts/plot_results.py bench_results/
```
### Statistical Significance
For reliable benchmarks:
- Run multiple iterations (3-5 times)
- Use appropriate dataset sizes (>10K vectors)
- Ensure consistent system load
- Record system specs in metadata
## Contributing
To add new benchmarks:
1. Create new binary in `src/bin/`
2. Use `ruvector_bench` utilities
3. Output results in standard format
4. Update this documentation
5. Add to `run_all_benchmarks.sh`
## References
- [ANN-Benchmarks](http://ann-benchmarks.com)
- [HNSW Paper](https://arxiv.org/abs/1603.09320)
- [AgenticDB Documentation](https://github.com/agenticdb/agenticdb)
- [Ruvector Repository](https://github.com/ruvnet/ruvector)
## Support
For issues or questions:
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
- Documentation: https://github.com/ruvnet/ruvector/docs
---
Last updated: 2025-11-19