Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/crates/ruvector-postgres/benches/README.md
+++ b/vendor/ruvector/crates/ruvector-postgres/benches/README.md
@@ -0,0 +1,307 @@
+# RuVector Benchmark Suite
+
+Comprehensive benchmarks comparing ruvector vs pgvector across multiple dimensions.
+
+## Overview
+
+This benchmark suite provides:
+
+1. **Rust Benchmarks** - Low-level performance testing using Criterion
+2. **SQL Benchmarks** - Realistic PostgreSQL workload testing
+3. **Automated CI** - GitHub Actions workflow for continuous benchmarking
+
+## Quick Start
+
+### Run All Benchmarks
+
+```bash
+cd crates/ruvector-postgres
+bash benches/scripts/run_benchmarks.sh
+```
+
+### Run Individual Benchmarks
+
+```bash
+# Distance function benchmarks
+cargo bench --bench distance_bench
+
+# HNSW index benchmarks
+cargo bench --bench index_bench
+
+# Quantization benchmarks
+cargo bench --bench quantization_bench
+
+# Quantized distance benchmarks
+cargo bench --bench quantized_distance_bench
+```
+
+### Run SQL Benchmarks
+
+```bash
+# Setup database
+createdb ruvector_bench
+psql -d ruvector_bench -c 'CREATE EXTENSION ruvector;'
+psql -d ruvector_bench -c 'CREATE EXTENSION pgvector;'
+
+# Quick benchmark (10k vectors)
+psql -d ruvector_bench -f benches/sql/quick_benchmark.sql
+
+# Full workload (1M vectors)
+psql -d ruvector_bench -f benches/sql/benchmark_workload.sql
+```
+
+## Benchmark Categories
+
+### 1. Distance Function Benchmarks (`distance_bench.rs`)
+
+Tests distance calculation performance across different vector dimensions:
+
+- **L2 (Euclidean) Distance**: Scalar vs SIMD implementations
+- **Cosine Distance**: Normalized similarity measurement
+- **Inner Product**: Dot product for maximum inner product search
+- **Batch Operations**: Sequential vs parallel processing
+
+**Dimensions tested**: 128, 384, 768, 1536, 3072
+
+**Key metrics**:
+- Single operation latency
+- Throughput (ops/sec)
+- SIMD speedup vs scalar
+
+### 2. HNSW Index Benchmarks (`index_bench.rs`)
+
+Tests Hierarchical Navigable Small World graph index:
+
+#### Build Benchmarks
+- Index construction time vs dataset size (1K, 10K, 100K, 1M vectors)
+- Impact of `ef_construction` parameter (16, 32, 64, 128, 256)
+- Impact of `M` parameter (8, 12, 16, 24, 32, 48)
+
+#### Search Benchmarks
+- Query latency vs dataset size
+- Impact of `ef_search` parameter (10, 20, 40, 80, 160, 320)
+- Impact of `k` (number of neighbors: 1, 5, 10, 20, 50, 100)
+
+#### Recall Accuracy
+- Recall@10 vs `ef_search` values
+- Ground truth comparison
+
+#### Memory Usage
+- Index size vs dataset size
+- Memory per vector overhead
+
+**Dimensions tested**: 128, 384, 768, 1536
+
+### 3. Quantization Benchmarks (`quantization_bench.rs`)
+
+Tests vector compression and quantized search:
+
+#### Scalar Quantization (SQ8)
+- Encoding/decoding speed
+- Distance calculation speedup
+- Recall vs exact search
+- Memory reduction (4x compression)
+
+#### Binary Quantization
+- Encoding speed
+- Hamming distance calculation (SIMD)
+- Massive compression (32x for f32)
+- Re-ranking strategies
+
+#### Product Quantization (PQ)
+- ADC (Asymmetric Distance Computation)
+- SIMD vs scalar lookup
+- Configurable compression ratios
+
+**Key metrics**:
+- Speedup vs exact search
+- Recall@10 accuracy
+- Compression ratio
+- Throughput improvement
+
+### 4. SQL Workload Benchmarks
+
+Realistic PostgreSQL scenarios:
+
+#### Quick Benchmark (`quick_benchmark.sql`)
+- 10,000 vectors, 768 dimensions
+- Sequential scan baseline
+- HNSW index build
+- Index search performance
+- Distance function comparisons
+
+#### Full Workload (`benchmark_workload.sql`)
+- 1,000,000 vectors, 1536 dimensions
+- 1,000 queries for statistical significance
+- P50, P99 latency measurements
+- Memory usage analysis
+- Recall accuracy testing
+- ruvector vs pgvector comparison
+
+## Understanding Results
+
+### Criterion Output
+
+```
+Distance/euclidean/scalar/768
+                        time:   [2.1234 µs 2.1456 µs 2.1678 µs]
+                        thrpt: [354.23 Melem/s 357.89 Melem/s 361.55 Melem/s]
+```
+
+- **time**: Mean execution time with confidence intervals
+- **thrpt**: Throughput (operations per second)
+
+### Comparing Implementations
+
+```bash
+# Set baseline
+cargo bench --bench distance_bench -- --save-baseline main
+
+# Make changes, then compare
+cargo bench --bench distance_bench -- --baseline main
+```
+
+### SQL Benchmark Interpretation
+
+```sql
+ p50_ms | p99_ms | avg_ms | min_ms | max_ms
+--------+--------+--------+--------+--------
+  0.856 |  1.234 |  0.912 |  0.654 |  2.456
+```
+
+- **p50**: Median latency (50th percentile)
+- **p99**: 99th percentile latency (worst 1%)
+- **avg**: Average latency
+- **min/max**: Best and worst case
+
+## Performance Targets
+
+### Distance Functions
+
+| Operation | Dimension | Target Throughput |
+|-----------|-----------|-------------------|
+| L2 (SIMD) | 768       | > 400 Mops/s     |
+| L2 (SIMD) | 1536      | > 200 Mops/s     |
+| Cosine    | 768       | > 300 Mops/s     |
+| Inner Product | 768   | > 500 Mops/s     |
+
+### HNSW Index
+
+| Dataset Size | Build Time | Search Latency | Recall@10 |
+|--------------|------------|----------------|-----------|
+| 100K         | < 30s      | < 1ms          | > 0.95    |
+| 1M           | < 5min     | < 2ms          | > 0.95    |
+| 10M          | < 1hr      | < 5ms          | > 0.90    |
+
+### Quantization
+
+| Method  | Compression | Speedup | Recall@10 |
+|---------|-------------|---------|-----------|
+| SQ8     | 4x          | 2-3x    | > 0.95    |
+| Binary  | 32x         | 10-20x  | > 0.85    |
+| PQ(8)   | 16x         | 5-10x   | > 0.90    |
+
+## Continuous Integration
+
+The GitHub Actions workflow runs automatically on:
+
+- Pull requests touching benchmark code
+- Pushes to `main` and `develop` branches
+- Manual workflow dispatch
+
+Results are:
+- Posted as PR comments
+- Stored as artifacts (30 day retention)
+- Tracked over time on main branch
+- Compared against baseline
+
+### Triggering Manual Runs
+
+```bash
+# From GitHub UI: Actions → Benchmarks → Run workflow
+
+# Or using gh CLI
+gh workflow run benchmarks.yml
+```
+
+### Enabling SQL Benchmarks in CI
+
+SQL benchmarks are disabled by default (too slow). Enable via workflow dispatch:
+
+```bash
+gh workflow run benchmarks.yml -f run_sql_benchmarks=true
+```
+
+## Advanced Usage
+
+### Profiling with Criterion
+
+```bash
+# Generate flamegraph
+cargo bench --bench distance_bench -- --profile-time=5
+
+# Output to specific format
+cargo bench --bench distance_bench -- --output-format bencher
+```
+
+### Custom Benchmark Parameters
+
+Edit benchmark files to adjust:
+
+- Vector dimensions
+- Dataset sizes
+- Number of queries
+- HNSW parameters (M, ef_construction, ef_search)
+- Quantization settings
+
+### Comparing with pgvector
+
+Ensure pgvector is installed:
+
+```bash
+git clone https://github.com/pgvector/pgvector.git
+cd pgvector
+make
+sudo make install
+```
+
+Then run SQL benchmarks for side-by-side comparison.
+
+## Interpreting Regressions
+
+### Performance Degradation Alert
+
+If CI fails due to performance regression:
+
+1. **Check the comparison**: Review the baseline vs current results
+2. **Validate the change**: Ensure it's not due to measurement noise
+3. **Profile the code**: Use flamegraphs to identify bottlenecks
+4. **Consider trade-offs**: Sometimes correctness > speed
+
+### Common Causes
+
+- **SIMD disabled**: Check compiler flags
+- **Debug build**: Ensure --release mode
+- **Thermal throttling**: CPU overheating in CI
+- **Cache effects**: Different data access patterns
+
+## Contributing
+
+When adding benchmarks:
+
+1. Add to appropriate `*_bench.rs` file
+2. Update this README
+3. Ensure benchmarks complete in < 5 minutes
+4. Use `black_box()` to prevent optimization
+5. Test both small and large inputs
+
+## Resources
+
+- [Criterion.rs Documentation](https://bheisler.github.io/criterion.rs/book/)
+- [HNSW Paper](https://arxiv.org/abs/1603.09320)
+- [Product Quantization Paper](https://ieeexplore.ieee.org/document/5432202)
+- [pgvector Repository](https://github.com/pgvector/pgvector)
+
+## License
+
+Same as ruvector project - MIT