Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
239
docs/benchmarks/BENCHMARK_RESULTS.md
Normal file
239
docs/benchmarks/BENCHMARK_RESULTS.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# RuVector Benchmark Results
|
||||
|
||||
**Date**: January 18, 2026
|
||||
**Hardware**: Apple M4 Pro, 48GB RAM
|
||||
**OS**: macOS 26.1 (Build 25B78)
|
||||
**Rust Version**: rustc 1.92.0 (ded5c06cf 2025-12-08)
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [SIMD Performance (NEON vs Scalar)](#simd-performance-neon-vs-scalar)
|
||||
2. [Distance Metric Benchmarks](#distance-metric-benchmarks)
|
||||
3. [HNSW Search Performance](#hnsw-search-performance)
|
||||
4. [Vector Insert Performance](#vector-insert-performance)
|
||||
5. [Quantization Performance](#quantization-performance)
|
||||
6. [System Comparison](#system-comparison)
|
||||
7. [Memory Usage](#memory-usage)
|
||||
8. [Methodology](#methodology)
|
||||
|
||||
---
|
||||
|
||||
## SIMD Performance (NEON vs Scalar)
|
||||
|
||||
### Test Configuration
|
||||
- **Dimensions**: 128
|
||||
- **Vectors**: 10,000
|
||||
- **Queries**: 1,000
|
||||
- **Total distance calculations**: 10,000,000
|
||||
|
||||
### Results
|
||||
|
||||
| Operation | SIMD (ms) | Scalar (ms) | Speedup |
|
||||
|-----------|-----------|-------------|---------|
|
||||
| **Euclidean Distance** | 114.36 | 328.25 | **2.87x** |
|
||||
| **Dot Product** | 97.68 | 287.22 | **2.94x** |
|
||||
| **Cosine Similarity** | 133.61 | 794.74 | **5.95x** |
|
||||
|
||||
### Key Findings
|
||||
- NEON SIMD provides significant speedups across all distance metrics
|
||||
- Cosine similarity benefits most (5.95x) due to combined dot product and norm calculations
|
||||
- The M4 Pro's NEON unit efficiently processes 4 floats per instruction
|
||||
|
||||
---
|
||||
|
||||
## Distance Metric Benchmarks
|
||||
|
||||
### Euclidean Distance (SIMD-Optimized)
|
||||
|
||||
| Dimensions | Latency (ns) | Throughput |
|
||||
|------------|--------------|------------|
|
||||
| 128 | 14.9 | 67M ops/s |
|
||||
| 384 | 55.3 | 18M ops/s |
|
||||
| 768 | 115.3 | 8.7M ops/s |
|
||||
| 1536 | 279.6 | 3.6M ops/s |
|
||||
|
||||
### Cosine Distance (SIMD-Optimized)
|
||||
|
||||
| Dimensions | Latency (ns) | Throughput |
|
||||
|------------|--------------|------------|
|
||||
| 128 | 16.4 | 61M ops/s |
|
||||
| 384 | 60.4 | 17M ops/s |
|
||||
| 768 | 128.8 | 7.8M ops/s |
|
||||
| 1536 | 302.9 | 3.3M ops/s |
|
||||
|
||||
### Dot Product (SIMD-Optimized)
|
||||
|
||||
| Dimensions | Latency (ns) | Throughput |
|
||||
|------------|--------------|------------|
|
||||
| 128 | 12.0 | 83M ops/s |
|
||||
| 384 | 52.7 | 19M ops/s |
|
||||
| 768 | 112.2 | 8.9M ops/s |
|
||||
| 1536 | 292.3 | 3.4M ops/s |
|
||||
|
||||
### Batch Distance Calculation
|
||||
|
||||
| Configuration | Latency | Throughput |
|
||||
|---------------|---------|------------|
|
||||
| 1000 vectors x 384 dimensions | 161.2 us | 6.2M distances/s |
|
||||
|
||||
---
|
||||
|
||||
## HNSW Search Performance
|
||||
|
||||
### Search Latency by k (top-k results)
|
||||
|
||||
| k | p50 Latency (us) | Throughput |
|
||||
|---|------------------|------------|
|
||||
| 1 | 18.9 | 53K queries/s |
|
||||
| 10 | 25.2 | 40K queries/s |
|
||||
| 100 | 77.9 | 13K queries/s |
|
||||
|
||||
### Index Configuration
|
||||
- **Index Size**: 10,000 vectors
|
||||
- **Dimensions**: 384 (standard embedding size)
|
||||
- **ef_construction**: default (HNSW parameter)
|
||||
|
||||
---
|
||||
|
||||
## Vector Insert Performance
|
||||
|
||||
### Single Insert Throughput
|
||||
|
||||
| Dimensions | Latency (ms) | Throughput |
|
||||
|------------|--------------|------------|
|
||||
| 128 | 4.41 | 227 inserts/s |
|
||||
| 256 | 4.63 | 216 inserts/s |
|
||||
| 512 | 5.23 | 191 inserts/s |
|
||||
|
||||
### Batch Insert Throughput
|
||||
|
||||
| Batch Size | Latency (ms) | Throughput |
|
||||
|------------|--------------|------------|
|
||||
| 100 | 34.1 | 2,928 inserts/s |
|
||||
| 500 | 72.8 | 6,865 inserts/s |
|
||||
| 1000 | 152.0 | 6,580 inserts/s |
|
||||
|
||||
### Key Findings
|
||||
- Batch inserts achieve **30x higher throughput** than single inserts
|
||||
- Optimal batch size is around 500-1000 vectors
|
||||
- HNSW index construction is the primary bottleneck
|
||||
|
||||
---
|
||||
|
||||
## Quantization Performance
|
||||
|
||||
### Scalar Quantization (INT8, 4x compression)
|
||||
|
||||
| Dimensions | Encode (ns) | Decode (ns) | Distance (ns) |
|
||||
|------------|-------------|-------------|---------------|
|
||||
| 384 | 213 | 215 | 31 |
|
||||
| 768 | 427 | 425 | 63 |
|
||||
| 1536 | 845 | 835 | 126 |
|
||||
|
||||
### Binary Quantization (32x compression)
|
||||
|
||||
| Dimensions | Encode (ns) | Decode (ns) | Hamming Distance (ns) |
|
||||
|------------|-------------|-------------|----------------------|
|
||||
| 384 | 208 | 215 | 0.9 |
|
||||
| 768 | 427 | 425 | 1.8 |
|
||||
| 1536 | 845 | 835 | 3.8 |
|
||||
|
||||
### Key Findings
|
||||
- Binary quantization provides **sub-nanosecond** hamming distance calculation
|
||||
- Scalar quantization achieves **30x faster** distance than full-precision
|
||||
- Combined with SIMD, quantized operations are extremely fast
|
||||
|
||||
---
|
||||
|
||||
## System Comparison
|
||||
|
||||
### Ruvector vs Alternatives (Simulated)
|
||||
|
||||
| System | QPS | p50 (ms) | p99 (ms) | Speedup vs Python |
|
||||
|--------|-----|----------|----------|-------------------|
|
||||
| **Ruvector (Optimized)** | 1,216 | 0.78 | 0.78 | **15.7x** |
|
||||
| **Ruvector (No Quant)** | 1,218 | 0.78 | 0.78 | **15.7x** |
|
||||
| Python Baseline | 77 | 11.88 | 11.88 | 1.0x |
|
||||
| Brute-Force | 12 | 77.76 | 77.76 | 0.2x |
|
||||
|
||||
### Test Configuration
|
||||
- **Vectors**: 10,000
|
||||
- **Dimensions**: 384
|
||||
- **Queries**: 100
|
||||
- **Top-k**: 10
|
||||
|
||||
---
|
||||
|
||||
## Memory Usage
|
||||
|
||||
### Memory Efficiency by Quantization
|
||||
|
||||
| Quantization | Compression | Memory per 1M vectors (384D) |
|
||||
|--------------|-------------|------------------------------|
|
||||
| None (f32) | 1x | 1.46 GB |
|
||||
| Scalar (INT8) | 4x | 366 MB |
|
||||
| INT4 | 8x | 183 MB |
|
||||
| Binary | 32x | 46 MB |
|
||||
|
||||
### HNSW Index Overhead
|
||||
- Graph structure: ~100 bytes per vector (average)
|
||||
- Total memory per vector: vector_size + 100 bytes
|
||||
|
||||
---
|
||||
|
||||
## Methodology
|
||||
|
||||
### Benchmark Environment
|
||||
- All benchmarks run in release mode (`--release`)
|
||||
- Criterion.rs used for statistical sampling (100 samples per benchmark)
|
||||
- NEON SIMD auto-detected and enabled on Apple Silicon
|
||||
- Warmed cache for consistent results
|
||||
|
||||
### How to Reproduce
|
||||
|
||||
```bash
|
||||
# SIMD NEON Benchmark
|
||||
cargo run --example neon_benchmark --release -p ruvector-core
|
||||
|
||||
# Criterion Benchmarks
|
||||
cargo bench -p ruvector-core --bench distance_metrics
|
||||
cargo bench -p ruvector-core --bench hnsw_search
|
||||
cargo bench -p ruvector-core --bench quantization_bench
|
||||
cargo bench -p ruvector-core --bench real_benchmark
|
||||
|
||||
# Comparison Benchmark
|
||||
cargo run -p ruvector-bench --bin comparison-benchmark --release -- \
|
||||
--num-vectors 10000 --queries 100 --dimensions 384
|
||||
|
||||
# Run all benchmarks with CI script
|
||||
./scripts/run_benchmarks.sh
|
||||
```
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
1. **SIMD Optimization**: The M4 Pro's NEON unit provides 2.9-6x speedup
|
||||
2. **Quantization**: INT8 provides excellent compression with minimal accuracy loss
|
||||
3. **Batch Operations**: Always prefer batch inserts for bulk data loading
|
||||
4. **Index Tuning**: Adjust ef_construction and ef_search for recall/speed tradeoff
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Raw Benchmark Data
|
||||
|
||||
### Criterion JSON Location
|
||||
```
|
||||
target/criterion/
|
||||
```
|
||||
|
||||
### Comparison Benchmark Output
|
||||
```
|
||||
bench_results/comparison_benchmark.json
|
||||
bench_results/comparison_benchmark.csv
|
||||
bench_results/comparison_benchmark.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*Generated by RuVector Benchmark Suite*
|
||||
Reference in New Issue
Block a user