Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

6.3 KiB

Raw Blame History

RuVector Benchmark Results

Date: January 18, 2026 Hardware: Apple M4 Pro, 48GB RAM OS: macOS 26.1 (Build 25B78) Rust Version: rustc 1.92.0 (ded5c06cf 2025-12-08)

SIMD Performance (NEON vs Scalar)
Distance Metric Benchmarks
HNSW Search Performance
Vector Insert Performance
Quantization Performance
System Comparison
Memory Usage
Methodology

SIMD Performance (NEON vs Scalar)

Test Configuration

Dimensions: 128
Vectors: 10,000
Queries: 1,000
Total distance calculations: 10,000,000

Results

Operation	SIMD (ms)	Scalar (ms)	Speedup
Euclidean Distance	114.36	328.25	2.87x
Dot Product	97.68	287.22	2.94x
Cosine Similarity	133.61	794.74	5.95x

Key Findings

NEON SIMD provides significant speedups across all distance metrics
Cosine similarity benefits most (5.95x) due to combined dot product and norm calculations
The M4 Pro's NEON unit efficiently processes 4 floats per instruction

Distance Metric Benchmarks

Euclidean Distance (SIMD-Optimized)

Dimensions	Latency (ns)	Throughput
128	14.9	67M ops/s
384	55.3	18M ops/s
768	115.3	8.7M ops/s
1536	279.6	3.6M ops/s

Cosine Distance (SIMD-Optimized)

Dimensions	Latency (ns)	Throughput
128	16.4	61M ops/s
384	60.4	17M ops/s
768	128.8	7.8M ops/s
1536	302.9	3.3M ops/s

Dot Product (SIMD-Optimized)

Dimensions	Latency (ns)	Throughput
128	12.0	83M ops/s
384	52.7	19M ops/s
768	112.2	8.9M ops/s
1536	292.3	3.4M ops/s

Batch Distance Calculation

Configuration	Latency	Throughput
1000 vectors x 384 dimensions	161.2 us	6.2M distances/s

HNSW Search Performance

Search Latency by k (top-k results)

k	p50 Latency (us)	Throughput
1	18.9	53K queries/s
10	25.2	40K queries/s
100	77.9	13K queries/s

Index Configuration

Index Size: 10,000 vectors
Dimensions: 384 (standard embedding size)
ef_construction: default (HNSW parameter)

Vector Insert Performance

Single Insert Throughput

Dimensions	Latency (ms)	Throughput
128	4.41	227 inserts/s
256	4.63	216 inserts/s
512	5.23	191 inserts/s

Batch Insert Throughput

Batch Size	Latency (ms)	Throughput
100	34.1	2,928 inserts/s
500	72.8	6,865 inserts/s
1000	152.0	6,580 inserts/s

Key Findings

Batch inserts achieve 30x higher throughput than single inserts
Optimal batch size is around 500-1000 vectors
HNSW index construction is the primary bottleneck

Quantization Performance

Scalar Quantization (INT8, 4x compression)

Dimensions	Encode (ns)	Decode (ns)	Distance (ns)
384	213	215	31
768	427	425	63
1536	845	835	126

Binary Quantization (32x compression)

Dimensions	Encode (ns)	Decode (ns)	Hamming Distance (ns)
384	208	215	0.9
768	427	425	1.8
1536	845	835	3.8

Key Findings

Binary quantization provides sub-nanosecond hamming distance calculation
Scalar quantization achieves 30x faster distance than full-precision
Combined with SIMD, quantized operations are extremely fast

System Comparison

Ruvector vs Alternatives (Simulated)

System	QPS	p50 (ms)	p99 (ms)	Speedup vs Python
Ruvector (Optimized)	1,216	0.78	0.78	15.7x
Ruvector (No Quant)	1,218	0.78	0.78	15.7x
Python Baseline	77	11.88	11.88	1.0x
Brute-Force	12	77.76	77.76	0.2x

Test Configuration

Vectors: 10,000
Dimensions: 384
Queries: 100
Top-k: 10

Memory Usage

Memory Efficiency by Quantization

Quantization	Compression	Memory per 1M vectors (384D)
None (f32)	1x	1.46 GB
Scalar (INT8)	4x	366 MB
INT4	8x	183 MB
Binary	32x	46 MB

HNSW Index Overhead

Graph structure: ~100 bytes per vector (average)
Total memory per vector: vector_size + 100 bytes

Methodology

Benchmark Environment

All benchmarks run in release mode (--release)
Criterion.rs used for statistical sampling (100 samples per benchmark)
NEON SIMD auto-detected and enabled on Apple Silicon
Warmed cache for consistent results

How to Reproduce

# SIMD NEON Benchmark
cargo run --example neon_benchmark --release -p ruvector-core

# Criterion Benchmarks
cargo bench -p ruvector-core --bench distance_metrics
cargo bench -p ruvector-core --bench hnsw_search
cargo bench -p ruvector-core --bench quantization_bench
cargo bench -p ruvector-core --bench real_benchmark

# Comparison Benchmark
cargo run -p ruvector-bench --bin comparison-benchmark --release -- \
  --num-vectors 10000 --queries 100 --dimensions 384

# Run all benchmarks with CI script
./scripts/run_benchmarks.sh

Performance Considerations

SIMD Optimization: The M4 Pro's NEON unit provides 2.9-6x speedup
Quantization: INT8 provides excellent compression with minimal accuracy loss
Batch Operations: Always prefer batch inserts for bulk data loading
Index Tuning: Adjust ef_construction and ef_search for recall/speed tradeoff

Appendix: Raw Benchmark Data

Criterion JSON Location

target/criterion/

Comparison Benchmark Output

bench_results/comparison_benchmark.json
bench_results/comparison_benchmark.csv
bench_results/comparison_benchmark.md

Generated by RuVector Benchmark Suite

6.3 KiB Raw Blame History

RuVector Benchmark Results

Table of Contents

SIMD Performance (NEON vs Scalar)

Test Configuration

Results

Key Findings

Distance Metric Benchmarks

Euclidean Distance (SIMD-Optimized)

Cosine Distance (SIMD-Optimized)

Dot Product (SIMD-Optimized)

Batch Distance Calculation

HNSW Search Performance

Search Latency by k (top-k results)

Index Configuration

Vector Insert Performance

Single Insert Throughput

Batch Insert Throughput

Key Findings

Quantization Performance

Scalar Quantization (INT8, 4x compression)

Binary Quantization (32x compression)

Key Findings

System Comparison

Ruvector vs Alternatives (Simulated)

Test Configuration

Memory Usage

Memory Efficiency by Quantization

HNSW Index Overhead

Methodology

Benchmark Environment

How to Reproduce

Performance Considerations

Appendix: Raw Benchmark Data

Criterion JSON Location

Comparison Benchmark Output

6.3 KiB

Raw Blame History