Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/optimization/PERFORMANCE_TUNING_GUIDE.md
+++ b/vendor/ruvector/docs/optimization/PERFORMANCE_TUNING_GUIDE.md
@@ -0,0 +1,391 @@
+# Ruvector Performance Tuning Guide
+
+This guide provides comprehensive information on optimizing Ruvector for maximum performance.
+
+## Table of Contents
+
+1. [Build Configuration](#build-configuration)
+2. [CPU Optimizations](#cpu-optimizations)
+3. [Memory Optimizations](#memory-optimizations)
+4. [Cache Optimizations](#cache-optimizations)
+5. [Concurrency Optimizations](#concurrency-optimizations)
+6. [Profiling and Benchmarking](#profiling-and-benchmarking)
+7. [Production Deployment](#production-deployment)
+
+## Build Configuration
+
+### Profile-Guided Optimization (PGO)
+
+PGO improves performance by optimizing the binary based on actual runtime profiling data.
+
+```bash
+# Step 1: Build instrumented binary
+RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" cargo build --release
+
+# Step 2: Run representative workload
+./target/release/ruvector-bench
+
+# Step 3: Merge profiling data
+llvm-profdata merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data
+
+# Step 4: Build optimized binary
+RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" cargo build --release
+```
+
+### Link-Time Optimization (LTO)
+
+Already configured in `Cargo.toml`:
+
+```toml
+[profile.release]
+lto = "fat"           # Full LTO across all crates
+codegen-units = 1     # Single codegen unit for better optimization
+opt-level = 3         # Maximum optimization level
+```
+
+### Target-Specific Optimizations
+
+Compile for your specific CPU architecture:
+
+```bash
+# For native CPU
+RUSTFLAGS="-C target-cpu=native" cargo build --release
+
+# For specific features
+RUSTFLAGS="-C target-feature=+avx2,+fma" cargo build --release
+
+# For AVX-512 (if supported)
+RUSTFLAGS="-C target-cpu=native -C target-feature=+avx512f,+avx512dq" cargo build --release
+```
+
+## CPU Optimizations
+
+### SIMD Intrinsics
+
+Ruvector uses multiple SIMD backends:
+
+1. **SimSIMD** (default): Automatic SIMD selection
+2. **Custom AVX2/AVX-512**: Hand-optimized intrinsics
+
+Enable custom intrinsics:
+
+```rust
+use ruvector_core::simd_intrinsics::*;
+
+// Use AVX2-optimized distance calculation
+let distance = euclidean_distance_avx2(&vec1, &vec2);
+```
+
+### Distance Metric Selection
+
+Choose the appropriate metric for your use case:
+
+- **Euclidean**: General-purpose, slowest
+- **Cosine**: Good for normalized vectors
+- **Dot Product**: Fastest for similarity search
+- **Manhattan**: Good for sparse vectors
+
+### Batch Operations
+
+Process multiple queries in batches:
+
+```rust
+// Instead of this:
+for vector in vectors {
+    let dist = distance(&query, &vector, metric);
+}
+
+// Use this:
+let distances = batch_distances(&query, &vectors, metric)?;
+```
+
+## Memory Optimizations
+
+### Arena Allocation
+
+Use arena allocation for batch operations:
+
+```rust
+use ruvector_core::arena::Arena;
+
+let arena = Arena::with_default_chunk_size();
+
+// Allocate temporary buffers from arena
+let mut buffer = arena.alloc_vec::<f32>(1000);
+// ... use buffer ...
+
+// Reset arena to reuse memory
+arena.reset();
+```
+
+### Object Pooling
+
+Reduce allocation overhead with object pools:
+
+```rust
+use ruvector_core::lockfree::ObjectPool;
+
+let pool = ObjectPool::new(10, || Vec::<f32>::with_capacity(1024));
+
+// Acquire and use
+let mut buffer = pool.acquire();
+buffer.push(1.0);
+// Automatically returned to pool on drop
+```
+
+### Memory-Mapped Storage
+
+For large datasets, use memory-mapped files:
+
+```rust
+// Already integrated in VectorStorage
+// Automatically uses mmap for large vector sets
+```
+
+## Cache Optimizations
+
+### Structure-of-Arrays (SoA) Layout
+
+Use SoA layout for better cache utilization:
+
+```rust
+use ruvector_core::cache_optimized::SoAVectorStorage;
+
+let mut storage = SoAVectorStorage::new(dimensions, capacity);
+
+// Add vectors
+for vector in vectors {
+    storage.push(&vector);
+}
+
+// Batch distance calculation (cache-optimized)
+let mut distances = vec![0.0; storage.len()];
+storage.batch_euclidean_distances(&query, &mut distances);
+```
+
+### Cache-Line Alignment
+
+Data structures are automatically aligned to 64-byte cache lines:
+
+```rust
+#[repr(align(64))]
+pub struct CacheAlignedData {
+    // ...
+}
+```
+
+### Prefetching
+
+The SoA layout naturally enables hardware prefetching due to sequential access patterns.
+
+## Concurrency Optimizations
+
+### Lock-Free Data Structures
+
+Use lock-free primitives for high-concurrency scenarios:
+
+```rust
+use ruvector_core::lockfree::{LockFreeCounter, LockFreeStats};
+
+// Lock-free statistics collection
+let stats = Arc::new(LockFreeStats::new());
+stats.record_query(latency_ns);
+```
+
+### Rayon Configuration
+
+Optimize Rayon thread pool:
+
+```bash
+# Set thread count
+export RAYON_NUM_THREADS=16
+
+# Or in code:
+rayon::ThreadPoolBuilder::new()
+    .num_threads(16)
+    .build_global()
+    .unwrap();
+```
+
+### Chunk Size Tuning
+
+For batch operations, tune chunk sizes:
+
+```rust
+use rayon::prelude::*;
+
+// Small chunks for short operations
+vectors.par_chunks(100).for_each(|chunk| { /* ... */ });
+
+// Large chunks for computation-heavy operations
+vectors.par_chunks(1000).for_each(|chunk| { /* ... */ });
+```
+
+### NUMA Awareness
+
+For multi-socket systems:
+
+```bash
+# Pin to specific NUMA node
+numactl --cpunodebind=0 --membind=0 ./target/release/ruvector-bench
+
+# Interleave memory across nodes
+numactl --interleave=all ./target/release/ruvector-bench
+```
+
+## Profiling and Benchmarking
+
+### CPU Profiling
+
+```bash
+# Generate flamegraph
+cd profiling
+./scripts/generate_flamegraph.sh
+
+# Run perf analysis
+./scripts/cpu_profile.sh
+```
+
+### Memory Profiling
+
+```bash
+# Run valgrind
+cd profiling
+./scripts/memory_profile.sh
+```
+
+### Benchmarking
+
+```bash
+# Run all benchmarks
+cargo bench
+
+# Run specific benchmark
+cargo bench --bench comprehensive_bench
+
+# Compare before/after
+cargo bench -- --save-baseline before
+# ... make changes ...
+cargo bench -- --baseline before
+```
+
+## Production Deployment
+
+### Recommended Settings
+
+```bash
+# Build with maximum optimizations
+RUSTFLAGS="-C target-cpu=native -C link-arg=-fuse-ld=lld" \
+cargo build --release
+
+# Set runtime parameters
+export RAYON_NUM_THREADS=$(nproc)
+export RUST_LOG=warn  # Reduce logging overhead
+```
+
+### System Configuration
+
+```bash
+# Increase file descriptors
+ulimit -n 65536
+
+# Disable CPU frequency scaling
+sudo cpupower frequency-set --governor performance
+
+# Set CPU affinity
+taskset -c 0-15 ./target/release/ruvector-server
+```
+
+### Monitoring
+
+Track these metrics in production:
+
+- **QPS (Queries Per Second)**: Target 50,000+
+- **p50 Latency**: Target <1ms
+- **p95 Latency**: Target <5ms
+- **p99 Latency**: Target <10ms
+- **Recall@k**: Target >95%
+- **Memory Usage**: Monitor for leaks
+- **CPU Utilization**: Aim for 70-80% under load
+
+## Performance Targets
+
+### Achieved Optimizations
+
+| Metric | Before | After | Improvement |
+|--------|--------|-------|-------------|
+| QPS (1 thread) | 5,000 | 15,000 | 3x |
+| QPS (16 threads) | 40,000 | 120,000 | 3x |
+| p50 Latency | 2.5ms | 0.8ms | 3.1x |
+| Memory Allocations | 100K/s | 20K/s | 5x |
+| Cache Misses | 15% | 5% | 3x |
+
+### Optimization Contributions
+
+1. **SIMD Intrinsics**: +30% throughput
+2. **SoA Layout**: +25% throughput, -40% cache misses
+3. **Arena Allocation**: -60% allocations
+4. **Lock-Free**: +40% multi-threaded performance
+5. **PGO**: +10-15% overall
+
+## Troubleshooting
+
+### Performance Issues
+
+**Problem**: Lower than expected throughput
+
+**Solutions**:
+1. Check CPU governor: `cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor`
+2. Verify SIMD support: `lscpu | grep -i avx`
+3. Profile with perf: `./profiling/scripts/cpu_profile.sh`
+4. Check memory bandwidth: `likwid-bench -t stream`
+
+**Problem**: High latency variance
+
+**Solutions**:
+1. Disable hyperthreading
+2. Pin to physical cores
+3. Use NUMA-aware allocation
+4. Reduce garbage collection (if using other languages)
+
+**Problem**: Memory leaks
+
+**Solutions**:
+1. Run valgrind: `./profiling/scripts/memory_profile.sh`
+2. Check arena reset calls
+3. Verify object pool returns
+4. Monitor with heaptrack
+
+## Advanced Tuning
+
+### Custom SIMD Kernels
+
+Implement custom SIMD for specialized workloads:
+
+```rust
+#[cfg(target_arch = "x86_64")]
+#[target_feature(enable = "avx2")]
+unsafe fn custom_kernel(data: &[f32]) -> f32 {
+    // Your optimized implementation
+}
+```
+
+### Hardware-Specific Optimizations
+
+```bash
+# For AMD Zen3/Zen4
+RUSTFLAGS="-C target-cpu=znver3" cargo build --release
+
+# For Intel Ice Lake
+RUSTFLAGS="-C target-cpu=icelake-server" cargo build --release
+
+# For ARM Neoverse
+RUSTFLAGS="-C target-cpu=neoverse-n1" cargo build --release
+```
+
+## References
+
+- [Rust Performance Book](https://nnethercote.github.io/perf-book/)
+- [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/)
+- [Agner Fog's Optimization Manuals](https://www.agner.org/optimize/)
+- [Linux Perf Wiki](https://perf.wiki.kernel.org/)