dearsky/wifi-densepose

Fork 0

Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

11 KiB

Raw Blame History

Performance Optimizations Guide

This document describes the performance optimizations available in ruvector-scipix and how to use them effectively.

Overview

The optimization module provides multiple strategies to improve performance:

SIMD Operations: Vectorized image processing (AVX2, AVX-512, NEON)
Parallel Processing: Multi-threaded execution using Rayon
Memory Optimizations: Object pooling, memory mapping, zero-copy views
Model Quantization: INT8 quantization for reduced memory and faster inference
Dynamic Batching: Intelligent batching for throughput optimization

Feature Detection

The library automatically detects CPU capabilities at runtime:

use ruvector_scipix::optimize::{detect_features, get_features};

// Detect CPU features
let features = detect_features();
println!("AVX2: {}", features.avx2);
println!("AVX-512: {}", features.avx512f);
println!("NEON: {}", features.neon);
println!("SSE4.2: {}", features.sse4_2);

SIMD Operations

Grayscale Conversion

Convert RGBA images to grayscale using SIMD:

use ruvector_scipix::optimize::simd;

let rgba: Vec<u8> = /* your RGBA data */;
let mut gray = vec![0u8; rgba.len() / 4];

// Automatically uses best SIMD implementation available
simd::simd_grayscale(&rgba, &mut gray);

Performance: Up to 4x faster than scalar implementation on AVX2 systems.

Threshold Operation

Fast binary thresholding:

simd::simd_threshold(&gray, 128, &mut binary);

Performance: Up to 8x faster on AVX2 systems.

Normalization

Fast tensor normalization for model inputs:

let mut tensor_data: Vec<f32> = /* your data */;
simd::simd_normalize(&mut tensor_data);

Performance: Up to 3x faster on AVX2 systems.

Parallel Processing

Parallel Image Preprocessing

Process multiple images in parallel:

use ruvector_scipix::optimize::parallel;
use image::DynamicImage;

let images: Vec<DynamicImage> = /* your images */;

let processed = parallel::parallel_preprocess(images, |img| {
    // Your preprocessing function
    preprocess_image(img)
});

Pipeline Execution

Create processing pipelines with parallel stages:

use ruvector_scipix::optimize::parallel::Pipeline3;

let pipeline = Pipeline3::new(
    |img| preprocess(img),
    |img| detect_regions(img),
    |regions| recognize_text(regions),
);

let results = pipeline.execute_batch(images);

Async Parallel Execution

Execute async operations with concurrency limits:

use ruvector_scipix::optimize::parallel::AsyncParallelExecutor;

let executor = AsyncParallelExecutor::new(4); // Max 4 concurrent

let results = executor.execute(tasks, |task| async move {
    process_async(task).await
}).await;

Memory Optimizations

Buffer Pooling

Reuse buffers to reduce allocations:

use ruvector_scipix::optimize::memory::{BufferPool, GlobalPools};

// Use global pools
let pools = GlobalPools::get();
let mut buffer = pools.acquire_large(); // 1MB buffer
buffer.extend_from_slice(&data);
// Buffer automatically returns to pool when dropped

// Or create custom pool
let pool = BufferPool::new(
    || Vec::with_capacity(1024),
    initial_size: 10,
    max_size: 100
);

Benefits: Reduces allocation overhead, improves cache locality.

Memory-Mapped Models

Load large models without copying to memory:

use ruvector_scipix::optimize::memory::MmapModel;

let model = MmapModel::from_file("model.bin")?;
let data = model.as_slice(); // Zero-copy access

Benefits: Faster loading, lower memory usage, shared across processes.

Zero-Copy Image Views

Work with image data without copying:

use ruvector_scipix::optimize::memory::ImageView;

let view = ImageView::new(&data, width, height, channels)?;
let pixel = view.pixel(x, y);

// Create subview without copying
let roi = view.subview(x, y, width, height)?;

Arena Allocation

Fast temporary allocations:

use ruvector_scipix::optimize::memory::Arena;

let mut arena = Arena::with_capacity(1024 * 1024);

for _ in 0..iterations {
    let buffer = arena.alloc(size, alignment);
    // Use buffer...
    arena.reset(); // Reuse capacity
}

Model Quantization

Basic Quantization

Quantize f32 weights to INT8:

use ruvector_scipix::optimize::quantize;

let weights: Vec<f32> = /* your model weights */;
let (quantized, params) = quantize::quantize_weights(&weights);

// Later, dequantize for inference
let restored = quantize::dequantize(&quantized, params);

Benefits: 4x memory reduction, faster inference on some hardware.

Quantized Tensors

Work with quantized tensor representations:

use ruvector_scipix::optimize::quantize::QuantizedTensor;

let tensor = QuantizedTensor::from_f32(&data, vec![batch, channels, height, width]);
println!("Compression ratio: {:.2}x", tensor.compression_ratio());

// Dequantize when needed
let f32_data = tensor.to_f32();

Per-Channel Quantization

Better accuracy for convolutional/linear layers:

use ruvector_scipix::optimize::quantize::PerChannelQuant;

// For weight tensor [out_channels, in_channels, ...]
let quant = PerChannelQuant::from_f32(&weights, shape);

// Each output channel has its own scale/zero-point

Quality Metrics

Measure quantization quality:

use ruvector_scipix::optimize::quantize::{quantization_error, sqnr};

let (quantized, params) = quantize::quantize_weights(&original);

let mse = quantization_error(&original, &quantized, params);
let signal_noise_ratio = sqnr(&original, &quantized, params);

println!("MSE: {:.6}, SQNR: {:.2} dB", mse, signal_noise_ratio);

Dynamic Batching

Basic Batching

Automatically batch requests for better throughput:

use ruvector_scipix::optimize::batch::{DynamicBatcher, BatchConfig};

let config = BatchConfig {
    max_batch_size: 32,
    max_wait_ms: 50,
    max_queue_size: 1000,
    preferred_batch_size: 16,
};

let batcher = Arc::new(DynamicBatcher::new(config, |items: Vec<Image>| {
    process_batch(items) // Your batch processing logic
}));

// Start processing loop
tokio::spawn({
    let batcher = batcher.clone();
    async move { batcher.run().await }
});

// Add items
let result = batcher.add(image).await?;

Adaptive Batching

Automatically adjust batch size based on latency:

use ruvector_scipix::optimize::batch::AdaptiveBatcher;
use std::time::Duration;

let batcher = Arc::new(AdaptiveBatcher::new(
    config,
    Duration::from_millis(100), // Target latency
    processor,
));

// Batch size adapts to maintain target latency

Optimization Levels

Control which optimizations are enabled:

use ruvector_scipix::optimize::{OptLevel, set_opt_level};

// Set optimization level at startup
set_opt_level(OptLevel::Full); // All optimizations

// Available levels:
// - OptLevel::None:     No optimizations
// - OptLevel::Simd:     SIMD only
// - OptLevel::Parallel: SIMD + parallel
// - OptLevel::Full:     All optimizations (default)

Benchmarking

Run benchmarks to compare optimized vs non-optimized implementations:

# Run all optimization benchmarks
cargo bench --bench optimization_bench

# Run specific benchmark group
cargo bench --bench optimization_bench -- grayscale

# Generate detailed reports
cargo bench --bench optimization_bench -- --verbose

Expected Performance Improvements

Based on benchmarks on modern x86_64 systems with AVX2:

Operation	Speedup	Notes
Grayscale conversion	3-4x	AVX2 vs scalar
Threshold	6-8x	AVX2 vs scalar
Normalization	2-3x	AVX2 vs scalar
Parallel preprocessing (8 cores)	6-7x	vs sequential
Buffer pooling	2-3x	vs direct allocation
Quantization	4x memory	INT8 vs FP32

Best Practices

Enable optimizations by default: Use the optimize feature in production
Profile first: Use benchmarks to identify bottlenecks
Use appropriate batch sizes: Larger batches = better throughput, higher latency
Pool buffers for hot paths: Reduces allocation overhead significantly
Quantize models: 4x memory reduction with minimal accuracy loss
Match parallelism to workload: Use thread count ≤ CPU cores

Platform-Specific Notes

x86_64

AVX2: Widely available on modern CPUs (2013+)
AVX-512: Available on newer server CPUs, provides marginal improvements
Best performance on CPUs with good SIMD execution units

ARM (AArch64)

NEON: Available on all ARMv8+ CPUs
Good SIMD performance, especially on Apple Silicon
Some operations may be faster with scalar code due to different execution units

WebAssembly

SIMD support is limited and experimental
Optimizations gracefully degrade to scalar implementations
Focus on algorithmic optimizations and caching

Troubleshooting

Low SIMD Performance

If SIMD optimizations are not providing expected speedup:

Check CPU features: cargo run -- detect-features
Ensure data is properly aligned (16-byte alignment for SIMD)
Profile to ensure SIMD code paths are being used
Try different optimization levels

High Memory Usage

If memory usage is too high:

Enable buffer pooling for frequently allocated buffers
Use memory-mapped models instead of loading into RAM
Enable model quantization
Reduce batch sizes

Thread Contention

If parallel performance is poor:

Reduce thread count: set_thread_count(cores - 1)
Use chunked parallel processing for better load balancing
Avoid fine-grained parallelism (prefer coarser chunks)
Profile mutex/lock contention

Integration Example

Complete example using multiple optimizations:

use ruvector_scipix::optimize::*;
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<()> {
    // Set optimization level
    set_opt_level(OptLevel::Full);

    // Detect features
    let features = detect_features();
    println!("Features: {:?}", features);

    // Create buffer pools
    let pools = memory::GlobalPools::get();

    // Create adaptive batcher
    let batcher = Arc::new(batch::AdaptiveBatcher::new(
        batch::BatchConfig::default(),
        Duration::from_millis(100),
        |images| process_images(images),
    ));

    // Start batcher
    let batcher_clone = batcher.clone();
    tokio::spawn(async move { batcher_clone.run().await });

    // Process images
    let result = batcher.add(image).await?;

    Ok(())
}

fn process_images(images: Vec<Image>) -> Vec<Result<Output, String>> {
    // Use parallel processing
    parallel::parallel_map_chunked(images, 8, |img| {
        // Get pooled buffer
        let mut buffer = memory::GlobalPools::get().acquire_large();

        // Use SIMD operations
        let mut gray = vec![0u8; img.width() * img.height()];
        simd::simd_grayscale(img.as_rgba8(), &mut gray);

        // Process...
        Ok(output)
    })
}

Future Optimizations

Planned improvements:

GPU acceleration using wgpu
Custom ONNX runtime integration
Advanced quantization (INT4, mixed precision)
Streaming processing for video
Distributed inference

References

SIMD in Rust
Rayon Parallel Processing
Quantization Techniques
Benchmark results: See benches/optimization_bench.rs

11 KiB Raw Blame History