Files
wifi-densepose/examples/scipix/docs/09_OPTIMIZATION.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

2569 lines
71 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OCR System Optimization Roadmap
## Executive Summary
This document outlines a comprehensive optimization strategy for the ruvector-scipix OCR system, targeting progressive performance improvements from baseline (1000ms/image) to production-ready (50ms/image) latency.
**Target Performance Metrics:**
- **Phase 1 (Baseline)**: 1000ms/image, 80% CPU utilization
- **Phase 2 (Optimized)**: 100ms/image, 60% CPU utilization, 10x throughput improvement
- **Phase 3 (Production)**: 50ms/image, 40% CPU utilization, 20x throughput improvement
---
## 1. Model Optimization
### 1.1 ONNX Model Quantization
**Objective**: Reduce model size and inference time while maintaining accuracy.
#### FP16 (Half-Precision) Quantization
```rust
// Expected Improvement: 2x speed, 50% memory reduction, <1% accuracy loss
use ort::quantization::{QuantizationConfig, QuantizationType};
pub struct ModelOptimizer {
quantization_config: QuantizationConfig,
}
impl ModelOptimizer {
pub fn quantize_fp16(model_path: &str) -> Result<String> {
let config = QuantizationConfig::new()
.with_quantization_type(QuantizationType::FP16)
.with_per_channel(true)
.with_reduce_range(false);
let output_path = model_path.replace(".onnx", "_fp16.onnx");
ort::quantization::quantize(model_path, &output_path, config)?;
Ok(output_path)
}
}
```
**Expected Results:**
- Model size: 500MB → 250MB (50% reduction)
- Inference time: 1000ms → 500ms (2x speedup)
- Accuracy degradation: <1%
- Memory usage: 50% reduction
#### INT8 Quantization
```rust
// Expected Improvement: 4x speed, 75% memory reduction, 2-5% accuracy loss
pub fn quantize_int8_dynamic(model_path: &str) -> Result<String> {
let config = QuantizationConfig::new()
.with_quantization_type(QuantizationType::DynamicINT8)
.with_per_channel(true)
.with_optimize_model(true);
let output_path = model_path.replace(".onnx", "_int8.onnx");
ort::quantization::quantize(model_path, &output_path, config)?;
Ok(output_path)
}
pub fn quantize_int8_static(
model_path: &str,
calibration_dataset: &[Tensor],
) -> Result<String> {
let config = QuantizationConfig::new()
.with_quantization_type(QuantizationType::StaticINT8)
.with_calibration_method(CalibrationMethod::MinMax)
.with_per_channel(true);
let output_path = model_path.replace(".onnx", "_int8_static.onnx");
// Calibrate using representative dataset
let calibrator = Calibrator::new(config, calibration_dataset);
calibrator.quantize(model_path, &output_path)?;
Ok(output_path)
}
```
**Expected Results:**
- Model size: 500MB → 125MB (75% reduction)
- Inference time: 1000ms → 250ms (4x speedup)
- Accuracy degradation: 2-5%
- Memory usage: 75% reduction
### 1.2 Model Pruning Strategies
**Objective**: Remove redundant weights and connections to reduce model complexity.
```rust
// Expected Improvement: 30-50% parameter reduction, 2-3x speed
pub struct ModelPruner {
sparsity_target: f32,
pruning_method: PruningMethod,
}
pub enum PruningMethod {
MagnitudeBased, // Remove smallest weights
StructuredPruning, // Remove entire neurons/filters
GradientBased, // Remove low-gradient weights
}
impl ModelPruner {
pub fn prune_magnitude_based(&self, model: &Model, threshold: f32) -> Model {
// 1. Analyze weight magnitudes
let weight_analysis = self.analyze_weight_importance(model);
// 2. Apply sparsity threshold
let pruned_weights = weight_analysis
.iter()
.map(|(layer, weights)| {
weights.iter().map(|w| {
if w.abs() < threshold { 0.0 } else { *w }
}).collect()
})
.collect();
// 3. Reconstruct model
self.rebuild_model(model, pruned_weights)
}
pub fn structured_pruning(&self, model: &Model, prune_ratio: f32) -> Model {
// Remove entire filter channels based on importance scores
let channel_importance = self.compute_channel_importance(model);
// Sort and prune least important channels
let channels_to_prune = self.select_channels_to_prune(
channel_importance,
prune_ratio
);
self.remove_channels(model, channels_to_prune)
}
}
```
**Expected Results:**
- Parameters: 200M → 100M (50% reduction)
- Inference time: 1000ms → 400ms (2.5x speedup)
- Accuracy degradation: 3-7%
- Fine-tuning required: Yes (10-20 epochs)
### 1.3 Knowledge Distillation
**Objective**: Train a smaller student model to match larger teacher model performance.
```rust
// Expected Improvement: 5-10x speed, 80-90% size reduction, <5% accuracy loss
pub struct KnowledgeDistiller {
teacher_model: Arc<Model>,
student_model: Arc<Model>,
temperature: f32,
alpha: f32, // Balance between hard and soft targets
}
impl KnowledgeDistiller {
pub async fn distill(&self, training_data: DataLoader) -> Result<Model> {
let mut student = self.student_model.clone();
for batch in training_data {
// Get teacher predictions (soft targets)
let teacher_output = self.teacher_model
.forward(&batch.images)
.await?
.apply_temperature(self.temperature);
// Get student predictions
let student_output = student.forward(&batch.images).await?;
// Compute distillation loss
let soft_loss = kl_divergence(
&student_output.apply_temperature(self.temperature),
&teacher_output
);
let hard_loss = cross_entropy(
&student_output,
&batch.labels
);
let loss = self.alpha * soft_loss + (1.0 - self.alpha) * hard_loss;
// Backpropagation and optimization
loss.backward();
student.optimize();
}
Ok(student)
}
}
// Example architecture reduction
pub fn create_distilled_model() -> StudentModel {
StudentModel::new()
.with_encoder_layers(6) // vs 12 in teacher
.with_hidden_size(384) // vs 768 in teacher
.with_attention_heads(6) // vs 12 in teacher
.with_intermediate_size(1536) // vs 3072 in teacher
}
```
**Expected Results:**
- Model size: 500MB → 50MB (10x reduction)
- Parameters: 200M → 20M (10x reduction)
- Inference time: 1000ms → 100ms (10x speedup)
- Accuracy degradation: 3-5%
### 1.4 TensorRT/OpenVINO Integration
**Objective**: Leverage hardware-specific optimizations for maximum performance.
#### TensorRT Integration (NVIDIA GPUs)
```rust
// Expected Improvement: 3-5x speed on NVIDIA GPUs
use tensorrt_rs::{Builder, NetworkDefinition, IOptimizationProfile};
pub struct TensorRTOptimizer {
builder: Builder,
precision: Precision,
}
pub enum Precision {
FP32,
FP16,
INT8,
}
impl TensorRTOptimizer {
pub fn optimize_for_tensorrt(&self, onnx_path: &str) -> Result<Vec<u8>> {
// 1. Create TensorRT builder
let network = self.builder
.create_network_from_onnx(onnx_path)?;
// 2. Configure optimization profile
let profile = self.builder
.create_optimization_profile()
.set_shape("input",
Dims::new(&[1, 3, 224, 224]), // min
Dims::new(&[4, 3, 224, 224]), // opt
Dims::new(&[16, 3, 224, 224]) // max
);
// 3. Build optimized engine
let config = self.builder.create_builder_config()
.set_max_workspace_size(1 << 30) // 1GB
.set_flag(BuilderFlag::FP16, self.precision == Precision::FP16)
.set_flag(BuilderFlag::INT8, self.precision == Precision::INT8)
.add_optimization_profile(profile);
let engine = self.builder.build_engine(&network, &config)?;
// 4. Serialize engine
Ok(engine.serialize())
}
}
```
**Expected Results (NVIDIA GPUs):**
- Inference time: 1000ms → 200ms (5x speedup)
- GPU utilization: 40% → 85%
- Memory bandwidth: Optimized kernel fusion
- Dynamic shape support: Yes
#### OpenVINO Integration (Intel CPUs/GPUs)
```rust
// Expected Improvement: 2-4x speed on Intel hardware
use openvino_rs::{Core, CompiledModel, InferRequest};
pub struct OpenVINOOptimizer {
core: Core,
device: String, // CPU, GPU, MYRIAD, etc.
}
impl OpenVINOOptimizer {
pub fn optimize_for_openvino(&self, onnx_path: &str) -> Result<CompiledModel> {
// 1. Read model
let model = self.core.read_model(onnx_path, None)?;
// 2. Configure optimization
let mut config = HashMap::new();
config.insert("PERFORMANCE_HINT", "THROUGHPUT");
config.insert("NUM_STREAMS", "AUTO");
config.insert("INFERENCE_PRECISION_HINT", "f16");
// 3. Compile for specific device
let compiled_model = self.core.compile_model(
&model,
&self.device,
&config
)?;
Ok(compiled_model)
}
pub async fn infer_optimized(&self,
compiled_model: &CompiledModel,
input: &Tensor
) -> Result<Tensor> {
let infer_request = compiled_model.create_infer_request()?;
// Set input tensor
infer_request.set_input_tensor(0, input)?;
// Asynchronous inference
infer_request.start_async()?;
infer_request.wait()?;
// Get output tensor
Ok(infer_request.get_output_tensor(0)?)
}
}
```
**Expected Results (Intel Hardware):**
- Inference time (CPU): 1000ms → 300ms (3.3x speedup)
- Inference time (GPU): 1000ms → 250ms (4x speedup)
- AVX-512 utilization: Automatic
- Multi-stream execution: Auto-tuned
---
## 2. Inference Optimization
### 2.1 Batch Processing for Throughput
**Objective**: Process multiple images simultaneously to maximize GPU/CPU utilization.
```rust
// Expected Improvement: 3-5x throughput with batch size 16-32
use tokio::sync::mpsc;
use rayon::prelude::*;
pub struct BatchProcessor {
batch_size: usize,
timeout_ms: u64,
inference_engine: Arc<InferenceEngine>,
}
impl BatchProcessor {
pub async fn process_with_batching(
&self,
input_stream: mpsc::Receiver<ImageRequest>
) -> mpsc::Receiver<OCRResult> {
let (tx, rx) = mpsc::channel(1000);
tokio::spawn(async move {
let mut batch_buffer = Vec::with_capacity(self.batch_size);
let mut timeout = tokio::time::interval(
Duration::from_millis(self.timeout_ms)
);
loop {
tokio::select! {
Some(request) = input_stream.recv() => {
batch_buffer.push(request);
if batch_buffer.len() >= self.batch_size {
self.process_batch(&batch_buffer, &tx).await;
batch_buffer.clear();
}
}
_ = timeout.tick() => {
if !batch_buffer.is_empty() {
self.process_batch(&batch_buffer, &tx).await;
batch_buffer.clear();
}
}
}
}
});
rx
}
async fn process_batch(
&self,
batch: &[ImageRequest],
tx: &mpsc::Sender<OCRResult>
) {
// 1. Preprocess in parallel
let preprocessed: Vec<Tensor> = batch
.par_iter()
.map(|req| self.preprocess(&req.image))
.collect();
// 2. Stack into single tensor
let batched_tensor = Tensor::stack(&preprocessed, 0);
// 3. Single inference call
let results = self.inference_engine
.infer(&batched_tensor)
.await
.unwrap();
// 4. Split and send results
for (request, result) in batch.iter().zip(results.split(0)) {
let ocr_result = self.postprocess(result);
tx.send(ocr_result).await.unwrap();
}
}
}
```
**Expected Results:**
- Throughput: 1 img/s → 15-20 img/s (batch size 16)
- Latency (p50): 1000ms → 150ms
- Latency (p99): 1000ms → 400ms (due to batching delay)
- GPU utilization: 40% → 90%
### 2.2 Model Caching and Warm-up
**Objective**: Eliminate cold-start latency and optimize model loading.
```rust
// Expected Improvement: First inference 5000ms → 100ms
pub struct ModelCache {
models: Arc<RwLock<LruCache<ModelKey, Arc<CompiledModel>>>>,
warm_up_batches: usize,
}
impl ModelCache {
pub async fn get_or_load_model(
&self,
model_key: ModelKey
) -> Result<Arc<CompiledModel>> {
// Try to get from cache
{
let cache = self.models.read().await;
if let Some(model) = cache.get(&model_key) {
return Ok(model.clone());
}
}
// Load and warm up model
let model = self.load_and_warmup(&model_key).await?;
let model = Arc::new(model);
// Cache for future use
{
let mut cache = self.models.write().await;
cache.put(model_key, model.clone());
}
Ok(model)
}
async fn load_and_warmup(&self, model_key: &ModelKey) -> Result<CompiledModel> {
// 1. Load model
let model = self.load_model(model_key).await?;
// 2. Warm-up with dummy inputs
let dummy_input = Tensor::zeros(&[1, 3, 224, 224]);
for _ in 0..self.warm_up_batches {
let _ = model.infer(&dummy_input).await?;
}
// 3. Model is now optimized in GPU memory
Ok(model)
}
pub async fn preload_models(&self, model_keys: &[ModelKey]) {
// Parallel model loading at startup
futures::future::join_all(
model_keys.iter().map(|key| self.get_or_load_model(key.clone()))
).await;
}
}
```
**Expected Results:**
- First inference: 5000ms → 100ms (50x improvement)
- Model loading: Asynchronous, non-blocking
- Memory usage: +500MB per cached model
- Cache hit rate: 95%+ in production
### 2.3 Dynamic Batching
**Objective**: Adaptively adjust batch size based on load and latency requirements.
```rust
// Expected Improvement: Optimal throughput/latency trade-off
pub struct DynamicBatcher {
min_batch_size: usize,
max_batch_size: usize,
target_latency_ms: u64,
adaptive_controller: AdaptiveController,
}
struct AdaptiveController {
current_batch_size: AtomicUsize,
latency_history: RwLock<VecDeque<Duration>>,
throughput_history: RwLock<VecDeque<f64>>,
}
impl DynamicBatcher {
pub async fn process_adaptive(
&self,
input_stream: mpsc::Receiver<ImageRequest>
) -> mpsc::Receiver<OCRResult> {
let (tx, rx) = mpsc::channel(1000);
tokio::spawn(async move {
loop {
// Determine optimal batch size
let batch_size = self.adaptive_controller
.compute_optimal_batch_size();
// Collect batch
let batch = self.collect_batch(
&input_stream,
batch_size
).await;
// Process and measure
let start = Instant::now();
self.process_batch(&batch, &tx).await;
let latency = start.elapsed();
// Update controller
self.adaptive_controller.update(
batch_size,
latency,
batch.len()
);
}
});
rx
}
}
impl AdaptiveController {
fn compute_optimal_batch_size(&self) -> usize {
let current = self.current_batch_size.load(Ordering::Relaxed);
let avg_latency = self.average_latency();
let avg_throughput = self.average_throughput();
// Gradient-based optimization
if avg_latency < self.target_latency_ms && avg_throughput.is_increasing() {
// Increase batch size
(current + 2).min(self.max_batch_size)
} else if avg_latency > self.target_latency_ms {
// Decrease batch size
(current.saturating_sub(2)).max(self.min_batch_size)
} else {
current
}
}
}
```
**Expected Results:**
- Batch size adaptation: 1-32 based on load
- Latency (low load): 100ms (batch size 1-4)
- Latency (high load): 200ms (batch size 16-32)
- Throughput optimization: Automatic
- SLA compliance: 99%+
### 2.4 Speculative Decoding
**Objective**: Accelerate autoregressive decoding for text generation tasks.
```rust
// Expected Improvement: 2-3x speed for LaTeX generation
pub struct SpeculativeDecoder {
draft_model: Arc<SmallModel>, // Fast, less accurate
target_model: Arc<LargeModel>, // Slow, accurate
num_speculative_tokens: usize,
}
impl SpeculativeDecoder {
pub async fn decode(&self, prompt: &Tensor) -> Result<String> {
let mut output_tokens = Vec::new();
let mut current_input = prompt.clone();
loop {
// 1. Draft model generates K tokens quickly
let draft_tokens = self.draft_model
.generate_n_tokens(&current_input, self.num_speculative_tokens)
.await?;
// 2. Target model verifies all K tokens in parallel
let verification_input = Tensor::concat(&[
current_input.clone(),
draft_tokens.clone()
], 0);
let target_logits = self.target_model
.forward(&verification_input)
.await?;
// 3. Accept tokens that match target model's top prediction
let mut accepted = 0;
for (i, draft_token) in draft_tokens.iter().enumerate() {
let target_prediction = target_logits[i].argmax();
if *draft_token == target_prediction {
output_tokens.push(*draft_token);
accepted += 1;
} else {
// Use target model's prediction and restart
output_tokens.push(target_prediction);
break;
}
}
// 4. Update input for next iteration
current_input = Tensor::from_slice(&output_tokens);
if self.is_complete(&output_tokens) {
break;
}
}
Ok(self.decode_tokens(&output_tokens))
}
}
```
**Expected Results:**
- LaTeX generation: 2000ms → 700ms (2.8x speedup)
- Acceptance rate: 60-80% of draft tokens
- Quality: Identical to target model
- Best for: Long-form LaTeX, chemical formulas
---
## 3. Memory Optimization
### 3.1 Memory-Mapped Model Loading
**Objective**: Reduce memory footprint and enable instant model loading.
```rust
// Expected Improvement: 90% memory reduction, instant loading
use memmap2::MmapOptions;
use std::fs::File;
pub struct MemoryMappedModel {
mmap: Mmap,
metadata: ModelMetadata,
}
impl MemoryMappedModel {
pub fn load(model_path: &str) -> Result<Self> {
// 1. Open file
let file = File::open(model_path)?;
// 2. Create memory-mapped region
let mmap = unsafe {
MmapOptions::new()
.populate() // Pre-fault pages
.map(&file)?
};
// 3. Parse metadata from header
let metadata = ModelMetadata::parse(&mmap[0..4096])?;
Ok(Self { mmap, metadata })
}
pub fn get_tensor(&self, layer_name: &str) -> Result<TensorView> {
let offset = self.metadata.tensor_offsets.get(layer_name)
.ok_or(Error::TensorNotFound)?;
let size = self.metadata.tensor_sizes.get(layer_name)?;
// Zero-copy tensor view
Ok(TensorView::from_bytes(
&self.mmap[offset.start..offset.end],
size
))
}
pub async fn infer(&self, input: &Tensor) -> Result<Tensor> {
// Inference operates directly on memory-mapped data
// No copying required
self.run_inference_on_mmap(input).await
}
}
```
**Expected Results:**
- Model loading time: 2000ms → 10ms (200x improvement)
- Memory usage: 500MB RAM → 50MB RAM (model stays on disk)
- Page faults: Minimal with `populate()` flag
- Shared memory: Multiple processes share same model
### 3.2 Tensor Arena Allocation
**Objective**: Pre-allocate fixed memory pools to eliminate runtime allocation overhead.
```rust
// Expected Improvement: 30% reduction in memory fragmentation
pub struct TensorArena {
memory_pool: Vec<u8>,
allocator: BumpAllocator,
checkpoints: Vec<usize>,
}
impl TensorArena {
pub fn new(size_bytes: usize) -> Self {
Self {
memory_pool: vec![0u8; size_bytes],
allocator: BumpAllocator::new(size_bytes),
checkpoints: Vec::new(),
}
}
pub fn allocate_tensor(&mut self, shape: &[usize], dtype: DType) -> TensorMut {
let size_bytes = shape.iter().product::<usize>() * dtype.size_bytes();
let offset = self.allocator.allocate(size_bytes)
.expect("Arena out of memory");
let slice = &mut self.memory_pool[offset..offset + size_bytes];
TensorMut::from_slice_mut(slice, shape, dtype)
}
pub fn checkpoint(&mut self) {
// Save current allocation position
self.checkpoints.push(self.allocator.position());
}
pub fn restore(&mut self) {
// Restore to previous checkpoint (free all allocations since)
if let Some(position) = self.checkpoints.pop() {
self.allocator.reset_to(position);
}
}
pub fn reset(&mut self) {
// Reset entire arena
self.allocator.reset();
self.checkpoints.clear();
}
}
// Usage in inference pipeline
impl InferenceEngine {
pub async fn infer_with_arena(&self, input: &Tensor) -> Result<Tensor> {
let mut arena = TensorArena::new(100 * 1024 * 1024); // 100MB
arena.checkpoint();
// All intermediate tensors allocated from arena
let preprocessed = self.preprocess_to_arena(input, &mut arena);
let features = self.extract_features_to_arena(&preprocessed, &mut arena);
let output = self.decode_to_arena(&features, &mut arena);
// Clone final output (arena will be freed)
let result = output.to_owned();
arena.restore(); // Free all intermediate allocations
Ok(result)
}
}
```
**Expected Results:**
- Memory allocations: 1000+ calls → 1 allocation
- Allocation time: 50ms → 1ms (50x improvement)
- Memory fragmentation: Eliminated
- Cache locality: Improved
### 3.3 Zero-Copy Image Processing
**Objective**: Eliminate unnecessary data copies in preprocessing pipeline.
```rust
// Expected Improvement: 40% reduction in preprocessing time
use image::DynamicImage;
use ndarray::ArrayView3;
pub struct ZeroCopyPreprocessor {
target_size: (usize, usize),
normalization: NormalizationParams,
}
impl ZeroCopyPreprocessor {
pub fn preprocess_inplace(&self, image: &DynamicImage) -> TensorView {
// 1. Get raw pixel data (no copy)
let rgb_image = image.to_rgb8();
let raw_pixels = rgb_image.as_raw();
// 2. Create tensor view over raw data
let tensor_view = unsafe {
TensorView::from_raw_parts(
raw_pixels.as_ptr() as *const f32,
&[1, 3, image.height() as usize, image.width() as usize]
)
};
// 3. Apply transformations in-place
let resized = self.resize_inplace(tensor_view, self.target_size);
let normalized = self.normalize_inplace(resized, &self.normalization);
normalized
}
fn resize_inplace(&self, input: TensorView, target_size: (usize, usize)) -> TensorView {
// Use SIMD-accelerated resize operations
// Operating directly on input buffer when possible
simd_resize::resize_rgb_inplace(input, target_size)
}
pub fn batch_preprocess_zero_copy(
&self,
images: &[DynamicImage]
) -> Vec<TensorView> {
images
.par_iter()
.map(|img| self.preprocess_inplace(img))
.collect()
}
}
// SIMD-accelerated normalization
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
pub fn normalize_simd(data: &mut [f32], mean: [f32; 3], std: [f32; 3]) {
unsafe {
let mean_vec = _mm_set_ps(0.0, mean[2], mean[1], mean[0]);
let std_vec = _mm_set_ps(1.0, std[2], std[1], std[0]);
for chunk in data.chunks_exact_mut(4) {
let values = _mm_loadu_ps(chunk.as_ptr());
let normalized = _mm_div_ps(
_mm_sub_ps(values, mean_vec),
std_vec
);
_mm_storeu_ps(chunk.as_mut_ptr(), normalized);
}
}
}
```
**Expected Results:**
- Preprocessing time: 100ms → 60ms (40% improvement)
- Memory copies: 3 copies → 0 copies
- Memory bandwidth: 50% reduction
- SIMD utilization: 90%+
### 3.4 Streaming for Large Documents
**Objective**: Process multi-page documents without loading entire document into memory.
```rust
// Expected Improvement: Process unlimited document sizes with constant memory
use tokio::io::{AsyncRead, AsyncReadExt};
use futures::stream::{Stream, StreamExt};
pub struct StreamingOCRProcessor {
page_buffer_size: usize,
max_concurrent_pages: usize,
inference_engine: Arc<InferenceEngine>,
}
impl StreamingOCRProcessor {
pub async fn process_document_stream<R: AsyncRead + Unpin>(
&self,
pdf_stream: R
) -> impl Stream<Item = Result<PageResult>> {
// 1. Create page stream
let page_stream = self.extract_pages_streaming(pdf_stream);
// 2. Process with bounded concurrency
page_stream
.map(|page_result| async move {
let page = page_result?;
// Preprocess page
let preprocessed = self.preprocess_page(&page).await?;
// Run OCR
let ocr_result = self.inference_engine
.infer(&preprocessed)
.await?;
// Free page immediately
drop(page);
drop(preprocessed);
Ok(PageResult {
page_num: page.page_num,
text: ocr_result,
})
})
.buffer_unordered(self.max_concurrent_pages)
}
async fn extract_pages_streaming<R: AsyncRead + Unpin>(
&self,
mut pdf_stream: R
) -> impl Stream<Item = Result<Page>> {
futures::stream::unfold(
(pdf_stream, 0usize),
move |(mut stream, page_num)| async move {
// Read next page from stream
let mut page_buffer = vec![0u8; self.page_buffer_size];
match stream.read(&mut page_buffer).await {
Ok(0) => None, // End of stream
Ok(n) => {
let page = self.decode_page(&page_buffer[..n], page_num).ok()?;
Some((Ok(page), (stream, page_num + 1)))
}
Err(e) => Some((Err(e.into()), (stream, page_num)))
}
}
)
}
pub async fn process_large_pdf(&self, pdf_path: &str) -> Result<Vec<PageResult>> {
let file = tokio::fs::File::open(pdf_path).await?;
let stream = self.process_document_stream(file);
stream.collect().await
}
}
```
**Expected Results:**
- Memory usage: O(n) → O(1) (constant)
- Max document size: Unlimited (was limited by RAM)
- Concurrent page processing: 4-8 pages
- Throughput: 5-10 pages/second
---
## 4. Parallelization Strategy
### 4.1 Rayon for CPU Parallelism
**Objective**: Maximize CPU core utilization for data-parallel operations.
```rust
// Expected Improvement: Near-linear scaling with CPU cores
use rayon::prelude::*;
pub struct ParallelPreprocessor {
thread_pool: rayon::ThreadPool,
}
impl ParallelPreprocessor {
pub fn new(num_threads: usize) -> Self {
let thread_pool = rayon::ThreadPoolBuilder::new()
.num_threads(num_threads)
.build()
.unwrap();
Self { thread_pool }
}
pub fn batch_preprocess(&self, images: &[DynamicImage]) -> Vec<Tensor> {
self.thread_pool.install(|| {
images
.par_iter()
.map(|img| {
// Each image processed on separate thread
self.preprocess_single(img)
})
.collect()
})
}
pub fn parallel_postprocess(&self, outputs: &[Tensor]) -> Vec<OCRResult> {
outputs
.par_iter()
.map(|output| {
// Parallel decoding, NMS, text extraction
self.decode_output(output)
})
.collect()
}
}
// Nested parallelism for complex operations
pub fn parallel_nms(boxes: &[BoundingBox], threshold: f32) -> Vec<BoundingBox> {
boxes
.par_chunks(1000)
.flat_map(|chunk| {
// Each chunk processed independently
nms_sequential(chunk, threshold)
})
.collect()
}
```
**Expected Results (8-core CPU):**
- Preprocessing throughput: 1 img/s → 7-8 img/s (7-8x)
- CPU utilization: 12% → 95%
- Scaling efficiency: 90%+ up to 16 cores
- Memory overhead: Minimal
### 4.2 Tokio for Async I/O
**Objective**: Overlap I/O operations with computation for maximum throughput.
```rust
// Expected Improvement: 3-5x throughput with I/O-bound operations
use tokio::sync::Semaphore;
use futures::stream::{FuturesUnordered, StreamExt};
pub struct AsyncOCRService {
inference_semaphore: Arc<Semaphore>,
io_semaphore: Arc<Semaphore>,
model: Arc<InferenceEngine>,
}
impl AsyncOCRService {
pub async fn process_batch_async(
&self,
image_urls: Vec<String>
) -> Vec<Result<OCRResult>> {
let mut futures = FuturesUnordered::new();
for url in image_urls {
let model = self.model.clone();
let inference_sem = self.inference_semaphore.clone();
let io_sem = self.io_semaphore.clone();
futures.push(async move {
// 1. Download image (I/O bound)
let _io_permit = io_sem.acquire().await?;
let image_data = Self::download_image(&url).await?;
drop(_io_permit);
// 2. Preprocess (CPU bound)
let preprocessed = Self::preprocess(&image_data)?;
// 3. Inference (GPU/CPU bound)
let _inference_permit = inference_sem.acquire().await?;
let result = model.infer(&preprocessed).await?;
drop(_inference_permit);
// 4. Postprocess (CPU bound)
Ok(Self::postprocess(result))
});
}
futures.collect().await
}
async fn download_image(url: &str) -> Result<Vec<u8>> {
let response = reqwest::get(url).await?;
Ok(response.bytes().await?.to_vec())
}
}
// Pipeline with async/await
pub struct AsyncPipeline {
stages: Vec<Box<dyn AsyncStage>>,
}
impl AsyncPipeline {
pub async fn execute(&self, input: Input) -> Result<Output> {
let mut current = input;
for stage in &self.stages {
current = stage.process(current).await?;
}
Ok(current)
}
pub async fn execute_batch(&self, inputs: Vec<Input>) -> Vec<Result<Output>> {
futures::future::join_all(
inputs.into_iter().map(|input| self.execute(input))
).await
}
}
```
**Expected Results:**
- Throughput (I/O bound): 5 img/s → 20 img/s (4x)
- Concurrent operations: 50-100 in-flight requests
- Resource utilization: Balanced I/O and compute
- Latency (p50): Unchanged
### 4.3 Pipeline Parallelism
**Objective**: Overlap different pipeline stages for continuous processing.
```rust
// Expected Improvement: 2-3x throughput with 4-stage pipeline
use tokio::sync::mpsc;
pub struct PipelineProcessor {
decode_workers: usize,
preprocess_workers: usize,
inference_workers: usize,
postprocess_workers: usize,
}
impl PipelineProcessor {
pub async fn start_pipeline(
&self,
input_rx: mpsc::Receiver<Vec<u8>>
) -> mpsc::Receiver<OCRResult> {
// Create channels for each stage
let (decode_tx, decode_rx) = mpsc::channel(100);
let (preprocess_tx, preprocess_rx) = mpsc::channel(100);
let (inference_tx, inference_rx) = mpsc::channel(100);
let (postprocess_tx, postprocess_rx) = mpsc::channel(100);
// Stage 1: Image decoding
for _ in 0..self.decode_workers {
let mut rx = input_rx.clone();
let tx = decode_tx.clone();
tokio::spawn(async move {
while let Some(image_bytes) = rx.recv().await {
let decoded = image::load_from_memory(&image_bytes).unwrap();
tx.send(decoded).await.unwrap();
}
});
}
// Stage 2: Preprocessing
for _ in 0..self.preprocess_workers {
let mut rx = decode_rx.clone();
let tx = preprocess_tx.clone();
tokio::spawn(async move {
while let Some(image) = rx.recv().await {
let preprocessed = preprocess_image(&image);
tx.send(preprocessed).await.unwrap();
}
});
}
// Stage 3: Inference (GPU bottleneck)
for _ in 0..self.inference_workers {
let mut rx = preprocess_rx.clone();
let tx = inference_tx.clone();
let model = self.model.clone();
tokio::spawn(async move {
while let Some(tensor) = rx.recv().await {
let output = model.infer(&tensor).await.unwrap();
tx.send(output).await.unwrap();
}
});
}
// Stage 4: Postprocessing
for _ in 0..self.postprocess_workers {
let mut rx = inference_rx.clone();
let tx = postprocess_tx.clone();
tokio::spawn(async move {
while let Some(output) = rx.recv().await {
let result = postprocess_output(&output);
tx.send(result).await.unwrap();
}
});
}
postprocess_rx
}
}
```
**Pipeline Configuration:**
```
Decode (4 workers) → Preprocess (4 workers) → Inference (2 workers) → Postprocess (4 workers)
20ms/img 30ms/img 100ms/img 20ms/img
```
**Expected Results:**
- Throughput: Limited by slowest stage (inference: 10 img/s with 2 workers)
- Latency: 170ms (sum of all stages)
- CPU utilization: 80-90% (balanced across stages)
- GPU utilization: 90%+
### 4.4 GPU Batch Scheduling
**Objective**: Optimize GPU utilization with intelligent batch scheduling.
```rust
// Expected Improvement: 40% better GPU utilization
pub struct GPUBatchScheduler {
gpu_memory_limit: usize,
max_batch_size: usize,
scheduler: Arc<Mutex<Scheduler>>,
}
struct Scheduler {
pending_queue: VecDeque<InferenceRequest>,
current_gpu_memory: usize,
}
impl GPUBatchScheduler {
pub async fn schedule_batch(&self) -> Option<Vec<InferenceRequest>> {
let mut scheduler = self.scheduler.lock().await;
let mut batch = Vec::new();
let mut batch_memory = 0;
while let Some(request) = scheduler.pending_queue.front() {
let request_memory = self.estimate_memory(request);
// Check constraints
if batch.len() >= self.max_batch_size {
break;
}
if batch_memory + request_memory > self.gpu_memory_limit {
break;
}
// Add to batch
let request = scheduler.pending_queue.pop_front().unwrap();
batch_memory += request_memory;
batch.push(request);
}
if batch.is_empty() {
None
} else {
scheduler.current_gpu_memory += batch_memory;
Some(batch)
}
}
pub async fn execute_with_scheduling(&self) {
loop {
if let Some(batch) = self.schedule_batch().await {
let batch_memory = batch.iter()
.map(|r| self.estimate_memory(r))
.sum();
// Execute batch
self.execute_batch(batch).await;
// Free GPU memory
let mut scheduler = self.scheduler.lock().await;
scheduler.current_gpu_memory -= batch_memory;
} else {
tokio::time::sleep(Duration::from_millis(10)).await;
}
}
}
fn estimate_memory(&self, request: &InferenceRequest) -> usize {
// Estimate GPU memory for this request
let input_size = request.input_shape.iter().product::<usize>();
let activation_size = input_size * 4; // Rough estimate
(input_size + activation_size) * std::mem::size_of::<f32>()
}
}
```
**Expected Results:**
- GPU utilization: 60% → 85% (40% improvement)
- Memory efficiency: 70% → 95%
- Batch size variance: Reduced
- OOM errors: Eliminated
---
## 5. Caching Strategy
### 5.1 LRU Cache for Repeated Queries
**Objective**: Cache OCR results for frequently accessed images.
```rust
// Expected Improvement: 100% speedup on cache hits (0.1ms vs 100ms)
use lru::LruCache;
use std::hash::{Hash, Hasher};
use sha2::{Sha256, Digest};
pub struct OCRCache {
cache: Arc<Mutex<LruCache<ImageHash, CachedResult>>>,
ttl: Duration,
}
#[derive(Clone, Hash, Eq, PartialEq)]
struct ImageHash([u8; 32]);
struct CachedResult {
result: OCRResult,
timestamp: Instant,
}
impl OCRCache {
pub fn new(capacity: usize, ttl: Duration) -> Self {
Self {
cache: Arc::new(Mutex::new(LruCache::new(capacity))),
ttl,
}
}
pub async fn get_or_compute<F>(
&self,
image: &DynamicImage,
compute_fn: F
) -> Result<OCRResult>
where
F: FnOnce(&DynamicImage) -> Result<OCRResult>
{
// 1. Compute image hash
let hash = self.hash_image(image);
// 2. Check cache
{
let mut cache = self.cache.lock().await;
if let Some(cached) = cache.get(&hash) {
// Check if still valid
if cached.timestamp.elapsed() < self.ttl {
return Ok(cached.result.clone());
}
}
}
// 3. Compute result
let result = compute_fn(image)?;
// 4. Store in cache
{
let mut cache = self.cache.lock().await;
cache.put(hash, CachedResult {
result: result.clone(),
timestamp: Instant::now(),
});
}
Ok(result)
}
fn hash_image(&self, image: &DynamicImage) -> ImageHash {
let mut hasher = Sha256::new();
hasher.update(image.as_bytes());
ImageHash(hasher.finalize().into())
}
pub async fn warm_cache(&self, common_images: Vec<(DynamicImage, OCRResult)>) {
let mut cache = self.cache.lock().await;
for (image, result) in common_images {
let hash = self.hash_image(&image);
cache.put(hash, CachedResult {
result,
timestamp: Instant::now(),
});
}
}
}
```
**Expected Results:**
- Cache hit latency: 0.1ms (1000x speedup)
- Cache hit rate: 30-40% in production
- Memory overhead: ~100MB for 1000 cached results
- TTL: 1 hour (configurable)
### 5.2 Vector Embedding Cache (ruvector-core)
**Objective**: Cache embeddings for semantic search and deduplication.
```rust
// Expected Improvement: 95% faster similarity search
use ruvector_core::VectorDB;
pub struct EmbeddingCache {
vector_db: VectorDB,
embedding_model: Arc<EmbeddingModel>,
}
impl EmbeddingCache {
pub async fn get_or_compute_embedding(
&self,
text: &str
) -> Result<Vec<f32>> {
// 1. Search for existing embedding
let query_hash = self.hash_text(text);
if let Some(cached) = self.vector_db.get_by_id(&query_hash)? {
return Ok(cached.vector);
}
// 2. Compute new embedding
let embedding = self.embedding_model.encode(text).await?;
// 3. Store in vector DB
self.vector_db.insert(
query_hash,
embedding.clone(),
HashMap::from([
("text".to_string(), text.to_string()),
("timestamp".to_string(), Utc::now().to_rfc3339()),
])
)?;
Ok(embedding)
}
pub async fn find_similar_results(
&self,
text: &str,
top_k: usize
) -> Result<Vec<OCRResult>> {
// 1. Get embedding
let embedding = self.get_or_compute_embedding(text).await?;
// 2. Search vector DB
let similar = self.vector_db.search(&embedding, top_k)?;
// 3. Return cached results
Ok(similar.into_iter()
.map(|item| self.deserialize_result(&item.metadata))
.collect())
}
pub async fn deduplicate_results(
&self,
results: Vec<OCRResult>,
similarity_threshold: f32
) -> Vec<OCRResult> {
let mut deduplicated = Vec::new();
for result in results {
let embedding = self.get_or_compute_embedding(&result.text).await.unwrap();
// Check if similar result already exists
let similar = self.vector_db.search(&embedding, 1).unwrap();
if similar.is_empty() || similar[0].score < similarity_threshold {
deduplicated.push(result.clone());
// Add to vector DB
self.vector_db.insert(
Uuid::new_v4().to_string(),
embedding,
HashMap::from([
("text".to_string(), result.text.clone()),
])
).unwrap();
}
}
deduplicated
}
}
```
**Expected Results:**
- Similarity search: 500ms → 25ms (20x speedup)
- Deduplication accuracy: 98%
- Storage efficiency: 768 dimensions × 4 bytes per embedding
- Scalability: Millions of embeddings
### 5.3 Result Memoization
**Objective**: Cache intermediate computation results for common patterns.
```rust
// Expected Improvement: 60% faster for repeated patterns
use moka::sync::Cache;
pub struct MemoizedOCR {
preprocessing_cache: Cache<PreprocessKey, Tensor>,
inference_cache: Cache<InferenceKey, Tensor>,
postprocessing_cache: Cache<PostprocessKey, OCRResult>,
}
#[derive(Clone, Hash, Eq, PartialEq)]
struct PreprocessKey {
image_hash: [u8; 32],
target_size: (usize, usize),
normalization: NormalizationParams,
}
impl MemoizedOCR {
pub fn new() -> Self {
Self {
preprocessing_cache: Cache::builder()
.max_capacity(1000)
.time_to_live(Duration::from_secs(3600))
.build(),
inference_cache: Cache::builder()
.max_capacity(500)
.time_to_live(Duration::from_secs(1800))
.build(),
postprocessing_cache: Cache::builder()
.max_capacity(2000)
.time_to_live(Duration::from_secs(3600))
.build(),
}
}
pub async fn process_with_memoization(
&self,
image: &DynamicImage
) -> Result<OCRResult> {
// 1. Memoized preprocessing
let preprocess_key = self.create_preprocess_key(image);
let preprocessed = self.preprocessing_cache
.get_or_insert_with(preprocess_key, || {
self.preprocess(image)
});
// 2. Memoized inference
let inference_key = self.create_inference_key(&preprocessed);
let inference_output = self.inference_cache
.get_or_insert_with(inference_key, || async {
self.model.infer(&preprocessed).await.unwrap()
}.await);
// 3. Memoized postprocessing
let postprocess_key = self.create_postprocess_key(&inference_output);
let result = self.postprocessing_cache
.get_or_insert_with(postprocess_key, || {
self.postprocess(&inference_output)
});
Ok(result)
}
pub fn get_cache_stats(&self) -> CacheStats {
CacheStats {
preprocessing_hit_rate: self.preprocessing_cache.hit_rate(),
inference_hit_rate: self.inference_cache.hit_rate(),
postprocessing_hit_rate: self.postprocessing_cache.hit_rate(),
}
}
}
```
**Expected Results:**
- Preprocessing cache hit rate: 40%
- Inference cache hit rate: 25%
- Postprocessing cache hit rate: 50%
- Overall speedup: 60% on cached patterns
---
## 6. Platform-Specific Optimizations
### 6.1 x86_64 AVX-512 Acceleration
**Objective**: Leverage AVX-512 for vectorized operations on modern Intel CPUs.
```rust
// Expected Improvement: 8-16x speedup for SIMD operations
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
pub struct AVX512Processor {
_phantom: std::marker::PhantomData<()>,
}
impl AVX512Processor {
#[target_feature(enable = "avx512f")]
pub unsafe fn batch_normalize_avx512(
data: &mut [f32],
mean: f32,
std: f32
) {
let mean_vec = _mm512_set1_ps(mean);
let std_vec = _mm512_set1_ps(std);
// Process 16 floats at a time
for chunk in data.chunks_exact_mut(16) {
let values = _mm512_loadu_ps(chunk.as_ptr());
let normalized = _mm512_div_ps(
_mm512_sub_ps(values, mean_vec),
std_vec
);
_mm512_storeu_ps(chunk.as_mut_ptr(), normalized);
}
// Handle remainder with scalar operations
let remainder_offset = (data.len() / 16) * 16;
for i in remainder_offset..data.len() {
data[i] = (data[i] - mean) / std;
}
}
#[target_feature(enable = "avx512f")]
pub unsafe fn matrix_multiply_avx512(
a: &[f32],
b: &[f32],
c: &mut [f32],
m: usize,
n: usize,
k: usize
) {
for i in 0..m {
for j in (0..n).step_by(16) {
let mut sum = _mm512_setzero_ps();
for p in 0..k {
let a_val = _mm512_set1_ps(a[i * k + p]);
let b_vals = _mm512_loadu_ps(&b[p * n + j]);
sum = _mm512_fmadd_ps(a_val, b_vals, sum);
}
_mm512_storeu_ps(&mut c[i * n + j], sum);
}
}
}
#[target_feature(enable = "avx512f", enable = "avx512bw")]
pub unsafe fn convert_u8_to_f32_avx512(
input: &[u8],
output: &mut [f32]
) {
// Process 16 bytes at a time
for (chunk_in, chunk_out) in input.chunks_exact(16)
.zip(output.chunks_exact_mut(16))
{
// Load 16 u8 values
let u8_values = _mm_loadu_si128(chunk_in.as_ptr() as *const __m128i);
// Convert to u32
let u32_values = _mm512_cvtepu8_epi32(u8_values);
// Convert to f32
let f32_values = _mm512_cvtepi32_ps(u32_values);
// Store result
_mm512_storeu_ps(chunk_out.as_mut_ptr(), f32_values);
}
}
}
```
**Expected Results:**
- Normalization: 100ms → 8ms (12.5x speedup)
- Matrix multiplication: 500ms → 35ms (14x speedup)
- Type conversion: 50ms → 4ms (12.5x speedup)
- Throughput: 16 operations per cycle
### 6.2 ARM NEON for Mobile
**Objective**: Optimize for mobile devices using ARM NEON SIMD.
```rust
// Expected Improvement: 4-8x speedup on ARM devices
#[cfg(target_arch = "aarch64")]
use std::arch::aarch64::*;
pub struct NEONProcessor {
_phantom: std::marker::PhantomData<()>,
}
impl NEONProcessor {
#[target_feature(enable = "neon")]
pub unsafe fn batch_normalize_neon(
data: &mut [f32],
mean: f32,
std: f32
) {
let mean_vec = vdupq_n_f32(mean);
let std_vec = vdupq_n_f32(std);
// Process 4 floats at a time
for chunk in data.chunks_exact_mut(4) {
let values = vld1q_f32(chunk.as_ptr());
let sub_result = vsubq_f32(values, mean_vec);
let div_result = vdivq_f32(sub_result, std_vec);
vst1q_f32(chunk.as_mut_ptr(), div_result);
}
}
#[target_feature(enable = "neon")]
pub unsafe fn resize_bilinear_neon(
src: &[u8],
dst: &mut [u8],
src_width: usize,
src_height: usize,
dst_width: usize,
dst_height: usize
) {
let x_ratio = (src_width << 16) / dst_width;
let y_ratio = (src_height << 16) / dst_height;
for y in 0..dst_height {
let src_y = (y * y_ratio) >> 16;
let y_diff = ((y * y_ratio) >> 8) & 0xFF;
for x in (0..dst_width).step_by(4) {
// NEON-accelerated bilinear interpolation
let src_x = (x * x_ratio) >> 16;
let x_diff = ((x * x_ratio) >> 8) & 0xFF;
// Load 4 pixels
let pixels = vld1_u8(&src[src_y * src_width + src_x]);
// Interpolate (simplified)
vst1_u8(&mut dst[y * dst_width + x], pixels);
}
}
}
}
```
**Expected Results:**
- Mobile CPU usage: 80% → 40%
- Battery impact: 50% reduction
- Latency on mobile: 2000ms → 500ms (4x)
- Temperature: Reduced
### 6.3 WebAssembly SIMD
**Objective**: Enable high-performance OCR in browser environments.
```rust
// Expected Improvement: 2-4x speedup in browsers
#[cfg(target_arch = "wasm32")]
use std::arch::wasm32::*;
pub struct WasmSimdProcessor {
_phantom: std::marker::PhantomData<()>,
}
#[cfg(target_arch = "wasm32")]
impl WasmSimdProcessor {
pub fn batch_normalize_wasm_simd(
data: &mut [f32],
mean: f32,
std: f32
) {
unsafe {
let mean_vec = f32x4_splat(mean);
let std_vec = f32x4_splat(std);
// Process 4 floats at a time
for chunk in data.chunks_exact_mut(4) {
let values = v128_load(chunk.as_ptr() as *const v128);
let sub_result = f32x4_sub(values, mean_vec);
let div_result = f32x4_div(sub_result, std_vec);
v128_store(chunk.as_mut_ptr() as *mut v128, div_result);
}
}
}
pub fn rgb_to_grayscale_wasm_simd(
rgb: &[u8],
gray: &mut [u8]
) {
unsafe {
let weights = u8x16(
77, 150, 29, 0, // R, G, B weights (scaled)
77, 150, 29, 0,
77, 150, 29, 0,
77, 150, 29, 0
);
for (chunk_rgb, chunk_gray) in rgb.chunks_exact(12)
.zip(gray.chunks_exact_mut(4))
{
let pixels = v128_load(chunk_rgb.as_ptr() as *const v128);
let weighted = u8x16_mul(pixels, weights);
// Sum RGB components
let result = u8x16_add_sat(
u8x16_add_sat(
u8x16_extract_lane::<0>(weighted),
u8x16_extract_lane::<1>(weighted)
),
u8x16_extract_lane::<2>(weighted)
);
// Store grayscale value
chunk_gray[0] = (result >> 8) as u8;
}
}
}
}
// Compile with: --target wasm32-unknown-unknown -C target-feature=+simd128
```
**Expected Results:**
- Browser latency: 3000ms → 800ms (3.75x)
- CPU usage: 100% → 50%
- Memory: 200MB → 150MB
- Compatibility: Chrome 91+, Firefox 89+
### 6.4 GPU Acceleration
**Objective**: Leverage GPU compute for massive parallelism.
#### CUDA (NVIDIA)
```rust
// Expected Improvement: 10-50x speedup on high-end GPUs
use cudarc::driver::*;
pub struct CudaAccelerator {
device: CudaDevice,
kernel: CudaFunction,
}
impl CudaAccelerator {
pub fn new() -> Result<Self> {
let device = CudaDevice::new(0)?;
// Load CUDA kernel
let ptx = include_str!("kernels/ocr.ptx");
device.load_ptx(ptx.into(), "ocr_module", &["preprocess_kernel"])?;
let kernel = device.get_func("ocr_module", "preprocess_kernel")?;
Ok(Self { device, kernel })
}
pub async fn preprocess_gpu(&self, images: &[u8]) -> Result<Tensor> {
// 1. Allocate GPU memory
let d_input = self.device.htod_copy(images.to_vec())?;
let d_output = self.device.alloc_zeros::<f32>(images.len())?;
// 2. Launch kernel
let cfg = LaunchConfig {
grid_dim: (images.len() / 256 + 1, 1, 1),
block_dim: (256, 1, 1),
shared_mem_bytes: 0,
};
unsafe {
self.kernel.launch(cfg, (
&d_input,
&d_output,
images.len(),
))?;
}
// 3. Copy result back
let output = self.device.dtoh_sync_copy(&d_output)?;
Ok(Tensor::from_vec(output))
}
}
// CUDA kernel (OCR preprocessing)
/*
__global__ void preprocess_kernel(
const unsigned char* input,
float* output,
int size
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
// Normalize to [0, 1]
output[idx] = input[idx] / 255.0f;
// Apply mean/std normalization
output[idx] = (output[idx] - 0.5f) / 0.5f;
}
}
*/
```
**Expected Results:**
- Preprocessing: 100ms → 5ms (20x speedup)
- Batch processing: 1000 img/s on RTX 4090
- Memory bandwidth: 1TB/s (GPU memory)
- Power efficiency: 5x better than CPU
#### Metal (Apple Silicon)
```rust
// Expected Improvement: 15-30x speedup on M1/M2/M3
use metal::*;
pub struct MetalAccelerator {
device: Device,
command_queue: CommandQueue,
pipeline: ComputePipelineState,
}
impl MetalAccelerator {
pub fn new() -> Result<Self> {
let device = Device::system_default()
.ok_or(Error::NoMetalDevice)?;
let command_queue = device.new_command_queue();
// Load Metal shader
let library = device.new_library_with_source(
include_str!("shaders/ocr.metal"),
&CompileOptions::new()
)?;
let kernel = library.get_function("preprocess_kernel", None)?;
let pipeline = device.new_compute_pipeline_state_with_function(&kernel)?;
Ok(Self { device, command_queue, pipeline })
}
pub async fn preprocess_metal(&self, images: &[u8]) -> Result<Vec<f32>> {
// 1. Create buffers
let input_buffer = self.device.new_buffer_with_data(
images.as_ptr() as *const _,
images.len() as u64,
MTLResourceOptions::StorageModeShared
);
let output_buffer = self.device.new_buffer(
(images.len() * std::mem::size_of::<f32>()) as u64,
MTLResourceOptions::StorageModeShared
);
// 2. Create command buffer
let command_buffer = self.command_queue.new_command_buffer();
let encoder = command_buffer.new_compute_command_encoder();
// 3. Encode kernel
encoder.set_compute_pipeline_state(&self.pipeline);
encoder.set_buffer(0, Some(&input_buffer), 0);
encoder.set_buffer(1, Some(&output_buffer), 0);
let grid_size = MTLSize::new(images.len() as u64, 1, 1);
let threadgroup_size = MTLSize::new(256, 1, 1);
encoder.dispatch_threads(grid_size, threadgroup_size);
encoder.end_encoding();
// 4. Execute
command_buffer.commit();
command_buffer.wait_until_completed();
// 5. Read results
let output_ptr = output_buffer.contents() as *const f32;
let output = unsafe {
std::slice::from_raw_parts(output_ptr, images.len())
};
Ok(output.to_vec())
}
}
```
**Expected Results (M2 Pro):**
- Preprocessing: 100ms → 4ms (25x speedup)
- Inference: 1000ms → 50ms (20x with CoreML)
- Power consumption: 10W vs 40W on Intel
- Unified memory: Zero-copy possible
---
## 7. Progressive Loading
### 7.1 Lazy Model Loading
**Objective**: Load model components on-demand to reduce initialization time.
```rust
// Expected Improvement: Startup time 5000ms → 500ms
use std::sync::OnceLock;
pub struct LazyModelLoader {
encoder: OnceLock<Arc<EncoderModel>>,
decoder: OnceLock<Arc<DecoderModel>>,
postprocessor: OnceLock<Arc<Postprocessor>>,
model_path: String,
}
impl LazyModelLoader {
pub fn new(model_path: String) -> Self {
Self {
encoder: OnceLock::new(),
decoder: OnceLock::new(),
postprocessor: OnceLock::new(),
model_path,
}
}
pub async fn get_encoder(&self) -> &Arc<EncoderModel> {
self.encoder.get_or_init(|| {
Arc::new(EncoderModel::load(&self.model_path).unwrap())
})
}
pub async fn get_decoder(&self) -> &Arc<DecoderModel> {
self.decoder.get_or_init(|| {
Arc::new(DecoderModel::load(&self.model_path).unwrap())
})
}
pub async fn preload_all(&self) {
// Parallel loading
let (encoder, decoder, postprocessor) = tokio::join!(
async { self.get_encoder().await },
async { self.get_decoder().await },
async { self.get_postprocessor().await }
);
}
}
// Application with lazy loading
pub struct OCRApplication {
model_loader: LazyModelLoader,
feature_flags: FeatureFlags,
}
impl OCRApplication {
pub async fn startup(&self) -> Result<()> {
// Only load components needed for initial features
if self.feature_flags.math_ocr_enabled {
self.model_loader.get_encoder().await;
}
// Decoder loaded on first use
Ok(())
}
pub async fn process_first_request(&self, image: &Image) -> Result<String> {
// Triggers lazy loading of decoder if not yet loaded
let encoder = self.model_loader.get_encoder().await;
let decoder = self.model_loader.get_decoder().await;
// Process normally
let features = encoder.encode(image).await?;
let text = decoder.decode(&features).await?;
Ok(text)
}
}
```
**Expected Results:**
- Initial startup: 5000ms → 500ms (10x faster)
- First request latency: +500ms (one-time cost)
- Memory usage: Reduced by 60% if not all features used
- User experience: App responsive immediately
### 7.2 Feature-Based Loading
**Objective**: Load only the model components needed for specific features.
```rust
// Expected Improvement: 70% memory reduction for specialized use cases
pub struct FeatureBasedModel {
config: ModelConfig,
loaded_features: Arc<RwLock<HashSet<Feature>>>,
model_registry: Arc<RwLock<HashMap<Feature, Arc<dyn ModelComponent>>>>,
}
#[derive(Hash, Eq, PartialEq, Clone)]
pub enum Feature {
MathOCR,
HandwritingRecognition,
DocumentLayout,
TableExtraction,
ChemicalFormulas,
MusicNotation,
}
impl FeatureBasedModel {
pub async fn load_feature(&self, feature: Feature) -> Result<()> {
// Check if already loaded
{
let loaded = self.loaded_features.read().await;
if loaded.contains(&feature) {
return Ok(());
}
}
// Load feature-specific model
let model_component = match feature {
Feature::MathOCR => {
Arc::new(MathOCRModel::load(&self.config.math_model_path)?)
as Arc<dyn ModelComponent>
}
Feature::HandwritingRecognition => {
Arc::new(HandwritingModel::load(&self.config.handwriting_model_path)?)
as Arc<dyn ModelComponent>
}
Feature::DocumentLayout => {
Arc::new(LayoutModel::load(&self.config.layout_model_path)?)
as Arc<dyn ModelComponent>
}
// ... other features
};
// Register model
{
let mut registry = self.model_registry.write().await;
registry.insert(feature.clone(), model_component);
}
// Mark as loaded
{
let mut loaded = self.loaded_features.write().await;
loaded.insert(feature);
}
Ok(())
}
pub async fn process_with_features(
&self,
image: &Image,
required_features: &[Feature]
) -> Result<OCRResult> {
// Load all required features
for feature in required_features {
self.load_feature(feature.clone()).await?;
}
// Process with loaded features
let registry = self.model_registry.read().await;
let mut result = OCRResult::new();
for feature in required_features {
if let Some(model) = registry.get(feature) {
let feature_result = model.process(image).await?;
result.merge(feature_result);
}
}
Ok(result)
}
pub async fn unload_feature(&self, feature: Feature) {
let mut registry = self.model_registry.write().await;
registry.remove(&feature);
let mut loaded = self.loaded_features.write().await;
loaded.remove(&feature);
}
}
// Usage example
pub async fn process_math_document(image: &Image) -> Result<OCRResult> {
let model = FeatureBasedModel::new(config);
// Only load math OCR feature (much smaller than full model)
model.process_with_features(
image,
&[Feature::MathOCR, Feature::DocumentLayout]
).await
}
```
**Model Sizes:**
- Full model: 500MB
- Math OCR only: 80MB (84% reduction)
- Handwriting only: 120MB (76% reduction)
- Document layout only: 50MB (90% reduction)
**Expected Results:**
- Memory usage: 500MB → 80-150MB (70-84% reduction)
- Loading time: 5000ms → 800ms (specialized features)
- Flexibility: Load/unload features dynamically
- Use case optimization: Perfect for specialized applications
---
## 8. Optimization Milestones
### Phase 1: Baseline (Current State)
**Target Metrics:**
- Inference latency: 1000ms/image
- Throughput: 1 image/second
- CPU utilization: 80%
- GPU utilization: 40%
- Memory usage: 2GB
- Model size: 500MB
**Implementation Status:**
- ✅ Basic ONNX Runtime integration
- ✅ Single-threaded inference
- ✅ Standard preprocessing
- ⬜ No caching
- ⬜ No batching
- ⬜ No SIMD optimizations
**Bottlenecks Identified:**
1. Sequential image processing
2. No GPU utilization optimization
3. Repeated preprocessing computations
4. Large model size
5. Memory allocation overhead
---
### Phase 2: Optimized (Target: 3 months)
**Target Metrics:**
- Inference latency: 100ms/image (10x improvement)
- Throughput: 15 images/second (15x improvement)
- CPU utilization: 60%
- GPU utilization: 85%
- Memory usage: 1GB (50% reduction)
- Model size: 125MB (75% reduction via INT8)
**Implementation Roadmap:**
#### Month 1: Model Optimization
- [ ] Implement INT8 quantization
- Expected: 4x speedup, 75% size reduction
- Risk: 2-5% accuracy loss
- Priority: HIGH
- [ ] Integrate TensorRT/OpenVINO
- Expected: 3-5x speedup
- Risk: Platform dependency
- Priority: HIGH
- [ ] Model warm-up and caching
- Expected: Eliminate cold start (5000ms → 100ms)
- Risk: Memory overhead
- Priority: MEDIUM
#### Month 2: Parallelization & Batching
- [ ] Implement batch processing
- Expected: 3-5x throughput improvement
- Risk: Increased latency for small loads
- Priority: HIGH
- [ ] Add pipeline parallelism
- Expected: 2-3x throughput
- Risk: Complexity
- Priority: MEDIUM
- [ ] Rayon for CPU parallelism
- Expected: 7-8x on 8-core CPU
- Risk: None
- Priority: HIGH
#### Month 3: Memory & Caching
- [ ] Implement LRU cache
- Expected: 100% speedup on cache hits
- Risk: Memory overhead (100MB)
- Priority: HIGH
- [ ] Memory-mapped model loading
- Expected: 200x faster loading
- Risk: Platform compatibility
- Priority: MEDIUM
- [ ] Zero-copy preprocessing
- Expected: 40% faster preprocessing
- Risk: Complexity
- Priority: LOW
**Success Criteria:**
- ✅ Latency < 150ms (target: 100ms)
- ✅ Throughput > 10 img/s (target: 15 img/s)
- ✅ Memory < 1.5GB (target: 1GB)
- ✅ Accuracy degradation < 5%
---
### Phase 3: Production (Target: 6 months)
**Target Metrics:**
- Inference latency: 50ms/image (20x improvement)
- Throughput: 30 images/second (30x improvement)
- CPU utilization: 40%
- GPU utilization: 90%
- Memory usage: 500MB (75% reduction)
- Model size: 50MB (90% reduction via distillation)
**Implementation Roadmap:**
#### Month 4: Advanced Model Optimization
- [ ] Knowledge distillation
- Expected: 10x speedup, 80% size reduction
- Risk: 3-5% accuracy loss, requires retraining
- Priority: HIGH
- [ ] Structured pruning
- Expected: 2.5x speedup, 50% parameter reduction
- Risk: Requires fine-tuning
- Priority: MEDIUM
- [ ] Speculative decoding
- Expected: 2-3x faster text generation
- Risk: Complexity
- Priority: LOW
#### Month 5: Platform-Specific Optimization
- [ ] AVX-512 implementation
- Expected: 8-16x SIMD speedup
- Risk: Limited CPU support
- Priority: MEDIUM
- [ ] ARM NEON for mobile
- Expected: 4-8x speedup on mobile
- Risk: None
- Priority: MEDIUM
- [ ] Metal/CUDA acceleration
- Expected: 15-30x speedup
- Risk: Platform dependency
- Priority: HIGH
#### Month 6: Advanced Features
- [ ] Dynamic batching
- Expected: Optimal latency/throughput trade-off
- Risk: Complexity
- Priority: HIGH
- [ ] Streaming for large documents
- Expected: Unlimited document size
- Risk: Complexity
- Priority: MEDIUM
- [ ] Vector embedding cache
- Expected: 95% faster similarity search
- Risk: Memory overhead
- Priority: LOW
**Success Criteria:**
- ✅ Latency < 75ms (target: 50ms)
- ✅ Throughput > 25 img/s (target: 30 img/s)
- ✅ Memory < 750MB (target: 500MB)
- ✅ Accuracy degradation < 5% total
- ✅ 99.9% uptime in production
- ✅ Sub-100ms p99 latency
---
## Performance Benchmarking Suite
### Benchmark Implementation
```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
pub fn benchmark_preprocessing(c: &mut Criterion) {
let mut group = c.benchmark_group("preprocessing");
for size in [224, 384, 512, 1024].iter() {
group.bench_with_input(
BenchmarkId::new("baseline", size),
size,
|b, &size| {
let image = create_test_image(size, size);
b.iter(|| preprocess_baseline(black_box(&image)))
}
);
group.bench_with_input(
BenchmarkId::new("simd", size),
size,
|b, &size| {
let image = create_test_image(size, size);
b.iter(|| preprocess_simd(black_box(&image)))
}
);
group.bench_with_input(
BenchmarkId::new("zero_copy", size),
size,
|b, &size| {
let image = create_test_image(size, size);
b.iter(|| preprocess_zero_copy(black_box(&image)))
}
);
}
group.finish();
}
pub fn benchmark_inference(c: &mut Criterion) {
let mut group = c.benchmark_group("inference");
group.bench_function("baseline", |b| {
let model = load_baseline_model();
let input = create_test_tensor();
b.iter(|| model.infer(black_box(&input)))
});
group.bench_function("int8_quantized", |b| {
let model = load_int8_model();
let input = create_test_tensor();
b.iter(|| model.infer(black_box(&input)))
});
group.bench_function("distilled", |b| {
let model = load_distilled_model();
let input = create_test_tensor();
b.iter(|| model.infer(black_box(&input)))
});
group.finish();
}
pub fn benchmark_batching(c: &mut Criterion) {
let mut group = c.benchmark_group("batching");
for batch_size in [1, 4, 8, 16, 32].iter() {
group.bench_with_input(
BenchmarkId::from_parameter(batch_size),
batch_size,
|b, &batch_size| {
let images = create_test_batch(batch_size);
b.iter(|| process_batch(black_box(&images)))
}
);
}
group.finish();
}
criterion_group!(
benches,
benchmark_preprocessing,
benchmark_inference,
benchmark_batching
);
criterion_main!(benches);
```
### Expected Benchmark Results
#### Phase 1 (Baseline)
```
preprocessing/baseline/224 100.5 ms
preprocessing/baseline/512 245.8 ms
inference/baseline 1000.2 ms
batching/1 1000.2 ms
batching/16 N/A (not implemented)
```
#### Phase 2 (Optimized)
```
preprocessing/simd/224 12.4 ms (8.1x improvement)
preprocessing/simd/512 31.2 ms (7.9x improvement)
inference/int8_quantized 248.5 ms (4.0x improvement)
batching/1 100.5 ms (10x improvement)
batching/16 65.2 ms/img (15.4x throughput)
```
#### Phase 3 (Production)
```
preprocessing/zero_copy/224 3.8 ms (26.4x improvement)
preprocessing/zero_copy/512 9.1 ms (27.0x improvement)
inference/distilled 98.3 ms (10.2x improvement)
inference/distilled+gpu 47.8 ms (20.9x improvement)
batching/1 50.2 ms (19.9x improvement)
batching/32 31.5 ms/img (31.8x throughput)
```
---
## Monitoring and Metrics
### Key Performance Indicators (KPIs)
1. **Latency Metrics**
- p50: Median latency
- p95: 95th percentile
- p99: 99th percentile
- p99.9: 99.9th percentile
2. **Throughput Metrics**
- Images/second
- Requests/second
- Tokens/second (for text generation)
3. **Resource Utilization**
- CPU usage (%)
- GPU usage (%)
- Memory usage (MB)
- Disk I/O (MB/s)
4. **Quality Metrics**
- Accuracy
- Character Error Rate (CER)
- Word Error Rate (WER)
- F1 Score
5. **Cost Metrics**
- Cost per 1000 images
- Infrastructure cost/month
- Power consumption (W)
### Continuous Monitoring
```rust
use prometheus::{Registry, Histogram, Counter, Gauge};
pub struct PerformanceMonitor {
latency_histogram: Histogram,
throughput_counter: Counter,
memory_gauge: Gauge,
accuracy_gauge: Gauge,
}
impl PerformanceMonitor {
pub fn record_inference(&self, duration: Duration, accuracy: f32) {
self.latency_histogram.observe(duration.as_secs_f64());
self.throughput_counter.inc();
self.accuracy_gauge.set(accuracy as f64);
}
pub fn get_report(&self) -> PerformanceReport {
PerformanceReport {
p50_latency: self.latency_histogram.get_sample_sum() / 2.0,
p99_latency: self.calculate_percentile(99.0),
throughput: self.throughput_counter.get() / 60.0, // per second
avg_accuracy: self.accuracy_gauge.get(),
}
}
}
```
---
## Conclusion
This optimization roadmap provides a systematic approach to improving the ruvector-scipix OCR system from baseline (1000ms/image) to production-ready (50ms/image) performance. The three-phase approach ensures:
1. **Quick Wins (Phase 1)**: Foundation with basic optimizations
2. **Substantial Improvements (Phase 2)**: 10x speedup through parallelization and quantization
3. **Production Excellence (Phase 3)**: 20x speedup with advanced techniques
**Key Success Factors:**
- Prioritize high-impact optimizations first
- Maintain accuracy within 5% degradation
- Benchmark continuously
- Monitor production metrics
- Iterate based on real-world usage
**Expected ROI:**
- **Performance**: 20x faster inference
- **Cost**: 75% reduction in compute costs
- **User Experience**: Sub-100ms latency
- **Scalability**: 30x throughput improvement
Implementation should follow agile methodology with 2-week sprints, continuous integration, and regular performance regression testing.