Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

47 KiB

Raw Blame History

Ruvector Integration Architecture

ruvector-scipix Integration Design

Version: 1.0.0 Date: 2025-11-28 Status: Design Phase

Executive Summary

This document defines the integration architecture for ruvector-scipix, a specialized OCR crate for mathematical expressions, with the existing ruvector ecosystem. The integration leverages ruvector's high-performance vector database, HNSW indexing, distributed clustering, and WASM capabilities to provide scalable, intelligent mathematical OCR processing.

Key Integration Points:

Vector-based caching of OCR results using ruvector-core
REST API endpoints via ruvector-server extension
Browser-based OCR using ruvector-wasm
Distributed processing with ruvector-cluster
Performance tracking via ruvector-metrics
Shared configuration and error handling patterns

1. Workspace Integration

1.1 Adding to Workspace Members

Root Cargo.toml Modification:

[workspace]
members = [
    # ... existing members ...
    "crates/ruvector-gnn-wasm",

    # Scipix Integration - NEW
    "crates/ruvector-scipix-core",      # Core OCR logic
    "crates/ruvector-scipix-node",      # Node.js bindings
    "crates/ruvector-scipix-wasm",      # Browser WASM
    "crates/ruvector-scipix-server",    # HTTP server extension

    "examples/refrag-pipeline",
    "examples/scipix",                   # Examples and demos
]

[workspace.dependencies]
# ... existing dependencies ...

# Scipix-specific dependencies - NEW
reqwest = { version = "0.12", features = ["json", "multipart"] }
base64 = "0.22"
image = { version = "0.25", features = ["png", "jpeg"] }
tesseract-rs = { version = "0.14", optional = true }  # Local fallback
pdf-extract = { version = "0.7", optional = true }

Dependency Version Strategy:

Use version = "0.1.16" (workspace version) for internal crates
Use workspace = true for shared dependencies
Add scipix-specific deps to workspace.dependencies for consistency

1.2 Crate Structure

crates/
├── ruvector-scipix-core/      # Core OCR engine
│   ├── src/
│   │   ├── lib.rs              # Public API
│   │   ├── api_client.rs       # Scipix API client
│   │   ├── ocr_engine.rs       # OCR processing
│   │   ├── cache.rs            # Vector-based cache
│   │   ├── preprocessing.rs    # Image preprocessing
│   │   ├── postprocessing.rs   # LaTeX refinement
│   │   └── error.rs            # Error types
│   └── Cargo.toml
│
├── ruvector-scipix-node/      # Node.js bindings (NAPI-RS)
│   ├── src/
│   │   └── lib.rs
│   ├── npm/                    # Platform binaries
│   └── Cargo.toml
│
├── ruvector-scipix-wasm/      # WASM bindings
│   ├── src/
│   │   └── lib.rs
│   └── Cargo.toml
│
└── ruvector-scipix-server/    # Server extension
    ├── src/
    │   ├── main.rs
    │   ├── routes.rs
    │   └── middleware.rs
    └── Cargo.toml

examples/scipix/               # Examples (NOT workspace member)
├── src/
├── tests/
├── docs/
└── Cargo.toml                  # Standalone example

1.3 Feature Flags Strategy

Core Crate (ruvector-scipix-core):

[features]
default = ["api-client", "cache", "simd"]

# Backend features
api-client = ["reqwest", "base64"]
tesseract = ["tesseract-rs"]        # Local OCR fallback
pdf-support = ["pdf-extract"]

# Performance features
cache = ["ruvector-core/storage"]   # Vector cache
simd = ["ruvector-core/simd"]       # SIMD optimizations
quantization = ["ruvector-core"]    # Quantized embeddings

# Environment features
wasm = []                           # WASM-compatible mode
memory-only = []                    # No file I/O

2. ruvector-core Usage

2.1 Storing Math Expression Embeddings

Integration Pattern:

// crates/ruvector-scipix-core/src/cache.rs

use ruvector_core::{VectorDB, VectorEntry, DistanceMetric, SearchQuery};
use std::path::Path;

/// OCR result cache using vector similarity
pub struct ScipixCache {
    /// Vector database for image embeddings
    image_db: VectorDB,
    /// Vector database for LaTeX embeddings
    latex_db: VectorDB,
    /// Embedding dimension
    dimension: usize,
}

impl ScipixCache {
    /// Create new cache with specified dimension
    pub fn new(cache_dir: &Path, dimension: usize) -> Result<Self> {
        let image_path = cache_dir.join("image_vectors.db");
        let latex_path = cache_dir.join("latex_vectors.db");

        Ok(Self {
            image_db: VectorDB::new(
                &image_path,
                dimension,
                DistanceMetric::Cosine,
            )?,
            latex_db: VectorDB::new(
                &latex_path,
                dimension,
                DistanceMetric::Cosine,
            )?,
            dimension,
        })
    }

    /// Store OCR result with image embedding
    pub fn store_result(
        &mut self,
        image_embedding: Vec<f32>,
        latex: String,
        confidence: f32,
    ) -> Result<uuid::Uuid> {
        // Store image embedding
        let id = uuid::Uuid::new_v4();
        self.image_db.add_vector(
            id,
            image_embedding.clone(),
            Some(serde_json::json!({
                "latex": latex,
                "confidence": confidence,
                "timestamp": chrono::Utc::now(),
            })),
        )?;

        // Also store LaTeX embedding for semantic search
        let latex_embedding = self.encode_latex(&latex)?;
        self.latex_db.add_vector(id, latex_embedding, None)?;

        Ok(id)
    }

    /// Find similar cached results
    pub fn find_similar(
        &self,
        image_embedding: Vec<f32>,
        threshold: f32,
    ) -> Result<Option<CachedResult>> {
        let query = SearchQuery::new(image_embedding)
            .with_k(1)
            .with_ef(50);

        let results = self.image_db.search(&query)?;

        if let Some(result) = results.first() {
            if result.distance <= threshold {
                let metadata = result.metadata.as_ref()
                    .ok_or(RuvectorError::MetadataMissing)?;

                return Ok(Some(CachedResult {
                    latex: metadata["latex"].as_str().unwrap().to_string(),
                    confidence: metadata["confidence"].as_f64().unwrap() as f32,
                    distance: result.distance,
                }));
            }
        }

        Ok(None)
    }

    /// Encode LaTeX to vector using simple hashing
    fn encode_latex(&self, latex: &str) -> Result<Vec<f32>> {
        // Use TF-IDF or learned embeddings
        // For now, simple character n-gram hashing
        let mut embedding = vec![0.0; self.dimension];

        for ngram in latex.chars().collect::<Vec<_>>().windows(3) {
            let hash = ngram.iter().fold(0u64, |acc, &c| {
                acc.wrapping_mul(31).wrapping_add(c as u64)
            });
            let idx = (hash % self.dimension as u64) as usize;
            embedding[idx] += 1.0;
        }

        // Normalize
        let norm: f32 = embedding.iter().map(|&x| x * x).sum::<f32>().sqrt();
        if norm > 0.0 {
            embedding.iter_mut().for_each(|x| *x /= norm);
        }

        Ok(embedding)
    }
}

#[derive(Debug, Clone)]
pub struct CachedResult {
    pub latex: String,
    pub confidence: f32,
    pub distance: f32,
}

2.2 Quantization for Memory Efficiency

use ruvector_core::quantization::{ScalarQuantizer, QuantizationConfig};

impl ScipixCache {
    /// Create cache with quantization (4-32x memory reduction)
    pub fn new_quantized(
        cache_dir: &Path,
        dimension: usize,
        bits: u8,  // 4 or 8
    ) -> Result<Self> {
        let config = QuantizationConfig {
            bits,
            ..Default::default()
        };

        // Quantizer will be used internally by VectorDB
        let mut cache = Self::new(cache_dir, dimension)?;
        cache.image_db.enable_quantization(config)?;

        Ok(cache)
    }
}

2.3 HNSW Parameters for OCR Cache

use ruvector_core::index::HNSWConfig;

impl ScipixCache {
    /// Optimize HNSW for OCR workload
    pub fn with_hnsw_config(mut self, config: HNSWConfig) -> Self {
        // Typical OCR workload:
        // - High recall needed (mathematical expressions must be accurate)
        // - Moderate write throughput
        // - Low latency reads

        let optimized = HNSWConfig {
            m: 32,              // Connections per layer (higher = better recall)
            ef_construction: 200, // Construction effort
            max_elements: 100_000, // Expected cache size
            ..Default::default()
        };

        self.image_db.configure_hnsw(optimized);
        self
    }
}

3. ruvector-server Extension

3.1 Server Crate Structure

crates/ruvector-scipix-server/Cargo.toml:

[package]
name = "ruvector-scipix-server"
version.workspace = true
edition.workspace = true
license.workspace = true
authors.workspace = true
repository.workspace = true
description = "HTTP server for Scipix OCR with vector caching"

[dependencies]
# Core dependencies
ruvector-core = { version = "0.1.16", path = "../ruvector-core" }
ruvector-server = { version = "0.1.16", path = "../ruvector-server" }
ruvector-scipix-core = { version = "0.1.16", path = "../ruvector-scipix-core" }

# Web framework
axum = { version = "0.7", features = ["json", "multipart"] }
tower = "0.5"
tower-http = { version = "0.6", features = ["cors", "trace", "limit"] }

# Async runtime
tokio = { workspace = true }

# Serialization
serde = { workspace = true }
serde_json = { workspace = true }

# Error handling
thiserror = { workspace = true }
anyhow = { workspace = true }

# Utilities
tracing = { workspace = true }
uuid = { workspace = true }
base64 = { workspace = true }

[features]
default = ["api-client"]
api-client = ["ruvector-scipix-core/api-client"]
metrics = ["ruvector-metrics"]

3.2 REST API Endpoints

crates/ruvector-scipix-server/src/routes.rs:

use axum::{
    Router,
    routing::{post, get},
    extract::{State, Multipart},
    Json,
    http::StatusCode,
};
use ruvector_scipix_core::{ScipixClient, ScipixCache};
use std::sync::Arc;

#[derive(Clone)]
pub struct AppState {
    pub scipix_client: Arc<ScipixClient>,
    pub cache: Arc<parking_lot::RwLock<ScipixCache>>,
}

/// Create Scipix routes
pub fn scipix_routes() -> Router<AppState> {
    Router::new()
        // Scipix API v3 endpoints
        .route("/v3/text", post(ocr_text))
        .route("/v3/pdf", post(ocr_pdf))
        .route("/v3/batch", post(ocr_batch))

        // Cache management
        .route("/cache/stats", get(cache_stats))
        .route("/cache/search", post(search_cache))
        .route("/cache/clear", post(clear_cache))
}

/// POST /v3/text - OCR text from image
async fn ocr_text(
    State(state): State<AppState>,
    mut multipart: Multipart,
) -> Result<Json<OcrResponse>, AppError> {
    let mut image_data = Vec::new();

    // Extract image from multipart
    while let Some(field) = multipart.next_field().await? {
        if field.name() == Some("image") {
            image_data = field.bytes().await?.to_vec();
        }
    }

    // Generate image embedding for cache lookup
    let embedding = state.scipix_client
        .generate_image_embedding(&image_data)?;

    // Check cache first
    if let Some(cached) = state.cache.read()
        .find_similar(embedding.clone(), 0.95)? {
        return Ok(Json(OcrResponse {
            latex: cached.latex,
            confidence: cached.confidence,
            cached: true,
        }));
    }

    // Cache miss - call Scipix API
    let result = state.scipix_client.ocr_image(&image_data).await?;

    // Store in cache
    state.cache.write().store_result(
        embedding,
        result.latex.clone(),
        result.confidence,
    )?;

    Ok(Json(OcrResponse {
        latex: result.latex,
        confidence: result.confidence,
        cached: false,
    }))
}

/// POST /v3/pdf - OCR entire PDF
async fn ocr_pdf(
    State(state): State<AppState>,
    mut multipart: Multipart,
) -> Result<Json<PdfOcrResponse>, AppError> {
    let mut pdf_data = Vec::new();

    while let Some(field) = multipart.next_field().await? {
        if field.name() == Some("pdf") {
            pdf_data = field.bytes().await?.to_vec();
        }
    }

    // Extract pages and process in parallel
    let pages = state.scipix_client.extract_pdf_pages(&pdf_data)?;
    let results = futures::future::join_all(
        pages.into_iter().map(|page| {
            let client = state.scipix_client.clone();
            async move { client.ocr_image(&page).await }
        })
    ).await;

    let pages: Vec<_> = results.into_iter()
        .collect::<Result<Vec<_>, _>>()?;

    Ok(Json(PdfOcrResponse { pages }))
}

#[derive(serde::Serialize)]
struct OcrResponse {
    latex: String,
    confidence: f32,
    cached: bool,
}

#[derive(serde::Serialize)]
struct PdfOcrResponse {
    pages: Vec<PageResult>,
}

#[derive(serde::Serialize)]
struct PageResult {
    page_num: usize,
    latex: String,
    confidence: f32,
}

3.3 Authentication Integration

use axum::{
    extract::Request,
    middleware::Next,
    http::StatusCode,
};

/// API key authentication middleware
pub async fn auth_middleware(
    mut req: Request,
    next: Next,
) -> Result<axum::response::Response, StatusCode> {
    let auth_header = req.headers()
        .get("X-API-Key")
        .and_then(|h| h.to_str().ok());

    match auth_header {
        Some(key) if validate_api_key(key) => {
            // Store user context in extensions
            req.extensions_mut().insert(ApiUser {
                key: key.to_string(),
            });
            Ok(next.run(req).await)
        }
        _ => Err(StatusCode::UNAUTHORIZED),
    }
}

fn validate_api_key(key: &str) -> bool {
    // Check against database or environment
    std::env::var("MATHPIX_API_KEY")
        .map(|k| k == key)
        .unwrap_or(false)
}

3.4 Rate Limiting

use tower::ServiceBuilder;
use tower_http::limit::RequestBodyLimitLayer;

pub fn create_server(state: AppState) -> Router {
    Router::new()
        .merge(scipix_routes())
        .layer(
            ServiceBuilder::new()
                // Rate limiting (100 req/min per IP)
                .layer(tower_http::timeout::TimeoutLayer::new(
                    std::time::Duration::from_secs(30)
                ))
                // Body size limit (10MB)
                .layer(RequestBodyLimitLayer::new(10 * 1024 * 1024))
                // Authentication
                .layer(axum::middleware::from_fn(auth_middleware))
        )
        .with_state(state)
}

4. ruvector-wasm Integration

4.1 WASM Crate Configuration

crates/ruvector-scipix-wasm/Cargo.toml:

[package]
name = "ruvector-scipix-wasm"
version.workspace = true
edition.workspace = true
license.workspace = true
description = "Browser-based OCR for mathematical expressions"

[lib]
crate-type = ["cdylib", "rlib"]

[dependencies]
# Core - use memory-only features
ruvector-core = {
    version = "0.1.16",
    path = "../ruvector-core",
    default-features = false,
    features = ["memory-only", "simd"]
}
ruvector-wasm = { version = "0.1.16", path = "../ruvector-wasm" }
ruvector-scipix-core = {
    version = "0.1.16",
    path = "../ruvector-scipix-core",
    default-features = false,
    features = ["wasm"]
}

# WASM bindings
wasm-bindgen = { workspace = true }
wasm-bindgen-futures = { workspace = true }
js-sys = { workspace = true }
web-sys = { workspace = true, features = [
    "CanvasRenderingContext2d",
    "HtmlCanvasElement",
    "ImageData",
    "console",
] }

# Utilities
serde = { workspace = true }
serde-wasm-bindgen = "0.6"
console_error_panic_hook = "0.1"
getrandom = { workspace = true, features = ["wasm_js"] }

[features]
default = []

[profile.release]
opt-level = "z"
lto = true
codegen-units = 1

4.2 Browser API

crates/ruvector-scipix-wasm/src/lib.rs:

use wasm_bindgen::prelude::*;
use web_sys::{ImageData, CanvasRenderingContext2d};
use ruvector_scipix_core::{ScipixClient, ScipixCache};

#[wasm_bindgen]
pub struct ScipixWasm {
    client: ScipixClient,
    cache: ScipixCache,
}

#[wasm_bindgen]
impl ScipixWasm {
    /// Create new instance with API key
    #[wasm_bindgen(constructor)]
    pub fn new(api_key: String, app_id: String) -> Result<ScipixWasm, JsValue> {
        console_error_panic_hook::set_once();

        let client = ScipixClient::new(api_key, app_id)
            .map_err(|e| JsValue::from_str(&e.to_string()))?;

        // Use in-memory cache for WASM
        let cache = ScipixCache::new_memory(512) // 512-dim embeddings
            .map_err(|e| JsValue::from_str(&e.to_string()))?;

        Ok(Self { client, cache })
    }

    /// OCR from canvas ImageData
    #[wasm_bindgen]
    pub async fn ocr_image_data(
        &mut self,
        image_data: ImageData,
    ) -> Result<JsValue, JsValue> {
        let width = image_data.width();
        let height = image_data.height();
        let data = image_data.data().0;

        // Convert to PNG bytes
        let png_bytes = self.rgba_to_png(width, height, &data)
            .map_err(|e| JsValue::from_str(&e.to_string()))?;

        // Check cache
        let embedding = self.client.generate_image_embedding(&png_bytes)
            .map_err(|e| JsValue::from_str(&e.to_string()))?;

        if let Some(cached) = self.cache.find_similar(embedding.clone(), 0.95)
            .map_err(|e| JsValue::from_str(&e.to_string()))? {
            return Ok(serde_wasm_bindgen::to_value(&OcrResult {
                latex: cached.latex,
                confidence: cached.confidence,
                cached: true,
            })?);
        }

        // Call API
        let result = self.client.ocr_image(&png_bytes).await
            .map_err(|e| JsValue::from_str(&e.to_string()))?;

        // Cache result
        self.cache.store_result(embedding, result.latex.clone(), result.confidence)
            .map_err(|e| JsValue::from_str(&e.to_string()))?;

        Ok(serde_wasm_bindgen::to_value(&OcrResult {
            latex: result.latex,
            confidence: result.confidence,
            cached: false,
        })?)
    }

    /// OCR from canvas element
    #[wasm_bindgen]
    pub async fn ocr_canvas(
        &mut self,
        canvas_id: String,
    ) -> Result<JsValue, JsValue> {
        let window = web_sys::window().unwrap();
        let document = window.document().unwrap();
        let canvas = document
            .get_element_by_id(&canvas_id)
            .ok_or_else(|| JsValue::from_str("Canvas not found"))?
            .dyn_into::<web_sys::HtmlCanvasElement>()?;

        let context = canvas
            .get_context("2d")?
            .unwrap()
            .dyn_into::<CanvasRenderingContext2d>()?;

        let image_data = context.get_image_data(
            0.0, 0.0,
            canvas.width() as f64,
            canvas.height() as f64,
        )?;

        self.ocr_image_data(image_data).await
    }

    fn rgba_to_png(&self, width: u32, height: u32, data: &[u8])
        -> Result<Vec<u8>, String> {
        // Use image crate to encode PNG
        // (simplified - actual implementation would use image crate)
        Ok(data.to_vec())
    }
}

#[derive(serde::Serialize)]
struct OcrResult {
    latex: String,
    confidence: f32,
    cached: bool,
}

4.3 TypeScript Definitions

crates/ruvector-scipix-wasm/scipix.d.ts:

export class ScipixWasm {
  constructor(apiKey: string, appId: string);

  ocr_image_data(imageData: ImageData): Promise<OcrResult>;
  ocr_canvas(canvasId: string): Promise<OcrResult>;

  free(): void;
}

export interface OcrResult {
  latex: string;
  confidence: number;
  cached: boolean;
}

5. ruvector-metrics Integration

5.1 OCR-Specific Metrics

crates/ruvector-scipix-core/src/metrics.rs:

use prometheus::{
    Counter, Histogram, IntGauge, Registry,
    HistogramOpts, Opts,
};
use lazy_static::lazy_static;

lazy_static! {
    /// Total OCR requests
    pub static ref OCR_REQUESTS: Counter = Counter::new(
        "scipix_ocr_requests_total",
        "Total number of OCR requests"
    ).unwrap();

    /// Cache hit rate
    pub static ref CACHE_HITS: Counter = Counter::new(
        "scipix_cache_hits_total",
        "Number of cache hits"
    ).unwrap();

    pub static ref CACHE_MISSES: Counter = Counter::new(
        "scipix_cache_misses_total",
        "Number of cache misses"
    ).unwrap();

    /// OCR latency histogram
    pub static ref OCR_LATENCY: Histogram = Histogram::with_opts(
        HistogramOpts::new(
            "scipix_ocr_duration_seconds",
            "OCR processing duration"
        ).buckets(vec![0.1, 0.5, 1.0, 2.0, 5.0, 10.0])
    ).unwrap();

    /// Confidence score distribution
    pub static ref CONFIDENCE_SCORE: Histogram = Histogram::with_opts(
        HistogramOpts::new(
            "scipix_confidence_score",
            "OCR confidence scores"
        ).buckets(vec![0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99])
    ).unwrap();

    /// Active API calls
    pub static ref ACTIVE_CALLS: IntGauge = IntGauge::new(
        "scipix_active_calls",
        "Number of active API calls"
    ).unwrap();

    /// Error counter by type
    pub static ref OCR_ERRORS: Counter = Counter::new(
        "scipix_errors_total",
        "Total OCR errors"
    ).unwrap();
}

/// Register all metrics
pub fn register_metrics(registry: &Registry) -> Result<(), Box<dyn std::error::Error>> {
    registry.register(Box::new(OCR_REQUESTS.clone()))?;
    registry.register(Box::new(CACHE_HITS.clone()))?;
    registry.register(Box::new(CACHE_MISSES.clone()))?;
    registry.register(Box::new(OCR_LATENCY.clone()))?;
    registry.register(Box::new(CONFIDENCE_SCORE.clone()))?;
    registry.register(Box::new(ACTIVE_CALLS.clone()))?;
    registry.register(Box::new(OCR_ERRORS.clone()))?;
    Ok(())
}

/// Track OCR operation
pub struct OcrMetrics;

impl OcrMetrics {
    pub fn record_request() {
        OCR_REQUESTS.inc();
        ACTIVE_CALLS.inc();
    }

    pub fn record_cache_hit() {
        CACHE_HITS.inc();
    }

    pub fn record_cache_miss() {
        CACHE_MISSES.inc();
    }

    pub fn record_latency(duration: std::time::Duration) {
        OCR_LATENCY.observe(duration.as_secs_f64());
        ACTIVE_CALLS.dec();
    }

    pub fn record_confidence(score: f32) {
        CONFIDENCE_SCORE.observe(score as f64);
    }

    pub fn record_error() {
        OCR_ERRORS.inc();
        ACTIVE_CALLS.dec();
    }
}

5.2 Integration with ruvector-metrics

// In ScipixClient implementation
impl ScipixClient {
    pub async fn ocr_image(&self, image: &[u8]) -> Result<OcrResult> {
        use crate::metrics::OcrMetrics;

        OcrMetrics::record_request();
        let start = std::time::Instant::now();

        let result = self.ocr_image_internal(image).await;

        match result {
            Ok(ref res) => {
                OcrMetrics::record_latency(start.elapsed());
                OcrMetrics::record_confidence(res.confidence);
            }
            Err(_) => {
                OcrMetrics::record_error();
            }
        }

        result
    }
}

5.3 Prometheus Endpoint

// In server routes
use prometheus::{Encoder, TextEncoder};

async fn metrics_handler() -> Result<String, AppError> {
    let encoder = TextEncoder::new();
    let metric_families = prometheus::gather();
    let mut buffer = Vec::new();
    encoder.encode(&metric_families, &mut buffer)?;
    Ok(String::from_utf8(buffer)?)
}

// Add to router
Router::new()
    .route("/metrics", get(metrics_handler))

6. ruvector-cluster for Distributed OCR

6.1 Sharding Strategy

crates/ruvector-scipix-core/src/distributed.rs:

use ruvector_cluster::{ClusterNode, ShardingStrategy, NodeId};
use std::sync::Arc;

/// Distributed OCR coordinator
pub struct DistributedOcr {
    cluster: Arc<ClusterNode>,
    shard_count: usize,
}

impl DistributedOcr {
    pub fn new(cluster: Arc<ClusterNode>, shard_count: usize) -> Self {
        Self { cluster, shard_count }
    }

    /// Process PDF across cluster
    pub async fn process_pdf_distributed(
        &self,
        pdf_data: Vec<u8>,
    ) -> Result<Vec<PageResult>> {
        // Extract pages
        let pages = extract_pdf_pages(&pdf_data)?;
        let total_pages = pages.len();

        // Shard pages across cluster nodes
        let nodes = self.cluster.get_active_nodes().await?;
        let pages_per_node = (total_pages + nodes.len() - 1) / nodes.len();

        // Distribute work
        let mut tasks = Vec::new();
        for (i, node) in nodes.iter().enumerate() {
            let start = i * pages_per_node;
            let end = ((i + 1) * pages_per_node).min(total_pages);
            let node_pages: Vec<_> = pages[start..end].to_vec();

            let task = self.cluster.send_task(
                node.id,
                OcrTask {
                    pages: node_pages,
                    start_page: start,
                },
            );
            tasks.push(task);
        }

        // Collect results
        let results = futures::future::join_all(tasks).await;

        // Aggregate and sort by page number
        let mut all_results = Vec::new();
        for result in results {
            all_results.extend(result?);
        }
        all_results.sort_by_key(|r| r.page_num);

        Ok(all_results)
    }
}

#[derive(serde::Serialize, serde::Deserialize)]
struct OcrTask {
    pages: Vec<Vec<u8>>,
    start_page: usize,
}

6.2 Load Balancing

use ruvector_cluster::LoadBalancer;

/// Smart load balancer for OCR workload
pub struct OcrLoadBalancer {
    balancer: LoadBalancer,
}

impl OcrLoadBalancer {
    /// Assign work based on node capacity and queue depth
    pub async fn assign_task(&self, task_size: usize) -> Result<NodeId> {
        let nodes = self.balancer.get_nodes().await?;

        // Score each node
        let mut best_node = None;
        let mut best_score = f64::MAX;

        for node in nodes {
            let metrics = self.balancer.get_node_metrics(node.id).await?;

            // Score based on:
            // - Queue depth (lower is better)
            // - CPU usage (lower is better)
            // - Task size compatibility
            let score =
                metrics.queue_depth as f64 * 10.0 +
                metrics.cpu_usage * 100.0 +
                (task_size as f64 - metrics.avg_task_size).abs();

            if score < best_score {
                best_score = score;
                best_node = Some(node.id);
            }
        }

        best_node.ok_or_else(|| RuvectorError::NoNodesAvailable)
    }
}

6.3 Result Aggregation

/// Aggregate OCR results from multiple nodes
pub struct ResultAggregator {
    results: dashmap::DashMap<uuid::Uuid, Vec<PageResult>>,
}

impl ResultAggregator {
    pub fn add_result(&self, job_id: uuid::Uuid, result: PageResult) {
        self.results.entry(job_id)
            .or_insert_with(Vec::new)
            .push(result);
    }

    pub fn get_results(&self, job_id: uuid::Uuid) -> Option<Vec<PageResult>> {
        self.results.get(&job_id).map(|r| {
            let mut results = r.clone();
            results.sort_by_key(|p| p.page_num);
            results
        })
    }

    pub fn is_complete(&self, job_id: uuid::Uuid, expected_pages: usize) -> bool {
        self.results.get(&job_id)
            .map(|r| r.len() == expected_pages)
            .unwrap_or(false)
    }
}

7. Shared Configuration

7.1 Environment Variables

config/scipix.env:

# Scipix API Configuration
MATHPIX_API_KEY=your_api_key_here
MATHPIX_APP_ID=your_app_id_here
MATHPIX_API_URL=https://api.scipix.com/v3

# Cache Configuration
MATHPIX_CACHE_DIR=./data/scipix_cache
MATHPIX_CACHE_DIMENSION=512
MATHPIX_CACHE_SIZE_MB=1000
MATHPIX_CACHE_THRESHOLD=0.95

# Vector DB Configuration
RUVECTOR_HNSW_M=32
RUVECTOR_HNSW_EF_CONSTRUCTION=200
RUVECTOR_DISTANCE_METRIC=cosine

# Quantization
MATHPIX_QUANTIZE_BITS=8  # 0 for no quantization

# Server Configuration
MATHPIX_SERVER_PORT=3000
MATHPIX_SERVER_HOST=0.0.0.0
MATHPIX_MAX_BODY_SIZE_MB=10
MATHPIX_RATE_LIMIT_PER_MIN=100

# Cluster Configuration
MATHPIX_CLUSTER_ENABLED=false
MATHPIX_CLUSTER_NODES=node1:8000,node2:8000
MATHPIX_SHARD_COUNT=4

# Metrics
MATHPIX_METRICS_ENABLED=true
MATHPIX_METRICS_PORT=9090

7.2 TOML Configuration

config/scipix.toml:

[api]
key = "${MATHPIX_API_KEY}"
app_id = "${MATHPIX_APP_ID}"
url = "https://api.scipix.com/v3"
timeout_secs = 30

[cache]
enabled = true
dir = "./data/scipix_cache"
dimension = 512
size_mb = 1000
threshold = 0.95

[cache.hnsw]
m = 32
ef_construction = 200
max_elements = 100_000

[cache.quantization]
enabled = true
bits = 8  # 4, 8, or 0 for disabled

[server]
host = "0.0.0.0"
port = 3000
max_body_size_mb = 10

[server.rate_limit]
enabled = true
requests_per_minute = 100

[cluster]
enabled = false
nodes = ["node1:8000", "node2:8000"]
shard_count = 4
replication_factor = 2

[metrics]
enabled = true
port = 9090
prometheus_endpoint = "/metrics"

[preprocessing]
# Image preprocessing options
auto_rotate = true
denoise = true
contrast_enhancement = true
dpi = 300

[postprocessing]
# LaTeX postprocessing
validate_syntax = true
normalize_symbols = true
confidence_threshold = 0.7

7.3 Configuration Loading

crates/ruvector-scipix-core/src/config.rs:

use serde::{Deserialize, Serialize};
use std::path::Path;

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct ScipixConfig {
    pub api: ApiConfig,
    pub cache: CacheConfig,
    pub server: ServerConfig,
    pub cluster: ClusterConfig,
    pub metrics: MetricsConfig,
    pub preprocessing: PreprocessingConfig,
    pub postprocessing: PostprocessingConfig,
}

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct ApiConfig {
    pub key: String,
    pub app_id: String,
    pub url: String,
    pub timeout_secs: u64,
}

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct CacheConfig {
    pub enabled: bool,
    pub dir: String,
    pub dimension: usize,
    pub size_mb: usize,
    pub threshold: f32,
    pub hnsw: HnswConfig,
    pub quantization: QuantizationConfig,
}

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct HnswConfig {
    pub m: usize,
    pub ef_construction: usize,
    pub max_elements: usize,
}

impl ScipixConfig {
    /// Load from TOML file with environment variable substitution
    pub fn from_file(path: &Path) -> Result<Self> {
        let content = std::fs::read_to_string(path)?;

        // Expand environment variables
        let expanded = Self::expand_env_vars(&content);

        let config: ScipixConfig = toml::from_str(&expanded)?;
        Ok(config)
    }

    /// Load from environment variables
    pub fn from_env() -> Result<Self> {
        Ok(Self {
            api: ApiConfig {
                key: std::env::var("MATHPIX_API_KEY")?,
                app_id: std::env::var("MATHPIX_APP_ID")?,
                url: std::env::var("MATHPIX_API_URL")
                    .unwrap_or_else(|_| "https://api.scipix.com/v3".to_string()),
                timeout_secs: 30,
            },
            cache: CacheConfig::from_env()?,
            // ... rest of config
        })
    }

    fn expand_env_vars(s: &str) -> String {
        let re = regex::Regex::new(r"\$\{([^}]+)\}").unwrap();
        re.replace_all(s, |caps: &regex::Captures| {
            std::env::var(&caps[1]).unwrap_or_default()
        }).to_string()
    }
}

8. Cross-Crate Types

8.1 Common Error Types

crates/ruvector-scipix-core/src/error.rs:

use thiserror::Error;
use ruvector_core::RuvectorError;

#[derive(Error, Debug)]
pub enum ScipixError {
    #[error("Scipix API error: {0}")]
    ApiError(String),

    #[error("HTTP request failed: {0}")]
    HttpError(#[from] reqwest::Error),

    #[error("Vector database error: {0}")]
    VectorDbError(#[from] RuvectorError),

    #[error("Image processing error: {0}")]
    ImageError(String),

    #[error("Invalid configuration: {0}")]
    ConfigError(String),

    #[error("Cache error: {0}")]
    CacheError(String),

    #[error("Serialization error: {0}")]
    SerializationError(#[from] serde_json::Error),

    #[error("IO error: {0}")]
    IoError(#[from] std::io::Error),

    #[error("LaTeX validation error: {0}")]
    LatexError(String),

    #[error("Rate limit exceeded")]
    RateLimitExceeded,

    #[error("Authentication failed")]
    AuthenticationFailed,

    #[error("Confidence too low: {0}")]
    LowConfidence(f32),
}

pub type Result<T> = std::result::Result<T, ScipixError>;

/// Convert to HTTP status code
impl ScipixError {
    pub fn status_code(&self) -> axum::http::StatusCode {
        use axum::http::StatusCode;
        match self {
            Self::ApiError(_) => StatusCode::BAD_GATEWAY,
            Self::HttpError(_) => StatusCode::BAD_GATEWAY,
            Self::VectorDbError(_) => StatusCode::INTERNAL_SERVER_ERROR,
            Self::ImageError(_) => StatusCode::BAD_REQUEST,
            Self::ConfigError(_) => StatusCode::INTERNAL_SERVER_ERROR,
            Self::CacheError(_) => StatusCode::INTERNAL_SERVER_ERROR,
            Self::SerializationError(_) => StatusCode::BAD_REQUEST,
            Self::IoError(_) => StatusCode::INTERNAL_SERVER_ERROR,
            Self::LatexError(_) => StatusCode::UNPROCESSABLE_ENTITY,
            Self::RateLimitExceeded => StatusCode::TOO_MANY_REQUESTS,
            Self::AuthenticationFailed => StatusCode::UNAUTHORIZED,
            Self::LowConfidence(_) => StatusCode::UNPROCESSABLE_ENTITY,
        }
    }
}

8.2 Shared Traits

crates/ruvector-scipix-core/src/traits.rs:

use async_trait::async_trait;

/// OCR engine trait (allows swapping implementations)
#[async_trait]
pub trait OcrEngine: Send + Sync {
    /// Process image to LaTeX
    async fn ocr(&self, image: &[u8]) -> Result<OcrResult>;

    /// Generate embedding for caching
    fn generate_embedding(&self, image: &[u8]) -> Result<Vec<f32>>;

    /// Batch processing
    async fn ocr_batch(&self, images: Vec<Vec<u8>>) -> Result<Vec<OcrResult>> {
        let mut results = Vec::new();
        for image in images {
            results.push(self.ocr(&image).await?);
        }
        Ok(results)
    }
}

/// Cache trait (allows different cache backends)
pub trait OcrCache: Send + Sync {
    fn store(&mut self, embedding: Vec<f32>, result: OcrResult) -> Result<uuid::Uuid>;
    fn find_similar(&self, embedding: Vec<f32>, threshold: f32) -> Result<Option<OcrResult>>;
    fn clear(&mut self) -> Result<()>;
    fn stats(&self) -> CacheStats;
}

#[derive(Debug, Clone)]
pub struct CacheStats {
    pub total_entries: usize,
    pub memory_usage_mb: f64,
    pub hit_rate: f64,
}

/// Preprocessing trait
pub trait ImagePreprocessor: Send + Sync {
    fn preprocess(&self, image: &[u8]) -> Result<Vec<u8>>;
}

/// Postprocessing trait
pub trait LatexPostprocessor: Send + Sync {
    fn postprocess(&self, latex: &str) -> Result<String>;
    fn validate(&self, latex: &str) -> bool;
}

8.3 API Contracts

crates/ruvector-scipix-core/src/types.rs:

use serde::{Deserialize, Serialize};
use uuid::Uuid;

/// OCR result
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OcrResult {
    pub latex: String,
    pub confidence: f32,
    pub timestamp: chrono::DateTime<chrono::Utc>,
    pub cached: bool,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub metadata: Option<serde_json::Value>,
}

/// PDF page result
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PageResult {
    pub page_num: usize,
    pub latex: String,
    pub confidence: f32,
    pub bounding_boxes: Vec<BoundingBox>,
}

/// Bounding box for detected regions
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BoundingBox {
    pub x: u32,
    pub y: u32,
    pub width: u32,
    pub height: u32,
    pub confidence: f32,
}

/// Batch OCR request
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BatchOcrRequest {
    pub images: Vec<ImageInput>,
    pub options: OcrOptions,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum ImageInput {
    #[serde(rename = "base64")]
    Base64 { data: String },
    #[serde(rename = "url")]
    Url { url: String },
    #[serde(rename = "bytes")]
    Bytes { data: Vec<u8> },
}

#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct OcrOptions {
    #[serde(default)]
    pub preprocess: bool,
    #[serde(default)]
    pub postprocess: bool,
    #[serde(default = "default_confidence")]
    pub min_confidence: f32,
    #[serde(default)]
    pub use_cache: bool,
}

fn default_confidence() -> f32 { 0.7 }

/// Job status for async processing
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct JobStatus {
    pub job_id: Uuid,
    pub status: JobState,
    pub progress: f32,  // 0.0 to 1.0
    pub result: Option<Vec<PageResult>>,
    pub error: Option<String>,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum JobState {
    Pending,
    Processing,
    Completed,
    Failed,
}

9. Workspace Cargo.toml Modifications

9.1 Complete Workspace Configuration

# Add to root Cargo.toml

[workspace]
members = [
    # ... existing members ...

    # Scipix Integration
    "crates/ruvector-scipix-core",
    "crates/ruvector-scipix-node",
    "crates/ruvector-scipix-wasm",
    "crates/ruvector-scipix-server",
]

[workspace.dependencies]
# ... existing dependencies ...

# Scipix-specific
reqwest = { version = "0.12", default-features = false, features = ["json", "multipart", "rustls-tls"] }
base64 = "0.22"
image = { version = "0.25", features = ["png", "jpeg", "webp"] }
async-trait = "0.1"
regex = "1.10"
toml = "0.8"

# Optional OCR backends
tesseract-rs = { version = "0.14", optional = true }
pdf-extract = { version = "0.7", optional = true }

9.2 Individual Crate Cargo.toml

crates/ruvector-scipix-core/Cargo.toml:

[package]
name = "ruvector-scipix-core"
version.workspace = true
edition.workspace = true
license.workspace = true
authors.workspace = true
repository.workspace = true
description = "Mathematical OCR with vector-based caching"

[dependencies]
# Ruvector ecosystem
ruvector-core = { version = "0.1.16", path = "../ruvector-core" }
ruvector-metrics = { version = "0.1.16", path = "../ruvector-metrics", optional = true }

# HTTP client
reqwest = { workspace = true, optional = true }
base64 = { workspace = true }

# Image processing
image = { workspace = true, optional = true }

# Async
tokio = { workspace = true, features = ["rt-multi-thread"] }
async-trait = { workspace = true }

# Serialization
serde = { workspace = true }
serde_json = { workspace = true }

# Error handling
thiserror = { workspace = true }
anyhow = { workspace = true }

# Utilities
uuid = { workspace = true }
chrono = { workspace = true }
tracing = { workspace = true }
dashmap = { workspace = true }
parking_lot = { workspace = true }

# Configuration
toml = { workspace = true, optional = true }
regex = { workspace = true, optional = true }

# Optional backends
tesseract-rs = { workspace = true, optional = true }
pdf-extract = { workspace = true, optional = true }

# Metrics
prometheus = { version = "0.13", optional = true }
lazy_static = { version = "1.5", optional = true }

[dev-dependencies]
tokio = { workspace = true, features = ["macros", "test-util"] }
tempfile = "3.13"
mockall = { workspace = true }

[features]
default = ["api-client", "cache", "preprocessing"]

# Core features
api-client = ["reqwest"]
cache = ["ruvector-core/storage"]
preprocessing = ["image"]
metrics = ["dep:ruvector-metrics", "prometheus", "lazy_static"]
config = ["toml", "regex"]

# Optional backends
tesseract = ["dep:tesseract-rs"]
pdf = ["dep:pdf-extract"]

# Performance
simd = ["ruvector-core/simd"]
quantization = []

# Environment
wasm = []
memory-only = []

10. Module Structure

10.1 Core Module Organization

crates/ruvector-scipix-core/src/
├── lib.rs                      # Public API
├── error.rs                    # Error types
├── types.rs                    # Shared types
├── traits.rs                   # Shared traits
├── config.rs                   # Configuration
│
├── api/
│   ├── mod.rs
│   ├── client.rs              # Scipix API client
│   └── models.rs              # API request/response types
│
├── cache/
│   ├── mod.rs
│   ├── vector_cache.rs        # ruvector-core integration
│   ├── memory_cache.rs        # In-memory cache for WASM
│   └── stats.rs               # Cache statistics
│
├── ocr/
│   ├── mod.rs
│   ├── engine.rs              # Main OCR engine
│   ├── batch.rs               # Batch processing
│   └── backends/
│       ├── mod.rs
│       ├── scipix.rs         # Scipix backend
│       └── tesseract.rs       # Tesseract fallback
│
├── preprocessing/
│   ├── mod.rs
│   ├── image_ops.rs           # Image preprocessing
│   ├── filters.rs             # Denoising, enhancement
│   └── rotation.rs            # Auto-rotation
│
├── postprocessing/
│   ├── mod.rs
│   ├── latex_validate.rs      # LaTeX validation
│   └── normalize.rs           # Symbol normalization
│
├── embeddings/
│   ├── mod.rs
│   ├── image_embedder.rs      # Image to vector
│   └── latex_embedder.rs      # LaTeX to vector
│
├── distributed/
│   ├── mod.rs
│   ├── coordinator.rs         # Cluster coordination
│   ├── sharding.rs            # Work distribution
│   └── aggregator.rs          # Result aggregation
│
└── metrics/
    ├── mod.rs
    └── prometheus.rs          # Metrics collection

10.2 Server Module Organization

crates/ruvector-scipix-server/src/
├── main.rs                     # Server entry point
├── routes/
│   ├── mod.rs
│   ├── ocr.rs                 # OCR endpoints
│   ├── cache.rs               # Cache management
│   ├── health.rs              # Health checks
│   └── metrics.rs             # Metrics endpoint
│
├── middleware/
│   ├── mod.rs
│   ├── auth.rs                # API key auth
│   ├── rate_limit.rs          # Rate limiting
│   └── logging.rs             # Request logging
│
├── state.rs                    # Shared app state
└── error.rs                    # HTTP error handling

11. Integration Checklist

Phase 1: Core Integration

Create ruvector-scipix-core crate
Implement vector cache using ruvector-core
Add Scipix API client
Implement image preprocessing
Add metrics collection
Write unit tests

Phase 2: Server Extension

Create ruvector-scipix-server crate
Implement REST API endpoints
Add authentication middleware
Implement rate limiting
Add health checks
Integration tests

Phase 3: WASM Support

Create ruvector-scipix-wasm crate
Implement browser API
Add TypeScript definitions
Create example web app
Browser testing

Phase 4: Distributed Processing

Integrate ruvector-cluster
Implement work sharding
Add load balancing
Implement result aggregation
Distributed tests

Phase 5: Node.js Bindings

Create ruvector-scipix-node crate
Implement NAPI bindings
Add TypeScript types
Build platform binaries
NPM package

Phase 6: Optimization

Enable quantization
SIMD optimizations
Cache tuning
Performance benchmarks
Documentation

12. Performance Targets

Cache Performance

Hit Rate: >80% on repeated expressions
Lookup Latency: <10ms (p99)
Memory Overhead: 4-8x reduction with quantization

API Performance

OCR Latency: <2s for single image
Throughput: >100 req/min per node
PDF Processing: <10s for 10-page document

Cluster Performance

Scaling Efficiency: >90% up to 8 nodes
Fault Tolerance: Continue with 1 node failure
Shard Rebalancing: <30s

13. Security Considerations

API Key Management

Never commit API keys to repository
Use environment variables or secure vaults
Rotate keys regularly
Implement key-per-user for multi-tenant

Rate Limiting

Per-IP and per-API-key limits
Sliding window algorithm
Graceful degradation under load

Input Validation

Image size limits (10MB default)
Format validation (PNG, JPEG only)
Sanitize LaTeX output
Prevent injection attacks

Cache Security

Encrypt sensitive cached data
Implement cache eviction policies
Prevent cache poisoning
Audit cache access

14. Monitoring & Observability

Key Metrics

scipix_ocr_requests_total - Total requests
scipix_cache_hit_rate - Cache effectiveness
scipix_ocr_duration_seconds - Latency distribution
scipix_confidence_score - Quality tracking
scipix_errors_total - Error rate

Dashboards

Real-time OCR throughput
Cache performance
Error rates by type
Confidence score distribution
Cluster health

Alerts

Error rate >5%
Latency p99 >5s
Cache hit rate <60%
Node failures
API quota exhaustion

15. Migration Path

From Standalone to Integrated

Step 1: Add ruvector-core dependency

cd crates/ruvector-scipix-core
cargo add ruvector-core --path ../ruvector-core

Step 2: Migrate cache to VectorDB

// Old: HashMap-based cache
let cache = HashMap::new();

// New: Vector-based cache
let cache = ScipixCache::new("./cache", 512)?;

Step 3: Integrate metrics

use ruvector_scipix_core::metrics::OcrMetrics;

OcrMetrics::record_request();
// ... perform OCR ...
OcrMetrics::record_latency(duration);

Step 4: Deploy with cluster support

# Enable cluster feature
cargo build --release --features cluster

# Start with cluster config
MATHPIX_CLUSTER_ENABLED=true cargo run

16. Testing Strategy

Unit Tests

Vector cache operations
Embedding generation
LaTeX validation
Error handling

Integration Tests

End-to-end OCR flow
Cache hit/miss scenarios
Cluster coordination
API endpoint testing

Performance Tests

Cache lookup benchmarks
HNSW search performance
Quantization overhead
Distributed scaling

Browser Tests (WASM)

Canvas image capture
API calls from browser
Memory management
Error handling

17. Documentation Requirements

API Documentation

OpenAPI/Swagger spec
Example requests/responses
Error codes
Rate limits

Integration Guides

Quick start guide
Configuration reference
Cluster setup
WASM integration

Performance Tuning

Cache configuration
HNSW parameters
Quantization trade-offs
Cluster sizing

Conclusion

This integration architecture provides a comprehensive blueprint for incorporating ruvector-scipix into the ruvector ecosystem. By leveraging existing infrastructure for vector storage, clustering, metrics, and WASM support, we achieve:

Performance: 80%+ cache hit rate, <10ms lookup latency
Scalability: Horizontal scaling via ruvector-cluster
Flexibility: Multiple deployment targets (server, browser, Node.js)
Maintainability: Shared types, errors, and configuration patterns
Observability: Rich metrics and monitoring

The modular design allows incremental adoption, starting with core OCR functionality and progressively adding caching, clustering, and advanced features.

Next Steps:

Review and approve architecture
Create Phase 1 crates (ruvector-scipix-core)
Implement vector cache integration
Add comprehensive tests
Deploy initial server with basic endpoints
Iterate based on performance metrics

47 KiB Raw Blame History