# Ruvector Integration Architecture ## ruvector-scipix Integration Design **Version:** 1.0.0 **Date:** 2025-11-28 **Status:** Design Phase --- ## Executive Summary This document defines the integration architecture for `ruvector-scipix`, a specialized OCR crate for mathematical expressions, with the existing ruvector ecosystem. The integration leverages ruvector's high-performance vector database, HNSW indexing, distributed clustering, and WASM capabilities to provide scalable, intelligent mathematical OCR processing. **Key Integration Points:** - Vector-based caching of OCR results using ruvector-core - REST API endpoints via ruvector-server extension - Browser-based OCR using ruvector-wasm - Distributed processing with ruvector-cluster - Performance tracking via ruvector-metrics - Shared configuration and error handling patterns --- ## 1. Workspace Integration ### 1.1 Adding to Workspace Members **Root Cargo.toml Modification:** ```toml [workspace] members = [ # ... existing members ... "crates/ruvector-gnn-wasm", # Scipix Integration - NEW "crates/ruvector-scipix-core", # Core OCR logic "crates/ruvector-scipix-node", # Node.js bindings "crates/ruvector-scipix-wasm", # Browser WASM "crates/ruvector-scipix-server", # HTTP server extension "examples/refrag-pipeline", "examples/scipix", # Examples and demos ] [workspace.dependencies] # ... existing dependencies ... # Scipix-specific dependencies - NEW reqwest = { version = "0.12", features = ["json", "multipart"] } base64 = "0.22" image = { version = "0.25", features = ["png", "jpeg"] } tesseract-rs = { version = "0.14", optional = true } # Local fallback pdf-extract = { version = "0.7", optional = true } ``` **Dependency Version Strategy:** - Use `version = "0.1.16"` (workspace version) for internal crates - Use `workspace = true` for shared dependencies - Add scipix-specific deps to workspace.dependencies for consistency ### 1.2 Crate Structure ``` crates/ ├── ruvector-scipix-core/ # Core OCR engine │ ├── src/ │ │ ├── lib.rs # Public API │ │ ├── api_client.rs # Scipix API client │ │ ├── ocr_engine.rs # OCR processing │ │ ├── cache.rs # Vector-based cache │ │ ├── preprocessing.rs # Image preprocessing │ │ ├── postprocessing.rs # LaTeX refinement │ │ └── error.rs # Error types │ └── Cargo.toml │ ├── ruvector-scipix-node/ # Node.js bindings (NAPI-RS) │ ├── src/ │ │ └── lib.rs │ ├── npm/ # Platform binaries │ └── Cargo.toml │ ├── ruvector-scipix-wasm/ # WASM bindings │ ├── src/ │ │ └── lib.rs │ └── Cargo.toml │ └── ruvector-scipix-server/ # Server extension ├── src/ │ ├── main.rs │ ├── routes.rs │ └── middleware.rs └── Cargo.toml examples/scipix/ # Examples (NOT workspace member) ├── src/ ├── tests/ ├── docs/ └── Cargo.toml # Standalone example ``` ### 1.3 Feature Flags Strategy **Core Crate (ruvector-scipix-core):** ```toml [features] default = ["api-client", "cache", "simd"] # Backend features api-client = ["reqwest", "base64"] tesseract = ["tesseract-rs"] # Local OCR fallback pdf-support = ["pdf-extract"] # Performance features cache = ["ruvector-core/storage"] # Vector cache simd = ["ruvector-core/simd"] # SIMD optimizations quantization = ["ruvector-core"] # Quantized embeddings # Environment features wasm = [] # WASM-compatible mode memory-only = [] # No file I/O ``` --- ## 2. ruvector-core Usage ### 2.1 Storing Math Expression Embeddings **Integration Pattern:** ```rust // crates/ruvector-scipix-core/src/cache.rs use ruvector_core::{VectorDB, VectorEntry, DistanceMetric, SearchQuery}; use std::path::Path; /// OCR result cache using vector similarity pub struct ScipixCache { /// Vector database for image embeddings image_db: VectorDB, /// Vector database for LaTeX embeddings latex_db: VectorDB, /// Embedding dimension dimension: usize, } impl ScipixCache { /// Create new cache with specified dimension pub fn new(cache_dir: &Path, dimension: usize) -> Result { let image_path = cache_dir.join("image_vectors.db"); let latex_path = cache_dir.join("latex_vectors.db"); Ok(Self { image_db: VectorDB::new( &image_path, dimension, DistanceMetric::Cosine, )?, latex_db: VectorDB::new( &latex_path, dimension, DistanceMetric::Cosine, )?, dimension, }) } /// Store OCR result with image embedding pub fn store_result( &mut self, image_embedding: Vec, latex: String, confidence: f32, ) -> Result { // Store image embedding let id = uuid::Uuid::new_v4(); self.image_db.add_vector( id, image_embedding.clone(), Some(serde_json::json!({ "latex": latex, "confidence": confidence, "timestamp": chrono::Utc::now(), })), )?; // Also store LaTeX embedding for semantic search let latex_embedding = self.encode_latex(&latex)?; self.latex_db.add_vector(id, latex_embedding, None)?; Ok(id) } /// Find similar cached results pub fn find_similar( &self, image_embedding: Vec, threshold: f32, ) -> Result> { let query = SearchQuery::new(image_embedding) .with_k(1) .with_ef(50); let results = self.image_db.search(&query)?; if let Some(result) = results.first() { if result.distance <= threshold { let metadata = result.metadata.as_ref() .ok_or(RuvectorError::MetadataMissing)?; return Ok(Some(CachedResult { latex: metadata["latex"].as_str().unwrap().to_string(), confidence: metadata["confidence"].as_f64().unwrap() as f32, distance: result.distance, })); } } Ok(None) } /// Encode LaTeX to vector using simple hashing fn encode_latex(&self, latex: &str) -> Result> { // Use TF-IDF or learned embeddings // For now, simple character n-gram hashing let mut embedding = vec![0.0; self.dimension]; for ngram in latex.chars().collect::>().windows(3) { let hash = ngram.iter().fold(0u64, |acc, &c| { acc.wrapping_mul(31).wrapping_add(c as u64) }); let idx = (hash % self.dimension as u64) as usize; embedding[idx] += 1.0; } // Normalize let norm: f32 = embedding.iter().map(|&x| x * x).sum::().sqrt(); if norm > 0.0 { embedding.iter_mut().for_each(|x| *x /= norm); } Ok(embedding) } } #[derive(Debug, Clone)] pub struct CachedResult { pub latex: String, pub confidence: f32, pub distance: f32, } ``` ### 2.2 Quantization for Memory Efficiency ```rust use ruvector_core::quantization::{ScalarQuantizer, QuantizationConfig}; impl ScipixCache { /// Create cache with quantization (4-32x memory reduction) pub fn new_quantized( cache_dir: &Path, dimension: usize, bits: u8, // 4 or 8 ) -> Result { let config = QuantizationConfig { bits, ..Default::default() }; // Quantizer will be used internally by VectorDB let mut cache = Self::new(cache_dir, dimension)?; cache.image_db.enable_quantization(config)?; Ok(cache) } } ``` ### 2.3 HNSW Parameters for OCR Cache ```rust use ruvector_core::index::HNSWConfig; impl ScipixCache { /// Optimize HNSW for OCR workload pub fn with_hnsw_config(mut self, config: HNSWConfig) -> Self { // Typical OCR workload: // - High recall needed (mathematical expressions must be accurate) // - Moderate write throughput // - Low latency reads let optimized = HNSWConfig { m: 32, // Connections per layer (higher = better recall) ef_construction: 200, // Construction effort max_elements: 100_000, // Expected cache size ..Default::default() }; self.image_db.configure_hnsw(optimized); self } } ``` --- ## 3. ruvector-server Extension ### 3.1 Server Crate Structure **crates/ruvector-scipix-server/Cargo.toml:** ```toml [package] name = "ruvector-scipix-server" version.workspace = true edition.workspace = true license.workspace = true authors.workspace = true repository.workspace = true description = "HTTP server for Scipix OCR with vector caching" [dependencies] # Core dependencies ruvector-core = { version = "0.1.16", path = "../ruvector-core" } ruvector-server = { version = "0.1.16", path = "../ruvector-server" } ruvector-scipix-core = { version = "0.1.16", path = "../ruvector-scipix-core" } # Web framework axum = { version = "0.7", features = ["json", "multipart"] } tower = "0.5" tower-http = { version = "0.6", features = ["cors", "trace", "limit"] } # Async runtime tokio = { workspace = true } # Serialization serde = { workspace = true } serde_json = { workspace = true } # Error handling thiserror = { workspace = true } anyhow = { workspace = true } # Utilities tracing = { workspace = true } uuid = { workspace = true } base64 = { workspace = true } [features] default = ["api-client"] api-client = ["ruvector-scipix-core/api-client"] metrics = ["ruvector-metrics"] ``` ### 3.2 REST API Endpoints **crates/ruvector-scipix-server/src/routes.rs:** ```rust use axum::{ Router, routing::{post, get}, extract::{State, Multipart}, Json, http::StatusCode, }; use ruvector_scipix_core::{ScipixClient, ScipixCache}; use std::sync::Arc; #[derive(Clone)] pub struct AppState { pub scipix_client: Arc, pub cache: Arc>, } /// Create Scipix routes pub fn scipix_routes() -> Router { Router::new() // Scipix API v3 endpoints .route("/v3/text", post(ocr_text)) .route("/v3/pdf", post(ocr_pdf)) .route("/v3/batch", post(ocr_batch)) // Cache management .route("/cache/stats", get(cache_stats)) .route("/cache/search", post(search_cache)) .route("/cache/clear", post(clear_cache)) } /// POST /v3/text - OCR text from image async fn ocr_text( State(state): State, mut multipart: Multipart, ) -> Result, AppError> { let mut image_data = Vec::new(); // Extract image from multipart while let Some(field) = multipart.next_field().await? { if field.name() == Some("image") { image_data = field.bytes().await?.to_vec(); } } // Generate image embedding for cache lookup let embedding = state.scipix_client .generate_image_embedding(&image_data)?; // Check cache first if let Some(cached) = state.cache.read() .find_similar(embedding.clone(), 0.95)? { return Ok(Json(OcrResponse { latex: cached.latex, confidence: cached.confidence, cached: true, })); } // Cache miss - call Scipix API let result = state.scipix_client.ocr_image(&image_data).await?; // Store in cache state.cache.write().store_result( embedding, result.latex.clone(), result.confidence, )?; Ok(Json(OcrResponse { latex: result.latex, confidence: result.confidence, cached: false, })) } /// POST /v3/pdf - OCR entire PDF async fn ocr_pdf( State(state): State, mut multipart: Multipart, ) -> Result, AppError> { let mut pdf_data = Vec::new(); while let Some(field) = multipart.next_field().await? { if field.name() == Some("pdf") { pdf_data = field.bytes().await?.to_vec(); } } // Extract pages and process in parallel let pages = state.scipix_client.extract_pdf_pages(&pdf_data)?; let results = futures::future::join_all( pages.into_iter().map(|page| { let client = state.scipix_client.clone(); async move { client.ocr_image(&page).await } }) ).await; let pages: Vec<_> = results.into_iter() .collect::, _>>()?; Ok(Json(PdfOcrResponse { pages })) } #[derive(serde::Serialize)] struct OcrResponse { latex: String, confidence: f32, cached: bool, } #[derive(serde::Serialize)] struct PdfOcrResponse { pages: Vec, } #[derive(serde::Serialize)] struct PageResult { page_num: usize, latex: String, confidence: f32, } ``` ### 3.3 Authentication Integration ```rust use axum::{ extract::Request, middleware::Next, http::StatusCode, }; /// API key authentication middleware pub async fn auth_middleware( mut req: Request, next: Next, ) -> Result { let auth_header = req.headers() .get("X-API-Key") .and_then(|h| h.to_str().ok()); match auth_header { Some(key) if validate_api_key(key) => { // Store user context in extensions req.extensions_mut().insert(ApiUser { key: key.to_string(), }); Ok(next.run(req).await) } _ => Err(StatusCode::UNAUTHORIZED), } } fn validate_api_key(key: &str) -> bool { // Check against database or environment std::env::var("MATHPIX_API_KEY") .map(|k| k == key) .unwrap_or(false) } ``` ### 3.4 Rate Limiting ```rust use tower::ServiceBuilder; use tower_http::limit::RequestBodyLimitLayer; pub fn create_server(state: AppState) -> Router { Router::new() .merge(scipix_routes()) .layer( ServiceBuilder::new() // Rate limiting (100 req/min per IP) .layer(tower_http::timeout::TimeoutLayer::new( std::time::Duration::from_secs(30) )) // Body size limit (10MB) .layer(RequestBodyLimitLayer::new(10 * 1024 * 1024)) // Authentication .layer(axum::middleware::from_fn(auth_middleware)) ) .with_state(state) } ``` --- ## 4. ruvector-wasm Integration ### 4.1 WASM Crate Configuration **crates/ruvector-scipix-wasm/Cargo.toml:** ```toml [package] name = "ruvector-scipix-wasm" version.workspace = true edition.workspace = true license.workspace = true description = "Browser-based OCR for mathematical expressions" [lib] crate-type = ["cdylib", "rlib"] [dependencies] # Core - use memory-only features ruvector-core = { version = "0.1.16", path = "../ruvector-core", default-features = false, features = ["memory-only", "simd"] } ruvector-wasm = { version = "0.1.16", path = "../ruvector-wasm" } ruvector-scipix-core = { version = "0.1.16", path = "../ruvector-scipix-core", default-features = false, features = ["wasm"] } # WASM bindings wasm-bindgen = { workspace = true } wasm-bindgen-futures = { workspace = true } js-sys = { workspace = true } web-sys = { workspace = true, features = [ "CanvasRenderingContext2d", "HtmlCanvasElement", "ImageData", "console", ] } # Utilities serde = { workspace = true } serde-wasm-bindgen = "0.6" console_error_panic_hook = "0.1" getrandom = { workspace = true, features = ["wasm_js"] } [features] default = [] [profile.release] opt-level = "z" lto = true codegen-units = 1 ``` ### 4.2 Browser API **crates/ruvector-scipix-wasm/src/lib.rs:** ```rust use wasm_bindgen::prelude::*; use web_sys::{ImageData, CanvasRenderingContext2d}; use ruvector_scipix_core::{ScipixClient, ScipixCache}; #[wasm_bindgen] pub struct ScipixWasm { client: ScipixClient, cache: ScipixCache, } #[wasm_bindgen] impl ScipixWasm { /// Create new instance with API key #[wasm_bindgen(constructor)] pub fn new(api_key: String, app_id: String) -> Result { console_error_panic_hook::set_once(); let client = ScipixClient::new(api_key, app_id) .map_err(|e| JsValue::from_str(&e.to_string()))?; // Use in-memory cache for WASM let cache = ScipixCache::new_memory(512) // 512-dim embeddings .map_err(|e| JsValue::from_str(&e.to_string()))?; Ok(Self { client, cache }) } /// OCR from canvas ImageData #[wasm_bindgen] pub async fn ocr_image_data( &mut self, image_data: ImageData, ) -> Result { let width = image_data.width(); let height = image_data.height(); let data = image_data.data().0; // Convert to PNG bytes let png_bytes = self.rgba_to_png(width, height, &data) .map_err(|e| JsValue::from_str(&e.to_string()))?; // Check cache let embedding = self.client.generate_image_embedding(&png_bytes) .map_err(|e| JsValue::from_str(&e.to_string()))?; if let Some(cached) = self.cache.find_similar(embedding.clone(), 0.95) .map_err(|e| JsValue::from_str(&e.to_string()))? { return Ok(serde_wasm_bindgen::to_value(&OcrResult { latex: cached.latex, confidence: cached.confidence, cached: true, })?); } // Call API let result = self.client.ocr_image(&png_bytes).await .map_err(|e| JsValue::from_str(&e.to_string()))?; // Cache result self.cache.store_result(embedding, result.latex.clone(), result.confidence) .map_err(|e| JsValue::from_str(&e.to_string()))?; Ok(serde_wasm_bindgen::to_value(&OcrResult { latex: result.latex, confidence: result.confidence, cached: false, })?) } /// OCR from canvas element #[wasm_bindgen] pub async fn ocr_canvas( &mut self, canvas_id: String, ) -> Result { let window = web_sys::window().unwrap(); let document = window.document().unwrap(); let canvas = document .get_element_by_id(&canvas_id) .ok_or_else(|| JsValue::from_str("Canvas not found"))? .dyn_into::()?; let context = canvas .get_context("2d")? .unwrap() .dyn_into::()?; let image_data = context.get_image_data( 0.0, 0.0, canvas.width() as f64, canvas.height() as f64, )?; self.ocr_image_data(image_data).await } fn rgba_to_png(&self, width: u32, height: u32, data: &[u8]) -> Result, String> { // Use image crate to encode PNG // (simplified - actual implementation would use image crate) Ok(data.to_vec()) } } #[derive(serde::Serialize)] struct OcrResult { latex: String, confidence: f32, cached: bool, } ``` ### 4.3 TypeScript Definitions **crates/ruvector-scipix-wasm/scipix.d.ts:** ```typescript export class ScipixWasm { constructor(apiKey: string, appId: string); ocr_image_data(imageData: ImageData): Promise; ocr_canvas(canvasId: string): Promise; free(): void; } export interface OcrResult { latex: string; confidence: number; cached: boolean; } ``` --- ## 5. ruvector-metrics Integration ### 5.1 OCR-Specific Metrics **crates/ruvector-scipix-core/src/metrics.rs:** ```rust use prometheus::{ Counter, Histogram, IntGauge, Registry, HistogramOpts, Opts, }; use lazy_static::lazy_static; lazy_static! { /// Total OCR requests pub static ref OCR_REQUESTS: Counter = Counter::new( "scipix_ocr_requests_total", "Total number of OCR requests" ).unwrap(); /// Cache hit rate pub static ref CACHE_HITS: Counter = Counter::new( "scipix_cache_hits_total", "Number of cache hits" ).unwrap(); pub static ref CACHE_MISSES: Counter = Counter::new( "scipix_cache_misses_total", "Number of cache misses" ).unwrap(); /// OCR latency histogram pub static ref OCR_LATENCY: Histogram = Histogram::with_opts( HistogramOpts::new( "scipix_ocr_duration_seconds", "OCR processing duration" ).buckets(vec![0.1, 0.5, 1.0, 2.0, 5.0, 10.0]) ).unwrap(); /// Confidence score distribution pub static ref CONFIDENCE_SCORE: Histogram = Histogram::with_opts( HistogramOpts::new( "scipix_confidence_score", "OCR confidence scores" ).buckets(vec![0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]) ).unwrap(); /// Active API calls pub static ref ACTIVE_CALLS: IntGauge = IntGauge::new( "scipix_active_calls", "Number of active API calls" ).unwrap(); /// Error counter by type pub static ref OCR_ERRORS: Counter = Counter::new( "scipix_errors_total", "Total OCR errors" ).unwrap(); } /// Register all metrics pub fn register_metrics(registry: &Registry) -> Result<(), Box> { registry.register(Box::new(OCR_REQUESTS.clone()))?; registry.register(Box::new(CACHE_HITS.clone()))?; registry.register(Box::new(CACHE_MISSES.clone()))?; registry.register(Box::new(OCR_LATENCY.clone()))?; registry.register(Box::new(CONFIDENCE_SCORE.clone()))?; registry.register(Box::new(ACTIVE_CALLS.clone()))?; registry.register(Box::new(OCR_ERRORS.clone()))?; Ok(()) } /// Track OCR operation pub struct OcrMetrics; impl OcrMetrics { pub fn record_request() { OCR_REQUESTS.inc(); ACTIVE_CALLS.inc(); } pub fn record_cache_hit() { CACHE_HITS.inc(); } pub fn record_cache_miss() { CACHE_MISSES.inc(); } pub fn record_latency(duration: std::time::Duration) { OCR_LATENCY.observe(duration.as_secs_f64()); ACTIVE_CALLS.dec(); } pub fn record_confidence(score: f32) { CONFIDENCE_SCORE.observe(score as f64); } pub fn record_error() { OCR_ERRORS.inc(); ACTIVE_CALLS.dec(); } } ``` ### 5.2 Integration with ruvector-metrics ```rust // In ScipixClient implementation impl ScipixClient { pub async fn ocr_image(&self, image: &[u8]) -> Result { use crate::metrics::OcrMetrics; OcrMetrics::record_request(); let start = std::time::Instant::now(); let result = self.ocr_image_internal(image).await; match result { Ok(ref res) => { OcrMetrics::record_latency(start.elapsed()); OcrMetrics::record_confidence(res.confidence); } Err(_) => { OcrMetrics::record_error(); } } result } } ``` ### 5.3 Prometheus Endpoint ```rust // In server routes use prometheus::{Encoder, TextEncoder}; async fn metrics_handler() -> Result { let encoder = TextEncoder::new(); let metric_families = prometheus::gather(); let mut buffer = Vec::new(); encoder.encode(&metric_families, &mut buffer)?; Ok(String::from_utf8(buffer)?) } // Add to router Router::new() .route("/metrics", get(metrics_handler)) ``` --- ## 6. ruvector-cluster for Distributed OCR ### 6.1 Sharding Strategy **crates/ruvector-scipix-core/src/distributed.rs:** ```rust use ruvector_cluster::{ClusterNode, ShardingStrategy, NodeId}; use std::sync::Arc; /// Distributed OCR coordinator pub struct DistributedOcr { cluster: Arc, shard_count: usize, } impl DistributedOcr { pub fn new(cluster: Arc, shard_count: usize) -> Self { Self { cluster, shard_count } } /// Process PDF across cluster pub async fn process_pdf_distributed( &self, pdf_data: Vec, ) -> Result> { // Extract pages let pages = extract_pdf_pages(&pdf_data)?; let total_pages = pages.len(); // Shard pages across cluster nodes let nodes = self.cluster.get_active_nodes().await?; let pages_per_node = (total_pages + nodes.len() - 1) / nodes.len(); // Distribute work let mut tasks = Vec::new(); for (i, node) in nodes.iter().enumerate() { let start = i * pages_per_node; let end = ((i + 1) * pages_per_node).min(total_pages); let node_pages: Vec<_> = pages[start..end].to_vec(); let task = self.cluster.send_task( node.id, OcrTask { pages: node_pages, start_page: start, }, ); tasks.push(task); } // Collect results let results = futures::future::join_all(tasks).await; // Aggregate and sort by page number let mut all_results = Vec::new(); for result in results { all_results.extend(result?); } all_results.sort_by_key(|r| r.page_num); Ok(all_results) } } #[derive(serde::Serialize, serde::Deserialize)] struct OcrTask { pages: Vec>, start_page: usize, } ``` ### 6.2 Load Balancing ```rust use ruvector_cluster::LoadBalancer; /// Smart load balancer for OCR workload pub struct OcrLoadBalancer { balancer: LoadBalancer, } impl OcrLoadBalancer { /// Assign work based on node capacity and queue depth pub async fn assign_task(&self, task_size: usize) -> Result { let nodes = self.balancer.get_nodes().await?; // Score each node let mut best_node = None; let mut best_score = f64::MAX; for node in nodes { let metrics = self.balancer.get_node_metrics(node.id).await?; // Score based on: // - Queue depth (lower is better) // - CPU usage (lower is better) // - Task size compatibility let score = metrics.queue_depth as f64 * 10.0 + metrics.cpu_usage * 100.0 + (task_size as f64 - metrics.avg_task_size).abs(); if score < best_score { best_score = score; best_node = Some(node.id); } } best_node.ok_or_else(|| RuvectorError::NoNodesAvailable) } } ``` ### 6.3 Result Aggregation ```rust /// Aggregate OCR results from multiple nodes pub struct ResultAggregator { results: dashmap::DashMap>, } impl ResultAggregator { pub fn add_result(&self, job_id: uuid::Uuid, result: PageResult) { self.results.entry(job_id) .or_insert_with(Vec::new) .push(result); } pub fn get_results(&self, job_id: uuid::Uuid) -> Option> { self.results.get(&job_id).map(|r| { let mut results = r.clone(); results.sort_by_key(|p| p.page_num); results }) } pub fn is_complete(&self, job_id: uuid::Uuid, expected_pages: usize) -> bool { self.results.get(&job_id) .map(|r| r.len() == expected_pages) .unwrap_or(false) } } ``` --- ## 7. Shared Configuration ### 7.1 Environment Variables **config/scipix.env:** ```bash # Scipix API Configuration MATHPIX_API_KEY=your_api_key_here MATHPIX_APP_ID=your_app_id_here MATHPIX_API_URL=https://api.scipix.com/v3 # Cache Configuration MATHPIX_CACHE_DIR=./data/scipix_cache MATHPIX_CACHE_DIMENSION=512 MATHPIX_CACHE_SIZE_MB=1000 MATHPIX_CACHE_THRESHOLD=0.95 # Vector DB Configuration RUVECTOR_HNSW_M=32 RUVECTOR_HNSW_EF_CONSTRUCTION=200 RUVECTOR_DISTANCE_METRIC=cosine # Quantization MATHPIX_QUANTIZE_BITS=8 # 0 for no quantization # Server Configuration MATHPIX_SERVER_PORT=3000 MATHPIX_SERVER_HOST=0.0.0.0 MATHPIX_MAX_BODY_SIZE_MB=10 MATHPIX_RATE_LIMIT_PER_MIN=100 # Cluster Configuration MATHPIX_CLUSTER_ENABLED=false MATHPIX_CLUSTER_NODES=node1:8000,node2:8000 MATHPIX_SHARD_COUNT=4 # Metrics MATHPIX_METRICS_ENABLED=true MATHPIX_METRICS_PORT=9090 ``` ### 7.2 TOML Configuration **config/scipix.toml:** ```toml [api] key = "${MATHPIX_API_KEY}" app_id = "${MATHPIX_APP_ID}" url = "https://api.scipix.com/v3" timeout_secs = 30 [cache] enabled = true dir = "./data/scipix_cache" dimension = 512 size_mb = 1000 threshold = 0.95 [cache.hnsw] m = 32 ef_construction = 200 max_elements = 100_000 [cache.quantization] enabled = true bits = 8 # 4, 8, or 0 for disabled [server] host = "0.0.0.0" port = 3000 max_body_size_mb = 10 [server.rate_limit] enabled = true requests_per_minute = 100 [cluster] enabled = false nodes = ["node1:8000", "node2:8000"] shard_count = 4 replication_factor = 2 [metrics] enabled = true port = 9090 prometheus_endpoint = "/metrics" [preprocessing] # Image preprocessing options auto_rotate = true denoise = true contrast_enhancement = true dpi = 300 [postprocessing] # LaTeX postprocessing validate_syntax = true normalize_symbols = true confidence_threshold = 0.7 ``` ### 7.3 Configuration Loading **crates/ruvector-scipix-core/src/config.rs:** ```rust use serde::{Deserialize, Serialize}; use std::path::Path; #[derive(Debug, Clone, Deserialize, Serialize)] pub struct ScipixConfig { pub api: ApiConfig, pub cache: CacheConfig, pub server: ServerConfig, pub cluster: ClusterConfig, pub metrics: MetricsConfig, pub preprocessing: PreprocessingConfig, pub postprocessing: PostprocessingConfig, } #[derive(Debug, Clone, Deserialize, Serialize)] pub struct ApiConfig { pub key: String, pub app_id: String, pub url: String, pub timeout_secs: u64, } #[derive(Debug, Clone, Deserialize, Serialize)] pub struct CacheConfig { pub enabled: bool, pub dir: String, pub dimension: usize, pub size_mb: usize, pub threshold: f32, pub hnsw: HnswConfig, pub quantization: QuantizationConfig, } #[derive(Debug, Clone, Deserialize, Serialize)] pub struct HnswConfig { pub m: usize, pub ef_construction: usize, pub max_elements: usize, } impl ScipixConfig { /// Load from TOML file with environment variable substitution pub fn from_file(path: &Path) -> Result { let content = std::fs::read_to_string(path)?; // Expand environment variables let expanded = Self::expand_env_vars(&content); let config: ScipixConfig = toml::from_str(&expanded)?; Ok(config) } /// Load from environment variables pub fn from_env() -> Result { Ok(Self { api: ApiConfig { key: std::env::var("MATHPIX_API_KEY")?, app_id: std::env::var("MATHPIX_APP_ID")?, url: std::env::var("MATHPIX_API_URL") .unwrap_or_else(|_| "https://api.scipix.com/v3".to_string()), timeout_secs: 30, }, cache: CacheConfig::from_env()?, // ... rest of config }) } fn expand_env_vars(s: &str) -> String { let re = regex::Regex::new(r"\$\{([^}]+)\}").unwrap(); re.replace_all(s, |caps: ®ex::Captures| { std::env::var(&caps[1]).unwrap_or_default() }).to_string() } } ``` --- ## 8. Cross-Crate Types ### 8.1 Common Error Types **crates/ruvector-scipix-core/src/error.rs:** ```rust use thiserror::Error; use ruvector_core::RuvectorError; #[derive(Error, Debug)] pub enum ScipixError { #[error("Scipix API error: {0}")] ApiError(String), #[error("HTTP request failed: {0}")] HttpError(#[from] reqwest::Error), #[error("Vector database error: {0}")] VectorDbError(#[from] RuvectorError), #[error("Image processing error: {0}")] ImageError(String), #[error("Invalid configuration: {0}")] ConfigError(String), #[error("Cache error: {0}")] CacheError(String), #[error("Serialization error: {0}")] SerializationError(#[from] serde_json::Error), #[error("IO error: {0}")] IoError(#[from] std::io::Error), #[error("LaTeX validation error: {0}")] LatexError(String), #[error("Rate limit exceeded")] RateLimitExceeded, #[error("Authentication failed")] AuthenticationFailed, #[error("Confidence too low: {0}")] LowConfidence(f32), } pub type Result = std::result::Result; /// Convert to HTTP status code impl ScipixError { pub fn status_code(&self) -> axum::http::StatusCode { use axum::http::StatusCode; match self { Self::ApiError(_) => StatusCode::BAD_GATEWAY, Self::HttpError(_) => StatusCode::BAD_GATEWAY, Self::VectorDbError(_) => StatusCode::INTERNAL_SERVER_ERROR, Self::ImageError(_) => StatusCode::BAD_REQUEST, Self::ConfigError(_) => StatusCode::INTERNAL_SERVER_ERROR, Self::CacheError(_) => StatusCode::INTERNAL_SERVER_ERROR, Self::SerializationError(_) => StatusCode::BAD_REQUEST, Self::IoError(_) => StatusCode::INTERNAL_SERVER_ERROR, Self::LatexError(_) => StatusCode::UNPROCESSABLE_ENTITY, Self::RateLimitExceeded => StatusCode::TOO_MANY_REQUESTS, Self::AuthenticationFailed => StatusCode::UNAUTHORIZED, Self::LowConfidence(_) => StatusCode::UNPROCESSABLE_ENTITY, } } } ``` ### 8.2 Shared Traits **crates/ruvector-scipix-core/src/traits.rs:** ```rust use async_trait::async_trait; /// OCR engine trait (allows swapping implementations) #[async_trait] pub trait OcrEngine: Send + Sync { /// Process image to LaTeX async fn ocr(&self, image: &[u8]) -> Result; /// Generate embedding for caching fn generate_embedding(&self, image: &[u8]) -> Result>; /// Batch processing async fn ocr_batch(&self, images: Vec>) -> Result> { let mut results = Vec::new(); for image in images { results.push(self.ocr(&image).await?); } Ok(results) } } /// Cache trait (allows different cache backends) pub trait OcrCache: Send + Sync { fn store(&mut self, embedding: Vec, result: OcrResult) -> Result; fn find_similar(&self, embedding: Vec, threshold: f32) -> Result>; fn clear(&mut self) -> Result<()>; fn stats(&self) -> CacheStats; } #[derive(Debug, Clone)] pub struct CacheStats { pub total_entries: usize, pub memory_usage_mb: f64, pub hit_rate: f64, } /// Preprocessing trait pub trait ImagePreprocessor: Send + Sync { fn preprocess(&self, image: &[u8]) -> Result>; } /// Postprocessing trait pub trait LatexPostprocessor: Send + Sync { fn postprocess(&self, latex: &str) -> Result; fn validate(&self, latex: &str) -> bool; } ``` ### 8.3 API Contracts **crates/ruvector-scipix-core/src/types.rs:** ```rust use serde::{Deserialize, Serialize}; use uuid::Uuid; /// OCR result #[derive(Debug, Clone, Serialize, Deserialize)] pub struct OcrResult { pub latex: String, pub confidence: f32, pub timestamp: chrono::DateTime, pub cached: bool, #[serde(skip_serializing_if = "Option::is_none")] pub metadata: Option, } /// PDF page result #[derive(Debug, Clone, Serialize, Deserialize)] pub struct PageResult { pub page_num: usize, pub latex: String, pub confidence: f32, pub bounding_boxes: Vec, } /// Bounding box for detected regions #[derive(Debug, Clone, Serialize, Deserialize)] pub struct BoundingBox { pub x: u32, pub y: u32, pub width: u32, pub height: u32, pub confidence: f32, } /// Batch OCR request #[derive(Debug, Clone, Serialize, Deserialize)] pub struct BatchOcrRequest { pub images: Vec, pub options: OcrOptions, } #[derive(Debug, Clone, Serialize, Deserialize)] #[serde(tag = "type")] pub enum ImageInput { #[serde(rename = "base64")] Base64 { data: String }, #[serde(rename = "url")] Url { url: String }, #[serde(rename = "bytes")] Bytes { data: Vec }, } #[derive(Debug, Clone, Serialize, Deserialize, Default)] pub struct OcrOptions { #[serde(default)] pub preprocess: bool, #[serde(default)] pub postprocess: bool, #[serde(default = "default_confidence")] pub min_confidence: f32, #[serde(default)] pub use_cache: bool, } fn default_confidence() -> f32 { 0.7 } /// Job status for async processing #[derive(Debug, Clone, Serialize, Deserialize)] pub struct JobStatus { pub job_id: Uuid, pub status: JobState, pub progress: f32, // 0.0 to 1.0 pub result: Option>, pub error: Option, } #[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] pub enum JobState { Pending, Processing, Completed, Failed, } ``` --- ## 9. Workspace Cargo.toml Modifications ### 9.1 Complete Workspace Configuration ```toml # Add to root Cargo.toml [workspace] members = [ # ... existing members ... # Scipix Integration "crates/ruvector-scipix-core", "crates/ruvector-scipix-node", "crates/ruvector-scipix-wasm", "crates/ruvector-scipix-server", ] [workspace.dependencies] # ... existing dependencies ... # Scipix-specific reqwest = { version = "0.12", default-features = false, features = ["json", "multipart", "rustls-tls"] } base64 = "0.22" image = { version = "0.25", features = ["png", "jpeg", "webp"] } async-trait = "0.1" regex = "1.10" toml = "0.8" # Optional OCR backends tesseract-rs = { version = "0.14", optional = true } pdf-extract = { version = "0.7", optional = true } ``` ### 9.2 Individual Crate Cargo.toml **crates/ruvector-scipix-core/Cargo.toml:** ```toml [package] name = "ruvector-scipix-core" version.workspace = true edition.workspace = true license.workspace = true authors.workspace = true repository.workspace = true description = "Mathematical OCR with vector-based caching" [dependencies] # Ruvector ecosystem ruvector-core = { version = "0.1.16", path = "../ruvector-core" } ruvector-metrics = { version = "0.1.16", path = "../ruvector-metrics", optional = true } # HTTP client reqwest = { workspace = true, optional = true } base64 = { workspace = true } # Image processing image = { workspace = true, optional = true } # Async tokio = { workspace = true, features = ["rt-multi-thread"] } async-trait = { workspace = true } # Serialization serde = { workspace = true } serde_json = { workspace = true } # Error handling thiserror = { workspace = true } anyhow = { workspace = true } # Utilities uuid = { workspace = true } chrono = { workspace = true } tracing = { workspace = true } dashmap = { workspace = true } parking_lot = { workspace = true } # Configuration toml = { workspace = true, optional = true } regex = { workspace = true, optional = true } # Optional backends tesseract-rs = { workspace = true, optional = true } pdf-extract = { workspace = true, optional = true } # Metrics prometheus = { version = "0.13", optional = true } lazy_static = { version = "1.5", optional = true } [dev-dependencies] tokio = { workspace = true, features = ["macros", "test-util"] } tempfile = "3.13" mockall = { workspace = true } [features] default = ["api-client", "cache", "preprocessing"] # Core features api-client = ["reqwest"] cache = ["ruvector-core/storage"] preprocessing = ["image"] metrics = ["dep:ruvector-metrics", "prometheus", "lazy_static"] config = ["toml", "regex"] # Optional backends tesseract = ["dep:tesseract-rs"] pdf = ["dep:pdf-extract"] # Performance simd = ["ruvector-core/simd"] quantization = [] # Environment wasm = [] memory-only = [] ``` --- ## 10. Module Structure ### 10.1 Core Module Organization ``` crates/ruvector-scipix-core/src/ ├── lib.rs # Public API ├── error.rs # Error types ├── types.rs # Shared types ├── traits.rs # Shared traits ├── config.rs # Configuration │ ├── api/ │ ├── mod.rs │ ├── client.rs # Scipix API client │ └── models.rs # API request/response types │ ├── cache/ │ ├── mod.rs │ ├── vector_cache.rs # ruvector-core integration │ ├── memory_cache.rs # In-memory cache for WASM │ └── stats.rs # Cache statistics │ ├── ocr/ │ ├── mod.rs │ ├── engine.rs # Main OCR engine │ ├── batch.rs # Batch processing │ └── backends/ │ ├── mod.rs │ ├── scipix.rs # Scipix backend │ └── tesseract.rs # Tesseract fallback │ ├── preprocessing/ │ ├── mod.rs │ ├── image_ops.rs # Image preprocessing │ ├── filters.rs # Denoising, enhancement │ └── rotation.rs # Auto-rotation │ ├── postprocessing/ │ ├── mod.rs │ ├── latex_validate.rs # LaTeX validation │ └── normalize.rs # Symbol normalization │ ├── embeddings/ │ ├── mod.rs │ ├── image_embedder.rs # Image to vector │ └── latex_embedder.rs # LaTeX to vector │ ├── distributed/ │ ├── mod.rs │ ├── coordinator.rs # Cluster coordination │ ├── sharding.rs # Work distribution │ └── aggregator.rs # Result aggregation │ └── metrics/ ├── mod.rs └── prometheus.rs # Metrics collection ``` ### 10.2 Server Module Organization ``` crates/ruvector-scipix-server/src/ ├── main.rs # Server entry point ├── routes/ │ ├── mod.rs │ ├── ocr.rs # OCR endpoints │ ├── cache.rs # Cache management │ ├── health.rs # Health checks │ └── metrics.rs # Metrics endpoint │ ├── middleware/ │ ├── mod.rs │ ├── auth.rs # API key auth │ ├── rate_limit.rs # Rate limiting │ └── logging.rs # Request logging │ ├── state.rs # Shared app state └── error.rs # HTTP error handling ``` --- ## 11. Integration Checklist ### Phase 1: Core Integration - [ ] Create `ruvector-scipix-core` crate - [ ] Implement vector cache using `ruvector-core` - [ ] Add Scipix API client - [ ] Implement image preprocessing - [ ] Add metrics collection - [ ] Write unit tests ### Phase 2: Server Extension - [ ] Create `ruvector-scipix-server` crate - [ ] Implement REST API endpoints - [ ] Add authentication middleware - [ ] Implement rate limiting - [ ] Add health checks - [ ] Integration tests ### Phase 3: WASM Support - [ ] Create `ruvector-scipix-wasm` crate - [ ] Implement browser API - [ ] Add TypeScript definitions - [ ] Create example web app - [ ] Browser testing ### Phase 4: Distributed Processing - [ ] Integrate `ruvector-cluster` - [ ] Implement work sharding - [ ] Add load balancing - [ ] Implement result aggregation - [ ] Distributed tests ### Phase 5: Node.js Bindings - [ ] Create `ruvector-scipix-node` crate - [ ] Implement NAPI bindings - [ ] Add TypeScript types - [ ] Build platform binaries - [ ] NPM package ### Phase 6: Optimization - [ ] Enable quantization - [ ] SIMD optimizations - [ ] Cache tuning - [ ] Performance benchmarks - [ ] Documentation --- ## 12. Performance Targets ### Cache Performance - **Hit Rate:** >80% on repeated expressions - **Lookup Latency:** <10ms (p99) - **Memory Overhead:** 4-8x reduction with quantization ### API Performance - **OCR Latency:** <2s for single image - **Throughput:** >100 req/min per node - **PDF Processing:** <10s for 10-page document ### Cluster Performance - **Scaling Efficiency:** >90% up to 8 nodes - **Fault Tolerance:** Continue with 1 node failure - **Shard Rebalancing:** <30s --- ## 13. Security Considerations ### API Key Management - Never commit API keys to repository - Use environment variables or secure vaults - Rotate keys regularly - Implement key-per-user for multi-tenant ### Rate Limiting - Per-IP and per-API-key limits - Sliding window algorithm - Graceful degradation under load ### Input Validation - Image size limits (10MB default) - Format validation (PNG, JPEG only) - Sanitize LaTeX output - Prevent injection attacks ### Cache Security - Encrypt sensitive cached data - Implement cache eviction policies - Prevent cache poisoning - Audit cache access --- ## 14. Monitoring & Observability ### Key Metrics - `scipix_ocr_requests_total` - Total requests - `scipix_cache_hit_rate` - Cache effectiveness - `scipix_ocr_duration_seconds` - Latency distribution - `scipix_confidence_score` - Quality tracking - `scipix_errors_total` - Error rate ### Dashboards - Real-time OCR throughput - Cache performance - Error rates by type - Confidence score distribution - Cluster health ### Alerts - Error rate >5% - Latency p99 >5s - Cache hit rate <60% - Node failures - API quota exhaustion --- ## 15. Migration Path ### From Standalone to Integrated **Step 1:** Add ruvector-core dependency ```bash cd crates/ruvector-scipix-core cargo add ruvector-core --path ../ruvector-core ``` **Step 2:** Migrate cache to VectorDB ```rust // Old: HashMap-based cache let cache = HashMap::new(); // New: Vector-based cache let cache = ScipixCache::new("./cache", 512)?; ``` **Step 3:** Integrate metrics ```rust use ruvector_scipix_core::metrics::OcrMetrics; OcrMetrics::record_request(); // ... perform OCR ... OcrMetrics::record_latency(duration); ``` **Step 4:** Deploy with cluster support ```bash # Enable cluster feature cargo build --release --features cluster # Start with cluster config MATHPIX_CLUSTER_ENABLED=true cargo run ``` --- ## 16. Testing Strategy ### Unit Tests - Vector cache operations - Embedding generation - LaTeX validation - Error handling ### Integration Tests - End-to-end OCR flow - Cache hit/miss scenarios - Cluster coordination - API endpoint testing ### Performance Tests - Cache lookup benchmarks - HNSW search performance - Quantization overhead - Distributed scaling ### Browser Tests (WASM) - Canvas image capture - API calls from browser - Memory management - Error handling --- ## 17. Documentation Requirements ### API Documentation - OpenAPI/Swagger spec - Example requests/responses - Error codes - Rate limits ### Integration Guides - Quick start guide - Configuration reference - Cluster setup - WASM integration ### Performance Tuning - Cache configuration - HNSW parameters - Quantization trade-offs - Cluster sizing --- ## Conclusion This integration architecture provides a comprehensive blueprint for incorporating `ruvector-scipix` into the ruvector ecosystem. By leveraging existing infrastructure for vector storage, clustering, metrics, and WASM support, we achieve: 1. **Performance:** 80%+ cache hit rate, <10ms lookup latency 2. **Scalability:** Horizontal scaling via ruvector-cluster 3. **Flexibility:** Multiple deployment targets (server, browser, Node.js) 4. **Maintainability:** Shared types, errors, and configuration patterns 5. **Observability:** Rich metrics and monitoring The modular design allows incremental adoption, starting with core OCR functionality and progressively adding caching, clustering, and advanced features. --- **Next Steps:** 1. Review and approve architecture 2. Create Phase 1 crates (`ruvector-scipix-core`) 3. Implement vector cache integration 4. Add comprehensive tests 5. Deploy initial server with basic endpoints 6. Iterate based on performance metrics