Files
wifi-densepose/examples/scipix/docs/12_RUVECTOR_INTEGRATION.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

1887 lines
47 KiB
Markdown

# Ruvector Integration Architecture
## ruvector-scipix Integration Design
**Version:** 1.0.0
**Date:** 2025-11-28
**Status:** Design Phase
---
## Executive Summary
This document defines the integration architecture for `ruvector-scipix`, a specialized OCR crate for mathematical expressions, with the existing ruvector ecosystem. The integration leverages ruvector's high-performance vector database, HNSW indexing, distributed clustering, and WASM capabilities to provide scalable, intelligent mathematical OCR processing.
**Key Integration Points:**
- Vector-based caching of OCR results using ruvector-core
- REST API endpoints via ruvector-server extension
- Browser-based OCR using ruvector-wasm
- Distributed processing with ruvector-cluster
- Performance tracking via ruvector-metrics
- Shared configuration and error handling patterns
---
## 1. Workspace Integration
### 1.1 Adding to Workspace Members
**Root Cargo.toml Modification:**
```toml
[workspace]
members = [
# ... existing members ...
"crates/ruvector-gnn-wasm",
# Scipix Integration - NEW
"crates/ruvector-scipix-core", # Core OCR logic
"crates/ruvector-scipix-node", # Node.js bindings
"crates/ruvector-scipix-wasm", # Browser WASM
"crates/ruvector-scipix-server", # HTTP server extension
"examples/refrag-pipeline",
"examples/scipix", # Examples and demos
]
[workspace.dependencies]
# ... existing dependencies ...
# Scipix-specific dependencies - NEW
reqwest = { version = "0.12", features = ["json", "multipart"] }
base64 = "0.22"
image = { version = "0.25", features = ["png", "jpeg"] }
tesseract-rs = { version = "0.14", optional = true } # Local fallback
pdf-extract = { version = "0.7", optional = true }
```
**Dependency Version Strategy:**
- Use `version = "0.1.16"` (workspace version) for internal crates
- Use `workspace = true` for shared dependencies
- Add scipix-specific deps to workspace.dependencies for consistency
### 1.2 Crate Structure
```
crates/
├── ruvector-scipix-core/ # Core OCR engine
│ ├── src/
│ │ ├── lib.rs # Public API
│ │ ├── api_client.rs # Scipix API client
│ │ ├── ocr_engine.rs # OCR processing
│ │ ├── cache.rs # Vector-based cache
│ │ ├── preprocessing.rs # Image preprocessing
│ │ ├── postprocessing.rs # LaTeX refinement
│ │ └── error.rs # Error types
│ └── Cargo.toml
├── ruvector-scipix-node/ # Node.js bindings (NAPI-RS)
│ ├── src/
│ │ └── lib.rs
│ ├── npm/ # Platform binaries
│ └── Cargo.toml
├── ruvector-scipix-wasm/ # WASM bindings
│ ├── src/
│ │ └── lib.rs
│ └── Cargo.toml
└── ruvector-scipix-server/ # Server extension
├── src/
│ ├── main.rs
│ ├── routes.rs
│ └── middleware.rs
└── Cargo.toml
examples/scipix/ # Examples (NOT workspace member)
├── src/
├── tests/
├── docs/
└── Cargo.toml # Standalone example
```
### 1.3 Feature Flags Strategy
**Core Crate (ruvector-scipix-core):**
```toml
[features]
default = ["api-client", "cache", "simd"]
# Backend features
api-client = ["reqwest", "base64"]
tesseract = ["tesseract-rs"] # Local OCR fallback
pdf-support = ["pdf-extract"]
# Performance features
cache = ["ruvector-core/storage"] # Vector cache
simd = ["ruvector-core/simd"] # SIMD optimizations
quantization = ["ruvector-core"] # Quantized embeddings
# Environment features
wasm = [] # WASM-compatible mode
memory-only = [] # No file I/O
```
---
## 2. ruvector-core Usage
### 2.1 Storing Math Expression Embeddings
**Integration Pattern:**
```rust
// crates/ruvector-scipix-core/src/cache.rs
use ruvector_core::{VectorDB, VectorEntry, DistanceMetric, SearchQuery};
use std::path::Path;
/// OCR result cache using vector similarity
pub struct ScipixCache {
/// Vector database for image embeddings
image_db: VectorDB,
/// Vector database for LaTeX embeddings
latex_db: VectorDB,
/// Embedding dimension
dimension: usize,
}
impl ScipixCache {
/// Create new cache with specified dimension
pub fn new(cache_dir: &Path, dimension: usize) -> Result<Self> {
let image_path = cache_dir.join("image_vectors.db");
let latex_path = cache_dir.join("latex_vectors.db");
Ok(Self {
image_db: VectorDB::new(
&image_path,
dimension,
DistanceMetric::Cosine,
)?,
latex_db: VectorDB::new(
&latex_path,
dimension,
DistanceMetric::Cosine,
)?,
dimension,
})
}
/// Store OCR result with image embedding
pub fn store_result(
&mut self,
image_embedding: Vec<f32>,
latex: String,
confidence: f32,
) -> Result<uuid::Uuid> {
// Store image embedding
let id = uuid::Uuid::new_v4();
self.image_db.add_vector(
id,
image_embedding.clone(),
Some(serde_json::json!({
"latex": latex,
"confidence": confidence,
"timestamp": chrono::Utc::now(),
})),
)?;
// Also store LaTeX embedding for semantic search
let latex_embedding = self.encode_latex(&latex)?;
self.latex_db.add_vector(id, latex_embedding, None)?;
Ok(id)
}
/// Find similar cached results
pub fn find_similar(
&self,
image_embedding: Vec<f32>,
threshold: f32,
) -> Result<Option<CachedResult>> {
let query = SearchQuery::new(image_embedding)
.with_k(1)
.with_ef(50);
let results = self.image_db.search(&query)?;
if let Some(result) = results.first() {
if result.distance <= threshold {
let metadata = result.metadata.as_ref()
.ok_or(RuvectorError::MetadataMissing)?;
return Ok(Some(CachedResult {
latex: metadata["latex"].as_str().unwrap().to_string(),
confidence: metadata["confidence"].as_f64().unwrap() as f32,
distance: result.distance,
}));
}
}
Ok(None)
}
/// Encode LaTeX to vector using simple hashing
fn encode_latex(&self, latex: &str) -> Result<Vec<f32>> {
// Use TF-IDF or learned embeddings
// For now, simple character n-gram hashing
let mut embedding = vec![0.0; self.dimension];
for ngram in latex.chars().collect::<Vec<_>>().windows(3) {
let hash = ngram.iter().fold(0u64, |acc, &c| {
acc.wrapping_mul(31).wrapping_add(c as u64)
});
let idx = (hash % self.dimension as u64) as usize;
embedding[idx] += 1.0;
}
// Normalize
let norm: f32 = embedding.iter().map(|&x| x * x).sum::<f32>().sqrt();
if norm > 0.0 {
embedding.iter_mut().for_each(|x| *x /= norm);
}
Ok(embedding)
}
}
#[derive(Debug, Clone)]
pub struct CachedResult {
pub latex: String,
pub confidence: f32,
pub distance: f32,
}
```
### 2.2 Quantization for Memory Efficiency
```rust
use ruvector_core::quantization::{ScalarQuantizer, QuantizationConfig};
impl ScipixCache {
/// Create cache with quantization (4-32x memory reduction)
pub fn new_quantized(
cache_dir: &Path,
dimension: usize,
bits: u8, // 4 or 8
) -> Result<Self> {
let config = QuantizationConfig {
bits,
..Default::default()
};
// Quantizer will be used internally by VectorDB
let mut cache = Self::new(cache_dir, dimension)?;
cache.image_db.enable_quantization(config)?;
Ok(cache)
}
}
```
### 2.3 HNSW Parameters for OCR Cache
```rust
use ruvector_core::index::HNSWConfig;
impl ScipixCache {
/// Optimize HNSW for OCR workload
pub fn with_hnsw_config(mut self, config: HNSWConfig) -> Self {
// Typical OCR workload:
// - High recall needed (mathematical expressions must be accurate)
// - Moderate write throughput
// - Low latency reads
let optimized = HNSWConfig {
m: 32, // Connections per layer (higher = better recall)
ef_construction: 200, // Construction effort
max_elements: 100_000, // Expected cache size
..Default::default()
};
self.image_db.configure_hnsw(optimized);
self
}
}
```
---
## 3. ruvector-server Extension
### 3.1 Server Crate Structure
**crates/ruvector-scipix-server/Cargo.toml:**
```toml
[package]
name = "ruvector-scipix-server"
version.workspace = true
edition.workspace = true
license.workspace = true
authors.workspace = true
repository.workspace = true
description = "HTTP server for Scipix OCR with vector caching"
[dependencies]
# Core dependencies
ruvector-core = { version = "0.1.16", path = "../ruvector-core" }
ruvector-server = { version = "0.1.16", path = "../ruvector-server" }
ruvector-scipix-core = { version = "0.1.16", path = "../ruvector-scipix-core" }
# Web framework
axum = { version = "0.7", features = ["json", "multipart"] }
tower = "0.5"
tower-http = { version = "0.6", features = ["cors", "trace", "limit"] }
# Async runtime
tokio = { workspace = true }
# Serialization
serde = { workspace = true }
serde_json = { workspace = true }
# Error handling
thiserror = { workspace = true }
anyhow = { workspace = true }
# Utilities
tracing = { workspace = true }
uuid = { workspace = true }
base64 = { workspace = true }
[features]
default = ["api-client"]
api-client = ["ruvector-scipix-core/api-client"]
metrics = ["ruvector-metrics"]
```
### 3.2 REST API Endpoints
**crates/ruvector-scipix-server/src/routes.rs:**
```rust
use axum::{
Router,
routing::{post, get},
extract::{State, Multipart},
Json,
http::StatusCode,
};
use ruvector_scipix_core::{ScipixClient, ScipixCache};
use std::sync::Arc;
#[derive(Clone)]
pub struct AppState {
pub scipix_client: Arc<ScipixClient>,
pub cache: Arc<parking_lot::RwLock<ScipixCache>>,
}
/// Create Scipix routes
pub fn scipix_routes() -> Router<AppState> {
Router::new()
// Scipix API v3 endpoints
.route("/v3/text", post(ocr_text))
.route("/v3/pdf", post(ocr_pdf))
.route("/v3/batch", post(ocr_batch))
// Cache management
.route("/cache/stats", get(cache_stats))
.route("/cache/search", post(search_cache))
.route("/cache/clear", post(clear_cache))
}
/// POST /v3/text - OCR text from image
async fn ocr_text(
State(state): State<AppState>,
mut multipart: Multipart,
) -> Result<Json<OcrResponse>, AppError> {
let mut image_data = Vec::new();
// Extract image from multipart
while let Some(field) = multipart.next_field().await? {
if field.name() == Some("image") {
image_data = field.bytes().await?.to_vec();
}
}
// Generate image embedding for cache lookup
let embedding = state.scipix_client
.generate_image_embedding(&image_data)?;
// Check cache first
if let Some(cached) = state.cache.read()
.find_similar(embedding.clone(), 0.95)? {
return Ok(Json(OcrResponse {
latex: cached.latex,
confidence: cached.confidence,
cached: true,
}));
}
// Cache miss - call Scipix API
let result = state.scipix_client.ocr_image(&image_data).await?;
// Store in cache
state.cache.write().store_result(
embedding,
result.latex.clone(),
result.confidence,
)?;
Ok(Json(OcrResponse {
latex: result.latex,
confidence: result.confidence,
cached: false,
}))
}
/// POST /v3/pdf - OCR entire PDF
async fn ocr_pdf(
State(state): State<AppState>,
mut multipart: Multipart,
) -> Result<Json<PdfOcrResponse>, AppError> {
let mut pdf_data = Vec::new();
while let Some(field) = multipart.next_field().await? {
if field.name() == Some("pdf") {
pdf_data = field.bytes().await?.to_vec();
}
}
// Extract pages and process in parallel
let pages = state.scipix_client.extract_pdf_pages(&pdf_data)?;
let results = futures::future::join_all(
pages.into_iter().map(|page| {
let client = state.scipix_client.clone();
async move { client.ocr_image(&page).await }
})
).await;
let pages: Vec<_> = results.into_iter()
.collect::<Result<Vec<_>, _>>()?;
Ok(Json(PdfOcrResponse { pages }))
}
#[derive(serde::Serialize)]
struct OcrResponse {
latex: String,
confidence: f32,
cached: bool,
}
#[derive(serde::Serialize)]
struct PdfOcrResponse {
pages: Vec<PageResult>,
}
#[derive(serde::Serialize)]
struct PageResult {
page_num: usize,
latex: String,
confidence: f32,
}
```
### 3.3 Authentication Integration
```rust
use axum::{
extract::Request,
middleware::Next,
http::StatusCode,
};
/// API key authentication middleware
pub async fn auth_middleware(
mut req: Request,
next: Next,
) -> Result<axum::response::Response, StatusCode> {
let auth_header = req.headers()
.get("X-API-Key")
.and_then(|h| h.to_str().ok());
match auth_header {
Some(key) if validate_api_key(key) => {
// Store user context in extensions
req.extensions_mut().insert(ApiUser {
key: key.to_string(),
});
Ok(next.run(req).await)
}
_ => Err(StatusCode::UNAUTHORIZED),
}
}
fn validate_api_key(key: &str) -> bool {
// Check against database or environment
std::env::var("MATHPIX_API_KEY")
.map(|k| k == key)
.unwrap_or(false)
}
```
### 3.4 Rate Limiting
```rust
use tower::ServiceBuilder;
use tower_http::limit::RequestBodyLimitLayer;
pub fn create_server(state: AppState) -> Router {
Router::new()
.merge(scipix_routes())
.layer(
ServiceBuilder::new()
// Rate limiting (100 req/min per IP)
.layer(tower_http::timeout::TimeoutLayer::new(
std::time::Duration::from_secs(30)
))
// Body size limit (10MB)
.layer(RequestBodyLimitLayer::new(10 * 1024 * 1024))
// Authentication
.layer(axum::middleware::from_fn(auth_middleware))
)
.with_state(state)
}
```
---
## 4. ruvector-wasm Integration
### 4.1 WASM Crate Configuration
**crates/ruvector-scipix-wasm/Cargo.toml:**
```toml
[package]
name = "ruvector-scipix-wasm"
version.workspace = true
edition.workspace = true
license.workspace = true
description = "Browser-based OCR for mathematical expressions"
[lib]
crate-type = ["cdylib", "rlib"]
[dependencies]
# Core - use memory-only features
ruvector-core = {
version = "0.1.16",
path = "../ruvector-core",
default-features = false,
features = ["memory-only", "simd"]
}
ruvector-wasm = { version = "0.1.16", path = "../ruvector-wasm" }
ruvector-scipix-core = {
version = "0.1.16",
path = "../ruvector-scipix-core",
default-features = false,
features = ["wasm"]
}
# WASM bindings
wasm-bindgen = { workspace = true }
wasm-bindgen-futures = { workspace = true }
js-sys = { workspace = true }
web-sys = { workspace = true, features = [
"CanvasRenderingContext2d",
"HtmlCanvasElement",
"ImageData",
"console",
] }
# Utilities
serde = { workspace = true }
serde-wasm-bindgen = "0.6"
console_error_panic_hook = "0.1"
getrandom = { workspace = true, features = ["wasm_js"] }
[features]
default = []
[profile.release]
opt-level = "z"
lto = true
codegen-units = 1
```
### 4.2 Browser API
**crates/ruvector-scipix-wasm/src/lib.rs:**
```rust
use wasm_bindgen::prelude::*;
use web_sys::{ImageData, CanvasRenderingContext2d};
use ruvector_scipix_core::{ScipixClient, ScipixCache};
#[wasm_bindgen]
pub struct ScipixWasm {
client: ScipixClient,
cache: ScipixCache,
}
#[wasm_bindgen]
impl ScipixWasm {
/// Create new instance with API key
#[wasm_bindgen(constructor)]
pub fn new(api_key: String, app_id: String) -> Result<ScipixWasm, JsValue> {
console_error_panic_hook::set_once();
let client = ScipixClient::new(api_key, app_id)
.map_err(|e| JsValue::from_str(&e.to_string()))?;
// Use in-memory cache for WASM
let cache = ScipixCache::new_memory(512) // 512-dim embeddings
.map_err(|e| JsValue::from_str(&e.to_string()))?;
Ok(Self { client, cache })
}
/// OCR from canvas ImageData
#[wasm_bindgen]
pub async fn ocr_image_data(
&mut self,
image_data: ImageData,
) -> Result<JsValue, JsValue> {
let width = image_data.width();
let height = image_data.height();
let data = image_data.data().0;
// Convert to PNG bytes
let png_bytes = self.rgba_to_png(width, height, &data)
.map_err(|e| JsValue::from_str(&e.to_string()))?;
// Check cache
let embedding = self.client.generate_image_embedding(&png_bytes)
.map_err(|e| JsValue::from_str(&e.to_string()))?;
if let Some(cached) = self.cache.find_similar(embedding.clone(), 0.95)
.map_err(|e| JsValue::from_str(&e.to_string()))? {
return Ok(serde_wasm_bindgen::to_value(&OcrResult {
latex: cached.latex,
confidence: cached.confidence,
cached: true,
})?);
}
// Call API
let result = self.client.ocr_image(&png_bytes).await
.map_err(|e| JsValue::from_str(&e.to_string()))?;
// Cache result
self.cache.store_result(embedding, result.latex.clone(), result.confidence)
.map_err(|e| JsValue::from_str(&e.to_string()))?;
Ok(serde_wasm_bindgen::to_value(&OcrResult {
latex: result.latex,
confidence: result.confidence,
cached: false,
})?)
}
/// OCR from canvas element
#[wasm_bindgen]
pub async fn ocr_canvas(
&mut self,
canvas_id: String,
) -> Result<JsValue, JsValue> {
let window = web_sys::window().unwrap();
let document = window.document().unwrap();
let canvas = document
.get_element_by_id(&canvas_id)
.ok_or_else(|| JsValue::from_str("Canvas not found"))?
.dyn_into::<web_sys::HtmlCanvasElement>()?;
let context = canvas
.get_context("2d")?
.unwrap()
.dyn_into::<CanvasRenderingContext2d>()?;
let image_data = context.get_image_data(
0.0, 0.0,
canvas.width() as f64,
canvas.height() as f64,
)?;
self.ocr_image_data(image_data).await
}
fn rgba_to_png(&self, width: u32, height: u32, data: &[u8])
-> Result<Vec<u8>, String> {
// Use image crate to encode PNG
// (simplified - actual implementation would use image crate)
Ok(data.to_vec())
}
}
#[derive(serde::Serialize)]
struct OcrResult {
latex: String,
confidence: f32,
cached: bool,
}
```
### 4.3 TypeScript Definitions
**crates/ruvector-scipix-wasm/scipix.d.ts:**
```typescript
export class ScipixWasm {
constructor(apiKey: string, appId: string);
ocr_image_data(imageData: ImageData): Promise<OcrResult>;
ocr_canvas(canvasId: string): Promise<OcrResult>;
free(): void;
}
export interface OcrResult {
latex: string;
confidence: number;
cached: boolean;
}
```
---
## 5. ruvector-metrics Integration
### 5.1 OCR-Specific Metrics
**crates/ruvector-scipix-core/src/metrics.rs:**
```rust
use prometheus::{
Counter, Histogram, IntGauge, Registry,
HistogramOpts, Opts,
};
use lazy_static::lazy_static;
lazy_static! {
/// Total OCR requests
pub static ref OCR_REQUESTS: Counter = Counter::new(
"scipix_ocr_requests_total",
"Total number of OCR requests"
).unwrap();
/// Cache hit rate
pub static ref CACHE_HITS: Counter = Counter::new(
"scipix_cache_hits_total",
"Number of cache hits"
).unwrap();
pub static ref CACHE_MISSES: Counter = Counter::new(
"scipix_cache_misses_total",
"Number of cache misses"
).unwrap();
/// OCR latency histogram
pub static ref OCR_LATENCY: Histogram = Histogram::with_opts(
HistogramOpts::new(
"scipix_ocr_duration_seconds",
"OCR processing duration"
).buckets(vec![0.1, 0.5, 1.0, 2.0, 5.0, 10.0])
).unwrap();
/// Confidence score distribution
pub static ref CONFIDENCE_SCORE: Histogram = Histogram::with_opts(
HistogramOpts::new(
"scipix_confidence_score",
"OCR confidence scores"
).buckets(vec![0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99])
).unwrap();
/// Active API calls
pub static ref ACTIVE_CALLS: IntGauge = IntGauge::new(
"scipix_active_calls",
"Number of active API calls"
).unwrap();
/// Error counter by type
pub static ref OCR_ERRORS: Counter = Counter::new(
"scipix_errors_total",
"Total OCR errors"
).unwrap();
}
/// Register all metrics
pub fn register_metrics(registry: &Registry) -> Result<(), Box<dyn std::error::Error>> {
registry.register(Box::new(OCR_REQUESTS.clone()))?;
registry.register(Box::new(CACHE_HITS.clone()))?;
registry.register(Box::new(CACHE_MISSES.clone()))?;
registry.register(Box::new(OCR_LATENCY.clone()))?;
registry.register(Box::new(CONFIDENCE_SCORE.clone()))?;
registry.register(Box::new(ACTIVE_CALLS.clone()))?;
registry.register(Box::new(OCR_ERRORS.clone()))?;
Ok(())
}
/// Track OCR operation
pub struct OcrMetrics;
impl OcrMetrics {
pub fn record_request() {
OCR_REQUESTS.inc();
ACTIVE_CALLS.inc();
}
pub fn record_cache_hit() {
CACHE_HITS.inc();
}
pub fn record_cache_miss() {
CACHE_MISSES.inc();
}
pub fn record_latency(duration: std::time::Duration) {
OCR_LATENCY.observe(duration.as_secs_f64());
ACTIVE_CALLS.dec();
}
pub fn record_confidence(score: f32) {
CONFIDENCE_SCORE.observe(score as f64);
}
pub fn record_error() {
OCR_ERRORS.inc();
ACTIVE_CALLS.dec();
}
}
```
### 5.2 Integration with ruvector-metrics
```rust
// In ScipixClient implementation
impl ScipixClient {
pub async fn ocr_image(&self, image: &[u8]) -> Result<OcrResult> {
use crate::metrics::OcrMetrics;
OcrMetrics::record_request();
let start = std::time::Instant::now();
let result = self.ocr_image_internal(image).await;
match result {
Ok(ref res) => {
OcrMetrics::record_latency(start.elapsed());
OcrMetrics::record_confidence(res.confidence);
}
Err(_) => {
OcrMetrics::record_error();
}
}
result
}
}
```
### 5.3 Prometheus Endpoint
```rust
// In server routes
use prometheus::{Encoder, TextEncoder};
async fn metrics_handler() -> Result<String, AppError> {
let encoder = TextEncoder::new();
let metric_families = prometheus::gather();
let mut buffer = Vec::new();
encoder.encode(&metric_families, &mut buffer)?;
Ok(String::from_utf8(buffer)?)
}
// Add to router
Router::new()
.route("/metrics", get(metrics_handler))
```
---
## 6. ruvector-cluster for Distributed OCR
### 6.1 Sharding Strategy
**crates/ruvector-scipix-core/src/distributed.rs:**
```rust
use ruvector_cluster::{ClusterNode, ShardingStrategy, NodeId};
use std::sync::Arc;
/// Distributed OCR coordinator
pub struct DistributedOcr {
cluster: Arc<ClusterNode>,
shard_count: usize,
}
impl DistributedOcr {
pub fn new(cluster: Arc<ClusterNode>, shard_count: usize) -> Self {
Self { cluster, shard_count }
}
/// Process PDF across cluster
pub async fn process_pdf_distributed(
&self,
pdf_data: Vec<u8>,
) -> Result<Vec<PageResult>> {
// Extract pages
let pages = extract_pdf_pages(&pdf_data)?;
let total_pages = pages.len();
// Shard pages across cluster nodes
let nodes = self.cluster.get_active_nodes().await?;
let pages_per_node = (total_pages + nodes.len() - 1) / nodes.len();
// Distribute work
let mut tasks = Vec::new();
for (i, node) in nodes.iter().enumerate() {
let start = i * pages_per_node;
let end = ((i + 1) * pages_per_node).min(total_pages);
let node_pages: Vec<_> = pages[start..end].to_vec();
let task = self.cluster.send_task(
node.id,
OcrTask {
pages: node_pages,
start_page: start,
},
);
tasks.push(task);
}
// Collect results
let results = futures::future::join_all(tasks).await;
// Aggregate and sort by page number
let mut all_results = Vec::new();
for result in results {
all_results.extend(result?);
}
all_results.sort_by_key(|r| r.page_num);
Ok(all_results)
}
}
#[derive(serde::Serialize, serde::Deserialize)]
struct OcrTask {
pages: Vec<Vec<u8>>,
start_page: usize,
}
```
### 6.2 Load Balancing
```rust
use ruvector_cluster::LoadBalancer;
/// Smart load balancer for OCR workload
pub struct OcrLoadBalancer {
balancer: LoadBalancer,
}
impl OcrLoadBalancer {
/// Assign work based on node capacity and queue depth
pub async fn assign_task(&self, task_size: usize) -> Result<NodeId> {
let nodes = self.balancer.get_nodes().await?;
// Score each node
let mut best_node = None;
let mut best_score = f64::MAX;
for node in nodes {
let metrics = self.balancer.get_node_metrics(node.id).await?;
// Score based on:
// - Queue depth (lower is better)
// - CPU usage (lower is better)
// - Task size compatibility
let score =
metrics.queue_depth as f64 * 10.0 +
metrics.cpu_usage * 100.0 +
(task_size as f64 - metrics.avg_task_size).abs();
if score < best_score {
best_score = score;
best_node = Some(node.id);
}
}
best_node.ok_or_else(|| RuvectorError::NoNodesAvailable)
}
}
```
### 6.3 Result Aggregation
```rust
/// Aggregate OCR results from multiple nodes
pub struct ResultAggregator {
results: dashmap::DashMap<uuid::Uuid, Vec<PageResult>>,
}
impl ResultAggregator {
pub fn add_result(&self, job_id: uuid::Uuid, result: PageResult) {
self.results.entry(job_id)
.or_insert_with(Vec::new)
.push(result);
}
pub fn get_results(&self, job_id: uuid::Uuid) -> Option<Vec<PageResult>> {
self.results.get(&job_id).map(|r| {
let mut results = r.clone();
results.sort_by_key(|p| p.page_num);
results
})
}
pub fn is_complete(&self, job_id: uuid::Uuid, expected_pages: usize) -> bool {
self.results.get(&job_id)
.map(|r| r.len() == expected_pages)
.unwrap_or(false)
}
}
```
---
## 7. Shared Configuration
### 7.1 Environment Variables
**config/scipix.env:**
```bash
# Scipix API Configuration
MATHPIX_API_KEY=your_api_key_here
MATHPIX_APP_ID=your_app_id_here
MATHPIX_API_URL=https://api.scipix.com/v3
# Cache Configuration
MATHPIX_CACHE_DIR=./data/scipix_cache
MATHPIX_CACHE_DIMENSION=512
MATHPIX_CACHE_SIZE_MB=1000
MATHPIX_CACHE_THRESHOLD=0.95
# Vector DB Configuration
RUVECTOR_HNSW_M=32
RUVECTOR_HNSW_EF_CONSTRUCTION=200
RUVECTOR_DISTANCE_METRIC=cosine
# Quantization
MATHPIX_QUANTIZE_BITS=8 # 0 for no quantization
# Server Configuration
MATHPIX_SERVER_PORT=3000
MATHPIX_SERVER_HOST=0.0.0.0
MATHPIX_MAX_BODY_SIZE_MB=10
MATHPIX_RATE_LIMIT_PER_MIN=100
# Cluster Configuration
MATHPIX_CLUSTER_ENABLED=false
MATHPIX_CLUSTER_NODES=node1:8000,node2:8000
MATHPIX_SHARD_COUNT=4
# Metrics
MATHPIX_METRICS_ENABLED=true
MATHPIX_METRICS_PORT=9090
```
### 7.2 TOML Configuration
**config/scipix.toml:**
```toml
[api]
key = "${MATHPIX_API_KEY}"
app_id = "${MATHPIX_APP_ID}"
url = "https://api.scipix.com/v3"
timeout_secs = 30
[cache]
enabled = true
dir = "./data/scipix_cache"
dimension = 512
size_mb = 1000
threshold = 0.95
[cache.hnsw]
m = 32
ef_construction = 200
max_elements = 100_000
[cache.quantization]
enabled = true
bits = 8 # 4, 8, or 0 for disabled
[server]
host = "0.0.0.0"
port = 3000
max_body_size_mb = 10
[server.rate_limit]
enabled = true
requests_per_minute = 100
[cluster]
enabled = false
nodes = ["node1:8000", "node2:8000"]
shard_count = 4
replication_factor = 2
[metrics]
enabled = true
port = 9090
prometheus_endpoint = "/metrics"
[preprocessing]
# Image preprocessing options
auto_rotate = true
denoise = true
contrast_enhancement = true
dpi = 300
[postprocessing]
# LaTeX postprocessing
validate_syntax = true
normalize_symbols = true
confidence_threshold = 0.7
```
### 7.3 Configuration Loading
**crates/ruvector-scipix-core/src/config.rs:**
```rust
use serde::{Deserialize, Serialize};
use std::path::Path;
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct ScipixConfig {
pub api: ApiConfig,
pub cache: CacheConfig,
pub server: ServerConfig,
pub cluster: ClusterConfig,
pub metrics: MetricsConfig,
pub preprocessing: PreprocessingConfig,
pub postprocessing: PostprocessingConfig,
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct ApiConfig {
pub key: String,
pub app_id: String,
pub url: String,
pub timeout_secs: u64,
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct CacheConfig {
pub enabled: bool,
pub dir: String,
pub dimension: usize,
pub size_mb: usize,
pub threshold: f32,
pub hnsw: HnswConfig,
pub quantization: QuantizationConfig,
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct HnswConfig {
pub m: usize,
pub ef_construction: usize,
pub max_elements: usize,
}
impl ScipixConfig {
/// Load from TOML file with environment variable substitution
pub fn from_file(path: &Path) -> Result<Self> {
let content = std::fs::read_to_string(path)?;
// Expand environment variables
let expanded = Self::expand_env_vars(&content);
let config: ScipixConfig = toml::from_str(&expanded)?;
Ok(config)
}
/// Load from environment variables
pub fn from_env() -> Result<Self> {
Ok(Self {
api: ApiConfig {
key: std::env::var("MATHPIX_API_KEY")?,
app_id: std::env::var("MATHPIX_APP_ID")?,
url: std::env::var("MATHPIX_API_URL")
.unwrap_or_else(|_| "https://api.scipix.com/v3".to_string()),
timeout_secs: 30,
},
cache: CacheConfig::from_env()?,
// ... rest of config
})
}
fn expand_env_vars(s: &str) -> String {
let re = regex::Regex::new(r"\$\{([^}]+)\}").unwrap();
re.replace_all(s, |caps: &regex::Captures| {
std::env::var(&caps[1]).unwrap_or_default()
}).to_string()
}
}
```
---
## 8. Cross-Crate Types
### 8.1 Common Error Types
**crates/ruvector-scipix-core/src/error.rs:**
```rust
use thiserror::Error;
use ruvector_core::RuvectorError;
#[derive(Error, Debug)]
pub enum ScipixError {
#[error("Scipix API error: {0}")]
ApiError(String),
#[error("HTTP request failed: {0}")]
HttpError(#[from] reqwest::Error),
#[error("Vector database error: {0}")]
VectorDbError(#[from] RuvectorError),
#[error("Image processing error: {0}")]
ImageError(String),
#[error("Invalid configuration: {0}")]
ConfigError(String),
#[error("Cache error: {0}")]
CacheError(String),
#[error("Serialization error: {0}")]
SerializationError(#[from] serde_json::Error),
#[error("IO error: {0}")]
IoError(#[from] std::io::Error),
#[error("LaTeX validation error: {0}")]
LatexError(String),
#[error("Rate limit exceeded")]
RateLimitExceeded,
#[error("Authentication failed")]
AuthenticationFailed,
#[error("Confidence too low: {0}")]
LowConfidence(f32),
}
pub type Result<T> = std::result::Result<T, ScipixError>;
/// Convert to HTTP status code
impl ScipixError {
pub fn status_code(&self) -> axum::http::StatusCode {
use axum::http::StatusCode;
match self {
Self::ApiError(_) => StatusCode::BAD_GATEWAY,
Self::HttpError(_) => StatusCode::BAD_GATEWAY,
Self::VectorDbError(_) => StatusCode::INTERNAL_SERVER_ERROR,
Self::ImageError(_) => StatusCode::BAD_REQUEST,
Self::ConfigError(_) => StatusCode::INTERNAL_SERVER_ERROR,
Self::CacheError(_) => StatusCode::INTERNAL_SERVER_ERROR,
Self::SerializationError(_) => StatusCode::BAD_REQUEST,
Self::IoError(_) => StatusCode::INTERNAL_SERVER_ERROR,
Self::LatexError(_) => StatusCode::UNPROCESSABLE_ENTITY,
Self::RateLimitExceeded => StatusCode::TOO_MANY_REQUESTS,
Self::AuthenticationFailed => StatusCode::UNAUTHORIZED,
Self::LowConfidence(_) => StatusCode::UNPROCESSABLE_ENTITY,
}
}
}
```
### 8.2 Shared Traits
**crates/ruvector-scipix-core/src/traits.rs:**
```rust
use async_trait::async_trait;
/// OCR engine trait (allows swapping implementations)
#[async_trait]
pub trait OcrEngine: Send + Sync {
/// Process image to LaTeX
async fn ocr(&self, image: &[u8]) -> Result<OcrResult>;
/// Generate embedding for caching
fn generate_embedding(&self, image: &[u8]) -> Result<Vec<f32>>;
/// Batch processing
async fn ocr_batch(&self, images: Vec<Vec<u8>>) -> Result<Vec<OcrResult>> {
let mut results = Vec::new();
for image in images {
results.push(self.ocr(&image).await?);
}
Ok(results)
}
}
/// Cache trait (allows different cache backends)
pub trait OcrCache: Send + Sync {
fn store(&mut self, embedding: Vec<f32>, result: OcrResult) -> Result<uuid::Uuid>;
fn find_similar(&self, embedding: Vec<f32>, threshold: f32) -> Result<Option<OcrResult>>;
fn clear(&mut self) -> Result<()>;
fn stats(&self) -> CacheStats;
}
#[derive(Debug, Clone)]
pub struct CacheStats {
pub total_entries: usize,
pub memory_usage_mb: f64,
pub hit_rate: f64,
}
/// Preprocessing trait
pub trait ImagePreprocessor: Send + Sync {
fn preprocess(&self, image: &[u8]) -> Result<Vec<u8>>;
}
/// Postprocessing trait
pub trait LatexPostprocessor: Send + Sync {
fn postprocess(&self, latex: &str) -> Result<String>;
fn validate(&self, latex: &str) -> bool;
}
```
### 8.3 API Contracts
**crates/ruvector-scipix-core/src/types.rs:**
```rust
use serde::{Deserialize, Serialize};
use uuid::Uuid;
/// OCR result
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OcrResult {
pub latex: String,
pub confidence: f32,
pub timestamp: chrono::DateTime<chrono::Utc>,
pub cached: bool,
#[serde(skip_serializing_if = "Option::is_none")]
pub metadata: Option<serde_json::Value>,
}
/// PDF page result
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PageResult {
pub page_num: usize,
pub latex: String,
pub confidence: f32,
pub bounding_boxes: Vec<BoundingBox>,
}
/// Bounding box for detected regions
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BoundingBox {
pub x: u32,
pub y: u32,
pub width: u32,
pub height: u32,
pub confidence: f32,
}
/// Batch OCR request
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BatchOcrRequest {
pub images: Vec<ImageInput>,
pub options: OcrOptions,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum ImageInput {
#[serde(rename = "base64")]
Base64 { data: String },
#[serde(rename = "url")]
Url { url: String },
#[serde(rename = "bytes")]
Bytes { data: Vec<u8> },
}
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct OcrOptions {
#[serde(default)]
pub preprocess: bool,
#[serde(default)]
pub postprocess: bool,
#[serde(default = "default_confidence")]
pub min_confidence: f32,
#[serde(default)]
pub use_cache: bool,
}
fn default_confidence() -> f32 { 0.7 }
/// Job status for async processing
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct JobStatus {
pub job_id: Uuid,
pub status: JobState,
pub progress: f32, // 0.0 to 1.0
pub result: Option<Vec<PageResult>>,
pub error: Option<String>,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum JobState {
Pending,
Processing,
Completed,
Failed,
}
```
---
## 9. Workspace Cargo.toml Modifications
### 9.1 Complete Workspace Configuration
```toml
# Add to root Cargo.toml
[workspace]
members = [
# ... existing members ...
# Scipix Integration
"crates/ruvector-scipix-core",
"crates/ruvector-scipix-node",
"crates/ruvector-scipix-wasm",
"crates/ruvector-scipix-server",
]
[workspace.dependencies]
# ... existing dependencies ...
# Scipix-specific
reqwest = { version = "0.12", default-features = false, features = ["json", "multipart", "rustls-tls"] }
base64 = "0.22"
image = { version = "0.25", features = ["png", "jpeg", "webp"] }
async-trait = "0.1"
regex = "1.10"
toml = "0.8"
# Optional OCR backends
tesseract-rs = { version = "0.14", optional = true }
pdf-extract = { version = "0.7", optional = true }
```
### 9.2 Individual Crate Cargo.toml
**crates/ruvector-scipix-core/Cargo.toml:**
```toml
[package]
name = "ruvector-scipix-core"
version.workspace = true
edition.workspace = true
license.workspace = true
authors.workspace = true
repository.workspace = true
description = "Mathematical OCR with vector-based caching"
[dependencies]
# Ruvector ecosystem
ruvector-core = { version = "0.1.16", path = "../ruvector-core" }
ruvector-metrics = { version = "0.1.16", path = "../ruvector-metrics", optional = true }
# HTTP client
reqwest = { workspace = true, optional = true }
base64 = { workspace = true }
# Image processing
image = { workspace = true, optional = true }
# Async
tokio = { workspace = true, features = ["rt-multi-thread"] }
async-trait = { workspace = true }
# Serialization
serde = { workspace = true }
serde_json = { workspace = true }
# Error handling
thiserror = { workspace = true }
anyhow = { workspace = true }
# Utilities
uuid = { workspace = true }
chrono = { workspace = true }
tracing = { workspace = true }
dashmap = { workspace = true }
parking_lot = { workspace = true }
# Configuration
toml = { workspace = true, optional = true }
regex = { workspace = true, optional = true }
# Optional backends
tesseract-rs = { workspace = true, optional = true }
pdf-extract = { workspace = true, optional = true }
# Metrics
prometheus = { version = "0.13", optional = true }
lazy_static = { version = "1.5", optional = true }
[dev-dependencies]
tokio = { workspace = true, features = ["macros", "test-util"] }
tempfile = "3.13"
mockall = { workspace = true }
[features]
default = ["api-client", "cache", "preprocessing"]
# Core features
api-client = ["reqwest"]
cache = ["ruvector-core/storage"]
preprocessing = ["image"]
metrics = ["dep:ruvector-metrics", "prometheus", "lazy_static"]
config = ["toml", "regex"]
# Optional backends
tesseract = ["dep:tesseract-rs"]
pdf = ["dep:pdf-extract"]
# Performance
simd = ["ruvector-core/simd"]
quantization = []
# Environment
wasm = []
memory-only = []
```
---
## 10. Module Structure
### 10.1 Core Module Organization
```
crates/ruvector-scipix-core/src/
├── lib.rs # Public API
├── error.rs # Error types
├── types.rs # Shared types
├── traits.rs # Shared traits
├── config.rs # Configuration
├── api/
│ ├── mod.rs
│ ├── client.rs # Scipix API client
│ └── models.rs # API request/response types
├── cache/
│ ├── mod.rs
│ ├── vector_cache.rs # ruvector-core integration
│ ├── memory_cache.rs # In-memory cache for WASM
│ └── stats.rs # Cache statistics
├── ocr/
│ ├── mod.rs
│ ├── engine.rs # Main OCR engine
│ ├── batch.rs # Batch processing
│ └── backends/
│ ├── mod.rs
│ ├── scipix.rs # Scipix backend
│ └── tesseract.rs # Tesseract fallback
├── preprocessing/
│ ├── mod.rs
│ ├── image_ops.rs # Image preprocessing
│ ├── filters.rs # Denoising, enhancement
│ └── rotation.rs # Auto-rotation
├── postprocessing/
│ ├── mod.rs
│ ├── latex_validate.rs # LaTeX validation
│ └── normalize.rs # Symbol normalization
├── embeddings/
│ ├── mod.rs
│ ├── image_embedder.rs # Image to vector
│ └── latex_embedder.rs # LaTeX to vector
├── distributed/
│ ├── mod.rs
│ ├── coordinator.rs # Cluster coordination
│ ├── sharding.rs # Work distribution
│ └── aggregator.rs # Result aggregation
└── metrics/
├── mod.rs
└── prometheus.rs # Metrics collection
```
### 10.2 Server Module Organization
```
crates/ruvector-scipix-server/src/
├── main.rs # Server entry point
├── routes/
│ ├── mod.rs
│ ├── ocr.rs # OCR endpoints
│ ├── cache.rs # Cache management
│ ├── health.rs # Health checks
│ └── metrics.rs # Metrics endpoint
├── middleware/
│ ├── mod.rs
│ ├── auth.rs # API key auth
│ ├── rate_limit.rs # Rate limiting
│ └── logging.rs # Request logging
├── state.rs # Shared app state
└── error.rs # HTTP error handling
```
---
## 11. Integration Checklist
### Phase 1: Core Integration
- [ ] Create `ruvector-scipix-core` crate
- [ ] Implement vector cache using `ruvector-core`
- [ ] Add Scipix API client
- [ ] Implement image preprocessing
- [ ] Add metrics collection
- [ ] Write unit tests
### Phase 2: Server Extension
- [ ] Create `ruvector-scipix-server` crate
- [ ] Implement REST API endpoints
- [ ] Add authentication middleware
- [ ] Implement rate limiting
- [ ] Add health checks
- [ ] Integration tests
### Phase 3: WASM Support
- [ ] Create `ruvector-scipix-wasm` crate
- [ ] Implement browser API
- [ ] Add TypeScript definitions
- [ ] Create example web app
- [ ] Browser testing
### Phase 4: Distributed Processing
- [ ] Integrate `ruvector-cluster`
- [ ] Implement work sharding
- [ ] Add load balancing
- [ ] Implement result aggregation
- [ ] Distributed tests
### Phase 5: Node.js Bindings
- [ ] Create `ruvector-scipix-node` crate
- [ ] Implement NAPI bindings
- [ ] Add TypeScript types
- [ ] Build platform binaries
- [ ] NPM package
### Phase 6: Optimization
- [ ] Enable quantization
- [ ] SIMD optimizations
- [ ] Cache tuning
- [ ] Performance benchmarks
- [ ] Documentation
---
## 12. Performance Targets
### Cache Performance
- **Hit Rate:** >80% on repeated expressions
- **Lookup Latency:** <10ms (p99)
- **Memory Overhead:** 4-8x reduction with quantization
### API Performance
- **OCR Latency:** <2s for single image
- **Throughput:** >100 req/min per node
- **PDF Processing:** <10s for 10-page document
### Cluster Performance
- **Scaling Efficiency:** >90% up to 8 nodes
- **Fault Tolerance:** Continue with 1 node failure
- **Shard Rebalancing:** <30s
---
## 13. Security Considerations
### API Key Management
- Never commit API keys to repository
- Use environment variables or secure vaults
- Rotate keys regularly
- Implement key-per-user for multi-tenant
### Rate Limiting
- Per-IP and per-API-key limits
- Sliding window algorithm
- Graceful degradation under load
### Input Validation
- Image size limits (10MB default)
- Format validation (PNG, JPEG only)
- Sanitize LaTeX output
- Prevent injection attacks
### Cache Security
- Encrypt sensitive cached data
- Implement cache eviction policies
- Prevent cache poisoning
- Audit cache access
---
## 14. Monitoring & Observability
### Key Metrics
- `scipix_ocr_requests_total` - Total requests
- `scipix_cache_hit_rate` - Cache effectiveness
- `scipix_ocr_duration_seconds` - Latency distribution
- `scipix_confidence_score` - Quality tracking
- `scipix_errors_total` - Error rate
### Dashboards
- Real-time OCR throughput
- Cache performance
- Error rates by type
- Confidence score distribution
- Cluster health
### Alerts
- Error rate >5%
- Latency p99 >5s
- Cache hit rate <60%
- Node failures
- API quota exhaustion
---
## 15. Migration Path
### From Standalone to Integrated
**Step 1:** Add ruvector-core dependency
```bash
cd crates/ruvector-scipix-core
cargo add ruvector-core --path ../ruvector-core
```
**Step 2:** Migrate cache to VectorDB
```rust
// Old: HashMap-based cache
let cache = HashMap::new();
// New: Vector-based cache
let cache = ScipixCache::new("./cache", 512)?;
```
**Step 3:** Integrate metrics
```rust
use ruvector_scipix_core::metrics::OcrMetrics;
OcrMetrics::record_request();
// ... perform OCR ...
OcrMetrics::record_latency(duration);
```
**Step 4:** Deploy with cluster support
```bash
# Enable cluster feature
cargo build --release --features cluster
# Start with cluster config
MATHPIX_CLUSTER_ENABLED=true cargo run
```
---
## 16. Testing Strategy
### Unit Tests
- Vector cache operations
- Embedding generation
- LaTeX validation
- Error handling
### Integration Tests
- End-to-end OCR flow
- Cache hit/miss scenarios
- Cluster coordination
- API endpoint testing
### Performance Tests
- Cache lookup benchmarks
- HNSW search performance
- Quantization overhead
- Distributed scaling
### Browser Tests (WASM)
- Canvas image capture
- API calls from browser
- Memory management
- Error handling
---
## 17. Documentation Requirements
### API Documentation
- OpenAPI/Swagger spec
- Example requests/responses
- Error codes
- Rate limits
### Integration Guides
- Quick start guide
- Configuration reference
- Cluster setup
- WASM integration
### Performance Tuning
- Cache configuration
- HNSW parameters
- Quantization trade-offs
- Cluster sizing
---
## Conclusion
This integration architecture provides a comprehensive blueprint for incorporating `ruvector-scipix` into the ruvector ecosystem. By leveraging existing infrastructure for vector storage, clustering, metrics, and WASM support, we achieve:
1. **Performance:** 80%+ cache hit rate, <10ms lookup latency
2. **Scalability:** Horizontal scaling via ruvector-cluster
3. **Flexibility:** Multiple deployment targets (server, browser, Node.js)
4. **Maintainability:** Shared types, errors, and configuration patterns
5. **Observability:** Rich metrics and monitoring
The modular design allows incremental adoption, starting with core OCR functionality and progressively adding caching, clustering, and advanced features.
---
**Next Steps:**
1. Review and approve architecture
2. Create Phase 1 crates (`ruvector-scipix-core`)
3. Implement vector cache integration
4. Add comprehensive tests
5. Deploy initial server with basic endpoints
6. Iterate based on performance metrics