# ruvector-sparse-inference-wasm WebAssembly bindings for PowerInfer-style sparse inference engine. ## Overview This crate provides WASM bindings for the RuVector sparse inference engine, enabling efficient neural network inference in web browsers and Node.js environments with: - **Sparse Activation**: PowerInfer-style neuron prediction for 2-3x speedup - **GGUF Support**: Load quantized models in GGUF format - **Streaming Loading**: Fetch large models incrementally - **Multiple Backends**: Embedding models and LLM text generation ## Building ### For Web Browsers ```bash wasm-pack build --target web --release ``` ### For Node.js ```bash wasm-pack build --target nodejs --release ``` ### For Bundlers (webpack, rollup, etc.) ```bash wasm-pack build --target bundler --release ``` ## Installation ```bash npm install ruvector-sparse-inference-wasm ``` Or build locally: ```bash wasm-pack build --target web cd pkg && npm link ``` ## Usage ### Basic Inference Engine ```typescript import init, { SparseInferenceEngine } from 'ruvector-sparse-inference-wasm'; // Initialize WASM module await init(); // Load model const modelBytes = await fetch('/models/llama-2-7b.gguf').then(r => r.arrayBuffer()); const config = { sparsity: { enabled: true, threshold: 0.1 // 10% neuron activation }, temperature: 0.7, top_k: 40 }; const engine = new SparseInferenceEngine( new Uint8Array(modelBytes), JSON.stringify(config) ); // Run inference const input = new Float32Array(4096); // Your input embedding const output = engine.infer(input); console.log('Sparsity stats:', engine.sparsity_stats()); console.log('Model metadata:', engine.metadata()); ``` ### Streaming Model Loading For large models (>1GB), use streaming: ```typescript const engine = await SparseInferenceEngine.load_streaming( 'https://example.com/large-model.gguf', JSON.stringify(config) ); ``` ### Embedding Models For sentence transformers and embedding generation: ```typescript import { EmbeddingModel } from 'ruvector-sparse-inference-wasm'; const modelBytes = await fetch('/models/all-MiniLM-L6-v2.gguf').then(r => r.arrayBuffer()); const embedder = new EmbeddingModel(new Uint8Array(modelBytes)); // Encode single sequence (requires tokenization first) const inputIds = new Uint32Array([101, 2023, 2003, ...]); // Tokenized input const embedding = embedder.encode(inputIds); console.log('Embedding dimension:', embedder.dimension()); // Batch encoding const batchIds = new Uint32Array([...all tokenized sequences...]); const lengths = new Uint32Array([10, 15, 12]); // Length of each sequence const embeddings = embedder.encode_batch(batchIds, lengths); ``` ### LLM Text Generation For autoregressive language models: ```typescript import { LLMModel } from 'ruvector-sparse-inference-wasm'; const modelBytes = await fetch('/models/llama-2-7b-chat.gguf').then(r => r.arrayBuffer()); const config = { sparsity: { enabled: true, threshold: 0.1 }, temperature: 0.7, top_k: 40 }; const llm = new LLMModel(new Uint8Array(modelBytes), JSON.stringify(config)); // Generate tokens one at a time const prompt = new Uint32Array([1, 4321, 1234, ...]); // Tokenized prompt let generatedTokens = []; for (let i = 0; i < 100; i++) { const nextToken = llm.next_token(prompt); generatedTokens.push(nextToken); // Append to prompt for next iteration prompt = new Uint32Array([...prompt, nextToken]); } // Or generate multiple tokens at once const tokens = llm.generate(prompt, 100); console.log('Generation stats:', llm.stats()); // Reset for new conversation llm.reset_cache(); ``` ### Calibration Improve predictor accuracy with sample data: ```typescript // Collect representative samples const samples = new Float32Array([ ...embedding1, // 512 dims ...embedding2, // 512 dims ...embedding3, // 512 dims ]); engine.calibrate(samples, 512); // 512 = dimension of each sample ``` ### Dynamic Sparsity Control Adjust sparsity threshold at runtime: ```typescript // More sparse = faster, less accurate engine.set_sparsity(0.2); // 20% activation // Less sparse = slower, more accurate engine.set_sparsity(0.05); // 5% activation ``` ### Performance Measurement ```typescript import { measure_inference_time } from 'ruvector-sparse-inference-wasm'; const input = new Float32Array(4096); const avgTime = measure_inference_time(engine, input, 100); // 100 iterations console.log(`Average inference time: ${avgTime.toFixed(2)}ms`); ``` ## Configuration Options ```typescript interface InferenceConfig { sparsity: { enabled: boolean; // Enable sparse inference threshold: number; // Activation threshold (0.0-1.0) }; temperature: number; // Sampling temperature (0.0-2.0) top_k: number; // Top-k sampling (1-100) top_p?: number; // Nucleus sampling (0.0-1.0) max_tokens?: number; // Max generation length } ``` ## Browser Compatibility - Chrome/Edge 91+ (WebAssembly SIMD) - Firefox 89+ - Safari 15+ - Node.js 16+ For older browsers, build without SIMD: ```bash wasm-pack build --target web -- --no-default-features ``` ## Performance Tips 1. **Enable SIMD**: Ensure `wasm32-simd` is enabled for 2-4x speedup 2. **Quantization**: Use 4-bit or 8-bit quantized GGUF models 3. **Sparsity**: Tune threshold based on accuracy/speed tradeoff 4. **Calibration**: Run calibration with representative data 5. **Batch Processing**: Use batch encoding for multiple inputs 6. **Worker Threads**: Run inference in Web Workers to avoid blocking UI ## Example: Web Worker Integration ```typescript // worker.js import init, { SparseInferenceEngine } from 'ruvector-sparse-inference-wasm'; let engine; self.onmessage = async (e) => { if (e.data.type === 'init') { await init(); engine = new SparseInferenceEngine(e.data.modelBytes, e.data.config); self.postMessage({ type: 'ready' }); } else if (e.data.type === 'infer') { const output = engine.infer(e.data.input); self.postMessage({ type: 'result', output }); } }; // main.js const worker = new Worker('worker.js', { type: 'module' }); worker.postMessage({ type: 'init', modelBytes: new Uint8Array(modelBytes), config: JSON.stringify(config) }); worker.onmessage = (e) => { if (e.data.type === 'ready') { worker.postMessage({ type: 'infer', input: new Float32Array([...]) }); } else if (e.data.type === 'result') { console.log('Inference result:', e.data.output); } }; ``` ## Benchmarks On Apple M1 Pro (browser): | Model | Size | Sparsity | Speed | Memory | |-------|------|----------|-------|--------| | Llama-2-7B | 3.8GB | 10% | 45 tok/s | 1.2GB | | MiniLM-L6 | 90MB | 15% | 120 emb/s | 180MB | | Mistral-7B | 4.1GB | 12% | 38 tok/s | 1.4GB | ## Error Handling ```typescript try { const engine = new SparseInferenceEngine(modelBytes, config); const output = engine.infer(input); } catch (error) { if (error.message.includes('parse')) { console.error('Invalid GGUF model format'); } else if (error.message.includes('config')) { console.error('Invalid configuration'); } else { console.error('Inference failed:', error); } } ``` ## Development ### Run Tests ```bash wasm-pack test --headless --chrome wasm-pack test --headless --firefox ``` ### Build Documentation ```bash cargo doc --open --target wasm32-unknown-unknown ``` ### Size Optimization ```bash # Optimize for size wasm-pack build --target web --release -- -Z build-std=std,panic_abort -Z build-std-features=panic_immediate_abort # Further compression with wasm-opt wasm-opt -Oz -o optimized.wasm pkg/ruvector_sparse_inference_wasm_bg.wasm ``` ## License Same as parent RuVector project. ## Related Crates - `ruvector-sparse-inference` - Core Rust implementation - `ruvector-core` - Main RuVector library - `rvlite` - Lightweight WASM vector database ## Contributing See main RuVector repository for contribution guidelines.