# ruvllm-wasm [![Crates.io](https://img.shields.io/crates/v/ruvllm-wasm.svg)](https://crates.io/crates/ruvllm-wasm) [![Documentation](https://docs.rs/ruvllm-wasm/badge.svg)](https://docs.rs/ruvllm-wasm) [![License](https://img.shields.io/crates/l/ruvllm-wasm.svg)](https://github.com/ruvnet/ruvector/blob/main/LICENSE) **WASM bindings for browser-based LLM inference** with WebGPU acceleration, SIMD optimizations, and intelligent routing. ## Features - **WebGPU Acceleration** - 10-50x faster inference with GPU compute shaders - **SIMD Optimizations** - Vectorized operations for CPU fallback - **Web Workers** - Parallel inference without blocking the main thread - **GGUF Support** - Load quantized models (Q4, Q5, Q8) for efficient browser inference - **Streaming Tokens** - Real-time token generation for responsive UX - **Intelligent Routing** - HNSW Router, MicroLoRA, SONA for optimized inference ## Installation Add to your `Cargo.toml`: ```toml [dependencies] ruvllm-wasm = "2.0" ``` Or build for WASM: ```bash wasm-pack build --target web --release ``` ## Quick Start ```rust use ruvllm_wasm::{RuvLLMWasm, GenerationConfig}; // Initialize with WebGPU (if available) let llm = RuvLLMWasm::new(true).await?; // Load a GGUF model llm.load_model_from_url("https://example.com/model.gguf").await?; // Generate text let config = GenerationConfig { max_tokens: 100, temperature: 0.7, top_p: 0.9, ..Default::default() }; let result = llm.generate("What is the capital of France?", &config).await?; println!("{}", result.text); ``` ## JavaScript Usage ```javascript import init, { RuvLLMWasm } from 'ruvllm-wasm'; await init(); // Create instance with WebGPU const llm = await RuvLLMWasm.new(true); // Load model await llm.load_model_from_url('https://example.com/model.gguf', (loaded, total) => { console.log(`Loading: ${Math.round(loaded / total * 100)}%`); }); // Generate with streaming await llm.generate_stream('Tell me a story', { max_tokens: 200, temperature: 0.8, }, (token) => { process.stdout.write(token); }); ``` ## Features ### WebGPU Acceleration ```toml [dependencies] ruvllm-wasm = { version = "2.0", features = ["webgpu"] } ``` Enables GPU-accelerated inference using WebGPU compute shaders: - Matrix multiplication kernels - Attention computation - 10-50x speedup on supported browsers ### Parallel Inference ```toml [dependencies] ruvllm-wasm = { version = "2.0", features = ["parallel"] } ``` Run inference in Web Workers: - Non-blocking main thread - Multiple concurrent requests - Automatic worker pool management ### SIMD Optimizations ```toml [dependencies] ruvllm-wasm = { version = "2.0", features = ["simd"] } ``` Requires building with SIMD target: ```bash RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web ``` ### Intelligent Features ```toml [dependencies] ruvllm-wasm = { version = "2.0", features = ["intelligent"] } ``` Enables advanced AI features: - **HNSW Router** - Semantic routing for multi-model deployments - **MicroLoRA** - Lightweight adapter injection - **SONA Instant** - Self-optimizing neural adaptation ## Browser Requirements | Feature | Required | Benefit | |---------|----------|---------| | WebAssembly | Yes | Core execution | | WebGPU | No (recommended) | 10-50x faster | | SharedArrayBuffer | No | Multi-threading | | SIMD | No | 2-4x faster math | ### Enable SharedArrayBuffer Add these headers to your server: ``` Cross-Origin-Opener-Policy: same-origin Cross-Origin-Embedder-Policy: require-corp ``` ## Recommended Models | Model | Size | Use Case | |-------|------|----------| | TinyLlama-1.1B-Q4 | ~700 MB | General chat | | Phi-2-Q4 | ~1.6 GB | Code, reasoning | | Qwen2-0.5B-Q4 | ~400 MB | Fast responses | | StableLM-Zephyr-3B-Q4 | ~2 GB | Quality chat | ## API Reference ### RuvLLMWasm ```rust impl RuvLLMWasm { /// Create a new instance pub async fn new(use_webgpu: bool) -> Result; /// Load model from URL pub async fn load_model_from_url(&self, url: &str) -> Result<(), JsValue>; /// Load model from bytes pub async fn load_model_from_bytes(&self, bytes: &[u8]) -> Result<(), JsValue>; /// Generate text completion pub async fn generate(&self, prompt: &str, config: &GenerationConfig) -> Result; /// Generate with streaming callback pub async fn generate_stream(&self, prompt: &str, config: &GenerationConfig, callback: js_sys::Function) -> Result; /// Check WebGPU availability pub async fn check_webgpu() -> WebGPUStatus; /// Get browser capabilities pub async fn get_capabilities() -> BrowserCapabilities; /// Unload model and free memory pub fn unload(&self); } ``` ## Related Packages - [ruvllm](https://crates.io/crates/ruvllm) - Core LLM runtime - [ruvllm-cli](https://crates.io/crates/ruvllm-cli) - CLI for model inference - [@ruvector/ruvllm-wasm](https://www.npmjs.com/package/@ruvector/ruvllm-wasm) - npm package ## License MIT OR Apache-2.0 --- **Part of the [RuVector](https://github.com/ruvnet/ruvector) ecosystem** - High-performance vector database with self-learning capabilities.