# @ruvector/ruvllm-wasm [![npm version](https://img.shields.io/npm/v/@ruvector/ruvllm-wasm.svg)](https://www.npmjs.com/package/@ruvector/ruvllm-wasm) [![npm downloads](https://img.shields.io/npm/dt/@ruvector/ruvllm-wasm.svg)](https://www.npmjs.com/package/@ruvector/ruvllm-wasm) [![npm downloads/month](https://img.shields.io/npm/dm/@ruvector/ruvllm-wasm.svg)](https://www.npmjs.com/package/@ruvector/ruvllm-wasm) [![License](https://img.shields.io/npm/l/@ruvector/ruvllm-wasm.svg)](https://github.com/ruvnet/ruvector/blob/main/LICENSE) [![TypeScript](https://img.shields.io/badge/TypeScript-5.0-blue.svg)](https://www.typescriptlang.org/) **Run large language models directly in the browser** using WebAssembly with optional WebGPU acceleration for faster inference. ## Features - **Browser-Native** - No server required, runs entirely client-side - **WebGPU Acceleration** - 10-50x faster inference with GPU support - **GGUF Models** - Load quantized models for efficient browser inference - **Streaming** - Real-time token streaming for responsive UX - **IndexedDB Caching** - Cache models locally for instant reload - **Privacy-First** - All processing happens on-device - **SIMD Support** - Optimized WASM with SIMD instructions - **Multi-Threading** - Parallel inference with SharedArrayBuffer ## Installation ```bash npm install @ruvector/ruvllm-wasm ``` ## Quick Start ```typescript import { RuvLLMWasm, checkWebGPU } from '@ruvector/ruvllm-wasm'; // Check browser capabilities const webgpu = await checkWebGPU(); console.log('WebGPU:', webgpu); // 'available' | 'unavailable' | 'not_supported' // Create instance with WebGPU (if available) const llm = await RuvLLMWasm.create({ useWebGPU: true, memoryLimit: 4096, // 4GB max }); // Load a model (with progress tracking) await llm.loadModel('https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf', { onProgress: (loaded, total) => { console.log(`Loading: ${Math.round(loaded / total * 100)}%`); } }); // Generate text const result = await llm.generate('What is the capital of France?', { maxTokens: 100, temperature: 0.7, }); console.log(result.text); console.log(`${result.stats.tokensPerSecond.toFixed(1)} tokens/sec`); ``` ## Streaming Tokens ```typescript // Stream tokens as they're generated await llm.generate('Tell me a story about a robot', { maxTokens: 200, stream: true, }, (token, done) => { process.stdout.write(token); if (done) console.log('\n--- Done ---'); }); ``` ## Chat Interface ```typescript import { ChatMessage } from '@ruvector/ruvllm-wasm'; const messages: ChatMessage[] = [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: 'What is 2 + 2?' }, ]; const response = await llm.chat(messages, { maxTokens: 100, temperature: 0.5, }); console.log(response.text); // "2 + 2 equals 4." ``` ## React Hook Example ```tsx import { useState, useEffect } from 'react'; import { RuvLLMWasm, LoadingStatus } from '@ruvector/ruvllm-wasm'; function useLLM(modelUrl: string) { const [llm, setLLM] = useState(null); const [status, setStatus] = useState('idle'); const [progress, setProgress] = useState(0); useEffect(() => { let instance: RuvLLMWasm; async function init() { instance = await RuvLLMWasm.create({ useWebGPU: true }); setStatus('downloading'); await instance.loadModel(modelUrl, { onProgress: (loaded, total) => setProgress(loaded / total), }); setStatus('ready'); setLLM(instance); } init(); return () => instance?.unload(); }, [modelUrl]); return { llm, status, progress }; } // Usage function ChatApp() { const { llm, status, progress } = useLLM('https://example.com/model.gguf'); const [response, setResponse] = useState(''); if (status !== 'ready') { return
Loading: {Math.round(progress * 100)}%
; } const generate = async () => { const result = await llm!.generate('Hello!', { maxTokens: 50 }); setResponse(result.text); }; return (

{response}

); } ``` ## Browser Requirements | Feature | Required | Benefit | |---------|----------|---------| | WebAssembly | Yes | Core execution | | WebGPU | No (recommended) | 10-50x faster | | SharedArrayBuffer | No | Multi-threading | | SIMD | No | 2-4x faster math | ### Check Capabilities ```typescript import { getCapabilities } from '@ruvector/ruvllm-wasm'; const caps = await getCapabilities(); console.log(caps); // { // webgpu: 'available', // sharedArrayBuffer: true, // simd: true, // crossOriginIsolated: true // } ``` ### Enable SharedArrayBuffer Add these headers to your server: ``` Cross-Origin-Opener-Policy: same-origin Cross-Origin-Embedder-Policy: require-corp ``` ## API Reference ### `RuvLLMWasm.create(options?)` Create a new instance. ```typescript const llm = await RuvLLMWasm.create({ useWebGPU: true, // Enable WebGPU acceleration threads: 4, // CPU threads (requires SharedArrayBuffer) memoryLimit: 4096, // Max memory in MB }); ``` ### `loadModel(source, options?)` Load a GGUF model. ```typescript await llm.loadModel(url, { onProgress: (loaded, total) => { /* ... */ } }); ``` ### `generate(prompt, config?, onToken?)` Generate text completion. ```typescript const result = await llm.generate('Hello', { maxTokens: 100, temperature: 0.7, topP: 0.9, topK: 40, repetitionPenalty: 1.1, stopSequences: ['\n\n'], stream: true, }, (token, done) => { /* ... */ }); ``` ### `chat(messages, config?, onToken?)` Chat completion with message history. ```typescript const result = await llm.chat([ { role: 'system', content: 'You are helpful.' }, { role: 'user', content: 'Hi!' }, ], { maxTokens: 100 }); ``` ### `unload()` Free memory and unload model. ```typescript llm.unload(); ``` ## Recommended Models Small models suitable for browser inference: | Model | Size | Use Case | |-------|------|----------| | TinyLlama-1.1B-Q4 | ~700 MB | General chat | | Phi-2-Q4 | ~1.6 GB | Code, reasoning | | Qwen2-0.5B-Q4 | ~400 MB | Fast responses | | StableLM-Zephyr-3B-Q4 | ~2 GB | Quality chat | ## Performance Tips 1. **Use WebGPU** - Check support and enable for 10-50x speedup 2. **Smaller models** - Q4_K_M quantization balances quality/size 3. **Cache models** - IndexedDB caching avoids re-downloads 4. **Limit context** - Smaller context = faster inference 5. **Stream tokens** - Better UX with progressive output ## Related Packages - [@ruvector/ruvllm](https://www.npmjs.com/package/@ruvector/ruvllm) - Node.js LLM library - [@ruvector/ruvllm-cli](https://www.npmjs.com/package/@ruvector/ruvllm-cli) - CLI tool - [ruvector](https://www.npmjs.com/package/ruvector) - Vector database ## Documentation - [WASM Crate](https://github.com/ruvnet/ruvector/tree/main/crates/ruvllm-wasm) - [API Reference](https://docs.rs/ruvllm-wasm) - [Examples](https://github.com/ruvnet/ruvector/tree/main/examples/ruvLLM) ## License MIT OR Apache-2.0 --- **Part of the [RuVector](https://github.com/ruvnet/ruvector) ecosystem** - High-performance vector database with self-learning capabilities.