7.1 KiB
7.1 KiB
@ruvector/ruvllm-wasm
Run large language models directly in the browser using WebAssembly with optional WebGPU acceleration for faster inference.
Features
- Browser-Native - No server required, runs entirely client-side
- WebGPU Acceleration - 10-50x faster inference with GPU support
- GGUF Models - Load quantized models for efficient browser inference
- Streaming - Real-time token streaming for responsive UX
- IndexedDB Caching - Cache models locally for instant reload
- Privacy-First - All processing happens on-device
- SIMD Support - Optimized WASM with SIMD instructions
- Multi-Threading - Parallel inference with SharedArrayBuffer
Installation
npm install @ruvector/ruvllm-wasm
Quick Start
import { RuvLLMWasm, checkWebGPU } from '@ruvector/ruvllm-wasm';
// Check browser capabilities
const webgpu = await checkWebGPU();
console.log('WebGPU:', webgpu); // 'available' | 'unavailable' | 'not_supported'
// Create instance with WebGPU (if available)
const llm = await RuvLLMWasm.create({
useWebGPU: true,
memoryLimit: 4096, // 4GB max
});
// Load a model (with progress tracking)
await llm.loadModel('https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf', {
onProgress: (loaded, total) => {
console.log(`Loading: ${Math.round(loaded / total * 100)}%`);
}
});
// Generate text
const result = await llm.generate('What is the capital of France?', {
maxTokens: 100,
temperature: 0.7,
});
console.log(result.text);
console.log(`${result.stats.tokensPerSecond.toFixed(1)} tokens/sec`);
Streaming Tokens
// Stream tokens as they're generated
await llm.generate('Tell me a story about a robot', {
maxTokens: 200,
stream: true,
}, (token, done) => {
process.stdout.write(token);
if (done) console.log('\n--- Done ---');
});
Chat Interface
import { ChatMessage } from '@ruvector/ruvllm-wasm';
const messages: ChatMessage[] = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is 2 + 2?' },
];
const response = await llm.chat(messages, {
maxTokens: 100,
temperature: 0.5,
});
console.log(response.text); // "2 + 2 equals 4."
React Hook Example
import { useState, useEffect } from 'react';
import { RuvLLMWasm, LoadingStatus } from '@ruvector/ruvllm-wasm';
function useLLM(modelUrl: string) {
const [llm, setLLM] = useState<RuvLLMWasm | null>(null);
const [status, setStatus] = useState<LoadingStatus>('idle');
const [progress, setProgress] = useState(0);
useEffect(() => {
let instance: RuvLLMWasm;
async function init() {
instance = await RuvLLMWasm.create({ useWebGPU: true });
setStatus('downloading');
await instance.loadModel(modelUrl, {
onProgress: (loaded, total) => setProgress(loaded / total),
});
setStatus('ready');
setLLM(instance);
}
init();
return () => instance?.unload();
}, [modelUrl]);
return { llm, status, progress };
}
// Usage
function ChatApp() {
const { llm, status, progress } = useLLM('https://example.com/model.gguf');
const [response, setResponse] = useState('');
if (status !== 'ready') {
return <div>Loading: {Math.round(progress * 100)}%</div>;
}
const generate = async () => {
const result = await llm!.generate('Hello!', { maxTokens: 50 });
setResponse(result.text);
};
return (
<div>
<button onClick={generate}>Generate</button>
<p>{response}</p>
</div>
);
}
Browser Requirements
| Feature | Required | Benefit |
|---|---|---|
| WebAssembly | Yes | Core execution |
| WebGPU | No (recommended) | 10-50x faster |
| SharedArrayBuffer | No | Multi-threading |
| SIMD | No | 2-4x faster math |
Check Capabilities
import { getCapabilities } from '@ruvector/ruvllm-wasm';
const caps = await getCapabilities();
console.log(caps);
// {
// webgpu: 'available',
// sharedArrayBuffer: true,
// simd: true,
// crossOriginIsolated: true
// }
Enable SharedArrayBuffer
Add these headers to your server:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
API Reference
RuvLLMWasm.create(options?)
Create a new instance.
const llm = await RuvLLMWasm.create({
useWebGPU: true, // Enable WebGPU acceleration
threads: 4, // CPU threads (requires SharedArrayBuffer)
memoryLimit: 4096, // Max memory in MB
});
loadModel(source, options?)
Load a GGUF model.
await llm.loadModel(url, {
onProgress: (loaded, total) => { /* ... */ }
});
generate(prompt, config?, onToken?)
Generate text completion.
const result = await llm.generate('Hello', {
maxTokens: 100,
temperature: 0.7,
topP: 0.9,
topK: 40,
repetitionPenalty: 1.1,
stopSequences: ['\n\n'],
stream: true,
}, (token, done) => { /* ... */ });
chat(messages, config?, onToken?)
Chat completion with message history.
const result = await llm.chat([
{ role: 'system', content: 'You are helpful.' },
{ role: 'user', content: 'Hi!' },
], { maxTokens: 100 });
unload()
Free memory and unload model.
llm.unload();
Recommended Models
Small models suitable for browser inference:
| Model | Size | Use Case |
|---|---|---|
| TinyLlama-1.1B-Q4 | ~700 MB | General chat |
| Phi-2-Q4 | ~1.6 GB | Code, reasoning |
| Qwen2-0.5B-Q4 | ~400 MB | Fast responses |
| StableLM-Zephyr-3B-Q4 | ~2 GB | Quality chat |
Performance Tips
- Use WebGPU - Check support and enable for 10-50x speedup
- Smaller models - Q4_K_M quantization balances quality/size
- Cache models - IndexedDB caching avoids re-downloads
- Limit context - Smaller context = faster inference
- Stream tokens - Better UX with progressive output
Related Packages
- @ruvector/ruvllm - Node.js LLM library
- @ruvector/ruvllm-cli - CLI tool
- ruvector - Vector database
Documentation
License
MIT OR Apache-2.0
Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.