202 lines
5.1 KiB
Markdown
202 lines
5.1 KiB
Markdown
# ruvllm-wasm
|
|
|
|
[](https://crates.io/crates/ruvllm-wasm)
|
|
[](https://docs.rs/ruvllm-wasm)
|
|
[](https://github.com/ruvnet/ruvector/blob/main/LICENSE)
|
|
|
|
**WASM bindings for browser-based LLM inference** with WebGPU acceleration, SIMD optimizations, and intelligent routing.
|
|
|
|
## Features
|
|
|
|
- **WebGPU Acceleration** - 10-50x faster inference with GPU compute shaders
|
|
- **SIMD Optimizations** - Vectorized operations for CPU fallback
|
|
- **Web Workers** - Parallel inference without blocking the main thread
|
|
- **GGUF Support** - Load quantized models (Q4, Q5, Q8) for efficient browser inference
|
|
- **Streaming Tokens** - Real-time token generation for responsive UX
|
|
- **Intelligent Routing** - HNSW Router, MicroLoRA, SONA for optimized inference
|
|
|
|
## Installation
|
|
|
|
Add to your `Cargo.toml`:
|
|
|
|
```toml
|
|
[dependencies]
|
|
ruvllm-wasm = "2.0"
|
|
```
|
|
|
|
Or build for WASM:
|
|
|
|
```bash
|
|
wasm-pack build --target web --release
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```rust
|
|
use ruvllm_wasm::{RuvLLMWasm, GenerationConfig};
|
|
|
|
// Initialize with WebGPU (if available)
|
|
let llm = RuvLLMWasm::new(true).await?;
|
|
|
|
// Load a GGUF model
|
|
llm.load_model_from_url("https://example.com/model.gguf").await?;
|
|
|
|
// Generate text
|
|
let config = GenerationConfig {
|
|
max_tokens: 100,
|
|
temperature: 0.7,
|
|
top_p: 0.9,
|
|
..Default::default()
|
|
};
|
|
|
|
let result = llm.generate("What is the capital of France?", &config).await?;
|
|
println!("{}", result.text);
|
|
```
|
|
|
|
## JavaScript Usage
|
|
|
|
```javascript
|
|
import init, { RuvLLMWasm } from 'ruvllm-wasm';
|
|
|
|
await init();
|
|
|
|
// Create instance with WebGPU
|
|
const llm = await RuvLLMWasm.new(true);
|
|
|
|
// Load model
|
|
await llm.load_model_from_url('https://example.com/model.gguf', (loaded, total) => {
|
|
console.log(`Loading: ${Math.round(loaded / total * 100)}%`);
|
|
});
|
|
|
|
// Generate with streaming
|
|
await llm.generate_stream('Tell me a story', {
|
|
max_tokens: 200,
|
|
temperature: 0.8,
|
|
}, (token) => {
|
|
process.stdout.write(token);
|
|
});
|
|
```
|
|
|
|
## Features
|
|
|
|
### WebGPU Acceleration
|
|
|
|
```toml
|
|
[dependencies]
|
|
ruvllm-wasm = { version = "2.0", features = ["webgpu"] }
|
|
```
|
|
|
|
Enables GPU-accelerated inference using WebGPU compute shaders:
|
|
- Matrix multiplication kernels
|
|
- Attention computation
|
|
- 10-50x speedup on supported browsers
|
|
|
|
### Parallel Inference
|
|
|
|
```toml
|
|
[dependencies]
|
|
ruvllm-wasm = { version = "2.0", features = ["parallel"] }
|
|
```
|
|
|
|
Run inference in Web Workers:
|
|
- Non-blocking main thread
|
|
- Multiple concurrent requests
|
|
- Automatic worker pool management
|
|
|
|
### SIMD Optimizations
|
|
|
|
```toml
|
|
[dependencies]
|
|
ruvllm-wasm = { version = "2.0", features = ["simd"] }
|
|
```
|
|
|
|
Requires building with SIMD target:
|
|
```bash
|
|
RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web
|
|
```
|
|
|
|
### Intelligent Features
|
|
|
|
```toml
|
|
[dependencies]
|
|
ruvllm-wasm = { version = "2.0", features = ["intelligent"] }
|
|
```
|
|
|
|
Enables advanced AI features:
|
|
- **HNSW Router** - Semantic routing for multi-model deployments
|
|
- **MicroLoRA** - Lightweight adapter injection
|
|
- **SONA Instant** - Self-optimizing neural adaptation
|
|
|
|
## Browser Requirements
|
|
|
|
| Feature | Required | Benefit |
|
|
|---------|----------|---------|
|
|
| WebAssembly | Yes | Core execution |
|
|
| WebGPU | No (recommended) | 10-50x faster |
|
|
| SharedArrayBuffer | No | Multi-threading |
|
|
| SIMD | No | 2-4x faster math |
|
|
|
|
### Enable SharedArrayBuffer
|
|
|
|
Add these headers to your server:
|
|
|
|
```
|
|
Cross-Origin-Opener-Policy: same-origin
|
|
Cross-Origin-Embedder-Policy: require-corp
|
|
```
|
|
|
|
## Recommended Models
|
|
|
|
| Model | Size | Use Case |
|
|
|-------|------|----------|
|
|
| TinyLlama-1.1B-Q4 | ~700 MB | General chat |
|
|
| Phi-2-Q4 | ~1.6 GB | Code, reasoning |
|
|
| Qwen2-0.5B-Q4 | ~400 MB | Fast responses |
|
|
| StableLM-Zephyr-3B-Q4 | ~2 GB | Quality chat |
|
|
|
|
## API Reference
|
|
|
|
### RuvLLMWasm
|
|
|
|
```rust
|
|
impl RuvLLMWasm {
|
|
/// Create a new instance
|
|
pub async fn new(use_webgpu: bool) -> Result<Self, JsValue>;
|
|
|
|
/// Load model from URL
|
|
pub async fn load_model_from_url(&self, url: &str) -> Result<(), JsValue>;
|
|
|
|
/// Load model from bytes
|
|
pub async fn load_model_from_bytes(&self, bytes: &[u8]) -> Result<(), JsValue>;
|
|
|
|
/// Generate text completion
|
|
pub async fn generate(&self, prompt: &str, config: &GenerationConfig) -> Result<GenerationResult, JsValue>;
|
|
|
|
/// Generate with streaming callback
|
|
pub async fn generate_stream(&self, prompt: &str, config: &GenerationConfig, callback: js_sys::Function) -> Result<GenerationResult, JsValue>;
|
|
|
|
/// Check WebGPU availability
|
|
pub async fn check_webgpu() -> WebGPUStatus;
|
|
|
|
/// Get browser capabilities
|
|
pub async fn get_capabilities() -> BrowserCapabilities;
|
|
|
|
/// Unload model and free memory
|
|
pub fn unload(&self);
|
|
}
|
|
```
|
|
|
|
## Related Packages
|
|
|
|
- [ruvllm](https://crates.io/crates/ruvllm) - Core LLM runtime
|
|
- [ruvllm-cli](https://crates.io/crates/ruvllm-cli) - CLI for model inference
|
|
- [@ruvector/ruvllm-wasm](https://www.npmjs.com/package/@ruvector/ruvllm-wasm) - npm package
|
|
|
|
## License
|
|
|
|
MIT OR Apache-2.0
|
|
|
|
---
|
|
|
|
**Part of the [RuVector](https://github.com/ruvnet/ruvector) ecosystem** - High-performance vector database with self-learning capabilities.
|