Files
wifi-densepose/vendor/ruvector/crates/ruvllm-wasm/README.md

202 lines
5.1 KiB
Markdown

# ruvllm-wasm
[![Crates.io](https://img.shields.io/crates/v/ruvllm-wasm.svg)](https://crates.io/crates/ruvllm-wasm)
[![Documentation](https://docs.rs/ruvllm-wasm/badge.svg)](https://docs.rs/ruvllm-wasm)
[![License](https://img.shields.io/crates/l/ruvllm-wasm.svg)](https://github.com/ruvnet/ruvector/blob/main/LICENSE)
**WASM bindings for browser-based LLM inference** with WebGPU acceleration, SIMD optimizations, and intelligent routing.
## Features
- **WebGPU Acceleration** - 10-50x faster inference with GPU compute shaders
- **SIMD Optimizations** - Vectorized operations for CPU fallback
- **Web Workers** - Parallel inference without blocking the main thread
- **GGUF Support** - Load quantized models (Q4, Q5, Q8) for efficient browser inference
- **Streaming Tokens** - Real-time token generation for responsive UX
- **Intelligent Routing** - HNSW Router, MicroLoRA, SONA for optimized inference
## Installation
Add to your `Cargo.toml`:
```toml
[dependencies]
ruvllm-wasm = "2.0"
```
Or build for WASM:
```bash
wasm-pack build --target web --release
```
## Quick Start
```rust
use ruvllm_wasm::{RuvLLMWasm, GenerationConfig};
// Initialize with WebGPU (if available)
let llm = RuvLLMWasm::new(true).await?;
// Load a GGUF model
llm.load_model_from_url("https://example.com/model.gguf").await?;
// Generate text
let config = GenerationConfig {
max_tokens: 100,
temperature: 0.7,
top_p: 0.9,
..Default::default()
};
let result = llm.generate("What is the capital of France?", &config).await?;
println!("{}", result.text);
```
## JavaScript Usage
```javascript
import init, { RuvLLMWasm } from 'ruvllm-wasm';
await init();
// Create instance with WebGPU
const llm = await RuvLLMWasm.new(true);
// Load model
await llm.load_model_from_url('https://example.com/model.gguf', (loaded, total) => {
console.log(`Loading: ${Math.round(loaded / total * 100)}%`);
});
// Generate with streaming
await llm.generate_stream('Tell me a story', {
max_tokens: 200,
temperature: 0.8,
}, (token) => {
process.stdout.write(token);
});
```
## Features
### WebGPU Acceleration
```toml
[dependencies]
ruvllm-wasm = { version = "2.0", features = ["webgpu"] }
```
Enables GPU-accelerated inference using WebGPU compute shaders:
- Matrix multiplication kernels
- Attention computation
- 10-50x speedup on supported browsers
### Parallel Inference
```toml
[dependencies]
ruvllm-wasm = { version = "2.0", features = ["parallel"] }
```
Run inference in Web Workers:
- Non-blocking main thread
- Multiple concurrent requests
- Automatic worker pool management
### SIMD Optimizations
```toml
[dependencies]
ruvllm-wasm = { version = "2.0", features = ["simd"] }
```
Requires building with SIMD target:
```bash
RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web
```
### Intelligent Features
```toml
[dependencies]
ruvllm-wasm = { version = "2.0", features = ["intelligent"] }
```
Enables advanced AI features:
- **HNSW Router** - Semantic routing for multi-model deployments
- **MicroLoRA** - Lightweight adapter injection
- **SONA Instant** - Self-optimizing neural adaptation
## Browser Requirements
| Feature | Required | Benefit |
|---------|----------|---------|
| WebAssembly | Yes | Core execution |
| WebGPU | No (recommended) | 10-50x faster |
| SharedArrayBuffer | No | Multi-threading |
| SIMD | No | 2-4x faster math |
### Enable SharedArrayBuffer
Add these headers to your server:
```
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
```
## Recommended Models
| Model | Size | Use Case |
|-------|------|----------|
| TinyLlama-1.1B-Q4 | ~700 MB | General chat |
| Phi-2-Q4 | ~1.6 GB | Code, reasoning |
| Qwen2-0.5B-Q4 | ~400 MB | Fast responses |
| StableLM-Zephyr-3B-Q4 | ~2 GB | Quality chat |
## API Reference
### RuvLLMWasm
```rust
impl RuvLLMWasm {
/// Create a new instance
pub async fn new(use_webgpu: bool) -> Result<Self, JsValue>;
/// Load model from URL
pub async fn load_model_from_url(&self, url: &str) -> Result<(), JsValue>;
/// Load model from bytes
pub async fn load_model_from_bytes(&self, bytes: &[u8]) -> Result<(), JsValue>;
/// Generate text completion
pub async fn generate(&self, prompt: &str, config: &GenerationConfig) -> Result<GenerationResult, JsValue>;
/// Generate with streaming callback
pub async fn generate_stream(&self, prompt: &str, config: &GenerationConfig, callback: js_sys::Function) -> Result<GenerationResult, JsValue>;
/// Check WebGPU availability
pub async fn check_webgpu() -> WebGPUStatus;
/// Get browser capabilities
pub async fn get_capabilities() -> BrowserCapabilities;
/// Unload model and free memory
pub fn unload(&self);
}
```
## Related Packages
- [ruvllm](https://crates.io/crates/ruvllm) - Core LLM runtime
- [ruvllm-cli](https://crates.io/crates/ruvllm-cli) - CLI for model inference
- [@ruvector/ruvllm-wasm](https://www.npmjs.com/package/@ruvector/ruvllm-wasm) - npm package
## License
MIT OR Apache-2.0
---
**Part of the [RuVector](https://github.com/ruvnet/ruvector) ecosystem** - High-performance vector database with self-learning capabilities.