Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

16 KiB

Raw Blame History

🚀 RuvLLM v2.0 - High-Performance LLM Inference for Apple Silicon

Run Large Language Models locally on your Mac with maximum performance
Website • Documentation • Discord • Twitter

What is RuvLLM?

RuvLLM is a blazing-fast LLM inference engine built in Rust, specifically optimized for Apple Silicon Macs (M1/M2/M3/M4). It lets you run AI models like Llama, Mistral, Phi, and Gemma directly on your laptop — no cloud, no API costs, complete privacy.

Why RuvLLM?

🔥 Fast — 40+ tokens/second on M4 Pro with optimized Metal shaders
🍎 Apple Silicon Native — Uses Metal GPU, Apple Neural Engine (ANE), and ARM NEON
🔒 Private — Everything runs locally, your data never leaves your device
📦 Easy — One command to install, one line to run
🌐 Cross-Platform — Works in Rust, Node.js, and browsers via WebAssembly

✨ Key Features

Core Capabilities

Feature	Description
Multi-Backend Support	Metal GPU, Core ML (ANE), CPU with NEON SIMD
Quantization	Q4, Q5, Q8 quantized models (4-8x memory savings)
GGUF Support	Load models directly from Hugging Face in GGUF format
Streaming	Real-time token-by-token generation
Continuous Batching	Efficient multi-request handling
KV Cache	Optimized key-value cache with paged attention
Speculative Decoding	1.5-2x speedup with draft models

v2.0 New Features

Feature	Improvement
Apple Neural Engine	38 TOPS dedicated ML acceleration on M4 Pro
Hybrid GPU+ANE Pipeline	Best of both worlds for optimal throughput
Flash Attention v2	2.5-7.5x faster attention computation
SONA Learning	Self-optimizing neural architecture for adaptive inference
Ruvector Integration	Built-in vector embeddings for RAG applications

🚀 Quickstart

Rust (Cargo)

# Add to Cargo.toml
cargo add ruvllm --features inference-metal

use ruvllm::{Engine, GenerateParams};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load a model (downloads automatically from Hugging Face)
    let engine = Engine::from_pretrained("microsoft/Phi-3-mini-4k-instruct-gguf")?;

    // Generate text
    let response = engine.generate(
        "Explain quantum computing in simple terms:",
        GenerateParams::default()
    )?;

    println!("{}", response);
    Ok(())
}

Node.js (npm)

npm install @aspect/ruvllm

import { RuvLLM } from '@aspect/ruvllm';

// Initialize with a model
const llm = await RuvLLM.fromPretrained('microsoft/Phi-3-mini-4k-instruct-gguf');

// Generate text
const response = await llm.generate('Explain quantum computing in simple terms:');
console.log(response);

// Or stream tokens
for await (const token of llm.stream('Write a haiku about coding:')) {
    process.stdout.write(token);
}

CLI

# Install CLI
cargo install ruvllm-cli

# Run interactively
ruvllm chat --model microsoft/Phi-3-mini-4k-instruct-gguf

# One-shot generation
ruvllm generate "What is the meaning of life?" --model phi-3

📚 Tutorials

Tutorial 1: Building a Local Chatbot

Create a simple chatbot that runs entirely on your Mac:

use ruvllm::{Engine, GenerateParams, ChatMessage};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let engine = Engine::from_pretrained("meta-llama/Llama-3.2-1B-Instruct-GGUF")?;

    let mut history = vec![];

    loop {
        print!("You: ");
        let mut input = String::new();
        std::io::stdin().read_line(&mut input)?;

        history.push(ChatMessage::user(&input));

        let response = engine.chat(&history, GenerateParams {
            max_tokens: 512,
            temperature: 0.7,
            ..Default::default()
        })?;

        println!("AI: {}", response);
        history.push(ChatMessage::assistant(&response));
    }
}

Tutorial 2: Streaming Responses in Node.js

Build a real-time streaming API:

import { RuvLLM } from '@aspect/ruvllm';
import express from 'express';

const app = express();
const llm = await RuvLLM.fromPretrained('phi-3-mini');

app.get('/stream', async (req, res) => {
    const prompt = req.query.prompt;

    res.setHeader('Content-Type', 'text/event-stream');
    res.setHeader('Cache-Control', 'no-cache');

    for await (const token of llm.stream(prompt)) {
        res.write(`data: ${JSON.stringify({ token })}\n\n`);
    }

    res.write('data: [DONE]\n\n');
    res.end();
});

app.listen(3000);

Tutorial 3: RAG with Ruvector

Combine RuvLLM with Ruvector for retrieval-augmented generation:

use ruvllm::Engine;
use ruvector_core::{VectorDb, HnswConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize vector database
    let db = VectorDb::new(HnswConfig::default())?;

    // Initialize LLM
    let llm = Engine::from_pretrained("phi-3-mini")?;

    // Add documents (embeddings generated automatically)
    db.add_document("doc1", "RuvLLM is a fast LLM inference engine.")?;
    db.add_document("doc2", "It supports Metal GPU acceleration.")?;

    // Query and generate
    let query = "What is RuvLLM?";
    let context = db.search(query, 3)?;

    let prompt = format!(
        "Context:\n{}\n\nQuestion: {}\nAnswer:",
        context.iter().map(|d| d.text.as_str()).collect::<Vec<_>>().join("\n"),
        query
    );

    let response = llm.generate(&prompt, Default::default())?;
    println!("{}", response);
    Ok(())
}

Tutorial 4: Browser-Based Inference (WebAssembly)

Run models directly in the browser:

<!DOCTYPE html>
<html>
<head>
    <script type="module">
        import init, { RuvLLM } from 'https://unpkg.com/@aspect/ruvllm-wasm/ruvllm.js';

        async function main() {
            await init();

            const llm = await RuvLLM.fromUrl('/models/phi-3-mini-q4.gguf');

            const output = document.getElementById('output');

            for await (const token of llm.stream('Write a poem about the web:')) {
                output.textContent += token;
            }
        }

        main();
    </script>
</head>
<body>
    <pre id="output"></pre>
</body>
</html>

🔧 Advanced Usage

Custom Model Configuration

Fine-tune model loading for your specific hardware:

use ruvllm::{Engine, ModelConfig, ComputeBackend, Quantization};

let engine = Engine::builder()
    .model_path("/path/to/model.gguf")
    .backend(ComputeBackend::Metal)          // Use Metal GPU
    .quantization(Quantization::Q4K)          // 4-bit quantization
    .context_length(8192)                     // Max context
    .num_gpu_layers(32)                       // Layers on GPU
    .use_flash_attention(true)                // Enable Flash Attention
    .build()?;

Apple Neural Engine (ANE) Configuration

Leverage the dedicated ML accelerator on Apple Silicon:

use ruvllm::{Engine, CoreMLBackend, ComputeUnits};

// Create Core ML backend with ANE
let backend = CoreMLBackend::new()?
    .with_compute_units(ComputeUnits::CpuAndNeuralEngine)  // Use ANE
    .with_tokenizer(tokenizer);

// Load Core ML model
backend.load_model("model.mlmodelc", ModelConfig::default())?;

// Generate (uses ANE for MLP, GPU for attention)
let response = backend.generate("Hello", GenerateParams::default())?;

Hybrid GPU + ANE Pipeline

Maximize throughput with intelligent workload distribution:

use ruvllm::kernels::{should_use_ane_matmul, get_ane_recommendation};

// Check if ANE is beneficial for your matrix size
let recommendation = get_ane_recommendation(batch_size, hidden_dim, vocab_size);

if recommendation.use_ane {
    println!("Using ANE: {} (confidence: {:.0}%)",
             recommendation.reason,
             recommendation.confidence * 100.0);
}

Continuous Batching Server

Build a high-throughput inference server:

use ruvllm::serving::{
    ContinuousBatchScheduler, KvCacheManager, InferenceRequest, SchedulerConfig
};

let config = SchedulerConfig {
    max_batch_size: 32,
    max_tokens_per_batch: 4096,
    preemption_mode: PreemptionMode::Swap,
    ..Default::default()
};

let mut scheduler = ContinuousBatchScheduler::new(config);
let mut kv_cache = KvCacheManager::new(KvCachePoolConfig::default());

// Add requests
scheduler.add_request(InferenceRequest::new(tokens, params));

// Process batches
while let Some(batch) = scheduler.schedule() {
    // Execute batch inference
    let outputs = engine.forward_batch(&batch)?;

    // Update scheduler with results
    scheduler.update(outputs);
}

Speculative Decoding

Speed up generation with draft models:

use ruvllm::speculative::{SpeculativeDecoder, SpeculativeConfig};

let config = SpeculativeConfig {
    draft_model: "phi-3-mini-draft",    // Small, fast model
    target_model: "phi-3-medium",        // Large, accurate model
    num_speculative_tokens: 4,           // Tokens to speculate
    temperature: 0.8,
};

let decoder = SpeculativeDecoder::new(config)?;

// 1.5-2x faster than standard decoding
let response = decoder.generate("Explain relativity:", params)?;

Custom Tokenizer

Use custom tokenizers for specialized models:

use ruvllm::tokenizer::{RuvTokenizer, TokenizerConfig};

// Load from HuggingFace
let tokenizer = RuvTokenizer::from_pretrained("meta-llama/Llama-3.2-1B")?;

// Or from local file
let tokenizer = RuvTokenizer::from_file("./tokenizer.json")?;

// Encode/decode
let tokens = tokenizer.encode("Hello, world!")?;
let text = tokenizer.decode(&tokens)?;

// With chat template
let formatted = tokenizer.apply_chat_template(&[
    ChatMessage::system("You are a helpful assistant."),
    ChatMessage::user("What is 2+2?"),
])?;

Memory Optimization

Optimize for large models on limited memory:

use ruvllm::{Engine, MemoryConfig};

let engine = Engine::builder()
    .model_path("llama-70b.gguf")
    .memory_config(MemoryConfig {
        max_memory_gb: 24.0,           // Limit memory usage
        offload_to_cpu: true,          // Offload layers to CPU
        use_mmap: true,                // Memory-map model file
        kv_cache_dtype: DType::F16,    // Half-precision KV cache
    })
    .build()?;

Embeddings for RAG

Generate embeddings for retrieval applications:

use ruvllm::Engine;

let engine = Engine::from_pretrained("nomic-embed-text-v1.5")?;

// Single embedding
let embedding = engine.embed("What is machine learning?")?;

// Batch embeddings
let embeddings = engine.embed_batch(&[
    "Document 1 content",
    "Document 2 content",
    "Document 3 content",
])?;

// Cosine similarity
let similarity = ruvector_core::cosine_similarity(&embedding, &embeddings[0]);

Node.js Advanced Configuration

import { RuvLLM, ModelConfig, ComputeBackend } from '@aspect/ruvllm';

const llm = await RuvLLM.create({
    modelPath: './models/phi-3-mini-q4.gguf',
    backend: ComputeBackend.Metal,
    contextLength: 8192,
    numGpuLayers: 32,
    flashAttention: true,

    // Callbacks
    onToken: (token) => process.stdout.write(token),
    onProgress: (progress) => console.log(`Loading: ${progress}%`),
});

// Structured output (JSON mode)
const result = await llm.generate('List 3 colors', {
    responseFormat: 'json',
    schema: {
        type: 'object',
        properties: {
            colors: { type: 'array', items: { type: 'string' } }
        }
    }
});

console.log(JSON.parse(result)); // { colors: ['red', 'blue', 'green'] }

📊 Performance Benchmarks

Tested on M4 Pro (14-core CPU, 20-core GPU, 38 TOPS ANE):

Model Inference Speed

Model	Size	Quantization	Tokens/sec	Memory
Phi-3 Mini	3.8B	Q4_K_M	52 t/s	2.4 GB
Llama 3.2	1B	Q4_K_M	78 t/s	0.8 GB
Llama 3.2	3B	Q4_K_M	45 t/s	2.1 GB
Mistral 7B	7B	Q4_K_M	28 t/s	4.2 GB
Gemma 2	9B	Q4_K_M	22 t/s	5.8 GB

🔥 ANE vs NEON Matrix Multiply (NEW in v2.0)

Dimension	ANE	NEON	Speedup
768×768	400 µs	104 ms	261x
1024×1024	1.2 ms	283 ms	243x
1536×1536	3.4 ms	1,028 ms	306x
2048×2048	8.5 ms	4,020 ms	473x
3072×3072	28.2 ms	15,240 ms	541x
4096×4096	66.1 ms	65,428 ms	989x

Hybrid Pipeline Performance

Mode	seq=128	seq=512	vs NEON
Pure ANE	35.9 ms	112.9 ms	460x faster
Hybrid	862 ms	3,195 ms	19x faster
Pure NEON	16,529 ms	66,539 ms	baseline

Activation Functions (SiLU/GELU)

Size	NEON	ANE	Winner
32×4096	70 µs	152 µs	NEON 2.2x
64×4096	141 µs	303 µs	NEON 2.1x
128×4096	284 µs	613 µs	NEON 2.2x

Auto-dispatch correctly routes: ANE for matmul ≥768 dims, NEON for activations.

Quantization Performance

Dimension	Encode	Hamming Distance
128-dim	0.1 µs	<0.1 µs
384-dim	0.3 µs	<0.1 µs
768-dim	0.5 µs	<0.1 µs
1536-dim	1.0 µs	<0.1 µs

Benchmarks run with Criterion.rs, 50 samples per test, M4 Pro 48GB.

🔌 Supported Models

RuvLLM supports any model in GGUF format. Popular options:

Llama 3.2 (1B, 3B) — Meta's latest efficient models
Phi-3 (Mini, Small, Medium) — Microsoft's powerful small models
Mistral 7B — Excellent quality-to-size ratio
Gemma 2 (2B, 9B, 27B) — Google's open models
Qwen 2.5 (0.5B-72B) — Alibaba's multilingual models
DeepSeek Coder — Specialized for code generation

Download models from Hugging Face.

🛠️ Installation

Rust

[dependencies]
ruvllm = { version = "2.0", features = ["inference-metal"] }

# Or with all features
ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "speculative"] }

Available features:

inference-metal — Metal GPU acceleration (recommended for Mac)
inference-cuda — CUDA acceleration (for NVIDIA GPUs)
coreml — Apple Neural Engine via Core ML
speculative — Speculative decoding support
async-runtime — Async/await support with Tokio

Node.js

npm install @aspect/ruvllm
# or
yarn add @aspect/ruvllm
# or
pnpm add @aspect/ruvllm

From Source

git clone https://github.com/aspect/ruvector
cd ruvector/crates/ruvllm
cargo build --release --features inference-metal

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

📄 License

RuvLLM is dual-licensed under MIT and Apache 2.0. See LICENSE-MIT and LICENSE-APACHE.

Made with ❤️ by ruv.io
_{Part of the Ruvector ecosystem}

16 KiB Raw Blame History Unescape Escape