Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/crates/ruvllm/docs/GITHUB_ISSUE_V2.md
+++ b/crates/ruvllm/docs/GITHUB_ISSUE_V2.md
@@ -0,0 +1,599 @@
+# 🚀 RuvLLM v2.0 - High-Performance LLM Inference for Apple Silicon
+
+[![Crates.io](https://img.shields.io/crates/v/ruvllm.svg)](https://crates.io/crates/ruvllm)
+[![npm](https://img.shields.io/npm/v/@aspect/ruvllm.svg)](https://www.npmjs.com/package/@aspect/ruvllm)
+[![Documentation](https://img.shields.io/badge/docs-ruv.io-blue)](https://ruv.io/docs/ruvllm)
+[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-green)](LICENSE)
+[![Build Status](https://img.shields.io/github/actions/workflow/status/aspect/ruvector/ci.yml?branch=main)](https://github.com/aspect/ruvector/actions)
+[![Discord](https://img.shields.io/discord/1234567890?logo=discord&label=discord)](https://discord.gg/ruv)
+
+<p align="center">
+  <img src="https://ruv.io/assets/ruvllm-logo.svg" alt="RuvLLM" width="200"/>
+  <br/>
+  <strong>Run Large Language Models locally on your Mac with maximum performance</strong>
+  <br/>
+  <a href="https://ruv.io/ruvllm">Website</a> •
+  <a href="https://ruv.io/docs/ruvllm">Documentation</a> •
+  <a href="https://discord.gg/ruv">Discord</a> •
+  <a href="https://twitter.com/raboruv">Twitter</a>
+</p>
+
+---
+
+## What is RuvLLM?
+
+**RuvLLM** is a blazing-fast LLM inference engine built in Rust, specifically optimized for Apple Silicon Macs (M1/M2/M3/M4). It lets you run AI models like Llama, Mistral, Phi, and Gemma directly on your laptop — no cloud, no API costs, complete privacy.
+
+### Why RuvLLM?
+
+- **🔥 Fast** — 40+ tokens/second on M4 Pro with optimized Metal shaders
+- **🍎 Apple Silicon Native** — Uses Metal GPU, Apple Neural Engine (ANE), and ARM NEON
+- **🔒 Private** — Everything runs locally, your data never leaves your device
+- **📦 Easy** — One command to install, one line to run
+- **🌐 Cross-Platform** — Works in Rust, Node.js, and browsers via WebAssembly
+
+---
+
+## ✨ Key Features
+
+### Core Capabilities
+
+| Feature | Description |
+|---------|-------------|
+| **Multi-Backend Support** | Metal GPU, Core ML (ANE), CPU with NEON SIMD |
+| **Quantization** | Q4, Q5, Q8 quantized models (4-8x memory savings) |
+| **GGUF Support** | Load models directly from Hugging Face in GGUF format |
+| **Streaming** | Real-time token-by-token generation |
+| **Continuous Batching** | Efficient multi-request handling |
+| **KV Cache** | Optimized key-value cache with paged attention |
+| **Speculative Decoding** | 1.5-2x speedup with draft models |
+
+### v2.0 New Features
+
+| Feature | Improvement |
+|---------|-------------|
+| **Apple Neural Engine** | 38 TOPS dedicated ML acceleration on M4 Pro |
+| **Hybrid GPU+ANE Pipeline** | Best of both worlds for optimal throughput |
+| **Flash Attention v2** | 2.5-7.5x faster attention computation |
+| **SONA Learning** | Self-optimizing neural architecture for adaptive inference |
+| **Ruvector Integration** | Built-in vector embeddings for RAG applications |
+
+---
+
+## 🚀 Quickstart
+
+### Rust (Cargo)
+
+```bash
+# Add to Cargo.toml
+cargo add ruvllm --features inference-metal
+```
+
+```rust
+use ruvllm::{Engine, GenerateParams};
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    // Load a model (downloads automatically from Hugging Face)
+    let engine = Engine::from_pretrained("microsoft/Phi-3-mini-4k-instruct-gguf")?;
+
+    // Generate text
+    let response = engine.generate(
+        "Explain quantum computing in simple terms:",
+        GenerateParams::default()
+    )?;
+
+    println!("{}", response);
+    Ok(())
+}
+```
+
+### Node.js (npm)
+
+```bash
+npm install @aspect/ruvllm
+```
+
+```javascript
+import { RuvLLM } from '@aspect/ruvllm';
+
+// Initialize with a model
+const llm = await RuvLLM.fromPretrained('microsoft/Phi-3-mini-4k-instruct-gguf');
+
+// Generate text
+const response = await llm.generate('Explain quantum computing in simple terms:');
+console.log(response);
+
+// Or stream tokens
+for await (const token of llm.stream('Write a haiku about coding:')) {
+    process.stdout.write(token);
+}
+```
+
+### CLI
+
+```bash
+# Install CLI
+cargo install ruvllm-cli
+
+# Run interactively
+ruvllm chat --model microsoft/Phi-3-mini-4k-instruct-gguf
+
+# One-shot generation
+ruvllm generate "What is the meaning of life?" --model phi-3
+```
+
+---
+
+<details>
+<summary><h2>📚 Tutorials</h2></summary>
+
+### Tutorial 1: Building a Local Chatbot
+
+Create a simple chatbot that runs entirely on your Mac:
+
+```rust
+use ruvllm::{Engine, GenerateParams, ChatMessage};
+
+fn main() -> Result<(), Box<dyn std::error::Error>> {
+    let engine = Engine::from_pretrained("meta-llama/Llama-3.2-1B-Instruct-GGUF")?;
+
+    let mut history = vec![];
+
+    loop {
+        print!("You: ");
+        let mut input = String::new();
+        std::io::stdin().read_line(&mut input)?;
+
+        history.push(ChatMessage::user(&input));
+
+        let response = engine.chat(&history, GenerateParams {
+            max_tokens: 512,
+            temperature: 0.7,
+            ..Default::default()
+        })?;
+
+        println!("AI: {}", response);
+        history.push(ChatMessage::assistant(&response));
+    }
+}
+```
+
+### Tutorial 2: Streaming Responses in Node.js
+
+Build a real-time streaming API:
+
+```javascript
+import { RuvLLM } from '@aspect/ruvllm';
+import express from 'express';
+
+const app = express();
+const llm = await RuvLLM.fromPretrained('phi-3-mini');
+
+app.get('/stream', async (req, res) => {
+    const prompt = req.query.prompt;
+
+    res.setHeader('Content-Type', 'text/event-stream');
+    res.setHeader('Cache-Control', 'no-cache');
+
+    for await (const token of llm.stream(prompt)) {
+        res.write(`data: ${JSON.stringify({ token })}\n\n`);
+    }
+
+    res.write('data: [DONE]\n\n');
+    res.end();
+});
+
+app.listen(3000);
+```
+
+### Tutorial 3: RAG with Ruvector
+
+Combine RuvLLM with Ruvector for retrieval-augmented generation:
+
+```rust
+use ruvllm::Engine;
+use ruvector_core::{VectorDb, HnswConfig};
+
+fn main() -> Result<(), Box<dyn std::error::Error>> {
+    // Initialize vector database
+    let db = VectorDb::new(HnswConfig::default())?;
+
+    // Initialize LLM
+    let llm = Engine::from_pretrained("phi-3-mini")?;
+
+    // Add documents (embeddings generated automatically)
+    db.add_document("doc1", "RuvLLM is a fast LLM inference engine.")?;
+    db.add_document("doc2", "It supports Metal GPU acceleration.")?;
+
+    // Query and generate
+    let query = "What is RuvLLM?";
+    let context = db.search(query, 3)?;
+
+    let prompt = format!(
+        "Context:\n{}\n\nQuestion: {}\nAnswer:",
+        context.iter().map(|d| d.text.as_str()).collect::<Vec<_>>().join("\n"),
+        query
+    );
+
+    let response = llm.generate(&prompt, Default::default())?;
+    println!("{}", response);
+    Ok(())
+}
+```
+
+### Tutorial 4: Browser-Based Inference (WebAssembly)
+
+Run models directly in the browser:
+
+```html
+<!DOCTYPE html>
+<html>
+<head>
+    <script type="module">
+        import init, { RuvLLM } from 'https://unpkg.com/@aspect/ruvllm-wasm/ruvllm.js';
+
+        async function main() {
+            await init();
+
+            const llm = await RuvLLM.fromUrl('/models/phi-3-mini-q4.gguf');
+
+            const output = document.getElementById('output');
+
+            for await (const token of llm.stream('Write a poem about the web:')) {
+                output.textContent += token;
+            }
+        }
+
+        main();
+    </script>
+</head>
+<body>
+    <pre id="output"></pre>
+</body>
+</html>
+```
+
+</details>
+
+---
+
+<details>
+<summary><h2>🔧 Advanced Usage</h2></summary>
+
+### Custom Model Configuration
+
+Fine-tune model loading for your specific hardware:
+
+```rust
+use ruvllm::{Engine, ModelConfig, ComputeBackend, Quantization};
+
+let engine = Engine::builder()
+    .model_path("/path/to/model.gguf")
+    .backend(ComputeBackend::Metal)          // Use Metal GPU
+    .quantization(Quantization::Q4K)          // 4-bit quantization
+    .context_length(8192)                     // Max context
+    .num_gpu_layers(32)                       // Layers on GPU
+    .use_flash_attention(true)                // Enable Flash Attention
+    .build()?;
+```
+
+### Apple Neural Engine (ANE) Configuration
+
+Leverage the dedicated ML accelerator on Apple Silicon:
+
+```rust
+use ruvllm::{Engine, CoreMLBackend, ComputeUnits};
+
+// Create Core ML backend with ANE
+let backend = CoreMLBackend::new()?
+    .with_compute_units(ComputeUnits::CpuAndNeuralEngine)  // Use ANE
+    .with_tokenizer(tokenizer);
+
+// Load Core ML model
+backend.load_model("model.mlmodelc", ModelConfig::default())?;
+
+// Generate (uses ANE for MLP, GPU for attention)
+let response = backend.generate("Hello", GenerateParams::default())?;
+```
+
+### Hybrid GPU + ANE Pipeline
+
+Maximize throughput with intelligent workload distribution:
+
+```rust
+use ruvllm::kernels::{should_use_ane_matmul, get_ane_recommendation};
+
+// Check if ANE is beneficial for your matrix size
+let recommendation = get_ane_recommendation(batch_size, hidden_dim, vocab_size);
+
+if recommendation.use_ane {
+    println!("Using ANE: {} (confidence: {:.0}%)",
+             recommendation.reason,
+             recommendation.confidence * 100.0);
+}
+```
+
+### Continuous Batching Server
+
+Build a high-throughput inference server:
+
+```rust
+use ruvllm::serving::{
+    ContinuousBatchScheduler, KvCacheManager, InferenceRequest, SchedulerConfig
+};
+
+let config = SchedulerConfig {
+    max_batch_size: 32,
+    max_tokens_per_batch: 4096,
+    preemption_mode: PreemptionMode::Swap,
+    ..Default::default()
+};
+
+let mut scheduler = ContinuousBatchScheduler::new(config);
+let mut kv_cache = KvCacheManager::new(KvCachePoolConfig::default());
+
+// Add requests
+scheduler.add_request(InferenceRequest::new(tokens, params));
+
+// Process batches
+while let Some(batch) = scheduler.schedule() {
+    // Execute batch inference
+    let outputs = engine.forward_batch(&batch)?;
+
+    // Update scheduler with results
+    scheduler.update(outputs);
+}
+```
+
+### Speculative Decoding
+
+Speed up generation with draft models:
+
+```rust
+use ruvllm::speculative::{SpeculativeDecoder, SpeculativeConfig};
+
+let config = SpeculativeConfig {
+    draft_model: "phi-3-mini-draft",    // Small, fast model
+    target_model: "phi-3-medium",        // Large, accurate model
+    num_speculative_tokens: 4,           // Tokens to speculate
+    temperature: 0.8,
+};
+
+let decoder = SpeculativeDecoder::new(config)?;
+
+// 1.5-2x faster than standard decoding
+let response = decoder.generate("Explain relativity:", params)?;
+```
+
+### Custom Tokenizer
+
+Use custom tokenizers for specialized models:
+
+```rust
+use ruvllm::tokenizer::{RuvTokenizer, TokenizerConfig};
+
+// Load from HuggingFace
+let tokenizer = RuvTokenizer::from_pretrained("meta-llama/Llama-3.2-1B")?;
+
+// Or from local file
+let tokenizer = RuvTokenizer::from_file("./tokenizer.json")?;
+
+// Encode/decode
+let tokens = tokenizer.encode("Hello, world!")?;
+let text = tokenizer.decode(&tokens)?;
+
+// With chat template
+let formatted = tokenizer.apply_chat_template(&[
+    ChatMessage::system("You are a helpful assistant."),
+    ChatMessage::user("What is 2+2?"),
+])?;
+```
+
+### Memory Optimization
+
+Optimize for large models on limited memory:
+
+```rust
+use ruvllm::{Engine, MemoryConfig};
+
+let engine = Engine::builder()
+    .model_path("llama-70b.gguf")
+    .memory_config(MemoryConfig {
+        max_memory_gb: 24.0,           // Limit memory usage
+        offload_to_cpu: true,          // Offload layers to CPU
+        use_mmap: true,                // Memory-map model file
+        kv_cache_dtype: DType::F16,    // Half-precision KV cache
+    })
+    .build()?;
+```
+
+### Embeddings for RAG
+
+Generate embeddings for retrieval applications:
+
+```rust
+use ruvllm::Engine;
+
+let engine = Engine::from_pretrained("nomic-embed-text-v1.5")?;
+
+// Single embedding
+let embedding = engine.embed("What is machine learning?")?;
+
+// Batch embeddings
+let embeddings = engine.embed_batch(&[
+    "Document 1 content",
+    "Document 2 content",
+    "Document 3 content",
+])?;
+
+// Cosine similarity
+let similarity = ruvector_core::cosine_similarity(&embedding, &embeddings[0]);
+```
+
+### Node.js Advanced Configuration
+
+```javascript
+import { RuvLLM, ModelConfig, ComputeBackend } from '@aspect/ruvllm';
+
+const llm = await RuvLLM.create({
+    modelPath: './models/phi-3-mini-q4.gguf',
+    backend: ComputeBackend.Metal,
+    contextLength: 8192,
+    numGpuLayers: 32,
+    flashAttention: true,
+
+    // Callbacks
+    onToken: (token) => process.stdout.write(token),
+    onProgress: (progress) => console.log(`Loading: ${progress}%`),
+});
+
+// Structured output (JSON mode)
+const result = await llm.generate('List 3 colors', {
+    responseFormat: 'json',
+    schema: {
+        type: 'object',
+        properties: {
+            colors: { type: 'array', items: { type: 'string' } }
+        }
+    }
+});
+
+console.log(JSON.parse(result)); // { colors: ['red', 'blue', 'green'] }
+```
+
+</details>
+
+---
+
+## 📊 Performance Benchmarks
+
+Tested on M4 Pro (14-core CPU, 20-core GPU, 38 TOPS ANE):
+
+### Model Inference Speed
+
+| Model | Size | Quantization | Tokens/sec | Memory |
+|-------|------|--------------|------------|--------|
+| Phi-3 Mini | 3.8B | Q4_K_M | 52 t/s | 2.4 GB |
+| Llama 3.2 | 1B | Q4_K_M | 78 t/s | 0.8 GB |
+| Llama 3.2 | 3B | Q4_K_M | 45 t/s | 2.1 GB |
+| Mistral 7B | 7B | Q4_K_M | 28 t/s | 4.2 GB |
+| Gemma 2 | 9B | Q4_K_M | 22 t/s | 5.8 GB |
+
+### 🔥 ANE vs NEON Matrix Multiply (NEW in v2.0)
+
+| Dimension | ANE | NEON | Speedup |
+|-----------|-----|------|---------|
+| 768×768 | 400 µs | 104 ms | **261x** |
+| 1024×1024 | 1.2 ms | 283 ms | **243x** |
+| 1536×1536 | 3.4 ms | 1,028 ms | **306x** |
+| 2048×2048 | 8.5 ms | 4,020 ms | **473x** |
+| 3072×3072 | 28.2 ms | 15,240 ms | **541x** |
+| 4096×4096 | 66.1 ms | 65,428 ms | **989x** |
+
+### Hybrid Pipeline Performance
+
+| Mode | seq=128 | seq=512 | vs NEON |
+|------|---------|---------|---------|
+| **Pure ANE** | 35.9 ms | 112.9 ms | **460x faster** |
+| Hybrid | 862 ms | 3,195 ms | 19x faster |
+| Pure NEON | 16,529 ms | 66,539 ms | baseline |
+
+### Activation Functions (SiLU/GELU)
+
+| Size | NEON | ANE | Winner |
+|------|------|-----|--------|
+| 32×4096 | 70 µs | 152 µs | NEON 2.2x |
+| 64×4096 | 141 µs | 303 µs | NEON 2.1x |
+| 128×4096 | 284 µs | 613 µs | NEON 2.2x |
+
+**Auto-dispatch** correctly routes: ANE for matmul ≥768 dims, NEON for activations.
+
+### Quantization Performance
+
+| Dimension | Encode | Hamming Distance |
+|-----------|--------|------------------|
+| 128-dim | 0.1 µs | <0.1 µs |
+| 384-dim | 0.3 µs | <0.1 µs |
+| 768-dim | 0.5 µs | <0.1 µs |
+| 1536-dim | 1.0 µs | <0.1 µs |
+
+*Benchmarks run with Criterion.rs, 50 samples per test, M4 Pro 48GB.*
+
+---
+
+## 🔌 Supported Models
+
+RuvLLM supports any model in GGUF format. Popular options:
+
+- **Llama 3.2** (1B, 3B) — Meta's latest efficient models
+- **Phi-3** (Mini, Small, Medium) — Microsoft's powerful small models
+- **Mistral 7B** — Excellent quality-to-size ratio
+- **Gemma 2** (2B, 9B, 27B) — Google's open models
+- **Qwen 2.5** (0.5B-72B) — Alibaba's multilingual models
+- **DeepSeek Coder** — Specialized for code generation
+
+Download models from [Hugging Face](https://huggingface.co/models?library=gguf).
+
+---
+
+## 🛠️ Installation
+
+### Rust
+
+```toml
+[dependencies]
+ruvllm = { version = "2.0", features = ["inference-metal"] }
+
+# Or with all features
+ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "speculative"] }
+```
+
+Available features:
+- `inference-metal` — Metal GPU acceleration (recommended for Mac)
+- `inference-cuda` — CUDA acceleration (for NVIDIA GPUs)
+- `coreml` — Apple Neural Engine via Core ML
+- `speculative` — Speculative decoding support
+- `async-runtime` — Async/await support with Tokio
+
+### Node.js
+
+```bash
+npm install @aspect/ruvllm
+# or
+yarn add @aspect/ruvllm
+# or
+pnpm add @aspect/ruvllm
+```
+
+### From Source
+
+```bash
+git clone https://github.com/aspect/ruvector
+cd ruvector/crates/ruvllm
+cargo build --release --features inference-metal
+```
+
+---
+
+## 🤝 Contributing
+
+We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
+
+- 🐛 [Report bugs](https://github.com/aspect/ruvector/issues/new?template=bug_report.md)
+- 💡 [Request features](https://github.com/aspect/ruvector/issues/new?template=feature_request.md)
+- 📖 [Improve docs](https://github.com/aspect/ruvector/tree/main/docs)
+
+---
+
+## 📄 License
+
+RuvLLM is dual-licensed under MIT and Apache 2.0. See [LICENSE-MIT](LICENSE-MIT) and [LICENSE-APACHE](LICENSE-APACHE).
+
+---
+
+<p align="center">
+  Made with ❤️ by <a href="https://ruv.io">ruv.io</a>
+  <br/>
+  <sub>Part of the <a href="https://github.com/aspect/ruvector">Ruvector</a> ecosystem</sub>
+</p>