Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

6.0 KiB

Raw Permalink Blame History

@ruvector/ruvllm-cli

Command-line interface for local LLM inference and benchmarking - run AI models on your machine with Metal, CUDA, and CPU acceleration.

Features

Hardware Acceleration - Metal (macOS), CUDA (NVIDIA), Vulkan, Apple Neural Engine
GGUF Support - Load quantized models (Q4, Q5, Q6, Q8) for efficient inference
Interactive Chat - Terminal-based chat sessions with conversation history
Benchmarking - Measure tokens/second, memory usage, time-to-first-token
HTTP Server - OpenAI-compatible API server for integration
Model Management - Download, list, and manage models from HuggingFace
Streaming Output - Real-time token streaming for responsive UX

Installation

# Install globally
npm install -g @ruvector/ruvllm-cli

# Or run directly with npx
npx @ruvector/ruvllm-cli --help

For full native performance, install the Rust binary:

cargo install ruvllm-cli

Quick Start

Run Inference

# Basic inference
ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"

# With options
ruvllm run \
  --model ./model.gguf \
  --prompt "Write a haiku about Rust" \
  --temperature 0.8 \
  --max-tokens 100 \
  --backend metal

Interactive Chat

# Start chat session
ruvllm chat --model ./model.gguf

# With system prompt
ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"

Benchmark Performance

# Run benchmark
ruvllm bench --model ./model.gguf --iterations 20

# Compare backends
ruvllm bench --model ./model.gguf --backend metal
ruvllm bench --model ./model.gguf --backend cpu

Start Server

# OpenAI-compatible API server
ruvllm serve --model ./model.gguf --port 8080

# Then use with any OpenAI client
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}'

Model Management

# List available models
ruvllm list

# Download from HuggingFace
ruvllm download TheBloke/Llama-2-7B-GGUF

# Download specific quantization
ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m

CLI Reference

Command	Description
`run`	Run inference on a prompt
`chat`	Interactive chat session
`bench`	Benchmark model performance
`serve`	Start HTTP server
`list`	List downloaded models
`download`	Download model from HuggingFace

Global Options

Option	Description	Default
`--model, -m`	Path to GGUF model file	-
`--backend, -b`	Acceleration backend (metal, cuda, cpu)	auto
`--threads, -t`	Number of CPU threads	auto
`--gpu-layers`	Layers to offload to GPU	all
`--context-size`	Context window size	2048
`--verbose, -v`	Enable verbose logging	false

Generation Options

Option	Description	Default
`--temperature`	Sampling temperature (0-2)	0.7
`--top-p`	Nucleus sampling threshold	0.9
`--top-k`	Top-k sampling	40
`--max-tokens`	Maximum tokens to generate	256
`--repeat-penalty`	Repetition penalty	1.1

Programmatic Usage

import {
  parseArgs,
  formatBenchmarkTable,
  getAvailableBackends,
  ModelConfig,
  BenchmarkResult,
} from '@ruvector/ruvllm-cli';

// Parse CLI arguments
const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']);
console.log(args); // { model: './model.gguf', temperature: '0.8' }

// Check available backends
const backends = getAvailableBackends();
console.log('Available:', backends); // ['cpu', 'metal'] on macOS

// Format benchmark results
const results: BenchmarkResult[] = [
  {
    model: 'llama-7b',
    backend: 'metal',
    promptTokens: 50,
    generatedTokens: 100,
    promptTime: 120,
    generationTime: 2500,
    promptTPS: 416.7,
    generationTPS: 40.0,
    memoryUsage: 4200,
    peakMemory: 4800,
  },
];

console.log(formatBenchmarkTable(results));

Performance

Benchmarks on Apple M2 Pro with Q4_K_M quantization:

Model	Prompt TPS	Gen TPS	Memory
Llama-2-7B	450	42	4.2 GB
Mistral-7B	480	45	4.1 GB
Phi-2	820	85	1.8 GB
TinyLlama-1.1B	1200	120	0.8 GB

Configuration

Create ~/.ruvllm/config.json:

{
  "defaultBackend": "metal",
  "modelsDir": "~/.ruvllm/models",
  "cacheDir": "~/.ruvllm/cache",
  "streaming": true,
  "logLevel": "info"
}

Environment Variables

Variable	Description
`RUVLLM_MODELS_DIR`	Models directory
`RUVLLM_CACHE_DIR`	Cache directory
`RUVLLM_BACKEND`	Default backend
`RUVLLM_THREADS`	CPU threads
`HF_TOKEN`	HuggingFace token for gated models

@ruvector/ruvllm - LLM orchestration library
@ruvector/ruvllm-wasm - Browser LLM inference
ruvector - All-in-one vector database

Documentation

License

MIT OR Apache-2.0

Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.

6.0 KiB Raw Permalink Blame History