Files

6.0 KiB

@ruvector/ruvllm-cli

npm version npm downloads npm downloads/month License TypeScript

Command-line interface for local LLM inference and benchmarking - run AI models on your machine with Metal, CUDA, and CPU acceleration.

Features

  • Hardware Acceleration - Metal (macOS), CUDA (NVIDIA), Vulkan, Apple Neural Engine
  • GGUF Support - Load quantized models (Q4, Q5, Q6, Q8) for efficient inference
  • Interactive Chat - Terminal-based chat sessions with conversation history
  • Benchmarking - Measure tokens/second, memory usage, time-to-first-token
  • HTTP Server - OpenAI-compatible API server for integration
  • Model Management - Download, list, and manage models from HuggingFace
  • Streaming Output - Real-time token streaming for responsive UX

Installation

# Install globally
npm install -g @ruvector/ruvllm-cli

# Or run directly with npx
npx @ruvector/ruvllm-cli --help

For full native performance, install the Rust binary:

cargo install ruvllm-cli

Quick Start

Run Inference

# Basic inference
ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"

# With options
ruvllm run \
  --model ./model.gguf \
  --prompt "Write a haiku about Rust" \
  --temperature 0.8 \
  --max-tokens 100 \
  --backend metal

Interactive Chat

# Start chat session
ruvllm chat --model ./model.gguf

# With system prompt
ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"

Benchmark Performance

# Run benchmark
ruvllm bench --model ./model.gguf --iterations 20

# Compare backends
ruvllm bench --model ./model.gguf --backend metal
ruvllm bench --model ./model.gguf --backend cpu

Start Server

# OpenAI-compatible API server
ruvllm serve --model ./model.gguf --port 8080

# Then use with any OpenAI client
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}'

Model Management

# List available models
ruvllm list

# Download from HuggingFace
ruvllm download TheBloke/Llama-2-7B-GGUF

# Download specific quantization
ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m

CLI Reference

Command Description
run Run inference on a prompt
chat Interactive chat session
bench Benchmark model performance
serve Start HTTP server
list List downloaded models
download Download model from HuggingFace

Global Options

Option Description Default
--model, -m Path to GGUF model file -
--backend, -b Acceleration backend (metal, cuda, cpu) auto
--threads, -t Number of CPU threads auto
--gpu-layers Layers to offload to GPU all
--context-size Context window size 2048
--verbose, -v Enable verbose logging false

Generation Options

Option Description Default
--temperature Sampling temperature (0-2) 0.7
--top-p Nucleus sampling threshold 0.9
--top-k Top-k sampling 40
--max-tokens Maximum tokens to generate 256
--repeat-penalty Repetition penalty 1.1

Programmatic Usage

import {
  parseArgs,
  formatBenchmarkTable,
  getAvailableBackends,
  ModelConfig,
  BenchmarkResult,
} from '@ruvector/ruvllm-cli';

// Parse CLI arguments
const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']);
console.log(args); // { model: './model.gguf', temperature: '0.8' }

// Check available backends
const backends = getAvailableBackends();
console.log('Available:', backends); // ['cpu', 'metal'] on macOS

// Format benchmark results
const results: BenchmarkResult[] = [
  {
    model: 'llama-7b',
    backend: 'metal',
    promptTokens: 50,
    generatedTokens: 100,
    promptTime: 120,
    generationTime: 2500,
    promptTPS: 416.7,
    generationTPS: 40.0,
    memoryUsage: 4200,
    peakMemory: 4800,
  },
];

console.log(formatBenchmarkTable(results));

Performance

Benchmarks on Apple M2 Pro with Q4_K_M quantization:

Model Prompt TPS Gen TPS Memory
Llama-2-7B 450 42 4.2 GB
Mistral-7B 480 45 4.1 GB
Phi-2 820 85 1.8 GB
TinyLlama-1.1B 1200 120 0.8 GB

Configuration

Create ~/.ruvllm/config.json:

{
  "defaultBackend": "metal",
  "modelsDir": "~/.ruvllm/models",
  "cacheDir": "~/.ruvllm/cache",
  "streaming": true,
  "logLevel": "info"
}

Environment Variables

Variable Description
RUVLLM_MODELS_DIR Models directory
RUVLLM_CACHE_DIR Cache directory
RUVLLM_BACKEND Default backend
RUVLLM_THREADS CPU threads
HF_TOKEN HuggingFace token for gated models

Documentation

License

MIT OR Apache-2.0


Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.