Files
wifi-densepose/npm/packages/ruvllm-cli/README.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

224 lines
6.0 KiB
Markdown

# @ruvector/ruvllm-cli
[![npm version](https://img.shields.io/npm/v/@ruvector/ruvllm-cli.svg)](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
[![npm downloads](https://img.shields.io/npm/dt/@ruvector/ruvllm-cli.svg)](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
[![npm downloads/month](https://img.shields.io/npm/dm/@ruvector/ruvllm-cli.svg)](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
[![License](https://img.shields.io/npm/l/@ruvector/ruvllm-cli.svg)](https://github.com/ruvnet/ruvector/blob/main/LICENSE)
[![TypeScript](https://img.shields.io/badge/TypeScript-5.0-blue.svg)](https://www.typescriptlang.org/)
**Command-line interface for local LLM inference and benchmarking** - run AI models on your machine with Metal, CUDA, and CPU acceleration.
## Features
- **Hardware Acceleration** - Metal (macOS), CUDA (NVIDIA), Vulkan, Apple Neural Engine
- **GGUF Support** - Load quantized models (Q4, Q5, Q6, Q8) for efficient inference
- **Interactive Chat** - Terminal-based chat sessions with conversation history
- **Benchmarking** - Measure tokens/second, memory usage, time-to-first-token
- **HTTP Server** - OpenAI-compatible API server for integration
- **Model Management** - Download, list, and manage models from HuggingFace
- **Streaming Output** - Real-time token streaming for responsive UX
## Installation
```bash
# Install globally
npm install -g @ruvector/ruvllm-cli
# Or run directly with npx
npx @ruvector/ruvllm-cli --help
```
For full native performance, install the Rust binary:
```bash
cargo install ruvllm-cli
```
## Quick Start
### Run Inference
```bash
# Basic inference
ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"
# With options
ruvllm run \
--model ./model.gguf \
--prompt "Write a haiku about Rust" \
--temperature 0.8 \
--max-tokens 100 \
--backend metal
```
### Interactive Chat
```bash
# Start chat session
ruvllm chat --model ./model.gguf
# With system prompt
ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"
```
### Benchmark Performance
```bash
# Run benchmark
ruvllm bench --model ./model.gguf --iterations 20
# Compare backends
ruvllm bench --model ./model.gguf --backend metal
ruvllm bench --model ./model.gguf --backend cpu
```
### Start Server
```bash
# OpenAI-compatible API server
ruvllm serve --model ./model.gguf --port 8080
# Then use with any OpenAI client
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello", "max_tokens": 50}'
```
### Model Management
```bash
# List available models
ruvllm list
# Download from HuggingFace
ruvllm download TheBloke/Llama-2-7B-GGUF
# Download specific quantization
ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m
```
## CLI Reference
| Command | Description |
|---------|-------------|
| `run` | Run inference on a prompt |
| `chat` | Interactive chat session |
| `bench` | Benchmark model performance |
| `serve` | Start HTTP server |
| `list` | List downloaded models |
| `download` | Download model from HuggingFace |
### Global Options
| Option | Description | Default |
|--------|-------------|---------|
| `--model, -m` | Path to GGUF model file | - |
| `--backend, -b` | Acceleration backend (metal, cuda, cpu) | auto |
| `--threads, -t` | Number of CPU threads | auto |
| `--gpu-layers` | Layers to offload to GPU | all |
| `--context-size` | Context window size | 2048 |
| `--verbose, -v` | Enable verbose logging | false |
### Generation Options
| Option | Description | Default |
|--------|-------------|---------|
| `--temperature` | Sampling temperature (0-2) | 0.7 |
| `--top-p` | Nucleus sampling threshold | 0.9 |
| `--top-k` | Top-k sampling | 40 |
| `--max-tokens` | Maximum tokens to generate | 256 |
| `--repeat-penalty` | Repetition penalty | 1.1 |
## Programmatic Usage
```typescript
import {
parseArgs,
formatBenchmarkTable,
getAvailableBackends,
ModelConfig,
BenchmarkResult,
} from '@ruvector/ruvllm-cli';
// Parse CLI arguments
const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']);
console.log(args); // { model: './model.gguf', temperature: '0.8' }
// Check available backends
const backends = getAvailableBackends();
console.log('Available:', backends); // ['cpu', 'metal'] on macOS
// Format benchmark results
const results: BenchmarkResult[] = [
{
model: 'llama-7b',
backend: 'metal',
promptTokens: 50,
generatedTokens: 100,
promptTime: 120,
generationTime: 2500,
promptTPS: 416.7,
generationTPS: 40.0,
memoryUsage: 4200,
peakMemory: 4800,
},
];
console.log(formatBenchmarkTable(results));
```
## Performance
Benchmarks on Apple M2 Pro with Q4_K_M quantization:
| Model | Prompt TPS | Gen TPS | Memory |
|-------|------------|---------|--------|
| Llama-2-7B | 450 | 42 | 4.2 GB |
| Mistral-7B | 480 | 45 | 4.1 GB |
| Phi-2 | 820 | 85 | 1.8 GB |
| TinyLlama-1.1B | 1200 | 120 | 0.8 GB |
## Configuration
Create `~/.ruvllm/config.json`:
```json
{
"defaultBackend": "metal",
"modelsDir": "~/.ruvllm/models",
"cacheDir": "~/.ruvllm/cache",
"streaming": true,
"logLevel": "info"
}
```
## Environment Variables
| Variable | Description |
|----------|-------------|
| `RUVLLM_MODELS_DIR` | Models directory |
| `RUVLLM_CACHE_DIR` | Cache directory |
| `RUVLLM_BACKEND` | Default backend |
| `RUVLLM_THREADS` | CPU threads |
| `HF_TOKEN` | HuggingFace token for gated models |
## Related Packages
- [@ruvector/ruvllm](https://www.npmjs.com/package/@ruvector/ruvllm) - LLM orchestration library
- [@ruvector/ruvllm-wasm](https://www.npmjs.com/package/@ruvector/ruvllm-wasm) - Browser LLM inference
- [ruvector](https://www.npmjs.com/package/ruvector) - All-in-one vector database
## Documentation
- [RuvLLM Documentation](https://github.com/ruvnet/ruvector/tree/main/crates/ruvllm)
- [CLI Crate](https://github.com/ruvnet/ruvector/tree/main/crates/ruvllm-cli)
- [API Reference](https://docs.rs/ruvllm-cli)
## License
MIT OR Apache-2.0
---
**Part of the [RuVector](https://github.com/ruvnet/ruvector) ecosystem** - High-performance vector database with self-learning capabilities.