303 lines
5.5 KiB
Markdown
303 lines
5.5 KiB
Markdown
# RuvLLM CLI
|
|
|
|
Command-line interface for RuvLLM inference, optimized for Apple Silicon.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# From crates.io
|
|
cargo install ruvllm-cli
|
|
|
|
# From source (with Metal acceleration)
|
|
cargo install --path . --features metal
|
|
```
|
|
|
|
## Commands
|
|
|
|
### Download Models
|
|
|
|
Download models from HuggingFace Hub:
|
|
|
|
```bash
|
|
# Download Qwen with Q4K quantization (default)
|
|
ruvllm download qwen
|
|
|
|
# Download with specific quantization
|
|
ruvllm download qwen --quantization q8
|
|
ruvllm download mistral --quantization f16
|
|
|
|
# Force re-download
|
|
ruvllm download phi --force
|
|
|
|
# Download specific revision
|
|
ruvllm download llama --revision main
|
|
```
|
|
|
|
#### Model Aliases
|
|
|
|
| Alias | Model ID |
|
|
|-------|----------|
|
|
| `qwen` | `Qwen/Qwen2.5-7B-Instruct` |
|
|
| `mistral` | `mistralai/Mistral-7B-Instruct-v0.3` |
|
|
| `phi` | `microsoft/Phi-3-medium-4k-instruct` |
|
|
| `llama` | `meta-llama/Meta-Llama-3.1-8B-Instruct` |
|
|
|
|
#### Quantization Options
|
|
|
|
| Option | Description | Memory Savings |
|
|
|--------|-------------|----------------|
|
|
| `q4k` | 4-bit quantization (default) | ~75% |
|
|
| `q8` | 8-bit quantization | ~50% |
|
|
| `f16` | Half precision | ~50% |
|
|
| `none` | Full precision | 0% |
|
|
|
|
### List Models
|
|
|
|
```bash
|
|
# List all available models
|
|
ruvllm list
|
|
|
|
# List only downloaded models
|
|
ruvllm list --downloaded
|
|
|
|
# Detailed listing with sizes
|
|
ruvllm list --long
|
|
```
|
|
|
|
### Model Information
|
|
|
|
```bash
|
|
# Show model details
|
|
ruvllm info qwen
|
|
|
|
# Output includes:
|
|
# - Model architecture
|
|
# - Parameter count
|
|
# - Download status
|
|
# - Disk usage
|
|
# - Supported features
|
|
```
|
|
|
|
### Interactive Chat
|
|
|
|
```bash
|
|
# Start chat with default settings
|
|
ruvllm chat qwen
|
|
|
|
# With custom system prompt
|
|
ruvllm chat qwen --system "You are a helpful coding assistant."
|
|
|
|
# Adjust generation parameters
|
|
ruvllm chat qwen --temperature 0.5 --max-tokens 1024
|
|
|
|
# Use specific quantization
|
|
ruvllm chat qwen --quantization q8
|
|
```
|
|
|
|
#### Chat Commands
|
|
|
|
During chat, use these commands:
|
|
|
|
| Command | Description |
|
|
|---------|-------------|
|
|
| `/help` | Show available commands |
|
|
| `/clear` | Clear conversation history |
|
|
| `/system <prompt>` | Change system prompt |
|
|
| `/temp <value>` | Change temperature |
|
|
| `/quit` or `/exit` | Exit chat |
|
|
|
|
### Start Server
|
|
|
|
OpenAI-compatible inference server:
|
|
|
|
```bash
|
|
# Start with defaults
|
|
ruvllm serve qwen
|
|
|
|
# Custom host and port
|
|
ruvllm serve qwen --host 0.0.0.0 --port 8080
|
|
|
|
# Configure concurrency
|
|
ruvllm serve qwen --max-concurrent 8 --max-context 8192
|
|
```
|
|
|
|
#### API Endpoints
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/v1/chat/completions` | POST | Chat completions |
|
|
| `/v1/completions` | POST | Text completions |
|
|
| `/v1/models` | GET | List models |
|
|
| `/health` | GET | Health check |
|
|
|
|
#### Example Request
|
|
|
|
```bash
|
|
curl http://localhost:8080/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "qwen",
|
|
"messages": [
|
|
{"role": "user", "content": "Hello!"}
|
|
],
|
|
"max_tokens": 256
|
|
}'
|
|
```
|
|
|
|
### Run Benchmarks
|
|
|
|
```bash
|
|
# Basic benchmark
|
|
ruvllm benchmark qwen
|
|
|
|
# Configure benchmark
|
|
ruvllm benchmark qwen \
|
|
--warmup 5 \
|
|
--iterations 20 \
|
|
--prompt-length 256 \
|
|
--gen-length 128
|
|
|
|
# Output formats
|
|
ruvllm benchmark qwen --format json
|
|
ruvllm benchmark qwen --format csv
|
|
```
|
|
|
|
#### Benchmark Metrics
|
|
|
|
- **Prefill Latency**: Time to process input prompt
|
|
- **Decode Throughput**: Tokens per second during generation
|
|
- **Time to First Token (TTFT)**: Latency before first output token
|
|
- **Memory Usage**: Peak GPU/RAM consumption
|
|
|
|
## Global Options
|
|
|
|
```bash
|
|
# Enable verbose logging
|
|
ruvllm --verbose <command>
|
|
|
|
# Disable colored output
|
|
ruvllm --no-color <command>
|
|
|
|
# Custom cache directory
|
|
ruvllm --cache-dir /path/to/cache <command>
|
|
|
|
# Or via environment variable
|
|
export RUVLLM_CACHE_DIR=/path/to/cache
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Cache Directory
|
|
|
|
Models are cached in:
|
|
|
|
- **macOS**: `~/Library/Caches/ruvllm`
|
|
- **Linux**: `~/.cache/ruvllm`
|
|
- **Windows**: `%LOCALAPPDATA%\ruvllm`
|
|
|
|
Override with `--cache-dir` or `RUVLLM_CACHE_DIR`.
|
|
|
|
### Logging
|
|
|
|
Set log level with `RUST_LOG`:
|
|
|
|
```bash
|
|
RUST_LOG=debug ruvllm chat qwen
|
|
RUST_LOG=ruvllm=trace ruvllm serve qwen
|
|
```
|
|
|
|
## Examples
|
|
|
|
### Basic Workflow
|
|
|
|
```bash
|
|
# 1. Download a model
|
|
ruvllm download qwen
|
|
|
|
# 2. Verify it's downloaded
|
|
ruvllm list --downloaded
|
|
|
|
# 3. Start chatting
|
|
ruvllm chat qwen
|
|
```
|
|
|
|
### Server Deployment
|
|
|
|
```bash
|
|
# Download model first
|
|
ruvllm download qwen --quantization q4k
|
|
|
|
# Start server with production settings
|
|
ruvllm serve qwen \
|
|
--host 0.0.0.0 \
|
|
--port 8080 \
|
|
--max-concurrent 16 \
|
|
--max-context 4096 \
|
|
--quantization q4k
|
|
```
|
|
|
|
### Performance Testing
|
|
|
|
```bash
|
|
# Run comprehensive benchmarks
|
|
ruvllm benchmark qwen \
|
|
--warmup 10 \
|
|
--iterations 50 \
|
|
--prompt-length 512 \
|
|
--gen-length 256 \
|
|
--format json > benchmark_results.json
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Out of Memory
|
|
|
|
```bash
|
|
# Use smaller quantization
|
|
ruvllm chat qwen --quantization q4k
|
|
|
|
# Or reduce context length
|
|
ruvllm serve qwen --max-context 2048
|
|
```
|
|
|
|
### Slow Download
|
|
|
|
```bash
|
|
# Resume interrupted download
|
|
ruvllm download qwen
|
|
|
|
# Force fresh download
|
|
ruvllm download qwen --force
|
|
```
|
|
|
|
### Metal Issues (macOS)
|
|
|
|
Ensure Metal is available:
|
|
|
|
```bash
|
|
# Check Metal device
|
|
system_profiler SPDisplaysDataType | grep Metal
|
|
|
|
# Try with CPU fallback
|
|
RUVLLM_NO_METAL=1 ruvllm chat qwen
|
|
```
|
|
|
|
## Feature Flags
|
|
|
|
Build with specific features:
|
|
|
|
```bash
|
|
# Metal acceleration (macOS)
|
|
cargo install ruvllm-cli --features metal
|
|
|
|
# CUDA acceleration (NVIDIA)
|
|
cargo install ruvllm-cli --features cuda
|
|
|
|
# Both (if available)
|
|
cargo install ruvllm-cli --features "metal,cuda"
|
|
```
|
|
|
|
## License
|
|
|
|
Apache-2.0 / MIT dual license.
|