Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
223
vendor/ruvector/npm/packages/ruvllm-cli/README.md
vendored
Normal file
223
vendor/ruvector/npm/packages/ruvllm-cli/README.md
vendored
Normal file
@@ -0,0 +1,223 @@
|
||||
# @ruvector/ruvllm-cli
|
||||
|
||||
[](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
|
||||
[](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
|
||||
[](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
|
||||
[](https://github.com/ruvnet/ruvector/blob/main/LICENSE)
|
||||
[](https://www.typescriptlang.org/)
|
||||
|
||||
**Command-line interface for local LLM inference and benchmarking** - run AI models on your machine with Metal, CUDA, and CPU acceleration.
|
||||
|
||||
## Features
|
||||
|
||||
- **Hardware Acceleration** - Metal (macOS), CUDA (NVIDIA), Vulkan, Apple Neural Engine
|
||||
- **GGUF Support** - Load quantized models (Q4, Q5, Q6, Q8) for efficient inference
|
||||
- **Interactive Chat** - Terminal-based chat sessions with conversation history
|
||||
- **Benchmarking** - Measure tokens/second, memory usage, time-to-first-token
|
||||
- **HTTP Server** - OpenAI-compatible API server for integration
|
||||
- **Model Management** - Download, list, and manage models from HuggingFace
|
||||
- **Streaming Output** - Real-time token streaming for responsive UX
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Install globally
|
||||
npm install -g @ruvector/ruvllm-cli
|
||||
|
||||
# Or run directly with npx
|
||||
npx @ruvector/ruvllm-cli --help
|
||||
```
|
||||
|
||||
For full native performance, install the Rust binary:
|
||||
|
||||
```bash
|
||||
cargo install ruvllm-cli
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Run Inference
|
||||
|
||||
```bash
|
||||
# Basic inference
|
||||
ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"
|
||||
|
||||
# With options
|
||||
ruvllm run \
|
||||
--model ./model.gguf \
|
||||
--prompt "Write a haiku about Rust" \
|
||||
--temperature 0.8 \
|
||||
--max-tokens 100 \
|
||||
--backend metal
|
||||
```
|
||||
|
||||
### Interactive Chat
|
||||
|
||||
```bash
|
||||
# Start chat session
|
||||
ruvllm chat --model ./model.gguf
|
||||
|
||||
# With system prompt
|
||||
ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"
|
||||
```
|
||||
|
||||
### Benchmark Performance
|
||||
|
||||
```bash
|
||||
# Run benchmark
|
||||
ruvllm bench --model ./model.gguf --iterations 20
|
||||
|
||||
# Compare backends
|
||||
ruvllm bench --model ./model.gguf --backend metal
|
||||
ruvllm bench --model ./model.gguf --backend cpu
|
||||
```
|
||||
|
||||
### Start Server
|
||||
|
||||
```bash
|
||||
# OpenAI-compatible API server
|
||||
ruvllm serve --model ./model.gguf --port 8080
|
||||
|
||||
# Then use with any OpenAI client
|
||||
curl http://localhost:8080/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"prompt": "Hello", "max_tokens": 50}'
|
||||
```
|
||||
|
||||
### Model Management
|
||||
|
||||
```bash
|
||||
# List available models
|
||||
ruvllm list
|
||||
|
||||
# Download from HuggingFace
|
||||
ruvllm download TheBloke/Llama-2-7B-GGUF
|
||||
|
||||
# Download specific quantization
|
||||
ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m
|
||||
```
|
||||
|
||||
## CLI Reference
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `run` | Run inference on a prompt |
|
||||
| `chat` | Interactive chat session |
|
||||
| `bench` | Benchmark model performance |
|
||||
| `serve` | Start HTTP server |
|
||||
| `list` | List downloaded models |
|
||||
| `download` | Download model from HuggingFace |
|
||||
|
||||
### Global Options
|
||||
|
||||
| Option | Description | Default |
|
||||
|--------|-------------|---------|
|
||||
| `--model, -m` | Path to GGUF model file | - |
|
||||
| `--backend, -b` | Acceleration backend (metal, cuda, cpu) | auto |
|
||||
| `--threads, -t` | Number of CPU threads | auto |
|
||||
| `--gpu-layers` | Layers to offload to GPU | all |
|
||||
| `--context-size` | Context window size | 2048 |
|
||||
| `--verbose, -v` | Enable verbose logging | false |
|
||||
|
||||
### Generation Options
|
||||
|
||||
| Option | Description | Default |
|
||||
|--------|-------------|---------|
|
||||
| `--temperature` | Sampling temperature (0-2) | 0.7 |
|
||||
| `--top-p` | Nucleus sampling threshold | 0.9 |
|
||||
| `--top-k` | Top-k sampling | 40 |
|
||||
| `--max-tokens` | Maximum tokens to generate | 256 |
|
||||
| `--repeat-penalty` | Repetition penalty | 1.1 |
|
||||
|
||||
## Programmatic Usage
|
||||
|
||||
```typescript
|
||||
import {
|
||||
parseArgs,
|
||||
formatBenchmarkTable,
|
||||
getAvailableBackends,
|
||||
ModelConfig,
|
||||
BenchmarkResult,
|
||||
} from '@ruvector/ruvllm-cli';
|
||||
|
||||
// Parse CLI arguments
|
||||
const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']);
|
||||
console.log(args); // { model: './model.gguf', temperature: '0.8' }
|
||||
|
||||
// Check available backends
|
||||
const backends = getAvailableBackends();
|
||||
console.log('Available:', backends); // ['cpu', 'metal'] on macOS
|
||||
|
||||
// Format benchmark results
|
||||
const results: BenchmarkResult[] = [
|
||||
{
|
||||
model: 'llama-7b',
|
||||
backend: 'metal',
|
||||
promptTokens: 50,
|
||||
generatedTokens: 100,
|
||||
promptTime: 120,
|
||||
generationTime: 2500,
|
||||
promptTPS: 416.7,
|
||||
generationTPS: 40.0,
|
||||
memoryUsage: 4200,
|
||||
peakMemory: 4800,
|
||||
},
|
||||
];
|
||||
|
||||
console.log(formatBenchmarkTable(results));
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
Benchmarks on Apple M2 Pro with Q4_K_M quantization:
|
||||
|
||||
| Model | Prompt TPS | Gen TPS | Memory |
|
||||
|-------|------------|---------|--------|
|
||||
| Llama-2-7B | 450 | 42 | 4.2 GB |
|
||||
| Mistral-7B | 480 | 45 | 4.1 GB |
|
||||
| Phi-2 | 820 | 85 | 1.8 GB |
|
||||
| TinyLlama-1.1B | 1200 | 120 | 0.8 GB |
|
||||
|
||||
## Configuration
|
||||
|
||||
Create `~/.ruvllm/config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"defaultBackend": "metal",
|
||||
"modelsDir": "~/.ruvllm/models",
|
||||
"cacheDir": "~/.ruvllm/cache",
|
||||
"streaming": true,
|
||||
"logLevel": "info"
|
||||
}
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `RUVLLM_MODELS_DIR` | Models directory |
|
||||
| `RUVLLM_CACHE_DIR` | Cache directory |
|
||||
| `RUVLLM_BACKEND` | Default backend |
|
||||
| `RUVLLM_THREADS` | CPU threads |
|
||||
| `HF_TOKEN` | HuggingFace token for gated models |
|
||||
|
||||
## Related Packages
|
||||
|
||||
- [@ruvector/ruvllm](https://www.npmjs.com/package/@ruvector/ruvllm) - LLM orchestration library
|
||||
- [@ruvector/ruvllm-wasm](https://www.npmjs.com/package/@ruvector/ruvllm-wasm) - Browser LLM inference
|
||||
- [ruvector](https://www.npmjs.com/package/ruvector) - All-in-one vector database
|
||||
|
||||
## Documentation
|
||||
|
||||
- [RuvLLM Documentation](https://github.com/ruvnet/ruvector/tree/main/crates/ruvllm)
|
||||
- [CLI Crate](https://github.com/ruvnet/ruvector/tree/main/crates/ruvllm-cli)
|
||||
- [API Reference](https://docs.rs/ruvllm-cli)
|
||||
|
||||
## License
|
||||
|
||||
MIT OR Apache-2.0
|
||||
|
||||
---
|
||||
|
||||
**Part of the [RuVector](https://github.com/ruvnet/ruvector) ecosystem** - High-performance vector database with self-learning capabilities.
|
||||
Reference in New Issue
Block a user