Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/crates/ruvllm-cli/README.md
+++ b/crates/ruvllm-cli/README.md
@@ -0,0 +1,302 @@
+# RuvLLM CLI
+
+Command-line interface for RuvLLM inference, optimized for Apple Silicon.
+
+## Installation
+
+```bash
+# From crates.io
+cargo install ruvllm-cli
+
+# From source (with Metal acceleration)
+cargo install --path . --features metal
+```
+
+## Commands
+
+### Download Models
+
+Download models from HuggingFace Hub:
+
+```bash
+# Download Qwen with Q4K quantization (default)
+ruvllm download qwen
+
+# Download with specific quantization
+ruvllm download qwen --quantization q8
+ruvllm download mistral --quantization f16
+
+# Force re-download
+ruvllm download phi --force
+
+# Download specific revision
+ruvllm download llama --revision main
+```
+
+#### Model Aliases
+
+| Alias | Model ID |
+|-------|----------|
+| `qwen` | `Qwen/Qwen2.5-7B-Instruct` |
+| `mistral` | `mistralai/Mistral-7B-Instruct-v0.3` |
+| `phi` | `microsoft/Phi-3-medium-4k-instruct` |
+| `llama` | `meta-llama/Meta-Llama-3.1-8B-Instruct` |
+
+#### Quantization Options
+
+| Option | Description | Memory Savings |
+|--------|-------------|----------------|
+| `q4k` | 4-bit quantization (default) | ~75% |
+| `q8` | 8-bit quantization | ~50% |
+| `f16` | Half precision | ~50% |
+| `none` | Full precision | 0% |
+
+### List Models
+
+```bash
+# List all available models
+ruvllm list
+
+# List only downloaded models
+ruvllm list --downloaded
+
+# Detailed listing with sizes
+ruvllm list --long
+```
+
+### Model Information
+
+```bash
+# Show model details
+ruvllm info qwen
+
+# Output includes:
+# - Model architecture
+# - Parameter count
+# - Download status
+# - Disk usage
+# - Supported features
+```
+
+### Interactive Chat
+
+```bash
+# Start chat with default settings
+ruvllm chat qwen
+
+# With custom system prompt
+ruvllm chat qwen --system "You are a helpful coding assistant."
+
+# Adjust generation parameters
+ruvllm chat qwen --temperature 0.5 --max-tokens 1024
+
+# Use specific quantization
+ruvllm chat qwen --quantization q8
+```
+
+#### Chat Commands
+
+During chat, use these commands:
+
+| Command | Description |
+|---------|-------------|
+| `/help` | Show available commands |
+| `/clear` | Clear conversation history |
+| `/system <prompt>` | Change system prompt |
+| `/temp <value>` | Change temperature |
+| `/quit` or `/exit` | Exit chat |
+
+### Start Server
+
+OpenAI-compatible inference server:
+
+```bash
+# Start with defaults
+ruvllm serve qwen
+
+# Custom host and port
+ruvllm serve qwen --host 0.0.0.0 --port 8080
+
+# Configure concurrency
+ruvllm serve qwen --max-concurrent 8 --max-context 8192
+```
+
+#### API Endpoints
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/v1/chat/completions` | POST | Chat completions |
+| `/v1/completions` | POST | Text completions |
+| `/v1/models` | GET | List models |
+| `/health` | GET | Health check |
+
+#### Example Request
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen",
+    "messages": [
+      {"role": "user", "content": "Hello!"}
+    ],
+    "max_tokens": 256
+  }'
+```
+
+### Run Benchmarks
+
+```bash
+# Basic benchmark
+ruvllm benchmark qwen
+
+# Configure benchmark
+ruvllm benchmark qwen \
+  --warmup 5 \
+  --iterations 20 \
+  --prompt-length 256 \
+  --gen-length 128
+
+# Output formats
+ruvllm benchmark qwen --format json
+ruvllm benchmark qwen --format csv
+```
+
+#### Benchmark Metrics
+
+- **Prefill Latency**: Time to process input prompt
+- **Decode Throughput**: Tokens per second during generation
+- **Time to First Token (TTFT)**: Latency before first output token
+- **Memory Usage**: Peak GPU/RAM consumption
+
+## Global Options
+
+```bash
+# Enable verbose logging
+ruvllm --verbose <command>
+
+# Disable colored output
+ruvllm --no-color <command>
+
+# Custom cache directory
+ruvllm --cache-dir /path/to/cache <command>
+
+# Or via environment variable
+export RUVLLM_CACHE_DIR=/path/to/cache
+```
+
+## Configuration
+
+### Cache Directory
+
+Models are cached in:
+
+- **macOS**: `~/Library/Caches/ruvllm`
+- **Linux**: `~/.cache/ruvllm`
+- **Windows**: `%LOCALAPPDATA%\ruvllm`
+
+Override with `--cache-dir` or `RUVLLM_CACHE_DIR`.
+
+### Logging
+
+Set log level with `RUST_LOG`:
+
+```bash
+RUST_LOG=debug ruvllm chat qwen
+RUST_LOG=ruvllm=trace ruvllm serve qwen
+```
+
+## Examples
+
+### Basic Workflow
+
+```bash
+# 1. Download a model
+ruvllm download qwen
+
+# 2. Verify it's downloaded
+ruvllm list --downloaded
+
+# 3. Start chatting
+ruvllm chat qwen
+```
+
+### Server Deployment
+
+```bash
+# Download model first
+ruvllm download qwen --quantization q4k
+
+# Start server with production settings
+ruvllm serve qwen \
+  --host 0.0.0.0 \
+  --port 8080 \
+  --max-concurrent 16 \
+  --max-context 4096 \
+  --quantization q4k
+```
+
+### Performance Testing
+
+```bash
+# Run comprehensive benchmarks
+ruvllm benchmark qwen \
+  --warmup 10 \
+  --iterations 50 \
+  --prompt-length 512 \
+  --gen-length 256 \
+  --format json > benchmark_results.json
+```
+
+## Troubleshooting
+
+### Out of Memory
+
+```bash
+# Use smaller quantization
+ruvllm chat qwen --quantization q4k
+
+# Or reduce context length
+ruvllm serve qwen --max-context 2048
+```
+
+### Slow Download
+
+```bash
+# Resume interrupted download
+ruvllm download qwen
+
+# Force fresh download
+ruvllm download qwen --force
+```
+
+### Metal Issues (macOS)
+
+Ensure Metal is available:
+
+```bash
+# Check Metal device
+system_profiler SPDisplaysDataType | grep Metal
+
+# Try with CPU fallback
+RUVLLM_NO_METAL=1 ruvllm chat qwen
+```
+
+## Feature Flags
+
+Build with specific features:
+
+```bash
+# Metal acceleration (macOS)
+cargo install ruvllm-cli --features metal
+
+# CUDA acceleration (NVIDIA)
+cargo install ruvllm-cli --features cuda
+
+# Both (if available)
+cargo install ruvllm-cli --features "metal,cuda"
+```
+
+## License
+
+Apache-2.0 / MIT dual license.