# RuvLLM Optimization Guide (v2.0.0)

This guide covers performance optimization strategies for RuvLLM, including SONA learning loops, batch sizing, KV cache management, and hardware-specific tuning.

## v2.0.0 Performance Highlights

| Feature | Improvement | Notes |
|---------|-------------|-------|
| Multi-threaded GEMM | 12.7x speedup | Rayon on M4 Pro 10-core |
| Flash Attention 2 | +10% throughput | Auto block sizing |
| Quantized Inference | 4-8x memory | INT8/INT4/Q4_K |
| Metal GPU | 3x speedup | simdgroup_matrix |
| Memory Pool | Zero-alloc | Arena allocator |

## Performance Overview

### Key Metrics

| Metric | Target (M4 Pro) | Achieved (v2.0.0) | Description |
|--------|-----------------|-------------------|-------------|
| Prefill | >2000 tok/s | 3500 tok/s | Processing input tokens |
| Decode | >80 tok/s | 120 tok/s | Generating output tokens |
| TTFT | <50ms | 35ms | Time to first token |
| Memory | <8GB for 7B | 3.4GB (Q4K) | Peak memory usage |
| MicroLoRA | <1ms | 8.56us | Per-request adaptation |

### Architecture Impact

```
┌─────────────────────────────────────────────────────────┐
│                    Optimization Layers                   │
├─────────────────────────────────────────────────────────┤
│  SONA Learning      │ Real-time adaptation, routing     │
├─────────────────────────────────────────────────────────┤
│  Attention          │ Flash, Paged, GQA - 2-4x speedup │
├─────────────────────────────────────────────────────────┤
│  KV Cache           │ Two-tier, quantized - 4x memory  │
├─────────────────────────────────────────────────────────┤
│  Quantization       │ Q4K, Q8 - 4-8x smaller           │
├─────────────────────────────────────────────────────────┤
│  SIMD/GPU           │ NEON, Metal - hardware accel     │
└─────────────────────────────────────────────────────────┘
```

## SONA Learning Optimization

### Instant Loop Tuning

The instant loop runs per-request with <1ms target latency.

```rust
let config = SonaLlmConfig {
    // Learning rate for instant updates
    // Higher = faster adaptation, more variance
    // Lower = slower adaptation, more stable
    instant_lr: 0.01,

    // Quality threshold - skip low-quality samples
    training: TrainingConfig {
        quality_threshold: 0.5,  // 0.0-1.0
        ..Default::default()
    },
    ..Default::default()
};
```

**Tuning Guidelines:**

| Use Case | instant_lr | quality_threshold |
|----------|------------|-------------------|
| High variance tasks | 0.005 | 0.7 |
| Stable domains | 0.02 | 0.3 |
| User personalization | 0.01 | 0.5 |

### Background Loop Tuning

Consolidates patterns without blocking inference.

```rust
let config = SonaLlmConfig {
    // How often to run (milliseconds)
    background_interval_ms: 100,

    // Minimum samples before consolidation
    background_min_samples: 10,

    // Maximum pending (triggers forced consolidation)
    max_pending_samples: 1000,

    // Consolidation strategy
    consolidation_strategy: ConsolidationStrategy::EwcMerge,
    ..Default::default()
};
```

**Tuning Guidelines:**

| Priority | interval_ms | min_samples | Strategy |
|----------|-------------|-------------|----------|
| Latency | 200 | 20 | Average |
| Quality | 50 | 5 | EwcMerge |
| Memory | 100 | 50 | BestOnly |

### Deep Loop Optimization

Triggered periodically for full optimization.

```rust
let config = SonaLlmConfig {
    // Accumulated quality threshold to trigger
    deep_trigger_threshold: 100.0,
    ..Default::default()
};

// Manual trigger for scheduled optimization
if sona.should_trigger_deep() || is_scheduled_time() {
    let samples = collect_high_quality_samples();
    let result = sona.deep_optimize(&samples);

    // Log improvement
    println!("Deep optimization: quality delta = {:.3}", result.quality_delta);
}
```

## Batch Size Optimization

### Dynamic Batching

```rust
// Optimal batch sizes vary by operation
struct BatchConfig {
    prefill_batch: usize,   // Process multiple prompts together
    decode_batch: usize,    // Parallel token generation
    lora_batch: usize,      // LoRA adaptation batch
}

impl BatchConfig {
    fn for_memory(available_gb: f32) -> Self {
        match available_gb {
            x if x < 8.0 => Self {
                prefill_batch: 1,
                decode_batch: 4,
                lora_batch: 16,
            },
            x if x < 16.0 => Self {
                prefill_batch: 2,
                decode_batch: 8,
                lora_batch: 32,
            },
            _ => Self {
                prefill_batch: 4,
                decode_batch: 16,
                lora_batch: 64,
            },
        }
    }
}
```

### Batch Size Impact

| Batch Size | Throughput | Latency | Memory |
|------------|------------|---------|--------|
| 1 | Low | Lowest | Lowest |
| 4 | Medium | Low | Medium |
| 8 | High | Medium | High |
| 16+ | Highest | Higher | Highest |

**Rule of thumb:** Increase batch size until memory pressure or latency constraints are hit.

## KV Cache Optimization

### Two-Tier Configuration

```rust
let config = KvCacheConfig {
    // Tokens in high-precision tail
    // More = better attention quality for recent context
    // Less = less memory usage
    tail_length: 256,

    // Tail precision (FP16 recommended)
    tail_precision: Precision::FP16,

    // Store precision (Q4 for 4x compression)
    store_precision: Precision::Q4,

    // Maximum context length
    max_tokens: 4096,

    // KV heads (depends on model architecture)
    num_kv_heads: 8,
    head_dim: 128,

    // Batch size for migration (affects latency spikes)
    migration_batch: 64,
};
```

### Memory Calculation

```
KV Cache Memory = num_layers * 2 * max_tokens * num_kv_heads * head_dim * bytes_per_element

Example (Qwen2.5-7B with 4096 context):
- Layers: 32
- KV heads: 8
- Head dim: 128
- FP16 tail (256 tokens): 32 * 2 * 256 * 8 * 128 * 2 = 33.5 MB
- Q4 store (3840 tokens): 32 * 2 * 3840 * 8 * 128 * 0.5 = 125.8 MB
- Total: ~160 MB (vs ~672 MB for full FP16)
```

### Cache Strategies by Use Case

| Use Case | tail_length | store_precision | max_tokens |
|----------|-------------|-----------------|------------|
| Chat (short) | 128 | Q8 | 2048 |
| Chat (long) | 256 | Q4 | 8192 |
| Document QA | 512 | Q4 | 16384 |
| Code completion | 128 | Q8 | 4096 |

## Attention Optimization

### Grouped-Query Attention (GQA)

```rust
let config = AttentionConfig {
    num_heads: 32,      // Query heads
    num_kv_heads: 8,    // KV heads (4:1 ratio)
    head_dim: 128,
    causal: true,
    ..Default::default()
};

// GQA ratio determines memory savings
// 4:1 = ~4x KV cache reduction
// 8:1 = ~8x KV cache reduction
assert_eq!(config.gqa_ratio(), 4);
```

### Flash Attention Optimization

```rust
// Flash Attention is memory-efficient but has setup overhead
// Best for: longer sequences (>256 tokens)

// For short sequences, standard attention may be faster
let use_flash = sequence_length > 256;

if use_flash {
    let output = flash_attention_neon(&query, &key, &value, scale, causal);
} else {
    let output = standard_attention(&query, &key, &value, scale, causal);
}
```

### Paged Attention for Inference

```rust
// Paged attention enables non-contiguous KV cache
// Best for: long-running inference with variable context

let mut cache = PagedKvCache::new(
    16,     // block_size: tokens per block
    8,      // num_kv_heads
    128,    // head_dim
);

// Append incrementally
for token in tokens {
    let (k, v) = compute_kv(token)?;
    cache.append(&k, &v);
}

// Efficient attention over paged cache
let output = paged_attention_neon(&query, &cache, &block_tables, scale);
```

## Quantization Optimization

### Model Quantization

| Precision | Memory | Quality | Speed |
|-----------|--------|---------|-------|
| FP32 | 4x | Best | Slowest |
| FP16 | 2x | Excellent | Fast |
| Q8 | 1x | Very Good | Faster |
| Q4K | 0.5x | Good | Fastest |
| Q4 | 0.5x | Acceptable | Fastest |

**Recommendations:**

```rust
// High quality (16GB+ RAM)
let config = ModelConfig {
    quantization: Precision::Q8,
    ..Default::default()
};

// Balanced (8-16GB RAM)
let config = ModelConfig {
    quantization: Precision::Q4K,  // K-quant preserves quality
    ..Default::default()
};

// Memory constrained (<8GB RAM)
let config = ModelConfig {
    quantization: Precision::Q4,
    ..Default::default()
};
```

### KV Cache Quantization

```rust
// Hybrid quantization: recent tokens in high precision
let config = KvCacheConfig {
    tail_length: 256,           // Recent: FP16
    tail_precision: Precision::FP16,
    store_precision: Precision::Q4,  // Older: Q4
    ..Default::default()
};

// Quality impact by position
// Position 0-256 (tail): Full quality
// Position 256+: ~95% quality with Q4
```

## Hardware-Specific Optimization

### Apple Silicon (M1/M2/M3/M4)

```rust
// Metal backend for GPU acceleration
let backend = CandleBackend::with_device(DeviceType::Metal)?;

// Optimize for unified memory
let config = ModelConfig {
    // Unified memory = larger KV cache possible
    kv_cache_config: KvCacheConfig {
        max_tokens: 8192,  // Can be larger on M-series
        ..Default::default()
    },
    ..Default::default()
};
```

**M4 Pro Specific:**
- Use `metal` feature for GPU acceleration
- NEON SIMD enabled by default
- Leverage unified memory for larger context

### NVIDIA GPUs

```rust
// CUDA backend
let backend = CandleBackend::with_device(DeviceType::Cuda(0))?;

// Optimize for separate VRAM
let config = ModelConfig {
    kv_cache_config: KvCacheConfig {
        // Conservative: VRAM is limited
        max_tokens: 4096,
        ..Default::default()
    },
    ..Default::default()
};
```

### CPU Fallback

```rust
// CPU with SIMD optimization
let backend = CandleBackend::with_device(DeviceType::Cpu)?;

// Reduce memory pressure
let config = ModelConfig {
    quantization: Precision::Q4,
    kv_cache_config: KvCacheConfig {
        tail_length: 128,
        max_tokens: 2048,
        ..Default::default()
    },
    ..Default::default()
};
```

## Real-Time Optimization

### Adaptive Optimization

```rust
use ruvllm::optimization::{RealTimeOptimizer, OptimizerConfig};

let optimizer = RealTimeOptimizer::new(OptimizerConfig {
    target_latency_ms: 100.0,
    min_throughput: 50.0,  // tokens/sec
    memory_threshold: 0.9,  // 90% of available
});

// Optimizer adjusts parameters in real-time
loop {
    let metrics = backend.get_metrics();
    let adjustments = optimizer.recommend(&metrics);

    if adjustments.reduce_batch_size {
        config.batch_size -= 1;
    }
    if adjustments.increase_quantization {
        config.kv_cache_config.store_precision = Precision::Q4;
    }
}
```

### Latency Monitoring

```rust
// Track latency components
struct LatencyBreakdown {
    tokenization_us: u64,
    prefill_us: u64,
    decode_us: u64,
    sampling_us: u64,
    lora_us: u64,
}

impl LatencyBreakdown {
    fn total_ms(&self) -> f64 {
        (self.tokenization_us + self.prefill_us +
         self.decode_us + self.sampling_us + self.lora_us) as f64 / 1000.0
    }

    fn bottleneck(&self) -> &str {
        let max = [
            (self.tokenization_us, "tokenization"),
            (self.prefill_us, "prefill"),
            (self.decode_us, "decode"),
            (self.sampling_us, "sampling"),
            (self.lora_us, "lora"),
        ].into_iter().max_by_key(|(v, _)| *v).unwrap();
        max.1
    }
}
```

## Benchmarking

### Running Benchmarks

```bash
# All benchmarks
cargo bench

# Specific benchmarks
cargo bench --bench attention_bench
cargo bench --bench lora_bench
cargo bench --bench e2e_bench

# With specific features
cargo bench --features metal
cargo bench --features cuda
```

### Custom Benchmarks

```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use ruvllm::kernels::attention::flash_attention_neon;

fn bench_attention(c: &mut Criterion) {
    let query = vec![0.1f32; 128];
    let key = vec![0.1f32; 512 * 128];
    let value = vec![0.1f32; 512 * 128];
    let scale = 1.0 / 128.0_f32.sqrt();

    c.bench_function("flash_attention_512", |b| {
        b.iter(|| {
            flash_attention_neon(
                black_box(&query),
                black_box(&key),
                black_box(&value),
                scale,
                true,
            )
        })
    });
}

criterion_group!(benches, bench_attention);
criterion_main!(benches);
```

## Optimization Checklist

### Before Deployment

- [ ] Choose appropriate quantization (Q4K for most cases)
- [ ] Configure KV cache for expected context length
- [ ] Enable GQA if model supports it
- [ ] Set appropriate batch sizes for memory
- [ ] Configure SONA learning rates
- [ ] Test with representative workloads

### Monitoring

- [ ] Track prefill and decode throughput
- [ ] Monitor memory usage over time
- [ ] Log KV cache hit rates
- [ ] Track SONA learning metrics
- [ ] Alert on latency spikes

### Troubleshooting

| Symptom | Likely Cause | Solution |
|---------|--------------|----------|
| High latency | Batch too large | Reduce batch size |
| OOM errors | KV cache too large | Reduce max_tokens or use Q4 |
| Quality degradation | Over-quantization | Use Q8 instead of Q4 |
| Slow adaptation | Learning rate too low | Increase instant_lr |
| Forgetting | EWC lambda too low | Increase ewc_lambda |