67 lines
2.7 KiB
Markdown
67 lines
2.7 KiB
Markdown
# Changelog
|
|
|
|
All notable changes to the ruvllm crate will be documented in this file.
|
|
|
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
|
|
## [2.0.0] - 2025-01-19
|
|
|
|
### Added
|
|
- Multi-threaded GEMM/GEMV with Rayon (12.7x speedup on M4 Pro)
|
|
- Flash Attention 2 with auto block sizing (+10% throughput)
|
|
- INT8/INT4/Q4_K quantized inference kernels (4-8x memory reduction)
|
|
- Optimized Metal GPU shaders (simdgroup_matrix)
|
|
- Memory pool with arena allocator (zero-alloc inference)
|
|
- WASM support via ruvllm-wasm crate
|
|
- npm package integration (@ruvector/ruvllm v2)
|
|
- Paged attention for non-contiguous KV cache
|
|
- Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) support
|
|
- Two-tier KV cache with FP16 tail and quantized cold storage
|
|
- MicroLoRA for real-time per-request adaptation (<1ms latency)
|
|
- EWC++ (Elastic Weight Consolidation) to prevent catastrophic forgetting
|
|
- SONA learning integration with three-tier loops (instant/background/deep)
|
|
- Native Metal compute shaders for M4 Pro optimization
|
|
- Candle backend integration for HuggingFace model loading
|
|
|
|
### Changed
|
|
- GEMV performance: 6 GFLOPS -> 35.9 GFLOPS (6x improvement)
|
|
- GEMM performance: 6 GFLOPS -> 19.2 GFLOPS (3.2x improvement)
|
|
- Cache blocking tuned for M4 Pro (96x64x256 tiles)
|
|
- 12x4 micro-kernel for better register utilization
|
|
- RMSNorm optimized with NEON SIMD (620ns for 4096 dim, 16x better than target)
|
|
- Flash Attention achieves 840us for 256-token sequences
|
|
- MicroLoRA forward pass: 8.56us scalar, 2.61us SIMD (117x/383x better than target)
|
|
|
|
### Fixed
|
|
- Parameter estimation accuracy for 7B models
|
|
- Doctest crate name compatibility
|
|
- KV cache migration batch sizing for latency spikes
|
|
- Memory bandwidth optimization for large matrix operations
|
|
|
|
### Performance Highlights (M4 Pro, 48GB RAM)
|
|
|
|
| Operation | Latency | Target | Status |
|
|
|-----------|---------|--------|--------|
|
|
| Flash Attention (256 seq) | 840us | <2ms | 2.4x better |
|
|
| RMSNorm (4096 dim) | 620ns | <10us | 16x better |
|
|
| GEMV (4096x4096) | 1.36ms | <5ms | 3.7x better |
|
|
| MicroLoRA forward (rank=2, dim=4096) | 8.56us | <1ms | 117x better |
|
|
| RoPE with tables (128 dim, 32 tokens) | 1.33us | <50us | 37x better |
|
|
|
|
## [0.1.32] - 2025-01-18
|
|
|
|
### Added
|
|
- Initial ruvllm-integration crate with basic LLM serving runtime
|
|
- Paged attention implementation
|
|
- KV cache management
|
|
- SONA learning integration scaffolding
|
|
- Basic NEON SIMD kernels for ARM64
|
|
|
|
### Dependencies
|
|
- ruvector-core for storage backend
|
|
- ruvector-sona for learning integration
|
|
- candle-core, candle-nn, candle-transformers for ML backend
|
|
- tokenizers for text processing
|
|
- hf-hub for model downloads
|