Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/crates/ruvllm/CHANGELOG.md
+++ b/crates/ruvllm/CHANGELOG.md
@@ -0,0 +1,66 @@
+# Changelog
+
+All notable changes to the ruvllm crate will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [2.0.0] - 2025-01-19
+
+### Added
+- Multi-threaded GEMM/GEMV with Rayon (12.7x speedup on M4 Pro)
+- Flash Attention 2 with auto block sizing (+10% throughput)
+- INT8/INT4/Q4_K quantized inference kernels (4-8x memory reduction)
+- Optimized Metal GPU shaders (simdgroup_matrix)
+- Memory pool with arena allocator (zero-alloc inference)
+- WASM support via ruvllm-wasm crate
+- npm package integration (@ruvector/ruvllm v2)
+- Paged attention for non-contiguous KV cache
+- Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) support
+- Two-tier KV cache with FP16 tail and quantized cold storage
+- MicroLoRA for real-time per-request adaptation (<1ms latency)
+- EWC++ (Elastic Weight Consolidation) to prevent catastrophic forgetting
+- SONA learning integration with three-tier loops (instant/background/deep)
+- Native Metal compute shaders for M4 Pro optimization
+- Candle backend integration for HuggingFace model loading
+
+### Changed
+- GEMV performance: 6 GFLOPS -> 35.9 GFLOPS (6x improvement)
+- GEMM performance: 6 GFLOPS -> 19.2 GFLOPS (3.2x improvement)
+- Cache blocking tuned for M4 Pro (96x64x256 tiles)
+- 12x4 micro-kernel for better register utilization
+- RMSNorm optimized with NEON SIMD (620ns for 4096 dim, 16x better than target)
+- Flash Attention achieves 840us for 256-token sequences
+- MicroLoRA forward pass: 8.56us scalar, 2.61us SIMD (117x/383x better than target)
+
+### Fixed
+- Parameter estimation accuracy for 7B models
+- Doctest crate name compatibility
+- KV cache migration batch sizing for latency spikes
+- Memory bandwidth optimization for large matrix operations
+
+### Performance Highlights (M4 Pro, 48GB RAM)
+
+| Operation | Latency | Target | Status |
+|-----------|---------|--------|--------|
+| Flash Attention (256 seq) | 840us | <2ms | 2.4x better |
+| RMSNorm (4096 dim) | 620ns | <10us | 16x better |
+| GEMV (4096x4096) | 1.36ms | <5ms | 3.7x better |
+| MicroLoRA forward (rank=2, dim=4096) | 8.56us | <1ms | 117x better |
+| RoPE with tables (128 dim, 32 tokens) | 1.33us | <50us | 37x better |
+
+## [0.1.32] - 2025-01-18
+
+### Added
+- Initial ruvllm-integration crate with basic LLM serving runtime
+- Paged attention implementation
+- KV cache management
+- SONA learning integration scaffolding
+- Basic NEON SIMD kernels for ARM64
+
+### Dependencies
+- ruvector-core for storage backend
+- ruvector-sona for learning integration
+- candle-core, candle-nn, candle-transformers for ML backend
+- tokenizers for text processing
+- hf-hub for model downloads