Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/crates/ruvector-mincut-gated-transformer/README.md
+++ b/vendor/ruvector/crates/ruvector-mincut-gated-transformer/README.md
@@ -0,0 +1,469 @@
+# Mincut-Gated Transformer
+
+> **Ultra-low latency transformer inference with graph-theoretic coherence control, designed for real-time AI systems and edge deployment**
+
+[![Crates.io](https://img.shields.io/crates/v/ruvector-mincut-gated-transformer.svg)](https://crates.io/crates/ruvector-mincut-gated-transformer)
+[![Documentation](https://docs.rs/ruvector-mincut-gated-transformer/badge.svg)](https://docs.rs/ruvector-mincut-gated-transformer)
+[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](LICENSE)
+
+## Introduction
+
+The **Mincut-Gated Transformer** is a production-grade inference engine that combines minimum cut (mincut) graph partitioning with adaptive compute allocation to achieve deterministic, ultra-low latency inference. Unlike traditional transformers that execute all layers uniformly, this architecture uses graph-theoretic coherence signals to dynamically skip computation, exit early, and control state updates—all while maintaining explainability and safety guarantees.
+
+**Why Mincut?** The minimum cut value (λ) of an attention graph provides a principled measure of information flow coherence. When λ is high and stable, the model can safely reduce computation. When λ drops or becomes unstable, the system conservatively executes more layers. This creates a natural feedback loop between model confidence and compute allocation.
+
+### Key Innovations
+
+| Innovation | Technique | Benefit |
+|-----------|-----------|---------|
+| **λ-based Mixture-of-Depths** | Route tokens using mincut delta instead of learned routers | 50% FLOPs reduction |
+| **Coherence-driven Early Exit** | Exit when λ stabilizes across layers | 30-50% latency reduction |
+| **Mincut Sparse Attention** | Use partition boundaries for sparse masks | 90% attention FLOPs reduction |
+| **Energy-based Gating** | Treat coherence as energy function | Principled compute-quality tradeoffs |
+| **Spike-driven Scheduling** | Event-driven inference on activity | 87× energy efficiency |
+| **Spectral Position Encoding** | Graph Laplacian eigenvectors via Lanczos | O(n) structural awareness |
+| **EAGLE-3 Speculative Decoding** | λ-guided draft tree verification | 3-5× decoding speedup |
+| **Mamba SSM Hybrid** | Selective state spaces with O(n) complexity | Linear-time sequence modeling |
+| **FlashAttention Tiling** | Block-wise attention with online softmax | O(n) memory, 2-4× faster |
+| **KV Cache INT4** | Hadamard transform + 2/4-bit quantization | 8-16× cache compression |
+| **RoPE with NTK/YaRN** | Context extension beyond training length | 4-32× context scaling |
+
+## Features
+
+### Core Capabilities
+
+- **Deterministic inference** — Same inputs always produce identical outputs (bit-exact)
+- **Bounded latency** — Predictable p99 guarantees through tier-based execution
+- **Explainable decisions** — Every inference produces a witness explaining all interventions
+- **Allocation-free hot path** — Zero heap allocations during inference after initialization
+- **Safety controls** — Coherence-gated state updates prevent contamination propagation
+
+### Quantization & Memory
+
+- **INT8 quantization** — Full model quantization with per-tensor and per-row scaling
+- **INT4 quantization** — 2× memory reduction with per-row and block-wise scaling
+- **Arena allocator** — Single contiguous allocation for weights, 64-byte cache-aligned
+- **Sparse CSR matrices** — Efficient storage for spectral graph operations
+
+### SIMD Acceleration
+
+- **AVX2/FMA** (x86_64) — Vectorized GEMM, GELU, quantization with 8×32 tiling
+- **NEON** (aarch64) — ARM SIMD for mobile and edge devices
+- **Scalar fallback** — Portable implementation for all platforms
+
+### Advanced Features
+
+- **Lanczos algorithm** — O(n) eigenvalue computation for spectral position encoding
+- **Power iteration** — Fast dominant eigenvector extraction
+- **Prefetch hints** — Memory access optimization for sequential patterns
+- **Benchmark utilities** — Built-in profiling with GFLOPS and bandwidth metrics
+
+### SOTA 2025 Features
+
+- **KV Cache INT4 (RotateKV)** — Hadamard transforms for outlier smoothing, 2-bit/4-bit quantization with <0.3 PPL degradation
+- **RoPE Embeddings** — Rotary position encoding with NTK-aware and YaRN scaling for 4-32× context extension
+- **EAGLE-3 Speculative Decoding** — λ-guided draft tree generation with rejection sampling for 3-5× faster decoding
+- **FlashAttention Tiling** — Block-wise computation with online softmax, O(n) memory instead of O(n²)
+- **Mamba SSM Layer** — Selective state space models with O(n) complexity and O(1) inference memory per step
+- **Criterion Benchmarks** — Comprehensive kernel performance profiling with GFLOPS metrics
+
+## Quick Start
+
+```rust
+use ruvector_mincut_gated_transformer::prelude::*;
+
+// Create configuration
+let config = TransformerConfig::micro();
+let policy = GatePolicy::default();
+
+// Load weights (or use empty for testing)
+let weights = QuantizedWeights::empty(&config);
+
+// Create transformer
+let mut transformer = MincutGatedTransformer::new(config, policy, weights)?;
+
+// Create gate packet from mincut signals
+let gate = GatePacket {
+    lambda: 100,              // Minimum cut value
+    lambda_prev: 95,          // Previous lambda for delta computation
+    boundary_edges: 5,        // Cross-partition edge count
+    boundary_concentration_q15: 8192,  // ~25% concentration (Q15 format)
+    partition_count: 3,       // Number of detected partitions
+    flags: 0,
+};
+
+// Prepare input
+let input = InferInput::from_tokens(&[1, 2, 3, 4], gate);
+
+// Allocate output buffer
+let mut logits = vec![0i32; config.logits as usize];
+let mut output = InferOutput::new(&mut logits);
+
+// Run inference
+transformer.infer(&input, &mut output)?;
+
+// Check witness for gate decisions
+println!("Decision: {:?}", output.witness.decision);
+println!("Reason: {:?}", output.witness.reason);
+println!("External writes allowed: {}", output.witness.external_writes_enabled);
+```
+
+## Architecture Overview
+
+```
+                    ┌─────────────────┐
+                    │   Gate Packet   │
+                    │  (λ, Δλ, edges) │
+                    └────────┬────────┘
+                             │
+    Input ──────────────────►│
+                             ▼
+                    ┌─────────────────┐
+                    │ Spike Scheduler │──── Skip (tier 3)
+                    │  Event-driven   │
+                    └────────┬────────┘
+                             │
+                             ▼
+                    ┌─────────────────┐
+                    │ Gate Controller │──── Select tier 0/1/2
+                    │ Coherence-gated │
+                    └────────┬────────┘
+                             │
+                             ▼
+              ┌──────────────┴──────────────┐
+              │      Transformer Core       │
+              │  ┌────────────────────────┐ │
+              │  │ MoD Router (λ-based)   │ │
+              │  └───────────┬────────────┘ │
+              │              ▼              │
+              │  ┌────────────────────────┐ │
+              │  │ Sparse Attention       │ │
+              │  │ (mincut boundaries)    │ │
+              │  └───────────┬────────────┘ │
+              │              ▼              │
+              │  ┌────────────────────────┐ │
+              │  │ Early Exit Check       │ │──── Exit if λ stable
+              │  │ (coherence threshold)  │ │
+              │  └───────────┬────────────┘ │
+              └──────────────┴──────────────┘
+                             │
+                             ▼
+                    ┌─────────────────┐
+                    │ Output + Witness│
+                    │  (explainable)  │
+                    └─────────────────┘
+```
+
+### Tier System
+
+| Tier | Layers | Seq Len | Window | Use Case | Speedup |
+|------|--------|---------|--------|----------|---------|
+| 0 | 4 | 64 | 16 | Normal (high λ) | 1× |
+| 1 | 2 | 32 | 8 | Reduced (moderate λ) | 2-3× |
+| 2 | 1 | 8 | 4 | Safe mode (low λ) | 5-10× |
+| 3 | 0 | 0 | 0 | Skip (no spike) | 50-200× |
+
+## Performance
+
+### Expected Speedups
+
+| Workload Type | Skip Rate | Speedup | Memory Reduction |
+|---------------|-----------|---------|------------------|
+| Streaming (low activity) | 70% | **10-15×** | 80% |
+| Interactive (bursty) | 40% | **4-6×** | 50% |
+| Continuous (high throughput) | 10% | **2-3×** | 40% |
+| Safety-critical (conservative) | 5% | **1.5-2×** | 25% |
+
+### SIMD Performance (on x86_64 AVX2)
+
+| Operation | Scalar | SIMD | Speedup |
+|-----------|--------|------|---------|
+| INT8 GEMM (256×256) | 12ms | 1.8ms | **6.7×** |
+| GELU activation (1024) | 45µs | 8µs | **5.6×** |
+| Quantize f32→i8 (1024) | 38µs | 7µs | **5.4×** |
+
+### Memory Footprint
+
+| Model Config | INT8 | INT4 | Arena Overhead |
+|-------------|------|------|----------------|
+| Micro (2L, 128H) | 1.2 MB | 0.6 MB | +64 bytes |
+| Baseline (4L, 256H) | 8.5 MB | 4.3 MB | +64 bytes |
+| Medium (12L, 768H) | ~85 MB | ~43 MB | +64 bytes |
+
+## Configuration
+
+### Preset Configurations
+
+```rust
+// Micro: WASM, edge gateways, embedded
+let config = TransformerConfig::micro();
+// Seq: 32, Hidden: 128, Heads: 4, Layers: 2
+
+// Baseline: CPU inference, development
+let config = TransformerConfig::baseline();
+// Seq: 64, Hidden: 256, Heads: 4, Layers: 4
+```
+
+### Gate Policy
+
+```rust
+let policy = GatePolicy {
+    lambda_min: 30,                         // Minimum coherence threshold
+    drop_ratio_q15_max: 16384,              // Max λ drop (50% in Q15)
+    boundary_edges_max: 20,                 // Max cross-partition edges
+    boundary_concentration_q15_max: 24576,  // Max concentration (75%)
+    partitions_max: 8,                      // Max partition count
+    spike_rate_q15_max: 26214,              // Max spike rate (80%)
+    allow_kv_write_when_unstable: false,    // Freeze KV cache
+    allow_external_write_when_unstable: false, // Block external writes
+};
+```
+
+## Feature Flags
+
+### Core Features
+- `sliding_window` (default) — Sliding window attention
+- `linear_attention` — Linear attention for O(n) scaling
+
+### Quantization
+- `simd` — AVX2/NEON SIMD acceleration
+- `int4` — INT4 quantization support
+- `fixed_point_softmax` — Fixed-point for embedded targets
+- `rmsnorm` — RMSNorm instead of LayerNorm
+
+### Advanced
+- `spectral_pe` — Spectral position encoding with Lanczos
+- `sparse_attention` — Mincut-guided sparse attention
+- `energy_gate` — Energy-based gate decisions
+- `spike_attention` — Spike-driven attention mechanism
+- `trace` — Runtime tracing and snapshots
+
+### Platform
+- `wasm` — WebAssembly support
+- `no_std_gateway` — No-std for embedded gateways
+
+## Current Limitations
+
+| Feature | Status | Notes |
+|---------|--------|-------|
+| GPU inference | Not implemented | CUDA/Metal kernels needed |
+| KV cache persistence | ✅ **Implemented** | INT4 with Hadamard transforms |
+| Multi-head grouped query | Not implemented | GQA for memory efficiency |
+| Flash Attention | ✅ **Implemented** | CPU tiled with online softmax |
+| Rotary position embeddings | ✅ **Implemented** | RoPE with NTK/YaRN scaling |
+| Criterion benchmarks | ✅ **Implemented** | Kernel, gate, latency benchmarks |
+| GGML/GGUF format | Not implemented | Model format compatibility |
+| Batched inference | Partial | Single-sequence optimized |
+| Async/streaming output | Not implemented | Token-by-token streaming |
+| Mamba/SSM hybrid | ✅ **Implemented** | Selective state space layer |
+| Speculative decoding | ✅ **Implemented** | EAGLE-3 style with λ-guidance |
+
+## Academic Foundations
+
+This implementation integrates peer-reviewed research:
+
+### Core Architecture
+1. **Mixture-of-Depths** (Raposo et al., 2024) — Dynamic compute allocation
+2. **LayerSkip** (Elhoushi et al., 2024) — Early exit and self-speculative decoding
+3. **MInference** (Jiang et al., 2024) — Dynamic sparse attention
+4. **Energy-Based Transformers** (Gladstone et al., 2025) — Energy-based decisions
+5. **Spike-driven Transformer** (Yao et al., 2023, 2024) — Event-driven inference
+6. **Spectral Attention** (Kreuzer et al., 2021) — Graph-based position encoding
+
+### SOTA 2025 Research
+7. **RotateKV** (IJCAI 2025) — Hadamard transforms for KV cache quantization
+8. **EAGLE-3** (NeurIPS 2025) — Speculative decoding with draft tree verification
+9. **FlashAttention-3** (Dao et al., 2024) — IO-aware attention with online softmax
+10. **Mamba** (Gu & Dao, 2023) — Selective State Space Models
+11. **Mamba-2** (Dao & Gu, 2024) — Structured state space duality
+12. **RoFormer** (Su et al., 2021) — Rotary position embeddings
+13. **YaRN** (Peng et al., 2023) — Efficient context window extension
+14. **NTK-Aware Scaling** (bloc97, 2023) — Base frequency adjustment for context extension
+
+See [docs/THEORY.md](docs/THEORY.md) for detailed theoretical foundations.
+
+## Integration
+
+### With RuVector Mincut
+
+```rust
+use ruvector_mincut_gated_transformer::prelude::*;
+use ruvector_mincut::MincutEngine;
+
+// Compute mincut from attention graph
+let mut mincut = MincutEngine::new(num_nodes);
+// ... add edges from attention weights ...
+let lambda = mincut.compute_mincut();
+
+// Create gate packet
+let gate = GatePacket {
+    lambda,
+    lambda_prev: prev_lambda,
+    boundary_edges: mincut.boundary_edge_count(),
+    ..Default::default()
+};
+
+// Run gated inference
+transformer.infer(&InferInput::from_tokens(tokens, gate), &mut output)?;
+```
+
+### Arena Allocator
+
+```rust
+use ruvector_mincut_gated_transformer::arena::{WeightArena, calculate_arena_size};
+
+// Calculate total size for model
+let size = calculate_arena_size(layers, hidden, ffn_mult, heads);
+let mut arena = WeightArena::new(size);
+
+// Allocate weight slices
+let w_q = arena.alloc_i8(hidden * hidden).unwrap();
+let scales = arena.alloc_f32(hidden).unwrap();
+```
+
+### INT4 Quantization
+
+```rust
+use ruvector_mincut_gated_transformer::kernel::quant4::{Int4Weights, int4_gemv};
+
+// Create INT4 weights from f32 (50% memory savings)
+let int4_w = Int4Weights::from_f32(&weights, rows, cols);
+
+// Matrix-vector multiplication
+int4_gemv(&int4_w, &input, 1.0, &mut output);
+```
+
+### KV Cache INT4 (RotateKV)
+
+```rust
+use ruvector_mincut_gated_transformer::kv_cache::{QuantizedKVCache, QuantBits};
+
+// Create 2-bit quantized KV cache (16× compression)
+let mut cache = QuantizedKVCache::new(
+    num_layers,
+    num_heads,
+    head_dim,
+    max_seq_len,
+    QuantBits::Two,
+);
+
+// Store key/value with automatic Hadamard transform
+cache.store_key(layer, head, position, &key_vector);
+cache.store_value(layer, head, position, &value_vector);
+
+// Retrieve (dequantize + inverse Hadamard)
+let key = cache.get_key(layer, head, position);
+```
+
+### RoPE Embeddings
+
+```rust
+use ruvector_mincut_gated_transformer::rope::{RopeConfig, RopeEmbedding, RopeScaling};
+
+// Standard RoPE
+let config = RopeConfig::default();
+let rope = RopeEmbedding::new(&config)?;
+
+// NTK-aware scaling for 4× context extension
+let config = RopeConfig {
+    scaling_type: RopeScaling::NTKAware { alpha: 4.0 },
+    ..Default::default()
+};
+
+// Apply to Q/K vectors
+rope.apply(&mut q, &mut k, position);
+```
+
+### FlashAttention Tiling
+
+```rust
+use ruvector_mincut_gated_transformer::flash_attention::{
+    FlashAttentionConfig, flash_attention_forward,
+};
+
+let config = FlashAttentionConfig {
+    block_size_q: 64,
+    block_size_kv: 64,
+    head_dim: 64,
+    causal: true,
+    softmax_scale: 0.125,
+};
+
+// O(n) memory attention
+flash_attention_forward(&config, &q, &k, &v, seq_len, seq_len, &mut output);
+```
+
+### Mamba SSM Layer
+
+```rust
+use ruvector_mincut_gated_transformer::mamba::{MambaConfig, MambaLayer};
+
+let config = MambaConfig::default();
+let mut layer = MambaLayer::new(config);
+
+// Recurrent mode (O(1) memory per step)
+for token in tokens.iter() {
+    let output = layer.step_recurrent(token);
+}
+
+// Batch mode for training
+let outputs = layer.forward_sequence(&input_sequence);
+```
+
+### EAGLE-3 Speculative Decoding
+
+```rust
+use ruvector_mincut_gated_transformer::speculative::{
+    SpeculativeConfig, SpeculativeDecoder,
+};
+
+let config = SpeculativeConfig {
+    max_draft_tokens: 8,
+    tree_width: 4,
+    acceptance_threshold: 0.9,
+    lambda_guidance: true,  // Use mincut λ for tree construction
+};
+
+let mut decoder = SpeculativeDecoder::new(config, &gate_policy);
+
+// Generate with speculation (3-5× faster)
+let (tokens, stats) = decoder.generate_with_speculation(
+    &draft_model,
+    &target_model,
+    &prompt,
+    max_new_tokens,
+);
+```
+
+## Safety & Determinism
+
+**Determinism guarantee:** For fixed `(weights, config, policy, input)`, inference always produces identical `(logits, witness)`.
+
+**Safety properties:**
+- External writes blocked when coherence is low
+- KV cache frozen/flushed on instability
+- All gate decisions recorded in witness
+- No hidden state or randomness
+
+**Witness fields:**
+```rust
+witness.decision        // ALLOW, DEFER, QUARANTINE, SKIP
+witness.reason          // Why this decision was made
+witness.external_writes_enabled  // Safe to persist?
+witness.kv_action       // WRITE, FREEZE, FLUSH
+```
+
+## License
+
+Licensed under either of Apache License 2.0 or MIT license at your option.
+
+## Contributing
+
+Contributions welcome! Areas of interest:
+
+- GPU kernel implementations (CUDA, Metal)
+- Additional quantization formats (GPTQ, AWQ)
+- Multi-head grouped query attention (GQA)
+- GGUF/Safetensors model format loaders
+- Batched inference optimization
+- Async/streaming token output