Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
469
vendor/ruvector/crates/ruvector-mincut-gated-transformer/README.md
vendored
Normal file
469
vendor/ruvector/crates/ruvector-mincut-gated-transformer/README.md
vendored
Normal file
@@ -0,0 +1,469 @@
|
||||
# Mincut-Gated Transformer
|
||||
|
||||
> **Ultra-low latency transformer inference with graph-theoretic coherence control, designed for real-time AI systems and edge deployment**
|
||||
|
||||
[](https://crates.io/crates/ruvector-mincut-gated-transformer)
|
||||
[](https://docs.rs/ruvector-mincut-gated-transformer)
|
||||
[](LICENSE)
|
||||
|
||||
## Introduction
|
||||
|
||||
The **Mincut-Gated Transformer** is a production-grade inference engine that combines minimum cut (mincut) graph partitioning with adaptive compute allocation to achieve deterministic, ultra-low latency inference. Unlike traditional transformers that execute all layers uniformly, this architecture uses graph-theoretic coherence signals to dynamically skip computation, exit early, and control state updates—all while maintaining explainability and safety guarantees.
|
||||
|
||||
**Why Mincut?** The minimum cut value (λ) of an attention graph provides a principled measure of information flow coherence. When λ is high and stable, the model can safely reduce computation. When λ drops or becomes unstable, the system conservatively executes more layers. This creates a natural feedback loop between model confidence and compute allocation.
|
||||
|
||||
### Key Innovations
|
||||
|
||||
| Innovation | Technique | Benefit |
|
||||
|-----------|-----------|---------|
|
||||
| **λ-based Mixture-of-Depths** | Route tokens using mincut delta instead of learned routers | 50% FLOPs reduction |
|
||||
| **Coherence-driven Early Exit** | Exit when λ stabilizes across layers | 30-50% latency reduction |
|
||||
| **Mincut Sparse Attention** | Use partition boundaries for sparse masks | 90% attention FLOPs reduction |
|
||||
| **Energy-based Gating** | Treat coherence as energy function | Principled compute-quality tradeoffs |
|
||||
| **Spike-driven Scheduling** | Event-driven inference on activity | 87× energy efficiency |
|
||||
| **Spectral Position Encoding** | Graph Laplacian eigenvectors via Lanczos | O(n) structural awareness |
|
||||
| **EAGLE-3 Speculative Decoding** | λ-guided draft tree verification | 3-5× decoding speedup |
|
||||
| **Mamba SSM Hybrid** | Selective state spaces with O(n) complexity | Linear-time sequence modeling |
|
||||
| **FlashAttention Tiling** | Block-wise attention with online softmax | O(n) memory, 2-4× faster |
|
||||
| **KV Cache INT4** | Hadamard transform + 2/4-bit quantization | 8-16× cache compression |
|
||||
| **RoPE with NTK/YaRN** | Context extension beyond training length | 4-32× context scaling |
|
||||
|
||||
## Features
|
||||
|
||||
### Core Capabilities
|
||||
|
||||
- **Deterministic inference** — Same inputs always produce identical outputs (bit-exact)
|
||||
- **Bounded latency** — Predictable p99 guarantees through tier-based execution
|
||||
- **Explainable decisions** — Every inference produces a witness explaining all interventions
|
||||
- **Allocation-free hot path** — Zero heap allocations during inference after initialization
|
||||
- **Safety controls** — Coherence-gated state updates prevent contamination propagation
|
||||
|
||||
### Quantization & Memory
|
||||
|
||||
- **INT8 quantization** — Full model quantization with per-tensor and per-row scaling
|
||||
- **INT4 quantization** — 2× memory reduction with per-row and block-wise scaling
|
||||
- **Arena allocator** — Single contiguous allocation for weights, 64-byte cache-aligned
|
||||
- **Sparse CSR matrices** — Efficient storage for spectral graph operations
|
||||
|
||||
### SIMD Acceleration
|
||||
|
||||
- **AVX2/FMA** (x86_64) — Vectorized GEMM, GELU, quantization with 8×32 tiling
|
||||
- **NEON** (aarch64) — ARM SIMD for mobile and edge devices
|
||||
- **Scalar fallback** — Portable implementation for all platforms
|
||||
|
||||
### Advanced Features
|
||||
|
||||
- **Lanczos algorithm** — O(n) eigenvalue computation for spectral position encoding
|
||||
- **Power iteration** — Fast dominant eigenvector extraction
|
||||
- **Prefetch hints** — Memory access optimization for sequential patterns
|
||||
- **Benchmark utilities** — Built-in profiling with GFLOPS and bandwidth metrics
|
||||
|
||||
### SOTA 2025 Features
|
||||
|
||||
- **KV Cache INT4 (RotateKV)** — Hadamard transforms for outlier smoothing, 2-bit/4-bit quantization with <0.3 PPL degradation
|
||||
- **RoPE Embeddings** — Rotary position encoding with NTK-aware and YaRN scaling for 4-32× context extension
|
||||
- **EAGLE-3 Speculative Decoding** — λ-guided draft tree generation with rejection sampling for 3-5× faster decoding
|
||||
- **FlashAttention Tiling** — Block-wise computation with online softmax, O(n) memory instead of O(n²)
|
||||
- **Mamba SSM Layer** — Selective state space models with O(n) complexity and O(1) inference memory per step
|
||||
- **Criterion Benchmarks** — Comprehensive kernel performance profiling with GFLOPS metrics
|
||||
|
||||
## Quick Start
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::prelude::*;
|
||||
|
||||
// Create configuration
|
||||
let config = TransformerConfig::micro();
|
||||
let policy = GatePolicy::default();
|
||||
|
||||
// Load weights (or use empty for testing)
|
||||
let weights = QuantizedWeights::empty(&config);
|
||||
|
||||
// Create transformer
|
||||
let mut transformer = MincutGatedTransformer::new(config, policy, weights)?;
|
||||
|
||||
// Create gate packet from mincut signals
|
||||
let gate = GatePacket {
|
||||
lambda: 100, // Minimum cut value
|
||||
lambda_prev: 95, // Previous lambda for delta computation
|
||||
boundary_edges: 5, // Cross-partition edge count
|
||||
boundary_concentration_q15: 8192, // ~25% concentration (Q15 format)
|
||||
partition_count: 3, // Number of detected partitions
|
||||
flags: 0,
|
||||
};
|
||||
|
||||
// Prepare input
|
||||
let input = InferInput::from_tokens(&[1, 2, 3, 4], gate);
|
||||
|
||||
// Allocate output buffer
|
||||
let mut logits = vec![0i32; config.logits as usize];
|
||||
let mut output = InferOutput::new(&mut logits);
|
||||
|
||||
// Run inference
|
||||
transformer.infer(&input, &mut output)?;
|
||||
|
||||
// Check witness for gate decisions
|
||||
println!("Decision: {:?}", output.witness.decision);
|
||||
println!("Reason: {:?}", output.witness.reason);
|
||||
println!("External writes allowed: {}", output.witness.external_writes_enabled);
|
||||
```
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Gate Packet │
|
||||
│ (λ, Δλ, edges) │
|
||||
└────────┬────────┘
|
||||
│
|
||||
Input ──────────────────►│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Spike Scheduler │──── Skip (tier 3)
|
||||
│ Event-driven │
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Gate Controller │──── Select tier 0/1/2
|
||||
│ Coherence-gated │
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────┴──────────────┐
|
||||
│ Transformer Core │
|
||||
│ ┌────────────────────────┐ │
|
||||
│ │ MoD Router (λ-based) │ │
|
||||
│ └───────────┬────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────┐ │
|
||||
│ │ Sparse Attention │ │
|
||||
│ │ (mincut boundaries) │ │
|
||||
│ └───────────┬────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────┐ │
|
||||
│ │ Early Exit Check │ │──── Exit if λ stable
|
||||
│ │ (coherence threshold) │ │
|
||||
│ └───────────┬────────────┘ │
|
||||
└──────────────┴──────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Output + Witness│
|
||||
│ (explainable) │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
### Tier System
|
||||
|
||||
| Tier | Layers | Seq Len | Window | Use Case | Speedup |
|
||||
|------|--------|---------|--------|----------|---------|
|
||||
| 0 | 4 | 64 | 16 | Normal (high λ) | 1× |
|
||||
| 1 | 2 | 32 | 8 | Reduced (moderate λ) | 2-3× |
|
||||
| 2 | 1 | 8 | 4 | Safe mode (low λ) | 5-10× |
|
||||
| 3 | 0 | 0 | 0 | Skip (no spike) | 50-200× |
|
||||
|
||||
## Performance
|
||||
|
||||
### Expected Speedups
|
||||
|
||||
| Workload Type | Skip Rate | Speedup | Memory Reduction |
|
||||
|---------------|-----------|---------|------------------|
|
||||
| Streaming (low activity) | 70% | **10-15×** | 80% |
|
||||
| Interactive (bursty) | 40% | **4-6×** | 50% |
|
||||
| Continuous (high throughput) | 10% | **2-3×** | 40% |
|
||||
| Safety-critical (conservative) | 5% | **1.5-2×** | 25% |
|
||||
|
||||
### SIMD Performance (on x86_64 AVX2)
|
||||
|
||||
| Operation | Scalar | SIMD | Speedup |
|
||||
|-----------|--------|------|---------|
|
||||
| INT8 GEMM (256×256) | 12ms | 1.8ms | **6.7×** |
|
||||
| GELU activation (1024) | 45µs | 8µs | **5.6×** |
|
||||
| Quantize f32→i8 (1024) | 38µs | 7µs | **5.4×** |
|
||||
|
||||
### Memory Footprint
|
||||
|
||||
| Model Config | INT8 | INT4 | Arena Overhead |
|
||||
|-------------|------|------|----------------|
|
||||
| Micro (2L, 128H) | 1.2 MB | 0.6 MB | +64 bytes |
|
||||
| Baseline (4L, 256H) | 8.5 MB | 4.3 MB | +64 bytes |
|
||||
| Medium (12L, 768H) | ~85 MB | ~43 MB | +64 bytes |
|
||||
|
||||
## Configuration
|
||||
|
||||
### Preset Configurations
|
||||
|
||||
```rust
|
||||
// Micro: WASM, edge gateways, embedded
|
||||
let config = TransformerConfig::micro();
|
||||
// Seq: 32, Hidden: 128, Heads: 4, Layers: 2
|
||||
|
||||
// Baseline: CPU inference, development
|
||||
let config = TransformerConfig::baseline();
|
||||
// Seq: 64, Hidden: 256, Heads: 4, Layers: 4
|
||||
```
|
||||
|
||||
### Gate Policy
|
||||
|
||||
```rust
|
||||
let policy = GatePolicy {
|
||||
lambda_min: 30, // Minimum coherence threshold
|
||||
drop_ratio_q15_max: 16384, // Max λ drop (50% in Q15)
|
||||
boundary_edges_max: 20, // Max cross-partition edges
|
||||
boundary_concentration_q15_max: 24576, // Max concentration (75%)
|
||||
partitions_max: 8, // Max partition count
|
||||
spike_rate_q15_max: 26214, // Max spike rate (80%)
|
||||
allow_kv_write_when_unstable: false, // Freeze KV cache
|
||||
allow_external_write_when_unstable: false, // Block external writes
|
||||
};
|
||||
```
|
||||
|
||||
## Feature Flags
|
||||
|
||||
### Core Features
|
||||
- `sliding_window` (default) — Sliding window attention
|
||||
- `linear_attention` — Linear attention for O(n) scaling
|
||||
|
||||
### Quantization
|
||||
- `simd` — AVX2/NEON SIMD acceleration
|
||||
- `int4` — INT4 quantization support
|
||||
- `fixed_point_softmax` — Fixed-point for embedded targets
|
||||
- `rmsnorm` — RMSNorm instead of LayerNorm
|
||||
|
||||
### Advanced
|
||||
- `spectral_pe` — Spectral position encoding with Lanczos
|
||||
- `sparse_attention` — Mincut-guided sparse attention
|
||||
- `energy_gate` — Energy-based gate decisions
|
||||
- `spike_attention` — Spike-driven attention mechanism
|
||||
- `trace` — Runtime tracing and snapshots
|
||||
|
||||
### Platform
|
||||
- `wasm` — WebAssembly support
|
||||
- `no_std_gateway` — No-std for embedded gateways
|
||||
|
||||
## Current Limitations
|
||||
|
||||
| Feature | Status | Notes |
|
||||
|---------|--------|-------|
|
||||
| GPU inference | Not implemented | CUDA/Metal kernels needed |
|
||||
| KV cache persistence | ✅ **Implemented** | INT4 with Hadamard transforms |
|
||||
| Multi-head grouped query | Not implemented | GQA for memory efficiency |
|
||||
| Flash Attention | ✅ **Implemented** | CPU tiled with online softmax |
|
||||
| Rotary position embeddings | ✅ **Implemented** | RoPE with NTK/YaRN scaling |
|
||||
| Criterion benchmarks | ✅ **Implemented** | Kernel, gate, latency benchmarks |
|
||||
| GGML/GGUF format | Not implemented | Model format compatibility |
|
||||
| Batched inference | Partial | Single-sequence optimized |
|
||||
| Async/streaming output | Not implemented | Token-by-token streaming |
|
||||
| Mamba/SSM hybrid | ✅ **Implemented** | Selective state space layer |
|
||||
| Speculative decoding | ✅ **Implemented** | EAGLE-3 style with λ-guidance |
|
||||
|
||||
## Academic Foundations
|
||||
|
||||
This implementation integrates peer-reviewed research:
|
||||
|
||||
### Core Architecture
|
||||
1. **Mixture-of-Depths** (Raposo et al., 2024) — Dynamic compute allocation
|
||||
2. **LayerSkip** (Elhoushi et al., 2024) — Early exit and self-speculative decoding
|
||||
3. **MInference** (Jiang et al., 2024) — Dynamic sparse attention
|
||||
4. **Energy-Based Transformers** (Gladstone et al., 2025) — Energy-based decisions
|
||||
5. **Spike-driven Transformer** (Yao et al., 2023, 2024) — Event-driven inference
|
||||
6. **Spectral Attention** (Kreuzer et al., 2021) — Graph-based position encoding
|
||||
|
||||
### SOTA 2025 Research
|
||||
7. **RotateKV** (IJCAI 2025) — Hadamard transforms for KV cache quantization
|
||||
8. **EAGLE-3** (NeurIPS 2025) — Speculative decoding with draft tree verification
|
||||
9. **FlashAttention-3** (Dao et al., 2024) — IO-aware attention with online softmax
|
||||
10. **Mamba** (Gu & Dao, 2023) — Selective State Space Models
|
||||
11. **Mamba-2** (Dao & Gu, 2024) — Structured state space duality
|
||||
12. **RoFormer** (Su et al., 2021) — Rotary position embeddings
|
||||
13. **YaRN** (Peng et al., 2023) — Efficient context window extension
|
||||
14. **NTK-Aware Scaling** (bloc97, 2023) — Base frequency adjustment for context extension
|
||||
|
||||
See [docs/THEORY.md](docs/THEORY.md) for detailed theoretical foundations.
|
||||
|
||||
## Integration
|
||||
|
||||
### With RuVector Mincut
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::prelude::*;
|
||||
use ruvector_mincut::MincutEngine;
|
||||
|
||||
// Compute mincut from attention graph
|
||||
let mut mincut = MincutEngine::new(num_nodes);
|
||||
// ... add edges from attention weights ...
|
||||
let lambda = mincut.compute_mincut();
|
||||
|
||||
// Create gate packet
|
||||
let gate = GatePacket {
|
||||
lambda,
|
||||
lambda_prev: prev_lambda,
|
||||
boundary_edges: mincut.boundary_edge_count(),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
// Run gated inference
|
||||
transformer.infer(&InferInput::from_tokens(tokens, gate), &mut output)?;
|
||||
```
|
||||
|
||||
### Arena Allocator
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::arena::{WeightArena, calculate_arena_size};
|
||||
|
||||
// Calculate total size for model
|
||||
let size = calculate_arena_size(layers, hidden, ffn_mult, heads);
|
||||
let mut arena = WeightArena::new(size);
|
||||
|
||||
// Allocate weight slices
|
||||
let w_q = arena.alloc_i8(hidden * hidden).unwrap();
|
||||
let scales = arena.alloc_f32(hidden).unwrap();
|
||||
```
|
||||
|
||||
### INT4 Quantization
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::kernel::quant4::{Int4Weights, int4_gemv};
|
||||
|
||||
// Create INT4 weights from f32 (50% memory savings)
|
||||
let int4_w = Int4Weights::from_f32(&weights, rows, cols);
|
||||
|
||||
// Matrix-vector multiplication
|
||||
int4_gemv(&int4_w, &input, 1.0, &mut output);
|
||||
```
|
||||
|
||||
### KV Cache INT4 (RotateKV)
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::kv_cache::{QuantizedKVCache, QuantBits};
|
||||
|
||||
// Create 2-bit quantized KV cache (16× compression)
|
||||
let mut cache = QuantizedKVCache::new(
|
||||
num_layers,
|
||||
num_heads,
|
||||
head_dim,
|
||||
max_seq_len,
|
||||
QuantBits::Two,
|
||||
);
|
||||
|
||||
// Store key/value with automatic Hadamard transform
|
||||
cache.store_key(layer, head, position, &key_vector);
|
||||
cache.store_value(layer, head, position, &value_vector);
|
||||
|
||||
// Retrieve (dequantize + inverse Hadamard)
|
||||
let key = cache.get_key(layer, head, position);
|
||||
```
|
||||
|
||||
### RoPE Embeddings
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::rope::{RopeConfig, RopeEmbedding, RopeScaling};
|
||||
|
||||
// Standard RoPE
|
||||
let config = RopeConfig::default();
|
||||
let rope = RopeEmbedding::new(&config)?;
|
||||
|
||||
// NTK-aware scaling for 4× context extension
|
||||
let config = RopeConfig {
|
||||
scaling_type: RopeScaling::NTKAware { alpha: 4.0 },
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
// Apply to Q/K vectors
|
||||
rope.apply(&mut q, &mut k, position);
|
||||
```
|
||||
|
||||
### FlashAttention Tiling
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::flash_attention::{
|
||||
FlashAttentionConfig, flash_attention_forward,
|
||||
};
|
||||
|
||||
let config = FlashAttentionConfig {
|
||||
block_size_q: 64,
|
||||
block_size_kv: 64,
|
||||
head_dim: 64,
|
||||
causal: true,
|
||||
softmax_scale: 0.125,
|
||||
};
|
||||
|
||||
// O(n) memory attention
|
||||
flash_attention_forward(&config, &q, &k, &v, seq_len, seq_len, &mut output);
|
||||
```
|
||||
|
||||
### Mamba SSM Layer
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::mamba::{MambaConfig, MambaLayer};
|
||||
|
||||
let config = MambaConfig::default();
|
||||
let mut layer = MambaLayer::new(config);
|
||||
|
||||
// Recurrent mode (O(1) memory per step)
|
||||
for token in tokens.iter() {
|
||||
let output = layer.step_recurrent(token);
|
||||
}
|
||||
|
||||
// Batch mode for training
|
||||
let outputs = layer.forward_sequence(&input_sequence);
|
||||
```
|
||||
|
||||
### EAGLE-3 Speculative Decoding
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::speculative::{
|
||||
SpeculativeConfig, SpeculativeDecoder,
|
||||
};
|
||||
|
||||
let config = SpeculativeConfig {
|
||||
max_draft_tokens: 8,
|
||||
tree_width: 4,
|
||||
acceptance_threshold: 0.9,
|
||||
lambda_guidance: true, // Use mincut λ for tree construction
|
||||
};
|
||||
|
||||
let mut decoder = SpeculativeDecoder::new(config, &gate_policy);
|
||||
|
||||
// Generate with speculation (3-5× faster)
|
||||
let (tokens, stats) = decoder.generate_with_speculation(
|
||||
&draft_model,
|
||||
&target_model,
|
||||
&prompt,
|
||||
max_new_tokens,
|
||||
);
|
||||
```
|
||||
|
||||
## Safety & Determinism
|
||||
|
||||
**Determinism guarantee:** For fixed `(weights, config, policy, input)`, inference always produces identical `(logits, witness)`.
|
||||
|
||||
**Safety properties:**
|
||||
- External writes blocked when coherence is low
|
||||
- KV cache frozen/flushed on instability
|
||||
- All gate decisions recorded in witness
|
||||
- No hidden state or randomness
|
||||
|
||||
**Witness fields:**
|
||||
```rust
|
||||
witness.decision // ALLOW, DEFER, QUARANTINE, SKIP
|
||||
witness.reason // Why this decision was made
|
||||
witness.external_writes_enabled // Safe to persist?
|
||||
witness.kv_action // WRITE, FREEZE, FLUSH
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
Licensed under either of Apache License 2.0 or MIT license at your option.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions welcome! Areas of interest:
|
||||
|
||||
- GPU kernel implementations (CUDA, Metal)
|
||||
- Additional quantization formats (GPTQ, AWQ)
|
||||
- Multi-head grouped query attention (GQA)
|
||||
- GGUF/Safetensors model format loaders
|
||||
- Batched inference optimization
|
||||
- Async/streaming token output
|
||||
Reference in New Issue
Block a user