git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
470 lines
18 KiB
Markdown
470 lines
18 KiB
Markdown
# Mincut-Gated Transformer
|
||
|
||
> **Ultra-low latency transformer inference with graph-theoretic coherence control, designed for real-time AI systems and edge deployment**
|
||
|
||
[](https://crates.io/crates/ruvector-mincut-gated-transformer)
|
||
[](https://docs.rs/ruvector-mincut-gated-transformer)
|
||
[](LICENSE)
|
||
|
||
## Introduction
|
||
|
||
The **Mincut-Gated Transformer** is a production-grade inference engine that combines minimum cut (mincut) graph partitioning with adaptive compute allocation to achieve deterministic, ultra-low latency inference. Unlike traditional transformers that execute all layers uniformly, this architecture uses graph-theoretic coherence signals to dynamically skip computation, exit early, and control state updates—all while maintaining explainability and safety guarantees.
|
||
|
||
**Why Mincut?** The minimum cut value (λ) of an attention graph provides a principled measure of information flow coherence. When λ is high and stable, the model can safely reduce computation. When λ drops or becomes unstable, the system conservatively executes more layers. This creates a natural feedback loop between model confidence and compute allocation.
|
||
|
||
### Key Innovations
|
||
|
||
| Innovation | Technique | Benefit |
|
||
|-----------|-----------|---------|
|
||
| **λ-based Mixture-of-Depths** | Route tokens using mincut delta instead of learned routers | 50% FLOPs reduction |
|
||
| **Coherence-driven Early Exit** | Exit when λ stabilizes across layers | 30-50% latency reduction |
|
||
| **Mincut Sparse Attention** | Use partition boundaries for sparse masks | 90% attention FLOPs reduction |
|
||
| **Energy-based Gating** | Treat coherence as energy function | Principled compute-quality tradeoffs |
|
||
| **Spike-driven Scheduling** | Event-driven inference on activity | 87× energy efficiency |
|
||
| **Spectral Position Encoding** | Graph Laplacian eigenvectors via Lanczos | O(n) structural awareness |
|
||
| **EAGLE-3 Speculative Decoding** | λ-guided draft tree verification | 3-5× decoding speedup |
|
||
| **Mamba SSM Hybrid** | Selective state spaces with O(n) complexity | Linear-time sequence modeling |
|
||
| **FlashAttention Tiling** | Block-wise attention with online softmax | O(n) memory, 2-4× faster |
|
||
| **KV Cache INT4** | Hadamard transform + 2/4-bit quantization | 8-16× cache compression |
|
||
| **RoPE with NTK/YaRN** | Context extension beyond training length | 4-32× context scaling |
|
||
|
||
## Features
|
||
|
||
### Core Capabilities
|
||
|
||
- **Deterministic inference** — Same inputs always produce identical outputs (bit-exact)
|
||
- **Bounded latency** — Predictable p99 guarantees through tier-based execution
|
||
- **Explainable decisions** — Every inference produces a witness explaining all interventions
|
||
- **Allocation-free hot path** — Zero heap allocations during inference after initialization
|
||
- **Safety controls** — Coherence-gated state updates prevent contamination propagation
|
||
|
||
### Quantization & Memory
|
||
|
||
- **INT8 quantization** — Full model quantization with per-tensor and per-row scaling
|
||
- **INT4 quantization** — 2× memory reduction with per-row and block-wise scaling
|
||
- **Arena allocator** — Single contiguous allocation for weights, 64-byte cache-aligned
|
||
- **Sparse CSR matrices** — Efficient storage for spectral graph operations
|
||
|
||
### SIMD Acceleration
|
||
|
||
- **AVX2/FMA** (x86_64) — Vectorized GEMM, GELU, quantization with 8×32 tiling
|
||
- **NEON** (aarch64) — ARM SIMD for mobile and edge devices
|
||
- **Scalar fallback** — Portable implementation for all platforms
|
||
|
||
### Advanced Features
|
||
|
||
- **Lanczos algorithm** — O(n) eigenvalue computation for spectral position encoding
|
||
- **Power iteration** — Fast dominant eigenvector extraction
|
||
- **Prefetch hints** — Memory access optimization for sequential patterns
|
||
- **Benchmark utilities** — Built-in profiling with GFLOPS and bandwidth metrics
|
||
|
||
### SOTA 2025 Features
|
||
|
||
- **KV Cache INT4 (RotateKV)** — Hadamard transforms for outlier smoothing, 2-bit/4-bit quantization with <0.3 PPL degradation
|
||
- **RoPE Embeddings** — Rotary position encoding with NTK-aware and YaRN scaling for 4-32× context extension
|
||
- **EAGLE-3 Speculative Decoding** — λ-guided draft tree generation with rejection sampling for 3-5× faster decoding
|
||
- **FlashAttention Tiling** — Block-wise computation with online softmax, O(n) memory instead of O(n²)
|
||
- **Mamba SSM Layer** — Selective state space models with O(n) complexity and O(1) inference memory per step
|
||
- **Criterion Benchmarks** — Comprehensive kernel performance profiling with GFLOPS metrics
|
||
|
||
## Quick Start
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::prelude::*;
|
||
|
||
// Create configuration
|
||
let config = TransformerConfig::micro();
|
||
let policy = GatePolicy::default();
|
||
|
||
// Load weights (or use empty for testing)
|
||
let weights = QuantizedWeights::empty(&config);
|
||
|
||
// Create transformer
|
||
let mut transformer = MincutGatedTransformer::new(config, policy, weights)?;
|
||
|
||
// Create gate packet from mincut signals
|
||
let gate = GatePacket {
|
||
lambda: 100, // Minimum cut value
|
||
lambda_prev: 95, // Previous lambda for delta computation
|
||
boundary_edges: 5, // Cross-partition edge count
|
||
boundary_concentration_q15: 8192, // ~25% concentration (Q15 format)
|
||
partition_count: 3, // Number of detected partitions
|
||
flags: 0,
|
||
};
|
||
|
||
// Prepare input
|
||
let input = InferInput::from_tokens(&[1, 2, 3, 4], gate);
|
||
|
||
// Allocate output buffer
|
||
let mut logits = vec![0i32; config.logits as usize];
|
||
let mut output = InferOutput::new(&mut logits);
|
||
|
||
// Run inference
|
||
transformer.infer(&input, &mut output)?;
|
||
|
||
// Check witness for gate decisions
|
||
println!("Decision: {:?}", output.witness.decision);
|
||
println!("Reason: {:?}", output.witness.reason);
|
||
println!("External writes allowed: {}", output.witness.external_writes_enabled);
|
||
```
|
||
|
||
## Architecture Overview
|
||
|
||
```
|
||
┌─────────────────┐
|
||
│ Gate Packet │
|
||
│ (λ, Δλ, edges) │
|
||
└────────┬────────┘
|
||
│
|
||
Input ──────────────────►│
|
||
▼
|
||
┌─────────────────┐
|
||
│ Spike Scheduler │──── Skip (tier 3)
|
||
│ Event-driven │
|
||
└────────┬────────┘
|
||
│
|
||
▼
|
||
┌─────────────────┐
|
||
│ Gate Controller │──── Select tier 0/1/2
|
||
│ Coherence-gated │
|
||
└────────┬────────┘
|
||
│
|
||
▼
|
||
┌──────────────┴──────────────┐
|
||
│ Transformer Core │
|
||
│ ┌────────────────────────┐ │
|
||
│ │ MoD Router (λ-based) │ │
|
||
│ └───────────┬────────────┘ │
|
||
│ ▼ │
|
||
│ ┌────────────────────────┐ │
|
||
│ │ Sparse Attention │ │
|
||
│ │ (mincut boundaries) │ │
|
||
│ └───────────┬────────────┘ │
|
||
│ ▼ │
|
||
│ ┌────────────────────────┐ │
|
||
│ │ Early Exit Check │ │──── Exit if λ stable
|
||
│ │ (coherence threshold) │ │
|
||
│ └───────────┬────────────┘ │
|
||
└──────────────┴──────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────┐
|
||
│ Output + Witness│
|
||
│ (explainable) │
|
||
└─────────────────┘
|
||
```
|
||
|
||
### Tier System
|
||
|
||
| Tier | Layers | Seq Len | Window | Use Case | Speedup |
|
||
|------|--------|---------|--------|----------|---------|
|
||
| 0 | 4 | 64 | 16 | Normal (high λ) | 1× |
|
||
| 1 | 2 | 32 | 8 | Reduced (moderate λ) | 2-3× |
|
||
| 2 | 1 | 8 | 4 | Safe mode (low λ) | 5-10× |
|
||
| 3 | 0 | 0 | 0 | Skip (no spike) | 50-200× |
|
||
|
||
## Performance
|
||
|
||
### Expected Speedups
|
||
|
||
| Workload Type | Skip Rate | Speedup | Memory Reduction |
|
||
|---------------|-----------|---------|------------------|
|
||
| Streaming (low activity) | 70% | **10-15×** | 80% |
|
||
| Interactive (bursty) | 40% | **4-6×** | 50% |
|
||
| Continuous (high throughput) | 10% | **2-3×** | 40% |
|
||
| Safety-critical (conservative) | 5% | **1.5-2×** | 25% |
|
||
|
||
### SIMD Performance (on x86_64 AVX2)
|
||
|
||
| Operation | Scalar | SIMD | Speedup |
|
||
|-----------|--------|------|---------|
|
||
| INT8 GEMM (256×256) | 12ms | 1.8ms | **6.7×** |
|
||
| GELU activation (1024) | 45µs | 8µs | **5.6×** |
|
||
| Quantize f32→i8 (1024) | 38µs | 7µs | **5.4×** |
|
||
|
||
### Memory Footprint
|
||
|
||
| Model Config | INT8 | INT4 | Arena Overhead |
|
||
|-------------|------|------|----------------|
|
||
| Micro (2L, 128H) | 1.2 MB | 0.6 MB | +64 bytes |
|
||
| Baseline (4L, 256H) | 8.5 MB | 4.3 MB | +64 bytes |
|
||
| Medium (12L, 768H) | ~85 MB | ~43 MB | +64 bytes |
|
||
|
||
## Configuration
|
||
|
||
### Preset Configurations
|
||
|
||
```rust
|
||
// Micro: WASM, edge gateways, embedded
|
||
let config = TransformerConfig::micro();
|
||
// Seq: 32, Hidden: 128, Heads: 4, Layers: 2
|
||
|
||
// Baseline: CPU inference, development
|
||
let config = TransformerConfig::baseline();
|
||
// Seq: 64, Hidden: 256, Heads: 4, Layers: 4
|
||
```
|
||
|
||
### Gate Policy
|
||
|
||
```rust
|
||
let policy = GatePolicy {
|
||
lambda_min: 30, // Minimum coherence threshold
|
||
drop_ratio_q15_max: 16384, // Max λ drop (50% in Q15)
|
||
boundary_edges_max: 20, // Max cross-partition edges
|
||
boundary_concentration_q15_max: 24576, // Max concentration (75%)
|
||
partitions_max: 8, // Max partition count
|
||
spike_rate_q15_max: 26214, // Max spike rate (80%)
|
||
allow_kv_write_when_unstable: false, // Freeze KV cache
|
||
allow_external_write_when_unstable: false, // Block external writes
|
||
};
|
||
```
|
||
|
||
## Feature Flags
|
||
|
||
### Core Features
|
||
- `sliding_window` (default) — Sliding window attention
|
||
- `linear_attention` — Linear attention for O(n) scaling
|
||
|
||
### Quantization
|
||
- `simd` — AVX2/NEON SIMD acceleration
|
||
- `int4` — INT4 quantization support
|
||
- `fixed_point_softmax` — Fixed-point for embedded targets
|
||
- `rmsnorm` — RMSNorm instead of LayerNorm
|
||
|
||
### Advanced
|
||
- `spectral_pe` — Spectral position encoding with Lanczos
|
||
- `sparse_attention` — Mincut-guided sparse attention
|
||
- `energy_gate` — Energy-based gate decisions
|
||
- `spike_attention` — Spike-driven attention mechanism
|
||
- `trace` — Runtime tracing and snapshots
|
||
|
||
### Platform
|
||
- `wasm` — WebAssembly support
|
||
- `no_std_gateway` — No-std for embedded gateways
|
||
|
||
## Current Limitations
|
||
|
||
| Feature | Status | Notes |
|
||
|---------|--------|-------|
|
||
| GPU inference | Not implemented | CUDA/Metal kernels needed |
|
||
| KV cache persistence | ✅ **Implemented** | INT4 with Hadamard transforms |
|
||
| Multi-head grouped query | Not implemented | GQA for memory efficiency |
|
||
| Flash Attention | ✅ **Implemented** | CPU tiled with online softmax |
|
||
| Rotary position embeddings | ✅ **Implemented** | RoPE with NTK/YaRN scaling |
|
||
| Criterion benchmarks | ✅ **Implemented** | Kernel, gate, latency benchmarks |
|
||
| GGML/GGUF format | Not implemented | Model format compatibility |
|
||
| Batched inference | Partial | Single-sequence optimized |
|
||
| Async/streaming output | Not implemented | Token-by-token streaming |
|
||
| Mamba/SSM hybrid | ✅ **Implemented** | Selective state space layer |
|
||
| Speculative decoding | ✅ **Implemented** | EAGLE-3 style with λ-guidance |
|
||
|
||
## Academic Foundations
|
||
|
||
This implementation integrates peer-reviewed research:
|
||
|
||
### Core Architecture
|
||
1. **Mixture-of-Depths** (Raposo et al., 2024) — Dynamic compute allocation
|
||
2. **LayerSkip** (Elhoushi et al., 2024) — Early exit and self-speculative decoding
|
||
3. **MInference** (Jiang et al., 2024) — Dynamic sparse attention
|
||
4. **Energy-Based Transformers** (Gladstone et al., 2025) — Energy-based decisions
|
||
5. **Spike-driven Transformer** (Yao et al., 2023, 2024) — Event-driven inference
|
||
6. **Spectral Attention** (Kreuzer et al., 2021) — Graph-based position encoding
|
||
|
||
### SOTA 2025 Research
|
||
7. **RotateKV** (IJCAI 2025) — Hadamard transforms for KV cache quantization
|
||
8. **EAGLE-3** (NeurIPS 2025) — Speculative decoding with draft tree verification
|
||
9. **FlashAttention-3** (Dao et al., 2024) — IO-aware attention with online softmax
|
||
10. **Mamba** (Gu & Dao, 2023) — Selective State Space Models
|
||
11. **Mamba-2** (Dao & Gu, 2024) — Structured state space duality
|
||
12. **RoFormer** (Su et al., 2021) — Rotary position embeddings
|
||
13. **YaRN** (Peng et al., 2023) — Efficient context window extension
|
||
14. **NTK-Aware Scaling** (bloc97, 2023) — Base frequency adjustment for context extension
|
||
|
||
See [docs/THEORY.md](docs/THEORY.md) for detailed theoretical foundations.
|
||
|
||
## Integration
|
||
|
||
### With RuVector Mincut
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::prelude::*;
|
||
use ruvector_mincut::MincutEngine;
|
||
|
||
// Compute mincut from attention graph
|
||
let mut mincut = MincutEngine::new(num_nodes);
|
||
// ... add edges from attention weights ...
|
||
let lambda = mincut.compute_mincut();
|
||
|
||
// Create gate packet
|
||
let gate = GatePacket {
|
||
lambda,
|
||
lambda_prev: prev_lambda,
|
||
boundary_edges: mincut.boundary_edge_count(),
|
||
..Default::default()
|
||
};
|
||
|
||
// Run gated inference
|
||
transformer.infer(&InferInput::from_tokens(tokens, gate), &mut output)?;
|
||
```
|
||
|
||
### Arena Allocator
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::arena::{WeightArena, calculate_arena_size};
|
||
|
||
// Calculate total size for model
|
||
let size = calculate_arena_size(layers, hidden, ffn_mult, heads);
|
||
let mut arena = WeightArena::new(size);
|
||
|
||
// Allocate weight slices
|
||
let w_q = arena.alloc_i8(hidden * hidden).unwrap();
|
||
let scales = arena.alloc_f32(hidden).unwrap();
|
||
```
|
||
|
||
### INT4 Quantization
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::kernel::quant4::{Int4Weights, int4_gemv};
|
||
|
||
// Create INT4 weights from f32 (50% memory savings)
|
||
let int4_w = Int4Weights::from_f32(&weights, rows, cols);
|
||
|
||
// Matrix-vector multiplication
|
||
int4_gemv(&int4_w, &input, 1.0, &mut output);
|
||
```
|
||
|
||
### KV Cache INT4 (RotateKV)
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::kv_cache::{QuantizedKVCache, QuantBits};
|
||
|
||
// Create 2-bit quantized KV cache (16× compression)
|
||
let mut cache = QuantizedKVCache::new(
|
||
num_layers,
|
||
num_heads,
|
||
head_dim,
|
||
max_seq_len,
|
||
QuantBits::Two,
|
||
);
|
||
|
||
// Store key/value with automatic Hadamard transform
|
||
cache.store_key(layer, head, position, &key_vector);
|
||
cache.store_value(layer, head, position, &value_vector);
|
||
|
||
// Retrieve (dequantize + inverse Hadamard)
|
||
let key = cache.get_key(layer, head, position);
|
||
```
|
||
|
||
### RoPE Embeddings
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::rope::{RopeConfig, RopeEmbedding, RopeScaling};
|
||
|
||
// Standard RoPE
|
||
let config = RopeConfig::default();
|
||
let rope = RopeEmbedding::new(&config)?;
|
||
|
||
// NTK-aware scaling for 4× context extension
|
||
let config = RopeConfig {
|
||
scaling_type: RopeScaling::NTKAware { alpha: 4.0 },
|
||
..Default::default()
|
||
};
|
||
|
||
// Apply to Q/K vectors
|
||
rope.apply(&mut q, &mut k, position);
|
||
```
|
||
|
||
### FlashAttention Tiling
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::flash_attention::{
|
||
FlashAttentionConfig, flash_attention_forward,
|
||
};
|
||
|
||
let config = FlashAttentionConfig {
|
||
block_size_q: 64,
|
||
block_size_kv: 64,
|
||
head_dim: 64,
|
||
causal: true,
|
||
softmax_scale: 0.125,
|
||
};
|
||
|
||
// O(n) memory attention
|
||
flash_attention_forward(&config, &q, &k, &v, seq_len, seq_len, &mut output);
|
||
```
|
||
|
||
### Mamba SSM Layer
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::mamba::{MambaConfig, MambaLayer};
|
||
|
||
let config = MambaConfig::default();
|
||
let mut layer = MambaLayer::new(config);
|
||
|
||
// Recurrent mode (O(1) memory per step)
|
||
for token in tokens.iter() {
|
||
let output = layer.step_recurrent(token);
|
||
}
|
||
|
||
// Batch mode for training
|
||
let outputs = layer.forward_sequence(&input_sequence);
|
||
```
|
||
|
||
### EAGLE-3 Speculative Decoding
|
||
|
||
```rust
|
||
use ruvector_mincut_gated_transformer::speculative::{
|
||
SpeculativeConfig, SpeculativeDecoder,
|
||
};
|
||
|
||
let config = SpeculativeConfig {
|
||
max_draft_tokens: 8,
|
||
tree_width: 4,
|
||
acceptance_threshold: 0.9,
|
||
lambda_guidance: true, // Use mincut λ for tree construction
|
||
};
|
||
|
||
let mut decoder = SpeculativeDecoder::new(config, &gate_policy);
|
||
|
||
// Generate with speculation (3-5× faster)
|
||
let (tokens, stats) = decoder.generate_with_speculation(
|
||
&draft_model,
|
||
&target_model,
|
||
&prompt,
|
||
max_new_tokens,
|
||
);
|
||
```
|
||
|
||
## Safety & Determinism
|
||
|
||
**Determinism guarantee:** For fixed `(weights, config, policy, input)`, inference always produces identical `(logits, witness)`.
|
||
|
||
**Safety properties:**
|
||
- External writes blocked when coherence is low
|
||
- KV cache frozen/flushed on instability
|
||
- All gate decisions recorded in witness
|
||
- No hidden state or randomness
|
||
|
||
**Witness fields:**
|
||
```rust
|
||
witness.decision // ALLOW, DEFER, QUARANTINE, SKIP
|
||
witness.reason // Why this decision was made
|
||
witness.external_writes_enabled // Safe to persist?
|
||
witness.kv_action // WRITE, FREEZE, FLUSH
|
||
```
|
||
|
||
## License
|
||
|
||
Licensed under either of Apache License 2.0 or MIT license at your option.
|
||
|
||
## Contributing
|
||
|
||
Contributions welcome! Areas of interest:
|
||
|
||
- GPU kernel implementations (CUDA, Metal)
|
||
- Additional quantization formats (GPTQ, AWQ)
|
||
- Multi-head grouped query attention (GQA)
|
||
- GGUF/Safetensors model format loaders
|
||
- Batched inference optimization
|
||
- Async/streaming token output
|