# ADR-015: Coherence-Gated Transformer (Sheaf Attention) **Status**: Proposed **Date**: 2026-01-22 **Authors**: ruv.io, RuVector Team **Deciders**: Architecture Review Board **Target Crate**: `ruvector-attention` ## Version History | Version | Date | Author | Changes | |---------|------|--------|---------| | 0.1 | 2026-01-22 | ruv.io | Initial proposal for coherence-gated attention | --- ## Context ### The Transformer Latency Problem Standard transformers have fundamental efficiency issues: 1. **Quadratic attention**: O(N²) for sequence length N 2. **Fixed computation**: Every token gets same compute regardless of difficulty 3. **Dense by default**: All attention weights computed even when most are near-zero 4. **Confidence-based exits**: Early exit uses unreliable confidence scores ### Existing Solutions and Their Limits | Approach | Method | Limitation | |----------|--------|------------| | Flash Attention | Memory-efficient matmul | Still O(N²) compute | | Sparse Attention | Fixed patterns (local, strided) | Patterns don't adapt to content | | Linear Attention | Kernel approximation | Quality degradation | | Early Exit | Confidence threshold | Confidence ≠ correctness | | MoE | Expert routing | Routing is learned, not principled | ### The Coherence Insight Prime-Radiant's coherence engine provides a **mathematically grounded** measure of consistency. This can be applied to attention: > **Core idea**: Tokens that are already coherent with context don't need expensive attention. Route computation based on coherence energy, not learned confidence. --- ## Decision ### Implement Coherence-Gated Transformer (CGT) in `ruvector-attention` A novel attention mechanism that uses sheaf coherence to: 1. **Route tokens** to different compute depths 2. **Sparsify attention** based on residual energy 3. **Exit early** when energy converges 4. **Replace QKV projections** with restriction maps --- ## Architecture ### High-Level Design ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ COHERENCE-GATED TRANSFORMER (CGT) │ │ │ │ ┌─────────────────────────────────────────────────────────────────────────┐│ │ │ INPUT PROCESSING ││ │ │ Tokens ──► Embedding ──► Initial Coherence Graph ││ │ └─────────────────────────────────────────────────────────────────────────┘│ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐│ │ │ COHERENCE ROUTER ││ │ │ ││ │ │ For each token t: ││ │ │ E(t) = Σ w_e ||ρ_t(x_t) - ρ_ctx(x_ctx)||² ││ │ │ ││ │ │ Route based on energy: ││ │ │ ┌──────────────┬──────────────┬──────────────┐ ││ │ │ │ E < θ_reflex │ E < θ_std │ E ≥ θ_std │ ││ │ │ │ │ │ │ │ │ │ ││ │ │ │ ▼ │ ▼ │ ▼ │ ││ │ │ │ LANE 0 │ LANE 1 │ LANE 2 │ ││ │ │ │ Reflex │ Standard │ Deep │ ││ │ │ └──────────────┴──────────────┴──────────────┘ ││ │ └─────────────────────────────────────────────────────────────────────────┘│ │ │ │ │ ┌────────────────────────────┼────────────────────────────┐ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ LANE 0 │ │ LANE 1 │ │ LANE 2 │ │ │ │ REFLEX │ │ STANDARD │ │ DEEP │ │ │ │ │ │ │ │ │ │ │ │ • 1-2 layers │ • 6 layers│ │ • 12+ layers │ │ │ • Local attention │ • Sparse │ │ • Full + MoE │ │ │ (window=64) │ sheaf │ │ • All experts │ │ │ • No FFN │ attn │ │ • Spectral │ │ │ • <0.1ms │ • ~1ms │ │ analysis │ │ │ │ │ │ • ~5ms │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ └────────────────────────────┼────────────────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐│ │ │ COHERENCE VERIFICATION ││ │ │ ││ │ │ E_final = compute_energy(output_graph) ││ │ │ ││ │ │ if E_final > θ_max: ││ │ │ → Escalate to Lane 2 OR refuse generation ││ │ │ else: ││ │ │ → Output with witness ││ │ └─────────────────────────────────────────────────────────────────────────┘│ │ │ │ │ ▼ │ │ Output + Witness │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Component Details #### 1. Sheaf Attention Layer Replace standard scaled dot-product attention with coherence-based attention: ``` Standard Attention: Attention(Q, K, V) = softmax(QK^T / √d) V Sheaf Attention: R_ij = ||ρ_i(x_i) - ρ_j(x_j)||² # Residual energy A_ij = exp(-β × R_ij) / Σ_k exp(-β × R_ik) # Coherence-based weight Output = A × V ``` **Key difference**: Attention weight is inversely proportional to residual energy. - High residual (incoherent) → Low attention (don't propagate inconsistency) - Low residual (coherent) → High attention (reinforce consistency) #### 2. Restriction Map Projections Replace learned W_q, W_k, W_v with restriction maps: ``` Standard: Q = W_q × x (learned projection) K = W_k × x V = W_v × x Sheaf: Q = ρ_q(x) (restriction map to query manifold) K = ρ_k(x) (restriction map to key manifold) V = ρ_v(x) (restriction map to value manifold) ``` **Benefits**: - Restriction maps have geometric meaning (project to shared space) - Can be initialized from domain knowledge - Residuals are interpretable #### 3. Token-Level Compute Routing ```python def route_token(token_embedding, context_graph): # Compute coherence energy with context energy = compute_token_energy(token_embedding, context_graph) if energy < THETA_REFLEX: return Lane.REFLEX # Minimal compute elif energy < THETA_STANDARD: return Lane.STANDARD # Normal compute else: return Lane.DEEP # Maximum compute ``` **Routing thresholds** (tunable via SONA): | Threshold | Default | Meaning | |-----------|---------|---------| | θ_reflex | 0.01 | Token is highly coherent with context | | θ_standard | 0.1 | Token has minor inconsistencies | | θ_deep | 1.0 | Token has major inconsistencies | #### 4. Residual-Sparse Attention Only compute attention for token pairs with high residual: ```python def sparse_sheaf_attention(X, threshold): N = len(X) attention_mask = zeros(N, N) for i in range(N): for j in range(N): residual = compute_residual(X[i], X[j]) if residual > threshold: # These tokens are incoherent - need attention attention_mask[i, j] = 1 # else: skip attention (already coherent) # Compute attention only for non-zero mask entries return masked_attention(X, attention_mask) ``` **Sparsity pattern**: Adapts to content, not fixed like local/strided attention. #### 5. Energy-Based Early Exit ```python def forward_with_early_exit(x, layers, epsilon=0.001): prev_energy = float('inf') for layer in layers: x = layer(x) curr_energy = compute_energy(x) delta = abs(curr_energy - prev_energy) if delta < epsilon: # Energy converged - no need for more layers return x prev_energy = curr_energy return x ``` **Exit criterion**: Energy convergence, not confidence threshold. --- ## Compute Lane Specifications ### Lane 0: Reflex (~0.1ms) ``` Layers: 1-2 Attention: Local only (window=64) FFN: Skip or minimal Use case: Common tokens, clear context Example: "the", "is", "and" in well-formed sentences ``` ### Lane 1: Standard (~1ms) ``` Layers: 6 Attention: Sparse sheaf (residual > 0.05) FFN: Standard Use case: Normal tokens requiring context integration Example: Most content words ``` ### Lane 2: Deep (~5ms) ``` Layers: 12+ Attention: Full sheaf + MoE routing FFN: Expert mixture Spectral: Eigenvalue analysis for structural issues Use case: Ambiguous, contradictory, or complex tokens Example: "bank" (river or financial?), negations, rare words ``` ### Lane 3: Escalate (async) ``` Action: Return uncertainty, request clarification Use case: Irreconcilable incoherence Example: "The cat is not a cat" - logical contradiction ``` --- ## Mathematical Foundation ### Sheaf Attention Formula Given tokens X = {x_1, ..., x_N} and restriction maps ρ_i, ρ_j: **Residual**: ``` r_ij = ρ_i(x_i) - ρ_j(x_j) ``` **Edge energy**: ``` E_ij = w_ij × ||r_ij||² ``` **Token energy**: ``` E_i = Σ_j E_ij (sum over edges incident to i) ``` **Attention weight** (coherence-based): ``` A_ij = exp(-β × E_ij) / Σ_k exp(-β × E_ik) ``` **Output**: ``` y_i = Σ_j A_ij × V_j ``` ### Complexity Analysis | Operation | Standard | Sheaf (Dense) | Sheaf (Sparse, s% non-zero) | |-----------|----------|---------------|----------------------------| | Attention | O(N²d) | O(N²d) | O(s×N²d) | | Routing | - | O(Nd) | O(Nd) | | Early exit | - | O(Ld) per check | O(Ld) per check | | **Total** | O(N²Ld) | O(N²Ld) | O(s×N²Ld + routing) | With typical s=10-20% sparsity and 50% early exit: **5-10x speedup**. --- ## Integration with `ruvector-attention` ### New Modules ``` ruvector-attention/ ├── src/ │ ├── sheaf/ # NEW: Sheaf attention │ │ ├── mod.rs │ │ ├── attention.rs # SheafAttention layer │ │ ├── restriction.rs # Restriction map projections │ │ ├── router.rs # Token-level routing │ │ ├── sparse.rs # Residual-sparse attention │ │ └── early_exit.rs # Energy-based early exit │ │ │ ├── coherence_gated/ # NEW: Full CGT implementation │ │ ├── mod.rs │ │ ├── transformer.rs # CoherenceGatedTransformer │ │ ├── lane.rs # ComputeLane enum + configs │ │ ├── config.rs # CGTConfig │ │ └── benchmark.rs # Latency/quality benchmarks │ │ │ └── ... (existing modules) ``` ### New Types ```rust /// Sheaf-based attention layer pub struct SheafAttention { /// Restriction map for queries pub rho_query: RestrictionMap, /// Restriction map for keys pub rho_key: RestrictionMap, /// Restriction map for values pub rho_value: RestrictionMap, /// Temperature for attention softmax pub beta: f32, /// Sparsity threshold pub sparsity_threshold: f32, } /// Compute lane for token routing #[derive(Debug, Clone, Copy)] pub enum ComputeLane { /// Minimal compute (<0.1ms) Reflex, /// Standard compute (~1ms) Standard, /// Deep compute (~5ms) Deep, /// Escalate to caller Escalate, } /// Coherence-Gated Transformer configuration pub struct CGTConfig { /// Embedding dimension pub d_model: usize, /// Layers per lane pub layers_per_lane: [usize; 3], // [reflex, standard, deep] /// Routing thresholds pub thresholds: CoherenceThresholds, /// Sparsity settings pub sparsity: SparsityConfig, /// Early exit settings pub early_exit: EarlyExitConfig, } /// Token routing decision pub struct RoutingDecision { pub token_id: usize, pub energy: f32, pub lane: ComputeLane, pub attention_mask: Option, } ``` ### Feature Flags ```toml [features] # Sheaf attention (requires prime-radiant) sheaf = ["dep:prime-radiant"] # Full CGT implementation coherence-gated = ["sheaf", "sparse", "moe"] # Benchmarking utilities cgt-bench = ["coherence-gated", "criterion"] ``` --- ## Performance Targets | Metric | Standard Transformer | CGT Target | Improvement | |--------|---------------------|------------|-------------| | Average latency (128 tokens) | 10ms | 1-2ms | 5-10x | | P99 latency (128 tokens) | 15ms | 8ms | 2x | | Memory (batch=32) | 2GB | 800MB | 2.5x | | Quality (perplexity) | Baseline | <5% degradation | Acceptable | ### Latency Breakdown ``` Standard (10ms total): Attention: 6ms (60%) FFN: 3ms (30%) Other: 1ms (10%) CGT Target (2ms total): Routing: 0.1ms (5%) Attention (sparse): 1ms (50%) FFN (conditional): 0.7ms (35%) Other: 0.2ms (10%) ``` --- ## Quality Guarantees ### Coherence Bound Every output is guaranteed to have coherence energy below threshold: ``` E(output) < θ_max OR escalate/refuse ``` This is **stronger** than confidence-based systems which can be confidently wrong. ### Graceful Degradation Under compute pressure: 1. Raise θ_reflex → more tokens to Lane 0 2. Increase sparsity threshold → fewer attention computations 3. Quality degrades **predictably** (energy increases) ### Interpretability For any output: - Which tokens went to which lane? - Which token pairs had high residuals? - Where did the model "struggle"? --- ## Comparison with Existing Approaches | Feature | Flash Attention | Sparse Transformers | MoE | CGT (Ours) | |---------|-----------------|---------------------|-----|------------| | Adaptive compute | No | No | Yes | Yes | | Content-based sparsity | No | No | Partial | Yes | | Mathematical grounding | No | No | No | Yes (sheaf) | | Quality guarantee | No | No | No | Yes (energy bound) | | Interpretable routing | N/A | N/A | Partial | Yes | | Early exit criterion | N/A | N/A | Confidence | Energy convergence | --- ## Research Questions 1. **Restriction map initialization**: Random vs. pre-trained vs. analytical? 2. **Threshold tuning**: Can SONA auto-tune θ values during inference? 3. **Multi-head sheaf attention**: One graph per head, or shared graph? 4. **Training objective**: Standard cross-entropy + energy regularization? 5. **Hardware optimization**: Can residual computation be fused with attention kernels? --- ## Implementation Phases ### Phase 1: Foundation (Weeks 1-4) - [ ] `SheafAttention` layer with restriction maps - [ ] Basic residual computation - [ ] Unit tests for mathematical correctness ### Phase 2: Routing (Weeks 5-8) - [ ] `ComputeLane` enum and routing logic - [ ] Token-level energy computation - [ ] Lane-specific layer configurations ### Phase 3: Sparsity (Weeks 9-12) - [ ] Residual-sparse attention mask generation - [ ] Efficient sparse attention kernel - [ ] Sparsity pattern analysis tools ### Phase 4: Integration (Weeks 13-16) - [ ] `CoherenceGatedTransformer` full implementation - [ ] Early exit with energy convergence - [ ] Benchmarking suite ### Phase 5: Optimization (Weeks 17-20) - [ ] SIMD optimization for residual computation - [ ] Kernel fusion opportunities - [ ] SONA integration for threshold tuning --- ## Dependencies ### Required - `prime-radiant` (coherence computation) - `ruvector-core` (vector operations) - `ndarray` (matrix operations) ### Optional - `rayon` (parallel routing) - `criterion` (benchmarking) --- ## References 1. Hansen, J., & Ghrist, R. (2019). "Toward a spectral theory of cellular sheaves." 2. Vaswani et al. (2017). "Attention Is All You Need." 3. Kitaev et al. (2020). "Reformer: The Efficient Transformer." 4. Fedus et al. (2022). "Switch Transformers: Scaling to Trillion Parameter Models." 5. ADR-014: Coherence Engine Architecture --- ## Related Decisions - **ADR-014**: Coherence Engine Architecture (Prime-Radiant) - **ADR-003**: SIMD Optimization Strategy - **ADR-006**: Memory Management --- ## Appendix: Name Options | Name | Rationale | |------|-----------| | **Coherence-Gated Transformer (CGT)** | Descriptive, clear function | | **Sheaf Attention** | Mathematical foundation | | **Residual-Routed Transformer** | Emphasizes routing mechanism | | **Energy-Adaptive Transformer** | Emphasizes efficiency | | **Prime Transformer** | Connection to Prime-Radiant | **Recommended**: "Coherence-Gated Transformer (CGT)" for the architecture, "Sheaf Attention" for the attention mechanism.