Files
wifi-densepose/docs/adr/ADR-015-coherence-gated-transformer.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

569 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-015: Coherence-Gated Transformer (Sheaf Attention)
**Status**: Proposed
**Date**: 2026-01-22
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
**Target Crate**: `ruvector-attention`
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1 | 2026-01-22 | ruv.io | Initial proposal for coherence-gated attention |
---
## Context
### The Transformer Latency Problem
Standard transformers have fundamental efficiency issues:
1. **Quadratic attention**: O(N²) for sequence length N
2. **Fixed computation**: Every token gets same compute regardless of difficulty
3. **Dense by default**: All attention weights computed even when most are near-zero
4. **Confidence-based exits**: Early exit uses unreliable confidence scores
### Existing Solutions and Their Limits
| Approach | Method | Limitation |
|----------|--------|------------|
| Flash Attention | Memory-efficient matmul | Still O(N²) compute |
| Sparse Attention | Fixed patterns (local, strided) | Patterns don't adapt to content |
| Linear Attention | Kernel approximation | Quality degradation |
| Early Exit | Confidence threshold | Confidence ≠ correctness |
| MoE | Expert routing | Routing is learned, not principled |
### The Coherence Insight
Prime-Radiant's coherence engine provides a **mathematically grounded** measure of consistency. This can be applied to attention:
> **Core idea**: Tokens that are already coherent with context don't need expensive attention. Route computation based on coherence energy, not learned confidence.
---
## Decision
### Implement Coherence-Gated Transformer (CGT) in `ruvector-attention`
A novel attention mechanism that uses sheaf coherence to:
1. **Route tokens** to different compute depths
2. **Sparsify attention** based on residual energy
3. **Exit early** when energy converges
4. **Replace QKV projections** with restriction maps
---
## Architecture
### High-Level Design
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ COHERENCE-GATED TRANSFORMER (CGT) │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ INPUT PROCESSING ││
│ │ Tokens ──► Embedding ──► Initial Coherence Graph ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ COHERENCE ROUTER ││
│ │ ││
│ │ For each token t: ││
│ │ E(t) = Σ w_e ||ρ_t(x_t) - ρ_ctx(x_ctx)||² ││
│ │ ││
│ │ Route based on energy: ││
│ │ ┌──────────────┬──────────────┬──────────────┐ ││
│ │ │ E < θ_reflex │ E < θ_std │ E ≥ θ_std │ ││
│ │ │ │ │ │ │ │ │ ││
│ │ │ ▼ │ ▼ │ ▼ │ ││
│ │ │ LANE 0 │ LANE 1 │ LANE 2 │ ││
│ │ │ Reflex │ Standard │ Deep │ ││
│ │ └──────────────┴──────────────┴──────────────┘ ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ LANE 0 │ │ LANE 1 │ │ LANE 2 │ │
│ │ REFLEX │ │ STANDARD │ │ DEEP │ │
│ │ │ │ │ │ │ │
│ │ • 1-2 layers │ • 6 layers│ │ • 12+ layers │
│ │ • Local attention │ • Sparse │ │ • Full + MoE │
│ │ (window=64) │ sheaf │ │ • All experts │
│ │ • No FFN │ attn │ │ • Spectral │
│ │ • <0.1ms │ • ~1ms │ │ analysis │
│ │ │ │ │ • ~5ms │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ COHERENCE VERIFICATION ││
│ │ ││
│ │ E_final = compute_energy(output_graph) ││
│ │ ││
│ │ if E_final > θ_max: ││
│ │ → Escalate to Lane 2 OR refuse generation ││
│ │ else: ││
│ │ → Output with witness ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ Output + Witness │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Component Details
#### 1. Sheaf Attention Layer
Replace standard scaled dot-product attention with coherence-based attention:
```
Standard Attention:
Attention(Q, K, V) = softmax(QK^T / √d) V
Sheaf Attention:
R_ij = ||ρ_i(x_i) - ρ_j(x_j)||² # Residual energy
A_ij = exp(-β × R_ij) / Σ_k exp(-β × R_ik) # Coherence-based weight
Output = A × V
```
**Key difference**: Attention weight is inversely proportional to residual energy.
- High residual (incoherent) → Low attention (don't propagate inconsistency)
- Low residual (coherent) → High attention (reinforce consistency)
#### 2. Restriction Map Projections
Replace learned W_q, W_k, W_v with restriction maps:
```
Standard:
Q = W_q × x (learned projection)
K = W_k × x
V = W_v × x
Sheaf:
Q = ρ_q(x) (restriction map to query manifold)
K = ρ_k(x) (restriction map to key manifold)
V = ρ_v(x) (restriction map to value manifold)
```
**Benefits**:
- Restriction maps have geometric meaning (project to shared space)
- Can be initialized from domain knowledge
- Residuals are interpretable
#### 3. Token-Level Compute Routing
```python
def route_token(token_embedding, context_graph):
# Compute coherence energy with context
energy = compute_token_energy(token_embedding, context_graph)
if energy < THETA_REFLEX:
return Lane.REFLEX # Minimal compute
elif energy < THETA_STANDARD:
return Lane.STANDARD # Normal compute
else:
return Lane.DEEP # Maximum compute
```
**Routing thresholds** (tunable via SONA):
| Threshold | Default | Meaning |
|-----------|---------|---------|
| θ_reflex | 0.01 | Token is highly coherent with context |
| θ_standard | 0.1 | Token has minor inconsistencies |
| θ_deep | 1.0 | Token has major inconsistencies |
#### 4. Residual-Sparse Attention
Only compute attention for token pairs with high residual:
```python
def sparse_sheaf_attention(X, threshold):
N = len(X)
attention_mask = zeros(N, N)
for i in range(N):
for j in range(N):
residual = compute_residual(X[i], X[j])
if residual > threshold:
# These tokens are incoherent - need attention
attention_mask[i, j] = 1
# else: skip attention (already coherent)
# Compute attention only for non-zero mask entries
return masked_attention(X, attention_mask)
```
**Sparsity pattern**: Adapts to content, not fixed like local/strided attention.
#### 5. Energy-Based Early Exit
```python
def forward_with_early_exit(x, layers, epsilon=0.001):
prev_energy = float('inf')
for layer in layers:
x = layer(x)
curr_energy = compute_energy(x)
delta = abs(curr_energy - prev_energy)
if delta < epsilon:
# Energy converged - no need for more layers
return x
prev_energy = curr_energy
return x
```
**Exit criterion**: Energy convergence, not confidence threshold.
---
## Compute Lane Specifications
### Lane 0: Reflex (~0.1ms)
```
Layers: 1-2
Attention: Local only (window=64)
FFN: Skip or minimal
Use case: Common tokens, clear context
Example: "the", "is", "and" in well-formed sentences
```
### Lane 1: Standard (~1ms)
```
Layers: 6
Attention: Sparse sheaf (residual > 0.05)
FFN: Standard
Use case: Normal tokens requiring context integration
Example: Most content words
```
### Lane 2: Deep (~5ms)
```
Layers: 12+
Attention: Full sheaf + MoE routing
FFN: Expert mixture
Spectral: Eigenvalue analysis for structural issues
Use case: Ambiguous, contradictory, or complex tokens
Example: "bank" (river or financial?), negations, rare words
```
### Lane 3: Escalate (async)
```
Action: Return uncertainty, request clarification
Use case: Irreconcilable incoherence
Example: "The cat is not a cat" - logical contradiction
```
---
## Mathematical Foundation
### Sheaf Attention Formula
Given tokens X = {x_1, ..., x_N} and restriction maps ρ_i, ρ_j:
**Residual**:
```
r_ij = ρ_i(x_i) - ρ_j(x_j)
```
**Edge energy**:
```
E_ij = w_ij × ||r_ij||²
```
**Token energy**:
```
E_i = Σ_j E_ij (sum over edges incident to i)
```
**Attention weight** (coherence-based):
```
A_ij = exp(-β × E_ij) / Σ_k exp(-β × E_ik)
```
**Output**:
```
y_i = Σ_j A_ij × V_j
```
### Complexity Analysis
| Operation | Standard | Sheaf (Dense) | Sheaf (Sparse, s% non-zero) |
|-----------|----------|---------------|----------------------------|
| Attention | O(N²d) | O(N²d) | O(s×N²d) |
| Routing | - | O(Nd) | O(Nd) |
| Early exit | - | O(Ld) per check | O(Ld) per check |
| **Total** | O(N²Ld) | O(N²Ld) | O(s×N²Ld + routing) |
With typical s=10-20% sparsity and 50% early exit: **5-10x speedup**.
---
## Integration with `ruvector-attention`
### New Modules
```
ruvector-attention/
├── src/
│ ├── sheaf/ # NEW: Sheaf attention
│ │ ├── mod.rs
│ │ ├── attention.rs # SheafAttention layer
│ │ ├── restriction.rs # Restriction map projections
│ │ ├── router.rs # Token-level routing
│ │ ├── sparse.rs # Residual-sparse attention
│ │ └── early_exit.rs # Energy-based early exit
│ │
│ ├── coherence_gated/ # NEW: Full CGT implementation
│ │ ├── mod.rs
│ │ ├── transformer.rs # CoherenceGatedTransformer
│ │ ├── lane.rs # ComputeLane enum + configs
│ │ ├── config.rs # CGTConfig
│ │ └── benchmark.rs # Latency/quality benchmarks
│ │
│ └── ... (existing modules)
```
### New Types
```rust
/// Sheaf-based attention layer
pub struct SheafAttention {
/// Restriction map for queries
pub rho_query: RestrictionMap,
/// Restriction map for keys
pub rho_key: RestrictionMap,
/// Restriction map for values
pub rho_value: RestrictionMap,
/// Temperature for attention softmax
pub beta: f32,
/// Sparsity threshold
pub sparsity_threshold: f32,
}
/// Compute lane for token routing
#[derive(Debug, Clone, Copy)]
pub enum ComputeLane {
/// Minimal compute (<0.1ms)
Reflex,
/// Standard compute (~1ms)
Standard,
/// Deep compute (~5ms)
Deep,
/// Escalate to caller
Escalate,
}
/// Coherence-Gated Transformer configuration
pub struct CGTConfig {
/// Embedding dimension
pub d_model: usize,
/// Layers per lane
pub layers_per_lane: [usize; 3], // [reflex, standard, deep]
/// Routing thresholds
pub thresholds: CoherenceThresholds,
/// Sparsity settings
pub sparsity: SparsityConfig,
/// Early exit settings
pub early_exit: EarlyExitConfig,
}
/// Token routing decision
pub struct RoutingDecision {
pub token_id: usize,
pub energy: f32,
pub lane: ComputeLane,
pub attention_mask: Option<SparseMask>,
}
```
### Feature Flags
```toml
[features]
# Sheaf attention (requires prime-radiant)
sheaf = ["dep:prime-radiant"]
# Full CGT implementation
coherence-gated = ["sheaf", "sparse", "moe"]
# Benchmarking utilities
cgt-bench = ["coherence-gated", "criterion"]
```
---
## Performance Targets
| Metric | Standard Transformer | CGT Target | Improvement |
|--------|---------------------|------------|-------------|
| Average latency (128 tokens) | 10ms | 1-2ms | 5-10x |
| P99 latency (128 tokens) | 15ms | 8ms | 2x |
| Memory (batch=32) | 2GB | 800MB | 2.5x |
| Quality (perplexity) | Baseline | <5% degradation | Acceptable |
### Latency Breakdown
```
Standard (10ms total):
Attention: 6ms (60%)
FFN: 3ms (30%)
Other: 1ms (10%)
CGT Target (2ms total):
Routing: 0.1ms (5%)
Attention (sparse): 1ms (50%)
FFN (conditional): 0.7ms (35%)
Other: 0.2ms (10%)
```
---
## Quality Guarantees
### Coherence Bound
Every output is guaranteed to have coherence energy below threshold:
```
E(output) < θ_max OR escalate/refuse
```
This is **stronger** than confidence-based systems which can be confidently wrong.
### Graceful Degradation
Under compute pressure:
1. Raise θ_reflex → more tokens to Lane 0
2. Increase sparsity threshold → fewer attention computations
3. Quality degrades **predictably** (energy increases)
### Interpretability
For any output:
- Which tokens went to which lane?
- Which token pairs had high residuals?
- Where did the model "struggle"?
---
## Comparison with Existing Approaches
| Feature | Flash Attention | Sparse Transformers | MoE | CGT (Ours) |
|---------|-----------------|---------------------|-----|------------|
| Adaptive compute | No | No | Yes | Yes |
| Content-based sparsity | No | No | Partial | Yes |
| Mathematical grounding | No | No | No | Yes (sheaf) |
| Quality guarantee | No | No | No | Yes (energy bound) |
| Interpretable routing | N/A | N/A | Partial | Yes |
| Early exit criterion | N/A | N/A | Confidence | Energy convergence |
---
## Research Questions
1. **Restriction map initialization**: Random vs. pre-trained vs. analytical?
2. **Threshold tuning**: Can SONA auto-tune θ values during inference?
3. **Multi-head sheaf attention**: One graph per head, or shared graph?
4. **Training objective**: Standard cross-entropy + energy regularization?
5. **Hardware optimization**: Can residual computation be fused with attention kernels?
---
## Implementation Phases
### Phase 1: Foundation (Weeks 1-4)
- [ ] `SheafAttention` layer with restriction maps
- [ ] Basic residual computation
- [ ] Unit tests for mathematical correctness
### Phase 2: Routing (Weeks 5-8)
- [ ] `ComputeLane` enum and routing logic
- [ ] Token-level energy computation
- [ ] Lane-specific layer configurations
### Phase 3: Sparsity (Weeks 9-12)
- [ ] Residual-sparse attention mask generation
- [ ] Efficient sparse attention kernel
- [ ] Sparsity pattern analysis tools
### Phase 4: Integration (Weeks 13-16)
- [ ] `CoherenceGatedTransformer` full implementation
- [ ] Early exit with energy convergence
- [ ] Benchmarking suite
### Phase 5: Optimization (Weeks 17-20)
- [ ] SIMD optimization for residual computation
- [ ] Kernel fusion opportunities
- [ ] SONA integration for threshold tuning
---
## Dependencies
### Required
- `prime-radiant` (coherence computation)
- `ruvector-core` (vector operations)
- `ndarray` (matrix operations)
### Optional
- `rayon` (parallel routing)
- `criterion` (benchmarking)
---
## References
1. Hansen, J., & Ghrist, R. (2019). "Toward a spectral theory of cellular sheaves."
2. Vaswani et al. (2017). "Attention Is All You Need."
3. Kitaev et al. (2020). "Reformer: The Efficient Transformer."
4. Fedus et al. (2022). "Switch Transformers: Scaling to Trillion Parameter Models."
5. ADR-014: Coherence Engine Architecture
---
## Related Decisions
- **ADR-014**: Coherence Engine Architecture (Prime-Radiant)
- **ADR-003**: SIMD Optimization Strategy
- **ADR-006**: Memory Management
---
## Appendix: Name Options
| Name | Rationale |
|------|-----------|
| **Coherence-Gated Transformer (CGT)** | Descriptive, clear function |
| **Sheaf Attention** | Mathematical foundation |
| **Residual-Routed Transformer** | Emphasizes routing mechanism |
| **Energy-Adaptive Transformer** | Emphasizes efficiency |
| **Prime Transformer** | Connection to Prime-Radiant |
**Recommended**: "Coherence-Gated Transformer (CGT)" for the architecture, "Sheaf Attention" for the attention mechanism.