wifi-densepose/docs/adr/ADR-015-coherence-gated-transformer.md

# ADR-015: Coherence-Gated Transformer (Sheaf Attention)

**Status**: Proposed
**Date**: 2026-01-22
**Authors**: ruv.io, RuVector Team
**Deciders**: Architecture Review Board
**Target Crate**: `ruvector-attention`

## Version History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1 | 2026-01-22 | ruv.io | Initial proposal for coherence-gated attention |

---

## Context

### The Transformer Latency Problem

Standard transformers have fundamental efficiency issues:

1. **Quadratic attention**: O(N²) for sequence length N
2. **Fixed computation**: Every token gets same compute regardless of difficulty
3. **Dense by default**: All attention weights computed even when most are near-zero
4. **Confidence-based exits**: Early exit uses unreliable confidence scores

### Existing Solutions and Their Limits

| Approach | Method | Limitation |
|----------|--------|------------|
| Flash Attention | Memory-efficient matmul | Still O(N²) compute |
| Sparse Attention | Fixed patterns (local, strided) | Patterns don't adapt to content |
| Linear Attention | Kernel approximation | Quality degradation |
| Early Exit | Confidence threshold | Confidence ≠ correctness |
| MoE | Expert routing | Routing is learned, not principled |

### The Coherence Insight

Prime-Radiant's coherence engine provides a **mathematically grounded** measure of consistency. This can be applied to attention:

> **Core idea**: Tokens that are already coherent with context don't need expensive attention. Route computation based on coherence energy, not learned confidence.

---

## Decision

### Implement Coherence-Gated Transformer (CGT) in `ruvector-attention`

A novel attention mechanism that uses sheaf coherence to:
1. **Route tokens** to different compute depths
2. **Sparsify attention** based on residual energy
3. **Exit early** when energy converges
4. **Replace QKV projections** with restriction maps

---

## Architecture

### High-Level Design

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                     COHERENCE-GATED TRANSFORMER (CGT)                        │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                         INPUT PROCESSING                                 ││
│  │  Tokens ──► Embedding ──► Initial Coherence Graph                       ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                      COHERENCE ROUTER                                    ││
│  │                                                                          ││
│  │  For each token t:                                                       ││
│  │    E(t) = Σ w_e ||ρ_t(x_t) - ρ_ctx(x_ctx)||²                            ││
│  │                                                                          ││
│  │    Route based on energy:                                                ││
│  │    ┌──────────────┬──────────────┬──────────────┐                       ││
│  │    │ E < θ_reflex │ E < θ_std   │ E ≥ θ_std    │                       ││
│  │    │     │        │     │        │     │        │                       ││
│  │    │     ▼        │     ▼        │     ▼        │                       ││
│  │    │  LANE 0      │  LANE 1      │  LANE 2      │                       ││
│  │    │  Reflex      │  Standard    │  Deep        │                       ││
│  │    └──────────────┴──────────────┴──────────────┘                       ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                         │
│       ┌────────────────────────────┼────────────────────────────┐           │
│       │                            │                            │           │
│       ▼                            ▼                            ▼           │
│  ┌──────────┐               ┌──────────┐               ┌──────────┐        │
│  │  LANE 0  │               │  LANE 1  │               │  LANE 2  │        │
│  │  REFLEX  │               │ STANDARD │               │   DEEP   │        │
│  │          │               │          │               │          │        │
│  │ • 1-2 layers            │ • 6 layers│               │ • 12+ layers      │
│  │ • Local attention       │ • Sparse  │               │ • Full + MoE     │
│  │   (window=64)           │   sheaf   │               │ • All experts    │
│  │ • No FFN                │   attn    │               │ • Spectral       │
│  │ • <0.1ms                │ • ~1ms    │               │   analysis       │
│  │                         │           │               │ • ~5ms           │
│  └──────────┘               └──────────┘               └──────────┘        │
│       │                            │                            │           │
│       └────────────────────────────┼────────────────────────────┘           │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                      COHERENCE VERIFICATION                              ││
│  │                                                                          ││
│  │  E_final = compute_energy(output_graph)                                  ││
│  │                                                                          ││
│  │  if E_final > θ_max:                                                     ││
│  │    → Escalate to Lane 2 OR refuse generation                            ││
│  │  else:                                                                   ││
│  │    → Output with witness                                                 ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                    │                                         │
│                                    ▼                                         │
│                           Output + Witness                                   │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Component Details

#### 1. Sheaf Attention Layer

Replace standard scaled dot-product attention with coherence-based attention:

```
Standard Attention:
  Attention(Q, K, V) = softmax(QK^T / √d) V

Sheaf Attention:
  R_ij = ||ρ_i(x_i) - ρ_j(x_j)||²           # Residual energy
  A_ij = exp(-β × R_ij) / Σ_k exp(-β × R_ik) # Coherence-based weight
  Output = A × V
```

**Key difference**: Attention weight is inversely proportional to residual energy.
- High residual (incoherent) → Low attention (don't propagate inconsistency)
- Low residual (coherent) → High attention (reinforce consistency)

#### 2. Restriction Map Projections

Replace learned W_q, W_k, W_v with restriction maps:

```
Standard:
  Q = W_q × x    (learned projection)
  K = W_k × x
  V = W_v × x

Sheaf:
  Q = ρ_q(x)     (restriction map to query manifold)
  K = ρ_k(x)     (restriction map to key manifold)
  V = ρ_v(x)     (restriction map to value manifold)
```

**Benefits**:
- Restriction maps have geometric meaning (project to shared space)
- Can be initialized from domain knowledge
- Residuals are interpretable

#### 3. Token-Level Compute Routing

```python
def route_token(token_embedding, context_graph):
    # Compute coherence energy with context
    energy = compute_token_energy(token_embedding, context_graph)

    if energy < THETA_REFLEX:
        return Lane.REFLEX      # Minimal compute
    elif energy < THETA_STANDARD:
        return Lane.STANDARD    # Normal compute
    else:
        return Lane.DEEP        # Maximum compute
```

**Routing thresholds** (tunable via SONA):

| Threshold | Default | Meaning |
|-----------|---------|---------|
| θ_reflex | 0.01 | Token is highly coherent with context |
| θ_standard | 0.1 | Token has minor inconsistencies |
| θ_deep | 1.0 | Token has major inconsistencies |

#### 4. Residual-Sparse Attention

Only compute attention for token pairs with high residual:

```python
def sparse_sheaf_attention(X, threshold):
    N = len(X)
    attention_mask = zeros(N, N)

    for i in range(N):
        for j in range(N):
            residual = compute_residual(X[i], X[j])
            if residual > threshold:
                # These tokens are incoherent - need attention
                attention_mask[i, j] = 1
            # else: skip attention (already coherent)

    # Compute attention only for non-zero mask entries
    return masked_attention(X, attention_mask)
```

**Sparsity pattern**: Adapts to content, not fixed like local/strided attention.

#### 5. Energy-Based Early Exit

```python
def forward_with_early_exit(x, layers, epsilon=0.001):
    prev_energy = float('inf')

    for layer in layers:
        x = layer(x)
        curr_energy = compute_energy(x)

        delta = abs(curr_energy - prev_energy)
        if delta < epsilon:
            # Energy converged - no need for more layers
            return x

        prev_energy = curr_energy

    return x
```

**Exit criterion**: Energy convergence, not confidence threshold.

---

## Compute Lane Specifications

### Lane 0: Reflex (~0.1ms)

```
Layers: 1-2
Attention: Local only (window=64)
FFN: Skip or minimal
Use case: Common tokens, clear context
Example: "the", "is", "and" in well-formed sentences
```

### Lane 1: Standard (~1ms)

```
Layers: 6
Attention: Sparse sheaf (residual > 0.05)
FFN: Standard
Use case: Normal tokens requiring context integration
Example: Most content words
```

### Lane 2: Deep (~5ms)

```
Layers: 12+
Attention: Full sheaf + MoE routing
FFN: Expert mixture
Spectral: Eigenvalue analysis for structural issues
Use case: Ambiguous, contradictory, or complex tokens
Example: "bank" (river or financial?), negations, rare words
```

### Lane 3: Escalate (async)

```
Action: Return uncertainty, request clarification
Use case: Irreconcilable incoherence
Example: "The cat is not a cat" - logical contradiction
```

---

## Mathematical Foundation

### Sheaf Attention Formula

Given tokens X = {x_1, ..., x_N} and restriction maps ρ_i, ρ_j:

**Residual**:
```
r_ij = ρ_i(x_i) - ρ_j(x_j)
```

**Edge energy**:
```
E_ij = w_ij × ||r_ij||²
```

**Token energy**:
```
E_i = Σ_j E_ij  (sum over edges incident to i)
```

**Attention weight** (coherence-based):
```
A_ij = exp(-β × E_ij) / Σ_k exp(-β × E_ik)
```

**Output**:
```
y_i = Σ_j A_ij × V_j
```

### Complexity Analysis

| Operation | Standard | Sheaf (Dense) | Sheaf (Sparse, s% non-zero) |
|-----------|----------|---------------|----------------------------|
| Attention | O(N²d) | O(N²d) | O(s×N²d) |
| Routing | - | O(Nd) | O(Nd) |
| Early exit | - | O(Ld) per check | O(Ld) per check |
| **Total** | O(N²Ld) | O(N²Ld) | O(s×N²Ld + routing) |

With typical s=10-20% sparsity and 50% early exit: **5-10x speedup**.

---

## Integration with `ruvector-attention`

### New Modules

```
ruvector-attention/
├── src/
│   ├── sheaf/                      # NEW: Sheaf attention
│   │   ├── mod.rs
│   │   ├── attention.rs            # SheafAttention layer
│   │   ├── restriction.rs          # Restriction map projections
│   │   ├── router.rs               # Token-level routing
│   │   ├── sparse.rs               # Residual-sparse attention
│   │   └── early_exit.rs           # Energy-based early exit
│   │
│   ├── coherence_gated/            # NEW: Full CGT implementation
│   │   ├── mod.rs
│   │   ├── transformer.rs          # CoherenceGatedTransformer
│   │   ├── lane.rs                 # ComputeLane enum + configs
│   │   ├── config.rs               # CGTConfig
│   │   └── benchmark.rs            # Latency/quality benchmarks
│   │
│   └── ... (existing modules)
```

### New Types

```rust
/// Sheaf-based attention layer
pub struct SheafAttention {
    /// Restriction map for queries
    pub rho_query: RestrictionMap,
    /// Restriction map for keys
    pub rho_key: RestrictionMap,
    /// Restriction map for values
    pub rho_value: RestrictionMap,
    /// Temperature for attention softmax
    pub beta: f32,
    /// Sparsity threshold
    pub sparsity_threshold: f32,
}

/// Compute lane for token routing
#[derive(Debug, Clone, Copy)]
pub enum ComputeLane {
    /// Minimal compute (<0.1ms)
    Reflex,
    /// Standard compute (~1ms)
    Standard,
    /// Deep compute (~5ms)
    Deep,
    /// Escalate to caller
    Escalate,
}

/// Coherence-Gated Transformer configuration
pub struct CGTConfig {
    /// Embedding dimension
    pub d_model: usize,
    /// Layers per lane
    pub layers_per_lane: [usize; 3],  // [reflex, standard, deep]
    /// Routing thresholds
    pub thresholds: CoherenceThresholds,
    /// Sparsity settings
    pub sparsity: SparsityConfig,
    /// Early exit settings
    pub early_exit: EarlyExitConfig,
}

/// Token routing decision
pub struct RoutingDecision {
    pub token_id: usize,
    pub energy: f32,
    pub lane: ComputeLane,
    pub attention_mask: Option<SparseMask>,
}
```

### Feature Flags

```toml
[features]
# Sheaf attention (requires prime-radiant)
sheaf = ["dep:prime-radiant"]

# Full CGT implementation
coherence-gated = ["sheaf", "sparse", "moe"]

# Benchmarking utilities
cgt-bench = ["coherence-gated", "criterion"]
```

---

## Performance Targets

| Metric | Standard Transformer | CGT Target | Improvement |
|--------|---------------------|------------|-------------|
| Average latency (128 tokens) | 10ms | 1-2ms | 5-10x |
| P99 latency (128 tokens) | 15ms | 8ms | 2x |
| Memory (batch=32) | 2GB | 800MB | 2.5x |
| Quality (perplexity) | Baseline | <5% degradation | Acceptable |

### Latency Breakdown

```
Standard (10ms total):
  Attention: 6ms (60%)
  FFN: 3ms (30%)
  Other: 1ms (10%)

CGT Target (2ms total):
  Routing: 0.1ms (5%)
  Attention (sparse): 1ms (50%)
  FFN (conditional): 0.7ms (35%)
  Other: 0.2ms (10%)
```

---

## Quality Guarantees

### Coherence Bound

Every output is guaranteed to have coherence energy below threshold:

```
E(output) < θ_max  OR  escalate/refuse
```

This is **stronger** than confidence-based systems which can be confidently wrong.

### Graceful Degradation

Under compute pressure:
1. Raise θ_reflex → more tokens to Lane 0
2. Increase sparsity threshold → fewer attention computations
3. Quality degrades **predictably** (energy increases)

### Interpretability

For any output:
- Which tokens went to which lane?
- Which token pairs had high residuals?
- Where did the model "struggle"?

---

## Comparison with Existing Approaches

| Feature | Flash Attention | Sparse Transformers | MoE | CGT (Ours) |
|---------|-----------------|---------------------|-----|------------|
| Adaptive compute | No | No | Yes | Yes |
| Content-based sparsity | No | No | Partial | Yes |
| Mathematical grounding | No | No | No | Yes (sheaf) |
| Quality guarantee | No | No | No | Yes (energy bound) |
| Interpretable routing | N/A | N/A | Partial | Yes |
| Early exit criterion | N/A | N/A | Confidence | Energy convergence |

---

## Research Questions

1. **Restriction map initialization**: Random vs. pre-trained vs. analytical?

2. **Threshold tuning**: Can SONA auto-tune θ values during inference?

3. **Multi-head sheaf attention**: One graph per head, or shared graph?

4. **Training objective**: Standard cross-entropy + energy regularization?

5. **Hardware optimization**: Can residual computation be fused with attention kernels?

---

## Implementation Phases

### Phase 1: Foundation (Weeks 1-4)
- [ ] `SheafAttention` layer with restriction maps
- [ ] Basic residual computation
- [ ] Unit tests for mathematical correctness

### Phase 2: Routing (Weeks 5-8)
- [ ] `ComputeLane` enum and routing logic
- [ ] Token-level energy computation
- [ ] Lane-specific layer configurations

### Phase 3: Sparsity (Weeks 9-12)
- [ ] Residual-sparse attention mask generation
- [ ] Efficient sparse attention kernel
- [ ] Sparsity pattern analysis tools

### Phase 4: Integration (Weeks 13-16)
- [ ] `CoherenceGatedTransformer` full implementation
- [ ] Early exit with energy convergence
- [ ] Benchmarking suite

### Phase 5: Optimization (Weeks 17-20)
- [ ] SIMD optimization for residual computation
- [ ] Kernel fusion opportunities
- [ ] SONA integration for threshold tuning

---

## Dependencies

### Required
- `prime-radiant` (coherence computation)
- `ruvector-core` (vector operations)
- `ndarray` (matrix operations)

### Optional
- `rayon` (parallel routing)
- `criterion` (benchmarking)

---

## References

1. Hansen, J., & Ghrist, R. (2019). "Toward a spectral theory of cellular sheaves."

2. Vaswani et al. (2017). "Attention Is All You Need."

3. Kitaev et al. (2020). "Reformer: The Efficient Transformer."

4. Fedus et al. (2022). "Switch Transformers: Scaling to Trillion Parameter Models."

5. ADR-014: Coherence Engine Architecture

---

## Related Decisions

- **ADR-014**: Coherence Engine Architecture (Prime-Radiant)
- **ADR-003**: SIMD Optimization Strategy
- **ADR-006**: Memory Management

---

## Appendix: Name Options

| Name | Rationale |
|------|-----------|
| **Coherence-Gated Transformer (CGT)** | Descriptive, clear function |
| **Sheaf Attention** | Mathematical foundation |
| **Residual-Routed Transformer** | Emphasizes routing mechanism |
| **Energy-Adaptive Transformer** | Emphasizes efficiency |
| **Prime Transformer** | Connection to Prime-Radiant |

**Recommended**: "Coherence-Gated Transformer (CGT)" for the architecture, "Sheaf Attention" for the attention mechanism.