Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
568
vendor/ruvector/docs/adr/ADR-015-coherence-gated-transformer.md
vendored
Normal file
568
vendor/ruvector/docs/adr/ADR-015-coherence-gated-transformer.md
vendored
Normal file
@@ -0,0 +1,568 @@
|
||||
# ADR-015: Coherence-Gated Transformer (Sheaf Attention)
|
||||
|
||||
**Status**: Proposed
|
||||
**Date**: 2026-01-22
|
||||
**Authors**: ruv.io, RuVector Team
|
||||
**Deciders**: Architecture Review Board
|
||||
**Target Crate**: `ruvector-attention`
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 0.1 | 2026-01-22 | ruv.io | Initial proposal for coherence-gated attention |
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
### The Transformer Latency Problem
|
||||
|
||||
Standard transformers have fundamental efficiency issues:
|
||||
|
||||
1. **Quadratic attention**: O(N²) for sequence length N
|
||||
2. **Fixed computation**: Every token gets same compute regardless of difficulty
|
||||
3. **Dense by default**: All attention weights computed even when most are near-zero
|
||||
4. **Confidence-based exits**: Early exit uses unreliable confidence scores
|
||||
|
||||
### Existing Solutions and Their Limits
|
||||
|
||||
| Approach | Method | Limitation |
|
||||
|----------|--------|------------|
|
||||
| Flash Attention | Memory-efficient matmul | Still O(N²) compute |
|
||||
| Sparse Attention | Fixed patterns (local, strided) | Patterns don't adapt to content |
|
||||
| Linear Attention | Kernel approximation | Quality degradation |
|
||||
| Early Exit | Confidence threshold | Confidence ≠ correctness |
|
||||
| MoE | Expert routing | Routing is learned, not principled |
|
||||
|
||||
### The Coherence Insight
|
||||
|
||||
Prime-Radiant's coherence engine provides a **mathematically grounded** measure of consistency. This can be applied to attention:
|
||||
|
||||
> **Core idea**: Tokens that are already coherent with context don't need expensive attention. Route computation based on coherence energy, not learned confidence.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### Implement Coherence-Gated Transformer (CGT) in `ruvector-attention`
|
||||
|
||||
A novel attention mechanism that uses sheaf coherence to:
|
||||
1. **Route tokens** to different compute depths
|
||||
2. **Sparsify attention** based on residual energy
|
||||
3. **Exit early** when energy converges
|
||||
4. **Replace QKV projections** with restriction maps
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### High-Level Design
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ COHERENCE-GATED TRANSFORMER (CGT) │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ INPUT PROCESSING ││
|
||||
│ │ Tokens ──► Embedding ──► Initial Coherence Graph ││
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘│
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ COHERENCE ROUTER ││
|
||||
│ │ ││
|
||||
│ │ For each token t: ││
|
||||
│ │ E(t) = Σ w_e ||ρ_t(x_t) - ρ_ctx(x_ctx)||² ││
|
||||
│ │ ││
|
||||
│ │ Route based on energy: ││
|
||||
│ │ ┌──────────────┬──────────────┬──────────────┐ ││
|
||||
│ │ │ E < θ_reflex │ E < θ_std │ E ≥ θ_std │ ││
|
||||
│ │ │ │ │ │ │ │ │ ││
|
||||
│ │ │ ▼ │ ▼ │ ▼ │ ││
|
||||
│ │ │ LANE 0 │ LANE 1 │ LANE 2 │ ││
|
||||
│ │ │ Reflex │ Standard │ Deep │ ││
|
||||
│ │ └──────────────┴──────────────┴──────────────┘ ││
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘│
|
||||
│ │ │
|
||||
│ ┌────────────────────────────┼────────────────────────────┐ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ LANE 0 │ │ LANE 1 │ │ LANE 2 │ │
|
||||
│ │ REFLEX │ │ STANDARD │ │ DEEP │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ • 1-2 layers │ • 6 layers│ │ • 12+ layers │
|
||||
│ │ • Local attention │ • Sparse │ │ • Full + MoE │
|
||||
│ │ (window=64) │ sheaf │ │ • All experts │
|
||||
│ │ • No FFN │ attn │ │ • Spectral │
|
||||
│ │ • <0.1ms │ • ~1ms │ │ analysis │
|
||||
│ │ │ │ │ • ~5ms │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ │
|
||||
│ │ │ │ │
|
||||
│ └────────────────────────────┼────────────────────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ COHERENCE VERIFICATION ││
|
||||
│ │ ││
|
||||
│ │ E_final = compute_energy(output_graph) ││
|
||||
│ │ ││
|
||||
│ │ if E_final > θ_max: ││
|
||||
│ │ → Escalate to Lane 2 OR refuse generation ││
|
||||
│ │ else: ││
|
||||
│ │ → Output with witness ││
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘│
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Output + Witness │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Component Details
|
||||
|
||||
#### 1. Sheaf Attention Layer
|
||||
|
||||
Replace standard scaled dot-product attention with coherence-based attention:
|
||||
|
||||
```
|
||||
Standard Attention:
|
||||
Attention(Q, K, V) = softmax(QK^T / √d) V
|
||||
|
||||
Sheaf Attention:
|
||||
R_ij = ||ρ_i(x_i) - ρ_j(x_j)||² # Residual energy
|
||||
A_ij = exp(-β × R_ij) / Σ_k exp(-β × R_ik) # Coherence-based weight
|
||||
Output = A × V
|
||||
```
|
||||
|
||||
**Key difference**: Attention weight is inversely proportional to residual energy.
|
||||
- High residual (incoherent) → Low attention (don't propagate inconsistency)
|
||||
- Low residual (coherent) → High attention (reinforce consistency)
|
||||
|
||||
#### 2. Restriction Map Projections
|
||||
|
||||
Replace learned W_q, W_k, W_v with restriction maps:
|
||||
|
||||
```
|
||||
Standard:
|
||||
Q = W_q × x (learned projection)
|
||||
K = W_k × x
|
||||
V = W_v × x
|
||||
|
||||
Sheaf:
|
||||
Q = ρ_q(x) (restriction map to query manifold)
|
||||
K = ρ_k(x) (restriction map to key manifold)
|
||||
V = ρ_v(x) (restriction map to value manifold)
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Restriction maps have geometric meaning (project to shared space)
|
||||
- Can be initialized from domain knowledge
|
||||
- Residuals are interpretable
|
||||
|
||||
#### 3. Token-Level Compute Routing
|
||||
|
||||
```python
|
||||
def route_token(token_embedding, context_graph):
|
||||
# Compute coherence energy with context
|
||||
energy = compute_token_energy(token_embedding, context_graph)
|
||||
|
||||
if energy < THETA_REFLEX:
|
||||
return Lane.REFLEX # Minimal compute
|
||||
elif energy < THETA_STANDARD:
|
||||
return Lane.STANDARD # Normal compute
|
||||
else:
|
||||
return Lane.DEEP # Maximum compute
|
||||
```
|
||||
|
||||
**Routing thresholds** (tunable via SONA):
|
||||
|
||||
| Threshold | Default | Meaning |
|
||||
|-----------|---------|---------|
|
||||
| θ_reflex | 0.01 | Token is highly coherent with context |
|
||||
| θ_standard | 0.1 | Token has minor inconsistencies |
|
||||
| θ_deep | 1.0 | Token has major inconsistencies |
|
||||
|
||||
#### 4. Residual-Sparse Attention
|
||||
|
||||
Only compute attention for token pairs with high residual:
|
||||
|
||||
```python
|
||||
def sparse_sheaf_attention(X, threshold):
|
||||
N = len(X)
|
||||
attention_mask = zeros(N, N)
|
||||
|
||||
for i in range(N):
|
||||
for j in range(N):
|
||||
residual = compute_residual(X[i], X[j])
|
||||
if residual > threshold:
|
||||
# These tokens are incoherent - need attention
|
||||
attention_mask[i, j] = 1
|
||||
# else: skip attention (already coherent)
|
||||
|
||||
# Compute attention only for non-zero mask entries
|
||||
return masked_attention(X, attention_mask)
|
||||
```
|
||||
|
||||
**Sparsity pattern**: Adapts to content, not fixed like local/strided attention.
|
||||
|
||||
#### 5. Energy-Based Early Exit
|
||||
|
||||
```python
|
||||
def forward_with_early_exit(x, layers, epsilon=0.001):
|
||||
prev_energy = float('inf')
|
||||
|
||||
for layer in layers:
|
||||
x = layer(x)
|
||||
curr_energy = compute_energy(x)
|
||||
|
||||
delta = abs(curr_energy - prev_energy)
|
||||
if delta < epsilon:
|
||||
# Energy converged - no need for more layers
|
||||
return x
|
||||
|
||||
prev_energy = curr_energy
|
||||
|
||||
return x
|
||||
```
|
||||
|
||||
**Exit criterion**: Energy convergence, not confidence threshold.
|
||||
|
||||
---
|
||||
|
||||
## Compute Lane Specifications
|
||||
|
||||
### Lane 0: Reflex (~0.1ms)
|
||||
|
||||
```
|
||||
Layers: 1-2
|
||||
Attention: Local only (window=64)
|
||||
FFN: Skip or minimal
|
||||
Use case: Common tokens, clear context
|
||||
Example: "the", "is", "and" in well-formed sentences
|
||||
```
|
||||
|
||||
### Lane 1: Standard (~1ms)
|
||||
|
||||
```
|
||||
Layers: 6
|
||||
Attention: Sparse sheaf (residual > 0.05)
|
||||
FFN: Standard
|
||||
Use case: Normal tokens requiring context integration
|
||||
Example: Most content words
|
||||
```
|
||||
|
||||
### Lane 2: Deep (~5ms)
|
||||
|
||||
```
|
||||
Layers: 12+
|
||||
Attention: Full sheaf + MoE routing
|
||||
FFN: Expert mixture
|
||||
Spectral: Eigenvalue analysis for structural issues
|
||||
Use case: Ambiguous, contradictory, or complex tokens
|
||||
Example: "bank" (river or financial?), negations, rare words
|
||||
```
|
||||
|
||||
### Lane 3: Escalate (async)
|
||||
|
||||
```
|
||||
Action: Return uncertainty, request clarification
|
||||
Use case: Irreconcilable incoherence
|
||||
Example: "The cat is not a cat" - logical contradiction
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mathematical Foundation
|
||||
|
||||
### Sheaf Attention Formula
|
||||
|
||||
Given tokens X = {x_1, ..., x_N} and restriction maps ρ_i, ρ_j:
|
||||
|
||||
**Residual**:
|
||||
```
|
||||
r_ij = ρ_i(x_i) - ρ_j(x_j)
|
||||
```
|
||||
|
||||
**Edge energy**:
|
||||
```
|
||||
E_ij = w_ij × ||r_ij||²
|
||||
```
|
||||
|
||||
**Token energy**:
|
||||
```
|
||||
E_i = Σ_j E_ij (sum over edges incident to i)
|
||||
```
|
||||
|
||||
**Attention weight** (coherence-based):
|
||||
```
|
||||
A_ij = exp(-β × E_ij) / Σ_k exp(-β × E_ik)
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```
|
||||
y_i = Σ_j A_ij × V_j
|
||||
```
|
||||
|
||||
### Complexity Analysis
|
||||
|
||||
| Operation | Standard | Sheaf (Dense) | Sheaf (Sparse, s% non-zero) |
|
||||
|-----------|----------|---------------|----------------------------|
|
||||
| Attention | O(N²d) | O(N²d) | O(s×N²d) |
|
||||
| Routing | - | O(Nd) | O(Nd) |
|
||||
| Early exit | - | O(Ld) per check | O(Ld) per check |
|
||||
| **Total** | O(N²Ld) | O(N²Ld) | O(s×N²Ld + routing) |
|
||||
|
||||
With typical s=10-20% sparsity and 50% early exit: **5-10x speedup**.
|
||||
|
||||
---
|
||||
|
||||
## Integration with `ruvector-attention`
|
||||
|
||||
### New Modules
|
||||
|
||||
```
|
||||
ruvector-attention/
|
||||
├── src/
|
||||
│ ├── sheaf/ # NEW: Sheaf attention
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── attention.rs # SheafAttention layer
|
||||
│ │ ├── restriction.rs # Restriction map projections
|
||||
│ │ ├── router.rs # Token-level routing
|
||||
│ │ ├── sparse.rs # Residual-sparse attention
|
||||
│ │ └── early_exit.rs # Energy-based early exit
|
||||
│ │
|
||||
│ ├── coherence_gated/ # NEW: Full CGT implementation
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── transformer.rs # CoherenceGatedTransformer
|
||||
│ │ ├── lane.rs # ComputeLane enum + configs
|
||||
│ │ ├── config.rs # CGTConfig
|
||||
│ │ └── benchmark.rs # Latency/quality benchmarks
|
||||
│ │
|
||||
│ └── ... (existing modules)
|
||||
```
|
||||
|
||||
### New Types
|
||||
|
||||
```rust
|
||||
/// Sheaf-based attention layer
|
||||
pub struct SheafAttention {
|
||||
/// Restriction map for queries
|
||||
pub rho_query: RestrictionMap,
|
||||
/// Restriction map for keys
|
||||
pub rho_key: RestrictionMap,
|
||||
/// Restriction map for values
|
||||
pub rho_value: RestrictionMap,
|
||||
/// Temperature for attention softmax
|
||||
pub beta: f32,
|
||||
/// Sparsity threshold
|
||||
pub sparsity_threshold: f32,
|
||||
}
|
||||
|
||||
/// Compute lane for token routing
|
||||
#[derive(Debug, Clone, Copy)]
|
||||
pub enum ComputeLane {
|
||||
/// Minimal compute (<0.1ms)
|
||||
Reflex,
|
||||
/// Standard compute (~1ms)
|
||||
Standard,
|
||||
/// Deep compute (~5ms)
|
||||
Deep,
|
||||
/// Escalate to caller
|
||||
Escalate,
|
||||
}
|
||||
|
||||
/// Coherence-Gated Transformer configuration
|
||||
pub struct CGTConfig {
|
||||
/// Embedding dimension
|
||||
pub d_model: usize,
|
||||
/// Layers per lane
|
||||
pub layers_per_lane: [usize; 3], // [reflex, standard, deep]
|
||||
/// Routing thresholds
|
||||
pub thresholds: CoherenceThresholds,
|
||||
/// Sparsity settings
|
||||
pub sparsity: SparsityConfig,
|
||||
/// Early exit settings
|
||||
pub early_exit: EarlyExitConfig,
|
||||
}
|
||||
|
||||
/// Token routing decision
|
||||
pub struct RoutingDecision {
|
||||
pub token_id: usize,
|
||||
pub energy: f32,
|
||||
pub lane: ComputeLane,
|
||||
pub attention_mask: Option<SparseMask>,
|
||||
}
|
||||
```
|
||||
|
||||
### Feature Flags
|
||||
|
||||
```toml
|
||||
[features]
|
||||
# Sheaf attention (requires prime-radiant)
|
||||
sheaf = ["dep:prime-radiant"]
|
||||
|
||||
# Full CGT implementation
|
||||
coherence-gated = ["sheaf", "sparse", "moe"]
|
||||
|
||||
# Benchmarking utilities
|
||||
cgt-bench = ["coherence-gated", "criterion"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Targets
|
||||
|
||||
| Metric | Standard Transformer | CGT Target | Improvement |
|
||||
|--------|---------------------|------------|-------------|
|
||||
| Average latency (128 tokens) | 10ms | 1-2ms | 5-10x |
|
||||
| P99 latency (128 tokens) | 15ms | 8ms | 2x |
|
||||
| Memory (batch=32) | 2GB | 800MB | 2.5x |
|
||||
| Quality (perplexity) | Baseline | <5% degradation | Acceptable |
|
||||
|
||||
### Latency Breakdown
|
||||
|
||||
```
|
||||
Standard (10ms total):
|
||||
Attention: 6ms (60%)
|
||||
FFN: 3ms (30%)
|
||||
Other: 1ms (10%)
|
||||
|
||||
CGT Target (2ms total):
|
||||
Routing: 0.1ms (5%)
|
||||
Attention (sparse): 1ms (50%)
|
||||
FFN (conditional): 0.7ms (35%)
|
||||
Other: 0.2ms (10%)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quality Guarantees
|
||||
|
||||
### Coherence Bound
|
||||
|
||||
Every output is guaranteed to have coherence energy below threshold:
|
||||
|
||||
```
|
||||
E(output) < θ_max OR escalate/refuse
|
||||
```
|
||||
|
||||
This is **stronger** than confidence-based systems which can be confidently wrong.
|
||||
|
||||
### Graceful Degradation
|
||||
|
||||
Under compute pressure:
|
||||
1. Raise θ_reflex → more tokens to Lane 0
|
||||
2. Increase sparsity threshold → fewer attention computations
|
||||
3. Quality degrades **predictably** (energy increases)
|
||||
|
||||
### Interpretability
|
||||
|
||||
For any output:
|
||||
- Which tokens went to which lane?
|
||||
- Which token pairs had high residuals?
|
||||
- Where did the model "struggle"?
|
||||
|
||||
---
|
||||
|
||||
## Comparison with Existing Approaches
|
||||
|
||||
| Feature | Flash Attention | Sparse Transformers | MoE | CGT (Ours) |
|
||||
|---------|-----------------|---------------------|-----|------------|
|
||||
| Adaptive compute | No | No | Yes | Yes |
|
||||
| Content-based sparsity | No | No | Partial | Yes |
|
||||
| Mathematical grounding | No | No | No | Yes (sheaf) |
|
||||
| Quality guarantee | No | No | No | Yes (energy bound) |
|
||||
| Interpretable routing | N/A | N/A | Partial | Yes |
|
||||
| Early exit criterion | N/A | N/A | Confidence | Energy convergence |
|
||||
|
||||
---
|
||||
|
||||
## Research Questions
|
||||
|
||||
1. **Restriction map initialization**: Random vs. pre-trained vs. analytical?
|
||||
|
||||
2. **Threshold tuning**: Can SONA auto-tune θ values during inference?
|
||||
|
||||
3. **Multi-head sheaf attention**: One graph per head, or shared graph?
|
||||
|
||||
4. **Training objective**: Standard cross-entropy + energy regularization?
|
||||
|
||||
5. **Hardware optimization**: Can residual computation be fused with attention kernels?
|
||||
|
||||
---
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Foundation (Weeks 1-4)
|
||||
- [ ] `SheafAttention` layer with restriction maps
|
||||
- [ ] Basic residual computation
|
||||
- [ ] Unit tests for mathematical correctness
|
||||
|
||||
### Phase 2: Routing (Weeks 5-8)
|
||||
- [ ] `ComputeLane` enum and routing logic
|
||||
- [ ] Token-level energy computation
|
||||
- [ ] Lane-specific layer configurations
|
||||
|
||||
### Phase 3: Sparsity (Weeks 9-12)
|
||||
- [ ] Residual-sparse attention mask generation
|
||||
- [ ] Efficient sparse attention kernel
|
||||
- [ ] Sparsity pattern analysis tools
|
||||
|
||||
### Phase 4: Integration (Weeks 13-16)
|
||||
- [ ] `CoherenceGatedTransformer` full implementation
|
||||
- [ ] Early exit with energy convergence
|
||||
- [ ] Benchmarking suite
|
||||
|
||||
### Phase 5: Optimization (Weeks 17-20)
|
||||
- [ ] SIMD optimization for residual computation
|
||||
- [ ] Kernel fusion opportunities
|
||||
- [ ] SONA integration for threshold tuning
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Required
|
||||
- `prime-radiant` (coherence computation)
|
||||
- `ruvector-core` (vector operations)
|
||||
- `ndarray` (matrix operations)
|
||||
|
||||
### Optional
|
||||
- `rayon` (parallel routing)
|
||||
- `criterion` (benchmarking)
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
1. Hansen, J., & Ghrist, R. (2019). "Toward a spectral theory of cellular sheaves."
|
||||
|
||||
2. Vaswani et al. (2017). "Attention Is All You Need."
|
||||
|
||||
3. Kitaev et al. (2020). "Reformer: The Efficient Transformer."
|
||||
|
||||
4. Fedus et al. (2022). "Switch Transformers: Scaling to Trillion Parameter Models."
|
||||
|
||||
5. ADR-014: Coherence Engine Architecture
|
||||
|
||||
---
|
||||
|
||||
## Related Decisions
|
||||
|
||||
- **ADR-014**: Coherence Engine Architecture (Prime-Radiant)
|
||||
- **ADR-003**: SIMD Optimization Strategy
|
||||
- **ADR-006**: Memory Management
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Name Options
|
||||
|
||||
| Name | Rationale |
|
||||
|------|-----------|
|
||||
| **Coherence-Gated Transformer (CGT)** | Descriptive, clear function |
|
||||
| **Sheaf Attention** | Mathematical foundation |
|
||||
| **Residual-Routed Transformer** | Emphasizes routing mechanism |
|
||||
| **Energy-Adaptive Transformer** | Emphasizes efficiency |
|
||||
| **Prime Transformer** | Connection to Prime-Radiant |
|
||||
|
||||
**Recommended**: "Coherence-Gated Transformer (CGT)" for the architecture, "Sheaf Attention" for the attention mechanism.
|
||||
Reference in New Issue
Block a user