Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
148
vendor/ruvector/docs/implementation/SPECULATIVE_DECODING.md
vendored
Normal file
148
vendor/ruvector/docs/implementation/SPECULATIVE_DECODING.md
vendored
Normal file
@@ -0,0 +1,148 @@
|
||||
# EAGLE-3 Speculative Decoding
|
||||
|
||||
Implementation of EAGLE-3 style speculative decoding for the mincut-gated-transformer crate.
|
||||
|
||||
## Overview
|
||||
|
||||
Speculative decoding accelerates inference by drafting multiple tokens in parallel and verifying them against the target model using rejection sampling. This implementation uses mincut λ-stability as a confidence signal to guide draft tree generation.
|
||||
|
||||
## Files
|
||||
|
||||
- `/home/user/ruvector/crates/ruvector-mincut-gated-transformer/src/speculative.rs` - Core implementation
|
||||
|
||||
## Key Features
|
||||
|
||||
### 1. Draft Tree Generation
|
||||
|
||||
Dynamic tree structure that adapts based on model confidence:
|
||||
|
||||
```rust
|
||||
let config = SpeculativeConfig {
|
||||
max_draft_tokens: 5, // Draft up to 5 tokens ahead
|
||||
tree_width: 3, // Up to 3 branches per node
|
||||
acceptance_threshold: 0.7, // 70% confidence for acceptance
|
||||
use_lambda_guidance: true, // Use λ as confidence signal
|
||||
};
|
||||
|
||||
let decoder = SpeculativeDecoder::new(config);
|
||||
let tree = decoder.generate_draft_tree(lambda, lambda_prev, draft_logits);
|
||||
```
|
||||
|
||||
### 2. λ-Guided Confidence
|
||||
|
||||
Uses mincut λ-stability to scale draft confidence:
|
||||
|
||||
- **Higher λ** = More stable partitioning = Higher draft confidence
|
||||
- **Increasing λ** = Improving stability = Confidence bonus
|
||||
- **Decreasing λ** = Degrading stability = Confidence penalty
|
||||
|
||||
### 3. Adaptive Tree Width
|
||||
|
||||
Tree branching adapts to confidence levels:
|
||||
|
||||
- **High confidence (≥0.9)**: Narrow tree (fewer branches)
|
||||
- **Medium confidence (0.6-0.9)**: Normal width
|
||||
- **Low confidence (0.3-0.6)**: Wider tree (more exploration)
|
||||
- **Very low confidence (<0.3)**: Minimal branching
|
||||
|
||||
### 4. Rejection Sampling Verification
|
||||
|
||||
EAGLE-3 style verification using:
|
||||
|
||||
```
|
||||
accept_prob = min(1, target_prob / draft_prob)
|
||||
```
|
||||
|
||||
Drafts are accepted if they match the target model's distribution.
|
||||
|
||||
### 5. Tree Attention Masks
|
||||
|
||||
Parallel verification of draft tokens using causal tree attention:
|
||||
|
||||
```rust
|
||||
let mask = generate_tree_attention_mask(&tree, seq_len);
|
||||
// Each token can attend to all ancestors in its path
|
||||
```
|
||||
|
||||
## Usage Example
|
||||
|
||||
```rust
|
||||
use ruvector_mincut_gated_transformer::prelude::*;
|
||||
|
||||
// Create decoder
|
||||
let config = SpeculativeConfig::default();
|
||||
let decoder = SpeculativeDecoder::new(config);
|
||||
|
||||
// Generate draft tree (5 tokens, dynamic structure)
|
||||
let lambda = 100; // Current mincut stability
|
||||
let lambda_prev = 95; // Previous stability
|
||||
let draft_logits = vec![vec![0.0; 1000]; 5]; // Draft model outputs
|
||||
|
||||
let tree = decoder.generate_draft_tree(lambda, lambda_prev, &draft_logits);
|
||||
|
||||
// Verify against target model
|
||||
let target_logits = vec![vec![0.0; 1000]; 5]; // Target model outputs
|
||||
let result = decoder.verify_drafts(&tree, &target_logits, 1.0);
|
||||
|
||||
println!("Accepted {} tokens with {:.1}% acceptance rate",
|
||||
result.accepted_count,
|
||||
result.acceptance_rate * 100.0);
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
- **Speedup**: 2-5x for high acceptance rates
|
||||
- **Memory**: O(max_draft_tokens × tree_width × vocab_size)
|
||||
- **Overhead**: ~10% for low acceptance rates
|
||||
- **Best case**: Stable models (high λ) with predictable outputs
|
||||
|
||||
## Academic Foundation
|
||||
|
||||
Based on **EAGLE-3** (NeurIPS 2025):
|
||||
|
||||
1. **Dynamic tree structure**: Adapts to model confidence
|
||||
2. **Multi-level feature fusion**: Uses λ-stability as confidence signal
|
||||
3. **Rejection sampling**: Mathematically correct acceptance criteria
|
||||
4. **Tree attention**: Parallel draft verification
|
||||
|
||||
## Integration with Mincut Gating
|
||||
|
||||
The speculative decoder integrates with the mincut-gated-transformer's coherence signals:
|
||||
|
||||
- **λ-stability** guides draft confidence
|
||||
- **High λ** (stable partitioning) → More aggressive speculation
|
||||
- **Low λ** (unstable partitioning) → Conservative speculation
|
||||
- **λ trends** influence tree width adaptation
|
||||
|
||||
## Testing
|
||||
|
||||
Comprehensive test suite covering:
|
||||
|
||||
- ✓ Single-path speculation (sequential drafting)
|
||||
- ✓ Tree speculation with branching (parallel drafting)
|
||||
- ✓ Rejection sampling correctness
|
||||
- ✓ λ-guided confidence scaling
|
||||
- ✓ Draft verification against target model
|
||||
- ✓ Tree attention mask generation
|
||||
- ✓ Adaptive tree width calculation
|
||||
- ✓ Edge cases (empty inputs, etc.)
|
||||
|
||||
Run tests:
|
||||
|
||||
```bash
|
||||
cd crates/ruvector-mincut-gated-transformer
|
||||
cargo test --lib speculative
|
||||
```
|
||||
|
||||
All 8 tests pass successfully.
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential improvements:
|
||||
|
||||
1. **Multi-token drafting**: Draft multiple positions simultaneously
|
||||
2. **Learned draft models**: Train lightweight draft models
|
||||
3. **Dynamic threshold adaptation**: Adjust acceptance threshold based on λ
|
||||
4. **Quantized drafting**: Use INT8/INT4 for draft model
|
||||
5. **Cached drafts**: Reuse draft trees across timesteps
|
||||
6. **Hybrid verification**: Combine rejection sampling with direct comparison
|
||||
Reference in New Issue
Block a user