Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,706 @@
# FastGRNN Training Pipeline Guide
This guide covers the complete training pipeline for the FastGRNN model used in Tiny Dancer's neural routing system.
## Table of Contents
1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Quick Start](#quick-start)
4. [Training Configuration](#training-configuration)
5. [Data Preparation](#data-preparation)
6. [Training Loop](#training-loop)
7. [Knowledge Distillation](#knowledge-distillation)
8. [Advanced Features](#advanced-features)
9. [Production Deployment](#production-deployment)
## Overview
The FastGRNN training pipeline provides a complete solution for training lightweight recurrent neural networks for AI agent routing decisions. Key features include:
- **Adam Optimizer**: State-of-the-art adaptive learning rate optimization
- **Mini-batch Training**: Efficient batch processing with configurable batch sizes
- **Early Stopping**: Automatic stopping when validation loss stops improving
- **Learning Rate Scheduling**: Exponential decay for better convergence
- **Knowledge Distillation**: Learn from larger teacher models
- **Gradient Clipping**: Prevent exploding gradients
- **L2 Regularization**: Prevent overfitting
## Architecture
### FastGRNN Cell
The FastGRNN (Fast Gated Recurrent Neural Network) uses a simplified gating mechanism:
```
r_t = σ(W_r × x_t + b_r) [Reset gate]
u_t = σ(W_u × x_t + b_u) [Update gate]
c_t = tanh(W_c × x_t + W × (r_t ⊙ h_t-1)) [Candidate state]
h_t = u_t ⊙ h_t-1 + (1 - u_t) ⊙ c_t [Hidden state]
y_t = σ(W_out × h_t + b_out) [Output]
```
Where:
- `σ` is the sigmoid activation with scaling parameter `nu`
- `tanh` is the hyperbolic tangent with scaling parameter `zeta`
- `⊙` denotes element-wise multiplication
### Training Pipeline
```
┌─────────────────┐
│ Raw Features │
│ + Labels │
└────────┬────────┘
┌─────────────────┐
│ Normalization │
│ (z-score) │
└────────┬────────┘
┌─────────────────┐
│ Train/Val │
│ Split │
└────────┬────────┘
┌─────────────────┐
│ Mini-batch │
│ Training │
│ (BPTT) │
└────────┬────────┘
┌─────────────────┐
│ Adam Update │
│ + Grad Clip │
└────────┬────────┘
┌─────────────────┐
│ Validation │
│ + Early Stop │
└────────┬────────┘
┌─────────────────┐
│ Trained Model │
└─────────────────┘
```
## Quick Start
### Basic Training
```rust
use ruvector_tiny_dancer_core::{
model::{FastGRNN, FastGRNNConfig},
training::{TrainingConfig, TrainingDataset, Trainer},
};
// 1. Prepare your data
let features = vec![
vec![0.8, 0.9, 0.7, 0.85, 0.2], // High confidence case
vec![0.3, 0.2, 0.4, 0.35, 0.9], // Low confidence case
// ... more samples
];
let labels = vec![1.0, 0.0, /* ... */]; // 1.0 = lightweight, 0.0 = powerful
let mut dataset = TrainingDataset::new(features, labels)?;
// 2. Normalize features
let (means, stds) = dataset.normalize()?;
// 3. Create model
let model_config = FastGRNNConfig {
input_dim: 5,
hidden_dim: 16,
output_dim: 1,
nu: 0.8,
zeta: 1.2,
rank: Some(8),
};
let mut model = FastGRNN::new(model_config.clone())?;
// 4. Configure training
let training_config = TrainingConfig {
learning_rate: 0.01,
batch_size: 32,
epochs: 50,
validation_split: 0.2,
early_stopping_patience: Some(5),
..Default::default()
};
// 5. Train
let mut trainer = Trainer::new(&model_config, training_config);
let metrics = trainer.train(&mut model, &dataset)?;
// 6. Save model
model.save("models/fastgrnn.safetensors")?;
```
### Run the Example
```bash
cd crates/ruvector-tiny-dancer-core
cargo run --example train-model
```
## Training Configuration
### Hyperparameters
```rust
pub struct TrainingConfig {
/// Learning rate (default: 0.001)
pub learning_rate: f32,
/// Batch size (default: 32)
pub batch_size: usize,
/// Number of epochs (default: 100)
pub epochs: usize,
/// Validation split ratio (default: 0.2)
pub validation_split: f32,
/// Early stopping patience (default: Some(10))
pub early_stopping_patience: Option<usize>,
/// Learning rate decay factor (default: 0.5)
pub lr_decay: f32,
/// Learning rate decay step in epochs (default: 20)
pub lr_decay_step: usize,
/// Gradient clipping threshold (default: 5.0)
pub grad_clip: f32,
/// Adam beta1 parameter (default: 0.9)
pub adam_beta1: f32,
/// Adam beta2 parameter (default: 0.999)
pub adam_beta2: f32,
/// Adam epsilon (default: 1e-8)
pub adam_epsilon: f32,
/// L2 regularization strength (default: 1e-5)
pub l2_reg: f32,
}
```
### Recommended Settings
#### Small Datasets (< 1,000 samples)
```rust
TrainingConfig {
learning_rate: 0.01,
batch_size: 16,
epochs: 100,
validation_split: 0.2,
early_stopping_patience: Some(10),
lr_decay: 0.8,
lr_decay_step: 20,
l2_reg: 1e-4,
..Default::default()
}
```
#### Medium Datasets (1,000 - 10,000 samples)
```rust
TrainingConfig {
learning_rate: 0.005,
batch_size: 32,
epochs: 50,
validation_split: 0.15,
early_stopping_patience: Some(5),
lr_decay: 0.7,
lr_decay_step: 10,
l2_reg: 1e-5,
..Default::default()
}
```
#### Large Datasets (> 10,000 samples)
```rust
TrainingConfig {
learning_rate: 0.001,
batch_size: 64,
epochs: 30,
validation_split: 0.1,
early_stopping_patience: Some(3),
lr_decay: 0.5,
lr_decay_step: 5,
l2_reg: 1e-6,
..Default::default()
}
```
## Data Preparation
### Feature Engineering
For routing decisions, typical features include:
```rust
pub struct RoutingFeatures {
/// Semantic similarity between query and candidate (0.0 to 1.0)
pub similarity: f32,
/// Recency score - how recently was this candidate accessed (0.0 to 1.0)
pub recency: f32,
/// Popularity score - how often is this candidate used (0.0 to 1.0)
pub popularity: f32,
/// Historical success rate for this candidate (0.0 to 1.0)
pub success_rate: f32,
/// Query complexity estimate (0.0 to 1.0)
pub complexity: f32,
}
impl RoutingFeatures {
fn to_vector(&self) -> Vec<f32> {
vec![
self.similarity,
self.recency,
self.popularity,
self.success_rate,
self.complexity,
]
}
}
```
### Data Collection
```rust
// Collect training data from production logs
fn collect_training_data(logs: &[RoutingLog]) -> (Vec<Vec<f32>>, Vec<f32>) {
let mut features = Vec::new();
let mut labels = Vec::new();
for log in logs {
// Extract features
let feature_vec = vec![
log.similarity_score,
log.recency_score,
log.popularity_score,
log.success_rate,
log.complexity_score,
];
// Label based on actual outcome
// 1.0 if lightweight model was sufficient
// 0.0 if powerful model was needed
let label = if log.lightweight_successful { 1.0 } else { 0.0 };
features.push(feature_vec);
labels.push(label);
}
(features, labels)
}
```
### Data Normalization
Always normalize your features before training:
```rust
let mut dataset = TrainingDataset::new(features, labels)?;
let (means, stds) = dataset.normalize()?;
// Save normalization parameters for inference
save_normalization_params("models/normalization.json", &means, &stds)?;
```
During inference, apply the same normalization:
```rust
fn normalize_features(features: &mut [f32], means: &[f32], stds: &[f32]) {
for (i, feat) in features.iter_mut().enumerate() {
*feat = (*feat - means[i]) / stds[i];
}
}
```
## Training Loop
### Basic Training
```rust
let mut trainer = Trainer::new(&model_config, training_config);
let metrics = trainer.train(&mut model, &dataset)?;
// Print final results
if let Some(last) = metrics.last() {
println!("Final validation accuracy: {:.2}%", last.val_accuracy * 100.0);
}
```
### Custom Training Loop
For more control, implement your own training loop:
```rust
use ruvector_tiny_dancer_core::training::BatchIterator;
for epoch in 0..config.epochs {
let mut epoch_loss = 0.0;
let mut n_batches = 0;
// Training phase
let batch_iter = BatchIterator::new(&train_dataset, config.batch_size, true);
for (features, labels, _) in batch_iter {
// Forward pass
let predictions: Vec<f32> = features
.iter()
.map(|f| model.forward(f, None).unwrap())
.collect();
// Compute loss
let batch_loss: f32 = predictions
.iter()
.zip(&labels)
.map(|(&pred, &target)| binary_cross_entropy(pred, target))
.sum::<f32>() / predictions.len() as f32;
epoch_loss += batch_loss;
n_batches += 1;
// Backward pass (simplified - real implementation needs BPTT)
// ...
}
println!("Epoch {}: loss = {:.4}", epoch, epoch_loss / n_batches as f32);
}
```
## Knowledge Distillation
Knowledge distillation allows a smaller "student" model to learn from a larger "teacher" model.
### Setup
```rust
use ruvector_tiny_dancer_core::training::{
generate_teacher_predictions,
temperature_softmax,
};
// 1. Create/load teacher model (larger, pre-trained)
let teacher_config = FastGRNNConfig {
input_dim: 5,
hidden_dim: 32, // Larger than student
output_dim: 1,
..Default::default()
};
let teacher = FastGRNN::load("models/teacher.safetensors")?;
// 2. Generate soft targets
let temperature = 3.0; // Higher = softer probabilities
let soft_targets = generate_teacher_predictions(
&teacher,
&dataset.features,
temperature
)?;
// 3. Add soft targets to dataset
let dataset = dataset.with_soft_targets(soft_targets)?;
// 4. Enable distillation in training config
let training_config = TrainingConfig {
enable_distillation: true,
distillation_temperature: temperature,
distillation_alpha: 0.7, // 70% soft targets, 30% hard targets
..Default::default()
};
```
### Distillation Loss
The total loss combines hard and soft targets:
```
L_total = α × L_soft + (1 - α) × L_hard
where:
- L_soft = BCE(student_logit / T, teacher_logit / T)
- L_hard = BCE(student_logit, true_label)
- α = distillation_alpha (typically 0.5 to 0.9)
- T = temperature (typically 2.0 to 5.0)
```
### Benefits
- **Faster Inference**: Student model is smaller and faster
- **Better Accuracy**: Student learns from teacher's knowledge
- **Compression**: 2-4x smaller models with minimal accuracy loss
- **Transfer Learning**: Transfer knowledge across architectures
## Advanced Features
### Learning Rate Scheduling
Exponential decay schedule:
```rust
TrainingConfig {
learning_rate: 0.01, // Initial LR
lr_decay: 0.8, // Multiply by 0.8 every lr_decay_step epochs
lr_decay_step: 10, // Decay every 10 epochs
..Default::default()
}
// Schedule:
// Epochs 0-9: LR = 0.01
// Epochs 10-19: LR = 0.008
// Epochs 20-29: LR = 0.0064
// Epochs 30-39: LR = 0.00512
// ...
```
### Early Stopping
Prevent overfitting by stopping when validation loss stops improving:
```rust
TrainingConfig {
early_stopping_patience: Some(5), // Stop after 5 epochs without improvement
..Default::default()
}
```
### Gradient Clipping
Prevent exploding gradients in RNNs:
```rust
TrainingConfig {
grad_clip: 5.0, // Clip gradients to [-5.0, 5.0]
..Default::default()
}
```
### Regularization
L2 weight decay to prevent overfitting:
```rust
TrainingConfig {
l2_reg: 1e-5, // Add L2 penalty to loss
..Default::default()
}
```
## Production Deployment
### Training Pipeline
1. **Data Collection**
```rust
// Collect production logs
let logs = collect_routing_logs_from_db(db_path)?;
let (features, labels) = extract_features_and_labels(&logs);
```
2. **Data Validation**
```rust
// Check data quality
assert!(features.len() >= 1000, "Need at least 1000 samples");
assert!(labels.iter().filter(|&&l| l > 0.5).count() > 100,
"Need balanced dataset");
```
3. **Training**
```rust
let mut dataset = TrainingDataset::new(features, labels)?;
let (means, stds) = dataset.normalize()?;
let mut trainer = Trainer::new(&model_config, training_config);
let metrics = trainer.train(&mut model, &dataset)?;
```
4. **Validation**
```rust
// Test on holdout set
let (_, test_dataset) = dataset.split(0.2)?;
let (test_loss, test_accuracy) = evaluate_model(&model, &test_dataset)?;
assert!(test_accuracy > 0.85, "Model accuracy too low");
```
5. **Save Artifacts**
```rust
// Save model
model.save("models/fastgrnn_v1.safetensors")?;
// Save normalization params
save_normalization("models/normalization_v1.json", &means, &stds)?;
// Save metrics
trainer.save_metrics("models/metrics_v1.json")?;
```
6. **Optimization**
```rust
// Quantize for production
model.quantize()?;
// Optional: Prune weights
model.prune(0.3)?; // 30% sparsity
```
### Continual Learning
Update the model with new data:
```rust
// Load existing model
let mut model = FastGRNN::load("models/current.safetensors")?;
// Collect new data
let new_logs = collect_recent_logs(since_timestamp)?;
let (new_features, new_labels) = extract_features_and_labels(&new_logs);
// Create dataset
let new_dataset = TrainingDataset::new(new_features, new_labels)?;
// Fine-tune with lower learning rate
let training_config = TrainingConfig {
learning_rate: 0.0001, // Lower LR for fine-tuning
epochs: 10,
..Default::default()
};
let mut trainer = Trainer::new(model.config(), training_config);
trainer.train(&mut model, &new_dataset)?;
// Save updated model
model.save("models/current_v2.safetensors")?;
```
### Model Versioning
```rust
use chrono::Utc;
pub struct ModelVersion {
pub version: String,
pub timestamp: i64,
pub model_path: String,
pub metrics_path: String,
pub normalization_path: String,
pub test_accuracy: f32,
pub model_size_bytes: usize,
}
impl ModelVersion {
pub fn create_new(model: &FastGRNN, metrics: &[TrainingMetrics]) -> Self {
let timestamp = Utc::now().timestamp();
let version = format!("v{}", timestamp);
Self {
version: version.clone(),
timestamp,
model_path: format!("models/fastgrnn_{}.safetensors", version),
metrics_path: format!("models/metrics_{}.json", version),
normalization_path: format!("models/norm_{}.json", version),
test_accuracy: metrics.last().unwrap().val_accuracy,
model_size_bytes: model.size_bytes(),
}
}
}
```
## Performance Benchmarks
### Training Speed
| Dataset Size | Batch Size | Epoch Time | Total Time (50 epochs) |
|--------------|------------|------------|------------------------|
| 1,000 | 32 | 0.2s | 10s |
| 10,000 | 64 | 1.5s | 75s |
| 100,000 | 128 | 12s | 600s (10 min) |
### Model Size
| Configuration | Parameters | FP32 Size | INT8 Size | Compression |
|--------------------|------------|-----------|-----------|-------------|
| Tiny (8 hidden) | ~250 | 1 KB | 256 B | 4x |
| Small (16 hidden) | ~850 | 3.4 KB | 850 B | 4x |
| Medium (32 hidden) | ~3,200 | 12.8 KB | 3.2 KB | 4x |
### Inference Speed
After training and quantization:
- **Inference time**: < 100 μs per sample
- **Batch inference** (32 samples): < 1 ms
- **Memory footprint**: < 5 KB
## Troubleshooting
### Common Issues
#### 1. Loss Not Decreasing
**Symptoms**: Training loss stays high or increases
**Solutions**:
- Reduce learning rate (try 0.001 or lower)
- Increase batch size
- Check data normalization
- Verify labels are correct (0.0 or 1.0)
- Add more training data
#### 2. Overfitting
**Symptoms**: Training accuracy high, validation accuracy low
**Solutions**:
- Increase L2 regularization (try 1e-4)
- Reduce model size (fewer hidden units)
- Use early stopping
- Add more training data
- Increase validation split
#### 3. Slow Convergence
**Symptoms**: Training takes too many epochs
**Solutions**:
- Increase learning rate (try 0.01 or 0.1)
- Use knowledge distillation
- Better feature engineering
- Use larger batch sizes
#### 4. Gradient Explosion
**Symptoms**: Loss becomes NaN, training crashes
**Solutions**:
- Enable gradient clipping (grad_clip: 1.0 or 5.0)
- Reduce learning rate
- Check for invalid data (NaN, Inf values)
## Next Steps
1. **Run the example**: `cargo run --example train-model`
2. **Collect your own data**: Integrate with production logs
3. **Experiment with hyperparameters**: Find optimal settings
4. **Deploy to production**: Integrate with the Router
5. **Monitor performance**: Track accuracy and latency
6. **Iterate**: Collect more data and retrain regularly
## References
- FastGRNN Paper: [Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things](https://arxiv.org/abs/1901.02358)
- Knowledge Distillation: [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)
- Adam Optimizer: [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)