git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
318 lines
7.6 KiB
Markdown
318 lines
7.6 KiB
Markdown
# RuvLTRA Training Datasets
|
|
|
|
Complete guide to fine-tuning datasets for RuvLTRA models.
|
|
|
|
## Available Datasets
|
|
|
|
### 1. Claude Task Routing Dataset
|
|
|
|
**Purpose**: Train models to intelligently route tasks to Claude Flow agents and select optimal Claude models (Haiku/Sonnet/Opus).
|
|
|
|
**Location**: `crates/ruvllm/src/training/claude_dataset.rs`
|
|
|
|
**Size**: ~2,700 examples (configurable)
|
|
|
|
**Categories**:
|
|
- Coder (20%) - Code generation, debugging, refactoring
|
|
- Researcher (20%) - Analysis, exploration, documentation
|
|
- Security (20%) - Audit, vulnerability analysis
|
|
- Architecture (20%) - System design, planning
|
|
- Reviewer (20%) - Code review, quality assessment
|
|
|
|
**Quick Start**:
|
|
```bash
|
|
cargo run --example generate_claude_dataset --release
|
|
```
|
|
|
|
**Documentation**:
|
|
- [Quick Start Guide](QUICKSTART.md)
|
|
- [Format Specification](../claude_dataset_format.md)
|
|
- [Implementation Summary](SUMMARY.md)
|
|
|
|
## Dataset Comparison
|
|
|
|
| Dataset | Examples | Categories | Quality | Use Case |
|
|
|---------|----------|------------|---------|----------|
|
|
| Claude Task | 2,700 | 5 | 0.87 | Task routing, model selection |
|
|
| (Future) Code Completion | TBD | - | - | Code generation |
|
|
| (Future) Security Audit | TBD | - | - | Vulnerability detection |
|
|
|
|
## Dataset Format
|
|
|
|
All datasets use consistent JSONL format:
|
|
|
|
```json
|
|
{
|
|
"input": "Task description",
|
|
"context": "Additional context",
|
|
"output_agent": "target_agent",
|
|
"metadata": {
|
|
"category": "TaskCategory",
|
|
"complexity": "ComplexityLevel",
|
|
"domain": "DomainType",
|
|
"expected_model": "haiku|sonnet|opus",
|
|
"quality_score": 0.87,
|
|
"tags": ["tag1", "tag2"]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Data Splits
|
|
|
|
Standard splits for all datasets:
|
|
- **Training**: 70%
|
|
- **Validation**: 15%
|
|
- **Test**: 15%
|
|
|
|
Stratified sampling ensures balanced representation across categories.
|
|
|
|
## Quality Standards
|
|
|
|
All datasets follow quality guidelines:
|
|
|
|
**Quality Score Ranges**:
|
|
- 0.90-1.00: Excellent (security, critical tasks)
|
|
- 0.85-0.90: Good (architecture, complex code)
|
|
- 0.80-0.85: Adequate (research, reviews)
|
|
|
|
**Minimum Standards**:
|
|
- Input clarity: Must be unambiguous
|
|
- Context completeness: All necessary details
|
|
- Output correctness: Verified agent/model selection
|
|
- Metadata accuracy: Properly labeled
|
|
|
|
## Generation Pipeline
|
|
|
|
```
|
|
1. Template Definition
|
|
↓
|
|
Hand-crafted task templates
|
|
↓
|
|
Quality review (0.90+ for seeds)
|
|
|
|
2. Base Generation
|
|
↓
|
|
Fill templates with variations
|
|
↓
|
|
Validate quality/correctness
|
|
|
|
3. Augmentation (optional)
|
|
↓
|
|
Paraphrasing
|
|
↓
|
|
Complexity variations
|
|
↓
|
|
Domain transfer
|
|
↓
|
|
Filter invalid examples
|
|
|
|
4. Export
|
|
↓
|
|
JSONL, JSON, Parquet
|
|
↓
|
|
Statistics and analysis
|
|
```
|
|
|
|
## Usage Patterns
|
|
|
|
### Generate Default Dataset
|
|
```rust
|
|
use ruvllm::training::{DatasetGenerator, DatasetConfig};
|
|
|
|
let config = DatasetConfig::default();
|
|
let mut generator = DatasetGenerator::new(config);
|
|
let dataset = generator.generate();
|
|
|
|
dataset.export_jsonl("training.jsonl")?;
|
|
```
|
|
|
|
### Custom Configuration
|
|
```rust
|
|
let config = DatasetConfig {
|
|
examples_per_category: 200,
|
|
enable_augmentation: true,
|
|
augmentation: AugmentationConfig {
|
|
paraphrases_per_example: 3,
|
|
complexity_variations: 2,
|
|
enable_domain_transfer: true,
|
|
},
|
|
seed: 42,
|
|
};
|
|
```
|
|
|
|
### Filter by Category
|
|
```rust
|
|
let security_tasks: Vec<_> = dataset.examples
|
|
.iter()
|
|
.filter(|e| e.metadata.category == TaskCategory::Security)
|
|
.collect();
|
|
```
|
|
|
|
### Filter by Complexity
|
|
```rust
|
|
let simple_tasks: Vec<_> = dataset.examples
|
|
.iter()
|
|
.filter(|e| e.metadata.complexity == ComplexityLevel::Simple)
|
|
.collect();
|
|
```
|
|
|
|
## Integration with RuvLTRA
|
|
|
|
### Training Pipeline
|
|
|
|
```rust
|
|
use ruvllm::training::DatasetGenerator;
|
|
use ruvllm::SonaLlm;
|
|
|
|
// 1. Generate dataset
|
|
let dataset = DatasetGenerator::new(config).generate();
|
|
|
|
// 2. Split data
|
|
let (train, val, test) = dataset.split(0.7, 0.15, 0.15, 42);
|
|
|
|
// 3. Train model
|
|
let mut model = SonaLlm::new(config)?;
|
|
for example in train {
|
|
let features = model.extract_features(&example.input)?;
|
|
let target = encode_target(&example.output_agent);
|
|
model.train(features, target)?;
|
|
}
|
|
|
|
// 4. Validate
|
|
let accuracy = evaluate_model(&model, &val)?;
|
|
println!("Validation accuracy: {:.2}%", accuracy * 100.0);
|
|
```
|
|
|
|
### Model Heads
|
|
|
|
**1. Task Embedding**:
|
|
- Input: Task description + context
|
|
- Output: 768-dim semantic vector
|
|
|
|
**2. Agent Classification**:
|
|
- Input: Task embedding
|
|
- Output: 5-way softmax (agent types)
|
|
|
|
**3. Model Selection**:
|
|
- Input: Task embedding + complexity
|
|
- Output: 3-way softmax (Haiku/Sonnet/Opus)
|
|
|
|
**4. Quality Prediction**:
|
|
- Input: Task embedding
|
|
- Output: Quality score (0-1)
|
|
|
|
## Performance Metrics
|
|
|
|
### Generation Performance
|
|
- **Speed**: ~7,000 examples/second
|
|
- **Memory**: ~200 MB for 2,700 examples
|
|
- **Disk**: ~10 MB JSONL for 2,700 examples
|
|
|
|
### Training Performance
|
|
- **Accuracy**: 95%+ for agent classification
|
|
- **Cost Savings**: 50%+ with model selection
|
|
- **Latency**: <10ms for routing decision
|
|
|
|
## Best Practices
|
|
|
|
### 1. Dataset Size
|
|
- **Minimum**: 1,000 examples total (200 per category)
|
|
- **Recommended**: 2,500-5,000 examples
|
|
- **Maximum**: 10,000+ for production
|
|
|
|
### 2. Quality Over Quantity
|
|
- Prefer fewer high-quality examples (0.90+)
|
|
- Review augmented examples for correctness
|
|
- Filter low-quality generations
|
|
|
|
### 3. Balanced Representation
|
|
- Equal distribution across categories
|
|
- Mix of complexity levels (33% Simple, 40% Moderate, 27% Complex)
|
|
- Diverse domain coverage
|
|
|
|
### 4. Regular Updates
|
|
- Add new task patterns as they emerge
|
|
- Update templates based on user feedback
|
|
- Retrain models quarterly
|
|
|
|
### 5. Validation
|
|
- Hold out 15% for validation
|
|
- Monitor accuracy on validation set
|
|
- A/B test routing decisions
|
|
|
|
## Common Issues
|
|
|
|
### Issue: Low Quality Scores
|
|
**Solution**: Disable augmentation or review templates
|
|
```rust
|
|
let config = DatasetConfig {
|
|
enable_augmentation: false,
|
|
..Default::default()
|
|
};
|
|
```
|
|
|
|
### Issue: Imbalanced Categories
|
|
**Solution**: Adjust examples per category
|
|
```rust
|
|
let config = DatasetConfig {
|
|
examples_per_category: 500, // Increase for balance
|
|
..Default::default()
|
|
};
|
|
```
|
|
|
|
### Issue: Too Much Variation
|
|
**Solution**: Reduce augmentation rates
|
|
```rust
|
|
augmentation: AugmentationConfig {
|
|
paraphrases_per_example: 1,
|
|
complexity_variations: 1,
|
|
enable_domain_transfer: false,
|
|
}
|
|
```
|
|
|
|
## Roadmap
|
|
|
|
### Short Term (Q1 2024)
|
|
- [ ] Parquet export format
|
|
- [ ] Custom template loading
|
|
- [ ] Multi-language support
|
|
- [ ] HuggingFace Datasets integration
|
|
|
|
### Medium Term (Q2-Q3 2024)
|
|
- [ ] Code completion dataset
|
|
- [ ] Security audit dataset
|
|
- [ ] Multi-turn conversation dataset
|
|
- [ ] Active learning integration
|
|
|
|
### Long Term (Q4 2024+)
|
|
- [ ] Few-shot learning examples
|
|
- [ ] Code execution feedback
|
|
- [ ] Self-improvement trajectories
|
|
- [ ] Cross-lingual transfer
|
|
|
|
## Resources
|
|
|
|
### Documentation
|
|
- [Quick Start Guide](QUICKSTART.md) - Get started in 5 minutes
|
|
- [Format Specification](../claude_dataset_format.md) - Detailed format docs
|
|
- [Implementation Summary](SUMMARY.md) - Technical deep-dive
|
|
- [Module README](../../crates/ruvllm/src/training/README.md) - API reference
|
|
|
|
### Examples
|
|
- [Dataset Generator](../../crates/ruvllm/examples/generate_claude_dataset.rs)
|
|
- [Fine-Tuning Pipeline](../../crates/ruvllm/examples/finetune_routing.rs) (coming soon)
|
|
|
|
### Code
|
|
- [claude_dataset.rs](../../crates/ruvllm/src/training/claude_dataset.rs) - Core implementation
|
|
- [tests.rs](../../crates/ruvllm/src/training/tests.rs) - Test suite
|
|
|
|
## Support
|
|
|
|
- **Issues**: https://github.com/ruvector/issues
|
|
- **Discussions**: https://github.com/ruvector/discussions
|
|
- **Documentation**: https://docs.ruvector.io
|
|
|
|
## License
|
|
|
|
All datasets are licensed under MIT OR Apache-2.0, same as RuvLTRA.
|