6.2 KiB
6.2 KiB
Quick Start: Claude Task Dataset Generation
Generate fine-tuning datasets for RuvLTRA models in 5 minutes.
Installation
Add to your Cargo.toml:
[dependencies]
ruvllm = { version = "0.1.0", features = ["training"] }
Basic Usage
1. Generate a Dataset
use ruvllm::training::{DatasetGenerator, DatasetConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create generator with default config
let config = DatasetConfig::default();
let mut generator = DatasetGenerator::new(config);
// Generate dataset
let dataset = generator.generate();
println!("Generated {} examples", dataset.examples.len());
Ok(())
}
2. Export to JSONL
// Export full dataset
dataset.export_jsonl("training.jsonl")?;
// Export statistics
dataset.export_stats("stats.json")?;
3. Create Train/Val/Test Splits
// 70% train, 15% validation, 15% test
let (train, val, test) = dataset.split(0.7, 0.15, 0.15, 42);
// Export each split
ClaudeTaskDataset::new(train).export_jsonl("train.jsonl")?;
ClaudeTaskDataset::new(val).export_jsonl("val.jsonl")?;
ClaudeTaskDataset::new(test).export_jsonl("test.jsonl")?;
Run the Example
# Generate a complete dataset
cargo run --example generate_claude_dataset --release
# Output:
# - claude_training_full.jsonl (~2,700 examples)
# - claude_training_train.jsonl (70% split)
# - claude_training_val.jsonl (15% split)
# - claude_training_test.jsonl (15% split)
# - claude_training_stats.json (statistics)
Custom Configuration
Control Dataset Size
let config = DatasetConfig {
examples_per_category: 200, // 200 examples per category
..Default::default()
};
Disable Augmentation
let config = DatasetConfig {
examples_per_category: 100,
enable_augmentation: false, // No augmentation
..Default::default()
};
Fine-Tune Augmentation
use ruvllm::training::AugmentationConfig;
let config = DatasetConfig {
examples_per_category: 100,
enable_augmentation: true,
augmentation: AugmentationConfig {
paraphrases_per_example: 3, // 3 paraphrases
complexity_variations: 2, // 2 complexity levels
enable_domain_transfer: true, // Cross-domain transfer
},
seed: 42, // For reproducibility
};
Understanding the Data
Dataset Structure
Each example contains:
{
"input": "Implement JWT authentication middleware in TypeScript",
"context": "Should verify Bearer tokens, check expiration, validate RS256 signature",
"output_agent": "coder",
"metadata": {
"category": "Coder",
"complexity": "Moderate",
"domain": "Web",
"expected_model": "sonnet",
"quality_score": 0.87,
"tags": ["authentication", "middleware", "jwt"]
}
}
Task Categories
- Coder (20%) - Code generation, debugging, refactoring
- Researcher (20%) - Analysis, exploration, documentation
- Security (20%) - Audits, vulnerabilities, compliance
- Architecture (20%) - System design, planning
- Reviewer (20%) - Code review, quality assessment
Model Selection
The dataset includes intelligent routing:
- Haiku: Simple tasks (cheap, fast)
- Sonnet: Moderate complexity (balanced)
- Opus: Complex/security tasks (highest quality)
Dataset Statistics
Default configuration generates:
Base examples: 500 (5 categories × 100)
Paraphrased: 1,000 (500 × 2)
Complexity varied: 800 (500 × 2, filtered)
Domain transfer: 400 (500 × 1, filtered)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total: ~2,700 examples
Category distribution:
Coder: ~540 examples (20%)
Researcher: ~540 examples (20%)
Security: ~540 examples (20%)
Architecture: ~540 examples (20%)
Reviewer: ~540 examples (20%)
Model distribution:
Haiku: ~730 examples (27%) - Cost-effective
Sonnet: ~1,270 examples (47%) - Balanced
Opus: ~700 examples (26%) - High-quality
Inspect the Data
// Print first 5 examples
for (i, example) in dataset.examples.iter().take(5).enumerate() {
println!("Example {}:", i + 1);
println!(" Input: {}", example.input);
println!(" Agent: {}", example.output_agent);
println!(" Model: {}", example.metadata.expected_model);
println!(" Quality: {:.2}\n", example.metadata.quality_score);
}
Filter by Category
// Get all security tasks
let security_tasks: Vec<_> = dataset.examples
.iter()
.filter(|e| e.metadata.category == TaskCategory::Security)
.collect();
println!("Security tasks: {}", security_tasks.len());
Filter by Complexity
// Get all simple tasks
let simple_tasks: Vec<_> = dataset.examples
.iter()
.filter(|e| e.metadata.complexity == ComplexityLevel::Simple)
.collect();
println!("Simple tasks: {}", simple_tasks.len());
Next Steps
- Fine-tune a model: Use the generated JSONL files with your favorite ML framework
- Customize templates: Modify
claude_dataset.rsto add domain-specific tasks - Integrate with SONA: Use RuvLLM's SONA learning for continuous improvement
- Deploy: Use RuvLLM's serving engine for production inference
Common Issues
"Not enough examples"
Increase examples_per_category:
let config = DatasetConfig {
examples_per_category: 500, // Generate more
..Default::default()
};
"Too much variation"
Disable augmentation:
let config = DatasetConfig {
enable_augmentation: false,
..Default::default()
};
"Need specific domain"
Filter after generation:
let web_tasks: Vec<_> = dataset.examples
.iter()
.filter(|e| e.metadata.domain == DomainType::Web)
.cloned()
.collect();
ClaudeTaskDataset::new(web_tasks).export_jsonl("web_tasks.jsonl")?;
Resources
- Full Documentation:
../crates/ruvllm/src/training/README.md - Format Spec:
../docs/claude_dataset_format.md - Example Code:
../crates/ruvllm/examples/generate_claude_dataset.rs - Tests:
../crates/ruvllm/src/training/tests.rs
Support
- GitHub Issues: https://github.com/ruvector/issues
- Documentation: https://docs.ruvector.io