Files
wifi-densepose/docs/training/claude_dataset_format.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

331 lines
9.1 KiB
Markdown

# Claude Task Dataset Format Specification
## Overview
The Claude Task Fine-Tuning Dataset is designed for training RuvLTRA models to intelligently route tasks to appropriate Claude Flow agents and select optimal Claude models (Haiku/Sonnet/Opus) based on task complexity.
## Dataset Categories
### 1. Coder Tasks
**Agent:** `coder`
**Focus:** Code generation, debugging, refactoring
**Model Routing:**
- Simple: Haiku (quick fixes, simple functions)
- Moderate: Sonnet (component development, API integration)
- Complex: Opus (complex algorithms, system-level code)
**Example Tasks:**
- Implement authentication middleware
- Debug race condition in concurrent code
- Refactor monolithic service into microservices
- Write unit tests with 90% coverage
### 2. Researcher Tasks
**Agent:** `researcher`
**Focus:** Analysis, exploration, documentation
**Model Routing:**
- Simple: Haiku (basic documentation)
- Moderate: Sonnet (most research tasks)
- Complex: Sonnet (deep analysis)
**Example Tasks:**
- Analyze performance bottlenecks
- Research best practices for GraphQL
- Document API endpoints
- Compare database solutions
### 3. Security Tasks
**Agent:** `security`
**Focus:** Audit, vulnerability analysis, threat detection
**Model Routing:**
- All: Opus (security requires highest quality)
**Example Tasks:**
- Audit authentication flow for vulnerabilities
- Review cryptographic implementation
- Identify SQL injection vectors
- Ensure GDPR compliance
### 4. Architecture Tasks
**Agent:** `architecture`
**Focus:** System design, planning, architecture
**Model Routing:**
- Simple: Sonnet (basic schemas)
- Moderate: Opus (microservices, APIs)
- Complex: Opus (distributed systems)
**Example Tasks:**
- Design microservices architecture
- Plan database schema for e-commerce
- Architect caching strategy
- Design disaster recovery system
### 5. Reviewer Tasks
**Agent:** `reviewer`
**Focus:** Code review, quality assessment
**Model Routing:**
- Simple: Haiku (standards compliance)
- Moderate: Sonnet (quality review, performance)
- Complex: Sonnet (architecture review)
**Example Tasks:**
- Review pull request for best practices
- Assess code quality and maintainability
- Review error handling patterns
- Analyze scalability of design
## JSONL Format
Each line in the JSONL file represents a single training example:
```json
{
"input": "Implement async authentication middleware in TypeScript for JWT validation",
"context": "The middleware should verify JWT tokens from Bearer header, check expiration, and validate signature using RS256",
"output_agent": "coder",
"metadata": {
"category": "Coder",
"complexity": "Moderate",
"domain": "Web",
"expected_model": "sonnet",
"quality_score": 0.87,
"tags": ["authentication", "middleware", "jwt", "security"]
}
}
```
## Fields Description
### Input
**Type:** String
**Description:** The task description or request from the user. This is what the model receives as input.
### Context
**Type:** String
**Description:** Additional context, requirements, constraints, or details about the task. Provides necessary background information.
### Output Agent
**Type:** String
**Enum:** `"coder"`, `"researcher"`, `"security"`, `"architecture"`, `"reviewer"`
**Description:** The expected agent that should handle this task.
### Metadata
#### Category
**Type:** TaskCategory enum
**Values:** `Coder`, `Researcher`, `Security`, `Architecture`, `Reviewer`
**Description:** Primary task category
#### Complexity
**Type:** ComplexityLevel enum
**Values:** `Simple`, `Moderate`, `Complex`
**Description:** Task complexity level determining model selection
#### Domain
**Type:** DomainType enum
**Values:** `Web`, `Systems`, `DataScience`, `Mobile`, `DevOps`, `Security`, `Database`, `Api`
**Description:** Technical domain context
#### Expected Model
**Type:** String
**Values:** `"haiku"`, `"sonnet"`, `"opus"`
**Description:** Recommended Claude model for this task based on complexity and category
**Cost Optimization:**
- Haiku: ~75% cheaper than Opus, 2-3x faster
- Sonnet: Balanced cost/quality, handles most tasks
- Opus: Highest quality, use for complex/critical tasks
#### Quality Score
**Type:** Float (0.0-1.0)
**Description:** Quality rating of this training example. Higher scores indicate more reliable examples for training.
#### Tags
**Type:** Array of strings
**Description:** Descriptive tags for filtering and analysis
## Data Augmentation
The dataset generator applies three augmentation techniques:
### 1. Paraphrasing
**Purpose:** Increase linguistic diversity
**Method:** Synonym replacement, phrase restructuring
**Example:**
- Original: "Implement a function to validate user input"
- Paraphrased: "Create a function to validate user input"
### 2. Complexity Variations
**Purpose:** Create training examples at different complexity levels
**Method:** Vary complexity while keeping core task same
**Example:**
- Simple: "Add error handling to API endpoint"
- Moderate: "Implement comprehensive error handling with retry logic"
- Complex: "Design fault-tolerant error handling with circuit breakers"
### 3. Domain Transfer
**Purpose:** Generalize across technical domains
**Method:** Apply same task pattern to different domains
**Example:**
- Web: "Optimize React component rendering"
- Mobile: "Optimize Flutter widget rendering"
- Systems: "Optimize kernel thread scheduling"
## Dataset Statistics
Typical generated dataset (100 base examples per category + augmentation):
```
Total Examples: ~1,500 (500 base + 1,000 augmented)
By Category:
- Coder: ~300 (20%)
- Researcher: ~300 (20%)
- Security: ~300 (20%)
- Architecture: ~300 (20%)
- Reviewer: ~300 (20%)
By Complexity:
- Simple: ~500 (33%)
- Moderate: ~600 (40%)
- Complex: ~400 (27%)
By Model:
- Haiku: ~400 (27%) - Cost-effective for simple tasks
- Sonnet: ~700 (47%) - Balanced for most tasks
- Opus: ~400 (27%) - High-quality for complex/security
```
## Training Splits
Recommended split ratios:
- **Training:** 70% (~1,050 examples)
- **Validation:** 15% (~225 examples)
- **Test:** 15% (~225 examples)
Stratified sampling ensures balanced representation across categories and complexity levels.
## Quality Assurance
Each training example includes a quality score (0.0-1.0) based on:
1. **Template Quality** (0.8-0.96)
- Seed templates: Hand-crafted, highest quality
- Paraphrased: Slightly lower due to automated generation
2. **Category Appropriateness**
- Security tasks: Higher scores (0.90-0.96)
- Code generation: Good scores (0.83-0.90)
3. **Complexity Alignment**
- Well-defined complexity: Higher scores
- Ambiguous complexity: Lower scores
## Usage in Fine-Tuning
### For Task Routing
Train model to predict `output_agent` given `input` and `context`.
```python
# Pseudo-code
def train_task_router(dataset):
for example in dataset:
x = embed(example.input + example.context)
y = encode_agent(example.output_agent)
model.train(x, y)
```
### For Model Selection
Train model to predict `expected_model` given task characteristics.
```python
# Pseudo-code
def train_model_selector(dataset):
for example in dataset:
features = extract_features(example.input, example.context)
complexity = encode_complexity(example.metadata.complexity)
category = encode_category(example.metadata.category)
x = [features, complexity, category]
y = encode_model(example.metadata.expected_model)
model.train(x, y)
```
## Export Formats
### JSONL (Recommended)
- One example per line
- Memory-efficient streaming
- Standard for LLM fine-tuning
- File: `claude_training_full.jsonl`
### JSON
- Full array of examples
- Human-readable
- Good for inspection
- File: `claude_training_full.json`
### Parquet (Planned)
- Columnar format
- Highly compressed
- Fast for analytics
- Integration with Arrow/Polars
## Example Generation Code
```rust
use ruvllm::training::{DatasetGenerator, DatasetConfig};
// Configure dataset
let config = DatasetConfig {
examples_per_category: 100,
enable_augmentation: true,
..Default::default()
};
// Generate dataset
let mut generator = DatasetGenerator::new(config);
let dataset = generator.generate();
// Export to JSONL
dataset.export_jsonl("training.jsonl")?;
// Split for training
let (train, val, test) = dataset.split(0.7, 0.15, 0.15, 42);
```
## Integration with RuvLTRA
The dataset is designed for fine-tuning RuvLTRA models with:
1. **Task Embedding Layer**
- Input: Task description + context
- Output: 768-dim semantic embedding
2. **Agent Classification Head**
- Input: Task embedding
- Output: 5-way classification (5 agent types)
3. **Model Selection Head**
- Input: Task embedding + complexity features
- Output: 3-way classification (Haiku/Sonnet/Opus)
4. **Quality Prediction Head**
- Input: Task embedding
- Output: Quality score (0-1)
## Versioning
**Current Version:** 1.0.0
**Format Version:** 1.0
**Last Updated:** 2024-01
## License
Training data follows the same license as RuvLTRA (MIT/Apache-2.0).
## References
- Claude Flow Documentation: https://github.com/ruvnet/claude-flow
- RuvLTRA Architecture: `../crates/ruvllm/README.md`
- SONA Learning: `../crates/sona/README.md`