Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
@@ -0,0 +1,374 @@
|
||||
# DSPy Multi-Model Benchmark Suite
|
||||
|
||||
Comprehensive benchmarking system for comparing multiple language models using real **dspy.ts v2.1.1** features.
|
||||
|
||||
## Features
|
||||
|
||||
### Real DSPy.ts Components
|
||||
|
||||
- ✅ **ChainOfThought** - For reasoning-based synthetic data generation
|
||||
- ✅ **ReAct** - For iterative data quality validation
|
||||
- ✅ **BootstrapFewShot** - Learn from successful examples (5 rounds)
|
||||
- ✅ **MIPROv2** - Bayesian prompt optimization (3 trials)
|
||||
- ✅ **Real Metrics** - f1Score, exactMatch, bleuScore, rougeScore
|
||||
|
||||
### Benchmark Capabilities
|
||||
|
||||
1. **Multi-Model Comparison**
|
||||
- OpenAI models (GPT-4, GPT-3.5-turbo)
|
||||
- Anthropic models (Claude 3 Sonnet, Claude 3 Haiku)
|
||||
- Automatic model registration and configuration
|
||||
|
||||
2. **Quality Metrics**
|
||||
- F1 Score
|
||||
- Exact Match
|
||||
- BLEU Score
|
||||
- ROUGE Score
|
||||
- Overall quality score
|
||||
|
||||
3. **Performance Metrics**
|
||||
- Latency (P50, P95, P99)
|
||||
- Throughput (samples/second)
|
||||
- Success rate
|
||||
- Average latency
|
||||
|
||||
4. **Cost Analysis**
|
||||
- Total cost tracking
|
||||
- Cost per sample
|
||||
- Cost per quality point
|
||||
- Token usage (input/output)
|
||||
|
||||
5. **Optimization Comparison**
|
||||
- Baseline quality
|
||||
- BootstrapFewShot improvement
|
||||
- MIPROv2 improvement
|
||||
- Quality progression tracking
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
cd /home/user/ruvector/packages/agentic-synth
|
||||
npm install
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
Set your API keys as environment variables:
|
||||
|
||||
```bash
|
||||
export OPENAI_API_KEY="your-openai-key"
|
||||
export ANTHROPIC_API_KEY="your-anthropic-key"
|
||||
```
|
||||
|
||||
Or create a `.env` file:
|
||||
|
||||
```env
|
||||
OPENAI_API_KEY=your-openai-key
|
||||
ANTHROPIC_API_KEY=your-anthropic-key
|
||||
SAMPLE_SIZE=100
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
npx tsx training/dspy-multi-model-benchmark.ts
|
||||
```
|
||||
|
||||
### Custom Sample Size
|
||||
|
||||
```bash
|
||||
SAMPLE_SIZE=1000 npx tsx training/dspy-multi-model-benchmark.ts
|
||||
```
|
||||
|
||||
### Programmatic Usage
|
||||
|
||||
```typescript
|
||||
import { DSPyMultiModelBenchmark } from './training/dspy-multi-model-benchmark';
|
||||
|
||||
const benchmark = new DSPyMultiModelBenchmark('./results');
|
||||
|
||||
// Add models
|
||||
benchmark.addModel({
|
||||
name: 'GPT-4',
|
||||
provider: 'openai',
|
||||
modelId: 'gpt-4',
|
||||
apiKey: process.env.OPENAI_API_KEY,
|
||||
costPer1kTokens: { input: 0.03, output: 0.06 },
|
||||
maxTokens: 8192
|
||||
});
|
||||
|
||||
// Run comparison
|
||||
const results = await benchmark.runComparison(1000);
|
||||
|
||||
// Generate report
|
||||
await benchmark.generateReport(results);
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The benchmark generates two files:
|
||||
|
||||
1. **Markdown Report** (`benchmark-report-TIMESTAMP.md`)
|
||||
- Executive summary with winners
|
||||
- Detailed metrics for each model
|
||||
- Rankings by category
|
||||
- Recommendations for different use cases
|
||||
|
||||
2. **JSON Results** (`benchmark-results-TIMESTAMP.json`)
|
||||
- Complete benchmark data
|
||||
- Raw metrics
|
||||
- Optimization history
|
||||
- Structured for further analysis
|
||||
|
||||
### Sample Output Structure
|
||||
|
||||
```
|
||||
training/results/multi-model/
|
||||
├── benchmark-report-2025-01-22T10-30-45-123Z.md
|
||||
└── benchmark-results-2025-01-22T10-30-45-123Z.json
|
||||
```
|
||||
|
||||
## Benchmark Workflow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ For Each Model │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ 1. Baseline Quality │
|
||||
│ └─ Test with basic ChainOfThought module │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ 2. BootstrapFewShot Optimization │
|
||||
│ └─ 5 rounds of few-shot learning │
|
||||
│ └─ Learn from successful examples │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ 3. MIPROv2 Optimization │
|
||||
│ └─ 3 trials of Bayesian optimization │
|
||||
│ └─ Expected Improvement acquisition │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ 4. Performance Testing │
|
||||
│ └─ Measure latency (P50, P95, P99) │
|
||||
│ └─ Calculate throughput │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ 5. Cost Analysis │
|
||||
│ └─ Track token usage │
|
||||
│ └─ Calculate cost efficiency │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Metrics Explained
|
||||
|
||||
### Quality Metrics
|
||||
|
||||
- **F1 Score**: Harmonic mean of precision and recall
|
||||
- **Exact Match**: Percentage of exact matches with expected output
|
||||
- **BLEU Score**: Bilingual Evaluation Understudy (text similarity)
|
||||
- **ROUGE Score**: Recall-Oriented Understudy for Gisting Evaluation
|
||||
- **Overall**: Weighted average of all quality metrics
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
- **P50 Latency**: Median response time
|
||||
- **P95 Latency**: 95th percentile response time
|
||||
- **P99 Latency**: 99th percentile response time
|
||||
- **Throughput**: Samples processed per second
|
||||
- **Success Rate**: Percentage of successful generations
|
||||
|
||||
### Optimization Metrics
|
||||
|
||||
- **Baseline Quality**: Initial quality without optimization
|
||||
- **Bootstrap Improvement**: Quality gain from BootstrapFewShot
|
||||
- **MIPRO Improvement**: Quality gain from MIPROv2
|
||||
- **Improvement %**: Relative improvement over baseline
|
||||
|
||||
## Customization
|
||||
|
||||
### Add Custom Models
|
||||
|
||||
```typescript
|
||||
benchmark.addModel({
|
||||
name: 'Custom Model',
|
||||
provider: 'openrouter',
|
||||
modelId: 'model-id',
|
||||
apiKey: 'your-key',
|
||||
costPer1kTokens: { input: 0.001, output: 0.002 },
|
||||
maxTokens: 4096
|
||||
});
|
||||
```
|
||||
|
||||
### Custom Schema
|
||||
|
||||
Modify the schema in `benchmarkModel()`:
|
||||
|
||||
```typescript
|
||||
const schema = {
|
||||
id: 'UUID',
|
||||
name: 'string (person name)',
|
||||
email: 'string (valid email)',
|
||||
age: 'number (18-80)',
|
||||
// Add your custom fields...
|
||||
};
|
||||
```
|
||||
|
||||
### Custom Metrics
|
||||
|
||||
Implement custom quality scoring:
|
||||
|
||||
```typescript
|
||||
private calculateQualityScore(output: any, expected: any): number {
|
||||
// Your custom scoring logic
|
||||
return score;
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Start Small**: Use `SAMPLE_SIZE=10` for quick tests
|
||||
2. **Increase Gradually**: Scale to 100, 1000, 10000 as needed
|
||||
3. **Parallel Testing**: Run different models separately
|
||||
4. **Cost Monitoring**: Check costs before large runs
|
||||
5. **Rate Limits**: Be aware of API rate limits
|
||||
|
||||
## Example Results
|
||||
|
||||
```
|
||||
🔬 DSPy Multi-Model Benchmark Suite
|
||||
======================================================================
|
||||
Models: 4
|
||||
Sample Size: 100
|
||||
======================================================================
|
||||
|
||||
📊 Benchmarking: GPT-4
|
||||
----------------------------------------------------------------------
|
||||
→ Running baseline...
|
||||
→ Optimizing with BootstrapFewShot...
|
||||
→ Optimizing with MIPROv2...
|
||||
✓ Quality Score: 0.875
|
||||
✓ P95 Latency: 1234ms
|
||||
✓ Cost/Sample: $0.000543
|
||||
✓ Bootstrap Improvement: +12.3%
|
||||
✓ MIPRO Improvement: +18.7%
|
||||
|
||||
📊 Benchmarking: Claude 3 Sonnet
|
||||
----------------------------------------------------------------------
|
||||
→ Running baseline...
|
||||
→ Optimizing with BootstrapFewShot...
|
||||
→ Optimizing with MIPROv2...
|
||||
✓ Quality Score: 0.892
|
||||
✓ P95 Latency: 987ms
|
||||
✓ Cost/Sample: $0.000234
|
||||
✓ Bootstrap Improvement: +14.2%
|
||||
✓ MIPRO Improvement: +21.5%
|
||||
|
||||
======================================================================
|
||||
✅ Benchmark completed successfully!
|
||||
📊 Check the results directory for detailed reports.
|
||||
======================================================================
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### API Key Issues
|
||||
|
||||
```bash
|
||||
# Check if keys are set
|
||||
echo $OPENAI_API_KEY
|
||||
echo $ANTHROPIC_API_KEY
|
||||
|
||||
# Set keys temporarily
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
export ANTHROPIC_API_KEY="sk-ant-..."
|
||||
```
|
||||
|
||||
### Import Errors
|
||||
|
||||
```bash
|
||||
# Rebuild the package
|
||||
npm run build
|
||||
|
||||
# Check dspy.ts installation
|
||||
npm list dspy.ts
|
||||
```
|
||||
|
||||
### Out of Memory
|
||||
|
||||
```bash
|
||||
# Reduce sample size
|
||||
SAMPLE_SIZE=10 npx tsx training/dspy-multi-model-benchmark.ts
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
Add delays between requests:
|
||||
|
||||
```typescript
|
||||
// In measurePerformance()
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
DSPyMultiModelBenchmark
|
||||
├── Model Management
|
||||
│ ├── OpenAILM (GPT-4, GPT-3.5)
|
||||
│ ├── AnthropicLM (Claude 3)
|
||||
│ └── Token tracking
|
||||
│
|
||||
├── DSPy Modules
|
||||
│ ├── SyntheticDataModule (ChainOfThought)
|
||||
│ └── DataQualityModule (ReAct)
|
||||
│
|
||||
├── Optimizers
|
||||
│ ├── BootstrapFewShot (5 rounds)
|
||||
│ └── MIPROv2 (3 trials, Bayesian)
|
||||
│
|
||||
├── Metrics
|
||||
│ ├── Quality (F1, EM, BLEU, ROUGE)
|
||||
│ ├── Performance (latency, throughput)
|
||||
│ └── Cost (tokens, efficiency)
|
||||
│
|
||||
└── Reporting
|
||||
├── Markdown reports
|
||||
└── JSON results
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
To add new features:
|
||||
|
||||
1. Extend `ModelConfig` for new providers
|
||||
2. Implement new LM classes
|
||||
3. Add custom DSPy modules
|
||||
4. Enhance quality metrics
|
||||
5. Extend reporting formats
|
||||
|
||||
## License
|
||||
|
||||
MIT - Same as dspy.ts and agentic-synth
|
||||
|
||||
## References
|
||||
|
||||
- [dspy.ts Documentation](https://github.com/ruvnet/dspy.ts)
|
||||
- [DSPy Paper](https://arxiv.org/abs/2310.03714)
|
||||
- [MIPROv2 Paper](https://arxiv.org/abs/2406.11695)
|
||||
|
||||
---
|
||||
|
||||
**Built with dspy.ts v2.1.1** - Declarative AI framework for TypeScript
|
||||
Reference in New Issue
Block a user