Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,374 @@
# DSPy Multi-Model Benchmark Suite
Comprehensive benchmarking system for comparing multiple language models using real **dspy.ts v2.1.1** features.
## Features
### Real DSPy.ts Components
-**ChainOfThought** - For reasoning-based synthetic data generation
-**ReAct** - For iterative data quality validation
-**BootstrapFewShot** - Learn from successful examples (5 rounds)
-**MIPROv2** - Bayesian prompt optimization (3 trials)
-**Real Metrics** - f1Score, exactMatch, bleuScore, rougeScore
### Benchmark Capabilities
1. **Multi-Model Comparison**
- OpenAI models (GPT-4, GPT-3.5-turbo)
- Anthropic models (Claude 3 Sonnet, Claude 3 Haiku)
- Automatic model registration and configuration
2. **Quality Metrics**
- F1 Score
- Exact Match
- BLEU Score
- ROUGE Score
- Overall quality score
3. **Performance Metrics**
- Latency (P50, P95, P99)
- Throughput (samples/second)
- Success rate
- Average latency
4. **Cost Analysis**
- Total cost tracking
- Cost per sample
- Cost per quality point
- Token usage (input/output)
5. **Optimization Comparison**
- Baseline quality
- BootstrapFewShot improvement
- MIPROv2 improvement
- Quality progression tracking
## Installation
```bash
cd /home/user/ruvector/packages/agentic-synth
npm install
```
## Setup
Set your API keys as environment variables:
```bash
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
```
Or create a `.env` file:
```env
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
SAMPLE_SIZE=100
```
## Usage
### Basic Usage
```bash
npx tsx training/dspy-multi-model-benchmark.ts
```
### Custom Sample Size
```bash
SAMPLE_SIZE=1000 npx tsx training/dspy-multi-model-benchmark.ts
```
### Programmatic Usage
```typescript
import { DSPyMultiModelBenchmark } from './training/dspy-multi-model-benchmark';
const benchmark = new DSPyMultiModelBenchmark('./results');
// Add models
benchmark.addModel({
name: 'GPT-4',
provider: 'openai',
modelId: 'gpt-4',
apiKey: process.env.OPENAI_API_KEY,
costPer1kTokens: { input: 0.03, output: 0.06 },
maxTokens: 8192
});
// Run comparison
const results = await benchmark.runComparison(1000);
// Generate report
await benchmark.generateReport(results);
```
## Output
The benchmark generates two files:
1. **Markdown Report** (`benchmark-report-TIMESTAMP.md`)
- Executive summary with winners
- Detailed metrics for each model
- Rankings by category
- Recommendations for different use cases
2. **JSON Results** (`benchmark-results-TIMESTAMP.json`)
- Complete benchmark data
- Raw metrics
- Optimization history
- Structured for further analysis
### Sample Output Structure
```
training/results/multi-model/
├── benchmark-report-2025-01-22T10-30-45-123Z.md
└── benchmark-results-2025-01-22T10-30-45-123Z.json
```
## Benchmark Workflow
```
┌─────────────────────────────────────────────────────────┐
│ For Each Model │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ 1. Baseline Quality │
│ └─ Test with basic ChainOfThought module │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ 2. BootstrapFewShot Optimization │
│ └─ 5 rounds of few-shot learning │
│ └─ Learn from successful examples │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ 3. MIPROv2 Optimization │
│ └─ 3 trials of Bayesian optimization │
│ └─ Expected Improvement acquisition │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ 4. Performance Testing │
│ └─ Measure latency (P50, P95, P99) │
│ └─ Calculate throughput │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ 5. Cost Analysis │
│ └─ Track token usage │
│ └─ Calculate cost efficiency │
└─────────────────────────────────────────────────────────┘
```
## Metrics Explained
### Quality Metrics
- **F1 Score**: Harmonic mean of precision and recall
- **Exact Match**: Percentage of exact matches with expected output
- **BLEU Score**: Bilingual Evaluation Understudy (text similarity)
- **ROUGE Score**: Recall-Oriented Understudy for Gisting Evaluation
- **Overall**: Weighted average of all quality metrics
### Performance Metrics
- **P50 Latency**: Median response time
- **P95 Latency**: 95th percentile response time
- **P99 Latency**: 99th percentile response time
- **Throughput**: Samples processed per second
- **Success Rate**: Percentage of successful generations
### Optimization Metrics
- **Baseline Quality**: Initial quality without optimization
- **Bootstrap Improvement**: Quality gain from BootstrapFewShot
- **MIPRO Improvement**: Quality gain from MIPROv2
- **Improvement %**: Relative improvement over baseline
## Customization
### Add Custom Models
```typescript
benchmark.addModel({
name: 'Custom Model',
provider: 'openrouter',
modelId: 'model-id',
apiKey: 'your-key',
costPer1kTokens: { input: 0.001, output: 0.002 },
maxTokens: 4096
});
```
### Custom Schema
Modify the schema in `benchmarkModel()`:
```typescript
const schema = {
id: 'UUID',
name: 'string (person name)',
email: 'string (valid email)',
age: 'number (18-80)',
// Add your custom fields...
};
```
### Custom Metrics
Implement custom quality scoring:
```typescript
private calculateQualityScore(output: any, expected: any): number {
// Your custom scoring logic
return score;
}
```
## Performance Tips
1. **Start Small**: Use `SAMPLE_SIZE=10` for quick tests
2. **Increase Gradually**: Scale to 100, 1000, 10000 as needed
3. **Parallel Testing**: Run different models separately
4. **Cost Monitoring**: Check costs before large runs
5. **Rate Limits**: Be aware of API rate limits
## Example Results
```
🔬 DSPy Multi-Model Benchmark Suite
======================================================================
Models: 4
Sample Size: 100
======================================================================
📊 Benchmarking: GPT-4
----------------------------------------------------------------------
→ Running baseline...
→ Optimizing with BootstrapFewShot...
→ Optimizing with MIPROv2...
✓ Quality Score: 0.875
✓ P95 Latency: 1234ms
✓ Cost/Sample: $0.000543
✓ Bootstrap Improvement: +12.3%
✓ MIPRO Improvement: +18.7%
📊 Benchmarking: Claude 3 Sonnet
----------------------------------------------------------------------
→ Running baseline...
→ Optimizing with BootstrapFewShot...
→ Optimizing with MIPROv2...
✓ Quality Score: 0.892
✓ P95 Latency: 987ms
✓ Cost/Sample: $0.000234
✓ Bootstrap Improvement: +14.2%
✓ MIPRO Improvement: +21.5%
======================================================================
✅ Benchmark completed successfully!
📊 Check the results directory for detailed reports.
======================================================================
```
## Troubleshooting
### API Key Issues
```bash
# Check if keys are set
echo $OPENAI_API_KEY
echo $ANTHROPIC_API_KEY
# Set keys temporarily
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
```
### Import Errors
```bash
# Rebuild the package
npm run build
# Check dspy.ts installation
npm list dspy.ts
```
### Out of Memory
```bash
# Reduce sample size
SAMPLE_SIZE=10 npx tsx training/dspy-multi-model-benchmark.ts
```
### Rate Limiting
Add delays between requests:
```typescript
// In measurePerformance()
await new Promise(resolve => setTimeout(resolve, 100));
```
## Architecture
```
DSPyMultiModelBenchmark
├── Model Management
│ ├── OpenAILM (GPT-4, GPT-3.5)
│ ├── AnthropicLM (Claude 3)
│ └── Token tracking
├── DSPy Modules
│ ├── SyntheticDataModule (ChainOfThought)
│ └── DataQualityModule (ReAct)
├── Optimizers
│ ├── BootstrapFewShot (5 rounds)
│ └── MIPROv2 (3 trials, Bayesian)
├── Metrics
│ ├── Quality (F1, EM, BLEU, ROUGE)
│ ├── Performance (latency, throughput)
│ └── Cost (tokens, efficiency)
└── Reporting
├── Markdown reports
└── JSON results
```
## Contributing
To add new features:
1. Extend `ModelConfig` for new providers
2. Implement new LM classes
3. Add custom DSPy modules
4. Enhance quality metrics
5. Extend reporting formats
## License
MIT - Same as dspy.ts and agentic-synth
## References
- [dspy.ts Documentation](https://github.com/ruvnet/dspy.ts)
- [DSPy Paper](https://arxiv.org/abs/2310.03714)
- [MIPROv2 Paper](https://arxiv.org/abs/2406.11695)
---
**Built with dspy.ts v2.1.1** - Declarative AI framework for TypeScript