# DSPy Multi-Model Benchmark Suite

Comprehensive benchmarking system for comparing multiple language models using real **dspy.ts v2.1.1** features.

## Features

### Real DSPy.ts Components

- ✅ **ChainOfThought** - For reasoning-based synthetic data generation
- ✅ **ReAct** - For iterative data quality validation
- ✅ **BootstrapFewShot** - Learn from successful examples (5 rounds)
- ✅ **MIPROv2** - Bayesian prompt optimization (3 trials)
- ✅ **Real Metrics** - f1Score, exactMatch, bleuScore, rougeScore

### Benchmark Capabilities

1. **Multi-Model Comparison**
   - OpenAI models (GPT-4, GPT-3.5-turbo)
   - Anthropic models (Claude 3 Sonnet, Claude 3 Haiku)
   - Automatic model registration and configuration

2. **Quality Metrics**
   - F1 Score
   - Exact Match
   - BLEU Score
   - ROUGE Score
   - Overall quality score

3. **Performance Metrics**
   - Latency (P50, P95, P99)
   - Throughput (samples/second)
   - Success rate
   - Average latency

4. **Cost Analysis**
   - Total cost tracking
   - Cost per sample
   - Cost per quality point
   - Token usage (input/output)

5. **Optimization Comparison**
   - Baseline quality
   - BootstrapFewShot improvement
   - MIPROv2 improvement
   - Quality progression tracking

## Installation

```bash
cd /home/user/ruvector/packages/agentic-synth
npm install
```

## Setup

Set your API keys as environment variables:

```bash
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
```

Or create a `.env` file:

```env
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
SAMPLE_SIZE=100
```

## Usage

### Basic Usage

```bash
npx tsx training/dspy-multi-model-benchmark.ts
```

### Custom Sample Size

```bash
SAMPLE_SIZE=1000 npx tsx training/dspy-multi-model-benchmark.ts
```

### Programmatic Usage

```typescript
import { DSPyMultiModelBenchmark } from './training/dspy-multi-model-benchmark';

const benchmark = new DSPyMultiModelBenchmark('./results');

// Add models
benchmark.addModel({
  name: 'GPT-4',
  provider: 'openai',
  modelId: 'gpt-4',
  apiKey: process.env.OPENAI_API_KEY,
  costPer1kTokens: { input: 0.03, output: 0.06 },
  maxTokens: 8192
});

// Run comparison
const results = await benchmark.runComparison(1000);

// Generate report
await benchmark.generateReport(results);
```

## Output

The benchmark generates two files:

1. **Markdown Report** (`benchmark-report-TIMESTAMP.md`)
   - Executive summary with winners
   - Detailed metrics for each model
   - Rankings by category
   - Recommendations for different use cases

2. **JSON Results** (`benchmark-results-TIMESTAMP.json`)
   - Complete benchmark data
   - Raw metrics
   - Optimization history
   - Structured for further analysis

### Sample Output Structure

```
training/results/multi-model/
├── benchmark-report-2025-01-22T10-30-45-123Z.md
└── benchmark-results-2025-01-22T10-30-45-123Z.json
```

## Benchmark Workflow

```
┌─────────────────────────────────────────────────────────┐
│                   For Each Model                        │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 1. Baseline Quality                                     │
│    └─ Test with basic ChainOfThought module            │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 2. BootstrapFewShot Optimization                        │
│    └─ 5 rounds of few-shot learning                    │
│    └─ Learn from successful examples                   │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 3. MIPROv2 Optimization                                 │
│    └─ 3 trials of Bayesian optimization                │
│    └─ Expected Improvement acquisition                 │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 4. Performance Testing                                  │
│    └─ Measure latency (P50, P95, P99)                  │
│    └─ Calculate throughput                             │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 5. Cost Analysis                                        │
│    └─ Track token usage                                │
│    └─ Calculate cost efficiency                        │
└─────────────────────────────────────────────────────────┘
```

## Metrics Explained

### Quality Metrics

- **F1 Score**: Harmonic mean of precision and recall
- **Exact Match**: Percentage of exact matches with expected output
- **BLEU Score**: Bilingual Evaluation Understudy (text similarity)
- **ROUGE Score**: Recall-Oriented Understudy for Gisting Evaluation
- **Overall**: Weighted average of all quality metrics

### Performance Metrics

- **P50 Latency**: Median response time
- **P95 Latency**: 95th percentile response time
- **P99 Latency**: 99th percentile response time
- **Throughput**: Samples processed per second
- **Success Rate**: Percentage of successful generations

### Optimization Metrics

- **Baseline Quality**: Initial quality without optimization
- **Bootstrap Improvement**: Quality gain from BootstrapFewShot
- **MIPRO Improvement**: Quality gain from MIPROv2
- **Improvement %**: Relative improvement over baseline

## Customization

### Add Custom Models

```typescript
benchmark.addModel({
  name: 'Custom Model',
  provider: 'openrouter',
  modelId: 'model-id',
  apiKey: 'your-key',
  costPer1kTokens: { input: 0.001, output: 0.002 },
  maxTokens: 4096
});
```

### Custom Schema

Modify the schema in `benchmarkModel()`:

```typescript
const schema = {
  id: 'UUID',
  name: 'string (person name)',
  email: 'string (valid email)',
  age: 'number (18-80)',
  // Add your custom fields...
};
```

### Custom Metrics

Implement custom quality scoring:

```typescript
private calculateQualityScore(output: any, expected: any): number {
  // Your custom scoring logic
  return score;
}
```

## Performance Tips

1. **Start Small**: Use `SAMPLE_SIZE=10` for quick tests
2. **Increase Gradually**: Scale to 100, 1000, 10000 as needed
3. **Parallel Testing**: Run different models separately
4. **Cost Monitoring**: Check costs before large runs
5. **Rate Limits**: Be aware of API rate limits

## Example Results

```
🔬 DSPy Multi-Model Benchmark Suite
======================================================================
Models: 4
Sample Size: 100
======================================================================

📊 Benchmarking: GPT-4
----------------------------------------------------------------------
  → Running baseline...
  → Optimizing with BootstrapFewShot...
  → Optimizing with MIPROv2...
  ✓ Quality Score: 0.875
  ✓ P95 Latency: 1234ms
  ✓ Cost/Sample: $0.000543
  ✓ Bootstrap Improvement: +12.3%
  ✓ MIPRO Improvement: +18.7%

📊 Benchmarking: Claude 3 Sonnet
----------------------------------------------------------------------
  → Running baseline...
  → Optimizing with BootstrapFewShot...
  → Optimizing with MIPROv2...
  ✓ Quality Score: 0.892
  ✓ P95 Latency: 987ms
  ✓ Cost/Sample: $0.000234
  ✓ Bootstrap Improvement: +14.2%
  ✓ MIPRO Improvement: +21.5%

======================================================================
✅ Benchmark completed successfully!
📊 Check the results directory for detailed reports.
======================================================================
```

## Troubleshooting

### API Key Issues

```bash
# Check if keys are set
echo $OPENAI_API_KEY
echo $ANTHROPIC_API_KEY

# Set keys temporarily
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
```

### Import Errors

```bash
# Rebuild the package
npm run build

# Check dspy.ts installation
npm list dspy.ts
```

### Out of Memory

```bash
# Reduce sample size
SAMPLE_SIZE=10 npx tsx training/dspy-multi-model-benchmark.ts
```

### Rate Limiting

Add delays between requests:

```typescript
// In measurePerformance()
await new Promise(resolve => setTimeout(resolve, 100));
```

## Architecture

```
DSPyMultiModelBenchmark
├── Model Management
│   ├── OpenAILM (GPT-4, GPT-3.5)
│   ├── AnthropicLM (Claude 3)
│   └── Token tracking
│
├── DSPy Modules
│   ├── SyntheticDataModule (ChainOfThought)
│   └── DataQualityModule (ReAct)
│
├── Optimizers
│   ├── BootstrapFewShot (5 rounds)
│   └── MIPROv2 (3 trials, Bayesian)
│
├── Metrics
│   ├── Quality (F1, EM, BLEU, ROUGE)
│   ├── Performance (latency, throughput)
│   └── Cost (tokens, efficiency)
│
└── Reporting
    ├── Markdown reports
    └── JSON results
```

## Contributing

To add new features:

1. Extend `ModelConfig` for new providers
2. Implement new LM classes
3. Add custom DSPy modules
4. Enhance quality metrics
5. Extend reporting formats

## License

MIT - Same as dspy.ts and agentic-synth

## References

- [dspy.ts Documentation](https://github.com/ruvnet/dspy.ts)
- [DSPy Paper](https://arxiv.org/abs/2310.03714)
- [MIPROv2 Paper](https://arxiv.org/abs/2406.11695)

---

**Built with dspy.ts v2.1.1** - Declarative AI framework for TypeScript