Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

11 KiB

Raw Blame History

DSPy Multi-Model Benchmark Suite

Comprehensive benchmarking system for comparing multiple language models using real dspy.ts v2.1.1 features.

Features

Real DSPy.ts Components

✅ ChainOfThought - For reasoning-based synthetic data generation
✅ ReAct - For iterative data quality validation
✅ BootstrapFewShot - Learn from successful examples (5 rounds)
✅ MIPROv2 - Bayesian prompt optimization (3 trials)
✅ Real Metrics - f1Score, exactMatch, bleuScore, rougeScore

Benchmark Capabilities

Multi-Model Comparison
- OpenAI models (GPT-4, GPT-3.5-turbo)
- Anthropic models (Claude 3 Sonnet, Claude 3 Haiku)
- Automatic model registration and configuration
Quality Metrics
- F1 Score
- Exact Match
- BLEU Score
- ROUGE Score
- Overall quality score
Performance Metrics
- Latency (P50, P95, P99)
- Throughput (samples/second)
- Success rate
- Average latency
Cost Analysis
- Total cost tracking
- Cost per sample
- Cost per quality point
- Token usage (input/output)
Optimization Comparison
- Baseline quality
- BootstrapFewShot improvement
- MIPROv2 improvement
- Quality progression tracking

Installation

cd /home/user/ruvector/packages/agentic-synth
npm install

Setup

Set your API keys as environment variables:

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"

Or create a .env file:

OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
SAMPLE_SIZE=100

Usage

Basic Usage

npx tsx training/dspy-multi-model-benchmark.ts

Custom Sample Size

SAMPLE_SIZE=1000 npx tsx training/dspy-multi-model-benchmark.ts

Programmatic Usage

import { DSPyMultiModelBenchmark } from './training/dspy-multi-model-benchmark';

const benchmark = new DSPyMultiModelBenchmark('./results');

// Add models
benchmark.addModel({
  name: 'GPT-4',
  provider: 'openai',
  modelId: 'gpt-4',
  apiKey: process.env.OPENAI_API_KEY,
  costPer1kTokens: { input: 0.03, output: 0.06 },
  maxTokens: 8192
});

// Run comparison
const results = await benchmark.runComparison(1000);

// Generate report
await benchmark.generateReport(results);

Output

The benchmark generates two files:

Markdown Report (benchmark-report-TIMESTAMP.md)
- Executive summary with winners
- Detailed metrics for each model
- Rankings by category
- Recommendations for different use cases
JSON Results (benchmark-results-TIMESTAMP.json)
- Complete benchmark data
- Raw metrics
- Optimization history
- Structured for further analysis

Sample Output Structure

training/results/multi-model/
├── benchmark-report-2025-01-22T10-30-45-123Z.md
└── benchmark-results-2025-01-22T10-30-45-123Z.json

Benchmark Workflow

┌─────────────────────────────────────────────────────────┐
│                   For Each Model                        │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 1. Baseline Quality                                     │
│    └─ Test with basic ChainOfThought module            │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 2. BootstrapFewShot Optimization                        │
│    └─ 5 rounds of few-shot learning                    │
│    └─ Learn from successful examples                   │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 3. MIPROv2 Optimization                                 │
│    └─ 3 trials of Bayesian optimization                │
│    └─ Expected Improvement acquisition                 │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 4. Performance Testing                                  │
│    └─ Measure latency (P50, P95, P99)                  │
│    └─ Calculate throughput                             │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 5. Cost Analysis                                        │
│    └─ Track token usage                                │
│    └─ Calculate cost efficiency                        │
└─────────────────────────────────────────────────────────┘

Metrics Explained

Quality Metrics

F1 Score: Harmonic mean of precision and recall
Exact Match: Percentage of exact matches with expected output
BLEU Score: Bilingual Evaluation Understudy (text similarity)
ROUGE Score: Recall-Oriented Understudy for Gisting Evaluation
Overall: Weighted average of all quality metrics

Performance Metrics

P50 Latency: Median response time
P95 Latency: 95th percentile response time
P99 Latency: 99th percentile response time
Throughput: Samples processed per second
Success Rate: Percentage of successful generations

Optimization Metrics

Baseline Quality: Initial quality without optimization
Bootstrap Improvement: Quality gain from BootstrapFewShot
MIPRO Improvement: Quality gain from MIPROv2
Improvement %: Relative improvement over baseline

Customization

Add Custom Models

benchmark.addModel({
  name: 'Custom Model',
  provider: 'openrouter',
  modelId: 'model-id',
  apiKey: 'your-key',
  costPer1kTokens: { input: 0.001, output: 0.002 },
  maxTokens: 4096
});

Custom Schema

Modify the schema in benchmarkModel():

const schema = {
  id: 'UUID',
  name: 'string (person name)',
  email: 'string (valid email)',
  age: 'number (18-80)',
  // Add your custom fields...
};

Custom Metrics

Implement custom quality scoring:

private calculateQualityScore(output: any, expected: any): number {
  // Your custom scoring logic
  return score;
}

Performance Tips

Start Small: Use SAMPLE_SIZE=10 for quick tests
Increase Gradually: Scale to 100, 1000, 10000 as needed
Parallel Testing: Run different models separately
Cost Monitoring: Check costs before large runs
Rate Limits: Be aware of API rate limits

Example Results

🔬 DSPy Multi-Model Benchmark Suite
======================================================================
Models: 4
Sample Size: 100
======================================================================

📊 Benchmarking: GPT-4
----------------------------------------------------------------------
  → Running baseline...
  → Optimizing with BootstrapFewShot...
  → Optimizing with MIPROv2...
  ✓ Quality Score: 0.875
  ✓ P95 Latency: 1234ms
  ✓ Cost/Sample: $0.000543
  ✓ Bootstrap Improvement: +12.3%
  ✓ MIPRO Improvement: +18.7%

📊 Benchmarking: Claude 3 Sonnet
----------------------------------------------------------------------
  → Running baseline...
  → Optimizing with BootstrapFewShot...
  → Optimizing with MIPROv2...
  ✓ Quality Score: 0.892
  ✓ P95 Latency: 987ms
  ✓ Cost/Sample: $0.000234
  ✓ Bootstrap Improvement: +14.2%
  ✓ MIPRO Improvement: +21.5%

======================================================================
✅ Benchmark completed successfully!
📊 Check the results directory for detailed reports.
======================================================================

Troubleshooting

API Key Issues

# Check if keys are set
echo $OPENAI_API_KEY
echo $ANTHROPIC_API_KEY

# Set keys temporarily
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

Import Errors

# Rebuild the package
npm run build

# Check dspy.ts installation
npm list dspy.ts

Out of Memory

# Reduce sample size
SAMPLE_SIZE=10 npx tsx training/dspy-multi-model-benchmark.ts

Rate Limiting

Add delays between requests:

// In measurePerformance()
await new Promise(resolve => setTimeout(resolve, 100));

Architecture

DSPyMultiModelBenchmark
├── Model Management
│   ├── OpenAILM (GPT-4, GPT-3.5)
│   ├── AnthropicLM (Claude 3)
│   └── Token tracking
│
├── DSPy Modules
│   ├── SyntheticDataModule (ChainOfThought)
│   └── DataQualityModule (ReAct)
│
├── Optimizers
│   ├── BootstrapFewShot (5 rounds)
│   └── MIPROv2 (3 trials, Bayesian)
│
├── Metrics
│   ├── Quality (F1, EM, BLEU, ROUGE)
│   ├── Performance (latency, throughput)
│   └── Cost (tokens, efficiency)
│
└── Reporting
    ├── Markdown reports
    └── JSON results

Contributing

To add new features:

Extend ModelConfig for new providers
Implement new LM classes
Add custom DSPy modules
Enhance quality metrics
Extend reporting formats

License

MIT - Same as dspy.ts and agentic-synth

References

Built with dspy.ts v2.1.1 - Declarative AI framework for TypeScript

11 KiB Raw Blame History