Files
wifi-densepose/npm/packages/agentic-synth/training/MULTI_MODEL_BENCHMARK_README.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

11 KiB

DSPy Multi-Model Benchmark Suite

Comprehensive benchmarking system for comparing multiple language models using real dspy.ts v2.1.1 features.

Features

Real DSPy.ts Components

  • ChainOfThought - For reasoning-based synthetic data generation
  • ReAct - For iterative data quality validation
  • BootstrapFewShot - Learn from successful examples (5 rounds)
  • MIPROv2 - Bayesian prompt optimization (3 trials)
  • Real Metrics - f1Score, exactMatch, bleuScore, rougeScore

Benchmark Capabilities

  1. Multi-Model Comparison

    • OpenAI models (GPT-4, GPT-3.5-turbo)
    • Anthropic models (Claude 3 Sonnet, Claude 3 Haiku)
    • Automatic model registration and configuration
  2. Quality Metrics

    • F1 Score
    • Exact Match
    • BLEU Score
    • ROUGE Score
    • Overall quality score
  3. Performance Metrics

    • Latency (P50, P95, P99)
    • Throughput (samples/second)
    • Success rate
    • Average latency
  4. Cost Analysis

    • Total cost tracking
    • Cost per sample
    • Cost per quality point
    • Token usage (input/output)
  5. Optimization Comparison

    • Baseline quality
    • BootstrapFewShot improvement
    • MIPROv2 improvement
    • Quality progression tracking

Installation

cd /home/user/ruvector/packages/agentic-synth
npm install

Setup

Set your API keys as environment variables:

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"

Or create a .env file:

OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
SAMPLE_SIZE=100

Usage

Basic Usage

npx tsx training/dspy-multi-model-benchmark.ts

Custom Sample Size

SAMPLE_SIZE=1000 npx tsx training/dspy-multi-model-benchmark.ts

Programmatic Usage

import { DSPyMultiModelBenchmark } from './training/dspy-multi-model-benchmark';

const benchmark = new DSPyMultiModelBenchmark('./results');

// Add models
benchmark.addModel({
  name: 'GPT-4',
  provider: 'openai',
  modelId: 'gpt-4',
  apiKey: process.env.OPENAI_API_KEY,
  costPer1kTokens: { input: 0.03, output: 0.06 },
  maxTokens: 8192
});

// Run comparison
const results = await benchmark.runComparison(1000);

// Generate report
await benchmark.generateReport(results);

Output

The benchmark generates two files:

  1. Markdown Report (benchmark-report-TIMESTAMP.md)

    • Executive summary with winners
    • Detailed metrics for each model
    • Rankings by category
    • Recommendations for different use cases
  2. JSON Results (benchmark-results-TIMESTAMP.json)

    • Complete benchmark data
    • Raw metrics
    • Optimization history
    • Structured for further analysis

Sample Output Structure

training/results/multi-model/
├── benchmark-report-2025-01-22T10-30-45-123Z.md
└── benchmark-results-2025-01-22T10-30-45-123Z.json

Benchmark Workflow

┌─────────────────────────────────────────────────────────┐
│                   For Each Model                        │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 1. Baseline Quality                                     │
│    └─ Test with basic ChainOfThought module            │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 2. BootstrapFewShot Optimization                        │
│    └─ 5 rounds of few-shot learning                    │
│    └─ Learn from successful examples                   │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 3. MIPROv2 Optimization                                 │
│    └─ 3 trials of Bayesian optimization                │
│    └─ Expected Improvement acquisition                 │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 4. Performance Testing                                  │
│    └─ Measure latency (P50, P95, P99)                  │
│    └─ Calculate throughput                             │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│ 5. Cost Analysis                                        │
│    └─ Track token usage                                │
│    └─ Calculate cost efficiency                        │
└─────────────────────────────────────────────────────────┘

Metrics Explained

Quality Metrics

  • F1 Score: Harmonic mean of precision and recall
  • Exact Match: Percentage of exact matches with expected output
  • BLEU Score: Bilingual Evaluation Understudy (text similarity)
  • ROUGE Score: Recall-Oriented Understudy for Gisting Evaluation
  • Overall: Weighted average of all quality metrics

Performance Metrics

  • P50 Latency: Median response time
  • P95 Latency: 95th percentile response time
  • P99 Latency: 99th percentile response time
  • Throughput: Samples processed per second
  • Success Rate: Percentage of successful generations

Optimization Metrics

  • Baseline Quality: Initial quality without optimization
  • Bootstrap Improvement: Quality gain from BootstrapFewShot
  • MIPRO Improvement: Quality gain from MIPROv2
  • Improvement %: Relative improvement over baseline

Customization

Add Custom Models

benchmark.addModel({
  name: 'Custom Model',
  provider: 'openrouter',
  modelId: 'model-id',
  apiKey: 'your-key',
  costPer1kTokens: { input: 0.001, output: 0.002 },
  maxTokens: 4096
});

Custom Schema

Modify the schema in benchmarkModel():

const schema = {
  id: 'UUID',
  name: 'string (person name)',
  email: 'string (valid email)',
  age: 'number (18-80)',
  // Add your custom fields...
};

Custom Metrics

Implement custom quality scoring:

private calculateQualityScore(output: any, expected: any): number {
  // Your custom scoring logic
  return score;
}

Performance Tips

  1. Start Small: Use SAMPLE_SIZE=10 for quick tests
  2. Increase Gradually: Scale to 100, 1000, 10000 as needed
  3. Parallel Testing: Run different models separately
  4. Cost Monitoring: Check costs before large runs
  5. Rate Limits: Be aware of API rate limits

Example Results

🔬 DSPy Multi-Model Benchmark Suite
======================================================================
Models: 4
Sample Size: 100
======================================================================

📊 Benchmarking: GPT-4
----------------------------------------------------------------------
  → Running baseline...
  → Optimizing with BootstrapFewShot...
  → Optimizing with MIPROv2...
  ✓ Quality Score: 0.875
  ✓ P95 Latency: 1234ms
  ✓ Cost/Sample: $0.000543
  ✓ Bootstrap Improvement: +12.3%
  ✓ MIPRO Improvement: +18.7%

📊 Benchmarking: Claude 3 Sonnet
----------------------------------------------------------------------
  → Running baseline...
  → Optimizing with BootstrapFewShot...
  → Optimizing with MIPROv2...
  ✓ Quality Score: 0.892
  ✓ P95 Latency: 987ms
  ✓ Cost/Sample: $0.000234
  ✓ Bootstrap Improvement: +14.2%
  ✓ MIPRO Improvement: +21.5%

======================================================================
✅ Benchmark completed successfully!
📊 Check the results directory for detailed reports.
======================================================================

Troubleshooting

API Key Issues

# Check if keys are set
echo $OPENAI_API_KEY
echo $ANTHROPIC_API_KEY

# Set keys temporarily
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

Import Errors

# Rebuild the package
npm run build

# Check dspy.ts installation
npm list dspy.ts

Out of Memory

# Reduce sample size
SAMPLE_SIZE=10 npx tsx training/dspy-multi-model-benchmark.ts

Rate Limiting

Add delays between requests:

// In measurePerformance()
await new Promise(resolve => setTimeout(resolve, 100));

Architecture

DSPyMultiModelBenchmark
├── Model Management
│   ├── OpenAILM (GPT-4, GPT-3.5)
│   ├── AnthropicLM (Claude 3)
│   └── Token tracking
│
├── DSPy Modules
│   ├── SyntheticDataModule (ChainOfThought)
│   └── DataQualityModule (ReAct)
│
├── Optimizers
│   ├── BootstrapFewShot (5 rounds)
│   └── MIPROv2 (3 trials, Bayesian)
│
├── Metrics
│   ├── Quality (F1, EM, BLEU, ROUGE)
│   ├── Performance (latency, throughput)
│   └── Cost (tokens, efficiency)
│
└── Reporting
    ├── Markdown reports
    └── JSON results

Contributing

To add new features:

  1. Extend ModelConfig for new providers
  2. Implement new LM classes
  3. Add custom DSPy modules
  4. Enhance quality metrics
  5. Extend reporting formats

License

MIT - Same as dspy.ts and agentic-synth

References


Built with dspy.ts v2.1.1 - Declarative AI framework for TypeScript