git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
11 KiB
DSPy Benchmark Comparison Framework
A comprehensive benchmarking suite for comparing multiple models across quality, performance, cost, learning, and diversity metrics.
Features
🎯 Core Capabilities
-
Multi-Model Comparison
- Compare unlimited models side-by-side
- Statistical significance testing
- Pareto frontier analysis
- Weighted scoring across dimensions
-
Scalability Testing
- Test from 100 to 100,000 samples
- Measure latency, throughput, cost at scale
- Calculate scaling efficiency
- Identify performance bottlenecks
-
Cost Analysis
- Track total cost per run
- Calculate cost per sample
- Compute cost per quality point
- Efficiency rankings
-
Quality Convergence
- Measure learning rates
- Track improvement over generations
- Identify plateau points
- Convergence speed analysis
-
Diversity Analysis
- Unique value counting
- Pattern variety measurement
- Shannon entropy calculation
- Coverage scoring
📊 Metrics Collected
Quality Metrics
- Accuracy: Correctness of generated data
- Coherence: Logical consistency and flow
- Validity: Adherence to schema and constraints
- Consistency: Uniformity across samples
- Completeness: Coverage of all required fields
- Overall: Weighted average of all quality metrics
Performance Metrics
- Latency P50/P95/P99: Response time percentiles
- Average Latency: Mean response time
- Min/Max Latency: Range of response times
- Throughput: Samples generated per second
- Success Rate: Percentage of successful generations
Cost Metrics
- Total Cost: Total expenditure for test run
- Cost per Sample: Average cost per generated sample
- Cost per Quality Point: Cost normalized by quality
- Tokens Used: Total tokens consumed
- Efficiency: Quality per unit cost
Learning Metrics
- Improvement Rate: Quality gain per generation
- Convergence Speed: Generations until plateau
- Learning Curve: Quality progression over time
- Plateau Generation: When learning stabilizes
- Final Quality: Ultimate quality achieved
Diversity Metrics
- Unique Values: Number of distinct samples
- Pattern Variety: Ratio of unique to total samples
- Distribution Entropy: Shannon entropy of data
- Coverage Score: Field-level diversity measure
- Novelty Rate: Rate of new pattern generation
Usage
Quick Start
import { BenchmarkSuite } from './dspy-benchmarks.js';
const suite = new BenchmarkSuite();
// Add common models
suite.addCommonModels();
// Run comprehensive comparison
const comparison = await suite.runModelComparison(1000);
// Generate reports
await suite.generateJSONReport(comparison);
await suite.generateMarkdownReport(comparison);
Custom Models
import { BenchmarkSuite, ModelConfig } from './dspy-benchmarks.js';
const suite = new BenchmarkSuite();
// Add custom model
const customModel: ModelConfig = {
name: 'My Custom Model',
provider: 'openrouter',
model: 'my-model',
costPer1kTokens: 0.002,
maxTokens: 8192,
apiKey: process.env.API_KEY, // Optional
};
suite.addModel(customModel);
// Run benchmarks
const comparison = await suite.runModelComparison(1000);
Running from CLI
# Full benchmark suite
npx tsx training/run-benchmarks.ts full
# Quick comparison (3 models, 500 samples)
npx tsx training/run-benchmarks.ts quick
# Scalability test only
npx tsx training/run-benchmarks.ts scalability
# Cost analysis only
npx tsx training/run-benchmarks.ts cost
API Reference
BenchmarkSuite Class
Constructor
constructor(outputDir?: string)
Creates a new benchmark suite instance.
outputDir: Optional output directory (default:./training/results/benchmarks)
Methods
addModel(config: ModelConfig)
Add a model to the benchmark suite.
suite.addModel({
name: 'GPT-4',
provider: 'openai',
model: 'gpt-4',
costPer1kTokens: 0.03,
maxTokens: 8192,
});
addCommonModels()
Add 6 pre-configured common models for quick testing:
- GPT-4
- Claude 3.5 Sonnet
- Gemini Pro
- GPT-3.5 Turbo
- Llama 3 70B
- Mixtral 8x7B
suite.addCommonModels();
runModelComparison(sampleSize?: number): Promise
Run comprehensive comparison across all models.
const comparison = await suite.runModelComparison(1000);
Returns: ComparisonResult with winners, statistical significance, Pareto frontier, and recommendations.
runScalabilityTest(): Promise<ScalabilityResult[]>
Test scalability from 100 to 100K samples.
const results = await suite.runScalabilityTest();
Tests: 100, 500, 1K, 5K, 10K, 50K, 100K samples
runCostAnalysis(): Promise
Analyze cost-effectiveness across models.
await suite.runCostAnalysis();
Outputs: Cost rankings, efficiency scores, cost/quality trade-offs
runQualityConvergence(generations?: number): Promise
Measure learning rates and quality convergence.
await suite.runQualityConvergence(10);
Default: 10 generations
runDiversityAnalysis(sampleSize?: number): Promise
Analyze data diversity and variety.
await suite.runDiversityAnalysis(5000);
Default: 5000 samples
generateJSONReport(comparison: ComparisonResult): Promise
Generate comprehensive JSON report.
await suite.generateJSONReport(comparison);
Output: benchmark-comparison.json
generateMarkdownReport(comparison: ComparisonResult): Promise
Generate human-readable Markdown report.
await suite.generateMarkdownReport(comparison);
Output: BENCHMARK_REPORT.md
Output Files
JSON Reports
benchmark-comparison.json
Complete benchmark results including:
- Metadata and timestamps
- Comparison results
- All model results
- Statistical summaries
scalability-results.json
Scalability test results including:
- Latencies at each scale
- Throughput measurements
- Cost progression
- Scaling efficiency
convergence-data.json
Learning convergence data including:
- Quality curves
- Improvement rates
- Plateau generations
Markdown Reports
BENCHMARK_REPORT.md
Comprehensive human-readable report including:
- Executive summary
- Detailed results per model
- Comparative tables
- Pareto frontier analysis
- Use case recommendations
- Statistical significance
- Methodology explanation
- Conclusions
Use Case Recommendations
The benchmark suite automatically recommends models for different scenarios:
High-Quality, Low-Volume (Research)
Best for research, high-stakes decisions, and scenarios where quality is paramount.
Optimizes for: Maximum quality, learning capability
High-Volume, Low-Latency (Production)
Best for production systems requiring high throughput and low latency.
Optimizes for: Throughput, low latency, success rate
Cost-Optimized (Batch Processing)
Best for batch processing, large-scale data generation, and cost-sensitive applications.
Optimizes for: Lowest cost per sample, efficiency
Balanced (General Purpose)
Best for general-purpose applications requiring a good balance of quality, performance, and cost.
Optimizes for: Weighted score across all metrics
Statistical Analysis
T-Test for Significance
The suite performs t-tests to determine if quality differences between models are statistically significant:
- p < 0.01: Highly significant difference
- p < 0.05: Significant difference
- p ≥ 0.05: No significant difference
Pareto Frontier
Identifies models with optimal quality/cost trade-offs. A model is on the Pareto frontier if no other model is better in both quality AND cost.
Mock Data Generation
The framework includes a sophisticated mock data generator for demonstration purposes:
- Realistic Latencies: Based on actual model characteristics
- Learning Simulation: Quality improves over generations
- Quality Differentiation: Different models have different base qualities
- Schema Support: Handles various field types (UUID, email, name, numbers, etc.)
Example Output
🔬 Running Model Comparison (1000 samples)
======================================================================
Testing GPT-4...
Quality: 0.872
Latency P95: 1589ms
Cost/Sample: $0.004500
Diversity: 0.843
Testing Claude 3.5 Sonnet...
Quality: 0.891
Latency P95: 1267ms
Cost/Sample: $0.002250
Diversity: 0.867
...
✅ All benchmarks completed!
📊 Key Findings:
Overall Winner: Claude 3.5 Sonnet
Best Quality: Claude 3.5 Sonnet
Best Performance: Mixtral 8x7B
Most Cost-Effective: Gemini Pro
Pareto Frontier: Claude 3.5 Sonnet, Gemini Pro, Mixtral 8x7B
💡 Recommendations by Use Case:
high-quality-low-volume: Claude 3.5 Sonnet
high-volume-low-latency: Mixtral 8x7B
cost-optimized: Gemini Pro
balanced: Claude 3.5 Sonnet
research: Claude 3.5 Sonnet
production: Claude 3.5 Sonnet
Advanced Features
Custom Weighting
You can modify the overall winner calculation by adjusting weights in the compareResults() method:
const score =
quality * 0.3 + // 30% quality
performance * 0.2 + // 20% performance
(1/cost) * 0.2 + // 20% cost
learning * 0.15 + // 15% learning
diversity * 0.15; // 15% diversity
Statistical Utilities
The StatisticalAnalyzer class provides utilities for:
- Mean and standard deviation
- Percentile calculation
- T-test for significance
- Shannon entropy
- Distribution analysis
Extensibility
Easily extend the framework:
- Add new metrics: Extend metric interfaces
- Add new models: Implement
ModelConfig - Add new tests: Add methods to
BenchmarkSuite - Custom analysis: Use
StatisticalAnalyzerutilities
Performance Considerations
- Mock Mode: Runs without API calls for testing
- Parallel Testing: Could be extended for concurrent model testing
- Caching: Results are cached to disk
- Memory Efficient: Processes samples in batches
Limitations
- Mock data generator simulates behavior (no actual API calls)
- Quality metrics are approximations based on model characteristics
- Statistical tests use simplified distributions
- Assumes consistent model behavior
Future Enhancements
- Real API integration with actual model calls
- Parallel model testing for faster benchmarks
- More sophisticated quality assessment
- Interactive visualization dashboard
- A/B testing framework
- Confidence interval calculation
- Cost prediction modeling
- Automated model selection
License
MIT
Contributing
Contributions welcome! Please ensure:
- TypeScript type safety
- Comprehensive documentation
- Test coverage
- Performance optimization
Support
For issues or questions:
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
- Documentation: See main project README