Files
wifi-densepose/npm/packages/agentic-synth/training/BENCHMARKS_README.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

447 lines
11 KiB
Markdown

# DSPy Benchmark Comparison Framework
A comprehensive benchmarking suite for comparing multiple models across quality, performance, cost, learning, and diversity metrics.
## Features
### 🎯 Core Capabilities
1. **Multi-Model Comparison**
- Compare unlimited models side-by-side
- Statistical significance testing
- Pareto frontier analysis
- Weighted scoring across dimensions
2. **Scalability Testing**
- Test from 100 to 100,000 samples
- Measure latency, throughput, cost at scale
- Calculate scaling efficiency
- Identify performance bottlenecks
3. **Cost Analysis**
- Track total cost per run
- Calculate cost per sample
- Compute cost per quality point
- Efficiency rankings
4. **Quality Convergence**
- Measure learning rates
- Track improvement over generations
- Identify plateau points
- Convergence speed analysis
5. **Diversity Analysis**
- Unique value counting
- Pattern variety measurement
- Shannon entropy calculation
- Coverage scoring
### 📊 Metrics Collected
#### Quality Metrics
- **Accuracy**: Correctness of generated data
- **Coherence**: Logical consistency and flow
- **Validity**: Adherence to schema and constraints
- **Consistency**: Uniformity across samples
- **Completeness**: Coverage of all required fields
- **Overall**: Weighted average of all quality metrics
#### Performance Metrics
- **Latency P50/P95/P99**: Response time percentiles
- **Average Latency**: Mean response time
- **Min/Max Latency**: Range of response times
- **Throughput**: Samples generated per second
- **Success Rate**: Percentage of successful generations
#### Cost Metrics
- **Total Cost**: Total expenditure for test run
- **Cost per Sample**: Average cost per generated sample
- **Cost per Quality Point**: Cost normalized by quality
- **Tokens Used**: Total tokens consumed
- **Efficiency**: Quality per unit cost
#### Learning Metrics
- **Improvement Rate**: Quality gain per generation
- **Convergence Speed**: Generations until plateau
- **Learning Curve**: Quality progression over time
- **Plateau Generation**: When learning stabilizes
- **Final Quality**: Ultimate quality achieved
#### Diversity Metrics
- **Unique Values**: Number of distinct samples
- **Pattern Variety**: Ratio of unique to total samples
- **Distribution Entropy**: Shannon entropy of data
- **Coverage Score**: Field-level diversity measure
- **Novelty Rate**: Rate of new pattern generation
## Usage
### Quick Start
```typescript
import { BenchmarkSuite } from './dspy-benchmarks.js';
const suite = new BenchmarkSuite();
// Add common models
suite.addCommonModels();
// Run comprehensive comparison
const comparison = await suite.runModelComparison(1000);
// Generate reports
await suite.generateJSONReport(comparison);
await suite.generateMarkdownReport(comparison);
```
### Custom Models
```typescript
import { BenchmarkSuite, ModelConfig } from './dspy-benchmarks.js';
const suite = new BenchmarkSuite();
// Add custom model
const customModel: ModelConfig = {
name: 'My Custom Model',
provider: 'openrouter',
model: 'my-model',
costPer1kTokens: 0.002,
maxTokens: 8192,
apiKey: process.env.API_KEY, // Optional
};
suite.addModel(customModel);
// Run benchmarks
const comparison = await suite.runModelComparison(1000);
```
### Running from CLI
```bash
# Full benchmark suite
npx tsx training/run-benchmarks.ts full
# Quick comparison (3 models, 500 samples)
npx tsx training/run-benchmarks.ts quick
# Scalability test only
npx tsx training/run-benchmarks.ts scalability
# Cost analysis only
npx tsx training/run-benchmarks.ts cost
```
## API Reference
### BenchmarkSuite Class
#### Constructor
```typescript
constructor(outputDir?: string)
```
Creates a new benchmark suite instance.
- `outputDir`: Optional output directory (default: `./training/results/benchmarks`)
#### Methods
##### addModel(config: ModelConfig)
Add a model to the benchmark suite.
```typescript
suite.addModel({
name: 'GPT-4',
provider: 'openai',
model: 'gpt-4',
costPer1kTokens: 0.03,
maxTokens: 8192,
});
```
##### addCommonModels()
Add 6 pre-configured common models for quick testing:
- GPT-4
- Claude 3.5 Sonnet
- Gemini Pro
- GPT-3.5 Turbo
- Llama 3 70B
- Mixtral 8x7B
```typescript
suite.addCommonModels();
```
##### runModelComparison(sampleSize?: number): Promise<ComparisonResult>
Run comprehensive comparison across all models.
```typescript
const comparison = await suite.runModelComparison(1000);
```
**Returns**: ComparisonResult with winners, statistical significance, Pareto frontier, and recommendations.
##### runScalabilityTest(): Promise<ScalabilityResult[]>
Test scalability from 100 to 100K samples.
```typescript
const results = await suite.runScalabilityTest();
```
**Tests**: 100, 500, 1K, 5K, 10K, 50K, 100K samples
##### runCostAnalysis(): Promise<void>
Analyze cost-effectiveness across models.
```typescript
await suite.runCostAnalysis();
```
**Outputs**: Cost rankings, efficiency scores, cost/quality trade-offs
##### runQualityConvergence(generations?: number): Promise<void>
Measure learning rates and quality convergence.
```typescript
await suite.runQualityConvergence(10);
```
**Default**: 10 generations
##### runDiversityAnalysis(sampleSize?: number): Promise<void>
Analyze data diversity and variety.
```typescript
await suite.runDiversityAnalysis(5000);
```
**Default**: 5000 samples
##### generateJSONReport(comparison: ComparisonResult): Promise<void>
Generate comprehensive JSON report.
```typescript
await suite.generateJSONReport(comparison);
```
**Output**: `benchmark-comparison.json`
##### generateMarkdownReport(comparison: ComparisonResult): Promise<void>
Generate human-readable Markdown report.
```typescript
await suite.generateMarkdownReport(comparison);
```
**Output**: `BENCHMARK_REPORT.md`
## Output Files
### JSON Reports
#### benchmark-comparison.json
Complete benchmark results including:
- Metadata and timestamps
- Comparison results
- All model results
- Statistical summaries
#### scalability-results.json
Scalability test results including:
- Latencies at each scale
- Throughput measurements
- Cost progression
- Scaling efficiency
#### convergence-data.json
Learning convergence data including:
- Quality curves
- Improvement rates
- Plateau generations
### Markdown Reports
#### BENCHMARK_REPORT.md
Comprehensive human-readable report including:
- Executive summary
- Detailed results per model
- Comparative tables
- Pareto frontier analysis
- Use case recommendations
- Statistical significance
- Methodology explanation
- Conclusions
## Use Case Recommendations
The benchmark suite automatically recommends models for different scenarios:
### High-Quality, Low-Volume (Research)
Best for research, high-stakes decisions, and scenarios where quality is paramount.
**Optimizes for**: Maximum quality, learning capability
### High-Volume, Low-Latency (Production)
Best for production systems requiring high throughput and low latency.
**Optimizes for**: Throughput, low latency, success rate
### Cost-Optimized (Batch Processing)
Best for batch processing, large-scale data generation, and cost-sensitive applications.
**Optimizes for**: Lowest cost per sample, efficiency
### Balanced (General Purpose)
Best for general-purpose applications requiring a good balance of quality, performance, and cost.
**Optimizes for**: Weighted score across all metrics
## Statistical Analysis
### T-Test for Significance
The suite performs t-tests to determine if quality differences between models are statistically significant:
- **p < 0.01**: Highly significant difference
- **p < 0.05**: Significant difference
- **p ≥ 0.05**: No significant difference
### Pareto Frontier
Identifies models with optimal quality/cost trade-offs. A model is on the Pareto frontier if no other model is better in both quality AND cost.
## Mock Data Generation
The framework includes a sophisticated mock data generator for demonstration purposes:
- **Realistic Latencies**: Based on actual model characteristics
- **Learning Simulation**: Quality improves over generations
- **Quality Differentiation**: Different models have different base qualities
- **Schema Support**: Handles various field types (UUID, email, name, numbers, etc.)
## Example Output
```
🔬 Running Model Comparison (1000 samples)
======================================================================
Testing GPT-4...
Quality: 0.872
Latency P95: 1589ms
Cost/Sample: $0.004500
Diversity: 0.843
Testing Claude 3.5 Sonnet...
Quality: 0.891
Latency P95: 1267ms
Cost/Sample: $0.002250
Diversity: 0.867
...
✅ All benchmarks completed!
📊 Key Findings:
Overall Winner: Claude 3.5 Sonnet
Best Quality: Claude 3.5 Sonnet
Best Performance: Mixtral 8x7B
Most Cost-Effective: Gemini Pro
Pareto Frontier: Claude 3.5 Sonnet, Gemini Pro, Mixtral 8x7B
💡 Recommendations by Use Case:
high-quality-low-volume: Claude 3.5 Sonnet
high-volume-low-latency: Mixtral 8x7B
cost-optimized: Gemini Pro
balanced: Claude 3.5 Sonnet
research: Claude 3.5 Sonnet
production: Claude 3.5 Sonnet
```
## Advanced Features
### Custom Weighting
You can modify the overall winner calculation by adjusting weights in the `compareResults()` method:
```typescript
const score =
quality * 0.3 + // 30% quality
performance * 0.2 + // 20% performance
(1/cost) * 0.2 + // 20% cost
learning * 0.15 + // 15% learning
diversity * 0.15; // 15% diversity
```
### Statistical Utilities
The `StatisticalAnalyzer` class provides utilities for:
- Mean and standard deviation
- Percentile calculation
- T-test for significance
- Shannon entropy
- Distribution analysis
### Extensibility
Easily extend the framework:
1. **Add new metrics**: Extend metric interfaces
2. **Add new models**: Implement `ModelConfig`
3. **Add new tests**: Add methods to `BenchmarkSuite`
4. **Custom analysis**: Use `StatisticalAnalyzer` utilities
## Performance Considerations
- **Mock Mode**: Runs without API calls for testing
- **Parallel Testing**: Could be extended for concurrent model testing
- **Caching**: Results are cached to disk
- **Memory Efficient**: Processes samples in batches
## Limitations
- Mock data generator simulates behavior (no actual API calls)
- Quality metrics are approximations based on model characteristics
- Statistical tests use simplified distributions
- Assumes consistent model behavior
## Future Enhancements
- [ ] Real API integration with actual model calls
- [ ] Parallel model testing for faster benchmarks
- [ ] More sophisticated quality assessment
- [ ] Interactive visualization dashboard
- [ ] A/B testing framework
- [ ] Confidence interval calculation
- [ ] Cost prediction modeling
- [ ] Automated model selection
## License
MIT
## Contributing
Contributions welcome! Please ensure:
- TypeScript type safety
- Comprehensive documentation
- Test coverage
- Performance optimization
## Support
For issues or questions:
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
- Documentation: See main project README