Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
446
npm/packages/agentic-synth/training/BENCHMARKS_README.md
Normal file
446
npm/packages/agentic-synth/training/BENCHMARKS_README.md
Normal file
@@ -0,0 +1,446 @@
|
||||
# DSPy Benchmark Comparison Framework
|
||||
|
||||
A comprehensive benchmarking suite for comparing multiple models across quality, performance, cost, learning, and diversity metrics.
|
||||
|
||||
## Features
|
||||
|
||||
### 🎯 Core Capabilities
|
||||
|
||||
1. **Multi-Model Comparison**
|
||||
- Compare unlimited models side-by-side
|
||||
- Statistical significance testing
|
||||
- Pareto frontier analysis
|
||||
- Weighted scoring across dimensions
|
||||
|
||||
2. **Scalability Testing**
|
||||
- Test from 100 to 100,000 samples
|
||||
- Measure latency, throughput, cost at scale
|
||||
- Calculate scaling efficiency
|
||||
- Identify performance bottlenecks
|
||||
|
||||
3. **Cost Analysis**
|
||||
- Track total cost per run
|
||||
- Calculate cost per sample
|
||||
- Compute cost per quality point
|
||||
- Efficiency rankings
|
||||
|
||||
4. **Quality Convergence**
|
||||
- Measure learning rates
|
||||
- Track improvement over generations
|
||||
- Identify plateau points
|
||||
- Convergence speed analysis
|
||||
|
||||
5. **Diversity Analysis**
|
||||
- Unique value counting
|
||||
- Pattern variety measurement
|
||||
- Shannon entropy calculation
|
||||
- Coverage scoring
|
||||
|
||||
### 📊 Metrics Collected
|
||||
|
||||
#### Quality Metrics
|
||||
- **Accuracy**: Correctness of generated data
|
||||
- **Coherence**: Logical consistency and flow
|
||||
- **Validity**: Adherence to schema and constraints
|
||||
- **Consistency**: Uniformity across samples
|
||||
- **Completeness**: Coverage of all required fields
|
||||
- **Overall**: Weighted average of all quality metrics
|
||||
|
||||
#### Performance Metrics
|
||||
- **Latency P50/P95/P99**: Response time percentiles
|
||||
- **Average Latency**: Mean response time
|
||||
- **Min/Max Latency**: Range of response times
|
||||
- **Throughput**: Samples generated per second
|
||||
- **Success Rate**: Percentage of successful generations
|
||||
|
||||
#### Cost Metrics
|
||||
- **Total Cost**: Total expenditure for test run
|
||||
- **Cost per Sample**: Average cost per generated sample
|
||||
- **Cost per Quality Point**: Cost normalized by quality
|
||||
- **Tokens Used**: Total tokens consumed
|
||||
- **Efficiency**: Quality per unit cost
|
||||
|
||||
#### Learning Metrics
|
||||
- **Improvement Rate**: Quality gain per generation
|
||||
- **Convergence Speed**: Generations until plateau
|
||||
- **Learning Curve**: Quality progression over time
|
||||
- **Plateau Generation**: When learning stabilizes
|
||||
- **Final Quality**: Ultimate quality achieved
|
||||
|
||||
#### Diversity Metrics
|
||||
- **Unique Values**: Number of distinct samples
|
||||
- **Pattern Variety**: Ratio of unique to total samples
|
||||
- **Distribution Entropy**: Shannon entropy of data
|
||||
- **Coverage Score**: Field-level diversity measure
|
||||
- **Novelty Rate**: Rate of new pattern generation
|
||||
|
||||
## Usage
|
||||
|
||||
### Quick Start
|
||||
|
||||
```typescript
|
||||
import { BenchmarkSuite } from './dspy-benchmarks.js';
|
||||
|
||||
const suite = new BenchmarkSuite();
|
||||
|
||||
// Add common models
|
||||
suite.addCommonModels();
|
||||
|
||||
// Run comprehensive comparison
|
||||
const comparison = await suite.runModelComparison(1000);
|
||||
|
||||
// Generate reports
|
||||
await suite.generateJSONReport(comparison);
|
||||
await suite.generateMarkdownReport(comparison);
|
||||
```
|
||||
|
||||
### Custom Models
|
||||
|
||||
```typescript
|
||||
import { BenchmarkSuite, ModelConfig } from './dspy-benchmarks.js';
|
||||
|
||||
const suite = new BenchmarkSuite();
|
||||
|
||||
// Add custom model
|
||||
const customModel: ModelConfig = {
|
||||
name: 'My Custom Model',
|
||||
provider: 'openrouter',
|
||||
model: 'my-model',
|
||||
costPer1kTokens: 0.002,
|
||||
maxTokens: 8192,
|
||||
apiKey: process.env.API_KEY, // Optional
|
||||
};
|
||||
|
||||
suite.addModel(customModel);
|
||||
|
||||
// Run benchmarks
|
||||
const comparison = await suite.runModelComparison(1000);
|
||||
```
|
||||
|
||||
### Running from CLI
|
||||
|
||||
```bash
|
||||
# Full benchmark suite
|
||||
npx tsx training/run-benchmarks.ts full
|
||||
|
||||
# Quick comparison (3 models, 500 samples)
|
||||
npx tsx training/run-benchmarks.ts quick
|
||||
|
||||
# Scalability test only
|
||||
npx tsx training/run-benchmarks.ts scalability
|
||||
|
||||
# Cost analysis only
|
||||
npx tsx training/run-benchmarks.ts cost
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### BenchmarkSuite Class
|
||||
|
||||
#### Constructor
|
||||
|
||||
```typescript
|
||||
constructor(outputDir?: string)
|
||||
```
|
||||
|
||||
Creates a new benchmark suite instance.
|
||||
|
||||
- `outputDir`: Optional output directory (default: `./training/results/benchmarks`)
|
||||
|
||||
#### Methods
|
||||
|
||||
##### addModel(config: ModelConfig)
|
||||
|
||||
Add a model to the benchmark suite.
|
||||
|
||||
```typescript
|
||||
suite.addModel({
|
||||
name: 'GPT-4',
|
||||
provider: 'openai',
|
||||
model: 'gpt-4',
|
||||
costPer1kTokens: 0.03,
|
||||
maxTokens: 8192,
|
||||
});
|
||||
```
|
||||
|
||||
##### addCommonModels()
|
||||
|
||||
Add 6 pre-configured common models for quick testing:
|
||||
- GPT-4
|
||||
- Claude 3.5 Sonnet
|
||||
- Gemini Pro
|
||||
- GPT-3.5 Turbo
|
||||
- Llama 3 70B
|
||||
- Mixtral 8x7B
|
||||
|
||||
```typescript
|
||||
suite.addCommonModels();
|
||||
```
|
||||
|
||||
##### runModelComparison(sampleSize?: number): Promise<ComparisonResult>
|
||||
|
||||
Run comprehensive comparison across all models.
|
||||
|
||||
```typescript
|
||||
const comparison = await suite.runModelComparison(1000);
|
||||
```
|
||||
|
||||
**Returns**: ComparisonResult with winners, statistical significance, Pareto frontier, and recommendations.
|
||||
|
||||
##### runScalabilityTest(): Promise<ScalabilityResult[]>
|
||||
|
||||
Test scalability from 100 to 100K samples.
|
||||
|
||||
```typescript
|
||||
const results = await suite.runScalabilityTest();
|
||||
```
|
||||
|
||||
**Tests**: 100, 500, 1K, 5K, 10K, 50K, 100K samples
|
||||
|
||||
##### runCostAnalysis(): Promise<void>
|
||||
|
||||
Analyze cost-effectiveness across models.
|
||||
|
||||
```typescript
|
||||
await suite.runCostAnalysis();
|
||||
```
|
||||
|
||||
**Outputs**: Cost rankings, efficiency scores, cost/quality trade-offs
|
||||
|
||||
##### runQualityConvergence(generations?: number): Promise<void>
|
||||
|
||||
Measure learning rates and quality convergence.
|
||||
|
||||
```typescript
|
||||
await suite.runQualityConvergence(10);
|
||||
```
|
||||
|
||||
**Default**: 10 generations
|
||||
|
||||
##### runDiversityAnalysis(sampleSize?: number): Promise<void>
|
||||
|
||||
Analyze data diversity and variety.
|
||||
|
||||
```typescript
|
||||
await suite.runDiversityAnalysis(5000);
|
||||
```
|
||||
|
||||
**Default**: 5000 samples
|
||||
|
||||
##### generateJSONReport(comparison: ComparisonResult): Promise<void>
|
||||
|
||||
Generate comprehensive JSON report.
|
||||
|
||||
```typescript
|
||||
await suite.generateJSONReport(comparison);
|
||||
```
|
||||
|
||||
**Output**: `benchmark-comparison.json`
|
||||
|
||||
##### generateMarkdownReport(comparison: ComparisonResult): Promise<void>
|
||||
|
||||
Generate human-readable Markdown report.
|
||||
|
||||
```typescript
|
||||
await suite.generateMarkdownReport(comparison);
|
||||
```
|
||||
|
||||
**Output**: `BENCHMARK_REPORT.md`
|
||||
|
||||
## Output Files
|
||||
|
||||
### JSON Reports
|
||||
|
||||
#### benchmark-comparison.json
|
||||
Complete benchmark results including:
|
||||
- Metadata and timestamps
|
||||
- Comparison results
|
||||
- All model results
|
||||
- Statistical summaries
|
||||
|
||||
#### scalability-results.json
|
||||
Scalability test results including:
|
||||
- Latencies at each scale
|
||||
- Throughput measurements
|
||||
- Cost progression
|
||||
- Scaling efficiency
|
||||
|
||||
#### convergence-data.json
|
||||
Learning convergence data including:
|
||||
- Quality curves
|
||||
- Improvement rates
|
||||
- Plateau generations
|
||||
|
||||
### Markdown Reports
|
||||
|
||||
#### BENCHMARK_REPORT.md
|
||||
Comprehensive human-readable report including:
|
||||
- Executive summary
|
||||
- Detailed results per model
|
||||
- Comparative tables
|
||||
- Pareto frontier analysis
|
||||
- Use case recommendations
|
||||
- Statistical significance
|
||||
- Methodology explanation
|
||||
- Conclusions
|
||||
|
||||
## Use Case Recommendations
|
||||
|
||||
The benchmark suite automatically recommends models for different scenarios:
|
||||
|
||||
### High-Quality, Low-Volume (Research)
|
||||
Best for research, high-stakes decisions, and scenarios where quality is paramount.
|
||||
|
||||
**Optimizes for**: Maximum quality, learning capability
|
||||
|
||||
### High-Volume, Low-Latency (Production)
|
||||
Best for production systems requiring high throughput and low latency.
|
||||
|
||||
**Optimizes for**: Throughput, low latency, success rate
|
||||
|
||||
### Cost-Optimized (Batch Processing)
|
||||
Best for batch processing, large-scale data generation, and cost-sensitive applications.
|
||||
|
||||
**Optimizes for**: Lowest cost per sample, efficiency
|
||||
|
||||
### Balanced (General Purpose)
|
||||
Best for general-purpose applications requiring a good balance of quality, performance, and cost.
|
||||
|
||||
**Optimizes for**: Weighted score across all metrics
|
||||
|
||||
## Statistical Analysis
|
||||
|
||||
### T-Test for Significance
|
||||
|
||||
The suite performs t-tests to determine if quality differences between models are statistically significant:
|
||||
|
||||
- **p < 0.01**: Highly significant difference
|
||||
- **p < 0.05**: Significant difference
|
||||
- **p ≥ 0.05**: No significant difference
|
||||
|
||||
### Pareto Frontier
|
||||
|
||||
Identifies models with optimal quality/cost trade-offs. A model is on the Pareto frontier if no other model is better in both quality AND cost.
|
||||
|
||||
## Mock Data Generation
|
||||
|
||||
The framework includes a sophisticated mock data generator for demonstration purposes:
|
||||
|
||||
- **Realistic Latencies**: Based on actual model characteristics
|
||||
- **Learning Simulation**: Quality improves over generations
|
||||
- **Quality Differentiation**: Different models have different base qualities
|
||||
- **Schema Support**: Handles various field types (UUID, email, name, numbers, etc.)
|
||||
|
||||
## Example Output
|
||||
|
||||
```
|
||||
🔬 Running Model Comparison (1000 samples)
|
||||
======================================================================
|
||||
|
||||
Testing GPT-4...
|
||||
Quality: 0.872
|
||||
Latency P95: 1589ms
|
||||
Cost/Sample: $0.004500
|
||||
Diversity: 0.843
|
||||
|
||||
Testing Claude 3.5 Sonnet...
|
||||
Quality: 0.891
|
||||
Latency P95: 1267ms
|
||||
Cost/Sample: $0.002250
|
||||
Diversity: 0.867
|
||||
|
||||
...
|
||||
|
||||
✅ All benchmarks completed!
|
||||
|
||||
📊 Key Findings:
|
||||
Overall Winner: Claude 3.5 Sonnet
|
||||
Best Quality: Claude 3.5 Sonnet
|
||||
Best Performance: Mixtral 8x7B
|
||||
Most Cost-Effective: Gemini Pro
|
||||
Pareto Frontier: Claude 3.5 Sonnet, Gemini Pro, Mixtral 8x7B
|
||||
|
||||
💡 Recommendations by Use Case:
|
||||
high-quality-low-volume: Claude 3.5 Sonnet
|
||||
high-volume-low-latency: Mixtral 8x7B
|
||||
cost-optimized: Gemini Pro
|
||||
balanced: Claude 3.5 Sonnet
|
||||
research: Claude 3.5 Sonnet
|
||||
production: Claude 3.5 Sonnet
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Custom Weighting
|
||||
|
||||
You can modify the overall winner calculation by adjusting weights in the `compareResults()` method:
|
||||
|
||||
```typescript
|
||||
const score =
|
||||
quality * 0.3 + // 30% quality
|
||||
performance * 0.2 + // 20% performance
|
||||
(1/cost) * 0.2 + // 20% cost
|
||||
learning * 0.15 + // 15% learning
|
||||
diversity * 0.15; // 15% diversity
|
||||
```
|
||||
|
||||
### Statistical Utilities
|
||||
|
||||
The `StatisticalAnalyzer` class provides utilities for:
|
||||
- Mean and standard deviation
|
||||
- Percentile calculation
|
||||
- T-test for significance
|
||||
- Shannon entropy
|
||||
- Distribution analysis
|
||||
|
||||
### Extensibility
|
||||
|
||||
Easily extend the framework:
|
||||
|
||||
1. **Add new metrics**: Extend metric interfaces
|
||||
2. **Add new models**: Implement `ModelConfig`
|
||||
3. **Add new tests**: Add methods to `BenchmarkSuite`
|
||||
4. **Custom analysis**: Use `StatisticalAnalyzer` utilities
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Mock Mode**: Runs without API calls for testing
|
||||
- **Parallel Testing**: Could be extended for concurrent model testing
|
||||
- **Caching**: Results are cached to disk
|
||||
- **Memory Efficient**: Processes samples in batches
|
||||
|
||||
## Limitations
|
||||
|
||||
- Mock data generator simulates behavior (no actual API calls)
|
||||
- Quality metrics are approximations based on model characteristics
|
||||
- Statistical tests use simplified distributions
|
||||
- Assumes consistent model behavior
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- [ ] Real API integration with actual model calls
|
||||
- [ ] Parallel model testing for faster benchmarks
|
||||
- [ ] More sophisticated quality assessment
|
||||
- [ ] Interactive visualization dashboard
|
||||
- [ ] A/B testing framework
|
||||
- [ ] Confidence interval calculation
|
||||
- [ ] Cost prediction modeling
|
||||
- [ ] Automated model selection
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions welcome! Please ensure:
|
||||
- TypeScript type safety
|
||||
- Comprehensive documentation
|
||||
- Test coverage
|
||||
- Performance optimization
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
|
||||
- Documentation: See main project README
|
||||
Reference in New Issue
Block a user