Files
wifi-densepose/npm/packages/agentic-synth/training/BENCHMARKS_README.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

11 KiB

DSPy Benchmark Comparison Framework

A comprehensive benchmarking suite for comparing multiple models across quality, performance, cost, learning, and diversity metrics.

Features

🎯 Core Capabilities

  1. Multi-Model Comparison

    • Compare unlimited models side-by-side
    • Statistical significance testing
    • Pareto frontier analysis
    • Weighted scoring across dimensions
  2. Scalability Testing

    • Test from 100 to 100,000 samples
    • Measure latency, throughput, cost at scale
    • Calculate scaling efficiency
    • Identify performance bottlenecks
  3. Cost Analysis

    • Track total cost per run
    • Calculate cost per sample
    • Compute cost per quality point
    • Efficiency rankings
  4. Quality Convergence

    • Measure learning rates
    • Track improvement over generations
    • Identify plateau points
    • Convergence speed analysis
  5. Diversity Analysis

    • Unique value counting
    • Pattern variety measurement
    • Shannon entropy calculation
    • Coverage scoring

📊 Metrics Collected

Quality Metrics

  • Accuracy: Correctness of generated data
  • Coherence: Logical consistency and flow
  • Validity: Adherence to schema and constraints
  • Consistency: Uniformity across samples
  • Completeness: Coverage of all required fields
  • Overall: Weighted average of all quality metrics

Performance Metrics

  • Latency P50/P95/P99: Response time percentiles
  • Average Latency: Mean response time
  • Min/Max Latency: Range of response times
  • Throughput: Samples generated per second
  • Success Rate: Percentage of successful generations

Cost Metrics

  • Total Cost: Total expenditure for test run
  • Cost per Sample: Average cost per generated sample
  • Cost per Quality Point: Cost normalized by quality
  • Tokens Used: Total tokens consumed
  • Efficiency: Quality per unit cost

Learning Metrics

  • Improvement Rate: Quality gain per generation
  • Convergence Speed: Generations until plateau
  • Learning Curve: Quality progression over time
  • Plateau Generation: When learning stabilizes
  • Final Quality: Ultimate quality achieved

Diversity Metrics

  • Unique Values: Number of distinct samples
  • Pattern Variety: Ratio of unique to total samples
  • Distribution Entropy: Shannon entropy of data
  • Coverage Score: Field-level diversity measure
  • Novelty Rate: Rate of new pattern generation

Usage

Quick Start

import { BenchmarkSuite } from './dspy-benchmarks.js';

const suite = new BenchmarkSuite();

// Add common models
suite.addCommonModels();

// Run comprehensive comparison
const comparison = await suite.runModelComparison(1000);

// Generate reports
await suite.generateJSONReport(comparison);
await suite.generateMarkdownReport(comparison);

Custom Models

import { BenchmarkSuite, ModelConfig } from './dspy-benchmarks.js';

const suite = new BenchmarkSuite();

// Add custom model
const customModel: ModelConfig = {
  name: 'My Custom Model',
  provider: 'openrouter',
  model: 'my-model',
  costPer1kTokens: 0.002,
  maxTokens: 8192,
  apiKey: process.env.API_KEY, // Optional
};

suite.addModel(customModel);

// Run benchmarks
const comparison = await suite.runModelComparison(1000);

Running from CLI

# Full benchmark suite
npx tsx training/run-benchmarks.ts full

# Quick comparison (3 models, 500 samples)
npx tsx training/run-benchmarks.ts quick

# Scalability test only
npx tsx training/run-benchmarks.ts scalability

# Cost analysis only
npx tsx training/run-benchmarks.ts cost

API Reference

BenchmarkSuite Class

Constructor

constructor(outputDir?: string)

Creates a new benchmark suite instance.

  • outputDir: Optional output directory (default: ./training/results/benchmarks)

Methods

addModel(config: ModelConfig)

Add a model to the benchmark suite.

suite.addModel({
  name: 'GPT-4',
  provider: 'openai',
  model: 'gpt-4',
  costPer1kTokens: 0.03,
  maxTokens: 8192,
});
addCommonModels()

Add 6 pre-configured common models for quick testing:

  • GPT-4
  • Claude 3.5 Sonnet
  • Gemini Pro
  • GPT-3.5 Turbo
  • Llama 3 70B
  • Mixtral 8x7B
suite.addCommonModels();
runModelComparison(sampleSize?: number): Promise

Run comprehensive comparison across all models.

const comparison = await suite.runModelComparison(1000);

Returns: ComparisonResult with winners, statistical significance, Pareto frontier, and recommendations.

runScalabilityTest(): Promise<ScalabilityResult[]>

Test scalability from 100 to 100K samples.

const results = await suite.runScalabilityTest();

Tests: 100, 500, 1K, 5K, 10K, 50K, 100K samples

runCostAnalysis(): Promise

Analyze cost-effectiveness across models.

await suite.runCostAnalysis();

Outputs: Cost rankings, efficiency scores, cost/quality trade-offs

runQualityConvergence(generations?: number): Promise

Measure learning rates and quality convergence.

await suite.runQualityConvergence(10);

Default: 10 generations

runDiversityAnalysis(sampleSize?: number): Promise

Analyze data diversity and variety.

await suite.runDiversityAnalysis(5000);

Default: 5000 samples

generateJSONReport(comparison: ComparisonResult): Promise

Generate comprehensive JSON report.

await suite.generateJSONReport(comparison);

Output: benchmark-comparison.json

generateMarkdownReport(comparison: ComparisonResult): Promise

Generate human-readable Markdown report.

await suite.generateMarkdownReport(comparison);

Output: BENCHMARK_REPORT.md

Output Files

JSON Reports

benchmark-comparison.json

Complete benchmark results including:

  • Metadata and timestamps
  • Comparison results
  • All model results
  • Statistical summaries

scalability-results.json

Scalability test results including:

  • Latencies at each scale
  • Throughput measurements
  • Cost progression
  • Scaling efficiency

convergence-data.json

Learning convergence data including:

  • Quality curves
  • Improvement rates
  • Plateau generations

Markdown Reports

BENCHMARK_REPORT.md

Comprehensive human-readable report including:

  • Executive summary
  • Detailed results per model
  • Comparative tables
  • Pareto frontier analysis
  • Use case recommendations
  • Statistical significance
  • Methodology explanation
  • Conclusions

Use Case Recommendations

The benchmark suite automatically recommends models for different scenarios:

High-Quality, Low-Volume (Research)

Best for research, high-stakes decisions, and scenarios where quality is paramount.

Optimizes for: Maximum quality, learning capability

High-Volume, Low-Latency (Production)

Best for production systems requiring high throughput and low latency.

Optimizes for: Throughput, low latency, success rate

Cost-Optimized (Batch Processing)

Best for batch processing, large-scale data generation, and cost-sensitive applications.

Optimizes for: Lowest cost per sample, efficiency

Balanced (General Purpose)

Best for general-purpose applications requiring a good balance of quality, performance, and cost.

Optimizes for: Weighted score across all metrics

Statistical Analysis

T-Test for Significance

The suite performs t-tests to determine if quality differences between models are statistically significant:

  • p < 0.01: Highly significant difference
  • p < 0.05: Significant difference
  • p ≥ 0.05: No significant difference

Pareto Frontier

Identifies models with optimal quality/cost trade-offs. A model is on the Pareto frontier if no other model is better in both quality AND cost.

Mock Data Generation

The framework includes a sophisticated mock data generator for demonstration purposes:

  • Realistic Latencies: Based on actual model characteristics
  • Learning Simulation: Quality improves over generations
  • Quality Differentiation: Different models have different base qualities
  • Schema Support: Handles various field types (UUID, email, name, numbers, etc.)

Example Output

🔬 Running Model Comparison (1000 samples)
======================================================================

Testing GPT-4...
  Quality: 0.872
  Latency P95: 1589ms
  Cost/Sample: $0.004500
  Diversity: 0.843

Testing Claude 3.5 Sonnet...
  Quality: 0.891
  Latency P95: 1267ms
  Cost/Sample: $0.002250
  Diversity: 0.867

...

✅ All benchmarks completed!

📊 Key Findings:
   Overall Winner: Claude 3.5 Sonnet
   Best Quality: Claude 3.5 Sonnet
   Best Performance: Mixtral 8x7B
   Most Cost-Effective: Gemini Pro
   Pareto Frontier: Claude 3.5 Sonnet, Gemini Pro, Mixtral 8x7B

💡 Recommendations by Use Case:
   high-quality-low-volume: Claude 3.5 Sonnet
   high-volume-low-latency: Mixtral 8x7B
   cost-optimized: Gemini Pro
   balanced: Claude 3.5 Sonnet
   research: Claude 3.5 Sonnet
   production: Claude 3.5 Sonnet

Advanced Features

Custom Weighting

You can modify the overall winner calculation by adjusting weights in the compareResults() method:

const score =
  quality * 0.3 +           // 30% quality
  performance * 0.2 +       // 20% performance
  (1/cost) * 0.2 +         // 20% cost
  learning * 0.15 +        // 15% learning
  diversity * 0.15;        // 15% diversity

Statistical Utilities

The StatisticalAnalyzer class provides utilities for:

  • Mean and standard deviation
  • Percentile calculation
  • T-test for significance
  • Shannon entropy
  • Distribution analysis

Extensibility

Easily extend the framework:

  1. Add new metrics: Extend metric interfaces
  2. Add new models: Implement ModelConfig
  3. Add new tests: Add methods to BenchmarkSuite
  4. Custom analysis: Use StatisticalAnalyzer utilities

Performance Considerations

  • Mock Mode: Runs without API calls for testing
  • Parallel Testing: Could be extended for concurrent model testing
  • Caching: Results are cached to disk
  • Memory Efficient: Processes samples in batches

Limitations

  • Mock data generator simulates behavior (no actual API calls)
  • Quality metrics are approximations based on model characteristics
  • Statistical tests use simplified distributions
  • Assumes consistent model behavior

Future Enhancements

  • Real API integration with actual model calls
  • Parallel model testing for faster benchmarks
  • More sophisticated quality assessment
  • Interactive visualization dashboard
  • A/B testing framework
  • Confidence interval calculation
  • Cost prediction modeling
  • Automated model selection

License

MIT

Contributing

Contributions welcome! Please ensure:

  • TypeScript type safety
  • Comprehensive documentation
  • Test coverage
  • Performance optimization

Support

For issues or questions: