Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/npm/packages/agentic-synth/training/MULTI_MODEL_BENCHMARK_README.md
+++ b/npm/packages/agentic-synth/training/MULTI_MODEL_BENCHMARK_README.md
@@ -0,0 +1,374 @@
+# DSPy Multi-Model Benchmark Suite
+
+Comprehensive benchmarking system for comparing multiple language models using real **dspy.ts v2.1.1** features.
+
+## Features
+
+### Real DSPy.ts Components
+
+- ✅ **ChainOfThought** - For reasoning-based synthetic data generation
+- ✅ **ReAct** - For iterative data quality validation
+- ✅ **BootstrapFewShot** - Learn from successful examples (5 rounds)
+- ✅ **MIPROv2** - Bayesian prompt optimization (3 trials)
+- ✅ **Real Metrics** - f1Score, exactMatch, bleuScore, rougeScore
+
+### Benchmark Capabilities
+
+1. **Multi-Model Comparison**
+   - OpenAI models (GPT-4, GPT-3.5-turbo)
+   - Anthropic models (Claude 3 Sonnet, Claude 3 Haiku)
+   - Automatic model registration and configuration
+
+2. **Quality Metrics**
+   - F1 Score
+   - Exact Match
+   - BLEU Score
+   - ROUGE Score
+   - Overall quality score
+
+3. **Performance Metrics**
+   - Latency (P50, P95, P99)
+   - Throughput (samples/second)
+   - Success rate
+   - Average latency
+
+4. **Cost Analysis**
+   - Total cost tracking
+   - Cost per sample
+   - Cost per quality point
+   - Token usage (input/output)
+
+5. **Optimization Comparison**
+   - Baseline quality
+   - BootstrapFewShot improvement
+   - MIPROv2 improvement
+   - Quality progression tracking
+
+## Installation
+
+```bash
+cd /home/user/ruvector/packages/agentic-synth
+npm install
+```
+
+## Setup
+
+Set your API keys as environment variables:
+
+```bash
+export OPENAI_API_KEY="your-openai-key"
+export ANTHROPIC_API_KEY="your-anthropic-key"
+```
+
+Or create a `.env` file:
+
+```env
+OPENAI_API_KEY=your-openai-key
+ANTHROPIC_API_KEY=your-anthropic-key
+SAMPLE_SIZE=100
+```
+
+## Usage
+
+### Basic Usage
+
+```bash
+npx tsx training/dspy-multi-model-benchmark.ts
+```
+
+### Custom Sample Size
+
+```bash
+SAMPLE_SIZE=1000 npx tsx training/dspy-multi-model-benchmark.ts
+```
+
+### Programmatic Usage
+
+```typescript
+import { DSPyMultiModelBenchmark } from './training/dspy-multi-model-benchmark';
+
+const benchmark = new DSPyMultiModelBenchmark('./results');
+
+// Add models
+benchmark.addModel({
+  name: 'GPT-4',
+  provider: 'openai',
+  modelId: 'gpt-4',
+  apiKey: process.env.OPENAI_API_KEY,
+  costPer1kTokens: { input: 0.03, output: 0.06 },
+  maxTokens: 8192
+});
+
+// Run comparison
+const results = await benchmark.runComparison(1000);
+
+// Generate report
+await benchmark.generateReport(results);
+```
+
+## Output
+
+The benchmark generates two files:
+
+1. **Markdown Report** (`benchmark-report-TIMESTAMP.md`)
+   - Executive summary with winners
+   - Detailed metrics for each model
+   - Rankings by category
+   - Recommendations for different use cases
+
+2. **JSON Results** (`benchmark-results-TIMESTAMP.json`)
+   - Complete benchmark data
+   - Raw metrics
+   - Optimization history
+   - Structured for further analysis
+
+### Sample Output Structure
+
+```
+training/results/multi-model/
+├── benchmark-report-2025-01-22T10-30-45-123Z.md
+└── benchmark-results-2025-01-22T10-30-45-123Z.json
+```
+
+## Benchmark Workflow
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                   For Each Model                        │
+└─────────────────────────────────────────────────────────┘
+                           │
+                           ▼
+┌─────────────────────────────────────────────────────────┐
+│ 1. Baseline Quality                                     │
+│    └─ Test with basic ChainOfThought module            │
+└─────────────────────────────────────────────────────────┘
+                           │
+                           ▼
+┌─────────────────────────────────────────────────────────┐
+│ 2. BootstrapFewShot Optimization                        │
+│    └─ 5 rounds of few-shot learning                    │
+│    └─ Learn from successful examples                   │
+└─────────────────────────────────────────────────────────┘
+                           │
+                           ▼
+┌─────────────────────────────────────────────────────────┐
+│ 3. MIPROv2 Optimization                                 │
+│    └─ 3 trials of Bayesian optimization                │
+│    └─ Expected Improvement acquisition                 │
+└─────────────────────────────────────────────────────────┘
+                           │
+                           ▼
+┌─────────────────────────────────────────────────────────┐
+│ 4. Performance Testing                                  │
+│    └─ Measure latency (P50, P95, P99)                  │
+│    └─ Calculate throughput                             │
+└─────────────────────────────────────────────────────────┘
+                           │
+                           ▼
+┌─────────────────────────────────────────────────────────┐
+│ 5. Cost Analysis                                        │
+│    └─ Track token usage                                │
+│    └─ Calculate cost efficiency                        │
+└─────────────────────────────────────────────────────────┘
+```
+
+## Metrics Explained
+
+### Quality Metrics
+
+- **F1 Score**: Harmonic mean of precision and recall
+- **Exact Match**: Percentage of exact matches with expected output
+- **BLEU Score**: Bilingual Evaluation Understudy (text similarity)
+- **ROUGE Score**: Recall-Oriented Understudy for Gisting Evaluation
+- **Overall**: Weighted average of all quality metrics
+
+### Performance Metrics
+
+- **P50 Latency**: Median response time
+- **P95 Latency**: 95th percentile response time
+- **P99 Latency**: 99th percentile response time
+- **Throughput**: Samples processed per second
+- **Success Rate**: Percentage of successful generations
+
+### Optimization Metrics
+
+- **Baseline Quality**: Initial quality without optimization
+- **Bootstrap Improvement**: Quality gain from BootstrapFewShot
+- **MIPRO Improvement**: Quality gain from MIPROv2
+- **Improvement %**: Relative improvement over baseline
+
+## Customization
+
+### Add Custom Models
+
+```typescript
+benchmark.addModel({
+  name: 'Custom Model',
+  provider: 'openrouter',
+  modelId: 'model-id',
+  apiKey: 'your-key',
+  costPer1kTokens: { input: 0.001, output: 0.002 },
+  maxTokens: 4096
+});
+```
+
+### Custom Schema
+
+Modify the schema in `benchmarkModel()`:
+
+```typescript
+const schema = {
+  id: 'UUID',
+  name: 'string (person name)',
+  email: 'string (valid email)',
+  age: 'number (18-80)',
+  // Add your custom fields...
+};
+```
+
+### Custom Metrics
+
+Implement custom quality scoring:
+
+```typescript
+private calculateQualityScore(output: any, expected: any): number {
+  // Your custom scoring logic
+  return score;
+}
+```
+
+## Performance Tips
+
+1. **Start Small**: Use `SAMPLE_SIZE=10` for quick tests
+2. **Increase Gradually**: Scale to 100, 1000, 10000 as needed
+3. **Parallel Testing**: Run different models separately
+4. **Cost Monitoring**: Check costs before large runs
+5. **Rate Limits**: Be aware of API rate limits
+
+## Example Results
+
+```
+🔬 DSPy Multi-Model Benchmark Suite
+======================================================================
+Models: 4
+Sample Size: 100
+======================================================================
+
+📊 Benchmarking: GPT-4
+----------------------------------------------------------------------
+  → Running baseline...
+  → Optimizing with BootstrapFewShot...
+  → Optimizing with MIPROv2...
+  ✓ Quality Score: 0.875
+  ✓ P95 Latency: 1234ms
+  ✓ Cost/Sample: $0.000543
+  ✓ Bootstrap Improvement: +12.3%
+  ✓ MIPRO Improvement: +18.7%
+
+📊 Benchmarking: Claude 3 Sonnet
+----------------------------------------------------------------------
+  → Running baseline...
+  → Optimizing with BootstrapFewShot...
+  → Optimizing with MIPROv2...
+  ✓ Quality Score: 0.892
+  ✓ P95 Latency: 987ms
+  ✓ Cost/Sample: $0.000234
+  ✓ Bootstrap Improvement: +14.2%
+  ✓ MIPRO Improvement: +21.5%
+
+======================================================================
+✅ Benchmark completed successfully!
+📊 Check the results directory for detailed reports.
+======================================================================
+```
+
+## Troubleshooting
+
+### API Key Issues
+
+```bash
+# Check if keys are set
+echo $OPENAI_API_KEY
+echo $ANTHROPIC_API_KEY
+
+# Set keys temporarily
+export OPENAI_API_KEY="sk-..."
+export ANTHROPIC_API_KEY="sk-ant-..."
+```
+
+### Import Errors
+
+```bash
+# Rebuild the package
+npm run build
+
+# Check dspy.ts installation
+npm list dspy.ts
+```
+
+### Out of Memory
+
+```bash
+# Reduce sample size
+SAMPLE_SIZE=10 npx tsx training/dspy-multi-model-benchmark.ts
+```
+
+### Rate Limiting
+
+Add delays between requests:
+
+```typescript
+// In measurePerformance()
+await new Promise(resolve => setTimeout(resolve, 100));
+```
+
+## Architecture
+
+```
+DSPyMultiModelBenchmark
+├── Model Management
+│   ├── OpenAILM (GPT-4, GPT-3.5)
+│   ├── AnthropicLM (Claude 3)
+│   └── Token tracking
+│
+├── DSPy Modules
+│   ├── SyntheticDataModule (ChainOfThought)
+│   └── DataQualityModule (ReAct)
+│
+├── Optimizers
+│   ├── BootstrapFewShot (5 rounds)
+│   └── MIPROv2 (3 trials, Bayesian)
+│
+├── Metrics
+│   ├── Quality (F1, EM, BLEU, ROUGE)
+│   ├── Performance (latency, throughput)
+│   └── Cost (tokens, efficiency)
+│
+└── Reporting
+    ├── Markdown reports
+    └── JSON results
+```
+
+## Contributing
+
+To add new features:
+
+1. Extend `ModelConfig` for new providers
+2. Implement new LM classes
+3. Add custom DSPy modules
+4. Enhance quality metrics
+5. Extend reporting formats
+
+## License
+
+MIT - Same as dspy.ts and agentic-synth
+
+## References
+
+- [dspy.ts Documentation](https://github.com/ruvnet/dspy.ts)
+- [DSPy Paper](https://arxiv.org/abs/2310.03714)
+- [MIPROv2 Paper](https://arxiv.org/abs/2406.11695)
+
+---
+
+**Built with dspy.ts v2.1.1** - Declarative AI framework for TypeScript