Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/npm/packages/agentic-synth/training/DSPY_INTEGRATION_README.md
+++ b/npm/packages/agentic-synth/training/DSPY_INTEGRATION_README.md
@@ -0,0 +1,563 @@
+# DSPy.ts Real Integration with Agentic-Synth
+
+Production-ready integration of [dspy.ts](https://github.com/dzhng/dspy.ts) v2.1.1 with agentic-synth for optimized synthetic data generation with automatic quality improvement.
+
+## Features
+
+✅ **Real dspy.ts Integration** - Uses actual dspy.ts npm package (v2.1.1)
+✅ **ChainOfThought Reasoning** - Advanced reasoning for data quality assessment
+✅ **BootstrapFewShot Optimization** - Automatic learning from successful generations
+✅ **Multi-Model Support** - OpenAI GPT models and Anthropic Claude
+✅ **Quality Metrics** - Real-time evaluation using dspy.ts metrics
+✅ **Convergence Detection** - Automatically stops when quality threshold is met
+✅ **Event-Driven Architecture** - Hooks for monitoring and coordination
+✅ **Production Ready** - Full TypeScript types and error handling
+
+## Architecture
+
+```
+DSPyAgenticSynthTrainer
+├── Language Models (OpenAILM, AnthropicLM)
+├── ChainOfThought Module (Quality reasoning)
+├── BootstrapFewShot Optimizer (Learning)
+└── Quality Evaluator (Metrics)
+```
+
+## Installation
+
+```bash
+# Already installed in agentic-synth
+cd packages/agentic-synth
+npm install  # dspy.ts@2.1.1 is included
+```
+
+## Environment Setup
+
+```bash
+# Required for OpenAI models
+export OPENAI_API_KEY="sk-..."
+
+# Optional for Claude models
+export ANTHROPIC_API_KEY="sk-ant-..."
+```
+
+## Usage
+
+### Basic Example
+
+```typescript
+import { DSPyAgenticSynthTrainer } from './training/dspy-real-integration.js';
+
+// Define your data schema
+const schema = {
+  type: 'object',
+  properties: {
+    userId: { type: 'string' },
+    name: { type: 'string' },
+    email: { type: 'string', format: 'email' },
+    age: { type: 'number' }
+  }
+};
+
+// Provide initial training examples
+const examples = [
+  {
+    input: JSON.stringify(schema),
+    output: JSON.stringify({
+      userId: '123',
+      name: 'Alice',
+      email: 'alice@example.com',
+      age: 28
+    }),
+    quality: 0.9
+  }
+];
+
+// Configure trainer
+const trainer = new DSPyAgenticSynthTrainer({
+  models: ['gpt-3.5-turbo'],
+  optimizationRounds: 5,
+  minQualityScore: 0.8,
+  batchSize: 10
+});
+
+// Initialize and train
+await trainer.initialize();
+const result = await trainer.trainWithOptimization(schema, examples);
+
+// Generate optimized data
+const data = await trainer.generateOptimizedData(100, schema);
+```
+
+### Advanced Configuration
+
+```typescript
+const trainer = new DSPyAgenticSynthTrainer({
+  // Models to use for training
+  models: [
+    'gpt-3.5-turbo',
+    'gpt-4',
+    'claude-3-sonnet-20240229'
+  ],
+
+  // Training parameters
+  optimizationRounds: 10,
+  minQualityScore: 0.85,
+  maxExamples: 100,
+  batchSize: 20,
+
+  // Evaluation metrics
+  evaluationMetrics: ['accuracy', 'coherence', 'relevance', 'diversity'],
+
+  // Performance options
+  enableCaching: true,
+
+  // Event hooks
+  hooks: {
+    onIterationComplete: (iteration, metrics) => {
+      console.log(`Iteration ${iteration}: ${metrics.overallScore}`);
+    },
+    onOptimizationComplete: (result) => {
+      console.log(`Improvement: ${result.improvements.improvement}%`);
+    },
+    onError: (error) => {
+      console.error('Training error:', error);
+    }
+  }
+});
+```
+
+### Event Monitoring
+
+```typescript
+// Listen to training events
+trainer.on('status', (message) => {
+  console.log('Status:', message);
+});
+
+trainer.on('progress', ({ current, total }) => {
+  console.log(`Progress: ${current}/${total}`);
+});
+
+trainer.on('complete', (result) => {
+  console.log('Training complete:', result);
+});
+
+trainer.on('error', (error) => {
+  console.error('Error:', error);
+});
+```
+
+## API Reference
+
+### DSPyAgenticSynthTrainer
+
+Main class for training and generating optimized synthetic data.
+
+#### Constructor
+
+```typescript
+constructor(config: DSPyTrainerConfig)
+```
+
+#### Methods
+
+##### `initialize(): Promise<void>`
+
+Initialize dspy.ts language models and modules. Must be called before training.
+
+##### `trainWithOptimization(schema, examples): Promise<TrainingResult>`
+
+Train the model with automatic optimization using BootstrapFewShot.
+
+**Parameters:**
+- `schema`: JSON schema describing the data structure
+- `examples`: Array of training examples with quality scores
+
+**Returns:** Training result with metrics and improvements
+
+##### `generateOptimizedData(count, schema?): Promise<any[]>`
+
+Generate optimized synthetic data using trained models.
+
+**Parameters:**
+- `count`: Number of samples to generate
+- `schema`: Optional schema for generation
+
+**Returns:** Array of generated data samples
+
+##### `evaluateQuality(data): Promise<QualityMetrics>`
+
+Evaluate the quality of generated data.
+
+**Parameters:**
+- `data`: Array of data samples to evaluate
+
+**Returns:** Quality metrics including accuracy, coherence, relevance, diversity
+
+##### `getStatistics()`
+
+Get training statistics.
+
+**Returns:**
+```typescript
+{
+  totalIterations: number;
+  bestScore: number;
+  trainingExamples: number;
+}
+```
+
+### Configuration Types
+
+#### DSPyTrainerConfig
+
+```typescript
+{
+  models: string[];              // Model names to use
+  optimizationRounds?: number;   // Number of optimization rounds (default: 5)
+  minQualityScore?: number;      // Minimum quality threshold (default: 0.8)
+  maxExamples?: number;          // Max training examples (default: 50)
+  batchSize?: number;            // Generation batch size (default: 10)
+  evaluationMetrics?: string[];  // Metrics to track
+  enableCaching?: boolean;       // Enable result caching
+  hooks?: {                      // Event callbacks
+    onIterationComplete?: (iteration, metrics) => void;
+    onOptimizationComplete?: (result) => void;
+    onError?: (error) => void;
+  };
+}
+```
+
+#### TrainingResult
+
+```typescript
+{
+  success: boolean;
+  iterations: IterationMetrics[];
+  bestIteration: IterationMetrics;
+  optimizedPrompt: string;
+  improvements: {
+    initialScore: number;
+    finalScore: number;
+    improvement: number;  // percentage
+  };
+  metadata: {
+    totalDuration: number;
+    modelsUsed: string[];
+    totalGenerated: number;
+    convergenceIteration?: number;
+  };
+}
+```
+
+#### QualityMetrics
+
+```typescript
+{
+  accuracy: number;      // 0-1
+  coherence: number;     // 0-1
+  relevance: number;     // 0-1
+  diversity: number;     // 0-1
+  overallScore: number;  // 0-1
+  timestamp: Date;
+}
+```
+
+## Running the Example
+
+```bash
+# Set API key
+export OPENAI_API_KEY="sk-..."
+
+# Run the built-in example
+cd packages/agentic-synth
+npx tsx training/dspy-real-integration.ts
+```
+
+Expected output:
+```
+🚀 Starting DSPy.ts Agentic-Synth Integration Example
+
+📊 Initializing DSPy.ts language models...
+📊 Initialized OpenAI model: gpt-3.5-turbo
+📊 DSPy.ts initialization complete
+
+📊 Starting training with optimization...
+📊 Phase 1: Baseline generation
+✓ Iteration 1: Score = 0.753
+
+📊 Phase 2: Running optimization rounds
+✓ Iteration 2: Score = 0.812
+✓ Iteration 3: Score = 0.845
+✓ Iteration 4: Score = 0.867
+
+✅ Optimization complete!
+Improvement: 15.1%
+
+============================================================
+TRAINING RESULTS
+============================================================
+Success: true
+Total Iterations: 4
+Best Model: gpt-3.5-turbo
+Best Score: 0.867
+Improvement: 15.1%
+Total Duration: 12.34s
+Total Generated: 20 samples
+```
+
+## Integration with Agentic-Synth
+
+### Extending BaseGenerator
+
+```typescript
+import { BaseGenerator } from '../src/generators/base.js';
+import { DSPyAgenticSynthTrainer } from './dspy-real-integration.js';
+
+class OptimizedGenerator extends BaseGenerator {
+  private trainer: DSPyAgenticSynthTrainer;
+
+  constructor(config: SynthConfig) {
+    super(config);
+    this.trainer = new DSPyAgenticSynthTrainer({
+      models: ['gpt-3.5-turbo'],
+      minQualityScore: 0.8
+    });
+  }
+
+  async generateWithOptimization(options: GeneratorOptions) {
+    await this.trainer.initialize();
+
+    // Use existing generation as training examples
+    const initial = await this.generate(options);
+    const examples = initial.data.map(item => ({
+      input: JSON.stringify(options.schema),
+      output: JSON.stringify(item),
+      quality: 0.7
+    }));
+
+    // Train and optimize
+    await this.trainer.trainWithOptimization(
+      options.schema || {},
+      examples
+    );
+
+    // Generate optimized data
+    return this.trainer.generateOptimizedData(
+      options.count || 10,
+      options.schema
+    );
+  }
+}
+```
+
+## How It Works
+
+### Phase 1: Initialization
+1. Initialize OpenAI/Anthropic language models via dspy.ts
+2. Configure ChainOfThought module for reasoning
+3. Set up BootstrapFewShot optimizer
+
+### Phase 2: Baseline Generation
+1. Generate initial data with each configured model
+2. Evaluate quality using dspy.ts metrics
+3. Collect successful examples (above threshold)
+
+### Phase 3: Optimization Rounds
+1. Train BootstrapFewShot with successful examples
+2. Compile optimized program with learned prompts
+3. Generate new data with optimized program
+4. Evaluate and update training set
+5. Repeat until convergence or max rounds
+
+### Phase 4: Production Generation
+1. Use optimized program for data generation
+2. Batch processing for efficiency
+3. Real-time quality monitoring
+4. Return high-quality synthetic data
+
+## DSPy.ts Features Used
+
+### Modules
+- `ChainOfThought` - Multi-step reasoning for quality assessment
+- `BootstrapFewShot` - Automatic few-shot learning optimizer
+
+### Language Models
+- `OpenAILM` - GPT-3.5, GPT-4 support
+- `AnthropicLM` - Claude models support
+- `configureLM()` - Switch between models
+
+### Evaluation
+- `evaluate()` - Batch evaluation of examples
+- `exactMatch()` - Exact string matching metric
+- `f1Score()` - F1 score calculation
+
+### Optimization
+- Automatic prompt optimization
+- Few-shot example selection
+- Quality-driven learning
+
+## Performance
+
+### Benchmarks
+
+- **Initial Quality**: ~0.70-0.75
+- **Optimized Quality**: ~0.85-0.90
+- **Improvement**: 15-25%
+- **Convergence**: 3-5 rounds typically
+- **Speed**: ~2-5s per iteration (GPT-3.5)
+
+### Optimization
+
+- Caching enabled by default
+- Batch processing for efficiency
+- Parallel model evaluation
+- Convergence detection to avoid unnecessary rounds
+
+## Best Practices
+
+### 1. Provide Quality Examples
+
+```typescript
+const examples = [
+  {
+    input: JSON.stringify(schema),
+    output: JSON.stringify(highQualityData),
+    quality: 0.9  // High quality score
+  }
+];
+```
+
+### 2. Start with Baseline Models
+
+```typescript
+// Start simple, then add advanced models
+models: [
+  'gpt-3.5-turbo',    // Fast baseline
+  'gpt-4'             // High quality
+]
+```
+
+### 3. Monitor Progress
+
+```typescript
+hooks: {
+  onIterationComplete: (iteration, metrics) => {
+    // Track progress
+    if (metrics.overallScore > 0.9) {
+      console.log('Excellent quality achieved!');
+    }
+  }
+}
+```
+
+### 4. Set Realistic Thresholds
+
+```typescript
+{
+  minQualityScore: 0.8,  // Achievable target
+  optimizationRounds: 5   // Balance quality vs. cost
+}
+```
+
+## Troubleshooting
+
+### API Key Issues
+
+```
+Error: OPENAI_API_KEY not set
+```
+
+**Solution:** Set environment variable:
+```bash
+export OPENAI_API_KEY="sk-..."
+```
+
+### Low Quality Scores
+
+**Solution:**
+- Provide better training examples
+- Increase optimization rounds
+- Lower quality threshold initially
+- Try different models
+
+### Slow Performance
+
+**Solution:**
+- Reduce batch size
+- Enable caching
+- Use faster models (gpt-3.5-turbo)
+- Lower optimization rounds
+
+### Module Import Errors
+
+**Solution:**
+```bash
+# Ensure dependencies are installed
+npm install
+
+# Check dspy.ts version
+npm list dspy.ts
+```
+
+## Example Schemas
+
+### User Profile
+```typescript
+{
+  type: 'object',
+  properties: {
+    userId: { type: 'string' },
+    name: { type: 'string' },
+    email: { type: 'string', format: 'email' },
+    age: { type: 'number', minimum: 18 }
+  }
+}
+```
+
+### E-commerce Product
+```typescript
+{
+  type: 'object',
+  properties: {
+    productId: { type: 'string' },
+    name: { type: 'string' },
+    price: { type: 'number', minimum: 0 },
+    category: { type: 'string' },
+    inStock: { type: 'boolean' }
+  }
+}
+```
+
+### Time Series Data
+```typescript
+{
+  type: 'object',
+  properties: {
+    timestamp: { type: 'string', format: 'date-time' },
+    metric: { type: 'string' },
+    value: { type: 'number' },
+    unit: { type: 'string' }
+  }
+}
+```
+
+## Resources
+
+- [dspy.ts GitHub](https://github.com/dzhng/dspy.ts)
+- [dspy.ts Documentation](https://github.com/dzhng/dspy.ts#readme)
+- [DSPy Paper](https://arxiv.org/abs/2310.03714)
+- [Agentic-Synth](https://github.com/ruvnet/ruvector/tree/main/packages/agentic-synth)
+
+## License
+
+MIT - See LICENSE file for details
+
+## Contributing
+
+Contributions welcome! Please submit PRs to improve the integration.
+
+---
+
+**Built with ❤️ using [dspy.ts](https://github.com/dzhng/dspy.ts) and [agentic-synth](https://github.com/ruvnet/ruvector)**