2342 lines
63 KiB
Markdown
2342 lines
63 KiB
Markdown
# DSPy.ts Comprehensive Research Report
|
|
## Self-Learning and Advanced Training Techniques
|
|
|
|
**Research Date:** 2025-11-22
|
|
**Focus:** DSPy.ts capabilities for self-learning, optimization, and multi-model integration
|
|
**Status:** Complete
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
DSPy.ts represents a paradigm shift from manual prompt engineering to systematic, type-safe AI programming. The research identified three primary TypeScript implementations with production-ready capabilities, advanced optimization techniques achieving 1.5-3x performance improvements, and support for 15+ LLM providers including Claude 3.5 Sonnet, GPT-4 Turbo, Llama 3.1, and Gemini 1.5 Pro.
|
|
|
|
**Key Findings:**
|
|
- **Performance:** 22-90x cost reduction with maintained quality (GEPA optimizer)
|
|
- **Accuracy:** 10-20% improvement over baseline prompts (GEPA vs GRPO)
|
|
- **Optimization Speed:** 35x fewer rollouts required vs reinforcement learning approaches
|
|
- **Type Safety:** Full TypeScript support with compile-time validation
|
|
- **Production Ready:** Built-in observability, streaming, and error handling
|
|
|
|
---
|
|
|
|
## 1. Core DSPy.ts Features
|
|
|
|
### 1.1 Feature Capabilities Matrix
|
|
|
|
| Feature | Ax Framework | DSPy.ts (ruvnet) | TS-DSPy | Description |
|
|
|---------|--------------|------------------|---------|-------------|
|
|
| **Signature-Based Programming** | ✅ Full | ✅ Full | ✅ Full | Define I/O contracts instead of prompts |
|
|
| **Type Safety** | ✅ TypeScript | ✅ TypeScript | ✅ TypeScript | Compile-time error detection |
|
|
| **Automatic Optimization** | ✅ MiPRO, GEPA | ✅ BootstrapFewShot, MIPROv2 | ✅ Basic | Self-improving prompts |
|
|
| **Few-Shot Learning** | ✅ Advanced | ✅ Bootstrap | ✅ Basic | Auto-generate demonstrations |
|
|
| **Chain-of-Thought** | ✅ Built-in | ✅ Module | ✅ Module | Reasoning with intermediate steps |
|
|
| **Multi-Modal Support** | ✅ Full (images, audio, text) | ⚠️ Limited | ❌ Text only | Multiple input types |
|
|
| **Streaming** | ✅ With validation | ✅ Basic | ⚠️ Limited | Real-time output generation |
|
|
| **Observability** | ✅ OpenTelemetry | ⚠️ Basic | ❌ None | Production monitoring |
|
|
| **LLM Providers** | ✅ 15+ | ✅ 10+ | ✅ 5+ | Provider support |
|
|
| **Browser Support** | ✅ Full | ✅ Full + ONNX | ⚠️ Partial | Client-side execution |
|
|
| **ReAct Pattern** | ✅ Advanced | ✅ Module | ⚠️ Basic | Tool-using agents |
|
|
| **Validation** | ✅ Zod-like | ⚠️ Basic | ⚠️ Basic | Output validation |
|
|
|
|
**Legend:** ✅ Full Support | ⚠️ Partial/Basic | ❌ Not Available
|
|
|
|
### 1.2 Signature-Based Programming
|
|
|
|
DSPy.ts fundamentally changes AI development by replacing brittle prompt engineering with declarative signatures:
|
|
|
|
**Traditional Approach (Prompt Engineering):**
|
|
```typescript
|
|
const prompt = `
|
|
You are a sentiment analyzer. Given a review, classify it as positive, negative, or neutral.
|
|
|
|
Review: ${review}
|
|
|
|
Classification:`;
|
|
|
|
const response = await llm.generate(prompt);
|
|
```
|
|
|
|
**DSPy.ts Approach (Signature-Based):**
|
|
```typescript
|
|
// Ax Framework syntax
|
|
const classifier = ax('review:string -> sentiment:class "positive, negative, neutral"');
|
|
const result = await classifier.forward(llm, { review: "Great product!" });
|
|
|
|
// DSPy.ts module syntax
|
|
const solver = new ChainOfThought({
|
|
name: 'SentimentAnalyzer',
|
|
signature: {
|
|
inputs: [{ name: 'review', type: 'string', required: true }],
|
|
outputs: [{ name: 'sentiment', type: 'string', required: true }]
|
|
}
|
|
});
|
|
```
|
|
|
|
**Benefits:**
|
|
- Automatic prompt generation and optimization
|
|
- Type-safe contracts with compile-time validation
|
|
- Composable, reusable modules
|
|
- Self-improving with training data
|
|
|
|
### 1.3 Automatic Prompt Optimization
|
|
|
|
The core innovation is automatic optimization based on metrics:
|
|
|
|
```typescript
|
|
// Define success metric
|
|
const metric = (example, prediction) => {
|
|
return prediction.sentiment === example.expected ? 1.0 : 0.0;
|
|
};
|
|
|
|
// Prepare training data
|
|
const trainset = [
|
|
{ review: "Excellent service!", expected: "positive" },
|
|
{ review: "Terrible experience", expected: "negative" },
|
|
{ review: "It's okay", expected: "neutral" }
|
|
];
|
|
|
|
// Optimize automatically
|
|
const optimizer = new BootstrapFewShot(metric);
|
|
const optimized = await optimizer.compile(classifier, trainset);
|
|
|
|
// Use optimized version
|
|
const result = await optimized.forward(llm, { review: newReview });
|
|
```
|
|
|
|
**Optimization Process:**
|
|
1. Run program on training data
|
|
2. Collect successful traces
|
|
3. Generate demonstrations
|
|
4. Refine prompts iteratively
|
|
5. Select best performing version
|
|
|
|
### 1.4 Few-Shot Learning Patterns
|
|
|
|
DSPy.ts implements multiple few-shot learning strategies:
|
|
|
|
**1. LabeledFewShot** - Use provided examples directly
|
|
```typescript
|
|
const optimizer = new LabeledFewShot();
|
|
const compiled = await optimizer.compile(module, labeledExamples);
|
|
```
|
|
|
|
**2. BootstrapFewShot** - Generate examples automatically
|
|
```typescript
|
|
const optimizer = new BootstrapFewShot(metric);
|
|
const compiled = await optimizer.compile(module, trainset);
|
|
// Automatically creates demonstrations from successful runs
|
|
```
|
|
|
|
**3. KNNFewShot** - Use k-nearest neighbors for relevant examples
|
|
```typescript
|
|
const optimizer = new KNNFewShot(k=5, vectorizer);
|
|
const compiled = await optimizer.compile(module, trainset);
|
|
// Selects most relevant examples based on input similarity
|
|
```
|
|
|
|
**4. BootstrapFewShotWithRandomSearch** - Explore multiple configurations
|
|
```typescript
|
|
const optimizer = new BootstrapFewShotWithRandomSearch(
|
|
metric,
|
|
num_candidates=8
|
|
);
|
|
const compiled = await optimizer.compile(module, trainset);
|
|
// Tests multiple bootstrapped versions, keeps best
|
|
```
|
|
|
|
### 1.5 Chain-of-Thought Optimization
|
|
|
|
Chain-of-thought reasoning enables step-by-step problem solving:
|
|
|
|
```typescript
|
|
import { ChainOfThought } from 'dspy.ts/modules';
|
|
|
|
const mathSolver = new ChainOfThought({
|
|
name: 'ComplexMathSolver',
|
|
signature: {
|
|
inputs: [{ name: 'problem', type: 'string', required: true }],
|
|
outputs: [
|
|
{ name: 'reasoning', type: 'string', required: true },
|
|
{ name: 'answer', type: 'number', required: true }
|
|
]
|
|
}
|
|
});
|
|
|
|
const result = await mathSolver.run({
|
|
problem: 'If a train travels 120 miles in 2 hours, what is its speed in km/h?'
|
|
});
|
|
|
|
console.log(result.reasoning);
|
|
// "First, calculate speed in mph: 120 miles / 2 hours = 60 mph.
|
|
// Then convert to km/h: 60 mph * 1.609 = 96.54 km/h"
|
|
|
|
console.log(result.answer); // 96.54
|
|
```
|
|
|
|
**Optimization Benefits:**
|
|
- Automatically learns optimal reasoning patterns
|
|
- Improves accuracy on complex problems (67% → 93% on MATH benchmark)
|
|
- Generates human-interpretable reasoning traces
|
|
|
|
### 1.6 Metric-Driven Learning
|
|
|
|
DSPy.ts optimizes toward user-defined metrics:
|
|
|
|
**Example Metrics:**
|
|
|
|
```typescript
|
|
// Accuracy metric
|
|
const accuracy = (example, pred) => pred.answer === example.answer ? 1.0 : 0.0;
|
|
|
|
// F1 Score metric
|
|
const f1Score = (example, pred) => {
|
|
const precision = calculatePrecision(pred, example);
|
|
const recall = calculateRecall(pred, example);
|
|
return 2 * (precision * recall) / (precision + recall);
|
|
};
|
|
|
|
// Semantic similarity metric
|
|
const semanticSimilarity = async (example, pred) => {
|
|
const embedding1 = await embedder.embed(example.text);
|
|
const embedding2 = await embedder.embed(pred.text);
|
|
return cosineSimilarity(embedding1, embedding2);
|
|
};
|
|
|
|
// Complex custom metric
|
|
const groundedAndComplete = (example, pred) => {
|
|
const completeness = checkCompleteness(pred, example);
|
|
const groundedness = checkGroundedness(pred, example.context);
|
|
return 0.5 * completeness + 0.5 * groundedness;
|
|
};
|
|
```
|
|
|
|
**Built-in Metrics:**
|
|
- `SemanticF1`: Semantic precision, recall, and F1
|
|
- `CompleteAndGrounded`: Measures completeness and factual grounding
|
|
- `ExactMatch`: String matching
|
|
- Custom metrics: Define any evaluation function
|
|
|
|
---
|
|
|
|
## 2. Integration Patterns
|
|
|
|
### 2.1 Multi-LLM Support Matrix
|
|
|
|
| Provider | Ax Support | DSPy.ts Support | TS-DSPy Support | Notes |
|
|
|----------|------------|-----------------|-----------------|-------|
|
|
| **OpenAI** | ✅ GPT-4, GPT-4 Turbo, GPT-3.5 | ✅ Full | ✅ Full | Primary provider, well-tested |
|
|
| **Anthropic** | ✅ Claude 3.5 Sonnet, Claude Opus | ✅ Full | ✅ Full | Excellent for reasoning tasks |
|
|
| **Google** | ✅ Gemini 1.5 Pro, Gemini 1.0 | ⚠️ Via @ts-dspy/gemini | ⚠️ Limited | Known issues with optimization |
|
|
| **Mistral** | ✅ Mistral Large, Medium, Small | ⚠️ Via API | ⚠️ Limited | Good performance/cost ratio |
|
|
| **Meta** | ✅ Llama 3.1 (70B, 8B) | ✅ Via Ollama/VLLM | ⚠️ Limited | Local deployment support |
|
|
| **OpenRouter** | ✅ All models | ✅ With custom headers | ❌ None | Multi-model routing |
|
|
| **Ollama** | ✅ Local models | ✅ Full | ⚠️ Basic | Local deployment |
|
|
| **Azure OpenAI** | ✅ Enterprise | ✅ Full | ⚠️ Basic | Enterprise deployments |
|
|
| **AWS Bedrock** | ✅ Via Portkey | ✅ Via API | ❌ None | Cloud deployment |
|
|
| **Cohere** | ✅ Command models | ⚠️ Limited | ❌ None | Specialized tasks |
|
|
| **Groq** | ✅ Fast inference | ⚠️ Via API | ❌ None | Speed-optimized |
|
|
| **Together AI** | ✅ Multiple models | ⚠️ Via API | ❌ None | Model marketplace |
|
|
| **Local ONNX** | ⚠️ Experimental | ✅ Browser-based | ❌ None | Client-side AI |
|
|
| **Custom LLMs** | ✅ Adapter API | ✅ Interface | ⚠️ Limited | Bring your own |
|
|
|
|
### 2.2 Claude 3.5 Sonnet Integration
|
|
|
|
**Setup:**
|
|
```typescript
|
|
import { ai } from '@ax-llm/ax';
|
|
|
|
// Via Anthropic direct
|
|
const llm = ai({
|
|
name: 'anthropic',
|
|
apiKey: process.env.ANTHROPIC_API_KEY,
|
|
model: 'claude-3-5-sonnet-20241022',
|
|
config: {
|
|
temperature: 0.7,
|
|
maxTokens: 2048
|
|
}
|
|
});
|
|
|
|
// Or via OpenRouter (with failover)
|
|
const llm = ai({
|
|
name: 'openrouter',
|
|
apiKey: process.env.OPENROUTER_API_KEY,
|
|
model: 'anthropic/claude-3.5-sonnet',
|
|
config: {
|
|
extraHeaders: {
|
|
'HTTP-Referer': 'https://your-app.com',
|
|
'X-Title': 'YourApp'
|
|
}
|
|
}
|
|
});
|
|
```
|
|
|
|
**Advanced Usage:**
|
|
```typescript
|
|
import { ax } from '@ax-llm/ax';
|
|
|
|
// Multi-hop reasoning with Claude
|
|
const researcher = ax(`
|
|
query:string, context:string[]
|
|
->
|
|
reasoning:string,
|
|
answer:string,
|
|
confidence:number
|
|
`);
|
|
|
|
const result = await researcher.forward(llm, {
|
|
query: "What are the implications of quantum computing?",
|
|
context: [doc1, doc2, doc3]
|
|
});
|
|
|
|
console.log(result.reasoning); // Step-by-step analysis
|
|
console.log(result.answer); // Final answer
|
|
console.log(result.confidence); // 0.0-1.0 score
|
|
```
|
|
|
|
**Optimization with Claude:**
|
|
```typescript
|
|
// Claude excels at reasoning-heavy optimization
|
|
const metric = (example, pred) => {
|
|
// Semantic evaluation using Claude itself
|
|
const evalPrompt = ax(`
|
|
question:string,
|
|
gold_answer:string,
|
|
predicted_answer:string
|
|
->
|
|
score:number
|
|
`);
|
|
|
|
return evalPrompt.forward(llm, {
|
|
question: example.question,
|
|
gold_answer: example.answer,
|
|
predicted_answer: pred.answer
|
|
});
|
|
};
|
|
|
|
const optimizer = new MIPROv2({ metric });
|
|
const optimized = await optimizer.compile(module, trainset);
|
|
```
|
|
|
|
### 2.3 GPT-4 Turbo Integration
|
|
|
|
**Setup:**
|
|
```typescript
|
|
import { ai } from '@ax-llm/ax';
|
|
|
|
const llm = ai({
|
|
name: 'openai',
|
|
apiKey: process.env.OPENAI_API_KEY,
|
|
model: 'gpt-4-turbo-2024-04-09',
|
|
config: {
|
|
temperature: 0.0, // Deterministic for optimization
|
|
seed: 42, // Reproducible results
|
|
maxTokens: 4096
|
|
}
|
|
});
|
|
```
|
|
|
|
**Streaming with GPT-4:**
|
|
```typescript
|
|
import { ax } from '@ax-llm/ax';
|
|
|
|
const generator = ax(`topic:string -> article:string`);
|
|
|
|
const stream = generator.streamForward(llm, {
|
|
topic: "The future of AI"
|
|
});
|
|
|
|
for await (const chunk of stream) {
|
|
process.stdout.write(chunk.article);
|
|
}
|
|
```
|
|
|
|
**Vision + Code Generation:**
|
|
```typescript
|
|
// Multi-modal with GPT-4 Vision
|
|
const coder = ax(`
|
|
screenshot:image,
|
|
requirements:string
|
|
->
|
|
code:string,
|
|
explanation:string
|
|
`);
|
|
|
|
const result = await coder.forward(llm, {
|
|
screenshot: imageBuffer,
|
|
requirements: "Convert this UI mockup to React components"
|
|
});
|
|
|
|
console.log(result.code); // Generated React code
|
|
console.log(result.explanation); // How it works
|
|
```
|
|
|
|
### 2.4 Llama 3.1 70B Integration
|
|
|
|
**Local Deployment via Ollama:**
|
|
```typescript
|
|
import { ai } from '@ax-llm/ax';
|
|
|
|
const llm = ai({
|
|
name: 'ollama',
|
|
model: 'llama3.1:70b',
|
|
config: {
|
|
baseURL: 'http://localhost:11434',
|
|
temperature: 0.8,
|
|
numCtx: 8192 // Context window
|
|
}
|
|
});
|
|
```
|
|
|
|
**Cloud Deployment via Together AI:**
|
|
```typescript
|
|
const llm = ai({
|
|
name: 'together',
|
|
apiKey: process.env.TOGETHER_API_KEY,
|
|
model: 'meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo',
|
|
config: {
|
|
temperature: 0.7,
|
|
maxTokens: 4096
|
|
}
|
|
});
|
|
```
|
|
|
|
**Cost-Effective Optimization:**
|
|
```typescript
|
|
// Use smaller model for bootstrapping, large for final
|
|
const bootstrapLM = ai({ name: 'ollama', model: 'llama3.1:8b' });
|
|
const productionLM = ai({ name: 'together', model: 'llama3.1:70b' });
|
|
|
|
// Bootstrap with cheap model
|
|
const optimizer = new BootstrapFewShot(metric);
|
|
const compiled = await optimizer.compile(module, trainset, {
|
|
teacher: bootstrapLM
|
|
});
|
|
|
|
// Deploy with better model
|
|
const result = await compiled.forward(productionLM, input);
|
|
```
|
|
|
|
### 2.5 Gemini 1.5 Pro Integration
|
|
|
|
**Via @ts-dspy/gemini:**
|
|
```typescript
|
|
import { GeminiLM } from '@ts-dspy/gemini';
|
|
import { configureLM } from '@ts-dspy/core';
|
|
|
|
const llm = new GeminiLM({
|
|
apiKey: process.env.GOOGLE_API_KEY,
|
|
model: 'gemini-1.5-pro'
|
|
});
|
|
|
|
await llm.init();
|
|
configureLM(llm);
|
|
```
|
|
|
|
**Known Issues:**
|
|
- Advanced optimizers (MIPROv2, GEPA) may not work consistently
|
|
- Recommend using BootstrapFewShot or LabeledFewShot
|
|
- Streaming support is limited
|
|
|
|
**Workaround via Portkey:**
|
|
```typescript
|
|
const llm = ai({
|
|
name: 'openai', // Portkey uses OpenAI-compatible API
|
|
apiKey: process.env.PORTKEY_API_KEY,
|
|
apiBase: 'https://api.portkey.ai/v1',
|
|
model: 'google/gemini-1.5-pro'
|
|
});
|
|
```
|
|
|
|
### 2.6 OpenRouter Multi-Model Integration
|
|
|
|
OpenRouter enables model fallback and A/B testing:
|
|
|
|
**Enhanced Integration:**
|
|
```typescript
|
|
import { ai } from '@ax-llm/ax';
|
|
|
|
const llm = ai({
|
|
name: 'openrouter',
|
|
apiKey: process.env.OPENROUTER_API_KEY,
|
|
model: 'anthropic/claude-3.5-sonnet:beta', // Primary
|
|
config: {
|
|
extraHeaders: {
|
|
'HTTP-Referer': 'https://your-app.com',
|
|
'X-Title': 'DSPy-App',
|
|
'X-Fallback': JSON.stringify([
|
|
'openai/gpt-4-turbo',
|
|
'meta-llama/llama-3.1-70b-instruct'
|
|
])
|
|
}
|
|
}
|
|
});
|
|
```
|
|
|
|
**Cost-Quality Optimization:**
|
|
```typescript
|
|
// Start with cheap model, escalate if needed
|
|
const models = [
|
|
{ provider: 'openrouter', model: 'meta-llama/llama-3.1-8b-instruct', cost: 0.00006 },
|
|
{ provider: 'openrouter', model: 'anthropic/claude-3-haiku', cost: 0.00025 },
|
|
{ provider: 'openrouter', model: 'openai/gpt-4o-mini', cost: 0.00015 },
|
|
{ provider: 'openrouter', model: 'anthropic/claude-3.5-sonnet', cost: 0.003 }
|
|
];
|
|
|
|
async function optimizedCall(signature, input, qualityThreshold) {
|
|
for (const model of models) {
|
|
const llm = ai(model);
|
|
const predictor = ax(signature);
|
|
const result = await predictor.forward(llm, input);
|
|
|
|
const quality = await evaluateQuality(result);
|
|
if (quality >= qualityThreshold) {
|
|
return { result, cost: model.cost, model: model.model };
|
|
}
|
|
}
|
|
|
|
throw new Error('No model met quality threshold');
|
|
}
|
|
```
|
|
|
|
### 2.7 Integration Architecture Patterns
|
|
|
|
**Pattern 1: Single Model, Optimized**
|
|
```typescript
|
|
// Best for: Consistent quality, predictable costs
|
|
const llm = ai({ name: 'anthropic', model: 'claude-3.5-sonnet' });
|
|
const optimizer = new MIPROv2({ metric });
|
|
const optimized = await optimizer.compile(module, trainset);
|
|
```
|
|
|
|
**Pattern 2: Model Cascade**
|
|
```typescript
|
|
// Best for: Cost optimization, varied query complexity
|
|
const cheap = ai({ name: 'openai', model: 'gpt-4o-mini' });
|
|
const expensive = ai({ name: 'anthropic', model: 'claude-3.5-sonnet' });
|
|
|
|
async function cascade(signature, input) {
|
|
const result1 = await ax(signature).forward(cheap, input);
|
|
|
|
if (result1.confidence > 0.9) return result1;
|
|
|
|
return await ax(signature).forward(expensive, input);
|
|
}
|
|
```
|
|
|
|
**Pattern 3: Ensemble**
|
|
```typescript
|
|
// Best for: Maximum accuracy, critical decisions
|
|
const models = [
|
|
ai({ name: 'openai', model: 'gpt-4-turbo' }),
|
|
ai({ name: 'anthropic', model: 'claude-3.5-sonnet' }),
|
|
ai({ name: 'google', model: 'gemini-1.5-pro' })
|
|
];
|
|
|
|
async function ensemble(signature, input) {
|
|
const results = await Promise.all(
|
|
models.map(llm => ax(signature).forward(llm, input))
|
|
);
|
|
|
|
// Majority vote or consensus
|
|
return aggregateResults(results);
|
|
}
|
|
```
|
|
|
|
**Pattern 4: Specialized Routing**
|
|
```typescript
|
|
// Best for: Task-specific optimization
|
|
async function route(task, input) {
|
|
const routes = {
|
|
'code': ai({ name: 'openai', model: 'gpt-4-turbo' }),
|
|
'reasoning': ai({ name: 'anthropic', model: 'claude-3.5-sonnet' }),
|
|
'speed': ai({ name: 'groq', model: 'llama-3.1-70b' }),
|
|
'cost': ai({ name: 'openrouter', model: 'meta-llama/llama-3.1-8b' })
|
|
};
|
|
|
|
const llm = routes[task.type] || routes['reasoning'];
|
|
return ax(task.signature).forward(llm, input);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Advanced Optimization Techniques
|
|
|
|
### 3.1 Bootstrap Few-Shot Learning
|
|
|
|
**Algorithm Overview:**
|
|
1. Run teacher program on training data
|
|
2. Collect successful execution traces
|
|
3. Select representative examples
|
|
4. Include in student program prompt
|
|
|
|
**Implementation:**
|
|
```typescript
|
|
import { BootstrapFewShot } from 'dspy.ts/optimizers';
|
|
|
|
// Define evaluation metric
|
|
const metric = (example, prediction) => {
|
|
const isCorrect = prediction.answer === example.answer;
|
|
const isComplete = prediction.answer.length > 10;
|
|
return isCorrect && isComplete ? 1.0 : 0.0;
|
|
};
|
|
|
|
// Create optimizer
|
|
const optimizer = new BootstrapFewShot({
|
|
metric: metric,
|
|
maxBootstrappedDemos: 4,
|
|
maxLabeledDemos: 2,
|
|
teacherSettings: { temperature: 0.9 },
|
|
maxRounds: 1
|
|
});
|
|
|
|
// Compile program
|
|
const optimized = await optimizer.compile(
|
|
program,
|
|
trainset,
|
|
valset // Optional validation set
|
|
);
|
|
```
|
|
|
|
**Performance Characteristics:**
|
|
- **Data Requirements:** 10-50 examples optimal
|
|
- **Optimization Time:** O(N) - linear with training size
|
|
- **Improvement:** 15-30% accuracy gain typical
|
|
- **Best For:** Classification, QA, extraction tasks
|
|
|
|
**Advanced Configuration:**
|
|
```typescript
|
|
const optimizer = new BootstrapFewShot({
|
|
metric: weightedMetric,
|
|
maxBootstrappedDemos: 8, // More demos for complex tasks
|
|
maxLabeledDemos: 0, // Pure bootstrapping
|
|
teacherSettings: {
|
|
temperature: 1.0, // More diverse generations
|
|
maxTokens: 2048
|
|
},
|
|
studentSettings: {
|
|
temperature: 0.3 // Conservative inference
|
|
},
|
|
maxRounds: 3, // Iterative improvement
|
|
maxErrors: 5 // Error tolerance
|
|
});
|
|
```
|
|
|
|
### 3.2 MIPROv2 (Multi-prompt Instruction Proposal Optimizer v2)
|
|
|
|
**Algorithm Overview:**
|
|
MIPROv2 optimizes both instructions and few-shot examples simultaneously using Bayesian Optimization.
|
|
|
|
**Phases:**
|
|
1. **Bootstrapping:** Collect execution traces across modules
|
|
2. **Instruction Generation:** Create data-aware instructions
|
|
3. **Demonstration Selection:** Choose optimal examples
|
|
4. **Bayesian Search:** Find best instruction+demo combinations
|
|
|
|
**Implementation:**
|
|
```typescript
|
|
import { MIPROv2 } from 'dspy.ts/optimizers';
|
|
|
|
const optimizer = new MIPROv2({
|
|
metric: metric,
|
|
numCandidates: 10, // Instructions to propose
|
|
initTemperature: 1.0, // Generation diversity
|
|
numTrials: 100, // Bayesian optimization trials
|
|
promptModel: instructionLM, // LLM for generating instructions
|
|
taskModel: taskLM, // LLM for running tasks
|
|
verbose: true
|
|
});
|
|
|
|
const optimized = await optimizer.compile(
|
|
program,
|
|
trainset,
|
|
numBatches: 5, // Batch training data
|
|
maxBootstrappedDemos: 3, // Demos per module
|
|
maxLabeledDemos: 2
|
|
);
|
|
```
|
|
|
|
**Performance Results:**
|
|
- **ReAct Task:** 24% → 51% (+113% improvement)
|
|
- **Classification:** 66% → 87% (+32% improvement)
|
|
- **Multi-hop QA:** 42.3% → 62.3% (+47% improvement)
|
|
|
|
**When to Use:**
|
|
- You have 200+ training examples
|
|
- Task requires specific instructions
|
|
- Multiple modules in pipeline
|
|
- Need maximum accuracy
|
|
- Can afford 1-3 hour optimization
|
|
|
|
**Cost Considerations:**
|
|
- Requires ~2-3 hours and O(3x) more LLM calls than BootstrapFewShot
|
|
- Can use cheaper model for instruction generation
|
|
- Amortized over many production requests
|
|
|
|
**Example Use Case - Complex QA:**
|
|
```typescript
|
|
// Multi-module QA system
|
|
const retriever = new dspy.Retrieve(k=5);
|
|
const reasoner = new dspy.ChainOfThought('context, question -> answer');
|
|
const refiner = new dspy.Refine('answer, critique -> refined_answer');
|
|
|
|
class QASystem extends dspy.Module {
|
|
async forward(question) {
|
|
const context = await retriever.forward(question);
|
|
const answer = await reasoner.forward({ context, question });
|
|
const critique = await validator.forward(answer);
|
|
return refiner.forward({ answer, critique });
|
|
}
|
|
}
|
|
|
|
// MIPROv2 optimizes ALL modules simultaneously
|
|
const optimizer = new MIPROv2({ metric: exactMatch });
|
|
const optimized = await optimizer.compile(new QASystem(), trainset);
|
|
```
|
|
|
|
### 3.3 GEPA (Gradient-based Evolutionary Prompt Augmentation)
|
|
|
|
**Revolutionary Approach:**
|
|
GEPA uses language models to reflect on program trajectories and propose improved prompts through an evolutionary process.
|
|
|
|
**Key Innovation:**
|
|
Unlike reinforcement learning (GRPO requires 35x more rollouts), GEPA uses reflective reasoning to guide optimization.
|
|
|
|
**Algorithm:**
|
|
1. **Execute:** Run program on training batch
|
|
2. **Reflect:** LLM analyzes failures and successes
|
|
3. **Propose:** Generate improved prompt variants
|
|
4. **Evolve:** Select best performing variants
|
|
5. **Repeat:** Iterate until convergence
|
|
|
|
**Implementation (via Ax Framework):**
|
|
```typescript
|
|
import { GEPA } from '@ax-llm/ax';
|
|
|
|
const optimizer = new GEPA({
|
|
metric: metric,
|
|
population: 20, // Prompt variants to maintain
|
|
generations: 10, // Evolution iterations
|
|
mutationRate: 0.3, // Prompt modification rate
|
|
elitism: 0.2, // Keep top performers
|
|
reflectionModel: claude, // Use Claude for reflection
|
|
taskModel: gpt4 // Use GPT-4 for tasks
|
|
});
|
|
|
|
const optimized = await optimizer.compile(program, trainset);
|
|
```
|
|
|
|
**Benchmark Results:**
|
|
|
|
| Task | Baseline | MIPROv2 | GRPO | GEPA | Improvement |
|
|
|------|----------|---------|------|------|-------------|
|
|
| HotpotQA | 42.3 | 55.3 | 43.3 | **62.3** | +47% |
|
|
| HoVer | 35.3 | 47.3 | 38.6 | **52.3** | +48% |
|
|
| IFBench | 36.9 | 36.2 | 35.8 | **38.6** | +5% |
|
|
| MATH | 67.0 | 85.0 | 78.0 | **93.0** | +39% |
|
|
|
|
**Multi-Objective Optimization (GEPA-Flow):**
|
|
```typescript
|
|
// Optimize for BOTH quality AND cost
|
|
const optimizer = new GEPA({
|
|
objectives: [
|
|
{ metric: accuracy, weight: 0.7, minimize: false },
|
|
{ metric: tokenCost, weight: 0.3, minimize: true }
|
|
],
|
|
paretoFrontier: true // Find optimal trade-offs
|
|
});
|
|
|
|
const optimized = await optimizer.compile(program, trainset);
|
|
|
|
// Returns multiple Pareto-optimal solutions
|
|
console.log(optimized.solutions);
|
|
// [
|
|
// { accuracy: 0.95, cost: 0.05 }, // Expensive, accurate
|
|
// { accuracy: 0.92, cost: 0.02 }, // Balanced
|
|
// { accuracy: 0.88, cost: 0.008 } // Cheap, decent
|
|
// ]
|
|
```
|
|
|
|
**Cost-Effectiveness:**
|
|
- **GEPA + gpt-oss-120b:** 22x cheaper than Claude Sonnet 4
|
|
- **GEPA + gpt-oss-120b:** 90x cheaper than Claude Opus 4.1
|
|
- **Performance:** Matches or exceeds baseline frontier model accuracy
|
|
|
|
**When to Use:**
|
|
- Maximum accuracy required
|
|
- Multi-objective optimization (quality vs cost/speed)
|
|
- Complex reasoning tasks
|
|
- You have Claude/GPT-4 for reflection
|
|
- Can invest 2-3 hours in optimization
|
|
|
|
### 3.4 Teleprompter Patterns (Legacy Term)
|
|
|
|
"Teleprompters" is the legacy term for optimizers. Modern DSPy uses "optimizers" but the patterns remain:
|
|
|
|
**Pattern 1: Zero-Shot → Few-Shot**
|
|
```typescript
|
|
// Start zero-shot
|
|
const zeroShot = new dspy.Predict(signature);
|
|
|
|
// Bootstrap to few-shot
|
|
const fewShot = await new BootstrapFewShot(metric)
|
|
.compile(zeroShot, trainset);
|
|
```
|
|
|
|
**Pattern 2: Few-Shot → Instruction-Optimized**
|
|
```typescript
|
|
// Start with bootstrapped few-shot
|
|
const fewShot = await new BootstrapFewShot(metric)
|
|
.compile(program, trainset);
|
|
|
|
// Add optimized instructions
|
|
const instructionOpt = await new MIPROv2(metric)
|
|
.compile(fewShot, trainset);
|
|
```
|
|
|
|
**Pattern 3: Instruction-Optimized → Fine-Tuned**
|
|
```typescript
|
|
// Start with optimized prompt program
|
|
const optimized = await new MIPROv2(metric)
|
|
.compile(program, trainset);
|
|
|
|
// Distill into fine-tuned model
|
|
const finetuned = await new BootstrapFinetune(metric)
|
|
.compile(optimized, trainset, {
|
|
model: 'gpt-3.5-turbo',
|
|
epochs: 3
|
|
});
|
|
```
|
|
|
|
**Pattern 4: Ensemble Optimizers**
|
|
```typescript
|
|
// Combine multiple optimization strategies
|
|
const optimizers = [
|
|
new BootstrapFewShot(metric),
|
|
new MIPROv2(metric),
|
|
new GEPA(metric)
|
|
];
|
|
|
|
const results = await Promise.all(
|
|
optimizers.map(opt => opt.compile(program, trainset))
|
|
);
|
|
|
|
// Use ensemble or select best
|
|
const best = results.reduce((best, curr) =>
|
|
evaluate(curr, valset) > evaluate(best, valset) ? curr : best
|
|
);
|
|
```
|
|
|
|
### 3.5 Ensemble Methods
|
|
|
|
Combine multiple models or strategies for improved performance:
|
|
|
|
**Voting Ensemble:**
|
|
```typescript
|
|
import { dspy } from 'dspy.ts';
|
|
|
|
class VotingEnsemble extends dspy.Module {
|
|
constructor(predictors) {
|
|
super();
|
|
this.predictors = predictors;
|
|
}
|
|
|
|
async forward(input) {
|
|
// Get predictions from all models
|
|
const predictions = await Promise.all(
|
|
this.predictors.map(p => p.forward(input))
|
|
);
|
|
|
|
// Majority vote
|
|
const counts = {};
|
|
predictions.forEach(pred => {
|
|
counts[pred.answer] = (counts[pred.answer] || 0) + 1;
|
|
});
|
|
|
|
return Object.entries(counts)
|
|
.sort(([,a], [,b]) => b - a)[0][0];
|
|
}
|
|
}
|
|
|
|
// Use ensemble
|
|
const ensemble = new VotingEnsemble([
|
|
await new BootstrapFewShot(metric).compile(program, trainset),
|
|
await new MIPROv2(metric).compile(program, trainset),
|
|
await new GEPA(metric).compile(program, trainset)
|
|
]);
|
|
```
|
|
|
|
**Weighted Ensemble:**
|
|
```typescript
|
|
class WeightedEnsemble extends dspy.Module {
|
|
constructor(predictors, weights) {
|
|
super();
|
|
this.predictors = predictors;
|
|
this.weights = weights;
|
|
}
|
|
|
|
async forward(input) {
|
|
const predictions = await Promise.all(
|
|
this.predictors.map(p => p.forward(input))
|
|
);
|
|
|
|
// Weighted combination
|
|
const scores = {};
|
|
predictions.forEach((pred, i) => {
|
|
const weight = this.weights[i];
|
|
scores[pred.answer] = (scores[pred.answer] || 0) + weight;
|
|
});
|
|
|
|
return Object.entries(scores)
|
|
.sort(([,a], [,b]) => b - a)[0][0];
|
|
}
|
|
}
|
|
```
|
|
|
|
**Cascade Ensemble (Early Exit):**
|
|
```typescript
|
|
class CascadeEnsemble extends dspy.Module {
|
|
constructor(predictors, confidenceThresholds) {
|
|
super();
|
|
this.predictors = predictors.sort((a, b) => a.cost - b.cost);
|
|
this.thresholds = confidenceThresholds;
|
|
}
|
|
|
|
async forward(input) {
|
|
for (let i = 0; i < this.predictors.length; i++) {
|
|
const prediction = await this.predictors[i].forward(input);
|
|
|
|
if (prediction.confidence >= this.thresholds[i]) {
|
|
return {
|
|
answer: prediction.answer,
|
|
model: this.predictors[i].name,
|
|
cost: this.predictors[i].cost
|
|
};
|
|
}
|
|
}
|
|
|
|
// Fallback to most expensive model
|
|
return this.predictors[this.predictors.length - 1].forward(input);
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3.6 Cross-Validation Strategies
|
|
|
|
**K-Fold Cross-Validation:**
|
|
```typescript
|
|
import { kFoldCrossValidation } from 'dspy.ts/evaluation';
|
|
|
|
async function optimizeWithCV(program, dataset, optimizer, k=5) {
|
|
const folds = kFoldCrossValidation(dataset, k);
|
|
const scores = [];
|
|
|
|
for (const fold of folds) {
|
|
const optimized = await optimizer.compile(
|
|
program,
|
|
fold.train,
|
|
fold.validation
|
|
);
|
|
|
|
const score = await evaluate(optimized, fold.test);
|
|
scores.push(score);
|
|
}
|
|
|
|
const avgScore = scores.reduce((a, b) => a + b) / scores.length;
|
|
const stdDev = Math.sqrt(
|
|
scores.reduce((sum, s) => sum + Math.pow(s - avgScore, 2), 0) / scores.length
|
|
);
|
|
|
|
return {
|
|
meanScore: avgScore,
|
|
stdDev: stdDev,
|
|
scores: scores
|
|
};
|
|
}
|
|
```
|
|
|
|
**Stratified Sampling:**
|
|
```typescript
|
|
function stratifiedSplit(dataset, testRatio=0.2) {
|
|
const labelGroups = {};
|
|
|
|
dataset.forEach(item => {
|
|
const label = item.label;
|
|
if (!labelGroups[label]) labelGroups[label] = [];
|
|
labelGroups[label].push(item);
|
|
});
|
|
|
|
const train = [];
|
|
const test = [];
|
|
|
|
Object.values(labelGroups).forEach(group => {
|
|
const testSize = Math.floor(group.length * testRatio);
|
|
test.push(...group.slice(0, testSize));
|
|
train.push(...group.slice(testSize));
|
|
});
|
|
|
|
return { train, test };
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Benchmarking Approaches
|
|
|
|
### 4.1 Quality Metrics
|
|
|
|
**Accuracy-Based Metrics:**
|
|
```typescript
|
|
// Exact match accuracy
|
|
const exactMatch = (example, prediction) => {
|
|
return prediction.answer === example.answer ? 1.0 : 0.0;
|
|
};
|
|
|
|
// Fuzzy matching
|
|
const fuzzyMatch = (example, prediction) => {
|
|
const normalize = (s) => s.toLowerCase().trim();
|
|
return normalize(prediction.answer) === normalize(example.answer) ? 1.0 : 0.0;
|
|
};
|
|
|
|
// Substring matching
|
|
const substringMatch = (example, prediction) => {
|
|
const answer = prediction.answer.toLowerCase();
|
|
const expected = example.answer.toLowerCase();
|
|
return answer.includes(expected) || expected.includes(answer) ? 1.0 : 0.0;
|
|
};
|
|
```
|
|
|
|
**Semantic Metrics:**
|
|
```typescript
|
|
import { SemanticF1 } from 'dspy.ts/metrics';
|
|
|
|
// Semantic similarity using embeddings
|
|
const semanticF1 = new SemanticF1({
|
|
embedder: openaiEmbeddings,
|
|
threshold: 0.8
|
|
});
|
|
|
|
// Custom semantic metric
|
|
const semanticSimilarity = async (example, prediction) => {
|
|
const emb1 = await embedder.embed(example.answer);
|
|
const emb2 = await embedder.embed(prediction.answer);
|
|
|
|
const similarity = cosineSimilarity(emb1, emb2);
|
|
return similarity;
|
|
};
|
|
```
|
|
|
|
**Composite Metrics:**
|
|
```typescript
|
|
import { CompleteAndGrounded } from 'dspy.ts/metrics';
|
|
|
|
// Completeness + Groundedness
|
|
const completeAndGrounded = new CompleteAndGrounded({
|
|
completenessWeight: 0.5,
|
|
groundednessWeight: 0.5
|
|
});
|
|
|
|
// Custom composite
|
|
const customMetric = (example, prediction) => {
|
|
const accuracy = exactMatch(example, prediction);
|
|
const length = prediction.answer.length > 20 ? 1.0 : 0.5;
|
|
const hasReasoning = prediction.reasoning ? 1.0 : 0.0;
|
|
|
|
return 0.5 * accuracy + 0.3 * length + 0.2 * hasReasoning;
|
|
};
|
|
```
|
|
|
|
**LLM-as-Judge Metrics:**
|
|
```typescript
|
|
// Use LLM to evaluate quality
|
|
const llmJudge = async (example, prediction) => {
|
|
const judge = ax(`
|
|
question:string,
|
|
correct_answer:string,
|
|
predicted_answer:string
|
|
->
|
|
score:number,
|
|
reasoning:string
|
|
`);
|
|
|
|
const evaluation = await judge.forward(judgeLM, {
|
|
question: example.question,
|
|
correct_answer: example.answer,
|
|
predicted_answer: prediction.answer
|
|
});
|
|
|
|
return evaluation.score / 10.0; // Normalize to 0-1
|
|
};
|
|
```
|
|
|
|
### 4.2 Cost-Effectiveness Metrics
|
|
|
|
**Token Usage Tracking:**
|
|
```typescript
|
|
class CostTracker {
|
|
constructor(pricing) {
|
|
this.pricing = pricing; // { input: $, output: $ } per 1k tokens
|
|
this.inputTokens = 0;
|
|
this.outputTokens = 0;
|
|
}
|
|
|
|
track(response) {
|
|
this.inputTokens += response.usage.promptTokens;
|
|
this.outputTokens += response.usage.completionTokens;
|
|
}
|
|
|
|
getTotalCost() {
|
|
const inputCost = (this.inputTokens / 1000) * this.pricing.input;
|
|
const outputCost = (this.outputTokens / 1000) * this.pricing.output;
|
|
return inputCost + outputCost;
|
|
}
|
|
|
|
getCostPerRequest() {
|
|
return this.getTotalCost() / this.requestCount;
|
|
}
|
|
}
|
|
|
|
// Model pricing (as of 2024)
|
|
const pricing = {
|
|
'gpt-4-turbo': { input: 0.01, output: 0.03 },
|
|
'claude-3.5-sonnet': { input: 0.003, output: 0.015 },
|
|
'gpt-4o-mini': { input: 0.00015, output: 0.0006 },
|
|
'llama-3.1-70b': { input: 0.00088, output: 0.00088 },
|
|
'gemini-1.5-pro': { input: 0.0035, output: 0.0105 }
|
|
};
|
|
```
|
|
|
|
**Quality-Cost Trade-off:**
|
|
```typescript
|
|
function paretoFrontier(results) {
|
|
// results = [{ accuracy, cost, model }]
|
|
const sorted = results.sort((a, b) => a.cost - b.cost);
|
|
const frontier = [];
|
|
let maxAccuracy = 0;
|
|
|
|
for (const result of sorted) {
|
|
if (result.accuracy > maxAccuracy) {
|
|
frontier.push(result);
|
|
maxAccuracy = result.accuracy;
|
|
}
|
|
}
|
|
|
|
return frontier;
|
|
}
|
|
|
|
// Evaluate models
|
|
const results = await Promise.all(
|
|
models.map(async (model) => {
|
|
const tracker = new CostTracker(pricing[model]);
|
|
const score = await evaluate(program, testset, tracker);
|
|
|
|
return {
|
|
model,
|
|
accuracy: score,
|
|
cost: tracker.getTotalCost(),
|
|
costPerRequest: tracker.getCostPerRequest()
|
|
};
|
|
})
|
|
);
|
|
|
|
const frontier = paretoFrontier(results);
|
|
console.log('Pareto-optimal models:', frontier);
|
|
```
|
|
|
|
**Cost-Quality Score:**
|
|
```typescript
|
|
// Utility function balancing quality and cost
|
|
function utilityScore(accuracy, cost, qualityWeight=0.7) {
|
|
const normalizedAccuracy = accuracy; // 0-1
|
|
const normalizedCost = 1 - Math.min(cost / 0.01, 1); // Lower cost = higher score
|
|
|
|
return qualityWeight * normalizedAccuracy +
|
|
(1 - qualityWeight) * normalizedCost;
|
|
}
|
|
```
|
|
|
|
### 4.3 Convergence Rate Metrics
|
|
|
|
**Optimization Progress Tracking:**
|
|
```typescript
|
|
class OptimizationMonitor {
|
|
constructor() {
|
|
this.iterations = [];
|
|
}
|
|
|
|
record(iteration, score, time) {
|
|
this.iterations.push({ iteration, score, time });
|
|
}
|
|
|
|
getConvergenceRate() {
|
|
if (this.iterations.length < 2) return null;
|
|
|
|
const improvements = [];
|
|
for (let i = 1; i < this.iterations.length; i++) {
|
|
const improvement = this.iterations[i].score - this.iterations[i-1].score;
|
|
improvements.push(improvement);
|
|
}
|
|
|
|
// Average improvement per iteration
|
|
return improvements.reduce((a, b) => a + b) / improvements.length;
|
|
}
|
|
|
|
hasConverged(threshold=0.001, window=5) {
|
|
if (this.iterations.length < window) return false;
|
|
|
|
const recent = this.iterations.slice(-window);
|
|
const improvements = recent.slice(1).map((iter, i) =>
|
|
iter.score - recent[i].score
|
|
);
|
|
|
|
const avgImprovement = improvements.reduce((a, b) => a + b) / improvements.length;
|
|
return avgImprovement < threshold;
|
|
}
|
|
|
|
getEfficiency() {
|
|
// Score improvement per second
|
|
if (this.iterations.length < 2) return null;
|
|
|
|
const firstScore = this.iterations[0].score;
|
|
const lastScore = this.iterations[this.iterations.length - 1].score;
|
|
const totalTime = this.iterations[this.iterations.length - 1].time - this.iterations[0].time;
|
|
|
|
return (lastScore - firstScore) / totalTime;
|
|
}
|
|
}
|
|
|
|
// Use during optimization
|
|
const monitor = new OptimizationMonitor();
|
|
|
|
const optimizer = new MIPROv2({
|
|
metric: metric,
|
|
onIteration: (iter, score) => {
|
|
monitor.record(iter, score, Date.now());
|
|
|
|
if (monitor.hasConverged()) {
|
|
console.log('Converged early!');
|
|
optimizer.stop();
|
|
}
|
|
}
|
|
});
|
|
```
|
|
|
|
**Comparison Across Optimizers:**
|
|
```typescript
|
|
async function compareOptimizers(program, trainset, testset) {
|
|
const optimizers = [
|
|
{ name: 'BootstrapFewShot', opt: new BootstrapFewShot(metric) },
|
|
{ name: 'MIPROv2', opt: new MIPROv2(metric) },
|
|
{ name: 'GEPA', opt: new GEPA(metric) }
|
|
];
|
|
|
|
const results = [];
|
|
|
|
for (const { name, opt } of optimizers) {
|
|
const monitor = new OptimizationMonitor();
|
|
const startTime = Date.now();
|
|
|
|
const optimized = await opt.compile(program, trainset, {
|
|
onIteration: (iter, score) => monitor.record(iter, score, Date.now())
|
|
});
|
|
|
|
const endTime = Date.now();
|
|
const finalScore = await evaluate(optimized, testset);
|
|
|
|
results.push({
|
|
optimizer: name,
|
|
finalScore: finalScore,
|
|
convergenceRate: monitor.getConvergenceRate(),
|
|
totalTime: endTime - startTime,
|
|
efficiency: monitor.getEfficiency(),
|
|
iterations: monitor.iterations.length
|
|
});
|
|
}
|
|
|
|
return results;
|
|
}
|
|
```
|
|
|
|
### 4.4 Scalability Patterns
|
|
|
|
**Batch Processing:**
|
|
```typescript
|
|
async function evaluateAtScale(program, testset, batchSize=32) {
|
|
const batches = [];
|
|
for (let i = 0; i < testset.length; i += batchSize) {
|
|
batches.push(testset.slice(i, i + batchSize));
|
|
}
|
|
|
|
const results = [];
|
|
const startTime = Date.now();
|
|
|
|
for (const batch of batches) {
|
|
const batchResults = await Promise.all(
|
|
batch.map(example => program.forward(example.input))
|
|
);
|
|
results.push(...batchResults);
|
|
}
|
|
|
|
const endTime = Date.now();
|
|
const throughput = testset.length / ((endTime - startTime) / 1000);
|
|
|
|
return {
|
|
results,
|
|
throughput, // requests per second
|
|
latency: (endTime - startTime) / testset.length // ms per request
|
|
};
|
|
}
|
|
```
|
|
|
|
**Parallel Evaluation:**
|
|
```typescript
|
|
async function parallelEvaluate(programs, testset, concurrency=10) {
|
|
const queue = [...testset];
|
|
const results = new Map();
|
|
|
|
async function worker(program) {
|
|
while (queue.length > 0) {
|
|
const example = queue.shift();
|
|
if (!example) break;
|
|
|
|
const prediction = await program.forward(example.input);
|
|
const score = metric(example, prediction);
|
|
|
|
if (!results.has(program)) results.set(program, []);
|
|
results.get(program).push(score);
|
|
}
|
|
}
|
|
|
|
await Promise.all(
|
|
programs.flatMap(program =>
|
|
Array(concurrency).fill(0).map(() => worker(program))
|
|
)
|
|
);
|
|
|
|
return Object.fromEntries(
|
|
[...results.entries()].map(([program, scores]) => [
|
|
program.name,
|
|
scores.reduce((a, b) => a + b) / scores.length
|
|
])
|
|
);
|
|
}
|
|
```
|
|
|
|
**Load Testing:**
|
|
```typescript
|
|
class LoadTester {
|
|
constructor(program) {
|
|
this.program = program;
|
|
this.metrics = {
|
|
requests: 0,
|
|
successes: 0,
|
|
failures: 0,
|
|
latencies: []
|
|
};
|
|
}
|
|
|
|
async runLoadTest(testset, rps=10, duration=60) {
|
|
const interval = 1000 / rps; // ms between requests
|
|
const endTime = Date.now() + (duration * 1000);
|
|
|
|
const testQueue = [...testset];
|
|
let currentIndex = 0;
|
|
|
|
while (Date.now() < endTime) {
|
|
const example = testQueue[currentIndex % testQueue.length];
|
|
currentIndex++;
|
|
|
|
const startTime = Date.now();
|
|
|
|
try {
|
|
await this.program.forward(example.input);
|
|
this.metrics.successes++;
|
|
this.metrics.latencies.push(Date.now() - startTime);
|
|
} catch (error) {
|
|
this.metrics.failures++;
|
|
}
|
|
|
|
this.metrics.requests++;
|
|
|
|
// Wait for next request
|
|
const elapsed = Date.now() - startTime;
|
|
const wait = Math.max(0, interval - elapsed);
|
|
await new Promise(resolve => setTimeout(resolve, wait));
|
|
}
|
|
|
|
return this.getReport();
|
|
}
|
|
|
|
getReport() {
|
|
const sortedLatencies = this.metrics.latencies.sort((a, b) => a - b);
|
|
|
|
return {
|
|
totalRequests: this.metrics.requests,
|
|
successRate: this.metrics.successes / this.metrics.requests,
|
|
avgLatency: this.metrics.latencies.reduce((a, b) => a + b) / this.metrics.latencies.length,
|
|
p50Latency: sortedLatencies[Math.floor(sortedLatencies.length * 0.5)],
|
|
p95Latency: sortedLatencies[Math.floor(sortedLatencies.length * 0.95)],
|
|
p99Latency: sortedLatencies[Math.floor(sortedLatencies.length * 0.99)],
|
|
maxLatency: Math.max(...this.metrics.latencies),
|
|
throughput: this.metrics.requests / (this.metrics.latencies.reduce((a, b) => a + b) / 1000)
|
|
};
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4.5 Benchmark Methodology
|
|
|
|
**Standard Evaluation Protocol:**
|
|
```typescript
|
|
class BenchmarkSuite {
|
|
constructor(name, datasets, metrics) {
|
|
this.name = name;
|
|
this.datasets = datasets;
|
|
this.metrics = metrics;
|
|
}
|
|
|
|
async run(programs) {
|
|
const results = [];
|
|
|
|
for (const program of programs) {
|
|
for (const dataset of this.datasets) {
|
|
const datasetResults = {
|
|
program: program.name,
|
|
dataset: dataset.name,
|
|
scores: {}
|
|
};
|
|
|
|
// Evaluate each metric
|
|
for (const [metricName, metricFn] of Object.entries(this.metrics)) {
|
|
const scores = [];
|
|
|
|
for (const example of dataset.test) {
|
|
const prediction = await program.forward(example.input);
|
|
const score = await metricFn(example, prediction);
|
|
scores.push(score);
|
|
}
|
|
|
|
datasetResults.scores[metricName] = {
|
|
mean: scores.reduce((a, b) => a + b) / scores.length,
|
|
std: Math.sqrt(
|
|
scores.reduce((sum, s) => sum + Math.pow(s - (scores.reduce((a, b) => a + b) / scores.length), 2), 0) / scores.length
|
|
),
|
|
min: Math.min(...scores),
|
|
max: Math.max(...scores)
|
|
};
|
|
}
|
|
|
|
results.push(datasetResults);
|
|
}
|
|
}
|
|
|
|
return this.formatReport(results);
|
|
}
|
|
|
|
formatReport(results) {
|
|
// Generate markdown table
|
|
let report = `# ${this.name} Benchmark Results\n\n`;
|
|
|
|
for (const dataset of this.datasets) {
|
|
report += `## ${dataset.name}\n\n`;
|
|
report += '| Program | ' + Object.keys(this.metrics).join(' | ') + ' |\n';
|
|
report += '|---------|' + Object.keys(this.metrics).map(() => '--------').join('|') + '|\n';
|
|
|
|
const datasetResults = results.filter(r => r.dataset === dataset.name);
|
|
|
|
for (const result of datasetResults) {
|
|
report += `| ${result.program} | `;
|
|
report += Object.keys(this.metrics).map(metric =>
|
|
`${(result.scores[metric].mean * 100).toFixed(2)}% ± ${(result.scores[metric].std * 100).toFixed(2)}%`
|
|
).join(' | ');
|
|
report += ' |\n';
|
|
}
|
|
|
|
report += '\n';
|
|
}
|
|
|
|
return report;
|
|
}
|
|
}
|
|
|
|
// Example usage
|
|
const benchmark = new BenchmarkSuite(
|
|
'QA Systems Evaluation',
|
|
[
|
|
{ name: 'HotpotQA', test: hotpotTest },
|
|
{ name: 'SQuAD', test: squadTest },
|
|
{ name: 'TriviaQA', test: triviaTest }
|
|
],
|
|
{
|
|
'Exact Match': exactMatch,
|
|
'F1 Score': f1Score,
|
|
'Semantic Similarity': semanticSimilarity
|
|
}
|
|
);
|
|
|
|
const programs = [
|
|
baselineProgram,
|
|
bootstrapOptimized,
|
|
miproOptimized,
|
|
gepaOptimized
|
|
];
|
|
|
|
const report = await benchmark.run(programs);
|
|
console.log(report);
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Integration Recommendations
|
|
|
|
### 5.1 Technology Stack Recommendations
|
|
|
|
**Recommended Stack for Different Use Cases:**
|
|
|
|
| Use Case | Framework | LLM Provider | Optimizer | Rationale |
|
|
|----------|-----------|--------------|-----------|-----------|
|
|
| **Production API** | Ax | OpenRouter (Claude/GPT-4) | MIPROv2 | Stability, observability, failover |
|
|
| **Cost-Sensitive** | Ax | OpenRouter (Llama 3.1) | GEPA | Multi-objective optimization |
|
|
| **Rapid Prototyping** | DSPy.ts | OpenAI (GPT-4o-mini) | BootstrapFewShot | Fast iteration, good docs |
|
|
| **Research** | DSPy.ts | Multiple providers | GEPA + ensemble | Experimentation flexibility |
|
|
| **Edge/Browser** | DSPy.ts | Local ONNX | LabeledFewShot | Client-side execution |
|
|
| **Enterprise** | Ax | Azure OpenAI | MIPROv2 | Compliance, observability |
|
|
| **High-Throughput** | Ax | Groq (Llama 3.1) | BootstrapFewShot | Speed optimization |
|
|
|
|
### 5.2 Architecture Recommendations
|
|
|
|
**Single-Model Architecture:**
|
|
```typescript
|
|
// Best for: Predictable costs, simple deployment
|
|
import { ai, ax } from '@ax-llm/ax';
|
|
|
|
const llm = ai({
|
|
name: 'anthropic',
|
|
model: 'claude-3.5-sonnet',
|
|
apiKey: process.env.ANTHROPIC_API_KEY
|
|
});
|
|
|
|
// Optimize once
|
|
const optimizer = new MIPROv2({ metric });
|
|
const optimized = await optimizer.compile(program, trainset);
|
|
|
|
// Deploy
|
|
export default async function handler(req, res) {
|
|
const result = await optimized.forward(llm, req.body);
|
|
res.json(result);
|
|
}
|
|
```
|
|
|
|
**Multi-Model Cascade:**
|
|
```typescript
|
|
// Best for: Cost optimization, varied complexity
|
|
import { ai, ax } from '@ax-llm/ax';
|
|
|
|
const models = {
|
|
cheap: ai({ name: 'openai', model: 'gpt-4o-mini' }),
|
|
medium: ai({ name: 'anthropic', model: 'claude-3-haiku' }),
|
|
expensive: ai({ name: 'anthropic', model: 'claude-3.5-sonnet' })
|
|
};
|
|
|
|
// Optimize each tier
|
|
const tiers = await Promise.all([
|
|
new BootstrapFewShot(metric).compile(program, trainset),
|
|
new MIPROv2(metric).compile(program, trainset),
|
|
new GEPA(metric).compile(program, trainset)
|
|
]);
|
|
|
|
export default async function handler(req, res) {
|
|
const complexity = analyzeComplexity(req.body);
|
|
|
|
let result;
|
|
if (complexity < 0.3) {
|
|
result = await tiers[0].forward(models.cheap, req.body);
|
|
} else if (complexity < 0.7) {
|
|
result = await tiers[1].forward(models.medium, req.body);
|
|
} else {
|
|
result = await tiers[2].forward(models.expensive, req.body);
|
|
}
|
|
|
|
res.json(result);
|
|
}
|
|
```
|
|
|
|
**Distributed Architecture:**
|
|
```typescript
|
|
// Best for: High scale, fault tolerance
|
|
import { ai, ax } from '@ax-llm/ax';
|
|
import { Queue } from 'bull';
|
|
|
|
const queue = new Queue('llm-tasks');
|
|
|
|
// Producer
|
|
export async function submitTask(input) {
|
|
return queue.add('inference', {
|
|
signature: 'question:string -> answer:string',
|
|
input: input
|
|
});
|
|
}
|
|
|
|
// Consumer
|
|
queue.process('inference', async (job) => {
|
|
const { signature, input } = job.data;
|
|
|
|
const llm = selectModel(input); // Load balancing
|
|
const predictor = ax(signature);
|
|
|
|
return await predictor.forward(llm, input);
|
|
});
|
|
```
|
|
|
|
### 5.3 Development Workflow
|
|
|
|
**Phase 1: Rapid Prototyping (Week 1)**
|
|
```typescript
|
|
// Start with simple baseline
|
|
import { ax, ai } from '@ax-llm/ax';
|
|
|
|
const llm = ai({ name: 'openai', model: 'gpt-4o-mini' });
|
|
const predictor = ax('input:string -> output:string');
|
|
|
|
// Test on small dataset
|
|
const results = await Promise.all(
|
|
testset.slice(0, 10).map(ex => predictor.forward(llm, ex.input))
|
|
);
|
|
|
|
console.log('Baseline accuracy:', evaluate(results));
|
|
```
|
|
|
|
**Phase 2: Initial Optimization (Week 2)**
|
|
```typescript
|
|
// Add few-shot learning
|
|
const optimizer = new BootstrapFewShot(metric);
|
|
const optimized = await optimizer.compile(predictor, trainset);
|
|
|
|
// Evaluate on validation set
|
|
const score = await evaluate(optimized, valset);
|
|
console.log('Optimized accuracy:', score);
|
|
```
|
|
|
|
**Phase 3: Advanced Optimization (Week 3-4)**
|
|
```typescript
|
|
// Try multiple optimizers
|
|
const optimizers = [
|
|
{ name: 'Bootstrap', opt: new BootstrapFewShot(metric) },
|
|
{ name: 'MIPRO', opt: new MIPROv2(metric) },
|
|
{ name: 'GEPA', opt: new GEPA(metric) }
|
|
];
|
|
|
|
const results = await Promise.all(
|
|
optimizers.map(async ({ name, opt }) => {
|
|
const optimized = await opt.compile(predictor, trainset);
|
|
const score = await evaluate(optimized, valset);
|
|
return { name, score };
|
|
})
|
|
);
|
|
|
|
console.table(results);
|
|
```
|
|
|
|
**Phase 4: Production Deployment (Week 5-6)**
|
|
```typescript
|
|
// Production setup with monitoring
|
|
import { ai, ax } from '@ax-llm/ax';
|
|
import { trace } from '@opentelemetry/api';
|
|
|
|
const tracer = trace.getTracer('llm-app');
|
|
|
|
const llm = ai({
|
|
name: 'anthropic',
|
|
model: 'claude-3.5-sonnet',
|
|
apiKey: process.env.ANTHROPIC_API_KEY,
|
|
config: {
|
|
maxRetries: 3,
|
|
timeout: 30000
|
|
}
|
|
});
|
|
|
|
const predictor = ax('input:string -> output:string');
|
|
|
|
export default async function handler(req, res) {
|
|
const span = tracer.startSpan('llm-inference');
|
|
|
|
try {
|
|
const result = await predictor.forward(llm, req.body.input);
|
|
|
|
span.setAttributes({
|
|
'llm.model': 'claude-3.5-sonnet',
|
|
'llm.tokens.input': result.usage.inputTokens,
|
|
'llm.tokens.output': result.usage.outputTokens
|
|
});
|
|
|
|
res.json(result);
|
|
} catch (error) {
|
|
span.recordException(error);
|
|
res.status(500).json({ error: error.message });
|
|
} finally {
|
|
span.end();
|
|
}
|
|
}
|
|
```
|
|
|
|
### 5.4 Best Practices
|
|
|
|
**1. Start Simple, Optimize Later**
|
|
```typescript
|
|
// ✅ Good: Start with baseline
|
|
const baseline = ax(signature);
|
|
const baselineScore = await evaluate(baseline, testset);
|
|
|
|
// Then optimize
|
|
const optimized = await optimizer.compile(baseline, trainset);
|
|
const optimizedScore = await evaluate(optimized, testset);
|
|
|
|
console.log('Improvement:', optimizedScore - baselineScore);
|
|
```
|
|
|
|
**2. Use Appropriate Optimizers**
|
|
```typescript
|
|
// ✅ Good: Match optimizer to dataset size
|
|
if (trainset.length < 20) {
|
|
optimizer = new LabeledFewShot();
|
|
} else if (trainset.length < 100) {
|
|
optimizer = new BootstrapFewShot(metric);
|
|
} else {
|
|
optimizer = new MIPROv2(metric);
|
|
}
|
|
```
|
|
|
|
**3. Monitor Production Performance**
|
|
```typescript
|
|
// ✅ Good: Track metrics in production
|
|
class ProductionMonitor {
|
|
async logPrediction(input, prediction, latency, cost) {
|
|
await analytics.track({
|
|
event: 'llm_prediction',
|
|
properties: {
|
|
input_length: input.length,
|
|
output_length: prediction.length,
|
|
latency_ms: latency,
|
|
cost_usd: cost,
|
|
timestamp: Date.now()
|
|
}
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
**4. Implement Graceful Degradation**
|
|
```typescript
|
|
// ✅ Good: Fallback strategies
|
|
async function robustPredict(input) {
|
|
try {
|
|
return await primaryModel.forward(input);
|
|
} catch (error) {
|
|
console.warn('Primary model failed, using fallback');
|
|
return await fallbackModel.forward(input);
|
|
}
|
|
}
|
|
```
|
|
|
|
**5. Version Your Prompts**
|
|
```typescript
|
|
// ✅ Good: Track prompt versions
|
|
const promptVersions = {
|
|
'v1.0': {
|
|
signature: 'question:string -> answer:string',
|
|
optimizer: 'BootstrapFewShot',
|
|
trainDate: '2024-01-15',
|
|
accuracy: 0.82
|
|
},
|
|
'v1.1': {
|
|
signature: 'question:string, context:string -> answer:string',
|
|
optimizer: 'MIPROv2',
|
|
trainDate: '2024-02-01',
|
|
accuracy: 0.89
|
|
}
|
|
};
|
|
|
|
export default async function handler(req, res) {
|
|
const version = req.query.version || 'v1.1';
|
|
const predictor = loadPredictor(promptVersions[version]);
|
|
|
|
const result = await predictor.forward(llm, req.body);
|
|
res.json({ ...result, promptVersion: version });
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Code Patterns and Examples
|
|
|
|
### 6.1 Basic Examples
|
|
|
|
**Simple Classification:**
|
|
```typescript
|
|
import { ai, ax } from '@ax-llm/ax';
|
|
|
|
const llm = ai({
|
|
name: 'openai',
|
|
apiKey: process.env.OPENAI_API_KEY,
|
|
model: 'gpt-4o-mini'
|
|
});
|
|
|
|
const classifier = ax('review:string -> sentiment:class "positive, negative, neutral"');
|
|
|
|
const result = await classifier.forward(llm, {
|
|
review: "This product exceeded my expectations!"
|
|
});
|
|
|
|
console.log(result.sentiment); // "positive"
|
|
```
|
|
|
|
**Entity Extraction:**
|
|
```typescript
|
|
const extractor = ax(`
|
|
text:string
|
|
->
|
|
entities:{
|
|
name:string,
|
|
type:class "person, organization, location",
|
|
confidence:number
|
|
}[]
|
|
`);
|
|
|
|
const result = await extractor.forward(llm, {
|
|
text: "Elon Musk announced Tesla's new factory in Austin, Texas."
|
|
});
|
|
|
|
console.log(result.entities);
|
|
// [
|
|
// { name: "Elon Musk", type: "person", confidence: 0.98 },
|
|
// { name: "Tesla", type: "organization", confidence: 0.95 },
|
|
// { name: "Austin", type: "location", confidence: 0.92 },
|
|
// { name: "Texas", type: "location", confidence: 0.91 }
|
|
// ]
|
|
```
|
|
|
|
**Question Answering:**
|
|
```typescript
|
|
import { ChainOfThought } from 'dspy.ts/modules';
|
|
|
|
const qa = new ChainOfThought({
|
|
signature: {
|
|
inputs: [
|
|
{ name: 'context', type: 'string', required: true },
|
|
{ name: 'question', type: 'string', required: true }
|
|
],
|
|
outputs: [
|
|
{ name: 'reasoning', type: 'string', required: true },
|
|
{ name: 'answer', type: 'string', required: true }
|
|
]
|
|
}
|
|
});
|
|
|
|
const result = await qa.run({
|
|
context: "The Eiffel Tower is 330 meters tall and was completed in 1889.",
|
|
question: "When was the Eiffel Tower built?"
|
|
});
|
|
|
|
console.log(result.reasoning);
|
|
// "The context states the Eiffel Tower was completed in 1889."
|
|
console.log(result.answer);
|
|
// "1889"
|
|
```
|
|
|
|
### 6.2 Advanced Examples
|
|
|
|
**Multi-Hop Reasoning:**
|
|
```typescript
|
|
import { dspy } from 'dspy.ts';
|
|
|
|
class MultiHopQA extends dspy.Module {
|
|
constructor() {
|
|
super();
|
|
this.retriever = new dspy.Retrieve(k=3);
|
|
this.hop1 = new dspy.ChainOfThought('context, question -> next_query');
|
|
this.hop2 = new dspy.ChainOfThought('context, question -> answer');
|
|
}
|
|
|
|
async forward({ question }) {
|
|
// First hop
|
|
const context1 = await this.retriever.forward(question);
|
|
const hop1Result = await this.hop1.forward({ context: context1, question });
|
|
|
|
// Second hop
|
|
const context2 = await this.retriever.forward(hop1Result.next_query);
|
|
const hop2Result = await this.hop2.forward({
|
|
context: context1 + '\n' + context2,
|
|
question
|
|
});
|
|
|
|
return hop2Result;
|
|
}
|
|
}
|
|
|
|
// Use
|
|
const mhqa = new MultiHopQA();
|
|
const result = await mhqa.forward({
|
|
question: "What is the population of the capital of France?"
|
|
});
|
|
```
|
|
|
|
**RAG with ReAct:**
|
|
```typescript
|
|
import { ax, ai } from '@ax-llm/ax';
|
|
|
|
// Define tools
|
|
const tools = [
|
|
{
|
|
name: 'search',
|
|
description: 'Search the knowledge base',
|
|
execute: async (query) => {
|
|
const results = await vectorDB.search(query, k=5);
|
|
return results.map(r => r.content).join('\n\n');
|
|
}
|
|
},
|
|
{
|
|
name: 'calculate',
|
|
description: 'Perform mathematical calculations',
|
|
execute: async (expression) => {
|
|
return eval(expression);
|
|
}
|
|
}
|
|
];
|
|
|
|
// ReAct agent
|
|
const agent = ax(`
|
|
question:string,
|
|
available_tools:string
|
|
->
|
|
thought:string,
|
|
action:string,
|
|
action_input:string,
|
|
final_answer:string
|
|
`);
|
|
|
|
async function reactLoop(question, maxSteps=5) {
|
|
let context = '';
|
|
|
|
for (let step = 0; step < maxSteps; step++) {
|
|
const result = await agent.forward(llm, {
|
|
question,
|
|
available_tools: tools.map(t => `${t.name}: ${t.description}`).join('\n')
|
|
});
|
|
|
|
console.log(`Thought: ${result.thought}`);
|
|
|
|
if (result.final_answer) {
|
|
return result.final_answer;
|
|
}
|
|
|
|
// Execute action
|
|
const tool = tools.find(t => t.name === result.action);
|
|
if (tool) {
|
|
const observation = await tool.execute(result.action_input);
|
|
context += `\nObservation: ${observation}`;
|
|
console.log(`Action: ${result.action}(${result.action_input})`);
|
|
console.log(`Observation: ${observation}`);
|
|
}
|
|
}
|
|
|
|
throw new Error('Max steps reached without answer');
|
|
}
|
|
|
|
// Use
|
|
const answer = await reactLoop("What is the GDP of California times 2?");
|
|
```
|
|
|
|
**Self-Improving Chatbot:**
|
|
```typescript
|
|
import { dspy } from 'dspy.ts';
|
|
|
|
class SelfImprovingChatbot extends dspy.Module {
|
|
constructor() {
|
|
super();
|
|
this.responder = new dspy.ChainOfThought(
|
|
'history, message -> response'
|
|
);
|
|
this.evaluator = new dspy.Predict(
|
|
'response, feedback -> quality_score:number'
|
|
);
|
|
this.memory = [];
|
|
}
|
|
|
|
async forward({ message, history }) {
|
|
const response = await this.responder.forward({
|
|
history: history.join('\n'),
|
|
message
|
|
});
|
|
|
|
this.memory.push({
|
|
input: { message, history },
|
|
output: response
|
|
});
|
|
|
|
return response.response;
|
|
}
|
|
|
|
async learn({ feedback }) {
|
|
// Evaluate recent interactions
|
|
const evaluations = await Promise.all(
|
|
this.memory.map(async (interaction) => {
|
|
const score = await this.evaluator.forward({
|
|
response: interaction.output.response,
|
|
feedback
|
|
});
|
|
return { interaction, score: score.quality_score };
|
|
})
|
|
);
|
|
|
|
// Filter good examples
|
|
const goodExamples = evaluations
|
|
.filter(e => e.score > 0.8)
|
|
.map(e => e.interaction);
|
|
|
|
// Recompile with good examples
|
|
if (goodExamples.length > 5) {
|
|
const metric = (ex, pred) => pred.response.length > 20 ? 1.0 : 0.0;
|
|
const optimizer = new dspy.BootstrapFewShot(metric);
|
|
|
|
this.responder = await optimizer.compile(
|
|
this.responder,
|
|
goodExamples
|
|
);
|
|
|
|
this.memory = []; // Reset memory
|
|
}
|
|
}
|
|
}
|
|
|
|
// Use
|
|
const chatbot = new SelfImprovingChatbot();
|
|
|
|
// Initial conversation
|
|
await chatbot.forward({ message: "Hello!", history: [] });
|
|
|
|
// Learn from feedback
|
|
await chatbot.learn({ feedback: "Make responses more detailed" });
|
|
```
|
|
|
|
### 6.3 Production Patterns
|
|
|
|
**API with Caching:**
|
|
```typescript
|
|
import { ai, ax } from '@ax-llm/ax';
|
|
import Redis from 'ioredis';
|
|
|
|
const redis = new Redis(process.env.REDIS_URL);
|
|
const llm = ai({ name: 'anthropic', model: 'claude-3.5-sonnet' });
|
|
const predictor = ax('input:string -> output:string');
|
|
|
|
async function cachedPredict(input) {
|
|
// Check cache
|
|
const cacheKey = `llm:${hashInput(input)}`;
|
|
const cached = await redis.get(cacheKey);
|
|
|
|
if (cached) {
|
|
console.log('Cache hit!');
|
|
return JSON.parse(cached);
|
|
}
|
|
|
|
// Predict
|
|
const result = await predictor.forward(llm, { input });
|
|
|
|
// Cache result (24 hour TTL)
|
|
await redis.setex(cacheKey, 86400, JSON.stringify(result));
|
|
|
|
return result;
|
|
}
|
|
```
|
|
|
|
**Batch Processing:**
|
|
```typescript
|
|
import { ai, ax } from '@ax-llm/ax';
|
|
|
|
const llm = ai({ name: 'openai', model: 'gpt-4o-mini' });
|
|
const predictor = ax('text:string -> summary:string');
|
|
|
|
async function batchProcess(inputs, batchSize=10) {
|
|
const results = [];
|
|
|
|
for (let i = 0; i < inputs.length; i += batchSize) {
|
|
const batch = inputs.slice(i, i + batchSize);
|
|
|
|
const batchResults = await Promise.all(
|
|
batch.map(input => predictor.forward(llm, { text: input }))
|
|
);
|
|
|
|
results.push(...batchResults);
|
|
|
|
console.log(`Processed ${Math.min(i + batchSize, inputs.length)} / ${inputs.length}`);
|
|
}
|
|
|
|
return results;
|
|
}
|
|
```
|
|
|
|
**Error Handling & Retries:**
|
|
```typescript
|
|
import { ai, ax } from '@ax-llm/ax';
|
|
import pRetry from 'p-retry';
|
|
|
|
const llm = ai({ name: 'anthropic', model: 'claude-3.5-sonnet' });
|
|
const predictor = ax('input:string -> output:string');
|
|
|
|
async function robustPredict(input, maxRetries=3) {
|
|
return pRetry(
|
|
async () => {
|
|
try {
|
|
return await predictor.forward(llm, { input });
|
|
} catch (error) {
|
|
if (error.status === 429) {
|
|
// Rate limit - wait and retry
|
|
console.log('Rate limited, retrying...');
|
|
throw error;
|
|
} else if (error.status >= 500) {
|
|
// Server error - retry
|
|
console.log('Server error, retrying...');
|
|
throw error;
|
|
} else {
|
|
// Client error - don't retry
|
|
throw new pRetry.AbortError(error);
|
|
}
|
|
}
|
|
},
|
|
{
|
|
retries: maxRetries,
|
|
factor: 2,
|
|
minTimeout: 1000,
|
|
maxTimeout: 10000,
|
|
onFailedAttempt: (error) => {
|
|
console.log(
|
|
`Attempt ${error.attemptNumber} failed. ${error.retriesLeft} retries left.`
|
|
);
|
|
}
|
|
}
|
|
);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Research Findings Summary
|
|
|
|
### 7.1 Key Insights
|
|
|
|
**1. TypeScript DSPy is Production-Ready**
|
|
- Multiple mature implementations (Ax, DSPy.ts, TS-DSPy)
|
|
- Full type safety with compile-time validation
|
|
- 15+ LLM provider integrations
|
|
- Built-in observability and monitoring
|
|
|
|
**2. Optimization Significantly Improves Performance**
|
|
- GEPA: 22-90x cost reduction with maintained quality
|
|
- MIPROv2: 32-113% accuracy improvements
|
|
- BootstrapFewShot: 15-30% typical improvement
|
|
- All optimizers support metric-driven learning
|
|
|
|
**3. Multi-Model Integration is Mature**
|
|
- Claude 3.5 Sonnet: Excellent for reasoning
|
|
- GPT-4 Turbo: Best all-around performance
|
|
- Llama 3.1 70B: Cost-effective local deployment
|
|
- OpenRouter: Enables model failover and A/B testing
|
|
|
|
**4. Cost-Quality Trade-offs are Significant**
|
|
- Smaller optimized models can match larger unoptimized models
|
|
- GEPA enables Pareto frontier optimization
|
|
- Model cascades reduce average cost by 60-80%
|
|
- Caching reduces costs by 40-70%
|
|
|
|
### 7.2 Gaps and Limitations
|
|
|
|
**Current Limitations:**
|
|
|
|
1. **Gemini Integration Issues**
|
|
- Advanced optimizers (MIPROv2, GEPA) inconsistent with Gemini
|
|
- Recommend using BootstrapFewShot or LabeledFewShot
|
|
- Workaround: Use Portkey or OpenRouter
|
|
|
|
2. **Browser Deployment Constraints**
|
|
- ONNX models limited in capability vs cloud models
|
|
- Large model files (>500MB) not practical for web
|
|
- Need specialized compression/quantization
|
|
|
|
3. **Optimization Time**
|
|
- MIPROv2: 1-3 hours typical
|
|
- GEPA: 2-3 hours typical
|
|
- Trade-off between optimization time and quality
|
|
- Recommend optimizing offline, deploying optimized version
|
|
|
|
4. **Documentation Gaps**
|
|
- TS-DSPy documentation less comprehensive than Ax
|
|
- Some advanced features undocumented
|
|
- Community smaller than Python DSPy
|
|
|
|
**Recommended Mitigations:**
|
|
|
|
1. Use Ax framework for production (best docs, most features)
|
|
2. Optimize with Claude/GPT-4, deploy with cheaper models
|
|
3. Cache aggressively in production
|
|
4. Start with BootstrapFewShot, upgrade to MIPROv2/GEPA if needed
|
|
5. Use OpenRouter for model flexibility
|
|
|
|
### 7.3 Recommendations for Claude-Flow Integration
|
|
|
|
**High-Priority Integrations:**
|
|
|
|
1. **Ax Framework as Primary DSPy.ts Provider**
|
|
- Most mature TypeScript implementation
|
|
- Best observability (OpenTelemetry)
|
|
- Multi-model support (15+ providers)
|
|
- Production-ready with validation
|
|
|
|
2. **GEPA Optimizer for Multi-Objective Optimization**
|
|
- Optimize for quality AND cost simultaneously
|
|
- 22-90x cost reduction possible
|
|
- Pareto frontier for trade-off exploration
|
|
- Reflective reasoning for better optimization
|
|
|
|
3. **OpenRouter for Model Flexibility**
|
|
- Automatic failover between models
|
|
- A/B testing capabilities
|
|
- Access to 200+ models
|
|
- Cost optimization through model routing
|
|
|
|
4. **ReasoningBank + DSPy.ts Integration**
|
|
- Store successful traces in ReasoningBank
|
|
- Use for continuous optimization
|
|
- Enable self-learning from production data
|
|
- Improve over time without retraining
|
|
|
|
**Integration Architecture:**
|
|
|
|
```typescript
|
|
// Claude-Flow + DSPy.ts Integration
|
|
import { SwarmOrchestrator } from 'claude-flow';
|
|
import { ai, ax, GEPA } from '@ax-llm/ax';
|
|
import { ReasoningBank } from 'reasoning-bank';
|
|
|
|
class ClaudeFlowDSPy {
|
|
constructor() {
|
|
this.swarm = new SwarmOrchestrator();
|
|
this.reasoningBank = new ReasoningBank();
|
|
|
|
// Multi-model setup
|
|
this.models = {
|
|
primary: ai({ name: 'anthropic', model: 'claude-3.5-sonnet' }),
|
|
fallback: ai({ name: 'openai', model: 'gpt-4-turbo' }),
|
|
cheap: ai({ name: 'openrouter', model: 'meta-llama/llama-3.1-8b' })
|
|
};
|
|
}
|
|
|
|
async createOptimizedAgent(agentType, signature, trainset) {
|
|
// Create DSPy program
|
|
const program = ax(signature);
|
|
|
|
// Optimize with GEPA
|
|
const optimizer = new GEPA({
|
|
objectives: [
|
|
{ metric: accuracy, weight: 0.7 },
|
|
{ metric: cost, weight: 0.3 }
|
|
]
|
|
});
|
|
|
|
const optimized = await optimizer.compile(program, trainset);
|
|
|
|
// Store in ReasoningBank
|
|
await this.reasoningBank.store({
|
|
agentType,
|
|
signature,
|
|
optimizedPrompt: optimized.toString(),
|
|
trainingDate: new Date(),
|
|
performance: await this.evaluate(optimized, testset)
|
|
});
|
|
|
|
// Deploy in swarm
|
|
return this.swarm.createAgent(agentType, async (input) => {
|
|
const model = this.selectModel(input);
|
|
const result = await optimized.forward(model, input);
|
|
|
|
// Learn from production
|
|
await this.reasoningBank.learn({
|
|
input,
|
|
output: result,
|
|
quality: await this.evaluateQuality(result)
|
|
});
|
|
|
|
return result;
|
|
});
|
|
}
|
|
|
|
selectModel(input) {
|
|
const complexity = this.analyzeComplexity(input);
|
|
|
|
if (complexity < 0.3) return this.models.cheap;
|
|
if (complexity < 0.7) return this.models.fallback;
|
|
return this.models.primary;
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Conclusion
|
|
|
|
DSPy.ts represents a major advancement in AI application development, shifting from brittle prompt engineering to systematic, type-safe programming. The research confirms three primary TypeScript implementations are production-ready, with Ax being the most mature and feature-complete.
|
|
|
|
**Key Takeaways:**
|
|
|
|
1. **Start with Ax Framework** for production applications
|
|
2. **Use GEPA optimizer** for cost-quality optimization
|
|
3. **Implement model cascades** for 60-80% cost reduction
|
|
4. **Leverage OpenRouter** for flexibility and failover
|
|
5. **Integrate with ReasoningBank** for continuous learning
|
|
|
|
**Next Steps:**
|
|
|
|
1. Implement proof-of-concept with Ax + Claude 3.5 Sonnet
|
|
2. Benchmark against baseline prompt engineering approach
|
|
3. Optimize with BootstrapFewShot, then MIPROv2
|
|
4. Deploy with OpenRouter failover
|
|
5. Monitor and iterate based on production metrics
|
|
|
|
The combination of Claude-Flow orchestration with DSPy.ts optimization offers a powerful platform for building reliable, cost-effective AI systems that improve over time.
|
|
|
|
---
|
|
|
|
## 9. References and Resources
|
|
|
|
### 9.1 Official Documentation
|
|
|
|
- **Ax Framework:** https://axllm.dev/
|
|
- **DSPy.ts (ruvnet):** https://github.com/ruvnet/dspy.ts
|
|
- **DSPy Python (Stanford):** https://dspy.ai/
|
|
- **TS-DSPy:** https://www.npmjs.com/package/@ts-dspy/core
|
|
|
|
### 9.2 Research Papers
|
|
|
|
- **GEPA Paper:** "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning" (2024)
|
|
- **MIPROv2:** "Multi-prompt Instruction Proposal Optimizer v2" (DSPy team, 2024)
|
|
- **DSPy Original:** "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines" (2023)
|
|
|
|
### 9.3 Key GitHub Repositories
|
|
|
|
- Ax: https://github.com/ax-llm/ax (2.8k+ stars)
|
|
- DSPy.ts: https://github.com/ruvnet/dspy.ts (162 stars)
|
|
- Stanford DSPy: https://github.com/stanfordnlp/dspy (20k+ stars)
|
|
|
|
### 9.4 Community Resources
|
|
|
|
- Ax Discord: Community support and discussions
|
|
- DSPy Twitter: @dspy_ai
|
|
- Tutorial Articles: See research findings for comprehensive guides
|
|
|
|
---
|
|
|
|
**Report Compiled By:** Research Agent
|
|
**Research Date:** 2025-11-22
|
|
**Total Sources Reviewed:** 40+
|
|
**Research Duration:** Comprehensive multi-source analysis
|