20 KiB
Agentic-Synth Architecture
Overview
Agentic-Synth is a TypeScript-based synthetic data generation package that provides both CLI and SDK interfaces for generating time-series, events, and structured data using AI models (Gemini and OpenRouter APIs). It integrates seamlessly with midstreamer for streaming and agentic-robotics for automation workflows.
Architecture Decision Records
ADR-001: TypeScript with ESM Modules
Status: Accepted
Context:
- Need modern JavaScript/TypeScript support
- Integration with Node.js ecosystem
- Support for both CLI and SDK usage
- Future-proof module system
Decision: Use TypeScript with ESM (ECMAScript Modules) as the primary module system.
Rationale:
- ESM is the standard for modern JavaScript
- Better tree-shaking and optimization
- Native TypeScript support
- Aligns with Node.js 18+ best practices
Consequences:
- Requires Node.js 18+
- All imports must use
.jsextensions in output - Better interoperability with modern tools
ADR-002: No Redis Dependency
Status: Accepted
Context:
- Need caching for context and API responses
- Avoid external service dependencies
- Simplify deployment and usage
Decision: Use in-memory caching with LRU (Least Recently Used) strategy and optional file-based persistence.
Rationale:
- Simpler deployment (no Redis server needed)
- Faster for most use cases (in-process memory)
- File-based persistence for session continuity
- Optional integration with ruvector for advanced caching
Consequences:
- Cache doesn't survive process restart (unless persisted to file)
- Memory-limited (configurable max size)
- Single-process only (no distributed caching)
ADR-003: Dual Interface (CLI + SDK)
Status: Accepted
Context:
- Need both programmatic access and command-line usage
- Different user personas (developers vs. operators)
- Consistent behavior across interfaces
Decision: Implement core logic in SDK with CLI as a thin wrapper.
Rationale:
- Single source of truth for logic
- CLI uses SDK internally
- Easy to test and maintain
- Clear separation of concerns
Consequences:
- SDK must be feature-complete
- CLI is primarily for ergonomics
- Documentation needed for both interfaces
ADR-004: Model Router Architecture
Status: Accepted
Context:
- Support multiple AI providers (Gemini, OpenRouter)
- Different models for different data types
- Cost optimization and fallback strategies
Decision: Implement a model router that selects appropriate models based on data type, cost, and availability.
Rationale:
- Flexibility in model selection
- Automatic fallback on failures
- Cost optimization through smart routing
- Provider-agnostic interface
Consequences:
- More complex routing logic
- Need configuration for routing rules
- Performance monitoring required
ADR-005: Plugin Architecture for Generators
Status: Accepted
Context:
- Different data types need different generation strategies
- Extensibility for custom generators
- Community contributions
Decision: Use a plugin-based architecture where each data type has a dedicated generator.
Rationale:
- Clear separation of concerns
- Easy to add new data types
- Testable in isolation
- Community can contribute generators
Consequences:
- Need generator registration system
- Consistent generator interface
- Documentation for custom generators
ADR-006: Optional Integration Pattern
Status: Accepted
Context:
- Integration with midstreamer, agentic-robotics, and ruvector
- Not all users need all integrations
- Avoid mandatory dependencies
Decision: Use optional peer dependencies with runtime detection.
Rationale:
- Lighter install for basic usage
- Pay-as-you-go complexity
- Clear integration boundaries
- Graceful degradation
Consequences:
- Runtime checks for integration availability
- Clear documentation about optional features
- Integration adapters with null implementations
System Architecture
High-Level Component Diagram (C4 Level 2)
┌─────────────────────────────────────────────────────────────────┐
│ Agentic-Synth │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ CLI │ │ SDK │ │
│ │ (Commander) │◄──────────────────────────► (Public API) │ │
│ └──────┬───────┘ └────────┬────────┘ │
│ │ │ │
│ └────────────────────┬───────────────────────┘ │
│ │ │
│ ┌─────────▼────────┐ │
│ │ Core Engine │ │
│ │ │ │
│ │ - Generator Hub │ │
│ │ - Model Router │ │
│ │ - Cache Manager │ │
│ │ - Config System │ │
│ └─────────┬────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ┌────▼─────┐ ┌──────▼──────┐ ┌─────▼──────┐ │
│ │Generator │ │ Models │ │Integration │ │
│ │ System │ │ System │ │ Adapters │ │
│ │ │ │ │ │ │ │
│ │-TimeSeries│ │- Gemini │ │-Midstreamer│ │
│ │-Events │ │- OpenRouter │ │-Robotics │ │
│ │-Structured│ │- Router │ │-Ruvector │ │
│ └──────────┘ └─────────────┘ └────────────┘ │
└───────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Output │ │ AI APIs │ │ External │
│ (Streams) │ │ │ │ Integrations │
│ │ │ - Gemini API │ │ │
│ - JSON │ │ - OpenRouter │ │ - Midstreamer │
│ - CSV │ │ │ │ - Agentic-Robot │
│ - Parquet │ └──────────────┘ │ - Ruvector DB │
└─────────────┘ └──────────────────┘
Data Flow Diagram
┌─────────┐
│ User │
└────┬────┘
│
│ (CLI Command or SDK Call)
▼
┌─────────────────┐
│ Request Parser │ ──► Validate schema, parse options
└────┬────────────┘
│
▼
┌─────────────────┐
│ Generator Hub │ ──► Select appropriate generator
└────┬────────────┘
│
▼
┌─────────────────┐
│ Model Router │ ──► Choose AI model (Gemini/OpenRouter)
└────┬────────────┘
│
├──► Check Cache ─────► Cache Hit? ─────► Return cached
│ │
│ │ (Miss)
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ AI Provider │───►│ Context Builder │
│ (Gemini/OR) │ │ (Prompt + Schema)│
└────┬────────────┘ └──────────────────┘
│
│ (Generated Data)
▼
┌─────────────────┐
│ Post-Processor │ ──► Validate, transform, format
└────┬────────────┘
│
├──► Store in Cache
│
├──► Stream via Midstreamer (if enabled)
│
├──► Store in Ruvector (if enabled)
│
▼
┌─────────────────┐
│ Output Handler │ ──► JSON/CSV/Parquet/Stream
└─────────────────┘
Core Components
1. Generator System
Purpose: Generate different types of synthetic data.
Components:
TimeSeriesGenerator: Generate time-series data with trends, seasonality, noiseEventGenerator: Generate event streams with timestamps and metadataStructuredGenerator: Generate structured records (JSON, tables)CustomGenerator: Base class for user-defined generators
Interface:
interface Generator<T = any> {
readonly type: string;
readonly schema: z.ZodSchema<T>;
generate(options: GenerateOptions): Promise<T>;
generateBatch(count: number, options: GenerateOptions): AsyncIterator<T>;
validate(data: unknown): T;
}
2. Model System
Purpose: Interface with AI providers for data generation.
Components:
GeminiProvider: Google Gemini API integrationOpenRouterProvider: OpenRouter API integrationModelRouter: Smart routing between providersContextCache: Cache prompts and responses
Interface:
interface ModelProvider {
readonly name: string;
readonly supportedModels: string[];
generate(prompt: string, options: ModelOptions): Promise<string>;
generateStream(prompt: string, options: ModelOptions): AsyncIterator<string>;
getCost(model: string, tokens: number): number;
}
3. Cache Manager
Purpose: Cache API responses and context without Redis.
Strategy:
- In-memory LRU cache (configurable size)
- Optional file-based persistence
- Content-based cache keys (hash of prompt + options)
- TTL support
Implementation:
class CacheManager {
private cache: Map<string, CacheEntry>;
private maxSize: number;
private ttl: number;
get(key: string): CacheEntry | undefined;
set(key: string, value: any, ttl?: number): void;
clear(): void;
persist(path: string): Promise<void>;
restore(path: string): Promise<void>;
}
4. Integration Adapters
Purpose: Optional integrations with external tools.
Adapters:
MidstreamerAdapter
interface MidstreamerAdapter {
isAvailable(): boolean;
stream(data: AsyncIterator<any>): Promise<void>;
createPipeline(config: PipelineConfig): StreamPipeline;
}
AgenticRoboticsAdapter
interface AgenticRoboticsAdapter {
isAvailable(): boolean;
registerWorkflow(name: string, generator: Generator): void;
triggerWorkflow(name: string, options: any): Promise<void>;
}
RuvectorAdapter
interface RuvectorAdapter {
isAvailable(): boolean;
store(data: any, metadata?: any): Promise<string>;
search(query: any, limit?: number): Promise<any[]>;
}
API Design
SDK API
Basic Usage
import { AgenticSynth, TimeSeriesGenerator } from 'agentic-synth';
// Initialize
const synth = new AgenticSynth({
apiKeys: {
gemini: process.env.GEMINI_API_KEY,
openRouter: process.env.OPENROUTER_API_KEY
},
cache: {
enabled: true,
maxSize: 1000,
ttl: 3600000 // 1 hour
}
});
// Generate time-series data
const data = await synth.generate('timeseries', {
count: 1000,
schema: {
timestamp: 'datetime',
temperature: { type: 'number', min: -20, max: 40 },
humidity: { type: 'number', min: 0, max: 100 }
},
model: 'gemini-pro'
});
// Stream generation
for await (const record of synth.generateStream('events', options)) {
console.log(record);
}
Advanced Usage with Integrations
import { AgenticSynth } from 'agentic-synth';
import { enableMidstreamer, enableRuvector } from 'agentic-synth/integrations';
const synth = new AgenticSynth({
apiKeys: { ... }
});
// Enable optional integrations
enableMidstreamer(synth, {
pipeline: 'synthetic-data-stream'
});
enableRuvector(synth, {
dbPath: './data/vectors.db'
});
// Generate and automatically stream + store
await synth.generate('structured', {
count: 10000,
stream: true, // Auto-streams via midstreamer
vectorize: true // Auto-stores in ruvector
});
CLI API
Basic Commands
# Generate time-series data
npx agentic-synth generate timeseries \
--count 1000 \
--schema ./schema.json \
--output data.json \
--model gemini-pro
# Generate events
npx agentic-synth generate events \
--count 5000 \
--rate 100/sec \
--stream \
--output events.jsonl
# Generate structured data
npx agentic-synth generate structured \
--schema ./user-schema.json \
--count 10000 \
--format csv \
--output users.csv
Advanced Commands
# With model routing
npx agentic-synth generate timeseries \
--count 1000 \
--auto-route \
--fallback gemini-pro,gpt-4 \
--budget 0.10
# With integrations
npx agentic-synth generate events \
--count 10000 \
--stream-to midstreamer \
--vectorize-with ruvector \
--cache-policy aggressive
# Batch generation
npx agentic-synth batch generate \
--config ./batch-config.yaml \
--parallel 4 \
--output ./output-dir/
Configuration System
Configuration File Format (.agentic-synth.json)
{
"apiKeys": {
"gemini": "${GEMINI_API_KEY}",
"openRouter": "${OPENROUTER_API_KEY}"
},
"cache": {
"enabled": true,
"maxSize": 1000,
"ttl": 3600000,
"persistPath": "./.cache/agentic-synth"
},
"models": {
"routing": {
"strategy": "cost-optimized",
"fallbackChain": ["gemini-pro", "gpt-4", "claude-3"]
},
"defaults": {
"timeseries": "gemini-pro",
"events": "gpt-4-turbo",
"structured": "claude-3-sonnet"
}
},
"integrations": {
"midstreamer": {
"enabled": true,
"defaultPipeline": "synthetic-data"
},
"agenticRobotics": {
"enabled": false
},
"ruvector": {
"enabled": true,
"dbPath": "./data/vectors.db"
}
},
"generators": {
"timeseries": {
"defaultSampleRate": "1s",
"defaultDuration": "1h"
},
"events": {
"defaultRate": "100/sec"
}
}
}
Technology Stack
Core Dependencies
- TypeScript 5.7+: Type safety and modern JavaScript features
- Zod 3.23+: Runtime schema validation
- Commander 12+: CLI framework
- Winston 3+: Logging system
AI Provider SDKs
- @google/generative-ai: Gemini API integration
- openai: OpenRouter API (compatible with OpenAI SDK)
Optional Integrations
- midstreamer: Streaming data pipelines
- agentic-robotics: Automation workflows
- ruvector: Vector database for embeddings
Development Tools
- Vitest: Testing framework
- ESLint: Linting
- Prettier: Code formatting
Performance Considerations
Context Caching Strategy
- Cache Key Generation: Hash of (prompt template + schema + model options)
- Cache Storage: In-memory Map with LRU eviction
- Cache Persistence: Optional file-based storage for session continuity
- Cache Invalidation: TTL-based + manual invalidation API
Model Selection Optimization
- Cost-Based Routing: Select cheapest model that meets requirements
- Performance-Based Routing: Select fastest model
- Quality-Based Routing: Select highest quality model
- Hybrid Routing: Balance cost/performance/quality
Memory Management
- Streaming generation for large datasets (avoid loading all in memory)
- Chunked processing for batch operations
- Configurable batch sizes
- Memory-efficient serialization formats (JSONL, Parquet)
Security Considerations
API Key Management
- Environment variable loading via dotenv
- Config file with environment variable substitution
- Never log API keys
- Secure storage in config files (encrypted or gitignored)
Data Validation
- Input validation using Zod schemas
- Output validation before returning to user
- Sanitization of AI-generated content
- Rate limiting for API calls
Error Handling
- Graceful degradation on provider failures
- Automatic retry with exponential backoff
- Detailed error logging without sensitive data
- User-friendly error messages
Testing Strategy
Unit Tests
- Individual generator tests
- Model provider mocks
- Cache manager tests
- Integration adapter tests (with mocks)
Integration Tests
- End-to-end generation workflows
- Real API calls (with test API keys)
- Integration with midstreamer/robotics (optional)
- CLI command tests
Performance Tests
- Benchmark generation speed
- Memory usage profiling
- Cache hit rate analysis
- Model routing efficiency
Deployment & Distribution
NPM Package
- Published as
agentic-synth - Dual CJS/ESM support (via tsconfig)
- Tree-shakeable exports
- Type definitions included
CLI Distribution
- Available via
npx agentic-synth - Self-contained executable (includes dependencies)
- Automatic updates via npm
Documentation
- README.md: Quick start guide
- API.md: Complete SDK reference
- CLI.md: Command-line reference
- EXAMPLES.md: Common use cases
- INTEGRATIONS.md: Optional integration guides
Future Enhancements
Phase 2 Features
- Support for more AI providers (Anthropic, Cohere, local models)
- Advanced schema generation from examples
- Multi-modal data generation (text + images)
- Distributed generation across multiple nodes
- Web UI for visual data generation
Phase 3 Features
- Real-time data generation with WebSocket support
- Integration with data orchestration platforms (Airflow, Prefect)
- Custom model fine-tuning for domain-specific data
- Data quality metrics and validation
- Automated testing dataset generation
Conclusion
This architecture provides a solid foundation for agentic-synth as a flexible, performant, and extensible synthetic data generation tool. The dual CLI/SDK interface, optional integrations, and plugin-based architecture ensure it can serve a wide range of use cases while remaining simple for basic usage.