Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

20 KiB

Raw Blame History

Agentic-Synth Architecture

Overview

Agentic-Synth is a TypeScript-based synthetic data generation package that provides both CLI and SDK interfaces for generating time-series, events, and structured data using AI models (Gemini and OpenRouter APIs). It integrates seamlessly with midstreamer for streaming and agentic-robotics for automation workflows.

Architecture Decision Records

ADR-001: TypeScript with ESM Modules

Status: Accepted

Context:

Need modern JavaScript/TypeScript support
Integration with Node.js ecosystem
Support for both CLI and SDK usage
Future-proof module system

Decision: Use TypeScript with ESM (ECMAScript Modules) as the primary module system.

Rationale:

ESM is the standard for modern JavaScript
Better tree-shaking and optimization
Native TypeScript support
Aligns with Node.js 18+ best practices

Consequences:

Requires Node.js 18+
All imports must use .js extensions in output
Better interoperability with modern tools

ADR-002: No Redis Dependency

Status: Accepted

Context:

Need caching for context and API responses
Avoid external service dependencies
Simplify deployment and usage

Decision: Use in-memory caching with LRU (Least Recently Used) strategy and optional file-based persistence.

Rationale:

Simpler deployment (no Redis server needed)
Faster for most use cases (in-process memory)
File-based persistence for session continuity
Optional integration with ruvector for advanced caching

Consequences:

Cache doesn't survive process restart (unless persisted to file)
Memory-limited (configurable max size)
Single-process only (no distributed caching)

ADR-003: Dual Interface (CLI + SDK)

Status: Accepted

Context:

Need both programmatic access and command-line usage
Different user personas (developers vs. operators)
Consistent behavior across interfaces

Decision: Implement core logic in SDK with CLI as a thin wrapper.

Rationale:

Single source of truth for logic
CLI uses SDK internally
Easy to test and maintain
Clear separation of concerns

Consequences:

SDK must be feature-complete
CLI is primarily for ergonomics
Documentation needed for both interfaces

ADR-004: Model Router Architecture

Status: Accepted

Context:

Support multiple AI providers (Gemini, OpenRouter)
Different models for different data types
Cost optimization and fallback strategies

Decision: Implement a model router that selects appropriate models based on data type, cost, and availability.

Rationale:

Flexibility in model selection
Automatic fallback on failures
Cost optimization through smart routing
Provider-agnostic interface

Consequences:

More complex routing logic
Need configuration for routing rules
Performance monitoring required

ADR-005: Plugin Architecture for Generators

Status: Accepted

Context:

Different data types need different generation strategies
Extensibility for custom generators
Community contributions

Decision: Use a plugin-based architecture where each data type has a dedicated generator.

Rationale:

Clear separation of concerns
Easy to add new data types
Testable in isolation
Community can contribute generators

Consequences:

Need generator registration system
Consistent generator interface
Documentation for custom generators

ADR-006: Optional Integration Pattern

Status: Accepted

Context:

Integration with midstreamer, agentic-robotics, and ruvector
Not all users need all integrations
Avoid mandatory dependencies

Decision: Use optional peer dependencies with runtime detection.

Rationale:

Lighter install for basic usage
Pay-as-you-go complexity
Clear integration boundaries
Graceful degradation

Consequences:

Runtime checks for integration availability
Clear documentation about optional features
Integration adapters with null implementations

System Architecture

High-Level Component Diagram (C4 Level 2)

┌─────────────────────────────────────────────────────────────────┐
│                         Agentic-Synth                            │
│                                                                   │
│  ┌──────────────┐                           ┌─────────────────┐  │
│  │     CLI      │                           │       SDK       │  │
│  │  (Commander) │◄──────────────────────────►  (Public API)  │  │
│  └──────┬───────┘                           └────────┬────────┘  │
│         │                                            │           │
│         └────────────────────┬───────────────────────┘           │
│                              │                                   │
│                    ┌─────────▼────────┐                          │
│                    │   Core Engine    │                          │
│                    │                  │                          │
│                    │  - Generator Hub │                          │
│                    │  - Model Router  │                          │
│                    │  - Cache Manager │                          │
│                    │  - Config System │                          │
│                    └─────────┬────────┘                          │
│                              │                                   │
│         ┌────────────────────┼────────────────────┐              │
│         │                    │                    │              │
│    ┌────▼─────┐      ┌──────▼──────┐      ┌─────▼──────┐       │
│    │Generator │      │   Models    │      │Integration │       │
│    │  System  │      │   System    │      │  Adapters  │       │
│    │          │      │             │      │            │       │
│    │-TimeSeries│     │- Gemini     │      │-Midstreamer│       │
│    │-Events    │     │- OpenRouter │      │-Robotics   │       │
│    │-Structured│     │- Router     │      │-Ruvector   │       │
│    └──────────┘      └─────────────┘      └────────────┘       │
└───────────────────────────────────────────────────────────────┘
         │                     │                       │
         ▼                     ▼                       ▼
┌─────────────┐      ┌──────────────┐       ┌──────────────────┐
│   Output    │      │   AI APIs    │       │    External      │
│  (Streams)  │      │              │       │  Integrations    │
│             │      │ - Gemini API │       │                  │
│ - JSON      │      │ - OpenRouter │       │ - Midstreamer    │
│ - CSV       │      │              │       │ - Agentic-Robot  │
│ - Parquet   │      └──────────────┘       │ - Ruvector DB    │
└─────────────┘                             └──────────────────┘

Data Flow Diagram

┌─────────┐
│  User   │
└────┬────┘
     │
     │ (CLI Command or SDK Call)
     ▼
┌─────────────────┐
│ Request Parser  │ ──► Validate schema, parse options
└────┬────────────┘
     │
     ▼
┌─────────────────┐
│ Generator Hub   │ ──► Select appropriate generator
└────┬────────────┘
     │
     ▼
┌─────────────────┐
│  Model Router   │ ──► Choose AI model (Gemini/OpenRouter)
└────┬────────────┘
     │
     ├──► Check Cache ─────► Cache Hit? ─────► Return cached
     │                            │
     │                            │ (Miss)
     ▼                            ▼
┌─────────────────┐    ┌──────────────────┐
│   AI Provider   │───►│ Context Builder  │
│  (Gemini/OR)    │    │ (Prompt + Schema)│
└────┬────────────┘    └──────────────────┘
     │
     │ (Generated Data)
     ▼
┌─────────────────┐
│  Post-Processor │ ──► Validate, transform, format
└────┬────────────┘
     │
     ├──► Store in Cache
     │
     ├──► Stream via Midstreamer (if enabled)
     │
     ├──► Store in Ruvector (if enabled)
     │
     ▼
┌─────────────────┐
│ Output Handler  │ ──► JSON/CSV/Parquet/Stream
└─────────────────┘

Core Components

1. Generator System

Purpose: Generate different types of synthetic data.

Components:

TimeSeriesGenerator: Generate time-series data with trends, seasonality, noise
EventGenerator: Generate event streams with timestamps and metadata
StructuredGenerator: Generate structured records (JSON, tables)
CustomGenerator: Base class for user-defined generators

Interface:

interface Generator<T = any> {
  readonly type: string;
  readonly schema: z.ZodSchema<T>;

  generate(options: GenerateOptions): Promise<T>;
  generateBatch(count: number, options: GenerateOptions): AsyncIterator<T>;
  validate(data: unknown): T;
}

2. Model System

Purpose: Interface with AI providers for data generation.

Components:

GeminiProvider: Google Gemini API integration
OpenRouterProvider: OpenRouter API integration
ModelRouter: Smart routing between providers
ContextCache: Cache prompts and responses

Interface:

interface ModelProvider {
  readonly name: string;
  readonly supportedModels: string[];

  generate(prompt: string, options: ModelOptions): Promise<string>;
  generateStream(prompt: string, options: ModelOptions): AsyncIterator<string>;
  getCost(model: string, tokens: number): number;
}

3. Cache Manager

Purpose: Cache API responses and context without Redis.

Strategy:

In-memory LRU cache (configurable size)
Optional file-based persistence
Content-based cache keys (hash of prompt + options)
TTL support

Implementation:

class CacheManager {
  private cache: Map<string, CacheEntry>;
  private maxSize: number;
  private ttl: number;

  get(key: string): CacheEntry | undefined;
  set(key: string, value: any, ttl?: number): void;
  clear(): void;
  persist(path: string): Promise<void>;
  restore(path: string): Promise<void>;
}

4. Integration Adapters

Purpose: Optional integrations with external tools.

Adapters:

MidstreamerAdapter

interface MidstreamerAdapter {
  isAvailable(): boolean;
  stream(data: AsyncIterator<any>): Promise<void>;
  createPipeline(config: PipelineConfig): StreamPipeline;
}

AgenticRoboticsAdapter

interface AgenticRoboticsAdapter {
  isAvailable(): boolean;
  registerWorkflow(name: string, generator: Generator): void;
  triggerWorkflow(name: string, options: any): Promise<void>;
}

RuvectorAdapter

interface RuvectorAdapter {
  isAvailable(): boolean;
  store(data: any, metadata?: any): Promise<string>;
  search(query: any, limit?: number): Promise<any[]>;
}

API Design

SDK API

Basic Usage

import { AgenticSynth, TimeSeriesGenerator } from 'agentic-synth';

// Initialize
const synth = new AgenticSynth({
  apiKeys: {
    gemini: process.env.GEMINI_API_KEY,
    openRouter: process.env.OPENROUTER_API_KEY
  },
  cache: {
    enabled: true,
    maxSize: 1000,
    ttl: 3600000 // 1 hour
  }
});

// Generate time-series data
const data = await synth.generate('timeseries', {
  count: 1000,
  schema: {
    timestamp: 'datetime',
    temperature: { type: 'number', min: -20, max: 40 },
    humidity: { type: 'number', min: 0, max: 100 }
  },
  model: 'gemini-pro'
});

// Stream generation
for await (const record of synth.generateStream('events', options)) {
  console.log(record);
}

Advanced Usage with Integrations

import { AgenticSynth } from 'agentic-synth';
import { enableMidstreamer, enableRuvector } from 'agentic-synth/integrations';

const synth = new AgenticSynth({
  apiKeys: { ... }
});

// Enable optional integrations
enableMidstreamer(synth, {
  pipeline: 'synthetic-data-stream'
});

enableRuvector(synth, {
  dbPath: './data/vectors.db'
});

// Generate and automatically stream + store
await synth.generate('structured', {
  count: 10000,
  stream: true,        // Auto-streams via midstreamer
  vectorize: true      // Auto-stores in ruvector
});

CLI API

Basic Commands

# Generate time-series data
npx agentic-synth generate timeseries \
  --count 1000 \
  --schema ./schema.json \
  --output data.json \
  --model gemini-pro

# Generate events
npx agentic-synth generate events \
  --count 5000 \
  --rate 100/sec \
  --stream \
  --output events.jsonl

# Generate structured data
npx agentic-synth generate structured \
  --schema ./user-schema.json \
  --count 10000 \
  --format csv \
  --output users.csv

Advanced Commands

# With model routing
npx agentic-synth generate timeseries \
  --count 1000 \
  --auto-route \
  --fallback gemini-pro,gpt-4 \
  --budget 0.10

# With integrations
npx agentic-synth generate events \
  --count 10000 \
  --stream-to midstreamer \
  --vectorize-with ruvector \
  --cache-policy aggressive

# Batch generation
npx agentic-synth batch generate \
  --config ./batch-config.yaml \
  --parallel 4 \
  --output ./output-dir/

Configuration System

Configuration File Format (.agentic-synth.json)

{
  "apiKeys": {
    "gemini": "${GEMINI_API_KEY}",
    "openRouter": "${OPENROUTER_API_KEY}"
  },
  "cache": {
    "enabled": true,
    "maxSize": 1000,
    "ttl": 3600000,
    "persistPath": "./.cache/agentic-synth"
  },
  "models": {
    "routing": {
      "strategy": "cost-optimized",
      "fallbackChain": ["gemini-pro", "gpt-4", "claude-3"]
    },
    "defaults": {
      "timeseries": "gemini-pro",
      "events": "gpt-4-turbo",
      "structured": "claude-3-sonnet"
    }
  },
  "integrations": {
    "midstreamer": {
      "enabled": true,
      "defaultPipeline": "synthetic-data"
    },
    "agenticRobotics": {
      "enabled": false
    },
    "ruvector": {
      "enabled": true,
      "dbPath": "./data/vectors.db"
    }
  },
  "generators": {
    "timeseries": {
      "defaultSampleRate": "1s",
      "defaultDuration": "1h"
    },
    "events": {
      "defaultRate": "100/sec"
    }
  }
}

Technology Stack

Core Dependencies

TypeScript 5.7+: Type safety and modern JavaScript features
Zod 3.23+: Runtime schema validation
Commander 12+: CLI framework
Winston 3+: Logging system

AI Provider SDKs

@google/generative-ai: Gemini API integration
openai: OpenRouter API (compatible with OpenAI SDK)

Optional Integrations

midstreamer: Streaming data pipelines
agentic-robotics: Automation workflows
ruvector: Vector database for embeddings

Development Tools

Vitest: Testing framework
ESLint: Linting
Prettier: Code formatting

Performance Considerations

Context Caching Strategy

Cache Key Generation: Hash of (prompt template + schema + model options)
Cache Storage: In-memory Map with LRU eviction
Cache Persistence: Optional file-based storage for session continuity
Cache Invalidation: TTL-based + manual invalidation API

Model Selection Optimization

Cost-Based Routing: Select cheapest model that meets requirements
Performance-Based Routing: Select fastest model
Quality-Based Routing: Select highest quality model
Hybrid Routing: Balance cost/performance/quality

Memory Management

Streaming generation for large datasets (avoid loading all in memory)
Chunked processing for batch operations
Configurable batch sizes
Memory-efficient serialization formats (JSONL, Parquet)

Security Considerations

API Key Management

Environment variable loading via dotenv
Config file with environment variable substitution
Never log API keys
Secure storage in config files (encrypted or gitignored)

Data Validation

Input validation using Zod schemas
Output validation before returning to user
Sanitization of AI-generated content
Rate limiting for API calls

Error Handling

Graceful degradation on provider failures
Automatic retry with exponential backoff
Detailed error logging without sensitive data
User-friendly error messages

Testing Strategy

Unit Tests

Individual generator tests
Model provider mocks
Cache manager tests
Integration adapter tests (with mocks)

Integration Tests

End-to-end generation workflows
Real API calls (with test API keys)
Integration with midstreamer/robotics (optional)
CLI command tests

Performance Tests

Benchmark generation speed
Memory usage profiling
Cache hit rate analysis
Model routing efficiency

Deployment & Distribution

NPM Package

Published as agentic-synth
Dual CJS/ESM support (via tsconfig)
Tree-shakeable exports
Type definitions included

CLI Distribution

Available via npx agentic-synth
Self-contained executable (includes dependencies)
Automatic updates via npm

Documentation

README.md: Quick start guide
API.md: Complete SDK reference
CLI.md: Command-line reference
EXAMPLES.md: Common use cases
INTEGRATIONS.md: Optional integration guides

Future Enhancements

Phase 2 Features

Support for more AI providers (Anthropic, Cohere, local models)
Advanced schema generation from examples
Multi-modal data generation (text + images)
Distributed generation across multiple nodes
Web UI for visual data generation

Phase 3 Features

Real-time data generation with WebSocket support
Integration with data orchestration platforms (Airflow, Prefect)
Custom model fine-tuning for domain-specific data
Data quality metrics and validation
Automated testing dataset generation

Conclusion

This architecture provides a solid foundation for agentic-synth as a flexible, performant, and extensible synthetic data generation tool. The dual CLI/SDK interface, optional integrations, and plugin-based architecture ensure it can serve a wide range of use cases while remaining simple for basic usage.

20 KiB Raw Blame History

Agentic-Synth Architecture

Overview

Architecture Decision Records

ADR-001: TypeScript with ESM Modules

ADR-002: No Redis Dependency

ADR-003: Dual Interface (CLI + SDK)

ADR-004: Model Router Architecture

ADR-005: Plugin Architecture for Generators

ADR-006: Optional Integration Pattern

System Architecture

High-Level Component Diagram (C4 Level 2)

Data Flow Diagram

Core Components

1. Generator System

2. Model System

3. Cache Manager

4. Integration Adapters

MidstreamerAdapter

AgenticRoboticsAdapter

RuvectorAdapter

API Design

SDK API

Basic Usage

Advanced Usage with Integrations

CLI API

Basic Commands

Advanced Commands

Configuration System

Configuration File Format (.agentic-synth.json)

Technology Stack

Core Dependencies

AI Provider SDKs

Optional Integrations

Development Tools

Performance Considerations

Context Caching Strategy

Model Selection Optimization

Memory Management

Security Considerations

API Key Management

Data Validation

Error Handling

Testing Strategy

Unit Tests

Integration Tests

Performance Tests

Deployment & Distribution

NPM Package

CLI Distribution

Documentation

Future Enhancements

Phase 2 Features

Phase 3 Features

Conclusion

20 KiB

Raw Blame History