Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/npm/packages/agentic-synth/docs/ARCHITECTURE.md
+++ b/vendor/ruvector/npm/packages/agentic-synth/docs/ARCHITECTURE.md
@@ -0,0 +1,644 @@
+# Agentic-Synth Architecture
+
+## Overview
+
+Agentic-Synth is a TypeScript-based synthetic data generation package that provides both CLI and SDK interfaces for generating time-series, events, and structured data using AI models (Gemini and OpenRouter APIs). It integrates seamlessly with midstreamer for streaming and agentic-robotics for automation workflows.
+
+## Architecture Decision Records
+
+### ADR-001: TypeScript with ESM Modules
+
+**Status:** Accepted
+
+**Context:**
+- Need modern JavaScript/TypeScript support
+- Integration with Node.js ecosystem
+- Support for both CLI and SDK usage
+- Future-proof module system
+
+**Decision:**
+Use TypeScript with ESM (ECMAScript Modules) as the primary module system.
+
+**Rationale:**
+- ESM is the standard for modern JavaScript
+- Better tree-shaking and optimization
+- Native TypeScript support
+- Aligns with Node.js 18+ best practices
+
+**Consequences:**
+- Requires Node.js 18+
+- All imports must use `.js` extensions in output
+- Better interoperability with modern tools
+
+---
+
+### ADR-002: No Redis Dependency
+
+**Status:** Accepted
+
+**Context:**
+- Need caching for context and API responses
+- Avoid external service dependencies
+- Simplify deployment and usage
+
+**Decision:**
+Use in-memory caching with LRU (Least Recently Used) strategy and optional file-based persistence.
+
+**Rationale:**
+- Simpler deployment (no Redis server needed)
+- Faster for most use cases (in-process memory)
+- File-based persistence for session continuity
+- Optional integration with ruvector for advanced caching
+
+**Consequences:**
+- Cache doesn't survive process restart (unless persisted to file)
+- Memory-limited (configurable max size)
+- Single-process only (no distributed caching)
+
+---
+
+### ADR-003: Dual Interface (CLI + SDK)
+
+**Status:** Accepted
+
+**Context:**
+- Need both programmatic access and command-line usage
+- Different user personas (developers vs. operators)
+- Consistent behavior across interfaces
+
+**Decision:**
+Implement core logic in SDK with CLI as a thin wrapper.
+
+**Rationale:**
+- Single source of truth for logic
+- CLI uses SDK internally
+- Easy to test and maintain
+- Clear separation of concerns
+
+**Consequences:**
+- SDK must be feature-complete
+- CLI is primarily for ergonomics
+- Documentation needed for both interfaces
+
+---
+
+### ADR-004: Model Router Architecture
+
+**Status:** Accepted
+
+**Context:**
+- Support multiple AI providers (Gemini, OpenRouter)
+- Different models for different data types
+- Cost optimization and fallback strategies
+
+**Decision:**
+Implement a model router that selects appropriate models based on data type, cost, and availability.
+
+**Rationale:**
+- Flexibility in model selection
+- Automatic fallback on failures
+- Cost optimization through smart routing
+- Provider-agnostic interface
+
+**Consequences:**
+- More complex routing logic
+- Need configuration for routing rules
+- Performance monitoring required
+
+---
+
+### ADR-005: Plugin Architecture for Generators
+
+**Status:** Accepted
+
+**Context:**
+- Different data types need different generation strategies
+- Extensibility for custom generators
+- Community contributions
+
+**Decision:**
+Use a plugin-based architecture where each data type has a dedicated generator.
+
+**Rationale:**
+- Clear separation of concerns
+- Easy to add new data types
+- Testable in isolation
+- Community can contribute generators
+
+**Consequences:**
+- Need generator registration system
+- Consistent generator interface
+- Documentation for custom generators
+
+---
+
+### ADR-006: Optional Integration Pattern
+
+**Status:** Accepted
+
+**Context:**
+- Integration with midstreamer, agentic-robotics, and ruvector
+- Not all users need all integrations
+- Avoid mandatory dependencies
+
+**Decision:**
+Use optional peer dependencies with runtime detection.
+
+**Rationale:**
+- Lighter install for basic usage
+- Pay-as-you-go complexity
+- Clear integration boundaries
+- Graceful degradation
+
+**Consequences:**
+- Runtime checks for integration availability
+- Clear documentation about optional features
+- Integration adapters with null implementations
+
+## System Architecture
+
+### High-Level Component Diagram (C4 Level 2)
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         Agentic-Synth                            │
+│                                                                   │
+│  ┌──────────────┐                           ┌─────────────────┐  │
+│  │     CLI      │                           │       SDK       │  │
+│  │  (Commander) │◄──────────────────────────►  (Public API)  │  │
+│  └──────┬───────┘                           └────────┬────────┘  │
+│         │                                            │           │
+│         └────────────────────┬───────────────────────┘           │
+│                              │                                   │
+│                    ┌─────────▼────────┐                          │
+│                    │   Core Engine    │                          │
+│                    │                  │                          │
+│                    │  - Generator Hub │                          │
+│                    │  - Model Router  │                          │
+│                    │  - Cache Manager │                          │
+│                    │  - Config System │                          │
+│                    └─────────┬────────┘                          │
+│                              │                                   │
+│         ┌────────────────────┼────────────────────┐              │
+│         │                    │                    │              │
+│    ┌────▼─────┐      ┌──────▼──────┐      ┌─────▼──────┐       │
+│    │Generator │      │   Models    │      │Integration │       │
+│    │  System  │      │   System    │      │  Adapters  │       │
+│    │          │      │             │      │            │       │
+│    │-TimeSeries│     │- Gemini     │      │-Midstreamer│       │
+│    │-Events    │     │- OpenRouter │      │-Robotics   │       │
+│    │-Structured│     │- Router     │      │-Ruvector   │       │
+│    └──────────┘      └─────────────┘      └────────────┘       │
+└───────────────────────────────────────────────────────────────┘
+         │                     │                       │
+         ▼                     ▼                       ▼
+┌─────────────┐      ┌──────────────┐       ┌──────────────────┐
+│   Output    │      │   AI APIs    │       │    External      │
+│  (Streams)  │      │              │       │  Integrations    │
+│             │      │ - Gemini API │       │                  │
+│ - JSON      │      │ - OpenRouter │       │ - Midstreamer    │
+│ - CSV       │      │              │       │ - Agentic-Robot  │
+│ - Parquet   │      └──────────────┘       │ - Ruvector DB    │
+└─────────────┘                             └──────────────────┘
+```
+
+### Data Flow Diagram
+
+```
+┌─────────┐
+│  User   │
+└────┬────┘
+     │
+     │ (CLI Command or SDK Call)
+     ▼
+┌─────────────────┐
+│ Request Parser  │ ──► Validate schema, parse options
+└────┬────────────┘
+     │
+     ▼
+┌─────────────────┐
+│ Generator Hub   │ ──► Select appropriate generator
+└────┬────────────┘
+     │
+     ▼
+┌─────────────────┐
+│  Model Router   │ ──► Choose AI model (Gemini/OpenRouter)
+└────┬────────────┘
+     │
+     ├──► Check Cache ─────► Cache Hit? ─────► Return cached
+     │                            │
+     │                            │ (Miss)
+     ▼                            ▼
+┌─────────────────┐    ┌──────────────────┐
+│   AI Provider   │───►│ Context Builder  │
+│  (Gemini/OR)    │    │ (Prompt + Schema)│
+└────┬────────────┘    └──────────────────┘
+     │
+     │ (Generated Data)
+     ▼
+┌─────────────────┐
+│  Post-Processor │ ──► Validate, transform, format
+└────┬────────────┘
+     │
+     ├──► Store in Cache
+     │
+     ├──► Stream via Midstreamer (if enabled)
+     │
+     ├──► Store in Ruvector (if enabled)
+     │
+     ▼
+┌─────────────────┐
+│ Output Handler  │ ──► JSON/CSV/Parquet/Stream
+└─────────────────┘
+```
+
+## Core Components
+
+### 1. Generator System
+
+**Purpose:** Generate different types of synthetic data.
+
+**Components:**
+- `TimeSeriesGenerator`: Generate time-series data with trends, seasonality, noise
+- `EventGenerator`: Generate event streams with timestamps and metadata
+- `StructuredGenerator`: Generate structured records (JSON, tables)
+- `CustomGenerator`: Base class for user-defined generators
+
+**Interface:**
+```typescript
+interface Generator<T = any> {
+  readonly type: string;
+  readonly schema: z.ZodSchema<T>;
+
+  generate(options: GenerateOptions): Promise<T>;
+  generateBatch(count: number, options: GenerateOptions): AsyncIterator<T>;
+  validate(data: unknown): T;
+}
+```
+
+### 2. Model System
+
+**Purpose:** Interface with AI providers for data generation.
+
+**Components:**
+- `GeminiProvider`: Google Gemini API integration
+- `OpenRouterProvider`: OpenRouter API integration
+- `ModelRouter`: Smart routing between providers
+- `ContextCache`: Cache prompts and responses
+
+**Interface:**
+```typescript
+interface ModelProvider {
+  readonly name: string;
+  readonly supportedModels: string[];
+
+  generate(prompt: string, options: ModelOptions): Promise<string>;
+  generateStream(prompt: string, options: ModelOptions): AsyncIterator<string>;
+  getCost(model: string, tokens: number): number;
+}
+```
+
+### 3. Cache Manager
+
+**Purpose:** Cache API responses and context without Redis.
+
+**Strategy:**
+- In-memory LRU cache (configurable size)
+- Optional file-based persistence
+- Content-based cache keys (hash of prompt + options)
+- TTL support
+
+**Implementation:**
+```typescript
+class CacheManager {
+  private cache: Map<string, CacheEntry>;
+  private maxSize: number;
+  private ttl: number;
+
+  get(key: string): CacheEntry | undefined;
+  set(key: string, value: any, ttl?: number): void;
+  clear(): void;
+  persist(path: string): Promise<void>;
+  restore(path: string): Promise<void>;
+}
+```
+
+### 4. Integration Adapters
+
+**Purpose:** Optional integrations with external tools.
+
+**Adapters:**
+
+#### MidstreamerAdapter
+```typescript
+interface MidstreamerAdapter {
+  isAvailable(): boolean;
+  stream(data: AsyncIterator<any>): Promise<void>;
+  createPipeline(config: PipelineConfig): StreamPipeline;
+}
+```
+
+#### AgenticRoboticsAdapter
+```typescript
+interface AgenticRoboticsAdapter {
+  isAvailable(): boolean;
+  registerWorkflow(name: string, generator: Generator): void;
+  triggerWorkflow(name: string, options: any): Promise<void>;
+}
+```
+
+#### RuvectorAdapter
+```typescript
+interface RuvectorAdapter {
+  isAvailable(): boolean;
+  store(data: any, metadata?: any): Promise<string>;
+  search(query: any, limit?: number): Promise<any[]>;
+}
+```
+
+## API Design
+
+### SDK API
+
+#### Basic Usage
+```typescript
+import { AgenticSynth, TimeSeriesGenerator } from 'agentic-synth';
+
+// Initialize
+const synth = new AgenticSynth({
+  apiKeys: {
+    gemini: process.env.GEMINI_API_KEY,
+    openRouter: process.env.OPENROUTER_API_KEY
+  },
+  cache: {
+    enabled: true,
+    maxSize: 1000,
+    ttl: 3600000 // 1 hour
+  }
+});
+
+// Generate time-series data
+const data = await synth.generate('timeseries', {
+  count: 1000,
+  schema: {
+    timestamp: 'datetime',
+    temperature: { type: 'number', min: -20, max: 40 },
+    humidity: { type: 'number', min: 0, max: 100 }
+  },
+  model: 'gemini-pro'
+});
+
+// Stream generation
+for await (const record of synth.generateStream('events', options)) {
+  console.log(record);
+}
+```
+
+#### Advanced Usage with Integrations
+```typescript
+import { AgenticSynth } from 'agentic-synth';
+import { enableMidstreamer, enableRuvector } from 'agentic-synth/integrations';
+
+const synth = new AgenticSynth({
+  apiKeys: { ... }
+});
+
+// Enable optional integrations
+enableMidstreamer(synth, {
+  pipeline: 'synthetic-data-stream'
+});
+
+enableRuvector(synth, {
+  dbPath: './data/vectors.db'
+});
+
+// Generate and automatically stream + store
+await synth.generate('structured', {
+  count: 10000,
+  stream: true,        // Auto-streams via midstreamer
+  vectorize: true      // Auto-stores in ruvector
+});
+```
+
+### CLI API
+
+#### Basic Commands
+```bash
+# Generate time-series data
+npx agentic-synth generate timeseries \
+  --count 1000 \
+  --schema ./schema.json \
+  --output data.json \
+  --model gemini-pro
+
+# Generate events
+npx agentic-synth generate events \
+  --count 5000 \
+  --rate 100/sec \
+  --stream \
+  --output events.jsonl
+
+# Generate structured data
+npx agentic-synth generate structured \
+  --schema ./user-schema.json \
+  --count 10000 \
+  --format csv \
+  --output users.csv
+```
+
+#### Advanced Commands
+```bash
+# With model routing
+npx agentic-synth generate timeseries \
+  --count 1000 \
+  --auto-route \
+  --fallback gemini-pro,gpt-4 \
+  --budget 0.10
+
+# With integrations
+npx agentic-synth generate events \
+  --count 10000 \
+  --stream-to midstreamer \
+  --vectorize-with ruvector \
+  --cache-policy aggressive
+
+# Batch generation
+npx agentic-synth batch generate \
+  --config ./batch-config.yaml \
+  --parallel 4 \
+  --output ./output-dir/
+```
+
+## Configuration System
+
+### Configuration File Format (.agentic-synth.json)
+
+```json
+{
+  "apiKeys": {
+    "gemini": "${GEMINI_API_KEY}",
+    "openRouter": "${OPENROUTER_API_KEY}"
+  },
+  "cache": {
+    "enabled": true,
+    "maxSize": 1000,
+    "ttl": 3600000,
+    "persistPath": "./.cache/agentic-synth"
+  },
+  "models": {
+    "routing": {
+      "strategy": "cost-optimized",
+      "fallbackChain": ["gemini-pro", "gpt-4", "claude-3"]
+    },
+    "defaults": {
+      "timeseries": "gemini-pro",
+      "events": "gpt-4-turbo",
+      "structured": "claude-3-sonnet"
+    }
+  },
+  "integrations": {
+    "midstreamer": {
+      "enabled": true,
+      "defaultPipeline": "synthetic-data"
+    },
+    "agenticRobotics": {
+      "enabled": false
+    },
+    "ruvector": {
+      "enabled": true,
+      "dbPath": "./data/vectors.db"
+    }
+  },
+  "generators": {
+    "timeseries": {
+      "defaultSampleRate": "1s",
+      "defaultDuration": "1h"
+    },
+    "events": {
+      "defaultRate": "100/sec"
+    }
+  }
+}
+```
+
+## Technology Stack
+
+### Core Dependencies
+- **TypeScript 5.7+**: Type safety and modern JavaScript features
+- **Zod 3.23+**: Runtime schema validation
+- **Commander 12+**: CLI framework
+- **Winston 3+**: Logging system
+
+### AI Provider SDKs
+- **@google/generative-ai**: Gemini API integration
+- **openai**: OpenRouter API (compatible with OpenAI SDK)
+
+### Optional Integrations
+- **midstreamer**: Streaming data pipelines
+- **agentic-robotics**: Automation workflows
+- **ruvector**: Vector database for embeddings
+
+### Development Tools
+- **Vitest**: Testing framework
+- **ESLint**: Linting
+- **Prettier**: Code formatting
+
+## Performance Considerations
+
+### Context Caching Strategy
+1. **Cache Key Generation**: Hash of (prompt template + schema + model options)
+2. **Cache Storage**: In-memory Map with LRU eviction
+3. **Cache Persistence**: Optional file-based storage for session continuity
+4. **Cache Invalidation**: TTL-based + manual invalidation API
+
+### Model Selection Optimization
+1. **Cost-Based Routing**: Select cheapest model that meets requirements
+2. **Performance-Based Routing**: Select fastest model
+3. **Quality-Based Routing**: Select highest quality model
+4. **Hybrid Routing**: Balance cost/performance/quality
+
+### Memory Management
+- Streaming generation for large datasets (avoid loading all in memory)
+- Chunked processing for batch operations
+- Configurable batch sizes
+- Memory-efficient serialization formats (JSONL, Parquet)
+
+## Security Considerations
+
+### API Key Management
+- Environment variable loading via dotenv
+- Config file with environment variable substitution
+- Never log API keys
+- Secure storage in config files (encrypted or gitignored)
+
+### Data Validation
+- Input validation using Zod schemas
+- Output validation before returning to user
+- Sanitization of AI-generated content
+- Rate limiting for API calls
+
+### Error Handling
+- Graceful degradation on provider failures
+- Automatic retry with exponential backoff
+- Detailed error logging without sensitive data
+- User-friendly error messages
+
+## Testing Strategy
+
+### Unit Tests
+- Individual generator tests
+- Model provider mocks
+- Cache manager tests
+- Integration adapter tests (with mocks)
+
+### Integration Tests
+- End-to-end generation workflows
+- Real API calls (with test API keys)
+- Integration with midstreamer/robotics (optional)
+- CLI command tests
+
+### Performance Tests
+- Benchmark generation speed
+- Memory usage profiling
+- Cache hit rate analysis
+- Model routing efficiency
+
+## Deployment & Distribution
+
+### NPM Package
+- Published as `agentic-synth`
+- Dual CJS/ESM support (via tsconfig)
+- Tree-shakeable exports
+- Type definitions included
+
+### CLI Distribution
+- Available via `npx agentic-synth`
+- Self-contained executable (includes dependencies)
+- Automatic updates via npm
+
+### Documentation
+- README.md: Quick start guide
+- API.md: Complete SDK reference
+- CLI.md: Command-line reference
+- EXAMPLES.md: Common use cases
+- INTEGRATIONS.md: Optional integration guides
+
+## Future Enhancements
+
+### Phase 2 Features
+- Support for more AI providers (Anthropic, Cohere, local models)
+- Advanced schema generation from examples
+- Multi-modal data generation (text + images)
+- Distributed generation across multiple nodes
+- Web UI for visual data generation
+
+### Phase 3 Features
+- Real-time data generation with WebSocket support
+- Integration with data orchestration platforms (Airflow, Prefect)
+- Custom model fine-tuning for domain-specific data
+- Data quality metrics and validation
+- Automated testing dataset generation
+
+## Conclusion
+
+This architecture provides a solid foundation for agentic-synth as a flexible, performant, and extensible synthetic data generation tool. The dual CLI/SDK interface, optional integrations, and plugin-based architecture ensure it can serve a wide range of use cases while remaining simple for basic usage.