Files
wifi-densepose/npm/packages/ruvbot/docs/adr/ADR-004-background-workers.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

28 KiB

ADR-004: Background Workers

Status: Accepted Date: 2026-01-27 Decision Makers: RuVector Architecture Team Technical Area: Infrastructure, Async Processing


Context and Problem Statement

RuvBot requires background processing for tasks that:

  1. Take too long for synchronous request handling (> 30s)
  2. Should be deferred to reduce response latency
  3. Need retry logic for unreliable operations
  4. Run on schedules for maintenance and optimization
  5. Process in batches for efficiency

Key use cases:

  • Memory consolidation: Merge episodic memories into semantic knowledge
  • Embedding generation: Batch process text to vectors
  • Pattern learning: Train on trajectories during off-peak hours
  • Session cleanup: Expire and archive old sessions
  • Index optimization: Rebalance HNSW indices
  • Webhook dispatch: Reliable delivery with retries

Decision Drivers

Reliability Requirements

Requirement Description
At-least-once delivery Jobs must execute even after worker crashes
Idempotency Safe to retry without side effects
Visibility Monitor job status, progress, failures
Dead letter handling Capture failed jobs for analysis
Graceful shutdown Complete in-progress jobs before stopping

Performance Requirements

Metric Target
Job pickup latency < 100ms
Throughput 1000 jobs/second
Concurrent workers 50 per instance
Job timeout Configurable, default 5 minutes
Retry delay Exponential backoff, max 1 hour

Decision Outcome

Adopt BullMQ with Custom Worker Framework

We implement a worker framework on top of BullMQ (Redis-backed) with integration to the agentic-flow worker pattern.

+-----------------------------------------------------------------------------+
|                           WORKER ARCHITECTURE                                |
+-----------------------------------------------------------------------------+

                    +---------------------------+
                    |      Job Dispatcher       |
                    |   (Application Layer)     |
                    +-------------+-------------+
                                  |
                    +-------------v-------------+
                    |        BullMQ Queue       |
                    |   (Redis-backed storage)  |
                    +-------------+-------------+
                                  |
          +-----------------------+-----------------------+
          |                       |                       |
+---------v---------+   +---------v---------+   +---------v---------+
|   Worker Pool     |   |   Worker Pool     |   |   Worker Pool     |
|   (Instance 1)    |   |   (Instance 2)    |   |   (Instance N)    |
|-------------------|   |-------------------|   |-------------------|
| - memory-consol   |   | - embedding-batch |   | - webhook-disp    |
| - pattern-learn   |   | - session-cleanup |   | - index-optimize  |
+-------------------+   +-------------------+   +-------------------+
          |                       |                       |
          +-----------------------+-----------------------+
                                  |
                    +-------------v-------------+
                    |       Worker Monitor      |
                    | (Metrics, Alerts, Admin)  |
                    +---------------------------+

Job Types

Core Worker Definitions

// Job type registry
const JOB_TYPES = {
  // Memory workers
  MEMORY_CONSOLIDATION: 'memory-consolidation',
  MEMORY_CLEANUP: 'memory-cleanup',
  MEMORY_IMPORTANCE_DECAY: 'memory-importance-decay',

  // Embedding workers
  EMBEDDING_BATCH: 'embedding-batch',
  EMBEDDING_REINDEX: 'embedding-reindex',

  // Learning workers
  TRAJECTORY_PROCESS: 'trajectory-process',
  PATTERN_TRAINING: 'pattern-training',
  EWC_CONSOLIDATION: 'ewc-consolidation',

  // Session workers
  SESSION_CLEANUP: 'session-cleanup',
  SESSION_ARCHIVE: 'session-archive',

  // Index workers
  INDEX_OPTIMIZATION: 'index-optimization',
  INDEX_REBALANCE: 'index-rebalance',

  // Integration workers
  WEBHOOK_DISPATCH: 'webhook-dispatch',
  SLACK_SYNC: 'slack-sync',

  // Maintenance workers
  QUOTA_CHECK: 'quota-check',
  AUDIT_EXPORT: 'audit-export',
  DATA_EXPORT: 'data-export',
  DATA_DELETION: 'data-deletion',
} as const;

type JobType = typeof JOB_TYPES[keyof typeof JOB_TYPES];

Job Configuration

// Job configuration with defaults
interface JobConfig<T> {
  type: JobType;
  data: T;
  options?: {
    priority?: number;        // Higher = more urgent (default: 0)
    delay?: number;           // ms to wait before processing
    attempts?: number;        // Max retry attempts (default: 3)
    backoff?: {
      type: 'exponential' | 'fixed';
      delay: number;          // Base delay in ms
    };
    timeout?: number;         // Max execution time in ms
    removeOnComplete?: boolean | number;  // Keep N completed jobs
    removeOnFail?: boolean | number;      // Keep N failed jobs
  };
  tenant?: {
    orgId: string;
    workspaceId?: string;
  };
}

// Default configurations per job type
const JOB_DEFAULTS: Record<JobType, Partial<JobConfig<unknown>['options']>> = {
  'memory-consolidation': {
    priority: 5,
    attempts: 3,
    backoff: { type: 'exponential', delay: 5000 },
    timeout: 300000,  // 5 minutes
  },
  'embedding-batch': {
    priority: 10,
    attempts: 5,
    backoff: { type: 'exponential', delay: 2000 },
    timeout: 60000,   // 1 minute
  },
  'webhook-dispatch': {
    priority: 15,
    attempts: 10,
    backoff: { type: 'exponential', delay: 1000 },
    timeout: 30000,   // 30 seconds
  },
  'pattern-training': {
    priority: 1,
    attempts: 2,
    backoff: { type: 'fixed', delay: 60000 },
    timeout: 3600000, // 1 hour
  },
  'session-cleanup': {
    priority: 3,
    attempts: 3,
    backoff: { type: 'exponential', delay: 10000 },
    timeout: 120000,  // 2 minutes
  },
  // ... other defaults
};

Worker Implementation

Worker Base Class

// Abstract worker with tenant context and lifecycle hooks
abstract class BaseWorker<TData, TResult> {
  protected logger: Logger;
  protected metrics: MetricsCollector;
  protected tenantContext?: TenantContext;

  constructor(
    protected readonly type: JobType,
    protected readonly config: WorkerConfig
  ) {
    this.logger = createLogger(`worker:${type}`);
    this.metrics = createMetricsCollector(`worker.${type}`);
  }

  // Main processing method - implement in subclasses
  abstract process(data: TData, job: Job): Promise<TResult>;

  // Lifecycle hooks
  async onStart(job: Job): Promise<void> {
    this.metrics.increment('started');
    this.logger.info(`Starting job ${job.id}`, { data: job.data });

    // Set tenant context if provided
    if (job.data.tenantContext) {
      this.tenantContext = job.data.tenantContext;
    }
  }

  async onComplete(job: Job, result: TResult): Promise<void> {
    this.metrics.increment('completed');
    this.metrics.timing('duration', Date.now() - job.processedOn!);
    this.logger.info(`Completed job ${job.id}`, { result });
  }

  async onFailed(job: Job, error: Error): Promise<void> {
    this.metrics.increment('failed');
    this.logger.error(`Failed job ${job.id}`, { error: error.message });

    // Check if should alert
    if (job.attemptsMade >= (job.opts.attempts ?? 3) - 1) {
      await this.alertOnFinalFailure(job, error);
    }
  }

  async onProgress(job: Job, progress: number | object): Promise<void> {
    this.logger.debug(`Progress ${job.id}`, { progress });
  }

  // Error classification
  isRetryable(error: Error): boolean {
    // Network errors, rate limits, temporary failures
    return (
      error.name === 'NetworkError' ||
      error.name === 'TimeoutError' ||
      (error as any).code === 'ECONNRESET' ||
      (error as any).status === 429 ||
      (error as any).status >= 500
    );
  }

  private async alertOnFinalFailure(job: Job, error: Error): Promise<void> {
    // Send to alerting system
    await this.metrics.alert({
      severity: 'warning',
      title: `Worker ${this.type} final failure`,
      message: `Job ${job.id} failed after ${job.attemptsMade} attempts: ${error.message}`,
      tags: {
        jobType: this.type,
        jobId: job.id,
        tenantId: this.tenantContext?.orgId,
      },
    });
  }
}

Memory Consolidation Worker

// Consolidate episodic memories into semantic knowledge
class MemoryConsolidationWorker extends BaseWorker<
  MemoryConsolidationData,
  MemoryConsolidationResult
> {
  constructor(
    private memoryRepo: MemoryRepository,
    private vectorStore: RuVectorAdapter,
    private llm: LLMProvider
  ) {
    super(JOB_TYPES.MEMORY_CONSOLIDATION, {
      concurrency: 5,
      limiter: { max: 10, duration: 1000 },
    });
  }

  async process(data: MemoryConsolidationData, job: Job): Promise<MemoryConsolidationResult> {
    const { workspaceId, userId, memoryIds, consolidationType } = data;

    // Load memories to consolidate
    const memories = await this.memoryRepo.findByIds(memoryIds);

    if (memories.length < 2) {
      return { consolidated: false, reason: 'insufficient_memories' };
    }

    await job.updateProgress({ phase: 'analyzing', percent: 10 });

    // Analyze memories for consolidation
    const clusters = await this.clusterMemories(memories);

    await job.updateProgress({ phase: 'generating', percent: 40 });

    // Generate consolidated knowledge for each cluster
    const consolidatedMemories: Memory[] = [];

    for (const cluster of clusters) {
      if (cluster.memories.length < 2) continue;

      // Use LLM to synthesize cluster into semantic knowledge
      const synthesis = await this.synthesizeCluster(cluster);

      // Create new semantic memory
      const semanticMemory = await this.memoryRepo.save({
        id: crypto.randomUUID(),
        type: 'semantic',
        content: synthesis.content,
        importance: synthesis.importance,
        sourceType: 'consolidation',
        sourceId: job.id,
        metadata: {
          sourceMemoryIds: cluster.memories.map(m => m.id),
          consolidationType,
          synthesisConfidence: synthesis.confidence,
        },
      });

      consolidatedMemories.push(semanticMemory);

      // Optionally reduce importance of source memories
      if (consolidationType === 'merge') {
        await Promise.all(
          cluster.memories.map(m =>
            this.memoryRepo.updateImportance(m.id, m.importance * 0.5)
          )
        );
      }
    }

    await job.updateProgress({ phase: 'indexing', percent: 80 });

    // Update vector indices
    await this.reindexMemories(consolidatedMemories);

    await job.updateProgress({ phase: 'complete', percent: 100 });

    return {
      consolidated: true,
      clustersProcessed: clusters.length,
      memoriesCreated: consolidatedMemories.length,
      sourceMemoriesProcessed: memories.length,
    };
  }

  private async clusterMemories(memories: Memory[]): Promise<MemoryCluster[]> {
    // Get embeddings
    const embeddings = await Promise.all(
      memories.map(m => this.vectorStore.getById(m.embeddingId!))
    );

    // Simple clustering by cosine similarity threshold
    const clusters: MemoryCluster[] = [];
    const assigned = new Set<string>();

    for (let i = 0; i < memories.length; i++) {
      if (assigned.has(memories[i].id)) continue;

      const cluster: MemoryCluster = {
        memories: [memories[i]],
        centroid: embeddings[i]!,
      };
      assigned.add(memories[i].id);

      for (let j = i + 1; j < memories.length; j++) {
        if (assigned.has(memories[j].id)) continue;

        const similarity = cosineSimilarity(embeddings[i]!, embeddings[j]!);
        if (similarity > 0.8) {
          cluster.memories.push(memories[j]);
          assigned.add(memories[j].id);
        }
      }

      clusters.push(cluster);
    }

    return clusters;
  }

  private async synthesizeCluster(cluster: MemoryCluster): Promise<Synthesis> {
    const prompt = `Synthesize the following related memories into a single coherent fact or concept:

${cluster.memories.map((m, i) => `${i + 1}. ${m.content}`).join('\n')}

Provide a concise synthesis that captures the key information.`;

    const response = await this.llm.complete(prompt, {
      maxTokens: 200,
      temperature: 0.3,
    });

    return {
      content: response.text.trim(),
      importance: Math.max(...cluster.memories.map(m => m.importance)),
      confidence: cluster.memories.length >= 3 ? 0.9 : 0.7,
    };
  }
}

Embedding Batch Worker

// Batch process embeddings for efficiency
class EmbeddingBatchWorker extends BaseWorker<
  EmbeddingBatchData,
  EmbeddingBatchResult
> {
  constructor(
    private embedder: EmbeddingService,
    private vectorStore: RuVectorAdapter,
    private db: PostgresAdapter
  ) {
    super(JOB_TYPES.EMBEDDING_BATCH, {
      concurrency: 3,
      limiter: { max: 5, duration: 1000 },
    });
  }

  async process(data: EmbeddingBatchData, job: Job): Promise<EmbeddingBatchResult> {
    const { items, namespace, updateTable, updateColumn } = data;

    const results: EmbeddingResult[] = [];
    const batchSize = 32;
    const totalBatches = Math.ceil(items.length / batchSize);

    for (let i = 0; i < items.length; i += batchSize) {
      const batch = items.slice(i, i + batchSize);
      const batchNum = Math.floor(i / batchSize) + 1;

      await job.updateProgress({
        phase: 'embedding',
        batch: batchNum,
        totalBatches,
        percent: (batchNum / totalBatches) * 80,
      });

      try {
        // Generate embeddings
        const embeddings = await this.embedder.embedBatch(
          batch.map(item => item.text)
        );

        // Prepare vector entries
        const entries: VectorEntry[] = batch.map((item, idx) => ({
          id: item.embeddingId || crypto.randomUUID(),
          vector: embeddings[idx],
          metadata: item.metadata,
        }));

        // Insert into vector store
        await this.vectorStore.upsert(namespace, entries);

        // Update source records if needed
        if (updateTable && updateColumn) {
          await this.updateSourceRecords(
            updateTable,
            updateColumn,
            batch.map((item, idx) => ({
              id: item.sourceId,
              embeddingId: entries[idx].id,
            }))
          );
        }

        results.push(
          ...batch.map((item, idx) => ({
            sourceId: item.sourceId,
            embeddingId: entries[idx].id,
            success: true,
          }))
        );
      } catch (error) {
        // Record failures but continue with next batch
        this.logger.warn(`Batch ${batchNum} failed`, { error });
        results.push(
          ...batch.map(item => ({
            sourceId: item.sourceId,
            embeddingId: null,
            success: false,
            error: (error as Error).message,
          }))
        );
      }
    }

    await job.updateProgress({ phase: 'complete', percent: 100 });

    return {
      total: items.length,
      succeeded: results.filter(r => r.success).length,
      failed: results.filter(r => !r.success).length,
      results,
    };
  }

  private async updateSourceRecords(
    table: string,
    column: string,
    updates: Array<{ id: string; embeddingId: string }>
  ): Promise<void> {
    // Batch update using UNNEST
    await this.db.query(`
      UPDATE ${table} AS t
      SET ${column} = u.embedding_id
      FROM UNNEST($1::uuid[], $2::uuid[]) AS u(id, embedding_id)
      WHERE t.id = u.id
    `, [
      updates.map(u => u.id),
      updates.map(u => u.embeddingId),
    ]);
  }
}

Webhook Dispatch Worker

// Reliable webhook delivery with retries
class WebhookDispatchWorker extends BaseWorker<
  WebhookDispatchData,
  WebhookDispatchResult
> {
  constructor(
    private http: HttpClient,
    private webhookRepo: WebhookRepository
  ) {
    super(JOB_TYPES.WEBHOOK_DISPATCH, {
      concurrency: 20,
      limiter: { max: 100, duration: 1000 },
    });
  }

  async process(data: WebhookDispatchData, job: Job): Promise<WebhookDispatchResult> {
    const { webhookId, payload, headers, signature } = data;

    // Load webhook configuration
    const webhook = await this.webhookRepo.findById(webhookId);
    if (!webhook || !webhook.isEnabled) {
      return { success: false, reason: 'webhook_disabled' };
    }

    // Prepare request
    const requestHeaders: Record<string, string> = {
      'Content-Type': 'application/json',
      'User-Agent': 'RuvBot-Webhook/1.0',
      'X-Webhook-Id': webhookId,
      'X-Delivery-Id': job.id,
      'X-Signature': signature,
      ...headers,
    };

    const startTime = Date.now();

    try {
      const response = await this.http.post(webhook.url, {
        body: JSON.stringify(payload),
        headers: requestHeaders,
        timeout: 10000,
      });

      const latency = Date.now() - startTime;

      // Record delivery
      await this.webhookRepo.recordDelivery({
        webhookId,
        jobId: job.id,
        status: 'success',
        statusCode: response.status,
        latencyMs: latency,
        responseBody: response.data?.slice(0, 1000),
      });

      // Update webhook stats
      await this.webhookRepo.updateStats(webhookId, {
        lastDeliveredAt: new Date(),
        successCount: { increment: 1 },
        avgLatencyMs: latency,
      });

      return {
        success: true,
        statusCode: response.status,
        latencyMs: latency,
      };
    } catch (error) {
      const latency = Date.now() - startTime;
      const httpError = error as HttpError;

      // Record failure
      await this.webhookRepo.recordDelivery({
        webhookId,
        jobId: job.id,
        status: 'failed',
        statusCode: httpError.status,
        latencyMs: latency,
        errorMessage: httpError.message,
      });

      // Update failure stats
      await this.webhookRepo.updateStats(webhookId, {
        failureCount: { increment: 1 },
      });

      // Check if should disable webhook
      const recentFailures = await this.webhookRepo.countRecentFailures(
        webhookId,
        24 * 60 * 60 * 1000 // 24 hours
      );

      if (recentFailures >= webhook.maxFailures) {
        await this.webhookRepo.disable(webhookId, 'max_failures_exceeded');
        this.metrics.increment('webhook_disabled');
      }

      // Determine if retryable
      if (!this.isRetryable(error as Error)) {
        return {
          success: false,
          reason: 'non_retryable_error',
          error: httpError.message,
        };
      }

      throw error; // Let BullMQ handle retry
    }
  }
}

Scheduling

Cron-Based Scheduling

// Scheduled job configuration
interface ScheduledJob {
  name: string;
  cron: string;
  jobType: JobType;
  jobData: Record<string, unknown>;
  options?: {
    timezone?: string;
    tz?: string;
    startDate?: Date;
    endDate?: Date;
    limit?: number;
  };
}

// Default scheduled jobs
const SCHEDULED_JOBS: ScheduledJob[] = [
  {
    name: 'session-cleanup',
    cron: '0 */15 * * * *',  // Every 15 minutes
    jobType: JOB_TYPES.SESSION_CLEANUP,
    jobData: { maxAge: 24 * 60 * 60 * 1000 },  // 24 hours
  },
  {
    name: 'memory-importance-decay',
    cron: '0 0 */6 * * *',  // Every 6 hours
    jobType: JOB_TYPES.MEMORY_IMPORTANCE_DECAY,
    jobData: { decayFactor: 0.99 },
  },
  {
    name: 'index-optimization',
    cron: '0 0 3 * * *',  // 3 AM daily
    jobType: JOB_TYPES.INDEX_OPTIMIZATION,
    jobData: { threshold: 0.8 },
    options: { timezone: 'UTC' },
  },
  {
    name: 'pattern-training-daily',
    cron: '0 0 4 * * *',  // 4 AM daily
    jobType: JOB_TYPES.PATTERN_TRAINING,
    jobData: { batchSize: 1000, epochs: 5 },
    options: { timezone: 'UTC' },
  },
  {
    name: 'quota-check',
    cron: '0 0 * * * *',  // Every hour
    jobType: JOB_TYPES.QUOTA_CHECK,
    jobData: {},
  },
];

// Scheduler class
class WorkerScheduler {
  private schedulers: Map<string, RepeatableJob> = new Map();

  constructor(private queue: Queue) {}

  async start(): Promise<void> {
    for (const job of SCHEDULED_JOBS) {
      const repeatable = await this.queue.add(
        job.jobType,
        job.jobData,
        {
          repeat: {
            pattern: job.cron,
            ...job.options,
          },
        }
      );
      this.schedulers.set(job.name, repeatable);
      console.log(`Scheduled ${job.name}: ${job.cron}`);
    }
  }

  async stop(): Promise<void> {
    for (const [name, job] of this.schedulers) {
      await job.remove();
      console.log(`Unscheduled ${name}`);
    }
    this.schedulers.clear();
  }

  async reschedule(name: string, cron: string): Promise<void> {
    const existing = this.schedulers.get(name);
    if (existing) {
      await existing.remove();
    }

    const job = SCHEDULED_JOBS.find(j => j.name === name);
    if (!job) throw new Error(`Unknown scheduled job: ${name}`);

    const repeatable = await this.queue.add(
      job.jobType,
      job.jobData,
      { repeat: { pattern: cron, ...job.options } }
    );
    this.schedulers.set(name, repeatable);
  }
}

Monitoring and Observability

Metrics Collection

// Worker metrics
interface WorkerMetrics {
  // Counters
  jobsStarted: Counter;
  jobsCompleted: Counter;
  jobsFailed: Counter;
  jobsRetried: Counter;

  // Gauges
  activeJobs: Gauge;
  waitingJobs: Gauge;
  delayedJobs: Gauge;
  workerInstances: Gauge;

  // Histograms
  jobDuration: Histogram;
  jobAttempts: Histogram;
  queueWaitTime: Histogram;
}

// Metrics exporter
class WorkerMetricsExporter {
  private registry: MetricRegistry;

  constructor() {
    this.registry = new MetricRegistry();
    this.initializeMetrics();
  }

  private initializeMetrics(): void {
    // Job lifecycle metrics
    this.registry.registerCounter('ruvbot_worker_jobs_total', {
      help: 'Total jobs processed',
      labels: ['type', 'status'],
    });

    this.registry.registerGauge('ruvbot_worker_jobs_active', {
      help: 'Currently active jobs',
      labels: ['type'],
    });

    this.registry.registerHistogram('ruvbot_worker_job_duration_seconds', {
      help: 'Job processing duration',
      labels: ['type'],
      buckets: [0.1, 0.5, 1, 5, 10, 30, 60, 300],
    });

    // Queue metrics
    this.registry.registerGauge('ruvbot_worker_queue_depth', {
      help: 'Number of jobs in queue',
      labels: ['type', 'state'],
    });
  }

  async collectMetrics(): Promise<string> {
    // Collect from all queues
    const queues = await this.getQueueStats();

    for (const [queueName, stats] of Object.entries(queues)) {
      this.registry.setGauge('ruvbot_worker_queue_depth', stats.waiting, {
        type: queueName,
        state: 'waiting',
      });
      this.registry.setGauge('ruvbot_worker_queue_depth', stats.active, {
        type: queueName,
        state: 'active',
      });
      this.registry.setGauge('ruvbot_worker_queue_depth', stats.delayed, {
        type: queueName,
        state: 'delayed',
      });
    }

    return this.registry.export();
  }
}

Health Checks

// Worker health check
interface WorkerHealthCheck {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: {
    redis: HealthStatus;
    workers: HealthStatus;
    queues: HealthStatus;
    deadLetters: HealthStatus;
  };
  details: {
    activeWorkers: number;
    totalQueues: number;
    oldestWaitingJob: number | null;
    deadLetterCount: number;
  };
}

class WorkerHealthChecker {
  async check(): Promise<WorkerHealthCheck> {
    const [redis, workers, queues, deadLetters] = await Promise.all([
      this.checkRedis(),
      this.checkWorkers(),
      this.checkQueues(),
      this.checkDeadLetters(),
    ]);

    const status = this.aggregateStatus([redis, workers, queues, deadLetters]);

    return {
      status,
      checks: { redis, workers, queues, deadLetters },
      details: {
        activeWorkers: workers.details?.count ?? 0,
        totalQueues: queues.details?.count ?? 0,
        oldestWaitingJob: queues.details?.oldestWaiting ?? null,
        deadLetterCount: deadLetters.details?.count ?? 0,
      },
    };
  }

  private async checkRedis(): Promise<HealthStatus> {
    try {
      const start = Date.now();
      await this.redis.ping();
      const latency = Date.now() - start;

      return {
        status: latency < 100 ? 'healthy' : 'degraded',
        latency,
      };
    } catch (error) {
      return { status: 'unhealthy', error: (error as Error).message };
    }
  }

  private async checkWorkers(): Promise<HealthStatus> {
    const workers = await this.getActiveWorkers();
    const minWorkers = this.config.minWorkers ?? 1;

    if (workers.length >= minWorkers) {
      return { status: 'healthy', details: { count: workers.length } };
    } else if (workers.length > 0) {
      return { status: 'degraded', details: { count: workers.length } };
    } else {
      return { status: 'unhealthy', details: { count: 0 } };
    }
  }

  private async checkDeadLetters(): Promise<HealthStatus> {
    const count = await this.getDeadLetterCount();
    const threshold = this.config.deadLetterThreshold ?? 100;

    if (count === 0) {
      return { status: 'healthy', details: { count } };
    } else if (count < threshold) {
      return { status: 'degraded', details: { count } };
    } else {
      return { status: 'unhealthy', details: { count } };
    }
  }
}

Error Handling and Dead Letters

Dead Letter Queue

// Dead letter handling
interface DeadLetterEntry {
  id: string;
  originalJobId: string;
  jobType: JobType;
  jobData: unknown;
  error: {
    message: string;
    stack?: string;
    code?: string;
  };
  attempts: number;
  failedAt: Date;
  tenant?: {
    orgId: string;
    workspaceId?: string;
  };
}

class DeadLetterManager {
  constructor(
    private queue: Queue,
    private storage: DeadLetterStorage,
    private alerter: Alerter
  ) {}

  async moveToDeadLetter(job: Job, error: Error): Promise<void> {
    const entry: DeadLetterEntry = {
      id: crypto.randomUUID(),
      originalJobId: job.id,
      jobType: job.name as JobType,
      jobData: job.data,
      error: {
        message: error.message,
        stack: error.stack,
        code: (error as any).code,
      },
      attempts: job.attemptsMade,
      failedAt: new Date(),
      tenant: job.data.tenantContext,
    };

    await this.storage.save(entry);

    // Alert on critical job types
    if (this.isCriticalJobType(job.name as JobType)) {
      await this.alerter.send({
        severity: 'high',
        title: `Critical job moved to dead letter: ${job.name}`,
        message: `Job ${job.id} failed after ${job.attemptsMade} attempts: ${error.message}`,
        metadata: {
          jobType: job.name,
          jobId: job.id,
          tenant: entry.tenant,
        },
      });
    }
  }

  async retry(deadLetterId: string): Promise<Job> {
    const entry = await this.storage.findById(deadLetterId);
    if (!entry) throw new Error(`Dead letter not found: ${deadLetterId}`);

    // Re-enqueue with fresh attempts
    const job = await this.queue.add(entry.jobType, entry.jobData, {
      attempts: 3,
      removeOnComplete: true,
    });

    // Mark as retried
    await this.storage.markRetried(deadLetterId, job.id);

    return job;
  }

  async purge(filter: DeadLetterFilter): Promise<number> {
    const entries = await this.storage.find(filter);
    await this.storage.deleteMany(entries.map(e => e.id));
    return entries.length;
  }

  private isCriticalJobType(type: JobType): boolean {
    return [
      JOB_TYPES.DATA_DELETION,
      JOB_TYPES.WEBHOOK_DISPATCH,
      JOB_TYPES.AUDIT_EXPORT,
    ].includes(type);
  }
}

Consequences

Benefits

  1. Reliability: At-least-once delivery with persistent storage
  2. Observability: Comprehensive metrics and health checks
  3. Flexibility: Extensible worker framework
  4. Efficiency: Batch processing and rate limiting
  5. Maintainability: Clean separation of concerns

Trade-offs

Benefit Trade-off
Persistence Redis memory usage
Concurrency Complexity in ordering
Retries Potential duplicates (need idempotency)
Monitoring Observability overhead

  • ADR-001: Architecture Overview
  • ADR-003: Persistence Layer (Redis integration)
  • ADR-007: Learning System (training workers)

Revision History

Version Date Author Changes
1.0 2026-01-27 RuVector Architecture Team Initial version