Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/npm/packages/ruvbot/docs/adr/ADR-009-hybrid-search.md
+++ b/npm/packages/ruvbot/docs/adr/ADR-009-hybrid-search.md
@@ -0,0 +1,159 @@
+# ADR-009: Hybrid Search Architecture
+
+## Status
+Accepted (Implemented)
+
+## Date
+2026-01-27
+
+## Context
+
+Clawdbot uses basic vector search with external embedding APIs. RuvBot improves on this with:
+- Local WASM embeddings (75x faster)
+- HNSW indexing (150x-12,500x faster)
+- Need for hybrid search combining vector + keyword (BM25)
+
+## Decision
+
+### Hybrid Search Pipeline
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    RuvBot Hybrid Search                          │
+├─────────────────────────────────────────────────────────────────┤
+│  Query Input                                                     │
+│    └─ Text normalization                                        │
+│    └─ Query embedding (WASM, <3ms)                              │
+├─────────────────────────────────────────────────────────────────┤
+│  Parallel Search (Promise.all)                                   │
+│    ├─ Vector Search (HNSW)          ├─ Keyword Search (BM25)   │
+│    │    └─ Cosine similarity        │    └─ Inverted index     │
+│    │    └─ Top-K candidates         │    └─ IDF + TF scoring   │
+├─────────────────────────────────────────────────────────────────┤
+│  Result Fusion                                                   │
+│    └─ Reciprocal Rank Fusion (RRF)                              │
+│    └─ Linear combination                                        │
+│    └─ Weighted average with presence bonus                      │
+├─────────────────────────────────────────────────────────────────┤
+│  Post-Processing                                                 │
+│    └─ Score normalization (BM25 max-normalized)                 │
+│    └─ Matched term tracking                                     │
+│    └─ Threshold filtering                                       │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Implementation
+
+Located in `/npm/packages/ruvbot/src/learning/search/`:
+- `HybridSearch.ts` - Main hybrid search coordinator
+- `BM25Index.ts` - BM25 keyword search implementation
+
+### Configuration
+
+```typescript
+interface HybridSearchConfig {
+  vector: {
+    enabled: boolean;
+    weight: number;  // 0.0-1.0, default: 0.7
+  };
+  keyword: {
+    enabled: boolean;
+    weight: number;  // 0.0-1.0, default: 0.3
+    k1?: number;     // BM25 k1 parameter, default: 1.2
+    b?: number;      // BM25 b parameter, default: 0.75
+  };
+  fusion: {
+    method: 'rrf' | 'linear' | 'weighted';
+    k: number;       // RRF constant, default: 60
+    candidateMultiplier: number;  // default: 3
+  };
+}
+
+interface HybridSearchOptions {
+  topK?: number;       // default: 10
+  threshold?: number;  // default: 0
+  vectorOnly?: boolean;
+  keywordOnly?: boolean;
+}
+
+interface HybridSearchResult {
+  id: string;
+  vectorScore: number;
+  keywordScore: number;
+  fusedScore: number;
+  matchedTerms?: string[];
+}
+```
+
+### Fusion Methods
+
+| Method | Algorithm | Best For |
+|--------|-----------|----------|
+| `rrf` | Reciprocal Rank Fusion: `1/(k + rank)` | General use, rank-based |
+| `linear` | `α·vectorScore + β·keywordScore` | Score-sensitive ranking |
+| `weighted` | Linear + 0.1 bonus for dual matches | Boosting exact matches |
+
+### BM25 Implementation
+
+```typescript
+interface BM25Config {
+  k1: number;  // Term frequency saturation (default: 1.2)
+  b: number;   // Document length normalization (default: 0.75)
+}
+```
+
+Features:
+- Inverted index with document frequency tracking
+- Built-in stopword filtering (100+ common words)
+- Basic Porter-style stemming (ing, ed, es, s, ly, tion)
+- Average document length normalization
+
+### Performance Targets
+
+| Operation | Target | Achieved |
+|-----------|--------|----------|
+| Query embedding | <5ms | 2.7ms |
+| Vector search (100K) | <10ms | <5ms |
+| Keyword search | <20ms | <15ms |
+| Fusion | <5ms | <2ms |
+| Total hybrid | <40ms | <25ms |
+
+### Usage Example
+
+```typescript
+import { HybridSearch, createHybridSearch } from './learning/search';
+
+// Create with custom config
+const search = createHybridSearch({
+  vector: { enabled: true, weight: 0.7 },
+  keyword: { enabled: true, weight: 0.3, k1: 1.2, b: 0.75 },
+  fusion: { method: 'rrf', k: 60, candidateMultiplier: 3 },
+});
+
+// Initialize with vector index and embedder
+search.initialize(vectorIndex, embedder);
+
+// Add documents
+await search.add('doc1', 'Document content here');
+
+// Search
+const results = await search.search('query text', { topK: 10 });
+```
+
+## Consequences
+
+### Positive
+- Better recall than vector-only search
+- Handles exact matches and semantic similarity
+- Maintains keyword search for debugging
+- Parallel search execution for low latency
+
+### Negative
+- Slightly higher latency than vector-only
+- Requires maintaining both indices
+- More complex tuning
+
+### Trade-offs
+- Weight tuning requires experimentation
+- Memory overhead for dual indices
+- BM25 stemming is basic (not full Porter algorithm)