Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md
+++ b/vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md
@@ -0,0 +1,285 @@
+# RVF Temperature Tiering
+
+## 1. Adaptive Layout as a First-Class Concept
+
+Traditional vector formats place data once and leave it. RVF treats data placement
+as a **continuous optimization problem**. Every vector block has a temperature, and
+the format periodically reorganizes to keep hot data fast and cold data small.
+
+```
+                Access Frequency
+                     ^
+                     |
+Tier 0 (HOT)        |  ████████   fp16 / 8-bit, interleaved
+                     |  ████████   < 1μs random access
+                     |
+Tier 1 (WARM)        |  ░░░░░░░░░░░░░░░░   5-7 bit quantized
+                     |  ░░░░░░░░░░░░░░░░   columnar, compressed
+                     |
+Tier 2 (COLD)        |  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒   3-bit or 1-bit
+                     |  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒   heavy compression
+                     |
+                     +------------------------------------> Vector ID
+```
+
+### Tier Definitions
+
+| Tier | Name | Quantization | Layout | Compression | Access Latency |
+|------|------|-------------|--------|-------------|----------------|
+| 0 | Hot | fp16 or int8 | Interleaved (row-major) | None or LZ4 | < 1 μs |
+| 1 | Warm | 5-7 bit SQ/PQ | Columnar | LZ4 or ZSTD | 1-10 μs |
+| 2 | Cold | 3-bit or binary | Columnar | ZSTD level 9+ | 10-100 μs |
+
+### Memory Ratios
+
+For 384-dimensional vectors (typical embedding size):
+
+| Tier | Bytes/Vector | Ratio vs fp32 | 10M Vectors |
+|------|-------------|---------------|-------------|
+| fp32 (raw) | 1536 B | 1.0x | 14.3 GB |
+| Tier 0 (fp16) | 768 B | 2.0x | 7.2 GB |
+| Tier 0 (int8) | 384 B | 4.0x | 3.6 GB |
+| Tier 1 (6-bit) | 288 B | 5.3x | 2.7 GB |
+| Tier 1 (5-bit) | 240 B | 6.4x | 2.2 GB |
+| Tier 2 (3-bit) | 144 B | 10.7x | 1.3 GB |
+| Tier 2 (1-bit) | 48 B | 32.0x | 0.45 GB |
+
+## 2. Access Counter Sketch
+
+Temperature decisions require knowing which blocks are accessed frequently.
+RVF maintains a lightweight **Count-Min Sketch** per block set, stored in
+SKETCH_SEG segments.
+
+### Sketch Parameters
+
+```
+Width (w):    1024 counters
+Depth (d):    4 hash functions
+Counter size: 8-bit saturating (max 255)
+Memory:       1024 * 4 * 1 = 4 KB per sketch
+Granularity:  One sketch per 1024-vector block
+Decay:        Halve all counters every 2^16 accesses (aging)
+```
+
+For 10M vectors in 1024-vector blocks:
+- 9,766 blocks
+- 9,766 * 4 KB = ~38 MB of sketches
+- Stored in SKETCH_SEG, referenced by manifest
+
+### Sketch Operations
+
+**On query access**:
+```
+block_id = vector_id / block_size
+for i in 0..depth:
+    idx = hash_i(block_id) % width
+    sketch[i][idx] = min(sketch[i][idx] + 1, 255)
+```
+
+**On temperature check**:
+```
+count = min over i of sketch[i][hash_i(block_id) % width]
+if count > HOT_THRESHOLD:   tier = 0
+elif count > WARM_THRESHOLD: tier = 1
+else:                        tier = 2
+```
+
+**Aging** (every 2^16 accesses):
+```
+for all counters: counter = counter >> 1
+```
+
+This ensures the sketch tracks *recent* access patterns, not cumulative history.
+
+### Why Count-Min Sketch
+
+| Alternative | Memory | Accuracy | Update Cost |
+|------------|--------|----------|-------------|
+| Per-vector counter | 80 MB (10M * 8B) | Exact | O(1) |
+| Count-Min Sketch | 38 MB | ~99.9% | O(depth) = O(4) |
+| HyperLogLog | 6 MB | ~98% | O(1) but cardinality only |
+| Bloom filter | 12 MB | No counting | N/A |
+
+Count-Min Sketch is the best trade-off: sub-exact accuracy with bounded memory
+and constant-time updates.
+
+## 3. Promotion and Demotion
+
+### Promotion: Warm/Cold -> Hot
+
+When a block's access count exceeds HOT_THRESHOLD for two consecutive sketch
+epochs:
+
+```
+1. Read the block from its current VEC_SEG
+2. Decode/dequantize vectors to fp16 or int8
+3. Rearrange from columnar to interleaved layout
+4. Write as a new HOT_SEG (or append to existing HOT_SEG)
+5. Update manifest with new tier assignment
+6. Optionally: add neighbor lists to HOT_SEG for locality
+```
+
+### Demotion: Hot -> Warm -> Cold
+
+When a block's access count drops below WARM_THRESHOLD:
+
+```
+1. The block is not immediately rewritten
+2. On next compaction cycle, the block is written to the appropriate tier
+3. Quantization is applied during compaction (not lazily)
+4. The HOT_SEG entry is tombstoned in the manifest
+```
+
+### Eviction as Compression
+
+The key insight: **eviction from hot tier is just compression, not deletion**.
+The vector data is always present — it just moves to a more compressed
+representation. This means:
+
+- No data loss on eviction
+- Recall degrades gracefully (quantized vectors still contribute to search)
+- The file naturally compresses over time as access patterns stabilize
+
+## 4. Temperature-Aware Compaction
+
+Standard compaction merges segments for space efficiency. Temperature-aware
+compaction also **rearranges blocks by tier**:
+
+```
+Before compaction:
+  VEC_SEG_1:  [hot] [cold] [warm] [hot] [cold]
+  VEC_SEG_2:  [warm] [hot] [cold] [warm] [warm]
+
+After temperature-aware compaction:
+  HOT_SEG:    [hot] [hot] [hot]       <- interleaved, fp16
+  VEC_SEG_W:  [warm] [warm] [warm] [warm]  <- columnar, 6-bit
+  VEC_SEG_C:  [cold] [cold] [cold]     <- columnar, 3-bit
+```
+
+This creates **physical locality by temperature**: hot blocks are contiguous
+(good for sequential scan), warm blocks are contiguous (good for batch decode),
+cold blocks are contiguous (good for compression ratio).
+
+### Compaction Triggers
+
+| Trigger | Condition | Action |
+|---------|-----------|--------|
+| Sketch epoch | Every N writes | Evaluate all block temperatures |
+| Space amplification | Dead space > 30% | Merge + rewrite segments |
+| Tier imbalance | Hot tier > 20% of data | Demote cold blocks |
+| Hot miss rate | Hot cache miss > 10% | Promote missing blocks |
+
+## 5. Quantization Strategies by Tier
+
+### Tier 0: Hot
+
+**Scalar quantization to int8** (preferred) or **fp16** (for maximum recall).
+
+```
+Encoding:
+  q = round((v - min) / (max - min) * 255)
+
+Decoding:
+  v = q / 255 * (max - min) + min
+
+Parameters stored in QUANT_SEG:
+  min: f32 per dimension
+  max: f32 per dimension
+```
+
+Distance computation directly on int8 using SIMD (vpsubb + vpmaddubsw on AVX-512).
+
+### Tier 1: Warm
+
+**Product Quantization (PQ)** with 5-7 bits per sub-vector.
+
+```
+Parameters:
+  M subspaces:          48 (for 384-dim vectors, 8 dims per subspace)
+  K centroids per sub:  64 (6-bit) or 128 (7-bit)
+  Codebook:             M * K * 8 * sizeof(f32) = 48 * 64 * 8 * 4 = 96 KB
+
+Encoding:
+  For each subvector: find nearest centroid -> store centroid index
+
+Distance computation:
+  ADC (Asymmetric Distance Computation) with precomputed distance tables
+```
+
+### Tier 2: Cold
+
+**Binary quantization** (1-bit) or **ternary quantization** (2-bit / 3-bit).
+
+```
+Binary encoding:
+  b = sign(v)  -> 1 bit per dimension
+  384 dims -> 48 bytes per vector (32x compression)
+
+Distance:
+  Hamming distance via POPCNT
+  XOR + POPCNT on AVX-512: 512 bits per cycle
+
+Ternary (3-bit with magnitude):
+  t = {-1, 0, +1} based on threshold
+  magnitude = |v| quantized to 3 levels
+  384 dims -> 144 bytes per vector (10.7x compression)
+```
+
+### Codebook Storage
+
+All quantization parameters (codebooks, min/max ranges, centroids) are stored
+in QUANT_SEG segments. The root manifest's `quantdict_seg_offset` hotset pointer
+references the active quantization dictionary for fast boot.
+
+Multiple QUANT_SEGs can coexist for different tiers — the manifest maps each
+tier to its dictionary.
+
+## 6. Hardware Adaptation
+
+### Desktop (AVX-512)
+
+- Hot tier: int8 with VNNI dot product (4 int8 multiplies per cycle)
+- Warm tier: PQ with AVX-512 gather for table lookups
+- Cold tier: Binary with VPOPCNTDQ (512-bit popcount)
+
+### ARM (NEON)
+
+- Hot tier: int8 with SDOT instruction
+- Warm tier: PQ with TBL for table lookups
+- Cold tier: Binary with CNT (population count)
+
+### WASM (v128)
+
+- Hot tier: int8 with i8x16.dot_i7x16_i16x8 (if available)
+- Warm tier: Scalar PQ (no gather)
+- Cold tier: Binary with manual popcount
+
+### Cognitum Tile (8KB code + 8KB data + 64KB SIMD)
+
+- Hot tier only: int8 interleaved, fits in SIMD scratch
+- No warm/cold — data stays on hub, tile fetches blocks on demand
+- Sketch is maintained by hub, not tile
+
+## 7. Self-Organization Over Time
+
+```
+t=0    All data Tier 1 (default warm)
+       |
+t+N    First sketch epoch: identify hot blocks
+       Promote top 5% to Tier 0
+       |
+t+2N   Second epoch: validate promotions
+       Demote false positives back to Tier 1
+       Identify true cold blocks (0 access in 2 epochs)
+       |
+t+3N   Compaction: physically separate tiers
+       HOT_SEG created with interleaved layout
+       Cold blocks compressed to 3-bit
+       |
+t+∞    Equilibrium: ~5% hot, ~30% warm, ~65% cold
+       File size: ~2-3x smaller than uniform fp16
+       Query p95: dominated by hot tier latency
+```
+
+The format converges to an equilibrium that reflects actual usage. No manual
+tuning required.