Files
wifi-densepose/vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md

8.8 KiB

RVF Temperature Tiering

1. Adaptive Layout as a First-Class Concept

Traditional vector formats place data once and leave it. RVF treats data placement as a continuous optimization problem. Every vector block has a temperature, and the format periodically reorganizes to keep hot data fast and cold data small.

                Access Frequency
                     ^
                     |
Tier 0 (HOT)        |  ████████   fp16 / 8-bit, interleaved
                     |  ████████   < 1μs random access
                     |
Tier 1 (WARM)        |  ░░░░░░░░░░░░░░░░   5-7 bit quantized
                     |  ░░░░░░░░░░░░░░░░   columnar, compressed
                     |
Tier 2 (COLD)        |  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒   3-bit or 1-bit
                     |  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒   heavy compression
                     |
                     +------------------------------------> Vector ID

Tier Definitions

Tier Name Quantization Layout Compression Access Latency
0 Hot fp16 or int8 Interleaved (row-major) None or LZ4 < 1 μs
1 Warm 5-7 bit SQ/PQ Columnar LZ4 or ZSTD 1-10 μs
2 Cold 3-bit or binary Columnar ZSTD level 9+ 10-100 μs

Memory Ratios

For 384-dimensional vectors (typical embedding size):

Tier Bytes/Vector Ratio vs fp32 10M Vectors
fp32 (raw) 1536 B 1.0x 14.3 GB
Tier 0 (fp16) 768 B 2.0x 7.2 GB
Tier 0 (int8) 384 B 4.0x 3.6 GB
Tier 1 (6-bit) 288 B 5.3x 2.7 GB
Tier 1 (5-bit) 240 B 6.4x 2.2 GB
Tier 2 (3-bit) 144 B 10.7x 1.3 GB
Tier 2 (1-bit) 48 B 32.0x 0.45 GB

2. Access Counter Sketch

Temperature decisions require knowing which blocks are accessed frequently. RVF maintains a lightweight Count-Min Sketch per block set, stored in SKETCH_SEG segments.

Sketch Parameters

Width (w):    1024 counters
Depth (d):    4 hash functions
Counter size: 8-bit saturating (max 255)
Memory:       1024 * 4 * 1 = 4 KB per sketch
Granularity:  One sketch per 1024-vector block
Decay:        Halve all counters every 2^16 accesses (aging)

For 10M vectors in 1024-vector blocks:

  • 9,766 blocks
  • 9,766 * 4 KB = ~38 MB of sketches
  • Stored in SKETCH_SEG, referenced by manifest

Sketch Operations

On query access:

block_id = vector_id / block_size
for i in 0..depth:
    idx = hash_i(block_id) % width
    sketch[i][idx] = min(sketch[i][idx] + 1, 255)

On temperature check:

count = min over i of sketch[i][hash_i(block_id) % width]
if count > HOT_THRESHOLD:   tier = 0
elif count > WARM_THRESHOLD: tier = 1
else:                        tier = 2

Aging (every 2^16 accesses):

for all counters: counter = counter >> 1

This ensures the sketch tracks recent access patterns, not cumulative history.

Why Count-Min Sketch

Alternative Memory Accuracy Update Cost
Per-vector counter 80 MB (10M * 8B) Exact O(1)
Count-Min Sketch 38 MB ~99.9% O(depth) = O(4)
HyperLogLog 6 MB ~98% O(1) but cardinality only
Bloom filter 12 MB No counting N/A

Count-Min Sketch is the best trade-off: sub-exact accuracy with bounded memory and constant-time updates.

3. Promotion and Demotion

Promotion: Warm/Cold -> Hot

When a block's access count exceeds HOT_THRESHOLD for two consecutive sketch epochs:

1. Read the block from its current VEC_SEG
2. Decode/dequantize vectors to fp16 or int8
3. Rearrange from columnar to interleaved layout
4. Write as a new HOT_SEG (or append to existing HOT_SEG)
5. Update manifest with new tier assignment
6. Optionally: add neighbor lists to HOT_SEG for locality

Demotion: Hot -> Warm -> Cold

When a block's access count drops below WARM_THRESHOLD:

1. The block is not immediately rewritten
2. On next compaction cycle, the block is written to the appropriate tier
3. Quantization is applied during compaction (not lazily)
4. The HOT_SEG entry is tombstoned in the manifest

Eviction as Compression

The key insight: eviction from hot tier is just compression, not deletion. The vector data is always present — it just moves to a more compressed representation. This means:

  • No data loss on eviction
  • Recall degrades gracefully (quantized vectors still contribute to search)
  • The file naturally compresses over time as access patterns stabilize

4. Temperature-Aware Compaction

Standard compaction merges segments for space efficiency. Temperature-aware compaction also rearranges blocks by tier:

Before compaction:
  VEC_SEG_1:  [hot] [cold] [warm] [hot] [cold]
  VEC_SEG_2:  [warm] [hot] [cold] [warm] [warm]

After temperature-aware compaction:
  HOT_SEG:    [hot] [hot] [hot]       <- interleaved, fp16
  VEC_SEG_W:  [warm] [warm] [warm] [warm]  <- columnar, 6-bit
  VEC_SEG_C:  [cold] [cold] [cold]     <- columnar, 3-bit

This creates physical locality by temperature: hot blocks are contiguous (good for sequential scan), warm blocks are contiguous (good for batch decode), cold blocks are contiguous (good for compression ratio).

Compaction Triggers

Trigger Condition Action
Sketch epoch Every N writes Evaluate all block temperatures
Space amplification Dead space > 30% Merge + rewrite segments
Tier imbalance Hot tier > 20% of data Demote cold blocks
Hot miss rate Hot cache miss > 10% Promote missing blocks

5. Quantization Strategies by Tier

Tier 0: Hot

Scalar quantization to int8 (preferred) or fp16 (for maximum recall).

Encoding:
  q = round((v - min) / (max - min) * 255)

Decoding:
  v = q / 255 * (max - min) + min

Parameters stored in QUANT_SEG:
  min: f32 per dimension
  max: f32 per dimension

Distance computation directly on int8 using SIMD (vpsubb + vpmaddubsw on AVX-512).

Tier 1: Warm

Product Quantization (PQ) with 5-7 bits per sub-vector.

Parameters:
  M subspaces:          48 (for 384-dim vectors, 8 dims per subspace)
  K centroids per sub:  64 (6-bit) or 128 (7-bit)
  Codebook:             M * K * 8 * sizeof(f32) = 48 * 64 * 8 * 4 = 96 KB

Encoding:
  For each subvector: find nearest centroid -> store centroid index

Distance computation:
  ADC (Asymmetric Distance Computation) with precomputed distance tables

Tier 2: Cold

Binary quantization (1-bit) or ternary quantization (2-bit / 3-bit).

Binary encoding:
  b = sign(v)  -> 1 bit per dimension
  384 dims -> 48 bytes per vector (32x compression)

Distance:
  Hamming distance via POPCNT
  XOR + POPCNT on AVX-512: 512 bits per cycle

Ternary (3-bit with magnitude):
  t = {-1, 0, +1} based on threshold
  magnitude = |v| quantized to 3 levels
  384 dims -> 144 bytes per vector (10.7x compression)

Codebook Storage

All quantization parameters (codebooks, min/max ranges, centroids) are stored in QUANT_SEG segments. The root manifest's quantdict_seg_offset hotset pointer references the active quantization dictionary for fast boot.

Multiple QUANT_SEGs can coexist for different tiers — the manifest maps each tier to its dictionary.

6. Hardware Adaptation

Desktop (AVX-512)

  • Hot tier: int8 with VNNI dot product (4 int8 multiplies per cycle)
  • Warm tier: PQ with AVX-512 gather for table lookups
  • Cold tier: Binary with VPOPCNTDQ (512-bit popcount)

ARM (NEON)

  • Hot tier: int8 with SDOT instruction
  • Warm tier: PQ with TBL for table lookups
  • Cold tier: Binary with CNT (population count)

WASM (v128)

  • Hot tier: int8 with i8x16.dot_i7x16_i16x8 (if available)
  • Warm tier: Scalar PQ (no gather)
  • Cold tier: Binary with manual popcount

Cognitum Tile (8KB code + 8KB data + 64KB SIMD)

  • Hot tier only: int8 interleaved, fits in SIMD scratch
  • No warm/cold — data stays on hub, tile fetches blocks on demand
  • Sketch is maintained by hub, not tile

7. Self-Organization Over Time

t=0    All data Tier 1 (default warm)
       |
t+N    First sketch epoch: identify hot blocks
       Promote top 5% to Tier 0
       |
t+2N   Second epoch: validate promotions
       Demote false positives back to Tier 1
       Identify true cold blocks (0 access in 2 epochs)
       |
t+3N   Compaction: physically separate tiers
       HOT_SEG created with interleaved layout
       Cold blocks compressed to 3-bit
       |
t+∞    Equilibrium: ~5% hot, ~30% warm, ~65% cold
       File size: ~2-3x smaller than uniform fp16
       Query p95: dominated by hot tier latency

The format converges to an equilibrium that reflects actual usage. No manual tuning required.