Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

8.8 KiB

Raw Blame History

RVF Temperature Tiering

1. Adaptive Layout as a First-Class Concept

Traditional vector formats place data once and leave it. RVF treats data placement as a continuous optimization problem. Every vector block has a temperature, and the format periodically reorganizes to keep hot data fast and cold data small.

                Access Frequency
                     ^
                     |
Tier 0 (HOT)        |  ████████   fp16 / 8-bit, interleaved
                     |  ████████   < 1μs random access
                     |
Tier 1 (WARM)        |  ░░░░░░░░░░░░░░░░   5-7 bit quantized
                     |  ░░░░░░░░░░░░░░░░   columnar, compressed
                     |
Tier 2 (COLD)        |  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒   3-bit or 1-bit
                     |  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒   heavy compression
                     |
                     +------------------------------------> Vector ID

Tier Definitions

Tier	Name	Quantization	Layout	Compression	Access Latency
0	Hot	fp16 or int8	Interleaved (row-major)	None or LZ4	< 1 μs
1	Warm	5-7 bit SQ/PQ	Columnar	LZ4 or ZSTD	1-10 μs
2	Cold	3-bit or binary	Columnar	ZSTD level 9+	10-100 μs

Memory Ratios

For 384-dimensional vectors (typical embedding size):

Tier	Bytes/Vector	Ratio vs fp32	10M Vectors
fp32 (raw)	1536 B	1.0x	14.3 GB
Tier 0 (fp16)	768 B	2.0x	7.2 GB
Tier 0 (int8)	384 B	4.0x	3.6 GB
Tier 1 (6-bit)	288 B	5.3x	2.7 GB
Tier 1 (5-bit)	240 B	6.4x	2.2 GB
Tier 2 (3-bit)	144 B	10.7x	1.3 GB
Tier 2 (1-bit)	48 B	32.0x	0.45 GB

2. Access Counter Sketch

Temperature decisions require knowing which blocks are accessed frequently. RVF maintains a lightweight Count-Min Sketch per block set, stored in SKETCH_SEG segments.

Sketch Parameters

Width (w):    1024 counters
Depth (d):    4 hash functions
Counter size: 8-bit saturating (max 255)
Memory:       1024 * 4 * 1 = 4 KB per sketch
Granularity:  One sketch per 1024-vector block
Decay:        Halve all counters every 2^16 accesses (aging)

For 10M vectors in 1024-vector blocks:

9,766 blocks
9,766 * 4 KB = ~38 MB of sketches
Stored in SKETCH_SEG, referenced by manifest

Sketch Operations

On query access:

block_id = vector_id / block_size
for i in 0..depth:
    idx = hash_i(block_id) % width
    sketch[i][idx] = min(sketch[i][idx] + 1, 255)

On temperature check:

count = min over i of sketch[i][hash_i(block_id) % width]
if count > HOT_THRESHOLD:   tier = 0
elif count > WARM_THRESHOLD: tier = 1
else:                        tier = 2

Aging (every 2^16 accesses):

for all counters: counter = counter >> 1

This ensures the sketch tracks recent access patterns, not cumulative history.

Why Count-Min Sketch

Alternative	Memory	Accuracy	Update Cost
Per-vector counter	80 MB (10M * 8B)	Exact	O(1)
Count-Min Sketch	38 MB	~99.9%	O(depth) = O(4)
HyperLogLog	6 MB	~98%	O(1) but cardinality only
Bloom filter	12 MB	No counting	N/A

Count-Min Sketch is the best trade-off: sub-exact accuracy with bounded memory and constant-time updates.

3. Promotion and Demotion

Promotion: Warm/Cold -> Hot

When a block's access count exceeds HOT_THRESHOLD for two consecutive sketch epochs:

1. Read the block from its current VEC_SEG
2. Decode/dequantize vectors to fp16 or int8
3. Rearrange from columnar to interleaved layout
4. Write as a new HOT_SEG (or append to existing HOT_SEG)
5. Update manifest with new tier assignment
6. Optionally: add neighbor lists to HOT_SEG for locality

Demotion: Hot -> Warm -> Cold

When a block's access count drops below WARM_THRESHOLD:

1. The block is not immediately rewritten
2. On next compaction cycle, the block is written to the appropriate tier
3. Quantization is applied during compaction (not lazily)
4. The HOT_SEG entry is tombstoned in the manifest

Eviction as Compression

The key insight: eviction from hot tier is just compression, not deletion. The vector data is always present — it just moves to a more compressed representation. This means:

No data loss on eviction
Recall degrades gracefully (quantized vectors still contribute to search)
The file naturally compresses over time as access patterns stabilize

4. Temperature-Aware Compaction

Standard compaction merges segments for space efficiency. Temperature-aware compaction also rearranges blocks by tier:

Before compaction:
  VEC_SEG_1:  [hot] [cold] [warm] [hot] [cold]
  VEC_SEG_2:  [warm] [hot] [cold] [warm] [warm]

After temperature-aware compaction:
  HOT_SEG:    [hot] [hot] [hot]       <- interleaved, fp16
  VEC_SEG_W:  [warm] [warm] [warm] [warm]  <- columnar, 6-bit
  VEC_SEG_C:  [cold] [cold] [cold]     <- columnar, 3-bit

This creates physical locality by temperature: hot blocks are contiguous (good for sequential scan), warm blocks are contiguous (good for batch decode), cold blocks are contiguous (good for compression ratio).

Compaction Triggers

Trigger	Condition	Action
Sketch epoch	Every N writes	Evaluate all block temperatures
Space amplification	Dead space > 30%	Merge + rewrite segments
Tier imbalance	Hot tier > 20% of data	Demote cold blocks
Hot miss rate	Hot cache miss > 10%	Promote missing blocks

5. Quantization Strategies by Tier

Tier 0: Hot

Scalar quantization to int8 (preferred) or fp16 (for maximum recall).

Encoding:
  q = round((v - min) / (max - min) * 255)

Decoding:
  v = q / 255 * (max - min) + min

Parameters stored in QUANT_SEG:
  min: f32 per dimension
  max: f32 per dimension

Distance computation directly on int8 using SIMD (vpsubb + vpmaddubsw on AVX-512).

Tier 1: Warm

Product Quantization (PQ) with 5-7 bits per sub-vector.

Parameters:
  M subspaces:          48 (for 384-dim vectors, 8 dims per subspace)
  K centroids per sub:  64 (6-bit) or 128 (7-bit)
  Codebook:             M * K * 8 * sizeof(f32) = 48 * 64 * 8 * 4 = 96 KB

Encoding:
  For each subvector: find nearest centroid -> store centroid index

Distance computation:
  ADC (Asymmetric Distance Computation) with precomputed distance tables

Tier 2: Cold

Binary quantization (1-bit) or ternary quantization (2-bit / 3-bit).

Binary encoding:
  b = sign(v)  -> 1 bit per dimension
  384 dims -> 48 bytes per vector (32x compression)

Distance:
  Hamming distance via POPCNT
  XOR + POPCNT on AVX-512: 512 bits per cycle

Ternary (3-bit with magnitude):
  t = {-1, 0, +1} based on threshold
  magnitude = |v| quantized to 3 levels
  384 dims -> 144 bytes per vector (10.7x compression)

Codebook Storage

All quantization parameters (codebooks, min/max ranges, centroids) are stored in QUANT_SEG segments. The root manifest's quantdict_seg_offset hotset pointer references the active quantization dictionary for fast boot.

Multiple QUANT_SEGs can coexist for different tiers — the manifest maps each tier to its dictionary.

6. Hardware Adaptation

Desktop (AVX-512)

Hot tier: int8 with VNNI dot product (4 int8 multiplies per cycle)
Warm tier: PQ with AVX-512 gather for table lookups
Cold tier: Binary with VPOPCNTDQ (512-bit popcount)

ARM (NEON)

Hot tier: int8 with SDOT instruction
Warm tier: PQ with TBL for table lookups
Cold tier: Binary with CNT (population count)

WASM (v128)

Hot tier: int8 with i8x16.dot_i7x16_i16x8 (if available)
Warm tier: Scalar PQ (no gather)
Cold tier: Binary with manual popcount

Cognitum Tile (8KB code + 8KB data + 64KB SIMD)

Hot tier only: int8 interleaved, fits in SIMD scratch
No warm/cold — data stays on hub, tile fetches blocks on demand
Sketch is maintained by hub, not tile

7. Self-Organization Over Time

t=0    All data Tier 1 (default warm)
       |
t+N    First sketch epoch: identify hot blocks
       Promote top 5% to Tier 0
       |
t+2N   Second epoch: validate promotions
       Demote false positives back to Tier 1
       Identify true cold blocks (0 access in 2 epochs)
       |
t+3N   Compaction: physically separate tiers
       HOT_SEG created with interleaved layout
       Cold blocks compressed to 3-bit
       |
t+∞    Equilibrium: ~5% hot, ~30% warm, ~65% cold
       File size: ~2-3x smaller than uniform fp16
       Query p95: dominated by hot tier latency

The format converges to an equilibrium that reflects actual usage. No manual tuning required.

8.8 KiB Raw Blame History