13 KiB
RVF Progressive Indexing
1. Index as Layers of Availability
Traditional HNSW serialization is all-or-nothing: either the full graph is loaded, or nothing works. RVF decomposes the index into three layers of availability, each independently useful, each stored in separate INDEX_SEG segments.
Layer C: Full Adjacency
+--------------------------------------------------+
| Complete neighbor lists for every node at every |
| HNSW level. Built lazily. Optional for queries. |
| Recall: >= 0.95 |
+--------------------------------------------------+
^ loaded last (seconds to minutes)
|
Layer B: Partial Adjacency
+--------------------------------------------------+
| Neighbor lists for the most-accessed region |
| (determined by temperature sketch). Covers the |
| hot working set of the graph. |
| Recall: >= 0.85 |
+--------------------------------------------------+
^ loaded second (100ms - 1s)
|
Layer A: Entry Points + Coarse Routing
+--------------------------------------------------+
| HNSW entry points. Top-layer adjacency lists. |
| Cluster centroids for IVF pre-routing. |
| Always present. Always in Level 0 hotset. |
| Recall: >= 0.70 |
+--------------------------------------------------+
^ loaded first (< 5ms)
|
File open
Why Three Layers
| Layer | Purpose | Data Size (10M vectors) | Load Time (NVMe) |
|---|---|---|---|
| A | First query possible | 1-4 MB | < 5 ms |
| B | Good quality for working set | 50-200 MB | 100-500 ms |
| C | Full recall for all queries | 1-4 GB | 2-10 s |
A system that only loads Layer A can still answer queries — just with lower recall. As layers B and C load asynchronously, quality improves transparently.
2. Layer A: Entry Points and Coarse Routing
Content
- HNSW entry points: The node(s) at the highest layer of the HNSW graph. Typically 1 node, but may be multiple for redundancy.
- Top-layer adjacency: Full neighbor lists for all nodes at HNSW layers
= ceil(ln(N) / ln(M)) - 2. For 10M vectors with M=16, this is layers 5-6, containing ~100-1000 nodes.
- Cluster centroids: K centroids (K = sqrt(N) typically, so ~3162 for 10M) used for IVF-style partition routing.
- Centroid-to-partition map: Which centroid owns which vector ID ranges.
Storage
Layer A data is stored in a dedicated INDEX_SEG with flags.HOT set. The root
manifest's hotset pointers reference this segment directly. On cold start, this
is the first data mapped after the manifest.
Binary Layout of Layer A INDEX_SEG
+-------------------------------------------+
| Header: INDEX_SEG, flags=HOT |
+-------------------------------------------+
| Block 0: Entry Points |
| entry_count: u32 |
| max_layer: u32 |
| [entry_node_id: u64, layer: u32] * N |
+-------------------------------------------+
| Block 1: Top-Layer Adjacency |
| layer_count: u32 |
| For each layer (top to bottom): |
| node_count: u32 |
| For each node: |
| node_id: u64 |
| neighbor_count: u16 |
| [neighbor_id: u64] * neighbor_count |
| [64B padding] |
+-------------------------------------------+
| Block 2: Centroids |
| centroid_count: u32 |
| dim: u16 |
| dtype: u8 (fp16) |
| [centroid_vector: fp16 * dim] * K |
| [64B aligned] |
+-------------------------------------------+
| Block 3: Partition Map |
| partition_count: u32 |
| For each partition: |
| centroid_id: u32 |
| vector_id_start: u64 |
| vector_id_end: u64 |
| segment_ref: u64 (segment_id) |
| block_ref: u32 (block offset) |
+-------------------------------------------+
Query Using Only Layer A
def query_layer_a_only(query, k, layer_a):
# Step 1: Find nearest centroids
dists = [distance(query, c) for c in layer_a.centroids]
top_partitions = top_n(dists, n_probe)
# Step 2: HNSW search through top layers only
entry = layer_a.entry_points[0]
current = entry
for layer in range(layer_a.max_layer, layer_a.min_available_layer, -1):
current = greedy_search(query, current, layer_a.adjacency[layer])
# Step 3: If hot cache available, refine against it
if hot_cache:
candidates = scan_hot_cache(query, hot_cache, current.partition)
return top_k(candidates, k)
# Step 4: Otherwise, return centroid-approximate results
return approximate_from_centroids(query, top_partitions, k)
Expected recall: 0.65-0.75 (depends on centroid quality and hot cache coverage).
3. Layer B: Partial Adjacency
Content
Neighbor lists for the hot region of the graph — the set of nodes that appear most frequently in query traversals. Determined by the temperature sketch (see 03-temperature-tiering.md).
Typically covers:
- All nodes at HNSW layers >= 2
- Layer 0-1 nodes in the hot temperature tier
- ~10-20% of total nodes
Storage
Layer B is stored in one or more INDEX_SEGs without the HOT flag. The Level 1 manifest maps these segments and records which node ID ranges they cover.
Incremental Build
Layer B can be built incrementally:
1. After Layer A is loaded, begin query serving
2. In background: read VEC_SEGs for hot-tier blocks
3. Build HNSW adjacency for those blocks
4. Write as new INDEX_SEG
5. Update manifest to include Layer B
6. Future queries use Layer B for better recall
This means the index improves over time without blocking any queries.
Partial Adjacency Routing
When a query traversal reaches a node without Layer B adjacency (i.e., it's in the cold region), the system falls back to:
- Centroid routing: Use Layer A centroids to estimate the nearest region
- Linear scan: Scan the relevant VEC_SEG block directly
- Approximate: Accept slightly lower recall for that portion
def search_with_partial_index(query, k, layers):
# Start with Layer A routing
current = hnsw_search_layers(query, layers.a, layers.a.max_layer, 2)
# Continue with Layer B (where available)
if layers.b.has_node(current):
current = hnsw_search_layers(query, layers.b, 1, 0,
start=current)
else:
# Fallback: scan the block containing current
candidates = linear_scan_block(query, current.block)
current = best_of(current, candidates)
return top_k(current.visited, k)
4. Layer C: Full Adjacency
Content
Complete neighbor lists for every node at every HNSW level. This is the traditional full HNSW graph.
Storage
Layer C may be split across multiple INDEX_SEGs for large datasets. The manifest records the node ID ranges covered by each segment.
Lazy Build
Layer C is built lazily — it is not required for the file to be functional. The build process runs as a background task:
1. Identify unindexed VEC_SEG blocks (those without Layer C adjacency)
2. Read blocks in partition order (good locality)
3. Build HNSW adjacency using the existing partial graph as scaffold
4. Write new INDEX_SEG(s)
5. Update manifest
Build Prioritization
Blocks are indexed in temperature order:
- Hot blocks first (most query benefit)
- Warm blocks next
- Cold blocks last (may never be indexed if queries don't reach them)
This means the index build converges to useful quality fast, then approaches completeness asymptotically.
5. Index Segment Binary Format
Adjacency List Encoding
Neighbor lists are stored using varint delta encoding with restart points for fast random access:
+-------------------------------------------+
| Restart Point Index |
| restart_interval: u32 (e.g., 64) |
| restart_count: u32 |
| [restart_offset: u32] * restart_count |
| [64B aligned] |
+-------------------------------------------+
| Adjacency Data |
| For each node (sorted by node_id): |
| neighbor_count: varint |
| [delta_encoded_neighbor_id: varint] |
| (restart point every N nodes) |
+-------------------------------------------+
Restart points: Every restart_interval nodes (default 64), the delta
encoding resets to absolute IDs. This enables O(1) random access to any node's
neighbors by:
- Binary search the restart point index for the nearest restart <= target
- Seek to that restart offset
- Sequentially decode from restart to target (at most 63 decodes)
Varint Encoding
Standard LEB128 varint:
- Values 0-127: 1 byte
- Values 128-16383: 2 bytes
- Values 16384-2097151: 3 bytes
For delta-encoded neighbor IDs (typical delta: 1-1000), most values fit in 1-2 bytes, giving ~3-4x compression over fixed u64.
Prefetch Hints
The manifest's prefetch table maps node ID ranges to contiguous page ranges:
Prefetch Entry:
node_id_start: u64
node_id_end: u64
page_offset: u64 Offset of first contiguous page
page_count: u32 Number of contiguous pages
prefetch_ahead: u32 Pages to prefetch ahead of current access
When the HNSW search accesses a node, the runtime issues madvise(WILLNEED)
(or equivalent) for the next prefetch_ahead pages. This hides disk/memory
latency behind computation.
6. Index Consistency
Append-Only Index Updates
When new vectors are added:
- New vectors go into a fresh VEC_SEG (append-only)
- A temporary in-memory index covers the new vectors
- When the in-memory index reaches a threshold, it is written as a new INDEX_SEG
- The manifest is updated to include both the old and new INDEX_SEGs
- Queries search both indexes and merge results
This is analogous to LSM-tree compaction levels but for graph indexes.
Index Merging
When too many small INDEX_SEGs accumulate:
1. Read all small INDEX_SEGs
2. Build a unified HNSW graph over all vectors
3. Write as a single sealed INDEX_SEG
4. Tombstone old INDEX_SEGs in manifest
Concurrent Read/Write
Readers always see a consistent snapshot through the manifest chain:
- Reader opens file -> reads manifest -> has immutable segment set
- Writer appends new segments + new manifest
- Reader continues using old manifest until it explicitly re-reads
- No locks needed — append-only guarantees no mutation of existing data
7. Query Path Integration
The complete query path combining progressive indexing with temperature tiering:
Query
|
v
+-----------+
| Layer A | Entry points + top-layer routing
| (always) | ~5ms to load on cold start
+-----------+
|
Is Layer B available for this region?
/ \
Yes No
/ \
+-----------+ +-----------+
| Layer B | | Centroid |
| HNSW | | Fallback |
| search | | + scan |
+-----------+ +-----------+
\ /
\ /
v v
+-----------+
| Candidate |
| Set |
+-----------+
|
Is hot cache available?
/ \
Yes No
/ \
+-----------+ +-----------+
| Hot cache | | Decode |
| re-rank | | from |
| (int8/fp16)| | VEC_SEG |
+-----------+ +-----------+
\ /
v v
+-----------+
| Top-K |
| Results |
+-----------+
Recall Expectations by State
| State | Layers Available | Expected Recall@10 |
|---|---|---|
| Cold start (L0 only) | A | 0.65-0.75 |
| L0 + hot cache | A + hot | 0.75-0.85 |
| L0 + L1 loading | A + B partial | 0.80-0.90 |
| L1 complete | A + B | 0.85-0.92 |
| Full load | A + B + C | 0.95-0.99 |
| Full + optimized | A + B + C + hot | 0.98-0.999 |