Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/research/rvf/spec/04-progressive-indexing.md
+++ b/vendor/ruvector/docs/research/rvf/spec/04-progressive-indexing.md
@@ -0,0 +1,374 @@
+# RVF Progressive Indexing
+
+## 1. Index as Layers of Availability
+
+Traditional HNSW serialization is all-or-nothing: either the full graph is loaded,
+or nothing works. RVF decomposes the index into three layers of availability, each
+independently useful, each stored in separate INDEX_SEG segments.
+
+```
+Layer C: Full Adjacency
+--------------------------------------------------+
+| Complete neighbor lists for every node at every   |
+| HNSW level. Built lazily. Optional for queries.   |
+| Recall: >= 0.95                                   |
+--------------------------------------------------+
+        ^  loaded last (seconds to minutes)
+        |
+Layer B: Partial Adjacency
+--------------------------------------------------+
+| Neighbor lists for the most-accessed region       |
+| (determined by temperature sketch). Covers the    |
+| hot working set of the graph.                     |
+| Recall: >= 0.85                                   |
+--------------------------------------------------+
+        ^  loaded second (100ms - 1s)
+        |
+Layer A: Entry Points + Coarse Routing
+--------------------------------------------------+
+| HNSW entry points. Top-layer adjacency lists.     |
+| Cluster centroids for IVF pre-routing.            |
+| Always present. Always in Level 0 hotset.         |
+| Recall: >= 0.70                                   |
+--------------------------------------------------+
+        ^  loaded first (< 5ms)
+        |
+      File open
+```
+
+### Why Three Layers
+
+| Layer | Purpose | Data Size (10M vectors) | Load Time (NVMe) |
+|-------|---------|------------------------|-------------------|
+| A | First query possible | 1-4 MB | < 5 ms |
+| B | Good quality for working set | 50-200 MB | 100-500 ms |
+| C | Full recall for all queries | 1-4 GB | 2-10 s |
+
+A system that only loads Layer A can still answer queries — just with lower recall.
+As layers B and C load asynchronously, quality improves transparently.
+
+## 2. Layer A: Entry Points and Coarse Routing
+
+### Content
+
+- **HNSW entry points**: The node(s) at the highest layer of the HNSW graph.
+  Typically 1 node, but may be multiple for redundancy.
+- **Top-layer adjacency**: Full neighbor lists for all nodes at HNSW layers
+  >= ceil(ln(N) / ln(M)) - 2. For 10M vectors with M=16, this is layers 5-6,
+  containing ~100-1000 nodes.
+- **Cluster centroids**: K centroids (K = sqrt(N) typically, so ~3162 for 10M)
+  used for IVF-style partition routing.
+- **Centroid-to-partition map**: Which centroid owns which vector ID ranges.
+
+### Storage
+
+Layer A data is stored in a dedicated INDEX_SEG with `flags.HOT` set. The root
+manifest's hotset pointers reference this segment directly. On cold start, this
+is the first data mapped after the manifest.
+
+### Binary Layout of Layer A INDEX_SEG
+
+```
+-------------------------------------------+
+| Header: INDEX_SEG, flags=HOT              |
+-------------------------------------------+
+| Block 0: Entry Points                     |
+|   entry_count: u32                        |
+|   max_layer: u32                          |
+|   [entry_node_id: u64, layer: u32] * N    |
+-------------------------------------------+
+| Block 1: Top-Layer Adjacency              |
+|   layer_count: u32                        |
+|   For each layer (top to bottom):         |
+|     node_count: u32                       |
+|     For each node:                        |
+|       node_id: u64                        |
+|       neighbor_count: u16                 |
+|       [neighbor_id: u64] * neighbor_count |
+|     [64B padding]                         |
+-------------------------------------------+
+| Block 2: Centroids                        |
+|   centroid_count: u32                     |
+|   dim: u16                                |
+|   dtype: u8 (fp16)                        |
+|   [centroid_vector: fp16 * dim] * K       |
+|   [64B aligned]                           |
+-------------------------------------------+
+| Block 3: Partition Map                    |
+|   partition_count: u32                    |
+|   For each partition:                     |
+|     centroid_id: u32                      |
+|     vector_id_start: u64                  |
+|     vector_id_end: u64                    |
+|     segment_ref: u64 (segment_id)         |
+|     block_ref: u32 (block offset)         |
+-------------------------------------------+
+```
+
+### Query Using Only Layer A
+
+```python
+def query_layer_a_only(query, k, layer_a):
+    # Step 1: Find nearest centroids
+    dists = [distance(query, c) for c in layer_a.centroids]
+    top_partitions = top_n(dists, n_probe)
+
+    # Step 2: HNSW search through top layers only
+    entry = layer_a.entry_points[0]
+    current = entry
+    for layer in range(layer_a.max_layer, layer_a.min_available_layer, -1):
+        current = greedy_search(query, current, layer_a.adjacency[layer])
+
+    # Step 3: If hot cache available, refine against it
+    if hot_cache:
+        candidates = scan_hot_cache(query, hot_cache, current.partition)
+        return top_k(candidates, k)
+
+    # Step 4: Otherwise, return centroid-approximate results
+    return approximate_from_centroids(query, top_partitions, k)
+```
+
+Expected recall: 0.65-0.75 (depends on centroid quality and hot cache coverage).
+
+## 3. Layer B: Partial Adjacency
+
+### Content
+
+Neighbor lists for the **hot region** of the graph — the set of nodes that appear
+most frequently in query traversals. Determined by the temperature sketch (see
+03-temperature-tiering.md).
+
+Typically covers:
+- All nodes at HNSW layers >= 2
+- Layer 0-1 nodes in the hot temperature tier
+- ~10-20% of total nodes
+
+### Storage
+
+Layer B is stored in one or more INDEX_SEGs without the HOT flag. The Level 1
+manifest maps these segments and records which node ID ranges they cover.
+
+### Incremental Build
+
+Layer B can be built incrementally:
+
+```
+1. After Layer A is loaded, begin query serving
+2. In background: read VEC_SEGs for hot-tier blocks
+3. Build HNSW adjacency for those blocks
+4. Write as new INDEX_SEG
+5. Update manifest to include Layer B
+6. Future queries use Layer B for better recall
+```
+
+This means the index improves over time without blocking any queries.
+
+### Partial Adjacency Routing
+
+When a query traversal reaches a node without Layer B adjacency (i.e., it's in
+the cold region), the system falls back to:
+
+1. **Centroid routing**: Use Layer A centroids to estimate the nearest region
+2. **Linear scan**: Scan the relevant VEC_SEG block directly
+3. **Approximate**: Accept slightly lower recall for that portion
+
+```python
+def search_with_partial_index(query, k, layers):
+    # Start with Layer A routing
+    current = hnsw_search_layers(query, layers.a, layers.a.max_layer, 2)
+
+    # Continue with Layer B (where available)
+    if layers.b.has_node(current):
+        current = hnsw_search_layers(query, layers.b, 1, 0,
+                                      start=current)
+    else:
+        # Fallback: scan the block containing current
+        candidates = linear_scan_block(query, current.block)
+        current = best_of(current, candidates)
+
+    return top_k(current.visited, k)
+```
+
+## 4. Layer C: Full Adjacency
+
+### Content
+
+Complete neighbor lists for every node at every HNSW level. This is the
+traditional full HNSW graph.
+
+### Storage
+
+Layer C may be split across multiple INDEX_SEGs for large datasets. The
+manifest records the node ID ranges covered by each segment.
+
+### Lazy Build
+
+Layer C is built lazily — it is not required for the file to be functional.
+The build process runs as a background task:
+
+```
+1. Identify unindexed VEC_SEG blocks (those without Layer C adjacency)
+2. Read blocks in partition order (good locality)
+3. Build HNSW adjacency using the existing partial graph as scaffold
+4. Write new INDEX_SEG(s)
+5. Update manifest
+```
+
+### Build Prioritization
+
+Blocks are indexed in temperature order:
+1. Hot blocks first (most query benefit)
+2. Warm blocks next
+3. Cold blocks last (may never be indexed if queries don't reach them)
+
+This means the index build converges to useful quality fast, then approaches
+completeness asymptotically.
+
+## 5. Index Segment Binary Format
+
+### Adjacency List Encoding
+
+Neighbor lists are stored using **varint delta encoding with restart points**
+for fast random access:
+
+```
+-------------------------------------------+
+| Restart Point Index                       |
+|   restart_interval: u32 (e.g., 64)       |
+|   restart_count: u32                      |
+|   [restart_offset: u32] * restart_count   |
+|   [64B aligned]                           |
+-------------------------------------------+
+| Adjacency Data                            |
+|   For each node (sorted by node_id):      |
+|     neighbor_count: varint                |
+|     [delta_encoded_neighbor_id: varint]   |
+|     (restart point every N nodes)         |
+-------------------------------------------+
+```
+
+**Restart points**: Every `restart_interval` nodes (default 64), the delta
+encoding resets to absolute IDs. This enables O(1) random access to any node's
+neighbors by:
+
+1. Binary search the restart point index for the nearest restart <= target
+2. Seek to that restart offset
+3. Sequentially decode from restart to target (at most 63 decodes)
+
+### Varint Encoding
+
+Standard LEB128 varint:
+- Values 0-127: 1 byte
+- Values 128-16383: 2 bytes
+- Values 16384-2097151: 3 bytes
+
+For delta-encoded neighbor IDs (typical delta: 1-1000), most values fit in 1-2
+bytes, giving ~3-4x compression over fixed u64.
+
+### Prefetch Hints
+
+The manifest's prefetch table maps node ID ranges to contiguous page ranges:
+
+```
+Prefetch Entry:
+  node_id_start: u64
+  node_id_end: u64
+  page_offset: u64      Offset of first contiguous page
+  page_count: u32       Number of contiguous pages
+  prefetch_ahead: u32   Pages to prefetch ahead of current access
+```
+
+When the HNSW search accesses a node, the runtime issues `madvise(WILLNEED)`
+(or equivalent) for the next `prefetch_ahead` pages. This hides disk/memory
+latency behind computation.
+
+## 6. Index Consistency
+
+### Append-Only Index Updates
+
+When new vectors are added:
+
+1. New vectors go into a **fresh VEC_SEG** (append-only)
+2. A temporary in-memory index covers the new vectors
+3. When the in-memory index reaches a threshold, it is written as a new INDEX_SEG
+4. The manifest is updated to include both the old and new INDEX_SEGs
+5. Queries search both indexes and merge results
+
+This is analogous to LSM-tree compaction levels but for graph indexes.
+
+### Index Merging
+
+When too many small INDEX_SEGs accumulate:
+
+```
+1. Read all small INDEX_SEGs
+2. Build a unified HNSW graph over all vectors
+3. Write as a single sealed INDEX_SEG
+4. Tombstone old INDEX_SEGs in manifest
+```
+
+### Concurrent Read/Write
+
+Readers always see a consistent snapshot through the manifest chain:
+- Reader opens file -> reads manifest -> has immutable segment set
+- Writer appends new segments + new manifest
+- Reader continues using old manifest until it explicitly re-reads
+- No locks needed — append-only guarantees no mutation of existing data
+
+## 7. Query Path Integration
+
+The complete query path combining progressive indexing with temperature tiering:
+
+```
+                         Query
+                           |
+                           v
+                    +-----------+
+                    | Layer A   |   Entry points + top-layer routing
+                    | (always)  |   ~5ms to load on cold start
+                    +-----------+
+                           |
+                    Is Layer B available for this region?
+                      /              \
+                    Yes               No
+                    /                   \
+            +-----------+         +-----------+
+            | Layer B   |         | Centroid  |
+            | HNSW      |         | Fallback  |
+            | search    |         | + scan    |
+            +-----------+         +-----------+
+                    \                  /
+                     \                /
+                      v              v
+                    +-----------+
+                    | Candidate |
+                    | Set       |
+                    +-----------+
+                           |
+                    Is hot cache available?
+                      /              \
+                    Yes               No
+                    /                   \
+            +-----------+         +-----------+
+            | Hot cache |         | Decode    |
+            | re-rank   |         | from      |
+            | (int8/fp16)|        | VEC_SEG   |
+            +-----------+         +-----------+
+                    \                  /
+                     v                v
+                    +-----------+
+                    | Top-K     |
+                    | Results   |
+                    +-----------+
+```
+
+### Recall Expectations by State
+
+| State | Layers Available | Expected Recall@10 |
+|-------|-----------------|-------------------|
+| Cold start (L0 only) | A | 0.65-0.75 |
+| L0 + hot cache | A + hot | 0.75-0.85 |
+| L0 + L1 loading | A + B partial | 0.80-0.90 |
+| L1 complete | A + B | 0.85-0.92 |
+| Full load | A + B + C | 0.95-0.99 |
+| Full + optimized | A + B + C + hot | 0.98-0.999 |