git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
309 lines
10 KiB
Markdown
309 lines
10 KiB
Markdown
# RVF Overlay Epochs
|
|
|
|
## 1. Streaming Dynamic Min-Cut Overlay
|
|
|
|
The overlay system manages dynamic graph partitioning — how the vector space is
|
|
subdivided for distributed search, shard routing, and load balancing. Unlike
|
|
static partitioning, RVF overlays evolve with the data through an epoch-based
|
|
model that bounds memory, bounds load time, and enables rollback.
|
|
|
|
## 2. Overlay Segment Structure
|
|
|
|
Each OVERLAY_SEG stores a delta relative to the previous epoch's partition state:
|
|
|
|
```
|
|
+-------------------------------------------+
|
|
| Header: OVERLAY_SEG |
|
|
+-------------------------------------------+
|
|
| Epoch Header |
|
|
| epoch: u32 |
|
|
| parent_epoch: u32 |
|
|
| parent_seg_id: u64 |
|
|
| rollback_offset: u64 |
|
|
| timestamp_ns: u64 |
|
|
| delta_count: u32 |
|
|
| partition_count: u32 |
|
|
+-------------------------------------------+
|
|
| Edge Deltas |
|
|
| For each delta: |
|
|
| delta_type: u8 (ADD=1, REMOVE=2, |
|
|
| REWEIGHT=3) |
|
|
| src_node: u64 |
|
|
| dst_node: u64 |
|
|
| weight: f32 (for ADD/REWEIGHT) |
|
|
| [64B aligned] |
|
|
+-------------------------------------------+
|
|
| Partition Summaries |
|
|
| For each partition: |
|
|
| partition_id: u32 |
|
|
| node_count: u64 |
|
|
| edge_cut_weight: f64 |
|
|
| centroid: [fp16 * dim] |
|
|
| node_id_range_start: u64 |
|
|
| node_id_range_end: u64 |
|
|
| [64B aligned] |
|
|
+-------------------------------------------+
|
|
| Min-Cut Witness |
|
|
| witness_type: u8 |
|
|
| 0 = checksum only |
|
|
| 1 = full certificate |
|
|
| cut_value: f64 |
|
|
| cut_edge_count: u32 |
|
|
| partition_hash: [u8; 32] (SHAKE-256) |
|
|
| If witness_type == 1: |
|
|
| [cut_edge: (u64, u64)] * count |
|
|
| [64B aligned] |
|
|
+-------------------------------------------+
|
|
| Rollback Pointer |
|
|
| prev_epoch_offset: u64 |
|
|
| prev_epoch_hash: [u8; 16] |
|
|
+-------------------------------------------+
|
|
```
|
|
|
|
## 3. Epoch Lifecycle
|
|
|
|
### Epoch Creation
|
|
|
|
A new epoch is created when:
|
|
- A batch of vectors is inserted that changes partition balance by > threshold
|
|
- The accumulated edge deltas exceed a size limit (default: 1 MB)
|
|
- A manual rebalance is triggered
|
|
- A merge/compaction produces a new partition layout
|
|
|
|
```
|
|
Epoch 0 (initial) Epoch 1 Epoch 2
|
|
+----------------+ +----------------+ +----------------+
|
|
| Full snapshot | | Deltas vs E0 | | Deltas vs E1 |
|
|
| of partitions | | +50 edges | | +30 edges |
|
|
| 32 partitions | | -12 edges | | -8 edges |
|
|
| min-cut: 0.342 | | rebalance: P3 | | split: P7->P7a |
|
|
+----------------+ +----------------+ +----------------+
|
|
```
|
|
|
|
### State Reconstruction
|
|
|
|
To reconstruct the current partition state:
|
|
|
|
```
|
|
1. Read latest MANIFEST_SEG -> get current_epoch
|
|
2. Read OVERLAY_SEG for current_epoch
|
|
3. If overlay is a delta: recursively read parent epochs
|
|
4. Apply deltas in order: base -> epoch 1 -> epoch 2 -> ... -> current
|
|
5. Result: complete partition state
|
|
```
|
|
|
|
For efficiency, the manifest caches the **last full snapshot epoch**. Delta
|
|
chains never exceed a configurable depth (default: 8 epochs) before a new
|
|
snapshot is forced.
|
|
|
|
### Compaction (Epoch Collapse)
|
|
|
|
When the delta chain reaches maximum depth:
|
|
|
|
```
|
|
1. Reconstruct full state from chain
|
|
2. Write new OVERLAY_SEG with witness_type=full_snapshot
|
|
3. This becomes the new base epoch
|
|
4. Old overlay segments are tombstoned
|
|
5. New delta chain starts from this base
|
|
```
|
|
|
|
```
|
|
Before: E0(snap) -> E1(delta) -> E2(delta) -> ... -> E8(delta)
|
|
After: E0(snap) -> ... -> E8(delta) -> E9(snap, compacted)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
These can be garbage collected
|
|
```
|
|
|
|
## 4. Min-Cut Witness
|
|
|
|
The min-cut witness provides a cryptographic proof that the current partition
|
|
is "good enough" — that the edge cut is within acceptable bounds.
|
|
|
|
### Witness Types
|
|
|
|
**Type 0: Checksum Only**
|
|
|
|
A SHAKE-256 hash of the complete partition state. Allows verification that
|
|
the state is consistent but doesn't prove optimality.
|
|
|
|
```
|
|
witness = SHAKE-256(
|
|
for each partition sorted by id:
|
|
partition_id || node_count || sorted(node_ids) || edge_cut_weight
|
|
)
|
|
```
|
|
|
|
**Type 1: Full Certificate**
|
|
|
|
Lists the actual cut edges. Allows any reader to verify that:
|
|
1. The listed edges are the only edges crossing partition boundaries
|
|
2. The total cut weight matches `cut_value`
|
|
3. No better cut exists within the local search neighborhood (optional)
|
|
|
|
### Bounded-Time Min-Cut Updates
|
|
|
|
Full min-cut computation is expensive (O(V * E) for max-flow). RVF uses
|
|
**incremental min-cut maintenance**:
|
|
|
|
For each edge delta:
|
|
```
|
|
1. If ADD(u, v) where u and v are in same partition:
|
|
-> No cut change. O(1).
|
|
|
|
2. If ADD(u, v) where u in P_i and v in P_j:
|
|
-> cut_weight[P_i][P_j] += weight. O(1).
|
|
-> Check if moving u to P_j or v to P_i reduces total cut.
|
|
-> If yes: execute move, update partition summaries. O(degree).
|
|
|
|
3. If REMOVE(u, v) across partitions:
|
|
-> cut_weight[P_i][P_j] -= weight. O(1).
|
|
-> No rebalance needed (cut improved).
|
|
|
|
4. If REMOVE(u, v) within same partition:
|
|
-> Check connectivity. If partition splits: create new partition. O(component).
|
|
```
|
|
|
|
This bounds update time to O(max_degree) per edge delta in the common case,
|
|
with O(component_size) in the rare partition-split case.
|
|
|
|
### Semi-Streaming Min-Cut
|
|
|
|
For large-scale rebalancing (e.g., after bulk insert), RVF uses a semi-streaming
|
|
algorithm inspired by Assadi et al.:
|
|
|
|
```
|
|
Phase 1: Single pass over edges to build a sparse skeleton
|
|
- Sample each edge with probability O(1/epsilon)
|
|
- Space: O(n * polylog(n))
|
|
|
|
Phase 2: Compute min-cut on skeleton
|
|
- Standard max-flow on sparse graph
|
|
- Time: O(n^2 * polylog(n))
|
|
|
|
Phase 3: Verify against full edge set
|
|
- Stream edges again, check cut validity
|
|
- If invalid: refine skeleton and repeat
|
|
```
|
|
|
|
This runs in O(n * polylog(n)) space regardless of edge count, making it
|
|
suitable for streaming over massive graphs.
|
|
|
|
## 5. Overlay Size Management
|
|
|
|
### Size Threshold
|
|
|
|
Each OVERLAY_SEG has a maximum payload size (configurable, default 1 MB).
|
|
When the accumulated deltas for the current epoch approach this threshold,
|
|
a new epoch is forced.
|
|
|
|
### Memory Budget
|
|
|
|
The total memory for overlay state is bounded:
|
|
|
|
```
|
|
max_overlay_memory = max_chain_depth * max_seg_size + snapshot_size
|
|
= 8 * 1 MB + snapshot_size
|
|
```
|
|
|
|
For 10M vectors with 32 partitions:
|
|
- Snapshot: ~32 * (8 + 16 + 768) bytes per partition ≈ 25 KB
|
|
- Delta chain: ≤ 8 MB
|
|
- Total: ≤ 9 MB
|
|
|
|
This is a fixed overhead regardless of dataset size (partition count scales
|
|
sublinearly).
|
|
|
|
### Garbage Collection
|
|
|
|
Overlay segments behind the last full snapshot are candidates for garbage
|
|
collection. The manifest tracks which overlay segments are still reachable
|
|
from the current epoch chain.
|
|
|
|
```
|
|
Reachable: current_epoch -> parent -> ... -> last_snapshot
|
|
Unreachable: Everything before last_snapshot (safely deletable)
|
|
```
|
|
|
|
GC runs during compaction. Old OVERLAY_SEGs are tombstoned in the manifest
|
|
and their space is reclaimed on file rewrite.
|
|
|
|
## 6. Distributed Overlay Coordination
|
|
|
|
When RVF files are sharded across multiple nodes, the overlay system coordinates
|
|
partition state:
|
|
|
|
### Shard-Local Overlays
|
|
|
|
Each shard maintains its own OVERLAY_SEG chain for its local partitions.
|
|
The global partition state is the union of all shard-local overlays.
|
|
|
|
### Cross-Shard Rebalancing
|
|
|
|
When a partition becomes unbalanced across shards:
|
|
|
|
```
|
|
1. Coordinator computes target partition assignment
|
|
2. Each shard writes a JOURNAL_SEG with vector move instructions
|
|
3. Vectors are copied (not moved — append-only) to target shards
|
|
4. Each shard writes a new OVERLAY_SEG reflecting the new partition
|
|
5. Coordinator writes a global MANIFEST_SEG with new shard map
|
|
```
|
|
|
|
This is eventually consistent — during rebalancing, queries may search both
|
|
old and new locations and deduplicate results.
|
|
|
|
### Consistency Model
|
|
|
|
**Within a shard**: Linearizable (single-writer, manifest chain)
|
|
**Across shards**: Eventually consistent with bounded staleness
|
|
|
|
The epoch counter provides a total order for convergence checking:
|
|
- If all shards report epoch >= E, the global state at epoch E is complete
|
|
- Stale shards are detectable by comparing epoch counters
|
|
|
|
## 7. Epoch-Aware Query Routing
|
|
|
|
Queries use the overlay state for partition routing:
|
|
|
|
```python
|
|
def route_query(query, overlay):
|
|
# Find nearest partition centroids
|
|
dists = [distance(query, p.centroid) for p in overlay.partitions]
|
|
target_partitions = top_n(dists, n_probe)
|
|
|
|
# Check epoch freshness
|
|
if overlay.epoch < current_epoch - stale_threshold:
|
|
# Overlay is stale — broaden search
|
|
target_partitions = top_n(dists, n_probe * 2)
|
|
|
|
return target_partitions
|
|
```
|
|
|
|
### Epoch Rollback
|
|
|
|
If an overlay epoch is found to be corrupt or suboptimal:
|
|
|
|
```
|
|
1. Read rollback_pointer from current OVERLAY_SEG
|
|
2. The pointer gives the offset of the previous epoch's OVERLAY_SEG
|
|
3. Write a new MANIFEST_SEG pointing to the previous epoch as current
|
|
4. Future writes continue from the rolled-back state
|
|
```
|
|
|
|
This provides O(1) rollback to any ancestor epoch in the chain.
|
|
|
|
## 8. Integration with Progressive Indexing
|
|
|
|
The overlay system and the index system are coupled:
|
|
|
|
- **Partition centroids** in the overlay guide Layer A routing
|
|
- **Partition boundaries** determine which INDEX_SEGs cover which regions
|
|
- **Partition rebalancing** may invalidate Layer B adjacency for moved vectors
|
|
(these are rebuilt lazily)
|
|
- **Layer C** is partitioned aligned — each INDEX_SEG covers vectors within
|
|
a single partition for locality
|
|
|
|
This means overlay compaction can trigger partial index rebuild, but only for
|
|
the affected partitions — not the entire index.
|