Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
308
vendor/ruvector/docs/research/rvf/spec/05-overlay-epochs.md
vendored
Normal file
308
vendor/ruvector/docs/research/rvf/spec/05-overlay-epochs.md
vendored
Normal file
@@ -0,0 +1,308 @@
|
||||
# RVF Overlay Epochs
|
||||
|
||||
## 1. Streaming Dynamic Min-Cut Overlay
|
||||
|
||||
The overlay system manages dynamic graph partitioning — how the vector space is
|
||||
subdivided for distributed search, shard routing, and load balancing. Unlike
|
||||
static partitioning, RVF overlays evolve with the data through an epoch-based
|
||||
model that bounds memory, bounds load time, and enables rollback.
|
||||
|
||||
## 2. Overlay Segment Structure
|
||||
|
||||
Each OVERLAY_SEG stores a delta relative to the previous epoch's partition state:
|
||||
|
||||
```
|
||||
+-------------------------------------------+
|
||||
| Header: OVERLAY_SEG |
|
||||
+-------------------------------------------+
|
||||
| Epoch Header |
|
||||
| epoch: u32 |
|
||||
| parent_epoch: u32 |
|
||||
| parent_seg_id: u64 |
|
||||
| rollback_offset: u64 |
|
||||
| timestamp_ns: u64 |
|
||||
| delta_count: u32 |
|
||||
| partition_count: u32 |
|
||||
+-------------------------------------------+
|
||||
| Edge Deltas |
|
||||
| For each delta: |
|
||||
| delta_type: u8 (ADD=1, REMOVE=2, |
|
||||
| REWEIGHT=3) |
|
||||
| src_node: u64 |
|
||||
| dst_node: u64 |
|
||||
| weight: f32 (for ADD/REWEIGHT) |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Partition Summaries |
|
||||
| For each partition: |
|
||||
| partition_id: u32 |
|
||||
| node_count: u64 |
|
||||
| edge_cut_weight: f64 |
|
||||
| centroid: [fp16 * dim] |
|
||||
| node_id_range_start: u64 |
|
||||
| node_id_range_end: u64 |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Min-Cut Witness |
|
||||
| witness_type: u8 |
|
||||
| 0 = checksum only |
|
||||
| 1 = full certificate |
|
||||
| cut_value: f64 |
|
||||
| cut_edge_count: u32 |
|
||||
| partition_hash: [u8; 32] (SHAKE-256) |
|
||||
| If witness_type == 1: |
|
||||
| [cut_edge: (u64, u64)] * count |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Rollback Pointer |
|
||||
| prev_epoch_offset: u64 |
|
||||
| prev_epoch_hash: [u8; 16] |
|
||||
+-------------------------------------------+
|
||||
```
|
||||
|
||||
## 3. Epoch Lifecycle
|
||||
|
||||
### Epoch Creation
|
||||
|
||||
A new epoch is created when:
|
||||
- A batch of vectors is inserted that changes partition balance by > threshold
|
||||
- The accumulated edge deltas exceed a size limit (default: 1 MB)
|
||||
- A manual rebalance is triggered
|
||||
- A merge/compaction produces a new partition layout
|
||||
|
||||
```
|
||||
Epoch 0 (initial) Epoch 1 Epoch 2
|
||||
+----------------+ +----------------+ +----------------+
|
||||
| Full snapshot | | Deltas vs E0 | | Deltas vs E1 |
|
||||
| of partitions | | +50 edges | | +30 edges |
|
||||
| 32 partitions | | -12 edges | | -8 edges |
|
||||
| min-cut: 0.342 | | rebalance: P3 | | split: P7->P7a |
|
||||
+----------------+ +----------------+ +----------------+
|
||||
```
|
||||
|
||||
### State Reconstruction
|
||||
|
||||
To reconstruct the current partition state:
|
||||
|
||||
```
|
||||
1. Read latest MANIFEST_SEG -> get current_epoch
|
||||
2. Read OVERLAY_SEG for current_epoch
|
||||
3. If overlay is a delta: recursively read parent epochs
|
||||
4. Apply deltas in order: base -> epoch 1 -> epoch 2 -> ... -> current
|
||||
5. Result: complete partition state
|
||||
```
|
||||
|
||||
For efficiency, the manifest caches the **last full snapshot epoch**. Delta
|
||||
chains never exceed a configurable depth (default: 8 epochs) before a new
|
||||
snapshot is forced.
|
||||
|
||||
### Compaction (Epoch Collapse)
|
||||
|
||||
When the delta chain reaches maximum depth:
|
||||
|
||||
```
|
||||
1. Reconstruct full state from chain
|
||||
2. Write new OVERLAY_SEG with witness_type=full_snapshot
|
||||
3. This becomes the new base epoch
|
||||
4. Old overlay segments are tombstoned
|
||||
5. New delta chain starts from this base
|
||||
```
|
||||
|
||||
```
|
||||
Before: E0(snap) -> E1(delta) -> E2(delta) -> ... -> E8(delta)
|
||||
After: E0(snap) -> ... -> E8(delta) -> E9(snap, compacted)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
These can be garbage collected
|
||||
```
|
||||
|
||||
## 4. Min-Cut Witness
|
||||
|
||||
The min-cut witness provides a cryptographic proof that the current partition
|
||||
is "good enough" — that the edge cut is within acceptable bounds.
|
||||
|
||||
### Witness Types
|
||||
|
||||
**Type 0: Checksum Only**
|
||||
|
||||
A SHAKE-256 hash of the complete partition state. Allows verification that
|
||||
the state is consistent but doesn't prove optimality.
|
||||
|
||||
```
|
||||
witness = SHAKE-256(
|
||||
for each partition sorted by id:
|
||||
partition_id || node_count || sorted(node_ids) || edge_cut_weight
|
||||
)
|
||||
```
|
||||
|
||||
**Type 1: Full Certificate**
|
||||
|
||||
Lists the actual cut edges. Allows any reader to verify that:
|
||||
1. The listed edges are the only edges crossing partition boundaries
|
||||
2. The total cut weight matches `cut_value`
|
||||
3. No better cut exists within the local search neighborhood (optional)
|
||||
|
||||
### Bounded-Time Min-Cut Updates
|
||||
|
||||
Full min-cut computation is expensive (O(V * E) for max-flow). RVF uses
|
||||
**incremental min-cut maintenance**:
|
||||
|
||||
For each edge delta:
|
||||
```
|
||||
1. If ADD(u, v) where u and v are in same partition:
|
||||
-> No cut change. O(1).
|
||||
|
||||
2. If ADD(u, v) where u in P_i and v in P_j:
|
||||
-> cut_weight[P_i][P_j] += weight. O(1).
|
||||
-> Check if moving u to P_j or v to P_i reduces total cut.
|
||||
-> If yes: execute move, update partition summaries. O(degree).
|
||||
|
||||
3. If REMOVE(u, v) across partitions:
|
||||
-> cut_weight[P_i][P_j] -= weight. O(1).
|
||||
-> No rebalance needed (cut improved).
|
||||
|
||||
4. If REMOVE(u, v) within same partition:
|
||||
-> Check connectivity. If partition splits: create new partition. O(component).
|
||||
```
|
||||
|
||||
This bounds update time to O(max_degree) per edge delta in the common case,
|
||||
with O(component_size) in the rare partition-split case.
|
||||
|
||||
### Semi-Streaming Min-Cut
|
||||
|
||||
For large-scale rebalancing (e.g., after bulk insert), RVF uses a semi-streaming
|
||||
algorithm inspired by Assadi et al.:
|
||||
|
||||
```
|
||||
Phase 1: Single pass over edges to build a sparse skeleton
|
||||
- Sample each edge with probability O(1/epsilon)
|
||||
- Space: O(n * polylog(n))
|
||||
|
||||
Phase 2: Compute min-cut on skeleton
|
||||
- Standard max-flow on sparse graph
|
||||
- Time: O(n^2 * polylog(n))
|
||||
|
||||
Phase 3: Verify against full edge set
|
||||
- Stream edges again, check cut validity
|
||||
- If invalid: refine skeleton and repeat
|
||||
```
|
||||
|
||||
This runs in O(n * polylog(n)) space regardless of edge count, making it
|
||||
suitable for streaming over massive graphs.
|
||||
|
||||
## 5. Overlay Size Management
|
||||
|
||||
### Size Threshold
|
||||
|
||||
Each OVERLAY_SEG has a maximum payload size (configurable, default 1 MB).
|
||||
When the accumulated deltas for the current epoch approach this threshold,
|
||||
a new epoch is forced.
|
||||
|
||||
### Memory Budget
|
||||
|
||||
The total memory for overlay state is bounded:
|
||||
|
||||
```
|
||||
max_overlay_memory = max_chain_depth * max_seg_size + snapshot_size
|
||||
= 8 * 1 MB + snapshot_size
|
||||
```
|
||||
|
||||
For 10M vectors with 32 partitions:
|
||||
- Snapshot: ~32 * (8 + 16 + 768) bytes per partition ≈ 25 KB
|
||||
- Delta chain: ≤ 8 MB
|
||||
- Total: ≤ 9 MB
|
||||
|
||||
This is a fixed overhead regardless of dataset size (partition count scales
|
||||
sublinearly).
|
||||
|
||||
### Garbage Collection
|
||||
|
||||
Overlay segments behind the last full snapshot are candidates for garbage
|
||||
collection. The manifest tracks which overlay segments are still reachable
|
||||
from the current epoch chain.
|
||||
|
||||
```
|
||||
Reachable: current_epoch -> parent -> ... -> last_snapshot
|
||||
Unreachable: Everything before last_snapshot (safely deletable)
|
||||
```
|
||||
|
||||
GC runs during compaction. Old OVERLAY_SEGs are tombstoned in the manifest
|
||||
and their space is reclaimed on file rewrite.
|
||||
|
||||
## 6. Distributed Overlay Coordination
|
||||
|
||||
When RVF files are sharded across multiple nodes, the overlay system coordinates
|
||||
partition state:
|
||||
|
||||
### Shard-Local Overlays
|
||||
|
||||
Each shard maintains its own OVERLAY_SEG chain for its local partitions.
|
||||
The global partition state is the union of all shard-local overlays.
|
||||
|
||||
### Cross-Shard Rebalancing
|
||||
|
||||
When a partition becomes unbalanced across shards:
|
||||
|
||||
```
|
||||
1. Coordinator computes target partition assignment
|
||||
2. Each shard writes a JOURNAL_SEG with vector move instructions
|
||||
3. Vectors are copied (not moved — append-only) to target shards
|
||||
4. Each shard writes a new OVERLAY_SEG reflecting the new partition
|
||||
5. Coordinator writes a global MANIFEST_SEG with new shard map
|
||||
```
|
||||
|
||||
This is eventually consistent — during rebalancing, queries may search both
|
||||
old and new locations and deduplicate results.
|
||||
|
||||
### Consistency Model
|
||||
|
||||
**Within a shard**: Linearizable (single-writer, manifest chain)
|
||||
**Across shards**: Eventually consistent with bounded staleness
|
||||
|
||||
The epoch counter provides a total order for convergence checking:
|
||||
- If all shards report epoch >= E, the global state at epoch E is complete
|
||||
- Stale shards are detectable by comparing epoch counters
|
||||
|
||||
## 7. Epoch-Aware Query Routing
|
||||
|
||||
Queries use the overlay state for partition routing:
|
||||
|
||||
```python
|
||||
def route_query(query, overlay):
|
||||
# Find nearest partition centroids
|
||||
dists = [distance(query, p.centroid) for p in overlay.partitions]
|
||||
target_partitions = top_n(dists, n_probe)
|
||||
|
||||
# Check epoch freshness
|
||||
if overlay.epoch < current_epoch - stale_threshold:
|
||||
# Overlay is stale — broaden search
|
||||
target_partitions = top_n(dists, n_probe * 2)
|
||||
|
||||
return target_partitions
|
||||
```
|
||||
|
||||
### Epoch Rollback
|
||||
|
||||
If an overlay epoch is found to be corrupt or suboptimal:
|
||||
|
||||
```
|
||||
1. Read rollback_pointer from current OVERLAY_SEG
|
||||
2. The pointer gives the offset of the previous epoch's OVERLAY_SEG
|
||||
3. Write a new MANIFEST_SEG pointing to the previous epoch as current
|
||||
4. Future writes continue from the rolled-back state
|
||||
```
|
||||
|
||||
This provides O(1) rollback to any ancestor epoch in the chain.
|
||||
|
||||
## 8. Integration with Progressive Indexing
|
||||
|
||||
The overlay system and the index system are coupled:
|
||||
|
||||
- **Partition centroids** in the overlay guide Layer A routing
|
||||
- **Partition boundaries** determine which INDEX_SEGs cover which regions
|
||||
- **Partition rebalancing** may invalidate Layer B adjacency for moved vectors
|
||||
(these are rebuilt lazily)
|
||||
- **Layer C** is partitioned aligned — each INDEX_SEG covers vectors within
|
||||
a single partition for locality
|
||||
|
||||
This means overlay compaction can trigger partial index rebuild, but only for
|
||||
the affected partitions — not the entire index.
|
||||
Reference in New Issue
Block a user