Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
140
vendor/ruvector/docs/research/rvf/spec/00-overview.md
vendored
Normal file
140
vendor/ruvector/docs/research/rvf/spec/00-overview.md
vendored
Normal file
@@ -0,0 +1,140 @@
|
||||
# RVF: RuVector Format Specification
|
||||
|
||||
## The Universal Substrate for Living Intelligence
|
||||
|
||||
**Version**: 0.1.0-draft
|
||||
**Status**: Research
|
||||
**Date**: 2026-02-13
|
||||
|
||||
---
|
||||
|
||||
## What RVF Is
|
||||
|
||||
RVF is not a file format. It is a **runtime substrate** — a living, self-reorganizing
|
||||
binary medium that stores, streams, indexes, and adapts vector intelligence across
|
||||
any domain, any scale, and any hardware tier.
|
||||
|
||||
Where traditional formats are snapshots of data, RVF is a **continuously evolving
|
||||
organism**. It ingests without rewriting. It answers queries before it finishes loading.
|
||||
It reorganizes its own layout to match access patterns. It survives crashes without
|
||||
journals. It fits on a 64 KB WASM tile or scales to a petabyte hub.
|
||||
|
||||
## The Four Laws of RVF
|
||||
|
||||
Every design decision in RVF derives from four inviolable laws:
|
||||
|
||||
### Law 1: Truth Lives at the Tail
|
||||
|
||||
The most recent `MANIFEST_SEG` at the tail of the file is the sole source of truth.
|
||||
No front-loaded metadata. No section directory that must be rewritten on mutation.
|
||||
Readers scan backward from EOF to find the latest manifest and know exactly what
|
||||
to map.
|
||||
|
||||
**Consequence**: Append-only writes. Streaming ingest. No global rewrite ever.
|
||||
|
||||
### Law 2: Every Segment Is Independently Valid
|
||||
|
||||
Each segment carries its own magic number, length, content hash, and type tag.
|
||||
A reader encountering any segment in isolation can verify it, identify it, and
|
||||
decide whether to process it. No segment depends on prior segments for structural
|
||||
validity.
|
||||
|
||||
**Consequence**: Crash safety for free. Parallel verification. Segment-level
|
||||
integrity without a global checksum.
|
||||
|
||||
### Law 3: Data and State Are Separated
|
||||
|
||||
Vector payloads, index structures, overlay graphs, quantization dictionaries, and
|
||||
runtime metadata live in distinct segment types. The manifest binds them together
|
||||
but they never intermingle. This means you can replace the index without touching
|
||||
vectors, update the overlay without rebuilding adjacency, or swap quantization
|
||||
without re-encoding.
|
||||
|
||||
**Consequence**: Incremental updates. Modular evolution. Zero-copy segment reuse.
|
||||
|
||||
### Law 4: The Format Adapts to Its Workload
|
||||
|
||||
RVF monitors access patterns through lightweight sketches and periodically
|
||||
reorganizes: promoting hot vectors to faster tiers, compacting stale overlays,
|
||||
lazily building deeper index layers. The format is not static — it converges
|
||||
toward the optimal layout for its actual workload.
|
||||
|
||||
**Consequence**: Self-tuning performance. No manual optimization. The file gets
|
||||
faster the more you use it.
|
||||
|
||||
## Design Coordinates
|
||||
|
||||
| Property | RVF Answer |
|
||||
|----------|-----------|
|
||||
| Write model | Append-only segments + background compaction |
|
||||
| Read model | Tail-manifest scan, then progressive mmap |
|
||||
| Index model | Layered availability (entry points -> partial -> full) |
|
||||
| Compression | Temperature-tiered (fp16 hot, 5-7 bit warm, 3 bit cold) |
|
||||
| Alignment | 64-byte for SIMD (AVX-512, NEON, WASM v128) |
|
||||
| Crash safety | Segment-level hashes, no WAL required |
|
||||
| Crypto | Post-quantum (ML-DSA-65 signatures, SHAKE-256 hashes) |
|
||||
| Streaming | Yes — first query before full load |
|
||||
| Hardware | 8 KB tile to petabyte hub |
|
||||
| Domain | Universal — genomics, text, graph, vision as profiles |
|
||||
|
||||
## Acceptance Test
|
||||
|
||||
> Cold start on a 10 million vector file: load and answer the first query with a
|
||||
> useful (recall >= 0.7) result without reading more than the last 4 MB, then
|
||||
> converge to full quality (recall >= 0.95) as it progressively maps more segments.
|
||||
|
||||
## Document Map
|
||||
|
||||
| Document | Path | Content |
|
||||
|----------|------|---------|
|
||||
| This overview | `spec/00-overview.md` | Philosophy, laws, design coordinates |
|
||||
| Segment model | `spec/01-segment-model.md` | Segment types, headers, append-only rules |
|
||||
| Manifest system | `spec/02-manifest-system.md` | Two-level manifests, hotset pointers |
|
||||
| Temperature tiering | `spec/03-temperature-tiering.md` | Adaptive layout, access sketches, promotion |
|
||||
| Progressive indexing | `spec/04-progressive-indexing.md` | Layered HNSW, partial availability |
|
||||
| Overlay epochs | `spec/05-overlay-epochs.md` | Streaming min-cut, epoch boundaries |
|
||||
| Wire format | `wire/binary-layout.md` | Byte-level binary format reference |
|
||||
| WASM microkernel | `microkernel/wasm-runtime.md` | Cognitum tile mapping, WASM exports |
|
||||
| Domain profiles | `profiles/domain-profiles.md` | RVDNA, RVText, RVGraph, RVVision |
|
||||
| Crypto spec | `crypto/quantum-signatures.md` | Post-quantum primitives, segment signing |
|
||||
| Benchmarks | `benchmarks/acceptance-tests.md` | Performance targets, test methodology |
|
||||
|
||||
## Relationship to RVDNA
|
||||
|
||||
RVDNA (RuVector DNA) was the first domain-specific format for genomic vector
|
||||
intelligence. In the RVF model, RVDNA becomes a **profile** — a set of conventions
|
||||
for how genomic data maps onto the universal RVF substrate:
|
||||
|
||||
```
|
||||
RVF (universal substrate)
|
||||
|
|
||||
+-- RVF Core Profile (minimal, fits on 64KB tile)
|
||||
+-- RVF Hot Profile (chip-optimized, SIMD-heavy)
|
||||
+-- RVF Full Profile (hub-scale, all features)
|
||||
|
|
||||
+-- Domain Profiles
|
||||
+-- RVDNA (genomics: codons, motifs, k-mers)
|
||||
+-- RVText (language: embeddings, token graphs)
|
||||
+-- RVGraph (networks: adjacency, partitions)
|
||||
+-- RVVision (imagery: feature maps, patch vectors)
|
||||
```
|
||||
|
||||
The substrate carries the laws. The profiles carry the semantics.
|
||||
|
||||
## Design Answers
|
||||
|
||||
**Q: Random writes or append-only plus compaction?**
|
||||
A: Append-only plus compaction. This gives speed and crash safety almost for free.
|
||||
Random writes add complexity for marginal benefit in the vector workload.
|
||||
|
||||
**Q: Primary target mmap on desktop CPUs or also microcontroller tiles?**
|
||||
A: Both. RVF defines three hardware profiles. The Core profile fits in 8 KB code +
|
||||
8 KB data + 64 KB SIMD scratch. The Full profile assumes mmap on desktop-class
|
||||
memory. The wire format is identical — only the runtime behavior changes.
|
||||
|
||||
**Q: Which property matters most?**
|
||||
A: All four are non-negotiable, but the priority order for conflict resolution is:
|
||||
1. **Streamable** (never block on write)
|
||||
2. **Progressive** (answer before fully loaded)
|
||||
3. **Adaptive** (self-optimize over time)
|
||||
4. **p95 speed** (predictable tail latency)
|
||||
224
vendor/ruvector/docs/research/rvf/spec/01-segment-model.md
vendored
Normal file
224
vendor/ruvector/docs/research/rvf/spec/01-segment-model.md
vendored
Normal file
@@ -0,0 +1,224 @@
|
||||
# RVF Segment Model
|
||||
|
||||
## 1. Append-Only Segment Architecture
|
||||
|
||||
An RVF file is a linear sequence of **segments**. Each segment is a self-contained,
|
||||
independently verifiable unit. New data is always appended — never inserted into or
|
||||
overwritten within existing segments.
|
||||
|
||||
```
|
||||
+------------+------------+------------+ +------------+
|
||||
| Segment 0 | Segment 1 | Segment 2 | ... | Segment N | <-- EOF
|
||||
+------------+------------+------------+ +------------+
|
||||
^
|
||||
Latest MANIFEST_SEG
|
||||
(source of truth)
|
||||
```
|
||||
|
||||
### Why Append-Only
|
||||
|
||||
| Property | Benefit |
|
||||
|----------|---------|
|
||||
| Write amplification | Zero — each byte written once until compaction |
|
||||
| Crash safety | Partial segment at tail is detectable and discardable |
|
||||
| Concurrent reads | Readers see a consistent snapshot at any manifest boundary |
|
||||
| Streaming ingest | Writer never blocks on reorganization |
|
||||
| mmap friendliness | Pages only grow — no invalidation of mapped regions |
|
||||
|
||||
## 2. Segment Header
|
||||
|
||||
Every segment begins with a fixed 64-byte header. The header is 64-byte aligned
|
||||
to match SIMD register width.
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 4 magic 0x52564653 ("RVFS" in ASCII)
|
||||
0x04 1 version Segment format version (currently 1)
|
||||
0x05 1 seg_type Segment type enum (see below)
|
||||
0x06 2 flags Bitfield: compressed, encrypted, signed, sealed, etc.
|
||||
0x08 8 segment_id Monotonically increasing segment ordinal
|
||||
0x10 8 payload_length Byte length of payload (after header, before footer)
|
||||
0x18 8 timestamp_ns Nanosecond UNIX timestamp of segment creation
|
||||
0x20 1 checksum_algo Hash algorithm enum: 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
|
||||
0x21 1 compression Compression enum: 0=none, 1=LZ4, 2=ZSTD, 3=custom
|
||||
0x22 2 reserved_0 Must be zero
|
||||
0x24 4 reserved_1 Must be zero
|
||||
0x28 16 content_hash First 128 bits of payload hash (algorithm per checksum_algo)
|
||||
0x38 4 uncompressed_len Original payload size (0 if no compression)
|
||||
0x3C 4 alignment_pad Padding to reach 64-byte boundary
|
||||
```
|
||||
|
||||
**Total header**: 64 bytes (one cache line, one AVX-512 register width).
|
||||
|
||||
### Magic Validation
|
||||
|
||||
Readers scanning backward from EOF look for `0x52564653` at 64-byte aligned
|
||||
boundaries. This enables fast tail-scan even on corrupted files.
|
||||
|
||||
### Flags Bitfield
|
||||
|
||||
```
|
||||
Bit 0: COMPRESSED Payload is compressed per compression field
|
||||
Bit 1: ENCRYPTED Payload is encrypted (key info in manifest)
|
||||
Bit 2: SIGNED A signature footer follows the payload
|
||||
Bit 3: SEALED Segment is immutable (compaction output)
|
||||
Bit 4: PARTIAL Segment is a partial write (streaming ingest)
|
||||
Bit 5: TOMBSTONE Segment logically deletes a prior segment
|
||||
Bit 6: HOT Segment contains temperature-promoted data
|
||||
Bit 7: OVERLAY Segment contains overlay/delta data
|
||||
Bit 8: SNAPSHOT Segment contains full snapshot (not delta)
|
||||
Bit 9: CHECKPOINT Segment is a safe rollback point
|
||||
Bits 10-15: reserved
|
||||
```
|
||||
|
||||
## 3. Segment Types
|
||||
|
||||
```
|
||||
Value Name Purpose
|
||||
----- ---- -------
|
||||
0x01 VEC_SEG Raw vector payloads (the actual embeddings)
|
||||
0x02 INDEX_SEG HNSW adjacency lists, entry points, routing tables
|
||||
0x03 OVERLAY_SEG Graph overlay deltas, partition updates, min-cut witnesses
|
||||
0x04 JOURNAL_SEG Metadata mutations (label changes, deletions, moves)
|
||||
0x05 MANIFEST_SEG Segment directory, hotset pointers, epoch state
|
||||
0x06 QUANT_SEG Quantization dictionaries and codebooks
|
||||
0x07 META_SEG Arbitrary key-value metadata (tags, provenance, lineage)
|
||||
0x08 HOT_SEG Temperature-promoted hot data (vectors + neighbors)
|
||||
0x09 SKETCH_SEG Access counter sketches for temperature decisions
|
||||
0x0A WITNESS_SEG Capability manifests, proof of computation, audit trails
|
||||
0x0B PROFILE_SEG Domain profile declarations (RVDNA, RVText, etc.)
|
||||
0x0C CRYPTO_SEG Key material, signature chains, certificate anchors
|
||||
0x0D METAIDX_SEG Metadata inverted indexes for filtered search
|
||||
```
|
||||
|
||||
### Reserved Range
|
||||
|
||||
Types `0x00` and `0xF0`-`0xFF` are reserved. `0x00` indicates an uninitialized
|
||||
or zeroed region (not a valid segment). `0xF0`-`0xFF` are reserved for
|
||||
implementation-specific extensions.
|
||||
|
||||
## 4. Segment Footer
|
||||
|
||||
If the `SIGNED` flag is set, the payload is followed by a signature footer:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 2 sig_algo Signature algorithm: 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
|
||||
0x02 2 sig_length Byte length of signature
|
||||
0x04 var signature The signature bytes
|
||||
var 4 footer_length Total footer size (for backward scanning)
|
||||
```
|
||||
|
||||
Unsigned segments have no footer — the next segment header follows immediately
|
||||
after the payload (at the next 64-byte aligned boundary).
|
||||
|
||||
## 5. Segment Lifecycle
|
||||
|
||||
### Write Path
|
||||
|
||||
```
|
||||
1. Allocate segment ID (monotonic counter)
|
||||
2. Compute payload hash
|
||||
3. Write header + payload + optional footer
|
||||
4. fsync (or fdatasync for non-manifest segments)
|
||||
5. Write MANIFEST_SEG referencing the new segment
|
||||
6. fsync the manifest
|
||||
```
|
||||
|
||||
The two-fsync protocol ensures that:
|
||||
- If crash occurs before step 6, the orphan segment is harmless (no manifest points to it)
|
||||
- If crash occurs during step 6, the partial manifest is detectable (bad hash)
|
||||
- After step 6, the segment is durably committed
|
||||
|
||||
### Read Path
|
||||
|
||||
```
|
||||
1. Seek to EOF
|
||||
2. Scan backward for latest MANIFEST_SEG (look for magic at aligned boundaries)
|
||||
3. Parse manifest -> get segment directory
|
||||
4. Map segments on demand (progressive loading)
|
||||
```
|
||||
|
||||
### Compaction
|
||||
|
||||
Compaction merges multiple segments into fewer, larger, sealed segments:
|
||||
|
||||
```
|
||||
Before: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3]
|
||||
After: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3] [VEC_SEG_sealed] [MANIFEST_4]
|
||||
^^^^^^^^^^^^^^^^^
|
||||
New sealed segment
|
||||
merging 1+2+3
|
||||
```
|
||||
|
||||
Old segments are marked with TOMBSTONE entries in the new manifest. Space is
|
||||
reclaimed when the file is eventually rewritten (or old segments are in a
|
||||
separate file in multi-file mode).
|
||||
|
||||
### Multi-File Mode
|
||||
|
||||
For very large datasets, RVF can span multiple files:
|
||||
|
||||
```
|
||||
data.rvf Main file with manifests and hot data
|
||||
data.rvf.cold.0 Cold segment shard 0
|
||||
data.rvf.cold.1 Cold segment shard 1
|
||||
data.rvf.idx.0 Index segment shard 0
|
||||
```
|
||||
|
||||
The manifest in the main file contains shard references with file paths and
|
||||
byte ranges. This enables cold data to live on slower storage while hot data
|
||||
stays on fast storage.
|
||||
|
||||
## 6. Segment Addressing
|
||||
|
||||
Segments are addressed by their `segment_id` (monotonically increasing 64-bit
|
||||
integer). The manifest maps segment IDs to file offsets (and optionally shard
|
||||
file paths in multi-file mode).
|
||||
|
||||
Within a segment, data is addressed by **block offset** — a 32-bit offset from
|
||||
the start of the segment payload. This limits individual segments to 4 GB, which
|
||||
is intentional: it keeps segments manageable for compaction and progressive loading.
|
||||
|
||||
### Block Structure Within VEC_SEG
|
||||
|
||||
```
|
||||
+-------------------+
|
||||
| Block Header (16B)|
|
||||
| block_id: u32 |
|
||||
| count: u32 |
|
||||
| dim: u16 |
|
||||
| dtype: u8 |
|
||||
| pad: [u8; 5] |
|
||||
+-------------------+
|
||||
| Vectors |
|
||||
| (count * dim * |
|
||||
| sizeof(dtype)) |
|
||||
| [64B aligned] |
|
||||
+-------------------+
|
||||
| ID Map |
|
||||
| (varint delta |
|
||||
| encoded IDs) |
|
||||
+-------------------+
|
||||
| Block Footer |
|
||||
| crc32c: u32 |
|
||||
+-------------------+
|
||||
```
|
||||
|
||||
Vectors within a block are stored **columnar** — all dimension 0 values, then all
|
||||
dimension 1 values, etc. This maximizes compression ratio. But the HOT_SEG stores
|
||||
vectors **interleaved** (row-major) for cache-friendly sequential scan during
|
||||
top-K refinement.
|
||||
|
||||
## 7. Invariants
|
||||
|
||||
1. Segment IDs are strictly monotonically increasing within a file
|
||||
2. A valid RVF file contains at least one MANIFEST_SEG
|
||||
3. The last MANIFEST_SEG is always the source of truth
|
||||
4. Segment headers are always 64-byte aligned
|
||||
5. No segment payload exceeds 4 GB
|
||||
6. Content hashes are computed over the raw (uncompressed, unencrypted) payload
|
||||
7. Sealed segments are never modified — only tombstoned
|
||||
8. A reader that cannot find a valid MANIFEST_SEG must reject the file
|
||||
287
vendor/ruvector/docs/research/rvf/spec/02-manifest-system.md
vendored
Normal file
287
vendor/ruvector/docs/research/rvf/spec/02-manifest-system.md
vendored
Normal file
@@ -0,0 +1,287 @@
|
||||
# RVF Manifest System
|
||||
|
||||
## 1. Two-Level Manifest Architecture
|
||||
|
||||
The manifest system is what makes RVF progressive. Instead of a monolithic directory
|
||||
that must be fully parsed before any query, RVF uses a two-level manifest that
|
||||
enables instant boot followed by incremental refinement.
|
||||
|
||||
```
|
||||
EOF
|
||||
|
|
||||
v
|
||||
+--------------------------------------------------+
|
||||
| Level 0: Root Manifest (fixed 4096 bytes) |
|
||||
| - Magic + version |
|
||||
| - Pointer to Level 1 manifest segment |
|
||||
| - Hotset pointers (inline) |
|
||||
| - Total vector count |
|
||||
| - Dimension |
|
||||
| - Epoch counter |
|
||||
| - Profile declaration |
|
||||
+--------------------------------------------------+
|
||||
|
|
||||
| points to
|
||||
v
|
||||
+--------------------------------------------------+
|
||||
| Level 1: Full Manifest (variable size) |
|
||||
| - Complete segment directory |
|
||||
| - Temperature tier map |
|
||||
| - Index layer availability |
|
||||
| - Overlay epoch chain |
|
||||
| - Compaction state |
|
||||
| - Shard references (multi-file) |
|
||||
| - Capability manifest |
|
||||
+--------------------------------------------------+
|
||||
```
|
||||
|
||||
### Why Two Levels
|
||||
|
||||
A reader performing cold start only needs Level 0 (4 KB). From Level 0 alone,
|
||||
it can locate the entry points, coarse routing graph, quantization dictionary,
|
||||
and centroids — enough to answer approximate queries immediately.
|
||||
|
||||
Level 1 is loaded asynchronously to enable full-quality queries, but the system
|
||||
is functional before Level 1 is fully parsed.
|
||||
|
||||
## 2. Level 0: Root Manifest
|
||||
|
||||
The root manifest is always the **last 4096 bytes** of the file (or the last
|
||||
4096 bytes of the most recent MANIFEST_SEG). Its fixed size enables instant
|
||||
location: `seek(EOF - 4096)`.
|
||||
|
||||
### Binary Layout
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x000 4 magic 0x52564D30 ("RVM0")
|
||||
0x004 2 version Root manifest version
|
||||
0x006 2 flags Root manifest flags
|
||||
0x008 8 l1_manifest_offset Byte offset to Level 1 manifest segment
|
||||
0x010 8 l1_manifest_length Byte length of Level 1 manifest segment
|
||||
0x018 8 total_vector_count Total vectors across all segments
|
||||
0x020 2 dimension Vector dimensionality
|
||||
0x022 1 base_dtype Base data type enum
|
||||
0x023 1 profile_id Domain profile (0=generic, 1=dna, 2=text, 3=graph, 4=vision)
|
||||
0x024 4 epoch Current overlay epoch number
|
||||
0x028 8 created_ns File creation timestamp (ns)
|
||||
0x030 8 modified_ns Last modification timestamp (ns)
|
||||
|
||||
--- Hotset Pointers (the key to instant boot) ---
|
||||
|
||||
0x038 8 entrypoint_seg_offset Offset to segment containing HNSW entry points
|
||||
0x040 4 entrypoint_block_offset Block offset within that segment
|
||||
0x044 4 entrypoint_count Number of entry points
|
||||
|
||||
0x048 8 toplayer_seg_offset Offset to segment with top-layer adjacency
|
||||
0x050 4 toplayer_block_offset Block offset
|
||||
0x054 4 toplayer_node_count Nodes in top layer
|
||||
|
||||
0x058 8 centroid_seg_offset Offset to segment with cluster centroids / pivots
|
||||
0x060 4 centroid_block_offset Block offset
|
||||
0x064 4 centroid_count Number of centroids
|
||||
|
||||
0x068 8 quantdict_seg_offset Offset to quantization dictionary segment
|
||||
0x070 4 quantdict_block_offset Block offset
|
||||
0x074 4 quantdict_size Dictionary size in bytes
|
||||
|
||||
0x078 8 hot_cache_seg_offset Offset to HOT_SEG with interleaved hot vectors
|
||||
0x080 4 hot_cache_block_offset Block offset
|
||||
0x084 4 hot_cache_vector_count Vectors in hot cache
|
||||
|
||||
0x088 8 prefetch_map_offset Offset to prefetch hint table
|
||||
0x090 4 prefetch_map_entries Number of prefetch entries
|
||||
|
||||
--- Crypto ---
|
||||
|
||||
0x094 2 sig_algo Manifest signature algorithm
|
||||
0x096 2 sig_length Signature length
|
||||
0x098 var signature Manifest signature (up to 3400 bytes for ML-DSA-65)
|
||||
|
||||
--- Padding to 4096 bytes ---
|
||||
|
||||
0xF00 252 reserved Reserved / zero-padded to 4096
|
||||
0xFFC 4 root_checksum CRC32C of bytes 0x000-0xFFB
|
||||
```
|
||||
|
||||
**Total**: Exactly 4096 bytes (one page, one disk sector on most hardware).
|
||||
|
||||
### Hotset Pointers
|
||||
|
||||
The six hotset pointers are the minimum information needed to answer a query:
|
||||
|
||||
1. **Entry points**: Where to start HNSW traversal
|
||||
2. **Top-layer adjacency**: Coarse routing to the right neighborhood
|
||||
3. **Centroids/pivots**: For IVF-style pre-filtering or partition routing
|
||||
4. **Quantization dictionary**: For decoding compressed vectors
|
||||
5. **Hot cache**: Pre-decoded interleaved vectors for top-K refinement
|
||||
6. **Prefetch map**: Contiguous neighbor-list pages with prefetch offsets
|
||||
|
||||
With these six pointers, a reader can:
|
||||
- Start HNSW search at the entry point
|
||||
- Route through the top layer
|
||||
- Quantize the query using the dictionary
|
||||
- Scan the hot cache for refinement
|
||||
- Prefetch neighbor pages for cache-friendly traversal
|
||||
|
||||
All without reading Level 1 or any cold segments.
|
||||
|
||||
## 3. Level 1: Full Manifest
|
||||
|
||||
Level 1 is a variable-size segment (type `MANIFEST_SEG`) referenced by Level 0.
|
||||
It contains the complete file directory.
|
||||
|
||||
### Structure
|
||||
|
||||
Level 1 is encoded as a sequence of typed records using a tag-length-value (TLV)
|
||||
scheme for forward compatibility:
|
||||
|
||||
```
|
||||
+---+---+---+---+---+---+---+---+
|
||||
| Tag (2B) | Length (4B) | Pad | <- 8-byte aligned record header
|
||||
+---+---+---+---+---+---+---+---+
|
||||
| Value (Length bytes) |
|
||||
| [padded to 8-byte boundary] |
|
||||
+---------------------------------+
|
||||
```
|
||||
|
||||
### Record Types
|
||||
|
||||
```
|
||||
Tag Name Description
|
||||
--- ---- -----------
|
||||
0x0001 SEGMENT_DIR Array of segment directory entries
|
||||
0x0002 TEMP_TIER_MAP Temperature tier assignments per block
|
||||
0x0003 INDEX_LAYERS Index layer availability bitmap
|
||||
0x0004 OVERLAY_CHAIN Epoch chain with rollback pointers
|
||||
0x0005 COMPACTION_STATE Active/tombstoned segment sets
|
||||
0x0006 SHARD_REFS Multi-file shard references
|
||||
0x0007 CAPABILITY_MANIFEST What this file can do (features, limits)
|
||||
0x0008 PROFILE_CONFIG Domain-specific configuration
|
||||
0x0009 ACCESS_SKETCH_REF Pointer to latest SKETCH_SEG
|
||||
0x000A PREFETCH_TABLE Full prefetch hint table
|
||||
0x000B ID_RESTART_POINTS Restart point index for varint delta IDs
|
||||
0x000C WITNESS_CHAIN Proof-of-computation witness chain
|
||||
0x000D KEY_DIRECTORY Encryption key references (not keys themselves)
|
||||
```
|
||||
|
||||
### Segment Directory Entry
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 8 segment_id Segment ordinal
|
||||
0x08 1 seg_type Segment type enum
|
||||
0x09 1 tier Temperature tier (0=hot, 1=warm, 2=cold)
|
||||
0x0A 2 flags Segment flags
|
||||
0x0C 4 reserved Must be zero
|
||||
0x10 8 file_offset Byte offset in file (or shard)
|
||||
0x18 8 payload_length Decompressed payload length
|
||||
0x20 8 compressed_length Compressed length (0 if uncompressed)
|
||||
0x28 2 shard_id Shard index (0 for main file)
|
||||
0x2A 2 compression Compression algorithm
|
||||
0x2C 4 block_count Number of blocks in segment
|
||||
0x30 16 content_hash Payload hash (first 128 bits)
|
||||
```
|
||||
|
||||
**Total**: 64 bytes per entry (cache-line aligned).
|
||||
|
||||
## 4. Manifest Lifecycle
|
||||
|
||||
### Writing a New Manifest
|
||||
|
||||
Every mutation to the file produces a new MANIFEST_SEG appended at the tail:
|
||||
|
||||
```
|
||||
1. Compute new Level 1 manifest (segment directory + metadata)
|
||||
2. Write Level 1 as a MANIFEST_SEG payload
|
||||
3. Compute Level 0 root manifest pointing to Level 1
|
||||
4. Write Level 0 as the last 4096 bytes of the MANIFEST_SEG
|
||||
5. fsync
|
||||
```
|
||||
|
||||
The MANIFEST_SEG payload structure is:
|
||||
|
||||
```
|
||||
+-----------------------------------+
|
||||
| Level 1 manifest (variable size) |
|
||||
+-----------------------------------+
|
||||
| Level 0 root manifest (4096 B) | <-- Always the last 4096 bytes
|
||||
+-----------------------------------+
|
||||
```
|
||||
|
||||
### Reading the Manifest
|
||||
|
||||
```
|
||||
1. seek(EOF - 4096)
|
||||
2. Read 4096 bytes -> Level 0 root manifest
|
||||
3. Validate magic (0x52564D30) and checksum
|
||||
4. If valid: extract hotset pointers -> system is queryable
|
||||
5. Async: read Level 1 at l1_manifest_offset -> full directory
|
||||
6. If Level 0 is invalid: scan backward for previous MANIFEST_SEG
|
||||
```
|
||||
|
||||
Step 6 provides crash recovery. If the latest manifest write was interrupted,
|
||||
the previous manifest is still valid. Readers scan backward at 64-byte aligned
|
||||
boundaries looking for the RVFS magic + MANIFEST_SEG type.
|
||||
|
||||
### Manifest Chain
|
||||
|
||||
Each manifest implicitly forms a chain through the segment ID ordering. For
|
||||
explicit rollback support, Level 1 contains the `OVERLAY_CHAIN` record which
|
||||
stores:
|
||||
|
||||
```
|
||||
epoch: u32 Current epoch
|
||||
prev_manifest_offset: u64 Offset of previous MANIFEST_SEG
|
||||
prev_manifest_id: u64 Segment ID of previous MANIFEST_SEG
|
||||
checkpoint_hash: [u8; 16] Hash of the complete state at this epoch
|
||||
```
|
||||
|
||||
This enables point-in-time recovery and bisection debugging.
|
||||
|
||||
## 5. Hotset Pointer Semantics
|
||||
|
||||
### Entry Point Stability
|
||||
|
||||
Entry points are the HNSW nodes at the highest layer. They change rarely (only
|
||||
when the index is rebuilt or a new highest-layer node is inserted). The root
|
||||
manifest caches them directly so they survive across manifest generations without
|
||||
re-reading the index.
|
||||
|
||||
### Centroid Refresh
|
||||
|
||||
Centroids may drift as data is added. The manifest tracks a `centroid_epoch` — if
|
||||
the current epoch exceeds centroid_epoch + threshold, the runtime should schedule
|
||||
centroid recomputation. But the stale centroids remain usable (recall degrades
|
||||
gracefully, it does not fail).
|
||||
|
||||
### Hot Cache Coherence
|
||||
|
||||
The hot cache in HOT_SEG is a **read-optimized snapshot** of the most-accessed
|
||||
vectors. It may be stale relative to the latest VEC_SEGs. The manifest tracks
|
||||
a `hot_cache_epoch` for staleness detection. Queries use the hot cache for fast
|
||||
initial results, then refine against authoritative VEC_SEGs if needed.
|
||||
|
||||
## 6. Progressive Boot Sequence
|
||||
|
||||
```
|
||||
Time Action System State
|
||||
---- ------ ------------
|
||||
t=0 Read last 4 KB (Level 0) Booting
|
||||
t+1ms Parse hotset pointers Queryable (approximate)
|
||||
t+2ms mmap entry points + top layer Better routing
|
||||
t+5ms mmap hot cache + quant dict Fast top-K refinement
|
||||
t+10ms Start loading Level 1 Discovering full directory
|
||||
t+50ms Level 1 parsed Full segment awareness
|
||||
t+100ms mmap warm VEC_SEGs Recall improving
|
||||
t+500ms mmap cold VEC_SEGs Full recall
|
||||
t+1s Background index layer build Converging to optimal
|
||||
```
|
||||
|
||||
For a 10M vector file (~4 GB at 384 dimensions, float16):
|
||||
- Level 0 read: 4 KB in <1 ms
|
||||
- Hotset data: ~2-4 MB (entry points + top layer + centroids + hot cache)
|
||||
- First query: within 5-10 ms of open
|
||||
- Full convergence: 1-5 seconds depending on storage speed
|
||||
285
vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md
vendored
Normal file
285
vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md
vendored
Normal file
@@ -0,0 +1,285 @@
|
||||
# RVF Temperature Tiering
|
||||
|
||||
## 1. Adaptive Layout as a First-Class Concept
|
||||
|
||||
Traditional vector formats place data once and leave it. RVF treats data placement
|
||||
as a **continuous optimization problem**. Every vector block has a temperature, and
|
||||
the format periodically reorganizes to keep hot data fast and cold data small.
|
||||
|
||||
```
|
||||
Access Frequency
|
||||
^
|
||||
|
|
||||
Tier 0 (HOT) | ████████ fp16 / 8-bit, interleaved
|
||||
| ████████ < 1μs random access
|
||||
|
|
||||
Tier 1 (WARM) | ░░░░░░░░░░░░░░░░ 5-7 bit quantized
|
||||
| ░░░░░░░░░░░░░░░░ columnar, compressed
|
||||
|
|
||||
Tier 2 (COLD) | ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 3-bit or 1-bit
|
||||
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ heavy compression
|
||||
|
|
||||
+------------------------------------> Vector ID
|
||||
```
|
||||
|
||||
### Tier Definitions
|
||||
|
||||
| Tier | Name | Quantization | Layout | Compression | Access Latency |
|
||||
|------|------|-------------|--------|-------------|----------------|
|
||||
| 0 | Hot | fp16 or int8 | Interleaved (row-major) | None or LZ4 | < 1 μs |
|
||||
| 1 | Warm | 5-7 bit SQ/PQ | Columnar | LZ4 or ZSTD | 1-10 μs |
|
||||
| 2 | Cold | 3-bit or binary | Columnar | ZSTD level 9+ | 10-100 μs |
|
||||
|
||||
### Memory Ratios
|
||||
|
||||
For 384-dimensional vectors (typical embedding size):
|
||||
|
||||
| Tier | Bytes/Vector | Ratio vs fp32 | 10M Vectors |
|
||||
|------|-------------|---------------|-------------|
|
||||
| fp32 (raw) | 1536 B | 1.0x | 14.3 GB |
|
||||
| Tier 0 (fp16) | 768 B | 2.0x | 7.2 GB |
|
||||
| Tier 0 (int8) | 384 B | 4.0x | 3.6 GB |
|
||||
| Tier 1 (6-bit) | 288 B | 5.3x | 2.7 GB |
|
||||
| Tier 1 (5-bit) | 240 B | 6.4x | 2.2 GB |
|
||||
| Tier 2 (3-bit) | 144 B | 10.7x | 1.3 GB |
|
||||
| Tier 2 (1-bit) | 48 B | 32.0x | 0.45 GB |
|
||||
|
||||
## 2. Access Counter Sketch
|
||||
|
||||
Temperature decisions require knowing which blocks are accessed frequently.
|
||||
RVF maintains a lightweight **Count-Min Sketch** per block set, stored in
|
||||
SKETCH_SEG segments.
|
||||
|
||||
### Sketch Parameters
|
||||
|
||||
```
|
||||
Width (w): 1024 counters
|
||||
Depth (d): 4 hash functions
|
||||
Counter size: 8-bit saturating (max 255)
|
||||
Memory: 1024 * 4 * 1 = 4 KB per sketch
|
||||
Granularity: One sketch per 1024-vector block
|
||||
Decay: Halve all counters every 2^16 accesses (aging)
|
||||
```
|
||||
|
||||
For 10M vectors in 1024-vector blocks:
|
||||
- 9,766 blocks
|
||||
- 9,766 * 4 KB = ~38 MB of sketches
|
||||
- Stored in SKETCH_SEG, referenced by manifest
|
||||
|
||||
### Sketch Operations
|
||||
|
||||
**On query access**:
|
||||
```
|
||||
block_id = vector_id / block_size
|
||||
for i in 0..depth:
|
||||
idx = hash_i(block_id) % width
|
||||
sketch[i][idx] = min(sketch[i][idx] + 1, 255)
|
||||
```
|
||||
|
||||
**On temperature check**:
|
||||
```
|
||||
count = min over i of sketch[i][hash_i(block_id) % width]
|
||||
if count > HOT_THRESHOLD: tier = 0
|
||||
elif count > WARM_THRESHOLD: tier = 1
|
||||
else: tier = 2
|
||||
```
|
||||
|
||||
**Aging** (every 2^16 accesses):
|
||||
```
|
||||
for all counters: counter = counter >> 1
|
||||
```
|
||||
|
||||
This ensures the sketch tracks *recent* access patterns, not cumulative history.
|
||||
|
||||
### Why Count-Min Sketch
|
||||
|
||||
| Alternative | Memory | Accuracy | Update Cost |
|
||||
|------------|--------|----------|-------------|
|
||||
| Per-vector counter | 80 MB (10M * 8B) | Exact | O(1) |
|
||||
| Count-Min Sketch | 38 MB | ~99.9% | O(depth) = O(4) |
|
||||
| HyperLogLog | 6 MB | ~98% | O(1) but cardinality only |
|
||||
| Bloom filter | 12 MB | No counting | N/A |
|
||||
|
||||
Count-Min Sketch is the best trade-off: sub-exact accuracy with bounded memory
|
||||
and constant-time updates.
|
||||
|
||||
## 3. Promotion and Demotion
|
||||
|
||||
### Promotion: Warm/Cold -> Hot
|
||||
|
||||
When a block's access count exceeds HOT_THRESHOLD for two consecutive sketch
|
||||
epochs:
|
||||
|
||||
```
|
||||
1. Read the block from its current VEC_SEG
|
||||
2. Decode/dequantize vectors to fp16 or int8
|
||||
3. Rearrange from columnar to interleaved layout
|
||||
4. Write as a new HOT_SEG (or append to existing HOT_SEG)
|
||||
5. Update manifest with new tier assignment
|
||||
6. Optionally: add neighbor lists to HOT_SEG for locality
|
||||
```
|
||||
|
||||
### Demotion: Hot -> Warm -> Cold
|
||||
|
||||
When a block's access count drops below WARM_THRESHOLD:
|
||||
|
||||
```
|
||||
1. The block is not immediately rewritten
|
||||
2. On next compaction cycle, the block is written to the appropriate tier
|
||||
3. Quantization is applied during compaction (not lazily)
|
||||
4. The HOT_SEG entry is tombstoned in the manifest
|
||||
```
|
||||
|
||||
### Eviction as Compression
|
||||
|
||||
The key insight: **eviction from hot tier is just compression, not deletion**.
|
||||
The vector data is always present — it just moves to a more compressed
|
||||
representation. This means:
|
||||
|
||||
- No data loss on eviction
|
||||
- Recall degrades gracefully (quantized vectors still contribute to search)
|
||||
- The file naturally compresses over time as access patterns stabilize
|
||||
|
||||
## 4. Temperature-Aware Compaction
|
||||
|
||||
Standard compaction merges segments for space efficiency. Temperature-aware
|
||||
compaction also **rearranges blocks by tier**:
|
||||
|
||||
```
|
||||
Before compaction:
|
||||
VEC_SEG_1: [hot] [cold] [warm] [hot] [cold]
|
||||
VEC_SEG_2: [warm] [hot] [cold] [warm] [warm]
|
||||
|
||||
After temperature-aware compaction:
|
||||
HOT_SEG: [hot] [hot] [hot] <- interleaved, fp16
|
||||
VEC_SEG_W: [warm] [warm] [warm] [warm] <- columnar, 6-bit
|
||||
VEC_SEG_C: [cold] [cold] [cold] <- columnar, 3-bit
|
||||
```
|
||||
|
||||
This creates **physical locality by temperature**: hot blocks are contiguous
|
||||
(good for sequential scan), warm blocks are contiguous (good for batch decode),
|
||||
cold blocks are contiguous (good for compression ratio).
|
||||
|
||||
### Compaction Triggers
|
||||
|
||||
| Trigger | Condition | Action |
|
||||
|---------|-----------|--------|
|
||||
| Sketch epoch | Every N writes | Evaluate all block temperatures |
|
||||
| Space amplification | Dead space > 30% | Merge + rewrite segments |
|
||||
| Tier imbalance | Hot tier > 20% of data | Demote cold blocks |
|
||||
| Hot miss rate | Hot cache miss > 10% | Promote missing blocks |
|
||||
|
||||
## 5. Quantization Strategies by Tier
|
||||
|
||||
### Tier 0: Hot
|
||||
|
||||
**Scalar quantization to int8** (preferred) or **fp16** (for maximum recall).
|
||||
|
||||
```
|
||||
Encoding:
|
||||
q = round((v - min) / (max - min) * 255)
|
||||
|
||||
Decoding:
|
||||
v = q / 255 * (max - min) + min
|
||||
|
||||
Parameters stored in QUANT_SEG:
|
||||
min: f32 per dimension
|
||||
max: f32 per dimension
|
||||
```
|
||||
|
||||
Distance computation directly on int8 using SIMD (vpsubb + vpmaddubsw on AVX-512).
|
||||
|
||||
### Tier 1: Warm
|
||||
|
||||
**Product Quantization (PQ)** with 5-7 bits per sub-vector.
|
||||
|
||||
```
|
||||
Parameters:
|
||||
M subspaces: 48 (for 384-dim vectors, 8 dims per subspace)
|
||||
K centroids per sub: 64 (6-bit) or 128 (7-bit)
|
||||
Codebook: M * K * 8 * sizeof(f32) = 48 * 64 * 8 * 4 = 96 KB
|
||||
|
||||
Encoding:
|
||||
For each subvector: find nearest centroid -> store centroid index
|
||||
|
||||
Distance computation:
|
||||
ADC (Asymmetric Distance Computation) with precomputed distance tables
|
||||
```
|
||||
|
||||
### Tier 2: Cold
|
||||
|
||||
**Binary quantization** (1-bit) or **ternary quantization** (2-bit / 3-bit).
|
||||
|
||||
```
|
||||
Binary encoding:
|
||||
b = sign(v) -> 1 bit per dimension
|
||||
384 dims -> 48 bytes per vector (32x compression)
|
||||
|
||||
Distance:
|
||||
Hamming distance via POPCNT
|
||||
XOR + POPCNT on AVX-512: 512 bits per cycle
|
||||
|
||||
Ternary (3-bit with magnitude):
|
||||
t = {-1, 0, +1} based on threshold
|
||||
magnitude = |v| quantized to 3 levels
|
||||
384 dims -> 144 bytes per vector (10.7x compression)
|
||||
```
|
||||
|
||||
### Codebook Storage
|
||||
|
||||
All quantization parameters (codebooks, min/max ranges, centroids) are stored
|
||||
in QUANT_SEG segments. The root manifest's `quantdict_seg_offset` hotset pointer
|
||||
references the active quantization dictionary for fast boot.
|
||||
|
||||
Multiple QUANT_SEGs can coexist for different tiers — the manifest maps each
|
||||
tier to its dictionary.
|
||||
|
||||
## 6. Hardware Adaptation
|
||||
|
||||
### Desktop (AVX-512)
|
||||
|
||||
- Hot tier: int8 with VNNI dot product (4 int8 multiplies per cycle)
|
||||
- Warm tier: PQ with AVX-512 gather for table lookups
|
||||
- Cold tier: Binary with VPOPCNTDQ (512-bit popcount)
|
||||
|
||||
### ARM (NEON)
|
||||
|
||||
- Hot tier: int8 with SDOT instruction
|
||||
- Warm tier: PQ with TBL for table lookups
|
||||
- Cold tier: Binary with CNT (population count)
|
||||
|
||||
### WASM (v128)
|
||||
|
||||
- Hot tier: int8 with i8x16.dot_i7x16_i16x8 (if available)
|
||||
- Warm tier: Scalar PQ (no gather)
|
||||
- Cold tier: Binary with manual popcount
|
||||
|
||||
### Cognitum Tile (8KB code + 8KB data + 64KB SIMD)
|
||||
|
||||
- Hot tier only: int8 interleaved, fits in SIMD scratch
|
||||
- No warm/cold — data stays on hub, tile fetches blocks on demand
|
||||
- Sketch is maintained by hub, not tile
|
||||
|
||||
## 7. Self-Organization Over Time
|
||||
|
||||
```
|
||||
t=0 All data Tier 1 (default warm)
|
||||
|
|
||||
t+N First sketch epoch: identify hot blocks
|
||||
Promote top 5% to Tier 0
|
||||
|
|
||||
t+2N Second epoch: validate promotions
|
||||
Demote false positives back to Tier 1
|
||||
Identify true cold blocks (0 access in 2 epochs)
|
||||
|
|
||||
t+3N Compaction: physically separate tiers
|
||||
HOT_SEG created with interleaved layout
|
||||
Cold blocks compressed to 3-bit
|
||||
|
|
||||
t+∞ Equilibrium: ~5% hot, ~30% warm, ~65% cold
|
||||
File size: ~2-3x smaller than uniform fp16
|
||||
Query p95: dominated by hot tier latency
|
||||
```
|
||||
|
||||
The format converges to an equilibrium that reflects actual usage. No manual
|
||||
tuning required.
|
||||
374
vendor/ruvector/docs/research/rvf/spec/04-progressive-indexing.md
vendored
Normal file
374
vendor/ruvector/docs/research/rvf/spec/04-progressive-indexing.md
vendored
Normal file
@@ -0,0 +1,374 @@
|
||||
# RVF Progressive Indexing
|
||||
|
||||
## 1. Index as Layers of Availability
|
||||
|
||||
Traditional HNSW serialization is all-or-nothing: either the full graph is loaded,
|
||||
or nothing works. RVF decomposes the index into three layers of availability, each
|
||||
independently useful, each stored in separate INDEX_SEG segments.
|
||||
|
||||
```
|
||||
Layer C: Full Adjacency
|
||||
+--------------------------------------------------+
|
||||
| Complete neighbor lists for every node at every |
|
||||
| HNSW level. Built lazily. Optional for queries. |
|
||||
| Recall: >= 0.95 |
|
||||
+--------------------------------------------------+
|
||||
^ loaded last (seconds to minutes)
|
||||
|
|
||||
Layer B: Partial Adjacency
|
||||
+--------------------------------------------------+
|
||||
| Neighbor lists for the most-accessed region |
|
||||
| (determined by temperature sketch). Covers the |
|
||||
| hot working set of the graph. |
|
||||
| Recall: >= 0.85 |
|
||||
+--------------------------------------------------+
|
||||
^ loaded second (100ms - 1s)
|
||||
|
|
||||
Layer A: Entry Points + Coarse Routing
|
||||
+--------------------------------------------------+
|
||||
| HNSW entry points. Top-layer adjacency lists. |
|
||||
| Cluster centroids for IVF pre-routing. |
|
||||
| Always present. Always in Level 0 hotset. |
|
||||
| Recall: >= 0.70 |
|
||||
+--------------------------------------------------+
|
||||
^ loaded first (< 5ms)
|
||||
|
|
||||
File open
|
||||
```
|
||||
|
||||
### Why Three Layers
|
||||
|
||||
| Layer | Purpose | Data Size (10M vectors) | Load Time (NVMe) |
|
||||
|-------|---------|------------------------|-------------------|
|
||||
| A | First query possible | 1-4 MB | < 5 ms |
|
||||
| B | Good quality for working set | 50-200 MB | 100-500 ms |
|
||||
| C | Full recall for all queries | 1-4 GB | 2-10 s |
|
||||
|
||||
A system that only loads Layer A can still answer queries — just with lower recall.
|
||||
As layers B and C load asynchronously, quality improves transparently.
|
||||
|
||||
## 2. Layer A: Entry Points and Coarse Routing
|
||||
|
||||
### Content
|
||||
|
||||
- **HNSW entry points**: The node(s) at the highest layer of the HNSW graph.
|
||||
Typically 1 node, but may be multiple for redundancy.
|
||||
- **Top-layer adjacency**: Full neighbor lists for all nodes at HNSW layers
|
||||
>= ceil(ln(N) / ln(M)) - 2. For 10M vectors with M=16, this is layers 5-6,
|
||||
containing ~100-1000 nodes.
|
||||
- **Cluster centroids**: K centroids (K = sqrt(N) typically, so ~3162 for 10M)
|
||||
used for IVF-style partition routing.
|
||||
- **Centroid-to-partition map**: Which centroid owns which vector ID ranges.
|
||||
|
||||
### Storage
|
||||
|
||||
Layer A data is stored in a dedicated INDEX_SEG with `flags.HOT` set. The root
|
||||
manifest's hotset pointers reference this segment directly. On cold start, this
|
||||
is the first data mapped after the manifest.
|
||||
|
||||
### Binary Layout of Layer A INDEX_SEG
|
||||
|
||||
```
|
||||
+-------------------------------------------+
|
||||
| Header: INDEX_SEG, flags=HOT |
|
||||
+-------------------------------------------+
|
||||
| Block 0: Entry Points |
|
||||
| entry_count: u32 |
|
||||
| max_layer: u32 |
|
||||
| [entry_node_id: u64, layer: u32] * N |
|
||||
+-------------------------------------------+
|
||||
| Block 1: Top-Layer Adjacency |
|
||||
| layer_count: u32 |
|
||||
| For each layer (top to bottom): |
|
||||
| node_count: u32 |
|
||||
| For each node: |
|
||||
| node_id: u64 |
|
||||
| neighbor_count: u16 |
|
||||
| [neighbor_id: u64] * neighbor_count |
|
||||
| [64B padding] |
|
||||
+-------------------------------------------+
|
||||
| Block 2: Centroids |
|
||||
| centroid_count: u32 |
|
||||
| dim: u16 |
|
||||
| dtype: u8 (fp16) |
|
||||
| [centroid_vector: fp16 * dim] * K |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Block 3: Partition Map |
|
||||
| partition_count: u32 |
|
||||
| For each partition: |
|
||||
| centroid_id: u32 |
|
||||
| vector_id_start: u64 |
|
||||
| vector_id_end: u64 |
|
||||
| segment_ref: u64 (segment_id) |
|
||||
| block_ref: u32 (block offset) |
|
||||
+-------------------------------------------+
|
||||
```
|
||||
|
||||
### Query Using Only Layer A
|
||||
|
||||
```python
|
||||
def query_layer_a_only(query, k, layer_a):
|
||||
# Step 1: Find nearest centroids
|
||||
dists = [distance(query, c) for c in layer_a.centroids]
|
||||
top_partitions = top_n(dists, n_probe)
|
||||
|
||||
# Step 2: HNSW search through top layers only
|
||||
entry = layer_a.entry_points[0]
|
||||
current = entry
|
||||
for layer in range(layer_a.max_layer, layer_a.min_available_layer, -1):
|
||||
current = greedy_search(query, current, layer_a.adjacency[layer])
|
||||
|
||||
# Step 3: If hot cache available, refine against it
|
||||
if hot_cache:
|
||||
candidates = scan_hot_cache(query, hot_cache, current.partition)
|
||||
return top_k(candidates, k)
|
||||
|
||||
# Step 4: Otherwise, return centroid-approximate results
|
||||
return approximate_from_centroids(query, top_partitions, k)
|
||||
```
|
||||
|
||||
Expected recall: 0.65-0.75 (depends on centroid quality and hot cache coverage).
|
||||
|
||||
## 3. Layer B: Partial Adjacency
|
||||
|
||||
### Content
|
||||
|
||||
Neighbor lists for the **hot region** of the graph — the set of nodes that appear
|
||||
most frequently in query traversals. Determined by the temperature sketch (see
|
||||
03-temperature-tiering.md).
|
||||
|
||||
Typically covers:
|
||||
- All nodes at HNSW layers >= 2
|
||||
- Layer 0-1 nodes in the hot temperature tier
|
||||
- ~10-20% of total nodes
|
||||
|
||||
### Storage
|
||||
|
||||
Layer B is stored in one or more INDEX_SEGs without the HOT flag. The Level 1
|
||||
manifest maps these segments and records which node ID ranges they cover.
|
||||
|
||||
### Incremental Build
|
||||
|
||||
Layer B can be built incrementally:
|
||||
|
||||
```
|
||||
1. After Layer A is loaded, begin query serving
|
||||
2. In background: read VEC_SEGs for hot-tier blocks
|
||||
3. Build HNSW adjacency for those blocks
|
||||
4. Write as new INDEX_SEG
|
||||
5. Update manifest to include Layer B
|
||||
6. Future queries use Layer B for better recall
|
||||
```
|
||||
|
||||
This means the index improves over time without blocking any queries.
|
||||
|
||||
### Partial Adjacency Routing
|
||||
|
||||
When a query traversal reaches a node without Layer B adjacency (i.e., it's in
|
||||
the cold region), the system falls back to:
|
||||
|
||||
1. **Centroid routing**: Use Layer A centroids to estimate the nearest region
|
||||
2. **Linear scan**: Scan the relevant VEC_SEG block directly
|
||||
3. **Approximate**: Accept slightly lower recall for that portion
|
||||
|
||||
```python
|
||||
def search_with_partial_index(query, k, layers):
|
||||
# Start with Layer A routing
|
||||
current = hnsw_search_layers(query, layers.a, layers.a.max_layer, 2)
|
||||
|
||||
# Continue with Layer B (where available)
|
||||
if layers.b.has_node(current):
|
||||
current = hnsw_search_layers(query, layers.b, 1, 0,
|
||||
start=current)
|
||||
else:
|
||||
# Fallback: scan the block containing current
|
||||
candidates = linear_scan_block(query, current.block)
|
||||
current = best_of(current, candidates)
|
||||
|
||||
return top_k(current.visited, k)
|
||||
```
|
||||
|
||||
## 4. Layer C: Full Adjacency
|
||||
|
||||
### Content
|
||||
|
||||
Complete neighbor lists for every node at every HNSW level. This is the
|
||||
traditional full HNSW graph.
|
||||
|
||||
### Storage
|
||||
|
||||
Layer C may be split across multiple INDEX_SEGs for large datasets. The
|
||||
manifest records the node ID ranges covered by each segment.
|
||||
|
||||
### Lazy Build
|
||||
|
||||
Layer C is built lazily — it is not required for the file to be functional.
|
||||
The build process runs as a background task:
|
||||
|
||||
```
|
||||
1. Identify unindexed VEC_SEG blocks (those without Layer C adjacency)
|
||||
2. Read blocks in partition order (good locality)
|
||||
3. Build HNSW adjacency using the existing partial graph as scaffold
|
||||
4. Write new INDEX_SEG(s)
|
||||
5. Update manifest
|
||||
```
|
||||
|
||||
### Build Prioritization
|
||||
|
||||
Blocks are indexed in temperature order:
|
||||
1. Hot blocks first (most query benefit)
|
||||
2. Warm blocks next
|
||||
3. Cold blocks last (may never be indexed if queries don't reach them)
|
||||
|
||||
This means the index build converges to useful quality fast, then approaches
|
||||
completeness asymptotically.
|
||||
|
||||
## 5. Index Segment Binary Format
|
||||
|
||||
### Adjacency List Encoding
|
||||
|
||||
Neighbor lists are stored using **varint delta encoding with restart points**
|
||||
for fast random access:
|
||||
|
||||
```
|
||||
+-------------------------------------------+
|
||||
| Restart Point Index |
|
||||
| restart_interval: u32 (e.g., 64) |
|
||||
| restart_count: u32 |
|
||||
| [restart_offset: u32] * restart_count |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Adjacency Data |
|
||||
| For each node (sorted by node_id): |
|
||||
| neighbor_count: varint |
|
||||
| [delta_encoded_neighbor_id: varint] |
|
||||
| (restart point every N nodes) |
|
||||
+-------------------------------------------+
|
||||
```
|
||||
|
||||
**Restart points**: Every `restart_interval` nodes (default 64), the delta
|
||||
encoding resets to absolute IDs. This enables O(1) random access to any node's
|
||||
neighbors by:
|
||||
|
||||
1. Binary search the restart point index for the nearest restart <= target
|
||||
2. Seek to that restart offset
|
||||
3. Sequentially decode from restart to target (at most 63 decodes)
|
||||
|
||||
### Varint Encoding
|
||||
|
||||
Standard LEB128 varint:
|
||||
- Values 0-127: 1 byte
|
||||
- Values 128-16383: 2 bytes
|
||||
- Values 16384-2097151: 3 bytes
|
||||
|
||||
For delta-encoded neighbor IDs (typical delta: 1-1000), most values fit in 1-2
|
||||
bytes, giving ~3-4x compression over fixed u64.
|
||||
|
||||
### Prefetch Hints
|
||||
|
||||
The manifest's prefetch table maps node ID ranges to contiguous page ranges:
|
||||
|
||||
```
|
||||
Prefetch Entry:
|
||||
node_id_start: u64
|
||||
node_id_end: u64
|
||||
page_offset: u64 Offset of first contiguous page
|
||||
page_count: u32 Number of contiguous pages
|
||||
prefetch_ahead: u32 Pages to prefetch ahead of current access
|
||||
```
|
||||
|
||||
When the HNSW search accesses a node, the runtime issues `madvise(WILLNEED)`
|
||||
(or equivalent) for the next `prefetch_ahead` pages. This hides disk/memory
|
||||
latency behind computation.
|
||||
|
||||
## 6. Index Consistency
|
||||
|
||||
### Append-Only Index Updates
|
||||
|
||||
When new vectors are added:
|
||||
|
||||
1. New vectors go into a **fresh VEC_SEG** (append-only)
|
||||
2. A temporary in-memory index covers the new vectors
|
||||
3. When the in-memory index reaches a threshold, it is written as a new INDEX_SEG
|
||||
4. The manifest is updated to include both the old and new INDEX_SEGs
|
||||
5. Queries search both indexes and merge results
|
||||
|
||||
This is analogous to LSM-tree compaction levels but for graph indexes.
|
||||
|
||||
### Index Merging
|
||||
|
||||
When too many small INDEX_SEGs accumulate:
|
||||
|
||||
```
|
||||
1. Read all small INDEX_SEGs
|
||||
2. Build a unified HNSW graph over all vectors
|
||||
3. Write as a single sealed INDEX_SEG
|
||||
4. Tombstone old INDEX_SEGs in manifest
|
||||
```
|
||||
|
||||
### Concurrent Read/Write
|
||||
|
||||
Readers always see a consistent snapshot through the manifest chain:
|
||||
- Reader opens file -> reads manifest -> has immutable segment set
|
||||
- Writer appends new segments + new manifest
|
||||
- Reader continues using old manifest until it explicitly re-reads
|
||||
- No locks needed — append-only guarantees no mutation of existing data
|
||||
|
||||
## 7. Query Path Integration
|
||||
|
||||
The complete query path combining progressive indexing with temperature tiering:
|
||||
|
||||
```
|
||||
Query
|
||||
|
|
||||
v
|
||||
+-----------+
|
||||
| Layer A | Entry points + top-layer routing
|
||||
| (always) | ~5ms to load on cold start
|
||||
+-----------+
|
||||
|
|
||||
Is Layer B available for this region?
|
||||
/ \
|
||||
Yes No
|
||||
/ \
|
||||
+-----------+ +-----------+
|
||||
| Layer B | | Centroid |
|
||||
| HNSW | | Fallback |
|
||||
| search | | + scan |
|
||||
+-----------+ +-----------+
|
||||
\ /
|
||||
\ /
|
||||
v v
|
||||
+-----------+
|
||||
| Candidate |
|
||||
| Set |
|
||||
+-----------+
|
||||
|
|
||||
Is hot cache available?
|
||||
/ \
|
||||
Yes No
|
||||
/ \
|
||||
+-----------+ +-----------+
|
||||
| Hot cache | | Decode |
|
||||
| re-rank | | from |
|
||||
| (int8/fp16)| | VEC_SEG |
|
||||
+-----------+ +-----------+
|
||||
\ /
|
||||
v v
|
||||
+-----------+
|
||||
| Top-K |
|
||||
| Results |
|
||||
+-----------+
|
||||
```
|
||||
|
||||
### Recall Expectations by State
|
||||
|
||||
| State | Layers Available | Expected Recall@10 |
|
||||
|-------|-----------------|-------------------|
|
||||
| Cold start (L0 only) | A | 0.65-0.75 |
|
||||
| L0 + hot cache | A + hot | 0.75-0.85 |
|
||||
| L0 + L1 loading | A + B partial | 0.80-0.90 |
|
||||
| L1 complete | A + B | 0.85-0.92 |
|
||||
| Full load | A + B + C | 0.95-0.99 |
|
||||
| Full + optimized | A + B + C + hot | 0.98-0.999 |
|
||||
308
vendor/ruvector/docs/research/rvf/spec/05-overlay-epochs.md
vendored
Normal file
308
vendor/ruvector/docs/research/rvf/spec/05-overlay-epochs.md
vendored
Normal file
@@ -0,0 +1,308 @@
|
||||
# RVF Overlay Epochs
|
||||
|
||||
## 1. Streaming Dynamic Min-Cut Overlay
|
||||
|
||||
The overlay system manages dynamic graph partitioning — how the vector space is
|
||||
subdivided for distributed search, shard routing, and load balancing. Unlike
|
||||
static partitioning, RVF overlays evolve with the data through an epoch-based
|
||||
model that bounds memory, bounds load time, and enables rollback.
|
||||
|
||||
## 2. Overlay Segment Structure
|
||||
|
||||
Each OVERLAY_SEG stores a delta relative to the previous epoch's partition state:
|
||||
|
||||
```
|
||||
+-------------------------------------------+
|
||||
| Header: OVERLAY_SEG |
|
||||
+-------------------------------------------+
|
||||
| Epoch Header |
|
||||
| epoch: u32 |
|
||||
| parent_epoch: u32 |
|
||||
| parent_seg_id: u64 |
|
||||
| rollback_offset: u64 |
|
||||
| timestamp_ns: u64 |
|
||||
| delta_count: u32 |
|
||||
| partition_count: u32 |
|
||||
+-------------------------------------------+
|
||||
| Edge Deltas |
|
||||
| For each delta: |
|
||||
| delta_type: u8 (ADD=1, REMOVE=2, |
|
||||
| REWEIGHT=3) |
|
||||
| src_node: u64 |
|
||||
| dst_node: u64 |
|
||||
| weight: f32 (for ADD/REWEIGHT) |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Partition Summaries |
|
||||
| For each partition: |
|
||||
| partition_id: u32 |
|
||||
| node_count: u64 |
|
||||
| edge_cut_weight: f64 |
|
||||
| centroid: [fp16 * dim] |
|
||||
| node_id_range_start: u64 |
|
||||
| node_id_range_end: u64 |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Min-Cut Witness |
|
||||
| witness_type: u8 |
|
||||
| 0 = checksum only |
|
||||
| 1 = full certificate |
|
||||
| cut_value: f64 |
|
||||
| cut_edge_count: u32 |
|
||||
| partition_hash: [u8; 32] (SHAKE-256) |
|
||||
| If witness_type == 1: |
|
||||
| [cut_edge: (u64, u64)] * count |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Rollback Pointer |
|
||||
| prev_epoch_offset: u64 |
|
||||
| prev_epoch_hash: [u8; 16] |
|
||||
+-------------------------------------------+
|
||||
```
|
||||
|
||||
## 3. Epoch Lifecycle
|
||||
|
||||
### Epoch Creation
|
||||
|
||||
A new epoch is created when:
|
||||
- A batch of vectors is inserted that changes partition balance by > threshold
|
||||
- The accumulated edge deltas exceed a size limit (default: 1 MB)
|
||||
- A manual rebalance is triggered
|
||||
- A merge/compaction produces a new partition layout
|
||||
|
||||
```
|
||||
Epoch 0 (initial) Epoch 1 Epoch 2
|
||||
+----------------+ +----------------+ +----------------+
|
||||
| Full snapshot | | Deltas vs E0 | | Deltas vs E1 |
|
||||
| of partitions | | +50 edges | | +30 edges |
|
||||
| 32 partitions | | -12 edges | | -8 edges |
|
||||
| min-cut: 0.342 | | rebalance: P3 | | split: P7->P7a |
|
||||
+----------------+ +----------------+ +----------------+
|
||||
```
|
||||
|
||||
### State Reconstruction
|
||||
|
||||
To reconstruct the current partition state:
|
||||
|
||||
```
|
||||
1. Read latest MANIFEST_SEG -> get current_epoch
|
||||
2. Read OVERLAY_SEG for current_epoch
|
||||
3. If overlay is a delta: recursively read parent epochs
|
||||
4. Apply deltas in order: base -> epoch 1 -> epoch 2 -> ... -> current
|
||||
5. Result: complete partition state
|
||||
```
|
||||
|
||||
For efficiency, the manifest caches the **last full snapshot epoch**. Delta
|
||||
chains never exceed a configurable depth (default: 8 epochs) before a new
|
||||
snapshot is forced.
|
||||
|
||||
### Compaction (Epoch Collapse)
|
||||
|
||||
When the delta chain reaches maximum depth:
|
||||
|
||||
```
|
||||
1. Reconstruct full state from chain
|
||||
2. Write new OVERLAY_SEG with witness_type=full_snapshot
|
||||
3. This becomes the new base epoch
|
||||
4. Old overlay segments are tombstoned
|
||||
5. New delta chain starts from this base
|
||||
```
|
||||
|
||||
```
|
||||
Before: E0(snap) -> E1(delta) -> E2(delta) -> ... -> E8(delta)
|
||||
After: E0(snap) -> ... -> E8(delta) -> E9(snap, compacted)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
These can be garbage collected
|
||||
```
|
||||
|
||||
## 4. Min-Cut Witness
|
||||
|
||||
The min-cut witness provides a cryptographic proof that the current partition
|
||||
is "good enough" — that the edge cut is within acceptable bounds.
|
||||
|
||||
### Witness Types
|
||||
|
||||
**Type 0: Checksum Only**
|
||||
|
||||
A SHAKE-256 hash of the complete partition state. Allows verification that
|
||||
the state is consistent but doesn't prove optimality.
|
||||
|
||||
```
|
||||
witness = SHAKE-256(
|
||||
for each partition sorted by id:
|
||||
partition_id || node_count || sorted(node_ids) || edge_cut_weight
|
||||
)
|
||||
```
|
||||
|
||||
**Type 1: Full Certificate**
|
||||
|
||||
Lists the actual cut edges. Allows any reader to verify that:
|
||||
1. The listed edges are the only edges crossing partition boundaries
|
||||
2. The total cut weight matches `cut_value`
|
||||
3. No better cut exists within the local search neighborhood (optional)
|
||||
|
||||
### Bounded-Time Min-Cut Updates
|
||||
|
||||
Full min-cut computation is expensive (O(V * E) for max-flow). RVF uses
|
||||
**incremental min-cut maintenance**:
|
||||
|
||||
For each edge delta:
|
||||
```
|
||||
1. If ADD(u, v) where u and v are in same partition:
|
||||
-> No cut change. O(1).
|
||||
|
||||
2. If ADD(u, v) where u in P_i and v in P_j:
|
||||
-> cut_weight[P_i][P_j] += weight. O(1).
|
||||
-> Check if moving u to P_j or v to P_i reduces total cut.
|
||||
-> If yes: execute move, update partition summaries. O(degree).
|
||||
|
||||
3. If REMOVE(u, v) across partitions:
|
||||
-> cut_weight[P_i][P_j] -= weight. O(1).
|
||||
-> No rebalance needed (cut improved).
|
||||
|
||||
4. If REMOVE(u, v) within same partition:
|
||||
-> Check connectivity. If partition splits: create new partition. O(component).
|
||||
```
|
||||
|
||||
This bounds update time to O(max_degree) per edge delta in the common case,
|
||||
with O(component_size) in the rare partition-split case.
|
||||
|
||||
### Semi-Streaming Min-Cut
|
||||
|
||||
For large-scale rebalancing (e.g., after bulk insert), RVF uses a semi-streaming
|
||||
algorithm inspired by Assadi et al.:
|
||||
|
||||
```
|
||||
Phase 1: Single pass over edges to build a sparse skeleton
|
||||
- Sample each edge with probability O(1/epsilon)
|
||||
- Space: O(n * polylog(n))
|
||||
|
||||
Phase 2: Compute min-cut on skeleton
|
||||
- Standard max-flow on sparse graph
|
||||
- Time: O(n^2 * polylog(n))
|
||||
|
||||
Phase 3: Verify against full edge set
|
||||
- Stream edges again, check cut validity
|
||||
- If invalid: refine skeleton and repeat
|
||||
```
|
||||
|
||||
This runs in O(n * polylog(n)) space regardless of edge count, making it
|
||||
suitable for streaming over massive graphs.
|
||||
|
||||
## 5. Overlay Size Management
|
||||
|
||||
### Size Threshold
|
||||
|
||||
Each OVERLAY_SEG has a maximum payload size (configurable, default 1 MB).
|
||||
When the accumulated deltas for the current epoch approach this threshold,
|
||||
a new epoch is forced.
|
||||
|
||||
### Memory Budget
|
||||
|
||||
The total memory for overlay state is bounded:
|
||||
|
||||
```
|
||||
max_overlay_memory = max_chain_depth * max_seg_size + snapshot_size
|
||||
= 8 * 1 MB + snapshot_size
|
||||
```
|
||||
|
||||
For 10M vectors with 32 partitions:
|
||||
- Snapshot: ~32 * (8 + 16 + 768) bytes per partition ≈ 25 KB
|
||||
- Delta chain: ≤ 8 MB
|
||||
- Total: ≤ 9 MB
|
||||
|
||||
This is a fixed overhead regardless of dataset size (partition count scales
|
||||
sublinearly).
|
||||
|
||||
### Garbage Collection
|
||||
|
||||
Overlay segments behind the last full snapshot are candidates for garbage
|
||||
collection. The manifest tracks which overlay segments are still reachable
|
||||
from the current epoch chain.
|
||||
|
||||
```
|
||||
Reachable: current_epoch -> parent -> ... -> last_snapshot
|
||||
Unreachable: Everything before last_snapshot (safely deletable)
|
||||
```
|
||||
|
||||
GC runs during compaction. Old OVERLAY_SEGs are tombstoned in the manifest
|
||||
and their space is reclaimed on file rewrite.
|
||||
|
||||
## 6. Distributed Overlay Coordination
|
||||
|
||||
When RVF files are sharded across multiple nodes, the overlay system coordinates
|
||||
partition state:
|
||||
|
||||
### Shard-Local Overlays
|
||||
|
||||
Each shard maintains its own OVERLAY_SEG chain for its local partitions.
|
||||
The global partition state is the union of all shard-local overlays.
|
||||
|
||||
### Cross-Shard Rebalancing
|
||||
|
||||
When a partition becomes unbalanced across shards:
|
||||
|
||||
```
|
||||
1. Coordinator computes target partition assignment
|
||||
2. Each shard writes a JOURNAL_SEG with vector move instructions
|
||||
3. Vectors are copied (not moved — append-only) to target shards
|
||||
4. Each shard writes a new OVERLAY_SEG reflecting the new partition
|
||||
5. Coordinator writes a global MANIFEST_SEG with new shard map
|
||||
```
|
||||
|
||||
This is eventually consistent — during rebalancing, queries may search both
|
||||
old and new locations and deduplicate results.
|
||||
|
||||
### Consistency Model
|
||||
|
||||
**Within a shard**: Linearizable (single-writer, manifest chain)
|
||||
**Across shards**: Eventually consistent with bounded staleness
|
||||
|
||||
The epoch counter provides a total order for convergence checking:
|
||||
- If all shards report epoch >= E, the global state at epoch E is complete
|
||||
- Stale shards are detectable by comparing epoch counters
|
||||
|
||||
## 7. Epoch-Aware Query Routing
|
||||
|
||||
Queries use the overlay state for partition routing:
|
||||
|
||||
```python
|
||||
def route_query(query, overlay):
|
||||
# Find nearest partition centroids
|
||||
dists = [distance(query, p.centroid) for p in overlay.partitions]
|
||||
target_partitions = top_n(dists, n_probe)
|
||||
|
||||
# Check epoch freshness
|
||||
if overlay.epoch < current_epoch - stale_threshold:
|
||||
# Overlay is stale — broaden search
|
||||
target_partitions = top_n(dists, n_probe * 2)
|
||||
|
||||
return target_partitions
|
||||
```
|
||||
|
||||
### Epoch Rollback
|
||||
|
||||
If an overlay epoch is found to be corrupt or suboptimal:
|
||||
|
||||
```
|
||||
1. Read rollback_pointer from current OVERLAY_SEG
|
||||
2. The pointer gives the offset of the previous epoch's OVERLAY_SEG
|
||||
3. Write a new MANIFEST_SEG pointing to the previous epoch as current
|
||||
4. Future writes continue from the rolled-back state
|
||||
```
|
||||
|
||||
This provides O(1) rollback to any ancestor epoch in the chain.
|
||||
|
||||
## 8. Integration with Progressive Indexing
|
||||
|
||||
The overlay system and the index system are coupled:
|
||||
|
||||
- **Partition centroids** in the overlay guide Layer A routing
|
||||
- **Partition boundaries** determine which INDEX_SEGs cover which regions
|
||||
- **Partition rebalancing** may invalidate Layer B adjacency for moved vectors
|
||||
(these are rebuilt lazily)
|
||||
- **Layer C** is partitioned aligned — each INDEX_SEG covers vectors within
|
||||
a single partition for locality
|
||||
|
||||
This means overlay compaction can trigger partial index rebuild, but only for
|
||||
the affected partitions — not the entire index.
|
||||
386
vendor/ruvector/docs/research/rvf/spec/06-query-optimization.md
vendored
Normal file
386
vendor/ruvector/docs/research/rvf/spec/06-query-optimization.md
vendored
Normal file
@@ -0,0 +1,386 @@
|
||||
# RVF Ultra-Fast Query Path
|
||||
|
||||
## 1. CPU Shape Optimization
|
||||
|
||||
The block layout determines performance at the hardware level. RVF is designed
|
||||
to match the shape of modern CPUs: wide SIMD, deep caches, hardware prefetch.
|
||||
|
||||
### Four Optimizations
|
||||
|
||||
1. **Strict 64-byte alignment** for all numeric arrays
|
||||
2. **Columnar + interleaved hybrid** for compression and speed
|
||||
3. **Prefetch hints** for cache-friendly graph traversal
|
||||
4. **Dictionary-coded IDs** for fast random access
|
||||
|
||||
## 2. Strict Alignment
|
||||
|
||||
Every numeric array in RVF starts at a 64-byte aligned offset. This matches:
|
||||
|
||||
| Target | Register Width | Alignment |
|
||||
|--------|---------------|-----------|
|
||||
| AVX-512 | 512 bits = 64 bytes | 64 B |
|
||||
| AVX2 | 256 bits = 32 bytes | 64 B (superset) |
|
||||
| ARM NEON | 128 bits = 16 bytes | 64 B (superset) |
|
||||
| WASM v128 | 128 bits = 16 bytes | 64 B (superset) |
|
||||
| Cache line | Typically 64 bytes | 64 B (exact) |
|
||||
|
||||
By aligning to 64 bytes, RVF ensures:
|
||||
- Zero-copy load into any SIMD register (no unaligned penalty)
|
||||
- No cache-line splits (each access touches exactly one cache line)
|
||||
- Optimal hardware prefetch behavior (prefetcher operates on cache lines)
|
||||
|
||||
### Alignment in Practice
|
||||
|
||||
```
|
||||
Segment header: 64 B (naturally aligned, first item in segment)
|
||||
Block header: Padded to 64 B boundary
|
||||
Vector data start: 64 B aligned from block start
|
||||
Each dimension column: 64 B aligned (columnar VEC_SEG)
|
||||
Each vector entry: 64 B aligned (interleaved HOT_SEG)
|
||||
ID map: 64 B aligned
|
||||
Restart point index: 64 B aligned
|
||||
```
|
||||
|
||||
Padding bytes between sections are zero-filled and excluded from checksums.
|
||||
|
||||
## 3. Columnar + Interleaved Hybrid
|
||||
|
||||
### Columnar Storage (VEC_SEG) — Optimized for Compression
|
||||
|
||||
```
|
||||
Block layout (1024 vectors, 384 dimensions, fp16):
|
||||
|
||||
Offset 0x000: dim_0[vec_0], dim_0[vec_1], ..., dim_0[vec_1023] (2048 B)
|
||||
Offset 0x800: dim_1[vec_0], dim_1[vec_1], ..., dim_1[vec_1023] (2048 B)
|
||||
...
|
||||
Offset 0xBF800: dim_383[vec_0], ..., dim_383[vec_1023] (2048 B)
|
||||
|
||||
Total: 384 * 2048 = 786,432 bytes (768 KB per block)
|
||||
```
|
||||
|
||||
**Why columnar for cold/warm storage**:
|
||||
- Adjacent values in the same dimension are correlated -> higher compression ratio
|
||||
- LZ4 on columnar fp16 achieves 1.5-2.5x compression (vs 1.1-1.3x on interleaved)
|
||||
- ZSTD on columnar fp16 achieves 2.5-4x compression
|
||||
- Batch operations (computing mean, variance) scan one dimension at a time
|
||||
|
||||
### Interleaved Storage (HOT_SEG) — Optimized for Speed
|
||||
|
||||
```
|
||||
Entry layout (one hot vector, 384 dim fp16):
|
||||
|
||||
Offset 0x000: vector_id (8 B)
|
||||
Offset 0x008: dim_0, dim_1, dim_2, ..., dim_383 (768 B)
|
||||
Offset 0x308: neighbor_count (2 B)
|
||||
Offset 0x30A: neighbor_0, neighbor_1, ... (8 B each)
|
||||
Offset 0x38A: padding to 64B boundary
|
||||
--> 960 bytes per entry (at M=16 neighbors)
|
||||
```
|
||||
|
||||
**Why interleaved for hot data**:
|
||||
- One vector = one sequential read (no column gathering)
|
||||
- Distance computation: load vector, compute, move to next (streaming pattern)
|
||||
- Neighbors co-located: after finding a good candidate, immediately traverse
|
||||
- 960 bytes per entry = 15 cache lines = predictable memory access
|
||||
|
||||
### When to Use Each
|
||||
|
||||
| Operation | Layout | Reason |
|
||||
|-----------|--------|--------|
|
||||
| Bulk distance computation | Columnar | SIMD operates on dimension columns |
|
||||
| Top-K refinement scan | Interleaved | Sequential scan of candidates |
|
||||
| Compression/archival | Columnar | Better ratio |
|
||||
| HNSW search (hot region) | Interleaved | Vector + neighbors together |
|
||||
| Batch insert | Columnar | Write once, compress well |
|
||||
|
||||
## 4. Prefetch Hints
|
||||
|
||||
### The Problem
|
||||
|
||||
HNSW search is pointer-chasing: compute distance at node A, read neighbor
|
||||
list, jump to node B, compute distance, repeat. Each jump is a random
|
||||
memory access. On a 10M vector file, this means:
|
||||
|
||||
```
|
||||
HNSW search: ~100-200 distance computations per query
|
||||
Each computation: 1 random read (vector) + 1 random read (neighbors)
|
||||
Random read latency: 50-100 ns (DRAM), 10-50 μs (SSD)
|
||||
Total: 10-40 μs (DRAM), 1-10 ms (SSD) without prefetch
|
||||
```
|
||||
|
||||
### The Solution
|
||||
|
||||
Store neighbor lists **contiguously** and add **prefetch offsets** in the
|
||||
manifest so the runtime can issue prefetch instructions ahead of time.
|
||||
|
||||
### Prefetch Table Structure
|
||||
|
||||
The manifest contains a prefetch table mapping node ID ranges to contiguous
|
||||
page regions:
|
||||
|
||||
```
|
||||
prefetch_table:
|
||||
entry_count: u32
|
||||
entries:
|
||||
[0]: node_ids 0-9999 -> pages at offset 0x100000, 50 pages, prefetch 3 ahead
|
||||
[1]: node_ids 10000-19999 -> pages at offset 0x200000, 50 pages, prefetch 3 ahead
|
||||
...
|
||||
```
|
||||
|
||||
### Runtime Prefetch Strategy
|
||||
|
||||
```python
|
||||
def hnsw_search_with_prefetch(query, entry_point, ef_search):
|
||||
candidates = MaxHeap()
|
||||
visited = BitSet()
|
||||
worklist = MinHeap([(distance(query, entry_point), entry_point)])
|
||||
|
||||
while worklist:
|
||||
dist, node = worklist.pop()
|
||||
|
||||
# PREFETCH: while processing this node, prefetch neighbors' data
|
||||
neighbors = get_neighbors(node)
|
||||
for n in neighbors[:PREFETCH_AHEAD]:
|
||||
if n not in visited:
|
||||
prefetch_vector(n) # madvise(WILLNEED) or __builtin_prefetch
|
||||
prefetch_neighbors(n) # prefetch neighbor list page
|
||||
|
||||
# COMPUTE: distance to neighbors (data should be in cache by now)
|
||||
for n in neighbors:
|
||||
if n not in visited:
|
||||
visited.add(n)
|
||||
d = distance(query, get_vector(n))
|
||||
if d < candidates.max() or len(candidates) < ef_search:
|
||||
candidates.push((d, n))
|
||||
worklist.push((d, n))
|
||||
|
||||
return candidates.top_k(k)
|
||||
```
|
||||
|
||||
### Contiguous Neighbor Layout
|
||||
|
||||
HOT_SEG stores vectors and neighbors together. For cold INDEX_SEGs, neighbor
|
||||
lists are laid out in **node ID order** within contiguous pages:
|
||||
|
||||
```
|
||||
Page 0: neighbors[node_0], neighbors[node_1], ..., neighbors[node_63]
|
||||
Page 1: neighbors[node_64], ..., neighbors[node_127]
|
||||
...
|
||||
```
|
||||
|
||||
Because HNSW search tends to traverse nodes in the same graph neighborhood
|
||||
(spatially close node IDs if data was inserted in order), sequential node
|
||||
IDs tend to be accessed together. Contiguous layout turns random access
|
||||
into sequential reads.
|
||||
|
||||
### Expected Improvement
|
||||
|
||||
| Configuration | p95 Latency (10M vectors) |
|
||||
|--------------|--------------------------|
|
||||
| No prefetch, random layout | 2.5 ms |
|
||||
| No prefetch, contiguous layout | 1.2 ms |
|
||||
| Prefetch, contiguous layout | 0.3 ms |
|
||||
| Prefetch, contiguous + hot cache | 0.15 ms |
|
||||
|
||||
## 5. Dictionary-Coded IDs
|
||||
|
||||
### The Problem
|
||||
|
||||
Vector IDs in neighbor lists and ID maps are 64-bit integers. For 10M vectors,
|
||||
most IDs fit in 24 bits. Storing full 64-bit IDs wastes ~5 bytes per entry.
|
||||
|
||||
With M=16 neighbors per node and 10M nodes:
|
||||
- Raw: 10M * 16 * 8 = 1.2 GB of ID data
|
||||
- Desired: < 300 MB
|
||||
|
||||
### Varint Delta Encoding
|
||||
|
||||
IDs within a block or neighbor list are sorted and delta-encoded:
|
||||
|
||||
```
|
||||
Original IDs: [1000, 1005, 1008, 1020, 1100]
|
||||
Deltas: [1000, 5, 3, 12, 80]
|
||||
Varint bytes: [ 2B, 1B, 1B, 1B, 1B] = 6 bytes (vs 40 bytes raw)
|
||||
```
|
||||
|
||||
### Restart Points
|
||||
|
||||
Every N entries (default N=64), the delta resets to an absolute value:
|
||||
|
||||
```
|
||||
Group 0 (entries 0-63): delta from 0 (absolute start)
|
||||
Group 1 (entries 64-127): delta from entry[64] (restart)
|
||||
Group 2 (entries 128-191): delta from entry[128] (restart)
|
||||
```
|
||||
|
||||
The restart point index stores the offset of each restart group:
|
||||
|
||||
```
|
||||
restart_index:
|
||||
interval: 64
|
||||
offsets: [0, 156, 298, 445, ...] // byte offsets into encoded data
|
||||
```
|
||||
|
||||
### Random Access
|
||||
|
||||
To find the neighbors of node N:
|
||||
|
||||
```
|
||||
1. group = N / restart_interval // O(1)
|
||||
2. offset = restart_index[group] // O(1)
|
||||
3. seek to offset in encoded data // O(1)
|
||||
4. decode sequentially from restart to N // O(restart_interval) = O(64)
|
||||
```
|
||||
|
||||
Total: O(64) varint decodes = ~50-100 ns. Compare with sorted array binary
|
||||
search: O(log N) = O(24) comparisons with cache misses = ~200-500 ns.
|
||||
|
||||
### SIMD Varint Decoding
|
||||
|
||||
Modern SIMD can decode varints in bulk:
|
||||
|
||||
```
|
||||
AVX-512 VBMI: ~8 varints per cycle using VPERMB + VPSHUFB
|
||||
Throughput: 2-4 billion integers/second (Lemire et al.)
|
||||
```
|
||||
|
||||
At 16 neighbors per node, one HNSW search step decodes 16 varints in ~2-4 ns.
|
||||
|
||||
### Compression Ratio
|
||||
|
||||
| Encoding | Bytes per ID (avg) | 10M * 16 neighbors |
|
||||
|----------|-------------------|-------------------|
|
||||
| Raw u64 | 8.0 B | 1,220 MB |
|
||||
| Raw u32 | 4.0 B | 610 MB |
|
||||
| Varint (no delta) | 3.2 B | 488 MB |
|
||||
| Varint delta | 1.5 B | 229 MB |
|
||||
| Varint delta + restart | 1.6 B | 244 MB |
|
||||
|
||||
Delta encoding with restart points achieves ~5x compression over raw u64
|
||||
while maintaining fast random access.
|
||||
|
||||
## 6. Cache Behavior Analysis
|
||||
|
||||
### L1/L2/L3 Working Sets
|
||||
|
||||
For a typical query on 10M vectors (384 dim, fp16):
|
||||
|
||||
```
|
||||
HNSW search:
|
||||
~150 distance computations
|
||||
Each computation: 768 B (vector) + ~128 B (neighbor list) ≈ 896 B
|
||||
Total working set: 150 * 896 ≈ 131 KB
|
||||
|
||||
Top-K refinement (hot cache scan):
|
||||
~1000 candidates checked
|
||||
Each: 960 B (interleaved HOT_SEG entry)
|
||||
Total: 960 KB
|
||||
|
||||
Query vector: 768 B (always in L1)
|
||||
Quantization tables: 96 KB (PQ codebook, always in L2)
|
||||
```
|
||||
|
||||
| Cache Level | Size | What Fits |
|
||||
|------------|------|-----------|
|
||||
| L1 (32-48 KB) | Query vector + current node | Always hit |
|
||||
| L2 (256 KB-1 MB) | PQ tables + 100-200 hot entries | Usually hit |
|
||||
| L3 (8-32 MB) | Hot cache + partial index | Mostly hit |
|
||||
| DRAM | Everything | Full dataset |
|
||||
|
||||
### p95 Latency Budget
|
||||
|
||||
```
|
||||
HNSW traversal: 150 nodes * 100 ns/node = 15 μs (L3 hit)
|
||||
Distance compute: 150 * 50 ns = 7.5 μs (SIMD)
|
||||
Top-K refinement: 1000 * 10 ns = 10 μs (hot cache, L2/L3 hit)
|
||||
Overhead: 5 μs (heap ops, bookkeeping)
|
||||
-------
|
||||
Total p95: ~37.5 μs ≈ 0.04 ms
|
||||
|
||||
With prefetch: ~30 μs (hide 25% of traversal latency)
|
||||
```
|
||||
|
||||
This matches the target of < 0.3 ms p95 on desktop hardware. The dominant
|
||||
cost is memory bandwidth, not computation — which is why cache-friendly
|
||||
layout and prefetch are critical.
|
||||
|
||||
## 7. Distance Function SIMD Implementations
|
||||
|
||||
### L2 Distance (fp16, 384 dim, AVX-512)
|
||||
|
||||
```
|
||||
; 384 fp16 values = 768 bytes = 12 ZMM registers
|
||||
; Process 32 fp16 values per iteration (convert to 16 fp32 per half)
|
||||
|
||||
.loop:
|
||||
vmovdqu16 zmm0, [rsi + rcx] ; Load 32 fp16 from A
|
||||
vmovdqu16 zmm1, [rdi + rcx] ; Load 32 fp16 from B
|
||||
vcvtph2ps zmm2, ymm0 ; Convert low 16 to fp32
|
||||
vcvtph2ps zmm3, ymm1
|
||||
vsubps zmm2, zmm2, zmm3 ; diff = A - B
|
||||
vfmadd231ps zmm4, zmm2, zmm2 ; acc += diff * diff
|
||||
; Repeat for high 16
|
||||
vextracti64x4 ymm0, zmm0, 1
|
||||
vextracti64x4 ymm1, zmm1, 1
|
||||
vcvtph2ps zmm2, ymm0
|
||||
vcvtph2ps zmm3, ymm1
|
||||
vsubps zmm2, zmm2, zmm3
|
||||
vfmadd231ps zmm4, zmm2, zmm2
|
||||
add rcx, 64
|
||||
cmp rcx, 768
|
||||
jl .loop
|
||||
|
||||
; Horizontal sum of zmm4 -> scalar result
|
||||
; ~12 iterations, ~24 FMA ops, ~12 cycles total
|
||||
```
|
||||
|
||||
### Inner Product (int8, 384 dim, AVX-512 VNNI)
|
||||
|
||||
```
|
||||
; 384 int8 values = 384 bytes = 6 ZMM registers
|
||||
; VPDPBUSD: 64 uint8*int8 multiply-adds per cycle
|
||||
|
||||
.loop:
|
||||
vmovdqu8 zmm0, [rsi + rcx] ; 64 uint8 from A
|
||||
vmovdqu8 zmm1, [rdi + rcx] ; 64 int8 from B
|
||||
vpdpbusd zmm2, zmm0, zmm1 ; acc += dot(A, B) per 4 bytes
|
||||
add rcx, 64
|
||||
cmp rcx, 384
|
||||
jl .loop
|
||||
|
||||
; 6 iterations, 6 VPDPBUSD ops, ~6 cycles
|
||||
; ~16x faster than fp16 L2
|
||||
```
|
||||
|
||||
### Hamming Distance (binary, 384 dim, AVX-512)
|
||||
|
||||
```
|
||||
; 384 bits = 48 bytes = 1 partial ZMM load
|
||||
; VPOPCNTDQ: popcount on 8 x 64-bit words per cycle
|
||||
|
||||
vmovdqu8 zmm0, [rsi] ; Load 48 bytes (384 bits) from A
|
||||
vmovdqu8 zmm1, [rdi] ; Load 48 bytes from B
|
||||
vpxorq zmm2, zmm0, zmm1 ; XOR -> differing bits
|
||||
vpopcntq zmm3, zmm2 ; Popcount per 64-bit word
|
||||
; Horizontal sum of 6 popcounts -> Hamming distance
|
||||
; ~3 cycles total
|
||||
```
|
||||
|
||||
## 8. Summary: Query Path Hot Loop
|
||||
|
||||
The complete hot path for one HNSW search step:
|
||||
|
||||
```
|
||||
1. Load current node's neighbor list [L2/L3 cache, 128 B, ~5 ns]
|
||||
2. Issue prefetch for next neighbors [~1 ns]
|
||||
3. For each neighbor (M=16):
|
||||
a. Check visited bitmap [L1, ~1 ns]
|
||||
b. Load neighbor vector (hot cache) [L2/L3, 768 B, ~5-10 ns]
|
||||
c. SIMD distance (fp16, 384 dim) [~12 cycles = ~4 ns]
|
||||
d. Heap insert if better [~5 ns]
|
||||
4. Total per step: ~300-500 ns
|
||||
5. Total per query (~150 steps): ~50-75 μs
|
||||
```
|
||||
|
||||
This achieves 13,000-20,000 QPS per thread on desktop hardware — matching
|
||||
or exceeding dedicated vector databases for in-memory workloads.
|
||||
580
vendor/ruvector/docs/research/rvf/spec/07-deletion-lifecycle.md
vendored
Normal file
580
vendor/ruvector/docs/research/rvf/spec/07-deletion-lifecycle.md
vendored
Normal file
@@ -0,0 +1,580 @@
|
||||
# RVF Deletion Lifecycle
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Deletion in RVF follows a two-phase protocol consistent with the append-only
|
||||
segment architecture. Vectors are never removed in-place. Instead, a soft
|
||||
delete records intent in a JOURNAL_SEG, and a subsequent compaction hard
|
||||
deletes by physically excluding the vectors from sealed output segments.
|
||||
|
||||
```
|
||||
JOURNAL_SEG Compaction GC / Rewrite
|
||||
(append) (merge) (reclaim)
|
||||
ACTIVE -----> SOFT_DELETED -----> HARD_DELETED ------> RECLAIMED
|
||||
| | | |
|
||||
| query path | query path | |
|
||||
| returns vec | skips vec | vec absent | space freed
|
||||
| | (bitmap check) | from output seg |
|
||||
```
|
||||
|
||||
Readers always see a consistent snapshot: a deletion is invisible until
|
||||
the manifest referencing the new deletion bitmap is durably committed.
|
||||
|
||||
## 2. Vector Lifecycle State Machine
|
||||
|
||||
```
|
||||
+----------+ JOURNAL_SEG +-----------------+
|
||||
| | DELETE_VECTOR / RANGE | |
|
||||
| ACTIVE +----------------------->+ SOFT_DELETED |
|
||||
| | | |
|
||||
+----------+ +--------+--------+
|
||||
| Compaction seals output
|
||||
v excluding this vector
|
||||
+--------+--------+
|
||||
| HARD_DELETED |
|
||||
+--------+--------+
|
||||
| File rewrite / truncation
|
||||
v reclaims physical space
|
||||
+--------+--------+
|
||||
| RECLAIMED |
|
||||
+-----------------+
|
||||
```
|
||||
|
||||
| State | Bitmap Bit | Physical Bytes | Query Visible |
|
||||
|-------|------------|----------------|---------------|
|
||||
| ACTIVE | 0 | Vector in VEC_SEG | Yes |
|
||||
| SOFT_DELETED | 1 | Vector in VEC_SEG | No |
|
||||
| HARD_DELETED | N/A | Excluded from sealed output | No |
|
||||
| RECLAIMED | N/A | Bytes overwritten / freed | No |
|
||||
|
||||
| Transition | Trigger | Durability |
|
||||
|------------|---------|------------|
|
||||
| ACTIVE -> SOFT_DELETED | JOURNAL_SEG + MANIFEST_SEG with bitmap | After manifest fsync |
|
||||
| SOFT_DELETED -> HARD_DELETED | Compaction writes sealed VEC_SEG without vector | After compaction manifest fsync |
|
||||
| HARD_DELETED -> RECLAIMED | File rewrite or old shard deletion | After shard unlink |
|
||||
|
||||
## 3. JOURNAL_SEG Wire Format (type 0x04)
|
||||
|
||||
A JOURNAL_SEG records metadata mutations: deletions, metadata updates, tier
|
||||
moves, and ID remappings. Its payload follows the standard 64-byte segment
|
||||
header (see `01-segment-model.md` section 2).
|
||||
|
||||
### 3.1 Journal Header (64 bytes)
|
||||
|
||||
```
|
||||
Offset Type Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 u32 entry_count Number of journal entries
|
||||
0x04 u32 journal_epoch Epoch when this journal was written
|
||||
0x08 u64 prev_journal_seg_id Segment ID of previous JOURNAL_SEG (0 if first)
|
||||
0x10 u32 flags Reserved, must be 0
|
||||
0x14 u8[44] reserved Zero-padded to 64-byte alignment
|
||||
```
|
||||
|
||||
### 3.2 Journal Entry Format
|
||||
|
||||
Each entry begins on an 8-byte aligned boundary:
|
||||
|
||||
```
|
||||
Offset Type Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 u8 entry_type Entry type enum
|
||||
0x01 u8 reserved Must be 0x00
|
||||
0x02 u16 entry_length Byte length of type-specific payload
|
||||
0x04 u8[] payload Type-specific payload
|
||||
var u8[] padding Zero-pad to next 8-byte boundary
|
||||
```
|
||||
|
||||
### 3.3 Entry Types
|
||||
|
||||
```
|
||||
Value Name Payload Size Description
|
||||
----- ---- ------------ -----------
|
||||
0x01 DELETE_VECTOR 8 B Delete a single vector by ID
|
||||
0x02 DELETE_RANGE 16 B Delete a contiguous range of vector IDs
|
||||
0x03 UPDATE_METADATA variable Update key-value metadata for a vector
|
||||
0x04 MOVE_VECTOR 24 B Reassign vector to a different segment/tier
|
||||
0x05 REMAP_ID 16 B Reassign vector ID (post-compaction)
|
||||
```
|
||||
|
||||
### 3.4 Type-Specific Payloads
|
||||
|
||||
**DELETE_VECTOR (0x01)**
|
||||
```
|
||||
0x00 u64 vector_id ID of the vector to soft-delete
|
||||
```
|
||||
|
||||
**DELETE_RANGE (0x02)**
|
||||
```
|
||||
0x00 u64 start_id First vector ID (inclusive)
|
||||
0x08 u64 end_id Last vector ID (exclusive)
|
||||
```
|
||||
Invariant: `start_id < end_id`. Range `[start_id, end_id)` is half-open.
|
||||
|
||||
**UPDATE_METADATA (0x03)**
|
||||
```
|
||||
0x00 u64 vector_id Target vector ID
|
||||
0x08 u16 key_len Byte length of metadata key
|
||||
0x0A u8[] key Metadata key (UTF-8)
|
||||
var u16 val_len Byte length of metadata value
|
||||
var+2 u8[] val Metadata value (opaque bytes)
|
||||
```
|
||||
|
||||
**MOVE_VECTOR (0x04)**
|
||||
```
|
||||
0x00 u64 vector_id Target vector ID
|
||||
0x08 u64 src_seg Source segment ID
|
||||
0x10 u64 dst_seg Destination segment ID
|
||||
```
|
||||
|
||||
**REMAP_ID (0x05)**
|
||||
```
|
||||
0x00 u64 old_id Original vector ID
|
||||
0x08 u64 new_id New vector ID after compaction
|
||||
```
|
||||
|
||||
### 3.5 Complete JOURNAL_SEG Example
|
||||
|
||||
Deleting vector 42, deleting range [1000, 2000), remapping ID 500 -> 3:
|
||||
|
||||
```
|
||||
Byte offset Content Notes
|
||||
----------- ------- -----
|
||||
0x00-0x3F Segment header (64 B) seg_type=0x04, magic=RVFS
|
||||
0x40-0x7F Journal header (64 B) entry_count=3, epoch=7,
|
||||
prev_journal_seg_id=12
|
||||
--- Entry 0: DELETE_VECTOR ---
|
||||
0x80 0x01 entry_type
|
||||
0x81 0x00 reserved
|
||||
0x82-0x83 0x0008 entry_length = 8
|
||||
0x84-0x8B 0x000000000000002A vector_id = 42
|
||||
0x8C-0x8F 0x00000000 padding to 8B
|
||||
|
||||
--- Entry 1: DELETE_RANGE ---
|
||||
0x90 0x02 entry_type
|
||||
0x91 0x00 reserved
|
||||
0x92-0x93 0x0010 entry_length = 16
|
||||
0x94-0x9B 0x00000000000003E8 start_id = 1000
|
||||
0x9C-0xA3 0x00000000000007D0 end_id = 2000
|
||||
|
||||
--- Entry 2: REMAP_ID ---
|
||||
0xA4 0x05 entry_type
|
||||
0xA5 0x00 reserved
|
||||
0xA6-0xA7 0x0010 entry_length = 16
|
||||
0xA8-0xAF 0x00000000000001F4 old_id = 500
|
||||
0xB0-0xB7 0x0000000000000003 new_id = 3
|
||||
```
|
||||
|
||||
## 4. Deletion Bitmap
|
||||
|
||||
### 4.1 Manifest Record
|
||||
|
||||
The deletion bitmap is stored in the Level 1 manifest as a TLV record:
|
||||
|
||||
```
|
||||
Tag Name Description
|
||||
--- ---- -----------
|
||||
0x000E DELETION_BITMAP Roaring bitmap of soft-deleted vector IDs
|
||||
```
|
||||
|
||||
This extends the TLV tag space (previous: 0x000D KEY_DIRECTORY).
|
||||
|
||||
### 4.2 Roaring Bitmap Binary Layout
|
||||
|
||||
Vector IDs are 64-bit. The upper 32 bits select a **high key**; the lower
|
||||
32 bits index into a **container** for that high key.
|
||||
|
||||
```
|
||||
+---------------------------------------------+
|
||||
| DELETION_BITMAP TLV Value |
|
||||
+---------------------------------------------+
|
||||
| Bitmap Header |
|
||||
| cookie: u32 (0x3B3A3332) |
|
||||
| high_key_count: u32 |
|
||||
| For each high key: |
|
||||
| high_key: u32 |
|
||||
| container_type: u8 |
|
||||
| 0x01 = ARRAY_CONTAINER |
|
||||
| 0x02 = BITMAP_CONTAINER |
|
||||
| 0x03 = RUN_CONTAINER |
|
||||
| container_offset: u32 (from bitmap start)|
|
||||
| [8B aligned] |
|
||||
+---------------------------------------------+
|
||||
| Container Data |
|
||||
| Container 0: [type-specific layout] |
|
||||
| Container 1: ... |
|
||||
| [8B aligned per container] |
|
||||
+---------------------------------------------+
|
||||
```
|
||||
|
||||
### 4.3 Container Types
|
||||
|
||||
**ARRAY_CONTAINER (0x01)** -- Sparse deletions (< 4096 set bits per 64K range).
|
||||
```
|
||||
0x00 u16 cardinality Number of set values (1-4096)
|
||||
0x02 u16[] values Sorted array of 16-bit values
|
||||
```
|
||||
Size: `2 + 2 * cardinality` bytes.
|
||||
|
||||
**BITMAP_CONTAINER (0x02)** -- Dense deletions (>= 4096 set bits per 64K range).
|
||||
```
|
||||
0x00 u16 cardinality Number of set bits
|
||||
0x02 u8[8192] bitmap Fixed 65536-bit bitmap (8 KB)
|
||||
```
|
||||
Size: 8194 bytes (fixed).
|
||||
|
||||
**RUN_CONTAINER (0x03)** -- Contiguous ranges of deletions.
|
||||
```
|
||||
0x00 u16 run_count Number of runs
|
||||
0x02 (u16,u16) runs[] Array of (start, length-1) pairs
|
||||
```
|
||||
Size: `2 + 4 * run_count` bytes.
|
||||
|
||||
### 4.4 Size Estimation
|
||||
|
||||
| Deletion Pattern | Deleted IDs | Container Types | Bitmap Size |
|
||||
|------------------|-------------|-----------------|-------------|
|
||||
| Sparse random | 10,000 (0.1%) | ~153 array | ~22 KB |
|
||||
| Clustered ranges | 10,000 (0.1%) | ~5 run | ~0.1 KB |
|
||||
| Mixed workload | 100,000 (1%) | array + run | ~80 KB |
|
||||
| Heavy deletion | 1,000,000 (10%) | bitmap + run | ~200 KB |
|
||||
|
||||
Even at 200 KB the bitmap fits entirely in L2 cache.
|
||||
|
||||
### 4.5 Bitmap Operations
|
||||
|
||||
```python
|
||||
def bitmap_check(bitmap, vector_id):
|
||||
"""Returns True if vector_id is soft-deleted. O(1) amortized."""
|
||||
high_key = vector_id >> 16
|
||||
low_val = vector_id & 0xFFFF
|
||||
container = bitmap.get_container(high_key)
|
||||
if container is None:
|
||||
return False
|
||||
return container.contains(low_val) # array: bsearch, bitmap: bit test, run: bsearch
|
||||
|
||||
def bitmap_set(bitmap, vector_id):
|
||||
"""Mark a vector as soft-deleted."""
|
||||
high_key = vector_id >> 16
|
||||
low_val = vector_id & 0xFFFF
|
||||
container = bitmap.get_or_create_container(high_key)
|
||||
container.add(low_val)
|
||||
if container.type == ARRAY and container.cardinality > 4096:
|
||||
container.promote_to_bitmap()
|
||||
```
|
||||
|
||||
## 5. Delete-Aware Query Path
|
||||
|
||||
### 5.1 HNSW Traversal with Deletion Filtering
|
||||
|
||||
Deleted vectors remain in the HNSW graph until compaction rebuilds the index.
|
||||
During search, the deletion bitmap is checked per candidate. Deleted nodes are
|
||||
still traversed for connectivity but excluded from the result set.
|
||||
|
||||
```python
|
||||
def hnsw_search_delete_aware(query, entry_point, ef_search, k, del_bitmap):
|
||||
candidates = MaxHeap() # worst candidate on top
|
||||
visited = BitSet()
|
||||
worklist = MinHeap() # best candidate first
|
||||
|
||||
d0 = distance(query, get_vector(entry_point))
|
||||
worklist.push((d0, entry_point))
|
||||
visited.add(entry_point)
|
||||
if not bitmap_check(del_bitmap, entry_point):
|
||||
candidates.push((d0, entry_point))
|
||||
|
||||
while worklist:
|
||||
dist, node = worklist.pop()
|
||||
if candidates.size() >= ef_search and dist > candidates.peek_max():
|
||||
break
|
||||
|
||||
neighbors = get_neighbors(node)
|
||||
for n in neighbors[:PREFETCH_AHEAD]:
|
||||
if n not in visited:
|
||||
prefetch_vector(n)
|
||||
|
||||
for n in neighbors:
|
||||
if n in visited:
|
||||
continue
|
||||
visited.add(n)
|
||||
d = distance(query, get_vector(n))
|
||||
is_deleted = bitmap_check(del_bitmap, n) # O(1) bitmap lookup
|
||||
|
||||
# Always add to worklist (graph connectivity)
|
||||
if candidates.size() < ef_search or d < candidates.peek_max():
|
||||
worklist.push((d, n))
|
||||
# Only add to results if NOT deleted
|
||||
if not is_deleted:
|
||||
if candidates.size() < ef_search:
|
||||
candidates.push((d, n))
|
||||
elif d < candidates.peek_max():
|
||||
candidates.replace_max((d, n))
|
||||
|
||||
return candidates.top_k(k)
|
||||
```
|
||||
|
||||
### 5.2 Top-K Refinement with Deletion Filtering
|
||||
|
||||
```python
|
||||
def topk_refine_delete_aware(candidates, hot_cache, query, k, del_bitmap):
|
||||
heap = MaxHeap()
|
||||
for cand_dist, cand_id in candidates:
|
||||
heap.push((cand_dist, cand_id))
|
||||
|
||||
for entry in hot_cache.sequential_scan():
|
||||
if bitmap_check(del_bitmap, entry.vector_id):
|
||||
continue # skip soft-deleted
|
||||
d = distance(query, entry.vector)
|
||||
if heap.size() < k:
|
||||
heap.push((d, entry.vector_id))
|
||||
elif d < heap.peek_max():
|
||||
heap.replace_max((d, entry.vector_id))
|
||||
|
||||
return heap.drain_sorted()
|
||||
```
|
||||
|
||||
### 5.3 Performance Impact
|
||||
|
||||
| Operation | Without Deletions | With Deletions | Overhead |
|
||||
|-----------|-------------------|----------------|----------|
|
||||
| Bitmap check | N/A | ~2-5 ns (L1/L2 hit) | Per candidate |
|
||||
| HNSW step (M=16) | ~300-500 ns | ~330-580 ns | +10% |
|
||||
| Top-K refine (1000) | ~10 us | ~12 us | +20% worst |
|
||||
| Total query | ~50-75 us | ~55-85 us | +10-13% |
|
||||
|
||||
At typical deletion rates (< 5%), overhead is negligible: the bitmap fits in
|
||||
L2 cache, graph connectivity is preserved, and the cost is one branch plus
|
||||
one bitmap load per candidate.
|
||||
|
||||
## 6. Deletion Write Path
|
||||
|
||||
All deletion operations follow the same two-fsync protocol:
|
||||
|
||||
```python
|
||||
def delete_vectors(file, entries):
|
||||
"""Soft-delete vectors. entries: list of DeleteVector or DeleteRange."""
|
||||
# 1. Append JOURNAL_SEG
|
||||
journal = JournalSegment(
|
||||
epoch=current_epoch(file),
|
||||
prev_journal_seg_id=latest_journal_id(file),
|
||||
entries=entries
|
||||
)
|
||||
append_segment(file, journal)
|
||||
fsync(file) # orphan-safe: no manifest references this yet
|
||||
|
||||
# 2. Update deletion bitmap in memory
|
||||
bitmap = load_deletion_bitmap(file)
|
||||
for e in entries:
|
||||
if e.type == DELETE_VECTOR:
|
||||
bitmap_set(bitmap, e.vector_id)
|
||||
elif e.type == DELETE_RANGE:
|
||||
bitmap.add_range(e.start_id, e.end_id)
|
||||
|
||||
# 3. Append MANIFEST_SEG with updated bitmap
|
||||
manifest = build_manifest(file, deletion_bitmap=bitmap)
|
||||
append_segment(file, manifest)
|
||||
fsync(file) # deletion now visible to all new readers
|
||||
```
|
||||
|
||||
Single deletes, bulk ranges, and batch deletes all use this path. Batch
|
||||
operations pack multiple entries into one JOURNAL_SEG to amortize fsync cost.
|
||||
|
||||
## 7. Compaction with Deletions
|
||||
|
||||
### 7.1 Compaction Process
|
||||
|
||||
```
|
||||
Before:
|
||||
[VEC_1] [VEC_2] [JOURNAL_1] [VEC_3] [JOURNAL_2] [MANIFEST_5]
|
||||
0-999 1000- del:42, 3000- del:[1000, bitmap={42,500,
|
||||
2999 del:500 4999 2000) 1000..1999}
|
||||
|
||||
After:
|
||||
... [MANIFEST_5] [VEC_sealed] [INDEX_new] [MANIFEST_6]
|
||||
vectors 0-4999 bitmap={}
|
||||
MINUS deleted (empty for
|
||||
compacted range)
|
||||
```
|
||||
|
||||
### 7.2 Compaction Algorithm
|
||||
|
||||
```python
|
||||
def compact_with_deletions(file, seg_ids):
|
||||
bitmap = load_deletion_bitmap(file)
|
||||
output, id_remap, next_id = [], {}, 0
|
||||
|
||||
for seg_id in sorted(seg_ids):
|
||||
seg = load_segment(file, seg_id)
|
||||
if seg.seg_type != VEC_SEG:
|
||||
continue
|
||||
for vec_id, vector in seg.all_vectors():
|
||||
if bitmap_check(bitmap, vec_id):
|
||||
continue # physically exclude
|
||||
id_remap[vec_id] = next_id
|
||||
output.append((next_id, vector))
|
||||
next_id += 1
|
||||
|
||||
append_segment(file, VecSegment(flags=SEALED, vectors=output))
|
||||
|
||||
remaps = [RemapIdEntry(old, new) for old, new in id_remap.items() if old != new]
|
||||
if remaps:
|
||||
append_segment(file, JournalSegment(entries=remaps))
|
||||
|
||||
append_segment(file, build_hnsw_index(output))
|
||||
|
||||
for old_id in id_remap:
|
||||
bitmap.remove(old_id)
|
||||
|
||||
manifest = build_manifest(file,
|
||||
tombstone_seg_ids=seg_ids,
|
||||
deletion_bitmap=bitmap)
|
||||
append_segment(file, manifest)
|
||||
fsync(file)
|
||||
```
|
||||
|
||||
### 7.3 Journal Merging
|
||||
|
||||
During compaction, JOURNAL_SEGs covering the compacted range are consumed:
|
||||
|
||||
| Entry Type | Materialization |
|
||||
|------------|-----------------|
|
||||
| DELETE_VECTOR / DELETE_RANGE | Vectors excluded from output |
|
||||
| UPDATE_METADATA | Applied to output META_SEG |
|
||||
| MOVE_VECTOR | Tier assignment applied in new manifest |
|
||||
| REMAP_ID | Chained: old remap composed with new remap |
|
||||
|
||||
Consumed JOURNAL_SEGs are tombstoned alongside compacted VEC_SEGs.
|
||||
|
||||
### 7.4 Compaction Invariants
|
||||
|
||||
| ID | Invariant |
|
||||
|----|-----------|
|
||||
| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
|
||||
| INV-D2 | Sealed output contains only ACTIVE vectors |
|
||||
| INV-D3 | REMAP_ID entries journaled for every relocated vector |
|
||||
| INV-D4 | Compacted input segments tombstoned in new manifest |
|
||||
| INV-D5 | Sealed segments are never modified |
|
||||
| INV-D6 | Rebuilt indexes exclude deleted nodes |
|
||||
|
||||
## 8. Deletion Consistency
|
||||
|
||||
### 8.1 Crash Safety
|
||||
|
||||
```
|
||||
Write path:
|
||||
1. Append JOURNAL_SEG -> fsync crash here: orphan, invisible
|
||||
2. Append MANIFEST_SEG -> fsync crash here: partial manifest, fallback
|
||||
|
||||
Recovery:
|
||||
- Crash after step 1: JOURNAL_SEG orphaned. No manifest references it.
|
||||
Reader sees previous manifest. Deletion NOT visible. Orphan cleaned
|
||||
up by next compaction.
|
||||
- Crash during step 2: Partial MANIFEST_SEG has bad checksum. Reader
|
||||
falls back to previous valid manifest. Deletion NOT visible.
|
||||
- After step 2 success: Manifest durable. Deletion visible.
|
||||
```
|
||||
|
||||
**Guarantee**: Uncommitted deletions never affect readers. Deletion is
|
||||
atomic at the manifest fsync boundary.
|
||||
|
||||
### 8.2 Manifest Chain Visibility
|
||||
|
||||
```
|
||||
MANIFEST_3: bitmap = {}
|
||||
| JOURNAL_SEG written (delete vector 42)
|
||||
MANIFEST_4: bitmap = {42} <-- deletion visible from here
|
||||
| Compaction runs
|
||||
MANIFEST_5: bitmap = {} <-- vector 42 physically removed
|
||||
```
|
||||
|
||||
A reader holding MANIFEST_3 continues to see vector 42. A reader opening
|
||||
after MANIFEST_4 will not. This provides snapshot isolation at manifest
|
||||
granularity.
|
||||
|
||||
### 8.3 Multi-File Mode
|
||||
|
||||
In multi-file mode, each shard maintains its own deletion bitmap. The
|
||||
DELETION_BITMAP TLV record supports two modes:
|
||||
|
||||
```
|
||||
+----------------------------------------------+
|
||||
| mode: u8 |
|
||||
| 0x00 = SINGLE (one bitmap, inline) |
|
||||
| 0x01 = SHARDED (per-shard references) |
|
||||
+----------------------------------------------+
|
||||
SINGLE (0x00):
|
||||
| roaring_bitmap: [u8; ...] |
|
||||
|
||||
SHARDED (0x01):
|
||||
| shard_count: u16 |
|
||||
| For each shard: |
|
||||
| shard_id: u16 |
|
||||
| bitmap_offset: u64 (in shard file) |
|
||||
| bitmap_length: u32 |
|
||||
| bitmap_hash: hash128 |
|
||||
+----------------------------------------------+
|
||||
```
|
||||
|
||||
Queries spanning shards load per-shard bitmaps and check each candidate
|
||||
against its shard's bitmap.
|
||||
|
||||
### 8.4 Concurrent Access
|
||||
|
||||
One writer at a time (file-level advisory lock). Multiple readers are safe
|
||||
due to append-only architecture. A reader that opened before a deletion
|
||||
sees the pre-deletion snapshot until it re-reads the manifest.
|
||||
|
||||
## 9. Space Reclamation
|
||||
|
||||
| Trigger | Threshold | Action |
|
||||
|---------|-----------|--------|
|
||||
| Deletion ratio | > 20% of vectors deleted | Schedule compaction |
|
||||
| Bitmap size | > 1 MB | Schedule compaction |
|
||||
| Segment count | > 64 mutable segments | Schedule compaction |
|
||||
| Manual | User-initiated | Compact immediately |
|
||||
|
||||
Space accounting derived from the manifest:
|
||||
```
|
||||
total_vector_count: 10,000,000 (Level 0 root manifest)
|
||||
deleted_vector_count: 150,000 (bitmap cardinality)
|
||||
active_vector_count: 9,850,000 (total - deleted)
|
||||
deletion_ratio: 1.5% (below threshold)
|
||||
wasted_bytes: ~115 MB (150K * 768 B per fp16-384 vector)
|
||||
```
|
||||
|
||||
## 10. Summary
|
||||
|
||||
### Deletion Protocol
|
||||
|
||||
| Step | Action | Durability |
|
||||
|------|--------|------------|
|
||||
| 1 | Append JOURNAL_SEG with DELETE entries | fsync (orphan-safe) |
|
||||
| 2 | Update roaring deletion bitmap | In-memory |
|
||||
| 3 | Append MANIFEST_SEG with new bitmap | fsync (deletion visible) |
|
||||
| 4 | Compaction excludes deleted vectors | fsync (physical removal) |
|
||||
| 5 | File rewrite reclaims space | fsync (space freed) |
|
||||
|
||||
### New Wire Format Elements
|
||||
|
||||
| Element | Type / Tag | Section |
|
||||
|---------|------------|---------|
|
||||
| JOURNAL_SEG | Segment type 0x04 | 3 |
|
||||
| DELETE_VECTOR | Journal entry 0x01 | 3.4 |
|
||||
| DELETE_RANGE | Journal entry 0x02 | 3.4 |
|
||||
| UPDATE_METADATA | Journal entry 0x03 | 3.4 |
|
||||
| MOVE_VECTOR | Journal entry 0x04 | 3.4 |
|
||||
| REMAP_ID | Journal entry 0x05 | 3.4 |
|
||||
| DELETION_BITMAP | Level 1 TLV 0x000E | 4 |
|
||||
|
||||
### Invariants
|
||||
|
||||
| ID | Invariant |
|
||||
|----|-----------|
|
||||
| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
|
||||
| INV-D2 | Sealed output segments contain only ACTIVE vectors |
|
||||
| INV-D3 | ID remappings journaled for every compaction-relocated vector |
|
||||
| INV-D4 | Compacted input segments tombstoned in new manifest |
|
||||
| INV-D5 | Sealed segments are never modified |
|
||||
| INV-D6 | Rebuilt indexes exclude deleted nodes |
|
||||
| INV-D7 | Uncommitted deletions never affect readers (crash safety) |
|
||||
| INV-D8 | Deletion visibility is atomic at the manifest fsync boundary |
|
||||
724
vendor/ruvector/docs/research/rvf/spec/08-filtered-search.md
vendored
Normal file
724
vendor/ruvector/docs/research/rvf/spec/08-filtered-search.md
vendored
Normal file
@@ -0,0 +1,724 @@
|
||||
# RVF Filtered Search
|
||||
|
||||
## 1. Motivation
|
||||
|
||||
Domain profiles declare metadata schemas with indexed fields (e.g., `"organism"` in
|
||||
RVDNA, `"language"` in RVText, `"node_type"` in RVGraph), but the format provides no
|
||||
specification for how those indexes are built, stored, or evaluated at query time.
|
||||
|
||||
Filtered search is the combination of vector similarity search with metadata
|
||||
predicates. Without it, a caller must retrieve an over-sized result set and filter
|
||||
client-side — wasting bandwidth, latency, and recall budget.
|
||||
|
||||
This specification adds:
|
||||
|
||||
1. **META_SEG** payload layout (segment type 0x07) for storing per-vector metadata
|
||||
2. **Filter expression language** with a compact binary encoding
|
||||
3. **Three evaluation strategies** (pre-, post-, and intra-filtering)
|
||||
4. **METAIDX_SEG** (new segment type 0x0D) for inverted and bitmap indexes
|
||||
5. **Manifest integration** via a new Level 1 TLV record
|
||||
6. **Temperature tier coordination** for metadata segments
|
||||
|
||||
## 2. META_SEG Payload Layout (Segment Type 0x07)
|
||||
|
||||
META_SEG stores the actual metadata values associated with vectors. It uses the
|
||||
standard 64-byte segment header (see `binary-layout.md` Section 3) with
|
||||
`seg_type = 0x07`.
|
||||
|
||||
```
|
||||
META_SEG Payload:
|
||||
|
||||
+------------------------------------------+
|
||||
| Meta Header (64 bytes, padded) |
|
||||
| schema_id: u32 | References PROFILE_SEG schema
|
||||
| vector_id_range_start: u64 | First vector ID covered
|
||||
| vector_id_range_end: u64 | Last vector ID covered (inclusive)
|
||||
| field_count: u16 | Number of fields in this segment
|
||||
| encoding: u8 | 0 = row-oriented, 1 = column-oriented
|
||||
| reserved: [u8; 37] | Must be zero
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Field Directory |
|
||||
| For each field (field_count entries): |
|
||||
| field_id: u16 |
|
||||
| field_type: u8 |
|
||||
| flags: u8 |
|
||||
| field_offset: u32 | Byte offset from payload start
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Field Data (column-oriented) |
|
||||
| (see Section 2.1 for per-type layout) |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
### Field Type Enum
|
||||
|
||||
```
|
||||
Value Type Wire Size Description
|
||||
----- ---- --------- -----------
|
||||
0x00 string Variable UTF-8, dictionary-encoded in column layout
|
||||
0x01 u32 4 bytes Unsigned 32-bit integer
|
||||
0x02 u64 8 bytes Unsigned 64-bit integer
|
||||
0x03 f32 4 bytes IEEE 754 single-precision float
|
||||
0x04 enum Variable (packed) Enumeration with defined label set
|
||||
0x05 bool 1 bit (packed) Boolean
|
||||
```
|
||||
|
||||
### Field Flags
|
||||
|
||||
```
|
||||
Bit Mask Name Meaning
|
||||
--- ---- ---- -------
|
||||
0 0x01 INDEXED Field has a corresponding METAIDX_SEG
|
||||
1 0x02 SORTED Values are stored in sorted order
|
||||
2 0x04 NULLABLE Null bitmap present before values
|
||||
3 0x08 STORED Field value returned in query results (not just filterable)
|
||||
4-7 reserved Must be zero
|
||||
```
|
||||
|
||||
### 2.1 Column-Oriented Field Layouts
|
||||
|
||||
Column-oriented encoding (encoding = 1) is the preferred layout. Each field's data
|
||||
block starts at a 64-byte aligned boundary.
|
||||
|
||||
**String fields** (dictionary-encoded):
|
||||
|
||||
```
|
||||
dict_size: u32 Number of distinct strings
|
||||
For each dict entry:
|
||||
length: u16 Byte length of UTF-8 string
|
||||
bytes: [u8; length] UTF-8 encoded string
|
||||
[4B aligned after dictionary]
|
||||
codes: [varint; vector_count] Dictionary code per vector
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
Dictionary codes are 0-indexed into the dictionary array. Code `0xFFFFFFFF` (max
|
||||
varint value for u32 range) represents null if the NULLABLE flag is set.
|
||||
|
||||
**Numeric fields** (u32, u64, f32 -- direct array):
|
||||
|
||||
```
|
||||
If NULLABLE:
|
||||
null_bitmap: [u8; ceil(vector_count / 8)] Bit-packed, 1 = present, 0 = null
|
||||
[8B aligned]
|
||||
values: [field_type; vector_count] Dense array of values
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
Values for null entries are zero-filled but must not be relied upon.
|
||||
|
||||
**Enum fields** (bit-packed):
|
||||
|
||||
```
|
||||
enum_count: u8 Number of enum labels
|
||||
For each enum label:
|
||||
length: u8 Byte length of label
|
||||
bytes: [u8; length] UTF-8 label string
|
||||
bits_per_code: u8 ceil(log2(enum_count))
|
||||
codes: packed bit array bits_per_code bits per vector
|
||||
[ceil(vector_count * bits_per_code / 8) bytes]
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
For example, an enum with 3 values (`"+", "-", "."`) uses 2 bits per vector.
|
||||
1M vectors = 250 KB.
|
||||
|
||||
**Bool fields** (bit-packed):
|
||||
|
||||
```
|
||||
If NULLABLE:
|
||||
null_bitmap: [u8; ceil(vector_count / 8)]
|
||||
[8B aligned]
|
||||
values: [u8; ceil(vector_count / 8)] Bit-packed, 1 = true, 0 = false
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
### 2.2 Sorted Index (Inline)
|
||||
|
||||
For fields with the SORTED flag, an additional sorted permutation index follows
|
||||
the field data:
|
||||
|
||||
```
|
||||
sorted_count: u32 Must equal vector_count
|
||||
sorted_order: [varint delta-encoded] Vector IDs in ascending value order
|
||||
restart_interval: u16 Restart every N entries (default 128)
|
||||
restart_offsets: [u32; ceil(sorted_count / restart_interval)]
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
This enables binary search over field values for range queries without requiring
|
||||
a separate METAIDX_SEG. It is suitable for fields where a full inverted index
|
||||
would be wasteful (high cardinality numeric fields like `position_start`).
|
||||
|
||||
## 3. Filter Expression Language
|
||||
|
||||
### 3.1 Abstract Syntax
|
||||
|
||||
A filter expression is a tree of predicates combined with boolean logic:
|
||||
|
||||
```
|
||||
expr ::= field_ref CMP literal -- comparison
|
||||
| field_ref IN literal_set -- set membership
|
||||
| field_ref PREFIX string_lit -- string prefix match
|
||||
| field_ref CONTAINS string_lit -- substring containment
|
||||
| expr AND expr -- conjunction
|
||||
| expr OR expr -- disjunction
|
||||
| NOT expr -- negation
|
||||
```
|
||||
|
||||
### 3.2 Binary Encoding (Postfix / RPN)
|
||||
|
||||
Filter expressions are encoded as a postfix (Reverse Polish Notation) token stream
|
||||
for stack-based evaluation. This avoids the need for recursive parsing and enables
|
||||
single-pass evaluation with a fixed-size stack.
|
||||
|
||||
```
|
||||
Filter Expression Binary Layout:
|
||||
|
||||
header:
|
||||
node_count: u16 Total number of tokens
|
||||
stack_depth: u8 Maximum stack depth required
|
||||
reserved: u8 Must be zero
|
||||
|
||||
tokens (postfix order):
|
||||
For each token:
|
||||
node_type: u8 Token type (see enum below)
|
||||
payload: type-specific Variable-size payload
|
||||
```
|
||||
|
||||
### Token Type Enum
|
||||
|
||||
```
|
||||
Value Name Stack Effect Payload
|
||||
----- ---- ------------ -------
|
||||
0x01 FIELD_REF push +1 field_id: u16
|
||||
0x02 LIT_U32 push +1 value: u32
|
||||
0x03 LIT_U64 push +1 value: u64
|
||||
0x04 LIT_F32 push +1 value: f32
|
||||
0x05 LIT_STR push +1 length: u16, bytes: [u8; length]
|
||||
0x06 LIT_BOOL push +1 value: u8 (0 or 1)
|
||||
0x07 LIT_NULL push +1 (no payload)
|
||||
|
||||
0x10 CMP_EQ pop 2, push 1 (no payload) -- a == b
|
||||
0x11 CMP_NE pop 2, push 1 (no payload) -- a != b
|
||||
0x12 CMP_LT pop 2, push 1 (no payload) -- a < b
|
||||
0x13 CMP_LE pop 2, push 1 (no payload) -- a <= b
|
||||
0x14 CMP_GT pop 2, push 1 (no payload) -- a > b
|
||||
0x15 CMP_GE pop 2, push 1 (no payload) -- a >= b
|
||||
|
||||
0x20 IN_SET pop 1, push 1 set_size: u16, [encoded values]
|
||||
0x21 PREFIX pop 2, push 1 (no payload) -- string prefix
|
||||
0x22 CONTAINS pop 2, push 1 (no payload) -- substring match
|
||||
|
||||
0x30 AND pop 2, push 1 (no payload)
|
||||
0x31 OR pop 2, push 1 (no payload)
|
||||
0x32 NOT pop 1, push 1 (no payload)
|
||||
```
|
||||
|
||||
### 3.3 Encoding Example
|
||||
|
||||
Filter: `organism = "E. coli" AND position_start >= 1000`
|
||||
|
||||
```
|
||||
Token 0: FIELD_REF field_id=0 (organism) stack: [organism_val]
|
||||
Token 1: LIT_STR "E. coli" stack: [organism_val, "E. coli"]
|
||||
Token 2: CMP_EQ stack: [true/false]
|
||||
Token 3: FIELD_REF field_id=3 (position_start) stack: [bool, pos_val]
|
||||
Token 4: LIT_U64 1000 stack: [bool, pos_val, 1000]
|
||||
Token 5: CMP_GE stack: [bool, true/false]
|
||||
Token 6: AND stack: [result]
|
||||
|
||||
Binary: node_count=7, stack_depth=3
|
||||
01 00:00 05 00:07 "E. coli" 10 01 00:03 03 00:00:00:00:00:00:03:E8 15 30
|
||||
```
|
||||
|
||||
### 3.4 Evaluation
|
||||
|
||||
Evaluation processes tokens left to right using a fixed-size boolean/value stack:
|
||||
|
||||
```python
|
||||
def evaluate(tokens, vector_id, metadata):
|
||||
stack = []
|
||||
for token in tokens:
|
||||
if token.type == FIELD_REF:
|
||||
stack.append(metadata.get_value(vector_id, token.field_id))
|
||||
elif token.type in (LIT_U32, LIT_U64, LIT_F32, LIT_STR, LIT_BOOL, LIT_NULL):
|
||||
stack.append(token.value)
|
||||
elif token.type in (CMP_EQ, CMP_NE, CMP_LT, CMP_LE, CMP_GT, CMP_GE):
|
||||
b, a = stack.pop(), stack.pop()
|
||||
stack.append(compare(a, token.type, b))
|
||||
elif token.type == IN_SET:
|
||||
a = stack.pop()
|
||||
stack.append(a in token.value_set)
|
||||
elif token.type in (PREFIX, CONTAINS):
|
||||
b, a = stack.pop(), stack.pop()
|
||||
stack.append(string_match(a, token.type, b))
|
||||
elif token.type == AND:
|
||||
b, a = stack.pop(), stack.pop()
|
||||
stack.append(a and b)
|
||||
elif token.type == OR:
|
||||
b, a = stack.pop(), stack.pop()
|
||||
stack.append(a or b)
|
||||
elif token.type == NOT:
|
||||
stack.append(not stack.pop())
|
||||
return stack[0]
|
||||
```
|
||||
|
||||
Maximum stack depth is declared in the header so the evaluator can pre-allocate.
|
||||
Implementations must reject expressions with `stack_depth > 16`.
|
||||
|
||||
## 4. Filter Evaluation Strategies
|
||||
|
||||
The runtime selects one of three strategies based on the estimated **selectivity**
|
||||
of the filter (the fraction of vectors passing the filter).
|
||||
|
||||
### 4.1 Pre-Filtering (Selectivity < 1%)
|
||||
|
||||
Build the candidate ID set from metadata indexes first, then run vector search
|
||||
only on the filtered subset.
|
||||
|
||||
```
|
||||
1. Evaluate filter using METAIDX_SEG inverted/bitmap indexes
|
||||
2. Collect matching vector IDs into a candidate set C
|
||||
3. If |C| < ef_search:
|
||||
Flat scan all candidates, return top-K
|
||||
Else:
|
||||
Build temporary flat index over C, run HNSW search restricted to C
|
||||
4. Return top-K results
|
||||
```
|
||||
|
||||
**Tradeoffs**:
|
||||
- Optimal when the candidate set is very small (hundreds to low thousands)
|
||||
- Risk: if the candidate set is disconnected in the HNSW graph, search cannot
|
||||
traverse from entry points to candidates. The flat scan fallback handles this.
|
||||
- Memory: candidate set bitmap = `ceil(total_vectors / 8)` bytes
|
||||
|
||||
### 4.2 Post-Filtering (Selectivity > 20%)
|
||||
|
||||
Run standard HNSW search with over-retrieval, then filter results.
|
||||
|
||||
```
|
||||
1. Compute over_retrieval_factor = min(1.0 / selectivity, 10.0)
|
||||
2. Set ef_search_adj = ef_search * over_retrieval_factor
|
||||
3. Run standard HNSW search with ef_search_adj
|
||||
4. Filter result set by evaluating filter expression per candidate
|
||||
5. Return top-K from filtered results
|
||||
```
|
||||
|
||||
**Tradeoffs**:
|
||||
- Optimal when the filter passes most vectors (minimal wasted computation)
|
||||
- Risk: if over-retrieval factor is too low, fewer than K results survive filtering.
|
||||
The caller should retry with a higher factor or fall back to intra-filtering.
|
||||
- No modification to HNSW traversal logic required.
|
||||
|
||||
### 4.3 Intra-Filtering (1% <= Selectivity <= 20%)
|
||||
|
||||
Evaluate the filter during HNSW traversal, skipping nodes that fail the predicate.
|
||||
|
||||
```python
|
||||
def filtered_hnsw_search(query, filter_expr, entry_point, ef_search, k):
|
||||
candidates = MaxHeap() # top-K results (max-heap by distance)
|
||||
worklist = MinHeap() # exploration frontier (min-heap by distance)
|
||||
visited = BitSet()
|
||||
filtered_skips = 0
|
||||
max_skips = ef_search * 3 # backoff threshold
|
||||
|
||||
worklist.push((distance(query, entry_point), entry_point))
|
||||
visited.add(entry_point)
|
||||
|
||||
while worklist and filtered_skips < max_skips:
|
||||
dist, node = worklist.pop()
|
||||
|
||||
# Check filter predicate
|
||||
if not evaluate(filter_expr, node, metadata):
|
||||
filtered_skips += 1
|
||||
# Still expand neighbors (maintain graph connectivity)
|
||||
neighbors = get_neighbors(node)
|
||||
for n in neighbors:
|
||||
if n not in visited:
|
||||
visited.add(n)
|
||||
d = distance(query, get_vector(n))
|
||||
worklist.push((d, n))
|
||||
continue
|
||||
|
||||
filtered_skips = 0 # reset skip counter on successful match
|
||||
candidates.push((dist, node))
|
||||
if len(candidates) > k:
|
||||
candidates.pop() # evict worst
|
||||
|
||||
# Expand neighbors
|
||||
neighbors = get_neighbors(node)
|
||||
for n in neighbors:
|
||||
if n not in visited:
|
||||
visited.add(n)
|
||||
d = distance(query, get_vector(n))
|
||||
if len(candidates) < ef_search or d < candidates.max():
|
||||
worklist.push((d, n))
|
||||
|
||||
return candidates.top_k(k)
|
||||
```
|
||||
|
||||
**Key design decisions**:
|
||||
|
||||
1. **Skipped nodes still expand neighbors**: This preserves graph connectivity.
|
||||
A node that fails the filter may have neighbors that pass it.
|
||||
|
||||
2. **Skip counter with backoff**: If too many consecutive nodes fail the filter,
|
||||
the search is exhausting the local neighborhood without finding matches. The
|
||||
`max_skips` threshold triggers termination to avoid unbounded traversal.
|
||||
|
||||
3. **Adaptive ef expansion**: When `filtered_skips > ef_search`, the effective
|
||||
search frontier is larger than requested, compensating for filtered-out nodes.
|
||||
|
||||
### 4.4 Strategy Selection
|
||||
|
||||
```
|
||||
selectivity = estimate_selectivity(filter_expr, metaidx_stats)
|
||||
|
||||
if selectivity < 0.01:
|
||||
strategy = PRE_FILTER
|
||||
elif selectivity > 0.20:
|
||||
strategy = POST_FILTER
|
||||
else:
|
||||
strategy = INTRA_FILTER
|
||||
```
|
||||
|
||||
Selectivity estimation uses statistics stored in the METAIDX_SEG header:
|
||||
|
||||
- **Inverted index**: `posting_list_length / total_vectors` per term
|
||||
- **Bitmap index**: `popcount(bitmap) / total_vectors` per enum value
|
||||
- **Range tree**: count of values in range / total_vectors
|
||||
|
||||
For compound filters (AND/OR), selectivity is estimated using independence
|
||||
assumption: `P(A AND B) = P(A) * P(B)`, `P(A OR B) = P(A) + P(B) - P(A) * P(B)`.
|
||||
|
||||
## 5. METAIDX_SEG (Segment Type 0x0D)
|
||||
|
||||
METAIDX_SEG stores secondary indexes over metadata fields for fast predicate
|
||||
evaluation. Each METAIDX_SEG covers one field. The segment type enum value 0x0D
|
||||
is allocated from the reserved range (see `binary-layout.md` Section 3).
|
||||
|
||||
```
|
||||
METAIDX_SEG Payload:
|
||||
|
||||
+------------------------------------------+
|
||||
| Index Header (64 bytes, padded) |
|
||||
| field_id: u16 | Field being indexed
|
||||
| index_type: u8 | 0=inverted, 1=range_tree, 2=bitmap
|
||||
| field_type: u8 | Mirrors META_SEG field_type
|
||||
| total_vectors: u64 | Vectors covered by this index
|
||||
| unique_values: u64 | Cardinality (distinct values)
|
||||
| reserved: [u8; 42] |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Index Data (type-specific) |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
### 5.1 Inverted Index (index_type = 0)
|
||||
|
||||
Best for: string fields with moderate cardinality (100 to 100K distinct values).
|
||||
|
||||
```
|
||||
term_count: u32
|
||||
For each term (sorted by encoded value):
|
||||
term_length: u16
|
||||
term_bytes: [u8; term_length] Encoded value (UTF-8 for strings)
|
||||
posting_length: u32 Number of vector IDs
|
||||
postings: [varint delta-encoded] Sorted vector IDs
|
||||
[8B aligned after postings]
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
Posting lists use varint delta encoding identical to the ID encoding in VEC_SEG
|
||||
(see `binary-layout.md` Section 5). Restart points every 128 entries enable
|
||||
binary search within a posting list for intersection operations.
|
||||
|
||||
### 5.2 Range Tree (index_type = 1)
|
||||
|
||||
Best for: numeric fields requiring range queries (u32, u64, f32).
|
||||
|
||||
```
|
||||
page_size: u32 Fixed 4096 bytes (4 KB, one disk page)
|
||||
page_count: u32
|
||||
root_page: u32 Page index of B+ tree root
|
||||
tree_height: u8
|
||||
reserved: [u8; 47]
|
||||
[64B aligned]
|
||||
|
||||
Internal Page (4096 bytes):
|
||||
page_type: u8 (0 = internal)
|
||||
key_count: u16
|
||||
keys: [field_type; key_count] Separator keys
|
||||
children: [u32; key_count + 1] Child page indices
|
||||
[zero-padded to 4096]
|
||||
|
||||
Leaf Page (4096 bytes):
|
||||
page_type: u8 (1 = leaf)
|
||||
entry_count: u16
|
||||
prev_leaf: u32 Linked-list pointer for range scan
|
||||
next_leaf: u32
|
||||
entries:
|
||||
For each entry:
|
||||
value: field_type The metadata value
|
||||
vector_id: u64 Associated vector ID
|
||||
[zero-padded to 4096]
|
||||
```
|
||||
|
||||
Leaf pages form a doubly-linked list for efficient range scans. A range query
|
||||
`position_start >= 1000 AND position_start <= 5000` descends the tree to find
|
||||
the first leaf with value >= 1000, then scans forward until value > 5000.
|
||||
|
||||
### 5.3 Bitmap Index (index_type = 2)
|
||||
|
||||
Best for: enum and bool fields with low cardinality (< 64 distinct values).
|
||||
|
||||
```
|
||||
value_count: u8 Number of distinct enum/bool values
|
||||
For each value:
|
||||
value_label_len: u8
|
||||
value_label: [u8; value_label_len] The enum label or "true"/"false"
|
||||
bitmap_format: u8 0 = raw, 1 = roaring
|
||||
bitmap_length: u32 Byte length of bitmap data
|
||||
bitmap_data: [u8; bitmap_length] Bitmap of matching vector IDs
|
||||
[8B aligned]
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
**Raw bitmaps** are used when `total_vectors < 8192` (1 KB per bitmap).
|
||||
|
||||
**Roaring bitmaps** are used for larger datasets. The roaring format stores
|
||||
the bitmap as a set of containers (array, bitmap, or run-length) per 64K chunk.
|
||||
This matches the industry-standard Roaring bitmap serialization (compatible with
|
||||
CRoaring / roaring-rs wire format).
|
||||
|
||||
Bitmap intersection and union operations map directly to AND/OR filter predicates
|
||||
using SIMD bitwise operations. For 10M vectors:
|
||||
|
||||
```
|
||||
Raw bitmap: ~1.2 MB per value (impractical for many values)
|
||||
Roaring bitmap: 100 KB - 1 MB per value depending on density
|
||||
AND/OR: ~0.1 ms per operation (AVX-512 on 1 MB bitmap)
|
||||
```
|
||||
|
||||
## 6. Level 1 Manifest Addition
|
||||
|
||||
### Tag 0x000F: METADATA_INDEX_DIR
|
||||
|
||||
A new TLV record in the Level 1 manifest (see `02-manifest-system.md` Section 3)
|
||||
that maps indexed metadata fields to their METAIDX_SEG segment IDs.
|
||||
|
||||
```
|
||||
Tag: 0x000F
|
||||
Name: METADATA_INDEX_DIR
|
||||
|
||||
Payload:
|
||||
entry_count: u16
|
||||
For each entry:
|
||||
field_id: u16 Matches META_SEG field_id
|
||||
field_name_len: u8
|
||||
field_name: [u8; field_name_len] UTF-8 field name for debugging
|
||||
index_seg_id: u64 Segment ID of METAIDX_SEG
|
||||
index_type: u8 0=inverted, 1=range_tree, 2=bitmap
|
||||
stats:
|
||||
total_vectors: u64
|
||||
unique_values: u64
|
||||
min_posting_len: u32 Smallest posting list size
|
||||
max_posting_len: u32 Largest posting list size
|
||||
```
|
||||
|
||||
This allows the query planner to estimate selectivity without reading the
|
||||
METAIDX_SEG segments themselves. The `min_posting_len` and `max_posting_len`
|
||||
fields provide bounds for cardinality estimation.
|
||||
|
||||
### Updated Record Types Table
|
||||
|
||||
```
|
||||
Tag Name Description
|
||||
--- ---- -----------
|
||||
0x0001 SEGMENT_DIR Array of segment directory entries
|
||||
0x0002 TEMP_TIER_MAP Temperature tier assignments per block
|
||||
...
|
||||
0x000D KEY_DIRECTORY Encryption key references
|
||||
0x000E (reserved)
|
||||
0x000F METADATA_INDEX_DIR Metadata field -> METAIDX_SEG mapping
|
||||
```
|
||||
|
||||
## 7. Performance Analysis
|
||||
|
||||
### 7.1 Filter Strategy vs Selectivity vs Recall
|
||||
|
||||
| Selectivity | Strategy | Recall@10 | Latency (10M vectors) | Notes |
|
||||
|-------------|----------|-----------|----------------------|-------|
|
||||
| 0.001% (100 matches) | Pre-filter | 1.00 | 0.02 ms | Flat scan on 100 candidates |
|
||||
| 0.01% (1K matches) | Pre-filter | 0.99 | 0.08 ms | Flat scan on 1K candidates |
|
||||
| 0.1% (10K matches) | Pre-filter | 0.98 | 0.5 ms | Mini-HNSW on 10K candidates |
|
||||
| 1% (100K matches) | Intra-filter | 0.96 | 0.12 ms | ~10% node skip overhead |
|
||||
| 5% (500K matches) | Intra-filter | 0.95 | 0.08 ms | ~5% node skip overhead |
|
||||
| 10% (1M matches) | Intra-filter | 0.94 | 0.06 ms | Minimal skip overhead |
|
||||
| 20% (2M matches) | Post-filter | 0.95 | 0.10 ms | 5x over-retrieval |
|
||||
| 50% (5M matches) | Post-filter | 0.97 | 0.06 ms | 2x over-retrieval |
|
||||
| 100% (no filter) | None | 0.98 | 0.04 ms | Baseline unfiltered |
|
||||
|
||||
### 7.2 Memory Overhead of Metadata Indexes
|
||||
|
||||
For 10M vectors with the RVDNA profile (5 indexed fields):
|
||||
|
||||
| Field | Type | Cardinality | Index Type | Size |
|
||||
|-------|------|-------------|------------|------|
|
||||
| organism | string | ~50K | Inverted | ~80 MB |
|
||||
| gene_id | string | ~500K | Inverted | ~120 MB |
|
||||
| chromosome | string | ~25 | Bitmap (roaring) | ~12 MB |
|
||||
| position_start | u64 | ~10M | Range tree | ~160 MB |
|
||||
| position_end | u64 | ~10M | Range tree | ~160 MB |
|
||||
| **Total** | | | | **~532 MB** |
|
||||
|
||||
As a fraction of vector data (10M * 384 dim * fp16 = 7.2 GB): **~7.4% overhead**.
|
||||
|
||||
For the RVText profile (2 indexed fields, typically lower cardinality):
|
||||
|
||||
| Field | Type | Cardinality | Index Type | Size |
|
||||
|-------|------|-------------|------------|------|
|
||||
| source_url | string | ~100K | Inverted | ~90 MB |
|
||||
| language | string | ~50 | Bitmap (roaring) | ~8 MB |
|
||||
| **Total** | | | | **~98 MB** |
|
||||
|
||||
Overhead: **~1.4%** of vector data.
|
||||
|
||||
### 7.3 Query Latency Breakdown (Filtered Intra-Search)
|
||||
|
||||
```
|
||||
Phase Time Notes
|
||||
----- ---- -----
|
||||
Parse filter expression 0.5 us Stack-based, no allocation
|
||||
Estimate selectivity 1.0 us Read manifest stats
|
||||
Load METAIDX_SEG (if cold) 50-200 us First query only; cached after
|
||||
HNSW traversal (150 steps) 45 us Baseline unfiltered
|
||||
+ filter eval per node +12 us ~80 ns per eval * 150 nodes
|
||||
+ skip expansion +8 us ~20% more nodes visited at 5% sel.
|
||||
Top-K collection 10 us Heap operations
|
||||
--------
|
||||
Total (warm cache) ~76 us
|
||||
Total (cold start) ~276 us
|
||||
```
|
||||
|
||||
## 8. Integration with Temperature Tiering
|
||||
|
||||
Metadata follows the same temperature model as vector data (see
|
||||
`03-temperature-tiering.md`), but with its own tier assignments.
|
||||
|
||||
### 8.1 Hot Metadata
|
||||
|
||||
Indexed fields for hot-tier vectors are kept resident in memory:
|
||||
|
||||
- **Bitmap indexes** for low-cardinality fields (enum, bool) are always hot.
|
||||
Total size is bounded: `cardinality * ceil(hot_vectors / 8)` bytes. For 100K
|
||||
hot vectors and 25 enum values: 25 * 12.5 KB = 312 KB.
|
||||
|
||||
- **Inverted index posting lists** are cached using an LRU policy keyed by
|
||||
(field_id, term). Frequently queried terms (e.g., `language = "en"`) remain
|
||||
resident.
|
||||
|
||||
- **Range tree pages** follow the standard B+ tree buffer pool model. Hot pages
|
||||
(root + first two levels) are pinned. Leaf pages are demand-paged.
|
||||
|
||||
### 8.2 Cold Metadata
|
||||
|
||||
Cold metadata covers vectors that are rarely accessed:
|
||||
|
||||
- META_SEG data for cold vectors is compressed with ZSTD (level 9+) and stored
|
||||
in cold-tier segments.
|
||||
- METAIDX_SEG posting lists for cold vectors are not loaded until a query
|
||||
specifically requests them.
|
||||
- When a filter matches only cold vectors (detected via the temperature tier
|
||||
map), the runtime issues a warning: filtered search on cold data may require
|
||||
decompression latency of 10-100 ms.
|
||||
|
||||
### 8.3 Compaction Coordination
|
||||
|
||||
When temperature-aware compaction reorganizes vector segments (see
|
||||
`03-temperature-tiering.md` Section 4), metadata must follow:
|
||||
|
||||
```
|
||||
1. Identify vectors moving between tiers
|
||||
2. Rewrite META_SEG for affected vector ID ranges
|
||||
3. Rebuild METAIDX_SEG posting lists (vector IDs may be renumbered during
|
||||
compaction if the COMPACTION_RENUMBER flag is set)
|
||||
4. Update METADATA_INDEX_DIR in the new manifest
|
||||
5. Tombstone old META_SEG and METAIDX_SEG segments
|
||||
```
|
||||
|
||||
Metadata compaction piggybacks on vector compaction -- it never triggers
|
||||
independently. This ensures metadata and vector segments remain in consistent
|
||||
temperature tiers.
|
||||
|
||||
### 8.4 Metadata-Aware Promotion
|
||||
|
||||
When a filter query frequently accesses metadata for warm-tier vectors, those
|
||||
metadata segments are candidates for promotion to hot tier. The access sketch
|
||||
(SKETCH_SEG) tracks metadata segment accesses alongside vector accesses:
|
||||
|
||||
```
|
||||
sketch_key = (META_SEG_ID << 32) | block_id
|
||||
```
|
||||
|
||||
This reuses the existing sketch infrastructure without modification.
|
||||
|
||||
## 9. Wire Protocol: Filtered Query Message
|
||||
|
||||
For completeness, the filter expression is carried in the query message as a
|
||||
tagged field. The query wire format is outside the scope of the storage spec,
|
||||
but the filter payload is defined here for interoperability.
|
||||
|
||||
```
|
||||
Query Message Filter Field:
|
||||
tag: u16 (0x0040 = FILTER)
|
||||
length: u32
|
||||
filter_version: u8 (1)
|
||||
filter_payload: [u8; length - 1] Binary filter expression (Section 3.2)
|
||||
```
|
||||
|
||||
Implementations that do not support filtered search must ignore tag 0x0040 and
|
||||
return unfiltered results. This preserves backward compatibility.
|
||||
|
||||
## 10. Implementation Notes
|
||||
|
||||
### 10.1 Index Selection Heuristics
|
||||
|
||||
When building indexes for a new META_SEG field, implementations should select
|
||||
the index type automatically:
|
||||
|
||||
```
|
||||
if field_type in (enum, bool) and cardinality < 64:
|
||||
index_type = BITMAP
|
||||
elif field_type in (u32, u64, f32):
|
||||
index_type = RANGE_TREE
|
||||
else:
|
||||
index_type = INVERTED
|
||||
```
|
||||
|
||||
Fields without the `"indexed": true` property in the profile schema must not
|
||||
have METAIDX_SEG segments built. They are stored in META_SEG for retrieval
|
||||
only (the STORED flag).
|
||||
|
||||
### 10.2 Posting List Intersection
|
||||
|
||||
For AND filters on multiple indexed fields, posting list intersection is
|
||||
performed using a merge-based algorithm on sorted, delta-decoded posting lists:
|
||||
|
||||
```
|
||||
Sorted Intersection (two-pointer merge):
|
||||
Time: O(min(|A|, |B|)) with skip-ahead via restart points
|
||||
Practical: ~100 ns per 1000 common elements (SIMD comparison)
|
||||
```
|
||||
|
||||
For OR filters, posting list union uses a similar merge with deduplication.
|
||||
|
||||
### 10.3 Null Handling
|
||||
|
||||
- `FIELD_REF` for a null value pushes a sentinel NULL onto the stack
|
||||
- `CMP_EQ NULL` returns true only for null values
|
||||
- `CMP_NE NULL` returns true for all non-null values
|
||||
- All other comparisons against NULL return false (SQL-style three-valued logic)
|
||||
- `IN_SET` never matches NULL unless NULL is explicitly in the set
|
||||
474
vendor/ruvector/docs/research/rvf/spec/09-concurrency-versioning.md
vendored
Normal file
474
vendor/ruvector/docs/research/rvf/spec/09-concurrency-versioning.md
vendored
Normal file
@@ -0,0 +1,474 @@
|
||||
# RVF Concurrency, Versioning, and Space Reclamation
|
||||
|
||||
## 1. Single-Writer / Multi-Reader Model
|
||||
|
||||
RVF uses a **single-writer, multi-reader** concurrency model. At most one process
|
||||
may append segments to an RVF file at any time. Any number of readers may operate
|
||||
concurrently with each other and with the writer. This model is enforced by an
|
||||
advisory lock file, not by OS-level mandatory locking.
|
||||
|
||||
| Concern | Advisory Lock | Mandatory Lock (flock/fcntl) |
|
||||
|---------|---------------|------------------------------|
|
||||
| NFS compatibility | Works (lock file is a regular file) | Broken on many NFS configs |
|
||||
| Crash recovery | Stale lock detectable by PID check | Kernel auto-releases, but only locally |
|
||||
| Cross-language | Any language can create a file | Requires OS-specific syscalls |
|
||||
| Visibility | Lock state inspectable by humans | Opaque kernel state |
|
||||
| Multi-file mode | One lock covers all shards | Would need per-shard locks |
|
||||
|
||||
## 2. Writer Lock File
|
||||
|
||||
The writer lock is a file named `<basename>.rvf.lock` in the same directory as the
|
||||
RVF file. For example, `data.rvf` uses `data.rvf.lock`.
|
||||
|
||||
### Binary Layout
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 4 magic 0x52564C46 ("RVLF" in ASCII)
|
||||
0x04 4 pid Writer process ID (u32)
|
||||
0x08 64 hostname Null-terminated hostname (max 63 chars + null)
|
||||
0x48 8 timestamp_ns Lock acquisition time (nanosecond UNIX timestamp)
|
||||
0x50 16 writer_id Random UUID (128-bit, written as raw bytes)
|
||||
0x60 4 lock_version Lock protocol version (currently 1)
|
||||
0x64 4 checksum CRC32C of bytes 0x00-0x63
|
||||
```
|
||||
|
||||
**Total**: 104 bytes.
|
||||
|
||||
### Lock Acquisition Protocol
|
||||
|
||||
```
|
||||
1. Construct lock file content (magic, PID, hostname, timestamp, random UUID)
|
||||
2. Compute CRC32C over bytes 0x00-0x63, store at 0x64
|
||||
3. Attempt open("<basename>.rvf.lock", O_CREAT | O_EXCL | O_WRONLY)
|
||||
4. If open succeeds:
|
||||
a. Write 104 bytes
|
||||
b. fsync
|
||||
c. Lock acquired — proceed with writes
|
||||
5. If open fails (EEXIST):
|
||||
a. Read existing lock file
|
||||
b. Validate magic and checksum
|
||||
c. If invalid: delete stale lock, retry from step 3
|
||||
d. If valid: run stale lock detection (see below)
|
||||
e. If stale: delete lock, retry from step 3
|
||||
f. If not stale: lock acquisition fails — another writer is active
|
||||
```
|
||||
|
||||
The `O_CREAT | O_EXCL` combination is atomic on POSIX filesystems, preventing
|
||||
two processes from simultaneously creating the lock.
|
||||
|
||||
### Stale Lock Detection
|
||||
|
||||
A lock is considered stale when **both** of the following are true:
|
||||
|
||||
1. **PID is dead**: `kill(pid, 0)` returns `ESRCH` (process does not exist), OR
|
||||
the hostname does not match the current host (remote crash)
|
||||
2. **Age exceeds threshold**: `now_ns - timestamp_ns > 30_000_000_000` (30 seconds)
|
||||
|
||||
The age check prevents a race where a PID is recycled by the OS. A lock younger
|
||||
than 30 seconds is never considered stale, even if the PID appears dead, because
|
||||
PID reuse on modern systems can occur within milliseconds.
|
||||
|
||||
If the hostname differs from the current host, the PID check is not meaningful.
|
||||
In this case, only the age threshold applies. Implementations SHOULD use a longer
|
||||
threshold (300 seconds) for cross-host lock recovery to account for clock skew.
|
||||
|
||||
### Lock Release Protocol
|
||||
|
||||
```
|
||||
1. fsync all pending data and manifest segments
|
||||
2. Verify the lock file still contains our writer_id (re-read and compare)
|
||||
3. If writer_id matches: unlink("<basename>.rvf.lock")
|
||||
4. If writer_id does not match: abort — another process stole the lock
|
||||
```
|
||||
|
||||
Step 2 prevents a writer from deleting a lock that was legitimately taken over
|
||||
after a stale lock recovery by another process.
|
||||
|
||||
If a writer crashes without releasing the lock, the lock file persists on disk.
|
||||
The next writer detects the orphan via stale lock detection and reclaims it.
|
||||
No data corruption occurs because the append-only segment model guarantees that
|
||||
partial writes are detectable: a segment with a bad content hash or a truncated
|
||||
manifest is simply ignored.
|
||||
|
||||
## 3. Reader-Writer Coordination
|
||||
|
||||
Readers and writers operate independently. The append-only architecture ensures
|
||||
they never conflict.
|
||||
|
||||
### Reader Protocol
|
||||
|
||||
```
|
||||
1. Open file (read-only, no lock required)
|
||||
2. Read Level 0 root manifest (last 4096 bytes)
|
||||
3. Parse hotset pointers and Level 1 offset
|
||||
4. This manifest snapshot defines the reader's view of the file
|
||||
5. All queries within this session use the snapshot
|
||||
6. To see new data: re-read Level 0 (explicit refresh)
|
||||
```
|
||||
|
||||
### Writer Protocol
|
||||
|
||||
```
|
||||
1. Acquire lock (Section 2)
|
||||
2. Read current manifest to learn segment directory state
|
||||
3. Append new segments (VEC_SEG, INDEX_SEG, etc.)
|
||||
4. Append new MANIFEST_SEG referencing all live segments
|
||||
5. fsync
|
||||
6. Release lock (Section 2)
|
||||
```
|
||||
|
||||
### Concurrent Timeline
|
||||
|
||||
```
|
||||
Time Writer Reader A Reader B
|
||||
---- ------ -------- --------
|
||||
t=0 Acquires lock
|
||||
t=1 Appends VEC_SEG_4 Opens file
|
||||
t=2 Appends VEC_SEG_5 Opens file Reads manifest M3
|
||||
t=3 Appends MANIFEST_SEG M4 Reads manifest M3 Queries (sees M3)
|
||||
t=4 fsync, releases lock Queries (sees M3) Queries (sees M3)
|
||||
t=5 Queries (sees M3) Refreshes -> M4
|
||||
t=6 Refreshes -> M4 Queries (sees M4)
|
||||
```
|
||||
|
||||
Reader A opened during the write but read manifest M3 (already stable) and never
|
||||
sees partially written segments. Reader B sees M3 until explicit refresh. Neither
|
||||
reader is blocked; the writer is never blocked by readers.
|
||||
|
||||
### Snapshot Isolation Guarantees
|
||||
|
||||
A reader holding a manifest snapshot is guaranteed:
|
||||
|
||||
1. All referenced segments are fully written and fsynced
|
||||
2. Segment content hashes match (the manifest would not reference broken segments)
|
||||
3. The snapshot is internally consistent (no partial epoch states)
|
||||
4. The snapshot remains valid for the lifetime of the open file descriptor, even
|
||||
if the file is compacted and replaced (old inode persists until close)
|
||||
|
||||
## 4. Format Versioning
|
||||
|
||||
RVF uses explicit version fields at every structural level. The versioning rules
|
||||
are designed for forward compatibility — older readers can safely process files
|
||||
produced by newer writers, with graceful degradation.
|
||||
|
||||
### Segment Version Compatibility
|
||||
|
||||
The segment header `version` field (offset 0x04, currently `1`) governs
|
||||
segment-level compatibility.
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| S1 | A v1 reader MUST successfully process all v1 segments |
|
||||
| S2 | A v1 reader MUST skip segments with version > 1 |
|
||||
| S3 | A v1 reader MUST log a warning when skipping unknown versions |
|
||||
| S4 | A v1 reader MUST NOT reject a file because it contains unknown-version segments |
|
||||
| S5 | A v2+ writer MUST write a root manifest readable by v1 readers (if the root manifest format allows it) |
|
||||
| S6 | A v2+ writer MAY write segments with version > 1 |
|
||||
| S7 | Readers MUST use `payload_length` from the segment header to skip unknown segments |
|
||||
|
||||
Skipping works because the segment header layout is stable: magic, version,
|
||||
seg_type, and payload_length occupy fixed offsets. A reader skips unknown
|
||||
segments by seeking past `64 + payload_length` bytes (header + payload).
|
||||
|
||||
### Unknown Segment Types
|
||||
|
||||
The segment type enum (offset 0x05) may be extended in future versions.
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| T1 | A reader MUST skip segment types outside the recognized range (currently 0x01-0x0C) |
|
||||
| T2 | A reader MUST NOT reject a file because of unknown segment types |
|
||||
| T3 | A reader MUST use the header's `payload_length` to skip the unknown segment |
|
||||
| T4 | A reader SHOULD log unknown types at diagnostic/debug level |
|
||||
| T5 | Types 0x00 and 0xF0-0xFF remain reserved (see spec 01, Section 3) |
|
||||
|
||||
### Level 1 TLV Forward Compatibility
|
||||
|
||||
Level 1 manifest records use tag-length-value encoding. New tags may be added
|
||||
in any version.
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| L1 | A reader MUST skip TLV records with unknown tags |
|
||||
| L2 | A reader MUST use the record's `length` field (4 bytes at tag offset +2) to skip |
|
||||
| L3 | A writer MUST NOT change the semantics of an existing tag |
|
||||
| L4 | A writer MUST NOT reuse a tag value for a different purpose |
|
||||
| L5 | New tags MUST be assigned sequentially from 0x000E onward |
|
||||
|
||||
### Root Manifest Compatibility
|
||||
|
||||
The root manifest (Level 0) has the strictest compatibility requirements because
|
||||
it is the entry point for all readers.
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| R1 | The magic `0x52564D30` at offset 0x000 is frozen forever |
|
||||
| R2 | The layout of bytes 0x000-0x007 (magic + version + flags) is frozen forever |
|
||||
| R3 | New fields may be added to reserved space at offsets 0xF00-0xFFB |
|
||||
| R4 | Readers MUST ignore non-zero bytes in reserved space they do not understand |
|
||||
| R5 | The root checksum at 0xFFC always covers bytes 0x000-0xFFB |
|
||||
| R6 | A v2+ writer extending reserved space MUST ensure the checksum remains valid |
|
||||
|
||||
There is no explicit version negotiation. Compatibility is achieved through the
|
||||
skip rules above. A reader processes what it understands and skips what it does
|
||||
not. This avoids capability exchange, making RVF suitable for offline and
|
||||
archival use cases.
|
||||
|
||||
## 5. Variable Dimension Support
|
||||
|
||||
The root manifest declares a `dimension` field (offset 0x020, u16) and each
|
||||
VEC_SEG block declares its own `dim` field (block header offset 0x08, u16).
|
||||
These may differ.
|
||||
|
||||
### Dimension Rules
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| D1 | The root manifest `dimension` is the **primary dimension** (most common in the file) |
|
||||
| D2 | An RVF file MAY contain VEC_SEG blocks with dimensions different from the primary |
|
||||
| D3 | Each VEC_SEG block's `dim` field is authoritative for the vectors in that block |
|
||||
| D4 | The HNSW index (INDEX_SEG) covers only vectors matching the primary dimension |
|
||||
| D5 | Vectors with non-primary dimensions are searchable via flat scan or a separate index |
|
||||
| D6 | A PROFILE_SEG may declare multiple expected dimensions |
|
||||
|
||||
### Dimension Catalog (Level 1 Record)
|
||||
|
||||
A new Level 1 TLV record (tag `0x0010`, DIMENSION_CATALOG) enables readers to
|
||||
discover all dimensions present without scanning every VEC_SEG.
|
||||
|
||||
Record layout:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 2 entry_count Number of dimension entries
|
||||
0x02 2 reserved Must be zero
|
||||
```
|
||||
|
||||
Followed by `entry_count` entries of:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 2 dimension Vector dimensionality
|
||||
0x02 1 dtype Data type enum for these vectors
|
||||
0x03 1 flags 0x01 = primary, 0x02 = has_index
|
||||
0x04 4 vector_count Number of vectors with this dimension
|
||||
0x08 8 index_seg_offset Offset to dedicated index (0 if none)
|
||||
```
|
||||
|
||||
**Entry size**: 16 bytes.
|
||||
|
||||
Example for an RVDNA profile file:
|
||||
|
||||
```
|
||||
DIMENSION_CATALOG:
|
||||
entry_count: 3
|
||||
[0] dim=64, dtype=f16, flags=0x01 (primary, has_index), count=10000000, index=0x1A00000
|
||||
[1] dim=384, dtype=f16, flags=0x02 (has_index), count=500000, index=0x3F00000
|
||||
[2] dim=4096, dtype=f32, flags=0x00 (flat scan only), count=10000, index=0
|
||||
```
|
||||
|
||||
## 6. Space Reclamation
|
||||
|
||||
Over time, tombstoned segments and superseded manifests accumulate dead space.
|
||||
RVF provides three reclamation strategies, each suited to different operating
|
||||
conditions.
|
||||
|
||||
### Strategy 1: Hole-Punching
|
||||
|
||||
On Linux filesystems that support `fallocate(2)` with `FALLOC_FL_PUNCH_HOLE`
|
||||
(ext4, XFS, btrfs), tombstoned segment ranges can be released back to the
|
||||
filesystem without rewriting the file.
|
||||
|
||||
```
|
||||
Before: [VEC_1 live] [VEC_2 dead] [VEC_3 dead] [VEC_4 live] [MANIFEST]
|
||||
After: [VEC_1 live] [ hole ] [ hole ] [VEC_4 live] [MANIFEST]
|
||||
```
|
||||
|
||||
File size is unchanged but disk blocks are freed. No data movement occurs — each
|
||||
punch is O(1). Reader mmap still works (holes read as zeros, but the manifest
|
||||
never references them). Hole-punching is performed only on segments marked as
|
||||
TOMBSTONE in the current manifest's COMPACTION_STATE record.
|
||||
|
||||
### Strategy 2: Copy-Compact
|
||||
|
||||
Copy-compact rewrites the file, including only live segments. This is the
|
||||
universal strategy that works on all filesystems.
|
||||
|
||||
```
|
||||
Protocol:
|
||||
1. Acquire writer lock
|
||||
2. Read current manifest to enumerate live segments
|
||||
3. Create temporary file: <basename>.rvf.compact.tmp
|
||||
4. Write live segments sequentially to temporary file
|
||||
5. Write new MANIFEST_SEG with updated offsets
|
||||
6. fsync temporary file
|
||||
7. Atomic rename: <basename>.rvf.compact.tmp -> <basename>.rvf
|
||||
8. Release writer lock
|
||||
```
|
||||
|
||||
The atomic rename (step 7) ensures readers either see the old file or the new
|
||||
file, never a partial state. Readers that opened the old file before the rename
|
||||
continue operating on the old inode via their open file descriptor. The old
|
||||
inode is freed when the last reader closes its descriptor.
|
||||
|
||||
### Strategy 3: Shard Rewrite (Multi-File Mode)
|
||||
|
||||
In multi-file mode, individual shard files can be rewritten independently:
|
||||
|
||||
```
|
||||
Protocol:
|
||||
1. Acquire writer lock
|
||||
2. Read shard reference from Level 1 SHARD_REFS record
|
||||
3. Write new shard: <basename>.rvf.cold.<N>.compact.tmp
|
||||
4. fsync new shard
|
||||
5. Update main file manifest with new shard reference
|
||||
6. fsync main file
|
||||
7. Atomic rename new shard over old shard
|
||||
8. Release writer lock
|
||||
```
|
||||
|
||||
The old shard is safe to delete after all readers close their descriptors.
|
||||
Implementations MAY defer deletion using a grace period (default: 60 seconds).
|
||||
|
||||
## 7. Space Reclamation Triggers
|
||||
|
||||
Reclamation is not performed on every write. Implementations SHOULD evaluate
|
||||
triggers after each manifest write and act when thresholds are exceeded.
|
||||
|
||||
| Trigger | Threshold | Action |
|
||||
|---------|-----------|--------|
|
||||
| Dead space ratio | > 50% of file size | Copy-compact |
|
||||
| Dead space absolute | > 1 GB | Hole-punch if supported, else copy-compact |
|
||||
| Tombstone count | > 10,000 JOURNAL_SEG tombstone entries | Consolidate journal segments |
|
||||
| Time since last compaction | > 7 days | Evaluate dead space ratio, compact if > 25% |
|
||||
|
||||
### Dead Space Calculation
|
||||
|
||||
Dead space is computed from the manifest's COMPACTION_STATE record:
|
||||
|
||||
```
|
||||
dead_bytes = sum(payload_length + 64) for each tombstoned segment
|
||||
total_bytes = file_size
|
||||
dead_ratio = dead_bytes / total_bytes
|
||||
```
|
||||
|
||||
The `+ 64` accounts for the segment header.
|
||||
|
||||
### Trigger Evaluation Protocol
|
||||
|
||||
```
|
||||
1. After writing a new MANIFEST_SEG, compute dead_bytes and dead_ratio
|
||||
2. If dead_ratio > 0.50: schedule copy-compact
|
||||
3. Else if dead_bytes > 1 GB:
|
||||
a. If fallocate supported: hole-punch tombstoned ranges
|
||||
b. Else: schedule copy-compact
|
||||
4. If tombstone_count > 10,000: consolidate JOURNAL_SEGs
|
||||
5. If days_since_last_compact > 7 AND dead_ratio > 0.25: schedule copy-compact
|
||||
```
|
||||
|
||||
Scheduled compactions MAY be deferred to a background process or low-activity
|
||||
period.
|
||||
|
||||
## 8. Multi-Process Compaction
|
||||
|
||||
Compaction is a write operation and requires the writer lock. Only one process
|
||||
may compact at a time.
|
||||
|
||||
### Background Compaction Process
|
||||
|
||||
A dedicated compaction process can run alongside the application:
|
||||
|
||||
```
|
||||
1. Attempt writer lock acquisition
|
||||
2. If lock acquired:
|
||||
a. Read current manifest
|
||||
b. Evaluate reclamation triggers
|
||||
c. If compaction needed:
|
||||
i. Write WITNESS_SEG with compaction_state = STARTED
|
||||
ii. Perform compaction (copy-compact or hole-punch)
|
||||
iii. Write WITNESS_SEG with compaction_state = COMPLETED
|
||||
iv. Write new MANIFEST_SEG
|
||||
d. Release lock
|
||||
3. If lock not acquired: sleep and retry
|
||||
```
|
||||
|
||||
### Crash Safety
|
||||
|
||||
Compaction is crash-safe by construction. Copy-compact does not rename until
|
||||
fsynced — a crash before rename leaves the original file untouched and the
|
||||
temporary file is cleaned up on next startup. Hole-punch `fallocate` calls are
|
||||
individually atomic; a crash mid-sequence leaves the manifest consistent because
|
||||
it references only live segments. Shard rewrite follows the same atomic rename
|
||||
pattern as copy-compact.
|
||||
|
||||
### Compaction Progress and Resumability
|
||||
|
||||
For long-running compactions, the writer records progress in WITNESS_SEG segments:
|
||||
|
||||
```
|
||||
WITNESS_SEG compaction payload:
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 4 state 0=STARTED, 1=IN_PROGRESS, 2=COMPLETED, 3=ABORTED
|
||||
0x04 8 source_manifest_id Segment ID of manifest being compacted
|
||||
0x0C 8 last_copied_seg_id Last segment ID successfully written to new file
|
||||
0x14 8 bytes_written Total bytes written to new file so far
|
||||
0x1C 8 bytes_remaining Estimated bytes remaining
|
||||
0x24 16 temp_file_hash Hash of temporary file at last checkpoint
|
||||
```
|
||||
|
||||
If a compaction process crashes and restarts, it can:
|
||||
|
||||
1. Find the latest WITNESS_SEG with `state = IN_PROGRESS`
|
||||
2. Verify the temporary file exists and matches `temp_file_hash`
|
||||
3. Resume from `last_copied_seg_id + 1`
|
||||
4. If verification fails, delete the temporary file and restart compaction
|
||||
|
||||
## 9. Crash Recovery Summary
|
||||
|
||||
RVF recovers from crashes at any point without external tooling.
|
||||
|
||||
| Crash Point | State After Recovery | Action Required |
|
||||
|-------------|---------------------|-----------------|
|
||||
| Segment append (before manifest) | Orphan segment at tail | None — manifest does not reference it |
|
||||
| Manifest write | Partial manifest at tail | Scan backward to previous valid manifest |
|
||||
| Lock acquisition | Lock file may or may not exist | Stale lock detection resolves it |
|
||||
| Lock release | Lock file persists | Stale lock detection resolves it |
|
||||
| Copy-compact (before rename) | Temporary file on disk | Delete `*.compact.tmp` on startup |
|
||||
| Copy-compact (during rename) | Atomic — old or new | No action needed |
|
||||
| Hole-punch | Partial holes punched | No action — manifest is consistent |
|
||||
| Shard rewrite | Temporary shard on disk | Delete `*.compact.tmp` on startup |
|
||||
|
||||
### Startup Recovery Protocol
|
||||
|
||||
On startup, before acquiring a write lock, a writer SHOULD:
|
||||
|
||||
```
|
||||
1. Delete any <basename>.rvf.compact.tmp files (orphaned compaction)
|
||||
2. Delete any <basename>.rvf.cold.*.compact.tmp files (orphaned shard compaction)
|
||||
3. Validate the lock file (if present) for staleness
|
||||
4. Open the RVF file and locate the latest valid manifest
|
||||
5. If the tail contains a partial segment (magic present, bad hash):
|
||||
a. Log a warning with the partial segment's offset and type
|
||||
b. The partial segment is outside the manifest — it is harmless
|
||||
c. The next append will overwrite it (or it will be compacted away)
|
||||
```
|
||||
|
||||
## 10. Invariants
|
||||
|
||||
The following invariants extend those in spec 01 (Section 7):
|
||||
|
||||
1. At most one writer lock exists per RVF file at any time
|
||||
2. A lock file with valid magic and checksum represents an active or stale lock
|
||||
3. Readers never require a lock, regardless of operation
|
||||
4. A manifest snapshot is immutable for the lifetime of a reader session
|
||||
5. Compaction never modifies live segments — it creates new ones
|
||||
6. Hole-punched regions are never referenced by any manifest
|
||||
7. The root manifest magic and first 8 bytes are frozen across all versions
|
||||
8. Unknown segment versions and types are skipped, never rejected
|
||||
9. Unknown TLV tags in Level 1 are skipped, never rejected
|
||||
10. Each VEC_SEG block's `dim` field is authoritative for that block's vectors
|
||||
688
vendor/ruvector/docs/research/rvf/spec/10-operations-api.md
vendored
Normal file
688
vendor/ruvector/docs/research/rvf/spec/10-operations-api.md
vendored
Normal file
@@ -0,0 +1,688 @@
|
||||
# RVF Operations API
|
||||
|
||||
## 1. Scope
|
||||
|
||||
This document specifies the operational surface of an RVF runtime: error codes
|
||||
returned by all operations, wire formats for batch queries, batch ingest, and
|
||||
batch deletes, the network streaming protocol for progressive loading over HTTP
|
||||
and TCP, and the compaction scheduling policy. It complements the segment model
|
||||
(spec 01), manifest system (spec 02), and query optimization (spec 06).
|
||||
|
||||
All multi-byte integers are little-endian unless otherwise noted. All offsets
|
||||
within messages are byte offsets from the start of the message payload.
|
||||
|
||||
## 2. Error Code Enumeration
|
||||
|
||||
Error codes are 16-bit unsigned integers. The high byte identifies the error
|
||||
category; the low byte identifies the specific error within that category.
|
||||
Implementations must preserve unrecognized codes in responses and must not
|
||||
treat unknown codes as fatal unless the high byte is `0x01` (format error).
|
||||
|
||||
### Category 0x00: Success
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0000 OK Operation succeeded
|
||||
0x0001 OK_PARTIAL Partial success (some items failed)
|
||||
```
|
||||
|
||||
`OK_PARTIAL` is returned when a batch operation succeeds for some items and
|
||||
fails for others. The response body contains per-item status details.
|
||||
|
||||
### Category 0x01: Format Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0100 INVALID_MAGIC Segment magic mismatch (expected 0x52564653)
|
||||
0x0101 INVALID_VERSION Unsupported segment version
|
||||
0x0102 INVALID_CHECKSUM Segment hash verification failed
|
||||
0x0103 INVALID_SIGNATURE Cryptographic signature invalid
|
||||
0x0104 TRUNCATED_SEGMENT Segment payload shorter than declared length
|
||||
0x0105 INVALID_MANIFEST Root manifest validation failed
|
||||
0x0106 MANIFEST_NOT_FOUND No valid MANIFEST_SEG in file
|
||||
0x0107 UNKNOWN_SEGMENT_TYPE Segment type not recognized (warning, not fatal)
|
||||
0x0108 ALIGNMENT_ERROR Data not at expected 64B boundary
|
||||
```
|
||||
|
||||
`UNKNOWN_SEGMENT_TYPE` is advisory. A reader encountering an unknown segment
|
||||
type should skip it and continue. All other format errors in this category
|
||||
are fatal for the affected segment.
|
||||
|
||||
### Category 0x02: Query Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0200 DIMENSION_MISMATCH Query vector dimension != index dimension
|
||||
0x0201 EMPTY_INDEX No index segments available
|
||||
0x0202 METRIC_UNSUPPORTED Requested distance metric not available
|
||||
0x0203 FILTER_PARSE_ERROR Invalid filter expression
|
||||
0x0204 K_TOO_LARGE Requested K exceeds available vectors
|
||||
0x0205 TIMEOUT Query exceeded time budget
|
||||
```
|
||||
|
||||
When `K_TOO_LARGE` is returned, the response still contains all available
|
||||
results. The result count will be less than the requested K.
|
||||
|
||||
### Category 0x03: Write Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0300 LOCK_HELD Another writer holds the lock
|
||||
0x0301 LOCK_STALE Lock file exists but owner process is dead
|
||||
0x0302 DISK_FULL Insufficient space for write
|
||||
0x0303 FSYNC_FAILED Durable write failed
|
||||
0x0304 SEGMENT_TOO_LARGE Segment exceeds 4 GB limit
|
||||
0x0305 READ_ONLY File opened in read-only mode
|
||||
```
|
||||
|
||||
`LOCK_STALE` is informational. The runtime may attempt to break the stale
|
||||
lock and retry. If recovery succeeds, the original operation proceeds with
|
||||
an `OK` status.
|
||||
|
||||
### Category 0x04: Tile Errors (WASM Microkernel)
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0400 TILE_TRAP WASM trap (OOB, unreachable, stack overflow)
|
||||
0x0401 TILE_OOM Tile exceeded scratch memory (64 KB)
|
||||
0x0402 TILE_TIMEOUT Tile computation exceeded time budget
|
||||
0x0403 TILE_INVALID_MSG Malformed hub-tile message
|
||||
0x0404 TILE_UNSUPPORTED_OP Operation not available on this profile
|
||||
```
|
||||
|
||||
All tile errors trigger the fault isolation protocol described in
|
||||
`microkernel/wasm-runtime.md` section 8. The hub reassigns the tile's
|
||||
work and optionally restarts the faulted tile.
|
||||
|
||||
### Category 0x05: Crypto Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0500 KEY_NOT_FOUND Referenced key_id not in CRYPTO_SEG
|
||||
0x0501 KEY_EXPIRED Key past valid_until timestamp
|
||||
0x0502 DECRYPT_FAILED Decryption or auth tag verification failed
|
||||
0x0503 ALGO_UNSUPPORTED Cryptographic algorithm not implemented
|
||||
```
|
||||
|
||||
Crypto errors are always fatal for the affected segment. An implementation
|
||||
must not serve data from a segment that fails signature or decryption checks.
|
||||
|
||||
## 3. Batch Query API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
Batch queries amortize connection overhead and enable the runtime to
|
||||
schedule vector block loads across multiple queries simultaneously.
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_count Number of queries in batch (max 1024)
|
||||
0x04 4 k Shared top-K parameter
|
||||
0x08 1 metric Distance metric: 0=L2, 1=IP, 2=cosine, 3=hamming
|
||||
0x09 3 reserved Must be zero
|
||||
0x0C 4 ef_search HNSW ef_search parameter
|
||||
0x10 4 shared_filter_len Byte length of shared filter (0 = no filter)
|
||||
0x14 var shared_filter Filter expression (applies to all queries)
|
||||
var var queries[] Per-query entries (see below)
|
||||
```
|
||||
|
||||
Each query entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_id Client-assigned correlation ID
|
||||
0x04 2 dim Vector dimensionality
|
||||
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
|
||||
0x07 1 flags Bit 0: has per-query filter
|
||||
0x08 var vector Query vector (dim * sizeof(dtype) bytes)
|
||||
var 4 filter_len Byte length of per-query filter (if flags bit 0)
|
||||
var var filter Per-query filter (overrides shared filter)
|
||||
```
|
||||
|
||||
When both a shared filter and a per-query filter are present, the per-query
|
||||
filter takes precedence. A per-query filter of zero length inherits the
|
||||
shared filter.
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_count Number of query results
|
||||
0x04 var results[] Per-query result entries
|
||||
```
|
||||
|
||||
Each result entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_id Correlation ID from request
|
||||
0x04 2 status Error code (0x0000 = OK)
|
||||
0x06 2 reserved Must be zero
|
||||
0x08 4 result_count Number of results returned
|
||||
0x0C var results[] Array of (vector_id: u64, distance: f32) pairs
|
||||
```
|
||||
|
||||
Each result pair is 12 bytes: 8 bytes for the vector ID followed by 4 bytes
|
||||
for the distance value. Results are sorted by distance ascending (nearest first).
|
||||
|
||||
### Batch Scheduling
|
||||
|
||||
The runtime should process batch queries using the following strategy:
|
||||
|
||||
1. Parse all query vectors and load them into memory
|
||||
2. Identify shared segments across queries (block deduplication)
|
||||
3. Load each vector block once and evaluate all relevant queries against it
|
||||
4. Merge per-query top-K heaps independently
|
||||
5. Return results as soon as each query completes (streaming response)
|
||||
|
||||
This amortizes I/O: if N queries touch the same vector block, the block is
|
||||
read once instead of N times.
|
||||
|
||||
## 4. Batch Ingest API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 vector_count Number of vectors to ingest (max 65536)
|
||||
0x04 2 dim Vector dimensionality
|
||||
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
|
||||
0x07 1 flags Bit 0: metadata_included
|
||||
0x08 var vectors[] Vector entries
|
||||
var var metadata[] Metadata entries (if flags bit 0)
|
||||
```
|
||||
|
||||
Each vector entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 vector_id Globally unique vector ID
|
||||
0x08 var vector Vector data (dim * sizeof(dtype) bytes)
|
||||
```
|
||||
|
||||
Each metadata entry (when metadata_included is set):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 field_count Number of metadata fields
|
||||
0x02 var fields[] Field entries
|
||||
```
|
||||
|
||||
Each metadata field:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 field_id Field identifier (application-defined)
|
||||
0x02 1 value_type 0=u64, 1=i64, 2=f64, 3=string, 4=bytes
|
||||
0x03 var value Encoded value (u64/i64/f64: 8B; string/bytes: 4B length + data)
|
||||
```
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 accepted_count Number of vectors accepted
|
||||
0x04 4 rejected_count Number of vectors rejected
|
||||
0x08 4 manifest_epoch Epoch of manifest after commit
|
||||
0x0C var rejected_ids[] Array of rejected vector IDs (u64 * rejected_count)
|
||||
var var rejected_reasons[] Array of error codes (u16 * rejected_count)
|
||||
```
|
||||
|
||||
The `manifest_epoch` field is the epoch of the MANIFEST_SEG written after the
|
||||
ingest is committed. Clients can use this value to confirm that a subsequent
|
||||
read will include the ingested vectors.
|
||||
|
||||
### Ingest Commit Semantics
|
||||
|
||||
1. The runtime writes vectors to a new VEC_SEG (append-only)
|
||||
2. If metadata is included, a META_SEG is appended
|
||||
3. Both segments are fsynced
|
||||
4. A new MANIFEST_SEG is written referencing the new segments
|
||||
5. The manifest is fsynced
|
||||
6. The response is sent with the new manifest_epoch
|
||||
|
||||
Vectors are visible to queries only after step 6 completes.
|
||||
|
||||
## 5. Batch Delete API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 1 delete_type 0=by_id, 1=by_range, 2=by_filter
|
||||
0x01 3 reserved Must be zero
|
||||
0x04 var payload Type-specific payload (see below)
|
||||
```
|
||||
|
||||
Delete by ID (`delete_type = 0`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 count Number of IDs to delete
|
||||
0x04 var ids[] Array of vector IDs (u64 * count)
|
||||
```
|
||||
|
||||
Delete by range (`delete_type = 1`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 start_id Start of range (inclusive)
|
||||
0x08 8 end_id End of range (exclusive)
|
||||
```
|
||||
|
||||
Delete by filter (`delete_type = 2`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 filter_len Byte length of filter expression
|
||||
0x04 var filter Filter expression
|
||||
```
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 deleted_count Number of vectors deleted
|
||||
0x08 2 status Error code (0x0000 = OK)
|
||||
0x0A 2 reserved Must be zero
|
||||
0x0C 4 manifest_epoch Epoch of manifest after delete committed
|
||||
```
|
||||
|
||||
### Delete Mechanics
|
||||
|
||||
Deletes are logical. The runtime appends a JOURNAL_SEG containing tombstone
|
||||
entries for the deleted vector IDs. The new MANIFEST_SEG marks affected
|
||||
VEC_SEGs as partially dead. Physical reclamation happens during compaction.
|
||||
|
||||
## 6. Network Streaming Protocol
|
||||
|
||||
### 6.1 HTTP Range Requests (Read-Only Access)
|
||||
|
||||
RVF's progressive loading model maps naturally to HTTP byte-range requests.
|
||||
A client can boot from a remote `.rvf` file and become queryable without
|
||||
downloading the entire file.
|
||||
|
||||
**Phase 1: Boot (mandatory)**
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=-4096
|
||||
```
|
||||
|
||||
Retrieves the last 4 KB of the file. This contains the Level 0 root manifest
|
||||
(MANIFEST_SEG). The client parses hotset pointers, the segment directory, and
|
||||
the profile ID.
|
||||
|
||||
If the file is smaller than 4 KB, the entire file is returned. If the last
|
||||
4 KB does not contain a valid MANIFEST_SEG, the client extends the range
|
||||
backward in 4 KB increments until one is found or 1 MB is scanned (at which
|
||||
point it returns `MANIFEST_NOT_FOUND`).
|
||||
|
||||
**Phase 2: Hotset (parallel, mandatory for queries)**
|
||||
|
||||
Using offsets from the Level 0 manifest, the client issues up to 5 parallel
|
||||
range requests:
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<entrypoint_offset>-<entrypoint_end>
|
||||
GET /file.rvf Range: bytes=<toplayer_offset>-<toplayer_end>
|
||||
GET /file.rvf Range: bytes=<centroid_offset>-<centroid_end>
|
||||
GET /file.rvf Range: bytes=<quantdict_offset>-<quantdict_end>
|
||||
GET /file.rvf Range: bytes=<hotcache_offset>-<hotcache_end>
|
||||
```
|
||||
|
||||
These fetch the HNSW entry point, top-layer graph, routing centroids,
|
||||
quantization dictionary, and the hot cache (HOT_SEG). After these 5 requests
|
||||
complete, the system is queryable with recall >= 0.7.
|
||||
|
||||
**Phase 3: Level 1 (background)**
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<l1_offset>-<l1_end>
|
||||
```
|
||||
|
||||
Fetches the Level 1 manifest containing the full segment directory. This
|
||||
enables the client to discover all segments and plan on-demand fetches.
|
||||
|
||||
**Phase 4: On-demand (per query)**
|
||||
|
||||
For queries that require cold data not yet fetched:
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<segment_offset>-<segment_end>
|
||||
```
|
||||
|
||||
The client caches fetched segments locally. Repeated queries against the
|
||||
same data region do not trigger additional requests.
|
||||
|
||||
### HTTP Requirements
|
||||
|
||||
- Server must support `Accept-Ranges: bytes`
|
||||
- Server must return `206 Partial Content` for range requests
|
||||
- Server should support multiple ranges in a single request (`multipart/byteranges`)
|
||||
- Client should use `If-None-Match` with the file's ETag to detect stale caches
|
||||
|
||||
### 6.2 TCP Streaming Protocol (Real-Time Access)
|
||||
|
||||
For real-time ingest and low-latency queries, RVF defines a binary TCP
|
||||
protocol over TLS 1.3.
|
||||
|
||||
**Connection Setup**
|
||||
|
||||
```
|
||||
1. Client opens TCP connection to server
|
||||
2. TLS 1.3 handshake (mandatory, no plaintext mode)
|
||||
3. Client sends HELLO message with protocol version and capabilities
|
||||
4. Server responds with HELLO_ACK confirming capabilities
|
||||
5. Connection is ready for messages
|
||||
```
|
||||
|
||||
**Framing**
|
||||
|
||||
All messages are length-prefixed:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 frame_length Payload length (big-endian, max 16 MB)
|
||||
0x04 1 msg_type Message type (see below)
|
||||
0x05 3 msg_id Correlation ID (big-endian, wraps at 2^24)
|
||||
0x08 var payload Message-specific payload
|
||||
```
|
||||
|
||||
Frame length is big-endian (network byte order) for consistency with TLS
|
||||
framing. The 16 MB maximum prevents a single message from monopolizing the
|
||||
connection. Payloads larger than 16 MB must be split across multiple messages
|
||||
using continuation framing (see section 6.4).
|
||||
|
||||
**Message Types**
|
||||
|
||||
```
|
||||
Client -> Server:
|
||||
0x01 QUERY Batch query (payload = Batch Query Request)
|
||||
0x02 INGEST Batch ingest (payload = Batch Ingest Request)
|
||||
0x03 DELETE Batch delete (payload = Batch Delete Request)
|
||||
0x04 STATUS Request server status (no payload)
|
||||
0x05 SUBSCRIBE Subscribe to update notifications
|
||||
|
||||
Server -> Client:
|
||||
0x81 QUERY_RESULT Batch query result
|
||||
0x82 INGEST_ACK Batch ingest acknowledgment
|
||||
0x83 DELETE_ACK Batch delete acknowledgment
|
||||
0x84 STATUS_RESP Server status response
|
||||
0x85 UPDATE_NOTIFY Push notification of new data
|
||||
0xFF ERROR Error with code and description
|
||||
```
|
||||
|
||||
**ERROR Message Payload**
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 error_code Error code from section 2
|
||||
0x02 2 description_len Byte length of description string
|
||||
0x04 var description UTF-8 error description (human-readable)
|
||||
```
|
||||
|
||||
### 6.3 Streaming Ingest Protocol
|
||||
|
||||
The TCP protocol supports continuous ingest where the client streams vectors
|
||||
without waiting for per-batch acknowledgments.
|
||||
|
||||
**Flow**
|
||||
|
||||
```
|
||||
Client Server
|
||||
| |
|
||||
|--- INGEST (batch 0) ------------->|
|
||||
|--- INGEST (batch 1) ------------->| Pipelining: send without waiting
|
||||
|--- INGEST (batch 2) ------------->|
|
||||
| | Server writes VEC_SEGs, appends manifest
|
||||
|<--- INGEST_ACK (batch 0) ---------|
|
||||
|<--- INGEST_ACK (batch 1) ---------|
|
||||
| | Backpressure: server delays ACK
|
||||
|--- INGEST (batch 3) ------------->| Client respects window
|
||||
|<--- INGEST_ACK (batch 2) ---------|
|
||||
| |
|
||||
```
|
||||
|
||||
**Backpressure**
|
||||
|
||||
The server controls ingest rate by delaying INGEST_ACK responses. The client
|
||||
must limit its in-flight (unacknowledged) ingest messages to a configurable
|
||||
window size (default: 8 messages). When the window is full, the client must
|
||||
wait for an ACK before sending the next batch.
|
||||
|
||||
The server should send backpressure when:
|
||||
- Write queue exceeds 80% capacity
|
||||
- Compaction is falling behind (dead space > 50%)
|
||||
- Available disk space drops below 10%
|
||||
|
||||
**Commit Semantics**
|
||||
|
||||
Each INGEST_ACK contains the `manifest_epoch` after commit. The server
|
||||
guarantees that all vectors acknowledged with epoch E are visible to any
|
||||
query that reads the manifest at epoch >= E.
|
||||
|
||||
### 6.4 Continuation Framing
|
||||
|
||||
For payloads exceeding the 16 MB frame limit:
|
||||
|
||||
```
|
||||
Frame 0: msg_type = original type, flags bit 0 = CONTINUATION_START
|
||||
Frame 1: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
|
||||
Frame 2: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
|
||||
Frame N: msg_type = 0x00 (CONTINUATION), flags bit 1 = CONTINUATION_END
|
||||
```
|
||||
|
||||
The receiver reassembles the payload from all continuation frames before
|
||||
processing. The msg_id is shared across all frames of a continuation sequence.
|
||||
|
||||
### 6.5 SUBSCRIBE and UPDATE_NOTIFY
|
||||
|
||||
The SUBSCRIBE message registers the client for push notifications when new
|
||||
data is committed:
|
||||
|
||||
```
|
||||
SUBSCRIBE payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 min_epoch Only notify for epochs > this value
|
||||
0x04 1 notify_flags Bit 0: ingest, Bit 1: delete, Bit 2: compaction
|
||||
0x05 3 reserved Must be zero
|
||||
```
|
||||
|
||||
The server sends UPDATE_NOTIFY whenever a new MANIFEST_SEG is committed that
|
||||
matches the subscription criteria:
|
||||
|
||||
```
|
||||
UPDATE_NOTIFY payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 epoch New manifest epoch
|
||||
0x04 1 event_type 0=ingest, 1=delete, 2=compaction
|
||||
0x05 3 reserved Must be zero
|
||||
0x08 4 affected_count Number of vectors affected
|
||||
0x0C 8 new_total Total vector count after event
|
||||
```
|
||||
|
||||
## 7. Compaction Scheduling Policy
|
||||
|
||||
Compaction merges small, overlapping, or partially-dead segments into larger,
|
||||
sealed segments. Because compaction competes with queries and ingest for I/O
|
||||
bandwidth, the runtime enforces a scheduling policy.
|
||||
|
||||
### 7.1 IO Budget
|
||||
|
||||
Compaction must consume at most 30% of available IOPS. The runtime measures
|
||||
IOPS over a 5-second sliding window and throttles compaction I/O to stay
|
||||
within budget.
|
||||
|
||||
```
|
||||
available_iops = measured_iops_capacity (from benchmarking at startup)
|
||||
compaction_budget = available_iops * 0.30
|
||||
compaction_throttle = max(compaction_budget - current_compaction_iops, 0)
|
||||
```
|
||||
|
||||
### 7.2 Priority Ordering
|
||||
|
||||
When I/O bandwidth is contended, operations are prioritized:
|
||||
|
||||
```
|
||||
Priority 1 (highest): Queries (reads from VEC_SEG, INDEX_SEG, HOT_SEG)
|
||||
Priority 2: Ingest (writes to VEC_SEG, META_SEG, MANIFEST_SEG)
|
||||
Priority 3 (lowest): Compaction (reads + writes of sealed segments)
|
||||
```
|
||||
|
||||
Compaction yields to queries and ingest. If a compaction I/O operation would
|
||||
cause a query to exceed its time budget, the compaction operation is deferred.
|
||||
|
||||
### 7.3 Scheduling Triggers
|
||||
|
||||
Compaction runs when all of the following conditions are met:
|
||||
|
||||
| Condition | Threshold | Rationale |
|
||||
|-----------|-----------|-----------|
|
||||
| Query load | < 50% of capacity | Avoid competing with active queries |
|
||||
| Dead space ratio | > 20% of total file size | Not worth compacting small amounts |
|
||||
| Segment count | > 32 active segments | Many small segments hurt read performance |
|
||||
| Time since last compaction | > 60 seconds | Prevent compaction storms |
|
||||
|
||||
The runtime evaluates these conditions every 10 seconds.
|
||||
|
||||
### 7.4 Emergency Compaction
|
||||
|
||||
If dead space exceeds 70% of total file size, compaction enters emergency mode:
|
||||
|
||||
```
|
||||
Emergency compaction rules:
|
||||
1. Compaction preempts ingest (ingest is paused, not rejected)
|
||||
2. IO budget increases to 60% of available IOPS
|
||||
3. Compaction runs regardless of query load
|
||||
4. Ingest resumes after dead space drops below 50%
|
||||
```
|
||||
|
||||
During emergency compaction, the server responds to INGEST messages with
|
||||
delayed ACKs (backpressure) rather than rejecting them. Queries continue to
|
||||
be served at highest priority.
|
||||
|
||||
### 7.5 Compaction Progress Reporting
|
||||
|
||||
The STATUS response includes compaction state:
|
||||
|
||||
```
|
||||
STATUS_RESP compaction fields:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------- ----------------------------------------
|
||||
0x00 1 compaction_state 0=idle, 1=running, 2=emergency
|
||||
0x01 1 progress_pct Completion percentage (0-100)
|
||||
0x02 2 reserved Must be zero
|
||||
0x04 8 dead_bytes Total dead space in bytes
|
||||
0x0C 8 total_bytes Total file size in bytes
|
||||
0x14 4 segments_remaining Segments left to compact
|
||||
0x18 4 segments_completed Segments compacted in current run
|
||||
0x1C 4 estimated_seconds Estimated time to completion
|
||||
0x20 4 io_budget_pct Current IO budget percentage (30 or 60)
|
||||
```
|
||||
|
||||
### 7.6 Compaction Segment Selection
|
||||
|
||||
The runtime selects segments for compaction using a tiered strategy:
|
||||
|
||||
```
|
||||
1. Tombstoned segments: Always compacted first (reclaim dead space)
|
||||
2. Small VEC_SEGs: Segments < 1 MB merged into larger segments
|
||||
3. High-overlap INDEX_SEGs: Index segments covering the same ID range
|
||||
4. Cold OVERLAY_SEGs: Overlay deltas merged into base segments
|
||||
```
|
||||
|
||||
The compaction output is always a sealed segment (SEALED flag set). Sealed
|
||||
segments are immutable and can be verified independently.
|
||||
|
||||
## 8. STATUS Response Format
|
||||
|
||||
The STATUS message provides a snapshot of the server state for monitoring
|
||||
and diagnostics.
|
||||
|
||||
```
|
||||
STATUS_RESP payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------- ----------------------------------------
|
||||
0x00 4 protocol_version Protocol version (currently 1)
|
||||
0x04 4 manifest_epoch Current manifest epoch
|
||||
0x08 8 total_vectors Total vector count
|
||||
0x10 8 total_segments Total segment count
|
||||
0x18 8 file_size_bytes Total file size
|
||||
0x20 4 query_qps Queries per second (last 5s window)
|
||||
0x24 4 ingest_vps Vectors ingested per second (last 5s window)
|
||||
0x28 24 compaction Compaction state (see section 7.5)
|
||||
0x40 1 profile_id Active hardware profile (0x00-0x03)
|
||||
0x41 1 health 0=healthy, 1=degraded, 2=read_only
|
||||
0x42 2 reserved Must be zero
|
||||
0x44 4 uptime_seconds Server uptime
|
||||
```
|
||||
|
||||
## 9. Filter Expression Format
|
||||
|
||||
Filter expressions used in batch queries and batch deletes share a common
|
||||
binary encoding:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 1 op Operator enum (see below)
|
||||
0x01 2 field_id Metadata field to filter on
|
||||
0x03 1 value_type Value type (matches metadata field types)
|
||||
0x04 var value Comparison value
|
||||
var var children[] Sub-expressions (for AND/OR/NOT)
|
||||
```
|
||||
|
||||
Operator enum:
|
||||
|
||||
```
|
||||
0x00 EQ field == value
|
||||
0x01 NE field != value
|
||||
0x02 LT field < value
|
||||
0x03 LE field <= value
|
||||
0x04 GT field > value
|
||||
0x05 GE field >= value
|
||||
0x06 IN field in [values]
|
||||
0x07 RANGE field in [low, high)
|
||||
0x10 AND All children must match
|
||||
0x11 OR Any child must match
|
||||
0x12 NOT Negate single child
|
||||
```
|
||||
|
||||
Filters are evaluated during the query scan phase. Vectors that do not match
|
||||
the filter are excluded from distance computation entirely (pre-filtering) or
|
||||
from the result set (post-filtering), depending on the runtime's cost model.
|
||||
|
||||
## 10. Invariants
|
||||
|
||||
1. Error codes are stable across versions; new codes are additive only
|
||||
2. Batch operations are atomic per-item, not per-batch (partial success is valid)
|
||||
3. TCP connections are always TLS 1.3; plaintext is not permitted
|
||||
4. Frame length is big-endian; all other multi-byte fields are little-endian
|
||||
5. HTTP progressive loading must succeed with at most 7 round trips to become queryable
|
||||
6. Compaction never runs at more than 60% of available IOPS, even in emergency mode
|
||||
7. The STATUS response is always available, even during emergency compaction
|
||||
8. Filter expressions are limited to 64 levels of nesting depth
|
||||
420
vendor/ruvector/docs/research/rvf/spec/11-wasm-bootstrap.md
vendored
Normal file
420
vendor/ruvector/docs/research/rvf/spec/11-wasm-bootstrap.md
vendored
Normal file
@@ -0,0 +1,420 @@
|
||||
# RVF WASM Self-Bootstrapping Specification
|
||||
|
||||
## 1. Motivation
|
||||
|
||||
Traditional file formats require an external runtime to interpret their contents.
|
||||
A JPEG needs an image decoder. A SQLite database needs the SQLite library. An RVF
|
||||
file needs a vector search engine.
|
||||
|
||||
What if the file carried its own runtime?
|
||||
|
||||
By embedding a tiny WASM interpreter inside the RVF file itself, we eliminate the
|
||||
last external dependency. The host only needs **raw execution capability** — the
|
||||
ability to run bytes as instructions. RVF becomes **self-bootstrapping**: a single
|
||||
file that contains both its data and the complete machinery to process that data.
|
||||
|
||||
This is the transition from "needs a compatible runtime" to **"runs anywhere
|
||||
compute exists."**
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
### The Bootstrap Stack
|
||||
|
||||
```
|
||||
Layer 3: RVF Data Segments (VEC_SEG, INDEX_SEG, MANIFEST_SEG, ...)
|
||||
^
|
||||
| processes
|
||||
|
|
||||
Layer 2: WASM Microkernel (WASM_SEG, role=Microkernel, ~5.5 KB)
|
||||
^ 14 exports: query, ingest, distance, top-K
|
||||
| executes
|
||||
|
|
||||
Layer 1: WASM Interpreter (WASM_SEG, role=Interpreter, ~50 KB)
|
||||
^ Minimal stack machine that runs WASM bytecode
|
||||
| loads
|
||||
|
|
||||
Layer 0: Raw Bytes (The .rvf file on any storage medium)
|
||||
```
|
||||
|
||||
Each layer depends only on the one below it. The host reads Layer 0 (raw bytes),
|
||||
finds the interpreter at Layer 1, uses it to execute the microkernel at Layer 2,
|
||||
which then processes the data at Layer 3.
|
||||
|
||||
### Segment Layout
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ bootable.rvf │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │
|
||||
│ │ WASM_SEG │ │ WASM_SEG │ │ VEC_SEG │ │ INDEX │ │
|
||||
│ │ 0x10 │ │ 0x10 │ │ 0x01 │ │ _SEG │ │
|
||||
│ │ │ │ │ │ │ │ 0x02 │ │
|
||||
│ │ role=Interp │ │ role=uKernel │ │ 10M vectors │ │ HNSW │ │
|
||||
│ │ ~50 KB │ │ ~5.5 KB │ │ 384-dim fp16 │ │ L0+L1 │ │
|
||||
│ │ priority=0 │ │ priority=1 │ │ │ │ │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────┘ └─────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ QUANT_SEG │ │ WITNESS_SEG │ │ MANIFEST_SEG │ ← tail │
|
||||
│ │ codebooks │ │ audit trail │ │ source of │ │
|
||||
│ │ │ │ │ │ truth │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 3. WASM_SEG Wire Format
|
||||
|
||||
### Segment Type
|
||||
|
||||
```
|
||||
Value: 0x10
|
||||
Name: WASM_SEG
|
||||
```
|
||||
|
||||
Uses the standard 64-byte RVF segment header (`SegmentHeader`), followed by
|
||||
a 64-byte `WasmHeader`, followed by the WASM bytecode.
|
||||
|
||||
### WasmHeader (64 bytes)
|
||||
|
||||
```
|
||||
Offset Size Type Field Description
|
||||
------ ---- ---- ----- -----------
|
||||
0x00 4 u32 wasm_magic 0x5256574D ("RVWM" big-endian)
|
||||
0x04 2 u16 header_version Currently 1
|
||||
0x06 1 u8 role Bootstrap role (see WasmRole enum)
|
||||
0x07 1 u8 target Target platform (see WasmTarget enum)
|
||||
0x08 2 u16 required_features WASM feature bitfield
|
||||
0x0A 2 u16 export_count Number of WASM exports
|
||||
0x0C 4 u32 bytecode_size Uncompressed bytecode size (bytes)
|
||||
0x10 4 u32 compressed_size Compressed size (0 = no compression)
|
||||
0x14 1 u8 compression 0=none, 1=LZ4, 2=ZSTD
|
||||
0x15 1 u8 min_memory_pages Minimum linear memory (64 KB each)
|
||||
0x16 1 u8 max_memory_pages Maximum linear memory (0 = no limit)
|
||||
0x17 1 u8 table_count Number of WASM tables
|
||||
0x18 32 hash256 bytecode_hash SHAKE-256-256 of uncompressed bytecode
|
||||
0x38 1 u8 bootstrap_priority Lower = tried first in chain
|
||||
0x39 1 u8 interpreter_type Interpreter variant (if role=Interpreter)
|
||||
0x3A 6 u8[6] reserved Must be zero
|
||||
```
|
||||
|
||||
### WasmRole Enum
|
||||
|
||||
```
|
||||
Value Name Description
|
||||
----- ---- -----------
|
||||
0x00 Microkernel RVF query engine (5.5 KB Cognitum tile runtime)
|
||||
0x01 Interpreter Minimal WASM interpreter for self-bootstrapping
|
||||
0x02 Combined Interpreter + microkernel linked together
|
||||
0x03 Extension Domain-specific module (custom distance, decoder)
|
||||
0x04 ControlPlane Store management (create, export, segment parsing)
|
||||
```
|
||||
|
||||
### WasmTarget Enum
|
||||
|
||||
```
|
||||
Value Name Description
|
||||
----- ---- -----------
|
||||
0x00 Wasm32 Generic wasm32 (any compliant runtime)
|
||||
0x01 WasiP1 WASI Preview 1 (requires WASI syscalls)
|
||||
0x02 WasiP2 WASI Preview 2 (component model)
|
||||
0x03 Browser Browser-optimized (expects Web APIs)
|
||||
0x04 BareTile Bare-metal Cognitum tile (hub-tile protocol only)
|
||||
```
|
||||
|
||||
### Required Features Bitfield
|
||||
|
||||
```
|
||||
Bit Mask Feature
|
||||
--- ---- -------
|
||||
0 0x0001 SIMD (v128 operations)
|
||||
1 0x0002 Bulk memory operations
|
||||
2 0x0004 Multi-value returns
|
||||
3 0x0008 Reference types
|
||||
4 0x0010 Threads (shared memory)
|
||||
5 0x0020 Tail call optimization
|
||||
6 0x0040 GC (garbage collection)
|
||||
7 0x0080 Exception handling
|
||||
```
|
||||
|
||||
### Interpreter Type (when role=Interpreter)
|
||||
|
||||
```
|
||||
Value Name Description
|
||||
----- ---- -----------
|
||||
0x00 StackMachine Generic stack-based interpreter
|
||||
0x01 Wasm3Compatible wasm3-style (register machine)
|
||||
0x02 WamrCompatible WAMR-style (AOT + interpreter)
|
||||
0x03 WasmiCompatible wasmi-style (pure stack machine)
|
||||
```
|
||||
|
||||
## 4. Bootstrap Resolution Protocol
|
||||
|
||||
### Discovery
|
||||
|
||||
1. Scan all segments for `seg_type == 0x10` (WASM_SEG)
|
||||
2. Parse the 64-byte WasmHeader from each
|
||||
3. Validate `wasm_magic == 0x5256574D`
|
||||
4. Sort by `bootstrap_priority` ascending
|
||||
|
||||
### Resolution
|
||||
|
||||
```
|
||||
IF any WASM_SEG has role=Combined:
|
||||
→ SelfContained bootstrap (single module does everything)
|
||||
|
||||
ELIF WASM_SEG with role=Interpreter AND role=Microkernel both exist:
|
||||
→ TwoStage bootstrap (interpreter runs microkernel)
|
||||
|
||||
ELIF only WASM_SEG with role=Microkernel exists:
|
||||
→ HostRequired (needs external WASM runtime)
|
||||
|
||||
ELSE:
|
||||
→ No WASM bootstrap available
|
||||
```
|
||||
|
||||
### Execution Sequence (Two-Stage)
|
||||
|
||||
```
|
||||
Host Interpreter Microkernel Data
|
||||
| | | |
|
||||
|-- read WASM_SEG[0] --->| | |
|
||||
| (interpreter bytes) | | |
|
||||
| | | |
|
||||
|-- instantiate -------->| | |
|
||||
| (load into memory) | | |
|
||||
| | | |
|
||||
|-- feed WASM_SEG[1] --->|-- instantiate -------->| |
|
||||
| (microkernel bytes) | (via interpreter) | |
|
||||
| | | |
|
||||
|-- LOAD_QUERY --------->|------- forward ------->| |
|
||||
| | |-- read VEC_SEG -->|
|
||||
| | |<- vector block ---|
|
||||
| | | |
|
||||
| | | rvf_distances() |
|
||||
| | | rvf_topk_merge() |
|
||||
| | | |
|
||||
|<-- TOPK_RESULT --------|<------ return ---------| |
|
||||
```
|
||||
|
||||
## 5. Size Budget
|
||||
|
||||
### Microkernel (role=Microkernel)
|
||||
|
||||
Already specified in `microkernel/wasm-runtime.md`:
|
||||
|
||||
```
|
||||
Total: ~5,500 bytes (< 8 KB code budget)
|
||||
Exports: 14 (query path + quantization + HNSW + verification)
|
||||
Memory: 8 KB data + 64 KB SIMD scratch
|
||||
```
|
||||
|
||||
### Interpreter (role=Interpreter)
|
||||
|
||||
Target: minimal WASM bytecode interpreter sufficient to run the microkernel.
|
||||
|
||||
```
|
||||
Component Estimated Size
|
||||
--------- --------------
|
||||
WASM binary parser 4 KB
|
||||
(magic, section parsing)
|
||||
Type section decoder 1 KB
|
||||
(function types)
|
||||
Import/Export resolution 2 KB
|
||||
Code section interpreter 12 KB
|
||||
(control flow, locals)
|
||||
Stack machine engine 8 KB
|
||||
(operand stack, call stack)
|
||||
Memory management 3 KB
|
||||
(linear memory, grow)
|
||||
i32/i64 integer ops 4 KB
|
||||
(add, sub, mul, div, rem, shifts)
|
||||
f32/f64 float ops 6 KB
|
||||
(add, sub, mul, div, sqrt, conversions)
|
||||
v128 SIMD ops (optional) 8 KB
|
||||
(only if WASM_FEAT_SIMD required)
|
||||
Table + call_indirect 2 KB
|
||||
----------
|
||||
Total (no SIMD): ~42 KB
|
||||
Total (with SIMD): ~50 KB
|
||||
```
|
||||
|
||||
### Combined (role=Combined)
|
||||
|
||||
Interpreter linked with microkernel in a single module:
|
||||
|
||||
```
|
||||
Total: ~48-56 KB (interpreter + microkernel, with overlap eliminated)
|
||||
```
|
||||
|
||||
### Self-Bootstrapping Overhead
|
||||
|
||||
For a 10M vector file (~7.3 GB at 384-dim fp16):
|
||||
- Bootstrap overhead: ~56 KB / ~7.3 GB = **0.0008%**
|
||||
- The file is 99.9992% data, 0.0008% self-sufficient runtime
|
||||
|
||||
For a 1000-vector file (~750 KB):
|
||||
- Bootstrap overhead: ~56 KB / ~750 KB = **7.5%**
|
||||
- Still practical for edge/IoT deployments
|
||||
|
||||
## 6. Execution Tiers (Extended)
|
||||
|
||||
The original three-tier model from ADR-030 is extended:
|
||||
|
||||
| Tier | Segment | Size | Boot | Self-Bootstrap? |
|
||||
|------|---------|------|------|-----------------|
|
||||
| 0: Embedded WASM Interpreter | WASM_SEG (role=Interpreter) | ~50 KB | <5 ms | **Yes** — file carries its own runtime |
|
||||
| 1: WASM Microkernel | WASM_SEG (role=Microkernel) | 5.5 KB | <1 ms | No — needs host or Tier 0 |
|
||||
| 2: eBPF | EBPF_SEG | 10-50 KB | <20 ms | No — needs Linux kernel |
|
||||
| 3: Unikernel | KERNEL_SEG | 200 KB-2 MB | <125 ms | No — needs VMM (Firecracker) |
|
||||
|
||||
**Key insight**: Tier 0 makes all other tiers optional. An RVF file with
|
||||
Tier 0 embedded runs on *any* host that can execute bytes — bare metal,
|
||||
browser, microcontroller, FPGA with a soft CPU, or even another WASM runtime.
|
||||
|
||||
## 7. "Runs Anywhere Compute Exists"
|
||||
|
||||
### What This Means
|
||||
|
||||
A self-bootstrapping RVF file requires exactly **one capability** from its host:
|
||||
|
||||
> The ability to read bytes from storage and execute them as instructions.
|
||||
|
||||
That's it. No operating system. No file system. No network stack. No runtime
|
||||
library. No package manager. No container engine.
|
||||
|
||||
### Where It Runs
|
||||
|
||||
| Host | How It Works |
|
||||
|------|-------------|
|
||||
| **x86 server** | Native WASM runtime (Wasmtime/WAMR) runs microkernel directly |
|
||||
| **ARM edge device** | Same — native WASM runtime |
|
||||
| **Browser tab** | `WebAssembly.instantiate()` on the microkernel bytes |
|
||||
| **Microcontroller** | Embedded interpreter runs microkernel in 64 KB scratch |
|
||||
| **FPGA soft CPU** | Interpreter mapped to BRAM, microkernel in flash |
|
||||
| **Another WASM runtime** | Interpreter-in-WASM runs microkernel-in-WASM (turtles) |
|
||||
| **Bare metal** | Bootloader extracts interpreter, interpreter runs microkernel |
|
||||
| **TEE enclave** | Enclave loads interpreter, verified via WITNESS_SEG attestation |
|
||||
|
||||
### The Bootstrapping Invariant
|
||||
|
||||
For any host `H` with execution capability `E`:
|
||||
|
||||
```
|
||||
∀ H, E: can_execute(H, E) ∧ can_read_bytes(H)
|
||||
→ can_process_rvf(H, self_bootstrapping_rvf_file)
|
||||
```
|
||||
|
||||
The file is a **fixed point** of the execution relation: it contains everything
|
||||
needed to process itself.
|
||||
|
||||
## 8. Security Considerations
|
||||
|
||||
### Interpreter Verification
|
||||
|
||||
The embedded interpreter's bytecode is hashed with SHAKE-256-256 and stored
|
||||
in the WasmHeader (`bytecode_hash`). A WITNESS_SEG can chain the interpreter
|
||||
hash to a trusted build, providing:
|
||||
|
||||
- **Provenance**: Who built this interpreter?
|
||||
- **Integrity**: Has the interpreter been modified?
|
||||
- **Attestation**: Can a TEE verify the interpreter before execution?
|
||||
|
||||
### Sandbox Guarantees
|
||||
|
||||
The WASM sandbox model applies at every layer:
|
||||
- The interpreter cannot access host memory beyond its linear memory
|
||||
- The microkernel cannot access interpreter memory
|
||||
- Each layer communicates only through defined exports/imports
|
||||
- A trapped module cannot corrupt other modules
|
||||
|
||||
### Bootstrap Attack Surface
|
||||
|
||||
| Attack | Mitigation |
|
||||
|--------|-----------|
|
||||
| Malicious interpreter | Verify `bytecode_hash` against known-good hash in WITNESS_SEG |
|
||||
| Modified microkernel | Interpreter verifies microkernel hash before instantiation |
|
||||
| Data corruption | Segment-level CRC32C/SHAKE-256 hashes (Law 2) |
|
||||
| Code injection | WASM validates all code at load time (type checking) |
|
||||
| Resource exhaustion | `max_memory_pages` cap, epoch-based interruption |
|
||||
|
||||
## 9. API
|
||||
|
||||
### Rust (rvf-runtime)
|
||||
|
||||
```rust
|
||||
// Embed a WASM module
|
||||
store.embed_wasm(
|
||||
role: WasmRole::Microkernel as u8,
|
||||
target: WasmTarget::Wasm32 as u8,
|
||||
required_features: WASM_FEAT_SIMD,
|
||||
wasm_bytecode: µkernel_bytes,
|
||||
export_count: 14,
|
||||
bootstrap_priority: 1,
|
||||
interpreter_type: 0,
|
||||
)?;
|
||||
|
||||
// Make self-bootstrapping
|
||||
store.embed_wasm(
|
||||
role: WasmRole::Interpreter as u8,
|
||||
target: WasmTarget::Wasm32 as u8,
|
||||
required_features: 0,
|
||||
wasm_bytecode: &interpreter_bytes,
|
||||
export_count: 3,
|
||||
bootstrap_priority: 0,
|
||||
interpreter_type: 0x03, // wasmi-compatible
|
||||
)?;
|
||||
|
||||
// Check if file is self-bootstrapping
|
||||
assert!(store.is_self_bootstrapping());
|
||||
|
||||
// Extract all WASM modules (ordered by priority)
|
||||
let modules = store.extract_wasm_all()?;
|
||||
```
|
||||
|
||||
### WASM (rvf-wasm bootstrap module)
|
||||
|
||||
```rust
|
||||
use rvf_wasm::bootstrap::{resolve_bootstrap_chain, get_bytecode, BootstrapChain};
|
||||
|
||||
let chain = resolve_bootstrap_chain(&rvf_bytes);
|
||||
|
||||
match chain {
|
||||
BootstrapChain::SelfContained { combined } => {
|
||||
let bytecode = get_bytecode(&rvf_bytes, &combined).unwrap();
|
||||
// Instantiate and run
|
||||
}
|
||||
BootstrapChain::TwoStage { interpreter, microkernel } => {
|
||||
let interp_code = get_bytecode(&rvf_bytes, &interpreter).unwrap();
|
||||
let kernel_code = get_bytecode(&rvf_bytes, µkernel).unwrap();
|
||||
// Load interpreter, then use it to run microkernel
|
||||
}
|
||||
_ => { /* use host runtime */ }
|
||||
}
|
||||
```
|
||||
|
||||
## 10. Relationship to Existing Segments
|
||||
|
||||
| Segment | Relationship to WASM_SEG |
|
||||
|---------|-------------------------|
|
||||
| KERNEL_SEG (0x0E) | Alternative execution tier — KERNEL_SEG boots a full unikernel, WASM_SEG runs a lightweight microkernel. Both make the file self-executing but at different capability levels. |
|
||||
| EBPF_SEG (0x0F) | Complementary — eBPF accelerates hot-path queries on Linux hosts while WASM provides universal portability. |
|
||||
| WITNESS_SEG (0x0A) | Verification — WITNESS_SEG chains can attest the interpreter and microkernel hashes, providing a trust anchor for the bootstrap chain. |
|
||||
| CRYPTO_SEG (0x0C) | Signing — CRYPTO_SEG key material can sign WASM_SEG contents for tamper detection. |
|
||||
| MANIFEST_SEG (0x05) | Discovery — the tail manifest references all WASM_SEGs with their roles and priorities. |
|
||||
|
||||
## 11. Implementation Status
|
||||
|
||||
| Component | Crate | Status |
|
||||
|-----------|-------|--------|
|
||||
| `SegmentType::Wasm` (0x10) | `rvf-types` | Implemented |
|
||||
| `WasmHeader` (64-byte header) | `rvf-types` | Implemented |
|
||||
| `WasmRole`, `WasmTarget` enums | `rvf-types` | Implemented |
|
||||
| `write_wasm_seg` | `rvf-runtime` | Implemented |
|
||||
| `embed_wasm` / `extract_wasm` | `rvf-runtime` | Implemented |
|
||||
| `extract_wasm_all` (priority-sorted) | `rvf-runtime` | Implemented |
|
||||
| `is_self_bootstrapping` | `rvf-runtime` | Implemented |
|
||||
| `resolve_bootstrap_chain` | `rvf-wasm` | Implemented |
|
||||
| `get_bytecode` (zero-copy extraction) | `rvf-wasm` | Implemented |
|
||||
| Embedded interpreter (wasmi-based) | `rvf-wasm` | Future |
|
||||
| Combined interpreter+microkernel build | `rvf-wasm` | Future |
|
||||
Reference in New Issue
Block a user