Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,140 @@
# RVF: RuVector Format Specification
## The Universal Substrate for Living Intelligence
**Version**: 0.1.0-draft
**Status**: Research
**Date**: 2026-02-13
---
## What RVF Is
RVF is not a file format. It is a **runtime substrate** — a living, self-reorganizing
binary medium that stores, streams, indexes, and adapts vector intelligence across
any domain, any scale, and any hardware tier.
Where traditional formats are snapshots of data, RVF is a **continuously evolving
organism**. It ingests without rewriting. It answers queries before it finishes loading.
It reorganizes its own layout to match access patterns. It survives crashes without
journals. It fits on a 64 KB WASM tile or scales to a petabyte hub.
## The Four Laws of RVF
Every design decision in RVF derives from four inviolable laws:
### Law 1: Truth Lives at the Tail
The most recent `MANIFEST_SEG` at the tail of the file is the sole source of truth.
No front-loaded metadata. No section directory that must be rewritten on mutation.
Readers scan backward from EOF to find the latest manifest and know exactly what
to map.
**Consequence**: Append-only writes. Streaming ingest. No global rewrite ever.
### Law 2: Every Segment Is Independently Valid
Each segment carries its own magic number, length, content hash, and type tag.
A reader encountering any segment in isolation can verify it, identify it, and
decide whether to process it. No segment depends on prior segments for structural
validity.
**Consequence**: Crash safety for free. Parallel verification. Segment-level
integrity without a global checksum.
### Law 3: Data and State Are Separated
Vector payloads, index structures, overlay graphs, quantization dictionaries, and
runtime metadata live in distinct segment types. The manifest binds them together
but they never intermingle. This means you can replace the index without touching
vectors, update the overlay without rebuilding adjacency, or swap quantization
without re-encoding.
**Consequence**: Incremental updates. Modular evolution. Zero-copy segment reuse.
### Law 4: The Format Adapts to Its Workload
RVF monitors access patterns through lightweight sketches and periodically
reorganizes: promoting hot vectors to faster tiers, compacting stale overlays,
lazily building deeper index layers. The format is not static — it converges
toward the optimal layout for its actual workload.
**Consequence**: Self-tuning performance. No manual optimization. The file gets
faster the more you use it.
## Design Coordinates
| Property | RVF Answer |
|----------|-----------|
| Write model | Append-only segments + background compaction |
| Read model | Tail-manifest scan, then progressive mmap |
| Index model | Layered availability (entry points -> partial -> full) |
| Compression | Temperature-tiered (fp16 hot, 5-7 bit warm, 3 bit cold) |
| Alignment | 64-byte for SIMD (AVX-512, NEON, WASM v128) |
| Crash safety | Segment-level hashes, no WAL required |
| Crypto | Post-quantum (ML-DSA-65 signatures, SHAKE-256 hashes) |
| Streaming | Yes — first query before full load |
| Hardware | 8 KB tile to petabyte hub |
| Domain | Universal — genomics, text, graph, vision as profiles |
## Acceptance Test
> Cold start on a 10 million vector file: load and answer the first query with a
> useful (recall >= 0.7) result without reading more than the last 4 MB, then
> converge to full quality (recall >= 0.95) as it progressively maps more segments.
## Document Map
| Document | Path | Content |
|----------|------|---------|
| This overview | `spec/00-overview.md` | Philosophy, laws, design coordinates |
| Segment model | `spec/01-segment-model.md` | Segment types, headers, append-only rules |
| Manifest system | `spec/02-manifest-system.md` | Two-level manifests, hotset pointers |
| Temperature tiering | `spec/03-temperature-tiering.md` | Adaptive layout, access sketches, promotion |
| Progressive indexing | `spec/04-progressive-indexing.md` | Layered HNSW, partial availability |
| Overlay epochs | `spec/05-overlay-epochs.md` | Streaming min-cut, epoch boundaries |
| Wire format | `wire/binary-layout.md` | Byte-level binary format reference |
| WASM microkernel | `microkernel/wasm-runtime.md` | Cognitum tile mapping, WASM exports |
| Domain profiles | `profiles/domain-profiles.md` | RVDNA, RVText, RVGraph, RVVision |
| Crypto spec | `crypto/quantum-signatures.md` | Post-quantum primitives, segment signing |
| Benchmarks | `benchmarks/acceptance-tests.md` | Performance targets, test methodology |
## Relationship to RVDNA
RVDNA (RuVector DNA) was the first domain-specific format for genomic vector
intelligence. In the RVF model, RVDNA becomes a **profile** — a set of conventions
for how genomic data maps onto the universal RVF substrate:
```
RVF (universal substrate)
|
+-- RVF Core Profile (minimal, fits on 64KB tile)
+-- RVF Hot Profile (chip-optimized, SIMD-heavy)
+-- RVF Full Profile (hub-scale, all features)
|
+-- Domain Profiles
+-- RVDNA (genomics: codons, motifs, k-mers)
+-- RVText (language: embeddings, token graphs)
+-- RVGraph (networks: adjacency, partitions)
+-- RVVision (imagery: feature maps, patch vectors)
```
The substrate carries the laws. The profiles carry the semantics.
## Design Answers
**Q: Random writes or append-only plus compaction?**
A: Append-only plus compaction. This gives speed and crash safety almost for free.
Random writes add complexity for marginal benefit in the vector workload.
**Q: Primary target mmap on desktop CPUs or also microcontroller tiles?**
A: Both. RVF defines three hardware profiles. The Core profile fits in 8 KB code +
8 KB data + 64 KB SIMD scratch. The Full profile assumes mmap on desktop-class
memory. The wire format is identical — only the runtime behavior changes.
**Q: Which property matters most?**
A: All four are non-negotiable, but the priority order for conflict resolution is:
1. **Streamable** (never block on write)
2. **Progressive** (answer before fully loaded)
3. **Adaptive** (self-optimize over time)
4. **p95 speed** (predictable tail latency)

View File

@@ -0,0 +1,224 @@
# RVF Segment Model
## 1. Append-Only Segment Architecture
An RVF file is a linear sequence of **segments**. Each segment is a self-contained,
independently verifiable unit. New data is always appended — never inserted into or
overwritten within existing segments.
```
+------------+------------+------------+ +------------+
| Segment 0 | Segment 1 | Segment 2 | ... | Segment N | <-- EOF
+------------+------------+------------+ +------------+
^
Latest MANIFEST_SEG
(source of truth)
```
### Why Append-Only
| Property | Benefit |
|----------|---------|
| Write amplification | Zero — each byte written once until compaction |
| Crash safety | Partial segment at tail is detectable and discardable |
| Concurrent reads | Readers see a consistent snapshot at any manifest boundary |
| Streaming ingest | Writer never blocks on reorganization |
| mmap friendliness | Pages only grow — no invalidation of mapped regions |
## 2. Segment Header
Every segment begins with a fixed 64-byte header. The header is 64-byte aligned
to match SIMD register width.
```
Offset Size Field Description
------ ---- ----- -----------
0x00 4 magic 0x52564653 ("RVFS" in ASCII)
0x04 1 version Segment format version (currently 1)
0x05 1 seg_type Segment type enum (see below)
0x06 2 flags Bitfield: compressed, encrypted, signed, sealed, etc.
0x08 8 segment_id Monotonically increasing segment ordinal
0x10 8 payload_length Byte length of payload (after header, before footer)
0x18 8 timestamp_ns Nanosecond UNIX timestamp of segment creation
0x20 1 checksum_algo Hash algorithm enum: 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
0x21 1 compression Compression enum: 0=none, 1=LZ4, 2=ZSTD, 3=custom
0x22 2 reserved_0 Must be zero
0x24 4 reserved_1 Must be zero
0x28 16 content_hash First 128 bits of payload hash (algorithm per checksum_algo)
0x38 4 uncompressed_len Original payload size (0 if no compression)
0x3C 4 alignment_pad Padding to reach 64-byte boundary
```
**Total header**: 64 bytes (one cache line, one AVX-512 register width).
### Magic Validation
Readers scanning backward from EOF look for `0x52564653` at 64-byte aligned
boundaries. This enables fast tail-scan even on corrupted files.
### Flags Bitfield
```
Bit 0: COMPRESSED Payload is compressed per compression field
Bit 1: ENCRYPTED Payload is encrypted (key info in manifest)
Bit 2: SIGNED A signature footer follows the payload
Bit 3: SEALED Segment is immutable (compaction output)
Bit 4: PARTIAL Segment is a partial write (streaming ingest)
Bit 5: TOMBSTONE Segment logically deletes a prior segment
Bit 6: HOT Segment contains temperature-promoted data
Bit 7: OVERLAY Segment contains overlay/delta data
Bit 8: SNAPSHOT Segment contains full snapshot (not delta)
Bit 9: CHECKPOINT Segment is a safe rollback point
Bits 10-15: reserved
```
## 3. Segment Types
```
Value Name Purpose
----- ---- -------
0x01 VEC_SEG Raw vector payloads (the actual embeddings)
0x02 INDEX_SEG HNSW adjacency lists, entry points, routing tables
0x03 OVERLAY_SEG Graph overlay deltas, partition updates, min-cut witnesses
0x04 JOURNAL_SEG Metadata mutations (label changes, deletions, moves)
0x05 MANIFEST_SEG Segment directory, hotset pointers, epoch state
0x06 QUANT_SEG Quantization dictionaries and codebooks
0x07 META_SEG Arbitrary key-value metadata (tags, provenance, lineage)
0x08 HOT_SEG Temperature-promoted hot data (vectors + neighbors)
0x09 SKETCH_SEG Access counter sketches for temperature decisions
0x0A WITNESS_SEG Capability manifests, proof of computation, audit trails
0x0B PROFILE_SEG Domain profile declarations (RVDNA, RVText, etc.)
0x0C CRYPTO_SEG Key material, signature chains, certificate anchors
0x0D METAIDX_SEG Metadata inverted indexes for filtered search
```
### Reserved Range
Types `0x00` and `0xF0`-`0xFF` are reserved. `0x00` indicates an uninitialized
or zeroed region (not a valid segment). `0xF0`-`0xFF` are reserved for
implementation-specific extensions.
## 4. Segment Footer
If the `SIGNED` flag is set, the payload is followed by a signature footer:
```
Offset Size Field Description
------ ---- ----- -----------
0x00 2 sig_algo Signature algorithm: 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
0x02 2 sig_length Byte length of signature
0x04 var signature The signature bytes
var 4 footer_length Total footer size (for backward scanning)
```
Unsigned segments have no footer — the next segment header follows immediately
after the payload (at the next 64-byte aligned boundary).
## 5. Segment Lifecycle
### Write Path
```
1. Allocate segment ID (monotonic counter)
2. Compute payload hash
3. Write header + payload + optional footer
4. fsync (or fdatasync for non-manifest segments)
5. Write MANIFEST_SEG referencing the new segment
6. fsync the manifest
```
The two-fsync protocol ensures that:
- If crash occurs before step 6, the orphan segment is harmless (no manifest points to it)
- If crash occurs during step 6, the partial manifest is detectable (bad hash)
- After step 6, the segment is durably committed
### Read Path
```
1. Seek to EOF
2. Scan backward for latest MANIFEST_SEG (look for magic at aligned boundaries)
3. Parse manifest -> get segment directory
4. Map segments on demand (progressive loading)
```
### Compaction
Compaction merges multiple segments into fewer, larger, sealed segments:
```
Before: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3]
After: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3] [VEC_SEG_sealed] [MANIFEST_4]
^^^^^^^^^^^^^^^^^
New sealed segment
merging 1+2+3
```
Old segments are marked with TOMBSTONE entries in the new manifest. Space is
reclaimed when the file is eventually rewritten (or old segments are in a
separate file in multi-file mode).
### Multi-File Mode
For very large datasets, RVF can span multiple files:
```
data.rvf Main file with manifests and hot data
data.rvf.cold.0 Cold segment shard 0
data.rvf.cold.1 Cold segment shard 1
data.rvf.idx.0 Index segment shard 0
```
The manifest in the main file contains shard references with file paths and
byte ranges. This enables cold data to live on slower storage while hot data
stays on fast storage.
## 6. Segment Addressing
Segments are addressed by their `segment_id` (monotonically increasing 64-bit
integer). The manifest maps segment IDs to file offsets (and optionally shard
file paths in multi-file mode).
Within a segment, data is addressed by **block offset** — a 32-bit offset from
the start of the segment payload. This limits individual segments to 4 GB, which
is intentional: it keeps segments manageable for compaction and progressive loading.
### Block Structure Within VEC_SEG
```
+-------------------+
| Block Header (16B)|
| block_id: u32 |
| count: u32 |
| dim: u16 |
| dtype: u8 |
| pad: [u8; 5] |
+-------------------+
| Vectors |
| (count * dim * |
| sizeof(dtype)) |
| [64B aligned] |
+-------------------+
| ID Map |
| (varint delta |
| encoded IDs) |
+-------------------+
| Block Footer |
| crc32c: u32 |
+-------------------+
```
Vectors within a block are stored **columnar** — all dimension 0 values, then all
dimension 1 values, etc. This maximizes compression ratio. But the HOT_SEG stores
vectors **interleaved** (row-major) for cache-friendly sequential scan during
top-K refinement.
## 7. Invariants
1. Segment IDs are strictly monotonically increasing within a file
2. A valid RVF file contains at least one MANIFEST_SEG
3. The last MANIFEST_SEG is always the source of truth
4. Segment headers are always 64-byte aligned
5. No segment payload exceeds 4 GB
6. Content hashes are computed over the raw (uncompressed, unencrypted) payload
7. Sealed segments are never modified — only tombstoned
8. A reader that cannot find a valid MANIFEST_SEG must reject the file

View File

@@ -0,0 +1,287 @@
# RVF Manifest System
## 1. Two-Level Manifest Architecture
The manifest system is what makes RVF progressive. Instead of a monolithic directory
that must be fully parsed before any query, RVF uses a two-level manifest that
enables instant boot followed by incremental refinement.
```
EOF
|
v
+--------------------------------------------------+
| Level 0: Root Manifest (fixed 4096 bytes) |
| - Magic + version |
| - Pointer to Level 1 manifest segment |
| - Hotset pointers (inline) |
| - Total vector count |
| - Dimension |
| - Epoch counter |
| - Profile declaration |
+--------------------------------------------------+
|
| points to
v
+--------------------------------------------------+
| Level 1: Full Manifest (variable size) |
| - Complete segment directory |
| - Temperature tier map |
| - Index layer availability |
| - Overlay epoch chain |
| - Compaction state |
| - Shard references (multi-file) |
| - Capability manifest |
+--------------------------------------------------+
```
### Why Two Levels
A reader performing cold start only needs Level 0 (4 KB). From Level 0 alone,
it can locate the entry points, coarse routing graph, quantization dictionary,
and centroids — enough to answer approximate queries immediately.
Level 1 is loaded asynchronously to enable full-quality queries, but the system
is functional before Level 1 is fully parsed.
## 2. Level 0: Root Manifest
The root manifest is always the **last 4096 bytes** of the file (or the last
4096 bytes of the most recent MANIFEST_SEG). Its fixed size enables instant
location: `seek(EOF - 4096)`.
### Binary Layout
```
Offset Size Field Description
------ ---- ----- -----------
0x000 4 magic 0x52564D30 ("RVM0")
0x004 2 version Root manifest version
0x006 2 flags Root manifest flags
0x008 8 l1_manifest_offset Byte offset to Level 1 manifest segment
0x010 8 l1_manifest_length Byte length of Level 1 manifest segment
0x018 8 total_vector_count Total vectors across all segments
0x020 2 dimension Vector dimensionality
0x022 1 base_dtype Base data type enum
0x023 1 profile_id Domain profile (0=generic, 1=dna, 2=text, 3=graph, 4=vision)
0x024 4 epoch Current overlay epoch number
0x028 8 created_ns File creation timestamp (ns)
0x030 8 modified_ns Last modification timestamp (ns)
--- Hotset Pointers (the key to instant boot) ---
0x038 8 entrypoint_seg_offset Offset to segment containing HNSW entry points
0x040 4 entrypoint_block_offset Block offset within that segment
0x044 4 entrypoint_count Number of entry points
0x048 8 toplayer_seg_offset Offset to segment with top-layer adjacency
0x050 4 toplayer_block_offset Block offset
0x054 4 toplayer_node_count Nodes in top layer
0x058 8 centroid_seg_offset Offset to segment with cluster centroids / pivots
0x060 4 centroid_block_offset Block offset
0x064 4 centroid_count Number of centroids
0x068 8 quantdict_seg_offset Offset to quantization dictionary segment
0x070 4 quantdict_block_offset Block offset
0x074 4 quantdict_size Dictionary size in bytes
0x078 8 hot_cache_seg_offset Offset to HOT_SEG with interleaved hot vectors
0x080 4 hot_cache_block_offset Block offset
0x084 4 hot_cache_vector_count Vectors in hot cache
0x088 8 prefetch_map_offset Offset to prefetch hint table
0x090 4 prefetch_map_entries Number of prefetch entries
--- Crypto ---
0x094 2 sig_algo Manifest signature algorithm
0x096 2 sig_length Signature length
0x098 var signature Manifest signature (up to 3400 bytes for ML-DSA-65)
--- Padding to 4096 bytes ---
0xF00 252 reserved Reserved / zero-padded to 4096
0xFFC 4 root_checksum CRC32C of bytes 0x000-0xFFB
```
**Total**: Exactly 4096 bytes (one page, one disk sector on most hardware).
### Hotset Pointers
The six hotset pointers are the minimum information needed to answer a query:
1. **Entry points**: Where to start HNSW traversal
2. **Top-layer adjacency**: Coarse routing to the right neighborhood
3. **Centroids/pivots**: For IVF-style pre-filtering or partition routing
4. **Quantization dictionary**: For decoding compressed vectors
5. **Hot cache**: Pre-decoded interleaved vectors for top-K refinement
6. **Prefetch map**: Contiguous neighbor-list pages with prefetch offsets
With these six pointers, a reader can:
- Start HNSW search at the entry point
- Route through the top layer
- Quantize the query using the dictionary
- Scan the hot cache for refinement
- Prefetch neighbor pages for cache-friendly traversal
All without reading Level 1 or any cold segments.
## 3. Level 1: Full Manifest
Level 1 is a variable-size segment (type `MANIFEST_SEG`) referenced by Level 0.
It contains the complete file directory.
### Structure
Level 1 is encoded as a sequence of typed records using a tag-length-value (TLV)
scheme for forward compatibility:
```
+---+---+---+---+---+---+---+---+
| Tag (2B) | Length (4B) | Pad | <- 8-byte aligned record header
+---+---+---+---+---+---+---+---+
| Value (Length bytes) |
| [padded to 8-byte boundary] |
+---------------------------------+
```
### Record Types
```
Tag Name Description
--- ---- -----------
0x0001 SEGMENT_DIR Array of segment directory entries
0x0002 TEMP_TIER_MAP Temperature tier assignments per block
0x0003 INDEX_LAYERS Index layer availability bitmap
0x0004 OVERLAY_CHAIN Epoch chain with rollback pointers
0x0005 COMPACTION_STATE Active/tombstoned segment sets
0x0006 SHARD_REFS Multi-file shard references
0x0007 CAPABILITY_MANIFEST What this file can do (features, limits)
0x0008 PROFILE_CONFIG Domain-specific configuration
0x0009 ACCESS_SKETCH_REF Pointer to latest SKETCH_SEG
0x000A PREFETCH_TABLE Full prefetch hint table
0x000B ID_RESTART_POINTS Restart point index for varint delta IDs
0x000C WITNESS_CHAIN Proof-of-computation witness chain
0x000D KEY_DIRECTORY Encryption key references (not keys themselves)
```
### Segment Directory Entry
```
Offset Size Field Description
------ ---- ----- -----------
0x00 8 segment_id Segment ordinal
0x08 1 seg_type Segment type enum
0x09 1 tier Temperature tier (0=hot, 1=warm, 2=cold)
0x0A 2 flags Segment flags
0x0C 4 reserved Must be zero
0x10 8 file_offset Byte offset in file (or shard)
0x18 8 payload_length Decompressed payload length
0x20 8 compressed_length Compressed length (0 if uncompressed)
0x28 2 shard_id Shard index (0 for main file)
0x2A 2 compression Compression algorithm
0x2C 4 block_count Number of blocks in segment
0x30 16 content_hash Payload hash (first 128 bits)
```
**Total**: 64 bytes per entry (cache-line aligned).
## 4. Manifest Lifecycle
### Writing a New Manifest
Every mutation to the file produces a new MANIFEST_SEG appended at the tail:
```
1. Compute new Level 1 manifest (segment directory + metadata)
2. Write Level 1 as a MANIFEST_SEG payload
3. Compute Level 0 root manifest pointing to Level 1
4. Write Level 0 as the last 4096 bytes of the MANIFEST_SEG
5. fsync
```
The MANIFEST_SEG payload structure is:
```
+-----------------------------------+
| Level 1 manifest (variable size) |
+-----------------------------------+
| Level 0 root manifest (4096 B) | <-- Always the last 4096 bytes
+-----------------------------------+
```
### Reading the Manifest
```
1. seek(EOF - 4096)
2. Read 4096 bytes -> Level 0 root manifest
3. Validate magic (0x52564D30) and checksum
4. If valid: extract hotset pointers -> system is queryable
5. Async: read Level 1 at l1_manifest_offset -> full directory
6. If Level 0 is invalid: scan backward for previous MANIFEST_SEG
```
Step 6 provides crash recovery. If the latest manifest write was interrupted,
the previous manifest is still valid. Readers scan backward at 64-byte aligned
boundaries looking for the RVFS magic + MANIFEST_SEG type.
### Manifest Chain
Each manifest implicitly forms a chain through the segment ID ordering. For
explicit rollback support, Level 1 contains the `OVERLAY_CHAIN` record which
stores:
```
epoch: u32 Current epoch
prev_manifest_offset: u64 Offset of previous MANIFEST_SEG
prev_manifest_id: u64 Segment ID of previous MANIFEST_SEG
checkpoint_hash: [u8; 16] Hash of the complete state at this epoch
```
This enables point-in-time recovery and bisection debugging.
## 5. Hotset Pointer Semantics
### Entry Point Stability
Entry points are the HNSW nodes at the highest layer. They change rarely (only
when the index is rebuilt or a new highest-layer node is inserted). The root
manifest caches them directly so they survive across manifest generations without
re-reading the index.
### Centroid Refresh
Centroids may drift as data is added. The manifest tracks a `centroid_epoch` — if
the current epoch exceeds centroid_epoch + threshold, the runtime should schedule
centroid recomputation. But the stale centroids remain usable (recall degrades
gracefully, it does not fail).
### Hot Cache Coherence
The hot cache in HOT_SEG is a **read-optimized snapshot** of the most-accessed
vectors. It may be stale relative to the latest VEC_SEGs. The manifest tracks
a `hot_cache_epoch` for staleness detection. Queries use the hot cache for fast
initial results, then refine against authoritative VEC_SEGs if needed.
## 6. Progressive Boot Sequence
```
Time Action System State
---- ------ ------------
t=0 Read last 4 KB (Level 0) Booting
t+1ms Parse hotset pointers Queryable (approximate)
t+2ms mmap entry points + top layer Better routing
t+5ms mmap hot cache + quant dict Fast top-K refinement
t+10ms Start loading Level 1 Discovering full directory
t+50ms Level 1 parsed Full segment awareness
t+100ms mmap warm VEC_SEGs Recall improving
t+500ms mmap cold VEC_SEGs Full recall
t+1s Background index layer build Converging to optimal
```
For a 10M vector file (~4 GB at 384 dimensions, float16):
- Level 0 read: 4 KB in <1 ms
- Hotset data: ~2-4 MB (entry points + top layer + centroids + hot cache)
- First query: within 5-10 ms of open
- Full convergence: 1-5 seconds depending on storage speed

View File

@@ -0,0 +1,285 @@
# RVF Temperature Tiering
## 1. Adaptive Layout as a First-Class Concept
Traditional vector formats place data once and leave it. RVF treats data placement
as a **continuous optimization problem**. Every vector block has a temperature, and
the format periodically reorganizes to keep hot data fast and cold data small.
```
Access Frequency
^
|
Tier 0 (HOT) | ████████ fp16 / 8-bit, interleaved
| ████████ < 1μs random access
|
Tier 1 (WARM) | ░░░░░░░░░░░░░░░░ 5-7 bit quantized
| ░░░░░░░░░░░░░░░░ columnar, compressed
|
Tier 2 (COLD) | ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 3-bit or 1-bit
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ heavy compression
|
+------------------------------------> Vector ID
```
### Tier Definitions
| Tier | Name | Quantization | Layout | Compression | Access Latency |
|------|------|-------------|--------|-------------|----------------|
| 0 | Hot | fp16 or int8 | Interleaved (row-major) | None or LZ4 | < 1 μs |
| 1 | Warm | 5-7 bit SQ/PQ | Columnar | LZ4 or ZSTD | 1-10 μs |
| 2 | Cold | 3-bit or binary | Columnar | ZSTD level 9+ | 10-100 μs |
### Memory Ratios
For 384-dimensional vectors (typical embedding size):
| Tier | Bytes/Vector | Ratio vs fp32 | 10M Vectors |
|------|-------------|---------------|-------------|
| fp32 (raw) | 1536 B | 1.0x | 14.3 GB |
| Tier 0 (fp16) | 768 B | 2.0x | 7.2 GB |
| Tier 0 (int8) | 384 B | 4.0x | 3.6 GB |
| Tier 1 (6-bit) | 288 B | 5.3x | 2.7 GB |
| Tier 1 (5-bit) | 240 B | 6.4x | 2.2 GB |
| Tier 2 (3-bit) | 144 B | 10.7x | 1.3 GB |
| Tier 2 (1-bit) | 48 B | 32.0x | 0.45 GB |
## 2. Access Counter Sketch
Temperature decisions require knowing which blocks are accessed frequently.
RVF maintains a lightweight **Count-Min Sketch** per block set, stored in
SKETCH_SEG segments.
### Sketch Parameters
```
Width (w): 1024 counters
Depth (d): 4 hash functions
Counter size: 8-bit saturating (max 255)
Memory: 1024 * 4 * 1 = 4 KB per sketch
Granularity: One sketch per 1024-vector block
Decay: Halve all counters every 2^16 accesses (aging)
```
For 10M vectors in 1024-vector blocks:
- 9,766 blocks
- 9,766 * 4 KB = ~38 MB of sketches
- Stored in SKETCH_SEG, referenced by manifest
### Sketch Operations
**On query access**:
```
block_id = vector_id / block_size
for i in 0..depth:
idx = hash_i(block_id) % width
sketch[i][idx] = min(sketch[i][idx] + 1, 255)
```
**On temperature check**:
```
count = min over i of sketch[i][hash_i(block_id) % width]
if count > HOT_THRESHOLD: tier = 0
elif count > WARM_THRESHOLD: tier = 1
else: tier = 2
```
**Aging** (every 2^16 accesses):
```
for all counters: counter = counter >> 1
```
This ensures the sketch tracks *recent* access patterns, not cumulative history.
### Why Count-Min Sketch
| Alternative | Memory | Accuracy | Update Cost |
|------------|--------|----------|-------------|
| Per-vector counter | 80 MB (10M * 8B) | Exact | O(1) |
| Count-Min Sketch | 38 MB | ~99.9% | O(depth) = O(4) |
| HyperLogLog | 6 MB | ~98% | O(1) but cardinality only |
| Bloom filter | 12 MB | No counting | N/A |
Count-Min Sketch is the best trade-off: sub-exact accuracy with bounded memory
and constant-time updates.
## 3. Promotion and Demotion
### Promotion: Warm/Cold -> Hot
When a block's access count exceeds HOT_THRESHOLD for two consecutive sketch
epochs:
```
1. Read the block from its current VEC_SEG
2. Decode/dequantize vectors to fp16 or int8
3. Rearrange from columnar to interleaved layout
4. Write as a new HOT_SEG (or append to existing HOT_SEG)
5. Update manifest with new tier assignment
6. Optionally: add neighbor lists to HOT_SEG for locality
```
### Demotion: Hot -> Warm -> Cold
When a block's access count drops below WARM_THRESHOLD:
```
1. The block is not immediately rewritten
2. On next compaction cycle, the block is written to the appropriate tier
3. Quantization is applied during compaction (not lazily)
4. The HOT_SEG entry is tombstoned in the manifest
```
### Eviction as Compression
The key insight: **eviction from hot tier is just compression, not deletion**.
The vector data is always present — it just moves to a more compressed
representation. This means:
- No data loss on eviction
- Recall degrades gracefully (quantized vectors still contribute to search)
- The file naturally compresses over time as access patterns stabilize
## 4. Temperature-Aware Compaction
Standard compaction merges segments for space efficiency. Temperature-aware
compaction also **rearranges blocks by tier**:
```
Before compaction:
VEC_SEG_1: [hot] [cold] [warm] [hot] [cold]
VEC_SEG_2: [warm] [hot] [cold] [warm] [warm]
After temperature-aware compaction:
HOT_SEG: [hot] [hot] [hot] <- interleaved, fp16
VEC_SEG_W: [warm] [warm] [warm] [warm] <- columnar, 6-bit
VEC_SEG_C: [cold] [cold] [cold] <- columnar, 3-bit
```
This creates **physical locality by temperature**: hot blocks are contiguous
(good for sequential scan), warm blocks are contiguous (good for batch decode),
cold blocks are contiguous (good for compression ratio).
### Compaction Triggers
| Trigger | Condition | Action |
|---------|-----------|--------|
| Sketch epoch | Every N writes | Evaluate all block temperatures |
| Space amplification | Dead space > 30% | Merge + rewrite segments |
| Tier imbalance | Hot tier > 20% of data | Demote cold blocks |
| Hot miss rate | Hot cache miss > 10% | Promote missing blocks |
## 5. Quantization Strategies by Tier
### Tier 0: Hot
**Scalar quantization to int8** (preferred) or **fp16** (for maximum recall).
```
Encoding:
q = round((v - min) / (max - min) * 255)
Decoding:
v = q / 255 * (max - min) + min
Parameters stored in QUANT_SEG:
min: f32 per dimension
max: f32 per dimension
```
Distance computation directly on int8 using SIMD (vpsubb + vpmaddubsw on AVX-512).
### Tier 1: Warm
**Product Quantization (PQ)** with 5-7 bits per sub-vector.
```
Parameters:
M subspaces: 48 (for 384-dim vectors, 8 dims per subspace)
K centroids per sub: 64 (6-bit) or 128 (7-bit)
Codebook: M * K * 8 * sizeof(f32) = 48 * 64 * 8 * 4 = 96 KB
Encoding:
For each subvector: find nearest centroid -> store centroid index
Distance computation:
ADC (Asymmetric Distance Computation) with precomputed distance tables
```
### Tier 2: Cold
**Binary quantization** (1-bit) or **ternary quantization** (2-bit / 3-bit).
```
Binary encoding:
b = sign(v) -> 1 bit per dimension
384 dims -> 48 bytes per vector (32x compression)
Distance:
Hamming distance via POPCNT
XOR + POPCNT on AVX-512: 512 bits per cycle
Ternary (3-bit with magnitude):
t = {-1, 0, +1} based on threshold
magnitude = |v| quantized to 3 levels
384 dims -> 144 bytes per vector (10.7x compression)
```
### Codebook Storage
All quantization parameters (codebooks, min/max ranges, centroids) are stored
in QUANT_SEG segments. The root manifest's `quantdict_seg_offset` hotset pointer
references the active quantization dictionary for fast boot.
Multiple QUANT_SEGs can coexist for different tiers — the manifest maps each
tier to its dictionary.
## 6. Hardware Adaptation
### Desktop (AVX-512)
- Hot tier: int8 with VNNI dot product (4 int8 multiplies per cycle)
- Warm tier: PQ with AVX-512 gather for table lookups
- Cold tier: Binary with VPOPCNTDQ (512-bit popcount)
### ARM (NEON)
- Hot tier: int8 with SDOT instruction
- Warm tier: PQ with TBL for table lookups
- Cold tier: Binary with CNT (population count)
### WASM (v128)
- Hot tier: int8 with i8x16.dot_i7x16_i16x8 (if available)
- Warm tier: Scalar PQ (no gather)
- Cold tier: Binary with manual popcount
### Cognitum Tile (8KB code + 8KB data + 64KB SIMD)
- Hot tier only: int8 interleaved, fits in SIMD scratch
- No warm/cold — data stays on hub, tile fetches blocks on demand
- Sketch is maintained by hub, not tile
## 7. Self-Organization Over Time
```
t=0 All data Tier 1 (default warm)
|
t+N First sketch epoch: identify hot blocks
Promote top 5% to Tier 0
|
t+2N Second epoch: validate promotions
Demote false positives back to Tier 1
Identify true cold blocks (0 access in 2 epochs)
|
t+3N Compaction: physically separate tiers
HOT_SEG created with interleaved layout
Cold blocks compressed to 3-bit
|
t+∞ Equilibrium: ~5% hot, ~30% warm, ~65% cold
File size: ~2-3x smaller than uniform fp16
Query p95: dominated by hot tier latency
```
The format converges to an equilibrium that reflects actual usage. No manual
tuning required.

View File

@@ -0,0 +1,374 @@
# RVF Progressive Indexing
## 1. Index as Layers of Availability
Traditional HNSW serialization is all-or-nothing: either the full graph is loaded,
or nothing works. RVF decomposes the index into three layers of availability, each
independently useful, each stored in separate INDEX_SEG segments.
```
Layer C: Full Adjacency
+--------------------------------------------------+
| Complete neighbor lists for every node at every |
| HNSW level. Built lazily. Optional for queries. |
| Recall: >= 0.95 |
+--------------------------------------------------+
^ loaded last (seconds to minutes)
|
Layer B: Partial Adjacency
+--------------------------------------------------+
| Neighbor lists for the most-accessed region |
| (determined by temperature sketch). Covers the |
| hot working set of the graph. |
| Recall: >= 0.85 |
+--------------------------------------------------+
^ loaded second (100ms - 1s)
|
Layer A: Entry Points + Coarse Routing
+--------------------------------------------------+
| HNSW entry points. Top-layer adjacency lists. |
| Cluster centroids for IVF pre-routing. |
| Always present. Always in Level 0 hotset. |
| Recall: >= 0.70 |
+--------------------------------------------------+
^ loaded first (< 5ms)
|
File open
```
### Why Three Layers
| Layer | Purpose | Data Size (10M vectors) | Load Time (NVMe) |
|-------|---------|------------------------|-------------------|
| A | First query possible | 1-4 MB | < 5 ms |
| B | Good quality for working set | 50-200 MB | 100-500 ms |
| C | Full recall for all queries | 1-4 GB | 2-10 s |
A system that only loads Layer A can still answer queries — just with lower recall.
As layers B and C load asynchronously, quality improves transparently.
## 2. Layer A: Entry Points and Coarse Routing
### Content
- **HNSW entry points**: The node(s) at the highest layer of the HNSW graph.
Typically 1 node, but may be multiple for redundancy.
- **Top-layer adjacency**: Full neighbor lists for all nodes at HNSW layers
>= ceil(ln(N) / ln(M)) - 2. For 10M vectors with M=16, this is layers 5-6,
containing ~100-1000 nodes.
- **Cluster centroids**: K centroids (K = sqrt(N) typically, so ~3162 for 10M)
used for IVF-style partition routing.
- **Centroid-to-partition map**: Which centroid owns which vector ID ranges.
### Storage
Layer A data is stored in a dedicated INDEX_SEG with `flags.HOT` set. The root
manifest's hotset pointers reference this segment directly. On cold start, this
is the first data mapped after the manifest.
### Binary Layout of Layer A INDEX_SEG
```
+-------------------------------------------+
| Header: INDEX_SEG, flags=HOT |
+-------------------------------------------+
| Block 0: Entry Points |
| entry_count: u32 |
| max_layer: u32 |
| [entry_node_id: u64, layer: u32] * N |
+-------------------------------------------+
| Block 1: Top-Layer Adjacency |
| layer_count: u32 |
| For each layer (top to bottom): |
| node_count: u32 |
| For each node: |
| node_id: u64 |
| neighbor_count: u16 |
| [neighbor_id: u64] * neighbor_count |
| [64B padding] |
+-------------------------------------------+
| Block 2: Centroids |
| centroid_count: u32 |
| dim: u16 |
| dtype: u8 (fp16) |
| [centroid_vector: fp16 * dim] * K |
| [64B aligned] |
+-------------------------------------------+
| Block 3: Partition Map |
| partition_count: u32 |
| For each partition: |
| centroid_id: u32 |
| vector_id_start: u64 |
| vector_id_end: u64 |
| segment_ref: u64 (segment_id) |
| block_ref: u32 (block offset) |
+-------------------------------------------+
```
### Query Using Only Layer A
```python
def query_layer_a_only(query, k, layer_a):
# Step 1: Find nearest centroids
dists = [distance(query, c) for c in layer_a.centroids]
top_partitions = top_n(dists, n_probe)
# Step 2: HNSW search through top layers only
entry = layer_a.entry_points[0]
current = entry
for layer in range(layer_a.max_layer, layer_a.min_available_layer, -1):
current = greedy_search(query, current, layer_a.adjacency[layer])
# Step 3: If hot cache available, refine against it
if hot_cache:
candidates = scan_hot_cache(query, hot_cache, current.partition)
return top_k(candidates, k)
# Step 4: Otherwise, return centroid-approximate results
return approximate_from_centroids(query, top_partitions, k)
```
Expected recall: 0.65-0.75 (depends on centroid quality and hot cache coverage).
## 3. Layer B: Partial Adjacency
### Content
Neighbor lists for the **hot region** of the graph — the set of nodes that appear
most frequently in query traversals. Determined by the temperature sketch (see
03-temperature-tiering.md).
Typically covers:
- All nodes at HNSW layers >= 2
- Layer 0-1 nodes in the hot temperature tier
- ~10-20% of total nodes
### Storage
Layer B is stored in one or more INDEX_SEGs without the HOT flag. The Level 1
manifest maps these segments and records which node ID ranges they cover.
### Incremental Build
Layer B can be built incrementally:
```
1. After Layer A is loaded, begin query serving
2. In background: read VEC_SEGs for hot-tier blocks
3. Build HNSW adjacency for those blocks
4. Write as new INDEX_SEG
5. Update manifest to include Layer B
6. Future queries use Layer B for better recall
```
This means the index improves over time without blocking any queries.
### Partial Adjacency Routing
When a query traversal reaches a node without Layer B adjacency (i.e., it's in
the cold region), the system falls back to:
1. **Centroid routing**: Use Layer A centroids to estimate the nearest region
2. **Linear scan**: Scan the relevant VEC_SEG block directly
3. **Approximate**: Accept slightly lower recall for that portion
```python
def search_with_partial_index(query, k, layers):
# Start with Layer A routing
current = hnsw_search_layers(query, layers.a, layers.a.max_layer, 2)
# Continue with Layer B (where available)
if layers.b.has_node(current):
current = hnsw_search_layers(query, layers.b, 1, 0,
start=current)
else:
# Fallback: scan the block containing current
candidates = linear_scan_block(query, current.block)
current = best_of(current, candidates)
return top_k(current.visited, k)
```
## 4. Layer C: Full Adjacency
### Content
Complete neighbor lists for every node at every HNSW level. This is the
traditional full HNSW graph.
### Storage
Layer C may be split across multiple INDEX_SEGs for large datasets. The
manifest records the node ID ranges covered by each segment.
### Lazy Build
Layer C is built lazily — it is not required for the file to be functional.
The build process runs as a background task:
```
1. Identify unindexed VEC_SEG blocks (those without Layer C adjacency)
2. Read blocks in partition order (good locality)
3. Build HNSW adjacency using the existing partial graph as scaffold
4. Write new INDEX_SEG(s)
5. Update manifest
```
### Build Prioritization
Blocks are indexed in temperature order:
1. Hot blocks first (most query benefit)
2. Warm blocks next
3. Cold blocks last (may never be indexed if queries don't reach them)
This means the index build converges to useful quality fast, then approaches
completeness asymptotically.
## 5. Index Segment Binary Format
### Adjacency List Encoding
Neighbor lists are stored using **varint delta encoding with restart points**
for fast random access:
```
+-------------------------------------------+
| Restart Point Index |
| restart_interval: u32 (e.g., 64) |
| restart_count: u32 |
| [restart_offset: u32] * restart_count |
| [64B aligned] |
+-------------------------------------------+
| Adjacency Data |
| For each node (sorted by node_id): |
| neighbor_count: varint |
| [delta_encoded_neighbor_id: varint] |
| (restart point every N nodes) |
+-------------------------------------------+
```
**Restart points**: Every `restart_interval` nodes (default 64), the delta
encoding resets to absolute IDs. This enables O(1) random access to any node's
neighbors by:
1. Binary search the restart point index for the nearest restart <= target
2. Seek to that restart offset
3. Sequentially decode from restart to target (at most 63 decodes)
### Varint Encoding
Standard LEB128 varint:
- Values 0-127: 1 byte
- Values 128-16383: 2 bytes
- Values 16384-2097151: 3 bytes
For delta-encoded neighbor IDs (typical delta: 1-1000), most values fit in 1-2
bytes, giving ~3-4x compression over fixed u64.
### Prefetch Hints
The manifest's prefetch table maps node ID ranges to contiguous page ranges:
```
Prefetch Entry:
node_id_start: u64
node_id_end: u64
page_offset: u64 Offset of first contiguous page
page_count: u32 Number of contiguous pages
prefetch_ahead: u32 Pages to prefetch ahead of current access
```
When the HNSW search accesses a node, the runtime issues `madvise(WILLNEED)`
(or equivalent) for the next `prefetch_ahead` pages. This hides disk/memory
latency behind computation.
## 6. Index Consistency
### Append-Only Index Updates
When new vectors are added:
1. New vectors go into a **fresh VEC_SEG** (append-only)
2. A temporary in-memory index covers the new vectors
3. When the in-memory index reaches a threshold, it is written as a new INDEX_SEG
4. The manifest is updated to include both the old and new INDEX_SEGs
5. Queries search both indexes and merge results
This is analogous to LSM-tree compaction levels but for graph indexes.
### Index Merging
When too many small INDEX_SEGs accumulate:
```
1. Read all small INDEX_SEGs
2. Build a unified HNSW graph over all vectors
3. Write as a single sealed INDEX_SEG
4. Tombstone old INDEX_SEGs in manifest
```
### Concurrent Read/Write
Readers always see a consistent snapshot through the manifest chain:
- Reader opens file -> reads manifest -> has immutable segment set
- Writer appends new segments + new manifest
- Reader continues using old manifest until it explicitly re-reads
- No locks needed — append-only guarantees no mutation of existing data
## 7. Query Path Integration
The complete query path combining progressive indexing with temperature tiering:
```
Query
|
v
+-----------+
| Layer A | Entry points + top-layer routing
| (always) | ~5ms to load on cold start
+-----------+
|
Is Layer B available for this region?
/ \
Yes No
/ \
+-----------+ +-----------+
| Layer B | | Centroid |
| HNSW | | Fallback |
| search | | + scan |
+-----------+ +-----------+
\ /
\ /
v v
+-----------+
| Candidate |
| Set |
+-----------+
|
Is hot cache available?
/ \
Yes No
/ \
+-----------+ +-----------+
| Hot cache | | Decode |
| re-rank | | from |
| (int8/fp16)| | VEC_SEG |
+-----------+ +-----------+
\ /
v v
+-----------+
| Top-K |
| Results |
+-----------+
```
### Recall Expectations by State
| State | Layers Available | Expected Recall@10 |
|-------|-----------------|-------------------|
| Cold start (L0 only) | A | 0.65-0.75 |
| L0 + hot cache | A + hot | 0.75-0.85 |
| L0 + L1 loading | A + B partial | 0.80-0.90 |
| L1 complete | A + B | 0.85-0.92 |
| Full load | A + B + C | 0.95-0.99 |
| Full + optimized | A + B + C + hot | 0.98-0.999 |

View File

@@ -0,0 +1,308 @@
# RVF Overlay Epochs
## 1. Streaming Dynamic Min-Cut Overlay
The overlay system manages dynamic graph partitioning — how the vector space is
subdivided for distributed search, shard routing, and load balancing. Unlike
static partitioning, RVF overlays evolve with the data through an epoch-based
model that bounds memory, bounds load time, and enables rollback.
## 2. Overlay Segment Structure
Each OVERLAY_SEG stores a delta relative to the previous epoch's partition state:
```
+-------------------------------------------+
| Header: OVERLAY_SEG |
+-------------------------------------------+
| Epoch Header |
| epoch: u32 |
| parent_epoch: u32 |
| parent_seg_id: u64 |
| rollback_offset: u64 |
| timestamp_ns: u64 |
| delta_count: u32 |
| partition_count: u32 |
+-------------------------------------------+
| Edge Deltas |
| For each delta: |
| delta_type: u8 (ADD=1, REMOVE=2, |
| REWEIGHT=3) |
| src_node: u64 |
| dst_node: u64 |
| weight: f32 (for ADD/REWEIGHT) |
| [64B aligned] |
+-------------------------------------------+
| Partition Summaries |
| For each partition: |
| partition_id: u32 |
| node_count: u64 |
| edge_cut_weight: f64 |
| centroid: [fp16 * dim] |
| node_id_range_start: u64 |
| node_id_range_end: u64 |
| [64B aligned] |
+-------------------------------------------+
| Min-Cut Witness |
| witness_type: u8 |
| 0 = checksum only |
| 1 = full certificate |
| cut_value: f64 |
| cut_edge_count: u32 |
| partition_hash: [u8; 32] (SHAKE-256) |
| If witness_type == 1: |
| [cut_edge: (u64, u64)] * count |
| [64B aligned] |
+-------------------------------------------+
| Rollback Pointer |
| prev_epoch_offset: u64 |
| prev_epoch_hash: [u8; 16] |
+-------------------------------------------+
```
## 3. Epoch Lifecycle
### Epoch Creation
A new epoch is created when:
- A batch of vectors is inserted that changes partition balance by > threshold
- The accumulated edge deltas exceed a size limit (default: 1 MB)
- A manual rebalance is triggered
- A merge/compaction produces a new partition layout
```
Epoch 0 (initial) Epoch 1 Epoch 2
+----------------+ +----------------+ +----------------+
| Full snapshot | | Deltas vs E0 | | Deltas vs E1 |
| of partitions | | +50 edges | | +30 edges |
| 32 partitions | | -12 edges | | -8 edges |
| min-cut: 0.342 | | rebalance: P3 | | split: P7->P7a |
+----------------+ +----------------+ +----------------+
```
### State Reconstruction
To reconstruct the current partition state:
```
1. Read latest MANIFEST_SEG -> get current_epoch
2. Read OVERLAY_SEG for current_epoch
3. If overlay is a delta: recursively read parent epochs
4. Apply deltas in order: base -> epoch 1 -> epoch 2 -> ... -> current
5. Result: complete partition state
```
For efficiency, the manifest caches the **last full snapshot epoch**. Delta
chains never exceed a configurable depth (default: 8 epochs) before a new
snapshot is forced.
### Compaction (Epoch Collapse)
When the delta chain reaches maximum depth:
```
1. Reconstruct full state from chain
2. Write new OVERLAY_SEG with witness_type=full_snapshot
3. This becomes the new base epoch
4. Old overlay segments are tombstoned
5. New delta chain starts from this base
```
```
Before: E0(snap) -> E1(delta) -> E2(delta) -> ... -> E8(delta)
After: E0(snap) -> ... -> E8(delta) -> E9(snap, compacted)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
These can be garbage collected
```
## 4. Min-Cut Witness
The min-cut witness provides a cryptographic proof that the current partition
is "good enough" — that the edge cut is within acceptable bounds.
### Witness Types
**Type 0: Checksum Only**
A SHAKE-256 hash of the complete partition state. Allows verification that
the state is consistent but doesn't prove optimality.
```
witness = SHAKE-256(
for each partition sorted by id:
partition_id || node_count || sorted(node_ids) || edge_cut_weight
)
```
**Type 1: Full Certificate**
Lists the actual cut edges. Allows any reader to verify that:
1. The listed edges are the only edges crossing partition boundaries
2. The total cut weight matches `cut_value`
3. No better cut exists within the local search neighborhood (optional)
### Bounded-Time Min-Cut Updates
Full min-cut computation is expensive (O(V * E) for max-flow). RVF uses
**incremental min-cut maintenance**:
For each edge delta:
```
1. If ADD(u, v) where u and v are in same partition:
-> No cut change. O(1).
2. If ADD(u, v) where u in P_i and v in P_j:
-> cut_weight[P_i][P_j] += weight. O(1).
-> Check if moving u to P_j or v to P_i reduces total cut.
-> If yes: execute move, update partition summaries. O(degree).
3. If REMOVE(u, v) across partitions:
-> cut_weight[P_i][P_j] -= weight. O(1).
-> No rebalance needed (cut improved).
4. If REMOVE(u, v) within same partition:
-> Check connectivity. If partition splits: create new partition. O(component).
```
This bounds update time to O(max_degree) per edge delta in the common case,
with O(component_size) in the rare partition-split case.
### Semi-Streaming Min-Cut
For large-scale rebalancing (e.g., after bulk insert), RVF uses a semi-streaming
algorithm inspired by Assadi et al.:
```
Phase 1: Single pass over edges to build a sparse skeleton
- Sample each edge with probability O(1/epsilon)
- Space: O(n * polylog(n))
Phase 2: Compute min-cut on skeleton
- Standard max-flow on sparse graph
- Time: O(n^2 * polylog(n))
Phase 3: Verify against full edge set
- Stream edges again, check cut validity
- If invalid: refine skeleton and repeat
```
This runs in O(n * polylog(n)) space regardless of edge count, making it
suitable for streaming over massive graphs.
## 5. Overlay Size Management
### Size Threshold
Each OVERLAY_SEG has a maximum payload size (configurable, default 1 MB).
When the accumulated deltas for the current epoch approach this threshold,
a new epoch is forced.
### Memory Budget
The total memory for overlay state is bounded:
```
max_overlay_memory = max_chain_depth * max_seg_size + snapshot_size
= 8 * 1 MB + snapshot_size
```
For 10M vectors with 32 partitions:
- Snapshot: ~32 * (8 + 16 + 768) bytes per partition ≈ 25 KB
- Delta chain: ≤ 8 MB
- Total: ≤ 9 MB
This is a fixed overhead regardless of dataset size (partition count scales
sublinearly).
### Garbage Collection
Overlay segments behind the last full snapshot are candidates for garbage
collection. The manifest tracks which overlay segments are still reachable
from the current epoch chain.
```
Reachable: current_epoch -> parent -> ... -> last_snapshot
Unreachable: Everything before last_snapshot (safely deletable)
```
GC runs during compaction. Old OVERLAY_SEGs are tombstoned in the manifest
and their space is reclaimed on file rewrite.
## 6. Distributed Overlay Coordination
When RVF files are sharded across multiple nodes, the overlay system coordinates
partition state:
### Shard-Local Overlays
Each shard maintains its own OVERLAY_SEG chain for its local partitions.
The global partition state is the union of all shard-local overlays.
### Cross-Shard Rebalancing
When a partition becomes unbalanced across shards:
```
1. Coordinator computes target partition assignment
2. Each shard writes a JOURNAL_SEG with vector move instructions
3. Vectors are copied (not moved — append-only) to target shards
4. Each shard writes a new OVERLAY_SEG reflecting the new partition
5. Coordinator writes a global MANIFEST_SEG with new shard map
```
This is eventually consistent — during rebalancing, queries may search both
old and new locations and deduplicate results.
### Consistency Model
**Within a shard**: Linearizable (single-writer, manifest chain)
**Across shards**: Eventually consistent with bounded staleness
The epoch counter provides a total order for convergence checking:
- If all shards report epoch >= E, the global state at epoch E is complete
- Stale shards are detectable by comparing epoch counters
## 7. Epoch-Aware Query Routing
Queries use the overlay state for partition routing:
```python
def route_query(query, overlay):
# Find nearest partition centroids
dists = [distance(query, p.centroid) for p in overlay.partitions]
target_partitions = top_n(dists, n_probe)
# Check epoch freshness
if overlay.epoch < current_epoch - stale_threshold:
# Overlay is stale — broaden search
target_partitions = top_n(dists, n_probe * 2)
return target_partitions
```
### Epoch Rollback
If an overlay epoch is found to be corrupt or suboptimal:
```
1. Read rollback_pointer from current OVERLAY_SEG
2. The pointer gives the offset of the previous epoch's OVERLAY_SEG
3. Write a new MANIFEST_SEG pointing to the previous epoch as current
4. Future writes continue from the rolled-back state
```
This provides O(1) rollback to any ancestor epoch in the chain.
## 8. Integration with Progressive Indexing
The overlay system and the index system are coupled:
- **Partition centroids** in the overlay guide Layer A routing
- **Partition boundaries** determine which INDEX_SEGs cover which regions
- **Partition rebalancing** may invalidate Layer B adjacency for moved vectors
(these are rebuilt lazily)
- **Layer C** is partitioned aligned — each INDEX_SEG covers vectors within
a single partition for locality
This means overlay compaction can trigger partial index rebuild, but only for
the affected partitions — not the entire index.

View File

@@ -0,0 +1,386 @@
# RVF Ultra-Fast Query Path
## 1. CPU Shape Optimization
The block layout determines performance at the hardware level. RVF is designed
to match the shape of modern CPUs: wide SIMD, deep caches, hardware prefetch.
### Four Optimizations
1. **Strict 64-byte alignment** for all numeric arrays
2. **Columnar + interleaved hybrid** for compression and speed
3. **Prefetch hints** for cache-friendly graph traversal
4. **Dictionary-coded IDs** for fast random access
## 2. Strict Alignment
Every numeric array in RVF starts at a 64-byte aligned offset. This matches:
| Target | Register Width | Alignment |
|--------|---------------|-----------|
| AVX-512 | 512 bits = 64 bytes | 64 B |
| AVX2 | 256 bits = 32 bytes | 64 B (superset) |
| ARM NEON | 128 bits = 16 bytes | 64 B (superset) |
| WASM v128 | 128 bits = 16 bytes | 64 B (superset) |
| Cache line | Typically 64 bytes | 64 B (exact) |
By aligning to 64 bytes, RVF ensures:
- Zero-copy load into any SIMD register (no unaligned penalty)
- No cache-line splits (each access touches exactly one cache line)
- Optimal hardware prefetch behavior (prefetcher operates on cache lines)
### Alignment in Practice
```
Segment header: 64 B (naturally aligned, first item in segment)
Block header: Padded to 64 B boundary
Vector data start: 64 B aligned from block start
Each dimension column: 64 B aligned (columnar VEC_SEG)
Each vector entry: 64 B aligned (interleaved HOT_SEG)
ID map: 64 B aligned
Restart point index: 64 B aligned
```
Padding bytes between sections are zero-filled and excluded from checksums.
## 3. Columnar + Interleaved Hybrid
### Columnar Storage (VEC_SEG) — Optimized for Compression
```
Block layout (1024 vectors, 384 dimensions, fp16):
Offset 0x000: dim_0[vec_0], dim_0[vec_1], ..., dim_0[vec_1023] (2048 B)
Offset 0x800: dim_1[vec_0], dim_1[vec_1], ..., dim_1[vec_1023] (2048 B)
...
Offset 0xBF800: dim_383[vec_0], ..., dim_383[vec_1023] (2048 B)
Total: 384 * 2048 = 786,432 bytes (768 KB per block)
```
**Why columnar for cold/warm storage**:
- Adjacent values in the same dimension are correlated -> higher compression ratio
- LZ4 on columnar fp16 achieves 1.5-2.5x compression (vs 1.1-1.3x on interleaved)
- ZSTD on columnar fp16 achieves 2.5-4x compression
- Batch operations (computing mean, variance) scan one dimension at a time
### Interleaved Storage (HOT_SEG) — Optimized for Speed
```
Entry layout (one hot vector, 384 dim fp16):
Offset 0x000: vector_id (8 B)
Offset 0x008: dim_0, dim_1, dim_2, ..., dim_383 (768 B)
Offset 0x308: neighbor_count (2 B)
Offset 0x30A: neighbor_0, neighbor_1, ... (8 B each)
Offset 0x38A: padding to 64B boundary
--> 960 bytes per entry (at M=16 neighbors)
```
**Why interleaved for hot data**:
- One vector = one sequential read (no column gathering)
- Distance computation: load vector, compute, move to next (streaming pattern)
- Neighbors co-located: after finding a good candidate, immediately traverse
- 960 bytes per entry = 15 cache lines = predictable memory access
### When to Use Each
| Operation | Layout | Reason |
|-----------|--------|--------|
| Bulk distance computation | Columnar | SIMD operates on dimension columns |
| Top-K refinement scan | Interleaved | Sequential scan of candidates |
| Compression/archival | Columnar | Better ratio |
| HNSW search (hot region) | Interleaved | Vector + neighbors together |
| Batch insert | Columnar | Write once, compress well |
## 4. Prefetch Hints
### The Problem
HNSW search is pointer-chasing: compute distance at node A, read neighbor
list, jump to node B, compute distance, repeat. Each jump is a random
memory access. On a 10M vector file, this means:
```
HNSW search: ~100-200 distance computations per query
Each computation: 1 random read (vector) + 1 random read (neighbors)
Random read latency: 50-100 ns (DRAM), 10-50 μs (SSD)
Total: 10-40 μs (DRAM), 1-10 ms (SSD) without prefetch
```
### The Solution
Store neighbor lists **contiguously** and add **prefetch offsets** in the
manifest so the runtime can issue prefetch instructions ahead of time.
### Prefetch Table Structure
The manifest contains a prefetch table mapping node ID ranges to contiguous
page regions:
```
prefetch_table:
entry_count: u32
entries:
[0]: node_ids 0-9999 -> pages at offset 0x100000, 50 pages, prefetch 3 ahead
[1]: node_ids 10000-19999 -> pages at offset 0x200000, 50 pages, prefetch 3 ahead
...
```
### Runtime Prefetch Strategy
```python
def hnsw_search_with_prefetch(query, entry_point, ef_search):
candidates = MaxHeap()
visited = BitSet()
worklist = MinHeap([(distance(query, entry_point), entry_point)])
while worklist:
dist, node = worklist.pop()
# PREFETCH: while processing this node, prefetch neighbors' data
neighbors = get_neighbors(node)
for n in neighbors[:PREFETCH_AHEAD]:
if n not in visited:
prefetch_vector(n) # madvise(WILLNEED) or __builtin_prefetch
prefetch_neighbors(n) # prefetch neighbor list page
# COMPUTE: distance to neighbors (data should be in cache by now)
for n in neighbors:
if n not in visited:
visited.add(n)
d = distance(query, get_vector(n))
if d < candidates.max() or len(candidates) < ef_search:
candidates.push((d, n))
worklist.push((d, n))
return candidates.top_k(k)
```
### Contiguous Neighbor Layout
HOT_SEG stores vectors and neighbors together. For cold INDEX_SEGs, neighbor
lists are laid out in **node ID order** within contiguous pages:
```
Page 0: neighbors[node_0], neighbors[node_1], ..., neighbors[node_63]
Page 1: neighbors[node_64], ..., neighbors[node_127]
...
```
Because HNSW search tends to traverse nodes in the same graph neighborhood
(spatially close node IDs if data was inserted in order), sequential node
IDs tend to be accessed together. Contiguous layout turns random access
into sequential reads.
### Expected Improvement
| Configuration | p95 Latency (10M vectors) |
|--------------|--------------------------|
| No prefetch, random layout | 2.5 ms |
| No prefetch, contiguous layout | 1.2 ms |
| Prefetch, contiguous layout | 0.3 ms |
| Prefetch, contiguous + hot cache | 0.15 ms |
## 5. Dictionary-Coded IDs
### The Problem
Vector IDs in neighbor lists and ID maps are 64-bit integers. For 10M vectors,
most IDs fit in 24 bits. Storing full 64-bit IDs wastes ~5 bytes per entry.
With M=16 neighbors per node and 10M nodes:
- Raw: 10M * 16 * 8 = 1.2 GB of ID data
- Desired: < 300 MB
### Varint Delta Encoding
IDs within a block or neighbor list are sorted and delta-encoded:
```
Original IDs: [1000, 1005, 1008, 1020, 1100]
Deltas: [1000, 5, 3, 12, 80]
Varint bytes: [ 2B, 1B, 1B, 1B, 1B] = 6 bytes (vs 40 bytes raw)
```
### Restart Points
Every N entries (default N=64), the delta resets to an absolute value:
```
Group 0 (entries 0-63): delta from 0 (absolute start)
Group 1 (entries 64-127): delta from entry[64] (restart)
Group 2 (entries 128-191): delta from entry[128] (restart)
```
The restart point index stores the offset of each restart group:
```
restart_index:
interval: 64
offsets: [0, 156, 298, 445, ...] // byte offsets into encoded data
```
### Random Access
To find the neighbors of node N:
```
1. group = N / restart_interval // O(1)
2. offset = restart_index[group] // O(1)
3. seek to offset in encoded data // O(1)
4. decode sequentially from restart to N // O(restart_interval) = O(64)
```
Total: O(64) varint decodes = ~50-100 ns. Compare with sorted array binary
search: O(log N) = O(24) comparisons with cache misses = ~200-500 ns.
### SIMD Varint Decoding
Modern SIMD can decode varints in bulk:
```
AVX-512 VBMI: ~8 varints per cycle using VPERMB + VPSHUFB
Throughput: 2-4 billion integers/second (Lemire et al.)
```
At 16 neighbors per node, one HNSW search step decodes 16 varints in ~2-4 ns.
### Compression Ratio
| Encoding | Bytes per ID (avg) | 10M * 16 neighbors |
|----------|-------------------|-------------------|
| Raw u64 | 8.0 B | 1,220 MB |
| Raw u32 | 4.0 B | 610 MB |
| Varint (no delta) | 3.2 B | 488 MB |
| Varint delta | 1.5 B | 229 MB |
| Varint delta + restart | 1.6 B | 244 MB |
Delta encoding with restart points achieves ~5x compression over raw u64
while maintaining fast random access.
## 6. Cache Behavior Analysis
### L1/L2/L3 Working Sets
For a typical query on 10M vectors (384 dim, fp16):
```
HNSW search:
~150 distance computations
Each computation: 768 B (vector) + ~128 B (neighbor list) ≈ 896 B
Total working set: 150 * 896 ≈ 131 KB
Top-K refinement (hot cache scan):
~1000 candidates checked
Each: 960 B (interleaved HOT_SEG entry)
Total: 960 KB
Query vector: 768 B (always in L1)
Quantization tables: 96 KB (PQ codebook, always in L2)
```
| Cache Level | Size | What Fits |
|------------|------|-----------|
| L1 (32-48 KB) | Query vector + current node | Always hit |
| L2 (256 KB-1 MB) | PQ tables + 100-200 hot entries | Usually hit |
| L3 (8-32 MB) | Hot cache + partial index | Mostly hit |
| DRAM | Everything | Full dataset |
### p95 Latency Budget
```
HNSW traversal: 150 nodes * 100 ns/node = 15 μs (L3 hit)
Distance compute: 150 * 50 ns = 7.5 μs (SIMD)
Top-K refinement: 1000 * 10 ns = 10 μs (hot cache, L2/L3 hit)
Overhead: 5 μs (heap ops, bookkeeping)
-------
Total p95: ~37.5 μs ≈ 0.04 ms
With prefetch: ~30 μs (hide 25% of traversal latency)
```
This matches the target of < 0.3 ms p95 on desktop hardware. The dominant
cost is memory bandwidth, not computation — which is why cache-friendly
layout and prefetch are critical.
## 7. Distance Function SIMD Implementations
### L2 Distance (fp16, 384 dim, AVX-512)
```
; 384 fp16 values = 768 bytes = 12 ZMM registers
; Process 32 fp16 values per iteration (convert to 16 fp32 per half)
.loop:
vmovdqu16 zmm0, [rsi + rcx] ; Load 32 fp16 from A
vmovdqu16 zmm1, [rdi + rcx] ; Load 32 fp16 from B
vcvtph2ps zmm2, ymm0 ; Convert low 16 to fp32
vcvtph2ps zmm3, ymm1
vsubps zmm2, zmm2, zmm3 ; diff = A - B
vfmadd231ps zmm4, zmm2, zmm2 ; acc += diff * diff
; Repeat for high 16
vextracti64x4 ymm0, zmm0, 1
vextracti64x4 ymm1, zmm1, 1
vcvtph2ps zmm2, ymm0
vcvtph2ps zmm3, ymm1
vsubps zmm2, zmm2, zmm3
vfmadd231ps zmm4, zmm2, zmm2
add rcx, 64
cmp rcx, 768
jl .loop
; Horizontal sum of zmm4 -> scalar result
; ~12 iterations, ~24 FMA ops, ~12 cycles total
```
### Inner Product (int8, 384 dim, AVX-512 VNNI)
```
; 384 int8 values = 384 bytes = 6 ZMM registers
; VPDPBUSD: 64 uint8*int8 multiply-adds per cycle
.loop:
vmovdqu8 zmm0, [rsi + rcx] ; 64 uint8 from A
vmovdqu8 zmm1, [rdi + rcx] ; 64 int8 from B
vpdpbusd zmm2, zmm0, zmm1 ; acc += dot(A, B) per 4 bytes
add rcx, 64
cmp rcx, 384
jl .loop
; 6 iterations, 6 VPDPBUSD ops, ~6 cycles
; ~16x faster than fp16 L2
```
### Hamming Distance (binary, 384 dim, AVX-512)
```
; 384 bits = 48 bytes = 1 partial ZMM load
; VPOPCNTDQ: popcount on 8 x 64-bit words per cycle
vmovdqu8 zmm0, [rsi] ; Load 48 bytes (384 bits) from A
vmovdqu8 zmm1, [rdi] ; Load 48 bytes from B
vpxorq zmm2, zmm0, zmm1 ; XOR -> differing bits
vpopcntq zmm3, zmm2 ; Popcount per 64-bit word
; Horizontal sum of 6 popcounts -> Hamming distance
; ~3 cycles total
```
## 8. Summary: Query Path Hot Loop
The complete hot path for one HNSW search step:
```
1. Load current node's neighbor list [L2/L3 cache, 128 B, ~5 ns]
2. Issue prefetch for next neighbors [~1 ns]
3. For each neighbor (M=16):
a. Check visited bitmap [L1, ~1 ns]
b. Load neighbor vector (hot cache) [L2/L3, 768 B, ~5-10 ns]
c. SIMD distance (fp16, 384 dim) [~12 cycles = ~4 ns]
d. Heap insert if better [~5 ns]
4. Total per step: ~300-500 ns
5. Total per query (~150 steps): ~50-75 μs
```
This achieves 13,000-20,000 QPS per thread on desktop hardware — matching
or exceeding dedicated vector databases for in-memory workloads.

View File

@@ -0,0 +1,580 @@
# RVF Deletion Lifecycle
## 1. Overview
Deletion in RVF follows a two-phase protocol consistent with the append-only
segment architecture. Vectors are never removed in-place. Instead, a soft
delete records intent in a JOURNAL_SEG, and a subsequent compaction hard
deletes by physically excluding the vectors from sealed output segments.
```
JOURNAL_SEG Compaction GC / Rewrite
(append) (merge) (reclaim)
ACTIVE -----> SOFT_DELETED -----> HARD_DELETED ------> RECLAIMED
| | | |
| query path | query path | |
| returns vec | skips vec | vec absent | space freed
| | (bitmap check) | from output seg |
```
Readers always see a consistent snapshot: a deletion is invisible until
the manifest referencing the new deletion bitmap is durably committed.
## 2. Vector Lifecycle State Machine
```
+----------+ JOURNAL_SEG +-----------------+
| | DELETE_VECTOR / RANGE | |
| ACTIVE +----------------------->+ SOFT_DELETED |
| | | |
+----------+ +--------+--------+
| Compaction seals output
v excluding this vector
+--------+--------+
| HARD_DELETED |
+--------+--------+
| File rewrite / truncation
v reclaims physical space
+--------+--------+
| RECLAIMED |
+-----------------+
```
| State | Bitmap Bit | Physical Bytes | Query Visible |
|-------|------------|----------------|---------------|
| ACTIVE | 0 | Vector in VEC_SEG | Yes |
| SOFT_DELETED | 1 | Vector in VEC_SEG | No |
| HARD_DELETED | N/A | Excluded from sealed output | No |
| RECLAIMED | N/A | Bytes overwritten / freed | No |
| Transition | Trigger | Durability |
|------------|---------|------------|
| ACTIVE -> SOFT_DELETED | JOURNAL_SEG + MANIFEST_SEG with bitmap | After manifest fsync |
| SOFT_DELETED -> HARD_DELETED | Compaction writes sealed VEC_SEG without vector | After compaction manifest fsync |
| HARD_DELETED -> RECLAIMED | File rewrite or old shard deletion | After shard unlink |
## 3. JOURNAL_SEG Wire Format (type 0x04)
A JOURNAL_SEG records metadata mutations: deletions, metadata updates, tier
moves, and ID remappings. Its payload follows the standard 64-byte segment
header (see `01-segment-model.md` section 2).
### 3.1 Journal Header (64 bytes)
```
Offset Type Field Description
------ ---- ----- -----------
0x00 u32 entry_count Number of journal entries
0x04 u32 journal_epoch Epoch when this journal was written
0x08 u64 prev_journal_seg_id Segment ID of previous JOURNAL_SEG (0 if first)
0x10 u32 flags Reserved, must be 0
0x14 u8[44] reserved Zero-padded to 64-byte alignment
```
### 3.2 Journal Entry Format
Each entry begins on an 8-byte aligned boundary:
```
Offset Type Field Description
------ ---- ----- -----------
0x00 u8 entry_type Entry type enum
0x01 u8 reserved Must be 0x00
0x02 u16 entry_length Byte length of type-specific payload
0x04 u8[] payload Type-specific payload
var u8[] padding Zero-pad to next 8-byte boundary
```
### 3.3 Entry Types
```
Value Name Payload Size Description
----- ---- ------------ -----------
0x01 DELETE_VECTOR 8 B Delete a single vector by ID
0x02 DELETE_RANGE 16 B Delete a contiguous range of vector IDs
0x03 UPDATE_METADATA variable Update key-value metadata for a vector
0x04 MOVE_VECTOR 24 B Reassign vector to a different segment/tier
0x05 REMAP_ID 16 B Reassign vector ID (post-compaction)
```
### 3.4 Type-Specific Payloads
**DELETE_VECTOR (0x01)**
```
0x00 u64 vector_id ID of the vector to soft-delete
```
**DELETE_RANGE (0x02)**
```
0x00 u64 start_id First vector ID (inclusive)
0x08 u64 end_id Last vector ID (exclusive)
```
Invariant: `start_id < end_id`. Range `[start_id, end_id)` is half-open.
**UPDATE_METADATA (0x03)**
```
0x00 u64 vector_id Target vector ID
0x08 u16 key_len Byte length of metadata key
0x0A u8[] key Metadata key (UTF-8)
var u16 val_len Byte length of metadata value
var+2 u8[] val Metadata value (opaque bytes)
```
**MOVE_VECTOR (0x04)**
```
0x00 u64 vector_id Target vector ID
0x08 u64 src_seg Source segment ID
0x10 u64 dst_seg Destination segment ID
```
**REMAP_ID (0x05)**
```
0x00 u64 old_id Original vector ID
0x08 u64 new_id New vector ID after compaction
```
### 3.5 Complete JOURNAL_SEG Example
Deleting vector 42, deleting range [1000, 2000), remapping ID 500 -> 3:
```
Byte offset Content Notes
----------- ------- -----
0x00-0x3F Segment header (64 B) seg_type=0x04, magic=RVFS
0x40-0x7F Journal header (64 B) entry_count=3, epoch=7,
prev_journal_seg_id=12
--- Entry 0: DELETE_VECTOR ---
0x80 0x01 entry_type
0x81 0x00 reserved
0x82-0x83 0x0008 entry_length = 8
0x84-0x8B 0x000000000000002A vector_id = 42
0x8C-0x8F 0x00000000 padding to 8B
--- Entry 1: DELETE_RANGE ---
0x90 0x02 entry_type
0x91 0x00 reserved
0x92-0x93 0x0010 entry_length = 16
0x94-0x9B 0x00000000000003E8 start_id = 1000
0x9C-0xA3 0x00000000000007D0 end_id = 2000
--- Entry 2: REMAP_ID ---
0xA4 0x05 entry_type
0xA5 0x00 reserved
0xA6-0xA7 0x0010 entry_length = 16
0xA8-0xAF 0x00000000000001F4 old_id = 500
0xB0-0xB7 0x0000000000000003 new_id = 3
```
## 4. Deletion Bitmap
### 4.1 Manifest Record
The deletion bitmap is stored in the Level 1 manifest as a TLV record:
```
Tag Name Description
--- ---- -----------
0x000E DELETION_BITMAP Roaring bitmap of soft-deleted vector IDs
```
This extends the TLV tag space (previous: 0x000D KEY_DIRECTORY).
### 4.2 Roaring Bitmap Binary Layout
Vector IDs are 64-bit. The upper 32 bits select a **high key**; the lower
32 bits index into a **container** for that high key.
```
+---------------------------------------------+
| DELETION_BITMAP TLV Value |
+---------------------------------------------+
| Bitmap Header |
| cookie: u32 (0x3B3A3332) |
| high_key_count: u32 |
| For each high key: |
| high_key: u32 |
| container_type: u8 |
| 0x01 = ARRAY_CONTAINER |
| 0x02 = BITMAP_CONTAINER |
| 0x03 = RUN_CONTAINER |
| container_offset: u32 (from bitmap start)|
| [8B aligned] |
+---------------------------------------------+
| Container Data |
| Container 0: [type-specific layout] |
| Container 1: ... |
| [8B aligned per container] |
+---------------------------------------------+
```
### 4.3 Container Types
**ARRAY_CONTAINER (0x01)** -- Sparse deletions (< 4096 set bits per 64K range).
```
0x00 u16 cardinality Number of set values (1-4096)
0x02 u16[] values Sorted array of 16-bit values
```
Size: `2 + 2 * cardinality` bytes.
**BITMAP_CONTAINER (0x02)** -- Dense deletions (>= 4096 set bits per 64K range).
```
0x00 u16 cardinality Number of set bits
0x02 u8[8192] bitmap Fixed 65536-bit bitmap (8 KB)
```
Size: 8194 bytes (fixed).
**RUN_CONTAINER (0x03)** -- Contiguous ranges of deletions.
```
0x00 u16 run_count Number of runs
0x02 (u16,u16) runs[] Array of (start, length-1) pairs
```
Size: `2 + 4 * run_count` bytes.
### 4.4 Size Estimation
| Deletion Pattern | Deleted IDs | Container Types | Bitmap Size |
|------------------|-------------|-----------------|-------------|
| Sparse random | 10,000 (0.1%) | ~153 array | ~22 KB |
| Clustered ranges | 10,000 (0.1%) | ~5 run | ~0.1 KB |
| Mixed workload | 100,000 (1%) | array + run | ~80 KB |
| Heavy deletion | 1,000,000 (10%) | bitmap + run | ~200 KB |
Even at 200 KB the bitmap fits entirely in L2 cache.
### 4.5 Bitmap Operations
```python
def bitmap_check(bitmap, vector_id):
"""Returns True if vector_id is soft-deleted. O(1) amortized."""
high_key = vector_id >> 16
low_val = vector_id & 0xFFFF
container = bitmap.get_container(high_key)
if container is None:
return False
return container.contains(low_val) # array: bsearch, bitmap: bit test, run: bsearch
def bitmap_set(bitmap, vector_id):
"""Mark a vector as soft-deleted."""
high_key = vector_id >> 16
low_val = vector_id & 0xFFFF
container = bitmap.get_or_create_container(high_key)
container.add(low_val)
if container.type == ARRAY and container.cardinality > 4096:
container.promote_to_bitmap()
```
## 5. Delete-Aware Query Path
### 5.1 HNSW Traversal with Deletion Filtering
Deleted vectors remain in the HNSW graph until compaction rebuilds the index.
During search, the deletion bitmap is checked per candidate. Deleted nodes are
still traversed for connectivity but excluded from the result set.
```python
def hnsw_search_delete_aware(query, entry_point, ef_search, k, del_bitmap):
candidates = MaxHeap() # worst candidate on top
visited = BitSet()
worklist = MinHeap() # best candidate first
d0 = distance(query, get_vector(entry_point))
worklist.push((d0, entry_point))
visited.add(entry_point)
if not bitmap_check(del_bitmap, entry_point):
candidates.push((d0, entry_point))
while worklist:
dist, node = worklist.pop()
if candidates.size() >= ef_search and dist > candidates.peek_max():
break
neighbors = get_neighbors(node)
for n in neighbors[:PREFETCH_AHEAD]:
if n not in visited:
prefetch_vector(n)
for n in neighbors:
if n in visited:
continue
visited.add(n)
d = distance(query, get_vector(n))
is_deleted = bitmap_check(del_bitmap, n) # O(1) bitmap lookup
# Always add to worklist (graph connectivity)
if candidates.size() < ef_search or d < candidates.peek_max():
worklist.push((d, n))
# Only add to results if NOT deleted
if not is_deleted:
if candidates.size() < ef_search:
candidates.push((d, n))
elif d < candidates.peek_max():
candidates.replace_max((d, n))
return candidates.top_k(k)
```
### 5.2 Top-K Refinement with Deletion Filtering
```python
def topk_refine_delete_aware(candidates, hot_cache, query, k, del_bitmap):
heap = MaxHeap()
for cand_dist, cand_id in candidates:
heap.push((cand_dist, cand_id))
for entry in hot_cache.sequential_scan():
if bitmap_check(del_bitmap, entry.vector_id):
continue # skip soft-deleted
d = distance(query, entry.vector)
if heap.size() < k:
heap.push((d, entry.vector_id))
elif d < heap.peek_max():
heap.replace_max((d, entry.vector_id))
return heap.drain_sorted()
```
### 5.3 Performance Impact
| Operation | Without Deletions | With Deletions | Overhead |
|-----------|-------------------|----------------|----------|
| Bitmap check | N/A | ~2-5 ns (L1/L2 hit) | Per candidate |
| HNSW step (M=16) | ~300-500 ns | ~330-580 ns | +10% |
| Top-K refine (1000) | ~10 us | ~12 us | +20% worst |
| Total query | ~50-75 us | ~55-85 us | +10-13% |
At typical deletion rates (< 5%), overhead is negligible: the bitmap fits in
L2 cache, graph connectivity is preserved, and the cost is one branch plus
one bitmap load per candidate.
## 6. Deletion Write Path
All deletion operations follow the same two-fsync protocol:
```python
def delete_vectors(file, entries):
"""Soft-delete vectors. entries: list of DeleteVector or DeleteRange."""
# 1. Append JOURNAL_SEG
journal = JournalSegment(
epoch=current_epoch(file),
prev_journal_seg_id=latest_journal_id(file),
entries=entries
)
append_segment(file, journal)
fsync(file) # orphan-safe: no manifest references this yet
# 2. Update deletion bitmap in memory
bitmap = load_deletion_bitmap(file)
for e in entries:
if e.type == DELETE_VECTOR:
bitmap_set(bitmap, e.vector_id)
elif e.type == DELETE_RANGE:
bitmap.add_range(e.start_id, e.end_id)
# 3. Append MANIFEST_SEG with updated bitmap
manifest = build_manifest(file, deletion_bitmap=bitmap)
append_segment(file, manifest)
fsync(file) # deletion now visible to all new readers
```
Single deletes, bulk ranges, and batch deletes all use this path. Batch
operations pack multiple entries into one JOURNAL_SEG to amortize fsync cost.
## 7. Compaction with Deletions
### 7.1 Compaction Process
```
Before:
[VEC_1] [VEC_2] [JOURNAL_1] [VEC_3] [JOURNAL_2] [MANIFEST_5]
0-999 1000- del:42, 3000- del:[1000, bitmap={42,500,
2999 del:500 4999 2000) 1000..1999}
After:
... [MANIFEST_5] [VEC_sealed] [INDEX_new] [MANIFEST_6]
vectors 0-4999 bitmap={}
MINUS deleted (empty for
compacted range)
```
### 7.2 Compaction Algorithm
```python
def compact_with_deletions(file, seg_ids):
bitmap = load_deletion_bitmap(file)
output, id_remap, next_id = [], {}, 0
for seg_id in sorted(seg_ids):
seg = load_segment(file, seg_id)
if seg.seg_type != VEC_SEG:
continue
for vec_id, vector in seg.all_vectors():
if bitmap_check(bitmap, vec_id):
continue # physically exclude
id_remap[vec_id] = next_id
output.append((next_id, vector))
next_id += 1
append_segment(file, VecSegment(flags=SEALED, vectors=output))
remaps = [RemapIdEntry(old, new) for old, new in id_remap.items() if old != new]
if remaps:
append_segment(file, JournalSegment(entries=remaps))
append_segment(file, build_hnsw_index(output))
for old_id in id_remap:
bitmap.remove(old_id)
manifest = build_manifest(file,
tombstone_seg_ids=seg_ids,
deletion_bitmap=bitmap)
append_segment(file, manifest)
fsync(file)
```
### 7.3 Journal Merging
During compaction, JOURNAL_SEGs covering the compacted range are consumed:
| Entry Type | Materialization |
|------------|-----------------|
| DELETE_VECTOR / DELETE_RANGE | Vectors excluded from output |
| UPDATE_METADATA | Applied to output META_SEG |
| MOVE_VECTOR | Tier assignment applied in new manifest |
| REMAP_ID | Chained: old remap composed with new remap |
Consumed JOURNAL_SEGs are tombstoned alongside compacted VEC_SEGs.
### 7.4 Compaction Invariants
| ID | Invariant |
|----|-----------|
| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
| INV-D2 | Sealed output contains only ACTIVE vectors |
| INV-D3 | REMAP_ID entries journaled for every relocated vector |
| INV-D4 | Compacted input segments tombstoned in new manifest |
| INV-D5 | Sealed segments are never modified |
| INV-D6 | Rebuilt indexes exclude deleted nodes |
## 8. Deletion Consistency
### 8.1 Crash Safety
```
Write path:
1. Append JOURNAL_SEG -> fsync crash here: orphan, invisible
2. Append MANIFEST_SEG -> fsync crash here: partial manifest, fallback
Recovery:
- Crash after step 1: JOURNAL_SEG orphaned. No manifest references it.
Reader sees previous manifest. Deletion NOT visible. Orphan cleaned
up by next compaction.
- Crash during step 2: Partial MANIFEST_SEG has bad checksum. Reader
falls back to previous valid manifest. Deletion NOT visible.
- After step 2 success: Manifest durable. Deletion visible.
```
**Guarantee**: Uncommitted deletions never affect readers. Deletion is
atomic at the manifest fsync boundary.
### 8.2 Manifest Chain Visibility
```
MANIFEST_3: bitmap = {}
| JOURNAL_SEG written (delete vector 42)
MANIFEST_4: bitmap = {42} <-- deletion visible from here
| Compaction runs
MANIFEST_5: bitmap = {} <-- vector 42 physically removed
```
A reader holding MANIFEST_3 continues to see vector 42. A reader opening
after MANIFEST_4 will not. This provides snapshot isolation at manifest
granularity.
### 8.3 Multi-File Mode
In multi-file mode, each shard maintains its own deletion bitmap. The
DELETION_BITMAP TLV record supports two modes:
```
+----------------------------------------------+
| mode: u8 |
| 0x00 = SINGLE (one bitmap, inline) |
| 0x01 = SHARDED (per-shard references) |
+----------------------------------------------+
SINGLE (0x00):
| roaring_bitmap: [u8; ...] |
SHARDED (0x01):
| shard_count: u16 |
| For each shard: |
| shard_id: u16 |
| bitmap_offset: u64 (in shard file) |
| bitmap_length: u32 |
| bitmap_hash: hash128 |
+----------------------------------------------+
```
Queries spanning shards load per-shard bitmaps and check each candidate
against its shard's bitmap.
### 8.4 Concurrent Access
One writer at a time (file-level advisory lock). Multiple readers are safe
due to append-only architecture. A reader that opened before a deletion
sees the pre-deletion snapshot until it re-reads the manifest.
## 9. Space Reclamation
| Trigger | Threshold | Action |
|---------|-----------|--------|
| Deletion ratio | > 20% of vectors deleted | Schedule compaction |
| Bitmap size | > 1 MB | Schedule compaction |
| Segment count | > 64 mutable segments | Schedule compaction |
| Manual | User-initiated | Compact immediately |
Space accounting derived from the manifest:
```
total_vector_count: 10,000,000 (Level 0 root manifest)
deleted_vector_count: 150,000 (bitmap cardinality)
active_vector_count: 9,850,000 (total - deleted)
deletion_ratio: 1.5% (below threshold)
wasted_bytes: ~115 MB (150K * 768 B per fp16-384 vector)
```
## 10. Summary
### Deletion Protocol
| Step | Action | Durability |
|------|--------|------------|
| 1 | Append JOURNAL_SEG with DELETE entries | fsync (orphan-safe) |
| 2 | Update roaring deletion bitmap | In-memory |
| 3 | Append MANIFEST_SEG with new bitmap | fsync (deletion visible) |
| 4 | Compaction excludes deleted vectors | fsync (physical removal) |
| 5 | File rewrite reclaims space | fsync (space freed) |
### New Wire Format Elements
| Element | Type / Tag | Section |
|---------|------------|---------|
| JOURNAL_SEG | Segment type 0x04 | 3 |
| DELETE_VECTOR | Journal entry 0x01 | 3.4 |
| DELETE_RANGE | Journal entry 0x02 | 3.4 |
| UPDATE_METADATA | Journal entry 0x03 | 3.4 |
| MOVE_VECTOR | Journal entry 0x04 | 3.4 |
| REMAP_ID | Journal entry 0x05 | 3.4 |
| DELETION_BITMAP | Level 1 TLV 0x000E | 4 |
### Invariants
| ID | Invariant |
|----|-----------|
| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
| INV-D2 | Sealed output segments contain only ACTIVE vectors |
| INV-D3 | ID remappings journaled for every compaction-relocated vector |
| INV-D4 | Compacted input segments tombstoned in new manifest |
| INV-D5 | Sealed segments are never modified |
| INV-D6 | Rebuilt indexes exclude deleted nodes |
| INV-D7 | Uncommitted deletions never affect readers (crash safety) |
| INV-D8 | Deletion visibility is atomic at the manifest fsync boundary |

View File

@@ -0,0 +1,724 @@
# RVF Filtered Search
## 1. Motivation
Domain profiles declare metadata schemas with indexed fields (e.g., `"organism"` in
RVDNA, `"language"` in RVText, `"node_type"` in RVGraph), but the format provides no
specification for how those indexes are built, stored, or evaluated at query time.
Filtered search is the combination of vector similarity search with metadata
predicates. Without it, a caller must retrieve an over-sized result set and filter
client-side — wasting bandwidth, latency, and recall budget.
This specification adds:
1. **META_SEG** payload layout (segment type 0x07) for storing per-vector metadata
2. **Filter expression language** with a compact binary encoding
3. **Three evaluation strategies** (pre-, post-, and intra-filtering)
4. **METAIDX_SEG** (new segment type 0x0D) for inverted and bitmap indexes
5. **Manifest integration** via a new Level 1 TLV record
6. **Temperature tier coordination** for metadata segments
## 2. META_SEG Payload Layout (Segment Type 0x07)
META_SEG stores the actual metadata values associated with vectors. It uses the
standard 64-byte segment header (see `binary-layout.md` Section 3) with
`seg_type = 0x07`.
```
META_SEG Payload:
+------------------------------------------+
| Meta Header (64 bytes, padded) |
| schema_id: u32 | References PROFILE_SEG schema
| vector_id_range_start: u64 | First vector ID covered
| vector_id_range_end: u64 | Last vector ID covered (inclusive)
| field_count: u16 | Number of fields in this segment
| encoding: u8 | 0 = row-oriented, 1 = column-oriented
| reserved: [u8; 37] | Must be zero
| [64B aligned] |
+------------------------------------------+
| Field Directory |
| For each field (field_count entries): |
| field_id: u16 |
| field_type: u8 |
| flags: u8 |
| field_offset: u32 | Byte offset from payload start
| [64B aligned] |
+------------------------------------------+
| Field Data (column-oriented) |
| (see Section 2.1 for per-type layout) |
+------------------------------------------+
```
### Field Type Enum
```
Value Type Wire Size Description
----- ---- --------- -----------
0x00 string Variable UTF-8, dictionary-encoded in column layout
0x01 u32 4 bytes Unsigned 32-bit integer
0x02 u64 8 bytes Unsigned 64-bit integer
0x03 f32 4 bytes IEEE 754 single-precision float
0x04 enum Variable (packed) Enumeration with defined label set
0x05 bool 1 bit (packed) Boolean
```
### Field Flags
```
Bit Mask Name Meaning
--- ---- ---- -------
0 0x01 INDEXED Field has a corresponding METAIDX_SEG
1 0x02 SORTED Values are stored in sorted order
2 0x04 NULLABLE Null bitmap present before values
3 0x08 STORED Field value returned in query results (not just filterable)
4-7 reserved Must be zero
```
### 2.1 Column-Oriented Field Layouts
Column-oriented encoding (encoding = 1) is the preferred layout. Each field's data
block starts at a 64-byte aligned boundary.
**String fields** (dictionary-encoded):
```
dict_size: u32 Number of distinct strings
For each dict entry:
length: u16 Byte length of UTF-8 string
bytes: [u8; length] UTF-8 encoded string
[4B aligned after dictionary]
codes: [varint; vector_count] Dictionary code per vector
[64B aligned]
```
Dictionary codes are 0-indexed into the dictionary array. Code `0xFFFFFFFF` (max
varint value for u32 range) represents null if the NULLABLE flag is set.
**Numeric fields** (u32, u64, f32 -- direct array):
```
If NULLABLE:
null_bitmap: [u8; ceil(vector_count / 8)] Bit-packed, 1 = present, 0 = null
[8B aligned]
values: [field_type; vector_count] Dense array of values
[64B aligned]
```
Values for null entries are zero-filled but must not be relied upon.
**Enum fields** (bit-packed):
```
enum_count: u8 Number of enum labels
For each enum label:
length: u8 Byte length of label
bytes: [u8; length] UTF-8 label string
bits_per_code: u8 ceil(log2(enum_count))
codes: packed bit array bits_per_code bits per vector
[ceil(vector_count * bits_per_code / 8) bytes]
[64B aligned]
```
For example, an enum with 3 values (`"+", "-", "."`) uses 2 bits per vector.
1M vectors = 250 KB.
**Bool fields** (bit-packed):
```
If NULLABLE:
null_bitmap: [u8; ceil(vector_count / 8)]
[8B aligned]
values: [u8; ceil(vector_count / 8)] Bit-packed, 1 = true, 0 = false
[64B aligned]
```
### 2.2 Sorted Index (Inline)
For fields with the SORTED flag, an additional sorted permutation index follows
the field data:
```
sorted_count: u32 Must equal vector_count
sorted_order: [varint delta-encoded] Vector IDs in ascending value order
restart_interval: u16 Restart every N entries (default 128)
restart_offsets: [u32; ceil(sorted_count / restart_interval)]
[64B aligned]
```
This enables binary search over field values for range queries without requiring
a separate METAIDX_SEG. It is suitable for fields where a full inverted index
would be wasteful (high cardinality numeric fields like `position_start`).
## 3. Filter Expression Language
### 3.1 Abstract Syntax
A filter expression is a tree of predicates combined with boolean logic:
```
expr ::= field_ref CMP literal -- comparison
| field_ref IN literal_set -- set membership
| field_ref PREFIX string_lit -- string prefix match
| field_ref CONTAINS string_lit -- substring containment
| expr AND expr -- conjunction
| expr OR expr -- disjunction
| NOT expr -- negation
```
### 3.2 Binary Encoding (Postfix / RPN)
Filter expressions are encoded as a postfix (Reverse Polish Notation) token stream
for stack-based evaluation. This avoids the need for recursive parsing and enables
single-pass evaluation with a fixed-size stack.
```
Filter Expression Binary Layout:
header:
node_count: u16 Total number of tokens
stack_depth: u8 Maximum stack depth required
reserved: u8 Must be zero
tokens (postfix order):
For each token:
node_type: u8 Token type (see enum below)
payload: type-specific Variable-size payload
```
### Token Type Enum
```
Value Name Stack Effect Payload
----- ---- ------------ -------
0x01 FIELD_REF push +1 field_id: u16
0x02 LIT_U32 push +1 value: u32
0x03 LIT_U64 push +1 value: u64
0x04 LIT_F32 push +1 value: f32
0x05 LIT_STR push +1 length: u16, bytes: [u8; length]
0x06 LIT_BOOL push +1 value: u8 (0 or 1)
0x07 LIT_NULL push +1 (no payload)
0x10 CMP_EQ pop 2, push 1 (no payload) -- a == b
0x11 CMP_NE pop 2, push 1 (no payload) -- a != b
0x12 CMP_LT pop 2, push 1 (no payload) -- a < b
0x13 CMP_LE pop 2, push 1 (no payload) -- a <= b
0x14 CMP_GT pop 2, push 1 (no payload) -- a > b
0x15 CMP_GE pop 2, push 1 (no payload) -- a >= b
0x20 IN_SET pop 1, push 1 set_size: u16, [encoded values]
0x21 PREFIX pop 2, push 1 (no payload) -- string prefix
0x22 CONTAINS pop 2, push 1 (no payload) -- substring match
0x30 AND pop 2, push 1 (no payload)
0x31 OR pop 2, push 1 (no payload)
0x32 NOT pop 1, push 1 (no payload)
```
### 3.3 Encoding Example
Filter: `organism = "E. coli" AND position_start >= 1000`
```
Token 0: FIELD_REF field_id=0 (organism) stack: [organism_val]
Token 1: LIT_STR "E. coli" stack: [organism_val, "E. coli"]
Token 2: CMP_EQ stack: [true/false]
Token 3: FIELD_REF field_id=3 (position_start) stack: [bool, pos_val]
Token 4: LIT_U64 1000 stack: [bool, pos_val, 1000]
Token 5: CMP_GE stack: [bool, true/false]
Token 6: AND stack: [result]
Binary: node_count=7, stack_depth=3
01 00:00 05 00:07 "E. coli" 10 01 00:03 03 00:00:00:00:00:00:03:E8 15 30
```
### 3.4 Evaluation
Evaluation processes tokens left to right using a fixed-size boolean/value stack:
```python
def evaluate(tokens, vector_id, metadata):
stack = []
for token in tokens:
if token.type == FIELD_REF:
stack.append(metadata.get_value(vector_id, token.field_id))
elif token.type in (LIT_U32, LIT_U64, LIT_F32, LIT_STR, LIT_BOOL, LIT_NULL):
stack.append(token.value)
elif token.type in (CMP_EQ, CMP_NE, CMP_LT, CMP_LE, CMP_GT, CMP_GE):
b, a = stack.pop(), stack.pop()
stack.append(compare(a, token.type, b))
elif token.type == IN_SET:
a = stack.pop()
stack.append(a in token.value_set)
elif token.type in (PREFIX, CONTAINS):
b, a = stack.pop(), stack.pop()
stack.append(string_match(a, token.type, b))
elif token.type == AND:
b, a = stack.pop(), stack.pop()
stack.append(a and b)
elif token.type == OR:
b, a = stack.pop(), stack.pop()
stack.append(a or b)
elif token.type == NOT:
stack.append(not stack.pop())
return stack[0]
```
Maximum stack depth is declared in the header so the evaluator can pre-allocate.
Implementations must reject expressions with `stack_depth > 16`.
## 4. Filter Evaluation Strategies
The runtime selects one of three strategies based on the estimated **selectivity**
of the filter (the fraction of vectors passing the filter).
### 4.1 Pre-Filtering (Selectivity < 1%)
Build the candidate ID set from metadata indexes first, then run vector search
only on the filtered subset.
```
1. Evaluate filter using METAIDX_SEG inverted/bitmap indexes
2. Collect matching vector IDs into a candidate set C
3. If |C| < ef_search:
Flat scan all candidates, return top-K
Else:
Build temporary flat index over C, run HNSW search restricted to C
4. Return top-K results
```
**Tradeoffs**:
- Optimal when the candidate set is very small (hundreds to low thousands)
- Risk: if the candidate set is disconnected in the HNSW graph, search cannot
traverse from entry points to candidates. The flat scan fallback handles this.
- Memory: candidate set bitmap = `ceil(total_vectors / 8)` bytes
### 4.2 Post-Filtering (Selectivity > 20%)
Run standard HNSW search with over-retrieval, then filter results.
```
1. Compute over_retrieval_factor = min(1.0 / selectivity, 10.0)
2. Set ef_search_adj = ef_search * over_retrieval_factor
3. Run standard HNSW search with ef_search_adj
4. Filter result set by evaluating filter expression per candidate
5. Return top-K from filtered results
```
**Tradeoffs**:
- Optimal when the filter passes most vectors (minimal wasted computation)
- Risk: if over-retrieval factor is too low, fewer than K results survive filtering.
The caller should retry with a higher factor or fall back to intra-filtering.
- No modification to HNSW traversal logic required.
### 4.3 Intra-Filtering (1% <= Selectivity <= 20%)
Evaluate the filter during HNSW traversal, skipping nodes that fail the predicate.
```python
def filtered_hnsw_search(query, filter_expr, entry_point, ef_search, k):
candidates = MaxHeap() # top-K results (max-heap by distance)
worklist = MinHeap() # exploration frontier (min-heap by distance)
visited = BitSet()
filtered_skips = 0
max_skips = ef_search * 3 # backoff threshold
worklist.push((distance(query, entry_point), entry_point))
visited.add(entry_point)
while worklist and filtered_skips < max_skips:
dist, node = worklist.pop()
# Check filter predicate
if not evaluate(filter_expr, node, metadata):
filtered_skips += 1
# Still expand neighbors (maintain graph connectivity)
neighbors = get_neighbors(node)
for n in neighbors:
if n not in visited:
visited.add(n)
d = distance(query, get_vector(n))
worklist.push((d, n))
continue
filtered_skips = 0 # reset skip counter on successful match
candidates.push((dist, node))
if len(candidates) > k:
candidates.pop() # evict worst
# Expand neighbors
neighbors = get_neighbors(node)
for n in neighbors:
if n not in visited:
visited.add(n)
d = distance(query, get_vector(n))
if len(candidates) < ef_search or d < candidates.max():
worklist.push((d, n))
return candidates.top_k(k)
```
**Key design decisions**:
1. **Skipped nodes still expand neighbors**: This preserves graph connectivity.
A node that fails the filter may have neighbors that pass it.
2. **Skip counter with backoff**: If too many consecutive nodes fail the filter,
the search is exhausting the local neighborhood without finding matches. The
`max_skips` threshold triggers termination to avoid unbounded traversal.
3. **Adaptive ef expansion**: When `filtered_skips > ef_search`, the effective
search frontier is larger than requested, compensating for filtered-out nodes.
### 4.4 Strategy Selection
```
selectivity = estimate_selectivity(filter_expr, metaidx_stats)
if selectivity < 0.01:
strategy = PRE_FILTER
elif selectivity > 0.20:
strategy = POST_FILTER
else:
strategy = INTRA_FILTER
```
Selectivity estimation uses statistics stored in the METAIDX_SEG header:
- **Inverted index**: `posting_list_length / total_vectors` per term
- **Bitmap index**: `popcount(bitmap) / total_vectors` per enum value
- **Range tree**: count of values in range / total_vectors
For compound filters (AND/OR), selectivity is estimated using independence
assumption: `P(A AND B) = P(A) * P(B)`, `P(A OR B) = P(A) + P(B) - P(A) * P(B)`.
## 5. METAIDX_SEG (Segment Type 0x0D)
METAIDX_SEG stores secondary indexes over metadata fields for fast predicate
evaluation. Each METAIDX_SEG covers one field. The segment type enum value 0x0D
is allocated from the reserved range (see `binary-layout.md` Section 3).
```
METAIDX_SEG Payload:
+------------------------------------------+
| Index Header (64 bytes, padded) |
| field_id: u16 | Field being indexed
| index_type: u8 | 0=inverted, 1=range_tree, 2=bitmap
| field_type: u8 | Mirrors META_SEG field_type
| total_vectors: u64 | Vectors covered by this index
| unique_values: u64 | Cardinality (distinct values)
| reserved: [u8; 42] |
| [64B aligned] |
+------------------------------------------+
| Index Data (type-specific) |
+------------------------------------------+
```
### 5.1 Inverted Index (index_type = 0)
Best for: string fields with moderate cardinality (100 to 100K distinct values).
```
term_count: u32
For each term (sorted by encoded value):
term_length: u16
term_bytes: [u8; term_length] Encoded value (UTF-8 for strings)
posting_length: u32 Number of vector IDs
postings: [varint delta-encoded] Sorted vector IDs
[8B aligned after postings]
[64B aligned]
```
Posting lists use varint delta encoding identical to the ID encoding in VEC_SEG
(see `binary-layout.md` Section 5). Restart points every 128 entries enable
binary search within a posting list for intersection operations.
### 5.2 Range Tree (index_type = 1)
Best for: numeric fields requiring range queries (u32, u64, f32).
```
page_size: u32 Fixed 4096 bytes (4 KB, one disk page)
page_count: u32
root_page: u32 Page index of B+ tree root
tree_height: u8
reserved: [u8; 47]
[64B aligned]
Internal Page (4096 bytes):
page_type: u8 (0 = internal)
key_count: u16
keys: [field_type; key_count] Separator keys
children: [u32; key_count + 1] Child page indices
[zero-padded to 4096]
Leaf Page (4096 bytes):
page_type: u8 (1 = leaf)
entry_count: u16
prev_leaf: u32 Linked-list pointer for range scan
next_leaf: u32
entries:
For each entry:
value: field_type The metadata value
vector_id: u64 Associated vector ID
[zero-padded to 4096]
```
Leaf pages form a doubly-linked list for efficient range scans. A range query
`position_start >= 1000 AND position_start <= 5000` descends the tree to find
the first leaf with value >= 1000, then scans forward until value > 5000.
### 5.3 Bitmap Index (index_type = 2)
Best for: enum and bool fields with low cardinality (< 64 distinct values).
```
value_count: u8 Number of distinct enum/bool values
For each value:
value_label_len: u8
value_label: [u8; value_label_len] The enum label or "true"/"false"
bitmap_format: u8 0 = raw, 1 = roaring
bitmap_length: u32 Byte length of bitmap data
bitmap_data: [u8; bitmap_length] Bitmap of matching vector IDs
[8B aligned]
[64B aligned]
```
**Raw bitmaps** are used when `total_vectors < 8192` (1 KB per bitmap).
**Roaring bitmaps** are used for larger datasets. The roaring format stores
the bitmap as a set of containers (array, bitmap, or run-length) per 64K chunk.
This matches the industry-standard Roaring bitmap serialization (compatible with
CRoaring / roaring-rs wire format).
Bitmap intersection and union operations map directly to AND/OR filter predicates
using SIMD bitwise operations. For 10M vectors:
```
Raw bitmap: ~1.2 MB per value (impractical for many values)
Roaring bitmap: 100 KB - 1 MB per value depending on density
AND/OR: ~0.1 ms per operation (AVX-512 on 1 MB bitmap)
```
## 6. Level 1 Manifest Addition
### Tag 0x000F: METADATA_INDEX_DIR
A new TLV record in the Level 1 manifest (see `02-manifest-system.md` Section 3)
that maps indexed metadata fields to their METAIDX_SEG segment IDs.
```
Tag: 0x000F
Name: METADATA_INDEX_DIR
Payload:
entry_count: u16
For each entry:
field_id: u16 Matches META_SEG field_id
field_name_len: u8
field_name: [u8; field_name_len] UTF-8 field name for debugging
index_seg_id: u64 Segment ID of METAIDX_SEG
index_type: u8 0=inverted, 1=range_tree, 2=bitmap
stats:
total_vectors: u64
unique_values: u64
min_posting_len: u32 Smallest posting list size
max_posting_len: u32 Largest posting list size
```
This allows the query planner to estimate selectivity without reading the
METAIDX_SEG segments themselves. The `min_posting_len` and `max_posting_len`
fields provide bounds for cardinality estimation.
### Updated Record Types Table
```
Tag Name Description
--- ---- -----------
0x0001 SEGMENT_DIR Array of segment directory entries
0x0002 TEMP_TIER_MAP Temperature tier assignments per block
...
0x000D KEY_DIRECTORY Encryption key references
0x000E (reserved)
0x000F METADATA_INDEX_DIR Metadata field -> METAIDX_SEG mapping
```
## 7. Performance Analysis
### 7.1 Filter Strategy vs Selectivity vs Recall
| Selectivity | Strategy | Recall@10 | Latency (10M vectors) | Notes |
|-------------|----------|-----------|----------------------|-------|
| 0.001% (100 matches) | Pre-filter | 1.00 | 0.02 ms | Flat scan on 100 candidates |
| 0.01% (1K matches) | Pre-filter | 0.99 | 0.08 ms | Flat scan on 1K candidates |
| 0.1% (10K matches) | Pre-filter | 0.98 | 0.5 ms | Mini-HNSW on 10K candidates |
| 1% (100K matches) | Intra-filter | 0.96 | 0.12 ms | ~10% node skip overhead |
| 5% (500K matches) | Intra-filter | 0.95 | 0.08 ms | ~5% node skip overhead |
| 10% (1M matches) | Intra-filter | 0.94 | 0.06 ms | Minimal skip overhead |
| 20% (2M matches) | Post-filter | 0.95 | 0.10 ms | 5x over-retrieval |
| 50% (5M matches) | Post-filter | 0.97 | 0.06 ms | 2x over-retrieval |
| 100% (no filter) | None | 0.98 | 0.04 ms | Baseline unfiltered |
### 7.2 Memory Overhead of Metadata Indexes
For 10M vectors with the RVDNA profile (5 indexed fields):
| Field | Type | Cardinality | Index Type | Size |
|-------|------|-------------|------------|------|
| organism | string | ~50K | Inverted | ~80 MB |
| gene_id | string | ~500K | Inverted | ~120 MB |
| chromosome | string | ~25 | Bitmap (roaring) | ~12 MB |
| position_start | u64 | ~10M | Range tree | ~160 MB |
| position_end | u64 | ~10M | Range tree | ~160 MB |
| **Total** | | | | **~532 MB** |
As a fraction of vector data (10M * 384 dim * fp16 = 7.2 GB): **~7.4% overhead**.
For the RVText profile (2 indexed fields, typically lower cardinality):
| Field | Type | Cardinality | Index Type | Size |
|-------|------|-------------|------------|------|
| source_url | string | ~100K | Inverted | ~90 MB |
| language | string | ~50 | Bitmap (roaring) | ~8 MB |
| **Total** | | | | **~98 MB** |
Overhead: **~1.4%** of vector data.
### 7.3 Query Latency Breakdown (Filtered Intra-Search)
```
Phase Time Notes
----- ---- -----
Parse filter expression 0.5 us Stack-based, no allocation
Estimate selectivity 1.0 us Read manifest stats
Load METAIDX_SEG (if cold) 50-200 us First query only; cached after
HNSW traversal (150 steps) 45 us Baseline unfiltered
+ filter eval per node +12 us ~80 ns per eval * 150 nodes
+ skip expansion +8 us ~20% more nodes visited at 5% sel.
Top-K collection 10 us Heap operations
--------
Total (warm cache) ~76 us
Total (cold start) ~276 us
```
## 8. Integration with Temperature Tiering
Metadata follows the same temperature model as vector data (see
`03-temperature-tiering.md`), but with its own tier assignments.
### 8.1 Hot Metadata
Indexed fields for hot-tier vectors are kept resident in memory:
- **Bitmap indexes** for low-cardinality fields (enum, bool) are always hot.
Total size is bounded: `cardinality * ceil(hot_vectors / 8)` bytes. For 100K
hot vectors and 25 enum values: 25 * 12.5 KB = 312 KB.
- **Inverted index posting lists** are cached using an LRU policy keyed by
(field_id, term). Frequently queried terms (e.g., `language = "en"`) remain
resident.
- **Range tree pages** follow the standard B+ tree buffer pool model. Hot pages
(root + first two levels) are pinned. Leaf pages are demand-paged.
### 8.2 Cold Metadata
Cold metadata covers vectors that are rarely accessed:
- META_SEG data for cold vectors is compressed with ZSTD (level 9+) and stored
in cold-tier segments.
- METAIDX_SEG posting lists for cold vectors are not loaded until a query
specifically requests them.
- When a filter matches only cold vectors (detected via the temperature tier
map), the runtime issues a warning: filtered search on cold data may require
decompression latency of 10-100 ms.
### 8.3 Compaction Coordination
When temperature-aware compaction reorganizes vector segments (see
`03-temperature-tiering.md` Section 4), metadata must follow:
```
1. Identify vectors moving between tiers
2. Rewrite META_SEG for affected vector ID ranges
3. Rebuild METAIDX_SEG posting lists (vector IDs may be renumbered during
compaction if the COMPACTION_RENUMBER flag is set)
4. Update METADATA_INDEX_DIR in the new manifest
5. Tombstone old META_SEG and METAIDX_SEG segments
```
Metadata compaction piggybacks on vector compaction -- it never triggers
independently. This ensures metadata and vector segments remain in consistent
temperature tiers.
### 8.4 Metadata-Aware Promotion
When a filter query frequently accesses metadata for warm-tier vectors, those
metadata segments are candidates for promotion to hot tier. The access sketch
(SKETCH_SEG) tracks metadata segment accesses alongside vector accesses:
```
sketch_key = (META_SEG_ID << 32) | block_id
```
This reuses the existing sketch infrastructure without modification.
## 9. Wire Protocol: Filtered Query Message
For completeness, the filter expression is carried in the query message as a
tagged field. The query wire format is outside the scope of the storage spec,
but the filter payload is defined here for interoperability.
```
Query Message Filter Field:
tag: u16 (0x0040 = FILTER)
length: u32
filter_version: u8 (1)
filter_payload: [u8; length - 1] Binary filter expression (Section 3.2)
```
Implementations that do not support filtered search must ignore tag 0x0040 and
return unfiltered results. This preserves backward compatibility.
## 10. Implementation Notes
### 10.1 Index Selection Heuristics
When building indexes for a new META_SEG field, implementations should select
the index type automatically:
```
if field_type in (enum, bool) and cardinality < 64:
index_type = BITMAP
elif field_type in (u32, u64, f32):
index_type = RANGE_TREE
else:
index_type = INVERTED
```
Fields without the `"indexed": true` property in the profile schema must not
have METAIDX_SEG segments built. They are stored in META_SEG for retrieval
only (the STORED flag).
### 10.2 Posting List Intersection
For AND filters on multiple indexed fields, posting list intersection is
performed using a merge-based algorithm on sorted, delta-decoded posting lists:
```
Sorted Intersection (two-pointer merge):
Time: O(min(|A|, |B|)) with skip-ahead via restart points
Practical: ~100 ns per 1000 common elements (SIMD comparison)
```
For OR filters, posting list union uses a similar merge with deduplication.
### 10.3 Null Handling
- `FIELD_REF` for a null value pushes a sentinel NULL onto the stack
- `CMP_EQ NULL` returns true only for null values
- `CMP_NE NULL` returns true for all non-null values
- All other comparisons against NULL return false (SQL-style three-valued logic)
- `IN_SET` never matches NULL unless NULL is explicitly in the set

View File

@@ -0,0 +1,474 @@
# RVF Concurrency, Versioning, and Space Reclamation
## 1. Single-Writer / Multi-Reader Model
RVF uses a **single-writer, multi-reader** concurrency model. At most one process
may append segments to an RVF file at any time. Any number of readers may operate
concurrently with each other and with the writer. This model is enforced by an
advisory lock file, not by OS-level mandatory locking.
| Concern | Advisory Lock | Mandatory Lock (flock/fcntl) |
|---------|---------------|------------------------------|
| NFS compatibility | Works (lock file is a regular file) | Broken on many NFS configs |
| Crash recovery | Stale lock detectable by PID check | Kernel auto-releases, but only locally |
| Cross-language | Any language can create a file | Requires OS-specific syscalls |
| Visibility | Lock state inspectable by humans | Opaque kernel state |
| Multi-file mode | One lock covers all shards | Would need per-shard locks |
## 2. Writer Lock File
The writer lock is a file named `<basename>.rvf.lock` in the same directory as the
RVF file. For example, `data.rvf` uses `data.rvf.lock`.
### Binary Layout
```
Offset Size Field Description
------ ---- ----- -----------
0x00 4 magic 0x52564C46 ("RVLF" in ASCII)
0x04 4 pid Writer process ID (u32)
0x08 64 hostname Null-terminated hostname (max 63 chars + null)
0x48 8 timestamp_ns Lock acquisition time (nanosecond UNIX timestamp)
0x50 16 writer_id Random UUID (128-bit, written as raw bytes)
0x60 4 lock_version Lock protocol version (currently 1)
0x64 4 checksum CRC32C of bytes 0x00-0x63
```
**Total**: 104 bytes.
### Lock Acquisition Protocol
```
1. Construct lock file content (magic, PID, hostname, timestamp, random UUID)
2. Compute CRC32C over bytes 0x00-0x63, store at 0x64
3. Attempt open("<basename>.rvf.lock", O_CREAT | O_EXCL | O_WRONLY)
4. If open succeeds:
a. Write 104 bytes
b. fsync
c. Lock acquired — proceed with writes
5. If open fails (EEXIST):
a. Read existing lock file
b. Validate magic and checksum
c. If invalid: delete stale lock, retry from step 3
d. If valid: run stale lock detection (see below)
e. If stale: delete lock, retry from step 3
f. If not stale: lock acquisition fails — another writer is active
```
The `O_CREAT | O_EXCL` combination is atomic on POSIX filesystems, preventing
two processes from simultaneously creating the lock.
### Stale Lock Detection
A lock is considered stale when **both** of the following are true:
1. **PID is dead**: `kill(pid, 0)` returns `ESRCH` (process does not exist), OR
the hostname does not match the current host (remote crash)
2. **Age exceeds threshold**: `now_ns - timestamp_ns > 30_000_000_000` (30 seconds)
The age check prevents a race where a PID is recycled by the OS. A lock younger
than 30 seconds is never considered stale, even if the PID appears dead, because
PID reuse on modern systems can occur within milliseconds.
If the hostname differs from the current host, the PID check is not meaningful.
In this case, only the age threshold applies. Implementations SHOULD use a longer
threshold (300 seconds) for cross-host lock recovery to account for clock skew.
### Lock Release Protocol
```
1. fsync all pending data and manifest segments
2. Verify the lock file still contains our writer_id (re-read and compare)
3. If writer_id matches: unlink("<basename>.rvf.lock")
4. If writer_id does not match: abort — another process stole the lock
```
Step 2 prevents a writer from deleting a lock that was legitimately taken over
after a stale lock recovery by another process.
If a writer crashes without releasing the lock, the lock file persists on disk.
The next writer detects the orphan via stale lock detection and reclaims it.
No data corruption occurs because the append-only segment model guarantees that
partial writes are detectable: a segment with a bad content hash or a truncated
manifest is simply ignored.
## 3. Reader-Writer Coordination
Readers and writers operate independently. The append-only architecture ensures
they never conflict.
### Reader Protocol
```
1. Open file (read-only, no lock required)
2. Read Level 0 root manifest (last 4096 bytes)
3. Parse hotset pointers and Level 1 offset
4. This manifest snapshot defines the reader's view of the file
5. All queries within this session use the snapshot
6. To see new data: re-read Level 0 (explicit refresh)
```
### Writer Protocol
```
1. Acquire lock (Section 2)
2. Read current manifest to learn segment directory state
3. Append new segments (VEC_SEG, INDEX_SEG, etc.)
4. Append new MANIFEST_SEG referencing all live segments
5. fsync
6. Release lock (Section 2)
```
### Concurrent Timeline
```
Time Writer Reader A Reader B
---- ------ -------- --------
t=0 Acquires lock
t=1 Appends VEC_SEG_4 Opens file
t=2 Appends VEC_SEG_5 Opens file Reads manifest M3
t=3 Appends MANIFEST_SEG M4 Reads manifest M3 Queries (sees M3)
t=4 fsync, releases lock Queries (sees M3) Queries (sees M3)
t=5 Queries (sees M3) Refreshes -> M4
t=6 Refreshes -> M4 Queries (sees M4)
```
Reader A opened during the write but read manifest M3 (already stable) and never
sees partially written segments. Reader B sees M3 until explicit refresh. Neither
reader is blocked; the writer is never blocked by readers.
### Snapshot Isolation Guarantees
A reader holding a manifest snapshot is guaranteed:
1. All referenced segments are fully written and fsynced
2. Segment content hashes match (the manifest would not reference broken segments)
3. The snapshot is internally consistent (no partial epoch states)
4. The snapshot remains valid for the lifetime of the open file descriptor, even
if the file is compacted and replaced (old inode persists until close)
## 4. Format Versioning
RVF uses explicit version fields at every structural level. The versioning rules
are designed for forward compatibility — older readers can safely process files
produced by newer writers, with graceful degradation.
### Segment Version Compatibility
The segment header `version` field (offset 0x04, currently `1`) governs
segment-level compatibility.
| Rule | Description |
|------|-------------|
| S1 | A v1 reader MUST successfully process all v1 segments |
| S2 | A v1 reader MUST skip segments with version > 1 |
| S3 | A v1 reader MUST log a warning when skipping unknown versions |
| S4 | A v1 reader MUST NOT reject a file because it contains unknown-version segments |
| S5 | A v2+ writer MUST write a root manifest readable by v1 readers (if the root manifest format allows it) |
| S6 | A v2+ writer MAY write segments with version > 1 |
| S7 | Readers MUST use `payload_length` from the segment header to skip unknown segments |
Skipping works because the segment header layout is stable: magic, version,
seg_type, and payload_length occupy fixed offsets. A reader skips unknown
segments by seeking past `64 + payload_length` bytes (header + payload).
### Unknown Segment Types
The segment type enum (offset 0x05) may be extended in future versions.
| Rule | Description |
|------|-------------|
| T1 | A reader MUST skip segment types outside the recognized range (currently 0x01-0x0C) |
| T2 | A reader MUST NOT reject a file because of unknown segment types |
| T3 | A reader MUST use the header's `payload_length` to skip the unknown segment |
| T4 | A reader SHOULD log unknown types at diagnostic/debug level |
| T5 | Types 0x00 and 0xF0-0xFF remain reserved (see spec 01, Section 3) |
### Level 1 TLV Forward Compatibility
Level 1 manifest records use tag-length-value encoding. New tags may be added
in any version.
| Rule | Description |
|------|-------------|
| L1 | A reader MUST skip TLV records with unknown tags |
| L2 | A reader MUST use the record's `length` field (4 bytes at tag offset +2) to skip |
| L3 | A writer MUST NOT change the semantics of an existing tag |
| L4 | A writer MUST NOT reuse a tag value for a different purpose |
| L5 | New tags MUST be assigned sequentially from 0x000E onward |
### Root Manifest Compatibility
The root manifest (Level 0) has the strictest compatibility requirements because
it is the entry point for all readers.
| Rule | Description |
|------|-------------|
| R1 | The magic `0x52564D30` at offset 0x000 is frozen forever |
| R2 | The layout of bytes 0x000-0x007 (magic + version + flags) is frozen forever |
| R3 | New fields may be added to reserved space at offsets 0xF00-0xFFB |
| R4 | Readers MUST ignore non-zero bytes in reserved space they do not understand |
| R5 | The root checksum at 0xFFC always covers bytes 0x000-0xFFB |
| R6 | A v2+ writer extending reserved space MUST ensure the checksum remains valid |
There is no explicit version negotiation. Compatibility is achieved through the
skip rules above. A reader processes what it understands and skips what it does
not. This avoids capability exchange, making RVF suitable for offline and
archival use cases.
## 5. Variable Dimension Support
The root manifest declares a `dimension` field (offset 0x020, u16) and each
VEC_SEG block declares its own `dim` field (block header offset 0x08, u16).
These may differ.
### Dimension Rules
| Rule | Description |
|------|-------------|
| D1 | The root manifest `dimension` is the **primary dimension** (most common in the file) |
| D2 | An RVF file MAY contain VEC_SEG blocks with dimensions different from the primary |
| D3 | Each VEC_SEG block's `dim` field is authoritative for the vectors in that block |
| D4 | The HNSW index (INDEX_SEG) covers only vectors matching the primary dimension |
| D5 | Vectors with non-primary dimensions are searchable via flat scan or a separate index |
| D6 | A PROFILE_SEG may declare multiple expected dimensions |
### Dimension Catalog (Level 1 Record)
A new Level 1 TLV record (tag `0x0010`, DIMENSION_CATALOG) enables readers to
discover all dimensions present without scanning every VEC_SEG.
Record layout:
```
Offset Size Field Description
------ ---- ----- -----------
0x00 2 entry_count Number of dimension entries
0x02 2 reserved Must be zero
```
Followed by `entry_count` entries of:
```
Offset Size Field Description
------ ---- ----- -----------
0x00 2 dimension Vector dimensionality
0x02 1 dtype Data type enum for these vectors
0x03 1 flags 0x01 = primary, 0x02 = has_index
0x04 4 vector_count Number of vectors with this dimension
0x08 8 index_seg_offset Offset to dedicated index (0 if none)
```
**Entry size**: 16 bytes.
Example for an RVDNA profile file:
```
DIMENSION_CATALOG:
entry_count: 3
[0] dim=64, dtype=f16, flags=0x01 (primary, has_index), count=10000000, index=0x1A00000
[1] dim=384, dtype=f16, flags=0x02 (has_index), count=500000, index=0x3F00000
[2] dim=4096, dtype=f32, flags=0x00 (flat scan only), count=10000, index=0
```
## 6. Space Reclamation
Over time, tombstoned segments and superseded manifests accumulate dead space.
RVF provides three reclamation strategies, each suited to different operating
conditions.
### Strategy 1: Hole-Punching
On Linux filesystems that support `fallocate(2)` with `FALLOC_FL_PUNCH_HOLE`
(ext4, XFS, btrfs), tombstoned segment ranges can be released back to the
filesystem without rewriting the file.
```
Before: [VEC_1 live] [VEC_2 dead] [VEC_3 dead] [VEC_4 live] [MANIFEST]
After: [VEC_1 live] [ hole ] [ hole ] [VEC_4 live] [MANIFEST]
```
File size is unchanged but disk blocks are freed. No data movement occurs — each
punch is O(1). Reader mmap still works (holes read as zeros, but the manifest
never references them). Hole-punching is performed only on segments marked as
TOMBSTONE in the current manifest's COMPACTION_STATE record.
### Strategy 2: Copy-Compact
Copy-compact rewrites the file, including only live segments. This is the
universal strategy that works on all filesystems.
```
Protocol:
1. Acquire writer lock
2. Read current manifest to enumerate live segments
3. Create temporary file: <basename>.rvf.compact.tmp
4. Write live segments sequentially to temporary file
5. Write new MANIFEST_SEG with updated offsets
6. fsync temporary file
7. Atomic rename: <basename>.rvf.compact.tmp -> <basename>.rvf
8. Release writer lock
```
The atomic rename (step 7) ensures readers either see the old file or the new
file, never a partial state. Readers that opened the old file before the rename
continue operating on the old inode via their open file descriptor. The old
inode is freed when the last reader closes its descriptor.
### Strategy 3: Shard Rewrite (Multi-File Mode)
In multi-file mode, individual shard files can be rewritten independently:
```
Protocol:
1. Acquire writer lock
2. Read shard reference from Level 1 SHARD_REFS record
3. Write new shard: <basename>.rvf.cold.<N>.compact.tmp
4. fsync new shard
5. Update main file manifest with new shard reference
6. fsync main file
7. Atomic rename new shard over old shard
8. Release writer lock
```
The old shard is safe to delete after all readers close their descriptors.
Implementations MAY defer deletion using a grace period (default: 60 seconds).
## 7. Space Reclamation Triggers
Reclamation is not performed on every write. Implementations SHOULD evaluate
triggers after each manifest write and act when thresholds are exceeded.
| Trigger | Threshold | Action |
|---------|-----------|--------|
| Dead space ratio | > 50% of file size | Copy-compact |
| Dead space absolute | > 1 GB | Hole-punch if supported, else copy-compact |
| Tombstone count | > 10,000 JOURNAL_SEG tombstone entries | Consolidate journal segments |
| Time since last compaction | > 7 days | Evaluate dead space ratio, compact if > 25% |
### Dead Space Calculation
Dead space is computed from the manifest's COMPACTION_STATE record:
```
dead_bytes = sum(payload_length + 64) for each tombstoned segment
total_bytes = file_size
dead_ratio = dead_bytes / total_bytes
```
The `+ 64` accounts for the segment header.
### Trigger Evaluation Protocol
```
1. After writing a new MANIFEST_SEG, compute dead_bytes and dead_ratio
2. If dead_ratio > 0.50: schedule copy-compact
3. Else if dead_bytes > 1 GB:
a. If fallocate supported: hole-punch tombstoned ranges
b. Else: schedule copy-compact
4. If tombstone_count > 10,000: consolidate JOURNAL_SEGs
5. If days_since_last_compact > 7 AND dead_ratio > 0.25: schedule copy-compact
```
Scheduled compactions MAY be deferred to a background process or low-activity
period.
## 8. Multi-Process Compaction
Compaction is a write operation and requires the writer lock. Only one process
may compact at a time.
### Background Compaction Process
A dedicated compaction process can run alongside the application:
```
1. Attempt writer lock acquisition
2. If lock acquired:
a. Read current manifest
b. Evaluate reclamation triggers
c. If compaction needed:
i. Write WITNESS_SEG with compaction_state = STARTED
ii. Perform compaction (copy-compact or hole-punch)
iii. Write WITNESS_SEG with compaction_state = COMPLETED
iv. Write new MANIFEST_SEG
d. Release lock
3. If lock not acquired: sleep and retry
```
### Crash Safety
Compaction is crash-safe by construction. Copy-compact does not rename until
fsynced — a crash before rename leaves the original file untouched and the
temporary file is cleaned up on next startup. Hole-punch `fallocate` calls are
individually atomic; a crash mid-sequence leaves the manifest consistent because
it references only live segments. Shard rewrite follows the same atomic rename
pattern as copy-compact.
### Compaction Progress and Resumability
For long-running compactions, the writer records progress in WITNESS_SEG segments:
```
WITNESS_SEG compaction payload:
Offset Size Field Description
------ ---- ----- -----------
0x00 4 state 0=STARTED, 1=IN_PROGRESS, 2=COMPLETED, 3=ABORTED
0x04 8 source_manifest_id Segment ID of manifest being compacted
0x0C 8 last_copied_seg_id Last segment ID successfully written to new file
0x14 8 bytes_written Total bytes written to new file so far
0x1C 8 bytes_remaining Estimated bytes remaining
0x24 16 temp_file_hash Hash of temporary file at last checkpoint
```
If a compaction process crashes and restarts, it can:
1. Find the latest WITNESS_SEG with `state = IN_PROGRESS`
2. Verify the temporary file exists and matches `temp_file_hash`
3. Resume from `last_copied_seg_id + 1`
4. If verification fails, delete the temporary file and restart compaction
## 9. Crash Recovery Summary
RVF recovers from crashes at any point without external tooling.
| Crash Point | State After Recovery | Action Required |
|-------------|---------------------|-----------------|
| Segment append (before manifest) | Orphan segment at tail | None — manifest does not reference it |
| Manifest write | Partial manifest at tail | Scan backward to previous valid manifest |
| Lock acquisition | Lock file may or may not exist | Stale lock detection resolves it |
| Lock release | Lock file persists | Stale lock detection resolves it |
| Copy-compact (before rename) | Temporary file on disk | Delete `*.compact.tmp` on startup |
| Copy-compact (during rename) | Atomic — old or new | No action needed |
| Hole-punch | Partial holes punched | No action — manifest is consistent |
| Shard rewrite | Temporary shard on disk | Delete `*.compact.tmp` on startup |
### Startup Recovery Protocol
On startup, before acquiring a write lock, a writer SHOULD:
```
1. Delete any <basename>.rvf.compact.tmp files (orphaned compaction)
2. Delete any <basename>.rvf.cold.*.compact.tmp files (orphaned shard compaction)
3. Validate the lock file (if present) for staleness
4. Open the RVF file and locate the latest valid manifest
5. If the tail contains a partial segment (magic present, bad hash):
a. Log a warning with the partial segment's offset and type
b. The partial segment is outside the manifest — it is harmless
c. The next append will overwrite it (or it will be compacted away)
```
## 10. Invariants
The following invariants extend those in spec 01 (Section 7):
1. At most one writer lock exists per RVF file at any time
2. A lock file with valid magic and checksum represents an active or stale lock
3. Readers never require a lock, regardless of operation
4. A manifest snapshot is immutable for the lifetime of a reader session
5. Compaction never modifies live segments — it creates new ones
6. Hole-punched regions are never referenced by any manifest
7. The root manifest magic and first 8 bytes are frozen across all versions
8. Unknown segment versions and types are skipped, never rejected
9. Unknown TLV tags in Level 1 are skipped, never rejected
10. Each VEC_SEG block's `dim` field is authoritative for that block's vectors

View File

@@ -0,0 +1,688 @@
# RVF Operations API
## 1. Scope
This document specifies the operational surface of an RVF runtime: error codes
returned by all operations, wire formats for batch queries, batch ingest, and
batch deletes, the network streaming protocol for progressive loading over HTTP
and TCP, and the compaction scheduling policy. It complements the segment model
(spec 01), manifest system (spec 02), and query optimization (spec 06).
All multi-byte integers are little-endian unless otherwise noted. All offsets
within messages are byte offsets from the start of the message payload.
## 2. Error Code Enumeration
Error codes are 16-bit unsigned integers. The high byte identifies the error
category; the low byte identifies the specific error within that category.
Implementations must preserve unrecognized codes in responses and must not
treat unknown codes as fatal unless the high byte is `0x01` (format error).
### Category 0x00: Success
```
Code Name Description
------ -------------------- ----------------------------------------
0x0000 OK Operation succeeded
0x0001 OK_PARTIAL Partial success (some items failed)
```
`OK_PARTIAL` is returned when a batch operation succeeds for some items and
fails for others. The response body contains per-item status details.
### Category 0x01: Format Errors
```
Code Name Description
------ -------------------- ----------------------------------------
0x0100 INVALID_MAGIC Segment magic mismatch (expected 0x52564653)
0x0101 INVALID_VERSION Unsupported segment version
0x0102 INVALID_CHECKSUM Segment hash verification failed
0x0103 INVALID_SIGNATURE Cryptographic signature invalid
0x0104 TRUNCATED_SEGMENT Segment payload shorter than declared length
0x0105 INVALID_MANIFEST Root manifest validation failed
0x0106 MANIFEST_NOT_FOUND No valid MANIFEST_SEG in file
0x0107 UNKNOWN_SEGMENT_TYPE Segment type not recognized (warning, not fatal)
0x0108 ALIGNMENT_ERROR Data not at expected 64B boundary
```
`UNKNOWN_SEGMENT_TYPE` is advisory. A reader encountering an unknown segment
type should skip it and continue. All other format errors in this category
are fatal for the affected segment.
### Category 0x02: Query Errors
```
Code Name Description
------ -------------------- ----------------------------------------
0x0200 DIMENSION_MISMATCH Query vector dimension != index dimension
0x0201 EMPTY_INDEX No index segments available
0x0202 METRIC_UNSUPPORTED Requested distance metric not available
0x0203 FILTER_PARSE_ERROR Invalid filter expression
0x0204 K_TOO_LARGE Requested K exceeds available vectors
0x0205 TIMEOUT Query exceeded time budget
```
When `K_TOO_LARGE` is returned, the response still contains all available
results. The result count will be less than the requested K.
### Category 0x03: Write Errors
```
Code Name Description
------ -------------------- ----------------------------------------
0x0300 LOCK_HELD Another writer holds the lock
0x0301 LOCK_STALE Lock file exists but owner process is dead
0x0302 DISK_FULL Insufficient space for write
0x0303 FSYNC_FAILED Durable write failed
0x0304 SEGMENT_TOO_LARGE Segment exceeds 4 GB limit
0x0305 READ_ONLY File opened in read-only mode
```
`LOCK_STALE` is informational. The runtime may attempt to break the stale
lock and retry. If recovery succeeds, the original operation proceeds with
an `OK` status.
### Category 0x04: Tile Errors (WASM Microkernel)
```
Code Name Description
------ -------------------- ----------------------------------------
0x0400 TILE_TRAP WASM trap (OOB, unreachable, stack overflow)
0x0401 TILE_OOM Tile exceeded scratch memory (64 KB)
0x0402 TILE_TIMEOUT Tile computation exceeded time budget
0x0403 TILE_INVALID_MSG Malformed hub-tile message
0x0404 TILE_UNSUPPORTED_OP Operation not available on this profile
```
All tile errors trigger the fault isolation protocol described in
`microkernel/wasm-runtime.md` section 8. The hub reassigns the tile's
work and optionally restarts the faulted tile.
### Category 0x05: Crypto Errors
```
Code Name Description
------ -------------------- ----------------------------------------
0x0500 KEY_NOT_FOUND Referenced key_id not in CRYPTO_SEG
0x0501 KEY_EXPIRED Key past valid_until timestamp
0x0502 DECRYPT_FAILED Decryption or auth tag verification failed
0x0503 ALGO_UNSUPPORTED Cryptographic algorithm not implemented
```
Crypto errors are always fatal for the affected segment. An implementation
must not serve data from a segment that fails signature or decryption checks.
## 3. Batch Query API
### Wire Format: Request
Batch queries amortize connection overhead and enable the runtime to
schedule vector block loads across multiple queries simultaneously.
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 query_count Number of queries in batch (max 1024)
0x04 4 k Shared top-K parameter
0x08 1 metric Distance metric: 0=L2, 1=IP, 2=cosine, 3=hamming
0x09 3 reserved Must be zero
0x0C 4 ef_search HNSW ef_search parameter
0x10 4 shared_filter_len Byte length of shared filter (0 = no filter)
0x14 var shared_filter Filter expression (applies to all queries)
var var queries[] Per-query entries (see below)
```
Each query entry:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 query_id Client-assigned correlation ID
0x04 2 dim Vector dimensionality
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
0x07 1 flags Bit 0: has per-query filter
0x08 var vector Query vector (dim * sizeof(dtype) bytes)
var 4 filter_len Byte length of per-query filter (if flags bit 0)
var var filter Per-query filter (overrides shared filter)
```
When both a shared filter and a per-query filter are present, the per-query
filter takes precedence. A per-query filter of zero length inherits the
shared filter.
### Wire Format: Response
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 query_count Number of query results
0x04 var results[] Per-query result entries
```
Each result entry:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 query_id Correlation ID from request
0x04 2 status Error code (0x0000 = OK)
0x06 2 reserved Must be zero
0x08 4 result_count Number of results returned
0x0C var results[] Array of (vector_id: u64, distance: f32) pairs
```
Each result pair is 12 bytes: 8 bytes for the vector ID followed by 4 bytes
for the distance value. Results are sorted by distance ascending (nearest first).
### Batch Scheduling
The runtime should process batch queries using the following strategy:
1. Parse all query vectors and load them into memory
2. Identify shared segments across queries (block deduplication)
3. Load each vector block once and evaluate all relevant queries against it
4. Merge per-query top-K heaps independently
5. Return results as soon as each query completes (streaming response)
This amortizes I/O: if N queries touch the same vector block, the block is
read once instead of N times.
## 4. Batch Ingest API
### Wire Format: Request
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 vector_count Number of vectors to ingest (max 65536)
0x04 2 dim Vector dimensionality
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
0x07 1 flags Bit 0: metadata_included
0x08 var vectors[] Vector entries
var var metadata[] Metadata entries (if flags bit 0)
```
Each vector entry:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 8 vector_id Globally unique vector ID
0x08 var vector Vector data (dim * sizeof(dtype) bytes)
```
Each metadata entry (when metadata_included is set):
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 2 field_count Number of metadata fields
0x02 var fields[] Field entries
```
Each metadata field:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 2 field_id Field identifier (application-defined)
0x02 1 value_type 0=u64, 1=i64, 2=f64, 3=string, 4=bytes
0x03 var value Encoded value (u64/i64/f64: 8B; string/bytes: 4B length + data)
```
### Wire Format: Response
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 accepted_count Number of vectors accepted
0x04 4 rejected_count Number of vectors rejected
0x08 4 manifest_epoch Epoch of manifest after commit
0x0C var rejected_ids[] Array of rejected vector IDs (u64 * rejected_count)
var var rejected_reasons[] Array of error codes (u16 * rejected_count)
```
The `manifest_epoch` field is the epoch of the MANIFEST_SEG written after the
ingest is committed. Clients can use this value to confirm that a subsequent
read will include the ingested vectors.
### Ingest Commit Semantics
1. The runtime writes vectors to a new VEC_SEG (append-only)
2. If metadata is included, a META_SEG is appended
3. Both segments are fsynced
4. A new MANIFEST_SEG is written referencing the new segments
5. The manifest is fsynced
6. The response is sent with the new manifest_epoch
Vectors are visible to queries only after step 6 completes.
## 5. Batch Delete API
### Wire Format: Request
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 1 delete_type 0=by_id, 1=by_range, 2=by_filter
0x01 3 reserved Must be zero
0x04 var payload Type-specific payload (see below)
```
Delete by ID (`delete_type = 0`):
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 count Number of IDs to delete
0x04 var ids[] Array of vector IDs (u64 * count)
```
Delete by range (`delete_type = 1`):
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 8 start_id Start of range (inclusive)
0x08 8 end_id End of range (exclusive)
```
Delete by filter (`delete_type = 2`):
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 filter_len Byte length of filter expression
0x04 var filter Filter expression
```
### Wire Format: Response
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 8 deleted_count Number of vectors deleted
0x08 2 status Error code (0x0000 = OK)
0x0A 2 reserved Must be zero
0x0C 4 manifest_epoch Epoch of manifest after delete committed
```
### Delete Mechanics
Deletes are logical. The runtime appends a JOURNAL_SEG containing tombstone
entries for the deleted vector IDs. The new MANIFEST_SEG marks affected
VEC_SEGs as partially dead. Physical reclamation happens during compaction.
## 6. Network Streaming Protocol
### 6.1 HTTP Range Requests (Read-Only Access)
RVF's progressive loading model maps naturally to HTTP byte-range requests.
A client can boot from a remote `.rvf` file and become queryable without
downloading the entire file.
**Phase 1: Boot (mandatory)**
```
GET /file.rvf Range: bytes=-4096
```
Retrieves the last 4 KB of the file. This contains the Level 0 root manifest
(MANIFEST_SEG). The client parses hotset pointers, the segment directory, and
the profile ID.
If the file is smaller than 4 KB, the entire file is returned. If the last
4 KB does not contain a valid MANIFEST_SEG, the client extends the range
backward in 4 KB increments until one is found or 1 MB is scanned (at which
point it returns `MANIFEST_NOT_FOUND`).
**Phase 2: Hotset (parallel, mandatory for queries)**
Using offsets from the Level 0 manifest, the client issues up to 5 parallel
range requests:
```
GET /file.rvf Range: bytes=<entrypoint_offset>-<entrypoint_end>
GET /file.rvf Range: bytes=<toplayer_offset>-<toplayer_end>
GET /file.rvf Range: bytes=<centroid_offset>-<centroid_end>
GET /file.rvf Range: bytes=<quantdict_offset>-<quantdict_end>
GET /file.rvf Range: bytes=<hotcache_offset>-<hotcache_end>
```
These fetch the HNSW entry point, top-layer graph, routing centroids,
quantization dictionary, and the hot cache (HOT_SEG). After these 5 requests
complete, the system is queryable with recall >= 0.7.
**Phase 3: Level 1 (background)**
```
GET /file.rvf Range: bytes=<l1_offset>-<l1_end>
```
Fetches the Level 1 manifest containing the full segment directory. This
enables the client to discover all segments and plan on-demand fetches.
**Phase 4: On-demand (per query)**
For queries that require cold data not yet fetched:
```
GET /file.rvf Range: bytes=<segment_offset>-<segment_end>
```
The client caches fetched segments locally. Repeated queries against the
same data region do not trigger additional requests.
### HTTP Requirements
- Server must support `Accept-Ranges: bytes`
- Server must return `206 Partial Content` for range requests
- Server should support multiple ranges in a single request (`multipart/byteranges`)
- Client should use `If-None-Match` with the file's ETag to detect stale caches
### 6.2 TCP Streaming Protocol (Real-Time Access)
For real-time ingest and low-latency queries, RVF defines a binary TCP
protocol over TLS 1.3.
**Connection Setup**
```
1. Client opens TCP connection to server
2. TLS 1.3 handshake (mandatory, no plaintext mode)
3. Client sends HELLO message with protocol version and capabilities
4. Server responds with HELLO_ACK confirming capabilities
5. Connection is ready for messages
```
**Framing**
All messages are length-prefixed:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 frame_length Payload length (big-endian, max 16 MB)
0x04 1 msg_type Message type (see below)
0x05 3 msg_id Correlation ID (big-endian, wraps at 2^24)
0x08 var payload Message-specific payload
```
Frame length is big-endian (network byte order) for consistency with TLS
framing. The 16 MB maximum prevents a single message from monopolizing the
connection. Payloads larger than 16 MB must be split across multiple messages
using continuation framing (see section 6.4).
**Message Types**
```
Client -> Server:
0x01 QUERY Batch query (payload = Batch Query Request)
0x02 INGEST Batch ingest (payload = Batch Ingest Request)
0x03 DELETE Batch delete (payload = Batch Delete Request)
0x04 STATUS Request server status (no payload)
0x05 SUBSCRIBE Subscribe to update notifications
Server -> Client:
0x81 QUERY_RESULT Batch query result
0x82 INGEST_ACK Batch ingest acknowledgment
0x83 DELETE_ACK Batch delete acknowledgment
0x84 STATUS_RESP Server status response
0x85 UPDATE_NOTIFY Push notification of new data
0xFF ERROR Error with code and description
```
**ERROR Message Payload**
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 2 error_code Error code from section 2
0x02 2 description_len Byte length of description string
0x04 var description UTF-8 error description (human-readable)
```
### 6.3 Streaming Ingest Protocol
The TCP protocol supports continuous ingest where the client streams vectors
without waiting for per-batch acknowledgments.
**Flow**
```
Client Server
| |
|--- INGEST (batch 0) ------------->|
|--- INGEST (batch 1) ------------->| Pipelining: send without waiting
|--- INGEST (batch 2) ------------->|
| | Server writes VEC_SEGs, appends manifest
|<--- INGEST_ACK (batch 0) ---------|
|<--- INGEST_ACK (batch 1) ---------|
| | Backpressure: server delays ACK
|--- INGEST (batch 3) ------------->| Client respects window
|<--- INGEST_ACK (batch 2) ---------|
| |
```
**Backpressure**
The server controls ingest rate by delaying INGEST_ACK responses. The client
must limit its in-flight (unacknowledged) ingest messages to a configurable
window size (default: 8 messages). When the window is full, the client must
wait for an ACK before sending the next batch.
The server should send backpressure when:
- Write queue exceeds 80% capacity
- Compaction is falling behind (dead space > 50%)
- Available disk space drops below 10%
**Commit Semantics**
Each INGEST_ACK contains the `manifest_epoch` after commit. The server
guarantees that all vectors acknowledged with epoch E are visible to any
query that reads the manifest at epoch >= E.
### 6.4 Continuation Framing
For payloads exceeding the 16 MB frame limit:
```
Frame 0: msg_type = original type, flags bit 0 = CONTINUATION_START
Frame 1: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
Frame 2: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
Frame N: msg_type = 0x00 (CONTINUATION), flags bit 1 = CONTINUATION_END
```
The receiver reassembles the payload from all continuation frames before
processing. The msg_id is shared across all frames of a continuation sequence.
### 6.5 SUBSCRIBE and UPDATE_NOTIFY
The SUBSCRIBE message registers the client for push notifications when new
data is committed:
```
SUBSCRIBE payload:
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 min_epoch Only notify for epochs > this value
0x04 1 notify_flags Bit 0: ingest, Bit 1: delete, Bit 2: compaction
0x05 3 reserved Must be zero
```
The server sends UPDATE_NOTIFY whenever a new MANIFEST_SEG is committed that
matches the subscription criteria:
```
UPDATE_NOTIFY payload:
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 epoch New manifest epoch
0x04 1 event_type 0=ingest, 1=delete, 2=compaction
0x05 3 reserved Must be zero
0x08 4 affected_count Number of vectors affected
0x0C 8 new_total Total vector count after event
```
## 7. Compaction Scheduling Policy
Compaction merges small, overlapping, or partially-dead segments into larger,
sealed segments. Because compaction competes with queries and ingest for I/O
bandwidth, the runtime enforces a scheduling policy.
### 7.1 IO Budget
Compaction must consume at most 30% of available IOPS. The runtime measures
IOPS over a 5-second sliding window and throttles compaction I/O to stay
within budget.
```
available_iops = measured_iops_capacity (from benchmarking at startup)
compaction_budget = available_iops * 0.30
compaction_throttle = max(compaction_budget - current_compaction_iops, 0)
```
### 7.2 Priority Ordering
When I/O bandwidth is contended, operations are prioritized:
```
Priority 1 (highest): Queries (reads from VEC_SEG, INDEX_SEG, HOT_SEG)
Priority 2: Ingest (writes to VEC_SEG, META_SEG, MANIFEST_SEG)
Priority 3 (lowest): Compaction (reads + writes of sealed segments)
```
Compaction yields to queries and ingest. If a compaction I/O operation would
cause a query to exceed its time budget, the compaction operation is deferred.
### 7.3 Scheduling Triggers
Compaction runs when all of the following conditions are met:
| Condition | Threshold | Rationale |
|-----------|-----------|-----------|
| Query load | < 50% of capacity | Avoid competing with active queries |
| Dead space ratio | > 20% of total file size | Not worth compacting small amounts |
| Segment count | > 32 active segments | Many small segments hurt read performance |
| Time since last compaction | > 60 seconds | Prevent compaction storms |
The runtime evaluates these conditions every 10 seconds.
### 7.4 Emergency Compaction
If dead space exceeds 70% of total file size, compaction enters emergency mode:
```
Emergency compaction rules:
1. Compaction preempts ingest (ingest is paused, not rejected)
2. IO budget increases to 60% of available IOPS
3. Compaction runs regardless of query load
4. Ingest resumes after dead space drops below 50%
```
During emergency compaction, the server responds to INGEST messages with
delayed ACKs (backpressure) rather than rejecting them. Queries continue to
be served at highest priority.
### 7.5 Compaction Progress Reporting
The STATUS response includes compaction state:
```
STATUS_RESP compaction fields:
Offset Size Field Description
------ ------ ------------------- ----------------------------------------
0x00 1 compaction_state 0=idle, 1=running, 2=emergency
0x01 1 progress_pct Completion percentage (0-100)
0x02 2 reserved Must be zero
0x04 8 dead_bytes Total dead space in bytes
0x0C 8 total_bytes Total file size in bytes
0x14 4 segments_remaining Segments left to compact
0x18 4 segments_completed Segments compacted in current run
0x1C 4 estimated_seconds Estimated time to completion
0x20 4 io_budget_pct Current IO budget percentage (30 or 60)
```
### 7.6 Compaction Segment Selection
The runtime selects segments for compaction using a tiered strategy:
```
1. Tombstoned segments: Always compacted first (reclaim dead space)
2. Small VEC_SEGs: Segments < 1 MB merged into larger segments
3. High-overlap INDEX_SEGs: Index segments covering the same ID range
4. Cold OVERLAY_SEGs: Overlay deltas merged into base segments
```
The compaction output is always a sealed segment (SEALED flag set). Sealed
segments are immutable and can be verified independently.
## 8. STATUS Response Format
The STATUS message provides a snapshot of the server state for monitoring
and diagnostics.
```
STATUS_RESP payload:
Offset Size Field Description
------ ------ ------------------- ----------------------------------------
0x00 4 protocol_version Protocol version (currently 1)
0x04 4 manifest_epoch Current manifest epoch
0x08 8 total_vectors Total vector count
0x10 8 total_segments Total segment count
0x18 8 file_size_bytes Total file size
0x20 4 query_qps Queries per second (last 5s window)
0x24 4 ingest_vps Vectors ingested per second (last 5s window)
0x28 24 compaction Compaction state (see section 7.5)
0x40 1 profile_id Active hardware profile (0x00-0x03)
0x41 1 health 0=healthy, 1=degraded, 2=read_only
0x42 2 reserved Must be zero
0x44 4 uptime_seconds Server uptime
```
## 9. Filter Expression Format
Filter expressions used in batch queries and batch deletes share a common
binary encoding:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 1 op Operator enum (see below)
0x01 2 field_id Metadata field to filter on
0x03 1 value_type Value type (matches metadata field types)
0x04 var value Comparison value
var var children[] Sub-expressions (for AND/OR/NOT)
```
Operator enum:
```
0x00 EQ field == value
0x01 NE field != value
0x02 LT field < value
0x03 LE field <= value
0x04 GT field > value
0x05 GE field >= value
0x06 IN field in [values]
0x07 RANGE field in [low, high)
0x10 AND All children must match
0x11 OR Any child must match
0x12 NOT Negate single child
```
Filters are evaluated during the query scan phase. Vectors that do not match
the filter are excluded from distance computation entirely (pre-filtering) or
from the result set (post-filtering), depending on the runtime's cost model.
## 10. Invariants
1. Error codes are stable across versions; new codes are additive only
2. Batch operations are atomic per-item, not per-batch (partial success is valid)
3. TCP connections are always TLS 1.3; plaintext is not permitted
4. Frame length is big-endian; all other multi-byte fields are little-endian
5. HTTP progressive loading must succeed with at most 7 round trips to become queryable
6. Compaction never runs at more than 60% of available IOPS, even in emergency mode
7. The STATUS response is always available, even during emergency compaction
8. Filter expressions are limited to 64 levels of nesting depth

View File

@@ -0,0 +1,420 @@
# RVF WASM Self-Bootstrapping Specification
## 1. Motivation
Traditional file formats require an external runtime to interpret their contents.
A JPEG needs an image decoder. A SQLite database needs the SQLite library. An RVF
file needs a vector search engine.
What if the file carried its own runtime?
By embedding a tiny WASM interpreter inside the RVF file itself, we eliminate the
last external dependency. The host only needs **raw execution capability** — the
ability to run bytes as instructions. RVF becomes **self-bootstrapping**: a single
file that contains both its data and the complete machinery to process that data.
This is the transition from "needs a compatible runtime" to **"runs anywhere
compute exists."**
## 2. Architecture
### The Bootstrap Stack
```
Layer 3: RVF Data Segments (VEC_SEG, INDEX_SEG, MANIFEST_SEG, ...)
^
| processes
|
Layer 2: WASM Microkernel (WASM_SEG, role=Microkernel, ~5.5 KB)
^ 14 exports: query, ingest, distance, top-K
| executes
|
Layer 1: WASM Interpreter (WASM_SEG, role=Interpreter, ~50 KB)
^ Minimal stack machine that runs WASM bytecode
| loads
|
Layer 0: Raw Bytes (The .rvf file on any storage medium)
```
Each layer depends only on the one below it. The host reads Layer 0 (raw bytes),
finds the interpreter at Layer 1, uses it to execute the microkernel at Layer 2,
which then processes the data at Layer 3.
### Segment Layout
```
┌──────────────────────────────────────────────────────────────────────┐
│ bootable.rvf │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │
│ │ WASM_SEG │ │ WASM_SEG │ │ VEC_SEG │ │ INDEX │ │
│ │ 0x10 │ │ 0x10 │ │ 0x01 │ │ _SEG │ │
│ │ │ │ │ │ │ │ 0x02 │ │
│ │ role=Interp │ │ role=uKernel │ │ 10M vectors │ │ HNSW │ │
│ │ ~50 KB │ │ ~5.5 KB │ │ 384-dim fp16 │ │ L0+L1 │ │
│ │ priority=0 │ │ priority=1 │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └─────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ QUANT_SEG │ │ WITNESS_SEG │ │ MANIFEST_SEG │ ← tail │
│ │ codebooks │ │ audit trail │ │ source of │ │
│ │ │ │ │ │ truth │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
```
## 3. WASM_SEG Wire Format
### Segment Type
```
Value: 0x10
Name: WASM_SEG
```
Uses the standard 64-byte RVF segment header (`SegmentHeader`), followed by
a 64-byte `WasmHeader`, followed by the WASM bytecode.
### WasmHeader (64 bytes)
```
Offset Size Type Field Description
------ ---- ---- ----- -----------
0x00 4 u32 wasm_magic 0x5256574D ("RVWM" big-endian)
0x04 2 u16 header_version Currently 1
0x06 1 u8 role Bootstrap role (see WasmRole enum)
0x07 1 u8 target Target platform (see WasmTarget enum)
0x08 2 u16 required_features WASM feature bitfield
0x0A 2 u16 export_count Number of WASM exports
0x0C 4 u32 bytecode_size Uncompressed bytecode size (bytes)
0x10 4 u32 compressed_size Compressed size (0 = no compression)
0x14 1 u8 compression 0=none, 1=LZ4, 2=ZSTD
0x15 1 u8 min_memory_pages Minimum linear memory (64 KB each)
0x16 1 u8 max_memory_pages Maximum linear memory (0 = no limit)
0x17 1 u8 table_count Number of WASM tables
0x18 32 hash256 bytecode_hash SHAKE-256-256 of uncompressed bytecode
0x38 1 u8 bootstrap_priority Lower = tried first in chain
0x39 1 u8 interpreter_type Interpreter variant (if role=Interpreter)
0x3A 6 u8[6] reserved Must be zero
```
### WasmRole Enum
```
Value Name Description
----- ---- -----------
0x00 Microkernel RVF query engine (5.5 KB Cognitum tile runtime)
0x01 Interpreter Minimal WASM interpreter for self-bootstrapping
0x02 Combined Interpreter + microkernel linked together
0x03 Extension Domain-specific module (custom distance, decoder)
0x04 ControlPlane Store management (create, export, segment parsing)
```
### WasmTarget Enum
```
Value Name Description
----- ---- -----------
0x00 Wasm32 Generic wasm32 (any compliant runtime)
0x01 WasiP1 WASI Preview 1 (requires WASI syscalls)
0x02 WasiP2 WASI Preview 2 (component model)
0x03 Browser Browser-optimized (expects Web APIs)
0x04 BareTile Bare-metal Cognitum tile (hub-tile protocol only)
```
### Required Features Bitfield
```
Bit Mask Feature
--- ---- -------
0 0x0001 SIMD (v128 operations)
1 0x0002 Bulk memory operations
2 0x0004 Multi-value returns
3 0x0008 Reference types
4 0x0010 Threads (shared memory)
5 0x0020 Tail call optimization
6 0x0040 GC (garbage collection)
7 0x0080 Exception handling
```
### Interpreter Type (when role=Interpreter)
```
Value Name Description
----- ---- -----------
0x00 StackMachine Generic stack-based interpreter
0x01 Wasm3Compatible wasm3-style (register machine)
0x02 WamrCompatible WAMR-style (AOT + interpreter)
0x03 WasmiCompatible wasmi-style (pure stack machine)
```
## 4. Bootstrap Resolution Protocol
### Discovery
1. Scan all segments for `seg_type == 0x10` (WASM_SEG)
2. Parse the 64-byte WasmHeader from each
3. Validate `wasm_magic == 0x5256574D`
4. Sort by `bootstrap_priority` ascending
### Resolution
```
IF any WASM_SEG has role=Combined:
→ SelfContained bootstrap (single module does everything)
ELIF WASM_SEG with role=Interpreter AND role=Microkernel both exist:
→ TwoStage bootstrap (interpreter runs microkernel)
ELIF only WASM_SEG with role=Microkernel exists:
→ HostRequired (needs external WASM runtime)
ELSE:
→ No WASM bootstrap available
```
### Execution Sequence (Two-Stage)
```
Host Interpreter Microkernel Data
| | | |
|-- read WASM_SEG[0] --->| | |
| (interpreter bytes) | | |
| | | |
|-- instantiate -------->| | |
| (load into memory) | | |
| | | |
|-- feed WASM_SEG[1] --->|-- instantiate -------->| |
| (microkernel bytes) | (via interpreter) | |
| | | |
|-- LOAD_QUERY --------->|------- forward ------->| |
| | |-- read VEC_SEG -->|
| | |<- vector block ---|
| | | |
| | | rvf_distances() |
| | | rvf_topk_merge() |
| | | |
|<-- TOPK_RESULT --------|<------ return ---------| |
```
## 5. Size Budget
### Microkernel (role=Microkernel)
Already specified in `microkernel/wasm-runtime.md`:
```
Total: ~5,500 bytes (< 8 KB code budget)
Exports: 14 (query path + quantization + HNSW + verification)
Memory: 8 KB data + 64 KB SIMD scratch
```
### Interpreter (role=Interpreter)
Target: minimal WASM bytecode interpreter sufficient to run the microkernel.
```
Component Estimated Size
--------- --------------
WASM binary parser 4 KB
(magic, section parsing)
Type section decoder 1 KB
(function types)
Import/Export resolution 2 KB
Code section interpreter 12 KB
(control flow, locals)
Stack machine engine 8 KB
(operand stack, call stack)
Memory management 3 KB
(linear memory, grow)
i32/i64 integer ops 4 KB
(add, sub, mul, div, rem, shifts)
f32/f64 float ops 6 KB
(add, sub, mul, div, sqrt, conversions)
v128 SIMD ops (optional) 8 KB
(only if WASM_FEAT_SIMD required)
Table + call_indirect 2 KB
----------
Total (no SIMD): ~42 KB
Total (with SIMD): ~50 KB
```
### Combined (role=Combined)
Interpreter linked with microkernel in a single module:
```
Total: ~48-56 KB (interpreter + microkernel, with overlap eliminated)
```
### Self-Bootstrapping Overhead
For a 10M vector file (~7.3 GB at 384-dim fp16):
- Bootstrap overhead: ~56 KB / ~7.3 GB = **0.0008%**
- The file is 99.9992% data, 0.0008% self-sufficient runtime
For a 1000-vector file (~750 KB):
- Bootstrap overhead: ~56 KB / ~750 KB = **7.5%**
- Still practical for edge/IoT deployments
## 6. Execution Tiers (Extended)
The original three-tier model from ADR-030 is extended:
| Tier | Segment | Size | Boot | Self-Bootstrap? |
|------|---------|------|------|-----------------|
| 0: Embedded WASM Interpreter | WASM_SEG (role=Interpreter) | ~50 KB | <5 ms | **Yes** — file carries its own runtime |
| 1: WASM Microkernel | WASM_SEG (role=Microkernel) | 5.5 KB | <1 ms | No — needs host or Tier 0 |
| 2: eBPF | EBPF_SEG | 10-50 KB | <20 ms | No — needs Linux kernel |
| 3: Unikernel | KERNEL_SEG | 200 KB-2 MB | <125 ms | No — needs VMM (Firecracker) |
**Key insight**: Tier 0 makes all other tiers optional. An RVF file with
Tier 0 embedded runs on *any* host that can execute bytes — bare metal,
browser, microcontroller, FPGA with a soft CPU, or even another WASM runtime.
## 7. "Runs Anywhere Compute Exists"
### What This Means
A self-bootstrapping RVF file requires exactly **one capability** from its host:
> The ability to read bytes from storage and execute them as instructions.
That's it. No operating system. No file system. No network stack. No runtime
library. No package manager. No container engine.
### Where It Runs
| Host | How It Works |
|------|-------------|
| **x86 server** | Native WASM runtime (Wasmtime/WAMR) runs microkernel directly |
| **ARM edge device** | Same — native WASM runtime |
| **Browser tab** | `WebAssembly.instantiate()` on the microkernel bytes |
| **Microcontroller** | Embedded interpreter runs microkernel in 64 KB scratch |
| **FPGA soft CPU** | Interpreter mapped to BRAM, microkernel in flash |
| **Another WASM runtime** | Interpreter-in-WASM runs microkernel-in-WASM (turtles) |
| **Bare metal** | Bootloader extracts interpreter, interpreter runs microkernel |
| **TEE enclave** | Enclave loads interpreter, verified via WITNESS_SEG attestation |
### The Bootstrapping Invariant
For any host `H` with execution capability `E`:
```
∀ H, E: can_execute(H, E) ∧ can_read_bytes(H)
→ can_process_rvf(H, self_bootstrapping_rvf_file)
```
The file is a **fixed point** of the execution relation: it contains everything
needed to process itself.
## 8. Security Considerations
### Interpreter Verification
The embedded interpreter's bytecode is hashed with SHAKE-256-256 and stored
in the WasmHeader (`bytecode_hash`). A WITNESS_SEG can chain the interpreter
hash to a trusted build, providing:
- **Provenance**: Who built this interpreter?
- **Integrity**: Has the interpreter been modified?
- **Attestation**: Can a TEE verify the interpreter before execution?
### Sandbox Guarantees
The WASM sandbox model applies at every layer:
- The interpreter cannot access host memory beyond its linear memory
- The microkernel cannot access interpreter memory
- Each layer communicates only through defined exports/imports
- A trapped module cannot corrupt other modules
### Bootstrap Attack Surface
| Attack | Mitigation |
|--------|-----------|
| Malicious interpreter | Verify `bytecode_hash` against known-good hash in WITNESS_SEG |
| Modified microkernel | Interpreter verifies microkernel hash before instantiation |
| Data corruption | Segment-level CRC32C/SHAKE-256 hashes (Law 2) |
| Code injection | WASM validates all code at load time (type checking) |
| Resource exhaustion | `max_memory_pages` cap, epoch-based interruption |
## 9. API
### Rust (rvf-runtime)
```rust
// Embed a WASM module
store.embed_wasm(
role: WasmRole::Microkernel as u8,
target: WasmTarget::Wasm32 as u8,
required_features: WASM_FEAT_SIMD,
wasm_bytecode: &microkernel_bytes,
export_count: 14,
bootstrap_priority: 1,
interpreter_type: 0,
)?;
// Make self-bootstrapping
store.embed_wasm(
role: WasmRole::Interpreter as u8,
target: WasmTarget::Wasm32 as u8,
required_features: 0,
wasm_bytecode: &interpreter_bytes,
export_count: 3,
bootstrap_priority: 0,
interpreter_type: 0x03, // wasmi-compatible
)?;
// Check if file is self-bootstrapping
assert!(store.is_self_bootstrapping());
// Extract all WASM modules (ordered by priority)
let modules = store.extract_wasm_all()?;
```
### WASM (rvf-wasm bootstrap module)
```rust
use rvf_wasm::bootstrap::{resolve_bootstrap_chain, get_bytecode, BootstrapChain};
let chain = resolve_bootstrap_chain(&rvf_bytes);
match chain {
BootstrapChain::SelfContained { combined } => {
let bytecode = get_bytecode(&rvf_bytes, &combined).unwrap();
// Instantiate and run
}
BootstrapChain::TwoStage { interpreter, microkernel } => {
let interp_code = get_bytecode(&rvf_bytes, &interpreter).unwrap();
let kernel_code = get_bytecode(&rvf_bytes, &microkernel).unwrap();
// Load interpreter, then use it to run microkernel
}
_ => { /* use host runtime */ }
}
```
## 10. Relationship to Existing Segments
| Segment | Relationship to WASM_SEG |
|---------|-------------------------|
| KERNEL_SEG (0x0E) | Alternative execution tier — KERNEL_SEG boots a full unikernel, WASM_SEG runs a lightweight microkernel. Both make the file self-executing but at different capability levels. |
| EBPF_SEG (0x0F) | Complementary — eBPF accelerates hot-path queries on Linux hosts while WASM provides universal portability. |
| WITNESS_SEG (0x0A) | Verification — WITNESS_SEG chains can attest the interpreter and microkernel hashes, providing a trust anchor for the bootstrap chain. |
| CRYPTO_SEG (0x0C) | Signing — CRYPTO_SEG key material can sign WASM_SEG contents for tamper detection. |
| MANIFEST_SEG (0x05) | Discovery — the tail manifest references all WASM_SEGs with their roles and priorities. |
## 11. Implementation Status
| Component | Crate | Status |
|-----------|-------|--------|
| `SegmentType::Wasm` (0x10) | `rvf-types` | Implemented |
| `WasmHeader` (64-byte header) | `rvf-types` | Implemented |
| `WasmRole`, `WasmTarget` enums | `rvf-types` | Implemented |
| `write_wasm_seg` | `rvf-runtime` | Implemented |
| `embed_wasm` / `extract_wasm` | `rvf-runtime` | Implemented |
| `extract_wasm_all` (priority-sorted) | `rvf-runtime` | Implemented |
| `is_self_bootstrapping` | `rvf-runtime` | Implemented |
| `resolve_bootstrap_chain` | `rvf-wasm` | Implemented |
| `get_bytecode` (zero-copy extraction) | `rvf-wasm` | Implemented |
| Embedded interpreter (wasmi-based) | `rvf-wasm` | Future |
| Combined interpreter+microkernel build | `rvf-wasm` | Future |