Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/research/rvf/spec/00-overview.md
+++ b/vendor/ruvector/docs/research/rvf/spec/00-overview.md
@@ -0,0 +1,140 @@
+# RVF: RuVector Format Specification
+
+## The Universal Substrate for Living Intelligence
+
+**Version**: 0.1.0-draft
+**Status**: Research
+**Date**: 2026-02-13
+
+---
+
+## What RVF Is
+
+RVF is not a file format. It is a **runtime substrate** — a living, self-reorganizing
+binary medium that stores, streams, indexes, and adapts vector intelligence across
+any domain, any scale, and any hardware tier.
+
+Where traditional formats are snapshots of data, RVF is a **continuously evolving
+organism**. It ingests without rewriting. It answers queries before it finishes loading.
+It reorganizes its own layout to match access patterns. It survives crashes without
+journals. It fits on a 64 KB WASM tile or scales to a petabyte hub.
+
+## The Four Laws of RVF
+
+Every design decision in RVF derives from four inviolable laws:
+
+### Law 1: Truth Lives at the Tail
+
+The most recent `MANIFEST_SEG` at the tail of the file is the sole source of truth.
+No front-loaded metadata. No section directory that must be rewritten on mutation.
+Readers scan backward from EOF to find the latest manifest and know exactly what
+to map.
+
+**Consequence**: Append-only writes. Streaming ingest. No global rewrite ever.
+
+### Law 2: Every Segment Is Independently Valid
+
+Each segment carries its own magic number, length, content hash, and type tag.
+A reader encountering any segment in isolation can verify it, identify it, and
+decide whether to process it. No segment depends on prior segments for structural
+validity.
+
+**Consequence**: Crash safety for free. Parallel verification. Segment-level
+integrity without a global checksum.
+
+### Law 3: Data and State Are Separated
+
+Vector payloads, index structures, overlay graphs, quantization dictionaries, and
+runtime metadata live in distinct segment types. The manifest binds them together
+but they never intermingle. This means you can replace the index without touching
+vectors, update the overlay without rebuilding adjacency, or swap quantization
+without re-encoding.
+
+**Consequence**: Incremental updates. Modular evolution. Zero-copy segment reuse.
+
+### Law 4: The Format Adapts to Its Workload
+
+RVF monitors access patterns through lightweight sketches and periodically
+reorganizes: promoting hot vectors to faster tiers, compacting stale overlays,
+lazily building deeper index layers. The format is not static — it converges
+toward the optimal layout for its actual workload.
+
+**Consequence**: Self-tuning performance. No manual optimization. The file gets
+faster the more you use it.
+
+## Design Coordinates
+
+| Property | RVF Answer |
+|----------|-----------|
+| Write model | Append-only segments + background compaction |
+| Read model | Tail-manifest scan, then progressive mmap |
+| Index model | Layered availability (entry points -> partial -> full) |
+| Compression | Temperature-tiered (fp16 hot, 5-7 bit warm, 3 bit cold) |
+| Alignment | 64-byte for SIMD (AVX-512, NEON, WASM v128) |
+| Crash safety | Segment-level hashes, no WAL required |
+| Crypto | Post-quantum (ML-DSA-65 signatures, SHAKE-256 hashes) |
+| Streaming | Yes — first query before full load |
+| Hardware | 8 KB tile to petabyte hub |
+| Domain | Universal — genomics, text, graph, vision as profiles |
+
+## Acceptance Test
+
+> Cold start on a 10 million vector file: load and answer the first query with a
+> useful (recall >= 0.7) result without reading more than the last 4 MB, then
+> converge to full quality (recall >= 0.95) as it progressively maps more segments.
+
+## Document Map
+
+| Document | Path | Content |
+|----------|------|---------|
+| This overview | `spec/00-overview.md` | Philosophy, laws, design coordinates |
+| Segment model | `spec/01-segment-model.md` | Segment types, headers, append-only rules |
+| Manifest system | `spec/02-manifest-system.md` | Two-level manifests, hotset pointers |
+| Temperature tiering | `spec/03-temperature-tiering.md` | Adaptive layout, access sketches, promotion |
+| Progressive indexing | `spec/04-progressive-indexing.md` | Layered HNSW, partial availability |
+| Overlay epochs | `spec/05-overlay-epochs.md` | Streaming min-cut, epoch boundaries |
+| Wire format | `wire/binary-layout.md` | Byte-level binary format reference |
+| WASM microkernel | `microkernel/wasm-runtime.md` | Cognitum tile mapping, WASM exports |
+| Domain profiles | `profiles/domain-profiles.md` | RVDNA, RVText, RVGraph, RVVision |
+| Crypto spec | `crypto/quantum-signatures.md` | Post-quantum primitives, segment signing |
+| Benchmarks | `benchmarks/acceptance-tests.md` | Performance targets, test methodology |
+
+## Relationship to RVDNA
+
+RVDNA (RuVector DNA) was the first domain-specific format for genomic vector
+intelligence. In the RVF model, RVDNA becomes a **profile** — a set of conventions
+for how genomic data maps onto the universal RVF substrate:
+
+```
+RVF (universal substrate)
+  |
+  +-- RVF Core Profile    (minimal, fits on 64KB tile)
+  +-- RVF Hot Profile      (chip-optimized, SIMD-heavy)
+  +-- RVF Full Profile     (hub-scale, all features)
+  |
+  +-- Domain Profiles
+       +-- RVDNA           (genomics: codons, motifs, k-mers)
+       +-- RVText          (language: embeddings, token graphs)
+       +-- RVGraph         (networks: adjacency, partitions)
+       +-- RVVision        (imagery: feature maps, patch vectors)
+```
+
+The substrate carries the laws. The profiles carry the semantics.
+
+## Design Answers
+
+**Q: Random writes or append-only plus compaction?**
+A: Append-only plus compaction. This gives speed and crash safety almost for free.
+Random writes add complexity for marginal benefit in the vector workload.
+
+**Q: Primary target mmap on desktop CPUs or also microcontroller tiles?**
+A: Both. RVF defines three hardware profiles. The Core profile fits in 8 KB code +
+8 KB data + 64 KB SIMD scratch. The Full profile assumes mmap on desktop-class
+memory. The wire format is identical — only the runtime behavior changes.
+
+**Q: Which property matters most?**
+A: All four are non-negotiable, but the priority order for conflict resolution is:
+1. **Streamable** (never block on write)
+2. **Progressive** (answer before fully loaded)
+3. **Adaptive** (self-optimize over time)
+4. **p95 speed** (predictable tail latency)
--- a/vendor/ruvector/docs/research/rvf/spec/01-segment-model.md
+++ b/vendor/ruvector/docs/research/rvf/spec/01-segment-model.md
@@ -0,0 +1,224 @@
+# RVF Segment Model
+
+## 1. Append-Only Segment Architecture
+
+An RVF file is a linear sequence of **segments**. Each segment is a self-contained,
+independently verifiable unit. New data is always appended — never inserted into or
+overwritten within existing segments.
+
+```
+------------+------------+------------+     +------------+
+| Segment 0  | Segment 1  | Segment 2  | ... | Segment N  |  <-- EOF
+------------+------------+------------+     +------------+
+                                                    ^
+                                            Latest MANIFEST_SEG
+                                            (source of truth)
+```
+
+### Why Append-Only
+
+| Property | Benefit |
+|----------|---------|
+| Write amplification | Zero — each byte written once until compaction |
+| Crash safety | Partial segment at tail is detectable and discardable |
+| Concurrent reads | Readers see a consistent snapshot at any manifest boundary |
+| Streaming ingest | Writer never blocks on reorganization |
+| mmap friendliness | Pages only grow — no invalidation of mapped regions |
+
+## 2. Segment Header
+
+Every segment begins with a fixed 64-byte header. The header is 64-byte aligned
+to match SIMD register width.
+
+```
+Offset  Size  Field              Description
+------  ----  -----              -----------
+0x00    4     magic              0x52564653 ("RVFS" in ASCII)
+0x04    1     version            Segment format version (currently 1)
+0x05    1     seg_type           Segment type enum (see below)
+0x06    2     flags              Bitfield: compressed, encrypted, signed, sealed, etc.
+0x08    8     segment_id         Monotonically increasing segment ordinal
+0x10    8     payload_length     Byte length of payload (after header, before footer)
+0x18    8     timestamp_ns       Nanosecond UNIX timestamp of segment creation
+0x20    1     checksum_algo      Hash algorithm enum: 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
+0x21    1     compression        Compression enum: 0=none, 1=LZ4, 2=ZSTD, 3=custom
+0x22    2     reserved_0         Must be zero
+0x24    4     reserved_1         Must be zero
+0x28    16    content_hash       First 128 bits of payload hash (algorithm per checksum_algo)
+0x38    4     uncompressed_len   Original payload size (0 if no compression)
+0x3C    4     alignment_pad      Padding to reach 64-byte boundary
+```
+
+**Total header**: 64 bytes (one cache line, one AVX-512 register width).
+
+### Magic Validation
+
+Readers scanning backward from EOF look for `0x52564653` at 64-byte aligned
+boundaries. This enables fast tail-scan even on corrupted files.
+
+### Flags Bitfield
+
+```
+Bit 0:  COMPRESSED    Payload is compressed per compression field
+Bit 1:  ENCRYPTED     Payload is encrypted (key info in manifest)
+Bit 2:  SIGNED        A signature footer follows the payload
+Bit 3:  SEALED        Segment is immutable (compaction output)
+Bit 4:  PARTIAL       Segment is a partial write (streaming ingest)
+Bit 5:  TOMBSTONE     Segment logically deletes a prior segment
+Bit 6:  HOT           Segment contains temperature-promoted data
+Bit 7:  OVERLAY       Segment contains overlay/delta data
+Bit 8:  SNAPSHOT      Segment contains full snapshot (not delta)
+Bit 9:  CHECKPOINT    Segment is a safe rollback point
+Bits 10-15: reserved
+```
+
+## 3. Segment Types
+
+```
+Value  Name            Purpose
+-----  ----            -------
+0x01   VEC_SEG         Raw vector payloads (the actual embeddings)
+0x02   INDEX_SEG       HNSW adjacency lists, entry points, routing tables
+0x03   OVERLAY_SEG     Graph overlay deltas, partition updates, min-cut witnesses
+0x04   JOURNAL_SEG     Metadata mutations (label changes, deletions, moves)
+0x05   MANIFEST_SEG    Segment directory, hotset pointers, epoch state
+0x06   QUANT_SEG       Quantization dictionaries and codebooks
+0x07   META_SEG        Arbitrary key-value metadata (tags, provenance, lineage)
+0x08   HOT_SEG         Temperature-promoted hot data (vectors + neighbors)
+0x09   SKETCH_SEG      Access counter sketches for temperature decisions
+0x0A   WITNESS_SEG     Capability manifests, proof of computation, audit trails
+0x0B   PROFILE_SEG     Domain profile declarations (RVDNA, RVText, etc.)
+0x0C   CRYPTO_SEG      Key material, signature chains, certificate anchors
+0x0D   METAIDX_SEG     Metadata inverted indexes for filtered search
+```
+
+### Reserved Range
+
+Types `0x00` and `0xF0`-`0xFF` are reserved. `0x00` indicates an uninitialized
+or zeroed region (not a valid segment). `0xF0`-`0xFF` are reserved for
+implementation-specific extensions.
+
+## 4. Segment Footer
+
+If the `SIGNED` flag is set, the payload is followed by a signature footer:
+
+```
+Offset  Size   Field              Description
+------  ----   -----              -----------
+0x00    2      sig_algo           Signature algorithm: 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
+0x02    2      sig_length         Byte length of signature
+0x04    var    signature          The signature bytes
+var     4      footer_length      Total footer size (for backward scanning)
+```
+
+Unsigned segments have no footer — the next segment header follows immediately
+after the payload (at the next 64-byte aligned boundary).
+
+## 5. Segment Lifecycle
+
+### Write Path
+
+```
+1. Allocate segment ID (monotonic counter)
+2. Compute payload hash
+3. Write header + payload + optional footer
+4. fsync (or fdatasync for non-manifest segments)
+5. Write MANIFEST_SEG referencing the new segment
+6. fsync the manifest
+```
+
+The two-fsync protocol ensures that:
+- If crash occurs before step 6, the orphan segment is harmless (no manifest points to it)
+- If crash occurs during step 6, the partial manifest is detectable (bad hash)
+- After step 6, the segment is durably committed
+
+### Read Path
+
+```
+1. Seek to EOF
+2. Scan backward for latest MANIFEST_SEG (look for magic at aligned boundaries)
+3. Parse manifest -> get segment directory
+4. Map segments on demand (progressive loading)
+```
+
+### Compaction
+
+Compaction merges multiple segments into fewer, larger, sealed segments:
+
+```
+Before:  [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3]
+After:   [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3] [VEC_SEG_sealed] [MANIFEST_4]
+                                                              ^^^^^^^^^^^^^^^^^
+                                                              New sealed segment
+                                                              merging 1+2+3
+```
+
+Old segments are marked with TOMBSTONE entries in the new manifest. Space is
+reclaimed when the file is eventually rewritten (or old segments are in a
+separate file in multi-file mode).
+
+### Multi-File Mode
+
+For very large datasets, RVF can span multiple files:
+
+```
+data.rvf          Main file with manifests and hot data
+data.rvf.cold.0   Cold segment shard 0
+data.rvf.cold.1   Cold segment shard 1
+data.rvf.idx.0    Index segment shard 0
+```
+
+The manifest in the main file contains shard references with file paths and
+byte ranges. This enables cold data to live on slower storage while hot data
+stays on fast storage.
+
+## 6. Segment Addressing
+
+Segments are addressed by their `segment_id` (monotonically increasing 64-bit
+integer). The manifest maps segment IDs to file offsets (and optionally shard
+file paths in multi-file mode).
+
+Within a segment, data is addressed by **block offset** — a 32-bit offset from
+the start of the segment payload. This limits individual segments to 4 GB, which
+is intentional: it keeps segments manageable for compaction and progressive loading.
+
+### Block Structure Within VEC_SEG
+
+```
+-------------------+
+| Block Header (16B)|
+|   block_id: u32   |
+|   count: u32      |
+|   dim: u16        |
+|   dtype: u8       |
+|   pad: [u8; 5]    |
+-------------------+
+| Vectors           |
+| (count * dim *    |
+|  sizeof(dtype))   |
+| [64B aligned]     |
+-------------------+
+| ID Map            |
+| (varint delta     |
+|  encoded IDs)     |
+-------------------+
+| Block Footer      |
+|   crc32c: u32     |
+-------------------+
+```
+
+Vectors within a block are stored **columnar** — all dimension 0 values, then all
+dimension 1 values, etc. This maximizes compression ratio. But the HOT_SEG stores
+vectors **interleaved** (row-major) for cache-friendly sequential scan during
+top-K refinement.
+
+## 7. Invariants
+
+1. Segment IDs are strictly monotonically increasing within a file
+2. A valid RVF file contains at least one MANIFEST_SEG
+3. The last MANIFEST_SEG is always the source of truth
+4. Segment headers are always 64-byte aligned
+5. No segment payload exceeds 4 GB
+6. Content hashes are computed over the raw (uncompressed, unencrypted) payload
+7. Sealed segments are never modified — only tombstoned
+8. A reader that cannot find a valid MANIFEST_SEG must reject the file
--- a/vendor/ruvector/docs/research/rvf/spec/02-manifest-system.md
+++ b/vendor/ruvector/docs/research/rvf/spec/02-manifest-system.md
@@ -0,0 +1,287 @@
+# RVF Manifest System
+
+## 1. Two-Level Manifest Architecture
+
+The manifest system is what makes RVF progressive. Instead of a monolithic directory
+that must be fully parsed before any query, RVF uses a two-level manifest that
+enables instant boot followed by incremental refinement.
+
+```
+                          EOF
+                           |
+                           v
+--------------------------------------------------+
+| Level 0: Root Manifest (fixed 4096 bytes)        |
+|   - Magic + version                              |
+|   - Pointer to Level 1 manifest segment          |
+|   - Hotset pointers (inline)                     |
+|   - Total vector count                           |
+|   - Dimension                                    |
+|   - Epoch counter                                |
+|   - Profile declaration                          |
+--------------------------------------------------+
+          |
+          | points to
+          v
+--------------------------------------------------+
+| Level 1: Full Manifest (variable size)           |
+|   - Complete segment directory                   |
+|   - Temperature tier map                         |
+|   - Index layer availability                     |
+|   - Overlay epoch chain                          |
+|   - Compaction state                             |
+|   - Shard references (multi-file)                |
+|   - Capability manifest                          |
+--------------------------------------------------+
+```
+
+### Why Two Levels
+
+A reader performing cold start only needs Level 0 (4 KB). From Level 0 alone,
+it can locate the entry points, coarse routing graph, quantization dictionary,
+and centroids — enough to answer approximate queries immediately.
+
+Level 1 is loaded asynchronously to enable full-quality queries, but the system
+is functional before Level 1 is fully parsed.
+
+## 2. Level 0: Root Manifest
+
+The root manifest is always the **last 4096 bytes** of the file (or the last
+4096 bytes of the most recent MANIFEST_SEG). Its fixed size enables instant
+location: `seek(EOF - 4096)`.
+
+### Binary Layout
+
+```
+Offset  Size  Field                     Description
+------  ----  -----                     -----------
+0x000   4     magic                     0x52564D30 ("RVM0")
+0x004   2     version                   Root manifest version
+0x006   2     flags                     Root manifest flags
+0x008   8     l1_manifest_offset        Byte offset to Level 1 manifest segment
+0x010   8     l1_manifest_length        Byte length of Level 1 manifest segment
+0x018   8     total_vector_count        Total vectors across all segments
+0x020   2     dimension                 Vector dimensionality
+0x022   1     base_dtype                Base data type enum
+0x023   1     profile_id                Domain profile (0=generic, 1=dna, 2=text, 3=graph, 4=vision)
+0x024   4     epoch                     Current overlay epoch number
+0x028   8     created_ns                File creation timestamp (ns)
+0x030   8     modified_ns               Last modification timestamp (ns)
+
+--- Hotset Pointers (the key to instant boot) ---
+
+0x038   8     entrypoint_seg_offset     Offset to segment containing HNSW entry points
+0x040   4     entrypoint_block_offset   Block offset within that segment
+0x044   4     entrypoint_count          Number of entry points
+
+0x048   8     toplayer_seg_offset       Offset to segment with top-layer adjacency
+0x050   4     toplayer_block_offset     Block offset
+0x054   4     toplayer_node_count       Nodes in top layer
+
+0x058   8     centroid_seg_offset       Offset to segment with cluster centroids / pivots
+0x060   4     centroid_block_offset     Block offset
+0x064   4     centroid_count            Number of centroids
+
+0x068   8     quantdict_seg_offset      Offset to quantization dictionary segment
+0x070   4     quantdict_block_offset    Block offset
+0x074   4     quantdict_size            Dictionary size in bytes
+
+0x078   8     hot_cache_seg_offset      Offset to HOT_SEG with interleaved hot vectors
+0x080   4     hot_cache_block_offset    Block offset
+0x084   4     hot_cache_vector_count    Vectors in hot cache
+
+0x088   8     prefetch_map_offset       Offset to prefetch hint table
+0x090   4     prefetch_map_entries      Number of prefetch entries
+
+--- Crypto ---
+
+0x094   2     sig_algo                  Manifest signature algorithm
+0x096   2     sig_length                Signature length
+0x098   var   signature                 Manifest signature (up to 3400 bytes for ML-DSA-65)
+
+--- Padding to 4096 bytes ---
+
+0xF00   252   reserved                  Reserved / zero-padded to 4096
+0xFFC   4     root_checksum             CRC32C of bytes 0x000-0xFFB
+```
+
+**Total**: Exactly 4096 bytes (one page, one disk sector on most hardware).
+
+### Hotset Pointers
+
+The six hotset pointers are the minimum information needed to answer a query:
+
+1. **Entry points**: Where to start HNSW traversal
+2. **Top-layer adjacency**: Coarse routing to the right neighborhood
+3. **Centroids/pivots**: For IVF-style pre-filtering or partition routing
+4. **Quantization dictionary**: For decoding compressed vectors
+5. **Hot cache**: Pre-decoded interleaved vectors for top-K refinement
+6. **Prefetch map**: Contiguous neighbor-list pages with prefetch offsets
+
+With these six pointers, a reader can:
+- Start HNSW search at the entry point
+- Route through the top layer
+- Quantize the query using the dictionary
+- Scan the hot cache for refinement
+- Prefetch neighbor pages for cache-friendly traversal
+
+All without reading Level 1 or any cold segments.
+
+## 3. Level 1: Full Manifest
+
+Level 1 is a variable-size segment (type `MANIFEST_SEG`) referenced by Level 0.
+It contains the complete file directory.
+
+### Structure
+
+Level 1 is encoded as a sequence of typed records using a tag-length-value (TLV)
+scheme for forward compatibility:
+
+```
+---+---+---+---+---+---+---+---+
+| Tag (2B) | Length (4B) | Pad   |  <- 8-byte aligned record header
+---+---+---+---+---+---+---+---+
+| Value (Length bytes)            |
+| [padded to 8-byte boundary]    |
+---------------------------------+
+```
+
+### Record Types
+
+```
+Tag     Name                    Description
+---     ----                    -----------
+0x0001  SEGMENT_DIR             Array of segment directory entries
+0x0002  TEMP_TIER_MAP           Temperature tier assignments per block
+0x0003  INDEX_LAYERS            Index layer availability bitmap
+0x0004  OVERLAY_CHAIN           Epoch chain with rollback pointers
+0x0005  COMPACTION_STATE        Active/tombstoned segment sets
+0x0006  SHARD_REFS              Multi-file shard references
+0x0007  CAPABILITY_MANIFEST     What this file can do (features, limits)
+0x0008  PROFILE_CONFIG          Domain-specific configuration
+0x0009  ACCESS_SKETCH_REF       Pointer to latest SKETCH_SEG
+0x000A  PREFETCH_TABLE          Full prefetch hint table
+0x000B  ID_RESTART_POINTS       Restart point index for varint delta IDs
+0x000C  WITNESS_CHAIN           Proof-of-computation witness chain
+0x000D  KEY_DIRECTORY           Encryption key references (not keys themselves)
+```
+
+### Segment Directory Entry
+
+```
+Offset  Size  Field                Description
+------  ----  -----                -----------
+0x00    8     segment_id           Segment ordinal
+0x08    1     seg_type             Segment type enum
+0x09    1     tier                 Temperature tier (0=hot, 1=warm, 2=cold)
+0x0A    2     flags                Segment flags
+0x0C    4     reserved             Must be zero
+0x10    8     file_offset          Byte offset in file (or shard)
+0x18    8     payload_length       Decompressed payload length
+0x20    8     compressed_length    Compressed length (0 if uncompressed)
+0x28    2     shard_id             Shard index (0 for main file)
+0x2A    2     compression          Compression algorithm
+0x2C    4     block_count          Number of blocks in segment
+0x30    16    content_hash         Payload hash (first 128 bits)
+```
+
+**Total**: 64 bytes per entry (cache-line aligned).
+
+## 4. Manifest Lifecycle
+
+### Writing a New Manifest
+
+Every mutation to the file produces a new MANIFEST_SEG appended at the tail:
+
+```
+1. Compute new Level 1 manifest (segment directory + metadata)
+2. Write Level 1 as a MANIFEST_SEG payload
+3. Compute Level 0 root manifest pointing to Level 1
+4. Write Level 0 as the last 4096 bytes of the MANIFEST_SEG
+5. fsync
+```
+
+The MANIFEST_SEG payload structure is:
+
+```
+-----------------------------------+
+| Level 1 manifest (variable size)  |
+-----------------------------------+
+| Level 0 root manifest (4096 B)   |  <-- Always the last 4096 bytes
+-----------------------------------+
+```
+
+### Reading the Manifest
+
+```
+1. seek(EOF - 4096)
+2. Read 4096 bytes -> Level 0 root manifest
+3. Validate magic (0x52564D30) and checksum
+4. If valid: extract hotset pointers -> system is queryable
+5. Async: read Level 1 at l1_manifest_offset -> full directory
+6. If Level 0 is invalid: scan backward for previous MANIFEST_SEG
+```
+
+Step 6 provides crash recovery. If the latest manifest write was interrupted,
+the previous manifest is still valid. Readers scan backward at 64-byte aligned
+boundaries looking for the RVFS magic + MANIFEST_SEG type.
+
+### Manifest Chain
+
+Each manifest implicitly forms a chain through the segment ID ordering. For
+explicit rollback support, Level 1 contains the `OVERLAY_CHAIN` record which
+stores:
+
+```
+epoch: u32              Current epoch
+prev_manifest_offset: u64   Offset of previous MANIFEST_SEG
+prev_manifest_id: u64       Segment ID of previous MANIFEST_SEG
+checkpoint_hash: [u8; 16]   Hash of the complete state at this epoch
+```
+
+This enables point-in-time recovery and bisection debugging.
+
+## 5. Hotset Pointer Semantics
+
+### Entry Point Stability
+
+Entry points are the HNSW nodes at the highest layer. They change rarely (only
+when the index is rebuilt or a new highest-layer node is inserted). The root
+manifest caches them directly so they survive across manifest generations without
+re-reading the index.
+
+### Centroid Refresh
+
+Centroids may drift as data is added. The manifest tracks a `centroid_epoch` — if
+the current epoch exceeds centroid_epoch + threshold, the runtime should schedule
+centroid recomputation. But the stale centroids remain usable (recall degrades
+gracefully, it does not fail).
+
+### Hot Cache Coherence
+
+The hot cache in HOT_SEG is a **read-optimized snapshot** of the most-accessed
+vectors. It may be stale relative to the latest VEC_SEGs. The manifest tracks
+a `hot_cache_epoch` for staleness detection. Queries use the hot cache for fast
+initial results, then refine against authoritative VEC_SEGs if needed.
+
+## 6. Progressive Boot Sequence
+
+```
+Time     Action                          System State
+----     ------                          ------------
+t=0      Read last 4 KB (Level 0)        Booting
+t+1ms    Parse hotset pointers            Queryable (approximate)
+t+2ms    mmap entry points + top layer    Better routing
+t+5ms    mmap hot cache + quant dict      Fast top-K refinement
+t+10ms   Start loading Level 1            Discovering full directory
+t+50ms   Level 1 parsed                   Full segment awareness
+t+100ms  mmap warm VEC_SEGs              Recall improving
+t+500ms  mmap cold VEC_SEGs              Full recall
+t+1s     Background index layer build     Converging to optimal
+```
+
+For a 10M vector file (~4 GB at 384 dimensions, float16):
+- Level 0 read: 4 KB in <1 ms
+- Hotset data: ~2-4 MB (entry points + top layer + centroids + hot cache)
+- First query: within 5-10 ms of open
+- Full convergence: 1-5 seconds depending on storage speed
--- a/vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md
+++ b/vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md
@@ -0,0 +1,285 @@
+# RVF Temperature Tiering
+
+## 1. Adaptive Layout as a First-Class Concept
+
+Traditional vector formats place data once and leave it. RVF treats data placement
+as a **continuous optimization problem**. Every vector block has a temperature, and
+the format periodically reorganizes to keep hot data fast and cold data small.
+
+```
+                Access Frequency
+                     ^
+                     |
+Tier 0 (HOT)        |  ████████   fp16 / 8-bit, interleaved
+                     |  ████████   < 1μs random access
+                     |
+Tier 1 (WARM)        |  ░░░░░░░░░░░░░░░░   5-7 bit quantized
+                     |  ░░░░░░░░░░░░░░░░   columnar, compressed
+                     |
+Tier 2 (COLD)        |  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒   3-bit or 1-bit
+                     |  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒   heavy compression
+                     |
+                     +------------------------------------> Vector ID
+```
+
+### Tier Definitions
+
+| Tier | Name | Quantization | Layout | Compression | Access Latency |
+|------|------|-------------|--------|-------------|----------------|
+| 0 | Hot | fp16 or int8 | Interleaved (row-major) | None or LZ4 | < 1 μs |
+| 1 | Warm | 5-7 bit SQ/PQ | Columnar | LZ4 or ZSTD | 1-10 μs |
+| 2 | Cold | 3-bit or binary | Columnar | ZSTD level 9+ | 10-100 μs |
+
+### Memory Ratios
+
+For 384-dimensional vectors (typical embedding size):
+
+| Tier | Bytes/Vector | Ratio vs fp32 | 10M Vectors |
+|------|-------------|---------------|-------------|
+| fp32 (raw) | 1536 B | 1.0x | 14.3 GB |
+| Tier 0 (fp16) | 768 B | 2.0x | 7.2 GB |
+| Tier 0 (int8) | 384 B | 4.0x | 3.6 GB |
+| Tier 1 (6-bit) | 288 B | 5.3x | 2.7 GB |
+| Tier 1 (5-bit) | 240 B | 6.4x | 2.2 GB |
+| Tier 2 (3-bit) | 144 B | 10.7x | 1.3 GB |
+| Tier 2 (1-bit) | 48 B | 32.0x | 0.45 GB |
+
+## 2. Access Counter Sketch
+
+Temperature decisions require knowing which blocks are accessed frequently.
+RVF maintains a lightweight **Count-Min Sketch** per block set, stored in
+SKETCH_SEG segments.
+
+### Sketch Parameters
+
+```
+Width (w):    1024 counters
+Depth (d):    4 hash functions
+Counter size: 8-bit saturating (max 255)
+Memory:       1024 * 4 * 1 = 4 KB per sketch
+Granularity:  One sketch per 1024-vector block
+Decay:        Halve all counters every 2^16 accesses (aging)
+```
+
+For 10M vectors in 1024-vector blocks:
+- 9,766 blocks
+- 9,766 * 4 KB = ~38 MB of sketches
+- Stored in SKETCH_SEG, referenced by manifest
+
+### Sketch Operations
+
+**On query access**:
+```
+block_id = vector_id / block_size
+for i in 0..depth:
+    idx = hash_i(block_id) % width
+    sketch[i][idx] = min(sketch[i][idx] + 1, 255)
+```
+
+**On temperature check**:
+```
+count = min over i of sketch[i][hash_i(block_id) % width]
+if count > HOT_THRESHOLD:   tier = 0
+elif count > WARM_THRESHOLD: tier = 1
+else:                        tier = 2
+```
+
+**Aging** (every 2^16 accesses):
+```
+for all counters: counter = counter >> 1
+```
+
+This ensures the sketch tracks *recent* access patterns, not cumulative history.
+
+### Why Count-Min Sketch
+
+| Alternative | Memory | Accuracy | Update Cost |
+|------------|--------|----------|-------------|
+| Per-vector counter | 80 MB (10M * 8B) | Exact | O(1) |
+| Count-Min Sketch | 38 MB | ~99.9% | O(depth) = O(4) |
+| HyperLogLog | 6 MB | ~98% | O(1) but cardinality only |
+| Bloom filter | 12 MB | No counting | N/A |
+
+Count-Min Sketch is the best trade-off: sub-exact accuracy with bounded memory
+and constant-time updates.
+
+## 3. Promotion and Demotion
+
+### Promotion: Warm/Cold -> Hot
+
+When a block's access count exceeds HOT_THRESHOLD for two consecutive sketch
+epochs:
+
+```
+1. Read the block from its current VEC_SEG
+2. Decode/dequantize vectors to fp16 or int8
+3. Rearrange from columnar to interleaved layout
+4. Write as a new HOT_SEG (or append to existing HOT_SEG)
+5. Update manifest with new tier assignment
+6. Optionally: add neighbor lists to HOT_SEG for locality
+```
+
+### Demotion: Hot -> Warm -> Cold
+
+When a block's access count drops below WARM_THRESHOLD:
+
+```
+1. The block is not immediately rewritten
+2. On next compaction cycle, the block is written to the appropriate tier
+3. Quantization is applied during compaction (not lazily)
+4. The HOT_SEG entry is tombstoned in the manifest
+```
+
+### Eviction as Compression
+
+The key insight: **eviction from hot tier is just compression, not deletion**.
+The vector data is always present — it just moves to a more compressed
+representation. This means:
+
+- No data loss on eviction
+- Recall degrades gracefully (quantized vectors still contribute to search)
+- The file naturally compresses over time as access patterns stabilize
+
+## 4. Temperature-Aware Compaction
+
+Standard compaction merges segments for space efficiency. Temperature-aware
+compaction also **rearranges blocks by tier**:
+
+```
+Before compaction:
+  VEC_SEG_1:  [hot] [cold] [warm] [hot] [cold]
+  VEC_SEG_2:  [warm] [hot] [cold] [warm] [warm]
+
+After temperature-aware compaction:
+  HOT_SEG:    [hot] [hot] [hot]       <- interleaved, fp16
+  VEC_SEG_W:  [warm] [warm] [warm] [warm]  <- columnar, 6-bit
+  VEC_SEG_C:  [cold] [cold] [cold]     <- columnar, 3-bit
+```
+
+This creates **physical locality by temperature**: hot blocks are contiguous
+(good for sequential scan), warm blocks are contiguous (good for batch decode),
+cold blocks are contiguous (good for compression ratio).
+
+### Compaction Triggers
+
+| Trigger | Condition | Action |
+|---------|-----------|--------|
+| Sketch epoch | Every N writes | Evaluate all block temperatures |
+| Space amplification | Dead space > 30% | Merge + rewrite segments |
+| Tier imbalance | Hot tier > 20% of data | Demote cold blocks |
+| Hot miss rate | Hot cache miss > 10% | Promote missing blocks |
+
+## 5. Quantization Strategies by Tier
+
+### Tier 0: Hot
+
+**Scalar quantization to int8** (preferred) or **fp16** (for maximum recall).
+
+```
+Encoding:
+  q = round((v - min) / (max - min) * 255)
+
+Decoding:
+  v = q / 255 * (max - min) + min
+
+Parameters stored in QUANT_SEG:
+  min: f32 per dimension
+  max: f32 per dimension
+```
+
+Distance computation directly on int8 using SIMD (vpsubb + vpmaddubsw on AVX-512).
+
+### Tier 1: Warm
+
+**Product Quantization (PQ)** with 5-7 bits per sub-vector.
+
+```
+Parameters:
+  M subspaces:          48 (for 384-dim vectors, 8 dims per subspace)
+  K centroids per sub:  64 (6-bit) or 128 (7-bit)
+  Codebook:             M * K * 8 * sizeof(f32) = 48 * 64 * 8 * 4 = 96 KB
+
+Encoding:
+  For each subvector: find nearest centroid -> store centroid index
+
+Distance computation:
+  ADC (Asymmetric Distance Computation) with precomputed distance tables
+```
+
+### Tier 2: Cold
+
+**Binary quantization** (1-bit) or **ternary quantization** (2-bit / 3-bit).
+
+```
+Binary encoding:
+  b = sign(v)  -> 1 bit per dimension
+  384 dims -> 48 bytes per vector (32x compression)
+
+Distance:
+  Hamming distance via POPCNT
+  XOR + POPCNT on AVX-512: 512 bits per cycle
+
+Ternary (3-bit with magnitude):
+  t = {-1, 0, +1} based on threshold
+  magnitude = |v| quantized to 3 levels
+  384 dims -> 144 bytes per vector (10.7x compression)
+```
+
+### Codebook Storage
+
+All quantization parameters (codebooks, min/max ranges, centroids) are stored
+in QUANT_SEG segments. The root manifest's `quantdict_seg_offset` hotset pointer
+references the active quantization dictionary for fast boot.
+
+Multiple QUANT_SEGs can coexist for different tiers — the manifest maps each
+tier to its dictionary.
+
+## 6. Hardware Adaptation
+
+### Desktop (AVX-512)
+
+- Hot tier: int8 with VNNI dot product (4 int8 multiplies per cycle)
+- Warm tier: PQ with AVX-512 gather for table lookups
+- Cold tier: Binary with VPOPCNTDQ (512-bit popcount)
+
+### ARM (NEON)
+
+- Hot tier: int8 with SDOT instruction
+- Warm tier: PQ with TBL for table lookups
+- Cold tier: Binary with CNT (population count)
+
+### WASM (v128)
+
+- Hot tier: int8 with i8x16.dot_i7x16_i16x8 (if available)
+- Warm tier: Scalar PQ (no gather)
+- Cold tier: Binary with manual popcount
+
+### Cognitum Tile (8KB code + 8KB data + 64KB SIMD)
+
+- Hot tier only: int8 interleaved, fits in SIMD scratch
+- No warm/cold — data stays on hub, tile fetches blocks on demand
+- Sketch is maintained by hub, not tile
+
+## 7. Self-Organization Over Time
+
+```
+t=0    All data Tier 1 (default warm)
+       |
+t+N    First sketch epoch: identify hot blocks
+       Promote top 5% to Tier 0
+       |
+t+2N   Second epoch: validate promotions
+       Demote false positives back to Tier 1
+       Identify true cold blocks (0 access in 2 epochs)
+       |
+t+3N   Compaction: physically separate tiers
+       HOT_SEG created with interleaved layout
+       Cold blocks compressed to 3-bit
+       |
+t+∞    Equilibrium: ~5% hot, ~30% warm, ~65% cold
+       File size: ~2-3x smaller than uniform fp16
+       Query p95: dominated by hot tier latency
+```
+
+The format converges to an equilibrium that reflects actual usage. No manual
+tuning required.
--- a/vendor/ruvector/docs/research/rvf/spec/04-progressive-indexing.md
+++ b/vendor/ruvector/docs/research/rvf/spec/04-progressive-indexing.md
@@ -0,0 +1,374 @@
+# RVF Progressive Indexing
+
+## 1. Index as Layers of Availability
+
+Traditional HNSW serialization is all-or-nothing: either the full graph is loaded,
+or nothing works. RVF decomposes the index into three layers of availability, each
+independently useful, each stored in separate INDEX_SEG segments.
+
+```
+Layer C: Full Adjacency
+--------------------------------------------------+
+| Complete neighbor lists for every node at every   |
+| HNSW level. Built lazily. Optional for queries.   |
+| Recall: >= 0.95                                   |
+--------------------------------------------------+
+        ^  loaded last (seconds to minutes)
+        |
+Layer B: Partial Adjacency
+--------------------------------------------------+
+| Neighbor lists for the most-accessed region       |
+| (determined by temperature sketch). Covers the    |
+| hot working set of the graph.                     |
+| Recall: >= 0.85                                   |
+--------------------------------------------------+
+        ^  loaded second (100ms - 1s)
+        |
+Layer A: Entry Points + Coarse Routing
+--------------------------------------------------+
+| HNSW entry points. Top-layer adjacency lists.     |
+| Cluster centroids for IVF pre-routing.            |
+| Always present. Always in Level 0 hotset.         |
+| Recall: >= 0.70                                   |
+--------------------------------------------------+
+        ^  loaded first (< 5ms)
+        |
+      File open
+```
+
+### Why Three Layers
+
+| Layer | Purpose | Data Size (10M vectors) | Load Time (NVMe) |
+|-------|---------|------------------------|-------------------|
+| A | First query possible | 1-4 MB | < 5 ms |
+| B | Good quality for working set | 50-200 MB | 100-500 ms |
+| C | Full recall for all queries | 1-4 GB | 2-10 s |
+
+A system that only loads Layer A can still answer queries — just with lower recall.
+As layers B and C load asynchronously, quality improves transparently.
+
+## 2. Layer A: Entry Points and Coarse Routing
+
+### Content
+
+- **HNSW entry points**: The node(s) at the highest layer of the HNSW graph.
+  Typically 1 node, but may be multiple for redundancy.
+- **Top-layer adjacency**: Full neighbor lists for all nodes at HNSW layers
+  >= ceil(ln(N) / ln(M)) - 2. For 10M vectors with M=16, this is layers 5-6,
+  containing ~100-1000 nodes.
+- **Cluster centroids**: K centroids (K = sqrt(N) typically, so ~3162 for 10M)
+  used for IVF-style partition routing.
+- **Centroid-to-partition map**: Which centroid owns which vector ID ranges.
+
+### Storage
+
+Layer A data is stored in a dedicated INDEX_SEG with `flags.HOT` set. The root
+manifest's hotset pointers reference this segment directly. On cold start, this
+is the first data mapped after the manifest.
+
+### Binary Layout of Layer A INDEX_SEG
+
+```
+-------------------------------------------+
+| Header: INDEX_SEG, flags=HOT              |
+-------------------------------------------+
+| Block 0: Entry Points                     |
+|   entry_count: u32                        |
+|   max_layer: u32                          |
+|   [entry_node_id: u64, layer: u32] * N    |
+-------------------------------------------+
+| Block 1: Top-Layer Adjacency              |
+|   layer_count: u32                        |
+|   For each layer (top to bottom):         |
+|     node_count: u32                       |
+|     For each node:                        |
+|       node_id: u64                        |
+|       neighbor_count: u16                 |
+|       [neighbor_id: u64] * neighbor_count |
+|     [64B padding]                         |
+-------------------------------------------+
+| Block 2: Centroids                        |
+|   centroid_count: u32                     |
+|   dim: u16                                |
+|   dtype: u8 (fp16)                        |
+|   [centroid_vector: fp16 * dim] * K       |
+|   [64B aligned]                           |
+-------------------------------------------+
+| Block 3: Partition Map                    |
+|   partition_count: u32                    |
+|   For each partition:                     |
+|     centroid_id: u32                      |
+|     vector_id_start: u64                  |
+|     vector_id_end: u64                    |
+|     segment_ref: u64 (segment_id)         |
+|     block_ref: u32 (block offset)         |
+-------------------------------------------+
+```
+
+### Query Using Only Layer A
+
+```python
+def query_layer_a_only(query, k, layer_a):
+    # Step 1: Find nearest centroids
+    dists = [distance(query, c) for c in layer_a.centroids]
+    top_partitions = top_n(dists, n_probe)
+
+    # Step 2: HNSW search through top layers only
+    entry = layer_a.entry_points[0]
+    current = entry
+    for layer in range(layer_a.max_layer, layer_a.min_available_layer, -1):
+        current = greedy_search(query, current, layer_a.adjacency[layer])
+
+    # Step 3: If hot cache available, refine against it
+    if hot_cache:
+        candidates = scan_hot_cache(query, hot_cache, current.partition)
+        return top_k(candidates, k)
+
+    # Step 4: Otherwise, return centroid-approximate results
+    return approximate_from_centroids(query, top_partitions, k)
+```
+
+Expected recall: 0.65-0.75 (depends on centroid quality and hot cache coverage).
+
+## 3. Layer B: Partial Adjacency
+
+### Content
+
+Neighbor lists for the **hot region** of the graph — the set of nodes that appear
+most frequently in query traversals. Determined by the temperature sketch (see
+03-temperature-tiering.md).
+
+Typically covers:
+- All nodes at HNSW layers >= 2
+- Layer 0-1 nodes in the hot temperature tier
+- ~10-20% of total nodes
+
+### Storage
+
+Layer B is stored in one or more INDEX_SEGs without the HOT flag. The Level 1
+manifest maps these segments and records which node ID ranges they cover.
+
+### Incremental Build
+
+Layer B can be built incrementally:
+
+```
+1. After Layer A is loaded, begin query serving
+2. In background: read VEC_SEGs for hot-tier blocks
+3. Build HNSW adjacency for those blocks
+4. Write as new INDEX_SEG
+5. Update manifest to include Layer B
+6. Future queries use Layer B for better recall
+```
+
+This means the index improves over time without blocking any queries.
+
+### Partial Adjacency Routing
+
+When a query traversal reaches a node without Layer B adjacency (i.e., it's in
+the cold region), the system falls back to:
+
+1. **Centroid routing**: Use Layer A centroids to estimate the nearest region
+2. **Linear scan**: Scan the relevant VEC_SEG block directly
+3. **Approximate**: Accept slightly lower recall for that portion
+
+```python
+def search_with_partial_index(query, k, layers):
+    # Start with Layer A routing
+    current = hnsw_search_layers(query, layers.a, layers.a.max_layer, 2)
+
+    # Continue with Layer B (where available)
+    if layers.b.has_node(current):
+        current = hnsw_search_layers(query, layers.b, 1, 0,
+                                      start=current)
+    else:
+        # Fallback: scan the block containing current
+        candidates = linear_scan_block(query, current.block)
+        current = best_of(current, candidates)
+
+    return top_k(current.visited, k)
+```
+
+## 4. Layer C: Full Adjacency
+
+### Content
+
+Complete neighbor lists for every node at every HNSW level. This is the
+traditional full HNSW graph.
+
+### Storage
+
+Layer C may be split across multiple INDEX_SEGs for large datasets. The
+manifest records the node ID ranges covered by each segment.
+
+### Lazy Build
+
+Layer C is built lazily — it is not required for the file to be functional.
+The build process runs as a background task:
+
+```
+1. Identify unindexed VEC_SEG blocks (those without Layer C adjacency)
+2. Read blocks in partition order (good locality)
+3. Build HNSW adjacency using the existing partial graph as scaffold
+4. Write new INDEX_SEG(s)
+5. Update manifest
+```
+
+### Build Prioritization
+
+Blocks are indexed in temperature order:
+1. Hot blocks first (most query benefit)
+2. Warm blocks next
+3. Cold blocks last (may never be indexed if queries don't reach them)
+
+This means the index build converges to useful quality fast, then approaches
+completeness asymptotically.
+
+## 5. Index Segment Binary Format
+
+### Adjacency List Encoding
+
+Neighbor lists are stored using **varint delta encoding with restart points**
+for fast random access:
+
+```
+-------------------------------------------+
+| Restart Point Index                       |
+|   restart_interval: u32 (e.g., 64)       |
+|   restart_count: u32                      |
+|   [restart_offset: u32] * restart_count   |
+|   [64B aligned]                           |
+-------------------------------------------+
+| Adjacency Data                            |
+|   For each node (sorted by node_id):      |
+|     neighbor_count: varint                |
+|     [delta_encoded_neighbor_id: varint]   |
+|     (restart point every N nodes)         |
+-------------------------------------------+
+```
+
+**Restart points**: Every `restart_interval` nodes (default 64), the delta
+encoding resets to absolute IDs. This enables O(1) random access to any node's
+neighbors by:
+
+1. Binary search the restart point index for the nearest restart <= target
+2. Seek to that restart offset
+3. Sequentially decode from restart to target (at most 63 decodes)
+
+### Varint Encoding
+
+Standard LEB128 varint:
+- Values 0-127: 1 byte
+- Values 128-16383: 2 bytes
+- Values 16384-2097151: 3 bytes
+
+For delta-encoded neighbor IDs (typical delta: 1-1000), most values fit in 1-2
+bytes, giving ~3-4x compression over fixed u64.
+
+### Prefetch Hints
+
+The manifest's prefetch table maps node ID ranges to contiguous page ranges:
+
+```
+Prefetch Entry:
+  node_id_start: u64
+  node_id_end: u64
+  page_offset: u64      Offset of first contiguous page
+  page_count: u32       Number of contiguous pages
+  prefetch_ahead: u32   Pages to prefetch ahead of current access
+```
+
+When the HNSW search accesses a node, the runtime issues `madvise(WILLNEED)`
+(or equivalent) for the next `prefetch_ahead` pages. This hides disk/memory
+latency behind computation.
+
+## 6. Index Consistency
+
+### Append-Only Index Updates
+
+When new vectors are added:
+
+1. New vectors go into a **fresh VEC_SEG** (append-only)
+2. A temporary in-memory index covers the new vectors
+3. When the in-memory index reaches a threshold, it is written as a new INDEX_SEG
+4. The manifest is updated to include both the old and new INDEX_SEGs
+5. Queries search both indexes and merge results
+
+This is analogous to LSM-tree compaction levels but for graph indexes.
+
+### Index Merging
+
+When too many small INDEX_SEGs accumulate:
+
+```
+1. Read all small INDEX_SEGs
+2. Build a unified HNSW graph over all vectors
+3. Write as a single sealed INDEX_SEG
+4. Tombstone old INDEX_SEGs in manifest
+```
+
+### Concurrent Read/Write
+
+Readers always see a consistent snapshot through the manifest chain:
+- Reader opens file -> reads manifest -> has immutable segment set
+- Writer appends new segments + new manifest
+- Reader continues using old manifest until it explicitly re-reads
+- No locks needed — append-only guarantees no mutation of existing data
+
+## 7. Query Path Integration
+
+The complete query path combining progressive indexing with temperature tiering:
+
+```
+                         Query
+                           |
+                           v
+                    +-----------+
+                    | Layer A   |   Entry points + top-layer routing
+                    | (always)  |   ~5ms to load on cold start
+                    +-----------+
+                           |
+                    Is Layer B available for this region?
+                      /              \
+                    Yes               No
+                    /                   \
+            +-----------+         +-----------+
+            | Layer B   |         | Centroid  |
+            | HNSW      |         | Fallback  |
+            | search    |         | + scan    |
+            +-----------+         +-----------+
+                    \                  /
+                     \                /
+                      v              v
+                    +-----------+
+                    | Candidate |
+                    | Set       |
+                    +-----------+
+                           |
+                    Is hot cache available?
+                      /              \
+                    Yes               No
+                    /                   \
+            +-----------+         +-----------+
+            | Hot cache |         | Decode    |
+            | re-rank   |         | from      |
+            | (int8/fp16)|        | VEC_SEG   |
+            +-----------+         +-----------+
+                    \                  /
+                     v                v
+                    +-----------+
+                    | Top-K     |
+                    | Results   |
+                    +-----------+
+```
+
+### Recall Expectations by State
+
+| State | Layers Available | Expected Recall@10 |
+|-------|-----------------|-------------------|
+| Cold start (L0 only) | A | 0.65-0.75 |
+| L0 + hot cache | A + hot | 0.75-0.85 |
+| L0 + L1 loading | A + B partial | 0.80-0.90 |
+| L1 complete | A + B | 0.85-0.92 |
+| Full load | A + B + C | 0.95-0.99 |
+| Full + optimized | A + B + C + hot | 0.98-0.999 |
--- a/vendor/ruvector/docs/research/rvf/spec/05-overlay-epochs.md
+++ b/vendor/ruvector/docs/research/rvf/spec/05-overlay-epochs.md
@@ -0,0 +1,308 @@
+# RVF Overlay Epochs
+
+## 1. Streaming Dynamic Min-Cut Overlay
+
+The overlay system manages dynamic graph partitioning — how the vector space is
+subdivided for distributed search, shard routing, and load balancing. Unlike
+static partitioning, RVF overlays evolve with the data through an epoch-based
+model that bounds memory, bounds load time, and enables rollback.
+
+## 2. Overlay Segment Structure
+
+Each OVERLAY_SEG stores a delta relative to the previous epoch's partition state:
+
+```
+-------------------------------------------+
+| Header: OVERLAY_SEG                       |
+-------------------------------------------+
+| Epoch Header                              |
+|   epoch: u32                              |
+|   parent_epoch: u32                       |
+|   parent_seg_id: u64                      |
+|   rollback_offset: u64                    |
+|   timestamp_ns: u64                       |
+|   delta_count: u32                        |
+|   partition_count: u32                    |
+-------------------------------------------+
+| Edge Deltas                               |
+|   For each delta:                         |
+|     delta_type: u8 (ADD=1, REMOVE=2,      |
+|                     REWEIGHT=3)           |
+|     src_node: u64                         |
+|     dst_node: u64                         |
+|     weight: f32 (for ADD/REWEIGHT)        |
+|   [64B aligned]                           |
+-------------------------------------------+
+| Partition Summaries                       |
+|   For each partition:                     |
+|     partition_id: u32                     |
+|     node_count: u64                       |
+|     edge_cut_weight: f64                  |
+|     centroid: [fp16 * dim]                |
+|     node_id_range_start: u64              |
+|     node_id_range_end: u64               |
+|   [64B aligned]                           |
+-------------------------------------------+
+| Min-Cut Witness                           |
+|   witness_type: u8                        |
+|     0 = checksum only                     |
+|     1 = full certificate                  |
+|   cut_value: f64                          |
+|   cut_edge_count: u32                     |
+|   partition_hash: [u8; 32] (SHAKE-256)    |
+|   If witness_type == 1:                   |
+|     [cut_edge: (u64, u64)] * count        |
+|   [64B aligned]                           |
+-------------------------------------------+
+| Rollback Pointer                          |
+|   prev_epoch_offset: u64                  |
+|   prev_epoch_hash: [u8; 16]              |
+-------------------------------------------+
+```
+
+## 3. Epoch Lifecycle
+
+### Epoch Creation
+
+A new epoch is created when:
+- A batch of vectors is inserted that changes partition balance by > threshold
+- The accumulated edge deltas exceed a size limit (default: 1 MB)
+- A manual rebalance is triggered
+- A merge/compaction produces a new partition layout
+
+```
+Epoch 0 (initial)     Epoch 1             Epoch 2
+----------------+    +----------------+   +----------------+
+| Full snapshot  |    | Deltas vs E0   |   | Deltas vs E1   |
+| of partitions  |    | +50 edges      |   | +30 edges      |
+| 32 partitions  |    | -12 edges      |   | -8 edges       |
+| min-cut: 0.342 |    | rebalance: P3  |   | split: P7->P7a |
+----------------+    +----------------+   +----------------+
+```
+
+### State Reconstruction
+
+To reconstruct the current partition state:
+
+```
+1. Read latest MANIFEST_SEG -> get current_epoch
+2. Read OVERLAY_SEG for current_epoch
+3. If overlay is a delta: recursively read parent epochs
+4. Apply deltas in order: base -> epoch 1 -> epoch 2 -> ... -> current
+5. Result: complete partition state
+```
+
+For efficiency, the manifest caches the **last full snapshot epoch**. Delta
+chains never exceed a configurable depth (default: 8 epochs) before a new
+snapshot is forced.
+
+### Compaction (Epoch Collapse)
+
+When the delta chain reaches maximum depth:
+
+```
+1. Reconstruct full state from chain
+2. Write new OVERLAY_SEG with witness_type=full_snapshot
+3. This becomes the new base epoch
+4. Old overlay segments are tombstoned
+5. New delta chain starts from this base
+```
+
+```
+Before:  E0(snap) -> E1(delta) -> E2(delta) -> ... -> E8(delta)
+After:   E0(snap) -> ... -> E8(delta) -> E9(snap, compacted)
+         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+         These can be garbage collected
+```
+
+## 4. Min-Cut Witness
+
+The min-cut witness provides a cryptographic proof that the current partition
+is "good enough" — that the edge cut is within acceptable bounds.
+
+### Witness Types
+
+**Type 0: Checksum Only**
+
+A SHAKE-256 hash of the complete partition state. Allows verification that
+the state is consistent but doesn't prove optimality.
+
+```
+witness = SHAKE-256(
+    for each partition sorted by id:
+        partition_id || node_count || sorted(node_ids) || edge_cut_weight
+)
+```
+
+**Type 1: Full Certificate**
+
+Lists the actual cut edges. Allows any reader to verify that:
+1. The listed edges are the only edges crossing partition boundaries
+2. The total cut weight matches `cut_value`
+3. No better cut exists within the local search neighborhood (optional)
+
+### Bounded-Time Min-Cut Updates
+
+Full min-cut computation is expensive (O(V * E) for max-flow). RVF uses
+**incremental min-cut maintenance**:
+
+For each edge delta:
+```
+1. If ADD(u, v) where u and v are in same partition:
+   -> No cut change. O(1).
+
+2. If ADD(u, v) where u in P_i and v in P_j:
+   -> cut_weight[P_i][P_j] += weight. O(1).
+   -> Check if moving u to P_j or v to P_i reduces total cut.
+   -> If yes: execute move, update partition summaries. O(degree).
+
+3. If REMOVE(u, v) across partitions:
+   -> cut_weight[P_i][P_j] -= weight. O(1).
+   -> No rebalance needed (cut improved).
+
+4. If REMOVE(u, v) within same partition:
+   -> Check connectivity. If partition splits: create new partition. O(component).
+```
+
+This bounds update time to O(max_degree) per edge delta in the common case,
+with O(component_size) in the rare partition-split case.
+
+### Semi-Streaming Min-Cut
+
+For large-scale rebalancing (e.g., after bulk insert), RVF uses a semi-streaming
+algorithm inspired by Assadi et al.:
+
+```
+Phase 1: Single pass over edges to build a sparse skeleton
+  - Sample each edge with probability O(1/epsilon)
+  - Space: O(n * polylog(n))
+
+Phase 2: Compute min-cut on skeleton
+  - Standard max-flow on sparse graph
+  - Time: O(n^2 * polylog(n))
+
+Phase 3: Verify against full edge set
+  - Stream edges again, check cut validity
+  - If invalid: refine skeleton and repeat
+```
+
+This runs in O(n * polylog(n)) space regardless of edge count, making it
+suitable for streaming over massive graphs.
+
+## 5. Overlay Size Management
+
+### Size Threshold
+
+Each OVERLAY_SEG has a maximum payload size (configurable, default 1 MB).
+When the accumulated deltas for the current epoch approach this threshold,
+a new epoch is forced.
+
+### Memory Budget
+
+The total memory for overlay state is bounded:
+
+```
+max_overlay_memory = max_chain_depth * max_seg_size + snapshot_size
+                   = 8 * 1 MB + snapshot_size
+```
+
+For 10M vectors with 32 partitions:
+- Snapshot: ~32 * (8 + 16 + 768) bytes per partition ≈ 25 KB
+- Delta chain: ≤ 8 MB
+- Total: ≤ 9 MB
+
+This is a fixed overhead regardless of dataset size (partition count scales
+sublinearly).
+
+### Garbage Collection
+
+Overlay segments behind the last full snapshot are candidates for garbage
+collection. The manifest tracks which overlay segments are still reachable
+from the current epoch chain.
+
+```
+Reachable:    current_epoch -> parent -> ... -> last_snapshot
+Unreachable:  Everything before last_snapshot (safely deletable)
+```
+
+GC runs during compaction. Old OVERLAY_SEGs are tombstoned in the manifest
+and their space is reclaimed on file rewrite.
+
+## 6. Distributed Overlay Coordination
+
+When RVF files are sharded across multiple nodes, the overlay system coordinates
+partition state:
+
+### Shard-Local Overlays
+
+Each shard maintains its own OVERLAY_SEG chain for its local partitions.
+The global partition state is the union of all shard-local overlays.
+
+### Cross-Shard Rebalancing
+
+When a partition becomes unbalanced across shards:
+
+```
+1. Coordinator computes target partition assignment
+2. Each shard writes a JOURNAL_SEG with vector move instructions
+3. Vectors are copied (not moved — append-only) to target shards
+4. Each shard writes a new OVERLAY_SEG reflecting the new partition
+5. Coordinator writes a global MANIFEST_SEG with new shard map
+```
+
+This is eventually consistent — during rebalancing, queries may search both
+old and new locations and deduplicate results.
+
+### Consistency Model
+
+**Within a shard**: Linearizable (single-writer, manifest chain)
+**Across shards**: Eventually consistent with bounded staleness
+
+The epoch counter provides a total order for convergence checking:
+- If all shards report epoch >= E, the global state at epoch E is complete
+- Stale shards are detectable by comparing epoch counters
+
+## 7. Epoch-Aware Query Routing
+
+Queries use the overlay state for partition routing:
+
+```python
+def route_query(query, overlay):
+    # Find nearest partition centroids
+    dists = [distance(query, p.centroid) for p in overlay.partitions]
+    target_partitions = top_n(dists, n_probe)
+
+    # Check epoch freshness
+    if overlay.epoch < current_epoch - stale_threshold:
+        # Overlay is stale — broaden search
+        target_partitions = top_n(dists, n_probe * 2)
+
+    return target_partitions
+```
+
+### Epoch Rollback
+
+If an overlay epoch is found to be corrupt or suboptimal:
+
+```
+1. Read rollback_pointer from current OVERLAY_SEG
+2. The pointer gives the offset of the previous epoch's OVERLAY_SEG
+3. Write a new MANIFEST_SEG pointing to the previous epoch as current
+4. Future writes continue from the rolled-back state
+```
+
+This provides O(1) rollback to any ancestor epoch in the chain.
+
+## 8. Integration with Progressive Indexing
+
+The overlay system and the index system are coupled:
+
+- **Partition centroids** in the overlay guide Layer A routing
+- **Partition boundaries** determine which INDEX_SEGs cover which regions
+- **Partition rebalancing** may invalidate Layer B adjacency for moved vectors
+  (these are rebuilt lazily)
+- **Layer C** is partitioned aligned — each INDEX_SEG covers vectors within
+  a single partition for locality
+
+This means overlay compaction can trigger partial index rebuild, but only for
+the affected partitions — not the entire index.
--- a/vendor/ruvector/docs/research/rvf/spec/06-query-optimization.md
+++ b/vendor/ruvector/docs/research/rvf/spec/06-query-optimization.md
@@ -0,0 +1,386 @@
+# RVF Ultra-Fast Query Path
+
+## 1. CPU Shape Optimization
+
+The block layout determines performance at the hardware level. RVF is designed
+to match the shape of modern CPUs: wide SIMD, deep caches, hardware prefetch.
+
+### Four Optimizations
+
+1. **Strict 64-byte alignment** for all numeric arrays
+2. **Columnar + interleaved hybrid** for compression and speed
+3. **Prefetch hints** for cache-friendly graph traversal
+4. **Dictionary-coded IDs** for fast random access
+
+## 2. Strict Alignment
+
+Every numeric array in RVF starts at a 64-byte aligned offset. This matches:
+
+| Target | Register Width | Alignment |
+|--------|---------------|-----------|
+| AVX-512 | 512 bits = 64 bytes | 64 B |
+| AVX2 | 256 bits = 32 bytes | 64 B (superset) |
+| ARM NEON | 128 bits = 16 bytes | 64 B (superset) |
+| WASM v128 | 128 bits = 16 bytes | 64 B (superset) |
+| Cache line | Typically 64 bytes | 64 B (exact) |
+
+By aligning to 64 bytes, RVF ensures:
+- Zero-copy load into any SIMD register (no unaligned penalty)
+- No cache-line splits (each access touches exactly one cache line)
+- Optimal hardware prefetch behavior (prefetcher operates on cache lines)
+
+### Alignment in Practice
+
+```
+Segment header:           64 B (naturally aligned, first item in segment)
+Block header:             Padded to 64 B boundary
+Vector data start:        64 B aligned from block start
+Each dimension column:    64 B aligned (columnar VEC_SEG)
+Each vector entry:        64 B aligned (interleaved HOT_SEG)
+ID map:                   64 B aligned
+Restart point index:      64 B aligned
+```
+
+Padding bytes between sections are zero-filled and excluded from checksums.
+
+## 3. Columnar + Interleaved Hybrid
+
+### Columnar Storage (VEC_SEG) — Optimized for Compression
+
+```
+Block layout (1024 vectors, 384 dimensions, fp16):
+
+Offset 0x000:   dim_0[vec_0], dim_0[vec_1], ..., dim_0[vec_1023]   (2048 B)
+Offset 0x800:   dim_1[vec_0], dim_1[vec_1], ..., dim_1[vec_1023]   (2048 B)
+...
+Offset 0xBF800: dim_383[vec_0], ..., dim_383[vec_1023]              (2048 B)
+
+Total: 384 * 2048 = 786,432 bytes (768 KB per block)
+```
+
+**Why columnar for cold/warm storage**:
+- Adjacent values in the same dimension are correlated -> higher compression ratio
+- LZ4 on columnar fp16 achieves 1.5-2.5x compression (vs 1.1-1.3x on interleaved)
+- ZSTD on columnar fp16 achieves 2.5-4x compression
+- Batch operations (computing mean, variance) scan one dimension at a time
+
+### Interleaved Storage (HOT_SEG) — Optimized for Speed
+
+```
+Entry layout (one hot vector, 384 dim fp16):
+
+Offset 0x000:   vector_id (8 B)
+Offset 0x008:   dim_0, dim_1, dim_2, ..., dim_383  (768 B)
+Offset 0x308:   neighbor_count (2 B)
+Offset 0x30A:   neighbor_0, neighbor_1, ... (8 B each)
+Offset 0x38A:   padding to 64B boundary
+                --> 960 bytes per entry (at M=16 neighbors)
+```
+
+**Why interleaved for hot data**:
+- One vector = one sequential read (no column gathering)
+- Distance computation: load vector, compute, move to next (streaming pattern)
+- Neighbors co-located: after finding a good candidate, immediately traverse
+- 960 bytes per entry = 15 cache lines = predictable memory access
+
+### When to Use Each
+
+| Operation | Layout | Reason |
+|-----------|--------|--------|
+| Bulk distance computation | Columnar | SIMD operates on dimension columns |
+| Top-K refinement scan | Interleaved | Sequential scan of candidates |
+| Compression/archival | Columnar | Better ratio |
+| HNSW search (hot region) | Interleaved | Vector + neighbors together |
+| Batch insert | Columnar | Write once, compress well |
+
+## 4. Prefetch Hints
+
+### The Problem
+
+HNSW search is pointer-chasing: compute distance at node A, read neighbor
+list, jump to node B, compute distance, repeat. Each jump is a random
+memory access. On a 10M vector file, this means:
+
+```
+HNSW search: ~100-200 distance computations per query
+Each computation: 1 random read (vector) + 1 random read (neighbors)
+Random read latency: 50-100 ns (DRAM), 10-50 μs (SSD)
+Total: 10-40 μs (DRAM), 1-10 ms (SSD) without prefetch
+```
+
+### The Solution
+
+Store neighbor lists **contiguously** and add **prefetch offsets** in the
+manifest so the runtime can issue prefetch instructions ahead of time.
+
+### Prefetch Table Structure
+
+The manifest contains a prefetch table mapping node ID ranges to contiguous
+page regions:
+
+```
+prefetch_table:
+  entry_count: u32
+  entries:
+    [0]: node_ids 0-9999      -> pages at offset 0x100000, 50 pages, prefetch 3 ahead
+    [1]: node_ids 10000-19999  -> pages at offset 0x200000, 50 pages, prefetch 3 ahead
+    ...
+```
+
+### Runtime Prefetch Strategy
+
+```python
+def hnsw_search_with_prefetch(query, entry_point, ef_search):
+    candidates = MaxHeap()
+    visited = BitSet()
+    worklist = MinHeap([(distance(query, entry_point), entry_point)])
+
+    while worklist:
+        dist, node = worklist.pop()
+
+        # PREFETCH: while processing this node, prefetch neighbors' data
+        neighbors = get_neighbors(node)
+        for n in neighbors[:PREFETCH_AHEAD]:
+            if n not in visited:
+                prefetch_vector(n)      # madvise(WILLNEED) or __builtin_prefetch
+                prefetch_neighbors(n)   # prefetch neighbor list page
+
+        # COMPUTE: distance to neighbors (data should be in cache by now)
+        for n in neighbors:
+            if n not in visited:
+                visited.add(n)
+                d = distance(query, get_vector(n))
+                if d < candidates.max() or len(candidates) < ef_search:
+                    candidates.push((d, n))
+                    worklist.push((d, n))
+
+    return candidates.top_k(k)
+```
+
+### Contiguous Neighbor Layout
+
+HOT_SEG stores vectors and neighbors together. For cold INDEX_SEGs, neighbor
+lists are laid out in **node ID order** within contiguous pages:
+
+```
+Page 0:  neighbors[node_0], neighbors[node_1], ..., neighbors[node_63]
+Page 1:  neighbors[node_64], ..., neighbors[node_127]
+...
+```
+
+Because HNSW search tends to traverse nodes in the same graph neighborhood
+(spatially close node IDs if data was inserted in order), sequential node
+IDs tend to be accessed together. Contiguous layout turns random access
+into sequential reads.
+
+### Expected Improvement
+
+| Configuration | p95 Latency (10M vectors) |
+|--------------|--------------------------|
+| No prefetch, random layout | 2.5 ms |
+| No prefetch, contiguous layout | 1.2 ms |
+| Prefetch, contiguous layout | 0.3 ms |
+| Prefetch, contiguous + hot cache | 0.15 ms |
+
+## 5. Dictionary-Coded IDs
+
+### The Problem
+
+Vector IDs in neighbor lists and ID maps are 64-bit integers. For 10M vectors,
+most IDs fit in 24 bits. Storing full 64-bit IDs wastes ~5 bytes per entry.
+
+With M=16 neighbors per node and 10M nodes:
+- Raw: 10M * 16 * 8 = 1.2 GB of ID data
+- Desired: < 300 MB
+
+### Varint Delta Encoding
+
+IDs within a block or neighbor list are sorted and delta-encoded:
+
+```
+Original IDs:    [1000, 1005, 1008, 1020, 1100]
+Deltas:          [1000,    5,    3,   12,   80]
+Varint bytes:    [  2B,  1B,  1B,   1B,   1B]  = 6 bytes (vs 40 bytes raw)
+```
+
+### Restart Points
+
+Every N entries (default N=64), the delta resets to an absolute value:
+
+```
+Group 0 (entries 0-63):    delta from 0 (absolute start)
+Group 1 (entries 64-127):  delta from entry[64] (restart)
+Group 2 (entries 128-191): delta from entry[128] (restart)
+```
+
+The restart point index stores the offset of each restart group:
+
+```
+restart_index:
+  interval: 64
+  offsets: [0, 156, 298, 445, ...]  // byte offsets into encoded data
+```
+
+### Random Access
+
+To find the neighbors of node N:
+
+```
+1. group = N / restart_interval            // O(1)
+2. offset = restart_index[group]           // O(1)
+3. seek to offset in encoded data          // O(1)
+4. decode sequentially from restart to N   // O(restart_interval) = O(64)
+```
+
+Total: O(64) varint decodes = ~50-100 ns. Compare with sorted array binary
+search: O(log N) = O(24) comparisons with cache misses = ~200-500 ns.
+
+### SIMD Varint Decoding
+
+Modern SIMD can decode varints in bulk:
+
+```
+AVX-512 VBMI: ~8 varints per cycle using VPERMB + VPSHUFB
+Throughput: 2-4 billion integers/second (Lemire et al.)
+```
+
+At 16 neighbors per node, one HNSW search step decodes 16 varints in ~2-4 ns.
+
+### Compression Ratio
+
+| Encoding | Bytes per ID (avg) | 10M * 16 neighbors |
+|----------|-------------------|-------------------|
+| Raw u64 | 8.0 B | 1,220 MB |
+| Raw u32 | 4.0 B | 610 MB |
+| Varint (no delta) | 3.2 B | 488 MB |
+| Varint delta | 1.5 B | 229 MB |
+| Varint delta + restart | 1.6 B | 244 MB |
+
+Delta encoding with restart points achieves ~5x compression over raw u64
+while maintaining fast random access.
+
+## 6. Cache Behavior Analysis
+
+### L1/L2/L3 Working Sets
+
+For a typical query on 10M vectors (384 dim, fp16):
+
+```
+HNSW search:
+  ~150 distance computations
+  Each computation: 768 B (vector) + ~128 B (neighbor list) ≈ 896 B
+  Total working set: 150 * 896 ≈ 131 KB
+
+Top-K refinement (hot cache scan):
+  ~1000 candidates checked
+  Each: 960 B (interleaved HOT_SEG entry)
+  Total: 960 KB
+
+Query vector: 768 B (always in L1)
+Quantization tables: 96 KB (PQ codebook, always in L2)
+```
+
+| Cache Level | Size | What Fits |
+|------------|------|-----------|
+| L1 (32-48 KB) | Query vector + current node | Always hit |
+| L2 (256 KB-1 MB) | PQ tables + 100-200 hot entries | Usually hit |
+| L3 (8-32 MB) | Hot cache + partial index | Mostly hit |
+| DRAM | Everything | Full dataset |
+
+### p95 Latency Budget
+
+```
+HNSW traversal:    150 nodes * 100 ns/node = 15 μs (L3 hit)
+Distance compute:  150 * 50 ns = 7.5 μs (SIMD)
+Top-K refinement:  1000 * 10 ns = 10 μs (hot cache, L2/L3 hit)
+Overhead:          5 μs (heap ops, bookkeeping)
+                   -------
+Total p95:         ~37.5 μs ≈ 0.04 ms
+
+With prefetch:     ~30 μs (hide 25% of traversal latency)
+```
+
+This matches the target of < 0.3 ms p95 on desktop hardware. The dominant
+cost is memory bandwidth, not computation — which is why cache-friendly
+layout and prefetch are critical.
+
+## 7. Distance Function SIMD Implementations
+
+### L2 Distance (fp16, 384 dim, AVX-512)
+
+```
+; 384 fp16 values = 768 bytes = 12 ZMM registers
+; Process 32 fp16 values per iteration (convert to 16 fp32 per half)
+
+.loop:
+    vmovdqu16   zmm0, [rsi + rcx]      ; Load 32 fp16 from A
+    vmovdqu16   zmm1, [rdi + rcx]      ; Load 32 fp16 from B
+    vcvtph2ps   zmm2, ymm0             ; Convert low 16 to fp32
+    vcvtph2ps   zmm3, ymm1
+    vsubps      zmm2, zmm2, zmm3       ; diff = A - B
+    vfmadd231ps zmm4, zmm2, zmm2       ; acc += diff * diff
+    ; Repeat for high 16
+    vextracti64x4 ymm0, zmm0, 1
+    vextracti64x4 ymm1, zmm1, 1
+    vcvtph2ps   zmm2, ymm0
+    vcvtph2ps   zmm3, ymm1
+    vsubps      zmm2, zmm2, zmm3
+    vfmadd231ps zmm4, zmm2, zmm2
+    add         rcx, 64
+    cmp         rcx, 768
+    jl          .loop
+
+; Horizontal sum of zmm4 -> scalar result
+; ~12 iterations, ~24 FMA ops, ~12 cycles total
+```
+
+### Inner Product (int8, 384 dim, AVX-512 VNNI)
+
+```
+; 384 int8 values = 384 bytes = 6 ZMM registers
+; VPDPBUSD: 64 uint8*int8 multiply-adds per cycle
+
+.loop:
+    vmovdqu8    zmm0, [rsi + rcx]      ; 64 uint8 from A
+    vmovdqu8    zmm1, [rdi + rcx]      ; 64 int8 from B
+    vpdpbusd    zmm2, zmm0, zmm1       ; acc += dot(A, B) per 4 bytes
+    add         rcx, 64
+    cmp         rcx, 384
+    jl          .loop
+
+; 6 iterations, 6 VPDPBUSD ops, ~6 cycles
+; ~16x faster than fp16 L2
+```
+
+### Hamming Distance (binary, 384 dim, AVX-512)
+
+```
+; 384 bits = 48 bytes = 1 partial ZMM load
+; VPOPCNTDQ: popcount on 8 x 64-bit words per cycle
+
+    vmovdqu8    zmm0, [rsi]            ; Load 48 bytes (384 bits) from A
+    vmovdqu8    zmm1, [rdi]            ; Load 48 bytes from B
+    vpxorq      zmm2, zmm0, zmm1       ; XOR -> differing bits
+    vpopcntq    zmm3, zmm2             ; Popcount per 64-bit word
+    ; Horizontal sum of 6 popcounts -> Hamming distance
+    ; ~3 cycles total
+```
+
+## 8. Summary: Query Path Hot Loop
+
+The complete hot path for one HNSW search step:
+
+```
+1. Load current node's neighbor list       [L2/L3 cache, 128 B, ~5 ns]
+2. Issue prefetch for next neighbors       [~1 ns]
+3. For each neighbor (M=16):
+   a. Check visited bitmap                 [L1, ~1 ns]
+   b. Load neighbor vector (hot cache)     [L2/L3, 768 B, ~5-10 ns]
+   c. SIMD distance (fp16, 384 dim)        [~12 cycles = ~4 ns]
+   d. Heap insert if better                [~5 ns]
+4. Total per step: ~300-500 ns
+5. Total per query (~150 steps): ~50-75 μs
+```
+
+This achieves 13,000-20,000 QPS per thread on desktop hardware — matching
+or exceeding dedicated vector databases for in-memory workloads.
--- a/vendor/ruvector/docs/research/rvf/spec/07-deletion-lifecycle.md
+++ b/vendor/ruvector/docs/research/rvf/spec/07-deletion-lifecycle.md
@@ -0,0 +1,580 @@
+# RVF Deletion Lifecycle
+
+## 1. Overview
+
+Deletion in RVF follows a two-phase protocol consistent with the append-only
+segment architecture. Vectors are never removed in-place. Instead, a soft
+delete records intent in a JOURNAL_SEG, and a subsequent compaction hard
+deletes by physically excluding the vectors from sealed output segments.
+
+```
+                  JOURNAL_SEG         Compaction           GC / Rewrite
+                  (append)            (merge)              (reclaim)
+    ACTIVE -----> SOFT_DELETED -----> HARD_DELETED ------> RECLAIMED
+      |               |                    |                    |
+      |  query path   |  query path        |                   |
+      |  returns vec  |  skips vec         |  vec absent       |  space freed
+      |               |  (bitmap check)    |  from output seg  |
+```
+
+Readers always see a consistent snapshot: a deletion is invisible until
+the manifest referencing the new deletion bitmap is durably committed.
+
+## 2. Vector Lifecycle State Machine
+
+```
+----------+     JOURNAL_SEG        +-----------------+
+|          |  DELETE_VECTOR / RANGE  |                 |
+|  ACTIVE  +----------------------->+  SOFT_DELETED   |
+|          |                        |                 |
+----------+                        +--------+--------+
+                                             |  Compaction seals output
+                                             v  excluding this vector
+                                    +--------+--------+
+                                    |  HARD_DELETED   |
+                                    +--------+--------+
+                                             |  File rewrite / truncation
+                                             v  reclaims physical space
+                                    +--------+--------+
+                                    |   RECLAIMED     |
+                                    +-----------------+
+```
+
+| State | Bitmap Bit | Physical Bytes | Query Visible |
+|-------|------------|----------------|---------------|
+| ACTIVE | 0 | Vector in VEC_SEG | Yes |
+| SOFT_DELETED | 1 | Vector in VEC_SEG | No |
+| HARD_DELETED | N/A | Excluded from sealed output | No |
+| RECLAIMED | N/A | Bytes overwritten / freed | No |
+
+| Transition | Trigger | Durability |
+|------------|---------|------------|
+| ACTIVE -> SOFT_DELETED | JOURNAL_SEG + MANIFEST_SEG with bitmap | After manifest fsync |
+| SOFT_DELETED -> HARD_DELETED | Compaction writes sealed VEC_SEG without vector | After compaction manifest fsync |
+| HARD_DELETED -> RECLAIMED | File rewrite or old shard deletion | After shard unlink |
+
+## 3. JOURNAL_SEG Wire Format (type 0x04)
+
+A JOURNAL_SEG records metadata mutations: deletions, metadata updates, tier
+moves, and ID remappings. Its payload follows the standard 64-byte segment
+header (see `01-segment-model.md` section 2).
+
+### 3.1 Journal Header (64 bytes)
+
+```
+Offset  Type    Field                 Description
+------  ----    -----                 -----------
+0x00    u32     entry_count           Number of journal entries
+0x04    u32     journal_epoch         Epoch when this journal was written
+0x08    u64     prev_journal_seg_id   Segment ID of previous JOURNAL_SEG (0 if first)
+0x10    u32     flags                 Reserved, must be 0
+0x14    u8[44]  reserved              Zero-padded to 64-byte alignment
+```
+
+### 3.2 Journal Entry Format
+
+Each entry begins on an 8-byte aligned boundary:
+
+```
+Offset  Type    Field          Description
+------  ----    -----          -----------
+0x00    u8      entry_type     Entry type enum
+0x01    u8      reserved       Must be 0x00
+0x02    u16     entry_length   Byte length of type-specific payload
+0x04    u8[]    payload        Type-specific payload
+var     u8[]    padding        Zero-pad to next 8-byte boundary
+```
+
+### 3.3 Entry Types
+
+```
+Value  Name              Payload Size  Description
+-----  ----              ------------  -----------
+0x01   DELETE_VECTOR      8 B          Delete a single vector by ID
+0x02   DELETE_RANGE      16 B          Delete a contiguous range of vector IDs
+0x03   UPDATE_METADATA   variable      Update key-value metadata for a vector
+0x04   MOVE_VECTOR       24 B          Reassign vector to a different segment/tier
+0x05   REMAP_ID          16 B          Reassign vector ID (post-compaction)
+```
+
+### 3.4 Type-Specific Payloads
+
+**DELETE_VECTOR (0x01)**
+```
+0x00  u64  vector_id    ID of the vector to soft-delete
+```
+
+**DELETE_RANGE (0x02)**
+```
+0x00  u64  start_id     First vector ID (inclusive)
+0x08  u64  end_id       Last vector ID (exclusive)
+```
+Invariant: `start_id < end_id`. Range `[start_id, end_id)` is half-open.
+
+**UPDATE_METADATA (0x03)**
+```
+0x00  u64   vector_id   Target vector ID
+0x08  u16   key_len     Byte length of metadata key
+0x0A  u8[]  key         Metadata key (UTF-8)
+var   u16   val_len     Byte length of metadata value
+var+2 u8[]  val         Metadata value (opaque bytes)
+```
+
+**MOVE_VECTOR (0x04)**
+```
+0x00  u64  vector_id    Target vector ID
+0x08  u64  src_seg      Source segment ID
+0x10  u64  dst_seg      Destination segment ID
+```
+
+**REMAP_ID (0x05)**
+```
+0x00  u64  old_id       Original vector ID
+0x08  u64  new_id       New vector ID after compaction
+```
+
+### 3.5 Complete JOURNAL_SEG Example
+
+Deleting vector 42, deleting range [1000, 2000), remapping ID 500 -> 3:
+
+```
+Byte offset   Content                    Notes
+-----------   -------                    -----
+0x00-0x3F     Segment header (64 B)      seg_type=0x04, magic=RVFS
+0x40-0x7F     Journal header (64 B)      entry_count=3, epoch=7,
+                                         prev_journal_seg_id=12
+--- Entry 0: DELETE_VECTOR ---
+0x80          0x01                       entry_type
+0x81          0x00                       reserved
+0x82-0x83     0x0008                     entry_length = 8
+0x84-0x8B     0x000000000000002A         vector_id = 42
+0x8C-0x8F     0x00000000                 padding to 8B
+
+--- Entry 1: DELETE_RANGE ---
+0x90          0x02                       entry_type
+0x91          0x00                       reserved
+0x92-0x93     0x0010                     entry_length = 16
+0x94-0x9B     0x00000000000003E8         start_id = 1000
+0x9C-0xA3     0x00000000000007D0         end_id = 2000
+
+--- Entry 2: REMAP_ID ---
+0xA4          0x05                       entry_type
+0xA5          0x00                       reserved
+0xA6-0xA7     0x0010                     entry_length = 16
+0xA8-0xAF     0x00000000000001F4         old_id = 500
+0xB0-0xB7     0x0000000000000003         new_id = 3
+```
+
+## 4. Deletion Bitmap
+
+### 4.1 Manifest Record
+
+The deletion bitmap is stored in the Level 1 manifest as a TLV record:
+
+```
+Tag     Name                Description
+---     ----                -----------
+0x000E  DELETION_BITMAP     Roaring bitmap of soft-deleted vector IDs
+```
+
+This extends the TLV tag space (previous: 0x000D KEY_DIRECTORY).
+
+### 4.2 Roaring Bitmap Binary Layout
+
+Vector IDs are 64-bit. The upper 32 bits select a **high key**; the lower
+32 bits index into a **container** for that high key.
+
+```
+---------------------------------------------+
+| DELETION_BITMAP TLV Value                   |
+---------------------------------------------+
+| Bitmap Header                               |
+|   cookie: u32       (0x3B3A3332)            |
+|   high_key_count: u32                       |
+|   For each high key:                        |
+|     high_key: u32                           |
+|     container_type: u8                      |
+|       0x01 = ARRAY_CONTAINER               |
+|       0x02 = BITMAP_CONTAINER              |
+|       0x03 = RUN_CONTAINER                 |
+|     container_offset: u32 (from bitmap start)|
+|   [8B aligned]                              |
+---------------------------------------------+
+| Container Data                              |
+|   Container 0: [type-specific layout]       |
+|   Container 1: ...                          |
+|   [8B aligned per container]                |
+---------------------------------------------+
+```
+
+### 4.3 Container Types
+
+**ARRAY_CONTAINER (0x01)** -- Sparse deletions (< 4096 set bits per 64K range).
+```
+0x00  u16    cardinality   Number of set values (1-4096)
+0x02  u16[]  values        Sorted array of 16-bit values
+```
+Size: `2 + 2 * cardinality` bytes.
+
+**BITMAP_CONTAINER (0x02)** -- Dense deletions (>= 4096 set bits per 64K range).
+```
+0x00  u16      cardinality   Number of set bits
+0x02  u8[8192] bitmap        Fixed 65536-bit bitmap (8 KB)
+```
+Size: 8194 bytes (fixed).
+
+**RUN_CONTAINER (0x03)** -- Contiguous ranges of deletions.
+```
+0x00  u16        run_count   Number of runs
+0x02  (u16,u16)  runs[]      Array of (start, length-1) pairs
+```
+Size: `2 + 4 * run_count` bytes.
+
+### 4.4 Size Estimation
+
+| Deletion Pattern | Deleted IDs | Container Types | Bitmap Size |
+|------------------|-------------|-----------------|-------------|
+| Sparse random | 10,000 (0.1%) | ~153 array | ~22 KB |
+| Clustered ranges | 10,000 (0.1%) | ~5 run | ~0.1 KB |
+| Mixed workload | 100,000 (1%) | array + run | ~80 KB |
+| Heavy deletion | 1,000,000 (10%) | bitmap + run | ~200 KB |
+
+Even at 200 KB the bitmap fits entirely in L2 cache.
+
+### 4.5 Bitmap Operations
+
+```python
+def bitmap_check(bitmap, vector_id):
+    """Returns True if vector_id is soft-deleted. O(1) amortized."""
+    high_key = vector_id >> 16
+    low_val  = vector_id & 0xFFFF
+    container = bitmap.get_container(high_key)
+    if container is None:
+        return False
+    return container.contains(low_val)  # array: bsearch, bitmap: bit test, run: bsearch
+
+def bitmap_set(bitmap, vector_id):
+    """Mark a vector as soft-deleted."""
+    high_key = vector_id >> 16
+    low_val  = vector_id & 0xFFFF
+    container = bitmap.get_or_create_container(high_key)
+    container.add(low_val)
+    if container.type == ARRAY and container.cardinality > 4096:
+        container.promote_to_bitmap()
+```
+
+## 5. Delete-Aware Query Path
+
+### 5.1 HNSW Traversal with Deletion Filtering
+
+Deleted vectors remain in the HNSW graph until compaction rebuilds the index.
+During search, the deletion bitmap is checked per candidate. Deleted nodes are
+still traversed for connectivity but excluded from the result set.
+
+```python
+def hnsw_search_delete_aware(query, entry_point, ef_search, k, del_bitmap):
+    candidates = MaxHeap()   # worst candidate on top
+    visited    = BitSet()
+    worklist   = MinHeap()   # best candidate first
+
+    d0 = distance(query, get_vector(entry_point))
+    worklist.push((d0, entry_point))
+    visited.add(entry_point)
+    if not bitmap_check(del_bitmap, entry_point):
+        candidates.push((d0, entry_point))
+
+    while worklist:
+        dist, node = worklist.pop()
+        if candidates.size() >= ef_search and dist > candidates.peek_max():
+            break
+
+        neighbors = get_neighbors(node)
+        for n in neighbors[:PREFETCH_AHEAD]:
+            if n not in visited:
+                prefetch_vector(n)
+
+        for n in neighbors:
+            if n in visited:
+                continue
+            visited.add(n)
+            d = distance(query, get_vector(n))
+            is_deleted = bitmap_check(del_bitmap, n)   # O(1) bitmap lookup
+
+            # Always add to worklist (graph connectivity)
+            if candidates.size() < ef_search or d < candidates.peek_max():
+                worklist.push((d, n))
+            # Only add to results if NOT deleted
+            if not is_deleted:
+                if candidates.size() < ef_search:
+                    candidates.push((d, n))
+                elif d < candidates.peek_max():
+                    candidates.replace_max((d, n))
+
+    return candidates.top_k(k)
+```
+
+### 5.2 Top-K Refinement with Deletion Filtering
+
+```python
+def topk_refine_delete_aware(candidates, hot_cache, query, k, del_bitmap):
+    heap = MaxHeap()
+    for cand_dist, cand_id in candidates:
+        heap.push((cand_dist, cand_id))
+
+    for entry in hot_cache.sequential_scan():
+        if bitmap_check(del_bitmap, entry.vector_id):
+            continue   # skip soft-deleted
+        d = distance(query, entry.vector)
+        if heap.size() < k:
+            heap.push((d, entry.vector_id))
+        elif d < heap.peek_max():
+            heap.replace_max((d, entry.vector_id))
+
+    return heap.drain_sorted()
+```
+
+### 5.3 Performance Impact
+
+| Operation | Without Deletions | With Deletions | Overhead |
+|-----------|-------------------|----------------|----------|
+| Bitmap check | N/A | ~2-5 ns (L1/L2 hit) | Per candidate |
+| HNSW step (M=16) | ~300-500 ns | ~330-580 ns | +10% |
+| Top-K refine (1000) | ~10 us | ~12 us | +20% worst |
+| Total query | ~50-75 us | ~55-85 us | +10-13% |
+
+At typical deletion rates (< 5%), overhead is negligible: the bitmap fits in
+L2 cache, graph connectivity is preserved, and the cost is one branch plus
+one bitmap load per candidate.
+
+## 6. Deletion Write Path
+
+All deletion operations follow the same two-fsync protocol:
+
+```python
+def delete_vectors(file, entries):
+    """Soft-delete vectors. entries: list of DeleteVector or DeleteRange."""
+    # 1. Append JOURNAL_SEG
+    journal = JournalSegment(
+        epoch=current_epoch(file),
+        prev_journal_seg_id=latest_journal_id(file),
+        entries=entries
+    )
+    append_segment(file, journal)
+    fsync(file)   # orphan-safe: no manifest references this yet
+
+    # 2. Update deletion bitmap in memory
+    bitmap = load_deletion_bitmap(file)
+    for e in entries:
+        if e.type == DELETE_VECTOR:
+            bitmap_set(bitmap, e.vector_id)
+        elif e.type == DELETE_RANGE:
+            bitmap.add_range(e.start_id, e.end_id)
+
+    # 3. Append MANIFEST_SEG with updated bitmap
+    manifest = build_manifest(file, deletion_bitmap=bitmap)
+    append_segment(file, manifest)
+    fsync(file)   # deletion now visible to all new readers
+```
+
+Single deletes, bulk ranges, and batch deletes all use this path. Batch
+operations pack multiple entries into one JOURNAL_SEG to amortize fsync cost.
+
+## 7. Compaction with Deletions
+
+### 7.1 Compaction Process
+
+```
+Before:
+[VEC_1] [VEC_2] [JOURNAL_1] [VEC_3] [JOURNAL_2] [MANIFEST_5]
+ 0-999   1000-   del:42,     3000-   del:[1000,   bitmap={42,500,
+         2999    del:500     4999    2000)         1000..1999}
+
+After:
+... [MANIFEST_5] [VEC_sealed] [INDEX_new] [MANIFEST_6]
+                  vectors 0-4999           bitmap={}
+                  MINUS deleted            (empty for
+                                           compacted range)
+```
+
+### 7.2 Compaction Algorithm
+
+```python
+def compact_with_deletions(file, seg_ids):
+    bitmap = load_deletion_bitmap(file)
+    output, id_remap, next_id = [], {}, 0
+
+    for seg_id in sorted(seg_ids):
+        seg = load_segment(file, seg_id)
+        if seg.seg_type != VEC_SEG:
+            continue
+        for vec_id, vector in seg.all_vectors():
+            if bitmap_check(bitmap, vec_id):
+                continue                        # physically exclude
+            id_remap[vec_id] = next_id
+            output.append((next_id, vector))
+            next_id += 1
+
+    append_segment(file, VecSegment(flags=SEALED, vectors=output))
+
+    remaps = [RemapIdEntry(old, new) for old, new in id_remap.items() if old != new]
+    if remaps:
+        append_segment(file, JournalSegment(entries=remaps))
+
+    append_segment(file, build_hnsw_index(output))
+
+    for old_id in id_remap:
+        bitmap.remove(old_id)
+
+    manifest = build_manifest(file,
+        tombstone_seg_ids=seg_ids,
+        deletion_bitmap=bitmap)
+    append_segment(file, manifest)
+    fsync(file)
+```
+
+### 7.3 Journal Merging
+
+During compaction, JOURNAL_SEGs covering the compacted range are consumed:
+
+| Entry Type | Materialization |
+|------------|-----------------|
+| DELETE_VECTOR / DELETE_RANGE | Vectors excluded from output |
+| UPDATE_METADATA | Applied to output META_SEG |
+| MOVE_VECTOR | Tier assignment applied in new manifest |
+| REMAP_ID | Chained: old remap composed with new remap |
+
+Consumed JOURNAL_SEGs are tombstoned alongside compacted VEC_SEGs.
+
+### 7.4 Compaction Invariants
+
+| ID | Invariant |
+|----|-----------|
+| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
+| INV-D2 | Sealed output contains only ACTIVE vectors |
+| INV-D3 | REMAP_ID entries journaled for every relocated vector |
+| INV-D4 | Compacted input segments tombstoned in new manifest |
+| INV-D5 | Sealed segments are never modified |
+| INV-D6 | Rebuilt indexes exclude deleted nodes |
+
+## 8. Deletion Consistency
+
+### 8.1 Crash Safety
+
+```
+Write path:
+  1. Append JOURNAL_SEG -> fsync         crash here: orphan, invisible
+  2. Append MANIFEST_SEG -> fsync        crash here: partial manifest, fallback
+
+Recovery:
+- Crash after step 1: JOURNAL_SEG orphaned. No manifest references it.
+  Reader sees previous manifest. Deletion NOT visible. Orphan cleaned
+  up by next compaction.
+- Crash during step 2: Partial MANIFEST_SEG has bad checksum. Reader
+  falls back to previous valid manifest. Deletion NOT visible.
+- After step 2 success: Manifest durable. Deletion visible.
+```
+
+**Guarantee**: Uncommitted deletions never affect readers. Deletion is
+atomic at the manifest fsync boundary.
+
+### 8.2 Manifest Chain Visibility
+
+```
+MANIFEST_3: bitmap = {}
+  |  JOURNAL_SEG written (delete vector 42)
+MANIFEST_4: bitmap = {42}     <-- deletion visible from here
+  |  Compaction runs
+MANIFEST_5: bitmap = {}       <-- vector 42 physically removed
+```
+
+A reader holding MANIFEST_3 continues to see vector 42. A reader opening
+after MANIFEST_4 will not. This provides snapshot isolation at manifest
+granularity.
+
+### 8.3 Multi-File Mode
+
+In multi-file mode, each shard maintains its own deletion bitmap. The
+DELETION_BITMAP TLV record supports two modes:
+
+```
+----------------------------------------------+
+| mode: u8                                     |
+|   0x00 = SINGLE   (one bitmap, inline)       |
+|   0x01 = SHARDED  (per-shard references)     |
+----------------------------------------------+
+SINGLE (0x00):
+|   roaring_bitmap: [u8; ...]                  |
+
+SHARDED (0x01):
+|   shard_count: u16                           |
+|   For each shard:                            |
+|     shard_id: u16                            |
+|     bitmap_offset: u64  (in shard file)      |
+|     bitmap_length: u32                       |
+|     bitmap_hash: hash128                     |
+----------------------------------------------+
+```
+
+Queries spanning shards load per-shard bitmaps and check each candidate
+against its shard's bitmap.
+
+### 8.4 Concurrent Access
+
+One writer at a time (file-level advisory lock). Multiple readers are safe
+due to append-only architecture. A reader that opened before a deletion
+sees the pre-deletion snapshot until it re-reads the manifest.
+
+## 9. Space Reclamation
+
+| Trigger | Threshold | Action |
+|---------|-----------|--------|
+| Deletion ratio | > 20% of vectors deleted | Schedule compaction |
+| Bitmap size | > 1 MB | Schedule compaction |
+| Segment count | > 64 mutable segments | Schedule compaction |
+| Manual | User-initiated | Compact immediately |
+
+Space accounting derived from the manifest:
+```
+total_vector_count:     10,000,000   (Level 0 root manifest)
+deleted_vector_count:      150,000   (bitmap cardinality)
+active_vector_count:     9,850,000   (total - deleted)
+deletion_ratio:              1.5%    (below threshold)
+wasted_bytes:           ~115 MB      (150K * 768 B per fp16-384 vector)
+```
+
+## 10. Summary
+
+### Deletion Protocol
+
+| Step | Action | Durability |
+|------|--------|------------|
+| 1 | Append JOURNAL_SEG with DELETE entries | fsync (orphan-safe) |
+| 2 | Update roaring deletion bitmap | In-memory |
+| 3 | Append MANIFEST_SEG with new bitmap | fsync (deletion visible) |
+| 4 | Compaction excludes deleted vectors | fsync (physical removal) |
+| 5 | File rewrite reclaims space | fsync (space freed) |
+
+### New Wire Format Elements
+
+| Element | Type / Tag | Section |
+|---------|------------|---------|
+| JOURNAL_SEG | Segment type 0x04 | 3 |
+| DELETE_VECTOR | Journal entry 0x01 | 3.4 |
+| DELETE_RANGE | Journal entry 0x02 | 3.4 |
+| UPDATE_METADATA | Journal entry 0x03 | 3.4 |
+| MOVE_VECTOR | Journal entry 0x04 | 3.4 |
+| REMAP_ID | Journal entry 0x05 | 3.4 |
+| DELETION_BITMAP | Level 1 TLV 0x000E | 4 |
+
+### Invariants
+
+| ID | Invariant |
+|----|-----------|
+| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
+| INV-D2 | Sealed output segments contain only ACTIVE vectors |
+| INV-D3 | ID remappings journaled for every compaction-relocated vector |
+| INV-D4 | Compacted input segments tombstoned in new manifest |
+| INV-D5 | Sealed segments are never modified |
+| INV-D6 | Rebuilt indexes exclude deleted nodes |
+| INV-D7 | Uncommitted deletions never affect readers (crash safety) |
+| INV-D8 | Deletion visibility is atomic at the manifest fsync boundary |
--- a/vendor/ruvector/docs/research/rvf/spec/08-filtered-search.md
+++ b/vendor/ruvector/docs/research/rvf/spec/08-filtered-search.md
@@ -0,0 +1,724 @@
+# RVF Filtered Search
+
+## 1. Motivation
+
+Domain profiles declare metadata schemas with indexed fields (e.g., `"organism"` in
+RVDNA, `"language"` in RVText, `"node_type"` in RVGraph), but the format provides no
+specification for how those indexes are built, stored, or evaluated at query time.
+
+Filtered search is the combination of vector similarity search with metadata
+predicates. Without it, a caller must retrieve an over-sized result set and filter
+client-side — wasting bandwidth, latency, and recall budget.
+
+This specification adds:
+
+1. **META_SEG** payload layout (segment type 0x07) for storing per-vector metadata
+2. **Filter expression language** with a compact binary encoding
+3. **Three evaluation strategies** (pre-, post-, and intra-filtering)
+4. **METAIDX_SEG** (new segment type 0x0D) for inverted and bitmap indexes
+5. **Manifest integration** via a new Level 1 TLV record
+6. **Temperature tier coordination** for metadata segments
+
+## 2. META_SEG Payload Layout (Segment Type 0x07)
+
+META_SEG stores the actual metadata values associated with vectors. It uses the
+standard 64-byte segment header (see `binary-layout.md` Section 3) with
+`seg_type = 0x07`.
+
+```
+META_SEG Payload:
+
+------------------------------------------+
+| Meta Header (64 bytes, padded)           |
+|   schema_id:            u32              |  References PROFILE_SEG schema
+|   vector_id_range_start: u64             |  First vector ID covered
+|   vector_id_range_end:   u64             |  Last vector ID covered (inclusive)
+|   field_count:           u16             |  Number of fields in this segment
+|   encoding:              u8              |  0 = row-oriented, 1 = column-oriented
+|   reserved:              [u8; 37]        |  Must be zero
+|   [64B aligned]                          |
+------------------------------------------+
+| Field Directory                          |
+|   For each field (field_count entries):  |
+|     field_id:       u16                  |
+|     field_type:     u8                   |
+|     flags:          u8                   |
+|     field_offset:   u32                  |  Byte offset from payload start
+|   [64B aligned]                          |
+------------------------------------------+
+| Field Data (column-oriented)             |
+|   (see Section 2.1 for per-type layout)  |
+------------------------------------------+
+```
+
+### Field Type Enum
+
+```
+Value   Type      Wire Size          Description
+-----   ----      ---------          -----------
+0x00    string    Variable           UTF-8, dictionary-encoded in column layout
+0x01    u32       4 bytes            Unsigned 32-bit integer
+0x02    u64       8 bytes            Unsigned 64-bit integer
+0x03    f32       4 bytes            IEEE 754 single-precision float
+0x04    enum      Variable (packed)  Enumeration with defined label set
+0x05    bool      1 bit (packed)     Boolean
+```
+
+### Field Flags
+
+```
+Bit  Mask  Name       Meaning
+---  ----  ----       -------
+0    0x01  INDEXED    Field has a corresponding METAIDX_SEG
+1    0x02  SORTED     Values are stored in sorted order
+2    0x04  NULLABLE   Null bitmap present before values
+3    0x08  STORED     Field value returned in query results (not just filterable)
+4-7        reserved   Must be zero
+```
+
+### 2.1 Column-Oriented Field Layouts
+
+Column-oriented encoding (encoding = 1) is the preferred layout. Each field's data
+block starts at a 64-byte aligned boundary.
+
+**String fields** (dictionary-encoded):
+
+```
+dict_size:    u32                           Number of distinct strings
+For each dict entry:
+  length:     u16                           Byte length of UTF-8 string
+  bytes:      [u8; length]                  UTF-8 encoded string
+[4B aligned after dictionary]
+codes:        [varint; vector_count]        Dictionary code per vector
+[64B aligned]
+```
+
+Dictionary codes are 0-indexed into the dictionary array. Code `0xFFFFFFFF` (max
+varint value for u32 range) represents null if the NULLABLE flag is set.
+
+**Numeric fields** (u32, u64, f32 -- direct array):
+
+```
+If NULLABLE:
+  null_bitmap: [u8; ceil(vector_count / 8)]  Bit-packed, 1 = present, 0 = null
+  [8B aligned]
+values:       [field_type; vector_count]     Dense array of values
+[64B aligned]
+```
+
+Values for null entries are zero-filled but must not be relied upon.
+
+**Enum fields** (bit-packed):
+
+```
+enum_count:   u8                            Number of enum labels
+For each enum label:
+  length:     u8                            Byte length of label
+  bytes:      [u8; length]                  UTF-8 label string
+bits_per_code: u8                           ceil(log2(enum_count))
+codes:        packed bit array              bits_per_code bits per vector
+              [ceil(vector_count * bits_per_code / 8) bytes]
+[64B aligned]
+```
+
+For example, an enum with 3 values (`"+", "-", "."`) uses 2 bits per vector.
+1M vectors = 250 KB.
+
+**Bool fields** (bit-packed):
+
+```
+If NULLABLE:
+  null_bitmap: [u8; ceil(vector_count / 8)]
+  [8B aligned]
+values:       [u8; ceil(vector_count / 8)]  Bit-packed, 1 = true, 0 = false
+[64B aligned]
+```
+
+### 2.2 Sorted Index (Inline)
+
+For fields with the SORTED flag, an additional sorted permutation index follows
+the field data:
+
+```
+sorted_count:   u32                         Must equal vector_count
+sorted_order:   [varint delta-encoded]      Vector IDs in ascending value order
+restart_interval: u16                       Restart every N entries (default 128)
+restart_offsets:  [u32; ceil(sorted_count / restart_interval)]
+[64B aligned]
+```
+
+This enables binary search over field values for range queries without requiring
+a separate METAIDX_SEG. It is suitable for fields where a full inverted index
+would be wasteful (high cardinality numeric fields like `position_start`).
+
+## 3. Filter Expression Language
+
+### 3.1 Abstract Syntax
+
+A filter expression is a tree of predicates combined with boolean logic:
+
+```
+expr ::= field_ref CMP literal         -- comparison
+       | field_ref IN literal_set       -- set membership
+       | field_ref PREFIX string_lit    -- string prefix match
+       | field_ref CONTAINS string_lit  -- substring containment
+       | expr AND expr                  -- conjunction
+       | expr OR expr                   -- disjunction
+       | NOT expr                       -- negation
+```
+
+### 3.2 Binary Encoding (Postfix / RPN)
+
+Filter expressions are encoded as a postfix (Reverse Polish Notation) token stream
+for stack-based evaluation. This avoids the need for recursive parsing and enables
+single-pass evaluation with a fixed-size stack.
+
+```
+Filter Expression Binary Layout:
+
+header:
+  node_count:     u16                   Total number of tokens
+  stack_depth:    u8                    Maximum stack depth required
+  reserved:       u8                    Must be zero
+
+tokens (postfix order):
+  For each token:
+    node_type:    u8                    Token type (see enum below)
+    payload:      type-specific         Variable-size payload
+```
+
+### Token Type Enum
+
+```
+Value   Name        Stack Effect   Payload
+-----   ----        ------------   -------
+0x01    FIELD_REF   push +1        field_id: u16
+0x02    LIT_U32     push +1        value: u32
+0x03    LIT_U64     push +1        value: u64
+0x04    LIT_F32     push +1        value: f32
+0x05    LIT_STR     push +1        length: u16, bytes: [u8; length]
+0x06    LIT_BOOL    push +1        value: u8 (0 or 1)
+0x07    LIT_NULL    push +1        (no payload)
+
+0x10    CMP_EQ      pop 2, push 1  (no payload) -- a == b
+0x11    CMP_NE      pop 2, push 1  (no payload) -- a != b
+0x12    CMP_LT      pop 2, push 1  (no payload) -- a < b
+0x13    CMP_LE      pop 2, push 1  (no payload) -- a <= b
+0x14    CMP_GT      pop 2, push 1  (no payload) -- a > b
+0x15    CMP_GE      pop 2, push 1  (no payload) -- a >= b
+
+0x20    IN_SET      pop 1, push 1  set_size: u16, [encoded values]
+0x21    PREFIX      pop 2, push 1  (no payload) -- string prefix
+0x22    CONTAINS    pop 2, push 1  (no payload) -- substring match
+
+0x30    AND         pop 2, push 1  (no payload)
+0x31    OR          pop 2, push 1  (no payload)
+0x32    NOT         pop 1, push 1  (no payload)
+```
+
+### 3.3 Encoding Example
+
+Filter: `organism = "E. coli" AND position_start >= 1000`
+
+```
+Token 0: FIELD_REF   field_id=0 (organism)          stack: [organism_val]
+Token 1: LIT_STR     "E. coli"                      stack: [organism_val, "E. coli"]
+Token 2: CMP_EQ                                     stack: [true/false]
+Token 3: FIELD_REF   field_id=3 (position_start)    stack: [bool, pos_val]
+Token 4: LIT_U64     1000                           stack: [bool, pos_val, 1000]
+Token 5: CMP_GE                                     stack: [bool, true/false]
+Token 6: AND                                        stack: [result]
+
+Binary: node_count=7, stack_depth=3
+  01 00:00  05 00:07 "E. coli"  10  01 00:03  03 00:00:00:00:00:00:03:E8  15  30
+```
+
+### 3.4 Evaluation
+
+Evaluation processes tokens left to right using a fixed-size boolean/value stack:
+
+```python
+def evaluate(tokens, vector_id, metadata):
+    stack = []
+    for token in tokens:
+        if token.type == FIELD_REF:
+            stack.append(metadata.get_value(vector_id, token.field_id))
+        elif token.type in (LIT_U32, LIT_U64, LIT_F32, LIT_STR, LIT_BOOL, LIT_NULL):
+            stack.append(token.value)
+        elif token.type in (CMP_EQ, CMP_NE, CMP_LT, CMP_LE, CMP_GT, CMP_GE):
+            b, a = stack.pop(), stack.pop()
+            stack.append(compare(a, token.type, b))
+        elif token.type == IN_SET:
+            a = stack.pop()
+            stack.append(a in token.value_set)
+        elif token.type in (PREFIX, CONTAINS):
+            b, a = stack.pop(), stack.pop()
+            stack.append(string_match(a, token.type, b))
+        elif token.type == AND:
+            b, a = stack.pop(), stack.pop()
+            stack.append(a and b)
+        elif token.type == OR:
+            b, a = stack.pop(), stack.pop()
+            stack.append(a or b)
+        elif token.type == NOT:
+            stack.append(not stack.pop())
+    return stack[0]
+```
+
+Maximum stack depth is declared in the header so the evaluator can pre-allocate.
+Implementations must reject expressions with `stack_depth > 16`.
+
+## 4. Filter Evaluation Strategies
+
+The runtime selects one of three strategies based on the estimated **selectivity**
+of the filter (the fraction of vectors passing the filter).
+
+### 4.1 Pre-Filtering (Selectivity < 1%)
+
+Build the candidate ID set from metadata indexes first, then run vector search
+only on the filtered subset.
+
+```
+1. Evaluate filter using METAIDX_SEG inverted/bitmap indexes
+2. Collect matching vector IDs into a candidate set C
+3. If |C| < ef_search:
+     Flat scan all candidates, return top-K
+   Else:
+     Build temporary flat index over C, run HNSW search restricted to C
+4. Return top-K results
+```
+
+**Tradeoffs**:
+- Optimal when the candidate set is very small (hundreds to low thousands)
+- Risk: if the candidate set is disconnected in the HNSW graph, search cannot
+  traverse from entry points to candidates. The flat scan fallback handles this.
+- Memory: candidate set bitmap = `ceil(total_vectors / 8)` bytes
+
+### 4.2 Post-Filtering (Selectivity > 20%)
+
+Run standard HNSW search with over-retrieval, then filter results.
+
+```
+1. Compute over_retrieval_factor = min(1.0 / selectivity, 10.0)
+2. Set ef_search_adj = ef_search * over_retrieval_factor
+3. Run standard HNSW search with ef_search_adj
+4. Filter result set by evaluating filter expression per candidate
+5. Return top-K from filtered results
+```
+
+**Tradeoffs**:
+- Optimal when the filter passes most vectors (minimal wasted computation)
+- Risk: if over-retrieval factor is too low, fewer than K results survive filtering.
+  The caller should retry with a higher factor or fall back to intra-filtering.
+- No modification to HNSW traversal logic required.
+
+### 4.3 Intra-Filtering (1% <= Selectivity <= 20%)
+
+Evaluate the filter during HNSW traversal, skipping nodes that fail the predicate.
+
+```python
+def filtered_hnsw_search(query, filter_expr, entry_point, ef_search, k):
+    candidates = MaxHeap()       # top-K results (max-heap by distance)
+    worklist = MinHeap()         # exploration frontier (min-heap by distance)
+    visited = BitSet()
+    filtered_skips = 0
+    max_skips = ef_search * 3    # backoff threshold
+
+    worklist.push((distance(query, entry_point), entry_point))
+    visited.add(entry_point)
+
+    while worklist and filtered_skips < max_skips:
+        dist, node = worklist.pop()
+
+        # Check filter predicate
+        if not evaluate(filter_expr, node, metadata):
+            filtered_skips += 1
+            # Still expand neighbors (maintain graph connectivity)
+            neighbors = get_neighbors(node)
+            for n in neighbors:
+                if n not in visited:
+                    visited.add(n)
+                    d = distance(query, get_vector(n))
+                    worklist.push((d, n))
+            continue
+
+        filtered_skips = 0  # reset skip counter on successful match
+        candidates.push((dist, node))
+        if len(candidates) > k:
+            candidates.pop()  # evict worst
+
+        # Expand neighbors
+        neighbors = get_neighbors(node)
+        for n in neighbors:
+            if n not in visited:
+                visited.add(n)
+                d = distance(query, get_vector(n))
+                if len(candidates) < ef_search or d < candidates.max():
+                    worklist.push((d, n))
+
+    return candidates.top_k(k)
+```
+
+**Key design decisions**:
+
+1. **Skipped nodes still expand neighbors**: This preserves graph connectivity.
+   A node that fails the filter may have neighbors that pass it.
+
+2. **Skip counter with backoff**: If too many consecutive nodes fail the filter,
+   the search is exhausting the local neighborhood without finding matches. The
+   `max_skips` threshold triggers termination to avoid unbounded traversal.
+
+3. **Adaptive ef expansion**: When `filtered_skips > ef_search`, the effective
+   search frontier is larger than requested, compensating for filtered-out nodes.
+
+### 4.4 Strategy Selection
+
+```
+selectivity = estimate_selectivity(filter_expr, metaidx_stats)
+
+if selectivity < 0.01:
+    strategy = PRE_FILTER
+elif selectivity > 0.20:
+    strategy = POST_FILTER
+else:
+    strategy = INTRA_FILTER
+```
+
+Selectivity estimation uses statistics stored in the METAIDX_SEG header:
+
+- **Inverted index**: `posting_list_length / total_vectors` per term
+- **Bitmap index**: `popcount(bitmap) / total_vectors` per enum value
+- **Range tree**: count of values in range / total_vectors
+
+For compound filters (AND/OR), selectivity is estimated using independence
+assumption: `P(A AND B) = P(A) * P(B)`, `P(A OR B) = P(A) + P(B) - P(A) * P(B)`.
+
+## 5. METAIDX_SEG (Segment Type 0x0D)
+
+METAIDX_SEG stores secondary indexes over metadata fields for fast predicate
+evaluation. Each METAIDX_SEG covers one field. The segment type enum value 0x0D
+is allocated from the reserved range (see `binary-layout.md` Section 3).
+
+```
+METAIDX_SEG Payload:
+
+------------------------------------------+
+| Index Header (64 bytes, padded)          |
+|   field_id:         u16                  |  Field being indexed
+|   index_type:       u8                   |  0=inverted, 1=range_tree, 2=bitmap
+|   field_type:       u8                   |  Mirrors META_SEG field_type
+|   total_vectors:    u64                  |  Vectors covered by this index
+|   unique_values:    u64                  |  Cardinality (distinct values)
+|   reserved:         [u8; 42]             |
+|   [64B aligned]                          |
+------------------------------------------+
+| Index Data (type-specific)               |
+------------------------------------------+
+```
+
+### 5.1 Inverted Index (index_type = 0)
+
+Best for: string fields with moderate cardinality (100 to 100K distinct values).
+
+```
+term_count:       u32
+For each term (sorted by encoded value):
+  term_length:    u16
+  term_bytes:     [u8; term_length]         Encoded value (UTF-8 for strings)
+  posting_length: u32                       Number of vector IDs
+  postings:       [varint delta-encoded]    Sorted vector IDs
+  [8B aligned after postings]
+[64B aligned]
+```
+
+Posting lists use varint delta encoding identical to the ID encoding in VEC_SEG
+(see `binary-layout.md` Section 5). Restart points every 128 entries enable
+binary search within a posting list for intersection operations.
+
+### 5.2 Range Tree (index_type = 1)
+
+Best for: numeric fields requiring range queries (u32, u64, f32).
+
+```
+page_size:        u32                       Fixed 4096 bytes (4 KB, one disk page)
+page_count:       u32
+root_page:        u32                       Page index of B+ tree root
+tree_height:      u8
+reserved:         [u8; 47]
+[64B aligned]
+
+Internal Page (4096 bytes):
+  page_type:      u8 (0 = internal)
+  key_count:      u16
+  keys:           [field_type; key_count]   Separator keys
+  children:       [u32; key_count + 1]      Child page indices
+  [zero-padded to 4096]
+
+Leaf Page (4096 bytes):
+  page_type:      u8 (1 = leaf)
+  entry_count:    u16
+  prev_leaf:      u32                       Linked-list pointer for range scan
+  next_leaf:      u32
+  entries:
+    For each entry:
+      value:      field_type                The metadata value
+      vector_id:  u64                       Associated vector ID
+  [zero-padded to 4096]
+```
+
+Leaf pages form a doubly-linked list for efficient range scans. A range query
+`position_start >= 1000 AND position_start <= 5000` descends the tree to find
+the first leaf with value >= 1000, then scans forward until value > 5000.
+
+### 5.3 Bitmap Index (index_type = 2)
+
+Best for: enum and bool fields with low cardinality (< 64 distinct values).
+
+```
+value_count:      u8                        Number of distinct enum/bool values
+For each value:
+  value_label_len: u8
+  value_label:    [u8; value_label_len]     The enum label or "true"/"false"
+  bitmap_format:  u8                        0 = raw, 1 = roaring
+  bitmap_length:  u32                       Byte length of bitmap data
+  bitmap_data:    [u8; bitmap_length]       Bitmap of matching vector IDs
+  [8B aligned]
+[64B aligned]
+```
+
+**Raw bitmaps** are used when `total_vectors < 8192` (1 KB per bitmap).
+
+**Roaring bitmaps** are used for larger datasets. The roaring format stores
+the bitmap as a set of containers (array, bitmap, or run-length) per 64K chunk.
+This matches the industry-standard Roaring bitmap serialization (compatible with
+CRoaring / roaring-rs wire format).
+
+Bitmap intersection and union operations map directly to AND/OR filter predicates
+using SIMD bitwise operations. For 10M vectors:
+
+```
+Raw bitmap:    ~1.2 MB per value (impractical for many values)
+Roaring bitmap: 100 KB - 1 MB per value depending on density
+AND/OR:        ~0.1 ms per operation (AVX-512 on 1 MB bitmap)
+```
+
+## 6. Level 1 Manifest Addition
+
+### Tag 0x000F: METADATA_INDEX_DIR
+
+A new TLV record in the Level 1 manifest (see `02-manifest-system.md` Section 3)
+that maps indexed metadata fields to their METAIDX_SEG segment IDs.
+
+```
+Tag:     0x000F
+Name:    METADATA_INDEX_DIR
+
+Payload:
+  entry_count:    u16
+  For each entry:
+    field_id:     u16                       Matches META_SEG field_id
+    field_name_len: u8
+    field_name:   [u8; field_name_len]      UTF-8 field name for debugging
+    index_seg_id: u64                       Segment ID of METAIDX_SEG
+    index_type:   u8                        0=inverted, 1=range_tree, 2=bitmap
+    stats:
+      total_vectors: u64
+      unique_values: u64
+      min_posting_len: u32                  Smallest posting list size
+      max_posting_len: u32                  Largest posting list size
+```
+
+This allows the query planner to estimate selectivity without reading the
+METAIDX_SEG segments themselves. The `min_posting_len` and `max_posting_len`
+fields provide bounds for cardinality estimation.
+
+### Updated Record Types Table
+
+```
+Tag     Name                    Description
+---     ----                    -----------
+0x0001  SEGMENT_DIR             Array of segment directory entries
+0x0002  TEMP_TIER_MAP           Temperature tier assignments per block
+...
+0x000D  KEY_DIRECTORY           Encryption key references
+0x000E  (reserved)
+0x000F  METADATA_INDEX_DIR      Metadata field -> METAIDX_SEG mapping
+```
+
+## 7. Performance Analysis
+
+### 7.1 Filter Strategy vs Selectivity vs Recall
+
+| Selectivity | Strategy | Recall@10 | Latency (10M vectors) | Notes |
+|-------------|----------|-----------|----------------------|-------|
+| 0.001% (100 matches) | Pre-filter | 1.00 | 0.02 ms | Flat scan on 100 candidates |
+| 0.01% (1K matches) | Pre-filter | 0.99 | 0.08 ms | Flat scan on 1K candidates |
+| 0.1% (10K matches) | Pre-filter | 0.98 | 0.5 ms | Mini-HNSW on 10K candidates |
+| 1% (100K matches) | Intra-filter | 0.96 | 0.12 ms | ~10% node skip overhead |
+| 5% (500K matches) | Intra-filter | 0.95 | 0.08 ms | ~5% node skip overhead |
+| 10% (1M matches) | Intra-filter | 0.94 | 0.06 ms | Minimal skip overhead |
+| 20% (2M matches) | Post-filter | 0.95 | 0.10 ms | 5x over-retrieval |
+| 50% (5M matches) | Post-filter | 0.97 | 0.06 ms | 2x over-retrieval |
+| 100% (no filter) | None | 0.98 | 0.04 ms | Baseline unfiltered |
+
+### 7.2 Memory Overhead of Metadata Indexes
+
+For 10M vectors with the RVDNA profile (5 indexed fields):
+
+| Field | Type | Cardinality | Index Type | Size |
+|-------|------|-------------|------------|------|
+| organism | string | ~50K | Inverted | ~80 MB |
+| gene_id | string | ~500K | Inverted | ~120 MB |
+| chromosome | string | ~25 | Bitmap (roaring) | ~12 MB |
+| position_start | u64 | ~10M | Range tree | ~160 MB |
+| position_end | u64 | ~10M | Range tree | ~160 MB |
+| **Total** | | | | **~532 MB** |
+
+As a fraction of vector data (10M * 384 dim * fp16 = 7.2 GB): **~7.4% overhead**.
+
+For the RVText profile (2 indexed fields, typically lower cardinality):
+
+| Field | Type | Cardinality | Index Type | Size |
+|-------|------|-------------|------------|------|
+| source_url | string | ~100K | Inverted | ~90 MB |
+| language | string | ~50 | Bitmap (roaring) | ~8 MB |
+| **Total** | | | | **~98 MB** |
+
+Overhead: **~1.4%** of vector data.
+
+### 7.3 Query Latency Breakdown (Filtered Intra-Search)
+
+```
+Phase                         Time        Notes
+-----                         ----        -----
+Parse filter expression       0.5 us      Stack-based, no allocation
+Estimate selectivity          1.0 us      Read manifest stats
+Load METAIDX_SEG (if cold)    50-200 us   First query only; cached after
+HNSW traversal (150 steps)    45 us       Baseline unfiltered
+  + filter eval per node      +12 us      ~80 ns per eval * 150 nodes
+  + skip expansion            +8 us       ~20% more nodes visited at 5% sel.
+Top-K collection              10 us       Heap operations
+                              --------
+Total (warm cache)            ~76 us
+Total (cold start)            ~276 us
+```
+
+## 8. Integration with Temperature Tiering
+
+Metadata follows the same temperature model as vector data (see
+`03-temperature-tiering.md`), but with its own tier assignments.
+
+### 8.1 Hot Metadata
+
+Indexed fields for hot-tier vectors are kept resident in memory:
+
+- **Bitmap indexes** for low-cardinality fields (enum, bool) are always hot.
+  Total size is bounded: `cardinality * ceil(hot_vectors / 8)` bytes. For 100K
+  hot vectors and 25 enum values: 25 * 12.5 KB = 312 KB.
+
+- **Inverted index posting lists** are cached using an LRU policy keyed by
+  (field_id, term). Frequently queried terms (e.g., `language = "en"`) remain
+  resident.
+
+- **Range tree pages** follow the standard B+ tree buffer pool model. Hot pages
+  (root + first two levels) are pinned. Leaf pages are demand-paged.
+
+### 8.2 Cold Metadata
+
+Cold metadata covers vectors that are rarely accessed:
+
+- META_SEG data for cold vectors is compressed with ZSTD (level 9+) and stored
+  in cold-tier segments.
+- METAIDX_SEG posting lists for cold vectors are not loaded until a query
+  specifically requests them.
+- When a filter matches only cold vectors (detected via the temperature tier
+  map), the runtime issues a warning: filtered search on cold data may require
+  decompression latency of 10-100 ms.
+
+### 8.3 Compaction Coordination
+
+When temperature-aware compaction reorganizes vector segments (see
+`03-temperature-tiering.md` Section 4), metadata must follow:
+
+```
+1. Identify vectors moving between tiers
+2. Rewrite META_SEG for affected vector ID ranges
+3. Rebuild METAIDX_SEG posting lists (vector IDs may be renumbered during
+   compaction if the COMPACTION_RENUMBER flag is set)
+4. Update METADATA_INDEX_DIR in the new manifest
+5. Tombstone old META_SEG and METAIDX_SEG segments
+```
+
+Metadata compaction piggybacks on vector compaction -- it never triggers
+independently. This ensures metadata and vector segments remain in consistent
+temperature tiers.
+
+### 8.4 Metadata-Aware Promotion
+
+When a filter query frequently accesses metadata for warm-tier vectors, those
+metadata segments are candidates for promotion to hot tier. The access sketch
+(SKETCH_SEG) tracks metadata segment accesses alongside vector accesses:
+
+```
+sketch_key = (META_SEG_ID << 32) | block_id
+```
+
+This reuses the existing sketch infrastructure without modification.
+
+## 9. Wire Protocol: Filtered Query Message
+
+For completeness, the filter expression is carried in the query message as a
+tagged field. The query wire format is outside the scope of the storage spec,
+but the filter payload is defined here for interoperability.
+
+```
+Query Message Filter Field:
+  tag:              u16 (0x0040 = FILTER)
+  length:           u32
+  filter_version:   u8 (1)
+  filter_payload:   [u8; length - 1]       Binary filter expression (Section 3.2)
+```
+
+Implementations that do not support filtered search must ignore tag 0x0040 and
+return unfiltered results. This preserves backward compatibility.
+
+## 10. Implementation Notes
+
+### 10.1 Index Selection Heuristics
+
+When building indexes for a new META_SEG field, implementations should select
+the index type automatically:
+
+```
+if field_type in (enum, bool) and cardinality < 64:
+    index_type = BITMAP
+elif field_type in (u32, u64, f32):
+    index_type = RANGE_TREE
+else:
+    index_type = INVERTED
+```
+
+Fields without the `"indexed": true` property in the profile schema must not
+have METAIDX_SEG segments built. They are stored in META_SEG for retrieval
+only (the STORED flag).
+
+### 10.2 Posting List Intersection
+
+For AND filters on multiple indexed fields, posting list intersection is
+performed using a merge-based algorithm on sorted, delta-decoded posting lists:
+
+```
+Sorted Intersection (two-pointer merge):
+  Time: O(min(|A|, |B|)) with skip-ahead via restart points
+  Practical: ~100 ns per 1000 common elements (SIMD comparison)
+```
+
+For OR filters, posting list union uses a similar merge with deduplication.
+
+### 10.3 Null Handling
+
+- `FIELD_REF` for a null value pushes a sentinel NULL onto the stack
+- `CMP_EQ NULL` returns true only for null values
+- `CMP_NE NULL` returns true for all non-null values
+- All other comparisons against NULL return false (SQL-style three-valued logic)
+- `IN_SET` never matches NULL unless NULL is explicitly in the set
--- a/vendor/ruvector/docs/research/rvf/spec/09-concurrency-versioning.md
+++ b/vendor/ruvector/docs/research/rvf/spec/09-concurrency-versioning.md
@@ -0,0 +1,474 @@
+# RVF Concurrency, Versioning, and Space Reclamation
+
+## 1. Single-Writer / Multi-Reader Model
+
+RVF uses a **single-writer, multi-reader** concurrency model. At most one process
+may append segments to an RVF file at any time. Any number of readers may operate
+concurrently with each other and with the writer. This model is enforced by an
+advisory lock file, not by OS-level mandatory locking.
+
+| Concern | Advisory Lock | Mandatory Lock (flock/fcntl) |
+|---------|---------------|------------------------------|
+| NFS compatibility | Works (lock file is a regular file) | Broken on many NFS configs |
+| Crash recovery | Stale lock detectable by PID check | Kernel auto-releases, but only locally |
+| Cross-language | Any language can create a file | Requires OS-specific syscalls |
+| Visibility | Lock state inspectable by humans | Opaque kernel state |
+| Multi-file mode | One lock covers all shards | Would need per-shard locks |
+
+## 2. Writer Lock File
+
+The writer lock is a file named `<basename>.rvf.lock` in the same directory as the
+RVF file. For example, `data.rvf` uses `data.rvf.lock`.
+
+### Binary Layout
+
+```
+Offset  Size  Field              Description
+------  ----  -----              -----------
+0x00    4     magic              0x52564C46 ("RVLF" in ASCII)
+0x04    4     pid                Writer process ID (u32)
+0x08    64    hostname           Null-terminated hostname (max 63 chars + null)
+0x48    8     timestamp_ns       Lock acquisition time (nanosecond UNIX timestamp)
+0x50    16    writer_id          Random UUID (128-bit, written as raw bytes)
+0x60    4     lock_version       Lock protocol version (currently 1)
+0x64    4     checksum           CRC32C of bytes 0x00-0x63
+```
+
+**Total**: 104 bytes.
+
+### Lock Acquisition Protocol
+
+```
+1. Construct lock file content (magic, PID, hostname, timestamp, random UUID)
+2. Compute CRC32C over bytes 0x00-0x63, store at 0x64
+3. Attempt open("<basename>.rvf.lock", O_CREAT | O_EXCL | O_WRONLY)
+4. If open succeeds:
+   a. Write 104 bytes
+   b. fsync
+   c. Lock acquired — proceed with writes
+5. If open fails (EEXIST):
+   a. Read existing lock file
+   b. Validate magic and checksum
+   c. If invalid: delete stale lock, retry from step 3
+   d. If valid: run stale lock detection (see below)
+   e. If stale: delete lock, retry from step 3
+   f. If not stale: lock acquisition fails — another writer is active
+```
+
+The `O_CREAT | O_EXCL` combination is atomic on POSIX filesystems, preventing
+two processes from simultaneously creating the lock.
+
+### Stale Lock Detection
+
+A lock is considered stale when **both** of the following are true:
+
+1. **PID is dead**: `kill(pid, 0)` returns `ESRCH` (process does not exist), OR
+   the hostname does not match the current host (remote crash)
+2. **Age exceeds threshold**: `now_ns - timestamp_ns > 30_000_000_000` (30 seconds)
+
+The age check prevents a race where a PID is recycled by the OS. A lock younger
+than 30 seconds is never considered stale, even if the PID appears dead, because
+PID reuse on modern systems can occur within milliseconds.
+
+If the hostname differs from the current host, the PID check is not meaningful.
+In this case, only the age threshold applies. Implementations SHOULD use a longer
+threshold (300 seconds) for cross-host lock recovery to account for clock skew.
+
+### Lock Release Protocol
+
+```
+1. fsync all pending data and manifest segments
+2. Verify the lock file still contains our writer_id (re-read and compare)
+3. If writer_id matches: unlink("<basename>.rvf.lock")
+4. If writer_id does not match: abort — another process stole the lock
+```
+
+Step 2 prevents a writer from deleting a lock that was legitimately taken over
+after a stale lock recovery by another process.
+
+If a writer crashes without releasing the lock, the lock file persists on disk.
+The next writer detects the orphan via stale lock detection and reclaims it.
+No data corruption occurs because the append-only segment model guarantees that
+partial writes are detectable: a segment with a bad content hash or a truncated
+manifest is simply ignored.
+
+## 3. Reader-Writer Coordination
+
+Readers and writers operate independently. The append-only architecture ensures
+they never conflict.
+
+### Reader Protocol
+
+```
+1. Open file (read-only, no lock required)
+2. Read Level 0 root manifest (last 4096 bytes)
+3. Parse hotset pointers and Level 1 offset
+4. This manifest snapshot defines the reader's view of the file
+5. All queries within this session use the snapshot
+6. To see new data: re-read Level 0 (explicit refresh)
+```
+
+### Writer Protocol
+
+```
+1. Acquire lock (Section 2)
+2. Read current manifest to learn segment directory state
+3. Append new segments (VEC_SEG, INDEX_SEG, etc.)
+4. Append new MANIFEST_SEG referencing all live segments
+5. fsync
+6. Release lock (Section 2)
+```
+
+### Concurrent Timeline
+
+```
+Time    Writer                          Reader A            Reader B
+----    ------                          --------            --------
+t=0     Acquires lock
+t=1     Appends VEC_SEG_4                                   Opens file
+t=2     Appends VEC_SEG_5               Opens file          Reads manifest M3
+t=3     Appends MANIFEST_SEG M4         Reads manifest M3   Queries (sees M3)
+t=4     fsync, releases lock            Queries (sees M3)   Queries (sees M3)
+t=5                                     Queries (sees M3)   Refreshes -> M4
+t=6                                     Refreshes -> M4     Queries (sees M4)
+```
+
+Reader A opened during the write but read manifest M3 (already stable) and never
+sees partially written segments. Reader B sees M3 until explicit refresh. Neither
+reader is blocked; the writer is never blocked by readers.
+
+### Snapshot Isolation Guarantees
+
+A reader holding a manifest snapshot is guaranteed:
+
+1. All referenced segments are fully written and fsynced
+2. Segment content hashes match (the manifest would not reference broken segments)
+3. The snapshot is internally consistent (no partial epoch states)
+4. The snapshot remains valid for the lifetime of the open file descriptor, even
+   if the file is compacted and replaced (old inode persists until close)
+
+## 4. Format Versioning
+
+RVF uses explicit version fields at every structural level. The versioning rules
+are designed for forward compatibility — older readers can safely process files
+produced by newer writers, with graceful degradation.
+
+### Segment Version Compatibility
+
+The segment header `version` field (offset 0x04, currently `1`) governs
+segment-level compatibility.
+
+| Rule | Description |
+|------|-------------|
+| S1 | A v1 reader MUST successfully process all v1 segments |
+| S2 | A v1 reader MUST skip segments with version > 1 |
+| S3 | A v1 reader MUST log a warning when skipping unknown versions |
+| S4 | A v1 reader MUST NOT reject a file because it contains unknown-version segments |
+| S5 | A v2+ writer MUST write a root manifest readable by v1 readers (if the root manifest format allows it) |
+| S6 | A v2+ writer MAY write segments with version > 1 |
+| S7 | Readers MUST use `payload_length` from the segment header to skip unknown segments |
+
+Skipping works because the segment header layout is stable: magic, version,
+seg_type, and payload_length occupy fixed offsets. A reader skips unknown
+segments by seeking past `64 + payload_length` bytes (header + payload).
+
+### Unknown Segment Types
+
+The segment type enum (offset 0x05) may be extended in future versions.
+
+| Rule | Description |
+|------|-------------|
+| T1 | A reader MUST skip segment types outside the recognized range (currently 0x01-0x0C) |
+| T2 | A reader MUST NOT reject a file because of unknown segment types |
+| T3 | A reader MUST use the header's `payload_length` to skip the unknown segment |
+| T4 | A reader SHOULD log unknown types at diagnostic/debug level |
+| T5 | Types 0x00 and 0xF0-0xFF remain reserved (see spec 01, Section 3) |
+
+### Level 1 TLV Forward Compatibility
+
+Level 1 manifest records use tag-length-value encoding. New tags may be added
+in any version.
+
+| Rule | Description |
+|------|-------------|
+| L1 | A reader MUST skip TLV records with unknown tags |
+| L2 | A reader MUST use the record's `length` field (4 bytes at tag offset +2) to skip |
+| L3 | A writer MUST NOT change the semantics of an existing tag |
+| L4 | A writer MUST NOT reuse a tag value for a different purpose |
+| L5 | New tags MUST be assigned sequentially from 0x000E onward |
+
+### Root Manifest Compatibility
+
+The root manifest (Level 0) has the strictest compatibility requirements because
+it is the entry point for all readers.
+
+| Rule | Description |
+|------|-------------|
+| R1 | The magic `0x52564D30` at offset 0x000 is frozen forever |
+| R2 | The layout of bytes 0x000-0x007 (magic + version + flags) is frozen forever |
+| R3 | New fields may be added to reserved space at offsets 0xF00-0xFFB |
+| R4 | Readers MUST ignore non-zero bytes in reserved space they do not understand |
+| R5 | The root checksum at 0xFFC always covers bytes 0x000-0xFFB |
+| R6 | A v2+ writer extending reserved space MUST ensure the checksum remains valid |
+
+There is no explicit version negotiation. Compatibility is achieved through the
+skip rules above. A reader processes what it understands and skips what it does
+not. This avoids capability exchange, making RVF suitable for offline and
+archival use cases.
+
+## 5. Variable Dimension Support
+
+The root manifest declares a `dimension` field (offset 0x020, u16) and each
+VEC_SEG block declares its own `dim` field (block header offset 0x08, u16).
+These may differ.
+
+### Dimension Rules
+
+| Rule | Description |
+|------|-------------|
+| D1 | The root manifest `dimension` is the **primary dimension** (most common in the file) |
+| D2 | An RVF file MAY contain VEC_SEG blocks with dimensions different from the primary |
+| D3 | Each VEC_SEG block's `dim` field is authoritative for the vectors in that block |
+| D4 | The HNSW index (INDEX_SEG) covers only vectors matching the primary dimension |
+| D5 | Vectors with non-primary dimensions are searchable via flat scan or a separate index |
+| D6 | A PROFILE_SEG may declare multiple expected dimensions |
+
+### Dimension Catalog (Level 1 Record)
+
+A new Level 1 TLV record (tag `0x0010`, DIMENSION_CATALOG) enables readers to
+discover all dimensions present without scanning every VEC_SEG.
+
+Record layout:
+
+```
+Offset  Size  Field                Description
+------  ----  -----                -----------
+0x00    2     entry_count          Number of dimension entries
+0x02    2     reserved             Must be zero
+```
+
+Followed by `entry_count` entries of:
+
+```
+Offset  Size  Field                Description
+------  ----  -----                -----------
+0x00    2     dimension            Vector dimensionality
+0x02    1     dtype                Data type enum for these vectors
+0x03    1     flags                0x01 = primary, 0x02 = has_index
+0x04    4     vector_count         Number of vectors with this dimension
+0x08    8     index_seg_offset     Offset to dedicated index (0 if none)
+```
+
+**Entry size**: 16 bytes.
+
+Example for an RVDNA profile file:
+
+```
+DIMENSION_CATALOG:
+  entry_count: 3
+  [0] dim=64,   dtype=f16, flags=0x01 (primary, has_index), count=10000000, index=0x1A00000
+  [1] dim=384,  dtype=f16, flags=0x02 (has_index),          count=500000,   index=0x3F00000
+  [2] dim=4096, dtype=f32, flags=0x00 (flat scan only),     count=10000,    index=0
+```
+
+## 6. Space Reclamation
+
+Over time, tombstoned segments and superseded manifests accumulate dead space.
+RVF provides three reclamation strategies, each suited to different operating
+conditions.
+
+### Strategy 1: Hole-Punching
+
+On Linux filesystems that support `fallocate(2)` with `FALLOC_FL_PUNCH_HOLE`
+(ext4, XFS, btrfs), tombstoned segment ranges can be released back to the
+filesystem without rewriting the file.
+
+```
+Before:  [VEC_1 live] [VEC_2 dead] [VEC_3 dead] [VEC_4 live] [MANIFEST]
+After:   [VEC_1 live] [  hole   ] [  hole   ] [VEC_4 live] [MANIFEST]
+```
+
+File size is unchanged but disk blocks are freed. No data movement occurs — each
+punch is O(1). Reader mmap still works (holes read as zeros, but the manifest
+never references them). Hole-punching is performed only on segments marked as
+TOMBSTONE in the current manifest's COMPACTION_STATE record.
+
+### Strategy 2: Copy-Compact
+
+Copy-compact rewrites the file, including only live segments. This is the
+universal strategy that works on all filesystems.
+
+```
+Protocol:
+1. Acquire writer lock
+2. Read current manifest to enumerate live segments
+3. Create temporary file: <basename>.rvf.compact.tmp
+4. Write live segments sequentially to temporary file
+5. Write new MANIFEST_SEG with updated offsets
+6. fsync temporary file
+7. Atomic rename: <basename>.rvf.compact.tmp -> <basename>.rvf
+8. Release writer lock
+```
+
+The atomic rename (step 7) ensures readers either see the old file or the new
+file, never a partial state. Readers that opened the old file before the rename
+continue operating on the old inode via their open file descriptor. The old
+inode is freed when the last reader closes its descriptor.
+
+### Strategy 3: Shard Rewrite (Multi-File Mode)
+
+In multi-file mode, individual shard files can be rewritten independently:
+
+```
+Protocol:
+1. Acquire writer lock
+2. Read shard reference from Level 1 SHARD_REFS record
+3. Write new shard: <basename>.rvf.cold.<N>.compact.tmp
+4. fsync new shard
+5. Update main file manifest with new shard reference
+6. fsync main file
+7. Atomic rename new shard over old shard
+8. Release writer lock
+```
+
+The old shard is safe to delete after all readers close their descriptors.
+Implementations MAY defer deletion using a grace period (default: 60 seconds).
+
+## 7. Space Reclamation Triggers
+
+Reclamation is not performed on every write. Implementations SHOULD evaluate
+triggers after each manifest write and act when thresholds are exceeded.
+
+| Trigger | Threshold | Action |
+|---------|-----------|--------|
+| Dead space ratio | > 50% of file size | Copy-compact |
+| Dead space absolute | > 1 GB | Hole-punch if supported, else copy-compact |
+| Tombstone count | > 10,000 JOURNAL_SEG tombstone entries | Consolidate journal segments |
+| Time since last compaction | > 7 days | Evaluate dead space ratio, compact if > 25% |
+
+### Dead Space Calculation
+
+Dead space is computed from the manifest's COMPACTION_STATE record:
+
+```
+dead_bytes = sum(payload_length + 64) for each tombstoned segment
+total_bytes = file_size
+dead_ratio = dead_bytes / total_bytes
+```
+
+The `+ 64` accounts for the segment header.
+
+### Trigger Evaluation Protocol
+
+```
+1. After writing a new MANIFEST_SEG, compute dead_bytes and dead_ratio
+2. If dead_ratio > 0.50: schedule copy-compact
+3. Else if dead_bytes > 1 GB:
+   a. If fallocate supported: hole-punch tombstoned ranges
+   b. Else: schedule copy-compact
+4. If tombstone_count > 10,000: consolidate JOURNAL_SEGs
+5. If days_since_last_compact > 7 AND dead_ratio > 0.25: schedule copy-compact
+```
+
+Scheduled compactions MAY be deferred to a background process or low-activity
+period.
+
+## 8. Multi-Process Compaction
+
+Compaction is a write operation and requires the writer lock. Only one process
+may compact at a time.
+
+### Background Compaction Process
+
+A dedicated compaction process can run alongside the application:
+
+```
+1. Attempt writer lock acquisition
+2. If lock acquired:
+   a. Read current manifest
+   b. Evaluate reclamation triggers
+   c. If compaction needed:
+      i.   Write WITNESS_SEG with compaction_state = STARTED
+      ii.  Perform compaction (copy-compact or hole-punch)
+      iii. Write WITNESS_SEG with compaction_state = COMPLETED
+      iv.  Write new MANIFEST_SEG
+   d. Release lock
+3. If lock not acquired: sleep and retry
+```
+
+### Crash Safety
+
+Compaction is crash-safe by construction. Copy-compact does not rename until
+fsynced — a crash before rename leaves the original file untouched and the
+temporary file is cleaned up on next startup. Hole-punch `fallocate` calls are
+individually atomic; a crash mid-sequence leaves the manifest consistent because
+it references only live segments. Shard rewrite follows the same atomic rename
+pattern as copy-compact.
+
+### Compaction Progress and Resumability
+
+For long-running compactions, the writer records progress in WITNESS_SEG segments:
+
+```
+WITNESS_SEG compaction payload:
+  Offset  Size  Field                Description
+  ------  ----  -----                -----------
+  0x00    4     state                0=STARTED, 1=IN_PROGRESS, 2=COMPLETED, 3=ABORTED
+  0x04    8     source_manifest_id   Segment ID of manifest being compacted
+  0x0C    8     last_copied_seg_id   Last segment ID successfully written to new file
+  0x14    8     bytes_written        Total bytes written to new file so far
+  0x1C    8     bytes_remaining      Estimated bytes remaining
+  0x24    16    temp_file_hash       Hash of temporary file at last checkpoint
+```
+
+If a compaction process crashes and restarts, it can:
+
+1. Find the latest WITNESS_SEG with `state = IN_PROGRESS`
+2. Verify the temporary file exists and matches `temp_file_hash`
+3. Resume from `last_copied_seg_id + 1`
+4. If verification fails, delete the temporary file and restart compaction
+
+## 9. Crash Recovery Summary
+
+RVF recovers from crashes at any point without external tooling.
+
+| Crash Point | State After Recovery | Action Required |
+|-------------|---------------------|-----------------|
+| Segment append (before manifest) | Orphan segment at tail | None — manifest does not reference it |
+| Manifest write | Partial manifest at tail | Scan backward to previous valid manifest |
+| Lock acquisition | Lock file may or may not exist | Stale lock detection resolves it |
+| Lock release | Lock file persists | Stale lock detection resolves it |
+| Copy-compact (before rename) | Temporary file on disk | Delete `*.compact.tmp` on startup |
+| Copy-compact (during rename) | Atomic — old or new | No action needed |
+| Hole-punch | Partial holes punched | No action — manifest is consistent |
+| Shard rewrite | Temporary shard on disk | Delete `*.compact.tmp` on startup |
+
+### Startup Recovery Protocol
+
+On startup, before acquiring a write lock, a writer SHOULD:
+
+```
+1. Delete any <basename>.rvf.compact.tmp files (orphaned compaction)
+2. Delete any <basename>.rvf.cold.*.compact.tmp files (orphaned shard compaction)
+3. Validate the lock file (if present) for staleness
+4. Open the RVF file and locate the latest valid manifest
+5. If the tail contains a partial segment (magic present, bad hash):
+   a. Log a warning with the partial segment's offset and type
+   b. The partial segment is outside the manifest — it is harmless
+   c. The next append will overwrite it (or it will be compacted away)
+```
+
+## 10. Invariants
+
+The following invariants extend those in spec 01 (Section 7):
+
+1. At most one writer lock exists per RVF file at any time
+2. A lock file with valid magic and checksum represents an active or stale lock
+3. Readers never require a lock, regardless of operation
+4. A manifest snapshot is immutable for the lifetime of a reader session
+5. Compaction never modifies live segments — it creates new ones
+6. Hole-punched regions are never referenced by any manifest
+7. The root manifest magic and first 8 bytes are frozen across all versions
+8. Unknown segment versions and types are skipped, never rejected
+9. Unknown TLV tags in Level 1 are skipped, never rejected
+10. Each VEC_SEG block's `dim` field is authoritative for that block's vectors
--- a/vendor/ruvector/docs/research/rvf/spec/10-operations-api.md
+++ b/vendor/ruvector/docs/research/rvf/spec/10-operations-api.md
@@ -0,0 +1,688 @@
+# RVF Operations API
+
+## 1. Scope
+
+This document specifies the operational surface of an RVF runtime: error codes
+returned by all operations, wire formats for batch queries, batch ingest, and
+batch deletes, the network streaming protocol for progressive loading over HTTP
+and TCP, and the compaction scheduling policy. It complements the segment model
+(spec 01), manifest system (spec 02), and query optimization (spec 06).
+
+All multi-byte integers are little-endian unless otherwise noted. All offsets
+within messages are byte offsets from the start of the message payload.
+
+## 2. Error Code Enumeration
+
+Error codes are 16-bit unsigned integers. The high byte identifies the error
+category; the low byte identifies the specific error within that category.
+Implementations must preserve unrecognized codes in responses and must not
+treat unknown codes as fatal unless the high byte is `0x01` (format error).
+
+### Category 0x00: Success
+
+```
+Code    Name                  Description
+------  --------------------  ----------------------------------------
+0x0000  OK                    Operation succeeded
+0x0001  OK_PARTIAL            Partial success (some items failed)
+```
+
+`OK_PARTIAL` is returned when a batch operation succeeds for some items and
+fails for others. The response body contains per-item status details.
+
+### Category 0x01: Format Errors
+
+```
+Code    Name                  Description
+------  --------------------  ----------------------------------------
+0x0100  INVALID_MAGIC         Segment magic mismatch (expected 0x52564653)
+0x0101  INVALID_VERSION       Unsupported segment version
+0x0102  INVALID_CHECKSUM      Segment hash verification failed
+0x0103  INVALID_SIGNATURE     Cryptographic signature invalid
+0x0104  TRUNCATED_SEGMENT     Segment payload shorter than declared length
+0x0105  INVALID_MANIFEST      Root manifest validation failed
+0x0106  MANIFEST_NOT_FOUND    No valid MANIFEST_SEG in file
+0x0107  UNKNOWN_SEGMENT_TYPE  Segment type not recognized (warning, not fatal)
+0x0108  ALIGNMENT_ERROR       Data not at expected 64B boundary
+```
+
+`UNKNOWN_SEGMENT_TYPE` is advisory. A reader encountering an unknown segment
+type should skip it and continue. All other format errors in this category
+are fatal for the affected segment.
+
+### Category 0x02: Query Errors
+
+```
+Code    Name                  Description
+------  --------------------  ----------------------------------------
+0x0200  DIMENSION_MISMATCH    Query vector dimension != index dimension
+0x0201  EMPTY_INDEX           No index segments available
+0x0202  METRIC_UNSUPPORTED    Requested distance metric not available
+0x0203  FILTER_PARSE_ERROR    Invalid filter expression
+0x0204  K_TOO_LARGE           Requested K exceeds available vectors
+0x0205  TIMEOUT               Query exceeded time budget
+```
+
+When `K_TOO_LARGE` is returned, the response still contains all available
+results. The result count will be less than the requested K.
+
+### Category 0x03: Write Errors
+
+```
+Code    Name                  Description
+------  --------------------  ----------------------------------------
+0x0300  LOCK_HELD             Another writer holds the lock
+0x0301  LOCK_STALE            Lock file exists but owner process is dead
+0x0302  DISK_FULL             Insufficient space for write
+0x0303  FSYNC_FAILED          Durable write failed
+0x0304  SEGMENT_TOO_LARGE     Segment exceeds 4 GB limit
+0x0305  READ_ONLY             File opened in read-only mode
+```
+
+`LOCK_STALE` is informational. The runtime may attempt to break the stale
+lock and retry. If recovery succeeds, the original operation proceeds with
+an `OK` status.
+
+### Category 0x04: Tile Errors (WASM Microkernel)
+
+```
+Code    Name                  Description
+------  --------------------  ----------------------------------------
+0x0400  TILE_TRAP             WASM trap (OOB, unreachable, stack overflow)
+0x0401  TILE_OOM              Tile exceeded scratch memory (64 KB)
+0x0402  TILE_TIMEOUT          Tile computation exceeded time budget
+0x0403  TILE_INVALID_MSG      Malformed hub-tile message
+0x0404  TILE_UNSUPPORTED_OP   Operation not available on this profile
+```
+
+All tile errors trigger the fault isolation protocol described in
+`microkernel/wasm-runtime.md` section 8. The hub reassigns the tile's
+work and optionally restarts the faulted tile.
+
+### Category 0x05: Crypto Errors
+
+```
+Code    Name                  Description
+------  --------------------  ----------------------------------------
+0x0500  KEY_NOT_FOUND         Referenced key_id not in CRYPTO_SEG
+0x0501  KEY_EXPIRED           Key past valid_until timestamp
+0x0502  DECRYPT_FAILED        Decryption or auth tag verification failed
+0x0503  ALGO_UNSUPPORTED      Cryptographic algorithm not implemented
+```
+
+Crypto errors are always fatal for the affected segment. An implementation
+must not serve data from a segment that fails signature or decryption checks.
+
+## 3. Batch Query API
+
+### Wire Format: Request
+
+Batch queries amortize connection overhead and enable the runtime to
+schedule vector block loads across multiple queries simultaneously.
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       query_count         Number of queries in batch (max 1024)
+0x04    4       k                   Shared top-K parameter
+0x08    1       metric              Distance metric: 0=L2, 1=IP, 2=cosine, 3=hamming
+0x09    3       reserved            Must be zero
+0x0C    4       ef_search           HNSW ef_search parameter
+0x10    4       shared_filter_len   Byte length of shared filter (0 = no filter)
+0x14    var     shared_filter       Filter expression (applies to all queries)
+var     var     queries[]           Per-query entries (see below)
+```
+
+Each query entry:
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       query_id            Client-assigned correlation ID
+0x04    2       dim                 Vector dimensionality
+0x06    1       dtype               Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
+0x07    1       flags               Bit 0: has per-query filter
+0x08    var     vector              Query vector (dim * sizeof(dtype) bytes)
+var     4       filter_len          Byte length of per-query filter (if flags bit 0)
+var     var     filter              Per-query filter (overrides shared filter)
+```
+
+When both a shared filter and a per-query filter are present, the per-query
+filter takes precedence. A per-query filter of zero length inherits the
+shared filter.
+
+### Wire Format: Response
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       query_count         Number of query results
+0x04    var     results[]           Per-query result entries
+```
+
+Each result entry:
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       query_id            Correlation ID from request
+0x04    2       status              Error code (0x0000 = OK)
+0x06    2       reserved            Must be zero
+0x08    4       result_count        Number of results returned
+0x0C    var     results[]           Array of (vector_id: u64, distance: f32) pairs
+```
+
+Each result pair is 12 bytes: 8 bytes for the vector ID followed by 4 bytes
+for the distance value. Results are sorted by distance ascending (nearest first).
+
+### Batch Scheduling
+
+The runtime should process batch queries using the following strategy:
+
+1. Parse all query vectors and load them into memory
+2. Identify shared segments across queries (block deduplication)
+3. Load each vector block once and evaluate all relevant queries against it
+4. Merge per-query top-K heaps independently
+5. Return results as soon as each query completes (streaming response)
+
+This amortizes I/O: if N queries touch the same vector block, the block is
+read once instead of N times.
+
+## 4. Batch Ingest API
+
+### Wire Format: Request
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       vector_count        Number of vectors to ingest (max 65536)
+0x04    2       dim                 Vector dimensionality
+0x06    1       dtype               Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
+0x07    1       flags               Bit 0: metadata_included
+0x08    var     vectors[]           Vector entries
+var     var     metadata[]          Metadata entries (if flags bit 0)
+```
+
+Each vector entry:
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    8       vector_id           Globally unique vector ID
+0x08    var     vector              Vector data (dim * sizeof(dtype) bytes)
+```
+
+Each metadata entry (when metadata_included is set):
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    2       field_count         Number of metadata fields
+0x02    var     fields[]            Field entries
+```
+
+Each metadata field:
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    2       field_id            Field identifier (application-defined)
+0x02    1       value_type          0=u64, 1=i64, 2=f64, 3=string, 4=bytes
+0x03    var     value               Encoded value (u64/i64/f64: 8B; string/bytes: 4B length + data)
+```
+
+### Wire Format: Response
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       accepted_count      Number of vectors accepted
+0x04    4       rejected_count      Number of vectors rejected
+0x08    4       manifest_epoch      Epoch of manifest after commit
+0x0C    var     rejected_ids[]      Array of rejected vector IDs (u64 * rejected_count)
+var     var     rejected_reasons[]  Array of error codes (u16 * rejected_count)
+```
+
+The `manifest_epoch` field is the epoch of the MANIFEST_SEG written after the
+ingest is committed. Clients can use this value to confirm that a subsequent
+read will include the ingested vectors.
+
+### Ingest Commit Semantics
+
+1. The runtime writes vectors to a new VEC_SEG (append-only)
+2. If metadata is included, a META_SEG is appended
+3. Both segments are fsynced
+4. A new MANIFEST_SEG is written referencing the new segments
+5. The manifest is fsynced
+6. The response is sent with the new manifest_epoch
+
+Vectors are visible to queries only after step 6 completes.
+
+## 5. Batch Delete API
+
+### Wire Format: Request
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    1       delete_type         0=by_id, 1=by_range, 2=by_filter
+0x01    3       reserved            Must be zero
+0x04    var     payload             Type-specific payload (see below)
+```
+
+Delete by ID (`delete_type = 0`):
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       count               Number of IDs to delete
+0x04    var     ids[]               Array of vector IDs (u64 * count)
+```
+
+Delete by range (`delete_type = 1`):
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    8       start_id            Start of range (inclusive)
+0x08    8       end_id              End of range (exclusive)
+```
+
+Delete by filter (`delete_type = 2`):
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       filter_len          Byte length of filter expression
+0x04    var     filter              Filter expression
+```
+
+### Wire Format: Response
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    8       deleted_count       Number of vectors deleted
+0x08    2       status              Error code (0x0000 = OK)
+0x0A    2       reserved            Must be zero
+0x0C    4       manifest_epoch      Epoch of manifest after delete committed
+```
+
+### Delete Mechanics
+
+Deletes are logical. The runtime appends a JOURNAL_SEG containing tombstone
+entries for the deleted vector IDs. The new MANIFEST_SEG marks affected
+VEC_SEGs as partially dead. Physical reclamation happens during compaction.
+
+## 6. Network Streaming Protocol
+
+### 6.1 HTTP Range Requests (Read-Only Access)
+
+RVF's progressive loading model maps naturally to HTTP byte-range requests.
+A client can boot from a remote `.rvf` file and become queryable without
+downloading the entire file.
+
+**Phase 1: Boot (mandatory)**
+
+```
+GET /file.rvf  Range: bytes=-4096
+```
+
+Retrieves the last 4 KB of the file. This contains the Level 0 root manifest
+(MANIFEST_SEG). The client parses hotset pointers, the segment directory, and
+the profile ID.
+
+If the file is smaller than 4 KB, the entire file is returned. If the last
+4 KB does not contain a valid MANIFEST_SEG, the client extends the range
+backward in 4 KB increments until one is found or 1 MB is scanned (at which
+point it returns `MANIFEST_NOT_FOUND`).
+
+**Phase 2: Hotset (parallel, mandatory for queries)**
+
+Using offsets from the Level 0 manifest, the client issues up to 5 parallel
+range requests:
+
+```
+GET /file.rvf  Range: bytes=<entrypoint_offset>-<entrypoint_end>
+GET /file.rvf  Range: bytes=<toplayer_offset>-<toplayer_end>
+GET /file.rvf  Range: bytes=<centroid_offset>-<centroid_end>
+GET /file.rvf  Range: bytes=<quantdict_offset>-<quantdict_end>
+GET /file.rvf  Range: bytes=<hotcache_offset>-<hotcache_end>
+```
+
+These fetch the HNSW entry point, top-layer graph, routing centroids,
+quantization dictionary, and the hot cache (HOT_SEG). After these 5 requests
+complete, the system is queryable with recall >= 0.7.
+
+**Phase 3: Level 1 (background)**
+
+```
+GET /file.rvf  Range: bytes=<l1_offset>-<l1_end>
+```
+
+Fetches the Level 1 manifest containing the full segment directory. This
+enables the client to discover all segments and plan on-demand fetches.
+
+**Phase 4: On-demand (per query)**
+
+For queries that require cold data not yet fetched:
+
+```
+GET /file.rvf  Range: bytes=<segment_offset>-<segment_end>
+```
+
+The client caches fetched segments locally. Repeated queries against the
+same data region do not trigger additional requests.
+
+### HTTP Requirements
+
+- Server must support `Accept-Ranges: bytes`
+- Server must return `206 Partial Content` for range requests
+- Server should support multiple ranges in a single request (`multipart/byteranges`)
+- Client should use `If-None-Match` with the file's ETag to detect stale caches
+
+### 6.2 TCP Streaming Protocol (Real-Time Access)
+
+For real-time ingest and low-latency queries, RVF defines a binary TCP
+protocol over TLS 1.3.
+
+**Connection Setup**
+
+```
+1. Client opens TCP connection to server
+2. TLS 1.3 handshake (mandatory, no plaintext mode)
+3. Client sends HELLO message with protocol version and capabilities
+4. Server responds with HELLO_ACK confirming capabilities
+5. Connection is ready for messages
+```
+
+**Framing**
+
+All messages are length-prefixed:
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       frame_length        Payload length (big-endian, max 16 MB)
+0x04    1       msg_type            Message type (see below)
+0x05    3       msg_id              Correlation ID (big-endian, wraps at 2^24)
+0x08    var     payload             Message-specific payload
+```
+
+Frame length is big-endian (network byte order) for consistency with TLS
+framing. The 16 MB maximum prevents a single message from monopolizing the
+connection. Payloads larger than 16 MB must be split across multiple messages
+using continuation framing (see section 6.4).
+
+**Message Types**
+
+```
+Client -> Server:
+  0x01  QUERY           Batch query (payload = Batch Query Request)
+  0x02  INGEST          Batch ingest (payload = Batch Ingest Request)
+  0x03  DELETE          Batch delete (payload = Batch Delete Request)
+  0x04  STATUS          Request server status (no payload)
+  0x05  SUBSCRIBE       Subscribe to update notifications
+
+Server -> Client:
+  0x81  QUERY_RESULT    Batch query result
+  0x82  INGEST_ACK      Batch ingest acknowledgment
+  0x83  DELETE_ACK      Batch delete acknowledgment
+  0x84  STATUS_RESP     Server status response
+  0x85  UPDATE_NOTIFY   Push notification of new data
+  0xFF  ERROR           Error with code and description
+```
+
+**ERROR Message Payload**
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    2       error_code          Error code from section 2
+0x02    2       description_len     Byte length of description string
+0x04    var     description         UTF-8 error description (human-readable)
+```
+
+### 6.3 Streaming Ingest Protocol
+
+The TCP protocol supports continuous ingest where the client streams vectors
+without waiting for per-batch acknowledgments.
+
+**Flow**
+
+```
+Client                              Server
+  |                                    |
+  |--- INGEST (batch 0) ------------->|
+  |--- INGEST (batch 1) ------------->|  Pipelining: send without waiting
+  |--- INGEST (batch 2) ------------->|
+  |                                    | Server writes VEC_SEGs, appends manifest
+  |<--- INGEST_ACK (batch 0) ---------|
+  |<--- INGEST_ACK (batch 1) ---------|
+  |                                    | Backpressure: server delays ACK
+  |--- INGEST (batch 3) ------------->|  Client respects window
+  |<--- INGEST_ACK (batch 2) ---------|
+  |                                    |
+```
+
+**Backpressure**
+
+The server controls ingest rate by delaying INGEST_ACK responses. The client
+must limit its in-flight (unacknowledged) ingest messages to a configurable
+window size (default: 8 messages). When the window is full, the client must
+wait for an ACK before sending the next batch.
+
+The server should send backpressure when:
+- Write queue exceeds 80% capacity
+- Compaction is falling behind (dead space > 50%)
+- Available disk space drops below 10%
+
+**Commit Semantics**
+
+Each INGEST_ACK contains the `manifest_epoch` after commit. The server
+guarantees that all vectors acknowledged with epoch E are visible to any
+query that reads the manifest at epoch >= E.
+
+### 6.4 Continuation Framing
+
+For payloads exceeding the 16 MB frame limit:
+
+```
+Frame 0: msg_type = original type, flags bit 0 = CONTINUATION_START
+Frame 1: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
+Frame 2: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
+Frame N: msg_type = 0x00 (CONTINUATION), flags bit 1 = CONTINUATION_END
+```
+
+The receiver reassembles the payload from all continuation frames before
+processing. The msg_id is shared across all frames of a continuation sequence.
+
+### 6.5 SUBSCRIBE and UPDATE_NOTIFY
+
+The SUBSCRIBE message registers the client for push notifications when new
+data is committed:
+
+```
+SUBSCRIBE payload:
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       min_epoch           Only notify for epochs > this value
+0x04    1       notify_flags        Bit 0: ingest, Bit 1: delete, Bit 2: compaction
+0x05    3       reserved            Must be zero
+```
+
+The server sends UPDATE_NOTIFY whenever a new MANIFEST_SEG is committed that
+matches the subscription criteria:
+
+```
+UPDATE_NOTIFY payload:
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    4       epoch               New manifest epoch
+0x04    1       event_type          0=ingest, 1=delete, 2=compaction
+0x05    3       reserved            Must be zero
+0x08    4       affected_count      Number of vectors affected
+0x0C    8       new_total           Total vector count after event
+```
+
+## 7. Compaction Scheduling Policy
+
+Compaction merges small, overlapping, or partially-dead segments into larger,
+sealed segments. Because compaction competes with queries and ingest for I/O
+bandwidth, the runtime enforces a scheduling policy.
+
+### 7.1 IO Budget
+
+Compaction must consume at most 30% of available IOPS. The runtime measures
+IOPS over a 5-second sliding window and throttles compaction I/O to stay
+within budget.
+
+```
+available_iops = measured_iops_capacity (from benchmarking at startup)
+compaction_budget = available_iops * 0.30
+compaction_throttle = max(compaction_budget - current_compaction_iops, 0)
+```
+
+### 7.2 Priority Ordering
+
+When I/O bandwidth is contended, operations are prioritized:
+
+```
+Priority 1 (highest):  Queries (reads from VEC_SEG, INDEX_SEG, HOT_SEG)
+Priority 2:            Ingest (writes to VEC_SEG, META_SEG, MANIFEST_SEG)
+Priority 3 (lowest):   Compaction (reads + writes of sealed segments)
+```
+
+Compaction yields to queries and ingest. If a compaction I/O operation would
+cause a query to exceed its time budget, the compaction operation is deferred.
+
+### 7.3 Scheduling Triggers
+
+Compaction runs when all of the following conditions are met:
+
+| Condition | Threshold | Rationale |
+|-----------|-----------|-----------|
+| Query load | < 50% of capacity | Avoid competing with active queries |
+| Dead space ratio | > 20% of total file size | Not worth compacting small amounts |
+| Segment count | > 32 active segments | Many small segments hurt read performance |
+| Time since last compaction | > 60 seconds | Prevent compaction storms |
+
+The runtime evaluates these conditions every 10 seconds.
+
+### 7.4 Emergency Compaction
+
+If dead space exceeds 70% of total file size, compaction enters emergency mode:
+
+```
+Emergency compaction rules:
+  1. Compaction preempts ingest (ingest is paused, not rejected)
+  2. IO budget increases to 60% of available IOPS
+  3. Compaction runs regardless of query load
+  4. Ingest resumes after dead space drops below 50%
+```
+
+During emergency compaction, the server responds to INGEST messages with
+delayed ACKs (backpressure) rather than rejecting them. Queries continue to
+be served at highest priority.
+
+### 7.5 Compaction Progress Reporting
+
+The STATUS response includes compaction state:
+
+```
+STATUS_RESP compaction fields:
+Offset  Size    Field                 Description
+------  ------  -------------------   ----------------------------------------
+0x00    1       compaction_state      0=idle, 1=running, 2=emergency
+0x01    1       progress_pct          Completion percentage (0-100)
+0x02    2       reserved              Must be zero
+0x04    8       dead_bytes            Total dead space in bytes
+0x0C    8       total_bytes           Total file size in bytes
+0x14    4       segments_remaining    Segments left to compact
+0x18    4       segments_completed    Segments compacted in current run
+0x1C    4       estimated_seconds     Estimated time to completion
+0x20    4       io_budget_pct         Current IO budget percentage (30 or 60)
+```
+
+### 7.6 Compaction Segment Selection
+
+The runtime selects segments for compaction using a tiered strategy:
+
+```
+1. Tombstoned segments:       Always compacted first (reclaim dead space)
+2. Small VEC_SEGs:            Segments < 1 MB merged into larger segments
+3. High-overlap INDEX_SEGs:   Index segments covering the same ID range
+4. Cold OVERLAY_SEGs:         Overlay deltas merged into base segments
+```
+
+The compaction output is always a sealed segment (SEALED flag set). Sealed
+segments are immutable and can be verified independently.
+
+## 8. STATUS Response Format
+
+The STATUS message provides a snapshot of the server state for monitoring
+and diagnostics.
+
+```
+STATUS_RESP payload:
+Offset  Size    Field                 Description
+------  ------  -------------------   ----------------------------------------
+0x00    4       protocol_version      Protocol version (currently 1)
+0x04    4       manifest_epoch        Current manifest epoch
+0x08    8       total_vectors         Total vector count
+0x10    8       total_segments        Total segment count
+0x18    8       file_size_bytes       Total file size
+0x20    4       query_qps             Queries per second (last 5s window)
+0x24    4       ingest_vps            Vectors ingested per second (last 5s window)
+0x28    24      compaction            Compaction state (see section 7.5)
+0x40    1       profile_id            Active hardware profile (0x00-0x03)
+0x41    1       health                0=healthy, 1=degraded, 2=read_only
+0x42    2       reserved              Must be zero
+0x44    4       uptime_seconds        Server uptime
+```
+
+## 9. Filter Expression Format
+
+Filter expressions used in batch queries and batch deletes share a common
+binary encoding:
+
+```
+Offset  Size    Field               Description
+------  ------  ------------------  ----------------------------------------
+0x00    1       op                  Operator enum (see below)
+0x01    2       field_id            Metadata field to filter on
+0x03    1       value_type          Value type (matches metadata field types)
+0x04    var     value               Comparison value
+var     var     children[]          Sub-expressions (for AND/OR/NOT)
+```
+
+Operator enum:
+
+```
+0x00  EQ          field == value
+0x01  NE          field != value
+0x02  LT          field < value
+0x03  LE          field <= value
+0x04  GT          field > value
+0x05  GE          field >= value
+0x06  IN          field in [values]
+0x07  RANGE       field in [low, high)
+0x10  AND         All children must match
+0x11  OR          Any child must match
+0x12  NOT         Negate single child
+```
+
+Filters are evaluated during the query scan phase. Vectors that do not match
+the filter are excluded from distance computation entirely (pre-filtering) or
+from the result set (post-filtering), depending on the runtime's cost model.
+
+## 10. Invariants
+
+1. Error codes are stable across versions; new codes are additive only
+2. Batch operations are atomic per-item, not per-batch (partial success is valid)
+3. TCP connections are always TLS 1.3; plaintext is not permitted
+4. Frame length is big-endian; all other multi-byte fields are little-endian
+5. HTTP progressive loading must succeed with at most 7 round trips to become queryable
+6. Compaction never runs at more than 60% of available IOPS, even in emergency mode
+7. The STATUS response is always available, even during emergency compaction
+8. Filter expressions are limited to 64 levels of nesting depth
--- a/vendor/ruvector/docs/research/rvf/spec/11-wasm-bootstrap.md
+++ b/vendor/ruvector/docs/research/rvf/spec/11-wasm-bootstrap.md
@@ -0,0 +1,420 @@
+# RVF WASM Self-Bootstrapping Specification
+
+## 1. Motivation
+
+Traditional file formats require an external runtime to interpret their contents.
+A JPEG needs an image decoder. A SQLite database needs the SQLite library. An RVF
+file needs a vector search engine.
+
+What if the file carried its own runtime?
+
+By embedding a tiny WASM interpreter inside the RVF file itself, we eliminate the
+last external dependency. The host only needs **raw execution capability** — the
+ability to run bytes as instructions. RVF becomes **self-bootstrapping**: a single
+file that contains both its data and the complete machinery to process that data.
+
+This is the transition from "needs a compatible runtime" to **"runs anywhere
+compute exists."**
+
+## 2. Architecture
+
+### The Bootstrap Stack
+
+```
+Layer 3:  RVF Data Segments          (VEC_SEG, INDEX_SEG, MANIFEST_SEG, ...)
+            ^
+            | processes
+            |
+Layer 2:  WASM Microkernel           (WASM_SEG, role=Microkernel, ~5.5 KB)
+            ^                         14 exports: query, ingest, distance, top-K
+            | executes
+            |
+Layer 1:  WASM Interpreter           (WASM_SEG, role=Interpreter, ~50 KB)
+            ^                         Minimal stack machine that runs WASM bytecode
+            | loads
+            |
+Layer 0:  Raw Bytes                  (The .rvf file on any storage medium)
+```
+
+Each layer depends only on the one below it. The host reads Layer 0 (raw bytes),
+finds the interpreter at Layer 1, uses it to execute the microkernel at Layer 2,
+which then processes the data at Layer 3.
+
+### Segment Layout
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│                         bootable.rvf                                 │
+│                                                                      │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌─────────┐ │
+│  │  WASM_SEG    │  │  WASM_SEG    │  │  VEC_SEG     │  │ INDEX   │ │
+│  │  0x10        │  │  0x10        │  │  0x01        │  │ _SEG    │ │
+│  │              │  │              │  │              │  │ 0x02    │ │
+│  │ role=Interp  │  │ role=uKernel │  │ 10M vectors  │  │ HNSW    │ │
+│  │ ~50 KB       │  │ ~5.5 KB      │  │ 384-dim fp16 │  │ L0+L1   │ │
+│  │ priority=0   │  │ priority=1   │  │              │  │         │ │
+│  └──────────────┘  └──────────────┘  └──────────────┘  └─────────┘ │
+│                                                                      │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │
+│  │ QUANT_SEG    │  │ WITNESS_SEG  │  │ MANIFEST_SEG │  ← tail      │
+│  │ codebooks    │  │ audit trail  │  │ source of    │               │
+│  │              │  │              │  │ truth        │               │
+│  └──────────────┘  └──────────────┘  └──────────────┘               │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+## 3. WASM_SEG Wire Format
+
+### Segment Type
+
+```
+Value:  0x10
+Name:   WASM_SEG
+```
+
+Uses the standard 64-byte RVF segment header (`SegmentHeader`), followed by
+a 64-byte `WasmHeader`, followed by the WASM bytecode.
+
+### WasmHeader (64 bytes)
+
+```
+Offset  Size  Type    Field               Description
+------  ----  ----    -----               -----------
+0x00    4     u32     wasm_magic           0x5256574D ("RVWM" big-endian)
+0x04    2     u16     header_version       Currently 1
+0x06    1     u8      role                 Bootstrap role (see WasmRole enum)
+0x07    1     u8      target               Target platform (see WasmTarget enum)
+0x08    2     u16     required_features    WASM feature bitfield
+0x0A    2     u16     export_count         Number of WASM exports
+0x0C    4     u32     bytecode_size        Uncompressed bytecode size (bytes)
+0x10    4     u32     compressed_size      Compressed size (0 = no compression)
+0x14    1     u8      compression          0=none, 1=LZ4, 2=ZSTD
+0x15    1     u8      min_memory_pages     Minimum linear memory (64 KB each)
+0x16    1     u8      max_memory_pages     Maximum linear memory (0 = no limit)
+0x17    1     u8      table_count          Number of WASM tables
+0x18    32    hash256 bytecode_hash        SHAKE-256-256 of uncompressed bytecode
+0x38    1     u8      bootstrap_priority   Lower = tried first in chain
+0x39    1     u8      interpreter_type     Interpreter variant (if role=Interpreter)
+0x3A    6     u8[6]   reserved             Must be zero
+```
+
+### WasmRole Enum
+
+```
+Value  Name            Description
+-----  ----            -----------
+0x00   Microkernel     RVF query engine (5.5 KB Cognitum tile runtime)
+0x01   Interpreter     Minimal WASM interpreter for self-bootstrapping
+0x02   Combined        Interpreter + microkernel linked together
+0x03   Extension       Domain-specific module (custom distance, decoder)
+0x04   ControlPlane    Store management (create, export, segment parsing)
+```
+
+### WasmTarget Enum
+
+```
+Value  Name         Description
+-----  ----         -----------
+0x00   Wasm32       Generic wasm32 (any compliant runtime)
+0x01   WasiP1       WASI Preview 1 (requires WASI syscalls)
+0x02   WasiP2       WASI Preview 2 (component model)
+0x03   Browser      Browser-optimized (expects Web APIs)
+0x04   BareTile     Bare-metal Cognitum tile (hub-tile protocol only)
+```
+
+### Required Features Bitfield
+
+```
+Bit  Mask    Feature
+---  ----    -------
+0    0x0001  SIMD (v128 operations)
+1    0x0002  Bulk memory operations
+2    0x0004  Multi-value returns
+3    0x0008  Reference types
+4    0x0010  Threads (shared memory)
+5    0x0020  Tail call optimization
+6    0x0040  GC (garbage collection)
+7    0x0080  Exception handling
+```
+
+### Interpreter Type (when role=Interpreter)
+
+```
+Value  Name              Description
+-----  ----              -----------
+0x00   StackMachine      Generic stack-based interpreter
+0x01   Wasm3Compatible   wasm3-style (register machine)
+0x02   WamrCompatible    WAMR-style (AOT + interpreter)
+0x03   WasmiCompatible   wasmi-style (pure stack machine)
+```
+
+## 4. Bootstrap Resolution Protocol
+
+### Discovery
+
+1. Scan all segments for `seg_type == 0x10` (WASM_SEG)
+2. Parse the 64-byte WasmHeader from each
+3. Validate `wasm_magic == 0x5256574D`
+4. Sort by `bootstrap_priority` ascending
+
+### Resolution
+
+```
+IF any WASM_SEG has role=Combined:
+    → SelfContained bootstrap (single module does everything)
+
+ELIF WASM_SEG with role=Interpreter AND role=Microkernel both exist:
+    → TwoStage bootstrap (interpreter runs microkernel)
+
+ELIF only WASM_SEG with role=Microkernel exists:
+    → HostRequired (needs external WASM runtime)
+
+ELSE:
+    → No WASM bootstrap available
+```
+
+### Execution Sequence (Two-Stage)
+
+```
+Host                    Interpreter              Microkernel           Data
+ |                         |                        |                   |
+ |-- read WASM_SEG[0] --->|                        |                   |
+ |   (interpreter bytes)   |                        |                   |
+ |                         |                        |                   |
+ |-- instantiate -------->|                        |                   |
+ |   (load into memory)    |                        |                   |
+ |                         |                        |                   |
+ |-- feed WASM_SEG[1] --->|-- instantiate -------->|                   |
+ |   (microkernel bytes)   |   (via interpreter)    |                   |
+ |                         |                        |                   |
+ |-- LOAD_QUERY --------->|------- forward ------->|                   |
+ |                         |                        |-- read VEC_SEG -->|
+ |                         |                        |<- vector block ---|
+ |                         |                        |                   |
+ |                         |                        |  rvf_distances()  |
+ |                         |                        |  rvf_topk_merge() |
+ |                         |                        |                   |
+ |<-- TOPK_RESULT --------|<------ return ---------|                   |
+```
+
+## 5. Size Budget
+
+### Microkernel (role=Microkernel)
+
+Already specified in `microkernel/wasm-runtime.md`:
+
+```
+Total:  ~5,500 bytes (< 8 KB code budget)
+Exports: 14 (query path + quantization + HNSW + verification)
+Memory:  8 KB data + 64 KB SIMD scratch
+```
+
+### Interpreter (role=Interpreter)
+
+Target: minimal WASM bytecode interpreter sufficient to run the microkernel.
+
+```
+Component                    Estimated Size
+---------                    --------------
+WASM binary parser           4 KB
+  (magic, section parsing)
+Type section decoder         1 KB
+  (function types)
+Import/Export resolution     2 KB
+Code section interpreter     12 KB
+  (control flow, locals)
+Stack machine engine         8 KB
+  (operand stack, call stack)
+Memory management            3 KB
+  (linear memory, grow)
+i32/i64 integer ops          4 KB
+  (add, sub, mul, div, rem, shifts)
+f32/f64 float ops            6 KB
+  (add, sub, mul, div, sqrt, conversions)
+v128 SIMD ops (optional)     8 KB
+  (only if WASM_FEAT_SIMD required)
+Table + call_indirect        2 KB
+                             ----------
+Total (no SIMD):             ~42 KB
+Total (with SIMD):           ~50 KB
+```
+
+### Combined (role=Combined)
+
+Interpreter linked with microkernel in a single module:
+
+```
+Total: ~48-56 KB (interpreter + microkernel, with overlap eliminated)
+```
+
+### Self-Bootstrapping Overhead
+
+For a 10M vector file (~7.3 GB at 384-dim fp16):
+- Bootstrap overhead: ~56 KB / ~7.3 GB = **0.0008%**
+- The file is 99.9992% data, 0.0008% self-sufficient runtime
+
+For a 1000-vector file (~750 KB):
+- Bootstrap overhead: ~56 KB / ~750 KB = **7.5%**
+- Still practical for edge/IoT deployments
+
+## 6. Execution Tiers (Extended)
+
+The original three-tier model from ADR-030 is extended:
+
+| Tier | Segment | Size | Boot | Self-Bootstrap? |
+|------|---------|------|------|-----------------|
+| 0: Embedded WASM Interpreter | WASM_SEG (role=Interpreter) | ~50 KB | <5 ms | **Yes** — file carries its own runtime |
+| 1: WASM Microkernel | WASM_SEG (role=Microkernel) | 5.5 KB | <1 ms | No — needs host or Tier 0 |
+| 2: eBPF | EBPF_SEG | 10-50 KB | <20 ms | No — needs Linux kernel |
+| 3: Unikernel | KERNEL_SEG | 200 KB-2 MB | <125 ms | No — needs VMM (Firecracker) |
+
+**Key insight**: Tier 0 makes all other tiers optional. An RVF file with
+Tier 0 embedded runs on *any* host that can execute bytes — bare metal,
+browser, microcontroller, FPGA with a soft CPU, or even another WASM runtime.
+
+## 7. "Runs Anywhere Compute Exists"
+
+### What This Means
+
+A self-bootstrapping RVF file requires exactly **one capability** from its host:
+
+> The ability to read bytes from storage and execute them as instructions.
+
+That's it. No operating system. No file system. No network stack. No runtime
+library. No package manager. No container engine.
+
+### Where It Runs
+
+| Host | How It Works |
+|------|-------------|
+| **x86 server** | Native WASM runtime (Wasmtime/WAMR) runs microkernel directly |
+| **ARM edge device** | Same — native WASM runtime |
+| **Browser tab** | `WebAssembly.instantiate()` on the microkernel bytes |
+| **Microcontroller** | Embedded interpreter runs microkernel in 64 KB scratch |
+| **FPGA soft CPU** | Interpreter mapped to BRAM, microkernel in flash |
+| **Another WASM runtime** | Interpreter-in-WASM runs microkernel-in-WASM (turtles) |
+| **Bare metal** | Bootloader extracts interpreter, interpreter runs microkernel |
+| **TEE enclave** | Enclave loads interpreter, verified via WITNESS_SEG attestation |
+
+### The Bootstrapping Invariant
+
+For any host `H` with execution capability `E`:
+
+```
+∀ H, E:  can_execute(H, E) ∧ can_read_bytes(H)
+         → can_process_rvf(H, self_bootstrapping_rvf_file)
+```
+
+The file is a **fixed point** of the execution relation: it contains everything
+needed to process itself.
+
+## 8. Security Considerations
+
+### Interpreter Verification
+
+The embedded interpreter's bytecode is hashed with SHAKE-256-256 and stored
+in the WasmHeader (`bytecode_hash`). A WITNESS_SEG can chain the interpreter
+hash to a trusted build, providing:
+
+- **Provenance**: Who built this interpreter?
+- **Integrity**: Has the interpreter been modified?
+- **Attestation**: Can a TEE verify the interpreter before execution?
+
+### Sandbox Guarantees
+
+The WASM sandbox model applies at every layer:
+- The interpreter cannot access host memory beyond its linear memory
+- The microkernel cannot access interpreter memory
+- Each layer communicates only through defined exports/imports
+- A trapped module cannot corrupt other modules
+
+### Bootstrap Attack Surface
+
+| Attack | Mitigation |
+|--------|-----------|
+| Malicious interpreter | Verify `bytecode_hash` against known-good hash in WITNESS_SEG |
+| Modified microkernel | Interpreter verifies microkernel hash before instantiation |
+| Data corruption | Segment-level CRC32C/SHAKE-256 hashes (Law 2) |
+| Code injection | WASM validates all code at load time (type checking) |
+| Resource exhaustion | `max_memory_pages` cap, epoch-based interruption |
+
+## 9. API
+
+### Rust (rvf-runtime)
+
+```rust
+// Embed a WASM module
+store.embed_wasm(
+    role: WasmRole::Microkernel as u8,
+    target: WasmTarget::Wasm32 as u8,
+    required_features: WASM_FEAT_SIMD,
+    wasm_bytecode: &microkernel_bytes,
+    export_count: 14,
+    bootstrap_priority: 1,
+    interpreter_type: 0,
+)?;
+
+// Make self-bootstrapping
+store.embed_wasm(
+    role: WasmRole::Interpreter as u8,
+    target: WasmTarget::Wasm32 as u8,
+    required_features: 0,
+    wasm_bytecode: &interpreter_bytes,
+    export_count: 3,
+    bootstrap_priority: 0,
+    interpreter_type: 0x03, // wasmi-compatible
+)?;
+
+// Check if file is self-bootstrapping
+assert!(store.is_self_bootstrapping());
+
+// Extract all WASM modules (ordered by priority)
+let modules = store.extract_wasm_all()?;
+```
+
+### WASM (rvf-wasm bootstrap module)
+
+```rust
+use rvf_wasm::bootstrap::{resolve_bootstrap_chain, get_bytecode, BootstrapChain};
+
+let chain = resolve_bootstrap_chain(&rvf_bytes);
+
+match chain {
+    BootstrapChain::SelfContained { combined } => {
+        let bytecode = get_bytecode(&rvf_bytes, &combined).unwrap();
+        // Instantiate and run
+    }
+    BootstrapChain::TwoStage { interpreter, microkernel } => {
+        let interp_code = get_bytecode(&rvf_bytes, &interpreter).unwrap();
+        let kernel_code = get_bytecode(&rvf_bytes, &microkernel).unwrap();
+        // Load interpreter, then use it to run microkernel
+    }
+    _ => { /* use host runtime */ }
+}
+```
+
+## 10. Relationship to Existing Segments
+
+| Segment | Relationship to WASM_SEG |
+|---------|-------------------------|
+| KERNEL_SEG (0x0E) | Alternative execution tier — KERNEL_SEG boots a full unikernel, WASM_SEG runs a lightweight microkernel. Both make the file self-executing but at different capability levels. |
+| EBPF_SEG (0x0F) | Complementary — eBPF accelerates hot-path queries on Linux hosts while WASM provides universal portability. |
+| WITNESS_SEG (0x0A) | Verification — WITNESS_SEG chains can attest the interpreter and microkernel hashes, providing a trust anchor for the bootstrap chain. |
+| CRYPTO_SEG (0x0C) | Signing — CRYPTO_SEG key material can sign WASM_SEG contents for tamper detection. |
+| MANIFEST_SEG (0x05) | Discovery — the tail manifest references all WASM_SEGs with their roles and priorities. |
+
+## 11. Implementation Status
+
+| Component | Crate | Status |
+|-----------|-------|--------|
+| `SegmentType::Wasm` (0x10) | `rvf-types` | Implemented |
+| `WasmHeader` (64-byte header) | `rvf-types` | Implemented |
+| `WasmRole`, `WasmTarget` enums | `rvf-types` | Implemented |
+| `write_wasm_seg` | `rvf-runtime` | Implemented |
+| `embed_wasm` / `extract_wasm` | `rvf-runtime` | Implemented |
+| `extract_wasm_all` (priority-sorted) | `rvf-runtime` | Implemented |
+| `is_self_bootstrapping` | `rvf-runtime` | Implemented |
+| `resolve_bootstrap_chain` | `rvf-wasm` | Implemented |
+| `get_bytecode` (zero-copy extraction) | `rvf-wasm` | Implemented |
+| Embedded interpreter (wasmi-based) | `rvf-wasm` | Future |
+| Combined interpreter+microkernel build | `rvf-wasm` | Future |