# RVF Wire Format Reference ## 1. File Structure An RVF file is a byte stream with no fixed header at offset 0. All structure is discovered from the tail. ``` Byte 0 EOF | | v v +--------+--------+--------+ +--------+---------+--------+---------+ | Seg 0 | Seg 1 | Seg 2 | ... | Seg N | Seg N+1 | Seg N+2| Mfst K | | VEC | VEC | INDEX | | VEC | HOT | INDEX | MANIF | +--------+--------+--------+ +--------+---------+--------+---------+ ^ ^ | | Level 1 Mfst | Level 0 (last 4KB) ``` ### Alignment Rule Every segment starts at a **64-byte aligned** boundary. If a segment's payload + footer does not end on a 64-byte boundary, zero-padding is inserted before the next segment header. ### Byte Order All multi-byte integers are **little-endian**. All floating-point values are IEEE 754 little-endian. This matches x86, ARM (in default mode), and WASM native byte order. ## 2. Primitive Types ``` Type Size Encoding ---- ---- -------- u8 1 Unsigned 8-bit integer u16 2 Unsigned 16-bit little-endian u32 4 Unsigned 32-bit little-endian u64 8 Unsigned 64-bit little-endian i32 4 Signed 32-bit little-endian (two's complement) i64 8 Signed 64-bit little-endian (two's complement) f16 2 IEEE 754 half-precision little-endian f32 4 IEEE 754 single-precision little-endian f64 8 IEEE 754 double-precision little-endian varint 1-10 LEB128 unsigned variable-length integer svarint 1-10 ZigZag + LEB128 signed variable-length integer hash128 16 First 128 bits of hash output hash256 32 First 256 bits of hash output ``` ### Varint Encoding (LEB128) ``` Value 0-127: 1 byte [0xxxxxxx] Value 128-16383: 2 bytes [1xxxxxxx 0xxxxxxx] Value 16384-2097151: 3 bytes [1xxxxxxx 1xxxxxxx 0xxxxxxx] ...up to 10 bytes for u64 ``` ### Delta Encoding Sequences of sorted integers use delta encoding: ``` Original: [100, 105, 108, 120, 200] Deltas: [100, 5, 3, 12, 80] Encoded: [varint(100), varint(5), varint(3), varint(12), varint(80)] ``` With restart points every N entries, the first value in each restart group is absolute (not delta-encoded). ## 3. Segment Header (64 bytes) ``` Offset Type Field Notes ------ ---- ----- ----- 0x00 u32 magic Always 0x52564653 ("RVFS") 0x04 u8 version Format version (1) 0x05 u8 seg_type Segment type enum 0x06 u16 flags See flags bitfield 0x08 u64 segment_id Monotonic ordinal 0x10 u64 payload_length Bytes after header, before footer 0x18 u64 timestamp_ns UNIX nanoseconds 0x20 u8 checksum_algo 0=CRC32C, 1=XXH3-128, 2=SHAKE-256 0x21 u8 compression 0=none, 1=LZ4, 2=ZSTD, 3=custom 0x22 u16 reserved_0 Must be 0x0000 0x24 u32 reserved_1 Must be 0x00000000 0x28 hash128 content_hash Payload hash (first 128 bits) 0x38 u32 uncompressed_len Original payload size (0 if no compression) 0x3C u32 alignment_pad Zero padding to 64B boundary ``` ### Segment Type Enum ``` 0x00 INVALID Not a valid segment 0x01 VEC_SEG Vector payloads 0x02 INDEX_SEG HNSW adjacency 0x03 OVERLAY_SEG Graph overlay deltas 0x04 JOURNAL_SEG Metadata mutations 0x05 MANIFEST_SEG Segment directory 0x06 QUANT_SEG Quantization dictionaries 0x07 META_SEG Key-value metadata 0x08 HOT_SEG Temperature-promoted data 0x09 SKETCH_SEG Access counter sketches 0x0A WITNESS_SEG Capability manifests 0x0B PROFILE_SEG Domain profile declarations 0x0C CRYPTO_SEG Key material / certificate anchors 0x0D-0xEF reserved 0xF0-0xFF extension Implementation-specific ``` ### Flags Bitfield ``` Bit Mask Name Meaning --- ---- ---- ------- 0 0x0001 COMPRESSED Payload compressed per compression field 1 0x0002 ENCRYPTED Payload encrypted (key in CRYPTO_SEG) 2 0x0004 SIGNED Signature footer follows payload 3 0x0008 SEALED Immutable (compaction output) 4 0x0010 PARTIAL Partial/streaming write 5 0x0020 TOMBSTONE Logically deletes prior segment 6 0x0040 HOT Contains hot-tier data 7 0x0080 OVERLAY Contains overlay/delta data 8 0x0100 SNAPSHOT Full snapshot (not delta) 9 0x0200 CHECKPOINT Safe rollback point 10-15 reserved Must be zero ``` ## 4. Signature Footer Present only if `SIGNED` flag is set. Follows immediately after the payload. ``` Offset Type Field Notes ------ ---- ----- ----- 0x00 u16 sig_algo 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s 0x02 u16 sig_length Signature byte length 0x04 u8[] signature Signature bytes var u32 footer_length Total footer size (for backward scan) ``` ### Signature Algorithm Sizes | Algorithm | sig_length | Post-Quantum | Performance | |-----------|-----------|-------------|-------------| | Ed25519 | 64 B | No | ~76,000 sign/s | | ML-DSA-65 | 3,309 B | Yes (NIST Level 3) | ~4,500 sign/s | | SLH-DSA-128s | 7,856 B | Yes (NIST Level 1) | ~350 sign/s | ## 5. VEC_SEG Payload Layout Vector segments store blocks of vectors in columnar layout for compression. ``` +------------------------------------------+ | VEC_SEG Payload | +------------------------------------------+ | Block Directory | | block_count: u32 | | For each block: | | block_offset: u32 (from payload start)| | vector_count: u32 | | dim: u16 | | dtype: u8 | | tier: u8 | | [64B aligned] | +------------------------------------------+ | Block 0 | | +-- Columnar Vectors --+ | | | dim_0[0..count] | <- all vals | | | dim_1[0..count] | for dim 0 | | | ... | then dim 1 | | | dim_D[0..count] | etc. | | +----------------------+ | | +-- ID Map --+ | | | encoding: u8 (0=raw, 1=delta-varint) | | | restart_interval: u16 | | | id_count: u32 | | | [restart_offsets: u32[]] (if delta) | | | [ids: encoded] | | +-----------+ | | +-- Block CRC --+ | | | crc32c: u32 | | | +----------------+ | | [64B padding] | +------------------------------------------+ | Block 1 | | ... | +------------------------------------------+ ``` ### Data Type Enum ``` 0x00 f32 32-bit float 0x01 f16 16-bit float 0x02 bf16 bfloat16 0x03 i8 signed 8-bit integer (scalar quantized) 0x04 u8 unsigned 8-bit integer 0x05 i4 4-bit integer (packed, 2 per byte) 0x06 binary 1-bit (packed, 8 per byte) 0x07 pq Product-quantized codes 0x08 custom Custom encoding (see QUANT_SEG) ``` ### Columnar vs Interleaved **VEC_SEG** (columnar): `dim_0[all], dim_1[all], ..., dim_D[all]` - Better compression (similar values adjacent) - Better for batch operations - Worse for single-vector random access **HOT_SEG** (interleaved): `vec_0[all_dims], vec_1[all_dims], ...` - Better for single-vector access (one cache line per vector) - Better for top-K refinement (sequential scan) - No compression benefit ## 6. INDEX_SEG Payload Layout ``` +------------------------------------------+ | INDEX_SEG Payload | +------------------------------------------+ | Index Header | | index_type: u8 (0=HNSW, 1=IVF, 2=flat)| | layer_level: u8 (A=0, B=1, C=2) | | M: u16 (HNSW max neighbors per layer) | | ef_construction: u32 | | node_count: u64 | | [64B aligned] | +------------------------------------------+ | Restart Point Index | | restart_interval: u32 | | restart_count: u32 | | [restart_offset: u32] * count | | [64B aligned] | +------------------------------------------+ | Adjacency Data | | For each node (sorted by node_id): | | layer_count: varint | | For each layer: | | neighbor_count: varint | | [delta_neighbor_id: varint] * cnt | | [64B padding per restart group] | +------------------------------------------+ | Prefetch Hints (optional) | | hint_count: u32 | | For each hint: | | node_range_start: u64 | | node_range_end: u64 | | page_offset: u64 | | page_count: u32 | | prefetch_ahead: u32 | | [64B aligned] | +------------------------------------------+ ``` ## 7. HOT_SEG Payload Layout The hot segment stores the most-accessed vectors in interleaved (row-major) layout with their neighbor lists co-located for cache locality. ``` +------------------------------------------+ | HOT_SEG Payload | +------------------------------------------+ | Hot Header | | vector_count: u32 | | dim: u16 | | dtype: u8 (f16 or i8) | | neighbor_M: u16 | | [64B aligned] | +------------------------------------------+ | Interleaved Hot Data | | For each hot vector: | | vector_id: u64 | | vector: [dtype * dim] | | neighbor_count: u16 | | [neighbor_id: u64] * neighbor_count | | [64B aligned per entry] | +------------------------------------------+ ``` Each hot entry is self-contained: vector + neighbors in one contiguous block. A sequential scan of the HOT_SEG for top-K refinement reads vectors and neighbors without any pointer chasing. ### Hot Entry Size Example For 384-dim fp16 vectors with M=16 neighbors: ``` 8 (id) + 768 (vector) + 2 (count) + 128 (neighbors) = 906 bytes Padded to 64B: 960 bytes per entry ``` 1000 hot vectors = 960 KB (fits in L2 cache on most CPUs). ## 8. MANIFEST_SEG Payload Layout ``` +------------------------------------------+ | MANIFEST_SEG Payload | +------------------------------------------+ | TLV Records (Level 1 manifest) | | For each record: | | tag: u16 | | length: u32 | | pad: u16 (to 8B alignment) | | value: [u8; length] | | [8B aligned] | +------------------------------------------+ | Level 0 Root Manifest (last 4096 bytes) | | (See 02-manifest-system.md for layout) | +------------------------------------------+ ``` ## 9. SKETCH_SEG Payload Layout ``` +------------------------------------------+ | SKETCH_SEG Payload | +------------------------------------------+ | Sketch Header | | block_count: u32 | | width: u32 (counters per row) | | depth: u32 (hash functions) | | counter_bits: u8 (8 or 16) | | decay_shift: u8 (aging right-shift) | | total_accesses: u64 | | [64B aligned] | +------------------------------------------+ | Sketch Data | | For each block: | | block_id: u32 | | counters: [u8; width * depth] | | [64B aligned per block] | +------------------------------------------+ ``` ## 10. QUANT_SEG Payload Layout ``` +------------------------------------------+ | QUANT_SEG Payload | +------------------------------------------+ | Quant Header | | quant_type: u8 | | 0 = scalar (min-max per dim) | | 1 = product quantization | | 2 = binary threshold | | 3 = residual PQ | | tier: u8 | | dim: u16 | | [64B aligned] | +------------------------------------------+ | Type-specific data: | | | | Scalar (type 0): | | min: [f32; dim] | | max: [f32; dim] | | | | PQ (type 1): | | M: u16 (subspaces) | | K: u16 (centroids per sub) | | sub_dim: u16 (dims per sub) | | codebook: [f32; M * K * sub_dim] | | | | Binary (type 2): | | threshold: [f32; dim] | | | | Residual PQ (type 3): | | coarse_centroids: [f32; K_coarse * dim]| | residual_codebook: [f32; M * K * sub] | | | | [64B aligned] | +------------------------------------------+ ``` ## 11. Checksum Algorithms | ID | Algorithm | Output | Speed (HW accel) | Use Case | |----|-----------|--------|-------------------|----------| | 0 | CRC32C | 4 B (stored in 16B field, zero-padded) | ~3 GB/s (SSE4.2) | Per-block integrity | | 1 | XXH3-128 | 16 B | ~50 GB/s (AVX2) | Segment content hash | | 2 | SHAKE-256 | 16 or 32 B | ~1 GB/s | Cryptographic verification | Default recommendation: - Block-level CRC: CRC32C (fastest, hardware accelerated) - Segment content hash: XXH3-128 (fast, good distribution) - Crypto witness hashes: SHAKE-256 (post-quantum safe) ## 12. Compression | ID | Algorithm | Ratio | Decompress Speed | Use Case | |----|-----------|-------|-----------------|----------| | 0 | None | 1.0x | N/A | Hot tier | | 1 | LZ4 | 1.5-3x | ~4 GB/s | Warm tier, low latency | | 2 | ZSTD | 3-6x | ~1.5 GB/s | Cold tier, high ratio | | 3 | Custom | Varies | Varies | Domain-specific | Compression is applied per-segment payload. Individual blocks within a segment share the same compression. ## 13. Tail Scan Algorithm ```python def find_latest_manifest(file): file_size = file.seek(0, SEEK_END) # Try fast path: last 4096 bytes file.seek(file_size - 4096) root = file.read(4096) if root[0:4] == b'RVM0' and verify_crc(root): return parse_root_manifest(root) # Slow path: scan backward for MANIFEST_SEG header scan_pos = file_size - 64 # Start at last 64B boundary while scan_pos >= 0: file.seek(scan_pos) header = file.read(64) if (header[0:4] == b'RVFS' and header[5] == 0x05 and # MANIFEST_SEG verify_segment_header(header)): return parse_manifest_segment(file, scan_pos) scan_pos -= 64 # Previous 64B boundary raise CorruptFileError("No valid MANIFEST_SEG found") ``` Worst case: full backward scan at 64B granularity. For a 4 GB file, this is 67M checks — but each check is a 4-byte comparison, so it completes in ~100ms on a modern CPU with mmap. In practice, the fast path succeeds on the first try for non-corrupt files.