Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,87 @@
# RVF: RuVector Format
## A Living, Self-Reorganizing Runtime Substrate for Vector Intelligence
---
### Document Index
#### Core Specification (`spec/`)
| # | Document | Description |
|---|----------|-------------|
| 00 | [Overview](spec/00-overview.md) | The Four Laws, design coordinates, philosophy |
| 01 | [Segment Model](spec/01-segment-model.md) | Append-only segments, headers, lifecycle, multi-file |
| 02 | [Manifest System](spec/02-manifest-system.md) | Two-level manifests, hotset pointers, progressive boot |
| 03 | [Temperature Tiering](spec/03-temperature-tiering.md) | Adaptive layout, access sketches, promotion/demotion |
| 04 | [Progressive Indexing](spec/04-progressive-indexing.md) | Layer A/B/C availability, lazy build, partial search |
| 05 | [Overlay Epochs](spec/05-overlay-epochs.md) | Streaming min-cut, epoch boundaries, rollback |
| 06 | [Query Optimization](spec/06-query-optimization.md) | SIMD alignment, prefetch, varint IDs, cache analysis |
| 07 | [Deletion & Lifecycle](spec/07-deletion-lifecycle.md) | Vector deletion, JOURNAL_SEG wire format, deletion bitmaps, compaction |
| 08 | [Filtered Search](spec/08-filtered-search.md) | META_SEG wire format, filter expressions, metadata indexes |
| 09 | [Concurrency & Versioning](spec/09-concurrency-versioning.md) | Writer locking, reader-writer coordination, space reclamation |
| 10 | [Operations API](spec/10-operations-api.md) | Batch ops, error codes, network streaming, compaction scheduling |
#### Wire Format (`wire/`)
| Document | Description |
|----------|-------------|
| [Binary Layout](wire/binary-layout.md) | Byte-level format reference, all segment payloads |
#### WASM Microkernel (`microkernel/`)
| Document | Description |
|----------|-------------|
| [WASM Runtime](microkernel/wasm-runtime.md) | Cognitum tile mapping, 14 exports, hub-tile protocol |
#### Domain Profiles (`profiles/`)
| Document | Description |
|----------|-------------|
| [Domain Profiles](profiles/domain-profiles.md) | RVDNA, RVText, RVGraph, RVVision specifications |
#### Cryptography (`crypto/`)
| Document | Description |
|----------|-------------|
| [Quantum Signatures](crypto/quantum-signatures.md) | ML-DSA-65, SHAKE-256, hybrid encryption, witnesses |
#### Benchmarks (`benchmarks/`)
| Document | Description |
|----------|-------------|
| [Acceptance Tests](benchmarks/acceptance-tests.md) | Performance targets, crash safety, scalability |
---
### Quick Reference
**The Four Laws**
1. Truth lives at the tail
2. Every segment is independently valid
3. Data and state are separated
4. The format adapts to its workload
**Minimal Upgrade Path** (smallest changes that unlock everything)
1. Add tail manifest segments
2. Make every payload a segment with its own hash and length
3. Add hotset pointers in the manifest
4. Add an epoch overlay model
**Hardware Profiles**
- **Core**: 8 KB code + 8 KB data + 64 KB SIMD (Cognitum tile)
- **Hot**: Multi-tile chip with shared memory
- **Full**: Desktop/server with mmap and full feature set
**Key Numbers**
- Boot: 4 KB read, < 5 ms
- First query: <= 4 MB read, recall >= 0.70
- Full quality: recall >= 0.95
- Signing: ML-DSA-65, 3,309 B signatures, ~4,500 sign/s
- Distance: 384-dim fp16 L2 in ~12 AVX-512 cycles
- Hot entry: 960 bytes (vector + 16 neighbors, cache-line aligned)
**Design Choices**
- Append-only + compaction (not random writes)
- Both mmap desktop and microcontroller tiles
- Priority: streamable > progressive > adaptive > p95 speed

View File

@@ -0,0 +1,633 @@
# RVF Implementation Swarm Guidance
## Objective
Implement, test, optimize, and publish the RVF (RuVector Format) as the canonical binary format across all RuVector libraries. Deliver as Rust crates (crates.io), WASM packages (npm), and Node.js N-API bindings (npm).
## Phase Overview
```
Phase 1: Foundation (rvf-types + rvf-wire) ── Week 1-2
Phase 2: Core Runtime (manifest + index + quant) ── Week 3-5
Phase 3: Integration (library adapters) ── Week 6-8
Phase 4: WASM + Node Bindings ── Week 9-10
Phase 5: Testing + Benchmarks ── Week 11-12
Phase 6: Optimization + Publishing ── Week 13-14
```
---
## Phase 1: Foundation — `rvf-types` + `rvf-wire`
### Agent Assignments
| Agent | Role | Crate | Deliverable |
|-------|------|-------|-------------|
| **coder-1** | Types specialist | `crates/rvf/rvf-types/` | All segment types, enums, headers |
| **coder-2** | Wire format specialist | `crates/rvf/rvf-wire/` | Read/write segment headers + payloads |
| **tester-1** | TDD for types/wire | `crates/rvf/rvf-types/tests/`, `crates/rvf/rvf-wire/tests/` | Round-trip tests, fuzz targets |
| **reviewer-1** | Spec compliance | N/A | Verify code matches wire format spec |
### `rvf-types` (no_std, no alloc dependency)
```toml
[package]
name = "rvf-types"
version = "0.1.0"
edition = "2021"
description = "RuVector Format core types — segment headers, enums, flags"
license = "MIT OR Apache-2.0"
categories = ["data-structures", "no-std"]
[features]
default = []
std = []
serde = ["dep:serde"]
```
**Files to create:**
```
crates/rvf/rvf-types/
src/
lib.rs # Re-exports
segment.rs # SegmentHeader (64 bytes), SegmentType enum
flags.rs # Flags bitfield (COMPRESSED, ENCRYPTED, SIGNED, etc.)
manifest.rs # Level0Root (4096 bytes), ManifestTag enum
vec_seg.rs # BlockDirectory, BlockHeader, DataType enum
index_seg.rs # IndexHeader, IndexType, AdjacencyLayout
hot_seg.rs # HotHeader, HotEntry layout
quant_seg.rs # QuantHeader, QuantType enum
sketch_seg.rs # SketchHeader layout
meta_seg.rs # MetaField, FilterOp enum
profile.rs # ProfileId, ProfileMagic constants
error.rs # RvfError enum (format, query, write, tile, crypto)
constants.rs # Magic numbers, alignment, limits
Cargo.toml
```
**Key constants (from spec):**
```rust
pub const SEGMENT_MAGIC: u32 = 0x52564653; // "RVFS"
pub const ROOT_MANIFEST_MAGIC: u32 = 0x52564D30; // "RVM0"
pub const SEGMENT_ALIGNMENT: usize = 64;
pub const ROOT_MANIFEST_SIZE: usize = 4096;
pub const MAX_SEGMENT_PAYLOAD: u64 = 4 * 1024 * 1024 * 1024; // 4 GB
```
**SegmentType enum (from spec 01):**
```rust
#[repr(u8)]
pub enum SegmentType {
Invalid = 0x00,
Vec = 0x01,
Index = 0x02,
Overlay = 0x03,
Journal = 0x04,
Manifest = 0x05,
Quant = 0x06,
Meta = 0x07,
Hot = 0x08,
Sketch = 0x09,
Witness = 0x0A,
Profile = 0x0B,
Crypto = 0x0C,
MetaIdx = 0x0D,
}
```
### `rvf-wire` (no_std + alloc)
```toml
[package]
name = "rvf-wire"
version = "0.1.0"
description = "RuVector Format wire format reader/writer"
[dependencies]
rvf-types = { path = "../rvf-types" }
[features]
default = ["std"]
std = ["rvf-types/std"]
```
**Files to create:**
```
crates/rvf/rvf-wire/
src/
lib.rs
reader.rs # SegmentReader: parse header, validate magic/hash
writer.rs # SegmentWriter: build header, compute hash, align
varint.rs # LEB128 encode/decode
delta.rs # Delta encoding with restart points
crc32c.rs # CRC32C (software + hardware detect)
xxh3.rs # XXH3-128 hash (or re-export from xxhash-rust)
tail_scan.rs # find_latest_manifest() backward scan
manifest_reader.rs # Level 0 root manifest parser
manifest_writer.rs # Level 0 + Level 1 manifest builder
vec_seg_codec.rs # VEC_SEG columnar encode/decode
hot_seg_codec.rs # HOT_SEG interleaved encode/decode
index_seg_codec.rs # INDEX_SEG adjacency encode/decode
Cargo.toml
```
### Phase 1 Acceptance Criteria
- [ ] `rvf-types` compiles with `#![no_std]`
- [ ] `rvf-wire` round-trips: create segment -> serialize -> deserialize -> compare
- [ ] Tail scan finds manifest in valid file
- [ ] CRC32C matches reference implementation
- [ ] Varint codec matches LEB128 spec
- [ ] `cargo test` passes for both crates
- [ ] `cargo clippy` clean, `cargo fmt` clean
---
## Phase 2: Core Runtime — manifest + index + quant
### Agent Assignments
| Agent | Role | Crate | Deliverable |
|-------|------|-------|-------------|
| **coder-3** | Manifest system | `crates/rvf/rvf-manifest/` | Two-level manifest, progressive boot |
| **coder-4** | Progressive indexing | `crates/rvf/rvf-index/` | Layer A/B/C HNSW with progressive load |
| **coder-5** | Quantization | `crates/rvf/rvf-quant/` | Temperature-tiered quant (fp16/i8/PQ/binary) |
| **coder-6** | Full runtime | `crates/rvf/rvf-runtime/` | RvfStore API, compaction, append-only |
| **tester-2** | Integration tests | `crates/rvf/tests/` | Progressive load, crash safety, recall |
### `rvf-manifest`
**Key functionality:**
- Parse Level 0 root manifest (4096 bytes) -> extract hotset pointers
- Parse Level 1 TLV records -> build segment directory
- Write new manifest on mutation (two-fsync protocol)
- Manifest chain for rollback (OVERLAY_CHAIN record)
### `rvf-index`
**Key functionality:**
- Layer A: Entry points + top-layer adjacency (from INDEX_SEG with HOT flag)
- Layer B: Partial adjacency for hot region (built incrementally)
- Layer C: Full HNSW adjacency (built lazily in background)
- Varint delta-encoded neighbor lists with restart points
- Prefetch hints for cache-friendly traversal
**Integration with existing ruvector-core HNSW:**
- Wrap `hnsw_rs` graph as the in-memory structure
- Serialize HNSW to INDEX_SEG format
- Deserialize INDEX_SEG into `hnsw_rs` layers
### `rvf-quant`
**Key functionality:**
- Scalar quantization: fp32 -> int8 (4x compression)
- Product quantization: M subspaces, K centroids (8-16x compression)
- Binary quantization: sign bit (32x compression)
- QUANT_SEG read/write for codebooks
- Temperature tier assignment from SKETCH_SEG access counters
### `rvf-runtime`
**Key functionality:**
- `RvfStore::create()` / `RvfStore::open()` / `RvfStore::open_readonly()`
- Append-only write path (VEC_SEG + MANIFEST_SEG)
- Progressive load sequence (Level 0 -> hotset -> Level 1 -> on-demand)
- Background compaction (IO-budget-aware, priority-ordered)
- Count-Min Sketch maintenance for temperature decisions
- Promotion/demotion lifecycle
### Phase 2 Acceptance Criteria
- [ ] Progressive boot: parse Level 0 in < 1ms, first query in < 50ms (1M vectors)
- [ ] Recall@10 >= 0.70 with Layer A only
- [ ] Recall@10 >= 0.95 with all layers loaded
- [ ] Crash safety: kill -9 during write -> recover to last valid manifest
- [ ] Compaction reduces dead space while respecting IO budget
- [ ] Scalar quantization reconstruction error < 0.5%
---
## Phase 3: Integration — Library Adapters
### Agent Assignments
| Agent | Role | Target Library | Deliverable |
|-------|------|---------------|-------------|
| **coder-7** | claude-flow adapter | claude-flow memory | RVF-backed memory store |
| **coder-8** | agentdb adapter | agentdb | RVF as persistence backend |
| **coder-9** | agentic-flow adapter | agentic-flow | RVF streaming for inter-agent exchange |
| **coder-10** | rvlite adapter | rvlite | RVF Core Profile minimal store |
### claude-flow Memory -> RVF
```
Current: JSON flat files + in-memory HNSW
Target: RVF file per memory namespace
Mapping:
memory store -> RvfStore with RVText profile
memory search -> rvf_runtime.query()
memory persist -> RVF append (VEC_SEG + META_SEG + MANIFEST_SEG)
audit trail -> WITNESS_SEG with hash chain
session state -> META_SEG with TTL metadata
```
### agentdb -> RVF
```
Current: Custom HNSW + serde persistence
Target: RVF file per database instance
Mapping:
agentdb.insert() -> rvf_runtime.ingest_batch()
agentdb.search() -> rvf_runtime.query()
agentdb.persist() -> already persistent (append-only)
HNSW graph -> INDEX_SEG (Layer A/B/C)
Metadata -> META_SEG + METAIDX_SEG
```
### agentic-flow -> RVF
```
Current: Shared memory blobs between agents
Target: RVF TCP streaming protocol
Mapping:
agent memory share -> RVF SUBSCRIBE + UPDATE_NOTIFY
swarm state -> META_SEG in shared RVF file
learning patterns -> SKETCH_SEG for access tracking
consensus state -> WITNESS_SEG with signatures
```
### Phase 3 Acceptance Criteria
- [ ] claude-flow `memory store` and `memory search` work against RVF backend
- [ ] agentdb existing test suite passes with RVF storage (swap in, not rewrite)
- [ ] agentic-flow agents can share vectors through RVF streaming protocol
- [ ] Legacy format import tools for each library
---
## Phase 4: WASM + Node.js Bindings
### Agent Assignments
| Agent | Role | Target | Deliverable |
|-------|------|--------|-------------|
| **coder-11** | WASM microkernel | `crates/rvf/rvf-wasm/` | 14-export WASM module (<8 KB) |
| **coder-12** | WASM full runtime | `npm/packages/rvf-wasm/` | wasm-pack build, browser-compatible |
| **coder-13** | Node.js N-API | `crates/rvf/rvf-node/` | napi-rs bindings, platform packages |
| **coder-14** | TypeScript SDK | `npm/packages/rvf/` | TypeScript wrapper, types, docs |
### WASM Microkernel (`rvf-wasm` crate, `wasm32-unknown-unknown`)
```rust
// 14 exports matching spec (microkernel/wasm-runtime.md)
#[no_mangle] pub extern "C" fn rvf_init(config_ptr: i32) -> i32;
#[no_mangle] pub extern "C" fn rvf_load_query(query_ptr: i32, dim: i32) -> i32;
#[no_mangle] pub extern "C" fn rvf_load_block(block_ptr: i32, count: i32, dtype: i32) -> i32;
#[no_mangle] pub extern "C" fn rvf_distances(metric: i32, result_ptr: i32) -> i32;
#[no_mangle] pub extern "C" fn rvf_topk_merge(dist_ptr: i32, id_ptr: i32, count: i32, k: i32) -> i32;
#[no_mangle] pub extern "C" fn rvf_topk_read(out_ptr: i32) -> i32;
// ... remaining 8 exports
```
**Build command:**
```bash
cargo build --target wasm32-unknown-unknown --release -p rvf-wasm
wasm-opt -Oz -o rvf-microkernel.wasm target/wasm32-unknown-unknown/release/rvf_wasm.wasm
```
**Size budget:** Must be < 8 KB after wasm-opt.
### WASM Full Runtime (wasm-pack, browser)
```bash
cd crates/rvf/rvf-runtime
wasm-pack build --target web --features wasm
```
**npm package:** `@ruvector/rvf-wasm`
```typescript
// npm/packages/rvf-wasm/index.ts
import init, { RvfStore } from './pkg/rvf_runtime.js';
await init();
const store = RvfStore.fromBytes(rvfFileBytes);
const results = store.query(queryVector, 10);
```
### Node.js N-API Bindings (napi-rs)
```bash
cd crates/rvf/rvf-node
npm run build # napi build --platform --release
```
**Platform packages:**
| Package | Target |
|---------|--------|
| `@ruvector/rvf-node` | Main package with postinstall platform select |
| `@ruvector/rvf-node-linux-x64-gnu` | Linux x86_64 glibc |
| `@ruvector/rvf-node-linux-arm64-gnu` | Linux aarch64 glibc |
| `@ruvector/rvf-node-darwin-arm64` | macOS Apple Silicon |
| `@ruvector/rvf-node-darwin-x64` | macOS Intel |
| `@ruvector/rvf-node-win32-x64-msvc` | Windows x64 |
### TypeScript SDK
```typescript
// npm/packages/rvf/src/index.ts
export class RvfDatabase {
static async open(path: string): Promise<RvfDatabase>;
static async create(path: string, options?: RvfOptions): Promise<RvfDatabase>;
async insert(id: string, vector: Float32Array, metadata?: Record<string, unknown>): Promise<void>;
async insertBatch(entries: RvfEntry[]): Promise<RvfIngestResult>;
async query(vector: Float32Array, k: number, options?: RvfQueryOptions): Promise<RvfResult[]>;
async delete(ids: string[]): Promise<RvfDeleteResult>;
// Progressive loading
async openProgressive(source: string | URL): Promise<RvfProgressiveReader>;
}
export interface RvfOptions {
profile?: 'generic' | 'rvdna' | 'rvtext' | 'rvgraph' | 'rvvision';
dimensions: number;
metric?: 'l2' | 'cosine' | 'dotproduct' | 'hamming';
compression?: 'none' | 'lz4' | 'zstd';
signing?: { algorithm: 'ed25519' | 'ml-dsa-65'; key: Uint8Array };
}
```
### Phase 4 Acceptance Criteria
- [ ] WASM microkernel < 8 KB after wasm-opt
- [ ] WASM full runtime works in Chrome, Firefox, Node.js
- [ ] N-API bindings pass same test suite as Rust crate
- [ ] TypeScript types match Rust API surface
- [ ] All platform binaries build in CI
---
## Phase 5: Testing + Benchmarks
### Agent Assignments
| Agent | Role | Scope |
|-------|------|-------|
| **tester-3** | Acceptance tests | 10M vector cold start, recall, crash safety |
| **tester-4** | Benchmark harness | criterion benches, perf targets from spec |
| **tester-5** | Fuzz testing | cargo-fuzz for wire format parsing |
| **tester-6** | WASM tests | Browser + Cognitum tile simulation |
### Test Matrix
| Test Category | Description | Target |
|--------------|-------------|--------|
| **Round-trip** | Write + read all segment types | `rvf-wire` |
| **Progressive boot** | Cold start, measure recall at each phase | `rvf-runtime` |
| **Crash safety** | kill -9 during ingest/manifest/compaction | `rvf-runtime` |
| **Bit flip detection** | Random corruption -> hash/CRC catch | `rvf-wire` |
| **Recall benchmarks** | recall@10 at Layer A, B, C | `rvf-index` |
| **Latency benchmarks** | p50/p95/p99 query latency | `rvf-runtime` |
| **Throughput benchmarks** | QPS and ingest rate | `rvf-runtime` |
| **WASM performance** | Distance compute, top-K in WASM | `rvf-wasm` |
| **Interop** | agentdb/claude-flow/agentic-flow integration | adapters |
| **Profile compatibility** | Generic reader opens RVDNA/RVText files | `rvf-runtime` |
### Benchmark Commands
```bash
# Rust benchmarks
cd crates/rvf/rvf-runtime && cargo bench
# WASM benchmarks
cd npm/packages/rvf-wasm && npm run bench
# Node.js benchmarks
cd npm/packages/rvf-node && npm run bench
# Full acceptance test (10M vectors)
cd crates/rvf && cargo test --release --test acceptance -- --ignored
```
### Phase 5 Acceptance Criteria
- [ ] All performance targets from `benchmarks/acceptance-tests.md` met
- [ ] Zero data loss in crash safety tests (100 iterations)
- [ ] 100% bit-flip detection rate
- [ ] WASM microkernel passes Cognitum tile simulation
- [ ] No memory safety issues found by fuzz testing (1M iterations)
---
## Phase 6: Optimization + Publishing
### Agent Assignments
| Agent | Role | Scope |
|-------|------|-------|
| **optimizer-1** | SIMD tuning | AVX-512/NEON distance kernels, alignment |
| **optimizer-2** | Compression tuning | LZ4/ZSTD level selection, block size |
| **publisher-1** | crates.io publishing | Version management, dependency graph |
| **publisher-2** | npm publishing | Platform packages, wasm-pack output |
### SIMD Optimization Targets
| Operation | AVX-512 Target | NEON Target | WASM v128 Target |
|-----------|---------------|-------------|-----------------|
| L2 distance (384-dim fp16) | ~12 cycles | ~48 cycles | ~96 cycles |
| Dot product (384-dim fp16) | ~12 cycles | ~48 cycles | ~96 cycles |
| Hamming (384-bit) | 1 cycle (VPOPCNTDQ) | ~6 cycles (CNT) | ~24 cycles |
| PQ ADC (48 subspaces) | ~48 cycles (gather) | ~96 cycles (TBL) | ~192 cycles |
### Publishing Dependency Order
Crates must be published in dependency order:
```
1. rvf-types (no deps)
2. rvf-wire (depends on rvf-types)
3. rvf-quant (depends on rvf-types)
4. rvf-manifest (depends on rvf-types, rvf-wire)
5. rvf-index (depends on rvf-types, rvf-wire, rvf-quant)
6. rvf-crypto (depends on rvf-types, rvf-wire)
7. rvf-runtime (depends on all above)
8. rvf-wasm (depends on rvf-types, rvf-wire, rvf-quant)
9. rvf-node (depends on rvf-runtime)
10. rvf-server (depends on rvf-runtime)
```
### crates.io Publishing
```bash
# Publish in dependency order
for crate in rvf-types rvf-wire rvf-quant rvf-manifest rvf-index rvf-crypto rvf-runtime rvf-wasm rvf-node rvf-server; do
cd crates/rvf/$crate
cargo publish
sleep 30 # Wait for crates.io index update
cd -
done
```
### npm Publishing
```bash
# WASM package
cd npm/packages/rvf-wasm
npm publish --access public
# Node.js platform binaries
for platform in linux-x64-gnu linux-arm64-gnu darwin-arm64 darwin-x64 win32-x64-msvc; do
cd npm/packages/rvf-node-$platform
npm publish --access public
cd -
done
# Main Node.js package
cd npm/packages/rvf-node
npm publish --access public
# TypeScript SDK
cd npm/packages/rvf
npm publish --access public
```
### Phase 6 Acceptance Criteria
- [ ] SIMD distance kernels meet cycle targets on each platform
- [ ] All crates published to crates.io with correct dependency graph
- [ ] All npm packages published with correct platform detection
- [ ] `npx rvf --version` works
- [ ] `npm install @ruvector/rvf` works on all supported platforms
- [ ] GitHub release with changelog
---
## Swarm Topology
```
┌──────────────┐
│ Queen │
│ Coordinator │
└──────┬───────┘
┌──────────────┼──────────────┐
│ │ │
┌──────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
│ Foundation │ │ Runtime │ │ Integration │
│ Squad │ │ Squad │ │ Squad │
│ (coder 1-2) │ │ (coder │ │ (coder 7-10)│
│ (tester-1) │ │ 3-6) │ │ │
│ (reviewer-1)│ │ (test-2)│ │ │
└─────────────┘ └─────────┘ └─────────────┘
│ │ │
│ ┌─────────┼──────────┐ │
│ │ │ │ │
│ ┌─▼───────┐│┌─────────▼┐ │
│ │ WASM + │││ Testing │ │
│ │ Node │││ Squad │ │
│ │ Squad │││(tester │ │
│ │(coder │││ 3-6) │ │
│ │ 11-14) │││ │ │
│ └─────────┘│└──────────┘ │
│ │ │
└─────────────┼──────────────┘
┌─────▼──────┐
│ Optimize + │
│ Publish │
│ Squad │
└────────────┘
```
### Swarm Init Command
```bash
npx @claude-flow/cli@latest swarm init \
--topology hierarchical \
--max-agents 8 \
--strategy specialized
```
### Agent Spawn Commands (via Claude Code Task tool)
All agents should be spawned as `run_in_background: true` Task calls in a single message. Each agent receives:
1. The relevant RVF spec files to read (from `docs/research/rvf/`)
2. The ADR-029 for context
3. The specific phase deliverables from this guidance
4. The acceptance criteria as exit conditions
---
## Critical Path
```
rvf-types ──> rvf-wire ──> rvf-manifest ──> rvf-runtime ──> adapters ──> publish
│ │
└──> rvf-quant ────────────────┘
│ │
└──> rvf-index ────────────────┘
rvf-wasm (parallel) ───────────┘
rvf-node (parallel) ───────────┘
```
**Blocking dependencies:**
- Everything depends on `rvf-types`
- `rvf-wire` unlocks all other crates
- `rvf-runtime` blocks integration adapters
- `rvf-wasm` and `rvf-node` can proceed in parallel once `rvf-wire` exists
---
## File Layout Summary
```
crates/rvf/
rvf-types/ # Segment types, headers, enums (no_std)
rvf-wire/ # Wire format read/write (no_std + alloc)
rvf-index/ # Progressive HNSW indexing
rvf-manifest/ # Two-level manifest system
rvf-quant/ # Temperature-tiered quantization
rvf-crypto/ # ML-DSA-65, SHAKE-256
rvf-runtime/ # Full runtime (RvfStore API)
rvf-wasm/ # WASM microkernel (<8 KB)
rvf-node/ # Node.js N-API bindings
rvf-server/ # TCP/HTTP streaming server
tests/ # Integration + acceptance tests
benches/ # Criterion benchmarks
npm/packages/
rvf/ # TypeScript SDK (@ruvector/rvf)
rvf-wasm/ # Browser WASM (@ruvector/rvf-wasm)
rvf-node/ # Node.js native (@ruvector/rvf-node)
rvf-node-linux-x64-gnu/
rvf-node-linux-arm64-gnu/
rvf-node-darwin-arm64/
rvf-node-darwin-x64/
rvf-node-win32-x64-msvc/
```
---
## Success Metrics
| Metric | Target | Measured By |
|--------|--------|-------------|
| Cold boot time | < 5 ms | Phase 5 acceptance test |
| First query recall@10 | >= 0.70 | Phase 5 recall benchmark |
| Full recall@10 | >= 0.95 | Phase 5 recall benchmark |
| Query latency p50 | < 0.3 ms (10M vectors) | Phase 5 latency benchmark |
| WASM microkernel size | < 8 KB | Phase 4 build output |
| Crash safety | 0 data loss in 100 kill tests | Phase 5 crash test |
| Crates published | 10 crates on crates.io | Phase 6 publish |
| NPM packages published | 8+ packages on npm | Phase 6 publish |
| Library integration | 4 libraries using RVF | Phase 3 adapter tests |

View File

@@ -0,0 +1,341 @@
# RVF Acceptance Tests and Performance Targets
## 1. Primary Acceptance Test
> **Cold start on a 10 million vector file: load and answer the first query with a
> useful result (recall@10 >= 0.70) without reading more than the last 4 MB, then
> converge to full quality (recall@10 >= 0.95) as it progressively maps more segments.**
### Test Parameters
```
Dataset: 10 million vectors
Dimensions: 384 (sentence embedding size)
Base dtype: fp16 (768 bytes per vector)
Raw file size: ~7.2 GB (vectors only)
With index: ~10-12 GB total
Query set: 1000 queries from held-out test set
Ground truth: Brute-force exact k-NN (k=10)
Metric: L2 distance
```
### Success Criteria
| Phase | Time Budget | Data Read | Min Recall@10 | Description |
|-------|------------|-----------|---------------|-------------|
| Boot | < 5 ms | 4 KB (Level 0) | N/A | Parse root manifest |
| First query | < 50 ms | <= 4 MB | >= 0.70 | Layer A + hot cache |
| Working quality | < 500 ms | <= 200 MB | >= 0.85 | Layer A + B |
| Full quality | < 5 s | <= 4 GB | >= 0.95 | Layers A + B + C |
| Optimized | < 30 s | Full file | >= 0.98 | All layers + hot tier |
### Measurement Methodology
```
1. Create RVF file from 10M vector dataset
- Build full HNSW index (M=16, ef_construction=200)
- Compute temperature tiers (default: all warm initially)
- Write with all segment types
2. Cold start measurement
- Drop filesystem cache: echo 3 > /proc/sys/vm/drop_caches
- Open file, start timer
- Read Level 0 (4 KB), record time T_boot
- Read hotset data, record time T_hotset
- Execute first query, record time T_first_query and recall@10
- Continue progressive loading
- At each milestone: record time, data read, recall@10
3. Throughput measurement (warm)
- After full load, execute 1000 queries
- Measure queries per second (QPS)
- Measure p50, p95, p99 latency
- Measure recall@10 average
4. Streaming ingest measurement
- Start with empty file
- Ingest 10M vectors in streaming mode
- Measure ingest rate (vectors/second)
- Measure file size over time
- Verify crash safety (kill -9 at random points, verify recovery)
```
## 2. Performance Targets
### Query Latency (10M vectors, 384 dim, fp16)
| Hardware | QPS (single thread) | p50 Latency | p95 Latency | p99 Latency |
|----------|-------------------|-------------|-------------|-------------|
| Desktop (AVX-512) | 5,000-15,000 | 0.1 ms | 0.3 ms | 1.0 ms |
| Desktop (AVX2) | 3,000-8,000 | 0.2 ms | 0.5 ms | 2.0 ms |
| Laptop (NEON) | 2,000-5,000 | 0.3 ms | 1.0 ms | 3.0 ms |
| WASM (browser) | 500-2,000 | 1.0 ms | 3.0 ms | 10.0 ms |
| Cognitum tile | 100-500 | 2.0 ms | 5.0 ms | 15.0 ms |
### Streaming Ingest Rate
| Hardware | Vectors/Second | Bytes/Second | Notes |
|----------|---------------|-------------|-------|
| NVMe SSD | 200K-500K | 150-380 MB/s | fsync every 1000 vectors |
| SATA SSD | 50K-100K | 38-76 MB/s | fsync every 1000 vectors |
| HDD | 10K-30K | 7-23 MB/s | Sequential append |
| Network (1 Gbps) | 50K-100K | 38-76 MB/s | Streaming over network |
### Progressive Load Times
| Phase | NVMe SSD | SATA SSD | HDD | Network |
|-------|----------|----------|-----|---------|
| Boot (4 KB) | < 0.1 ms | < 0.5 ms | < 10 ms | < 50 ms |
| First query (4 MB) | < 2 ms | < 10 ms | < 100 ms | < 500 ms |
| Working quality (200 MB) | < 100 ms | < 500 ms | < 5 s | < 20 s |
| Full quality (4 GB) | < 2 s | < 10 s | < 120 s | < 400 s |
### Space Efficiency
| Configuration | Bytes/Vector | File Size (10M) | Ratio vs Raw |
|--------------|-------------|-----------------|-------------|
| Raw fp32 | 1,536 | 14.3 GB | 1.0x |
| RVF uniform fp16 | 768 + overhead | 8.0 GB | 0.56x |
| RVF adaptive (equilibrium) | ~300 avg | 3.2 GB | 0.22x |
| RVF aggressive (binary cold) | ~100 avg | 1.1 GB | 0.08x |
## 3. Crash Safety Tests
### Test 1: Kill During Vector Ingest
```
1. Start ingesting 1M vectors
2. After 500K vectors: kill -9 the writer
3. Verify: file is readable
4. Verify: latest valid manifest is found
5. Verify: all vectors referenced by latest manifest are intact
6. Verify: no data corruption (all segment hashes valid)
```
**Pass criteria**: Zero data loss for committed segments. At most the
last incomplete segment is lost (bounded by fsync interval).
### Test 2: Kill During Manifest Write
```
1. Create file with 1M vectors
2. Trigger manifest rewrite (add metadata, trigger compaction)
3. Kill -9 during manifest write
4. Verify: file falls back to previous valid manifest
5. Verify: all queries work correctly with previous manifest
```
**Pass criteria**: Automatic fallback to previous manifest. No manual
recovery needed.
### Test 3: Kill During Compaction
```
1. Create file with 1M vectors across 100 small VEC_SEGs
2. Trigger compaction
3. Kill -9 during compaction
4. Verify: file is readable (old segments still valid)
5. Verify: partial compaction output is safely ignored
```
**Pass criteria**: Old segments remain valid. Incomplete compaction
output has no manifest reference and is safely orphaned.
### Test 4: Bit Flip Detection
```
1. Create valid RVF file
2. Flip random bits in various locations
3. Verify: corruption detected by hash/CRC checks
4. Verify: specific corrupted segment identified
5. Verify: other segments still readable
```
**Pass criteria**: 100% detection of single-bit flips. Corruption
isolated to affected segment.
## 4. Scalability Tests
### Test: 1 Billion Vectors
```
Dataset: 1B vectors, 384 dimensions, fp16
File size: ~700 GB (raw) -> ~200 GB (adaptive RVF)
Hardware: Server with 256 GB RAM, NVMe array
Verify:
- Boot time < 10 ms
- First query < 100 ms
- Full quality convergence < 60 s
- Recall@10 >= 0.95 at full quality
- Streaming ingest sustained at 100K+ vectors/second
```
### Test: High Dimensionality
```
Dataset: 1M vectors, 4096 dimensions (LLM embeddings)
File size: ~8 GB (fp16)
Verify:
- PQ compression to 5-bit achieves >= 10x compression
- Recall@10 >= 0.90 with PQ
- Query latency < 5 ms (p95) with PQ + HNSW
```
### Test: Multi-File Sharding
```
Dataset: 100M vectors across 10 shard files
Verify:
- Transparent query across all shards
- Shard addition without full rebuild
- Individual shard compaction
- Shard removal with manifest update only
```
## 5. WASM Performance Tests
### Browser Environment
```
Runtime: Chrome V8 / Firefox SpiderMonkey
SIMD: WASM v128
Memory: Limited to 4 GB WASM heap
Test: Load 1M vector RVF file via fetch()
- Boot time < 50 ms
- First query < 200 ms (after boot)
- QPS >= 500 (single thread)
- Memory usage < 500 MB
```
### Cognitum Tile Simulation
```
Runtime: wasmtime with memory limits
Code limit: 8 KB
Data limit: 8 KB
Scratch: 64 KB
Test: Process 1000 blocks via hub protocol
- Distance computation matches reference implementation
- Top-K results match brute-force within quantization tolerance
- No memory access out of bounds
- Tile recovers from simulated faults
```
## 6. Interoperability Tests
### Round-Trip Test
```
1. Create RVF file from numpy arrays
2. Read back with independent implementation
3. Verify: all vectors bit-identical
4. Verify: all metadata preserved
5. Verify: index produces same results
```
### Profile Compatibility Test
```
1. Create RVDNA file with genomic data
2. Create RVText file with text embeddings
3. Read both with generic RVF reader
4. Verify: generic reader can access vectors and metadata
5. Verify: profile-specific features degrade gracefully
```
### Version Forward Compatibility Test
```
1. Create RVF file with version 1
2. Add segments with hypothetical version 2 features (unknown tags)
3. Read with version 1 reader
4. Verify: version 1 reader skips unknown segments/tags
5. Verify: version 1 data is fully accessible
```
## 7. Security Tests
### Signature Verification
```
1. Create signed RVF file (ML-DSA-65)
2. Verify all segment signatures
3. Modify one byte in a signed segment
4. Verify: modification detected
5. Verify: other segments still valid
```
### Encryption Round-Trip
```
1. Create encrypted RVF file (ML-KEM-768 + AES-256-GCM)
2. Decrypt with correct key
3. Verify: plaintext matches original
4. Attempt decrypt with wrong key
5. Verify: decryption fails (GCM auth tag mismatch)
```
### Key Rotation
```
1. Create file signed with key A
2. Rotate to key B (write CRYPTO_SEG rotation record)
3. Write new segments signed with key B
4. Verify: old segments valid with key A
5. Verify: new segments valid with key B
6. Verify: cross-signature in rotation record is valid
```
## 8. Benchmark Harness
### Recommended Tools
| Purpose | Tool | Notes |
|---------|------|-------|
| Latency measurement | criterion (Rust) / benchmark.js | Statistical rigor |
| Recall measurement | Custom recall@K computation | Against brute-force ground truth |
| Memory profiling | valgrind massif / Chrome DevTools | Peak and sustained |
| I/O profiling | blktrace / iostat | Verify read patterns |
| SIMD verification | Intel SDE / ARM emulator | Correct SIMD codegen |
| Crash testing | Custom harness with kill -9 | Random timing |
### Report Format
Each benchmark run produces a report:
```json
{
"test_name": "cold_start_10m",
"dataset": {
"vector_count": 10000000,
"dimensions": 384,
"dtype": "fp16",
"file_size_bytes": 10737418240
},
"hardware": {
"cpu": "Intel Xeon w5-3435X",
"simd": "AVX-512",
"ram_gb": 256,
"storage": "NVMe Samsung 990 Pro"
},
"results": {
"boot_ms": 0.08,
"first_query_ms": 12.3,
"first_query_recall_at_10": 0.73,
"working_quality_ms": 340,
"working_quality_recall_at_10": 0.87,
"full_quality_ms": 3200,
"full_quality_recall_at_10": 0.96,
"steady_state_qps": 8500,
"steady_state_p50_ms": 0.12,
"steady_state_p95_ms": 0.28,
"steady_state_p99_ms": 0.85,
"data_read_first_query_mb": 3.2,
"data_read_working_quality_mb": 180
}
}
```

View File

@@ -0,0 +1,312 @@
# RVF Quantum-Resistant Cryptography
## 1. Threat Model
RVF files may contain high-value intelligence (medical genomics, proprietary
embeddings, classified networks). The cryptographic design must:
1. **Authenticate**: Prove a segment was written by an authorized producer
2. **Integrity**: Detect any modification to segment payloads
3. **Quantum resistance**: Survive attacks by future quantum computers
4. **Performance**: Not bottleneck streaming ingest or query paths
5. **Compactness**: Signatures must fit in segment footers without bloating
### Harvest-Now, Decrypt-Later
Adversaries may archive RVF files today and break classical signatures later
with quantum computers. Post-quantum signatures protect against this from day one.
## 2. Algorithm Selection
### NIST Post-Quantum Standards (FIPS 204, 205, 206)
| Algorithm | Standard | Type | Sig Size | PK Size | SK Size | Sign/s | Verify/s | Level |
|-----------|----------|------|----------|---------|---------|--------|----------|-------|
| ML-DSA-44 | FIPS 204 | Lattice | 2,420 B | 1,312 B | 2,560 B | ~9,000 | ~42,000 | 2 |
| ML-DSA-65 | FIPS 204 | Lattice | 3,309 B | 1,952 B | 4,032 B | ~4,500 | ~17,000 | 3 |
| ML-DSA-87 | FIPS 204 | Lattice | 4,627 B | 2,592 B | 4,896 B | ~2,800 | ~10,000 | 5 |
| SLH-DSA-128s | FIPS 205 | Hash | 7,856 B | 32 B | 64 B | ~350 | ~15,000 | 1 |
| SLH-DSA-128f | FIPS 205 | Hash | 17,088 B | 32 B | 64 B | ~3,000 | ~90,000 | 1 |
| FN-DSA-512 | FIPS 206 | Lattice | 666 B | 897 B | ~1.3 KB | ~5,000 | ~25,000 | 1 |
### RVF Default: ML-DSA-65
**Why ML-DSA-65**:
- NIST Level 3 security (128-bit post-quantum)
- 3,309 byte signatures (manageable in segment footer)
- ~4,500 sign/s (sufficient for streaming ingest at segment level)
- ~17,000 verify/s (fast enough for progressive load verification)
- Well-studied lattice assumption (Module-LWE)
**Alternative for size-constrained environments (Core Profile)**:
FN-DSA-512 with 666 byte signatures — but FIPS 206 is newer and less deployed.
**Alternative for maximum conservatism**:
SLH-DSA-128s (hash-based, stateless, minimal assumptions) — 7,856 byte
signatures but the smallest keys and strongest theoretical foundation.
## 3. Signature Scheme
### What Gets Signed
Each signed segment's signature covers:
```
signed_data = segment_header[0:40] # Header minus content_hash and padding
|| content_hash # The payload hash
|| segment_id_bytes # Prevent replay
|| context_string # Domain separation
```
The signature does NOT cover the raw payload directly — it covers the payload's
hash. This means:
- Signing is O(1) regardless of payload size
- The hash is computed during write anyway (required for integrity)
- Verification requires only the header + hash, not the full payload
### Context String
```
context = "RVF-v1-" || seg_type_name || "-" || profile_name
```
Examples:
- `"RVF-v1-VEC_SEG-rvdna"`
- `"RVF-v1-MANIFEST_SEG-generic"`
Domain separation prevents cross-type signature confusion.
### Key Management
Keys are stored in CRYPTO_SEG segments:
```
CRYPTO_SEG Payload:
key_type: u8
0 = signing public key
1 = verification certificate chain
2 = encryption public key (for ENCRYPTED segments)
3 = key rotation record
algorithm: u8
0 = Ed25519 (classical)
1 = ML-DSA-65 (post-quantum)
2 = SLH-DSA-128s (hash-based PQ)
3 = X25519 (classical KEM)
4 = ML-KEM-768 (post-quantum KEM)
key_id: [u8; 16] Unique key identifier (hash of public key)
key_data: [u8; var] The actual key material
valid_from: u64 Timestamp (ns) when key becomes valid
valid_until: u64 Timestamp (ns) when key expires (0 = no expiry)
```
### Key Rotation
New keys are introduced by writing a new CRYPTO_SEG with `key_type=3`
(rotation record) that references both old and new key IDs. Segments
signed with either key are valid during the transition period.
```
CRYPTO_SEG (rotation):
old_key_id: [u8; 16]
new_key_id: [u8; 16]
rotation_timestamp: u64
cross_signature: [u8; var] New key signed by old key
```
## 4. Hash Functions
### SHAKE-256 (Primary)
SHAKE-256 from the SHA-3 family is used for:
- Content hashes in segment headers (128-bit truncation for compactness)
- Min-cut witness hashes (256-bit for cryptographic binding)
- Key derivation
- Domain separation
**Why SHAKE-256**:
- Post-quantum safe (Keccak is not vulnerable to Grover's algorithm at 256-bit output)
- Extendable output function (XOF) — can produce any hash length
- No length extension attacks
- ~1 GB/s in software, faster with hardware SHA-3 extensions
### XXH3-128 (Fast Path)
XXH3 is used for non-cryptographic content hashing where speed matters more
than collision resistance:
- Segment content hashes when crypto verification is not required
- Block-level integrity checks in combination with CRC32C
**Performance**: ~50 GB/s with AVX2. This means hash computation is never
the bottleneck during streaming ingest.
### CRC32C (Block Level)
CRC32C is used for per-block integrity within segments:
- Detects random bit flips and truncation
- Hardware accelerated on x86 (SSE4.2) and ARM (CRC32 extension)
- ~3 GB/s throughput
### Hash Selection by Context
| Context | Algorithm | Output Size | Why |
|---------|-----------|------------|-----|
| Block integrity | CRC32C | 4 B | Fastest, HW accel |
| Segment content hash (fast) | XXH3-128 | 16 B | Very fast, good distribution |
| Segment content hash (crypto) | SHAKE-256 | 16 B | Post-quantum, collision resistant |
| Witness / proof hashes | SHAKE-256 | 32 B | Full crypto strength |
| Key derivation | SHAKE-256 | 32+ B | XOF flexibility |
## 5. Encryption (Optional)
For ENCRYPTED segments, RVF uses hybrid encryption:
### Key Encapsulation
```
Classical: X25519 ECDH
Post-Quantum: ML-KEM-768 (CRYSTALS-Kyber, NIST Level 3)
Hybrid: X25519 || ML-KEM-768 (concatenated shared secrets)
```
### Payload Encryption
```
Algorithm: AES-256-GCM (AEAD)
Key: SHAKE-256(X25519_shared || ML-KEM_shared || context)
Nonce: First 12 bytes of SHAKE-256(segment_id || timestamp)
AAD: segment_header[0:40] (authenticated but not encrypted)
```
### Encrypted Segment Layout
```
Segment Header (64B, plaintext)
flags: ENCRYPTED set
content_hash: hash of PLAINTEXT payload (for integrity after decrypt)
Encapsulated Keys
x25519_ephemeral_pk: [u8; 32]
ml_kem_ciphertext: [u8; 1088]
key_id_recipient: [u8; 16]
Encrypted Payload
AES-256-GCM ciphertext (same size as plaintext + 16B auth tag)
Signature Footer (if also SIGNED)
Signature covers header + encapsulated keys + encrypted payload
```
## 6. Capability Manifests (WITNESS_SEG)
WITNESS_SEGs provide cryptographic proof of provenance and computation:
### Witness Types
```
0x01 PROVENANCE Who created this file and when
0x02 COMPUTATION Proof that an index was correctly built
0x03 DELEGATION Authorization chain for data access
0x04 AUDIT Record of queries executed against this file
0x05 ATTESTATION Hardware attestation (for Cognitum tiles)
```
### Provenance Witness
```
creator_key_id: [u8; 16]
creation_time: u64
tool_name: [u8; 64]
tool_version: [u8; 16]
input_hashes: [(hash256, description)] Hashes of source data
transform_description: [u8; var] What was done to create vectors
signature: [u8; var] Creator's signature over all above
```
### Computation Witness
```
computation_type: u8
0 = HNSW construction
1 = Quantization training
2 = Temperature compaction
3 = Overlay rebalance
4 = Index merge
input_segments: [segment_id]
output_segments: [segment_id]
parameters: [(key, value)]
result_hash: hash256
duration_ns: u64
signature: [u8; var]
```
This allows any reader to verify that the index was built from the declared
vectors using the declared parameters — without re-running the computation.
## 7. Signing Performance Budget
For streaming ingest at 100K vectors/second with 1024-vector blocks:
```
Segment write rate: ~100 segments/second (1024 vectors per VEC_SEG)
Manifest writes: ~1/second (batched)
ML-DSA-65 signing: ~4,500/second
Signing budget: 100 segment sigs + 1 manifest sig = 101/second
Utilization: 101 / 4,500 = 2.2%
```
Signing is not a bottleneck. Even at 10x the ingest rate, ML-DSA-65 has
headroom.
For verification during progressive load (reading 1000 segments):
```
ML-DSA-65 verify: ~17,000/second
Verification budget: 1000 segments / 17,000 = 59 ms
```
All segments verified in under 60 ms. This runs concurrently with data
loading, so it adds minimal latency to the progressive boot sequence.
## 8. Core Profile Crypto
For the Core Profile (8 KB code budget), full ML-DSA-65 verification is
too large (~15 KB of code). Options:
1. **Hub verifies, tile trusts**: Hub checks all signatures before sending
blocks to tiles. Tile only needs CRC32C for transport integrity.
2. **Truncated verification**: Tile verifies only the CRC32C of received
blocks. Hub provides a signed attestation that the source segments
were verified.
3. **FN-DSA-512**: Smaller verification code (~3 KB), 666 byte signatures.
Fits in tile code budget but is less mature.
Recommended: Option 1 (hub verifies, tile trusts) for the initial release.
The hub is a trusted component in the Cognitum architecture, and the
tile-hub channel is physically secure (on-chip mesh).
## 9. Algorithm Agility
The `sig_algo` and `checksum_algo` fields in segment headers and footers
allow algorithm migration without format changes:
```
Today: ML-DSA-65 signatures, SHAKE-256 hashes
Future: May migrate to ML-DSA-87 or newer NIST standards
Transition: Write new segments with new algo, old segments remain valid
Verification: Reader tries algo from header field, no guessing needed
```
New algorithms are introduced by:
1. Assigning a new enum value
2. Writing a CRYPTO_SEG with the new key type
3. Signing new segments with the new algorithm
4. Old segments with old signatures remain verifiable
No file rewrite needed. No flag day. Gradual migration through the
append-only segment model.

View File

@@ -0,0 +1,397 @@
# RVF WASM Microkernel and Cognitum Hardware Mapping
## 1. Design Philosophy
RVF must run on hardware ranging from a 64 KB WASM tile to a petabyte
cluster. The WASM microkernel is the minimal runtime that makes a tile
a first-class RVF citizen — capable of answering queries, ingesting
streams, and participating in distributed search.
The microkernel is not a shrunken version of the full runtime. It is a
**purpose-built execution core** that exposes the exact set of operations
a tile needs, and nothing more.
## 2. Cognitum Tile Architecture
### Hardware Constraints
```
+-----------------------------------+
| Cognitum Tile |
| |
| Code Memory: 8 KB |
| Data Memory: 8 KB |
| SIMD Scratch: 64 KB |
| Registers: v128 (WASM SIMD) |
| Clock: ~1 GHz |
| Interconnect: Mesh to hub |
| |
| No filesystem. No mmap. |
| No allocator beyond scratch. |
| All I/O through hub messages. |
+-----------------------------------+
```
### Memory Map
```
Code (8 KB):
0x0000 - 0x0FFF Microkernel WASM bytecode (4 KB)
0x1000 - 0x17FF Distance function hot path (2 KB)
0x1800 - 0x1FFF Decode / quantization stubs (2 KB)
Data (8 KB):
0x0000 - 0x003F Tile configuration (64 B)
0x0040 - 0x00FF Query scratch (192 B: query vector fp16)
0x0100 - 0x01FF Result buffer (256 B: top-K candidates)
0x0200 - 0x03FF Routing table (512 B: entry points + centroids)
0x0400 - 0x07FF Decode workspace (1 KB)
0x0800 - 0x0FFF Message I/O buffer (2 KB)
0x1000 - 0x1FFF Neighbor list cache (4 KB)
SIMD Scratch (64 KB):
0x0000 - 0x7FFF Vector block (up to 85 vectors @ 384-dim fp16)
0x8000 - 0xBFFF Distance accumulator / PQ tables (16 KB)
0xC000 - 0xEFFF Hot cache subset (12 KB)
0xF000 - 0xFFFF Temporary / spill (4 KB)
```
### Tile Budget
For 384-dim fp16 vectors:
- One vector: 768 bytes
- SIMD scratch holds: 64 KB / 768 = ~85 vectors
- Top-K result buffer: 16 candidates * 16 B = 256 B
- Query vector: 768 B
A tile can process one block of ~85 vectors per cycle, computing distances
and maintaining a top-K heap entirely within scratch memory.
## 3. Microkernel Exports
The WASM microkernel exports exactly these functions:
```wat
;; === Core Query Path ===
;; Initialize tile with configuration
;; config_ptr: pointer to 64B tile config in data memory
(export "rvf_init" (func $rvf_init (param $config_ptr i32) (result i32)))
;; Load query vector into query scratch
;; query_ptr: pointer to fp16 vector in data memory
;; dim: vector dimensionality
(export "rvf_load_query" (func $rvf_load_query
(param $query_ptr i32) (param $dim i32) (result i32)))
;; Load a block of vectors into SIMD scratch
;; block_ptr: pointer to vector block in SIMD scratch
;; count: number of vectors
;; dtype: data type enum
(export "rvf_load_block" (func $rvf_load_block
(param $block_ptr i32) (param $count i32)
(param $dtype i32) (result i32)))
;; Compute distances between query and loaded block
;; metric: 0=L2, 1=IP, 2=cosine, 3=hamming
;; result_ptr: pointer to write distances
(export "rvf_distances" (func $rvf_distances
(param $metric i32) (param $result_ptr i32) (result i32)))
;; Merge distances into top-K heap
;; dist_ptr: pointer to distance array
;; id_ptr: pointer to vector ID array
;; count: number of candidates
;; k: top-K to maintain
(export "rvf_topk_merge" (func $rvf_topk_merge
(param $dist_ptr i32) (param $id_ptr i32)
(param $count i32) (param $k i32) (result i32)))
;; Read current top-K results
;; out_ptr: pointer to write results (id, distance pairs)
(export "rvf_topk_read" (func $rvf_topk_read
(param $out_ptr i32) (result i32)))
;; === Quantization ===
;; Load scalar quantization parameters (min/max per dim)
(export "rvf_load_sq_params" (func $rvf_load_sq_params
(param $params_ptr i32) (param $dim i32) (result i32)))
;; Dequantize int8 block to fp16 in SIMD scratch
(export "rvf_dequant_i8" (func $rvf_dequant_i8
(param $src_ptr i32) (param $dst_ptr i32)
(param $count i32) (result i32)))
;; Load PQ codebook subset
(export "rvf_load_pq_codebook" (func $rvf_load_pq_codebook
(param $codebook_ptr i32) (param $M i32)
(param $K i32) (result i32)))
;; Compute PQ asymmetric distances
(export "rvf_pq_distances" (func $rvf_pq_distances
(param $codes_ptr i32) (param $count i32)
(param $result_ptr i32) (result i32)))
;; === HNSW Navigation ===
;; Load neighbor list for a node
(export "rvf_load_neighbors" (func $rvf_load_neighbors
(param $node_id i64) (param $layer i32)
(param $out_ptr i32) (result i32)))
;; Greedy search step: given current node, find nearest neighbor
(export "rvf_greedy_step" (func $rvf_greedy_step
(param $current_id i64) (param $layer i32) (result i64)))
;; === Segment Verification ===
;; Verify segment header hash
(export "rvf_verify_header" (func $rvf_verify_header
(param $header_ptr i32) (result i32)))
;; Compute CRC32C of a data region
(export "rvf_crc32c" (func $rvf_crc32c
(param $data_ptr i32) (param $len i32) (result i32)))
```
### Export Count
14 exports. Each maps to a tight inner loop that fits in the 8 KB code budget.
The host (hub) is responsible for all I/O, segment parsing, and orchestration.
## 4. Host-Tile Protocol
Communication between the hub and tile uses fixed-size messages through
the 2 KB I/O buffer:
### Message Format
```
Offset Size Field Description
------ ---- ----- -----------
0x00 2 msg_type Message type enum
0x02 2 msg_length Payload length
0x04 4 msg_id Correlation ID
0x08 var payload Type-specific payload
```
### Message Types
```
Hub -> Tile:
0x01 LOAD_QUERY Send query vector (768 B for 384-dim fp16)
0x02 LOAD_BLOCK Send vector block (up to ~1.5 KB compressed)
0x03 LOAD_NEIGHBORS Send neighbor list for a node
0x04 LOAD_PARAMS Send quantization parameters
0x05 COMPUTE Trigger distance computation
0x06 READ_TOPK Request current top-K results
0x07 RESET Clear tile state for new query
Tile -> Hub:
0x81 TOPK_RESULT Top-K results (id, distance pairs)
0x82 NEED_BLOCK Request a specific vector block
0x83 NEED_NEIGHBORS Request neighbor list for a node
0x84 DONE Computation complete
0x85 ERROR Error with code
```
### Execution Flow
```
Hub Tile
| |
|--- LOAD_QUERY (768B) ------------>|
| | rvf_load_query()
|--- LOAD_PARAMS (SQ params) ------>|
| | rvf_load_sq_params()
|--- LOAD_BLOCK (block 0) -------->|
| | rvf_load_block()
| | rvf_distances()
| | rvf_topk_merge()
|--- LOAD_BLOCK (block 1) -------->|
| | rvf_load_block()
| | rvf_distances()
| | rvf_topk_merge()
| ... |
|--- READ_TOPK -------------------->|
| | rvf_topk_read()
|<--- TOPK_RESULT ------------------|
| |
```
### Pull Mode
For HNSW search, the tile drives the traversal:
```
Hub Tile
| |
|--- LOAD_QUERY -------------------->|
|--- LOAD_NEIGHBORS (entry point) -->|
| | rvf_greedy_step()
|<--- NEED_NEIGHBORS (next node) ----|
|--- LOAD_NEIGHBORS (next node) ---->|
| | rvf_greedy_step()
|<--- NEED_BLOCK (for candidate) ----|
|--- LOAD_BLOCK -------------------->|
| | rvf_distances()
| | rvf_topk_merge()
|<--- DONE ----------------------------|
|--- READ_TOPK --------------------->|
|<--- TOPK_RESULT ------------------|
```
## 5. Three Hardware Profiles
### RVF Core Profile (Tile)
```
Target: Cognitum tile (8KB + 8KB + 64KB)
Features: Distance compute, top-K, SQ dequant, CRC32C verify
Max vectors: ~85 per block load
Max dimensions: 384 (fp16) or 768 (i8)
Index: None (hub routes, tile computes)
Streaming: Receive blocks from hub
Quantization: i8 scalar only (no PQ on tile)
Compression: None (hub decompresses before sending)
```
### RVF Hot Profile (Chip)
```
Target: Cognitum chip (multiple tiles + shared memory)
Features: Core + PQ distance, HNSW navigation, parallel tiles
Max vectors: Limited by shared memory (~10K in shared cache)
Max dimensions: 1024
Index: Layer A in shared memory
Streaming: Block streaming across tiles
Quantization: i8 scalar + PQ (6-bit)
Compression: LZ4 decompress in shared memory
```
### RVF Full Profile (Hub/Desktop)
```
Target: Desktop CPU, server, hub controller
Features: All features, all segment types, all quantization
Max vectors: Billions (limited by storage)
Max dimensions: Unlimited
Index: Full HNSW (Layers A + B + C)
Streaming: Full append-only segment model
Quantization: All tiers (fp16, i8, PQ, binary)
Compression: All (LZ4, ZSTD, custom)
Crypto: Full (ML-DSA-65 signatures, SHAKE-256)
Temperature: Full adaptive tiering
Overlay: Full epoch model with compaction
```
### Profile Detection
The root manifest's `profile_id` field declares the minimum profile needed:
```
0x00 generic Requires Full Profile features
0x01 core Fully usable with Core Profile
0x02 hot Requires Hot Profile minimum
0x03 full Requires Full Profile
```
A Full Profile reader can always read Core or Hot files. A Core Profile
reader rejects Full Profile files but can read Core files. Hot Profile
readers can read Core and Hot files.
## 6. SIMD Strategy by Platform
### WASM v128 (Tile/Browser)
```wasm
;; L2 distance: fp16 vectors, 384 dimensions
;; Process 8 fp16 values per v128 operation
(func $l2_fp16_384 (param $a_ptr i32) (param $b_ptr i32) (result f32)
(local $acc v128)
(local $i i32)
(local.set $acc (v128.const i64x2 0 0))
(local.set $i (i32.const 0))
(block $done
(loop $loop
;; Load 8 fp16 values, widen to f32x4 pairs
;; Subtract, square, accumulate
;; ... (8 values per iteration, 48 iterations for 384 dims)
(br_if $done (i32.ge_u (local.get $i) (i32.const 384)))
(br $loop)
)
)
;; Horizontal sum of accumulator
;; Return L2 distance
)
```
### AVX-512 (Desktop/Server)
```
; Process 32 fp16 values per cycle with VCVTPH2PS + VFMADD231PS
; 384 dims = 12 iterations of 32 values
; ~12 cycles per distance computation
```
### ARM NEON (Mobile/Edge)
```
; Process 8 fp16 values per cycle with FMLA
; 384 dims = 48 iterations of 8 values
; ~48 cycles per distance computation
```
## 7. Microkernel Size Budget
```
Function Estimated Size
-------- --------------
rvf_init 128 B
rvf_load_query 64 B
rvf_load_block 256 B
rvf_distances (L2 fp16) 512 B
rvf_distances (L2 i8) 384 B
rvf_distances (IP fp16) 512 B
rvf_distances (hamming) 256 B
rvf_topk_merge 384 B
rvf_topk_read 64 B
rvf_load_sq_params 64 B
rvf_dequant_i8 256 B
rvf_load_pq_codebook 128 B
rvf_pq_distances 512 B
rvf_load_neighbors 128 B
rvf_greedy_step 512 B
rvf_verify_header 128 B
rvf_crc32c 256 B
Message dispatch loop 384 B
Utility functions 256 B
WASM overhead 512 B
----------
Total ~5,500 B (< 8 KB code budget)
```
Remaining ~2.5 KB of code space is available for domain-specific extensions
(e.g., codon distance for RVDNA profile, token overlap for RVText profile).
## 8. Fault Isolation
Each tile runs in a WASM sandbox. A tile cannot:
- Access hub memory directly
- Communicate with other tiles except through the hub
- Allocate memory beyond its 8 KB data + 64 KB scratch
- Execute code beyond its 8 KB code space
- Trap without the hub catching and recovering
If a tile traps (out-of-bounds, unreachable, stack overflow):
1. Hub catches the trap
2. Hub marks tile as faulted
3. Hub reassigns the tile's work to another tile (or processes locally)
4. Hub optionally restarts the faulted tile with fresh state
This makes the system resilient to individual tile failures — important for
large tile arrays where hardware faults are inevitable.

View File

@@ -0,0 +1,377 @@
# RVF Domain Profiles
## 1. Profile Architecture
A domain profile is a **semantic overlay** on the universal RVF substrate. It does
not change the wire format — every profile-specific file is a valid RVF file. The
profile adds:
1. **Semantic type annotations** for vector dimensions
2. **Domain-specific distance metrics**
3. **Custom quantization strategies** optimized for the domain
4. **Metadata schemas** for domain-specific labels and provenance
5. **Query preprocessing** conventions
Profiles are declared in a PROFILE_SEG and referenced by the root manifest's
`profile_id` field.
```
+-- RVF Universal Substrate --+
| Segments, manifests, tiers |
| HNSW index, overlays |
| Temperature, compaction |
+-----------------------------+
|
| profile_id
v
+-- Domain Profile Layer --+
| Semantic types |
| Custom distances |
| Metadata schema |
| Query conventions |
+---------------------------+
```
## 2. PROFILE_SEG Binary Layout
```
Offset Size Field Description
------ ---- ----- -----------
0x00 4 profile_magic Profile-specific magic number
0x04 2 profile_version Profile spec version
0x06 2 profile_id Same as root manifest profile_id
0x08 32 profile_name UTF-8 null-terminated name
0x28 8 schema_length Length of metadata schema
0x30 var metadata_schema JSON or binary schema for META_SEG entries
var 8 distance_config_len Length of distance configuration
var var distance_config Distance metric parameters
var 8 quant_config_len Length of quantization configuration
var var quant_config Domain-specific quantization parameters
var 8 preprocess_len Length of preprocessing spec
var var preprocess_spec Query preprocessing pipeline description
```
## 3. RVDNA Profile (Genomics)
### Profile Declaration
```
profile_magic: 0x52444E41 ("RDNA")
profile_id: 0x01
profile_name: "rvdna"
```
### Semantic Types
RVDNA vectors encode biological sequences at multiple granularities:
| Granularity | Dimensions | Encoding | Use Case |
|------------|-----------|----------|----------|
| Codon | 64 | Frequency of each codon in reading frame | Gene-level comparison |
| K-mer (k=6) | 4096 | 6-mer frequency spectrum | Species identification |
| Motif | 128-512 | Learned motif embeddings (transformer) | Regulatory element search |
| Structure | 256 | Protein secondary structure embedding | Fold similarity |
| Epigenetic | 384 | Methylation + histone mark embedding | Epigenomic comparison |
### Distance Metrics
```
Codon frequency: Jensen-Shannon divergence (symmetric KL)
K-mer spectrum: Cosine similarity (normalized frequency vectors)
Motif embedding: L2 distance (Euclidean in learned space)
Structure: L2 distance with structure-aware weighting
Epigenetic: Weighted cosine (CpG density as weight)
```
### Quantization Strategy
Genomic vectors have specific statistical properties:
- **Codon frequencies**: Sparse, non-negative, sum-to-1. Use **scalar quantization
with log transform**: `q = round(log2(freq + epsilon) * scale)`. 8-bit covers
6 orders of magnitude.
- **K-mer spectra**: Very sparse (most 6-mers absent in short reads). Use
**sparse encoding**: store only non-zero k-mer indices + values. Typical
compression: 20-50x over dense.
- **Learned embeddings**: Gaussian-distributed. Standard PQ works well.
M=32 subspaces, K=256 centroids (8-bit codes).
### Metadata Schema
```json
{
"type": "rvdna",
"fields": {
"organism": { "type": "string", "indexed": true },
"gene_id": { "type": "string", "indexed": true },
"chromosome": { "type": "string", "indexed": true },
"position_start": { "type": "u64", "indexed": true },
"position_end": { "type": "u64", "indexed": true },
"strand": { "type": "enum", "values": ["+", "-"] },
"quality_score": { "type": "f32" },
"source_format": { "type": "enum", "values": ["FASTA", "FASTQ", "BAM", "VCF"] },
"read_depth": { "type": "u32" },
"gc_content": { "type": "f32" }
}
}
```
### Query Preprocessing
For RVDNA queries:
1. Input: Raw sequence string (ACGT...)
2. Compute k-mer frequency spectrum
3. Apply log transform for codon/k-mer queries
4. Normalize to unit length for cosine metrics
5. Encode as fp16 vector
6. Submit to RVF query path
## 4. RVText Profile (Language)
### Profile Declaration
```
profile_magic: 0x52545854 ("RTXT")
profile_id: 0x02
profile_name: "rvtext"
```
### Semantic Types
| Granularity | Dimensions | Source | Use Case |
|------------|-----------|--------|----------|
| Token | 768-1536 | Transformer last hidden state | Semantic search |
| Sentence | 384-768 | Sentence transformer pooled output | Document retrieval |
| Paragraph | 384-1024 | Long-context model embedding | Passage ranking |
| Document | 256-512 | Document-level embedding | Collection search |
| Sparse | 30522 | BM25/SPLADE term weights | Lexical matching |
### Distance Metrics
```
Dense embeddings: Cosine similarity (normalized dot product)
Sparse (SPLADE): Dot product on sparse vectors
Hybrid: alpha * dense_score + (1-alpha) * sparse_score
Matryoshka: Cosine on truncated prefix (adaptive dimensionality)
```
### Quantization Strategy
Text embeddings are well-suited to aggressive quantization:
- **Dense (384-768 dim)**: Binary quantization achieves 0.95+ recall on
normalized embeddings. 384 dims -> 48 bytes. Use binary for cold tier,
int8 for hot.
- **Sparse (SPLADE)**: Store as sorted (term_id, weight) pairs with
delta-encoded term_ids. Typical sparsity: 100-300 non-zero terms out
of 30K vocabulary. Compression: ~100x over dense.
- **Matryoshka**: Store full-dimension vectors but index only the first
D/4 dimensions. Progressive refinement uses more dimensions.
### Metadata Schema
```json
{
"type": "rvtext",
"fields": {
"text": { "type": "string", "stored": true, "max_length": 8192 },
"source_url": { "type": "string", "indexed": true },
"language": { "type": "string", "indexed": true },
"model_id": { "type": "string" },
"chunk_index": { "type": "u32" },
"total_chunks": { "type": "u32" },
"token_count": { "type": "u32" },
"timestamp": { "type": "u64" }
}
}
```
### Query Preprocessing
1. Input: Raw text string
2. Tokenize with model-specific tokenizer
3. Encode through embedding model (or receive pre-computed embedding)
4. L2-normalize for cosine similarity
5. Optionally: compute SPLADE sparse expansion
6. Submit dense + sparse to hybrid query path
## 5. RVGraph Profile (Networks)
### Profile Declaration
```
profile_magic: 0x52475248 ("RGRH")
profile_id: 0x03
profile_name: "rvgraph"
```
### Semantic Types
| Granularity | Dimensions | Source | Use Case |
|------------|-----------|--------|----------|
| Node | 64-256 | Node2Vec / GCN embedding | Node similarity |
| Edge | 64-128 | Edge feature embedding | Link prediction |
| Subgraph | 128-512 | Graph kernel embedding | Subgraph matching |
| Community | 64-256 | Community embedding | Community detection |
| Spectral | 32-128 | Laplacian eigenvectors | Graph structure |
### Distance Metrics
```
Node embedding: L2 distance
Edge embedding: Cosine similarity
Subgraph: Wasserstein distance (approximated by L2 on sorted features)
Community: Cosine similarity
Spectral: L2 on normalized eigenvectors
```
### Integration with Overlay System
RVGraph uniquely integrates with the RVF overlay epoch system:
- **Graph structure** is stored in OVERLAY_SEGs (not just as metadata)
- **Node embeddings** are stored in VEC_SEGs
- **Edge weights** are overlay deltas
- **Community assignments** are partition summaries
- **Min-cut witnesses** directly serve graph partitioning queries
This means RVGraph files are simultaneously vector stores AND graph databases.
The overlay system provides dynamic graph operations (add/remove edges,
rebalance partitions) while the vector system provides similarity search.
### Metadata Schema
```json
{
"type": "rvgraph",
"fields": {
"node_type": { "type": "string", "indexed": true },
"edge_type": { "type": "string", "indexed": true },
"node_label": { "type": "string", "indexed": true },
"degree": { "type": "u32", "indexed": true },
"community_id": { "type": "u32", "indexed": true },
"pagerank": { "type": "f32" },
"clustering_coeff": { "type": "f32" },
"source_graph": { "type": "string" }
}
}
```
## 6. RVVision Profile (Imagery)
### Profile Declaration
```
profile_magic: 0x52564953 ("RVIS")
profile_id: 0x04
profile_name: "rvvision"
```
### Semantic Types
| Granularity | Dimensions | Source | Use Case |
|------------|-----------|--------|----------|
| Patch | 64-256 | ViT patch embedding | Region search |
| Image | 512-2048 | CLIP / DINOv2 global embedding | Image retrieval |
| Object | 256-512 | Object detection crop embedding | Object search |
| Scene | 128-512 | Scene classification embedding | Scene matching |
| Multi-scale | 256 * N | Pyramid of embeddings at scales | Scale-invariant search |
### Distance Metrics
```
CLIP embedding: Cosine similarity (model-normalized)
DINOv2: Cosine similarity
Patch: L2 distance (not normalized)
Multi-scale: Weighted sum of per-scale cosine similarities
```
### Quantization Strategy
Vision embeddings have high intrinsic dimensionality but are compressible:
- **CLIP (512-dim)**: PQ with M=64, K=256 works well. Binary quantization
achieves 0.90+ recall.
- **DINOv2 (768-dim)**: Similar to CLIP. PQ M=96, K=256.
- **Patch embeddings**: Large volume (196+ patches per image). Aggressive
quantization to 4-bit scalar. Use residual PQ for high-recall applications.
### Spatial Metadata
RVVision supports spatial queries through metadata:
```json
{
"type": "rvvision",
"fields": {
"image_id": { "type": "string", "indexed": true },
"patch_row": { "type": "u16" },
"patch_col": { "type": "u16" },
"scale": { "type": "f32" },
"bbox_x": { "type": "f32" },
"bbox_y": { "type": "f32" },
"bbox_w": { "type": "f32" },
"bbox_h": { "type": "f32" },
"object_class": { "type": "string", "indexed": true },
"confidence": { "type": "f32" },
"model_id": { "type": "string" }
}
}
```
## 7. Custom Profile Registration
New profiles can be registered by writing a PROFILE_SEG:
```
1. Choose a unique profile_id (0x10-0xEF for custom profiles)
2. Define a 4-byte profile_magic
3. Define metadata schema
4. Define distance metric configuration
5. Define quantization recommendations
6. Write PROFILE_SEG into the RVF file
7. Set profile_id in root manifest
```
The profile system is open — any domain can define its own profile as long
as it maps onto the RVF substrate. The substrate does not need to understand
the domain semantics; it only needs to store vectors, compute distances,
and maintain indexes.
## 8. Cross-Profile Queries
RVF files with different profiles can be queried together if their vectors
share a compatible embedding space. This is common in multimodal applications:
```
Query: "Find images similar to this text description"
1. Text embedding (RVText profile) -> 512-dim CLIP text vector
2. Image database (RVVision profile) -> 512-dim CLIP image vectors
3. Distance metric: Cosine similarity (shared CLIP space)
4. Result: Images ranked by text-image similarity
```
The query path treats both files as RVF files. The profile only affects
preprocessing and metadata interpretation — the core distance computation
and indexing are profile-agnostic.
## 9. Profile Compatibility Matrix
| Source Profile | Target Profile | Compatible? | Condition |
|---------------|---------------|------------|-----------|
| RVDNA | RVDNA | Yes | Same granularity |
| RVText | RVText | Yes | Same model or compatible space |
| RVVision | RVVision | Yes | Same model or compatible space |
| RVText | RVVision | Yes | If both use CLIP or shared space |
| RVDNA | RVText | No* | Unless mapped through protein language model |
| RVGraph | Any | Partial | Node embeddings may share space |
*Cross-domain compatibility requires explicit embedding space alignment,
which is outside the scope of the format spec but enabled by it.

View File

@@ -0,0 +1,140 @@
# RVF: RuVector Format Specification
## The Universal Substrate for Living Intelligence
**Version**: 0.1.0-draft
**Status**: Research
**Date**: 2026-02-13
---
## What RVF Is
RVF is not a file format. It is a **runtime substrate** — a living, self-reorganizing
binary medium that stores, streams, indexes, and adapts vector intelligence across
any domain, any scale, and any hardware tier.
Where traditional formats are snapshots of data, RVF is a **continuously evolving
organism**. It ingests without rewriting. It answers queries before it finishes loading.
It reorganizes its own layout to match access patterns. It survives crashes without
journals. It fits on a 64 KB WASM tile or scales to a petabyte hub.
## The Four Laws of RVF
Every design decision in RVF derives from four inviolable laws:
### Law 1: Truth Lives at the Tail
The most recent `MANIFEST_SEG` at the tail of the file is the sole source of truth.
No front-loaded metadata. No section directory that must be rewritten on mutation.
Readers scan backward from EOF to find the latest manifest and know exactly what
to map.
**Consequence**: Append-only writes. Streaming ingest. No global rewrite ever.
### Law 2: Every Segment Is Independently Valid
Each segment carries its own magic number, length, content hash, and type tag.
A reader encountering any segment in isolation can verify it, identify it, and
decide whether to process it. No segment depends on prior segments for structural
validity.
**Consequence**: Crash safety for free. Parallel verification. Segment-level
integrity without a global checksum.
### Law 3: Data and State Are Separated
Vector payloads, index structures, overlay graphs, quantization dictionaries, and
runtime metadata live in distinct segment types. The manifest binds them together
but they never intermingle. This means you can replace the index without touching
vectors, update the overlay without rebuilding adjacency, or swap quantization
without re-encoding.
**Consequence**: Incremental updates. Modular evolution. Zero-copy segment reuse.
### Law 4: The Format Adapts to Its Workload
RVF monitors access patterns through lightweight sketches and periodically
reorganizes: promoting hot vectors to faster tiers, compacting stale overlays,
lazily building deeper index layers. The format is not static — it converges
toward the optimal layout for its actual workload.
**Consequence**: Self-tuning performance. No manual optimization. The file gets
faster the more you use it.
## Design Coordinates
| Property | RVF Answer |
|----------|-----------|
| Write model | Append-only segments + background compaction |
| Read model | Tail-manifest scan, then progressive mmap |
| Index model | Layered availability (entry points -> partial -> full) |
| Compression | Temperature-tiered (fp16 hot, 5-7 bit warm, 3 bit cold) |
| Alignment | 64-byte for SIMD (AVX-512, NEON, WASM v128) |
| Crash safety | Segment-level hashes, no WAL required |
| Crypto | Post-quantum (ML-DSA-65 signatures, SHAKE-256 hashes) |
| Streaming | Yes — first query before full load |
| Hardware | 8 KB tile to petabyte hub |
| Domain | Universal — genomics, text, graph, vision as profiles |
## Acceptance Test
> Cold start on a 10 million vector file: load and answer the first query with a
> useful (recall >= 0.7) result without reading more than the last 4 MB, then
> converge to full quality (recall >= 0.95) as it progressively maps more segments.
## Document Map
| Document | Path | Content |
|----------|------|---------|
| This overview | `spec/00-overview.md` | Philosophy, laws, design coordinates |
| Segment model | `spec/01-segment-model.md` | Segment types, headers, append-only rules |
| Manifest system | `spec/02-manifest-system.md` | Two-level manifests, hotset pointers |
| Temperature tiering | `spec/03-temperature-tiering.md` | Adaptive layout, access sketches, promotion |
| Progressive indexing | `spec/04-progressive-indexing.md` | Layered HNSW, partial availability |
| Overlay epochs | `spec/05-overlay-epochs.md` | Streaming min-cut, epoch boundaries |
| Wire format | `wire/binary-layout.md` | Byte-level binary format reference |
| WASM microkernel | `microkernel/wasm-runtime.md` | Cognitum tile mapping, WASM exports |
| Domain profiles | `profiles/domain-profiles.md` | RVDNA, RVText, RVGraph, RVVision |
| Crypto spec | `crypto/quantum-signatures.md` | Post-quantum primitives, segment signing |
| Benchmarks | `benchmarks/acceptance-tests.md` | Performance targets, test methodology |
## Relationship to RVDNA
RVDNA (RuVector DNA) was the first domain-specific format for genomic vector
intelligence. In the RVF model, RVDNA becomes a **profile** — a set of conventions
for how genomic data maps onto the universal RVF substrate:
```
RVF (universal substrate)
|
+-- RVF Core Profile (minimal, fits on 64KB tile)
+-- RVF Hot Profile (chip-optimized, SIMD-heavy)
+-- RVF Full Profile (hub-scale, all features)
|
+-- Domain Profiles
+-- RVDNA (genomics: codons, motifs, k-mers)
+-- RVText (language: embeddings, token graphs)
+-- RVGraph (networks: adjacency, partitions)
+-- RVVision (imagery: feature maps, patch vectors)
```
The substrate carries the laws. The profiles carry the semantics.
## Design Answers
**Q: Random writes or append-only plus compaction?**
A: Append-only plus compaction. This gives speed and crash safety almost for free.
Random writes add complexity for marginal benefit in the vector workload.
**Q: Primary target mmap on desktop CPUs or also microcontroller tiles?**
A: Both. RVF defines three hardware profiles. The Core profile fits in 8 KB code +
8 KB data + 64 KB SIMD scratch. The Full profile assumes mmap on desktop-class
memory. The wire format is identical — only the runtime behavior changes.
**Q: Which property matters most?**
A: All four are non-negotiable, but the priority order for conflict resolution is:
1. **Streamable** (never block on write)
2. **Progressive** (answer before fully loaded)
3. **Adaptive** (self-optimize over time)
4. **p95 speed** (predictable tail latency)

View File

@@ -0,0 +1,224 @@
# RVF Segment Model
## 1. Append-Only Segment Architecture
An RVF file is a linear sequence of **segments**. Each segment is a self-contained,
independently verifiable unit. New data is always appended — never inserted into or
overwritten within existing segments.
```
+------------+------------+------------+ +------------+
| Segment 0 | Segment 1 | Segment 2 | ... | Segment N | <-- EOF
+------------+------------+------------+ +------------+
^
Latest MANIFEST_SEG
(source of truth)
```
### Why Append-Only
| Property | Benefit |
|----------|---------|
| Write amplification | Zero — each byte written once until compaction |
| Crash safety | Partial segment at tail is detectable and discardable |
| Concurrent reads | Readers see a consistent snapshot at any manifest boundary |
| Streaming ingest | Writer never blocks on reorganization |
| mmap friendliness | Pages only grow — no invalidation of mapped regions |
## 2. Segment Header
Every segment begins with a fixed 64-byte header. The header is 64-byte aligned
to match SIMD register width.
```
Offset Size Field Description
------ ---- ----- -----------
0x00 4 magic 0x52564653 ("RVFS" in ASCII)
0x04 1 version Segment format version (currently 1)
0x05 1 seg_type Segment type enum (see below)
0x06 2 flags Bitfield: compressed, encrypted, signed, sealed, etc.
0x08 8 segment_id Monotonically increasing segment ordinal
0x10 8 payload_length Byte length of payload (after header, before footer)
0x18 8 timestamp_ns Nanosecond UNIX timestamp of segment creation
0x20 1 checksum_algo Hash algorithm enum: 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
0x21 1 compression Compression enum: 0=none, 1=LZ4, 2=ZSTD, 3=custom
0x22 2 reserved_0 Must be zero
0x24 4 reserved_1 Must be zero
0x28 16 content_hash First 128 bits of payload hash (algorithm per checksum_algo)
0x38 4 uncompressed_len Original payload size (0 if no compression)
0x3C 4 alignment_pad Padding to reach 64-byte boundary
```
**Total header**: 64 bytes (one cache line, one AVX-512 register width).
### Magic Validation
Readers scanning backward from EOF look for `0x52564653` at 64-byte aligned
boundaries. This enables fast tail-scan even on corrupted files.
### Flags Bitfield
```
Bit 0: COMPRESSED Payload is compressed per compression field
Bit 1: ENCRYPTED Payload is encrypted (key info in manifest)
Bit 2: SIGNED A signature footer follows the payload
Bit 3: SEALED Segment is immutable (compaction output)
Bit 4: PARTIAL Segment is a partial write (streaming ingest)
Bit 5: TOMBSTONE Segment logically deletes a prior segment
Bit 6: HOT Segment contains temperature-promoted data
Bit 7: OVERLAY Segment contains overlay/delta data
Bit 8: SNAPSHOT Segment contains full snapshot (not delta)
Bit 9: CHECKPOINT Segment is a safe rollback point
Bits 10-15: reserved
```
## 3. Segment Types
```
Value Name Purpose
----- ---- -------
0x01 VEC_SEG Raw vector payloads (the actual embeddings)
0x02 INDEX_SEG HNSW adjacency lists, entry points, routing tables
0x03 OVERLAY_SEG Graph overlay deltas, partition updates, min-cut witnesses
0x04 JOURNAL_SEG Metadata mutations (label changes, deletions, moves)
0x05 MANIFEST_SEG Segment directory, hotset pointers, epoch state
0x06 QUANT_SEG Quantization dictionaries and codebooks
0x07 META_SEG Arbitrary key-value metadata (tags, provenance, lineage)
0x08 HOT_SEG Temperature-promoted hot data (vectors + neighbors)
0x09 SKETCH_SEG Access counter sketches for temperature decisions
0x0A WITNESS_SEG Capability manifests, proof of computation, audit trails
0x0B PROFILE_SEG Domain profile declarations (RVDNA, RVText, etc.)
0x0C CRYPTO_SEG Key material, signature chains, certificate anchors
0x0D METAIDX_SEG Metadata inverted indexes for filtered search
```
### Reserved Range
Types `0x00` and `0xF0`-`0xFF` are reserved. `0x00` indicates an uninitialized
or zeroed region (not a valid segment). `0xF0`-`0xFF` are reserved for
implementation-specific extensions.
## 4. Segment Footer
If the `SIGNED` flag is set, the payload is followed by a signature footer:
```
Offset Size Field Description
------ ---- ----- -----------
0x00 2 sig_algo Signature algorithm: 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
0x02 2 sig_length Byte length of signature
0x04 var signature The signature bytes
var 4 footer_length Total footer size (for backward scanning)
```
Unsigned segments have no footer — the next segment header follows immediately
after the payload (at the next 64-byte aligned boundary).
## 5. Segment Lifecycle
### Write Path
```
1. Allocate segment ID (monotonic counter)
2. Compute payload hash
3. Write header + payload + optional footer
4. fsync (or fdatasync for non-manifest segments)
5. Write MANIFEST_SEG referencing the new segment
6. fsync the manifest
```
The two-fsync protocol ensures that:
- If crash occurs before step 6, the orphan segment is harmless (no manifest points to it)
- If crash occurs during step 6, the partial manifest is detectable (bad hash)
- After step 6, the segment is durably committed
### Read Path
```
1. Seek to EOF
2. Scan backward for latest MANIFEST_SEG (look for magic at aligned boundaries)
3. Parse manifest -> get segment directory
4. Map segments on demand (progressive loading)
```
### Compaction
Compaction merges multiple segments into fewer, larger, sealed segments:
```
Before: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3]
After: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3] [VEC_SEG_sealed] [MANIFEST_4]
^^^^^^^^^^^^^^^^^
New sealed segment
merging 1+2+3
```
Old segments are marked with TOMBSTONE entries in the new manifest. Space is
reclaimed when the file is eventually rewritten (or old segments are in a
separate file in multi-file mode).
### Multi-File Mode
For very large datasets, RVF can span multiple files:
```
data.rvf Main file with manifests and hot data
data.rvf.cold.0 Cold segment shard 0
data.rvf.cold.1 Cold segment shard 1
data.rvf.idx.0 Index segment shard 0
```
The manifest in the main file contains shard references with file paths and
byte ranges. This enables cold data to live on slower storage while hot data
stays on fast storage.
## 6. Segment Addressing
Segments are addressed by their `segment_id` (monotonically increasing 64-bit
integer). The manifest maps segment IDs to file offsets (and optionally shard
file paths in multi-file mode).
Within a segment, data is addressed by **block offset** — a 32-bit offset from
the start of the segment payload. This limits individual segments to 4 GB, which
is intentional: it keeps segments manageable for compaction and progressive loading.
### Block Structure Within VEC_SEG
```
+-------------------+
| Block Header (16B)|
| block_id: u32 |
| count: u32 |
| dim: u16 |
| dtype: u8 |
| pad: [u8; 5] |
+-------------------+
| Vectors |
| (count * dim * |
| sizeof(dtype)) |
| [64B aligned] |
+-------------------+
| ID Map |
| (varint delta |
| encoded IDs) |
+-------------------+
| Block Footer |
| crc32c: u32 |
+-------------------+
```
Vectors within a block are stored **columnar** — all dimension 0 values, then all
dimension 1 values, etc. This maximizes compression ratio. But the HOT_SEG stores
vectors **interleaved** (row-major) for cache-friendly sequential scan during
top-K refinement.
## 7. Invariants
1. Segment IDs are strictly monotonically increasing within a file
2. A valid RVF file contains at least one MANIFEST_SEG
3. The last MANIFEST_SEG is always the source of truth
4. Segment headers are always 64-byte aligned
5. No segment payload exceeds 4 GB
6. Content hashes are computed over the raw (uncompressed, unencrypted) payload
7. Sealed segments are never modified — only tombstoned
8. A reader that cannot find a valid MANIFEST_SEG must reject the file

View File

@@ -0,0 +1,287 @@
# RVF Manifest System
## 1. Two-Level Manifest Architecture
The manifest system is what makes RVF progressive. Instead of a monolithic directory
that must be fully parsed before any query, RVF uses a two-level manifest that
enables instant boot followed by incremental refinement.
```
EOF
|
v
+--------------------------------------------------+
| Level 0: Root Manifest (fixed 4096 bytes) |
| - Magic + version |
| - Pointer to Level 1 manifest segment |
| - Hotset pointers (inline) |
| - Total vector count |
| - Dimension |
| - Epoch counter |
| - Profile declaration |
+--------------------------------------------------+
|
| points to
v
+--------------------------------------------------+
| Level 1: Full Manifest (variable size) |
| - Complete segment directory |
| - Temperature tier map |
| - Index layer availability |
| - Overlay epoch chain |
| - Compaction state |
| - Shard references (multi-file) |
| - Capability manifest |
+--------------------------------------------------+
```
### Why Two Levels
A reader performing cold start only needs Level 0 (4 KB). From Level 0 alone,
it can locate the entry points, coarse routing graph, quantization dictionary,
and centroids — enough to answer approximate queries immediately.
Level 1 is loaded asynchronously to enable full-quality queries, but the system
is functional before Level 1 is fully parsed.
## 2. Level 0: Root Manifest
The root manifest is always the **last 4096 bytes** of the file (or the last
4096 bytes of the most recent MANIFEST_SEG). Its fixed size enables instant
location: `seek(EOF - 4096)`.
### Binary Layout
```
Offset Size Field Description
------ ---- ----- -----------
0x000 4 magic 0x52564D30 ("RVM0")
0x004 2 version Root manifest version
0x006 2 flags Root manifest flags
0x008 8 l1_manifest_offset Byte offset to Level 1 manifest segment
0x010 8 l1_manifest_length Byte length of Level 1 manifest segment
0x018 8 total_vector_count Total vectors across all segments
0x020 2 dimension Vector dimensionality
0x022 1 base_dtype Base data type enum
0x023 1 profile_id Domain profile (0=generic, 1=dna, 2=text, 3=graph, 4=vision)
0x024 4 epoch Current overlay epoch number
0x028 8 created_ns File creation timestamp (ns)
0x030 8 modified_ns Last modification timestamp (ns)
--- Hotset Pointers (the key to instant boot) ---
0x038 8 entrypoint_seg_offset Offset to segment containing HNSW entry points
0x040 4 entrypoint_block_offset Block offset within that segment
0x044 4 entrypoint_count Number of entry points
0x048 8 toplayer_seg_offset Offset to segment with top-layer adjacency
0x050 4 toplayer_block_offset Block offset
0x054 4 toplayer_node_count Nodes in top layer
0x058 8 centroid_seg_offset Offset to segment with cluster centroids / pivots
0x060 4 centroid_block_offset Block offset
0x064 4 centroid_count Number of centroids
0x068 8 quantdict_seg_offset Offset to quantization dictionary segment
0x070 4 quantdict_block_offset Block offset
0x074 4 quantdict_size Dictionary size in bytes
0x078 8 hot_cache_seg_offset Offset to HOT_SEG with interleaved hot vectors
0x080 4 hot_cache_block_offset Block offset
0x084 4 hot_cache_vector_count Vectors in hot cache
0x088 8 prefetch_map_offset Offset to prefetch hint table
0x090 4 prefetch_map_entries Number of prefetch entries
--- Crypto ---
0x094 2 sig_algo Manifest signature algorithm
0x096 2 sig_length Signature length
0x098 var signature Manifest signature (up to 3400 bytes for ML-DSA-65)
--- Padding to 4096 bytes ---
0xF00 252 reserved Reserved / zero-padded to 4096
0xFFC 4 root_checksum CRC32C of bytes 0x000-0xFFB
```
**Total**: Exactly 4096 bytes (one page, one disk sector on most hardware).
### Hotset Pointers
The six hotset pointers are the minimum information needed to answer a query:
1. **Entry points**: Where to start HNSW traversal
2. **Top-layer adjacency**: Coarse routing to the right neighborhood
3. **Centroids/pivots**: For IVF-style pre-filtering or partition routing
4. **Quantization dictionary**: For decoding compressed vectors
5. **Hot cache**: Pre-decoded interleaved vectors for top-K refinement
6. **Prefetch map**: Contiguous neighbor-list pages with prefetch offsets
With these six pointers, a reader can:
- Start HNSW search at the entry point
- Route through the top layer
- Quantize the query using the dictionary
- Scan the hot cache for refinement
- Prefetch neighbor pages for cache-friendly traversal
All without reading Level 1 or any cold segments.
## 3. Level 1: Full Manifest
Level 1 is a variable-size segment (type `MANIFEST_SEG`) referenced by Level 0.
It contains the complete file directory.
### Structure
Level 1 is encoded as a sequence of typed records using a tag-length-value (TLV)
scheme for forward compatibility:
```
+---+---+---+---+---+---+---+---+
| Tag (2B) | Length (4B) | Pad | <- 8-byte aligned record header
+---+---+---+---+---+---+---+---+
| Value (Length bytes) |
| [padded to 8-byte boundary] |
+---------------------------------+
```
### Record Types
```
Tag Name Description
--- ---- -----------
0x0001 SEGMENT_DIR Array of segment directory entries
0x0002 TEMP_TIER_MAP Temperature tier assignments per block
0x0003 INDEX_LAYERS Index layer availability bitmap
0x0004 OVERLAY_CHAIN Epoch chain with rollback pointers
0x0005 COMPACTION_STATE Active/tombstoned segment sets
0x0006 SHARD_REFS Multi-file shard references
0x0007 CAPABILITY_MANIFEST What this file can do (features, limits)
0x0008 PROFILE_CONFIG Domain-specific configuration
0x0009 ACCESS_SKETCH_REF Pointer to latest SKETCH_SEG
0x000A PREFETCH_TABLE Full prefetch hint table
0x000B ID_RESTART_POINTS Restart point index for varint delta IDs
0x000C WITNESS_CHAIN Proof-of-computation witness chain
0x000D KEY_DIRECTORY Encryption key references (not keys themselves)
```
### Segment Directory Entry
```
Offset Size Field Description
------ ---- ----- -----------
0x00 8 segment_id Segment ordinal
0x08 1 seg_type Segment type enum
0x09 1 tier Temperature tier (0=hot, 1=warm, 2=cold)
0x0A 2 flags Segment flags
0x0C 4 reserved Must be zero
0x10 8 file_offset Byte offset in file (or shard)
0x18 8 payload_length Decompressed payload length
0x20 8 compressed_length Compressed length (0 if uncompressed)
0x28 2 shard_id Shard index (0 for main file)
0x2A 2 compression Compression algorithm
0x2C 4 block_count Number of blocks in segment
0x30 16 content_hash Payload hash (first 128 bits)
```
**Total**: 64 bytes per entry (cache-line aligned).
## 4. Manifest Lifecycle
### Writing a New Manifest
Every mutation to the file produces a new MANIFEST_SEG appended at the tail:
```
1. Compute new Level 1 manifest (segment directory + metadata)
2. Write Level 1 as a MANIFEST_SEG payload
3. Compute Level 0 root manifest pointing to Level 1
4. Write Level 0 as the last 4096 bytes of the MANIFEST_SEG
5. fsync
```
The MANIFEST_SEG payload structure is:
```
+-----------------------------------+
| Level 1 manifest (variable size) |
+-----------------------------------+
| Level 0 root manifest (4096 B) | <-- Always the last 4096 bytes
+-----------------------------------+
```
### Reading the Manifest
```
1. seek(EOF - 4096)
2. Read 4096 bytes -> Level 0 root manifest
3. Validate magic (0x52564D30) and checksum
4. If valid: extract hotset pointers -> system is queryable
5. Async: read Level 1 at l1_manifest_offset -> full directory
6. If Level 0 is invalid: scan backward for previous MANIFEST_SEG
```
Step 6 provides crash recovery. If the latest manifest write was interrupted,
the previous manifest is still valid. Readers scan backward at 64-byte aligned
boundaries looking for the RVFS magic + MANIFEST_SEG type.
### Manifest Chain
Each manifest implicitly forms a chain through the segment ID ordering. For
explicit rollback support, Level 1 contains the `OVERLAY_CHAIN` record which
stores:
```
epoch: u32 Current epoch
prev_manifest_offset: u64 Offset of previous MANIFEST_SEG
prev_manifest_id: u64 Segment ID of previous MANIFEST_SEG
checkpoint_hash: [u8; 16] Hash of the complete state at this epoch
```
This enables point-in-time recovery and bisection debugging.
## 5. Hotset Pointer Semantics
### Entry Point Stability
Entry points are the HNSW nodes at the highest layer. They change rarely (only
when the index is rebuilt or a new highest-layer node is inserted). The root
manifest caches them directly so they survive across manifest generations without
re-reading the index.
### Centroid Refresh
Centroids may drift as data is added. The manifest tracks a `centroid_epoch` — if
the current epoch exceeds centroid_epoch + threshold, the runtime should schedule
centroid recomputation. But the stale centroids remain usable (recall degrades
gracefully, it does not fail).
### Hot Cache Coherence
The hot cache in HOT_SEG is a **read-optimized snapshot** of the most-accessed
vectors. It may be stale relative to the latest VEC_SEGs. The manifest tracks
a `hot_cache_epoch` for staleness detection. Queries use the hot cache for fast
initial results, then refine against authoritative VEC_SEGs if needed.
## 6. Progressive Boot Sequence
```
Time Action System State
---- ------ ------------
t=0 Read last 4 KB (Level 0) Booting
t+1ms Parse hotset pointers Queryable (approximate)
t+2ms mmap entry points + top layer Better routing
t+5ms mmap hot cache + quant dict Fast top-K refinement
t+10ms Start loading Level 1 Discovering full directory
t+50ms Level 1 parsed Full segment awareness
t+100ms mmap warm VEC_SEGs Recall improving
t+500ms mmap cold VEC_SEGs Full recall
t+1s Background index layer build Converging to optimal
```
For a 10M vector file (~4 GB at 384 dimensions, float16):
- Level 0 read: 4 KB in <1 ms
- Hotset data: ~2-4 MB (entry points + top layer + centroids + hot cache)
- First query: within 5-10 ms of open
- Full convergence: 1-5 seconds depending on storage speed

View File

@@ -0,0 +1,285 @@
# RVF Temperature Tiering
## 1. Adaptive Layout as a First-Class Concept
Traditional vector formats place data once and leave it. RVF treats data placement
as a **continuous optimization problem**. Every vector block has a temperature, and
the format periodically reorganizes to keep hot data fast and cold data small.
```
Access Frequency
^
|
Tier 0 (HOT) | ████████ fp16 / 8-bit, interleaved
| ████████ < 1μs random access
|
Tier 1 (WARM) | ░░░░░░░░░░░░░░░░ 5-7 bit quantized
| ░░░░░░░░░░░░░░░░ columnar, compressed
|
Tier 2 (COLD) | ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 3-bit or 1-bit
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ heavy compression
|
+------------------------------------> Vector ID
```
### Tier Definitions
| Tier | Name | Quantization | Layout | Compression | Access Latency |
|------|------|-------------|--------|-------------|----------------|
| 0 | Hot | fp16 or int8 | Interleaved (row-major) | None or LZ4 | < 1 μs |
| 1 | Warm | 5-7 bit SQ/PQ | Columnar | LZ4 or ZSTD | 1-10 μs |
| 2 | Cold | 3-bit or binary | Columnar | ZSTD level 9+ | 10-100 μs |
### Memory Ratios
For 384-dimensional vectors (typical embedding size):
| Tier | Bytes/Vector | Ratio vs fp32 | 10M Vectors |
|------|-------------|---------------|-------------|
| fp32 (raw) | 1536 B | 1.0x | 14.3 GB |
| Tier 0 (fp16) | 768 B | 2.0x | 7.2 GB |
| Tier 0 (int8) | 384 B | 4.0x | 3.6 GB |
| Tier 1 (6-bit) | 288 B | 5.3x | 2.7 GB |
| Tier 1 (5-bit) | 240 B | 6.4x | 2.2 GB |
| Tier 2 (3-bit) | 144 B | 10.7x | 1.3 GB |
| Tier 2 (1-bit) | 48 B | 32.0x | 0.45 GB |
## 2. Access Counter Sketch
Temperature decisions require knowing which blocks are accessed frequently.
RVF maintains a lightweight **Count-Min Sketch** per block set, stored in
SKETCH_SEG segments.
### Sketch Parameters
```
Width (w): 1024 counters
Depth (d): 4 hash functions
Counter size: 8-bit saturating (max 255)
Memory: 1024 * 4 * 1 = 4 KB per sketch
Granularity: One sketch per 1024-vector block
Decay: Halve all counters every 2^16 accesses (aging)
```
For 10M vectors in 1024-vector blocks:
- 9,766 blocks
- 9,766 * 4 KB = ~38 MB of sketches
- Stored in SKETCH_SEG, referenced by manifest
### Sketch Operations
**On query access**:
```
block_id = vector_id / block_size
for i in 0..depth:
idx = hash_i(block_id) % width
sketch[i][idx] = min(sketch[i][idx] + 1, 255)
```
**On temperature check**:
```
count = min over i of sketch[i][hash_i(block_id) % width]
if count > HOT_THRESHOLD: tier = 0
elif count > WARM_THRESHOLD: tier = 1
else: tier = 2
```
**Aging** (every 2^16 accesses):
```
for all counters: counter = counter >> 1
```
This ensures the sketch tracks *recent* access patterns, not cumulative history.
### Why Count-Min Sketch
| Alternative | Memory | Accuracy | Update Cost |
|------------|--------|----------|-------------|
| Per-vector counter | 80 MB (10M * 8B) | Exact | O(1) |
| Count-Min Sketch | 38 MB | ~99.9% | O(depth) = O(4) |
| HyperLogLog | 6 MB | ~98% | O(1) but cardinality only |
| Bloom filter | 12 MB | No counting | N/A |
Count-Min Sketch is the best trade-off: sub-exact accuracy with bounded memory
and constant-time updates.
## 3. Promotion and Demotion
### Promotion: Warm/Cold -> Hot
When a block's access count exceeds HOT_THRESHOLD for two consecutive sketch
epochs:
```
1. Read the block from its current VEC_SEG
2. Decode/dequantize vectors to fp16 or int8
3. Rearrange from columnar to interleaved layout
4. Write as a new HOT_SEG (or append to existing HOT_SEG)
5. Update manifest with new tier assignment
6. Optionally: add neighbor lists to HOT_SEG for locality
```
### Demotion: Hot -> Warm -> Cold
When a block's access count drops below WARM_THRESHOLD:
```
1. The block is not immediately rewritten
2. On next compaction cycle, the block is written to the appropriate tier
3. Quantization is applied during compaction (not lazily)
4. The HOT_SEG entry is tombstoned in the manifest
```
### Eviction as Compression
The key insight: **eviction from hot tier is just compression, not deletion**.
The vector data is always present — it just moves to a more compressed
representation. This means:
- No data loss on eviction
- Recall degrades gracefully (quantized vectors still contribute to search)
- The file naturally compresses over time as access patterns stabilize
## 4. Temperature-Aware Compaction
Standard compaction merges segments for space efficiency. Temperature-aware
compaction also **rearranges blocks by tier**:
```
Before compaction:
VEC_SEG_1: [hot] [cold] [warm] [hot] [cold]
VEC_SEG_2: [warm] [hot] [cold] [warm] [warm]
After temperature-aware compaction:
HOT_SEG: [hot] [hot] [hot] <- interleaved, fp16
VEC_SEG_W: [warm] [warm] [warm] [warm] <- columnar, 6-bit
VEC_SEG_C: [cold] [cold] [cold] <- columnar, 3-bit
```
This creates **physical locality by temperature**: hot blocks are contiguous
(good for sequential scan), warm blocks are contiguous (good for batch decode),
cold blocks are contiguous (good for compression ratio).
### Compaction Triggers
| Trigger | Condition | Action |
|---------|-----------|--------|
| Sketch epoch | Every N writes | Evaluate all block temperatures |
| Space amplification | Dead space > 30% | Merge + rewrite segments |
| Tier imbalance | Hot tier > 20% of data | Demote cold blocks |
| Hot miss rate | Hot cache miss > 10% | Promote missing blocks |
## 5. Quantization Strategies by Tier
### Tier 0: Hot
**Scalar quantization to int8** (preferred) or **fp16** (for maximum recall).
```
Encoding:
q = round((v - min) / (max - min) * 255)
Decoding:
v = q / 255 * (max - min) + min
Parameters stored in QUANT_SEG:
min: f32 per dimension
max: f32 per dimension
```
Distance computation directly on int8 using SIMD (vpsubb + vpmaddubsw on AVX-512).
### Tier 1: Warm
**Product Quantization (PQ)** with 5-7 bits per sub-vector.
```
Parameters:
M subspaces: 48 (for 384-dim vectors, 8 dims per subspace)
K centroids per sub: 64 (6-bit) or 128 (7-bit)
Codebook: M * K * 8 * sizeof(f32) = 48 * 64 * 8 * 4 = 96 KB
Encoding:
For each subvector: find nearest centroid -> store centroid index
Distance computation:
ADC (Asymmetric Distance Computation) with precomputed distance tables
```
### Tier 2: Cold
**Binary quantization** (1-bit) or **ternary quantization** (2-bit / 3-bit).
```
Binary encoding:
b = sign(v) -> 1 bit per dimension
384 dims -> 48 bytes per vector (32x compression)
Distance:
Hamming distance via POPCNT
XOR + POPCNT on AVX-512: 512 bits per cycle
Ternary (3-bit with magnitude):
t = {-1, 0, +1} based on threshold
magnitude = |v| quantized to 3 levels
384 dims -> 144 bytes per vector (10.7x compression)
```
### Codebook Storage
All quantization parameters (codebooks, min/max ranges, centroids) are stored
in QUANT_SEG segments. The root manifest's `quantdict_seg_offset` hotset pointer
references the active quantization dictionary for fast boot.
Multiple QUANT_SEGs can coexist for different tiers — the manifest maps each
tier to its dictionary.
## 6. Hardware Adaptation
### Desktop (AVX-512)
- Hot tier: int8 with VNNI dot product (4 int8 multiplies per cycle)
- Warm tier: PQ with AVX-512 gather for table lookups
- Cold tier: Binary with VPOPCNTDQ (512-bit popcount)
### ARM (NEON)
- Hot tier: int8 with SDOT instruction
- Warm tier: PQ with TBL for table lookups
- Cold tier: Binary with CNT (population count)
### WASM (v128)
- Hot tier: int8 with i8x16.dot_i7x16_i16x8 (if available)
- Warm tier: Scalar PQ (no gather)
- Cold tier: Binary with manual popcount
### Cognitum Tile (8KB code + 8KB data + 64KB SIMD)
- Hot tier only: int8 interleaved, fits in SIMD scratch
- No warm/cold — data stays on hub, tile fetches blocks on demand
- Sketch is maintained by hub, not tile
## 7. Self-Organization Over Time
```
t=0 All data Tier 1 (default warm)
|
t+N First sketch epoch: identify hot blocks
Promote top 5% to Tier 0
|
t+2N Second epoch: validate promotions
Demote false positives back to Tier 1
Identify true cold blocks (0 access in 2 epochs)
|
t+3N Compaction: physically separate tiers
HOT_SEG created with interleaved layout
Cold blocks compressed to 3-bit
|
t+∞ Equilibrium: ~5% hot, ~30% warm, ~65% cold
File size: ~2-3x smaller than uniform fp16
Query p95: dominated by hot tier latency
```
The format converges to an equilibrium that reflects actual usage. No manual
tuning required.

View File

@@ -0,0 +1,374 @@
# RVF Progressive Indexing
## 1. Index as Layers of Availability
Traditional HNSW serialization is all-or-nothing: either the full graph is loaded,
or nothing works. RVF decomposes the index into three layers of availability, each
independently useful, each stored in separate INDEX_SEG segments.
```
Layer C: Full Adjacency
+--------------------------------------------------+
| Complete neighbor lists for every node at every |
| HNSW level. Built lazily. Optional for queries. |
| Recall: >= 0.95 |
+--------------------------------------------------+
^ loaded last (seconds to minutes)
|
Layer B: Partial Adjacency
+--------------------------------------------------+
| Neighbor lists for the most-accessed region |
| (determined by temperature sketch). Covers the |
| hot working set of the graph. |
| Recall: >= 0.85 |
+--------------------------------------------------+
^ loaded second (100ms - 1s)
|
Layer A: Entry Points + Coarse Routing
+--------------------------------------------------+
| HNSW entry points. Top-layer adjacency lists. |
| Cluster centroids for IVF pre-routing. |
| Always present. Always in Level 0 hotset. |
| Recall: >= 0.70 |
+--------------------------------------------------+
^ loaded first (< 5ms)
|
File open
```
### Why Three Layers
| Layer | Purpose | Data Size (10M vectors) | Load Time (NVMe) |
|-------|---------|------------------------|-------------------|
| A | First query possible | 1-4 MB | < 5 ms |
| B | Good quality for working set | 50-200 MB | 100-500 ms |
| C | Full recall for all queries | 1-4 GB | 2-10 s |
A system that only loads Layer A can still answer queries — just with lower recall.
As layers B and C load asynchronously, quality improves transparently.
## 2. Layer A: Entry Points and Coarse Routing
### Content
- **HNSW entry points**: The node(s) at the highest layer of the HNSW graph.
Typically 1 node, but may be multiple for redundancy.
- **Top-layer adjacency**: Full neighbor lists for all nodes at HNSW layers
>= ceil(ln(N) / ln(M)) - 2. For 10M vectors with M=16, this is layers 5-6,
containing ~100-1000 nodes.
- **Cluster centroids**: K centroids (K = sqrt(N) typically, so ~3162 for 10M)
used for IVF-style partition routing.
- **Centroid-to-partition map**: Which centroid owns which vector ID ranges.
### Storage
Layer A data is stored in a dedicated INDEX_SEG with `flags.HOT` set. The root
manifest's hotset pointers reference this segment directly. On cold start, this
is the first data mapped after the manifest.
### Binary Layout of Layer A INDEX_SEG
```
+-------------------------------------------+
| Header: INDEX_SEG, flags=HOT |
+-------------------------------------------+
| Block 0: Entry Points |
| entry_count: u32 |
| max_layer: u32 |
| [entry_node_id: u64, layer: u32] * N |
+-------------------------------------------+
| Block 1: Top-Layer Adjacency |
| layer_count: u32 |
| For each layer (top to bottom): |
| node_count: u32 |
| For each node: |
| node_id: u64 |
| neighbor_count: u16 |
| [neighbor_id: u64] * neighbor_count |
| [64B padding] |
+-------------------------------------------+
| Block 2: Centroids |
| centroid_count: u32 |
| dim: u16 |
| dtype: u8 (fp16) |
| [centroid_vector: fp16 * dim] * K |
| [64B aligned] |
+-------------------------------------------+
| Block 3: Partition Map |
| partition_count: u32 |
| For each partition: |
| centroid_id: u32 |
| vector_id_start: u64 |
| vector_id_end: u64 |
| segment_ref: u64 (segment_id) |
| block_ref: u32 (block offset) |
+-------------------------------------------+
```
### Query Using Only Layer A
```python
def query_layer_a_only(query, k, layer_a):
# Step 1: Find nearest centroids
dists = [distance(query, c) for c in layer_a.centroids]
top_partitions = top_n(dists, n_probe)
# Step 2: HNSW search through top layers only
entry = layer_a.entry_points[0]
current = entry
for layer in range(layer_a.max_layer, layer_a.min_available_layer, -1):
current = greedy_search(query, current, layer_a.adjacency[layer])
# Step 3: If hot cache available, refine against it
if hot_cache:
candidates = scan_hot_cache(query, hot_cache, current.partition)
return top_k(candidates, k)
# Step 4: Otherwise, return centroid-approximate results
return approximate_from_centroids(query, top_partitions, k)
```
Expected recall: 0.65-0.75 (depends on centroid quality and hot cache coverage).
## 3. Layer B: Partial Adjacency
### Content
Neighbor lists for the **hot region** of the graph — the set of nodes that appear
most frequently in query traversals. Determined by the temperature sketch (see
03-temperature-tiering.md).
Typically covers:
- All nodes at HNSW layers >= 2
- Layer 0-1 nodes in the hot temperature tier
- ~10-20% of total nodes
### Storage
Layer B is stored in one or more INDEX_SEGs without the HOT flag. The Level 1
manifest maps these segments and records which node ID ranges they cover.
### Incremental Build
Layer B can be built incrementally:
```
1. After Layer A is loaded, begin query serving
2. In background: read VEC_SEGs for hot-tier blocks
3. Build HNSW adjacency for those blocks
4. Write as new INDEX_SEG
5. Update manifest to include Layer B
6. Future queries use Layer B for better recall
```
This means the index improves over time without blocking any queries.
### Partial Adjacency Routing
When a query traversal reaches a node without Layer B adjacency (i.e., it's in
the cold region), the system falls back to:
1. **Centroid routing**: Use Layer A centroids to estimate the nearest region
2. **Linear scan**: Scan the relevant VEC_SEG block directly
3. **Approximate**: Accept slightly lower recall for that portion
```python
def search_with_partial_index(query, k, layers):
# Start with Layer A routing
current = hnsw_search_layers(query, layers.a, layers.a.max_layer, 2)
# Continue with Layer B (where available)
if layers.b.has_node(current):
current = hnsw_search_layers(query, layers.b, 1, 0,
start=current)
else:
# Fallback: scan the block containing current
candidates = linear_scan_block(query, current.block)
current = best_of(current, candidates)
return top_k(current.visited, k)
```
## 4. Layer C: Full Adjacency
### Content
Complete neighbor lists for every node at every HNSW level. This is the
traditional full HNSW graph.
### Storage
Layer C may be split across multiple INDEX_SEGs for large datasets. The
manifest records the node ID ranges covered by each segment.
### Lazy Build
Layer C is built lazily — it is not required for the file to be functional.
The build process runs as a background task:
```
1. Identify unindexed VEC_SEG blocks (those without Layer C adjacency)
2. Read blocks in partition order (good locality)
3. Build HNSW adjacency using the existing partial graph as scaffold
4. Write new INDEX_SEG(s)
5. Update manifest
```
### Build Prioritization
Blocks are indexed in temperature order:
1. Hot blocks first (most query benefit)
2. Warm blocks next
3. Cold blocks last (may never be indexed if queries don't reach them)
This means the index build converges to useful quality fast, then approaches
completeness asymptotically.
## 5. Index Segment Binary Format
### Adjacency List Encoding
Neighbor lists are stored using **varint delta encoding with restart points**
for fast random access:
```
+-------------------------------------------+
| Restart Point Index |
| restart_interval: u32 (e.g., 64) |
| restart_count: u32 |
| [restart_offset: u32] * restart_count |
| [64B aligned] |
+-------------------------------------------+
| Adjacency Data |
| For each node (sorted by node_id): |
| neighbor_count: varint |
| [delta_encoded_neighbor_id: varint] |
| (restart point every N nodes) |
+-------------------------------------------+
```
**Restart points**: Every `restart_interval` nodes (default 64), the delta
encoding resets to absolute IDs. This enables O(1) random access to any node's
neighbors by:
1. Binary search the restart point index for the nearest restart <= target
2. Seek to that restart offset
3. Sequentially decode from restart to target (at most 63 decodes)
### Varint Encoding
Standard LEB128 varint:
- Values 0-127: 1 byte
- Values 128-16383: 2 bytes
- Values 16384-2097151: 3 bytes
For delta-encoded neighbor IDs (typical delta: 1-1000), most values fit in 1-2
bytes, giving ~3-4x compression over fixed u64.
### Prefetch Hints
The manifest's prefetch table maps node ID ranges to contiguous page ranges:
```
Prefetch Entry:
node_id_start: u64
node_id_end: u64
page_offset: u64 Offset of first contiguous page
page_count: u32 Number of contiguous pages
prefetch_ahead: u32 Pages to prefetch ahead of current access
```
When the HNSW search accesses a node, the runtime issues `madvise(WILLNEED)`
(or equivalent) for the next `prefetch_ahead` pages. This hides disk/memory
latency behind computation.
## 6. Index Consistency
### Append-Only Index Updates
When new vectors are added:
1. New vectors go into a **fresh VEC_SEG** (append-only)
2. A temporary in-memory index covers the new vectors
3. When the in-memory index reaches a threshold, it is written as a new INDEX_SEG
4. The manifest is updated to include both the old and new INDEX_SEGs
5. Queries search both indexes and merge results
This is analogous to LSM-tree compaction levels but for graph indexes.
### Index Merging
When too many small INDEX_SEGs accumulate:
```
1. Read all small INDEX_SEGs
2. Build a unified HNSW graph over all vectors
3. Write as a single sealed INDEX_SEG
4. Tombstone old INDEX_SEGs in manifest
```
### Concurrent Read/Write
Readers always see a consistent snapshot through the manifest chain:
- Reader opens file -> reads manifest -> has immutable segment set
- Writer appends new segments + new manifest
- Reader continues using old manifest until it explicitly re-reads
- No locks needed — append-only guarantees no mutation of existing data
## 7. Query Path Integration
The complete query path combining progressive indexing with temperature tiering:
```
Query
|
v
+-----------+
| Layer A | Entry points + top-layer routing
| (always) | ~5ms to load on cold start
+-----------+
|
Is Layer B available for this region?
/ \
Yes No
/ \
+-----------+ +-----------+
| Layer B | | Centroid |
| HNSW | | Fallback |
| search | | + scan |
+-----------+ +-----------+
\ /
\ /
v v
+-----------+
| Candidate |
| Set |
+-----------+
|
Is hot cache available?
/ \
Yes No
/ \
+-----------+ +-----------+
| Hot cache | | Decode |
| re-rank | | from |
| (int8/fp16)| | VEC_SEG |
+-----------+ +-----------+
\ /
v v
+-----------+
| Top-K |
| Results |
+-----------+
```
### Recall Expectations by State
| State | Layers Available | Expected Recall@10 |
|-------|-----------------|-------------------|
| Cold start (L0 only) | A | 0.65-0.75 |
| L0 + hot cache | A + hot | 0.75-0.85 |
| L0 + L1 loading | A + B partial | 0.80-0.90 |
| L1 complete | A + B | 0.85-0.92 |
| Full load | A + B + C | 0.95-0.99 |
| Full + optimized | A + B + C + hot | 0.98-0.999 |

View File

@@ -0,0 +1,308 @@
# RVF Overlay Epochs
## 1. Streaming Dynamic Min-Cut Overlay
The overlay system manages dynamic graph partitioning — how the vector space is
subdivided for distributed search, shard routing, and load balancing. Unlike
static partitioning, RVF overlays evolve with the data through an epoch-based
model that bounds memory, bounds load time, and enables rollback.
## 2. Overlay Segment Structure
Each OVERLAY_SEG stores a delta relative to the previous epoch's partition state:
```
+-------------------------------------------+
| Header: OVERLAY_SEG |
+-------------------------------------------+
| Epoch Header |
| epoch: u32 |
| parent_epoch: u32 |
| parent_seg_id: u64 |
| rollback_offset: u64 |
| timestamp_ns: u64 |
| delta_count: u32 |
| partition_count: u32 |
+-------------------------------------------+
| Edge Deltas |
| For each delta: |
| delta_type: u8 (ADD=1, REMOVE=2, |
| REWEIGHT=3) |
| src_node: u64 |
| dst_node: u64 |
| weight: f32 (for ADD/REWEIGHT) |
| [64B aligned] |
+-------------------------------------------+
| Partition Summaries |
| For each partition: |
| partition_id: u32 |
| node_count: u64 |
| edge_cut_weight: f64 |
| centroid: [fp16 * dim] |
| node_id_range_start: u64 |
| node_id_range_end: u64 |
| [64B aligned] |
+-------------------------------------------+
| Min-Cut Witness |
| witness_type: u8 |
| 0 = checksum only |
| 1 = full certificate |
| cut_value: f64 |
| cut_edge_count: u32 |
| partition_hash: [u8; 32] (SHAKE-256) |
| If witness_type == 1: |
| [cut_edge: (u64, u64)] * count |
| [64B aligned] |
+-------------------------------------------+
| Rollback Pointer |
| prev_epoch_offset: u64 |
| prev_epoch_hash: [u8; 16] |
+-------------------------------------------+
```
## 3. Epoch Lifecycle
### Epoch Creation
A new epoch is created when:
- A batch of vectors is inserted that changes partition balance by > threshold
- The accumulated edge deltas exceed a size limit (default: 1 MB)
- A manual rebalance is triggered
- A merge/compaction produces a new partition layout
```
Epoch 0 (initial) Epoch 1 Epoch 2
+----------------+ +----------------+ +----------------+
| Full snapshot | | Deltas vs E0 | | Deltas vs E1 |
| of partitions | | +50 edges | | +30 edges |
| 32 partitions | | -12 edges | | -8 edges |
| min-cut: 0.342 | | rebalance: P3 | | split: P7->P7a |
+----------------+ +----------------+ +----------------+
```
### State Reconstruction
To reconstruct the current partition state:
```
1. Read latest MANIFEST_SEG -> get current_epoch
2. Read OVERLAY_SEG for current_epoch
3. If overlay is a delta: recursively read parent epochs
4. Apply deltas in order: base -> epoch 1 -> epoch 2 -> ... -> current
5. Result: complete partition state
```
For efficiency, the manifest caches the **last full snapshot epoch**. Delta
chains never exceed a configurable depth (default: 8 epochs) before a new
snapshot is forced.
### Compaction (Epoch Collapse)
When the delta chain reaches maximum depth:
```
1. Reconstruct full state from chain
2. Write new OVERLAY_SEG with witness_type=full_snapshot
3. This becomes the new base epoch
4. Old overlay segments are tombstoned
5. New delta chain starts from this base
```
```
Before: E0(snap) -> E1(delta) -> E2(delta) -> ... -> E8(delta)
After: E0(snap) -> ... -> E8(delta) -> E9(snap, compacted)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
These can be garbage collected
```
## 4. Min-Cut Witness
The min-cut witness provides a cryptographic proof that the current partition
is "good enough" — that the edge cut is within acceptable bounds.
### Witness Types
**Type 0: Checksum Only**
A SHAKE-256 hash of the complete partition state. Allows verification that
the state is consistent but doesn't prove optimality.
```
witness = SHAKE-256(
for each partition sorted by id:
partition_id || node_count || sorted(node_ids) || edge_cut_weight
)
```
**Type 1: Full Certificate**
Lists the actual cut edges. Allows any reader to verify that:
1. The listed edges are the only edges crossing partition boundaries
2. The total cut weight matches `cut_value`
3. No better cut exists within the local search neighborhood (optional)
### Bounded-Time Min-Cut Updates
Full min-cut computation is expensive (O(V * E) for max-flow). RVF uses
**incremental min-cut maintenance**:
For each edge delta:
```
1. If ADD(u, v) where u and v are in same partition:
-> No cut change. O(1).
2. If ADD(u, v) where u in P_i and v in P_j:
-> cut_weight[P_i][P_j] += weight. O(1).
-> Check if moving u to P_j or v to P_i reduces total cut.
-> If yes: execute move, update partition summaries. O(degree).
3. If REMOVE(u, v) across partitions:
-> cut_weight[P_i][P_j] -= weight. O(1).
-> No rebalance needed (cut improved).
4. If REMOVE(u, v) within same partition:
-> Check connectivity. If partition splits: create new partition. O(component).
```
This bounds update time to O(max_degree) per edge delta in the common case,
with O(component_size) in the rare partition-split case.
### Semi-Streaming Min-Cut
For large-scale rebalancing (e.g., after bulk insert), RVF uses a semi-streaming
algorithm inspired by Assadi et al.:
```
Phase 1: Single pass over edges to build a sparse skeleton
- Sample each edge with probability O(1/epsilon)
- Space: O(n * polylog(n))
Phase 2: Compute min-cut on skeleton
- Standard max-flow on sparse graph
- Time: O(n^2 * polylog(n))
Phase 3: Verify against full edge set
- Stream edges again, check cut validity
- If invalid: refine skeleton and repeat
```
This runs in O(n * polylog(n)) space regardless of edge count, making it
suitable for streaming over massive graphs.
## 5. Overlay Size Management
### Size Threshold
Each OVERLAY_SEG has a maximum payload size (configurable, default 1 MB).
When the accumulated deltas for the current epoch approach this threshold,
a new epoch is forced.
### Memory Budget
The total memory for overlay state is bounded:
```
max_overlay_memory = max_chain_depth * max_seg_size + snapshot_size
= 8 * 1 MB + snapshot_size
```
For 10M vectors with 32 partitions:
- Snapshot: ~32 * (8 + 16 + 768) bytes per partition ≈ 25 KB
- Delta chain: ≤ 8 MB
- Total: ≤ 9 MB
This is a fixed overhead regardless of dataset size (partition count scales
sublinearly).
### Garbage Collection
Overlay segments behind the last full snapshot are candidates for garbage
collection. The manifest tracks which overlay segments are still reachable
from the current epoch chain.
```
Reachable: current_epoch -> parent -> ... -> last_snapshot
Unreachable: Everything before last_snapshot (safely deletable)
```
GC runs during compaction. Old OVERLAY_SEGs are tombstoned in the manifest
and their space is reclaimed on file rewrite.
## 6. Distributed Overlay Coordination
When RVF files are sharded across multiple nodes, the overlay system coordinates
partition state:
### Shard-Local Overlays
Each shard maintains its own OVERLAY_SEG chain for its local partitions.
The global partition state is the union of all shard-local overlays.
### Cross-Shard Rebalancing
When a partition becomes unbalanced across shards:
```
1. Coordinator computes target partition assignment
2. Each shard writes a JOURNAL_SEG with vector move instructions
3. Vectors are copied (not moved — append-only) to target shards
4. Each shard writes a new OVERLAY_SEG reflecting the new partition
5. Coordinator writes a global MANIFEST_SEG with new shard map
```
This is eventually consistent — during rebalancing, queries may search both
old and new locations and deduplicate results.
### Consistency Model
**Within a shard**: Linearizable (single-writer, manifest chain)
**Across shards**: Eventually consistent with bounded staleness
The epoch counter provides a total order for convergence checking:
- If all shards report epoch >= E, the global state at epoch E is complete
- Stale shards are detectable by comparing epoch counters
## 7. Epoch-Aware Query Routing
Queries use the overlay state for partition routing:
```python
def route_query(query, overlay):
# Find nearest partition centroids
dists = [distance(query, p.centroid) for p in overlay.partitions]
target_partitions = top_n(dists, n_probe)
# Check epoch freshness
if overlay.epoch < current_epoch - stale_threshold:
# Overlay is stale — broaden search
target_partitions = top_n(dists, n_probe * 2)
return target_partitions
```
### Epoch Rollback
If an overlay epoch is found to be corrupt or suboptimal:
```
1. Read rollback_pointer from current OVERLAY_SEG
2. The pointer gives the offset of the previous epoch's OVERLAY_SEG
3. Write a new MANIFEST_SEG pointing to the previous epoch as current
4. Future writes continue from the rolled-back state
```
This provides O(1) rollback to any ancestor epoch in the chain.
## 8. Integration with Progressive Indexing
The overlay system and the index system are coupled:
- **Partition centroids** in the overlay guide Layer A routing
- **Partition boundaries** determine which INDEX_SEGs cover which regions
- **Partition rebalancing** may invalidate Layer B adjacency for moved vectors
(these are rebuilt lazily)
- **Layer C** is partitioned aligned — each INDEX_SEG covers vectors within
a single partition for locality
This means overlay compaction can trigger partial index rebuild, but only for
the affected partitions — not the entire index.

View File

@@ -0,0 +1,386 @@
# RVF Ultra-Fast Query Path
## 1. CPU Shape Optimization
The block layout determines performance at the hardware level. RVF is designed
to match the shape of modern CPUs: wide SIMD, deep caches, hardware prefetch.
### Four Optimizations
1. **Strict 64-byte alignment** for all numeric arrays
2. **Columnar + interleaved hybrid** for compression and speed
3. **Prefetch hints** for cache-friendly graph traversal
4. **Dictionary-coded IDs** for fast random access
## 2. Strict Alignment
Every numeric array in RVF starts at a 64-byte aligned offset. This matches:
| Target | Register Width | Alignment |
|--------|---------------|-----------|
| AVX-512 | 512 bits = 64 bytes | 64 B |
| AVX2 | 256 bits = 32 bytes | 64 B (superset) |
| ARM NEON | 128 bits = 16 bytes | 64 B (superset) |
| WASM v128 | 128 bits = 16 bytes | 64 B (superset) |
| Cache line | Typically 64 bytes | 64 B (exact) |
By aligning to 64 bytes, RVF ensures:
- Zero-copy load into any SIMD register (no unaligned penalty)
- No cache-line splits (each access touches exactly one cache line)
- Optimal hardware prefetch behavior (prefetcher operates on cache lines)
### Alignment in Practice
```
Segment header: 64 B (naturally aligned, first item in segment)
Block header: Padded to 64 B boundary
Vector data start: 64 B aligned from block start
Each dimension column: 64 B aligned (columnar VEC_SEG)
Each vector entry: 64 B aligned (interleaved HOT_SEG)
ID map: 64 B aligned
Restart point index: 64 B aligned
```
Padding bytes between sections are zero-filled and excluded from checksums.
## 3. Columnar + Interleaved Hybrid
### Columnar Storage (VEC_SEG) — Optimized for Compression
```
Block layout (1024 vectors, 384 dimensions, fp16):
Offset 0x000: dim_0[vec_0], dim_0[vec_1], ..., dim_0[vec_1023] (2048 B)
Offset 0x800: dim_1[vec_0], dim_1[vec_1], ..., dim_1[vec_1023] (2048 B)
...
Offset 0xBF800: dim_383[vec_0], ..., dim_383[vec_1023] (2048 B)
Total: 384 * 2048 = 786,432 bytes (768 KB per block)
```
**Why columnar for cold/warm storage**:
- Adjacent values in the same dimension are correlated -> higher compression ratio
- LZ4 on columnar fp16 achieves 1.5-2.5x compression (vs 1.1-1.3x on interleaved)
- ZSTD on columnar fp16 achieves 2.5-4x compression
- Batch operations (computing mean, variance) scan one dimension at a time
### Interleaved Storage (HOT_SEG) — Optimized for Speed
```
Entry layout (one hot vector, 384 dim fp16):
Offset 0x000: vector_id (8 B)
Offset 0x008: dim_0, dim_1, dim_2, ..., dim_383 (768 B)
Offset 0x308: neighbor_count (2 B)
Offset 0x30A: neighbor_0, neighbor_1, ... (8 B each)
Offset 0x38A: padding to 64B boundary
--> 960 bytes per entry (at M=16 neighbors)
```
**Why interleaved for hot data**:
- One vector = one sequential read (no column gathering)
- Distance computation: load vector, compute, move to next (streaming pattern)
- Neighbors co-located: after finding a good candidate, immediately traverse
- 960 bytes per entry = 15 cache lines = predictable memory access
### When to Use Each
| Operation | Layout | Reason |
|-----------|--------|--------|
| Bulk distance computation | Columnar | SIMD operates on dimension columns |
| Top-K refinement scan | Interleaved | Sequential scan of candidates |
| Compression/archival | Columnar | Better ratio |
| HNSW search (hot region) | Interleaved | Vector + neighbors together |
| Batch insert | Columnar | Write once, compress well |
## 4. Prefetch Hints
### The Problem
HNSW search is pointer-chasing: compute distance at node A, read neighbor
list, jump to node B, compute distance, repeat. Each jump is a random
memory access. On a 10M vector file, this means:
```
HNSW search: ~100-200 distance computations per query
Each computation: 1 random read (vector) + 1 random read (neighbors)
Random read latency: 50-100 ns (DRAM), 10-50 μs (SSD)
Total: 10-40 μs (DRAM), 1-10 ms (SSD) without prefetch
```
### The Solution
Store neighbor lists **contiguously** and add **prefetch offsets** in the
manifest so the runtime can issue prefetch instructions ahead of time.
### Prefetch Table Structure
The manifest contains a prefetch table mapping node ID ranges to contiguous
page regions:
```
prefetch_table:
entry_count: u32
entries:
[0]: node_ids 0-9999 -> pages at offset 0x100000, 50 pages, prefetch 3 ahead
[1]: node_ids 10000-19999 -> pages at offset 0x200000, 50 pages, prefetch 3 ahead
...
```
### Runtime Prefetch Strategy
```python
def hnsw_search_with_prefetch(query, entry_point, ef_search):
candidates = MaxHeap()
visited = BitSet()
worklist = MinHeap([(distance(query, entry_point), entry_point)])
while worklist:
dist, node = worklist.pop()
# PREFETCH: while processing this node, prefetch neighbors' data
neighbors = get_neighbors(node)
for n in neighbors[:PREFETCH_AHEAD]:
if n not in visited:
prefetch_vector(n) # madvise(WILLNEED) or __builtin_prefetch
prefetch_neighbors(n) # prefetch neighbor list page
# COMPUTE: distance to neighbors (data should be in cache by now)
for n in neighbors:
if n not in visited:
visited.add(n)
d = distance(query, get_vector(n))
if d < candidates.max() or len(candidates) < ef_search:
candidates.push((d, n))
worklist.push((d, n))
return candidates.top_k(k)
```
### Contiguous Neighbor Layout
HOT_SEG stores vectors and neighbors together. For cold INDEX_SEGs, neighbor
lists are laid out in **node ID order** within contiguous pages:
```
Page 0: neighbors[node_0], neighbors[node_1], ..., neighbors[node_63]
Page 1: neighbors[node_64], ..., neighbors[node_127]
...
```
Because HNSW search tends to traverse nodes in the same graph neighborhood
(spatially close node IDs if data was inserted in order), sequential node
IDs tend to be accessed together. Contiguous layout turns random access
into sequential reads.
### Expected Improvement
| Configuration | p95 Latency (10M vectors) |
|--------------|--------------------------|
| No prefetch, random layout | 2.5 ms |
| No prefetch, contiguous layout | 1.2 ms |
| Prefetch, contiguous layout | 0.3 ms |
| Prefetch, contiguous + hot cache | 0.15 ms |
## 5. Dictionary-Coded IDs
### The Problem
Vector IDs in neighbor lists and ID maps are 64-bit integers. For 10M vectors,
most IDs fit in 24 bits. Storing full 64-bit IDs wastes ~5 bytes per entry.
With M=16 neighbors per node and 10M nodes:
- Raw: 10M * 16 * 8 = 1.2 GB of ID data
- Desired: < 300 MB
### Varint Delta Encoding
IDs within a block or neighbor list are sorted and delta-encoded:
```
Original IDs: [1000, 1005, 1008, 1020, 1100]
Deltas: [1000, 5, 3, 12, 80]
Varint bytes: [ 2B, 1B, 1B, 1B, 1B] = 6 bytes (vs 40 bytes raw)
```
### Restart Points
Every N entries (default N=64), the delta resets to an absolute value:
```
Group 0 (entries 0-63): delta from 0 (absolute start)
Group 1 (entries 64-127): delta from entry[64] (restart)
Group 2 (entries 128-191): delta from entry[128] (restart)
```
The restart point index stores the offset of each restart group:
```
restart_index:
interval: 64
offsets: [0, 156, 298, 445, ...] // byte offsets into encoded data
```
### Random Access
To find the neighbors of node N:
```
1. group = N / restart_interval // O(1)
2. offset = restart_index[group] // O(1)
3. seek to offset in encoded data // O(1)
4. decode sequentially from restart to N // O(restart_interval) = O(64)
```
Total: O(64) varint decodes = ~50-100 ns. Compare with sorted array binary
search: O(log N) = O(24) comparisons with cache misses = ~200-500 ns.
### SIMD Varint Decoding
Modern SIMD can decode varints in bulk:
```
AVX-512 VBMI: ~8 varints per cycle using VPERMB + VPSHUFB
Throughput: 2-4 billion integers/second (Lemire et al.)
```
At 16 neighbors per node, one HNSW search step decodes 16 varints in ~2-4 ns.
### Compression Ratio
| Encoding | Bytes per ID (avg) | 10M * 16 neighbors |
|----------|-------------------|-------------------|
| Raw u64 | 8.0 B | 1,220 MB |
| Raw u32 | 4.0 B | 610 MB |
| Varint (no delta) | 3.2 B | 488 MB |
| Varint delta | 1.5 B | 229 MB |
| Varint delta + restart | 1.6 B | 244 MB |
Delta encoding with restart points achieves ~5x compression over raw u64
while maintaining fast random access.
## 6. Cache Behavior Analysis
### L1/L2/L3 Working Sets
For a typical query on 10M vectors (384 dim, fp16):
```
HNSW search:
~150 distance computations
Each computation: 768 B (vector) + ~128 B (neighbor list) ≈ 896 B
Total working set: 150 * 896 ≈ 131 KB
Top-K refinement (hot cache scan):
~1000 candidates checked
Each: 960 B (interleaved HOT_SEG entry)
Total: 960 KB
Query vector: 768 B (always in L1)
Quantization tables: 96 KB (PQ codebook, always in L2)
```
| Cache Level | Size | What Fits |
|------------|------|-----------|
| L1 (32-48 KB) | Query vector + current node | Always hit |
| L2 (256 KB-1 MB) | PQ tables + 100-200 hot entries | Usually hit |
| L3 (8-32 MB) | Hot cache + partial index | Mostly hit |
| DRAM | Everything | Full dataset |
### p95 Latency Budget
```
HNSW traversal: 150 nodes * 100 ns/node = 15 μs (L3 hit)
Distance compute: 150 * 50 ns = 7.5 μs (SIMD)
Top-K refinement: 1000 * 10 ns = 10 μs (hot cache, L2/L3 hit)
Overhead: 5 μs (heap ops, bookkeeping)
-------
Total p95: ~37.5 μs ≈ 0.04 ms
With prefetch: ~30 μs (hide 25% of traversal latency)
```
This matches the target of < 0.3 ms p95 on desktop hardware. The dominant
cost is memory bandwidth, not computation — which is why cache-friendly
layout and prefetch are critical.
## 7. Distance Function SIMD Implementations
### L2 Distance (fp16, 384 dim, AVX-512)
```
; 384 fp16 values = 768 bytes = 12 ZMM registers
; Process 32 fp16 values per iteration (convert to 16 fp32 per half)
.loop:
vmovdqu16 zmm0, [rsi + rcx] ; Load 32 fp16 from A
vmovdqu16 zmm1, [rdi + rcx] ; Load 32 fp16 from B
vcvtph2ps zmm2, ymm0 ; Convert low 16 to fp32
vcvtph2ps zmm3, ymm1
vsubps zmm2, zmm2, zmm3 ; diff = A - B
vfmadd231ps zmm4, zmm2, zmm2 ; acc += diff * diff
; Repeat for high 16
vextracti64x4 ymm0, zmm0, 1
vextracti64x4 ymm1, zmm1, 1
vcvtph2ps zmm2, ymm0
vcvtph2ps zmm3, ymm1
vsubps zmm2, zmm2, zmm3
vfmadd231ps zmm4, zmm2, zmm2
add rcx, 64
cmp rcx, 768
jl .loop
; Horizontal sum of zmm4 -> scalar result
; ~12 iterations, ~24 FMA ops, ~12 cycles total
```
### Inner Product (int8, 384 dim, AVX-512 VNNI)
```
; 384 int8 values = 384 bytes = 6 ZMM registers
; VPDPBUSD: 64 uint8*int8 multiply-adds per cycle
.loop:
vmovdqu8 zmm0, [rsi + rcx] ; 64 uint8 from A
vmovdqu8 zmm1, [rdi + rcx] ; 64 int8 from B
vpdpbusd zmm2, zmm0, zmm1 ; acc += dot(A, B) per 4 bytes
add rcx, 64
cmp rcx, 384
jl .loop
; 6 iterations, 6 VPDPBUSD ops, ~6 cycles
; ~16x faster than fp16 L2
```
### Hamming Distance (binary, 384 dim, AVX-512)
```
; 384 bits = 48 bytes = 1 partial ZMM load
; VPOPCNTDQ: popcount on 8 x 64-bit words per cycle
vmovdqu8 zmm0, [rsi] ; Load 48 bytes (384 bits) from A
vmovdqu8 zmm1, [rdi] ; Load 48 bytes from B
vpxorq zmm2, zmm0, zmm1 ; XOR -> differing bits
vpopcntq zmm3, zmm2 ; Popcount per 64-bit word
; Horizontal sum of 6 popcounts -> Hamming distance
; ~3 cycles total
```
## 8. Summary: Query Path Hot Loop
The complete hot path for one HNSW search step:
```
1. Load current node's neighbor list [L2/L3 cache, 128 B, ~5 ns]
2. Issue prefetch for next neighbors [~1 ns]
3. For each neighbor (M=16):
a. Check visited bitmap [L1, ~1 ns]
b. Load neighbor vector (hot cache) [L2/L3, 768 B, ~5-10 ns]
c. SIMD distance (fp16, 384 dim) [~12 cycles = ~4 ns]
d. Heap insert if better [~5 ns]
4. Total per step: ~300-500 ns
5. Total per query (~150 steps): ~50-75 μs
```
This achieves 13,000-20,000 QPS per thread on desktop hardware — matching
or exceeding dedicated vector databases for in-memory workloads.

View File

@@ -0,0 +1,580 @@
# RVF Deletion Lifecycle
## 1. Overview
Deletion in RVF follows a two-phase protocol consistent with the append-only
segment architecture. Vectors are never removed in-place. Instead, a soft
delete records intent in a JOURNAL_SEG, and a subsequent compaction hard
deletes by physically excluding the vectors from sealed output segments.
```
JOURNAL_SEG Compaction GC / Rewrite
(append) (merge) (reclaim)
ACTIVE -----> SOFT_DELETED -----> HARD_DELETED ------> RECLAIMED
| | | |
| query path | query path | |
| returns vec | skips vec | vec absent | space freed
| | (bitmap check) | from output seg |
```
Readers always see a consistent snapshot: a deletion is invisible until
the manifest referencing the new deletion bitmap is durably committed.
## 2. Vector Lifecycle State Machine
```
+----------+ JOURNAL_SEG +-----------------+
| | DELETE_VECTOR / RANGE | |
| ACTIVE +----------------------->+ SOFT_DELETED |
| | | |
+----------+ +--------+--------+
| Compaction seals output
v excluding this vector
+--------+--------+
| HARD_DELETED |
+--------+--------+
| File rewrite / truncation
v reclaims physical space
+--------+--------+
| RECLAIMED |
+-----------------+
```
| State | Bitmap Bit | Physical Bytes | Query Visible |
|-------|------------|----------------|---------------|
| ACTIVE | 0 | Vector in VEC_SEG | Yes |
| SOFT_DELETED | 1 | Vector in VEC_SEG | No |
| HARD_DELETED | N/A | Excluded from sealed output | No |
| RECLAIMED | N/A | Bytes overwritten / freed | No |
| Transition | Trigger | Durability |
|------------|---------|------------|
| ACTIVE -> SOFT_DELETED | JOURNAL_SEG + MANIFEST_SEG with bitmap | After manifest fsync |
| SOFT_DELETED -> HARD_DELETED | Compaction writes sealed VEC_SEG without vector | After compaction manifest fsync |
| HARD_DELETED -> RECLAIMED | File rewrite or old shard deletion | After shard unlink |
## 3. JOURNAL_SEG Wire Format (type 0x04)
A JOURNAL_SEG records metadata mutations: deletions, metadata updates, tier
moves, and ID remappings. Its payload follows the standard 64-byte segment
header (see `01-segment-model.md` section 2).
### 3.1 Journal Header (64 bytes)
```
Offset Type Field Description
------ ---- ----- -----------
0x00 u32 entry_count Number of journal entries
0x04 u32 journal_epoch Epoch when this journal was written
0x08 u64 prev_journal_seg_id Segment ID of previous JOURNAL_SEG (0 if first)
0x10 u32 flags Reserved, must be 0
0x14 u8[44] reserved Zero-padded to 64-byte alignment
```
### 3.2 Journal Entry Format
Each entry begins on an 8-byte aligned boundary:
```
Offset Type Field Description
------ ---- ----- -----------
0x00 u8 entry_type Entry type enum
0x01 u8 reserved Must be 0x00
0x02 u16 entry_length Byte length of type-specific payload
0x04 u8[] payload Type-specific payload
var u8[] padding Zero-pad to next 8-byte boundary
```
### 3.3 Entry Types
```
Value Name Payload Size Description
----- ---- ------------ -----------
0x01 DELETE_VECTOR 8 B Delete a single vector by ID
0x02 DELETE_RANGE 16 B Delete a contiguous range of vector IDs
0x03 UPDATE_METADATA variable Update key-value metadata for a vector
0x04 MOVE_VECTOR 24 B Reassign vector to a different segment/tier
0x05 REMAP_ID 16 B Reassign vector ID (post-compaction)
```
### 3.4 Type-Specific Payloads
**DELETE_VECTOR (0x01)**
```
0x00 u64 vector_id ID of the vector to soft-delete
```
**DELETE_RANGE (0x02)**
```
0x00 u64 start_id First vector ID (inclusive)
0x08 u64 end_id Last vector ID (exclusive)
```
Invariant: `start_id < end_id`. Range `[start_id, end_id)` is half-open.
**UPDATE_METADATA (0x03)**
```
0x00 u64 vector_id Target vector ID
0x08 u16 key_len Byte length of metadata key
0x0A u8[] key Metadata key (UTF-8)
var u16 val_len Byte length of metadata value
var+2 u8[] val Metadata value (opaque bytes)
```
**MOVE_VECTOR (0x04)**
```
0x00 u64 vector_id Target vector ID
0x08 u64 src_seg Source segment ID
0x10 u64 dst_seg Destination segment ID
```
**REMAP_ID (0x05)**
```
0x00 u64 old_id Original vector ID
0x08 u64 new_id New vector ID after compaction
```
### 3.5 Complete JOURNAL_SEG Example
Deleting vector 42, deleting range [1000, 2000), remapping ID 500 -> 3:
```
Byte offset Content Notes
----------- ------- -----
0x00-0x3F Segment header (64 B) seg_type=0x04, magic=RVFS
0x40-0x7F Journal header (64 B) entry_count=3, epoch=7,
prev_journal_seg_id=12
--- Entry 0: DELETE_VECTOR ---
0x80 0x01 entry_type
0x81 0x00 reserved
0x82-0x83 0x0008 entry_length = 8
0x84-0x8B 0x000000000000002A vector_id = 42
0x8C-0x8F 0x00000000 padding to 8B
--- Entry 1: DELETE_RANGE ---
0x90 0x02 entry_type
0x91 0x00 reserved
0x92-0x93 0x0010 entry_length = 16
0x94-0x9B 0x00000000000003E8 start_id = 1000
0x9C-0xA3 0x00000000000007D0 end_id = 2000
--- Entry 2: REMAP_ID ---
0xA4 0x05 entry_type
0xA5 0x00 reserved
0xA6-0xA7 0x0010 entry_length = 16
0xA8-0xAF 0x00000000000001F4 old_id = 500
0xB0-0xB7 0x0000000000000003 new_id = 3
```
## 4. Deletion Bitmap
### 4.1 Manifest Record
The deletion bitmap is stored in the Level 1 manifest as a TLV record:
```
Tag Name Description
--- ---- -----------
0x000E DELETION_BITMAP Roaring bitmap of soft-deleted vector IDs
```
This extends the TLV tag space (previous: 0x000D KEY_DIRECTORY).
### 4.2 Roaring Bitmap Binary Layout
Vector IDs are 64-bit. The upper 32 bits select a **high key**; the lower
32 bits index into a **container** for that high key.
```
+---------------------------------------------+
| DELETION_BITMAP TLV Value |
+---------------------------------------------+
| Bitmap Header |
| cookie: u32 (0x3B3A3332) |
| high_key_count: u32 |
| For each high key: |
| high_key: u32 |
| container_type: u8 |
| 0x01 = ARRAY_CONTAINER |
| 0x02 = BITMAP_CONTAINER |
| 0x03 = RUN_CONTAINER |
| container_offset: u32 (from bitmap start)|
| [8B aligned] |
+---------------------------------------------+
| Container Data |
| Container 0: [type-specific layout] |
| Container 1: ... |
| [8B aligned per container] |
+---------------------------------------------+
```
### 4.3 Container Types
**ARRAY_CONTAINER (0x01)** -- Sparse deletions (< 4096 set bits per 64K range).
```
0x00 u16 cardinality Number of set values (1-4096)
0x02 u16[] values Sorted array of 16-bit values
```
Size: `2 + 2 * cardinality` bytes.
**BITMAP_CONTAINER (0x02)** -- Dense deletions (>= 4096 set bits per 64K range).
```
0x00 u16 cardinality Number of set bits
0x02 u8[8192] bitmap Fixed 65536-bit bitmap (8 KB)
```
Size: 8194 bytes (fixed).
**RUN_CONTAINER (0x03)** -- Contiguous ranges of deletions.
```
0x00 u16 run_count Number of runs
0x02 (u16,u16) runs[] Array of (start, length-1) pairs
```
Size: `2 + 4 * run_count` bytes.
### 4.4 Size Estimation
| Deletion Pattern | Deleted IDs | Container Types | Bitmap Size |
|------------------|-------------|-----------------|-------------|
| Sparse random | 10,000 (0.1%) | ~153 array | ~22 KB |
| Clustered ranges | 10,000 (0.1%) | ~5 run | ~0.1 KB |
| Mixed workload | 100,000 (1%) | array + run | ~80 KB |
| Heavy deletion | 1,000,000 (10%) | bitmap + run | ~200 KB |
Even at 200 KB the bitmap fits entirely in L2 cache.
### 4.5 Bitmap Operations
```python
def bitmap_check(bitmap, vector_id):
"""Returns True if vector_id is soft-deleted. O(1) amortized."""
high_key = vector_id >> 16
low_val = vector_id & 0xFFFF
container = bitmap.get_container(high_key)
if container is None:
return False
return container.contains(low_val) # array: bsearch, bitmap: bit test, run: bsearch
def bitmap_set(bitmap, vector_id):
"""Mark a vector as soft-deleted."""
high_key = vector_id >> 16
low_val = vector_id & 0xFFFF
container = bitmap.get_or_create_container(high_key)
container.add(low_val)
if container.type == ARRAY and container.cardinality > 4096:
container.promote_to_bitmap()
```
## 5. Delete-Aware Query Path
### 5.1 HNSW Traversal with Deletion Filtering
Deleted vectors remain in the HNSW graph until compaction rebuilds the index.
During search, the deletion bitmap is checked per candidate. Deleted nodes are
still traversed for connectivity but excluded from the result set.
```python
def hnsw_search_delete_aware(query, entry_point, ef_search, k, del_bitmap):
candidates = MaxHeap() # worst candidate on top
visited = BitSet()
worklist = MinHeap() # best candidate first
d0 = distance(query, get_vector(entry_point))
worklist.push((d0, entry_point))
visited.add(entry_point)
if not bitmap_check(del_bitmap, entry_point):
candidates.push((d0, entry_point))
while worklist:
dist, node = worklist.pop()
if candidates.size() >= ef_search and dist > candidates.peek_max():
break
neighbors = get_neighbors(node)
for n in neighbors[:PREFETCH_AHEAD]:
if n not in visited:
prefetch_vector(n)
for n in neighbors:
if n in visited:
continue
visited.add(n)
d = distance(query, get_vector(n))
is_deleted = bitmap_check(del_bitmap, n) # O(1) bitmap lookup
# Always add to worklist (graph connectivity)
if candidates.size() < ef_search or d < candidates.peek_max():
worklist.push((d, n))
# Only add to results if NOT deleted
if not is_deleted:
if candidates.size() < ef_search:
candidates.push((d, n))
elif d < candidates.peek_max():
candidates.replace_max((d, n))
return candidates.top_k(k)
```
### 5.2 Top-K Refinement with Deletion Filtering
```python
def topk_refine_delete_aware(candidates, hot_cache, query, k, del_bitmap):
heap = MaxHeap()
for cand_dist, cand_id in candidates:
heap.push((cand_dist, cand_id))
for entry in hot_cache.sequential_scan():
if bitmap_check(del_bitmap, entry.vector_id):
continue # skip soft-deleted
d = distance(query, entry.vector)
if heap.size() < k:
heap.push((d, entry.vector_id))
elif d < heap.peek_max():
heap.replace_max((d, entry.vector_id))
return heap.drain_sorted()
```
### 5.3 Performance Impact
| Operation | Without Deletions | With Deletions | Overhead |
|-----------|-------------------|----------------|----------|
| Bitmap check | N/A | ~2-5 ns (L1/L2 hit) | Per candidate |
| HNSW step (M=16) | ~300-500 ns | ~330-580 ns | +10% |
| Top-K refine (1000) | ~10 us | ~12 us | +20% worst |
| Total query | ~50-75 us | ~55-85 us | +10-13% |
At typical deletion rates (< 5%), overhead is negligible: the bitmap fits in
L2 cache, graph connectivity is preserved, and the cost is one branch plus
one bitmap load per candidate.
## 6. Deletion Write Path
All deletion operations follow the same two-fsync protocol:
```python
def delete_vectors(file, entries):
"""Soft-delete vectors. entries: list of DeleteVector or DeleteRange."""
# 1. Append JOURNAL_SEG
journal = JournalSegment(
epoch=current_epoch(file),
prev_journal_seg_id=latest_journal_id(file),
entries=entries
)
append_segment(file, journal)
fsync(file) # orphan-safe: no manifest references this yet
# 2. Update deletion bitmap in memory
bitmap = load_deletion_bitmap(file)
for e in entries:
if e.type == DELETE_VECTOR:
bitmap_set(bitmap, e.vector_id)
elif e.type == DELETE_RANGE:
bitmap.add_range(e.start_id, e.end_id)
# 3. Append MANIFEST_SEG with updated bitmap
manifest = build_manifest(file, deletion_bitmap=bitmap)
append_segment(file, manifest)
fsync(file) # deletion now visible to all new readers
```
Single deletes, bulk ranges, and batch deletes all use this path. Batch
operations pack multiple entries into one JOURNAL_SEG to amortize fsync cost.
## 7. Compaction with Deletions
### 7.1 Compaction Process
```
Before:
[VEC_1] [VEC_2] [JOURNAL_1] [VEC_3] [JOURNAL_2] [MANIFEST_5]
0-999 1000- del:42, 3000- del:[1000, bitmap={42,500,
2999 del:500 4999 2000) 1000..1999}
After:
... [MANIFEST_5] [VEC_sealed] [INDEX_new] [MANIFEST_6]
vectors 0-4999 bitmap={}
MINUS deleted (empty for
compacted range)
```
### 7.2 Compaction Algorithm
```python
def compact_with_deletions(file, seg_ids):
bitmap = load_deletion_bitmap(file)
output, id_remap, next_id = [], {}, 0
for seg_id in sorted(seg_ids):
seg = load_segment(file, seg_id)
if seg.seg_type != VEC_SEG:
continue
for vec_id, vector in seg.all_vectors():
if bitmap_check(bitmap, vec_id):
continue # physically exclude
id_remap[vec_id] = next_id
output.append((next_id, vector))
next_id += 1
append_segment(file, VecSegment(flags=SEALED, vectors=output))
remaps = [RemapIdEntry(old, new) for old, new in id_remap.items() if old != new]
if remaps:
append_segment(file, JournalSegment(entries=remaps))
append_segment(file, build_hnsw_index(output))
for old_id in id_remap:
bitmap.remove(old_id)
manifest = build_manifest(file,
tombstone_seg_ids=seg_ids,
deletion_bitmap=bitmap)
append_segment(file, manifest)
fsync(file)
```
### 7.3 Journal Merging
During compaction, JOURNAL_SEGs covering the compacted range are consumed:
| Entry Type | Materialization |
|------------|-----------------|
| DELETE_VECTOR / DELETE_RANGE | Vectors excluded from output |
| UPDATE_METADATA | Applied to output META_SEG |
| MOVE_VECTOR | Tier assignment applied in new manifest |
| REMAP_ID | Chained: old remap composed with new remap |
Consumed JOURNAL_SEGs are tombstoned alongside compacted VEC_SEGs.
### 7.4 Compaction Invariants
| ID | Invariant |
|----|-----------|
| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
| INV-D2 | Sealed output contains only ACTIVE vectors |
| INV-D3 | REMAP_ID entries journaled for every relocated vector |
| INV-D4 | Compacted input segments tombstoned in new manifest |
| INV-D5 | Sealed segments are never modified |
| INV-D6 | Rebuilt indexes exclude deleted nodes |
## 8. Deletion Consistency
### 8.1 Crash Safety
```
Write path:
1. Append JOURNAL_SEG -> fsync crash here: orphan, invisible
2. Append MANIFEST_SEG -> fsync crash here: partial manifest, fallback
Recovery:
- Crash after step 1: JOURNAL_SEG orphaned. No manifest references it.
Reader sees previous manifest. Deletion NOT visible. Orphan cleaned
up by next compaction.
- Crash during step 2: Partial MANIFEST_SEG has bad checksum. Reader
falls back to previous valid manifest. Deletion NOT visible.
- After step 2 success: Manifest durable. Deletion visible.
```
**Guarantee**: Uncommitted deletions never affect readers. Deletion is
atomic at the manifest fsync boundary.
### 8.2 Manifest Chain Visibility
```
MANIFEST_3: bitmap = {}
| JOURNAL_SEG written (delete vector 42)
MANIFEST_4: bitmap = {42} <-- deletion visible from here
| Compaction runs
MANIFEST_5: bitmap = {} <-- vector 42 physically removed
```
A reader holding MANIFEST_3 continues to see vector 42. A reader opening
after MANIFEST_4 will not. This provides snapshot isolation at manifest
granularity.
### 8.3 Multi-File Mode
In multi-file mode, each shard maintains its own deletion bitmap. The
DELETION_BITMAP TLV record supports two modes:
```
+----------------------------------------------+
| mode: u8 |
| 0x00 = SINGLE (one bitmap, inline) |
| 0x01 = SHARDED (per-shard references) |
+----------------------------------------------+
SINGLE (0x00):
| roaring_bitmap: [u8; ...] |
SHARDED (0x01):
| shard_count: u16 |
| For each shard: |
| shard_id: u16 |
| bitmap_offset: u64 (in shard file) |
| bitmap_length: u32 |
| bitmap_hash: hash128 |
+----------------------------------------------+
```
Queries spanning shards load per-shard bitmaps and check each candidate
against its shard's bitmap.
### 8.4 Concurrent Access
One writer at a time (file-level advisory lock). Multiple readers are safe
due to append-only architecture. A reader that opened before a deletion
sees the pre-deletion snapshot until it re-reads the manifest.
## 9. Space Reclamation
| Trigger | Threshold | Action |
|---------|-----------|--------|
| Deletion ratio | > 20% of vectors deleted | Schedule compaction |
| Bitmap size | > 1 MB | Schedule compaction |
| Segment count | > 64 mutable segments | Schedule compaction |
| Manual | User-initiated | Compact immediately |
Space accounting derived from the manifest:
```
total_vector_count: 10,000,000 (Level 0 root manifest)
deleted_vector_count: 150,000 (bitmap cardinality)
active_vector_count: 9,850,000 (total - deleted)
deletion_ratio: 1.5% (below threshold)
wasted_bytes: ~115 MB (150K * 768 B per fp16-384 vector)
```
## 10. Summary
### Deletion Protocol
| Step | Action | Durability |
|------|--------|------------|
| 1 | Append JOURNAL_SEG with DELETE entries | fsync (orphan-safe) |
| 2 | Update roaring deletion bitmap | In-memory |
| 3 | Append MANIFEST_SEG with new bitmap | fsync (deletion visible) |
| 4 | Compaction excludes deleted vectors | fsync (physical removal) |
| 5 | File rewrite reclaims space | fsync (space freed) |
### New Wire Format Elements
| Element | Type / Tag | Section |
|---------|------------|---------|
| JOURNAL_SEG | Segment type 0x04 | 3 |
| DELETE_VECTOR | Journal entry 0x01 | 3.4 |
| DELETE_RANGE | Journal entry 0x02 | 3.4 |
| UPDATE_METADATA | Journal entry 0x03 | 3.4 |
| MOVE_VECTOR | Journal entry 0x04 | 3.4 |
| REMAP_ID | Journal entry 0x05 | 3.4 |
| DELETION_BITMAP | Level 1 TLV 0x000E | 4 |
### Invariants
| ID | Invariant |
|----|-----------|
| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
| INV-D2 | Sealed output segments contain only ACTIVE vectors |
| INV-D3 | ID remappings journaled for every compaction-relocated vector |
| INV-D4 | Compacted input segments tombstoned in new manifest |
| INV-D5 | Sealed segments are never modified |
| INV-D6 | Rebuilt indexes exclude deleted nodes |
| INV-D7 | Uncommitted deletions never affect readers (crash safety) |
| INV-D8 | Deletion visibility is atomic at the manifest fsync boundary |

View File

@@ -0,0 +1,724 @@
# RVF Filtered Search
## 1. Motivation
Domain profiles declare metadata schemas with indexed fields (e.g., `"organism"` in
RVDNA, `"language"` in RVText, `"node_type"` in RVGraph), but the format provides no
specification for how those indexes are built, stored, or evaluated at query time.
Filtered search is the combination of vector similarity search with metadata
predicates. Without it, a caller must retrieve an over-sized result set and filter
client-side — wasting bandwidth, latency, and recall budget.
This specification adds:
1. **META_SEG** payload layout (segment type 0x07) for storing per-vector metadata
2. **Filter expression language** with a compact binary encoding
3. **Three evaluation strategies** (pre-, post-, and intra-filtering)
4. **METAIDX_SEG** (new segment type 0x0D) for inverted and bitmap indexes
5. **Manifest integration** via a new Level 1 TLV record
6. **Temperature tier coordination** for metadata segments
## 2. META_SEG Payload Layout (Segment Type 0x07)
META_SEG stores the actual metadata values associated with vectors. It uses the
standard 64-byte segment header (see `binary-layout.md` Section 3) with
`seg_type = 0x07`.
```
META_SEG Payload:
+------------------------------------------+
| Meta Header (64 bytes, padded) |
| schema_id: u32 | References PROFILE_SEG schema
| vector_id_range_start: u64 | First vector ID covered
| vector_id_range_end: u64 | Last vector ID covered (inclusive)
| field_count: u16 | Number of fields in this segment
| encoding: u8 | 0 = row-oriented, 1 = column-oriented
| reserved: [u8; 37] | Must be zero
| [64B aligned] |
+------------------------------------------+
| Field Directory |
| For each field (field_count entries): |
| field_id: u16 |
| field_type: u8 |
| flags: u8 |
| field_offset: u32 | Byte offset from payload start
| [64B aligned] |
+------------------------------------------+
| Field Data (column-oriented) |
| (see Section 2.1 for per-type layout) |
+------------------------------------------+
```
### Field Type Enum
```
Value Type Wire Size Description
----- ---- --------- -----------
0x00 string Variable UTF-8, dictionary-encoded in column layout
0x01 u32 4 bytes Unsigned 32-bit integer
0x02 u64 8 bytes Unsigned 64-bit integer
0x03 f32 4 bytes IEEE 754 single-precision float
0x04 enum Variable (packed) Enumeration with defined label set
0x05 bool 1 bit (packed) Boolean
```
### Field Flags
```
Bit Mask Name Meaning
--- ---- ---- -------
0 0x01 INDEXED Field has a corresponding METAIDX_SEG
1 0x02 SORTED Values are stored in sorted order
2 0x04 NULLABLE Null bitmap present before values
3 0x08 STORED Field value returned in query results (not just filterable)
4-7 reserved Must be zero
```
### 2.1 Column-Oriented Field Layouts
Column-oriented encoding (encoding = 1) is the preferred layout. Each field's data
block starts at a 64-byte aligned boundary.
**String fields** (dictionary-encoded):
```
dict_size: u32 Number of distinct strings
For each dict entry:
length: u16 Byte length of UTF-8 string
bytes: [u8; length] UTF-8 encoded string
[4B aligned after dictionary]
codes: [varint; vector_count] Dictionary code per vector
[64B aligned]
```
Dictionary codes are 0-indexed into the dictionary array. Code `0xFFFFFFFF` (max
varint value for u32 range) represents null if the NULLABLE flag is set.
**Numeric fields** (u32, u64, f32 -- direct array):
```
If NULLABLE:
null_bitmap: [u8; ceil(vector_count / 8)] Bit-packed, 1 = present, 0 = null
[8B aligned]
values: [field_type; vector_count] Dense array of values
[64B aligned]
```
Values for null entries are zero-filled but must not be relied upon.
**Enum fields** (bit-packed):
```
enum_count: u8 Number of enum labels
For each enum label:
length: u8 Byte length of label
bytes: [u8; length] UTF-8 label string
bits_per_code: u8 ceil(log2(enum_count))
codes: packed bit array bits_per_code bits per vector
[ceil(vector_count * bits_per_code / 8) bytes]
[64B aligned]
```
For example, an enum with 3 values (`"+", "-", "."`) uses 2 bits per vector.
1M vectors = 250 KB.
**Bool fields** (bit-packed):
```
If NULLABLE:
null_bitmap: [u8; ceil(vector_count / 8)]
[8B aligned]
values: [u8; ceil(vector_count / 8)] Bit-packed, 1 = true, 0 = false
[64B aligned]
```
### 2.2 Sorted Index (Inline)
For fields with the SORTED flag, an additional sorted permutation index follows
the field data:
```
sorted_count: u32 Must equal vector_count
sorted_order: [varint delta-encoded] Vector IDs in ascending value order
restart_interval: u16 Restart every N entries (default 128)
restart_offsets: [u32; ceil(sorted_count / restart_interval)]
[64B aligned]
```
This enables binary search over field values for range queries without requiring
a separate METAIDX_SEG. It is suitable for fields where a full inverted index
would be wasteful (high cardinality numeric fields like `position_start`).
## 3. Filter Expression Language
### 3.1 Abstract Syntax
A filter expression is a tree of predicates combined with boolean logic:
```
expr ::= field_ref CMP literal -- comparison
| field_ref IN literal_set -- set membership
| field_ref PREFIX string_lit -- string prefix match
| field_ref CONTAINS string_lit -- substring containment
| expr AND expr -- conjunction
| expr OR expr -- disjunction
| NOT expr -- negation
```
### 3.2 Binary Encoding (Postfix / RPN)
Filter expressions are encoded as a postfix (Reverse Polish Notation) token stream
for stack-based evaluation. This avoids the need for recursive parsing and enables
single-pass evaluation with a fixed-size stack.
```
Filter Expression Binary Layout:
header:
node_count: u16 Total number of tokens
stack_depth: u8 Maximum stack depth required
reserved: u8 Must be zero
tokens (postfix order):
For each token:
node_type: u8 Token type (see enum below)
payload: type-specific Variable-size payload
```
### Token Type Enum
```
Value Name Stack Effect Payload
----- ---- ------------ -------
0x01 FIELD_REF push +1 field_id: u16
0x02 LIT_U32 push +1 value: u32
0x03 LIT_U64 push +1 value: u64
0x04 LIT_F32 push +1 value: f32
0x05 LIT_STR push +1 length: u16, bytes: [u8; length]
0x06 LIT_BOOL push +1 value: u8 (0 or 1)
0x07 LIT_NULL push +1 (no payload)
0x10 CMP_EQ pop 2, push 1 (no payload) -- a == b
0x11 CMP_NE pop 2, push 1 (no payload) -- a != b
0x12 CMP_LT pop 2, push 1 (no payload) -- a < b
0x13 CMP_LE pop 2, push 1 (no payload) -- a <= b
0x14 CMP_GT pop 2, push 1 (no payload) -- a > b
0x15 CMP_GE pop 2, push 1 (no payload) -- a >= b
0x20 IN_SET pop 1, push 1 set_size: u16, [encoded values]
0x21 PREFIX pop 2, push 1 (no payload) -- string prefix
0x22 CONTAINS pop 2, push 1 (no payload) -- substring match
0x30 AND pop 2, push 1 (no payload)
0x31 OR pop 2, push 1 (no payload)
0x32 NOT pop 1, push 1 (no payload)
```
### 3.3 Encoding Example
Filter: `organism = "E. coli" AND position_start >= 1000`
```
Token 0: FIELD_REF field_id=0 (organism) stack: [organism_val]
Token 1: LIT_STR "E. coli" stack: [organism_val, "E. coli"]
Token 2: CMP_EQ stack: [true/false]
Token 3: FIELD_REF field_id=3 (position_start) stack: [bool, pos_val]
Token 4: LIT_U64 1000 stack: [bool, pos_val, 1000]
Token 5: CMP_GE stack: [bool, true/false]
Token 6: AND stack: [result]
Binary: node_count=7, stack_depth=3
01 00:00 05 00:07 "E. coli" 10 01 00:03 03 00:00:00:00:00:00:03:E8 15 30
```
### 3.4 Evaluation
Evaluation processes tokens left to right using a fixed-size boolean/value stack:
```python
def evaluate(tokens, vector_id, metadata):
stack = []
for token in tokens:
if token.type == FIELD_REF:
stack.append(metadata.get_value(vector_id, token.field_id))
elif token.type in (LIT_U32, LIT_U64, LIT_F32, LIT_STR, LIT_BOOL, LIT_NULL):
stack.append(token.value)
elif token.type in (CMP_EQ, CMP_NE, CMP_LT, CMP_LE, CMP_GT, CMP_GE):
b, a = stack.pop(), stack.pop()
stack.append(compare(a, token.type, b))
elif token.type == IN_SET:
a = stack.pop()
stack.append(a in token.value_set)
elif token.type in (PREFIX, CONTAINS):
b, a = stack.pop(), stack.pop()
stack.append(string_match(a, token.type, b))
elif token.type == AND:
b, a = stack.pop(), stack.pop()
stack.append(a and b)
elif token.type == OR:
b, a = stack.pop(), stack.pop()
stack.append(a or b)
elif token.type == NOT:
stack.append(not stack.pop())
return stack[0]
```
Maximum stack depth is declared in the header so the evaluator can pre-allocate.
Implementations must reject expressions with `stack_depth > 16`.
## 4. Filter Evaluation Strategies
The runtime selects one of three strategies based on the estimated **selectivity**
of the filter (the fraction of vectors passing the filter).
### 4.1 Pre-Filtering (Selectivity < 1%)
Build the candidate ID set from metadata indexes first, then run vector search
only on the filtered subset.
```
1. Evaluate filter using METAIDX_SEG inverted/bitmap indexes
2. Collect matching vector IDs into a candidate set C
3. If |C| < ef_search:
Flat scan all candidates, return top-K
Else:
Build temporary flat index over C, run HNSW search restricted to C
4. Return top-K results
```
**Tradeoffs**:
- Optimal when the candidate set is very small (hundreds to low thousands)
- Risk: if the candidate set is disconnected in the HNSW graph, search cannot
traverse from entry points to candidates. The flat scan fallback handles this.
- Memory: candidate set bitmap = `ceil(total_vectors / 8)` bytes
### 4.2 Post-Filtering (Selectivity > 20%)
Run standard HNSW search with over-retrieval, then filter results.
```
1. Compute over_retrieval_factor = min(1.0 / selectivity, 10.0)
2. Set ef_search_adj = ef_search * over_retrieval_factor
3. Run standard HNSW search with ef_search_adj
4. Filter result set by evaluating filter expression per candidate
5. Return top-K from filtered results
```
**Tradeoffs**:
- Optimal when the filter passes most vectors (minimal wasted computation)
- Risk: if over-retrieval factor is too low, fewer than K results survive filtering.
The caller should retry with a higher factor or fall back to intra-filtering.
- No modification to HNSW traversal logic required.
### 4.3 Intra-Filtering (1% <= Selectivity <= 20%)
Evaluate the filter during HNSW traversal, skipping nodes that fail the predicate.
```python
def filtered_hnsw_search(query, filter_expr, entry_point, ef_search, k):
candidates = MaxHeap() # top-K results (max-heap by distance)
worklist = MinHeap() # exploration frontier (min-heap by distance)
visited = BitSet()
filtered_skips = 0
max_skips = ef_search * 3 # backoff threshold
worklist.push((distance(query, entry_point), entry_point))
visited.add(entry_point)
while worklist and filtered_skips < max_skips:
dist, node = worklist.pop()
# Check filter predicate
if not evaluate(filter_expr, node, metadata):
filtered_skips += 1
# Still expand neighbors (maintain graph connectivity)
neighbors = get_neighbors(node)
for n in neighbors:
if n not in visited:
visited.add(n)
d = distance(query, get_vector(n))
worklist.push((d, n))
continue
filtered_skips = 0 # reset skip counter on successful match
candidates.push((dist, node))
if len(candidates) > k:
candidates.pop() # evict worst
# Expand neighbors
neighbors = get_neighbors(node)
for n in neighbors:
if n not in visited:
visited.add(n)
d = distance(query, get_vector(n))
if len(candidates) < ef_search or d < candidates.max():
worklist.push((d, n))
return candidates.top_k(k)
```
**Key design decisions**:
1. **Skipped nodes still expand neighbors**: This preserves graph connectivity.
A node that fails the filter may have neighbors that pass it.
2. **Skip counter with backoff**: If too many consecutive nodes fail the filter,
the search is exhausting the local neighborhood without finding matches. The
`max_skips` threshold triggers termination to avoid unbounded traversal.
3. **Adaptive ef expansion**: When `filtered_skips > ef_search`, the effective
search frontier is larger than requested, compensating for filtered-out nodes.
### 4.4 Strategy Selection
```
selectivity = estimate_selectivity(filter_expr, metaidx_stats)
if selectivity < 0.01:
strategy = PRE_FILTER
elif selectivity > 0.20:
strategy = POST_FILTER
else:
strategy = INTRA_FILTER
```
Selectivity estimation uses statistics stored in the METAIDX_SEG header:
- **Inverted index**: `posting_list_length / total_vectors` per term
- **Bitmap index**: `popcount(bitmap) / total_vectors` per enum value
- **Range tree**: count of values in range / total_vectors
For compound filters (AND/OR), selectivity is estimated using independence
assumption: `P(A AND B) = P(A) * P(B)`, `P(A OR B) = P(A) + P(B) - P(A) * P(B)`.
## 5. METAIDX_SEG (Segment Type 0x0D)
METAIDX_SEG stores secondary indexes over metadata fields for fast predicate
evaluation. Each METAIDX_SEG covers one field. The segment type enum value 0x0D
is allocated from the reserved range (see `binary-layout.md` Section 3).
```
METAIDX_SEG Payload:
+------------------------------------------+
| Index Header (64 bytes, padded) |
| field_id: u16 | Field being indexed
| index_type: u8 | 0=inverted, 1=range_tree, 2=bitmap
| field_type: u8 | Mirrors META_SEG field_type
| total_vectors: u64 | Vectors covered by this index
| unique_values: u64 | Cardinality (distinct values)
| reserved: [u8; 42] |
| [64B aligned] |
+------------------------------------------+
| Index Data (type-specific) |
+------------------------------------------+
```
### 5.1 Inverted Index (index_type = 0)
Best for: string fields with moderate cardinality (100 to 100K distinct values).
```
term_count: u32
For each term (sorted by encoded value):
term_length: u16
term_bytes: [u8; term_length] Encoded value (UTF-8 for strings)
posting_length: u32 Number of vector IDs
postings: [varint delta-encoded] Sorted vector IDs
[8B aligned after postings]
[64B aligned]
```
Posting lists use varint delta encoding identical to the ID encoding in VEC_SEG
(see `binary-layout.md` Section 5). Restart points every 128 entries enable
binary search within a posting list for intersection operations.
### 5.2 Range Tree (index_type = 1)
Best for: numeric fields requiring range queries (u32, u64, f32).
```
page_size: u32 Fixed 4096 bytes (4 KB, one disk page)
page_count: u32
root_page: u32 Page index of B+ tree root
tree_height: u8
reserved: [u8; 47]
[64B aligned]
Internal Page (4096 bytes):
page_type: u8 (0 = internal)
key_count: u16
keys: [field_type; key_count] Separator keys
children: [u32; key_count + 1] Child page indices
[zero-padded to 4096]
Leaf Page (4096 bytes):
page_type: u8 (1 = leaf)
entry_count: u16
prev_leaf: u32 Linked-list pointer for range scan
next_leaf: u32
entries:
For each entry:
value: field_type The metadata value
vector_id: u64 Associated vector ID
[zero-padded to 4096]
```
Leaf pages form a doubly-linked list for efficient range scans. A range query
`position_start >= 1000 AND position_start <= 5000` descends the tree to find
the first leaf with value >= 1000, then scans forward until value > 5000.
### 5.3 Bitmap Index (index_type = 2)
Best for: enum and bool fields with low cardinality (< 64 distinct values).
```
value_count: u8 Number of distinct enum/bool values
For each value:
value_label_len: u8
value_label: [u8; value_label_len] The enum label or "true"/"false"
bitmap_format: u8 0 = raw, 1 = roaring
bitmap_length: u32 Byte length of bitmap data
bitmap_data: [u8; bitmap_length] Bitmap of matching vector IDs
[8B aligned]
[64B aligned]
```
**Raw bitmaps** are used when `total_vectors < 8192` (1 KB per bitmap).
**Roaring bitmaps** are used for larger datasets. The roaring format stores
the bitmap as a set of containers (array, bitmap, or run-length) per 64K chunk.
This matches the industry-standard Roaring bitmap serialization (compatible with
CRoaring / roaring-rs wire format).
Bitmap intersection and union operations map directly to AND/OR filter predicates
using SIMD bitwise operations. For 10M vectors:
```
Raw bitmap: ~1.2 MB per value (impractical for many values)
Roaring bitmap: 100 KB - 1 MB per value depending on density
AND/OR: ~0.1 ms per operation (AVX-512 on 1 MB bitmap)
```
## 6. Level 1 Manifest Addition
### Tag 0x000F: METADATA_INDEX_DIR
A new TLV record in the Level 1 manifest (see `02-manifest-system.md` Section 3)
that maps indexed metadata fields to their METAIDX_SEG segment IDs.
```
Tag: 0x000F
Name: METADATA_INDEX_DIR
Payload:
entry_count: u16
For each entry:
field_id: u16 Matches META_SEG field_id
field_name_len: u8
field_name: [u8; field_name_len] UTF-8 field name for debugging
index_seg_id: u64 Segment ID of METAIDX_SEG
index_type: u8 0=inverted, 1=range_tree, 2=bitmap
stats:
total_vectors: u64
unique_values: u64
min_posting_len: u32 Smallest posting list size
max_posting_len: u32 Largest posting list size
```
This allows the query planner to estimate selectivity without reading the
METAIDX_SEG segments themselves. The `min_posting_len` and `max_posting_len`
fields provide bounds for cardinality estimation.
### Updated Record Types Table
```
Tag Name Description
--- ---- -----------
0x0001 SEGMENT_DIR Array of segment directory entries
0x0002 TEMP_TIER_MAP Temperature tier assignments per block
...
0x000D KEY_DIRECTORY Encryption key references
0x000E (reserved)
0x000F METADATA_INDEX_DIR Metadata field -> METAIDX_SEG mapping
```
## 7. Performance Analysis
### 7.1 Filter Strategy vs Selectivity vs Recall
| Selectivity | Strategy | Recall@10 | Latency (10M vectors) | Notes |
|-------------|----------|-----------|----------------------|-------|
| 0.001% (100 matches) | Pre-filter | 1.00 | 0.02 ms | Flat scan on 100 candidates |
| 0.01% (1K matches) | Pre-filter | 0.99 | 0.08 ms | Flat scan on 1K candidates |
| 0.1% (10K matches) | Pre-filter | 0.98 | 0.5 ms | Mini-HNSW on 10K candidates |
| 1% (100K matches) | Intra-filter | 0.96 | 0.12 ms | ~10% node skip overhead |
| 5% (500K matches) | Intra-filter | 0.95 | 0.08 ms | ~5% node skip overhead |
| 10% (1M matches) | Intra-filter | 0.94 | 0.06 ms | Minimal skip overhead |
| 20% (2M matches) | Post-filter | 0.95 | 0.10 ms | 5x over-retrieval |
| 50% (5M matches) | Post-filter | 0.97 | 0.06 ms | 2x over-retrieval |
| 100% (no filter) | None | 0.98 | 0.04 ms | Baseline unfiltered |
### 7.2 Memory Overhead of Metadata Indexes
For 10M vectors with the RVDNA profile (5 indexed fields):
| Field | Type | Cardinality | Index Type | Size |
|-------|------|-------------|------------|------|
| organism | string | ~50K | Inverted | ~80 MB |
| gene_id | string | ~500K | Inverted | ~120 MB |
| chromosome | string | ~25 | Bitmap (roaring) | ~12 MB |
| position_start | u64 | ~10M | Range tree | ~160 MB |
| position_end | u64 | ~10M | Range tree | ~160 MB |
| **Total** | | | | **~532 MB** |
As a fraction of vector data (10M * 384 dim * fp16 = 7.2 GB): **~7.4% overhead**.
For the RVText profile (2 indexed fields, typically lower cardinality):
| Field | Type | Cardinality | Index Type | Size |
|-------|------|-------------|------------|------|
| source_url | string | ~100K | Inverted | ~90 MB |
| language | string | ~50 | Bitmap (roaring) | ~8 MB |
| **Total** | | | | **~98 MB** |
Overhead: **~1.4%** of vector data.
### 7.3 Query Latency Breakdown (Filtered Intra-Search)
```
Phase Time Notes
----- ---- -----
Parse filter expression 0.5 us Stack-based, no allocation
Estimate selectivity 1.0 us Read manifest stats
Load METAIDX_SEG (if cold) 50-200 us First query only; cached after
HNSW traversal (150 steps) 45 us Baseline unfiltered
+ filter eval per node +12 us ~80 ns per eval * 150 nodes
+ skip expansion +8 us ~20% more nodes visited at 5% sel.
Top-K collection 10 us Heap operations
--------
Total (warm cache) ~76 us
Total (cold start) ~276 us
```
## 8. Integration with Temperature Tiering
Metadata follows the same temperature model as vector data (see
`03-temperature-tiering.md`), but with its own tier assignments.
### 8.1 Hot Metadata
Indexed fields for hot-tier vectors are kept resident in memory:
- **Bitmap indexes** for low-cardinality fields (enum, bool) are always hot.
Total size is bounded: `cardinality * ceil(hot_vectors / 8)` bytes. For 100K
hot vectors and 25 enum values: 25 * 12.5 KB = 312 KB.
- **Inverted index posting lists** are cached using an LRU policy keyed by
(field_id, term). Frequently queried terms (e.g., `language = "en"`) remain
resident.
- **Range tree pages** follow the standard B+ tree buffer pool model. Hot pages
(root + first two levels) are pinned. Leaf pages are demand-paged.
### 8.2 Cold Metadata
Cold metadata covers vectors that are rarely accessed:
- META_SEG data for cold vectors is compressed with ZSTD (level 9+) and stored
in cold-tier segments.
- METAIDX_SEG posting lists for cold vectors are not loaded until a query
specifically requests them.
- When a filter matches only cold vectors (detected via the temperature tier
map), the runtime issues a warning: filtered search on cold data may require
decompression latency of 10-100 ms.
### 8.3 Compaction Coordination
When temperature-aware compaction reorganizes vector segments (see
`03-temperature-tiering.md` Section 4), metadata must follow:
```
1. Identify vectors moving between tiers
2. Rewrite META_SEG for affected vector ID ranges
3. Rebuild METAIDX_SEG posting lists (vector IDs may be renumbered during
compaction if the COMPACTION_RENUMBER flag is set)
4. Update METADATA_INDEX_DIR in the new manifest
5. Tombstone old META_SEG and METAIDX_SEG segments
```
Metadata compaction piggybacks on vector compaction -- it never triggers
independently. This ensures metadata and vector segments remain in consistent
temperature tiers.
### 8.4 Metadata-Aware Promotion
When a filter query frequently accesses metadata for warm-tier vectors, those
metadata segments are candidates for promotion to hot tier. The access sketch
(SKETCH_SEG) tracks metadata segment accesses alongside vector accesses:
```
sketch_key = (META_SEG_ID << 32) | block_id
```
This reuses the existing sketch infrastructure without modification.
## 9. Wire Protocol: Filtered Query Message
For completeness, the filter expression is carried in the query message as a
tagged field. The query wire format is outside the scope of the storage spec,
but the filter payload is defined here for interoperability.
```
Query Message Filter Field:
tag: u16 (0x0040 = FILTER)
length: u32
filter_version: u8 (1)
filter_payload: [u8; length - 1] Binary filter expression (Section 3.2)
```
Implementations that do not support filtered search must ignore tag 0x0040 and
return unfiltered results. This preserves backward compatibility.
## 10. Implementation Notes
### 10.1 Index Selection Heuristics
When building indexes for a new META_SEG field, implementations should select
the index type automatically:
```
if field_type in (enum, bool) and cardinality < 64:
index_type = BITMAP
elif field_type in (u32, u64, f32):
index_type = RANGE_TREE
else:
index_type = INVERTED
```
Fields without the `"indexed": true` property in the profile schema must not
have METAIDX_SEG segments built. They are stored in META_SEG for retrieval
only (the STORED flag).
### 10.2 Posting List Intersection
For AND filters on multiple indexed fields, posting list intersection is
performed using a merge-based algorithm on sorted, delta-decoded posting lists:
```
Sorted Intersection (two-pointer merge):
Time: O(min(|A|, |B|)) with skip-ahead via restart points
Practical: ~100 ns per 1000 common elements (SIMD comparison)
```
For OR filters, posting list union uses a similar merge with deduplication.
### 10.3 Null Handling
- `FIELD_REF` for a null value pushes a sentinel NULL onto the stack
- `CMP_EQ NULL` returns true only for null values
- `CMP_NE NULL` returns true for all non-null values
- All other comparisons against NULL return false (SQL-style three-valued logic)
- `IN_SET` never matches NULL unless NULL is explicitly in the set

View File

@@ -0,0 +1,474 @@
# RVF Concurrency, Versioning, and Space Reclamation
## 1. Single-Writer / Multi-Reader Model
RVF uses a **single-writer, multi-reader** concurrency model. At most one process
may append segments to an RVF file at any time. Any number of readers may operate
concurrently with each other and with the writer. This model is enforced by an
advisory lock file, not by OS-level mandatory locking.
| Concern | Advisory Lock | Mandatory Lock (flock/fcntl) |
|---------|---------------|------------------------------|
| NFS compatibility | Works (lock file is a regular file) | Broken on many NFS configs |
| Crash recovery | Stale lock detectable by PID check | Kernel auto-releases, but only locally |
| Cross-language | Any language can create a file | Requires OS-specific syscalls |
| Visibility | Lock state inspectable by humans | Opaque kernel state |
| Multi-file mode | One lock covers all shards | Would need per-shard locks |
## 2. Writer Lock File
The writer lock is a file named `<basename>.rvf.lock` in the same directory as the
RVF file. For example, `data.rvf` uses `data.rvf.lock`.
### Binary Layout
```
Offset Size Field Description
------ ---- ----- -----------
0x00 4 magic 0x52564C46 ("RVLF" in ASCII)
0x04 4 pid Writer process ID (u32)
0x08 64 hostname Null-terminated hostname (max 63 chars + null)
0x48 8 timestamp_ns Lock acquisition time (nanosecond UNIX timestamp)
0x50 16 writer_id Random UUID (128-bit, written as raw bytes)
0x60 4 lock_version Lock protocol version (currently 1)
0x64 4 checksum CRC32C of bytes 0x00-0x63
```
**Total**: 104 bytes.
### Lock Acquisition Protocol
```
1. Construct lock file content (magic, PID, hostname, timestamp, random UUID)
2. Compute CRC32C over bytes 0x00-0x63, store at 0x64
3. Attempt open("<basename>.rvf.lock", O_CREAT | O_EXCL | O_WRONLY)
4. If open succeeds:
a. Write 104 bytes
b. fsync
c. Lock acquired — proceed with writes
5. If open fails (EEXIST):
a. Read existing lock file
b. Validate magic and checksum
c. If invalid: delete stale lock, retry from step 3
d. If valid: run stale lock detection (see below)
e. If stale: delete lock, retry from step 3
f. If not stale: lock acquisition fails — another writer is active
```
The `O_CREAT | O_EXCL` combination is atomic on POSIX filesystems, preventing
two processes from simultaneously creating the lock.
### Stale Lock Detection
A lock is considered stale when **both** of the following are true:
1. **PID is dead**: `kill(pid, 0)` returns `ESRCH` (process does not exist), OR
the hostname does not match the current host (remote crash)
2. **Age exceeds threshold**: `now_ns - timestamp_ns > 30_000_000_000` (30 seconds)
The age check prevents a race where a PID is recycled by the OS. A lock younger
than 30 seconds is never considered stale, even if the PID appears dead, because
PID reuse on modern systems can occur within milliseconds.
If the hostname differs from the current host, the PID check is not meaningful.
In this case, only the age threshold applies. Implementations SHOULD use a longer
threshold (300 seconds) for cross-host lock recovery to account for clock skew.
### Lock Release Protocol
```
1. fsync all pending data and manifest segments
2. Verify the lock file still contains our writer_id (re-read and compare)
3. If writer_id matches: unlink("<basename>.rvf.lock")
4. If writer_id does not match: abort — another process stole the lock
```
Step 2 prevents a writer from deleting a lock that was legitimately taken over
after a stale lock recovery by another process.
If a writer crashes without releasing the lock, the lock file persists on disk.
The next writer detects the orphan via stale lock detection and reclaims it.
No data corruption occurs because the append-only segment model guarantees that
partial writes are detectable: a segment with a bad content hash or a truncated
manifest is simply ignored.
## 3. Reader-Writer Coordination
Readers and writers operate independently. The append-only architecture ensures
they never conflict.
### Reader Protocol
```
1. Open file (read-only, no lock required)
2. Read Level 0 root manifest (last 4096 bytes)
3. Parse hotset pointers and Level 1 offset
4. This manifest snapshot defines the reader's view of the file
5. All queries within this session use the snapshot
6. To see new data: re-read Level 0 (explicit refresh)
```
### Writer Protocol
```
1. Acquire lock (Section 2)
2. Read current manifest to learn segment directory state
3. Append new segments (VEC_SEG, INDEX_SEG, etc.)
4. Append new MANIFEST_SEG referencing all live segments
5. fsync
6. Release lock (Section 2)
```
### Concurrent Timeline
```
Time Writer Reader A Reader B
---- ------ -------- --------
t=0 Acquires lock
t=1 Appends VEC_SEG_4 Opens file
t=2 Appends VEC_SEG_5 Opens file Reads manifest M3
t=3 Appends MANIFEST_SEG M4 Reads manifest M3 Queries (sees M3)
t=4 fsync, releases lock Queries (sees M3) Queries (sees M3)
t=5 Queries (sees M3) Refreshes -> M4
t=6 Refreshes -> M4 Queries (sees M4)
```
Reader A opened during the write but read manifest M3 (already stable) and never
sees partially written segments. Reader B sees M3 until explicit refresh. Neither
reader is blocked; the writer is never blocked by readers.
### Snapshot Isolation Guarantees
A reader holding a manifest snapshot is guaranteed:
1. All referenced segments are fully written and fsynced
2. Segment content hashes match (the manifest would not reference broken segments)
3. The snapshot is internally consistent (no partial epoch states)
4. The snapshot remains valid for the lifetime of the open file descriptor, even
if the file is compacted and replaced (old inode persists until close)
## 4. Format Versioning
RVF uses explicit version fields at every structural level. The versioning rules
are designed for forward compatibility — older readers can safely process files
produced by newer writers, with graceful degradation.
### Segment Version Compatibility
The segment header `version` field (offset 0x04, currently `1`) governs
segment-level compatibility.
| Rule | Description |
|------|-------------|
| S1 | A v1 reader MUST successfully process all v1 segments |
| S2 | A v1 reader MUST skip segments with version > 1 |
| S3 | A v1 reader MUST log a warning when skipping unknown versions |
| S4 | A v1 reader MUST NOT reject a file because it contains unknown-version segments |
| S5 | A v2+ writer MUST write a root manifest readable by v1 readers (if the root manifest format allows it) |
| S6 | A v2+ writer MAY write segments with version > 1 |
| S7 | Readers MUST use `payload_length` from the segment header to skip unknown segments |
Skipping works because the segment header layout is stable: magic, version,
seg_type, and payload_length occupy fixed offsets. A reader skips unknown
segments by seeking past `64 + payload_length` bytes (header + payload).
### Unknown Segment Types
The segment type enum (offset 0x05) may be extended in future versions.
| Rule | Description |
|------|-------------|
| T1 | A reader MUST skip segment types outside the recognized range (currently 0x01-0x0C) |
| T2 | A reader MUST NOT reject a file because of unknown segment types |
| T3 | A reader MUST use the header's `payload_length` to skip the unknown segment |
| T4 | A reader SHOULD log unknown types at diagnostic/debug level |
| T5 | Types 0x00 and 0xF0-0xFF remain reserved (see spec 01, Section 3) |
### Level 1 TLV Forward Compatibility
Level 1 manifest records use tag-length-value encoding. New tags may be added
in any version.
| Rule | Description |
|------|-------------|
| L1 | A reader MUST skip TLV records with unknown tags |
| L2 | A reader MUST use the record's `length` field (4 bytes at tag offset +2) to skip |
| L3 | A writer MUST NOT change the semantics of an existing tag |
| L4 | A writer MUST NOT reuse a tag value for a different purpose |
| L5 | New tags MUST be assigned sequentially from 0x000E onward |
### Root Manifest Compatibility
The root manifest (Level 0) has the strictest compatibility requirements because
it is the entry point for all readers.
| Rule | Description |
|------|-------------|
| R1 | The magic `0x52564D30` at offset 0x000 is frozen forever |
| R2 | The layout of bytes 0x000-0x007 (magic + version + flags) is frozen forever |
| R3 | New fields may be added to reserved space at offsets 0xF00-0xFFB |
| R4 | Readers MUST ignore non-zero bytes in reserved space they do not understand |
| R5 | The root checksum at 0xFFC always covers bytes 0x000-0xFFB |
| R6 | A v2+ writer extending reserved space MUST ensure the checksum remains valid |
There is no explicit version negotiation. Compatibility is achieved through the
skip rules above. A reader processes what it understands and skips what it does
not. This avoids capability exchange, making RVF suitable for offline and
archival use cases.
## 5. Variable Dimension Support
The root manifest declares a `dimension` field (offset 0x020, u16) and each
VEC_SEG block declares its own `dim` field (block header offset 0x08, u16).
These may differ.
### Dimension Rules
| Rule | Description |
|------|-------------|
| D1 | The root manifest `dimension` is the **primary dimension** (most common in the file) |
| D2 | An RVF file MAY contain VEC_SEG blocks with dimensions different from the primary |
| D3 | Each VEC_SEG block's `dim` field is authoritative for the vectors in that block |
| D4 | The HNSW index (INDEX_SEG) covers only vectors matching the primary dimension |
| D5 | Vectors with non-primary dimensions are searchable via flat scan or a separate index |
| D6 | A PROFILE_SEG may declare multiple expected dimensions |
### Dimension Catalog (Level 1 Record)
A new Level 1 TLV record (tag `0x0010`, DIMENSION_CATALOG) enables readers to
discover all dimensions present without scanning every VEC_SEG.
Record layout:
```
Offset Size Field Description
------ ---- ----- -----------
0x00 2 entry_count Number of dimension entries
0x02 2 reserved Must be zero
```
Followed by `entry_count` entries of:
```
Offset Size Field Description
------ ---- ----- -----------
0x00 2 dimension Vector dimensionality
0x02 1 dtype Data type enum for these vectors
0x03 1 flags 0x01 = primary, 0x02 = has_index
0x04 4 vector_count Number of vectors with this dimension
0x08 8 index_seg_offset Offset to dedicated index (0 if none)
```
**Entry size**: 16 bytes.
Example for an RVDNA profile file:
```
DIMENSION_CATALOG:
entry_count: 3
[0] dim=64, dtype=f16, flags=0x01 (primary, has_index), count=10000000, index=0x1A00000
[1] dim=384, dtype=f16, flags=0x02 (has_index), count=500000, index=0x3F00000
[2] dim=4096, dtype=f32, flags=0x00 (flat scan only), count=10000, index=0
```
## 6. Space Reclamation
Over time, tombstoned segments and superseded manifests accumulate dead space.
RVF provides three reclamation strategies, each suited to different operating
conditions.
### Strategy 1: Hole-Punching
On Linux filesystems that support `fallocate(2)` with `FALLOC_FL_PUNCH_HOLE`
(ext4, XFS, btrfs), tombstoned segment ranges can be released back to the
filesystem without rewriting the file.
```
Before: [VEC_1 live] [VEC_2 dead] [VEC_3 dead] [VEC_4 live] [MANIFEST]
After: [VEC_1 live] [ hole ] [ hole ] [VEC_4 live] [MANIFEST]
```
File size is unchanged but disk blocks are freed. No data movement occurs — each
punch is O(1). Reader mmap still works (holes read as zeros, but the manifest
never references them). Hole-punching is performed only on segments marked as
TOMBSTONE in the current manifest's COMPACTION_STATE record.
### Strategy 2: Copy-Compact
Copy-compact rewrites the file, including only live segments. This is the
universal strategy that works on all filesystems.
```
Protocol:
1. Acquire writer lock
2. Read current manifest to enumerate live segments
3. Create temporary file: <basename>.rvf.compact.tmp
4. Write live segments sequentially to temporary file
5. Write new MANIFEST_SEG with updated offsets
6. fsync temporary file
7. Atomic rename: <basename>.rvf.compact.tmp -> <basename>.rvf
8. Release writer lock
```
The atomic rename (step 7) ensures readers either see the old file or the new
file, never a partial state. Readers that opened the old file before the rename
continue operating on the old inode via their open file descriptor. The old
inode is freed when the last reader closes its descriptor.
### Strategy 3: Shard Rewrite (Multi-File Mode)
In multi-file mode, individual shard files can be rewritten independently:
```
Protocol:
1. Acquire writer lock
2. Read shard reference from Level 1 SHARD_REFS record
3. Write new shard: <basename>.rvf.cold.<N>.compact.tmp
4. fsync new shard
5. Update main file manifest with new shard reference
6. fsync main file
7. Atomic rename new shard over old shard
8. Release writer lock
```
The old shard is safe to delete after all readers close their descriptors.
Implementations MAY defer deletion using a grace period (default: 60 seconds).
## 7. Space Reclamation Triggers
Reclamation is not performed on every write. Implementations SHOULD evaluate
triggers after each manifest write and act when thresholds are exceeded.
| Trigger | Threshold | Action |
|---------|-----------|--------|
| Dead space ratio | > 50% of file size | Copy-compact |
| Dead space absolute | > 1 GB | Hole-punch if supported, else copy-compact |
| Tombstone count | > 10,000 JOURNAL_SEG tombstone entries | Consolidate journal segments |
| Time since last compaction | > 7 days | Evaluate dead space ratio, compact if > 25% |
### Dead Space Calculation
Dead space is computed from the manifest's COMPACTION_STATE record:
```
dead_bytes = sum(payload_length + 64) for each tombstoned segment
total_bytes = file_size
dead_ratio = dead_bytes / total_bytes
```
The `+ 64` accounts for the segment header.
### Trigger Evaluation Protocol
```
1. After writing a new MANIFEST_SEG, compute dead_bytes and dead_ratio
2. If dead_ratio > 0.50: schedule copy-compact
3. Else if dead_bytes > 1 GB:
a. If fallocate supported: hole-punch tombstoned ranges
b. Else: schedule copy-compact
4. If tombstone_count > 10,000: consolidate JOURNAL_SEGs
5. If days_since_last_compact > 7 AND dead_ratio > 0.25: schedule copy-compact
```
Scheduled compactions MAY be deferred to a background process or low-activity
period.
## 8. Multi-Process Compaction
Compaction is a write operation and requires the writer lock. Only one process
may compact at a time.
### Background Compaction Process
A dedicated compaction process can run alongside the application:
```
1. Attempt writer lock acquisition
2. If lock acquired:
a. Read current manifest
b. Evaluate reclamation triggers
c. If compaction needed:
i. Write WITNESS_SEG with compaction_state = STARTED
ii. Perform compaction (copy-compact or hole-punch)
iii. Write WITNESS_SEG with compaction_state = COMPLETED
iv. Write new MANIFEST_SEG
d. Release lock
3. If lock not acquired: sleep and retry
```
### Crash Safety
Compaction is crash-safe by construction. Copy-compact does not rename until
fsynced — a crash before rename leaves the original file untouched and the
temporary file is cleaned up on next startup. Hole-punch `fallocate` calls are
individually atomic; a crash mid-sequence leaves the manifest consistent because
it references only live segments. Shard rewrite follows the same atomic rename
pattern as copy-compact.
### Compaction Progress and Resumability
For long-running compactions, the writer records progress in WITNESS_SEG segments:
```
WITNESS_SEG compaction payload:
Offset Size Field Description
------ ---- ----- -----------
0x00 4 state 0=STARTED, 1=IN_PROGRESS, 2=COMPLETED, 3=ABORTED
0x04 8 source_manifest_id Segment ID of manifest being compacted
0x0C 8 last_copied_seg_id Last segment ID successfully written to new file
0x14 8 bytes_written Total bytes written to new file so far
0x1C 8 bytes_remaining Estimated bytes remaining
0x24 16 temp_file_hash Hash of temporary file at last checkpoint
```
If a compaction process crashes and restarts, it can:
1. Find the latest WITNESS_SEG with `state = IN_PROGRESS`
2. Verify the temporary file exists and matches `temp_file_hash`
3. Resume from `last_copied_seg_id + 1`
4. If verification fails, delete the temporary file and restart compaction
## 9. Crash Recovery Summary
RVF recovers from crashes at any point without external tooling.
| Crash Point | State After Recovery | Action Required |
|-------------|---------------------|-----------------|
| Segment append (before manifest) | Orphan segment at tail | None — manifest does not reference it |
| Manifest write | Partial manifest at tail | Scan backward to previous valid manifest |
| Lock acquisition | Lock file may or may not exist | Stale lock detection resolves it |
| Lock release | Lock file persists | Stale lock detection resolves it |
| Copy-compact (before rename) | Temporary file on disk | Delete `*.compact.tmp` on startup |
| Copy-compact (during rename) | Atomic — old or new | No action needed |
| Hole-punch | Partial holes punched | No action — manifest is consistent |
| Shard rewrite | Temporary shard on disk | Delete `*.compact.tmp` on startup |
### Startup Recovery Protocol
On startup, before acquiring a write lock, a writer SHOULD:
```
1. Delete any <basename>.rvf.compact.tmp files (orphaned compaction)
2. Delete any <basename>.rvf.cold.*.compact.tmp files (orphaned shard compaction)
3. Validate the lock file (if present) for staleness
4. Open the RVF file and locate the latest valid manifest
5. If the tail contains a partial segment (magic present, bad hash):
a. Log a warning with the partial segment's offset and type
b. The partial segment is outside the manifest — it is harmless
c. The next append will overwrite it (or it will be compacted away)
```
## 10. Invariants
The following invariants extend those in spec 01 (Section 7):
1. At most one writer lock exists per RVF file at any time
2. A lock file with valid magic and checksum represents an active or stale lock
3. Readers never require a lock, regardless of operation
4. A manifest snapshot is immutable for the lifetime of a reader session
5. Compaction never modifies live segments — it creates new ones
6. Hole-punched regions are never referenced by any manifest
7. The root manifest magic and first 8 bytes are frozen across all versions
8. Unknown segment versions and types are skipped, never rejected
9. Unknown TLV tags in Level 1 are skipped, never rejected
10. Each VEC_SEG block's `dim` field is authoritative for that block's vectors

View File

@@ -0,0 +1,688 @@
# RVF Operations API
## 1. Scope
This document specifies the operational surface of an RVF runtime: error codes
returned by all operations, wire formats for batch queries, batch ingest, and
batch deletes, the network streaming protocol for progressive loading over HTTP
and TCP, and the compaction scheduling policy. It complements the segment model
(spec 01), manifest system (spec 02), and query optimization (spec 06).
All multi-byte integers are little-endian unless otherwise noted. All offsets
within messages are byte offsets from the start of the message payload.
## 2. Error Code Enumeration
Error codes are 16-bit unsigned integers. The high byte identifies the error
category; the low byte identifies the specific error within that category.
Implementations must preserve unrecognized codes in responses and must not
treat unknown codes as fatal unless the high byte is `0x01` (format error).
### Category 0x00: Success
```
Code Name Description
------ -------------------- ----------------------------------------
0x0000 OK Operation succeeded
0x0001 OK_PARTIAL Partial success (some items failed)
```
`OK_PARTIAL` is returned when a batch operation succeeds for some items and
fails for others. The response body contains per-item status details.
### Category 0x01: Format Errors
```
Code Name Description
------ -------------------- ----------------------------------------
0x0100 INVALID_MAGIC Segment magic mismatch (expected 0x52564653)
0x0101 INVALID_VERSION Unsupported segment version
0x0102 INVALID_CHECKSUM Segment hash verification failed
0x0103 INVALID_SIGNATURE Cryptographic signature invalid
0x0104 TRUNCATED_SEGMENT Segment payload shorter than declared length
0x0105 INVALID_MANIFEST Root manifest validation failed
0x0106 MANIFEST_NOT_FOUND No valid MANIFEST_SEG in file
0x0107 UNKNOWN_SEGMENT_TYPE Segment type not recognized (warning, not fatal)
0x0108 ALIGNMENT_ERROR Data not at expected 64B boundary
```
`UNKNOWN_SEGMENT_TYPE` is advisory. A reader encountering an unknown segment
type should skip it and continue. All other format errors in this category
are fatal for the affected segment.
### Category 0x02: Query Errors
```
Code Name Description
------ -------------------- ----------------------------------------
0x0200 DIMENSION_MISMATCH Query vector dimension != index dimension
0x0201 EMPTY_INDEX No index segments available
0x0202 METRIC_UNSUPPORTED Requested distance metric not available
0x0203 FILTER_PARSE_ERROR Invalid filter expression
0x0204 K_TOO_LARGE Requested K exceeds available vectors
0x0205 TIMEOUT Query exceeded time budget
```
When `K_TOO_LARGE` is returned, the response still contains all available
results. The result count will be less than the requested K.
### Category 0x03: Write Errors
```
Code Name Description
------ -------------------- ----------------------------------------
0x0300 LOCK_HELD Another writer holds the lock
0x0301 LOCK_STALE Lock file exists but owner process is dead
0x0302 DISK_FULL Insufficient space for write
0x0303 FSYNC_FAILED Durable write failed
0x0304 SEGMENT_TOO_LARGE Segment exceeds 4 GB limit
0x0305 READ_ONLY File opened in read-only mode
```
`LOCK_STALE` is informational. The runtime may attempt to break the stale
lock and retry. If recovery succeeds, the original operation proceeds with
an `OK` status.
### Category 0x04: Tile Errors (WASM Microkernel)
```
Code Name Description
------ -------------------- ----------------------------------------
0x0400 TILE_TRAP WASM trap (OOB, unreachable, stack overflow)
0x0401 TILE_OOM Tile exceeded scratch memory (64 KB)
0x0402 TILE_TIMEOUT Tile computation exceeded time budget
0x0403 TILE_INVALID_MSG Malformed hub-tile message
0x0404 TILE_UNSUPPORTED_OP Operation not available on this profile
```
All tile errors trigger the fault isolation protocol described in
`microkernel/wasm-runtime.md` section 8. The hub reassigns the tile's
work and optionally restarts the faulted tile.
### Category 0x05: Crypto Errors
```
Code Name Description
------ -------------------- ----------------------------------------
0x0500 KEY_NOT_FOUND Referenced key_id not in CRYPTO_SEG
0x0501 KEY_EXPIRED Key past valid_until timestamp
0x0502 DECRYPT_FAILED Decryption or auth tag verification failed
0x0503 ALGO_UNSUPPORTED Cryptographic algorithm not implemented
```
Crypto errors are always fatal for the affected segment. An implementation
must not serve data from a segment that fails signature or decryption checks.
## 3. Batch Query API
### Wire Format: Request
Batch queries amortize connection overhead and enable the runtime to
schedule vector block loads across multiple queries simultaneously.
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 query_count Number of queries in batch (max 1024)
0x04 4 k Shared top-K parameter
0x08 1 metric Distance metric: 0=L2, 1=IP, 2=cosine, 3=hamming
0x09 3 reserved Must be zero
0x0C 4 ef_search HNSW ef_search parameter
0x10 4 shared_filter_len Byte length of shared filter (0 = no filter)
0x14 var shared_filter Filter expression (applies to all queries)
var var queries[] Per-query entries (see below)
```
Each query entry:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 query_id Client-assigned correlation ID
0x04 2 dim Vector dimensionality
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
0x07 1 flags Bit 0: has per-query filter
0x08 var vector Query vector (dim * sizeof(dtype) bytes)
var 4 filter_len Byte length of per-query filter (if flags bit 0)
var var filter Per-query filter (overrides shared filter)
```
When both a shared filter and a per-query filter are present, the per-query
filter takes precedence. A per-query filter of zero length inherits the
shared filter.
### Wire Format: Response
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 query_count Number of query results
0x04 var results[] Per-query result entries
```
Each result entry:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 query_id Correlation ID from request
0x04 2 status Error code (0x0000 = OK)
0x06 2 reserved Must be zero
0x08 4 result_count Number of results returned
0x0C var results[] Array of (vector_id: u64, distance: f32) pairs
```
Each result pair is 12 bytes: 8 bytes for the vector ID followed by 4 bytes
for the distance value. Results are sorted by distance ascending (nearest first).
### Batch Scheduling
The runtime should process batch queries using the following strategy:
1. Parse all query vectors and load them into memory
2. Identify shared segments across queries (block deduplication)
3. Load each vector block once and evaluate all relevant queries against it
4. Merge per-query top-K heaps independently
5. Return results as soon as each query completes (streaming response)
This amortizes I/O: if N queries touch the same vector block, the block is
read once instead of N times.
## 4. Batch Ingest API
### Wire Format: Request
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 vector_count Number of vectors to ingest (max 65536)
0x04 2 dim Vector dimensionality
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
0x07 1 flags Bit 0: metadata_included
0x08 var vectors[] Vector entries
var var metadata[] Metadata entries (if flags bit 0)
```
Each vector entry:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 8 vector_id Globally unique vector ID
0x08 var vector Vector data (dim * sizeof(dtype) bytes)
```
Each metadata entry (when metadata_included is set):
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 2 field_count Number of metadata fields
0x02 var fields[] Field entries
```
Each metadata field:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 2 field_id Field identifier (application-defined)
0x02 1 value_type 0=u64, 1=i64, 2=f64, 3=string, 4=bytes
0x03 var value Encoded value (u64/i64/f64: 8B; string/bytes: 4B length + data)
```
### Wire Format: Response
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 accepted_count Number of vectors accepted
0x04 4 rejected_count Number of vectors rejected
0x08 4 manifest_epoch Epoch of manifest after commit
0x0C var rejected_ids[] Array of rejected vector IDs (u64 * rejected_count)
var var rejected_reasons[] Array of error codes (u16 * rejected_count)
```
The `manifest_epoch` field is the epoch of the MANIFEST_SEG written after the
ingest is committed. Clients can use this value to confirm that a subsequent
read will include the ingested vectors.
### Ingest Commit Semantics
1. The runtime writes vectors to a new VEC_SEG (append-only)
2. If metadata is included, a META_SEG is appended
3. Both segments are fsynced
4. A new MANIFEST_SEG is written referencing the new segments
5. The manifest is fsynced
6. The response is sent with the new manifest_epoch
Vectors are visible to queries only after step 6 completes.
## 5. Batch Delete API
### Wire Format: Request
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 1 delete_type 0=by_id, 1=by_range, 2=by_filter
0x01 3 reserved Must be zero
0x04 var payload Type-specific payload (see below)
```
Delete by ID (`delete_type = 0`):
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 count Number of IDs to delete
0x04 var ids[] Array of vector IDs (u64 * count)
```
Delete by range (`delete_type = 1`):
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 8 start_id Start of range (inclusive)
0x08 8 end_id End of range (exclusive)
```
Delete by filter (`delete_type = 2`):
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 filter_len Byte length of filter expression
0x04 var filter Filter expression
```
### Wire Format: Response
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 8 deleted_count Number of vectors deleted
0x08 2 status Error code (0x0000 = OK)
0x0A 2 reserved Must be zero
0x0C 4 manifest_epoch Epoch of manifest after delete committed
```
### Delete Mechanics
Deletes are logical. The runtime appends a JOURNAL_SEG containing tombstone
entries for the deleted vector IDs. The new MANIFEST_SEG marks affected
VEC_SEGs as partially dead. Physical reclamation happens during compaction.
## 6. Network Streaming Protocol
### 6.1 HTTP Range Requests (Read-Only Access)
RVF's progressive loading model maps naturally to HTTP byte-range requests.
A client can boot from a remote `.rvf` file and become queryable without
downloading the entire file.
**Phase 1: Boot (mandatory)**
```
GET /file.rvf Range: bytes=-4096
```
Retrieves the last 4 KB of the file. This contains the Level 0 root manifest
(MANIFEST_SEG). The client parses hotset pointers, the segment directory, and
the profile ID.
If the file is smaller than 4 KB, the entire file is returned. If the last
4 KB does not contain a valid MANIFEST_SEG, the client extends the range
backward in 4 KB increments until one is found or 1 MB is scanned (at which
point it returns `MANIFEST_NOT_FOUND`).
**Phase 2: Hotset (parallel, mandatory for queries)**
Using offsets from the Level 0 manifest, the client issues up to 5 parallel
range requests:
```
GET /file.rvf Range: bytes=<entrypoint_offset>-<entrypoint_end>
GET /file.rvf Range: bytes=<toplayer_offset>-<toplayer_end>
GET /file.rvf Range: bytes=<centroid_offset>-<centroid_end>
GET /file.rvf Range: bytes=<quantdict_offset>-<quantdict_end>
GET /file.rvf Range: bytes=<hotcache_offset>-<hotcache_end>
```
These fetch the HNSW entry point, top-layer graph, routing centroids,
quantization dictionary, and the hot cache (HOT_SEG). After these 5 requests
complete, the system is queryable with recall >= 0.7.
**Phase 3: Level 1 (background)**
```
GET /file.rvf Range: bytes=<l1_offset>-<l1_end>
```
Fetches the Level 1 manifest containing the full segment directory. This
enables the client to discover all segments and plan on-demand fetches.
**Phase 4: On-demand (per query)**
For queries that require cold data not yet fetched:
```
GET /file.rvf Range: bytes=<segment_offset>-<segment_end>
```
The client caches fetched segments locally. Repeated queries against the
same data region do not trigger additional requests.
### HTTP Requirements
- Server must support `Accept-Ranges: bytes`
- Server must return `206 Partial Content` for range requests
- Server should support multiple ranges in a single request (`multipart/byteranges`)
- Client should use `If-None-Match` with the file's ETag to detect stale caches
### 6.2 TCP Streaming Protocol (Real-Time Access)
For real-time ingest and low-latency queries, RVF defines a binary TCP
protocol over TLS 1.3.
**Connection Setup**
```
1. Client opens TCP connection to server
2. TLS 1.3 handshake (mandatory, no plaintext mode)
3. Client sends HELLO message with protocol version and capabilities
4. Server responds with HELLO_ACK confirming capabilities
5. Connection is ready for messages
```
**Framing**
All messages are length-prefixed:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 frame_length Payload length (big-endian, max 16 MB)
0x04 1 msg_type Message type (see below)
0x05 3 msg_id Correlation ID (big-endian, wraps at 2^24)
0x08 var payload Message-specific payload
```
Frame length is big-endian (network byte order) for consistency with TLS
framing. The 16 MB maximum prevents a single message from monopolizing the
connection. Payloads larger than 16 MB must be split across multiple messages
using continuation framing (see section 6.4).
**Message Types**
```
Client -> Server:
0x01 QUERY Batch query (payload = Batch Query Request)
0x02 INGEST Batch ingest (payload = Batch Ingest Request)
0x03 DELETE Batch delete (payload = Batch Delete Request)
0x04 STATUS Request server status (no payload)
0x05 SUBSCRIBE Subscribe to update notifications
Server -> Client:
0x81 QUERY_RESULT Batch query result
0x82 INGEST_ACK Batch ingest acknowledgment
0x83 DELETE_ACK Batch delete acknowledgment
0x84 STATUS_RESP Server status response
0x85 UPDATE_NOTIFY Push notification of new data
0xFF ERROR Error with code and description
```
**ERROR Message Payload**
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 2 error_code Error code from section 2
0x02 2 description_len Byte length of description string
0x04 var description UTF-8 error description (human-readable)
```
### 6.3 Streaming Ingest Protocol
The TCP protocol supports continuous ingest where the client streams vectors
without waiting for per-batch acknowledgments.
**Flow**
```
Client Server
| |
|--- INGEST (batch 0) ------------->|
|--- INGEST (batch 1) ------------->| Pipelining: send without waiting
|--- INGEST (batch 2) ------------->|
| | Server writes VEC_SEGs, appends manifest
|<--- INGEST_ACK (batch 0) ---------|
|<--- INGEST_ACK (batch 1) ---------|
| | Backpressure: server delays ACK
|--- INGEST (batch 3) ------------->| Client respects window
|<--- INGEST_ACK (batch 2) ---------|
| |
```
**Backpressure**
The server controls ingest rate by delaying INGEST_ACK responses. The client
must limit its in-flight (unacknowledged) ingest messages to a configurable
window size (default: 8 messages). When the window is full, the client must
wait for an ACK before sending the next batch.
The server should send backpressure when:
- Write queue exceeds 80% capacity
- Compaction is falling behind (dead space > 50%)
- Available disk space drops below 10%
**Commit Semantics**
Each INGEST_ACK contains the `manifest_epoch` after commit. The server
guarantees that all vectors acknowledged with epoch E are visible to any
query that reads the manifest at epoch >= E.
### 6.4 Continuation Framing
For payloads exceeding the 16 MB frame limit:
```
Frame 0: msg_type = original type, flags bit 0 = CONTINUATION_START
Frame 1: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
Frame 2: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
Frame N: msg_type = 0x00 (CONTINUATION), flags bit 1 = CONTINUATION_END
```
The receiver reassembles the payload from all continuation frames before
processing. The msg_id is shared across all frames of a continuation sequence.
### 6.5 SUBSCRIBE and UPDATE_NOTIFY
The SUBSCRIBE message registers the client for push notifications when new
data is committed:
```
SUBSCRIBE payload:
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 min_epoch Only notify for epochs > this value
0x04 1 notify_flags Bit 0: ingest, Bit 1: delete, Bit 2: compaction
0x05 3 reserved Must be zero
```
The server sends UPDATE_NOTIFY whenever a new MANIFEST_SEG is committed that
matches the subscription criteria:
```
UPDATE_NOTIFY payload:
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 4 epoch New manifest epoch
0x04 1 event_type 0=ingest, 1=delete, 2=compaction
0x05 3 reserved Must be zero
0x08 4 affected_count Number of vectors affected
0x0C 8 new_total Total vector count after event
```
## 7. Compaction Scheduling Policy
Compaction merges small, overlapping, or partially-dead segments into larger,
sealed segments. Because compaction competes with queries and ingest for I/O
bandwidth, the runtime enforces a scheduling policy.
### 7.1 IO Budget
Compaction must consume at most 30% of available IOPS. The runtime measures
IOPS over a 5-second sliding window and throttles compaction I/O to stay
within budget.
```
available_iops = measured_iops_capacity (from benchmarking at startup)
compaction_budget = available_iops * 0.30
compaction_throttle = max(compaction_budget - current_compaction_iops, 0)
```
### 7.2 Priority Ordering
When I/O bandwidth is contended, operations are prioritized:
```
Priority 1 (highest): Queries (reads from VEC_SEG, INDEX_SEG, HOT_SEG)
Priority 2: Ingest (writes to VEC_SEG, META_SEG, MANIFEST_SEG)
Priority 3 (lowest): Compaction (reads + writes of sealed segments)
```
Compaction yields to queries and ingest. If a compaction I/O operation would
cause a query to exceed its time budget, the compaction operation is deferred.
### 7.3 Scheduling Triggers
Compaction runs when all of the following conditions are met:
| Condition | Threshold | Rationale |
|-----------|-----------|-----------|
| Query load | < 50% of capacity | Avoid competing with active queries |
| Dead space ratio | > 20% of total file size | Not worth compacting small amounts |
| Segment count | > 32 active segments | Many small segments hurt read performance |
| Time since last compaction | > 60 seconds | Prevent compaction storms |
The runtime evaluates these conditions every 10 seconds.
### 7.4 Emergency Compaction
If dead space exceeds 70% of total file size, compaction enters emergency mode:
```
Emergency compaction rules:
1. Compaction preempts ingest (ingest is paused, not rejected)
2. IO budget increases to 60% of available IOPS
3. Compaction runs regardless of query load
4. Ingest resumes after dead space drops below 50%
```
During emergency compaction, the server responds to INGEST messages with
delayed ACKs (backpressure) rather than rejecting them. Queries continue to
be served at highest priority.
### 7.5 Compaction Progress Reporting
The STATUS response includes compaction state:
```
STATUS_RESP compaction fields:
Offset Size Field Description
------ ------ ------------------- ----------------------------------------
0x00 1 compaction_state 0=idle, 1=running, 2=emergency
0x01 1 progress_pct Completion percentage (0-100)
0x02 2 reserved Must be zero
0x04 8 dead_bytes Total dead space in bytes
0x0C 8 total_bytes Total file size in bytes
0x14 4 segments_remaining Segments left to compact
0x18 4 segments_completed Segments compacted in current run
0x1C 4 estimated_seconds Estimated time to completion
0x20 4 io_budget_pct Current IO budget percentage (30 or 60)
```
### 7.6 Compaction Segment Selection
The runtime selects segments for compaction using a tiered strategy:
```
1. Tombstoned segments: Always compacted first (reclaim dead space)
2. Small VEC_SEGs: Segments < 1 MB merged into larger segments
3. High-overlap INDEX_SEGs: Index segments covering the same ID range
4. Cold OVERLAY_SEGs: Overlay deltas merged into base segments
```
The compaction output is always a sealed segment (SEALED flag set). Sealed
segments are immutable and can be verified independently.
## 8. STATUS Response Format
The STATUS message provides a snapshot of the server state for monitoring
and diagnostics.
```
STATUS_RESP payload:
Offset Size Field Description
------ ------ ------------------- ----------------------------------------
0x00 4 protocol_version Protocol version (currently 1)
0x04 4 manifest_epoch Current manifest epoch
0x08 8 total_vectors Total vector count
0x10 8 total_segments Total segment count
0x18 8 file_size_bytes Total file size
0x20 4 query_qps Queries per second (last 5s window)
0x24 4 ingest_vps Vectors ingested per second (last 5s window)
0x28 24 compaction Compaction state (see section 7.5)
0x40 1 profile_id Active hardware profile (0x00-0x03)
0x41 1 health 0=healthy, 1=degraded, 2=read_only
0x42 2 reserved Must be zero
0x44 4 uptime_seconds Server uptime
```
## 9. Filter Expression Format
Filter expressions used in batch queries and batch deletes share a common
binary encoding:
```
Offset Size Field Description
------ ------ ------------------ ----------------------------------------
0x00 1 op Operator enum (see below)
0x01 2 field_id Metadata field to filter on
0x03 1 value_type Value type (matches metadata field types)
0x04 var value Comparison value
var var children[] Sub-expressions (for AND/OR/NOT)
```
Operator enum:
```
0x00 EQ field == value
0x01 NE field != value
0x02 LT field < value
0x03 LE field <= value
0x04 GT field > value
0x05 GE field >= value
0x06 IN field in [values]
0x07 RANGE field in [low, high)
0x10 AND All children must match
0x11 OR Any child must match
0x12 NOT Negate single child
```
Filters are evaluated during the query scan phase. Vectors that do not match
the filter are excluded from distance computation entirely (pre-filtering) or
from the result set (post-filtering), depending on the runtime's cost model.
## 10. Invariants
1. Error codes are stable across versions; new codes are additive only
2. Batch operations are atomic per-item, not per-batch (partial success is valid)
3. TCP connections are always TLS 1.3; plaintext is not permitted
4. Frame length is big-endian; all other multi-byte fields are little-endian
5. HTTP progressive loading must succeed with at most 7 round trips to become queryable
6. Compaction never runs at more than 60% of available IOPS, even in emergency mode
7. The STATUS response is always available, even during emergency compaction
8. Filter expressions are limited to 64 levels of nesting depth

View File

@@ -0,0 +1,420 @@
# RVF WASM Self-Bootstrapping Specification
## 1. Motivation
Traditional file formats require an external runtime to interpret their contents.
A JPEG needs an image decoder. A SQLite database needs the SQLite library. An RVF
file needs a vector search engine.
What if the file carried its own runtime?
By embedding a tiny WASM interpreter inside the RVF file itself, we eliminate the
last external dependency. The host only needs **raw execution capability** — the
ability to run bytes as instructions. RVF becomes **self-bootstrapping**: a single
file that contains both its data and the complete machinery to process that data.
This is the transition from "needs a compatible runtime" to **"runs anywhere
compute exists."**
## 2. Architecture
### The Bootstrap Stack
```
Layer 3: RVF Data Segments (VEC_SEG, INDEX_SEG, MANIFEST_SEG, ...)
^
| processes
|
Layer 2: WASM Microkernel (WASM_SEG, role=Microkernel, ~5.5 KB)
^ 14 exports: query, ingest, distance, top-K
| executes
|
Layer 1: WASM Interpreter (WASM_SEG, role=Interpreter, ~50 KB)
^ Minimal stack machine that runs WASM bytecode
| loads
|
Layer 0: Raw Bytes (The .rvf file on any storage medium)
```
Each layer depends only on the one below it. The host reads Layer 0 (raw bytes),
finds the interpreter at Layer 1, uses it to execute the microkernel at Layer 2,
which then processes the data at Layer 3.
### Segment Layout
```
┌──────────────────────────────────────────────────────────────────────┐
│ bootable.rvf │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │
│ │ WASM_SEG │ │ WASM_SEG │ │ VEC_SEG │ │ INDEX │ │
│ │ 0x10 │ │ 0x10 │ │ 0x01 │ │ _SEG │ │
│ │ │ │ │ │ │ │ 0x02 │ │
│ │ role=Interp │ │ role=uKernel │ │ 10M vectors │ │ HNSW │ │
│ │ ~50 KB │ │ ~5.5 KB │ │ 384-dim fp16 │ │ L0+L1 │ │
│ │ priority=0 │ │ priority=1 │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └─────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ QUANT_SEG │ │ WITNESS_SEG │ │ MANIFEST_SEG │ ← tail │
│ │ codebooks │ │ audit trail │ │ source of │ │
│ │ │ │ │ │ truth │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
```
## 3. WASM_SEG Wire Format
### Segment Type
```
Value: 0x10
Name: WASM_SEG
```
Uses the standard 64-byte RVF segment header (`SegmentHeader`), followed by
a 64-byte `WasmHeader`, followed by the WASM bytecode.
### WasmHeader (64 bytes)
```
Offset Size Type Field Description
------ ---- ---- ----- -----------
0x00 4 u32 wasm_magic 0x5256574D ("RVWM" big-endian)
0x04 2 u16 header_version Currently 1
0x06 1 u8 role Bootstrap role (see WasmRole enum)
0x07 1 u8 target Target platform (see WasmTarget enum)
0x08 2 u16 required_features WASM feature bitfield
0x0A 2 u16 export_count Number of WASM exports
0x0C 4 u32 bytecode_size Uncompressed bytecode size (bytes)
0x10 4 u32 compressed_size Compressed size (0 = no compression)
0x14 1 u8 compression 0=none, 1=LZ4, 2=ZSTD
0x15 1 u8 min_memory_pages Minimum linear memory (64 KB each)
0x16 1 u8 max_memory_pages Maximum linear memory (0 = no limit)
0x17 1 u8 table_count Number of WASM tables
0x18 32 hash256 bytecode_hash SHAKE-256-256 of uncompressed bytecode
0x38 1 u8 bootstrap_priority Lower = tried first in chain
0x39 1 u8 interpreter_type Interpreter variant (if role=Interpreter)
0x3A 6 u8[6] reserved Must be zero
```
### WasmRole Enum
```
Value Name Description
----- ---- -----------
0x00 Microkernel RVF query engine (5.5 KB Cognitum tile runtime)
0x01 Interpreter Minimal WASM interpreter for self-bootstrapping
0x02 Combined Interpreter + microkernel linked together
0x03 Extension Domain-specific module (custom distance, decoder)
0x04 ControlPlane Store management (create, export, segment parsing)
```
### WasmTarget Enum
```
Value Name Description
----- ---- -----------
0x00 Wasm32 Generic wasm32 (any compliant runtime)
0x01 WasiP1 WASI Preview 1 (requires WASI syscalls)
0x02 WasiP2 WASI Preview 2 (component model)
0x03 Browser Browser-optimized (expects Web APIs)
0x04 BareTile Bare-metal Cognitum tile (hub-tile protocol only)
```
### Required Features Bitfield
```
Bit Mask Feature
--- ---- -------
0 0x0001 SIMD (v128 operations)
1 0x0002 Bulk memory operations
2 0x0004 Multi-value returns
3 0x0008 Reference types
4 0x0010 Threads (shared memory)
5 0x0020 Tail call optimization
6 0x0040 GC (garbage collection)
7 0x0080 Exception handling
```
### Interpreter Type (when role=Interpreter)
```
Value Name Description
----- ---- -----------
0x00 StackMachine Generic stack-based interpreter
0x01 Wasm3Compatible wasm3-style (register machine)
0x02 WamrCompatible WAMR-style (AOT + interpreter)
0x03 WasmiCompatible wasmi-style (pure stack machine)
```
## 4. Bootstrap Resolution Protocol
### Discovery
1. Scan all segments for `seg_type == 0x10` (WASM_SEG)
2. Parse the 64-byte WasmHeader from each
3. Validate `wasm_magic == 0x5256574D`
4. Sort by `bootstrap_priority` ascending
### Resolution
```
IF any WASM_SEG has role=Combined:
→ SelfContained bootstrap (single module does everything)
ELIF WASM_SEG with role=Interpreter AND role=Microkernel both exist:
→ TwoStage bootstrap (interpreter runs microkernel)
ELIF only WASM_SEG with role=Microkernel exists:
→ HostRequired (needs external WASM runtime)
ELSE:
→ No WASM bootstrap available
```
### Execution Sequence (Two-Stage)
```
Host Interpreter Microkernel Data
| | | |
|-- read WASM_SEG[0] --->| | |
| (interpreter bytes) | | |
| | | |
|-- instantiate -------->| | |
| (load into memory) | | |
| | | |
|-- feed WASM_SEG[1] --->|-- instantiate -------->| |
| (microkernel bytes) | (via interpreter) | |
| | | |
|-- LOAD_QUERY --------->|------- forward ------->| |
| | |-- read VEC_SEG -->|
| | |<- vector block ---|
| | | |
| | | rvf_distances() |
| | | rvf_topk_merge() |
| | | |
|<-- TOPK_RESULT --------|<------ return ---------| |
```
## 5. Size Budget
### Microkernel (role=Microkernel)
Already specified in `microkernel/wasm-runtime.md`:
```
Total: ~5,500 bytes (< 8 KB code budget)
Exports: 14 (query path + quantization + HNSW + verification)
Memory: 8 KB data + 64 KB SIMD scratch
```
### Interpreter (role=Interpreter)
Target: minimal WASM bytecode interpreter sufficient to run the microkernel.
```
Component Estimated Size
--------- --------------
WASM binary parser 4 KB
(magic, section parsing)
Type section decoder 1 KB
(function types)
Import/Export resolution 2 KB
Code section interpreter 12 KB
(control flow, locals)
Stack machine engine 8 KB
(operand stack, call stack)
Memory management 3 KB
(linear memory, grow)
i32/i64 integer ops 4 KB
(add, sub, mul, div, rem, shifts)
f32/f64 float ops 6 KB
(add, sub, mul, div, sqrt, conversions)
v128 SIMD ops (optional) 8 KB
(only if WASM_FEAT_SIMD required)
Table + call_indirect 2 KB
----------
Total (no SIMD): ~42 KB
Total (with SIMD): ~50 KB
```
### Combined (role=Combined)
Interpreter linked with microkernel in a single module:
```
Total: ~48-56 KB (interpreter + microkernel, with overlap eliminated)
```
### Self-Bootstrapping Overhead
For a 10M vector file (~7.3 GB at 384-dim fp16):
- Bootstrap overhead: ~56 KB / ~7.3 GB = **0.0008%**
- The file is 99.9992% data, 0.0008% self-sufficient runtime
For a 1000-vector file (~750 KB):
- Bootstrap overhead: ~56 KB / ~750 KB = **7.5%**
- Still practical for edge/IoT deployments
## 6. Execution Tiers (Extended)
The original three-tier model from ADR-030 is extended:
| Tier | Segment | Size | Boot | Self-Bootstrap? |
|------|---------|------|------|-----------------|
| 0: Embedded WASM Interpreter | WASM_SEG (role=Interpreter) | ~50 KB | <5 ms | **Yes** — file carries its own runtime |
| 1: WASM Microkernel | WASM_SEG (role=Microkernel) | 5.5 KB | <1 ms | No — needs host or Tier 0 |
| 2: eBPF | EBPF_SEG | 10-50 KB | <20 ms | No — needs Linux kernel |
| 3: Unikernel | KERNEL_SEG | 200 KB-2 MB | <125 ms | No — needs VMM (Firecracker) |
**Key insight**: Tier 0 makes all other tiers optional. An RVF file with
Tier 0 embedded runs on *any* host that can execute bytes — bare metal,
browser, microcontroller, FPGA with a soft CPU, or even another WASM runtime.
## 7. "Runs Anywhere Compute Exists"
### What This Means
A self-bootstrapping RVF file requires exactly **one capability** from its host:
> The ability to read bytes from storage and execute them as instructions.
That's it. No operating system. No file system. No network stack. No runtime
library. No package manager. No container engine.
### Where It Runs
| Host | How It Works |
|------|-------------|
| **x86 server** | Native WASM runtime (Wasmtime/WAMR) runs microkernel directly |
| **ARM edge device** | Same — native WASM runtime |
| **Browser tab** | `WebAssembly.instantiate()` on the microkernel bytes |
| **Microcontroller** | Embedded interpreter runs microkernel in 64 KB scratch |
| **FPGA soft CPU** | Interpreter mapped to BRAM, microkernel in flash |
| **Another WASM runtime** | Interpreter-in-WASM runs microkernel-in-WASM (turtles) |
| **Bare metal** | Bootloader extracts interpreter, interpreter runs microkernel |
| **TEE enclave** | Enclave loads interpreter, verified via WITNESS_SEG attestation |
### The Bootstrapping Invariant
For any host `H` with execution capability `E`:
```
∀ H, E: can_execute(H, E) ∧ can_read_bytes(H)
→ can_process_rvf(H, self_bootstrapping_rvf_file)
```
The file is a **fixed point** of the execution relation: it contains everything
needed to process itself.
## 8. Security Considerations
### Interpreter Verification
The embedded interpreter's bytecode is hashed with SHAKE-256-256 and stored
in the WasmHeader (`bytecode_hash`). A WITNESS_SEG can chain the interpreter
hash to a trusted build, providing:
- **Provenance**: Who built this interpreter?
- **Integrity**: Has the interpreter been modified?
- **Attestation**: Can a TEE verify the interpreter before execution?
### Sandbox Guarantees
The WASM sandbox model applies at every layer:
- The interpreter cannot access host memory beyond its linear memory
- The microkernel cannot access interpreter memory
- Each layer communicates only through defined exports/imports
- A trapped module cannot corrupt other modules
### Bootstrap Attack Surface
| Attack | Mitigation |
|--------|-----------|
| Malicious interpreter | Verify `bytecode_hash` against known-good hash in WITNESS_SEG |
| Modified microkernel | Interpreter verifies microkernel hash before instantiation |
| Data corruption | Segment-level CRC32C/SHAKE-256 hashes (Law 2) |
| Code injection | WASM validates all code at load time (type checking) |
| Resource exhaustion | `max_memory_pages` cap, epoch-based interruption |
## 9. API
### Rust (rvf-runtime)
```rust
// Embed a WASM module
store.embed_wasm(
role: WasmRole::Microkernel as u8,
target: WasmTarget::Wasm32 as u8,
required_features: WASM_FEAT_SIMD,
wasm_bytecode: &microkernel_bytes,
export_count: 14,
bootstrap_priority: 1,
interpreter_type: 0,
)?;
// Make self-bootstrapping
store.embed_wasm(
role: WasmRole::Interpreter as u8,
target: WasmTarget::Wasm32 as u8,
required_features: 0,
wasm_bytecode: &interpreter_bytes,
export_count: 3,
bootstrap_priority: 0,
interpreter_type: 0x03, // wasmi-compatible
)?;
// Check if file is self-bootstrapping
assert!(store.is_self_bootstrapping());
// Extract all WASM modules (ordered by priority)
let modules = store.extract_wasm_all()?;
```
### WASM (rvf-wasm bootstrap module)
```rust
use rvf_wasm::bootstrap::{resolve_bootstrap_chain, get_bytecode, BootstrapChain};
let chain = resolve_bootstrap_chain(&rvf_bytes);
match chain {
BootstrapChain::SelfContained { combined } => {
let bytecode = get_bytecode(&rvf_bytes, &combined).unwrap();
// Instantiate and run
}
BootstrapChain::TwoStage { interpreter, microkernel } => {
let interp_code = get_bytecode(&rvf_bytes, &interpreter).unwrap();
let kernel_code = get_bytecode(&rvf_bytes, &microkernel).unwrap();
// Load interpreter, then use it to run microkernel
}
_ => { /* use host runtime */ }
}
```
## 10. Relationship to Existing Segments
| Segment | Relationship to WASM_SEG |
|---------|-------------------------|
| KERNEL_SEG (0x0E) | Alternative execution tier — KERNEL_SEG boots a full unikernel, WASM_SEG runs a lightweight microkernel. Both make the file self-executing but at different capability levels. |
| EBPF_SEG (0x0F) | Complementary — eBPF accelerates hot-path queries on Linux hosts while WASM provides universal portability. |
| WITNESS_SEG (0x0A) | Verification — WITNESS_SEG chains can attest the interpreter and microkernel hashes, providing a trust anchor for the bootstrap chain. |
| CRYPTO_SEG (0x0C) | Signing — CRYPTO_SEG key material can sign WASM_SEG contents for tamper detection. |
| MANIFEST_SEG (0x05) | Discovery — the tail manifest references all WASM_SEGs with their roles and priorities. |
## 11. Implementation Status
| Component | Crate | Status |
|-----------|-------|--------|
| `SegmentType::Wasm` (0x10) | `rvf-types` | Implemented |
| `WasmHeader` (64-byte header) | `rvf-types` | Implemented |
| `WasmRole`, `WasmTarget` enums | `rvf-types` | Implemented |
| `write_wasm_seg` | `rvf-runtime` | Implemented |
| `embed_wasm` / `extract_wasm` | `rvf-runtime` | Implemented |
| `extract_wasm_all` (priority-sorted) | `rvf-runtime` | Implemented |
| `is_self_bootstrapping` | `rvf-runtime` | Implemented |
| `resolve_bootstrap_chain` | `rvf-wasm` | Implemented |
| `get_bytecode` (zero-copy extraction) | `rvf-wasm` | Implemented |
| Embedded interpreter (wasmi-based) | `rvf-wasm` | Future |
| Combined interpreter+microkernel build | `rvf-wasm` | Future |

View File

@@ -0,0 +1,439 @@
# RVF Wire Format Reference
## 1. File Structure
An RVF file is a byte stream with no fixed header at offset 0. All structure
is discovered from the tail.
```
Byte 0 EOF
| |
v v
+--------+--------+--------+ +--------+---------+--------+---------+
| Seg 0 | Seg 1 | Seg 2 | ... | Seg N | Seg N+1 | Seg N+2| Mfst K |
| VEC | VEC | INDEX | | VEC | HOT | INDEX | MANIF |
+--------+--------+--------+ +--------+---------+--------+---------+
^ ^
| |
Level 1 Mfst |
Level 0
(last 4KB)
```
### Alignment Rule
Every segment starts at a **64-byte aligned** boundary. If a segment's
payload + footer does not end on a 64-byte boundary, zero-padding is inserted
before the next segment header.
### Byte Order
All multi-byte integers are **little-endian**. All floating-point values
are IEEE 754 little-endian. This matches x86, ARM (in default mode), and
WASM native byte order.
## 2. Primitive Types
```
Type Size Encoding
---- ---- --------
u8 1 Unsigned 8-bit integer
u16 2 Unsigned 16-bit little-endian
u32 4 Unsigned 32-bit little-endian
u64 8 Unsigned 64-bit little-endian
i32 4 Signed 32-bit little-endian (two's complement)
i64 8 Signed 64-bit little-endian (two's complement)
f16 2 IEEE 754 half-precision little-endian
f32 4 IEEE 754 single-precision little-endian
f64 8 IEEE 754 double-precision little-endian
varint 1-10 LEB128 unsigned variable-length integer
svarint 1-10 ZigZag + LEB128 signed variable-length integer
hash128 16 First 128 bits of hash output
hash256 32 First 256 bits of hash output
```
### Varint Encoding (LEB128)
```
Value 0-127: 1 byte [0xxxxxxx]
Value 128-16383: 2 bytes [1xxxxxxx 0xxxxxxx]
Value 16384-2097151: 3 bytes [1xxxxxxx 1xxxxxxx 0xxxxxxx]
...up to 10 bytes for u64
```
### Delta Encoding
Sequences of sorted integers use delta encoding:
```
Original: [100, 105, 108, 120, 200]
Deltas: [100, 5, 3, 12, 80]
Encoded: [varint(100), varint(5), varint(3), varint(12), varint(80)]
```
With restart points every N entries, the first value in each restart group
is absolute (not delta-encoded).
## 3. Segment Header (64 bytes)
```
Offset Type Field Notes
------ ---- ----- -----
0x00 u32 magic Always 0x52564653 ("RVFS")
0x04 u8 version Format version (1)
0x05 u8 seg_type Segment type enum
0x06 u16 flags See flags bitfield
0x08 u64 segment_id Monotonic ordinal
0x10 u64 payload_length Bytes after header, before footer
0x18 u64 timestamp_ns UNIX nanoseconds
0x20 u8 checksum_algo 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
0x21 u8 compression 0=none, 1=LZ4, 2=ZSTD, 3=custom
0x22 u16 reserved_0 Must be 0x0000
0x24 u32 reserved_1 Must be 0x00000000
0x28 hash128 content_hash Payload hash (first 128 bits)
0x38 u32 uncompressed_len Original payload size (0 if no compression)
0x3C u32 alignment_pad Zero padding to 64B boundary
```
### Segment Type Enum
```
0x00 INVALID Not a valid segment
0x01 VEC_SEG Vector payloads
0x02 INDEX_SEG HNSW adjacency
0x03 OVERLAY_SEG Graph overlay deltas
0x04 JOURNAL_SEG Metadata mutations
0x05 MANIFEST_SEG Segment directory
0x06 QUANT_SEG Quantization dictionaries
0x07 META_SEG Key-value metadata
0x08 HOT_SEG Temperature-promoted data
0x09 SKETCH_SEG Access counter sketches
0x0A WITNESS_SEG Capability manifests
0x0B PROFILE_SEG Domain profile declarations
0x0C CRYPTO_SEG Key material / certificate anchors
0x0D-0xEF reserved
0xF0-0xFF extension Implementation-specific
```
### Flags Bitfield
```
Bit Mask Name Meaning
--- ---- ---- -------
0 0x0001 COMPRESSED Payload compressed per compression field
1 0x0002 ENCRYPTED Payload encrypted (key in CRYPTO_SEG)
2 0x0004 SIGNED Signature footer follows payload
3 0x0008 SEALED Immutable (compaction output)
4 0x0010 PARTIAL Partial/streaming write
5 0x0020 TOMBSTONE Logically deletes prior segment
6 0x0040 HOT Contains hot-tier data
7 0x0080 OVERLAY Contains overlay/delta data
8 0x0100 SNAPSHOT Full snapshot (not delta)
9 0x0200 CHECKPOINT Safe rollback point
10-15 reserved Must be zero
```
## 4. Signature Footer
Present only if `SIGNED` flag is set. Follows immediately after the payload.
```
Offset Type Field Notes
------ ---- ----- -----
0x00 u16 sig_algo 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
0x02 u16 sig_length Signature byte length
0x04 u8[] signature Signature bytes
var u32 footer_length Total footer size (for backward scan)
```
### Signature Algorithm Sizes
| Algorithm | sig_length | Post-Quantum | Performance |
|-----------|-----------|-------------|-------------|
| Ed25519 | 64 B | No | ~76,000 sign/s |
| ML-DSA-65 | 3,309 B | Yes (NIST Level 3) | ~4,500 sign/s |
| SLH-DSA-128s | 7,856 B | Yes (NIST Level 1) | ~350 sign/s |
## 5. VEC_SEG Payload Layout
Vector segments store blocks of vectors in columnar layout for compression.
```
+------------------------------------------+
| VEC_SEG Payload |
+------------------------------------------+
| Block Directory |
| block_count: u32 |
| For each block: |
| block_offset: u32 (from payload start)|
| vector_count: u32 |
| dim: u16 |
| dtype: u8 |
| tier: u8 |
| [64B aligned] |
+------------------------------------------+
| Block 0 |
| +-- Columnar Vectors --+ |
| | dim_0[0..count] | <- all vals |
| | dim_1[0..count] | for dim 0 |
| | ... | then dim 1 |
| | dim_D[0..count] | etc. |
| +----------------------+ |
| +-- ID Map --+ |
| | encoding: u8 (0=raw, 1=delta-varint) |
| | restart_interval: u16 |
| | id_count: u32 |
| | [restart_offsets: u32[]] (if delta) |
| | [ids: encoded] |
| +-----------+ |
| +-- Block CRC --+ |
| | crc32c: u32 | |
| +----------------+ |
| [64B padding] |
+------------------------------------------+
| Block 1 |
| ... |
+------------------------------------------+
```
### Data Type Enum
```
0x00 f32 32-bit float
0x01 f16 16-bit float
0x02 bf16 bfloat16
0x03 i8 signed 8-bit integer (scalar quantized)
0x04 u8 unsigned 8-bit integer
0x05 i4 4-bit integer (packed, 2 per byte)
0x06 binary 1-bit (packed, 8 per byte)
0x07 pq Product-quantized codes
0x08 custom Custom encoding (see QUANT_SEG)
```
### Columnar vs Interleaved
**VEC_SEG** (columnar): `dim_0[all], dim_1[all], ..., dim_D[all]`
- Better compression (similar values adjacent)
- Better for batch operations
- Worse for single-vector random access
**HOT_SEG** (interleaved): `vec_0[all_dims], vec_1[all_dims], ...`
- Better for single-vector access (one cache line per vector)
- Better for top-K refinement (sequential scan)
- No compression benefit
## 6. INDEX_SEG Payload Layout
```
+------------------------------------------+
| INDEX_SEG Payload |
+------------------------------------------+
| Index Header |
| index_type: u8 (0=HNSW, 1=IVF, 2=flat)|
| layer_level: u8 (A=0, B=1, C=2) |
| M: u16 (HNSW max neighbors per layer) |
| ef_construction: u32 |
| node_count: u64 |
| [64B aligned] |
+------------------------------------------+
| Restart Point Index |
| restart_interval: u32 |
| restart_count: u32 |
| [restart_offset: u32] * count |
| [64B aligned] |
+------------------------------------------+
| Adjacency Data |
| For each node (sorted by node_id): |
| layer_count: varint |
| For each layer: |
| neighbor_count: varint |
| [delta_neighbor_id: varint] * cnt |
| [64B padding per restart group] |
+------------------------------------------+
| Prefetch Hints (optional) |
| hint_count: u32 |
| For each hint: |
| node_range_start: u64 |
| node_range_end: u64 |
| page_offset: u64 |
| page_count: u32 |
| prefetch_ahead: u32 |
| [64B aligned] |
+------------------------------------------+
```
## 7. HOT_SEG Payload Layout
The hot segment stores the most-accessed vectors in interleaved (row-major)
layout with their neighbor lists co-located for cache locality.
```
+------------------------------------------+
| HOT_SEG Payload |
+------------------------------------------+
| Hot Header |
| vector_count: u32 |
| dim: u16 |
| dtype: u8 (f16 or i8) |
| neighbor_M: u16 |
| [64B aligned] |
+------------------------------------------+
| Interleaved Hot Data |
| For each hot vector: |
| vector_id: u64 |
| vector: [dtype * dim] |
| neighbor_count: u16 |
| [neighbor_id: u64] * neighbor_count |
| [64B aligned per entry] |
+------------------------------------------+
```
Each hot entry is self-contained: vector + neighbors in one contiguous block.
A sequential scan of the HOT_SEG for top-K refinement reads vectors and
neighbors without any pointer chasing.
### Hot Entry Size Example
For 384-dim fp16 vectors with M=16 neighbors:
```
8 (id) + 768 (vector) + 2 (count) + 128 (neighbors) = 906 bytes
Padded to 64B: 960 bytes per entry
```
1000 hot vectors = 960 KB (fits in L2 cache on most CPUs).
## 8. MANIFEST_SEG Payload Layout
```
+------------------------------------------+
| MANIFEST_SEG Payload |
+------------------------------------------+
| TLV Records (Level 1 manifest) |
| For each record: |
| tag: u16 |
| length: u32 |
| pad: u16 (to 8B alignment) |
| value: [u8; length] |
| [8B aligned] |
+------------------------------------------+
| Level 0 Root Manifest (last 4096 bytes) |
| (See 02-manifest-system.md for layout) |
+------------------------------------------+
```
## 9. SKETCH_SEG Payload Layout
```
+------------------------------------------+
| SKETCH_SEG Payload |
+------------------------------------------+
| Sketch Header |
| block_count: u32 |
| width: u32 (counters per row) |
| depth: u32 (hash functions) |
| counter_bits: u8 (8 or 16) |
| decay_shift: u8 (aging right-shift) |
| total_accesses: u64 |
| [64B aligned] |
+------------------------------------------+
| Sketch Data |
| For each block: |
| block_id: u32 |
| counters: [u8; width * depth] |
| [64B aligned per block] |
+------------------------------------------+
```
## 10. QUANT_SEG Payload Layout
```
+------------------------------------------+
| QUANT_SEG Payload |
+------------------------------------------+
| Quant Header |
| quant_type: u8 |
| 0 = scalar (min-max per dim) |
| 1 = product quantization |
| 2 = binary threshold |
| 3 = residual PQ |
| tier: u8 |
| dim: u16 |
| [64B aligned] |
+------------------------------------------+
| Type-specific data: |
| |
| Scalar (type 0): |
| min: [f32; dim] |
| max: [f32; dim] |
| |
| PQ (type 1): |
| M: u16 (subspaces) |
| K: u16 (centroids per sub) |
| sub_dim: u16 (dims per sub) |
| codebook: [f32; M * K * sub_dim] |
| |
| Binary (type 2): |
| threshold: [f32; dim] |
| |
| Residual PQ (type 3): |
| coarse_centroids: [f32; K_coarse * dim]|
| residual_codebook: [f32; M * K * sub] |
| |
| [64B aligned] |
+------------------------------------------+
```
## 11. Checksum Algorithms
| ID | Algorithm | Output | Speed (HW accel) | Use Case |
|----|-----------|--------|-------------------|----------|
| 0 | CRC32C | 4 B (stored in 16B field, zero-padded) | ~3 GB/s (SSE4.2) | Per-block integrity |
| 1 | XXH3-128 | 16 B | ~50 GB/s (AVX2) | Segment content hash |
| 2 | SHAKE-256 | 16 or 32 B | ~1 GB/s | Cryptographic verification |
Default recommendation:
- Block-level CRC: CRC32C (fastest, hardware accelerated)
- Segment content hash: XXH3-128 (fast, good distribution)
- Crypto witness hashes: SHAKE-256 (post-quantum safe)
## 12. Compression
| ID | Algorithm | Ratio | Decompress Speed | Use Case |
|----|-----------|-------|-----------------|----------|
| 0 | None | 1.0x | N/A | Hot tier |
| 1 | LZ4 | 1.5-3x | ~4 GB/s | Warm tier, low latency |
| 2 | ZSTD | 3-6x | ~1.5 GB/s | Cold tier, high ratio |
| 3 | Custom | Varies | Varies | Domain-specific |
Compression is applied per-segment payload. Individual blocks within a
segment share the same compression.
## 13. Tail Scan Algorithm
```python
def find_latest_manifest(file):
file_size = file.seek(0, SEEK_END)
# Try fast path: last 4096 bytes
file.seek(file_size - 4096)
root = file.read(4096)
if root[0:4] == b'RVM0' and verify_crc(root):
return parse_root_manifest(root)
# Slow path: scan backward for MANIFEST_SEG header
scan_pos = file_size - 64 # Start at last 64B boundary
while scan_pos >= 0:
file.seek(scan_pos)
header = file.read(64)
if (header[0:4] == b'RVFS' and
header[5] == 0x05 and # MANIFEST_SEG
verify_segment_header(header)):
return parse_manifest_segment(file, scan_pos)
scan_pos -= 64 # Previous 64B boundary
raise CorruptFileError("No valid MANIFEST_SEG found")
```
Worst case: full backward scan at 64B granularity. For a 4 GB file, this is
67M checks — but each check is a 4-byte comparison, so it completes in ~100ms
on a modern CPU with mmap. In practice, the fast path succeeds on the first try
for non-corrupt files.