Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
87
vendor/ruvector/docs/research/rvf/INDEX.md
vendored
Normal file
87
vendor/ruvector/docs/research/rvf/INDEX.md
vendored
Normal file
@@ -0,0 +1,87 @@
|
||||
# RVF: RuVector Format
|
||||
|
||||
## A Living, Self-Reorganizing Runtime Substrate for Vector Intelligence
|
||||
|
||||
---
|
||||
|
||||
### Document Index
|
||||
|
||||
#### Core Specification (`spec/`)
|
||||
|
||||
| # | Document | Description |
|
||||
|---|----------|-------------|
|
||||
| 00 | [Overview](spec/00-overview.md) | The Four Laws, design coordinates, philosophy |
|
||||
| 01 | [Segment Model](spec/01-segment-model.md) | Append-only segments, headers, lifecycle, multi-file |
|
||||
| 02 | [Manifest System](spec/02-manifest-system.md) | Two-level manifests, hotset pointers, progressive boot |
|
||||
| 03 | [Temperature Tiering](spec/03-temperature-tiering.md) | Adaptive layout, access sketches, promotion/demotion |
|
||||
| 04 | [Progressive Indexing](spec/04-progressive-indexing.md) | Layer A/B/C availability, lazy build, partial search |
|
||||
| 05 | [Overlay Epochs](spec/05-overlay-epochs.md) | Streaming min-cut, epoch boundaries, rollback |
|
||||
| 06 | [Query Optimization](spec/06-query-optimization.md) | SIMD alignment, prefetch, varint IDs, cache analysis |
|
||||
| 07 | [Deletion & Lifecycle](spec/07-deletion-lifecycle.md) | Vector deletion, JOURNAL_SEG wire format, deletion bitmaps, compaction |
|
||||
| 08 | [Filtered Search](spec/08-filtered-search.md) | META_SEG wire format, filter expressions, metadata indexes |
|
||||
| 09 | [Concurrency & Versioning](spec/09-concurrency-versioning.md) | Writer locking, reader-writer coordination, space reclamation |
|
||||
| 10 | [Operations API](spec/10-operations-api.md) | Batch ops, error codes, network streaming, compaction scheduling |
|
||||
|
||||
#### Wire Format (`wire/`)
|
||||
|
||||
| Document | Description |
|
||||
|----------|-------------|
|
||||
| [Binary Layout](wire/binary-layout.md) | Byte-level format reference, all segment payloads |
|
||||
|
||||
#### WASM Microkernel (`microkernel/`)
|
||||
|
||||
| Document | Description |
|
||||
|----------|-------------|
|
||||
| [WASM Runtime](microkernel/wasm-runtime.md) | Cognitum tile mapping, 14 exports, hub-tile protocol |
|
||||
|
||||
#### Domain Profiles (`profiles/`)
|
||||
|
||||
| Document | Description |
|
||||
|----------|-------------|
|
||||
| [Domain Profiles](profiles/domain-profiles.md) | RVDNA, RVText, RVGraph, RVVision specifications |
|
||||
|
||||
#### Cryptography (`crypto/`)
|
||||
|
||||
| Document | Description |
|
||||
|----------|-------------|
|
||||
| [Quantum Signatures](crypto/quantum-signatures.md) | ML-DSA-65, SHAKE-256, hybrid encryption, witnesses |
|
||||
|
||||
#### Benchmarks (`benchmarks/`)
|
||||
|
||||
| Document | Description |
|
||||
|----------|-------------|
|
||||
| [Acceptance Tests](benchmarks/acceptance-tests.md) | Performance targets, crash safety, scalability |
|
||||
|
||||
---
|
||||
|
||||
### Quick Reference
|
||||
|
||||
**The Four Laws**
|
||||
1. Truth lives at the tail
|
||||
2. Every segment is independently valid
|
||||
3. Data and state are separated
|
||||
4. The format adapts to its workload
|
||||
|
||||
**Minimal Upgrade Path** (smallest changes that unlock everything)
|
||||
1. Add tail manifest segments
|
||||
2. Make every payload a segment with its own hash and length
|
||||
3. Add hotset pointers in the manifest
|
||||
4. Add an epoch overlay model
|
||||
|
||||
**Hardware Profiles**
|
||||
- **Core**: 8 KB code + 8 KB data + 64 KB SIMD (Cognitum tile)
|
||||
- **Hot**: Multi-tile chip with shared memory
|
||||
- **Full**: Desktop/server with mmap and full feature set
|
||||
|
||||
**Key Numbers**
|
||||
- Boot: 4 KB read, < 5 ms
|
||||
- First query: <= 4 MB read, recall >= 0.70
|
||||
- Full quality: recall >= 0.95
|
||||
- Signing: ML-DSA-65, 3,309 B signatures, ~4,500 sign/s
|
||||
- Distance: 384-dim fp16 L2 in ~12 AVX-512 cycles
|
||||
- Hot entry: 960 bytes (vector + 16 neighbors, cache-line aligned)
|
||||
|
||||
**Design Choices**
|
||||
- Append-only + compaction (not random writes)
|
||||
- Both mmap desktop and microcontroller tiles
|
||||
- Priority: streamable > progressive > adaptive > p95 speed
|
||||
633
vendor/ruvector/docs/research/rvf/SWARM-GUIDANCE.md
vendored
Normal file
633
vendor/ruvector/docs/research/rvf/SWARM-GUIDANCE.md
vendored
Normal file
@@ -0,0 +1,633 @@
|
||||
# RVF Implementation Swarm Guidance
|
||||
|
||||
## Objective
|
||||
|
||||
Implement, test, optimize, and publish the RVF (RuVector Format) as the canonical binary format across all RuVector libraries. Deliver as Rust crates (crates.io), WASM packages (npm), and Node.js N-API bindings (npm).
|
||||
|
||||
## Phase Overview
|
||||
|
||||
```
|
||||
Phase 1: Foundation (rvf-types + rvf-wire) ── Week 1-2
|
||||
Phase 2: Core Runtime (manifest + index + quant) ── Week 3-5
|
||||
Phase 3: Integration (library adapters) ── Week 6-8
|
||||
Phase 4: WASM + Node Bindings ── Week 9-10
|
||||
Phase 5: Testing + Benchmarks ── Week 11-12
|
||||
Phase 6: Optimization + Publishing ── Week 13-14
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Foundation — `rvf-types` + `rvf-wire`
|
||||
|
||||
### Agent Assignments
|
||||
|
||||
| Agent | Role | Crate | Deliverable |
|
||||
|-------|------|-------|-------------|
|
||||
| **coder-1** | Types specialist | `crates/rvf/rvf-types/` | All segment types, enums, headers |
|
||||
| **coder-2** | Wire format specialist | `crates/rvf/rvf-wire/` | Read/write segment headers + payloads |
|
||||
| **tester-1** | TDD for types/wire | `crates/rvf/rvf-types/tests/`, `crates/rvf/rvf-wire/tests/` | Round-trip tests, fuzz targets |
|
||||
| **reviewer-1** | Spec compliance | N/A | Verify code matches wire format spec |
|
||||
|
||||
### `rvf-types` (no_std, no alloc dependency)
|
||||
|
||||
```toml
|
||||
[package]
|
||||
name = "rvf-types"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
description = "RuVector Format core types — segment headers, enums, flags"
|
||||
license = "MIT OR Apache-2.0"
|
||||
categories = ["data-structures", "no-std"]
|
||||
|
||||
[features]
|
||||
default = []
|
||||
std = []
|
||||
serde = ["dep:serde"]
|
||||
```
|
||||
|
||||
**Files to create:**
|
||||
|
||||
```
|
||||
crates/rvf/rvf-types/
|
||||
src/
|
||||
lib.rs # Re-exports
|
||||
segment.rs # SegmentHeader (64 bytes), SegmentType enum
|
||||
flags.rs # Flags bitfield (COMPRESSED, ENCRYPTED, SIGNED, etc.)
|
||||
manifest.rs # Level0Root (4096 bytes), ManifestTag enum
|
||||
vec_seg.rs # BlockDirectory, BlockHeader, DataType enum
|
||||
index_seg.rs # IndexHeader, IndexType, AdjacencyLayout
|
||||
hot_seg.rs # HotHeader, HotEntry layout
|
||||
quant_seg.rs # QuantHeader, QuantType enum
|
||||
sketch_seg.rs # SketchHeader layout
|
||||
meta_seg.rs # MetaField, FilterOp enum
|
||||
profile.rs # ProfileId, ProfileMagic constants
|
||||
error.rs # RvfError enum (format, query, write, tile, crypto)
|
||||
constants.rs # Magic numbers, alignment, limits
|
||||
Cargo.toml
|
||||
```
|
||||
|
||||
**Key constants (from spec):**
|
||||
|
||||
```rust
|
||||
pub const SEGMENT_MAGIC: u32 = 0x52564653; // "RVFS"
|
||||
pub const ROOT_MANIFEST_MAGIC: u32 = 0x52564D30; // "RVM0"
|
||||
pub const SEGMENT_ALIGNMENT: usize = 64;
|
||||
pub const ROOT_MANIFEST_SIZE: usize = 4096;
|
||||
pub const MAX_SEGMENT_PAYLOAD: u64 = 4 * 1024 * 1024 * 1024; // 4 GB
|
||||
```
|
||||
|
||||
**SegmentType enum (from spec 01):**
|
||||
|
||||
```rust
|
||||
#[repr(u8)]
|
||||
pub enum SegmentType {
|
||||
Invalid = 0x00,
|
||||
Vec = 0x01,
|
||||
Index = 0x02,
|
||||
Overlay = 0x03,
|
||||
Journal = 0x04,
|
||||
Manifest = 0x05,
|
||||
Quant = 0x06,
|
||||
Meta = 0x07,
|
||||
Hot = 0x08,
|
||||
Sketch = 0x09,
|
||||
Witness = 0x0A,
|
||||
Profile = 0x0B,
|
||||
Crypto = 0x0C,
|
||||
MetaIdx = 0x0D,
|
||||
}
|
||||
```
|
||||
|
||||
### `rvf-wire` (no_std + alloc)
|
||||
|
||||
```toml
|
||||
[package]
|
||||
name = "rvf-wire"
|
||||
version = "0.1.0"
|
||||
description = "RuVector Format wire format reader/writer"
|
||||
|
||||
[dependencies]
|
||||
rvf-types = { path = "../rvf-types" }
|
||||
|
||||
[features]
|
||||
default = ["std"]
|
||||
std = ["rvf-types/std"]
|
||||
```
|
||||
|
||||
**Files to create:**
|
||||
|
||||
```
|
||||
crates/rvf/rvf-wire/
|
||||
src/
|
||||
lib.rs
|
||||
reader.rs # SegmentReader: parse header, validate magic/hash
|
||||
writer.rs # SegmentWriter: build header, compute hash, align
|
||||
varint.rs # LEB128 encode/decode
|
||||
delta.rs # Delta encoding with restart points
|
||||
crc32c.rs # CRC32C (software + hardware detect)
|
||||
xxh3.rs # XXH3-128 hash (or re-export from xxhash-rust)
|
||||
tail_scan.rs # find_latest_manifest() backward scan
|
||||
manifest_reader.rs # Level 0 root manifest parser
|
||||
manifest_writer.rs # Level 0 + Level 1 manifest builder
|
||||
vec_seg_codec.rs # VEC_SEG columnar encode/decode
|
||||
hot_seg_codec.rs # HOT_SEG interleaved encode/decode
|
||||
index_seg_codec.rs # INDEX_SEG adjacency encode/decode
|
||||
Cargo.toml
|
||||
```
|
||||
|
||||
### Phase 1 Acceptance Criteria
|
||||
|
||||
- [ ] `rvf-types` compiles with `#![no_std]`
|
||||
- [ ] `rvf-wire` round-trips: create segment -> serialize -> deserialize -> compare
|
||||
- [ ] Tail scan finds manifest in valid file
|
||||
- [ ] CRC32C matches reference implementation
|
||||
- [ ] Varint codec matches LEB128 spec
|
||||
- [ ] `cargo test` passes for both crates
|
||||
- [ ] `cargo clippy` clean, `cargo fmt` clean
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Core Runtime — manifest + index + quant
|
||||
|
||||
### Agent Assignments
|
||||
|
||||
| Agent | Role | Crate | Deliverable |
|
||||
|-------|------|-------|-------------|
|
||||
| **coder-3** | Manifest system | `crates/rvf/rvf-manifest/` | Two-level manifest, progressive boot |
|
||||
| **coder-4** | Progressive indexing | `crates/rvf/rvf-index/` | Layer A/B/C HNSW with progressive load |
|
||||
| **coder-5** | Quantization | `crates/rvf/rvf-quant/` | Temperature-tiered quant (fp16/i8/PQ/binary) |
|
||||
| **coder-6** | Full runtime | `crates/rvf/rvf-runtime/` | RvfStore API, compaction, append-only |
|
||||
| **tester-2** | Integration tests | `crates/rvf/tests/` | Progressive load, crash safety, recall |
|
||||
|
||||
### `rvf-manifest`
|
||||
|
||||
**Key functionality:**
|
||||
- Parse Level 0 root manifest (4096 bytes) -> extract hotset pointers
|
||||
- Parse Level 1 TLV records -> build segment directory
|
||||
- Write new manifest on mutation (two-fsync protocol)
|
||||
- Manifest chain for rollback (OVERLAY_CHAIN record)
|
||||
|
||||
### `rvf-index`
|
||||
|
||||
**Key functionality:**
|
||||
- Layer A: Entry points + top-layer adjacency (from INDEX_SEG with HOT flag)
|
||||
- Layer B: Partial adjacency for hot region (built incrementally)
|
||||
- Layer C: Full HNSW adjacency (built lazily in background)
|
||||
- Varint delta-encoded neighbor lists with restart points
|
||||
- Prefetch hints for cache-friendly traversal
|
||||
|
||||
**Integration with existing ruvector-core HNSW:**
|
||||
- Wrap `hnsw_rs` graph as the in-memory structure
|
||||
- Serialize HNSW to INDEX_SEG format
|
||||
- Deserialize INDEX_SEG into `hnsw_rs` layers
|
||||
|
||||
### `rvf-quant`
|
||||
|
||||
**Key functionality:**
|
||||
- Scalar quantization: fp32 -> int8 (4x compression)
|
||||
- Product quantization: M subspaces, K centroids (8-16x compression)
|
||||
- Binary quantization: sign bit (32x compression)
|
||||
- QUANT_SEG read/write for codebooks
|
||||
- Temperature tier assignment from SKETCH_SEG access counters
|
||||
|
||||
### `rvf-runtime`
|
||||
|
||||
**Key functionality:**
|
||||
- `RvfStore::create()` / `RvfStore::open()` / `RvfStore::open_readonly()`
|
||||
- Append-only write path (VEC_SEG + MANIFEST_SEG)
|
||||
- Progressive load sequence (Level 0 -> hotset -> Level 1 -> on-demand)
|
||||
- Background compaction (IO-budget-aware, priority-ordered)
|
||||
- Count-Min Sketch maintenance for temperature decisions
|
||||
- Promotion/demotion lifecycle
|
||||
|
||||
### Phase 2 Acceptance Criteria
|
||||
|
||||
- [ ] Progressive boot: parse Level 0 in < 1ms, first query in < 50ms (1M vectors)
|
||||
- [ ] Recall@10 >= 0.70 with Layer A only
|
||||
- [ ] Recall@10 >= 0.95 with all layers loaded
|
||||
- [ ] Crash safety: kill -9 during write -> recover to last valid manifest
|
||||
- [ ] Compaction reduces dead space while respecting IO budget
|
||||
- [ ] Scalar quantization reconstruction error < 0.5%
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Integration — Library Adapters
|
||||
|
||||
### Agent Assignments
|
||||
|
||||
| Agent | Role | Target Library | Deliverable |
|
||||
|-------|------|---------------|-------------|
|
||||
| **coder-7** | claude-flow adapter | claude-flow memory | RVF-backed memory store |
|
||||
| **coder-8** | agentdb adapter | agentdb | RVF as persistence backend |
|
||||
| **coder-9** | agentic-flow adapter | agentic-flow | RVF streaming for inter-agent exchange |
|
||||
| **coder-10** | rvlite adapter | rvlite | RVF Core Profile minimal store |
|
||||
|
||||
### claude-flow Memory -> RVF
|
||||
|
||||
```
|
||||
Current: JSON flat files + in-memory HNSW
|
||||
Target: RVF file per memory namespace
|
||||
|
||||
Mapping:
|
||||
memory store -> RvfStore with RVText profile
|
||||
memory search -> rvf_runtime.query()
|
||||
memory persist -> RVF append (VEC_SEG + META_SEG + MANIFEST_SEG)
|
||||
audit trail -> WITNESS_SEG with hash chain
|
||||
session state -> META_SEG with TTL metadata
|
||||
```
|
||||
|
||||
### agentdb -> RVF
|
||||
|
||||
```
|
||||
Current: Custom HNSW + serde persistence
|
||||
Target: RVF file per database instance
|
||||
|
||||
Mapping:
|
||||
agentdb.insert() -> rvf_runtime.ingest_batch()
|
||||
agentdb.search() -> rvf_runtime.query()
|
||||
agentdb.persist() -> already persistent (append-only)
|
||||
HNSW graph -> INDEX_SEG (Layer A/B/C)
|
||||
Metadata -> META_SEG + METAIDX_SEG
|
||||
```
|
||||
|
||||
### agentic-flow -> RVF
|
||||
|
||||
```
|
||||
Current: Shared memory blobs between agents
|
||||
Target: RVF TCP streaming protocol
|
||||
|
||||
Mapping:
|
||||
agent memory share -> RVF SUBSCRIBE + UPDATE_NOTIFY
|
||||
swarm state -> META_SEG in shared RVF file
|
||||
learning patterns -> SKETCH_SEG for access tracking
|
||||
consensus state -> WITNESS_SEG with signatures
|
||||
```
|
||||
|
||||
### Phase 3 Acceptance Criteria
|
||||
|
||||
- [ ] claude-flow `memory store` and `memory search` work against RVF backend
|
||||
- [ ] agentdb existing test suite passes with RVF storage (swap in, not rewrite)
|
||||
- [ ] agentic-flow agents can share vectors through RVF streaming protocol
|
||||
- [ ] Legacy format import tools for each library
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: WASM + Node.js Bindings
|
||||
|
||||
### Agent Assignments
|
||||
|
||||
| Agent | Role | Target | Deliverable |
|
||||
|-------|------|--------|-------------|
|
||||
| **coder-11** | WASM microkernel | `crates/rvf/rvf-wasm/` | 14-export WASM module (<8 KB) |
|
||||
| **coder-12** | WASM full runtime | `npm/packages/rvf-wasm/` | wasm-pack build, browser-compatible |
|
||||
| **coder-13** | Node.js N-API | `crates/rvf/rvf-node/` | napi-rs bindings, platform packages |
|
||||
| **coder-14** | TypeScript SDK | `npm/packages/rvf/` | TypeScript wrapper, types, docs |
|
||||
|
||||
### WASM Microkernel (`rvf-wasm` crate, `wasm32-unknown-unknown`)
|
||||
|
||||
```rust
|
||||
// 14 exports matching spec (microkernel/wasm-runtime.md)
|
||||
#[no_mangle] pub extern "C" fn rvf_init(config_ptr: i32) -> i32;
|
||||
#[no_mangle] pub extern "C" fn rvf_load_query(query_ptr: i32, dim: i32) -> i32;
|
||||
#[no_mangle] pub extern "C" fn rvf_load_block(block_ptr: i32, count: i32, dtype: i32) -> i32;
|
||||
#[no_mangle] pub extern "C" fn rvf_distances(metric: i32, result_ptr: i32) -> i32;
|
||||
#[no_mangle] pub extern "C" fn rvf_topk_merge(dist_ptr: i32, id_ptr: i32, count: i32, k: i32) -> i32;
|
||||
#[no_mangle] pub extern "C" fn rvf_topk_read(out_ptr: i32) -> i32;
|
||||
// ... remaining 8 exports
|
||||
```
|
||||
|
||||
**Build command:**
|
||||
```bash
|
||||
cargo build --target wasm32-unknown-unknown --release -p rvf-wasm
|
||||
wasm-opt -Oz -o rvf-microkernel.wasm target/wasm32-unknown-unknown/release/rvf_wasm.wasm
|
||||
```
|
||||
|
||||
**Size budget:** Must be < 8 KB after wasm-opt.
|
||||
|
||||
### WASM Full Runtime (wasm-pack, browser)
|
||||
|
||||
```bash
|
||||
cd crates/rvf/rvf-runtime
|
||||
wasm-pack build --target web --features wasm
|
||||
```
|
||||
|
||||
**npm package:** `@ruvector/rvf-wasm`
|
||||
|
||||
```typescript
|
||||
// npm/packages/rvf-wasm/index.ts
|
||||
import init, { RvfStore } from './pkg/rvf_runtime.js';
|
||||
|
||||
await init();
|
||||
const store = RvfStore.fromBytes(rvfFileBytes);
|
||||
const results = store.query(queryVector, 10);
|
||||
```
|
||||
|
||||
### Node.js N-API Bindings (napi-rs)
|
||||
|
||||
```bash
|
||||
cd crates/rvf/rvf-node
|
||||
npm run build # napi build --platform --release
|
||||
```
|
||||
|
||||
**Platform packages:**
|
||||
|
||||
| Package | Target |
|
||||
|---------|--------|
|
||||
| `@ruvector/rvf-node` | Main package with postinstall platform select |
|
||||
| `@ruvector/rvf-node-linux-x64-gnu` | Linux x86_64 glibc |
|
||||
| `@ruvector/rvf-node-linux-arm64-gnu` | Linux aarch64 glibc |
|
||||
| `@ruvector/rvf-node-darwin-arm64` | macOS Apple Silicon |
|
||||
| `@ruvector/rvf-node-darwin-x64` | macOS Intel |
|
||||
| `@ruvector/rvf-node-win32-x64-msvc` | Windows x64 |
|
||||
|
||||
### TypeScript SDK
|
||||
|
||||
```typescript
|
||||
// npm/packages/rvf/src/index.ts
|
||||
export class RvfDatabase {
|
||||
static async open(path: string): Promise<RvfDatabase>;
|
||||
static async create(path: string, options?: RvfOptions): Promise<RvfDatabase>;
|
||||
|
||||
async insert(id: string, vector: Float32Array, metadata?: Record<string, unknown>): Promise<void>;
|
||||
async insertBatch(entries: RvfEntry[]): Promise<RvfIngestResult>;
|
||||
async query(vector: Float32Array, k: number, options?: RvfQueryOptions): Promise<RvfResult[]>;
|
||||
async delete(ids: string[]): Promise<RvfDeleteResult>;
|
||||
|
||||
// Progressive loading
|
||||
async openProgressive(source: string | URL): Promise<RvfProgressiveReader>;
|
||||
}
|
||||
|
||||
export interface RvfOptions {
|
||||
profile?: 'generic' | 'rvdna' | 'rvtext' | 'rvgraph' | 'rvvision';
|
||||
dimensions: number;
|
||||
metric?: 'l2' | 'cosine' | 'dotproduct' | 'hamming';
|
||||
compression?: 'none' | 'lz4' | 'zstd';
|
||||
signing?: { algorithm: 'ed25519' | 'ml-dsa-65'; key: Uint8Array };
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 4 Acceptance Criteria
|
||||
|
||||
- [ ] WASM microkernel < 8 KB after wasm-opt
|
||||
- [ ] WASM full runtime works in Chrome, Firefox, Node.js
|
||||
- [ ] N-API bindings pass same test suite as Rust crate
|
||||
- [ ] TypeScript types match Rust API surface
|
||||
- [ ] All platform binaries build in CI
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Testing + Benchmarks
|
||||
|
||||
### Agent Assignments
|
||||
|
||||
| Agent | Role | Scope |
|
||||
|-------|------|-------|
|
||||
| **tester-3** | Acceptance tests | 10M vector cold start, recall, crash safety |
|
||||
| **tester-4** | Benchmark harness | criterion benches, perf targets from spec |
|
||||
| **tester-5** | Fuzz testing | cargo-fuzz for wire format parsing |
|
||||
| **tester-6** | WASM tests | Browser + Cognitum tile simulation |
|
||||
|
||||
### Test Matrix
|
||||
|
||||
| Test Category | Description | Target |
|
||||
|--------------|-------------|--------|
|
||||
| **Round-trip** | Write + read all segment types | `rvf-wire` |
|
||||
| **Progressive boot** | Cold start, measure recall at each phase | `rvf-runtime` |
|
||||
| **Crash safety** | kill -9 during ingest/manifest/compaction | `rvf-runtime` |
|
||||
| **Bit flip detection** | Random corruption -> hash/CRC catch | `rvf-wire` |
|
||||
| **Recall benchmarks** | recall@10 at Layer A, B, C | `rvf-index` |
|
||||
| **Latency benchmarks** | p50/p95/p99 query latency | `rvf-runtime` |
|
||||
| **Throughput benchmarks** | QPS and ingest rate | `rvf-runtime` |
|
||||
| **WASM performance** | Distance compute, top-K in WASM | `rvf-wasm` |
|
||||
| **Interop** | agentdb/claude-flow/agentic-flow integration | adapters |
|
||||
| **Profile compatibility** | Generic reader opens RVDNA/RVText files | `rvf-runtime` |
|
||||
|
||||
### Benchmark Commands
|
||||
|
||||
```bash
|
||||
# Rust benchmarks
|
||||
cd crates/rvf/rvf-runtime && cargo bench
|
||||
|
||||
# WASM benchmarks
|
||||
cd npm/packages/rvf-wasm && npm run bench
|
||||
|
||||
# Node.js benchmarks
|
||||
cd npm/packages/rvf-node && npm run bench
|
||||
|
||||
# Full acceptance test (10M vectors)
|
||||
cd crates/rvf && cargo test --release --test acceptance -- --ignored
|
||||
```
|
||||
|
||||
### Phase 5 Acceptance Criteria
|
||||
|
||||
- [ ] All performance targets from `benchmarks/acceptance-tests.md` met
|
||||
- [ ] Zero data loss in crash safety tests (100 iterations)
|
||||
- [ ] 100% bit-flip detection rate
|
||||
- [ ] WASM microkernel passes Cognitum tile simulation
|
||||
- [ ] No memory safety issues found by fuzz testing (1M iterations)
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Optimization + Publishing
|
||||
|
||||
### Agent Assignments
|
||||
|
||||
| Agent | Role | Scope |
|
||||
|-------|------|-------|
|
||||
| **optimizer-1** | SIMD tuning | AVX-512/NEON distance kernels, alignment |
|
||||
| **optimizer-2** | Compression tuning | LZ4/ZSTD level selection, block size |
|
||||
| **publisher-1** | crates.io publishing | Version management, dependency graph |
|
||||
| **publisher-2** | npm publishing | Platform packages, wasm-pack output |
|
||||
|
||||
### SIMD Optimization Targets
|
||||
|
||||
| Operation | AVX-512 Target | NEON Target | WASM v128 Target |
|
||||
|-----------|---------------|-------------|-----------------|
|
||||
| L2 distance (384-dim fp16) | ~12 cycles | ~48 cycles | ~96 cycles |
|
||||
| Dot product (384-dim fp16) | ~12 cycles | ~48 cycles | ~96 cycles |
|
||||
| Hamming (384-bit) | 1 cycle (VPOPCNTDQ) | ~6 cycles (CNT) | ~24 cycles |
|
||||
| PQ ADC (48 subspaces) | ~48 cycles (gather) | ~96 cycles (TBL) | ~192 cycles |
|
||||
|
||||
### Publishing Dependency Order
|
||||
|
||||
Crates must be published in dependency order:
|
||||
|
||||
```
|
||||
1. rvf-types (no deps)
|
||||
2. rvf-wire (depends on rvf-types)
|
||||
3. rvf-quant (depends on rvf-types)
|
||||
4. rvf-manifest (depends on rvf-types, rvf-wire)
|
||||
5. rvf-index (depends on rvf-types, rvf-wire, rvf-quant)
|
||||
6. rvf-crypto (depends on rvf-types, rvf-wire)
|
||||
7. rvf-runtime (depends on all above)
|
||||
8. rvf-wasm (depends on rvf-types, rvf-wire, rvf-quant)
|
||||
9. rvf-node (depends on rvf-runtime)
|
||||
10. rvf-server (depends on rvf-runtime)
|
||||
```
|
||||
|
||||
### crates.io Publishing
|
||||
|
||||
```bash
|
||||
# Publish in dependency order
|
||||
for crate in rvf-types rvf-wire rvf-quant rvf-manifest rvf-index rvf-crypto rvf-runtime rvf-wasm rvf-node rvf-server; do
|
||||
cd crates/rvf/$crate
|
||||
cargo publish
|
||||
sleep 30 # Wait for crates.io index update
|
||||
cd -
|
||||
done
|
||||
```
|
||||
|
||||
### npm Publishing
|
||||
|
||||
```bash
|
||||
# WASM package
|
||||
cd npm/packages/rvf-wasm
|
||||
npm publish --access public
|
||||
|
||||
# Node.js platform binaries
|
||||
for platform in linux-x64-gnu linux-arm64-gnu darwin-arm64 darwin-x64 win32-x64-msvc; do
|
||||
cd npm/packages/rvf-node-$platform
|
||||
npm publish --access public
|
||||
cd -
|
||||
done
|
||||
|
||||
# Main Node.js package
|
||||
cd npm/packages/rvf-node
|
||||
npm publish --access public
|
||||
|
||||
# TypeScript SDK
|
||||
cd npm/packages/rvf
|
||||
npm publish --access public
|
||||
```
|
||||
|
||||
### Phase 6 Acceptance Criteria
|
||||
|
||||
- [ ] SIMD distance kernels meet cycle targets on each platform
|
||||
- [ ] All crates published to crates.io with correct dependency graph
|
||||
- [ ] All npm packages published with correct platform detection
|
||||
- [ ] `npx rvf --version` works
|
||||
- [ ] `npm install @ruvector/rvf` works on all supported platforms
|
||||
- [ ] GitHub release with changelog
|
||||
|
||||
---
|
||||
|
||||
## Swarm Topology
|
||||
|
||||
```
|
||||
┌──────────────┐
|
||||
│ Queen │
|
||||
│ Coordinator │
|
||||
└──────┬───────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
│ │ │
|
||||
┌──────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
|
||||
│ Foundation │ │ Runtime │ │ Integration │
|
||||
│ Squad │ │ Squad │ │ Squad │
|
||||
│ (coder 1-2) │ │ (coder │ │ (coder 7-10)│
|
||||
│ (tester-1) │ │ 3-6) │ │ │
|
||||
│ (reviewer-1)│ │ (test-2)│ │ │
|
||||
└─────────────┘ └─────────┘ └─────────────┘
|
||||
│ │ │
|
||||
│ ┌─────────┼──────────┐ │
|
||||
│ │ │ │ │
|
||||
│ ┌─▼───────┐│┌─────────▼┐ │
|
||||
│ │ WASM + │││ Testing │ │
|
||||
│ │ Node │││ Squad │ │
|
||||
│ │ Squad │││(tester │ │
|
||||
│ │(coder │││ 3-6) │ │
|
||||
│ │ 11-14) │││ │ │
|
||||
│ └─────────┘│└──────────┘ │
|
||||
│ │ │
|
||||
└─────────────┼──────────────┘
|
||||
┌─────▼──────┐
|
||||
│ Optimize + │
|
||||
│ Publish │
|
||||
│ Squad │
|
||||
└────────────┘
|
||||
```
|
||||
|
||||
### Swarm Init Command
|
||||
|
||||
```bash
|
||||
npx @claude-flow/cli@latest swarm init \
|
||||
--topology hierarchical \
|
||||
--max-agents 8 \
|
||||
--strategy specialized
|
||||
```
|
||||
|
||||
### Agent Spawn Commands (via Claude Code Task tool)
|
||||
|
||||
All agents should be spawned as `run_in_background: true` Task calls in a single message. Each agent receives:
|
||||
|
||||
1. The relevant RVF spec files to read (from `docs/research/rvf/`)
|
||||
2. The ADR-029 for context
|
||||
3. The specific phase deliverables from this guidance
|
||||
4. The acceptance criteria as exit conditions
|
||||
|
||||
---
|
||||
|
||||
## Critical Path
|
||||
|
||||
```
|
||||
rvf-types ──> rvf-wire ──> rvf-manifest ──> rvf-runtime ──> adapters ──> publish
|
||||
│ │
|
||||
└──> rvf-quant ────────────────┘
|
||||
│ │
|
||||
└──> rvf-index ────────────────┘
|
||||
│
|
||||
rvf-wasm (parallel) ───────────┘
|
||||
rvf-node (parallel) ───────────┘
|
||||
```
|
||||
|
||||
**Blocking dependencies:**
|
||||
- Everything depends on `rvf-types`
|
||||
- `rvf-wire` unlocks all other crates
|
||||
- `rvf-runtime` blocks integration adapters
|
||||
- `rvf-wasm` and `rvf-node` can proceed in parallel once `rvf-wire` exists
|
||||
|
||||
---
|
||||
|
||||
## File Layout Summary
|
||||
|
||||
```
|
||||
crates/rvf/
|
||||
rvf-types/ # Segment types, headers, enums (no_std)
|
||||
rvf-wire/ # Wire format read/write (no_std + alloc)
|
||||
rvf-index/ # Progressive HNSW indexing
|
||||
rvf-manifest/ # Two-level manifest system
|
||||
rvf-quant/ # Temperature-tiered quantization
|
||||
rvf-crypto/ # ML-DSA-65, SHAKE-256
|
||||
rvf-runtime/ # Full runtime (RvfStore API)
|
||||
rvf-wasm/ # WASM microkernel (<8 KB)
|
||||
rvf-node/ # Node.js N-API bindings
|
||||
rvf-server/ # TCP/HTTP streaming server
|
||||
tests/ # Integration + acceptance tests
|
||||
benches/ # Criterion benchmarks
|
||||
|
||||
npm/packages/
|
||||
rvf/ # TypeScript SDK (@ruvector/rvf)
|
||||
rvf-wasm/ # Browser WASM (@ruvector/rvf-wasm)
|
||||
rvf-node/ # Node.js native (@ruvector/rvf-node)
|
||||
rvf-node-linux-x64-gnu/
|
||||
rvf-node-linux-arm64-gnu/
|
||||
rvf-node-darwin-arm64/
|
||||
rvf-node-darwin-x64/
|
||||
rvf-node-win32-x64-msvc/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
| Metric | Target | Measured By |
|
||||
|--------|--------|-------------|
|
||||
| Cold boot time | < 5 ms | Phase 5 acceptance test |
|
||||
| First query recall@10 | >= 0.70 | Phase 5 recall benchmark |
|
||||
| Full recall@10 | >= 0.95 | Phase 5 recall benchmark |
|
||||
| Query latency p50 | < 0.3 ms (10M vectors) | Phase 5 latency benchmark |
|
||||
| WASM microkernel size | < 8 KB | Phase 4 build output |
|
||||
| Crash safety | 0 data loss in 100 kill tests | Phase 5 crash test |
|
||||
| Crates published | 10 crates on crates.io | Phase 6 publish |
|
||||
| NPM packages published | 8+ packages on npm | Phase 6 publish |
|
||||
| Library integration | 4 libraries using RVF | Phase 3 adapter tests |
|
||||
341
vendor/ruvector/docs/research/rvf/benchmarks/acceptance-tests.md
vendored
Normal file
341
vendor/ruvector/docs/research/rvf/benchmarks/acceptance-tests.md
vendored
Normal file
@@ -0,0 +1,341 @@
|
||||
# RVF Acceptance Tests and Performance Targets
|
||||
|
||||
## 1. Primary Acceptance Test
|
||||
|
||||
> **Cold start on a 10 million vector file: load and answer the first query with a
|
||||
> useful result (recall@10 >= 0.70) without reading more than the last 4 MB, then
|
||||
> converge to full quality (recall@10 >= 0.95) as it progressively maps more segments.**
|
||||
|
||||
### Test Parameters
|
||||
|
||||
```
|
||||
Dataset: 10 million vectors
|
||||
Dimensions: 384 (sentence embedding size)
|
||||
Base dtype: fp16 (768 bytes per vector)
|
||||
Raw file size: ~7.2 GB (vectors only)
|
||||
With index: ~10-12 GB total
|
||||
Query set: 1000 queries from held-out test set
|
||||
Ground truth: Brute-force exact k-NN (k=10)
|
||||
Metric: L2 distance
|
||||
```
|
||||
|
||||
### Success Criteria
|
||||
|
||||
| Phase | Time Budget | Data Read | Min Recall@10 | Description |
|
||||
|-------|------------|-----------|---------------|-------------|
|
||||
| Boot | < 5 ms | 4 KB (Level 0) | N/A | Parse root manifest |
|
||||
| First query | < 50 ms | <= 4 MB | >= 0.70 | Layer A + hot cache |
|
||||
| Working quality | < 500 ms | <= 200 MB | >= 0.85 | Layer A + B |
|
||||
| Full quality | < 5 s | <= 4 GB | >= 0.95 | Layers A + B + C |
|
||||
| Optimized | < 30 s | Full file | >= 0.98 | All layers + hot tier |
|
||||
|
||||
### Measurement Methodology
|
||||
|
||||
```
|
||||
1. Create RVF file from 10M vector dataset
|
||||
- Build full HNSW index (M=16, ef_construction=200)
|
||||
- Compute temperature tiers (default: all warm initially)
|
||||
- Write with all segment types
|
||||
|
||||
2. Cold start measurement
|
||||
- Drop filesystem cache: echo 3 > /proc/sys/vm/drop_caches
|
||||
- Open file, start timer
|
||||
- Read Level 0 (4 KB), record time T_boot
|
||||
- Read hotset data, record time T_hotset
|
||||
- Execute first query, record time T_first_query and recall@10
|
||||
- Continue progressive loading
|
||||
- At each milestone: record time, data read, recall@10
|
||||
|
||||
3. Throughput measurement (warm)
|
||||
- After full load, execute 1000 queries
|
||||
- Measure queries per second (QPS)
|
||||
- Measure p50, p95, p99 latency
|
||||
- Measure recall@10 average
|
||||
|
||||
4. Streaming ingest measurement
|
||||
- Start with empty file
|
||||
- Ingest 10M vectors in streaming mode
|
||||
- Measure ingest rate (vectors/second)
|
||||
- Measure file size over time
|
||||
- Verify crash safety (kill -9 at random points, verify recovery)
|
||||
```
|
||||
|
||||
## 2. Performance Targets
|
||||
|
||||
### Query Latency (10M vectors, 384 dim, fp16)
|
||||
|
||||
| Hardware | QPS (single thread) | p50 Latency | p95 Latency | p99 Latency |
|
||||
|----------|-------------------|-------------|-------------|-------------|
|
||||
| Desktop (AVX-512) | 5,000-15,000 | 0.1 ms | 0.3 ms | 1.0 ms |
|
||||
| Desktop (AVX2) | 3,000-8,000 | 0.2 ms | 0.5 ms | 2.0 ms |
|
||||
| Laptop (NEON) | 2,000-5,000 | 0.3 ms | 1.0 ms | 3.0 ms |
|
||||
| WASM (browser) | 500-2,000 | 1.0 ms | 3.0 ms | 10.0 ms |
|
||||
| Cognitum tile | 100-500 | 2.0 ms | 5.0 ms | 15.0 ms |
|
||||
|
||||
### Streaming Ingest Rate
|
||||
|
||||
| Hardware | Vectors/Second | Bytes/Second | Notes |
|
||||
|----------|---------------|-------------|-------|
|
||||
| NVMe SSD | 200K-500K | 150-380 MB/s | fsync every 1000 vectors |
|
||||
| SATA SSD | 50K-100K | 38-76 MB/s | fsync every 1000 vectors |
|
||||
| HDD | 10K-30K | 7-23 MB/s | Sequential append |
|
||||
| Network (1 Gbps) | 50K-100K | 38-76 MB/s | Streaming over network |
|
||||
|
||||
### Progressive Load Times
|
||||
|
||||
| Phase | NVMe SSD | SATA SSD | HDD | Network |
|
||||
|-------|----------|----------|-----|---------|
|
||||
| Boot (4 KB) | < 0.1 ms | < 0.5 ms | < 10 ms | < 50 ms |
|
||||
| First query (4 MB) | < 2 ms | < 10 ms | < 100 ms | < 500 ms |
|
||||
| Working quality (200 MB) | < 100 ms | < 500 ms | < 5 s | < 20 s |
|
||||
| Full quality (4 GB) | < 2 s | < 10 s | < 120 s | < 400 s |
|
||||
|
||||
### Space Efficiency
|
||||
|
||||
| Configuration | Bytes/Vector | File Size (10M) | Ratio vs Raw |
|
||||
|--------------|-------------|-----------------|-------------|
|
||||
| Raw fp32 | 1,536 | 14.3 GB | 1.0x |
|
||||
| RVF uniform fp16 | 768 + overhead | 8.0 GB | 0.56x |
|
||||
| RVF adaptive (equilibrium) | ~300 avg | 3.2 GB | 0.22x |
|
||||
| RVF aggressive (binary cold) | ~100 avg | 1.1 GB | 0.08x |
|
||||
|
||||
## 3. Crash Safety Tests
|
||||
|
||||
### Test 1: Kill During Vector Ingest
|
||||
|
||||
```
|
||||
1. Start ingesting 1M vectors
|
||||
2. After 500K vectors: kill -9 the writer
|
||||
3. Verify: file is readable
|
||||
4. Verify: latest valid manifest is found
|
||||
5. Verify: all vectors referenced by latest manifest are intact
|
||||
6. Verify: no data corruption (all segment hashes valid)
|
||||
```
|
||||
|
||||
**Pass criteria**: Zero data loss for committed segments. At most the
|
||||
last incomplete segment is lost (bounded by fsync interval).
|
||||
|
||||
### Test 2: Kill During Manifest Write
|
||||
|
||||
```
|
||||
1. Create file with 1M vectors
|
||||
2. Trigger manifest rewrite (add metadata, trigger compaction)
|
||||
3. Kill -9 during manifest write
|
||||
4. Verify: file falls back to previous valid manifest
|
||||
5. Verify: all queries work correctly with previous manifest
|
||||
```
|
||||
|
||||
**Pass criteria**: Automatic fallback to previous manifest. No manual
|
||||
recovery needed.
|
||||
|
||||
### Test 3: Kill During Compaction
|
||||
|
||||
```
|
||||
1. Create file with 1M vectors across 100 small VEC_SEGs
|
||||
2. Trigger compaction
|
||||
3. Kill -9 during compaction
|
||||
4. Verify: file is readable (old segments still valid)
|
||||
5. Verify: partial compaction output is safely ignored
|
||||
```
|
||||
|
||||
**Pass criteria**: Old segments remain valid. Incomplete compaction
|
||||
output has no manifest reference and is safely orphaned.
|
||||
|
||||
### Test 4: Bit Flip Detection
|
||||
|
||||
```
|
||||
1. Create valid RVF file
|
||||
2. Flip random bits in various locations
|
||||
3. Verify: corruption detected by hash/CRC checks
|
||||
4. Verify: specific corrupted segment identified
|
||||
5. Verify: other segments still readable
|
||||
```
|
||||
|
||||
**Pass criteria**: 100% detection of single-bit flips. Corruption
|
||||
isolated to affected segment.
|
||||
|
||||
## 4. Scalability Tests
|
||||
|
||||
### Test: 1 Billion Vectors
|
||||
|
||||
```
|
||||
Dataset: 1B vectors, 384 dimensions, fp16
|
||||
File size: ~700 GB (raw) -> ~200 GB (adaptive RVF)
|
||||
Hardware: Server with 256 GB RAM, NVMe array
|
||||
|
||||
Verify:
|
||||
- Boot time < 10 ms
|
||||
- First query < 100 ms
|
||||
- Full quality convergence < 60 s
|
||||
- Recall@10 >= 0.95 at full quality
|
||||
- Streaming ingest sustained at 100K+ vectors/second
|
||||
```
|
||||
|
||||
### Test: High Dimensionality
|
||||
|
||||
```
|
||||
Dataset: 1M vectors, 4096 dimensions (LLM embeddings)
|
||||
File size: ~8 GB (fp16)
|
||||
|
||||
Verify:
|
||||
- PQ compression to 5-bit achieves >= 10x compression
|
||||
- Recall@10 >= 0.90 with PQ
|
||||
- Query latency < 5 ms (p95) with PQ + HNSW
|
||||
```
|
||||
|
||||
### Test: Multi-File Sharding
|
||||
|
||||
```
|
||||
Dataset: 100M vectors across 10 shard files
|
||||
Verify:
|
||||
- Transparent query across all shards
|
||||
- Shard addition without full rebuild
|
||||
- Individual shard compaction
|
||||
- Shard removal with manifest update only
|
||||
```
|
||||
|
||||
## 5. WASM Performance Tests
|
||||
|
||||
### Browser Environment
|
||||
|
||||
```
|
||||
Runtime: Chrome V8 / Firefox SpiderMonkey
|
||||
SIMD: WASM v128
|
||||
Memory: Limited to 4 GB WASM heap
|
||||
|
||||
Test: Load 1M vector RVF file via fetch()
|
||||
- Boot time < 50 ms
|
||||
- First query < 200 ms (after boot)
|
||||
- QPS >= 500 (single thread)
|
||||
- Memory usage < 500 MB
|
||||
```
|
||||
|
||||
### Cognitum Tile Simulation
|
||||
|
||||
```
|
||||
Runtime: wasmtime with memory limits
|
||||
Code limit: 8 KB
|
||||
Data limit: 8 KB
|
||||
Scratch: 64 KB
|
||||
|
||||
Test: Process 1000 blocks via hub protocol
|
||||
- Distance computation matches reference implementation
|
||||
- Top-K results match brute-force within quantization tolerance
|
||||
- No memory access out of bounds
|
||||
- Tile recovers from simulated faults
|
||||
```
|
||||
|
||||
## 6. Interoperability Tests
|
||||
|
||||
### Round-Trip Test
|
||||
|
||||
```
|
||||
1. Create RVF file from numpy arrays
|
||||
2. Read back with independent implementation
|
||||
3. Verify: all vectors bit-identical
|
||||
4. Verify: all metadata preserved
|
||||
5. Verify: index produces same results
|
||||
```
|
||||
|
||||
### Profile Compatibility Test
|
||||
|
||||
```
|
||||
1. Create RVDNA file with genomic data
|
||||
2. Create RVText file with text embeddings
|
||||
3. Read both with generic RVF reader
|
||||
4. Verify: generic reader can access vectors and metadata
|
||||
5. Verify: profile-specific features degrade gracefully
|
||||
```
|
||||
|
||||
### Version Forward Compatibility Test
|
||||
|
||||
```
|
||||
1. Create RVF file with version 1
|
||||
2. Add segments with hypothetical version 2 features (unknown tags)
|
||||
3. Read with version 1 reader
|
||||
4. Verify: version 1 reader skips unknown segments/tags
|
||||
5. Verify: version 1 data is fully accessible
|
||||
```
|
||||
|
||||
## 7. Security Tests
|
||||
|
||||
### Signature Verification
|
||||
|
||||
```
|
||||
1. Create signed RVF file (ML-DSA-65)
|
||||
2. Verify all segment signatures
|
||||
3. Modify one byte in a signed segment
|
||||
4. Verify: modification detected
|
||||
5. Verify: other segments still valid
|
||||
```
|
||||
|
||||
### Encryption Round-Trip
|
||||
|
||||
```
|
||||
1. Create encrypted RVF file (ML-KEM-768 + AES-256-GCM)
|
||||
2. Decrypt with correct key
|
||||
3. Verify: plaintext matches original
|
||||
4. Attempt decrypt with wrong key
|
||||
5. Verify: decryption fails (GCM auth tag mismatch)
|
||||
```
|
||||
|
||||
### Key Rotation
|
||||
|
||||
```
|
||||
1. Create file signed with key A
|
||||
2. Rotate to key B (write CRYPTO_SEG rotation record)
|
||||
3. Write new segments signed with key B
|
||||
4. Verify: old segments valid with key A
|
||||
5. Verify: new segments valid with key B
|
||||
6. Verify: cross-signature in rotation record is valid
|
||||
```
|
||||
|
||||
## 8. Benchmark Harness
|
||||
|
||||
### Recommended Tools
|
||||
|
||||
| Purpose | Tool | Notes |
|
||||
|---------|------|-------|
|
||||
| Latency measurement | criterion (Rust) / benchmark.js | Statistical rigor |
|
||||
| Recall measurement | Custom recall@K computation | Against brute-force ground truth |
|
||||
| Memory profiling | valgrind massif / Chrome DevTools | Peak and sustained |
|
||||
| I/O profiling | blktrace / iostat | Verify read patterns |
|
||||
| SIMD verification | Intel SDE / ARM emulator | Correct SIMD codegen |
|
||||
| Crash testing | Custom harness with kill -9 | Random timing |
|
||||
|
||||
### Report Format
|
||||
|
||||
Each benchmark run produces a report:
|
||||
|
||||
```json
|
||||
{
|
||||
"test_name": "cold_start_10m",
|
||||
"dataset": {
|
||||
"vector_count": 10000000,
|
||||
"dimensions": 384,
|
||||
"dtype": "fp16",
|
||||
"file_size_bytes": 10737418240
|
||||
},
|
||||
"hardware": {
|
||||
"cpu": "Intel Xeon w5-3435X",
|
||||
"simd": "AVX-512",
|
||||
"ram_gb": 256,
|
||||
"storage": "NVMe Samsung 990 Pro"
|
||||
},
|
||||
"results": {
|
||||
"boot_ms": 0.08,
|
||||
"first_query_ms": 12.3,
|
||||
"first_query_recall_at_10": 0.73,
|
||||
"working_quality_ms": 340,
|
||||
"working_quality_recall_at_10": 0.87,
|
||||
"full_quality_ms": 3200,
|
||||
"full_quality_recall_at_10": 0.96,
|
||||
"steady_state_qps": 8500,
|
||||
"steady_state_p50_ms": 0.12,
|
||||
"steady_state_p95_ms": 0.28,
|
||||
"steady_state_p99_ms": 0.85,
|
||||
"data_read_first_query_mb": 3.2,
|
||||
"data_read_working_quality_mb": 180
|
||||
}
|
||||
}
|
||||
```
|
||||
312
vendor/ruvector/docs/research/rvf/crypto/quantum-signatures.md
vendored
Normal file
312
vendor/ruvector/docs/research/rvf/crypto/quantum-signatures.md
vendored
Normal file
@@ -0,0 +1,312 @@
|
||||
# RVF Quantum-Resistant Cryptography
|
||||
|
||||
## 1. Threat Model
|
||||
|
||||
RVF files may contain high-value intelligence (medical genomics, proprietary
|
||||
embeddings, classified networks). The cryptographic design must:
|
||||
|
||||
1. **Authenticate**: Prove a segment was written by an authorized producer
|
||||
2. **Integrity**: Detect any modification to segment payloads
|
||||
3. **Quantum resistance**: Survive attacks by future quantum computers
|
||||
4. **Performance**: Not bottleneck streaming ingest or query paths
|
||||
5. **Compactness**: Signatures must fit in segment footers without bloating
|
||||
|
||||
### Harvest-Now, Decrypt-Later
|
||||
|
||||
Adversaries may archive RVF files today and break classical signatures later
|
||||
with quantum computers. Post-quantum signatures protect against this from day one.
|
||||
|
||||
## 2. Algorithm Selection
|
||||
|
||||
### NIST Post-Quantum Standards (FIPS 204, 205, 206)
|
||||
|
||||
| Algorithm | Standard | Type | Sig Size | PK Size | SK Size | Sign/s | Verify/s | Level |
|
||||
|-----------|----------|------|----------|---------|---------|--------|----------|-------|
|
||||
| ML-DSA-44 | FIPS 204 | Lattice | 2,420 B | 1,312 B | 2,560 B | ~9,000 | ~42,000 | 2 |
|
||||
| ML-DSA-65 | FIPS 204 | Lattice | 3,309 B | 1,952 B | 4,032 B | ~4,500 | ~17,000 | 3 |
|
||||
| ML-DSA-87 | FIPS 204 | Lattice | 4,627 B | 2,592 B | 4,896 B | ~2,800 | ~10,000 | 5 |
|
||||
| SLH-DSA-128s | FIPS 205 | Hash | 7,856 B | 32 B | 64 B | ~350 | ~15,000 | 1 |
|
||||
| SLH-DSA-128f | FIPS 205 | Hash | 17,088 B | 32 B | 64 B | ~3,000 | ~90,000 | 1 |
|
||||
| FN-DSA-512 | FIPS 206 | Lattice | 666 B | 897 B | ~1.3 KB | ~5,000 | ~25,000 | 1 |
|
||||
|
||||
### RVF Default: ML-DSA-65
|
||||
|
||||
**Why ML-DSA-65**:
|
||||
- NIST Level 3 security (128-bit post-quantum)
|
||||
- 3,309 byte signatures (manageable in segment footer)
|
||||
- ~4,500 sign/s (sufficient for streaming ingest at segment level)
|
||||
- ~17,000 verify/s (fast enough for progressive load verification)
|
||||
- Well-studied lattice assumption (Module-LWE)
|
||||
|
||||
**Alternative for size-constrained environments (Core Profile)**:
|
||||
FN-DSA-512 with 666 byte signatures — but FIPS 206 is newer and less deployed.
|
||||
|
||||
**Alternative for maximum conservatism**:
|
||||
SLH-DSA-128s (hash-based, stateless, minimal assumptions) — 7,856 byte
|
||||
signatures but the smallest keys and strongest theoretical foundation.
|
||||
|
||||
## 3. Signature Scheme
|
||||
|
||||
### What Gets Signed
|
||||
|
||||
Each signed segment's signature covers:
|
||||
|
||||
```
|
||||
signed_data = segment_header[0:40] # Header minus content_hash and padding
|
||||
|| content_hash # The payload hash
|
||||
|| segment_id_bytes # Prevent replay
|
||||
|| context_string # Domain separation
|
||||
```
|
||||
|
||||
The signature does NOT cover the raw payload directly — it covers the payload's
|
||||
hash. This means:
|
||||
- Signing is O(1) regardless of payload size
|
||||
- The hash is computed during write anyway (required for integrity)
|
||||
- Verification requires only the header + hash, not the full payload
|
||||
|
||||
### Context String
|
||||
|
||||
```
|
||||
context = "RVF-v1-" || seg_type_name || "-" || profile_name
|
||||
```
|
||||
|
||||
Examples:
|
||||
- `"RVF-v1-VEC_SEG-rvdna"`
|
||||
- `"RVF-v1-MANIFEST_SEG-generic"`
|
||||
|
||||
Domain separation prevents cross-type signature confusion.
|
||||
|
||||
### Key Management
|
||||
|
||||
Keys are stored in CRYPTO_SEG segments:
|
||||
|
||||
```
|
||||
CRYPTO_SEG Payload:
|
||||
key_type: u8
|
||||
0 = signing public key
|
||||
1 = verification certificate chain
|
||||
2 = encryption public key (for ENCRYPTED segments)
|
||||
3 = key rotation record
|
||||
|
||||
algorithm: u8
|
||||
0 = Ed25519 (classical)
|
||||
1 = ML-DSA-65 (post-quantum)
|
||||
2 = SLH-DSA-128s (hash-based PQ)
|
||||
3 = X25519 (classical KEM)
|
||||
4 = ML-KEM-768 (post-quantum KEM)
|
||||
|
||||
key_id: [u8; 16] Unique key identifier (hash of public key)
|
||||
key_data: [u8; var] The actual key material
|
||||
valid_from: u64 Timestamp (ns) when key becomes valid
|
||||
valid_until: u64 Timestamp (ns) when key expires (0 = no expiry)
|
||||
```
|
||||
|
||||
### Key Rotation
|
||||
|
||||
New keys are introduced by writing a new CRYPTO_SEG with `key_type=3`
|
||||
(rotation record) that references both old and new key IDs. Segments
|
||||
signed with either key are valid during the transition period.
|
||||
|
||||
```
|
||||
CRYPTO_SEG (rotation):
|
||||
old_key_id: [u8; 16]
|
||||
new_key_id: [u8; 16]
|
||||
rotation_timestamp: u64
|
||||
cross_signature: [u8; var] New key signed by old key
|
||||
```
|
||||
|
||||
## 4. Hash Functions
|
||||
|
||||
### SHAKE-256 (Primary)
|
||||
|
||||
SHAKE-256 from the SHA-3 family is used for:
|
||||
- Content hashes in segment headers (128-bit truncation for compactness)
|
||||
- Min-cut witness hashes (256-bit for cryptographic binding)
|
||||
- Key derivation
|
||||
- Domain separation
|
||||
|
||||
**Why SHAKE-256**:
|
||||
- Post-quantum safe (Keccak is not vulnerable to Grover's algorithm at 256-bit output)
|
||||
- Extendable output function (XOF) — can produce any hash length
|
||||
- No length extension attacks
|
||||
- ~1 GB/s in software, faster with hardware SHA-3 extensions
|
||||
|
||||
### XXH3-128 (Fast Path)
|
||||
|
||||
XXH3 is used for non-cryptographic content hashing where speed matters more
|
||||
than collision resistance:
|
||||
- Segment content hashes when crypto verification is not required
|
||||
- Block-level integrity checks in combination with CRC32C
|
||||
|
||||
**Performance**: ~50 GB/s with AVX2. This means hash computation is never
|
||||
the bottleneck during streaming ingest.
|
||||
|
||||
### CRC32C (Block Level)
|
||||
|
||||
CRC32C is used for per-block integrity within segments:
|
||||
- Detects random bit flips and truncation
|
||||
- Hardware accelerated on x86 (SSE4.2) and ARM (CRC32 extension)
|
||||
- ~3 GB/s throughput
|
||||
|
||||
### Hash Selection by Context
|
||||
|
||||
| Context | Algorithm | Output Size | Why |
|
||||
|---------|-----------|------------|-----|
|
||||
| Block integrity | CRC32C | 4 B | Fastest, HW accel |
|
||||
| Segment content hash (fast) | XXH3-128 | 16 B | Very fast, good distribution |
|
||||
| Segment content hash (crypto) | SHAKE-256 | 16 B | Post-quantum, collision resistant |
|
||||
| Witness / proof hashes | SHAKE-256 | 32 B | Full crypto strength |
|
||||
| Key derivation | SHAKE-256 | 32+ B | XOF flexibility |
|
||||
|
||||
## 5. Encryption (Optional)
|
||||
|
||||
For ENCRYPTED segments, RVF uses hybrid encryption:
|
||||
|
||||
### Key Encapsulation
|
||||
|
||||
```
|
||||
Classical: X25519 ECDH
|
||||
Post-Quantum: ML-KEM-768 (CRYSTALS-Kyber, NIST Level 3)
|
||||
Hybrid: X25519 || ML-KEM-768 (concatenated shared secrets)
|
||||
```
|
||||
|
||||
### Payload Encryption
|
||||
|
||||
```
|
||||
Algorithm: AES-256-GCM (AEAD)
|
||||
Key: SHAKE-256(X25519_shared || ML-KEM_shared || context)
|
||||
Nonce: First 12 bytes of SHAKE-256(segment_id || timestamp)
|
||||
AAD: segment_header[0:40] (authenticated but not encrypted)
|
||||
```
|
||||
|
||||
### Encrypted Segment Layout
|
||||
|
||||
```
|
||||
Segment Header (64B, plaintext)
|
||||
flags: ENCRYPTED set
|
||||
content_hash: hash of PLAINTEXT payload (for integrity after decrypt)
|
||||
|
||||
Encapsulated Keys
|
||||
x25519_ephemeral_pk: [u8; 32]
|
||||
ml_kem_ciphertext: [u8; 1088]
|
||||
key_id_recipient: [u8; 16]
|
||||
|
||||
Encrypted Payload
|
||||
AES-256-GCM ciphertext (same size as plaintext + 16B auth tag)
|
||||
|
||||
Signature Footer (if also SIGNED)
|
||||
Signature covers header + encapsulated keys + encrypted payload
|
||||
```
|
||||
|
||||
## 6. Capability Manifests (WITNESS_SEG)
|
||||
|
||||
WITNESS_SEGs provide cryptographic proof of provenance and computation:
|
||||
|
||||
### Witness Types
|
||||
|
||||
```
|
||||
0x01 PROVENANCE Who created this file and when
|
||||
0x02 COMPUTATION Proof that an index was correctly built
|
||||
0x03 DELEGATION Authorization chain for data access
|
||||
0x04 AUDIT Record of queries executed against this file
|
||||
0x05 ATTESTATION Hardware attestation (for Cognitum tiles)
|
||||
```
|
||||
|
||||
### Provenance Witness
|
||||
|
||||
```
|
||||
creator_key_id: [u8; 16]
|
||||
creation_time: u64
|
||||
tool_name: [u8; 64]
|
||||
tool_version: [u8; 16]
|
||||
input_hashes: [(hash256, description)] Hashes of source data
|
||||
transform_description: [u8; var] What was done to create vectors
|
||||
signature: [u8; var] Creator's signature over all above
|
||||
```
|
||||
|
||||
### Computation Witness
|
||||
|
||||
```
|
||||
computation_type: u8
|
||||
0 = HNSW construction
|
||||
1 = Quantization training
|
||||
2 = Temperature compaction
|
||||
3 = Overlay rebalance
|
||||
4 = Index merge
|
||||
|
||||
input_segments: [segment_id]
|
||||
output_segments: [segment_id]
|
||||
parameters: [(key, value)]
|
||||
result_hash: hash256
|
||||
duration_ns: u64
|
||||
signature: [u8; var]
|
||||
```
|
||||
|
||||
This allows any reader to verify that the index was built from the declared
|
||||
vectors using the declared parameters — without re-running the computation.
|
||||
|
||||
## 7. Signing Performance Budget
|
||||
|
||||
For streaming ingest at 100K vectors/second with 1024-vector blocks:
|
||||
|
||||
```
|
||||
Segment write rate: ~100 segments/second (1024 vectors per VEC_SEG)
|
||||
Manifest writes: ~1/second (batched)
|
||||
|
||||
ML-DSA-65 signing: ~4,500/second
|
||||
Signing budget: 100 segment sigs + 1 manifest sig = 101/second
|
||||
Utilization: 101 / 4,500 = 2.2%
|
||||
```
|
||||
|
||||
Signing is not a bottleneck. Even at 10x the ingest rate, ML-DSA-65 has
|
||||
headroom.
|
||||
|
||||
For verification during progressive load (reading 1000 segments):
|
||||
|
||||
```
|
||||
ML-DSA-65 verify: ~17,000/second
|
||||
Verification budget: 1000 segments / 17,000 = 59 ms
|
||||
```
|
||||
|
||||
All segments verified in under 60 ms. This runs concurrently with data
|
||||
loading, so it adds minimal latency to the progressive boot sequence.
|
||||
|
||||
## 8. Core Profile Crypto
|
||||
|
||||
For the Core Profile (8 KB code budget), full ML-DSA-65 verification is
|
||||
too large (~15 KB of code). Options:
|
||||
|
||||
1. **Hub verifies, tile trusts**: Hub checks all signatures before sending
|
||||
blocks to tiles. Tile only needs CRC32C for transport integrity.
|
||||
|
||||
2. **Truncated verification**: Tile verifies only the CRC32C of received
|
||||
blocks. Hub provides a signed attestation that the source segments
|
||||
were verified.
|
||||
|
||||
3. **FN-DSA-512**: Smaller verification code (~3 KB), 666 byte signatures.
|
||||
Fits in tile code budget but is less mature.
|
||||
|
||||
Recommended: Option 1 (hub verifies, tile trusts) for the initial release.
|
||||
The hub is a trusted component in the Cognitum architecture, and the
|
||||
tile-hub channel is physically secure (on-chip mesh).
|
||||
|
||||
## 9. Algorithm Agility
|
||||
|
||||
The `sig_algo` and `checksum_algo` fields in segment headers and footers
|
||||
allow algorithm migration without format changes:
|
||||
|
||||
```
|
||||
Today: ML-DSA-65 signatures, SHAKE-256 hashes
|
||||
Future: May migrate to ML-DSA-87 or newer NIST standards
|
||||
Transition: Write new segments with new algo, old segments remain valid
|
||||
Verification: Reader tries algo from header field, no guessing needed
|
||||
```
|
||||
|
||||
New algorithms are introduced by:
|
||||
1. Assigning a new enum value
|
||||
2. Writing a CRYPTO_SEG with the new key type
|
||||
3. Signing new segments with the new algorithm
|
||||
4. Old segments with old signatures remain verifiable
|
||||
|
||||
No file rewrite needed. No flag day. Gradual migration through the
|
||||
append-only segment model.
|
||||
397
vendor/ruvector/docs/research/rvf/microkernel/wasm-runtime.md
vendored
Normal file
397
vendor/ruvector/docs/research/rvf/microkernel/wasm-runtime.md
vendored
Normal file
@@ -0,0 +1,397 @@
|
||||
# RVF WASM Microkernel and Cognitum Hardware Mapping
|
||||
|
||||
## 1. Design Philosophy
|
||||
|
||||
RVF must run on hardware ranging from a 64 KB WASM tile to a petabyte
|
||||
cluster. The WASM microkernel is the minimal runtime that makes a tile
|
||||
a first-class RVF citizen — capable of answering queries, ingesting
|
||||
streams, and participating in distributed search.
|
||||
|
||||
The microkernel is not a shrunken version of the full runtime. It is a
|
||||
**purpose-built execution core** that exposes the exact set of operations
|
||||
a tile needs, and nothing more.
|
||||
|
||||
## 2. Cognitum Tile Architecture
|
||||
|
||||
### Hardware Constraints
|
||||
|
||||
```
|
||||
+-----------------------------------+
|
||||
| Cognitum Tile |
|
||||
| |
|
||||
| Code Memory: 8 KB |
|
||||
| Data Memory: 8 KB |
|
||||
| SIMD Scratch: 64 KB |
|
||||
| Registers: v128 (WASM SIMD) |
|
||||
| Clock: ~1 GHz |
|
||||
| Interconnect: Mesh to hub |
|
||||
| |
|
||||
| No filesystem. No mmap. |
|
||||
| No allocator beyond scratch. |
|
||||
| All I/O through hub messages. |
|
||||
+-----------------------------------+
|
||||
```
|
||||
|
||||
### Memory Map
|
||||
|
||||
```
|
||||
Code (8 KB):
|
||||
0x0000 - 0x0FFF Microkernel WASM bytecode (4 KB)
|
||||
0x1000 - 0x17FF Distance function hot path (2 KB)
|
||||
0x1800 - 0x1FFF Decode / quantization stubs (2 KB)
|
||||
|
||||
Data (8 KB):
|
||||
0x0000 - 0x003F Tile configuration (64 B)
|
||||
0x0040 - 0x00FF Query scratch (192 B: query vector fp16)
|
||||
0x0100 - 0x01FF Result buffer (256 B: top-K candidates)
|
||||
0x0200 - 0x03FF Routing table (512 B: entry points + centroids)
|
||||
0x0400 - 0x07FF Decode workspace (1 KB)
|
||||
0x0800 - 0x0FFF Message I/O buffer (2 KB)
|
||||
0x1000 - 0x1FFF Neighbor list cache (4 KB)
|
||||
|
||||
SIMD Scratch (64 KB):
|
||||
0x0000 - 0x7FFF Vector block (up to 85 vectors @ 384-dim fp16)
|
||||
0x8000 - 0xBFFF Distance accumulator / PQ tables (16 KB)
|
||||
0xC000 - 0xEFFF Hot cache subset (12 KB)
|
||||
0xF000 - 0xFFFF Temporary / spill (4 KB)
|
||||
```
|
||||
|
||||
### Tile Budget
|
||||
|
||||
For 384-dim fp16 vectors:
|
||||
- One vector: 768 bytes
|
||||
- SIMD scratch holds: 64 KB / 768 = ~85 vectors
|
||||
- Top-K result buffer: 16 candidates * 16 B = 256 B
|
||||
- Query vector: 768 B
|
||||
|
||||
A tile can process one block of ~85 vectors per cycle, computing distances
|
||||
and maintaining a top-K heap entirely within scratch memory.
|
||||
|
||||
## 3. Microkernel Exports
|
||||
|
||||
The WASM microkernel exports exactly these functions:
|
||||
|
||||
```wat
|
||||
;; === Core Query Path ===
|
||||
|
||||
;; Initialize tile with configuration
|
||||
;; config_ptr: pointer to 64B tile config in data memory
|
||||
(export "rvf_init" (func $rvf_init (param $config_ptr i32) (result i32)))
|
||||
|
||||
;; Load query vector into query scratch
|
||||
;; query_ptr: pointer to fp16 vector in data memory
|
||||
;; dim: vector dimensionality
|
||||
(export "rvf_load_query" (func $rvf_load_query
|
||||
(param $query_ptr i32) (param $dim i32) (result i32)))
|
||||
|
||||
;; Load a block of vectors into SIMD scratch
|
||||
;; block_ptr: pointer to vector block in SIMD scratch
|
||||
;; count: number of vectors
|
||||
;; dtype: data type enum
|
||||
(export "rvf_load_block" (func $rvf_load_block
|
||||
(param $block_ptr i32) (param $count i32)
|
||||
(param $dtype i32) (result i32)))
|
||||
|
||||
;; Compute distances between query and loaded block
|
||||
;; metric: 0=L2, 1=IP, 2=cosine, 3=hamming
|
||||
;; result_ptr: pointer to write distances
|
||||
(export "rvf_distances" (func $rvf_distances
|
||||
(param $metric i32) (param $result_ptr i32) (result i32)))
|
||||
|
||||
;; Merge distances into top-K heap
|
||||
;; dist_ptr: pointer to distance array
|
||||
;; id_ptr: pointer to vector ID array
|
||||
;; count: number of candidates
|
||||
;; k: top-K to maintain
|
||||
(export "rvf_topk_merge" (func $rvf_topk_merge
|
||||
(param $dist_ptr i32) (param $id_ptr i32)
|
||||
(param $count i32) (param $k i32) (result i32)))
|
||||
|
||||
;; Read current top-K results
|
||||
;; out_ptr: pointer to write results (id, distance pairs)
|
||||
(export "rvf_topk_read" (func $rvf_topk_read
|
||||
(param $out_ptr i32) (result i32)))
|
||||
|
||||
;; === Quantization ===
|
||||
|
||||
;; Load scalar quantization parameters (min/max per dim)
|
||||
(export "rvf_load_sq_params" (func $rvf_load_sq_params
|
||||
(param $params_ptr i32) (param $dim i32) (result i32)))
|
||||
|
||||
;; Dequantize int8 block to fp16 in SIMD scratch
|
||||
(export "rvf_dequant_i8" (func $rvf_dequant_i8
|
||||
(param $src_ptr i32) (param $dst_ptr i32)
|
||||
(param $count i32) (result i32)))
|
||||
|
||||
;; Load PQ codebook subset
|
||||
(export "rvf_load_pq_codebook" (func $rvf_load_pq_codebook
|
||||
(param $codebook_ptr i32) (param $M i32)
|
||||
(param $K i32) (result i32)))
|
||||
|
||||
;; Compute PQ asymmetric distances
|
||||
(export "rvf_pq_distances" (func $rvf_pq_distances
|
||||
(param $codes_ptr i32) (param $count i32)
|
||||
(param $result_ptr i32) (result i32)))
|
||||
|
||||
;; === HNSW Navigation ===
|
||||
|
||||
;; Load neighbor list for a node
|
||||
(export "rvf_load_neighbors" (func $rvf_load_neighbors
|
||||
(param $node_id i64) (param $layer i32)
|
||||
(param $out_ptr i32) (result i32)))
|
||||
|
||||
;; Greedy search step: given current node, find nearest neighbor
|
||||
(export "rvf_greedy_step" (func $rvf_greedy_step
|
||||
(param $current_id i64) (param $layer i32) (result i64)))
|
||||
|
||||
;; === Segment Verification ===
|
||||
|
||||
;; Verify segment header hash
|
||||
(export "rvf_verify_header" (func $rvf_verify_header
|
||||
(param $header_ptr i32) (result i32)))
|
||||
|
||||
;; Compute CRC32C of a data region
|
||||
(export "rvf_crc32c" (func $rvf_crc32c
|
||||
(param $data_ptr i32) (param $len i32) (result i32)))
|
||||
```
|
||||
|
||||
### Export Count
|
||||
|
||||
14 exports. Each maps to a tight inner loop that fits in the 8 KB code budget.
|
||||
The host (hub) is responsible for all I/O, segment parsing, and orchestration.
|
||||
|
||||
## 4. Host-Tile Protocol
|
||||
|
||||
Communication between the hub and tile uses fixed-size messages through
|
||||
the 2 KB I/O buffer:
|
||||
|
||||
### Message Format
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 2 msg_type Message type enum
|
||||
0x02 2 msg_length Payload length
|
||||
0x04 4 msg_id Correlation ID
|
||||
0x08 var payload Type-specific payload
|
||||
```
|
||||
|
||||
### Message Types
|
||||
|
||||
```
|
||||
Hub -> Tile:
|
||||
0x01 LOAD_QUERY Send query vector (768 B for 384-dim fp16)
|
||||
0x02 LOAD_BLOCK Send vector block (up to ~1.5 KB compressed)
|
||||
0x03 LOAD_NEIGHBORS Send neighbor list for a node
|
||||
0x04 LOAD_PARAMS Send quantization parameters
|
||||
0x05 COMPUTE Trigger distance computation
|
||||
0x06 READ_TOPK Request current top-K results
|
||||
0x07 RESET Clear tile state for new query
|
||||
|
||||
Tile -> Hub:
|
||||
0x81 TOPK_RESULT Top-K results (id, distance pairs)
|
||||
0x82 NEED_BLOCK Request a specific vector block
|
||||
0x83 NEED_NEIGHBORS Request neighbor list for a node
|
||||
0x84 DONE Computation complete
|
||||
0x85 ERROR Error with code
|
||||
```
|
||||
|
||||
### Execution Flow
|
||||
|
||||
```
|
||||
Hub Tile
|
||||
| |
|
||||
|--- LOAD_QUERY (768B) ------------>|
|
||||
| | rvf_load_query()
|
||||
|--- LOAD_PARAMS (SQ params) ------>|
|
||||
| | rvf_load_sq_params()
|
||||
|--- LOAD_BLOCK (block 0) -------->|
|
||||
| | rvf_load_block()
|
||||
| | rvf_distances()
|
||||
| | rvf_topk_merge()
|
||||
|--- LOAD_BLOCK (block 1) -------->|
|
||||
| | rvf_load_block()
|
||||
| | rvf_distances()
|
||||
| | rvf_topk_merge()
|
||||
| ... |
|
||||
|--- READ_TOPK -------------------->|
|
||||
| | rvf_topk_read()
|
||||
|<--- TOPK_RESULT ------------------|
|
||||
| |
|
||||
```
|
||||
|
||||
### Pull Mode
|
||||
|
||||
For HNSW search, the tile drives the traversal:
|
||||
|
||||
```
|
||||
Hub Tile
|
||||
| |
|
||||
|--- LOAD_QUERY -------------------->|
|
||||
|--- LOAD_NEIGHBORS (entry point) -->|
|
||||
| | rvf_greedy_step()
|
||||
|<--- NEED_NEIGHBORS (next node) ----|
|
||||
|--- LOAD_NEIGHBORS (next node) ---->|
|
||||
| | rvf_greedy_step()
|
||||
|<--- NEED_BLOCK (for candidate) ----|
|
||||
|--- LOAD_BLOCK -------------------->|
|
||||
| | rvf_distances()
|
||||
| | rvf_topk_merge()
|
||||
|<--- DONE ----------------------------|
|
||||
|--- READ_TOPK --------------------->|
|
||||
|<--- TOPK_RESULT ------------------|
|
||||
```
|
||||
|
||||
## 5. Three Hardware Profiles
|
||||
|
||||
### RVF Core Profile (Tile)
|
||||
|
||||
```
|
||||
Target: Cognitum tile (8KB + 8KB + 64KB)
|
||||
Features: Distance compute, top-K, SQ dequant, CRC32C verify
|
||||
Max vectors: ~85 per block load
|
||||
Max dimensions: 384 (fp16) or 768 (i8)
|
||||
Index: None (hub routes, tile computes)
|
||||
Streaming: Receive blocks from hub
|
||||
Quantization: i8 scalar only (no PQ on tile)
|
||||
Compression: None (hub decompresses before sending)
|
||||
```
|
||||
|
||||
### RVF Hot Profile (Chip)
|
||||
|
||||
```
|
||||
Target: Cognitum chip (multiple tiles + shared memory)
|
||||
Features: Core + PQ distance, HNSW navigation, parallel tiles
|
||||
Max vectors: Limited by shared memory (~10K in shared cache)
|
||||
Max dimensions: 1024
|
||||
Index: Layer A in shared memory
|
||||
Streaming: Block streaming across tiles
|
||||
Quantization: i8 scalar + PQ (6-bit)
|
||||
Compression: LZ4 decompress in shared memory
|
||||
```
|
||||
|
||||
### RVF Full Profile (Hub/Desktop)
|
||||
|
||||
```
|
||||
Target: Desktop CPU, server, hub controller
|
||||
Features: All features, all segment types, all quantization
|
||||
Max vectors: Billions (limited by storage)
|
||||
Max dimensions: Unlimited
|
||||
Index: Full HNSW (Layers A + B + C)
|
||||
Streaming: Full append-only segment model
|
||||
Quantization: All tiers (fp16, i8, PQ, binary)
|
||||
Compression: All (LZ4, ZSTD, custom)
|
||||
Crypto: Full (ML-DSA-65 signatures, SHAKE-256)
|
||||
Temperature: Full adaptive tiering
|
||||
Overlay: Full epoch model with compaction
|
||||
```
|
||||
|
||||
### Profile Detection
|
||||
|
||||
The root manifest's `profile_id` field declares the minimum profile needed:
|
||||
|
||||
```
|
||||
0x00 generic Requires Full Profile features
|
||||
0x01 core Fully usable with Core Profile
|
||||
0x02 hot Requires Hot Profile minimum
|
||||
0x03 full Requires Full Profile
|
||||
```
|
||||
|
||||
A Full Profile reader can always read Core or Hot files. A Core Profile
|
||||
reader rejects Full Profile files but can read Core files. Hot Profile
|
||||
readers can read Core and Hot files.
|
||||
|
||||
## 6. SIMD Strategy by Platform
|
||||
|
||||
### WASM v128 (Tile/Browser)
|
||||
|
||||
```wasm
|
||||
;; L2 distance: fp16 vectors, 384 dimensions
|
||||
;; Process 8 fp16 values per v128 operation
|
||||
|
||||
(func $l2_fp16_384 (param $a_ptr i32) (param $b_ptr i32) (result f32)
|
||||
(local $acc v128)
|
||||
(local $i i32)
|
||||
(local.set $acc (v128.const i64x2 0 0))
|
||||
(local.set $i (i32.const 0))
|
||||
|
||||
(block $done
|
||||
(loop $loop
|
||||
;; Load 8 fp16 values, widen to f32x4 pairs
|
||||
;; Subtract, square, accumulate
|
||||
;; ... (8 values per iteration, 48 iterations for 384 dims)
|
||||
|
||||
(br_if $done (i32.ge_u (local.get $i) (i32.const 384)))
|
||||
(br $loop)
|
||||
)
|
||||
)
|
||||
;; Horizontal sum of accumulator
|
||||
;; Return L2 distance
|
||||
)
|
||||
```
|
||||
|
||||
### AVX-512 (Desktop/Server)
|
||||
|
||||
```
|
||||
; Process 32 fp16 values per cycle with VCVTPH2PS + VFMADD231PS
|
||||
; 384 dims = 12 iterations of 32 values
|
||||
; ~12 cycles per distance computation
|
||||
```
|
||||
|
||||
### ARM NEON (Mobile/Edge)
|
||||
|
||||
```
|
||||
; Process 8 fp16 values per cycle with FMLA
|
||||
; 384 dims = 48 iterations of 8 values
|
||||
; ~48 cycles per distance computation
|
||||
```
|
||||
|
||||
## 7. Microkernel Size Budget
|
||||
|
||||
```
|
||||
Function Estimated Size
|
||||
-------- --------------
|
||||
rvf_init 128 B
|
||||
rvf_load_query 64 B
|
||||
rvf_load_block 256 B
|
||||
rvf_distances (L2 fp16) 512 B
|
||||
rvf_distances (L2 i8) 384 B
|
||||
rvf_distances (IP fp16) 512 B
|
||||
rvf_distances (hamming) 256 B
|
||||
rvf_topk_merge 384 B
|
||||
rvf_topk_read 64 B
|
||||
rvf_load_sq_params 64 B
|
||||
rvf_dequant_i8 256 B
|
||||
rvf_load_pq_codebook 128 B
|
||||
rvf_pq_distances 512 B
|
||||
rvf_load_neighbors 128 B
|
||||
rvf_greedy_step 512 B
|
||||
rvf_verify_header 128 B
|
||||
rvf_crc32c 256 B
|
||||
Message dispatch loop 384 B
|
||||
Utility functions 256 B
|
||||
WASM overhead 512 B
|
||||
----------
|
||||
Total ~5,500 B (< 8 KB code budget)
|
||||
```
|
||||
|
||||
Remaining ~2.5 KB of code space is available for domain-specific extensions
|
||||
(e.g., codon distance for RVDNA profile, token overlap for RVText profile).
|
||||
|
||||
## 8. Fault Isolation
|
||||
|
||||
Each tile runs in a WASM sandbox. A tile cannot:
|
||||
- Access hub memory directly
|
||||
- Communicate with other tiles except through the hub
|
||||
- Allocate memory beyond its 8 KB data + 64 KB scratch
|
||||
- Execute code beyond its 8 KB code space
|
||||
- Trap without the hub catching and recovering
|
||||
|
||||
If a tile traps (out-of-bounds, unreachable, stack overflow):
|
||||
1. Hub catches the trap
|
||||
2. Hub marks tile as faulted
|
||||
3. Hub reassigns the tile's work to another tile (or processes locally)
|
||||
4. Hub optionally restarts the faulted tile with fresh state
|
||||
|
||||
This makes the system resilient to individual tile failures — important for
|
||||
large tile arrays where hardware faults are inevitable.
|
||||
377
vendor/ruvector/docs/research/rvf/profiles/domain-profiles.md
vendored
Normal file
377
vendor/ruvector/docs/research/rvf/profiles/domain-profiles.md
vendored
Normal file
@@ -0,0 +1,377 @@
|
||||
# RVF Domain Profiles
|
||||
|
||||
## 1. Profile Architecture
|
||||
|
||||
A domain profile is a **semantic overlay** on the universal RVF substrate. It does
|
||||
not change the wire format — every profile-specific file is a valid RVF file. The
|
||||
profile adds:
|
||||
|
||||
1. **Semantic type annotations** for vector dimensions
|
||||
2. **Domain-specific distance metrics**
|
||||
3. **Custom quantization strategies** optimized for the domain
|
||||
4. **Metadata schemas** for domain-specific labels and provenance
|
||||
5. **Query preprocessing** conventions
|
||||
|
||||
Profiles are declared in a PROFILE_SEG and referenced by the root manifest's
|
||||
`profile_id` field.
|
||||
|
||||
```
|
||||
+-- RVF Universal Substrate --+
|
||||
| Segments, manifests, tiers |
|
||||
| HNSW index, overlays |
|
||||
| Temperature, compaction |
|
||||
+-----------------------------+
|
||||
|
|
||||
| profile_id
|
||||
v
|
||||
+-- Domain Profile Layer --+
|
||||
| Semantic types |
|
||||
| Custom distances |
|
||||
| Metadata schema |
|
||||
| Query conventions |
|
||||
+---------------------------+
|
||||
```
|
||||
|
||||
## 2. PROFILE_SEG Binary Layout
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 4 profile_magic Profile-specific magic number
|
||||
0x04 2 profile_version Profile spec version
|
||||
0x06 2 profile_id Same as root manifest profile_id
|
||||
0x08 32 profile_name UTF-8 null-terminated name
|
||||
0x28 8 schema_length Length of metadata schema
|
||||
0x30 var metadata_schema JSON or binary schema for META_SEG entries
|
||||
var 8 distance_config_len Length of distance configuration
|
||||
var var distance_config Distance metric parameters
|
||||
var 8 quant_config_len Length of quantization configuration
|
||||
var var quant_config Domain-specific quantization parameters
|
||||
var 8 preprocess_len Length of preprocessing spec
|
||||
var var preprocess_spec Query preprocessing pipeline description
|
||||
```
|
||||
|
||||
## 3. RVDNA Profile (Genomics)
|
||||
|
||||
### Profile Declaration
|
||||
|
||||
```
|
||||
profile_magic: 0x52444E41 ("RDNA")
|
||||
profile_id: 0x01
|
||||
profile_name: "rvdna"
|
||||
```
|
||||
|
||||
### Semantic Types
|
||||
|
||||
RVDNA vectors encode biological sequences at multiple granularities:
|
||||
|
||||
| Granularity | Dimensions | Encoding | Use Case |
|
||||
|------------|-----------|----------|----------|
|
||||
| Codon | 64 | Frequency of each codon in reading frame | Gene-level comparison |
|
||||
| K-mer (k=6) | 4096 | 6-mer frequency spectrum | Species identification |
|
||||
| Motif | 128-512 | Learned motif embeddings (transformer) | Regulatory element search |
|
||||
| Structure | 256 | Protein secondary structure embedding | Fold similarity |
|
||||
| Epigenetic | 384 | Methylation + histone mark embedding | Epigenomic comparison |
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
```
|
||||
Codon frequency: Jensen-Shannon divergence (symmetric KL)
|
||||
K-mer spectrum: Cosine similarity (normalized frequency vectors)
|
||||
Motif embedding: L2 distance (Euclidean in learned space)
|
||||
Structure: L2 distance with structure-aware weighting
|
||||
Epigenetic: Weighted cosine (CpG density as weight)
|
||||
```
|
||||
|
||||
### Quantization Strategy
|
||||
|
||||
Genomic vectors have specific statistical properties:
|
||||
|
||||
- **Codon frequencies**: Sparse, non-negative, sum-to-1. Use **scalar quantization
|
||||
with log transform**: `q = round(log2(freq + epsilon) * scale)`. 8-bit covers
|
||||
6 orders of magnitude.
|
||||
|
||||
- **K-mer spectra**: Very sparse (most 6-mers absent in short reads). Use
|
||||
**sparse encoding**: store only non-zero k-mer indices + values. Typical
|
||||
compression: 20-50x over dense.
|
||||
|
||||
- **Learned embeddings**: Gaussian-distributed. Standard PQ works well.
|
||||
M=32 subspaces, K=256 centroids (8-bit codes).
|
||||
|
||||
### Metadata Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rvdna",
|
||||
"fields": {
|
||||
"organism": { "type": "string", "indexed": true },
|
||||
"gene_id": { "type": "string", "indexed": true },
|
||||
"chromosome": { "type": "string", "indexed": true },
|
||||
"position_start": { "type": "u64", "indexed": true },
|
||||
"position_end": { "type": "u64", "indexed": true },
|
||||
"strand": { "type": "enum", "values": ["+", "-"] },
|
||||
"quality_score": { "type": "f32" },
|
||||
"source_format": { "type": "enum", "values": ["FASTA", "FASTQ", "BAM", "VCF"] },
|
||||
"read_depth": { "type": "u32" },
|
||||
"gc_content": { "type": "f32" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Query Preprocessing
|
||||
|
||||
For RVDNA queries:
|
||||
1. Input: Raw sequence string (ACGT...)
|
||||
2. Compute k-mer frequency spectrum
|
||||
3. Apply log transform for codon/k-mer queries
|
||||
4. Normalize to unit length for cosine metrics
|
||||
5. Encode as fp16 vector
|
||||
6. Submit to RVF query path
|
||||
|
||||
## 4. RVText Profile (Language)
|
||||
|
||||
### Profile Declaration
|
||||
|
||||
```
|
||||
profile_magic: 0x52545854 ("RTXT")
|
||||
profile_id: 0x02
|
||||
profile_name: "rvtext"
|
||||
```
|
||||
|
||||
### Semantic Types
|
||||
|
||||
| Granularity | Dimensions | Source | Use Case |
|
||||
|------------|-----------|--------|----------|
|
||||
| Token | 768-1536 | Transformer last hidden state | Semantic search |
|
||||
| Sentence | 384-768 | Sentence transformer pooled output | Document retrieval |
|
||||
| Paragraph | 384-1024 | Long-context model embedding | Passage ranking |
|
||||
| Document | 256-512 | Document-level embedding | Collection search |
|
||||
| Sparse | 30522 | BM25/SPLADE term weights | Lexical matching |
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
```
|
||||
Dense embeddings: Cosine similarity (normalized dot product)
|
||||
Sparse (SPLADE): Dot product on sparse vectors
|
||||
Hybrid: alpha * dense_score + (1-alpha) * sparse_score
|
||||
Matryoshka: Cosine on truncated prefix (adaptive dimensionality)
|
||||
```
|
||||
|
||||
### Quantization Strategy
|
||||
|
||||
Text embeddings are well-suited to aggressive quantization:
|
||||
|
||||
- **Dense (384-768 dim)**: Binary quantization achieves 0.95+ recall on
|
||||
normalized embeddings. 384 dims -> 48 bytes. Use binary for cold tier,
|
||||
int8 for hot.
|
||||
|
||||
- **Sparse (SPLADE)**: Store as sorted (term_id, weight) pairs with
|
||||
delta-encoded term_ids. Typical sparsity: 100-300 non-zero terms out
|
||||
of 30K vocabulary. Compression: ~100x over dense.
|
||||
|
||||
- **Matryoshka**: Store full-dimension vectors but index only the first
|
||||
D/4 dimensions. Progressive refinement uses more dimensions.
|
||||
|
||||
### Metadata Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rvtext",
|
||||
"fields": {
|
||||
"text": { "type": "string", "stored": true, "max_length": 8192 },
|
||||
"source_url": { "type": "string", "indexed": true },
|
||||
"language": { "type": "string", "indexed": true },
|
||||
"model_id": { "type": "string" },
|
||||
"chunk_index": { "type": "u32" },
|
||||
"total_chunks": { "type": "u32" },
|
||||
"token_count": { "type": "u32" },
|
||||
"timestamp": { "type": "u64" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Query Preprocessing
|
||||
|
||||
1. Input: Raw text string
|
||||
2. Tokenize with model-specific tokenizer
|
||||
3. Encode through embedding model (or receive pre-computed embedding)
|
||||
4. L2-normalize for cosine similarity
|
||||
5. Optionally: compute SPLADE sparse expansion
|
||||
6. Submit dense + sparse to hybrid query path
|
||||
|
||||
## 5. RVGraph Profile (Networks)
|
||||
|
||||
### Profile Declaration
|
||||
|
||||
```
|
||||
profile_magic: 0x52475248 ("RGRH")
|
||||
profile_id: 0x03
|
||||
profile_name: "rvgraph"
|
||||
```
|
||||
|
||||
### Semantic Types
|
||||
|
||||
| Granularity | Dimensions | Source | Use Case |
|
||||
|------------|-----------|--------|----------|
|
||||
| Node | 64-256 | Node2Vec / GCN embedding | Node similarity |
|
||||
| Edge | 64-128 | Edge feature embedding | Link prediction |
|
||||
| Subgraph | 128-512 | Graph kernel embedding | Subgraph matching |
|
||||
| Community | 64-256 | Community embedding | Community detection |
|
||||
| Spectral | 32-128 | Laplacian eigenvectors | Graph structure |
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
```
|
||||
Node embedding: L2 distance
|
||||
Edge embedding: Cosine similarity
|
||||
Subgraph: Wasserstein distance (approximated by L2 on sorted features)
|
||||
Community: Cosine similarity
|
||||
Spectral: L2 on normalized eigenvectors
|
||||
```
|
||||
|
||||
### Integration with Overlay System
|
||||
|
||||
RVGraph uniquely integrates with the RVF overlay epoch system:
|
||||
|
||||
- **Graph structure** is stored in OVERLAY_SEGs (not just as metadata)
|
||||
- **Node embeddings** are stored in VEC_SEGs
|
||||
- **Edge weights** are overlay deltas
|
||||
- **Community assignments** are partition summaries
|
||||
- **Min-cut witnesses** directly serve graph partitioning queries
|
||||
|
||||
This means RVGraph files are simultaneously vector stores AND graph databases.
|
||||
The overlay system provides dynamic graph operations (add/remove edges,
|
||||
rebalance partitions) while the vector system provides similarity search.
|
||||
|
||||
### Metadata Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rvgraph",
|
||||
"fields": {
|
||||
"node_type": { "type": "string", "indexed": true },
|
||||
"edge_type": { "type": "string", "indexed": true },
|
||||
"node_label": { "type": "string", "indexed": true },
|
||||
"degree": { "type": "u32", "indexed": true },
|
||||
"community_id": { "type": "u32", "indexed": true },
|
||||
"pagerank": { "type": "f32" },
|
||||
"clustering_coeff": { "type": "f32" },
|
||||
"source_graph": { "type": "string" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 6. RVVision Profile (Imagery)
|
||||
|
||||
### Profile Declaration
|
||||
|
||||
```
|
||||
profile_magic: 0x52564953 ("RVIS")
|
||||
profile_id: 0x04
|
||||
profile_name: "rvvision"
|
||||
```
|
||||
|
||||
### Semantic Types
|
||||
|
||||
| Granularity | Dimensions | Source | Use Case |
|
||||
|------------|-----------|--------|----------|
|
||||
| Patch | 64-256 | ViT patch embedding | Region search |
|
||||
| Image | 512-2048 | CLIP / DINOv2 global embedding | Image retrieval |
|
||||
| Object | 256-512 | Object detection crop embedding | Object search |
|
||||
| Scene | 128-512 | Scene classification embedding | Scene matching |
|
||||
| Multi-scale | 256 * N | Pyramid of embeddings at scales | Scale-invariant search |
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
```
|
||||
CLIP embedding: Cosine similarity (model-normalized)
|
||||
DINOv2: Cosine similarity
|
||||
Patch: L2 distance (not normalized)
|
||||
Multi-scale: Weighted sum of per-scale cosine similarities
|
||||
```
|
||||
|
||||
### Quantization Strategy
|
||||
|
||||
Vision embeddings have high intrinsic dimensionality but are compressible:
|
||||
|
||||
- **CLIP (512-dim)**: PQ with M=64, K=256 works well. Binary quantization
|
||||
achieves 0.90+ recall.
|
||||
|
||||
- **DINOv2 (768-dim)**: Similar to CLIP. PQ M=96, K=256.
|
||||
|
||||
- **Patch embeddings**: Large volume (196+ patches per image). Aggressive
|
||||
quantization to 4-bit scalar. Use residual PQ for high-recall applications.
|
||||
|
||||
### Spatial Metadata
|
||||
|
||||
RVVision supports spatial queries through metadata:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rvvision",
|
||||
"fields": {
|
||||
"image_id": { "type": "string", "indexed": true },
|
||||
"patch_row": { "type": "u16" },
|
||||
"patch_col": { "type": "u16" },
|
||||
"scale": { "type": "f32" },
|
||||
"bbox_x": { "type": "f32" },
|
||||
"bbox_y": { "type": "f32" },
|
||||
"bbox_w": { "type": "f32" },
|
||||
"bbox_h": { "type": "f32" },
|
||||
"object_class": { "type": "string", "indexed": true },
|
||||
"confidence": { "type": "f32" },
|
||||
"model_id": { "type": "string" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 7. Custom Profile Registration
|
||||
|
||||
New profiles can be registered by writing a PROFILE_SEG:
|
||||
|
||||
```
|
||||
1. Choose a unique profile_id (0x10-0xEF for custom profiles)
|
||||
2. Define a 4-byte profile_magic
|
||||
3. Define metadata schema
|
||||
4. Define distance metric configuration
|
||||
5. Define quantization recommendations
|
||||
6. Write PROFILE_SEG into the RVF file
|
||||
7. Set profile_id in root manifest
|
||||
```
|
||||
|
||||
The profile system is open — any domain can define its own profile as long
|
||||
as it maps onto the RVF substrate. The substrate does not need to understand
|
||||
the domain semantics; it only needs to store vectors, compute distances,
|
||||
and maintain indexes.
|
||||
|
||||
## 8. Cross-Profile Queries
|
||||
|
||||
RVF files with different profiles can be queried together if their vectors
|
||||
share a compatible embedding space. This is common in multimodal applications:
|
||||
|
||||
```
|
||||
Query: "Find images similar to this text description"
|
||||
|
||||
1. Text embedding (RVText profile) -> 512-dim CLIP text vector
|
||||
2. Image database (RVVision profile) -> 512-dim CLIP image vectors
|
||||
3. Distance metric: Cosine similarity (shared CLIP space)
|
||||
4. Result: Images ranked by text-image similarity
|
||||
```
|
||||
|
||||
The query path treats both files as RVF files. The profile only affects
|
||||
preprocessing and metadata interpretation — the core distance computation
|
||||
and indexing are profile-agnostic.
|
||||
|
||||
## 9. Profile Compatibility Matrix
|
||||
|
||||
| Source Profile | Target Profile | Compatible? | Condition |
|
||||
|---------------|---------------|------------|-----------|
|
||||
| RVDNA | RVDNA | Yes | Same granularity |
|
||||
| RVText | RVText | Yes | Same model or compatible space |
|
||||
| RVVision | RVVision | Yes | Same model or compatible space |
|
||||
| RVText | RVVision | Yes | If both use CLIP or shared space |
|
||||
| RVDNA | RVText | No* | Unless mapped through protein language model |
|
||||
| RVGraph | Any | Partial | Node embeddings may share space |
|
||||
|
||||
*Cross-domain compatibility requires explicit embedding space alignment,
|
||||
which is outside the scope of the format spec but enabled by it.
|
||||
140
vendor/ruvector/docs/research/rvf/spec/00-overview.md
vendored
Normal file
140
vendor/ruvector/docs/research/rvf/spec/00-overview.md
vendored
Normal file
@@ -0,0 +1,140 @@
|
||||
# RVF: RuVector Format Specification
|
||||
|
||||
## The Universal Substrate for Living Intelligence
|
||||
|
||||
**Version**: 0.1.0-draft
|
||||
**Status**: Research
|
||||
**Date**: 2026-02-13
|
||||
|
||||
---
|
||||
|
||||
## What RVF Is
|
||||
|
||||
RVF is not a file format. It is a **runtime substrate** — a living, self-reorganizing
|
||||
binary medium that stores, streams, indexes, and adapts vector intelligence across
|
||||
any domain, any scale, and any hardware tier.
|
||||
|
||||
Where traditional formats are snapshots of data, RVF is a **continuously evolving
|
||||
organism**. It ingests without rewriting. It answers queries before it finishes loading.
|
||||
It reorganizes its own layout to match access patterns. It survives crashes without
|
||||
journals. It fits on a 64 KB WASM tile or scales to a petabyte hub.
|
||||
|
||||
## The Four Laws of RVF
|
||||
|
||||
Every design decision in RVF derives from four inviolable laws:
|
||||
|
||||
### Law 1: Truth Lives at the Tail
|
||||
|
||||
The most recent `MANIFEST_SEG` at the tail of the file is the sole source of truth.
|
||||
No front-loaded metadata. No section directory that must be rewritten on mutation.
|
||||
Readers scan backward from EOF to find the latest manifest and know exactly what
|
||||
to map.
|
||||
|
||||
**Consequence**: Append-only writes. Streaming ingest. No global rewrite ever.
|
||||
|
||||
### Law 2: Every Segment Is Independently Valid
|
||||
|
||||
Each segment carries its own magic number, length, content hash, and type tag.
|
||||
A reader encountering any segment in isolation can verify it, identify it, and
|
||||
decide whether to process it. No segment depends on prior segments for structural
|
||||
validity.
|
||||
|
||||
**Consequence**: Crash safety for free. Parallel verification. Segment-level
|
||||
integrity without a global checksum.
|
||||
|
||||
### Law 3: Data and State Are Separated
|
||||
|
||||
Vector payloads, index structures, overlay graphs, quantization dictionaries, and
|
||||
runtime metadata live in distinct segment types. The manifest binds them together
|
||||
but they never intermingle. This means you can replace the index without touching
|
||||
vectors, update the overlay without rebuilding adjacency, or swap quantization
|
||||
without re-encoding.
|
||||
|
||||
**Consequence**: Incremental updates. Modular evolution. Zero-copy segment reuse.
|
||||
|
||||
### Law 4: The Format Adapts to Its Workload
|
||||
|
||||
RVF monitors access patterns through lightweight sketches and periodically
|
||||
reorganizes: promoting hot vectors to faster tiers, compacting stale overlays,
|
||||
lazily building deeper index layers. The format is not static — it converges
|
||||
toward the optimal layout for its actual workload.
|
||||
|
||||
**Consequence**: Self-tuning performance. No manual optimization. The file gets
|
||||
faster the more you use it.
|
||||
|
||||
## Design Coordinates
|
||||
|
||||
| Property | RVF Answer |
|
||||
|----------|-----------|
|
||||
| Write model | Append-only segments + background compaction |
|
||||
| Read model | Tail-manifest scan, then progressive mmap |
|
||||
| Index model | Layered availability (entry points -> partial -> full) |
|
||||
| Compression | Temperature-tiered (fp16 hot, 5-7 bit warm, 3 bit cold) |
|
||||
| Alignment | 64-byte for SIMD (AVX-512, NEON, WASM v128) |
|
||||
| Crash safety | Segment-level hashes, no WAL required |
|
||||
| Crypto | Post-quantum (ML-DSA-65 signatures, SHAKE-256 hashes) |
|
||||
| Streaming | Yes — first query before full load |
|
||||
| Hardware | 8 KB tile to petabyte hub |
|
||||
| Domain | Universal — genomics, text, graph, vision as profiles |
|
||||
|
||||
## Acceptance Test
|
||||
|
||||
> Cold start on a 10 million vector file: load and answer the first query with a
|
||||
> useful (recall >= 0.7) result without reading more than the last 4 MB, then
|
||||
> converge to full quality (recall >= 0.95) as it progressively maps more segments.
|
||||
|
||||
## Document Map
|
||||
|
||||
| Document | Path | Content |
|
||||
|----------|------|---------|
|
||||
| This overview | `spec/00-overview.md` | Philosophy, laws, design coordinates |
|
||||
| Segment model | `spec/01-segment-model.md` | Segment types, headers, append-only rules |
|
||||
| Manifest system | `spec/02-manifest-system.md` | Two-level manifests, hotset pointers |
|
||||
| Temperature tiering | `spec/03-temperature-tiering.md` | Adaptive layout, access sketches, promotion |
|
||||
| Progressive indexing | `spec/04-progressive-indexing.md` | Layered HNSW, partial availability |
|
||||
| Overlay epochs | `spec/05-overlay-epochs.md` | Streaming min-cut, epoch boundaries |
|
||||
| Wire format | `wire/binary-layout.md` | Byte-level binary format reference |
|
||||
| WASM microkernel | `microkernel/wasm-runtime.md` | Cognitum tile mapping, WASM exports |
|
||||
| Domain profiles | `profiles/domain-profiles.md` | RVDNA, RVText, RVGraph, RVVision |
|
||||
| Crypto spec | `crypto/quantum-signatures.md` | Post-quantum primitives, segment signing |
|
||||
| Benchmarks | `benchmarks/acceptance-tests.md` | Performance targets, test methodology |
|
||||
|
||||
## Relationship to RVDNA
|
||||
|
||||
RVDNA (RuVector DNA) was the first domain-specific format for genomic vector
|
||||
intelligence. In the RVF model, RVDNA becomes a **profile** — a set of conventions
|
||||
for how genomic data maps onto the universal RVF substrate:
|
||||
|
||||
```
|
||||
RVF (universal substrate)
|
||||
|
|
||||
+-- RVF Core Profile (minimal, fits on 64KB tile)
|
||||
+-- RVF Hot Profile (chip-optimized, SIMD-heavy)
|
||||
+-- RVF Full Profile (hub-scale, all features)
|
||||
|
|
||||
+-- Domain Profiles
|
||||
+-- RVDNA (genomics: codons, motifs, k-mers)
|
||||
+-- RVText (language: embeddings, token graphs)
|
||||
+-- RVGraph (networks: adjacency, partitions)
|
||||
+-- RVVision (imagery: feature maps, patch vectors)
|
||||
```
|
||||
|
||||
The substrate carries the laws. The profiles carry the semantics.
|
||||
|
||||
## Design Answers
|
||||
|
||||
**Q: Random writes or append-only plus compaction?**
|
||||
A: Append-only plus compaction. This gives speed and crash safety almost for free.
|
||||
Random writes add complexity for marginal benefit in the vector workload.
|
||||
|
||||
**Q: Primary target mmap on desktop CPUs or also microcontroller tiles?**
|
||||
A: Both. RVF defines three hardware profiles. The Core profile fits in 8 KB code +
|
||||
8 KB data + 64 KB SIMD scratch. The Full profile assumes mmap on desktop-class
|
||||
memory. The wire format is identical — only the runtime behavior changes.
|
||||
|
||||
**Q: Which property matters most?**
|
||||
A: All four are non-negotiable, but the priority order for conflict resolution is:
|
||||
1. **Streamable** (never block on write)
|
||||
2. **Progressive** (answer before fully loaded)
|
||||
3. **Adaptive** (self-optimize over time)
|
||||
4. **p95 speed** (predictable tail latency)
|
||||
224
vendor/ruvector/docs/research/rvf/spec/01-segment-model.md
vendored
Normal file
224
vendor/ruvector/docs/research/rvf/spec/01-segment-model.md
vendored
Normal file
@@ -0,0 +1,224 @@
|
||||
# RVF Segment Model
|
||||
|
||||
## 1. Append-Only Segment Architecture
|
||||
|
||||
An RVF file is a linear sequence of **segments**. Each segment is a self-contained,
|
||||
independently verifiable unit. New data is always appended — never inserted into or
|
||||
overwritten within existing segments.
|
||||
|
||||
```
|
||||
+------------+------------+------------+ +------------+
|
||||
| Segment 0 | Segment 1 | Segment 2 | ... | Segment N | <-- EOF
|
||||
+------------+------------+------------+ +------------+
|
||||
^
|
||||
Latest MANIFEST_SEG
|
||||
(source of truth)
|
||||
```
|
||||
|
||||
### Why Append-Only
|
||||
|
||||
| Property | Benefit |
|
||||
|----------|---------|
|
||||
| Write amplification | Zero — each byte written once until compaction |
|
||||
| Crash safety | Partial segment at tail is detectable and discardable |
|
||||
| Concurrent reads | Readers see a consistent snapshot at any manifest boundary |
|
||||
| Streaming ingest | Writer never blocks on reorganization |
|
||||
| mmap friendliness | Pages only grow — no invalidation of mapped regions |
|
||||
|
||||
## 2. Segment Header
|
||||
|
||||
Every segment begins with a fixed 64-byte header. The header is 64-byte aligned
|
||||
to match SIMD register width.
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 4 magic 0x52564653 ("RVFS" in ASCII)
|
||||
0x04 1 version Segment format version (currently 1)
|
||||
0x05 1 seg_type Segment type enum (see below)
|
||||
0x06 2 flags Bitfield: compressed, encrypted, signed, sealed, etc.
|
||||
0x08 8 segment_id Monotonically increasing segment ordinal
|
||||
0x10 8 payload_length Byte length of payload (after header, before footer)
|
||||
0x18 8 timestamp_ns Nanosecond UNIX timestamp of segment creation
|
||||
0x20 1 checksum_algo Hash algorithm enum: 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
|
||||
0x21 1 compression Compression enum: 0=none, 1=LZ4, 2=ZSTD, 3=custom
|
||||
0x22 2 reserved_0 Must be zero
|
||||
0x24 4 reserved_1 Must be zero
|
||||
0x28 16 content_hash First 128 bits of payload hash (algorithm per checksum_algo)
|
||||
0x38 4 uncompressed_len Original payload size (0 if no compression)
|
||||
0x3C 4 alignment_pad Padding to reach 64-byte boundary
|
||||
```
|
||||
|
||||
**Total header**: 64 bytes (one cache line, one AVX-512 register width).
|
||||
|
||||
### Magic Validation
|
||||
|
||||
Readers scanning backward from EOF look for `0x52564653` at 64-byte aligned
|
||||
boundaries. This enables fast tail-scan even on corrupted files.
|
||||
|
||||
### Flags Bitfield
|
||||
|
||||
```
|
||||
Bit 0: COMPRESSED Payload is compressed per compression field
|
||||
Bit 1: ENCRYPTED Payload is encrypted (key info in manifest)
|
||||
Bit 2: SIGNED A signature footer follows the payload
|
||||
Bit 3: SEALED Segment is immutable (compaction output)
|
||||
Bit 4: PARTIAL Segment is a partial write (streaming ingest)
|
||||
Bit 5: TOMBSTONE Segment logically deletes a prior segment
|
||||
Bit 6: HOT Segment contains temperature-promoted data
|
||||
Bit 7: OVERLAY Segment contains overlay/delta data
|
||||
Bit 8: SNAPSHOT Segment contains full snapshot (not delta)
|
||||
Bit 9: CHECKPOINT Segment is a safe rollback point
|
||||
Bits 10-15: reserved
|
||||
```
|
||||
|
||||
## 3. Segment Types
|
||||
|
||||
```
|
||||
Value Name Purpose
|
||||
----- ---- -------
|
||||
0x01 VEC_SEG Raw vector payloads (the actual embeddings)
|
||||
0x02 INDEX_SEG HNSW adjacency lists, entry points, routing tables
|
||||
0x03 OVERLAY_SEG Graph overlay deltas, partition updates, min-cut witnesses
|
||||
0x04 JOURNAL_SEG Metadata mutations (label changes, deletions, moves)
|
||||
0x05 MANIFEST_SEG Segment directory, hotset pointers, epoch state
|
||||
0x06 QUANT_SEG Quantization dictionaries and codebooks
|
||||
0x07 META_SEG Arbitrary key-value metadata (tags, provenance, lineage)
|
||||
0x08 HOT_SEG Temperature-promoted hot data (vectors + neighbors)
|
||||
0x09 SKETCH_SEG Access counter sketches for temperature decisions
|
||||
0x0A WITNESS_SEG Capability manifests, proof of computation, audit trails
|
||||
0x0B PROFILE_SEG Domain profile declarations (RVDNA, RVText, etc.)
|
||||
0x0C CRYPTO_SEG Key material, signature chains, certificate anchors
|
||||
0x0D METAIDX_SEG Metadata inverted indexes for filtered search
|
||||
```
|
||||
|
||||
### Reserved Range
|
||||
|
||||
Types `0x00` and `0xF0`-`0xFF` are reserved. `0x00` indicates an uninitialized
|
||||
or zeroed region (not a valid segment). `0xF0`-`0xFF` are reserved for
|
||||
implementation-specific extensions.
|
||||
|
||||
## 4. Segment Footer
|
||||
|
||||
If the `SIGNED` flag is set, the payload is followed by a signature footer:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 2 sig_algo Signature algorithm: 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
|
||||
0x02 2 sig_length Byte length of signature
|
||||
0x04 var signature The signature bytes
|
||||
var 4 footer_length Total footer size (for backward scanning)
|
||||
```
|
||||
|
||||
Unsigned segments have no footer — the next segment header follows immediately
|
||||
after the payload (at the next 64-byte aligned boundary).
|
||||
|
||||
## 5. Segment Lifecycle
|
||||
|
||||
### Write Path
|
||||
|
||||
```
|
||||
1. Allocate segment ID (monotonic counter)
|
||||
2. Compute payload hash
|
||||
3. Write header + payload + optional footer
|
||||
4. fsync (or fdatasync for non-manifest segments)
|
||||
5. Write MANIFEST_SEG referencing the new segment
|
||||
6. fsync the manifest
|
||||
```
|
||||
|
||||
The two-fsync protocol ensures that:
|
||||
- If crash occurs before step 6, the orphan segment is harmless (no manifest points to it)
|
||||
- If crash occurs during step 6, the partial manifest is detectable (bad hash)
|
||||
- After step 6, the segment is durably committed
|
||||
|
||||
### Read Path
|
||||
|
||||
```
|
||||
1. Seek to EOF
|
||||
2. Scan backward for latest MANIFEST_SEG (look for magic at aligned boundaries)
|
||||
3. Parse manifest -> get segment directory
|
||||
4. Map segments on demand (progressive loading)
|
||||
```
|
||||
|
||||
### Compaction
|
||||
|
||||
Compaction merges multiple segments into fewer, larger, sealed segments:
|
||||
|
||||
```
|
||||
Before: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3]
|
||||
After: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3] [VEC_SEG_sealed] [MANIFEST_4]
|
||||
^^^^^^^^^^^^^^^^^
|
||||
New sealed segment
|
||||
merging 1+2+3
|
||||
```
|
||||
|
||||
Old segments are marked with TOMBSTONE entries in the new manifest. Space is
|
||||
reclaimed when the file is eventually rewritten (or old segments are in a
|
||||
separate file in multi-file mode).
|
||||
|
||||
### Multi-File Mode
|
||||
|
||||
For very large datasets, RVF can span multiple files:
|
||||
|
||||
```
|
||||
data.rvf Main file with manifests and hot data
|
||||
data.rvf.cold.0 Cold segment shard 0
|
||||
data.rvf.cold.1 Cold segment shard 1
|
||||
data.rvf.idx.0 Index segment shard 0
|
||||
```
|
||||
|
||||
The manifest in the main file contains shard references with file paths and
|
||||
byte ranges. This enables cold data to live on slower storage while hot data
|
||||
stays on fast storage.
|
||||
|
||||
## 6. Segment Addressing
|
||||
|
||||
Segments are addressed by their `segment_id` (monotonically increasing 64-bit
|
||||
integer). The manifest maps segment IDs to file offsets (and optionally shard
|
||||
file paths in multi-file mode).
|
||||
|
||||
Within a segment, data is addressed by **block offset** — a 32-bit offset from
|
||||
the start of the segment payload. This limits individual segments to 4 GB, which
|
||||
is intentional: it keeps segments manageable for compaction and progressive loading.
|
||||
|
||||
### Block Structure Within VEC_SEG
|
||||
|
||||
```
|
||||
+-------------------+
|
||||
| Block Header (16B)|
|
||||
| block_id: u32 |
|
||||
| count: u32 |
|
||||
| dim: u16 |
|
||||
| dtype: u8 |
|
||||
| pad: [u8; 5] |
|
||||
+-------------------+
|
||||
| Vectors |
|
||||
| (count * dim * |
|
||||
| sizeof(dtype)) |
|
||||
| [64B aligned] |
|
||||
+-------------------+
|
||||
| ID Map |
|
||||
| (varint delta |
|
||||
| encoded IDs) |
|
||||
+-------------------+
|
||||
| Block Footer |
|
||||
| crc32c: u32 |
|
||||
+-------------------+
|
||||
```
|
||||
|
||||
Vectors within a block are stored **columnar** — all dimension 0 values, then all
|
||||
dimension 1 values, etc. This maximizes compression ratio. But the HOT_SEG stores
|
||||
vectors **interleaved** (row-major) for cache-friendly sequential scan during
|
||||
top-K refinement.
|
||||
|
||||
## 7. Invariants
|
||||
|
||||
1. Segment IDs are strictly monotonically increasing within a file
|
||||
2. A valid RVF file contains at least one MANIFEST_SEG
|
||||
3. The last MANIFEST_SEG is always the source of truth
|
||||
4. Segment headers are always 64-byte aligned
|
||||
5. No segment payload exceeds 4 GB
|
||||
6. Content hashes are computed over the raw (uncompressed, unencrypted) payload
|
||||
7. Sealed segments are never modified — only tombstoned
|
||||
8. A reader that cannot find a valid MANIFEST_SEG must reject the file
|
||||
287
vendor/ruvector/docs/research/rvf/spec/02-manifest-system.md
vendored
Normal file
287
vendor/ruvector/docs/research/rvf/spec/02-manifest-system.md
vendored
Normal file
@@ -0,0 +1,287 @@
|
||||
# RVF Manifest System
|
||||
|
||||
## 1. Two-Level Manifest Architecture
|
||||
|
||||
The manifest system is what makes RVF progressive. Instead of a monolithic directory
|
||||
that must be fully parsed before any query, RVF uses a two-level manifest that
|
||||
enables instant boot followed by incremental refinement.
|
||||
|
||||
```
|
||||
EOF
|
||||
|
|
||||
v
|
||||
+--------------------------------------------------+
|
||||
| Level 0: Root Manifest (fixed 4096 bytes) |
|
||||
| - Magic + version |
|
||||
| - Pointer to Level 1 manifest segment |
|
||||
| - Hotset pointers (inline) |
|
||||
| - Total vector count |
|
||||
| - Dimension |
|
||||
| - Epoch counter |
|
||||
| - Profile declaration |
|
||||
+--------------------------------------------------+
|
||||
|
|
||||
| points to
|
||||
v
|
||||
+--------------------------------------------------+
|
||||
| Level 1: Full Manifest (variable size) |
|
||||
| - Complete segment directory |
|
||||
| - Temperature tier map |
|
||||
| - Index layer availability |
|
||||
| - Overlay epoch chain |
|
||||
| - Compaction state |
|
||||
| - Shard references (multi-file) |
|
||||
| - Capability manifest |
|
||||
+--------------------------------------------------+
|
||||
```
|
||||
|
||||
### Why Two Levels
|
||||
|
||||
A reader performing cold start only needs Level 0 (4 KB). From Level 0 alone,
|
||||
it can locate the entry points, coarse routing graph, quantization dictionary,
|
||||
and centroids — enough to answer approximate queries immediately.
|
||||
|
||||
Level 1 is loaded asynchronously to enable full-quality queries, but the system
|
||||
is functional before Level 1 is fully parsed.
|
||||
|
||||
## 2. Level 0: Root Manifest
|
||||
|
||||
The root manifest is always the **last 4096 bytes** of the file (or the last
|
||||
4096 bytes of the most recent MANIFEST_SEG). Its fixed size enables instant
|
||||
location: `seek(EOF - 4096)`.
|
||||
|
||||
### Binary Layout
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x000 4 magic 0x52564D30 ("RVM0")
|
||||
0x004 2 version Root manifest version
|
||||
0x006 2 flags Root manifest flags
|
||||
0x008 8 l1_manifest_offset Byte offset to Level 1 manifest segment
|
||||
0x010 8 l1_manifest_length Byte length of Level 1 manifest segment
|
||||
0x018 8 total_vector_count Total vectors across all segments
|
||||
0x020 2 dimension Vector dimensionality
|
||||
0x022 1 base_dtype Base data type enum
|
||||
0x023 1 profile_id Domain profile (0=generic, 1=dna, 2=text, 3=graph, 4=vision)
|
||||
0x024 4 epoch Current overlay epoch number
|
||||
0x028 8 created_ns File creation timestamp (ns)
|
||||
0x030 8 modified_ns Last modification timestamp (ns)
|
||||
|
||||
--- Hotset Pointers (the key to instant boot) ---
|
||||
|
||||
0x038 8 entrypoint_seg_offset Offset to segment containing HNSW entry points
|
||||
0x040 4 entrypoint_block_offset Block offset within that segment
|
||||
0x044 4 entrypoint_count Number of entry points
|
||||
|
||||
0x048 8 toplayer_seg_offset Offset to segment with top-layer adjacency
|
||||
0x050 4 toplayer_block_offset Block offset
|
||||
0x054 4 toplayer_node_count Nodes in top layer
|
||||
|
||||
0x058 8 centroid_seg_offset Offset to segment with cluster centroids / pivots
|
||||
0x060 4 centroid_block_offset Block offset
|
||||
0x064 4 centroid_count Number of centroids
|
||||
|
||||
0x068 8 quantdict_seg_offset Offset to quantization dictionary segment
|
||||
0x070 4 quantdict_block_offset Block offset
|
||||
0x074 4 quantdict_size Dictionary size in bytes
|
||||
|
||||
0x078 8 hot_cache_seg_offset Offset to HOT_SEG with interleaved hot vectors
|
||||
0x080 4 hot_cache_block_offset Block offset
|
||||
0x084 4 hot_cache_vector_count Vectors in hot cache
|
||||
|
||||
0x088 8 prefetch_map_offset Offset to prefetch hint table
|
||||
0x090 4 prefetch_map_entries Number of prefetch entries
|
||||
|
||||
--- Crypto ---
|
||||
|
||||
0x094 2 sig_algo Manifest signature algorithm
|
||||
0x096 2 sig_length Signature length
|
||||
0x098 var signature Manifest signature (up to 3400 bytes for ML-DSA-65)
|
||||
|
||||
--- Padding to 4096 bytes ---
|
||||
|
||||
0xF00 252 reserved Reserved / zero-padded to 4096
|
||||
0xFFC 4 root_checksum CRC32C of bytes 0x000-0xFFB
|
||||
```
|
||||
|
||||
**Total**: Exactly 4096 bytes (one page, one disk sector on most hardware).
|
||||
|
||||
### Hotset Pointers
|
||||
|
||||
The six hotset pointers are the minimum information needed to answer a query:
|
||||
|
||||
1. **Entry points**: Where to start HNSW traversal
|
||||
2. **Top-layer adjacency**: Coarse routing to the right neighborhood
|
||||
3. **Centroids/pivots**: For IVF-style pre-filtering or partition routing
|
||||
4. **Quantization dictionary**: For decoding compressed vectors
|
||||
5. **Hot cache**: Pre-decoded interleaved vectors for top-K refinement
|
||||
6. **Prefetch map**: Contiguous neighbor-list pages with prefetch offsets
|
||||
|
||||
With these six pointers, a reader can:
|
||||
- Start HNSW search at the entry point
|
||||
- Route through the top layer
|
||||
- Quantize the query using the dictionary
|
||||
- Scan the hot cache for refinement
|
||||
- Prefetch neighbor pages for cache-friendly traversal
|
||||
|
||||
All without reading Level 1 or any cold segments.
|
||||
|
||||
## 3. Level 1: Full Manifest
|
||||
|
||||
Level 1 is a variable-size segment (type `MANIFEST_SEG`) referenced by Level 0.
|
||||
It contains the complete file directory.
|
||||
|
||||
### Structure
|
||||
|
||||
Level 1 is encoded as a sequence of typed records using a tag-length-value (TLV)
|
||||
scheme for forward compatibility:
|
||||
|
||||
```
|
||||
+---+---+---+---+---+---+---+---+
|
||||
| Tag (2B) | Length (4B) | Pad | <- 8-byte aligned record header
|
||||
+---+---+---+---+---+---+---+---+
|
||||
| Value (Length bytes) |
|
||||
| [padded to 8-byte boundary] |
|
||||
+---------------------------------+
|
||||
```
|
||||
|
||||
### Record Types
|
||||
|
||||
```
|
||||
Tag Name Description
|
||||
--- ---- -----------
|
||||
0x0001 SEGMENT_DIR Array of segment directory entries
|
||||
0x0002 TEMP_TIER_MAP Temperature tier assignments per block
|
||||
0x0003 INDEX_LAYERS Index layer availability bitmap
|
||||
0x0004 OVERLAY_CHAIN Epoch chain with rollback pointers
|
||||
0x0005 COMPACTION_STATE Active/tombstoned segment sets
|
||||
0x0006 SHARD_REFS Multi-file shard references
|
||||
0x0007 CAPABILITY_MANIFEST What this file can do (features, limits)
|
||||
0x0008 PROFILE_CONFIG Domain-specific configuration
|
||||
0x0009 ACCESS_SKETCH_REF Pointer to latest SKETCH_SEG
|
||||
0x000A PREFETCH_TABLE Full prefetch hint table
|
||||
0x000B ID_RESTART_POINTS Restart point index for varint delta IDs
|
||||
0x000C WITNESS_CHAIN Proof-of-computation witness chain
|
||||
0x000D KEY_DIRECTORY Encryption key references (not keys themselves)
|
||||
```
|
||||
|
||||
### Segment Directory Entry
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 8 segment_id Segment ordinal
|
||||
0x08 1 seg_type Segment type enum
|
||||
0x09 1 tier Temperature tier (0=hot, 1=warm, 2=cold)
|
||||
0x0A 2 flags Segment flags
|
||||
0x0C 4 reserved Must be zero
|
||||
0x10 8 file_offset Byte offset in file (or shard)
|
||||
0x18 8 payload_length Decompressed payload length
|
||||
0x20 8 compressed_length Compressed length (0 if uncompressed)
|
||||
0x28 2 shard_id Shard index (0 for main file)
|
||||
0x2A 2 compression Compression algorithm
|
||||
0x2C 4 block_count Number of blocks in segment
|
||||
0x30 16 content_hash Payload hash (first 128 bits)
|
||||
```
|
||||
|
||||
**Total**: 64 bytes per entry (cache-line aligned).
|
||||
|
||||
## 4. Manifest Lifecycle
|
||||
|
||||
### Writing a New Manifest
|
||||
|
||||
Every mutation to the file produces a new MANIFEST_SEG appended at the tail:
|
||||
|
||||
```
|
||||
1. Compute new Level 1 manifest (segment directory + metadata)
|
||||
2. Write Level 1 as a MANIFEST_SEG payload
|
||||
3. Compute Level 0 root manifest pointing to Level 1
|
||||
4. Write Level 0 as the last 4096 bytes of the MANIFEST_SEG
|
||||
5. fsync
|
||||
```
|
||||
|
||||
The MANIFEST_SEG payload structure is:
|
||||
|
||||
```
|
||||
+-----------------------------------+
|
||||
| Level 1 manifest (variable size) |
|
||||
+-----------------------------------+
|
||||
| Level 0 root manifest (4096 B) | <-- Always the last 4096 bytes
|
||||
+-----------------------------------+
|
||||
```
|
||||
|
||||
### Reading the Manifest
|
||||
|
||||
```
|
||||
1. seek(EOF - 4096)
|
||||
2. Read 4096 bytes -> Level 0 root manifest
|
||||
3. Validate magic (0x52564D30) and checksum
|
||||
4. If valid: extract hotset pointers -> system is queryable
|
||||
5. Async: read Level 1 at l1_manifest_offset -> full directory
|
||||
6. If Level 0 is invalid: scan backward for previous MANIFEST_SEG
|
||||
```
|
||||
|
||||
Step 6 provides crash recovery. If the latest manifest write was interrupted,
|
||||
the previous manifest is still valid. Readers scan backward at 64-byte aligned
|
||||
boundaries looking for the RVFS magic + MANIFEST_SEG type.
|
||||
|
||||
### Manifest Chain
|
||||
|
||||
Each manifest implicitly forms a chain through the segment ID ordering. For
|
||||
explicit rollback support, Level 1 contains the `OVERLAY_CHAIN` record which
|
||||
stores:
|
||||
|
||||
```
|
||||
epoch: u32 Current epoch
|
||||
prev_manifest_offset: u64 Offset of previous MANIFEST_SEG
|
||||
prev_manifest_id: u64 Segment ID of previous MANIFEST_SEG
|
||||
checkpoint_hash: [u8; 16] Hash of the complete state at this epoch
|
||||
```
|
||||
|
||||
This enables point-in-time recovery and bisection debugging.
|
||||
|
||||
## 5. Hotset Pointer Semantics
|
||||
|
||||
### Entry Point Stability
|
||||
|
||||
Entry points are the HNSW nodes at the highest layer. They change rarely (only
|
||||
when the index is rebuilt or a new highest-layer node is inserted). The root
|
||||
manifest caches them directly so they survive across manifest generations without
|
||||
re-reading the index.
|
||||
|
||||
### Centroid Refresh
|
||||
|
||||
Centroids may drift as data is added. The manifest tracks a `centroid_epoch` — if
|
||||
the current epoch exceeds centroid_epoch + threshold, the runtime should schedule
|
||||
centroid recomputation. But the stale centroids remain usable (recall degrades
|
||||
gracefully, it does not fail).
|
||||
|
||||
### Hot Cache Coherence
|
||||
|
||||
The hot cache in HOT_SEG is a **read-optimized snapshot** of the most-accessed
|
||||
vectors. It may be stale relative to the latest VEC_SEGs. The manifest tracks
|
||||
a `hot_cache_epoch` for staleness detection. Queries use the hot cache for fast
|
||||
initial results, then refine against authoritative VEC_SEGs if needed.
|
||||
|
||||
## 6. Progressive Boot Sequence
|
||||
|
||||
```
|
||||
Time Action System State
|
||||
---- ------ ------------
|
||||
t=0 Read last 4 KB (Level 0) Booting
|
||||
t+1ms Parse hotset pointers Queryable (approximate)
|
||||
t+2ms mmap entry points + top layer Better routing
|
||||
t+5ms mmap hot cache + quant dict Fast top-K refinement
|
||||
t+10ms Start loading Level 1 Discovering full directory
|
||||
t+50ms Level 1 parsed Full segment awareness
|
||||
t+100ms mmap warm VEC_SEGs Recall improving
|
||||
t+500ms mmap cold VEC_SEGs Full recall
|
||||
t+1s Background index layer build Converging to optimal
|
||||
```
|
||||
|
||||
For a 10M vector file (~4 GB at 384 dimensions, float16):
|
||||
- Level 0 read: 4 KB in <1 ms
|
||||
- Hotset data: ~2-4 MB (entry points + top layer + centroids + hot cache)
|
||||
- First query: within 5-10 ms of open
|
||||
- Full convergence: 1-5 seconds depending on storage speed
|
||||
285
vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md
vendored
Normal file
285
vendor/ruvector/docs/research/rvf/spec/03-temperature-tiering.md
vendored
Normal file
@@ -0,0 +1,285 @@
|
||||
# RVF Temperature Tiering
|
||||
|
||||
## 1. Adaptive Layout as a First-Class Concept
|
||||
|
||||
Traditional vector formats place data once and leave it. RVF treats data placement
|
||||
as a **continuous optimization problem**. Every vector block has a temperature, and
|
||||
the format periodically reorganizes to keep hot data fast and cold data small.
|
||||
|
||||
```
|
||||
Access Frequency
|
||||
^
|
||||
|
|
||||
Tier 0 (HOT) | ████████ fp16 / 8-bit, interleaved
|
||||
| ████████ < 1μs random access
|
||||
|
|
||||
Tier 1 (WARM) | ░░░░░░░░░░░░░░░░ 5-7 bit quantized
|
||||
| ░░░░░░░░░░░░░░░░ columnar, compressed
|
||||
|
|
||||
Tier 2 (COLD) | ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 3-bit or 1-bit
|
||||
| ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ heavy compression
|
||||
|
|
||||
+------------------------------------> Vector ID
|
||||
```
|
||||
|
||||
### Tier Definitions
|
||||
|
||||
| Tier | Name | Quantization | Layout | Compression | Access Latency |
|
||||
|------|------|-------------|--------|-------------|----------------|
|
||||
| 0 | Hot | fp16 or int8 | Interleaved (row-major) | None or LZ4 | < 1 μs |
|
||||
| 1 | Warm | 5-7 bit SQ/PQ | Columnar | LZ4 or ZSTD | 1-10 μs |
|
||||
| 2 | Cold | 3-bit or binary | Columnar | ZSTD level 9+ | 10-100 μs |
|
||||
|
||||
### Memory Ratios
|
||||
|
||||
For 384-dimensional vectors (typical embedding size):
|
||||
|
||||
| Tier | Bytes/Vector | Ratio vs fp32 | 10M Vectors |
|
||||
|------|-------------|---------------|-------------|
|
||||
| fp32 (raw) | 1536 B | 1.0x | 14.3 GB |
|
||||
| Tier 0 (fp16) | 768 B | 2.0x | 7.2 GB |
|
||||
| Tier 0 (int8) | 384 B | 4.0x | 3.6 GB |
|
||||
| Tier 1 (6-bit) | 288 B | 5.3x | 2.7 GB |
|
||||
| Tier 1 (5-bit) | 240 B | 6.4x | 2.2 GB |
|
||||
| Tier 2 (3-bit) | 144 B | 10.7x | 1.3 GB |
|
||||
| Tier 2 (1-bit) | 48 B | 32.0x | 0.45 GB |
|
||||
|
||||
## 2. Access Counter Sketch
|
||||
|
||||
Temperature decisions require knowing which blocks are accessed frequently.
|
||||
RVF maintains a lightweight **Count-Min Sketch** per block set, stored in
|
||||
SKETCH_SEG segments.
|
||||
|
||||
### Sketch Parameters
|
||||
|
||||
```
|
||||
Width (w): 1024 counters
|
||||
Depth (d): 4 hash functions
|
||||
Counter size: 8-bit saturating (max 255)
|
||||
Memory: 1024 * 4 * 1 = 4 KB per sketch
|
||||
Granularity: One sketch per 1024-vector block
|
||||
Decay: Halve all counters every 2^16 accesses (aging)
|
||||
```
|
||||
|
||||
For 10M vectors in 1024-vector blocks:
|
||||
- 9,766 blocks
|
||||
- 9,766 * 4 KB = ~38 MB of sketches
|
||||
- Stored in SKETCH_SEG, referenced by manifest
|
||||
|
||||
### Sketch Operations
|
||||
|
||||
**On query access**:
|
||||
```
|
||||
block_id = vector_id / block_size
|
||||
for i in 0..depth:
|
||||
idx = hash_i(block_id) % width
|
||||
sketch[i][idx] = min(sketch[i][idx] + 1, 255)
|
||||
```
|
||||
|
||||
**On temperature check**:
|
||||
```
|
||||
count = min over i of sketch[i][hash_i(block_id) % width]
|
||||
if count > HOT_THRESHOLD: tier = 0
|
||||
elif count > WARM_THRESHOLD: tier = 1
|
||||
else: tier = 2
|
||||
```
|
||||
|
||||
**Aging** (every 2^16 accesses):
|
||||
```
|
||||
for all counters: counter = counter >> 1
|
||||
```
|
||||
|
||||
This ensures the sketch tracks *recent* access patterns, not cumulative history.
|
||||
|
||||
### Why Count-Min Sketch
|
||||
|
||||
| Alternative | Memory | Accuracy | Update Cost |
|
||||
|------------|--------|----------|-------------|
|
||||
| Per-vector counter | 80 MB (10M * 8B) | Exact | O(1) |
|
||||
| Count-Min Sketch | 38 MB | ~99.9% | O(depth) = O(4) |
|
||||
| HyperLogLog | 6 MB | ~98% | O(1) but cardinality only |
|
||||
| Bloom filter | 12 MB | No counting | N/A |
|
||||
|
||||
Count-Min Sketch is the best trade-off: sub-exact accuracy with bounded memory
|
||||
and constant-time updates.
|
||||
|
||||
## 3. Promotion and Demotion
|
||||
|
||||
### Promotion: Warm/Cold -> Hot
|
||||
|
||||
When a block's access count exceeds HOT_THRESHOLD for two consecutive sketch
|
||||
epochs:
|
||||
|
||||
```
|
||||
1. Read the block from its current VEC_SEG
|
||||
2. Decode/dequantize vectors to fp16 or int8
|
||||
3. Rearrange from columnar to interleaved layout
|
||||
4. Write as a new HOT_SEG (or append to existing HOT_SEG)
|
||||
5. Update manifest with new tier assignment
|
||||
6. Optionally: add neighbor lists to HOT_SEG for locality
|
||||
```
|
||||
|
||||
### Demotion: Hot -> Warm -> Cold
|
||||
|
||||
When a block's access count drops below WARM_THRESHOLD:
|
||||
|
||||
```
|
||||
1. The block is not immediately rewritten
|
||||
2. On next compaction cycle, the block is written to the appropriate tier
|
||||
3. Quantization is applied during compaction (not lazily)
|
||||
4. The HOT_SEG entry is tombstoned in the manifest
|
||||
```
|
||||
|
||||
### Eviction as Compression
|
||||
|
||||
The key insight: **eviction from hot tier is just compression, not deletion**.
|
||||
The vector data is always present — it just moves to a more compressed
|
||||
representation. This means:
|
||||
|
||||
- No data loss on eviction
|
||||
- Recall degrades gracefully (quantized vectors still contribute to search)
|
||||
- The file naturally compresses over time as access patterns stabilize
|
||||
|
||||
## 4. Temperature-Aware Compaction
|
||||
|
||||
Standard compaction merges segments for space efficiency. Temperature-aware
|
||||
compaction also **rearranges blocks by tier**:
|
||||
|
||||
```
|
||||
Before compaction:
|
||||
VEC_SEG_1: [hot] [cold] [warm] [hot] [cold]
|
||||
VEC_SEG_2: [warm] [hot] [cold] [warm] [warm]
|
||||
|
||||
After temperature-aware compaction:
|
||||
HOT_SEG: [hot] [hot] [hot] <- interleaved, fp16
|
||||
VEC_SEG_W: [warm] [warm] [warm] [warm] <- columnar, 6-bit
|
||||
VEC_SEG_C: [cold] [cold] [cold] <- columnar, 3-bit
|
||||
```
|
||||
|
||||
This creates **physical locality by temperature**: hot blocks are contiguous
|
||||
(good for sequential scan), warm blocks are contiguous (good for batch decode),
|
||||
cold blocks are contiguous (good for compression ratio).
|
||||
|
||||
### Compaction Triggers
|
||||
|
||||
| Trigger | Condition | Action |
|
||||
|---------|-----------|--------|
|
||||
| Sketch epoch | Every N writes | Evaluate all block temperatures |
|
||||
| Space amplification | Dead space > 30% | Merge + rewrite segments |
|
||||
| Tier imbalance | Hot tier > 20% of data | Demote cold blocks |
|
||||
| Hot miss rate | Hot cache miss > 10% | Promote missing blocks |
|
||||
|
||||
## 5. Quantization Strategies by Tier
|
||||
|
||||
### Tier 0: Hot
|
||||
|
||||
**Scalar quantization to int8** (preferred) or **fp16** (for maximum recall).
|
||||
|
||||
```
|
||||
Encoding:
|
||||
q = round((v - min) / (max - min) * 255)
|
||||
|
||||
Decoding:
|
||||
v = q / 255 * (max - min) + min
|
||||
|
||||
Parameters stored in QUANT_SEG:
|
||||
min: f32 per dimension
|
||||
max: f32 per dimension
|
||||
```
|
||||
|
||||
Distance computation directly on int8 using SIMD (vpsubb + vpmaddubsw on AVX-512).
|
||||
|
||||
### Tier 1: Warm
|
||||
|
||||
**Product Quantization (PQ)** with 5-7 bits per sub-vector.
|
||||
|
||||
```
|
||||
Parameters:
|
||||
M subspaces: 48 (for 384-dim vectors, 8 dims per subspace)
|
||||
K centroids per sub: 64 (6-bit) or 128 (7-bit)
|
||||
Codebook: M * K * 8 * sizeof(f32) = 48 * 64 * 8 * 4 = 96 KB
|
||||
|
||||
Encoding:
|
||||
For each subvector: find nearest centroid -> store centroid index
|
||||
|
||||
Distance computation:
|
||||
ADC (Asymmetric Distance Computation) with precomputed distance tables
|
||||
```
|
||||
|
||||
### Tier 2: Cold
|
||||
|
||||
**Binary quantization** (1-bit) or **ternary quantization** (2-bit / 3-bit).
|
||||
|
||||
```
|
||||
Binary encoding:
|
||||
b = sign(v) -> 1 bit per dimension
|
||||
384 dims -> 48 bytes per vector (32x compression)
|
||||
|
||||
Distance:
|
||||
Hamming distance via POPCNT
|
||||
XOR + POPCNT on AVX-512: 512 bits per cycle
|
||||
|
||||
Ternary (3-bit with magnitude):
|
||||
t = {-1, 0, +1} based on threshold
|
||||
magnitude = |v| quantized to 3 levels
|
||||
384 dims -> 144 bytes per vector (10.7x compression)
|
||||
```
|
||||
|
||||
### Codebook Storage
|
||||
|
||||
All quantization parameters (codebooks, min/max ranges, centroids) are stored
|
||||
in QUANT_SEG segments. The root manifest's `quantdict_seg_offset` hotset pointer
|
||||
references the active quantization dictionary for fast boot.
|
||||
|
||||
Multiple QUANT_SEGs can coexist for different tiers — the manifest maps each
|
||||
tier to its dictionary.
|
||||
|
||||
## 6. Hardware Adaptation
|
||||
|
||||
### Desktop (AVX-512)
|
||||
|
||||
- Hot tier: int8 with VNNI dot product (4 int8 multiplies per cycle)
|
||||
- Warm tier: PQ with AVX-512 gather for table lookups
|
||||
- Cold tier: Binary with VPOPCNTDQ (512-bit popcount)
|
||||
|
||||
### ARM (NEON)
|
||||
|
||||
- Hot tier: int8 with SDOT instruction
|
||||
- Warm tier: PQ with TBL for table lookups
|
||||
- Cold tier: Binary with CNT (population count)
|
||||
|
||||
### WASM (v128)
|
||||
|
||||
- Hot tier: int8 with i8x16.dot_i7x16_i16x8 (if available)
|
||||
- Warm tier: Scalar PQ (no gather)
|
||||
- Cold tier: Binary with manual popcount
|
||||
|
||||
### Cognitum Tile (8KB code + 8KB data + 64KB SIMD)
|
||||
|
||||
- Hot tier only: int8 interleaved, fits in SIMD scratch
|
||||
- No warm/cold — data stays on hub, tile fetches blocks on demand
|
||||
- Sketch is maintained by hub, not tile
|
||||
|
||||
## 7. Self-Organization Over Time
|
||||
|
||||
```
|
||||
t=0 All data Tier 1 (default warm)
|
||||
|
|
||||
t+N First sketch epoch: identify hot blocks
|
||||
Promote top 5% to Tier 0
|
||||
|
|
||||
t+2N Second epoch: validate promotions
|
||||
Demote false positives back to Tier 1
|
||||
Identify true cold blocks (0 access in 2 epochs)
|
||||
|
|
||||
t+3N Compaction: physically separate tiers
|
||||
HOT_SEG created with interleaved layout
|
||||
Cold blocks compressed to 3-bit
|
||||
|
|
||||
t+∞ Equilibrium: ~5% hot, ~30% warm, ~65% cold
|
||||
File size: ~2-3x smaller than uniform fp16
|
||||
Query p95: dominated by hot tier latency
|
||||
```
|
||||
|
||||
The format converges to an equilibrium that reflects actual usage. No manual
|
||||
tuning required.
|
||||
374
vendor/ruvector/docs/research/rvf/spec/04-progressive-indexing.md
vendored
Normal file
374
vendor/ruvector/docs/research/rvf/spec/04-progressive-indexing.md
vendored
Normal file
@@ -0,0 +1,374 @@
|
||||
# RVF Progressive Indexing
|
||||
|
||||
## 1. Index as Layers of Availability
|
||||
|
||||
Traditional HNSW serialization is all-or-nothing: either the full graph is loaded,
|
||||
or nothing works. RVF decomposes the index into three layers of availability, each
|
||||
independently useful, each stored in separate INDEX_SEG segments.
|
||||
|
||||
```
|
||||
Layer C: Full Adjacency
|
||||
+--------------------------------------------------+
|
||||
| Complete neighbor lists for every node at every |
|
||||
| HNSW level. Built lazily. Optional for queries. |
|
||||
| Recall: >= 0.95 |
|
||||
+--------------------------------------------------+
|
||||
^ loaded last (seconds to minutes)
|
||||
|
|
||||
Layer B: Partial Adjacency
|
||||
+--------------------------------------------------+
|
||||
| Neighbor lists for the most-accessed region |
|
||||
| (determined by temperature sketch). Covers the |
|
||||
| hot working set of the graph. |
|
||||
| Recall: >= 0.85 |
|
||||
+--------------------------------------------------+
|
||||
^ loaded second (100ms - 1s)
|
||||
|
|
||||
Layer A: Entry Points + Coarse Routing
|
||||
+--------------------------------------------------+
|
||||
| HNSW entry points. Top-layer adjacency lists. |
|
||||
| Cluster centroids for IVF pre-routing. |
|
||||
| Always present. Always in Level 0 hotset. |
|
||||
| Recall: >= 0.70 |
|
||||
+--------------------------------------------------+
|
||||
^ loaded first (< 5ms)
|
||||
|
|
||||
File open
|
||||
```
|
||||
|
||||
### Why Three Layers
|
||||
|
||||
| Layer | Purpose | Data Size (10M vectors) | Load Time (NVMe) |
|
||||
|-------|---------|------------------------|-------------------|
|
||||
| A | First query possible | 1-4 MB | < 5 ms |
|
||||
| B | Good quality for working set | 50-200 MB | 100-500 ms |
|
||||
| C | Full recall for all queries | 1-4 GB | 2-10 s |
|
||||
|
||||
A system that only loads Layer A can still answer queries — just with lower recall.
|
||||
As layers B and C load asynchronously, quality improves transparently.
|
||||
|
||||
## 2. Layer A: Entry Points and Coarse Routing
|
||||
|
||||
### Content
|
||||
|
||||
- **HNSW entry points**: The node(s) at the highest layer of the HNSW graph.
|
||||
Typically 1 node, but may be multiple for redundancy.
|
||||
- **Top-layer adjacency**: Full neighbor lists for all nodes at HNSW layers
|
||||
>= ceil(ln(N) / ln(M)) - 2. For 10M vectors with M=16, this is layers 5-6,
|
||||
containing ~100-1000 nodes.
|
||||
- **Cluster centroids**: K centroids (K = sqrt(N) typically, so ~3162 for 10M)
|
||||
used for IVF-style partition routing.
|
||||
- **Centroid-to-partition map**: Which centroid owns which vector ID ranges.
|
||||
|
||||
### Storage
|
||||
|
||||
Layer A data is stored in a dedicated INDEX_SEG with `flags.HOT` set. The root
|
||||
manifest's hotset pointers reference this segment directly. On cold start, this
|
||||
is the first data mapped after the manifest.
|
||||
|
||||
### Binary Layout of Layer A INDEX_SEG
|
||||
|
||||
```
|
||||
+-------------------------------------------+
|
||||
| Header: INDEX_SEG, flags=HOT |
|
||||
+-------------------------------------------+
|
||||
| Block 0: Entry Points |
|
||||
| entry_count: u32 |
|
||||
| max_layer: u32 |
|
||||
| [entry_node_id: u64, layer: u32] * N |
|
||||
+-------------------------------------------+
|
||||
| Block 1: Top-Layer Adjacency |
|
||||
| layer_count: u32 |
|
||||
| For each layer (top to bottom): |
|
||||
| node_count: u32 |
|
||||
| For each node: |
|
||||
| node_id: u64 |
|
||||
| neighbor_count: u16 |
|
||||
| [neighbor_id: u64] * neighbor_count |
|
||||
| [64B padding] |
|
||||
+-------------------------------------------+
|
||||
| Block 2: Centroids |
|
||||
| centroid_count: u32 |
|
||||
| dim: u16 |
|
||||
| dtype: u8 (fp16) |
|
||||
| [centroid_vector: fp16 * dim] * K |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Block 3: Partition Map |
|
||||
| partition_count: u32 |
|
||||
| For each partition: |
|
||||
| centroid_id: u32 |
|
||||
| vector_id_start: u64 |
|
||||
| vector_id_end: u64 |
|
||||
| segment_ref: u64 (segment_id) |
|
||||
| block_ref: u32 (block offset) |
|
||||
+-------------------------------------------+
|
||||
```
|
||||
|
||||
### Query Using Only Layer A
|
||||
|
||||
```python
|
||||
def query_layer_a_only(query, k, layer_a):
|
||||
# Step 1: Find nearest centroids
|
||||
dists = [distance(query, c) for c in layer_a.centroids]
|
||||
top_partitions = top_n(dists, n_probe)
|
||||
|
||||
# Step 2: HNSW search through top layers only
|
||||
entry = layer_a.entry_points[0]
|
||||
current = entry
|
||||
for layer in range(layer_a.max_layer, layer_a.min_available_layer, -1):
|
||||
current = greedy_search(query, current, layer_a.adjacency[layer])
|
||||
|
||||
# Step 3: If hot cache available, refine against it
|
||||
if hot_cache:
|
||||
candidates = scan_hot_cache(query, hot_cache, current.partition)
|
||||
return top_k(candidates, k)
|
||||
|
||||
# Step 4: Otherwise, return centroid-approximate results
|
||||
return approximate_from_centroids(query, top_partitions, k)
|
||||
```
|
||||
|
||||
Expected recall: 0.65-0.75 (depends on centroid quality and hot cache coverage).
|
||||
|
||||
## 3. Layer B: Partial Adjacency
|
||||
|
||||
### Content
|
||||
|
||||
Neighbor lists for the **hot region** of the graph — the set of nodes that appear
|
||||
most frequently in query traversals. Determined by the temperature sketch (see
|
||||
03-temperature-tiering.md).
|
||||
|
||||
Typically covers:
|
||||
- All nodes at HNSW layers >= 2
|
||||
- Layer 0-1 nodes in the hot temperature tier
|
||||
- ~10-20% of total nodes
|
||||
|
||||
### Storage
|
||||
|
||||
Layer B is stored in one or more INDEX_SEGs without the HOT flag. The Level 1
|
||||
manifest maps these segments and records which node ID ranges they cover.
|
||||
|
||||
### Incremental Build
|
||||
|
||||
Layer B can be built incrementally:
|
||||
|
||||
```
|
||||
1. After Layer A is loaded, begin query serving
|
||||
2. In background: read VEC_SEGs for hot-tier blocks
|
||||
3. Build HNSW adjacency for those blocks
|
||||
4. Write as new INDEX_SEG
|
||||
5. Update manifest to include Layer B
|
||||
6. Future queries use Layer B for better recall
|
||||
```
|
||||
|
||||
This means the index improves over time without blocking any queries.
|
||||
|
||||
### Partial Adjacency Routing
|
||||
|
||||
When a query traversal reaches a node without Layer B adjacency (i.e., it's in
|
||||
the cold region), the system falls back to:
|
||||
|
||||
1. **Centroid routing**: Use Layer A centroids to estimate the nearest region
|
||||
2. **Linear scan**: Scan the relevant VEC_SEG block directly
|
||||
3. **Approximate**: Accept slightly lower recall for that portion
|
||||
|
||||
```python
|
||||
def search_with_partial_index(query, k, layers):
|
||||
# Start with Layer A routing
|
||||
current = hnsw_search_layers(query, layers.a, layers.a.max_layer, 2)
|
||||
|
||||
# Continue with Layer B (where available)
|
||||
if layers.b.has_node(current):
|
||||
current = hnsw_search_layers(query, layers.b, 1, 0,
|
||||
start=current)
|
||||
else:
|
||||
# Fallback: scan the block containing current
|
||||
candidates = linear_scan_block(query, current.block)
|
||||
current = best_of(current, candidates)
|
||||
|
||||
return top_k(current.visited, k)
|
||||
```
|
||||
|
||||
## 4. Layer C: Full Adjacency
|
||||
|
||||
### Content
|
||||
|
||||
Complete neighbor lists for every node at every HNSW level. This is the
|
||||
traditional full HNSW graph.
|
||||
|
||||
### Storage
|
||||
|
||||
Layer C may be split across multiple INDEX_SEGs for large datasets. The
|
||||
manifest records the node ID ranges covered by each segment.
|
||||
|
||||
### Lazy Build
|
||||
|
||||
Layer C is built lazily — it is not required for the file to be functional.
|
||||
The build process runs as a background task:
|
||||
|
||||
```
|
||||
1. Identify unindexed VEC_SEG blocks (those without Layer C adjacency)
|
||||
2. Read blocks in partition order (good locality)
|
||||
3. Build HNSW adjacency using the existing partial graph as scaffold
|
||||
4. Write new INDEX_SEG(s)
|
||||
5. Update manifest
|
||||
```
|
||||
|
||||
### Build Prioritization
|
||||
|
||||
Blocks are indexed in temperature order:
|
||||
1. Hot blocks first (most query benefit)
|
||||
2. Warm blocks next
|
||||
3. Cold blocks last (may never be indexed if queries don't reach them)
|
||||
|
||||
This means the index build converges to useful quality fast, then approaches
|
||||
completeness asymptotically.
|
||||
|
||||
## 5. Index Segment Binary Format
|
||||
|
||||
### Adjacency List Encoding
|
||||
|
||||
Neighbor lists are stored using **varint delta encoding with restart points**
|
||||
for fast random access:
|
||||
|
||||
```
|
||||
+-------------------------------------------+
|
||||
| Restart Point Index |
|
||||
| restart_interval: u32 (e.g., 64) |
|
||||
| restart_count: u32 |
|
||||
| [restart_offset: u32] * restart_count |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Adjacency Data |
|
||||
| For each node (sorted by node_id): |
|
||||
| neighbor_count: varint |
|
||||
| [delta_encoded_neighbor_id: varint] |
|
||||
| (restart point every N nodes) |
|
||||
+-------------------------------------------+
|
||||
```
|
||||
|
||||
**Restart points**: Every `restart_interval` nodes (default 64), the delta
|
||||
encoding resets to absolute IDs. This enables O(1) random access to any node's
|
||||
neighbors by:
|
||||
|
||||
1. Binary search the restart point index for the nearest restart <= target
|
||||
2. Seek to that restart offset
|
||||
3. Sequentially decode from restart to target (at most 63 decodes)
|
||||
|
||||
### Varint Encoding
|
||||
|
||||
Standard LEB128 varint:
|
||||
- Values 0-127: 1 byte
|
||||
- Values 128-16383: 2 bytes
|
||||
- Values 16384-2097151: 3 bytes
|
||||
|
||||
For delta-encoded neighbor IDs (typical delta: 1-1000), most values fit in 1-2
|
||||
bytes, giving ~3-4x compression over fixed u64.
|
||||
|
||||
### Prefetch Hints
|
||||
|
||||
The manifest's prefetch table maps node ID ranges to contiguous page ranges:
|
||||
|
||||
```
|
||||
Prefetch Entry:
|
||||
node_id_start: u64
|
||||
node_id_end: u64
|
||||
page_offset: u64 Offset of first contiguous page
|
||||
page_count: u32 Number of contiguous pages
|
||||
prefetch_ahead: u32 Pages to prefetch ahead of current access
|
||||
```
|
||||
|
||||
When the HNSW search accesses a node, the runtime issues `madvise(WILLNEED)`
|
||||
(or equivalent) for the next `prefetch_ahead` pages. This hides disk/memory
|
||||
latency behind computation.
|
||||
|
||||
## 6. Index Consistency
|
||||
|
||||
### Append-Only Index Updates
|
||||
|
||||
When new vectors are added:
|
||||
|
||||
1. New vectors go into a **fresh VEC_SEG** (append-only)
|
||||
2. A temporary in-memory index covers the new vectors
|
||||
3. When the in-memory index reaches a threshold, it is written as a new INDEX_SEG
|
||||
4. The manifest is updated to include both the old and new INDEX_SEGs
|
||||
5. Queries search both indexes and merge results
|
||||
|
||||
This is analogous to LSM-tree compaction levels but for graph indexes.
|
||||
|
||||
### Index Merging
|
||||
|
||||
When too many small INDEX_SEGs accumulate:
|
||||
|
||||
```
|
||||
1. Read all small INDEX_SEGs
|
||||
2. Build a unified HNSW graph over all vectors
|
||||
3. Write as a single sealed INDEX_SEG
|
||||
4. Tombstone old INDEX_SEGs in manifest
|
||||
```
|
||||
|
||||
### Concurrent Read/Write
|
||||
|
||||
Readers always see a consistent snapshot through the manifest chain:
|
||||
- Reader opens file -> reads manifest -> has immutable segment set
|
||||
- Writer appends new segments + new manifest
|
||||
- Reader continues using old manifest until it explicitly re-reads
|
||||
- No locks needed — append-only guarantees no mutation of existing data
|
||||
|
||||
## 7. Query Path Integration
|
||||
|
||||
The complete query path combining progressive indexing with temperature tiering:
|
||||
|
||||
```
|
||||
Query
|
||||
|
|
||||
v
|
||||
+-----------+
|
||||
| Layer A | Entry points + top-layer routing
|
||||
| (always) | ~5ms to load on cold start
|
||||
+-----------+
|
||||
|
|
||||
Is Layer B available for this region?
|
||||
/ \
|
||||
Yes No
|
||||
/ \
|
||||
+-----------+ +-----------+
|
||||
| Layer B | | Centroid |
|
||||
| HNSW | | Fallback |
|
||||
| search | | + scan |
|
||||
+-----------+ +-----------+
|
||||
\ /
|
||||
\ /
|
||||
v v
|
||||
+-----------+
|
||||
| Candidate |
|
||||
| Set |
|
||||
+-----------+
|
||||
|
|
||||
Is hot cache available?
|
||||
/ \
|
||||
Yes No
|
||||
/ \
|
||||
+-----------+ +-----------+
|
||||
| Hot cache | | Decode |
|
||||
| re-rank | | from |
|
||||
| (int8/fp16)| | VEC_SEG |
|
||||
+-----------+ +-----------+
|
||||
\ /
|
||||
v v
|
||||
+-----------+
|
||||
| Top-K |
|
||||
| Results |
|
||||
+-----------+
|
||||
```
|
||||
|
||||
### Recall Expectations by State
|
||||
|
||||
| State | Layers Available | Expected Recall@10 |
|
||||
|-------|-----------------|-------------------|
|
||||
| Cold start (L0 only) | A | 0.65-0.75 |
|
||||
| L0 + hot cache | A + hot | 0.75-0.85 |
|
||||
| L0 + L1 loading | A + B partial | 0.80-0.90 |
|
||||
| L1 complete | A + B | 0.85-0.92 |
|
||||
| Full load | A + B + C | 0.95-0.99 |
|
||||
| Full + optimized | A + B + C + hot | 0.98-0.999 |
|
||||
308
vendor/ruvector/docs/research/rvf/spec/05-overlay-epochs.md
vendored
Normal file
308
vendor/ruvector/docs/research/rvf/spec/05-overlay-epochs.md
vendored
Normal file
@@ -0,0 +1,308 @@
|
||||
# RVF Overlay Epochs
|
||||
|
||||
## 1. Streaming Dynamic Min-Cut Overlay
|
||||
|
||||
The overlay system manages dynamic graph partitioning — how the vector space is
|
||||
subdivided for distributed search, shard routing, and load balancing. Unlike
|
||||
static partitioning, RVF overlays evolve with the data through an epoch-based
|
||||
model that bounds memory, bounds load time, and enables rollback.
|
||||
|
||||
## 2. Overlay Segment Structure
|
||||
|
||||
Each OVERLAY_SEG stores a delta relative to the previous epoch's partition state:
|
||||
|
||||
```
|
||||
+-------------------------------------------+
|
||||
| Header: OVERLAY_SEG |
|
||||
+-------------------------------------------+
|
||||
| Epoch Header |
|
||||
| epoch: u32 |
|
||||
| parent_epoch: u32 |
|
||||
| parent_seg_id: u64 |
|
||||
| rollback_offset: u64 |
|
||||
| timestamp_ns: u64 |
|
||||
| delta_count: u32 |
|
||||
| partition_count: u32 |
|
||||
+-------------------------------------------+
|
||||
| Edge Deltas |
|
||||
| For each delta: |
|
||||
| delta_type: u8 (ADD=1, REMOVE=2, |
|
||||
| REWEIGHT=3) |
|
||||
| src_node: u64 |
|
||||
| dst_node: u64 |
|
||||
| weight: f32 (for ADD/REWEIGHT) |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Partition Summaries |
|
||||
| For each partition: |
|
||||
| partition_id: u32 |
|
||||
| node_count: u64 |
|
||||
| edge_cut_weight: f64 |
|
||||
| centroid: [fp16 * dim] |
|
||||
| node_id_range_start: u64 |
|
||||
| node_id_range_end: u64 |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Min-Cut Witness |
|
||||
| witness_type: u8 |
|
||||
| 0 = checksum only |
|
||||
| 1 = full certificate |
|
||||
| cut_value: f64 |
|
||||
| cut_edge_count: u32 |
|
||||
| partition_hash: [u8; 32] (SHAKE-256) |
|
||||
| If witness_type == 1: |
|
||||
| [cut_edge: (u64, u64)] * count |
|
||||
| [64B aligned] |
|
||||
+-------------------------------------------+
|
||||
| Rollback Pointer |
|
||||
| prev_epoch_offset: u64 |
|
||||
| prev_epoch_hash: [u8; 16] |
|
||||
+-------------------------------------------+
|
||||
```
|
||||
|
||||
## 3. Epoch Lifecycle
|
||||
|
||||
### Epoch Creation
|
||||
|
||||
A new epoch is created when:
|
||||
- A batch of vectors is inserted that changes partition balance by > threshold
|
||||
- The accumulated edge deltas exceed a size limit (default: 1 MB)
|
||||
- A manual rebalance is triggered
|
||||
- A merge/compaction produces a new partition layout
|
||||
|
||||
```
|
||||
Epoch 0 (initial) Epoch 1 Epoch 2
|
||||
+----------------+ +----------------+ +----------------+
|
||||
| Full snapshot | | Deltas vs E0 | | Deltas vs E1 |
|
||||
| of partitions | | +50 edges | | +30 edges |
|
||||
| 32 partitions | | -12 edges | | -8 edges |
|
||||
| min-cut: 0.342 | | rebalance: P3 | | split: P7->P7a |
|
||||
+----------------+ +----------------+ +----------------+
|
||||
```
|
||||
|
||||
### State Reconstruction
|
||||
|
||||
To reconstruct the current partition state:
|
||||
|
||||
```
|
||||
1. Read latest MANIFEST_SEG -> get current_epoch
|
||||
2. Read OVERLAY_SEG for current_epoch
|
||||
3. If overlay is a delta: recursively read parent epochs
|
||||
4. Apply deltas in order: base -> epoch 1 -> epoch 2 -> ... -> current
|
||||
5. Result: complete partition state
|
||||
```
|
||||
|
||||
For efficiency, the manifest caches the **last full snapshot epoch**. Delta
|
||||
chains never exceed a configurable depth (default: 8 epochs) before a new
|
||||
snapshot is forced.
|
||||
|
||||
### Compaction (Epoch Collapse)
|
||||
|
||||
When the delta chain reaches maximum depth:
|
||||
|
||||
```
|
||||
1. Reconstruct full state from chain
|
||||
2. Write new OVERLAY_SEG with witness_type=full_snapshot
|
||||
3. This becomes the new base epoch
|
||||
4. Old overlay segments are tombstoned
|
||||
5. New delta chain starts from this base
|
||||
```
|
||||
|
||||
```
|
||||
Before: E0(snap) -> E1(delta) -> E2(delta) -> ... -> E8(delta)
|
||||
After: E0(snap) -> ... -> E8(delta) -> E9(snap, compacted)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
These can be garbage collected
|
||||
```
|
||||
|
||||
## 4. Min-Cut Witness
|
||||
|
||||
The min-cut witness provides a cryptographic proof that the current partition
|
||||
is "good enough" — that the edge cut is within acceptable bounds.
|
||||
|
||||
### Witness Types
|
||||
|
||||
**Type 0: Checksum Only**
|
||||
|
||||
A SHAKE-256 hash of the complete partition state. Allows verification that
|
||||
the state is consistent but doesn't prove optimality.
|
||||
|
||||
```
|
||||
witness = SHAKE-256(
|
||||
for each partition sorted by id:
|
||||
partition_id || node_count || sorted(node_ids) || edge_cut_weight
|
||||
)
|
||||
```
|
||||
|
||||
**Type 1: Full Certificate**
|
||||
|
||||
Lists the actual cut edges. Allows any reader to verify that:
|
||||
1. The listed edges are the only edges crossing partition boundaries
|
||||
2. The total cut weight matches `cut_value`
|
||||
3. No better cut exists within the local search neighborhood (optional)
|
||||
|
||||
### Bounded-Time Min-Cut Updates
|
||||
|
||||
Full min-cut computation is expensive (O(V * E) for max-flow). RVF uses
|
||||
**incremental min-cut maintenance**:
|
||||
|
||||
For each edge delta:
|
||||
```
|
||||
1. If ADD(u, v) where u and v are in same partition:
|
||||
-> No cut change. O(1).
|
||||
|
||||
2. If ADD(u, v) where u in P_i and v in P_j:
|
||||
-> cut_weight[P_i][P_j] += weight. O(1).
|
||||
-> Check if moving u to P_j or v to P_i reduces total cut.
|
||||
-> If yes: execute move, update partition summaries. O(degree).
|
||||
|
||||
3. If REMOVE(u, v) across partitions:
|
||||
-> cut_weight[P_i][P_j] -= weight. O(1).
|
||||
-> No rebalance needed (cut improved).
|
||||
|
||||
4. If REMOVE(u, v) within same partition:
|
||||
-> Check connectivity. If partition splits: create new partition. O(component).
|
||||
```
|
||||
|
||||
This bounds update time to O(max_degree) per edge delta in the common case,
|
||||
with O(component_size) in the rare partition-split case.
|
||||
|
||||
### Semi-Streaming Min-Cut
|
||||
|
||||
For large-scale rebalancing (e.g., after bulk insert), RVF uses a semi-streaming
|
||||
algorithm inspired by Assadi et al.:
|
||||
|
||||
```
|
||||
Phase 1: Single pass over edges to build a sparse skeleton
|
||||
- Sample each edge with probability O(1/epsilon)
|
||||
- Space: O(n * polylog(n))
|
||||
|
||||
Phase 2: Compute min-cut on skeleton
|
||||
- Standard max-flow on sparse graph
|
||||
- Time: O(n^2 * polylog(n))
|
||||
|
||||
Phase 3: Verify against full edge set
|
||||
- Stream edges again, check cut validity
|
||||
- If invalid: refine skeleton and repeat
|
||||
```
|
||||
|
||||
This runs in O(n * polylog(n)) space regardless of edge count, making it
|
||||
suitable for streaming over massive graphs.
|
||||
|
||||
## 5. Overlay Size Management
|
||||
|
||||
### Size Threshold
|
||||
|
||||
Each OVERLAY_SEG has a maximum payload size (configurable, default 1 MB).
|
||||
When the accumulated deltas for the current epoch approach this threshold,
|
||||
a new epoch is forced.
|
||||
|
||||
### Memory Budget
|
||||
|
||||
The total memory for overlay state is bounded:
|
||||
|
||||
```
|
||||
max_overlay_memory = max_chain_depth * max_seg_size + snapshot_size
|
||||
= 8 * 1 MB + snapshot_size
|
||||
```
|
||||
|
||||
For 10M vectors with 32 partitions:
|
||||
- Snapshot: ~32 * (8 + 16 + 768) bytes per partition ≈ 25 KB
|
||||
- Delta chain: ≤ 8 MB
|
||||
- Total: ≤ 9 MB
|
||||
|
||||
This is a fixed overhead regardless of dataset size (partition count scales
|
||||
sublinearly).
|
||||
|
||||
### Garbage Collection
|
||||
|
||||
Overlay segments behind the last full snapshot are candidates for garbage
|
||||
collection. The manifest tracks which overlay segments are still reachable
|
||||
from the current epoch chain.
|
||||
|
||||
```
|
||||
Reachable: current_epoch -> parent -> ... -> last_snapshot
|
||||
Unreachable: Everything before last_snapshot (safely deletable)
|
||||
```
|
||||
|
||||
GC runs during compaction. Old OVERLAY_SEGs are tombstoned in the manifest
|
||||
and their space is reclaimed on file rewrite.
|
||||
|
||||
## 6. Distributed Overlay Coordination
|
||||
|
||||
When RVF files are sharded across multiple nodes, the overlay system coordinates
|
||||
partition state:
|
||||
|
||||
### Shard-Local Overlays
|
||||
|
||||
Each shard maintains its own OVERLAY_SEG chain for its local partitions.
|
||||
The global partition state is the union of all shard-local overlays.
|
||||
|
||||
### Cross-Shard Rebalancing
|
||||
|
||||
When a partition becomes unbalanced across shards:
|
||||
|
||||
```
|
||||
1. Coordinator computes target partition assignment
|
||||
2. Each shard writes a JOURNAL_SEG with vector move instructions
|
||||
3. Vectors are copied (not moved — append-only) to target shards
|
||||
4. Each shard writes a new OVERLAY_SEG reflecting the new partition
|
||||
5. Coordinator writes a global MANIFEST_SEG with new shard map
|
||||
```
|
||||
|
||||
This is eventually consistent — during rebalancing, queries may search both
|
||||
old and new locations and deduplicate results.
|
||||
|
||||
### Consistency Model
|
||||
|
||||
**Within a shard**: Linearizable (single-writer, manifest chain)
|
||||
**Across shards**: Eventually consistent with bounded staleness
|
||||
|
||||
The epoch counter provides a total order for convergence checking:
|
||||
- If all shards report epoch >= E, the global state at epoch E is complete
|
||||
- Stale shards are detectable by comparing epoch counters
|
||||
|
||||
## 7. Epoch-Aware Query Routing
|
||||
|
||||
Queries use the overlay state for partition routing:
|
||||
|
||||
```python
|
||||
def route_query(query, overlay):
|
||||
# Find nearest partition centroids
|
||||
dists = [distance(query, p.centroid) for p in overlay.partitions]
|
||||
target_partitions = top_n(dists, n_probe)
|
||||
|
||||
# Check epoch freshness
|
||||
if overlay.epoch < current_epoch - stale_threshold:
|
||||
# Overlay is stale — broaden search
|
||||
target_partitions = top_n(dists, n_probe * 2)
|
||||
|
||||
return target_partitions
|
||||
```
|
||||
|
||||
### Epoch Rollback
|
||||
|
||||
If an overlay epoch is found to be corrupt or suboptimal:
|
||||
|
||||
```
|
||||
1. Read rollback_pointer from current OVERLAY_SEG
|
||||
2. The pointer gives the offset of the previous epoch's OVERLAY_SEG
|
||||
3. Write a new MANIFEST_SEG pointing to the previous epoch as current
|
||||
4. Future writes continue from the rolled-back state
|
||||
```
|
||||
|
||||
This provides O(1) rollback to any ancestor epoch in the chain.
|
||||
|
||||
## 8. Integration with Progressive Indexing
|
||||
|
||||
The overlay system and the index system are coupled:
|
||||
|
||||
- **Partition centroids** in the overlay guide Layer A routing
|
||||
- **Partition boundaries** determine which INDEX_SEGs cover which regions
|
||||
- **Partition rebalancing** may invalidate Layer B adjacency for moved vectors
|
||||
(these are rebuilt lazily)
|
||||
- **Layer C** is partitioned aligned — each INDEX_SEG covers vectors within
|
||||
a single partition for locality
|
||||
|
||||
This means overlay compaction can trigger partial index rebuild, but only for
|
||||
the affected partitions — not the entire index.
|
||||
386
vendor/ruvector/docs/research/rvf/spec/06-query-optimization.md
vendored
Normal file
386
vendor/ruvector/docs/research/rvf/spec/06-query-optimization.md
vendored
Normal file
@@ -0,0 +1,386 @@
|
||||
# RVF Ultra-Fast Query Path
|
||||
|
||||
## 1. CPU Shape Optimization
|
||||
|
||||
The block layout determines performance at the hardware level. RVF is designed
|
||||
to match the shape of modern CPUs: wide SIMD, deep caches, hardware prefetch.
|
||||
|
||||
### Four Optimizations
|
||||
|
||||
1. **Strict 64-byte alignment** for all numeric arrays
|
||||
2. **Columnar + interleaved hybrid** for compression and speed
|
||||
3. **Prefetch hints** for cache-friendly graph traversal
|
||||
4. **Dictionary-coded IDs** for fast random access
|
||||
|
||||
## 2. Strict Alignment
|
||||
|
||||
Every numeric array in RVF starts at a 64-byte aligned offset. This matches:
|
||||
|
||||
| Target | Register Width | Alignment |
|
||||
|--------|---------------|-----------|
|
||||
| AVX-512 | 512 bits = 64 bytes | 64 B |
|
||||
| AVX2 | 256 bits = 32 bytes | 64 B (superset) |
|
||||
| ARM NEON | 128 bits = 16 bytes | 64 B (superset) |
|
||||
| WASM v128 | 128 bits = 16 bytes | 64 B (superset) |
|
||||
| Cache line | Typically 64 bytes | 64 B (exact) |
|
||||
|
||||
By aligning to 64 bytes, RVF ensures:
|
||||
- Zero-copy load into any SIMD register (no unaligned penalty)
|
||||
- No cache-line splits (each access touches exactly one cache line)
|
||||
- Optimal hardware prefetch behavior (prefetcher operates on cache lines)
|
||||
|
||||
### Alignment in Practice
|
||||
|
||||
```
|
||||
Segment header: 64 B (naturally aligned, first item in segment)
|
||||
Block header: Padded to 64 B boundary
|
||||
Vector data start: 64 B aligned from block start
|
||||
Each dimension column: 64 B aligned (columnar VEC_SEG)
|
||||
Each vector entry: 64 B aligned (interleaved HOT_SEG)
|
||||
ID map: 64 B aligned
|
||||
Restart point index: 64 B aligned
|
||||
```
|
||||
|
||||
Padding bytes between sections are zero-filled and excluded from checksums.
|
||||
|
||||
## 3. Columnar + Interleaved Hybrid
|
||||
|
||||
### Columnar Storage (VEC_SEG) — Optimized for Compression
|
||||
|
||||
```
|
||||
Block layout (1024 vectors, 384 dimensions, fp16):
|
||||
|
||||
Offset 0x000: dim_0[vec_0], dim_0[vec_1], ..., dim_0[vec_1023] (2048 B)
|
||||
Offset 0x800: dim_1[vec_0], dim_1[vec_1], ..., dim_1[vec_1023] (2048 B)
|
||||
...
|
||||
Offset 0xBF800: dim_383[vec_0], ..., dim_383[vec_1023] (2048 B)
|
||||
|
||||
Total: 384 * 2048 = 786,432 bytes (768 KB per block)
|
||||
```
|
||||
|
||||
**Why columnar for cold/warm storage**:
|
||||
- Adjacent values in the same dimension are correlated -> higher compression ratio
|
||||
- LZ4 on columnar fp16 achieves 1.5-2.5x compression (vs 1.1-1.3x on interleaved)
|
||||
- ZSTD on columnar fp16 achieves 2.5-4x compression
|
||||
- Batch operations (computing mean, variance) scan one dimension at a time
|
||||
|
||||
### Interleaved Storage (HOT_SEG) — Optimized for Speed
|
||||
|
||||
```
|
||||
Entry layout (one hot vector, 384 dim fp16):
|
||||
|
||||
Offset 0x000: vector_id (8 B)
|
||||
Offset 0x008: dim_0, dim_1, dim_2, ..., dim_383 (768 B)
|
||||
Offset 0x308: neighbor_count (2 B)
|
||||
Offset 0x30A: neighbor_0, neighbor_1, ... (8 B each)
|
||||
Offset 0x38A: padding to 64B boundary
|
||||
--> 960 bytes per entry (at M=16 neighbors)
|
||||
```
|
||||
|
||||
**Why interleaved for hot data**:
|
||||
- One vector = one sequential read (no column gathering)
|
||||
- Distance computation: load vector, compute, move to next (streaming pattern)
|
||||
- Neighbors co-located: after finding a good candidate, immediately traverse
|
||||
- 960 bytes per entry = 15 cache lines = predictable memory access
|
||||
|
||||
### When to Use Each
|
||||
|
||||
| Operation | Layout | Reason |
|
||||
|-----------|--------|--------|
|
||||
| Bulk distance computation | Columnar | SIMD operates on dimension columns |
|
||||
| Top-K refinement scan | Interleaved | Sequential scan of candidates |
|
||||
| Compression/archival | Columnar | Better ratio |
|
||||
| HNSW search (hot region) | Interleaved | Vector + neighbors together |
|
||||
| Batch insert | Columnar | Write once, compress well |
|
||||
|
||||
## 4. Prefetch Hints
|
||||
|
||||
### The Problem
|
||||
|
||||
HNSW search is pointer-chasing: compute distance at node A, read neighbor
|
||||
list, jump to node B, compute distance, repeat. Each jump is a random
|
||||
memory access. On a 10M vector file, this means:
|
||||
|
||||
```
|
||||
HNSW search: ~100-200 distance computations per query
|
||||
Each computation: 1 random read (vector) + 1 random read (neighbors)
|
||||
Random read latency: 50-100 ns (DRAM), 10-50 μs (SSD)
|
||||
Total: 10-40 μs (DRAM), 1-10 ms (SSD) without prefetch
|
||||
```
|
||||
|
||||
### The Solution
|
||||
|
||||
Store neighbor lists **contiguously** and add **prefetch offsets** in the
|
||||
manifest so the runtime can issue prefetch instructions ahead of time.
|
||||
|
||||
### Prefetch Table Structure
|
||||
|
||||
The manifest contains a prefetch table mapping node ID ranges to contiguous
|
||||
page regions:
|
||||
|
||||
```
|
||||
prefetch_table:
|
||||
entry_count: u32
|
||||
entries:
|
||||
[0]: node_ids 0-9999 -> pages at offset 0x100000, 50 pages, prefetch 3 ahead
|
||||
[1]: node_ids 10000-19999 -> pages at offset 0x200000, 50 pages, prefetch 3 ahead
|
||||
...
|
||||
```
|
||||
|
||||
### Runtime Prefetch Strategy
|
||||
|
||||
```python
|
||||
def hnsw_search_with_prefetch(query, entry_point, ef_search):
|
||||
candidates = MaxHeap()
|
||||
visited = BitSet()
|
||||
worklist = MinHeap([(distance(query, entry_point), entry_point)])
|
||||
|
||||
while worklist:
|
||||
dist, node = worklist.pop()
|
||||
|
||||
# PREFETCH: while processing this node, prefetch neighbors' data
|
||||
neighbors = get_neighbors(node)
|
||||
for n in neighbors[:PREFETCH_AHEAD]:
|
||||
if n not in visited:
|
||||
prefetch_vector(n) # madvise(WILLNEED) or __builtin_prefetch
|
||||
prefetch_neighbors(n) # prefetch neighbor list page
|
||||
|
||||
# COMPUTE: distance to neighbors (data should be in cache by now)
|
||||
for n in neighbors:
|
||||
if n not in visited:
|
||||
visited.add(n)
|
||||
d = distance(query, get_vector(n))
|
||||
if d < candidates.max() or len(candidates) < ef_search:
|
||||
candidates.push((d, n))
|
||||
worklist.push((d, n))
|
||||
|
||||
return candidates.top_k(k)
|
||||
```
|
||||
|
||||
### Contiguous Neighbor Layout
|
||||
|
||||
HOT_SEG stores vectors and neighbors together. For cold INDEX_SEGs, neighbor
|
||||
lists are laid out in **node ID order** within contiguous pages:
|
||||
|
||||
```
|
||||
Page 0: neighbors[node_0], neighbors[node_1], ..., neighbors[node_63]
|
||||
Page 1: neighbors[node_64], ..., neighbors[node_127]
|
||||
...
|
||||
```
|
||||
|
||||
Because HNSW search tends to traverse nodes in the same graph neighborhood
|
||||
(spatially close node IDs if data was inserted in order), sequential node
|
||||
IDs tend to be accessed together. Contiguous layout turns random access
|
||||
into sequential reads.
|
||||
|
||||
### Expected Improvement
|
||||
|
||||
| Configuration | p95 Latency (10M vectors) |
|
||||
|--------------|--------------------------|
|
||||
| No prefetch, random layout | 2.5 ms |
|
||||
| No prefetch, contiguous layout | 1.2 ms |
|
||||
| Prefetch, contiguous layout | 0.3 ms |
|
||||
| Prefetch, contiguous + hot cache | 0.15 ms |
|
||||
|
||||
## 5. Dictionary-Coded IDs
|
||||
|
||||
### The Problem
|
||||
|
||||
Vector IDs in neighbor lists and ID maps are 64-bit integers. For 10M vectors,
|
||||
most IDs fit in 24 bits. Storing full 64-bit IDs wastes ~5 bytes per entry.
|
||||
|
||||
With M=16 neighbors per node and 10M nodes:
|
||||
- Raw: 10M * 16 * 8 = 1.2 GB of ID data
|
||||
- Desired: < 300 MB
|
||||
|
||||
### Varint Delta Encoding
|
||||
|
||||
IDs within a block or neighbor list are sorted and delta-encoded:
|
||||
|
||||
```
|
||||
Original IDs: [1000, 1005, 1008, 1020, 1100]
|
||||
Deltas: [1000, 5, 3, 12, 80]
|
||||
Varint bytes: [ 2B, 1B, 1B, 1B, 1B] = 6 bytes (vs 40 bytes raw)
|
||||
```
|
||||
|
||||
### Restart Points
|
||||
|
||||
Every N entries (default N=64), the delta resets to an absolute value:
|
||||
|
||||
```
|
||||
Group 0 (entries 0-63): delta from 0 (absolute start)
|
||||
Group 1 (entries 64-127): delta from entry[64] (restart)
|
||||
Group 2 (entries 128-191): delta from entry[128] (restart)
|
||||
```
|
||||
|
||||
The restart point index stores the offset of each restart group:
|
||||
|
||||
```
|
||||
restart_index:
|
||||
interval: 64
|
||||
offsets: [0, 156, 298, 445, ...] // byte offsets into encoded data
|
||||
```
|
||||
|
||||
### Random Access
|
||||
|
||||
To find the neighbors of node N:
|
||||
|
||||
```
|
||||
1. group = N / restart_interval // O(1)
|
||||
2. offset = restart_index[group] // O(1)
|
||||
3. seek to offset in encoded data // O(1)
|
||||
4. decode sequentially from restart to N // O(restart_interval) = O(64)
|
||||
```
|
||||
|
||||
Total: O(64) varint decodes = ~50-100 ns. Compare with sorted array binary
|
||||
search: O(log N) = O(24) comparisons with cache misses = ~200-500 ns.
|
||||
|
||||
### SIMD Varint Decoding
|
||||
|
||||
Modern SIMD can decode varints in bulk:
|
||||
|
||||
```
|
||||
AVX-512 VBMI: ~8 varints per cycle using VPERMB + VPSHUFB
|
||||
Throughput: 2-4 billion integers/second (Lemire et al.)
|
||||
```
|
||||
|
||||
At 16 neighbors per node, one HNSW search step decodes 16 varints in ~2-4 ns.
|
||||
|
||||
### Compression Ratio
|
||||
|
||||
| Encoding | Bytes per ID (avg) | 10M * 16 neighbors |
|
||||
|----------|-------------------|-------------------|
|
||||
| Raw u64 | 8.0 B | 1,220 MB |
|
||||
| Raw u32 | 4.0 B | 610 MB |
|
||||
| Varint (no delta) | 3.2 B | 488 MB |
|
||||
| Varint delta | 1.5 B | 229 MB |
|
||||
| Varint delta + restart | 1.6 B | 244 MB |
|
||||
|
||||
Delta encoding with restart points achieves ~5x compression over raw u64
|
||||
while maintaining fast random access.
|
||||
|
||||
## 6. Cache Behavior Analysis
|
||||
|
||||
### L1/L2/L3 Working Sets
|
||||
|
||||
For a typical query on 10M vectors (384 dim, fp16):
|
||||
|
||||
```
|
||||
HNSW search:
|
||||
~150 distance computations
|
||||
Each computation: 768 B (vector) + ~128 B (neighbor list) ≈ 896 B
|
||||
Total working set: 150 * 896 ≈ 131 KB
|
||||
|
||||
Top-K refinement (hot cache scan):
|
||||
~1000 candidates checked
|
||||
Each: 960 B (interleaved HOT_SEG entry)
|
||||
Total: 960 KB
|
||||
|
||||
Query vector: 768 B (always in L1)
|
||||
Quantization tables: 96 KB (PQ codebook, always in L2)
|
||||
```
|
||||
|
||||
| Cache Level | Size | What Fits |
|
||||
|------------|------|-----------|
|
||||
| L1 (32-48 KB) | Query vector + current node | Always hit |
|
||||
| L2 (256 KB-1 MB) | PQ tables + 100-200 hot entries | Usually hit |
|
||||
| L3 (8-32 MB) | Hot cache + partial index | Mostly hit |
|
||||
| DRAM | Everything | Full dataset |
|
||||
|
||||
### p95 Latency Budget
|
||||
|
||||
```
|
||||
HNSW traversal: 150 nodes * 100 ns/node = 15 μs (L3 hit)
|
||||
Distance compute: 150 * 50 ns = 7.5 μs (SIMD)
|
||||
Top-K refinement: 1000 * 10 ns = 10 μs (hot cache, L2/L3 hit)
|
||||
Overhead: 5 μs (heap ops, bookkeeping)
|
||||
-------
|
||||
Total p95: ~37.5 μs ≈ 0.04 ms
|
||||
|
||||
With prefetch: ~30 μs (hide 25% of traversal latency)
|
||||
```
|
||||
|
||||
This matches the target of < 0.3 ms p95 on desktop hardware. The dominant
|
||||
cost is memory bandwidth, not computation — which is why cache-friendly
|
||||
layout and prefetch are critical.
|
||||
|
||||
## 7. Distance Function SIMD Implementations
|
||||
|
||||
### L2 Distance (fp16, 384 dim, AVX-512)
|
||||
|
||||
```
|
||||
; 384 fp16 values = 768 bytes = 12 ZMM registers
|
||||
; Process 32 fp16 values per iteration (convert to 16 fp32 per half)
|
||||
|
||||
.loop:
|
||||
vmovdqu16 zmm0, [rsi + rcx] ; Load 32 fp16 from A
|
||||
vmovdqu16 zmm1, [rdi + rcx] ; Load 32 fp16 from B
|
||||
vcvtph2ps zmm2, ymm0 ; Convert low 16 to fp32
|
||||
vcvtph2ps zmm3, ymm1
|
||||
vsubps zmm2, zmm2, zmm3 ; diff = A - B
|
||||
vfmadd231ps zmm4, zmm2, zmm2 ; acc += diff * diff
|
||||
; Repeat for high 16
|
||||
vextracti64x4 ymm0, zmm0, 1
|
||||
vextracti64x4 ymm1, zmm1, 1
|
||||
vcvtph2ps zmm2, ymm0
|
||||
vcvtph2ps zmm3, ymm1
|
||||
vsubps zmm2, zmm2, zmm3
|
||||
vfmadd231ps zmm4, zmm2, zmm2
|
||||
add rcx, 64
|
||||
cmp rcx, 768
|
||||
jl .loop
|
||||
|
||||
; Horizontal sum of zmm4 -> scalar result
|
||||
; ~12 iterations, ~24 FMA ops, ~12 cycles total
|
||||
```
|
||||
|
||||
### Inner Product (int8, 384 dim, AVX-512 VNNI)
|
||||
|
||||
```
|
||||
; 384 int8 values = 384 bytes = 6 ZMM registers
|
||||
; VPDPBUSD: 64 uint8*int8 multiply-adds per cycle
|
||||
|
||||
.loop:
|
||||
vmovdqu8 zmm0, [rsi + rcx] ; 64 uint8 from A
|
||||
vmovdqu8 zmm1, [rdi + rcx] ; 64 int8 from B
|
||||
vpdpbusd zmm2, zmm0, zmm1 ; acc += dot(A, B) per 4 bytes
|
||||
add rcx, 64
|
||||
cmp rcx, 384
|
||||
jl .loop
|
||||
|
||||
; 6 iterations, 6 VPDPBUSD ops, ~6 cycles
|
||||
; ~16x faster than fp16 L2
|
||||
```
|
||||
|
||||
### Hamming Distance (binary, 384 dim, AVX-512)
|
||||
|
||||
```
|
||||
; 384 bits = 48 bytes = 1 partial ZMM load
|
||||
; VPOPCNTDQ: popcount on 8 x 64-bit words per cycle
|
||||
|
||||
vmovdqu8 zmm0, [rsi] ; Load 48 bytes (384 bits) from A
|
||||
vmovdqu8 zmm1, [rdi] ; Load 48 bytes from B
|
||||
vpxorq zmm2, zmm0, zmm1 ; XOR -> differing bits
|
||||
vpopcntq zmm3, zmm2 ; Popcount per 64-bit word
|
||||
; Horizontal sum of 6 popcounts -> Hamming distance
|
||||
; ~3 cycles total
|
||||
```
|
||||
|
||||
## 8. Summary: Query Path Hot Loop
|
||||
|
||||
The complete hot path for one HNSW search step:
|
||||
|
||||
```
|
||||
1. Load current node's neighbor list [L2/L3 cache, 128 B, ~5 ns]
|
||||
2. Issue prefetch for next neighbors [~1 ns]
|
||||
3. For each neighbor (M=16):
|
||||
a. Check visited bitmap [L1, ~1 ns]
|
||||
b. Load neighbor vector (hot cache) [L2/L3, 768 B, ~5-10 ns]
|
||||
c. SIMD distance (fp16, 384 dim) [~12 cycles = ~4 ns]
|
||||
d. Heap insert if better [~5 ns]
|
||||
4. Total per step: ~300-500 ns
|
||||
5. Total per query (~150 steps): ~50-75 μs
|
||||
```
|
||||
|
||||
This achieves 13,000-20,000 QPS per thread on desktop hardware — matching
|
||||
or exceeding dedicated vector databases for in-memory workloads.
|
||||
580
vendor/ruvector/docs/research/rvf/spec/07-deletion-lifecycle.md
vendored
Normal file
580
vendor/ruvector/docs/research/rvf/spec/07-deletion-lifecycle.md
vendored
Normal file
@@ -0,0 +1,580 @@
|
||||
# RVF Deletion Lifecycle
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Deletion in RVF follows a two-phase protocol consistent with the append-only
|
||||
segment architecture. Vectors are never removed in-place. Instead, a soft
|
||||
delete records intent in a JOURNAL_SEG, and a subsequent compaction hard
|
||||
deletes by physically excluding the vectors from sealed output segments.
|
||||
|
||||
```
|
||||
JOURNAL_SEG Compaction GC / Rewrite
|
||||
(append) (merge) (reclaim)
|
||||
ACTIVE -----> SOFT_DELETED -----> HARD_DELETED ------> RECLAIMED
|
||||
| | | |
|
||||
| query path | query path | |
|
||||
| returns vec | skips vec | vec absent | space freed
|
||||
| | (bitmap check) | from output seg |
|
||||
```
|
||||
|
||||
Readers always see a consistent snapshot: a deletion is invisible until
|
||||
the manifest referencing the new deletion bitmap is durably committed.
|
||||
|
||||
## 2. Vector Lifecycle State Machine
|
||||
|
||||
```
|
||||
+----------+ JOURNAL_SEG +-----------------+
|
||||
| | DELETE_VECTOR / RANGE | |
|
||||
| ACTIVE +----------------------->+ SOFT_DELETED |
|
||||
| | | |
|
||||
+----------+ +--------+--------+
|
||||
| Compaction seals output
|
||||
v excluding this vector
|
||||
+--------+--------+
|
||||
| HARD_DELETED |
|
||||
+--------+--------+
|
||||
| File rewrite / truncation
|
||||
v reclaims physical space
|
||||
+--------+--------+
|
||||
| RECLAIMED |
|
||||
+-----------------+
|
||||
```
|
||||
|
||||
| State | Bitmap Bit | Physical Bytes | Query Visible |
|
||||
|-------|------------|----------------|---------------|
|
||||
| ACTIVE | 0 | Vector in VEC_SEG | Yes |
|
||||
| SOFT_DELETED | 1 | Vector in VEC_SEG | No |
|
||||
| HARD_DELETED | N/A | Excluded from sealed output | No |
|
||||
| RECLAIMED | N/A | Bytes overwritten / freed | No |
|
||||
|
||||
| Transition | Trigger | Durability |
|
||||
|------------|---------|------------|
|
||||
| ACTIVE -> SOFT_DELETED | JOURNAL_SEG + MANIFEST_SEG with bitmap | After manifest fsync |
|
||||
| SOFT_DELETED -> HARD_DELETED | Compaction writes sealed VEC_SEG without vector | After compaction manifest fsync |
|
||||
| HARD_DELETED -> RECLAIMED | File rewrite or old shard deletion | After shard unlink |
|
||||
|
||||
## 3. JOURNAL_SEG Wire Format (type 0x04)
|
||||
|
||||
A JOURNAL_SEG records metadata mutations: deletions, metadata updates, tier
|
||||
moves, and ID remappings. Its payload follows the standard 64-byte segment
|
||||
header (see `01-segment-model.md` section 2).
|
||||
|
||||
### 3.1 Journal Header (64 bytes)
|
||||
|
||||
```
|
||||
Offset Type Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 u32 entry_count Number of journal entries
|
||||
0x04 u32 journal_epoch Epoch when this journal was written
|
||||
0x08 u64 prev_journal_seg_id Segment ID of previous JOURNAL_SEG (0 if first)
|
||||
0x10 u32 flags Reserved, must be 0
|
||||
0x14 u8[44] reserved Zero-padded to 64-byte alignment
|
||||
```
|
||||
|
||||
### 3.2 Journal Entry Format
|
||||
|
||||
Each entry begins on an 8-byte aligned boundary:
|
||||
|
||||
```
|
||||
Offset Type Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 u8 entry_type Entry type enum
|
||||
0x01 u8 reserved Must be 0x00
|
||||
0x02 u16 entry_length Byte length of type-specific payload
|
||||
0x04 u8[] payload Type-specific payload
|
||||
var u8[] padding Zero-pad to next 8-byte boundary
|
||||
```
|
||||
|
||||
### 3.3 Entry Types
|
||||
|
||||
```
|
||||
Value Name Payload Size Description
|
||||
----- ---- ------------ -----------
|
||||
0x01 DELETE_VECTOR 8 B Delete a single vector by ID
|
||||
0x02 DELETE_RANGE 16 B Delete a contiguous range of vector IDs
|
||||
0x03 UPDATE_METADATA variable Update key-value metadata for a vector
|
||||
0x04 MOVE_VECTOR 24 B Reassign vector to a different segment/tier
|
||||
0x05 REMAP_ID 16 B Reassign vector ID (post-compaction)
|
||||
```
|
||||
|
||||
### 3.4 Type-Specific Payloads
|
||||
|
||||
**DELETE_VECTOR (0x01)**
|
||||
```
|
||||
0x00 u64 vector_id ID of the vector to soft-delete
|
||||
```
|
||||
|
||||
**DELETE_RANGE (0x02)**
|
||||
```
|
||||
0x00 u64 start_id First vector ID (inclusive)
|
||||
0x08 u64 end_id Last vector ID (exclusive)
|
||||
```
|
||||
Invariant: `start_id < end_id`. Range `[start_id, end_id)` is half-open.
|
||||
|
||||
**UPDATE_METADATA (0x03)**
|
||||
```
|
||||
0x00 u64 vector_id Target vector ID
|
||||
0x08 u16 key_len Byte length of metadata key
|
||||
0x0A u8[] key Metadata key (UTF-8)
|
||||
var u16 val_len Byte length of metadata value
|
||||
var+2 u8[] val Metadata value (opaque bytes)
|
||||
```
|
||||
|
||||
**MOVE_VECTOR (0x04)**
|
||||
```
|
||||
0x00 u64 vector_id Target vector ID
|
||||
0x08 u64 src_seg Source segment ID
|
||||
0x10 u64 dst_seg Destination segment ID
|
||||
```
|
||||
|
||||
**REMAP_ID (0x05)**
|
||||
```
|
||||
0x00 u64 old_id Original vector ID
|
||||
0x08 u64 new_id New vector ID after compaction
|
||||
```
|
||||
|
||||
### 3.5 Complete JOURNAL_SEG Example
|
||||
|
||||
Deleting vector 42, deleting range [1000, 2000), remapping ID 500 -> 3:
|
||||
|
||||
```
|
||||
Byte offset Content Notes
|
||||
----------- ------- -----
|
||||
0x00-0x3F Segment header (64 B) seg_type=0x04, magic=RVFS
|
||||
0x40-0x7F Journal header (64 B) entry_count=3, epoch=7,
|
||||
prev_journal_seg_id=12
|
||||
--- Entry 0: DELETE_VECTOR ---
|
||||
0x80 0x01 entry_type
|
||||
0x81 0x00 reserved
|
||||
0x82-0x83 0x0008 entry_length = 8
|
||||
0x84-0x8B 0x000000000000002A vector_id = 42
|
||||
0x8C-0x8F 0x00000000 padding to 8B
|
||||
|
||||
--- Entry 1: DELETE_RANGE ---
|
||||
0x90 0x02 entry_type
|
||||
0x91 0x00 reserved
|
||||
0x92-0x93 0x0010 entry_length = 16
|
||||
0x94-0x9B 0x00000000000003E8 start_id = 1000
|
||||
0x9C-0xA3 0x00000000000007D0 end_id = 2000
|
||||
|
||||
--- Entry 2: REMAP_ID ---
|
||||
0xA4 0x05 entry_type
|
||||
0xA5 0x00 reserved
|
||||
0xA6-0xA7 0x0010 entry_length = 16
|
||||
0xA8-0xAF 0x00000000000001F4 old_id = 500
|
||||
0xB0-0xB7 0x0000000000000003 new_id = 3
|
||||
```
|
||||
|
||||
## 4. Deletion Bitmap
|
||||
|
||||
### 4.1 Manifest Record
|
||||
|
||||
The deletion bitmap is stored in the Level 1 manifest as a TLV record:
|
||||
|
||||
```
|
||||
Tag Name Description
|
||||
--- ---- -----------
|
||||
0x000E DELETION_BITMAP Roaring bitmap of soft-deleted vector IDs
|
||||
```
|
||||
|
||||
This extends the TLV tag space (previous: 0x000D KEY_DIRECTORY).
|
||||
|
||||
### 4.2 Roaring Bitmap Binary Layout
|
||||
|
||||
Vector IDs are 64-bit. The upper 32 bits select a **high key**; the lower
|
||||
32 bits index into a **container** for that high key.
|
||||
|
||||
```
|
||||
+---------------------------------------------+
|
||||
| DELETION_BITMAP TLV Value |
|
||||
+---------------------------------------------+
|
||||
| Bitmap Header |
|
||||
| cookie: u32 (0x3B3A3332) |
|
||||
| high_key_count: u32 |
|
||||
| For each high key: |
|
||||
| high_key: u32 |
|
||||
| container_type: u8 |
|
||||
| 0x01 = ARRAY_CONTAINER |
|
||||
| 0x02 = BITMAP_CONTAINER |
|
||||
| 0x03 = RUN_CONTAINER |
|
||||
| container_offset: u32 (from bitmap start)|
|
||||
| [8B aligned] |
|
||||
+---------------------------------------------+
|
||||
| Container Data |
|
||||
| Container 0: [type-specific layout] |
|
||||
| Container 1: ... |
|
||||
| [8B aligned per container] |
|
||||
+---------------------------------------------+
|
||||
```
|
||||
|
||||
### 4.3 Container Types
|
||||
|
||||
**ARRAY_CONTAINER (0x01)** -- Sparse deletions (< 4096 set bits per 64K range).
|
||||
```
|
||||
0x00 u16 cardinality Number of set values (1-4096)
|
||||
0x02 u16[] values Sorted array of 16-bit values
|
||||
```
|
||||
Size: `2 + 2 * cardinality` bytes.
|
||||
|
||||
**BITMAP_CONTAINER (0x02)** -- Dense deletions (>= 4096 set bits per 64K range).
|
||||
```
|
||||
0x00 u16 cardinality Number of set bits
|
||||
0x02 u8[8192] bitmap Fixed 65536-bit bitmap (8 KB)
|
||||
```
|
||||
Size: 8194 bytes (fixed).
|
||||
|
||||
**RUN_CONTAINER (0x03)** -- Contiguous ranges of deletions.
|
||||
```
|
||||
0x00 u16 run_count Number of runs
|
||||
0x02 (u16,u16) runs[] Array of (start, length-1) pairs
|
||||
```
|
||||
Size: `2 + 4 * run_count` bytes.
|
||||
|
||||
### 4.4 Size Estimation
|
||||
|
||||
| Deletion Pattern | Deleted IDs | Container Types | Bitmap Size |
|
||||
|------------------|-------------|-----------------|-------------|
|
||||
| Sparse random | 10,000 (0.1%) | ~153 array | ~22 KB |
|
||||
| Clustered ranges | 10,000 (0.1%) | ~5 run | ~0.1 KB |
|
||||
| Mixed workload | 100,000 (1%) | array + run | ~80 KB |
|
||||
| Heavy deletion | 1,000,000 (10%) | bitmap + run | ~200 KB |
|
||||
|
||||
Even at 200 KB the bitmap fits entirely in L2 cache.
|
||||
|
||||
### 4.5 Bitmap Operations
|
||||
|
||||
```python
|
||||
def bitmap_check(bitmap, vector_id):
|
||||
"""Returns True if vector_id is soft-deleted. O(1) amortized."""
|
||||
high_key = vector_id >> 16
|
||||
low_val = vector_id & 0xFFFF
|
||||
container = bitmap.get_container(high_key)
|
||||
if container is None:
|
||||
return False
|
||||
return container.contains(low_val) # array: bsearch, bitmap: bit test, run: bsearch
|
||||
|
||||
def bitmap_set(bitmap, vector_id):
|
||||
"""Mark a vector as soft-deleted."""
|
||||
high_key = vector_id >> 16
|
||||
low_val = vector_id & 0xFFFF
|
||||
container = bitmap.get_or_create_container(high_key)
|
||||
container.add(low_val)
|
||||
if container.type == ARRAY and container.cardinality > 4096:
|
||||
container.promote_to_bitmap()
|
||||
```
|
||||
|
||||
## 5. Delete-Aware Query Path
|
||||
|
||||
### 5.1 HNSW Traversal with Deletion Filtering
|
||||
|
||||
Deleted vectors remain in the HNSW graph until compaction rebuilds the index.
|
||||
During search, the deletion bitmap is checked per candidate. Deleted nodes are
|
||||
still traversed for connectivity but excluded from the result set.
|
||||
|
||||
```python
|
||||
def hnsw_search_delete_aware(query, entry_point, ef_search, k, del_bitmap):
|
||||
candidates = MaxHeap() # worst candidate on top
|
||||
visited = BitSet()
|
||||
worklist = MinHeap() # best candidate first
|
||||
|
||||
d0 = distance(query, get_vector(entry_point))
|
||||
worklist.push((d0, entry_point))
|
||||
visited.add(entry_point)
|
||||
if not bitmap_check(del_bitmap, entry_point):
|
||||
candidates.push((d0, entry_point))
|
||||
|
||||
while worklist:
|
||||
dist, node = worklist.pop()
|
||||
if candidates.size() >= ef_search and dist > candidates.peek_max():
|
||||
break
|
||||
|
||||
neighbors = get_neighbors(node)
|
||||
for n in neighbors[:PREFETCH_AHEAD]:
|
||||
if n not in visited:
|
||||
prefetch_vector(n)
|
||||
|
||||
for n in neighbors:
|
||||
if n in visited:
|
||||
continue
|
||||
visited.add(n)
|
||||
d = distance(query, get_vector(n))
|
||||
is_deleted = bitmap_check(del_bitmap, n) # O(1) bitmap lookup
|
||||
|
||||
# Always add to worklist (graph connectivity)
|
||||
if candidates.size() < ef_search or d < candidates.peek_max():
|
||||
worklist.push((d, n))
|
||||
# Only add to results if NOT deleted
|
||||
if not is_deleted:
|
||||
if candidates.size() < ef_search:
|
||||
candidates.push((d, n))
|
||||
elif d < candidates.peek_max():
|
||||
candidates.replace_max((d, n))
|
||||
|
||||
return candidates.top_k(k)
|
||||
```
|
||||
|
||||
### 5.2 Top-K Refinement with Deletion Filtering
|
||||
|
||||
```python
|
||||
def topk_refine_delete_aware(candidates, hot_cache, query, k, del_bitmap):
|
||||
heap = MaxHeap()
|
||||
for cand_dist, cand_id in candidates:
|
||||
heap.push((cand_dist, cand_id))
|
||||
|
||||
for entry in hot_cache.sequential_scan():
|
||||
if bitmap_check(del_bitmap, entry.vector_id):
|
||||
continue # skip soft-deleted
|
||||
d = distance(query, entry.vector)
|
||||
if heap.size() < k:
|
||||
heap.push((d, entry.vector_id))
|
||||
elif d < heap.peek_max():
|
||||
heap.replace_max((d, entry.vector_id))
|
||||
|
||||
return heap.drain_sorted()
|
||||
```
|
||||
|
||||
### 5.3 Performance Impact
|
||||
|
||||
| Operation | Without Deletions | With Deletions | Overhead |
|
||||
|-----------|-------------------|----------------|----------|
|
||||
| Bitmap check | N/A | ~2-5 ns (L1/L2 hit) | Per candidate |
|
||||
| HNSW step (M=16) | ~300-500 ns | ~330-580 ns | +10% |
|
||||
| Top-K refine (1000) | ~10 us | ~12 us | +20% worst |
|
||||
| Total query | ~50-75 us | ~55-85 us | +10-13% |
|
||||
|
||||
At typical deletion rates (< 5%), overhead is negligible: the bitmap fits in
|
||||
L2 cache, graph connectivity is preserved, and the cost is one branch plus
|
||||
one bitmap load per candidate.
|
||||
|
||||
## 6. Deletion Write Path
|
||||
|
||||
All deletion operations follow the same two-fsync protocol:
|
||||
|
||||
```python
|
||||
def delete_vectors(file, entries):
|
||||
"""Soft-delete vectors. entries: list of DeleteVector or DeleteRange."""
|
||||
# 1. Append JOURNAL_SEG
|
||||
journal = JournalSegment(
|
||||
epoch=current_epoch(file),
|
||||
prev_journal_seg_id=latest_journal_id(file),
|
||||
entries=entries
|
||||
)
|
||||
append_segment(file, journal)
|
||||
fsync(file) # orphan-safe: no manifest references this yet
|
||||
|
||||
# 2. Update deletion bitmap in memory
|
||||
bitmap = load_deletion_bitmap(file)
|
||||
for e in entries:
|
||||
if e.type == DELETE_VECTOR:
|
||||
bitmap_set(bitmap, e.vector_id)
|
||||
elif e.type == DELETE_RANGE:
|
||||
bitmap.add_range(e.start_id, e.end_id)
|
||||
|
||||
# 3. Append MANIFEST_SEG with updated bitmap
|
||||
manifest = build_manifest(file, deletion_bitmap=bitmap)
|
||||
append_segment(file, manifest)
|
||||
fsync(file) # deletion now visible to all new readers
|
||||
```
|
||||
|
||||
Single deletes, bulk ranges, and batch deletes all use this path. Batch
|
||||
operations pack multiple entries into one JOURNAL_SEG to amortize fsync cost.
|
||||
|
||||
## 7. Compaction with Deletions
|
||||
|
||||
### 7.1 Compaction Process
|
||||
|
||||
```
|
||||
Before:
|
||||
[VEC_1] [VEC_2] [JOURNAL_1] [VEC_3] [JOURNAL_2] [MANIFEST_5]
|
||||
0-999 1000- del:42, 3000- del:[1000, bitmap={42,500,
|
||||
2999 del:500 4999 2000) 1000..1999}
|
||||
|
||||
After:
|
||||
... [MANIFEST_5] [VEC_sealed] [INDEX_new] [MANIFEST_6]
|
||||
vectors 0-4999 bitmap={}
|
||||
MINUS deleted (empty for
|
||||
compacted range)
|
||||
```
|
||||
|
||||
### 7.2 Compaction Algorithm
|
||||
|
||||
```python
|
||||
def compact_with_deletions(file, seg_ids):
|
||||
bitmap = load_deletion_bitmap(file)
|
||||
output, id_remap, next_id = [], {}, 0
|
||||
|
||||
for seg_id in sorted(seg_ids):
|
||||
seg = load_segment(file, seg_id)
|
||||
if seg.seg_type != VEC_SEG:
|
||||
continue
|
||||
for vec_id, vector in seg.all_vectors():
|
||||
if bitmap_check(bitmap, vec_id):
|
||||
continue # physically exclude
|
||||
id_remap[vec_id] = next_id
|
||||
output.append((next_id, vector))
|
||||
next_id += 1
|
||||
|
||||
append_segment(file, VecSegment(flags=SEALED, vectors=output))
|
||||
|
||||
remaps = [RemapIdEntry(old, new) for old, new in id_remap.items() if old != new]
|
||||
if remaps:
|
||||
append_segment(file, JournalSegment(entries=remaps))
|
||||
|
||||
append_segment(file, build_hnsw_index(output))
|
||||
|
||||
for old_id in id_remap:
|
||||
bitmap.remove(old_id)
|
||||
|
||||
manifest = build_manifest(file,
|
||||
tombstone_seg_ids=seg_ids,
|
||||
deletion_bitmap=bitmap)
|
||||
append_segment(file, manifest)
|
||||
fsync(file)
|
||||
```
|
||||
|
||||
### 7.3 Journal Merging
|
||||
|
||||
During compaction, JOURNAL_SEGs covering the compacted range are consumed:
|
||||
|
||||
| Entry Type | Materialization |
|
||||
|------------|-----------------|
|
||||
| DELETE_VECTOR / DELETE_RANGE | Vectors excluded from output |
|
||||
| UPDATE_METADATA | Applied to output META_SEG |
|
||||
| MOVE_VECTOR | Tier assignment applied in new manifest |
|
||||
| REMAP_ID | Chained: old remap composed with new remap |
|
||||
|
||||
Consumed JOURNAL_SEGs are tombstoned alongside compacted VEC_SEGs.
|
||||
|
||||
### 7.4 Compaction Invariants
|
||||
|
||||
| ID | Invariant |
|
||||
|----|-----------|
|
||||
| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
|
||||
| INV-D2 | Sealed output contains only ACTIVE vectors |
|
||||
| INV-D3 | REMAP_ID entries journaled for every relocated vector |
|
||||
| INV-D4 | Compacted input segments tombstoned in new manifest |
|
||||
| INV-D5 | Sealed segments are never modified |
|
||||
| INV-D6 | Rebuilt indexes exclude deleted nodes |
|
||||
|
||||
## 8. Deletion Consistency
|
||||
|
||||
### 8.1 Crash Safety
|
||||
|
||||
```
|
||||
Write path:
|
||||
1. Append JOURNAL_SEG -> fsync crash here: orphan, invisible
|
||||
2. Append MANIFEST_SEG -> fsync crash here: partial manifest, fallback
|
||||
|
||||
Recovery:
|
||||
- Crash after step 1: JOURNAL_SEG orphaned. No manifest references it.
|
||||
Reader sees previous manifest. Deletion NOT visible. Orphan cleaned
|
||||
up by next compaction.
|
||||
- Crash during step 2: Partial MANIFEST_SEG has bad checksum. Reader
|
||||
falls back to previous valid manifest. Deletion NOT visible.
|
||||
- After step 2 success: Manifest durable. Deletion visible.
|
||||
```
|
||||
|
||||
**Guarantee**: Uncommitted deletions never affect readers. Deletion is
|
||||
atomic at the manifest fsync boundary.
|
||||
|
||||
### 8.2 Manifest Chain Visibility
|
||||
|
||||
```
|
||||
MANIFEST_3: bitmap = {}
|
||||
| JOURNAL_SEG written (delete vector 42)
|
||||
MANIFEST_4: bitmap = {42} <-- deletion visible from here
|
||||
| Compaction runs
|
||||
MANIFEST_5: bitmap = {} <-- vector 42 physically removed
|
||||
```
|
||||
|
||||
A reader holding MANIFEST_3 continues to see vector 42. A reader opening
|
||||
after MANIFEST_4 will not. This provides snapshot isolation at manifest
|
||||
granularity.
|
||||
|
||||
### 8.3 Multi-File Mode
|
||||
|
||||
In multi-file mode, each shard maintains its own deletion bitmap. The
|
||||
DELETION_BITMAP TLV record supports two modes:
|
||||
|
||||
```
|
||||
+----------------------------------------------+
|
||||
| mode: u8 |
|
||||
| 0x00 = SINGLE (one bitmap, inline) |
|
||||
| 0x01 = SHARDED (per-shard references) |
|
||||
+----------------------------------------------+
|
||||
SINGLE (0x00):
|
||||
| roaring_bitmap: [u8; ...] |
|
||||
|
||||
SHARDED (0x01):
|
||||
| shard_count: u16 |
|
||||
| For each shard: |
|
||||
| shard_id: u16 |
|
||||
| bitmap_offset: u64 (in shard file) |
|
||||
| bitmap_length: u32 |
|
||||
| bitmap_hash: hash128 |
|
||||
+----------------------------------------------+
|
||||
```
|
||||
|
||||
Queries spanning shards load per-shard bitmaps and check each candidate
|
||||
against its shard's bitmap.
|
||||
|
||||
### 8.4 Concurrent Access
|
||||
|
||||
One writer at a time (file-level advisory lock). Multiple readers are safe
|
||||
due to append-only architecture. A reader that opened before a deletion
|
||||
sees the pre-deletion snapshot until it re-reads the manifest.
|
||||
|
||||
## 9. Space Reclamation
|
||||
|
||||
| Trigger | Threshold | Action |
|
||||
|---------|-----------|--------|
|
||||
| Deletion ratio | > 20% of vectors deleted | Schedule compaction |
|
||||
| Bitmap size | > 1 MB | Schedule compaction |
|
||||
| Segment count | > 64 mutable segments | Schedule compaction |
|
||||
| Manual | User-initiated | Compact immediately |
|
||||
|
||||
Space accounting derived from the manifest:
|
||||
```
|
||||
total_vector_count: 10,000,000 (Level 0 root manifest)
|
||||
deleted_vector_count: 150,000 (bitmap cardinality)
|
||||
active_vector_count: 9,850,000 (total - deleted)
|
||||
deletion_ratio: 1.5% (below threshold)
|
||||
wasted_bytes: ~115 MB (150K * 768 B per fp16-384 vector)
|
||||
```
|
||||
|
||||
## 10. Summary
|
||||
|
||||
### Deletion Protocol
|
||||
|
||||
| Step | Action | Durability |
|
||||
|------|--------|------------|
|
||||
| 1 | Append JOURNAL_SEG with DELETE entries | fsync (orphan-safe) |
|
||||
| 2 | Update roaring deletion bitmap | In-memory |
|
||||
| 3 | Append MANIFEST_SEG with new bitmap | fsync (deletion visible) |
|
||||
| 4 | Compaction excludes deleted vectors | fsync (physical removal) |
|
||||
| 5 | File rewrite reclaims space | fsync (space freed) |
|
||||
|
||||
### New Wire Format Elements
|
||||
|
||||
| Element | Type / Tag | Section |
|
||||
|---------|------------|---------|
|
||||
| JOURNAL_SEG | Segment type 0x04 | 3 |
|
||||
| DELETE_VECTOR | Journal entry 0x01 | 3.4 |
|
||||
| DELETE_RANGE | Journal entry 0x02 | 3.4 |
|
||||
| UPDATE_METADATA | Journal entry 0x03 | 3.4 |
|
||||
| MOVE_VECTOR | Journal entry 0x04 | 3.4 |
|
||||
| REMAP_ID | Journal entry 0x05 | 3.4 |
|
||||
| DELETION_BITMAP | Level 1 TLV 0x000E | 4 |
|
||||
|
||||
### Invariants
|
||||
|
||||
| ID | Invariant |
|
||||
|----|-----------|
|
||||
| INV-D1 | After compaction, deletion bitmap is empty for compacted range |
|
||||
| INV-D2 | Sealed output segments contain only ACTIVE vectors |
|
||||
| INV-D3 | ID remappings journaled for every compaction-relocated vector |
|
||||
| INV-D4 | Compacted input segments tombstoned in new manifest |
|
||||
| INV-D5 | Sealed segments are never modified |
|
||||
| INV-D6 | Rebuilt indexes exclude deleted nodes |
|
||||
| INV-D7 | Uncommitted deletions never affect readers (crash safety) |
|
||||
| INV-D8 | Deletion visibility is atomic at the manifest fsync boundary |
|
||||
724
vendor/ruvector/docs/research/rvf/spec/08-filtered-search.md
vendored
Normal file
724
vendor/ruvector/docs/research/rvf/spec/08-filtered-search.md
vendored
Normal file
@@ -0,0 +1,724 @@
|
||||
# RVF Filtered Search
|
||||
|
||||
## 1. Motivation
|
||||
|
||||
Domain profiles declare metadata schemas with indexed fields (e.g., `"organism"` in
|
||||
RVDNA, `"language"` in RVText, `"node_type"` in RVGraph), but the format provides no
|
||||
specification for how those indexes are built, stored, or evaluated at query time.
|
||||
|
||||
Filtered search is the combination of vector similarity search with metadata
|
||||
predicates. Without it, a caller must retrieve an over-sized result set and filter
|
||||
client-side — wasting bandwidth, latency, and recall budget.
|
||||
|
||||
This specification adds:
|
||||
|
||||
1. **META_SEG** payload layout (segment type 0x07) for storing per-vector metadata
|
||||
2. **Filter expression language** with a compact binary encoding
|
||||
3. **Three evaluation strategies** (pre-, post-, and intra-filtering)
|
||||
4. **METAIDX_SEG** (new segment type 0x0D) for inverted and bitmap indexes
|
||||
5. **Manifest integration** via a new Level 1 TLV record
|
||||
6. **Temperature tier coordination** for metadata segments
|
||||
|
||||
## 2. META_SEG Payload Layout (Segment Type 0x07)
|
||||
|
||||
META_SEG stores the actual metadata values associated with vectors. It uses the
|
||||
standard 64-byte segment header (see `binary-layout.md` Section 3) with
|
||||
`seg_type = 0x07`.
|
||||
|
||||
```
|
||||
META_SEG Payload:
|
||||
|
||||
+------------------------------------------+
|
||||
| Meta Header (64 bytes, padded) |
|
||||
| schema_id: u32 | References PROFILE_SEG schema
|
||||
| vector_id_range_start: u64 | First vector ID covered
|
||||
| vector_id_range_end: u64 | Last vector ID covered (inclusive)
|
||||
| field_count: u16 | Number of fields in this segment
|
||||
| encoding: u8 | 0 = row-oriented, 1 = column-oriented
|
||||
| reserved: [u8; 37] | Must be zero
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Field Directory |
|
||||
| For each field (field_count entries): |
|
||||
| field_id: u16 |
|
||||
| field_type: u8 |
|
||||
| flags: u8 |
|
||||
| field_offset: u32 | Byte offset from payload start
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Field Data (column-oriented) |
|
||||
| (see Section 2.1 for per-type layout) |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
### Field Type Enum
|
||||
|
||||
```
|
||||
Value Type Wire Size Description
|
||||
----- ---- --------- -----------
|
||||
0x00 string Variable UTF-8, dictionary-encoded in column layout
|
||||
0x01 u32 4 bytes Unsigned 32-bit integer
|
||||
0x02 u64 8 bytes Unsigned 64-bit integer
|
||||
0x03 f32 4 bytes IEEE 754 single-precision float
|
||||
0x04 enum Variable (packed) Enumeration with defined label set
|
||||
0x05 bool 1 bit (packed) Boolean
|
||||
```
|
||||
|
||||
### Field Flags
|
||||
|
||||
```
|
||||
Bit Mask Name Meaning
|
||||
--- ---- ---- -------
|
||||
0 0x01 INDEXED Field has a corresponding METAIDX_SEG
|
||||
1 0x02 SORTED Values are stored in sorted order
|
||||
2 0x04 NULLABLE Null bitmap present before values
|
||||
3 0x08 STORED Field value returned in query results (not just filterable)
|
||||
4-7 reserved Must be zero
|
||||
```
|
||||
|
||||
### 2.1 Column-Oriented Field Layouts
|
||||
|
||||
Column-oriented encoding (encoding = 1) is the preferred layout. Each field's data
|
||||
block starts at a 64-byte aligned boundary.
|
||||
|
||||
**String fields** (dictionary-encoded):
|
||||
|
||||
```
|
||||
dict_size: u32 Number of distinct strings
|
||||
For each dict entry:
|
||||
length: u16 Byte length of UTF-8 string
|
||||
bytes: [u8; length] UTF-8 encoded string
|
||||
[4B aligned after dictionary]
|
||||
codes: [varint; vector_count] Dictionary code per vector
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
Dictionary codes are 0-indexed into the dictionary array. Code `0xFFFFFFFF` (max
|
||||
varint value for u32 range) represents null if the NULLABLE flag is set.
|
||||
|
||||
**Numeric fields** (u32, u64, f32 -- direct array):
|
||||
|
||||
```
|
||||
If NULLABLE:
|
||||
null_bitmap: [u8; ceil(vector_count / 8)] Bit-packed, 1 = present, 0 = null
|
||||
[8B aligned]
|
||||
values: [field_type; vector_count] Dense array of values
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
Values for null entries are zero-filled but must not be relied upon.
|
||||
|
||||
**Enum fields** (bit-packed):
|
||||
|
||||
```
|
||||
enum_count: u8 Number of enum labels
|
||||
For each enum label:
|
||||
length: u8 Byte length of label
|
||||
bytes: [u8; length] UTF-8 label string
|
||||
bits_per_code: u8 ceil(log2(enum_count))
|
||||
codes: packed bit array bits_per_code bits per vector
|
||||
[ceil(vector_count * bits_per_code / 8) bytes]
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
For example, an enum with 3 values (`"+", "-", "."`) uses 2 bits per vector.
|
||||
1M vectors = 250 KB.
|
||||
|
||||
**Bool fields** (bit-packed):
|
||||
|
||||
```
|
||||
If NULLABLE:
|
||||
null_bitmap: [u8; ceil(vector_count / 8)]
|
||||
[8B aligned]
|
||||
values: [u8; ceil(vector_count / 8)] Bit-packed, 1 = true, 0 = false
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
### 2.2 Sorted Index (Inline)
|
||||
|
||||
For fields with the SORTED flag, an additional sorted permutation index follows
|
||||
the field data:
|
||||
|
||||
```
|
||||
sorted_count: u32 Must equal vector_count
|
||||
sorted_order: [varint delta-encoded] Vector IDs in ascending value order
|
||||
restart_interval: u16 Restart every N entries (default 128)
|
||||
restart_offsets: [u32; ceil(sorted_count / restart_interval)]
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
This enables binary search over field values for range queries without requiring
|
||||
a separate METAIDX_SEG. It is suitable for fields where a full inverted index
|
||||
would be wasteful (high cardinality numeric fields like `position_start`).
|
||||
|
||||
## 3. Filter Expression Language
|
||||
|
||||
### 3.1 Abstract Syntax
|
||||
|
||||
A filter expression is a tree of predicates combined with boolean logic:
|
||||
|
||||
```
|
||||
expr ::= field_ref CMP literal -- comparison
|
||||
| field_ref IN literal_set -- set membership
|
||||
| field_ref PREFIX string_lit -- string prefix match
|
||||
| field_ref CONTAINS string_lit -- substring containment
|
||||
| expr AND expr -- conjunction
|
||||
| expr OR expr -- disjunction
|
||||
| NOT expr -- negation
|
||||
```
|
||||
|
||||
### 3.2 Binary Encoding (Postfix / RPN)
|
||||
|
||||
Filter expressions are encoded as a postfix (Reverse Polish Notation) token stream
|
||||
for stack-based evaluation. This avoids the need for recursive parsing and enables
|
||||
single-pass evaluation with a fixed-size stack.
|
||||
|
||||
```
|
||||
Filter Expression Binary Layout:
|
||||
|
||||
header:
|
||||
node_count: u16 Total number of tokens
|
||||
stack_depth: u8 Maximum stack depth required
|
||||
reserved: u8 Must be zero
|
||||
|
||||
tokens (postfix order):
|
||||
For each token:
|
||||
node_type: u8 Token type (see enum below)
|
||||
payload: type-specific Variable-size payload
|
||||
```
|
||||
|
||||
### Token Type Enum
|
||||
|
||||
```
|
||||
Value Name Stack Effect Payload
|
||||
----- ---- ------------ -------
|
||||
0x01 FIELD_REF push +1 field_id: u16
|
||||
0x02 LIT_U32 push +1 value: u32
|
||||
0x03 LIT_U64 push +1 value: u64
|
||||
0x04 LIT_F32 push +1 value: f32
|
||||
0x05 LIT_STR push +1 length: u16, bytes: [u8; length]
|
||||
0x06 LIT_BOOL push +1 value: u8 (0 or 1)
|
||||
0x07 LIT_NULL push +1 (no payload)
|
||||
|
||||
0x10 CMP_EQ pop 2, push 1 (no payload) -- a == b
|
||||
0x11 CMP_NE pop 2, push 1 (no payload) -- a != b
|
||||
0x12 CMP_LT pop 2, push 1 (no payload) -- a < b
|
||||
0x13 CMP_LE pop 2, push 1 (no payload) -- a <= b
|
||||
0x14 CMP_GT pop 2, push 1 (no payload) -- a > b
|
||||
0x15 CMP_GE pop 2, push 1 (no payload) -- a >= b
|
||||
|
||||
0x20 IN_SET pop 1, push 1 set_size: u16, [encoded values]
|
||||
0x21 PREFIX pop 2, push 1 (no payload) -- string prefix
|
||||
0x22 CONTAINS pop 2, push 1 (no payload) -- substring match
|
||||
|
||||
0x30 AND pop 2, push 1 (no payload)
|
||||
0x31 OR pop 2, push 1 (no payload)
|
||||
0x32 NOT pop 1, push 1 (no payload)
|
||||
```
|
||||
|
||||
### 3.3 Encoding Example
|
||||
|
||||
Filter: `organism = "E. coli" AND position_start >= 1000`
|
||||
|
||||
```
|
||||
Token 0: FIELD_REF field_id=0 (organism) stack: [organism_val]
|
||||
Token 1: LIT_STR "E. coli" stack: [organism_val, "E. coli"]
|
||||
Token 2: CMP_EQ stack: [true/false]
|
||||
Token 3: FIELD_REF field_id=3 (position_start) stack: [bool, pos_val]
|
||||
Token 4: LIT_U64 1000 stack: [bool, pos_val, 1000]
|
||||
Token 5: CMP_GE stack: [bool, true/false]
|
||||
Token 6: AND stack: [result]
|
||||
|
||||
Binary: node_count=7, stack_depth=3
|
||||
01 00:00 05 00:07 "E. coli" 10 01 00:03 03 00:00:00:00:00:00:03:E8 15 30
|
||||
```
|
||||
|
||||
### 3.4 Evaluation
|
||||
|
||||
Evaluation processes tokens left to right using a fixed-size boolean/value stack:
|
||||
|
||||
```python
|
||||
def evaluate(tokens, vector_id, metadata):
|
||||
stack = []
|
||||
for token in tokens:
|
||||
if token.type == FIELD_REF:
|
||||
stack.append(metadata.get_value(vector_id, token.field_id))
|
||||
elif token.type in (LIT_U32, LIT_U64, LIT_F32, LIT_STR, LIT_BOOL, LIT_NULL):
|
||||
stack.append(token.value)
|
||||
elif token.type in (CMP_EQ, CMP_NE, CMP_LT, CMP_LE, CMP_GT, CMP_GE):
|
||||
b, a = stack.pop(), stack.pop()
|
||||
stack.append(compare(a, token.type, b))
|
||||
elif token.type == IN_SET:
|
||||
a = stack.pop()
|
||||
stack.append(a in token.value_set)
|
||||
elif token.type in (PREFIX, CONTAINS):
|
||||
b, a = stack.pop(), stack.pop()
|
||||
stack.append(string_match(a, token.type, b))
|
||||
elif token.type == AND:
|
||||
b, a = stack.pop(), stack.pop()
|
||||
stack.append(a and b)
|
||||
elif token.type == OR:
|
||||
b, a = stack.pop(), stack.pop()
|
||||
stack.append(a or b)
|
||||
elif token.type == NOT:
|
||||
stack.append(not stack.pop())
|
||||
return stack[0]
|
||||
```
|
||||
|
||||
Maximum stack depth is declared in the header so the evaluator can pre-allocate.
|
||||
Implementations must reject expressions with `stack_depth > 16`.
|
||||
|
||||
## 4. Filter Evaluation Strategies
|
||||
|
||||
The runtime selects one of three strategies based on the estimated **selectivity**
|
||||
of the filter (the fraction of vectors passing the filter).
|
||||
|
||||
### 4.1 Pre-Filtering (Selectivity < 1%)
|
||||
|
||||
Build the candidate ID set from metadata indexes first, then run vector search
|
||||
only on the filtered subset.
|
||||
|
||||
```
|
||||
1. Evaluate filter using METAIDX_SEG inverted/bitmap indexes
|
||||
2. Collect matching vector IDs into a candidate set C
|
||||
3. If |C| < ef_search:
|
||||
Flat scan all candidates, return top-K
|
||||
Else:
|
||||
Build temporary flat index over C, run HNSW search restricted to C
|
||||
4. Return top-K results
|
||||
```
|
||||
|
||||
**Tradeoffs**:
|
||||
- Optimal when the candidate set is very small (hundreds to low thousands)
|
||||
- Risk: if the candidate set is disconnected in the HNSW graph, search cannot
|
||||
traverse from entry points to candidates. The flat scan fallback handles this.
|
||||
- Memory: candidate set bitmap = `ceil(total_vectors / 8)` bytes
|
||||
|
||||
### 4.2 Post-Filtering (Selectivity > 20%)
|
||||
|
||||
Run standard HNSW search with over-retrieval, then filter results.
|
||||
|
||||
```
|
||||
1. Compute over_retrieval_factor = min(1.0 / selectivity, 10.0)
|
||||
2. Set ef_search_adj = ef_search * over_retrieval_factor
|
||||
3. Run standard HNSW search with ef_search_adj
|
||||
4. Filter result set by evaluating filter expression per candidate
|
||||
5. Return top-K from filtered results
|
||||
```
|
||||
|
||||
**Tradeoffs**:
|
||||
- Optimal when the filter passes most vectors (minimal wasted computation)
|
||||
- Risk: if over-retrieval factor is too low, fewer than K results survive filtering.
|
||||
The caller should retry with a higher factor or fall back to intra-filtering.
|
||||
- No modification to HNSW traversal logic required.
|
||||
|
||||
### 4.3 Intra-Filtering (1% <= Selectivity <= 20%)
|
||||
|
||||
Evaluate the filter during HNSW traversal, skipping nodes that fail the predicate.
|
||||
|
||||
```python
|
||||
def filtered_hnsw_search(query, filter_expr, entry_point, ef_search, k):
|
||||
candidates = MaxHeap() # top-K results (max-heap by distance)
|
||||
worklist = MinHeap() # exploration frontier (min-heap by distance)
|
||||
visited = BitSet()
|
||||
filtered_skips = 0
|
||||
max_skips = ef_search * 3 # backoff threshold
|
||||
|
||||
worklist.push((distance(query, entry_point), entry_point))
|
||||
visited.add(entry_point)
|
||||
|
||||
while worklist and filtered_skips < max_skips:
|
||||
dist, node = worklist.pop()
|
||||
|
||||
# Check filter predicate
|
||||
if not evaluate(filter_expr, node, metadata):
|
||||
filtered_skips += 1
|
||||
# Still expand neighbors (maintain graph connectivity)
|
||||
neighbors = get_neighbors(node)
|
||||
for n in neighbors:
|
||||
if n not in visited:
|
||||
visited.add(n)
|
||||
d = distance(query, get_vector(n))
|
||||
worklist.push((d, n))
|
||||
continue
|
||||
|
||||
filtered_skips = 0 # reset skip counter on successful match
|
||||
candidates.push((dist, node))
|
||||
if len(candidates) > k:
|
||||
candidates.pop() # evict worst
|
||||
|
||||
# Expand neighbors
|
||||
neighbors = get_neighbors(node)
|
||||
for n in neighbors:
|
||||
if n not in visited:
|
||||
visited.add(n)
|
||||
d = distance(query, get_vector(n))
|
||||
if len(candidates) < ef_search or d < candidates.max():
|
||||
worklist.push((d, n))
|
||||
|
||||
return candidates.top_k(k)
|
||||
```
|
||||
|
||||
**Key design decisions**:
|
||||
|
||||
1. **Skipped nodes still expand neighbors**: This preserves graph connectivity.
|
||||
A node that fails the filter may have neighbors that pass it.
|
||||
|
||||
2. **Skip counter with backoff**: If too many consecutive nodes fail the filter,
|
||||
the search is exhausting the local neighborhood without finding matches. The
|
||||
`max_skips` threshold triggers termination to avoid unbounded traversal.
|
||||
|
||||
3. **Adaptive ef expansion**: When `filtered_skips > ef_search`, the effective
|
||||
search frontier is larger than requested, compensating for filtered-out nodes.
|
||||
|
||||
### 4.4 Strategy Selection
|
||||
|
||||
```
|
||||
selectivity = estimate_selectivity(filter_expr, metaidx_stats)
|
||||
|
||||
if selectivity < 0.01:
|
||||
strategy = PRE_FILTER
|
||||
elif selectivity > 0.20:
|
||||
strategy = POST_FILTER
|
||||
else:
|
||||
strategy = INTRA_FILTER
|
||||
```
|
||||
|
||||
Selectivity estimation uses statistics stored in the METAIDX_SEG header:
|
||||
|
||||
- **Inverted index**: `posting_list_length / total_vectors` per term
|
||||
- **Bitmap index**: `popcount(bitmap) / total_vectors` per enum value
|
||||
- **Range tree**: count of values in range / total_vectors
|
||||
|
||||
For compound filters (AND/OR), selectivity is estimated using independence
|
||||
assumption: `P(A AND B) = P(A) * P(B)`, `P(A OR B) = P(A) + P(B) - P(A) * P(B)`.
|
||||
|
||||
## 5. METAIDX_SEG (Segment Type 0x0D)
|
||||
|
||||
METAIDX_SEG stores secondary indexes over metadata fields for fast predicate
|
||||
evaluation. Each METAIDX_SEG covers one field. The segment type enum value 0x0D
|
||||
is allocated from the reserved range (see `binary-layout.md` Section 3).
|
||||
|
||||
```
|
||||
METAIDX_SEG Payload:
|
||||
|
||||
+------------------------------------------+
|
||||
| Index Header (64 bytes, padded) |
|
||||
| field_id: u16 | Field being indexed
|
||||
| index_type: u8 | 0=inverted, 1=range_tree, 2=bitmap
|
||||
| field_type: u8 | Mirrors META_SEG field_type
|
||||
| total_vectors: u64 | Vectors covered by this index
|
||||
| unique_values: u64 | Cardinality (distinct values)
|
||||
| reserved: [u8; 42] |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Index Data (type-specific) |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
### 5.1 Inverted Index (index_type = 0)
|
||||
|
||||
Best for: string fields with moderate cardinality (100 to 100K distinct values).
|
||||
|
||||
```
|
||||
term_count: u32
|
||||
For each term (sorted by encoded value):
|
||||
term_length: u16
|
||||
term_bytes: [u8; term_length] Encoded value (UTF-8 for strings)
|
||||
posting_length: u32 Number of vector IDs
|
||||
postings: [varint delta-encoded] Sorted vector IDs
|
||||
[8B aligned after postings]
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
Posting lists use varint delta encoding identical to the ID encoding in VEC_SEG
|
||||
(see `binary-layout.md` Section 5). Restart points every 128 entries enable
|
||||
binary search within a posting list for intersection operations.
|
||||
|
||||
### 5.2 Range Tree (index_type = 1)
|
||||
|
||||
Best for: numeric fields requiring range queries (u32, u64, f32).
|
||||
|
||||
```
|
||||
page_size: u32 Fixed 4096 bytes (4 KB, one disk page)
|
||||
page_count: u32
|
||||
root_page: u32 Page index of B+ tree root
|
||||
tree_height: u8
|
||||
reserved: [u8; 47]
|
||||
[64B aligned]
|
||||
|
||||
Internal Page (4096 bytes):
|
||||
page_type: u8 (0 = internal)
|
||||
key_count: u16
|
||||
keys: [field_type; key_count] Separator keys
|
||||
children: [u32; key_count + 1] Child page indices
|
||||
[zero-padded to 4096]
|
||||
|
||||
Leaf Page (4096 bytes):
|
||||
page_type: u8 (1 = leaf)
|
||||
entry_count: u16
|
||||
prev_leaf: u32 Linked-list pointer for range scan
|
||||
next_leaf: u32
|
||||
entries:
|
||||
For each entry:
|
||||
value: field_type The metadata value
|
||||
vector_id: u64 Associated vector ID
|
||||
[zero-padded to 4096]
|
||||
```
|
||||
|
||||
Leaf pages form a doubly-linked list for efficient range scans. A range query
|
||||
`position_start >= 1000 AND position_start <= 5000` descends the tree to find
|
||||
the first leaf with value >= 1000, then scans forward until value > 5000.
|
||||
|
||||
### 5.3 Bitmap Index (index_type = 2)
|
||||
|
||||
Best for: enum and bool fields with low cardinality (< 64 distinct values).
|
||||
|
||||
```
|
||||
value_count: u8 Number of distinct enum/bool values
|
||||
For each value:
|
||||
value_label_len: u8
|
||||
value_label: [u8; value_label_len] The enum label or "true"/"false"
|
||||
bitmap_format: u8 0 = raw, 1 = roaring
|
||||
bitmap_length: u32 Byte length of bitmap data
|
||||
bitmap_data: [u8; bitmap_length] Bitmap of matching vector IDs
|
||||
[8B aligned]
|
||||
[64B aligned]
|
||||
```
|
||||
|
||||
**Raw bitmaps** are used when `total_vectors < 8192` (1 KB per bitmap).
|
||||
|
||||
**Roaring bitmaps** are used for larger datasets. The roaring format stores
|
||||
the bitmap as a set of containers (array, bitmap, or run-length) per 64K chunk.
|
||||
This matches the industry-standard Roaring bitmap serialization (compatible with
|
||||
CRoaring / roaring-rs wire format).
|
||||
|
||||
Bitmap intersection and union operations map directly to AND/OR filter predicates
|
||||
using SIMD bitwise operations. For 10M vectors:
|
||||
|
||||
```
|
||||
Raw bitmap: ~1.2 MB per value (impractical for many values)
|
||||
Roaring bitmap: 100 KB - 1 MB per value depending on density
|
||||
AND/OR: ~0.1 ms per operation (AVX-512 on 1 MB bitmap)
|
||||
```
|
||||
|
||||
## 6. Level 1 Manifest Addition
|
||||
|
||||
### Tag 0x000F: METADATA_INDEX_DIR
|
||||
|
||||
A new TLV record in the Level 1 manifest (see `02-manifest-system.md` Section 3)
|
||||
that maps indexed metadata fields to their METAIDX_SEG segment IDs.
|
||||
|
||||
```
|
||||
Tag: 0x000F
|
||||
Name: METADATA_INDEX_DIR
|
||||
|
||||
Payload:
|
||||
entry_count: u16
|
||||
For each entry:
|
||||
field_id: u16 Matches META_SEG field_id
|
||||
field_name_len: u8
|
||||
field_name: [u8; field_name_len] UTF-8 field name for debugging
|
||||
index_seg_id: u64 Segment ID of METAIDX_SEG
|
||||
index_type: u8 0=inverted, 1=range_tree, 2=bitmap
|
||||
stats:
|
||||
total_vectors: u64
|
||||
unique_values: u64
|
||||
min_posting_len: u32 Smallest posting list size
|
||||
max_posting_len: u32 Largest posting list size
|
||||
```
|
||||
|
||||
This allows the query planner to estimate selectivity without reading the
|
||||
METAIDX_SEG segments themselves. The `min_posting_len` and `max_posting_len`
|
||||
fields provide bounds for cardinality estimation.
|
||||
|
||||
### Updated Record Types Table
|
||||
|
||||
```
|
||||
Tag Name Description
|
||||
--- ---- -----------
|
||||
0x0001 SEGMENT_DIR Array of segment directory entries
|
||||
0x0002 TEMP_TIER_MAP Temperature tier assignments per block
|
||||
...
|
||||
0x000D KEY_DIRECTORY Encryption key references
|
||||
0x000E (reserved)
|
||||
0x000F METADATA_INDEX_DIR Metadata field -> METAIDX_SEG mapping
|
||||
```
|
||||
|
||||
## 7. Performance Analysis
|
||||
|
||||
### 7.1 Filter Strategy vs Selectivity vs Recall
|
||||
|
||||
| Selectivity | Strategy | Recall@10 | Latency (10M vectors) | Notes |
|
||||
|-------------|----------|-----------|----------------------|-------|
|
||||
| 0.001% (100 matches) | Pre-filter | 1.00 | 0.02 ms | Flat scan on 100 candidates |
|
||||
| 0.01% (1K matches) | Pre-filter | 0.99 | 0.08 ms | Flat scan on 1K candidates |
|
||||
| 0.1% (10K matches) | Pre-filter | 0.98 | 0.5 ms | Mini-HNSW on 10K candidates |
|
||||
| 1% (100K matches) | Intra-filter | 0.96 | 0.12 ms | ~10% node skip overhead |
|
||||
| 5% (500K matches) | Intra-filter | 0.95 | 0.08 ms | ~5% node skip overhead |
|
||||
| 10% (1M matches) | Intra-filter | 0.94 | 0.06 ms | Minimal skip overhead |
|
||||
| 20% (2M matches) | Post-filter | 0.95 | 0.10 ms | 5x over-retrieval |
|
||||
| 50% (5M matches) | Post-filter | 0.97 | 0.06 ms | 2x over-retrieval |
|
||||
| 100% (no filter) | None | 0.98 | 0.04 ms | Baseline unfiltered |
|
||||
|
||||
### 7.2 Memory Overhead of Metadata Indexes
|
||||
|
||||
For 10M vectors with the RVDNA profile (5 indexed fields):
|
||||
|
||||
| Field | Type | Cardinality | Index Type | Size |
|
||||
|-------|------|-------------|------------|------|
|
||||
| organism | string | ~50K | Inverted | ~80 MB |
|
||||
| gene_id | string | ~500K | Inverted | ~120 MB |
|
||||
| chromosome | string | ~25 | Bitmap (roaring) | ~12 MB |
|
||||
| position_start | u64 | ~10M | Range tree | ~160 MB |
|
||||
| position_end | u64 | ~10M | Range tree | ~160 MB |
|
||||
| **Total** | | | | **~532 MB** |
|
||||
|
||||
As a fraction of vector data (10M * 384 dim * fp16 = 7.2 GB): **~7.4% overhead**.
|
||||
|
||||
For the RVText profile (2 indexed fields, typically lower cardinality):
|
||||
|
||||
| Field | Type | Cardinality | Index Type | Size |
|
||||
|-------|------|-------------|------------|------|
|
||||
| source_url | string | ~100K | Inverted | ~90 MB |
|
||||
| language | string | ~50 | Bitmap (roaring) | ~8 MB |
|
||||
| **Total** | | | | **~98 MB** |
|
||||
|
||||
Overhead: **~1.4%** of vector data.
|
||||
|
||||
### 7.3 Query Latency Breakdown (Filtered Intra-Search)
|
||||
|
||||
```
|
||||
Phase Time Notes
|
||||
----- ---- -----
|
||||
Parse filter expression 0.5 us Stack-based, no allocation
|
||||
Estimate selectivity 1.0 us Read manifest stats
|
||||
Load METAIDX_SEG (if cold) 50-200 us First query only; cached after
|
||||
HNSW traversal (150 steps) 45 us Baseline unfiltered
|
||||
+ filter eval per node +12 us ~80 ns per eval * 150 nodes
|
||||
+ skip expansion +8 us ~20% more nodes visited at 5% sel.
|
||||
Top-K collection 10 us Heap operations
|
||||
--------
|
||||
Total (warm cache) ~76 us
|
||||
Total (cold start) ~276 us
|
||||
```
|
||||
|
||||
## 8. Integration with Temperature Tiering
|
||||
|
||||
Metadata follows the same temperature model as vector data (see
|
||||
`03-temperature-tiering.md`), but with its own tier assignments.
|
||||
|
||||
### 8.1 Hot Metadata
|
||||
|
||||
Indexed fields for hot-tier vectors are kept resident in memory:
|
||||
|
||||
- **Bitmap indexes** for low-cardinality fields (enum, bool) are always hot.
|
||||
Total size is bounded: `cardinality * ceil(hot_vectors / 8)` bytes. For 100K
|
||||
hot vectors and 25 enum values: 25 * 12.5 KB = 312 KB.
|
||||
|
||||
- **Inverted index posting lists** are cached using an LRU policy keyed by
|
||||
(field_id, term). Frequently queried terms (e.g., `language = "en"`) remain
|
||||
resident.
|
||||
|
||||
- **Range tree pages** follow the standard B+ tree buffer pool model. Hot pages
|
||||
(root + first two levels) are pinned. Leaf pages are demand-paged.
|
||||
|
||||
### 8.2 Cold Metadata
|
||||
|
||||
Cold metadata covers vectors that are rarely accessed:
|
||||
|
||||
- META_SEG data for cold vectors is compressed with ZSTD (level 9+) and stored
|
||||
in cold-tier segments.
|
||||
- METAIDX_SEG posting lists for cold vectors are not loaded until a query
|
||||
specifically requests them.
|
||||
- When a filter matches only cold vectors (detected via the temperature tier
|
||||
map), the runtime issues a warning: filtered search on cold data may require
|
||||
decompression latency of 10-100 ms.
|
||||
|
||||
### 8.3 Compaction Coordination
|
||||
|
||||
When temperature-aware compaction reorganizes vector segments (see
|
||||
`03-temperature-tiering.md` Section 4), metadata must follow:
|
||||
|
||||
```
|
||||
1. Identify vectors moving between tiers
|
||||
2. Rewrite META_SEG for affected vector ID ranges
|
||||
3. Rebuild METAIDX_SEG posting lists (vector IDs may be renumbered during
|
||||
compaction if the COMPACTION_RENUMBER flag is set)
|
||||
4. Update METADATA_INDEX_DIR in the new manifest
|
||||
5. Tombstone old META_SEG and METAIDX_SEG segments
|
||||
```
|
||||
|
||||
Metadata compaction piggybacks on vector compaction -- it never triggers
|
||||
independently. This ensures metadata and vector segments remain in consistent
|
||||
temperature tiers.
|
||||
|
||||
### 8.4 Metadata-Aware Promotion
|
||||
|
||||
When a filter query frequently accesses metadata for warm-tier vectors, those
|
||||
metadata segments are candidates for promotion to hot tier. The access sketch
|
||||
(SKETCH_SEG) tracks metadata segment accesses alongside vector accesses:
|
||||
|
||||
```
|
||||
sketch_key = (META_SEG_ID << 32) | block_id
|
||||
```
|
||||
|
||||
This reuses the existing sketch infrastructure without modification.
|
||||
|
||||
## 9. Wire Protocol: Filtered Query Message
|
||||
|
||||
For completeness, the filter expression is carried in the query message as a
|
||||
tagged field. The query wire format is outside the scope of the storage spec,
|
||||
but the filter payload is defined here for interoperability.
|
||||
|
||||
```
|
||||
Query Message Filter Field:
|
||||
tag: u16 (0x0040 = FILTER)
|
||||
length: u32
|
||||
filter_version: u8 (1)
|
||||
filter_payload: [u8; length - 1] Binary filter expression (Section 3.2)
|
||||
```
|
||||
|
||||
Implementations that do not support filtered search must ignore tag 0x0040 and
|
||||
return unfiltered results. This preserves backward compatibility.
|
||||
|
||||
## 10. Implementation Notes
|
||||
|
||||
### 10.1 Index Selection Heuristics
|
||||
|
||||
When building indexes for a new META_SEG field, implementations should select
|
||||
the index type automatically:
|
||||
|
||||
```
|
||||
if field_type in (enum, bool) and cardinality < 64:
|
||||
index_type = BITMAP
|
||||
elif field_type in (u32, u64, f32):
|
||||
index_type = RANGE_TREE
|
||||
else:
|
||||
index_type = INVERTED
|
||||
```
|
||||
|
||||
Fields without the `"indexed": true` property in the profile schema must not
|
||||
have METAIDX_SEG segments built. They are stored in META_SEG for retrieval
|
||||
only (the STORED flag).
|
||||
|
||||
### 10.2 Posting List Intersection
|
||||
|
||||
For AND filters on multiple indexed fields, posting list intersection is
|
||||
performed using a merge-based algorithm on sorted, delta-decoded posting lists:
|
||||
|
||||
```
|
||||
Sorted Intersection (two-pointer merge):
|
||||
Time: O(min(|A|, |B|)) with skip-ahead via restart points
|
||||
Practical: ~100 ns per 1000 common elements (SIMD comparison)
|
||||
```
|
||||
|
||||
For OR filters, posting list union uses a similar merge with deduplication.
|
||||
|
||||
### 10.3 Null Handling
|
||||
|
||||
- `FIELD_REF` for a null value pushes a sentinel NULL onto the stack
|
||||
- `CMP_EQ NULL` returns true only for null values
|
||||
- `CMP_NE NULL` returns true for all non-null values
|
||||
- All other comparisons against NULL return false (SQL-style three-valued logic)
|
||||
- `IN_SET` never matches NULL unless NULL is explicitly in the set
|
||||
474
vendor/ruvector/docs/research/rvf/spec/09-concurrency-versioning.md
vendored
Normal file
474
vendor/ruvector/docs/research/rvf/spec/09-concurrency-versioning.md
vendored
Normal file
@@ -0,0 +1,474 @@
|
||||
# RVF Concurrency, Versioning, and Space Reclamation
|
||||
|
||||
## 1. Single-Writer / Multi-Reader Model
|
||||
|
||||
RVF uses a **single-writer, multi-reader** concurrency model. At most one process
|
||||
may append segments to an RVF file at any time. Any number of readers may operate
|
||||
concurrently with each other and with the writer. This model is enforced by an
|
||||
advisory lock file, not by OS-level mandatory locking.
|
||||
|
||||
| Concern | Advisory Lock | Mandatory Lock (flock/fcntl) |
|
||||
|---------|---------------|------------------------------|
|
||||
| NFS compatibility | Works (lock file is a regular file) | Broken on many NFS configs |
|
||||
| Crash recovery | Stale lock detectable by PID check | Kernel auto-releases, but only locally |
|
||||
| Cross-language | Any language can create a file | Requires OS-specific syscalls |
|
||||
| Visibility | Lock state inspectable by humans | Opaque kernel state |
|
||||
| Multi-file mode | One lock covers all shards | Would need per-shard locks |
|
||||
|
||||
## 2. Writer Lock File
|
||||
|
||||
The writer lock is a file named `<basename>.rvf.lock` in the same directory as the
|
||||
RVF file. For example, `data.rvf` uses `data.rvf.lock`.
|
||||
|
||||
### Binary Layout
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 4 magic 0x52564C46 ("RVLF" in ASCII)
|
||||
0x04 4 pid Writer process ID (u32)
|
||||
0x08 64 hostname Null-terminated hostname (max 63 chars + null)
|
||||
0x48 8 timestamp_ns Lock acquisition time (nanosecond UNIX timestamp)
|
||||
0x50 16 writer_id Random UUID (128-bit, written as raw bytes)
|
||||
0x60 4 lock_version Lock protocol version (currently 1)
|
||||
0x64 4 checksum CRC32C of bytes 0x00-0x63
|
||||
```
|
||||
|
||||
**Total**: 104 bytes.
|
||||
|
||||
### Lock Acquisition Protocol
|
||||
|
||||
```
|
||||
1. Construct lock file content (magic, PID, hostname, timestamp, random UUID)
|
||||
2. Compute CRC32C over bytes 0x00-0x63, store at 0x64
|
||||
3. Attempt open("<basename>.rvf.lock", O_CREAT | O_EXCL | O_WRONLY)
|
||||
4. If open succeeds:
|
||||
a. Write 104 bytes
|
||||
b. fsync
|
||||
c. Lock acquired — proceed with writes
|
||||
5. If open fails (EEXIST):
|
||||
a. Read existing lock file
|
||||
b. Validate magic and checksum
|
||||
c. If invalid: delete stale lock, retry from step 3
|
||||
d. If valid: run stale lock detection (see below)
|
||||
e. If stale: delete lock, retry from step 3
|
||||
f. If not stale: lock acquisition fails — another writer is active
|
||||
```
|
||||
|
||||
The `O_CREAT | O_EXCL` combination is atomic on POSIX filesystems, preventing
|
||||
two processes from simultaneously creating the lock.
|
||||
|
||||
### Stale Lock Detection
|
||||
|
||||
A lock is considered stale when **both** of the following are true:
|
||||
|
||||
1. **PID is dead**: `kill(pid, 0)` returns `ESRCH` (process does not exist), OR
|
||||
the hostname does not match the current host (remote crash)
|
||||
2. **Age exceeds threshold**: `now_ns - timestamp_ns > 30_000_000_000` (30 seconds)
|
||||
|
||||
The age check prevents a race where a PID is recycled by the OS. A lock younger
|
||||
than 30 seconds is never considered stale, even if the PID appears dead, because
|
||||
PID reuse on modern systems can occur within milliseconds.
|
||||
|
||||
If the hostname differs from the current host, the PID check is not meaningful.
|
||||
In this case, only the age threshold applies. Implementations SHOULD use a longer
|
||||
threshold (300 seconds) for cross-host lock recovery to account for clock skew.
|
||||
|
||||
### Lock Release Protocol
|
||||
|
||||
```
|
||||
1. fsync all pending data and manifest segments
|
||||
2. Verify the lock file still contains our writer_id (re-read and compare)
|
||||
3. If writer_id matches: unlink("<basename>.rvf.lock")
|
||||
4. If writer_id does not match: abort — another process stole the lock
|
||||
```
|
||||
|
||||
Step 2 prevents a writer from deleting a lock that was legitimately taken over
|
||||
after a stale lock recovery by another process.
|
||||
|
||||
If a writer crashes without releasing the lock, the lock file persists on disk.
|
||||
The next writer detects the orphan via stale lock detection and reclaims it.
|
||||
No data corruption occurs because the append-only segment model guarantees that
|
||||
partial writes are detectable: a segment with a bad content hash or a truncated
|
||||
manifest is simply ignored.
|
||||
|
||||
## 3. Reader-Writer Coordination
|
||||
|
||||
Readers and writers operate independently. The append-only architecture ensures
|
||||
they never conflict.
|
||||
|
||||
### Reader Protocol
|
||||
|
||||
```
|
||||
1. Open file (read-only, no lock required)
|
||||
2. Read Level 0 root manifest (last 4096 bytes)
|
||||
3. Parse hotset pointers and Level 1 offset
|
||||
4. This manifest snapshot defines the reader's view of the file
|
||||
5. All queries within this session use the snapshot
|
||||
6. To see new data: re-read Level 0 (explicit refresh)
|
||||
```
|
||||
|
||||
### Writer Protocol
|
||||
|
||||
```
|
||||
1. Acquire lock (Section 2)
|
||||
2. Read current manifest to learn segment directory state
|
||||
3. Append new segments (VEC_SEG, INDEX_SEG, etc.)
|
||||
4. Append new MANIFEST_SEG referencing all live segments
|
||||
5. fsync
|
||||
6. Release lock (Section 2)
|
||||
```
|
||||
|
||||
### Concurrent Timeline
|
||||
|
||||
```
|
||||
Time Writer Reader A Reader B
|
||||
---- ------ -------- --------
|
||||
t=0 Acquires lock
|
||||
t=1 Appends VEC_SEG_4 Opens file
|
||||
t=2 Appends VEC_SEG_5 Opens file Reads manifest M3
|
||||
t=3 Appends MANIFEST_SEG M4 Reads manifest M3 Queries (sees M3)
|
||||
t=4 fsync, releases lock Queries (sees M3) Queries (sees M3)
|
||||
t=5 Queries (sees M3) Refreshes -> M4
|
||||
t=6 Refreshes -> M4 Queries (sees M4)
|
||||
```
|
||||
|
||||
Reader A opened during the write but read manifest M3 (already stable) and never
|
||||
sees partially written segments. Reader B sees M3 until explicit refresh. Neither
|
||||
reader is blocked; the writer is never blocked by readers.
|
||||
|
||||
### Snapshot Isolation Guarantees
|
||||
|
||||
A reader holding a manifest snapshot is guaranteed:
|
||||
|
||||
1. All referenced segments are fully written and fsynced
|
||||
2. Segment content hashes match (the manifest would not reference broken segments)
|
||||
3. The snapshot is internally consistent (no partial epoch states)
|
||||
4. The snapshot remains valid for the lifetime of the open file descriptor, even
|
||||
if the file is compacted and replaced (old inode persists until close)
|
||||
|
||||
## 4. Format Versioning
|
||||
|
||||
RVF uses explicit version fields at every structural level. The versioning rules
|
||||
are designed for forward compatibility — older readers can safely process files
|
||||
produced by newer writers, with graceful degradation.
|
||||
|
||||
### Segment Version Compatibility
|
||||
|
||||
The segment header `version` field (offset 0x04, currently `1`) governs
|
||||
segment-level compatibility.
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| S1 | A v1 reader MUST successfully process all v1 segments |
|
||||
| S2 | A v1 reader MUST skip segments with version > 1 |
|
||||
| S3 | A v1 reader MUST log a warning when skipping unknown versions |
|
||||
| S4 | A v1 reader MUST NOT reject a file because it contains unknown-version segments |
|
||||
| S5 | A v2+ writer MUST write a root manifest readable by v1 readers (if the root manifest format allows it) |
|
||||
| S6 | A v2+ writer MAY write segments with version > 1 |
|
||||
| S7 | Readers MUST use `payload_length` from the segment header to skip unknown segments |
|
||||
|
||||
Skipping works because the segment header layout is stable: magic, version,
|
||||
seg_type, and payload_length occupy fixed offsets. A reader skips unknown
|
||||
segments by seeking past `64 + payload_length` bytes (header + payload).
|
||||
|
||||
### Unknown Segment Types
|
||||
|
||||
The segment type enum (offset 0x05) may be extended in future versions.
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| T1 | A reader MUST skip segment types outside the recognized range (currently 0x01-0x0C) |
|
||||
| T2 | A reader MUST NOT reject a file because of unknown segment types |
|
||||
| T3 | A reader MUST use the header's `payload_length` to skip the unknown segment |
|
||||
| T4 | A reader SHOULD log unknown types at diagnostic/debug level |
|
||||
| T5 | Types 0x00 and 0xF0-0xFF remain reserved (see spec 01, Section 3) |
|
||||
|
||||
### Level 1 TLV Forward Compatibility
|
||||
|
||||
Level 1 manifest records use tag-length-value encoding. New tags may be added
|
||||
in any version.
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| L1 | A reader MUST skip TLV records with unknown tags |
|
||||
| L2 | A reader MUST use the record's `length` field (4 bytes at tag offset +2) to skip |
|
||||
| L3 | A writer MUST NOT change the semantics of an existing tag |
|
||||
| L4 | A writer MUST NOT reuse a tag value for a different purpose |
|
||||
| L5 | New tags MUST be assigned sequentially from 0x000E onward |
|
||||
|
||||
### Root Manifest Compatibility
|
||||
|
||||
The root manifest (Level 0) has the strictest compatibility requirements because
|
||||
it is the entry point for all readers.
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| R1 | The magic `0x52564D30` at offset 0x000 is frozen forever |
|
||||
| R2 | The layout of bytes 0x000-0x007 (magic + version + flags) is frozen forever |
|
||||
| R3 | New fields may be added to reserved space at offsets 0xF00-0xFFB |
|
||||
| R4 | Readers MUST ignore non-zero bytes in reserved space they do not understand |
|
||||
| R5 | The root checksum at 0xFFC always covers bytes 0x000-0xFFB |
|
||||
| R6 | A v2+ writer extending reserved space MUST ensure the checksum remains valid |
|
||||
|
||||
There is no explicit version negotiation. Compatibility is achieved through the
|
||||
skip rules above. A reader processes what it understands and skips what it does
|
||||
not. This avoids capability exchange, making RVF suitable for offline and
|
||||
archival use cases.
|
||||
|
||||
## 5. Variable Dimension Support
|
||||
|
||||
The root manifest declares a `dimension` field (offset 0x020, u16) and each
|
||||
VEC_SEG block declares its own `dim` field (block header offset 0x08, u16).
|
||||
These may differ.
|
||||
|
||||
### Dimension Rules
|
||||
|
||||
| Rule | Description |
|
||||
|------|-------------|
|
||||
| D1 | The root manifest `dimension` is the **primary dimension** (most common in the file) |
|
||||
| D2 | An RVF file MAY contain VEC_SEG blocks with dimensions different from the primary |
|
||||
| D3 | Each VEC_SEG block's `dim` field is authoritative for the vectors in that block |
|
||||
| D4 | The HNSW index (INDEX_SEG) covers only vectors matching the primary dimension |
|
||||
| D5 | Vectors with non-primary dimensions are searchable via flat scan or a separate index |
|
||||
| D6 | A PROFILE_SEG may declare multiple expected dimensions |
|
||||
|
||||
### Dimension Catalog (Level 1 Record)
|
||||
|
||||
A new Level 1 TLV record (tag `0x0010`, DIMENSION_CATALOG) enables readers to
|
||||
discover all dimensions present without scanning every VEC_SEG.
|
||||
|
||||
Record layout:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 2 entry_count Number of dimension entries
|
||||
0x02 2 reserved Must be zero
|
||||
```
|
||||
|
||||
Followed by `entry_count` entries of:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 2 dimension Vector dimensionality
|
||||
0x02 1 dtype Data type enum for these vectors
|
||||
0x03 1 flags 0x01 = primary, 0x02 = has_index
|
||||
0x04 4 vector_count Number of vectors with this dimension
|
||||
0x08 8 index_seg_offset Offset to dedicated index (0 if none)
|
||||
```
|
||||
|
||||
**Entry size**: 16 bytes.
|
||||
|
||||
Example for an RVDNA profile file:
|
||||
|
||||
```
|
||||
DIMENSION_CATALOG:
|
||||
entry_count: 3
|
||||
[0] dim=64, dtype=f16, flags=0x01 (primary, has_index), count=10000000, index=0x1A00000
|
||||
[1] dim=384, dtype=f16, flags=0x02 (has_index), count=500000, index=0x3F00000
|
||||
[2] dim=4096, dtype=f32, flags=0x00 (flat scan only), count=10000, index=0
|
||||
```
|
||||
|
||||
## 6. Space Reclamation
|
||||
|
||||
Over time, tombstoned segments and superseded manifests accumulate dead space.
|
||||
RVF provides three reclamation strategies, each suited to different operating
|
||||
conditions.
|
||||
|
||||
### Strategy 1: Hole-Punching
|
||||
|
||||
On Linux filesystems that support `fallocate(2)` with `FALLOC_FL_PUNCH_HOLE`
|
||||
(ext4, XFS, btrfs), tombstoned segment ranges can be released back to the
|
||||
filesystem without rewriting the file.
|
||||
|
||||
```
|
||||
Before: [VEC_1 live] [VEC_2 dead] [VEC_3 dead] [VEC_4 live] [MANIFEST]
|
||||
After: [VEC_1 live] [ hole ] [ hole ] [VEC_4 live] [MANIFEST]
|
||||
```
|
||||
|
||||
File size is unchanged but disk blocks are freed. No data movement occurs — each
|
||||
punch is O(1). Reader mmap still works (holes read as zeros, but the manifest
|
||||
never references them). Hole-punching is performed only on segments marked as
|
||||
TOMBSTONE in the current manifest's COMPACTION_STATE record.
|
||||
|
||||
### Strategy 2: Copy-Compact
|
||||
|
||||
Copy-compact rewrites the file, including only live segments. This is the
|
||||
universal strategy that works on all filesystems.
|
||||
|
||||
```
|
||||
Protocol:
|
||||
1. Acquire writer lock
|
||||
2. Read current manifest to enumerate live segments
|
||||
3. Create temporary file: <basename>.rvf.compact.tmp
|
||||
4. Write live segments sequentially to temporary file
|
||||
5. Write new MANIFEST_SEG with updated offsets
|
||||
6. fsync temporary file
|
||||
7. Atomic rename: <basename>.rvf.compact.tmp -> <basename>.rvf
|
||||
8. Release writer lock
|
||||
```
|
||||
|
||||
The atomic rename (step 7) ensures readers either see the old file or the new
|
||||
file, never a partial state. Readers that opened the old file before the rename
|
||||
continue operating on the old inode via their open file descriptor. The old
|
||||
inode is freed when the last reader closes its descriptor.
|
||||
|
||||
### Strategy 3: Shard Rewrite (Multi-File Mode)
|
||||
|
||||
In multi-file mode, individual shard files can be rewritten independently:
|
||||
|
||||
```
|
||||
Protocol:
|
||||
1. Acquire writer lock
|
||||
2. Read shard reference from Level 1 SHARD_REFS record
|
||||
3. Write new shard: <basename>.rvf.cold.<N>.compact.tmp
|
||||
4. fsync new shard
|
||||
5. Update main file manifest with new shard reference
|
||||
6. fsync main file
|
||||
7. Atomic rename new shard over old shard
|
||||
8. Release writer lock
|
||||
```
|
||||
|
||||
The old shard is safe to delete after all readers close their descriptors.
|
||||
Implementations MAY defer deletion using a grace period (default: 60 seconds).
|
||||
|
||||
## 7. Space Reclamation Triggers
|
||||
|
||||
Reclamation is not performed on every write. Implementations SHOULD evaluate
|
||||
triggers after each manifest write and act when thresholds are exceeded.
|
||||
|
||||
| Trigger | Threshold | Action |
|
||||
|---------|-----------|--------|
|
||||
| Dead space ratio | > 50% of file size | Copy-compact |
|
||||
| Dead space absolute | > 1 GB | Hole-punch if supported, else copy-compact |
|
||||
| Tombstone count | > 10,000 JOURNAL_SEG tombstone entries | Consolidate journal segments |
|
||||
| Time since last compaction | > 7 days | Evaluate dead space ratio, compact if > 25% |
|
||||
|
||||
### Dead Space Calculation
|
||||
|
||||
Dead space is computed from the manifest's COMPACTION_STATE record:
|
||||
|
||||
```
|
||||
dead_bytes = sum(payload_length + 64) for each tombstoned segment
|
||||
total_bytes = file_size
|
||||
dead_ratio = dead_bytes / total_bytes
|
||||
```
|
||||
|
||||
The `+ 64` accounts for the segment header.
|
||||
|
||||
### Trigger Evaluation Protocol
|
||||
|
||||
```
|
||||
1. After writing a new MANIFEST_SEG, compute dead_bytes and dead_ratio
|
||||
2. If dead_ratio > 0.50: schedule copy-compact
|
||||
3. Else if dead_bytes > 1 GB:
|
||||
a. If fallocate supported: hole-punch tombstoned ranges
|
||||
b. Else: schedule copy-compact
|
||||
4. If tombstone_count > 10,000: consolidate JOURNAL_SEGs
|
||||
5. If days_since_last_compact > 7 AND dead_ratio > 0.25: schedule copy-compact
|
||||
```
|
||||
|
||||
Scheduled compactions MAY be deferred to a background process or low-activity
|
||||
period.
|
||||
|
||||
## 8. Multi-Process Compaction
|
||||
|
||||
Compaction is a write operation and requires the writer lock. Only one process
|
||||
may compact at a time.
|
||||
|
||||
### Background Compaction Process
|
||||
|
||||
A dedicated compaction process can run alongside the application:
|
||||
|
||||
```
|
||||
1. Attempt writer lock acquisition
|
||||
2. If lock acquired:
|
||||
a. Read current manifest
|
||||
b. Evaluate reclamation triggers
|
||||
c. If compaction needed:
|
||||
i. Write WITNESS_SEG with compaction_state = STARTED
|
||||
ii. Perform compaction (copy-compact or hole-punch)
|
||||
iii. Write WITNESS_SEG with compaction_state = COMPLETED
|
||||
iv. Write new MANIFEST_SEG
|
||||
d. Release lock
|
||||
3. If lock not acquired: sleep and retry
|
||||
```
|
||||
|
||||
### Crash Safety
|
||||
|
||||
Compaction is crash-safe by construction. Copy-compact does not rename until
|
||||
fsynced — a crash before rename leaves the original file untouched and the
|
||||
temporary file is cleaned up on next startup. Hole-punch `fallocate` calls are
|
||||
individually atomic; a crash mid-sequence leaves the manifest consistent because
|
||||
it references only live segments. Shard rewrite follows the same atomic rename
|
||||
pattern as copy-compact.
|
||||
|
||||
### Compaction Progress and Resumability
|
||||
|
||||
For long-running compactions, the writer records progress in WITNESS_SEG segments:
|
||||
|
||||
```
|
||||
WITNESS_SEG compaction payload:
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 4 state 0=STARTED, 1=IN_PROGRESS, 2=COMPLETED, 3=ABORTED
|
||||
0x04 8 source_manifest_id Segment ID of manifest being compacted
|
||||
0x0C 8 last_copied_seg_id Last segment ID successfully written to new file
|
||||
0x14 8 bytes_written Total bytes written to new file so far
|
||||
0x1C 8 bytes_remaining Estimated bytes remaining
|
||||
0x24 16 temp_file_hash Hash of temporary file at last checkpoint
|
||||
```
|
||||
|
||||
If a compaction process crashes and restarts, it can:
|
||||
|
||||
1. Find the latest WITNESS_SEG with `state = IN_PROGRESS`
|
||||
2. Verify the temporary file exists and matches `temp_file_hash`
|
||||
3. Resume from `last_copied_seg_id + 1`
|
||||
4. If verification fails, delete the temporary file and restart compaction
|
||||
|
||||
## 9. Crash Recovery Summary
|
||||
|
||||
RVF recovers from crashes at any point without external tooling.
|
||||
|
||||
| Crash Point | State After Recovery | Action Required |
|
||||
|-------------|---------------------|-----------------|
|
||||
| Segment append (before manifest) | Orphan segment at tail | None — manifest does not reference it |
|
||||
| Manifest write | Partial manifest at tail | Scan backward to previous valid manifest |
|
||||
| Lock acquisition | Lock file may or may not exist | Stale lock detection resolves it |
|
||||
| Lock release | Lock file persists | Stale lock detection resolves it |
|
||||
| Copy-compact (before rename) | Temporary file on disk | Delete `*.compact.tmp` on startup |
|
||||
| Copy-compact (during rename) | Atomic — old or new | No action needed |
|
||||
| Hole-punch | Partial holes punched | No action — manifest is consistent |
|
||||
| Shard rewrite | Temporary shard on disk | Delete `*.compact.tmp` on startup |
|
||||
|
||||
### Startup Recovery Protocol
|
||||
|
||||
On startup, before acquiring a write lock, a writer SHOULD:
|
||||
|
||||
```
|
||||
1. Delete any <basename>.rvf.compact.tmp files (orphaned compaction)
|
||||
2. Delete any <basename>.rvf.cold.*.compact.tmp files (orphaned shard compaction)
|
||||
3. Validate the lock file (if present) for staleness
|
||||
4. Open the RVF file and locate the latest valid manifest
|
||||
5. If the tail contains a partial segment (magic present, bad hash):
|
||||
a. Log a warning with the partial segment's offset and type
|
||||
b. The partial segment is outside the manifest — it is harmless
|
||||
c. The next append will overwrite it (or it will be compacted away)
|
||||
```
|
||||
|
||||
## 10. Invariants
|
||||
|
||||
The following invariants extend those in spec 01 (Section 7):
|
||||
|
||||
1. At most one writer lock exists per RVF file at any time
|
||||
2. A lock file with valid magic and checksum represents an active or stale lock
|
||||
3. Readers never require a lock, regardless of operation
|
||||
4. A manifest snapshot is immutable for the lifetime of a reader session
|
||||
5. Compaction never modifies live segments — it creates new ones
|
||||
6. Hole-punched regions are never referenced by any manifest
|
||||
7. The root manifest magic and first 8 bytes are frozen across all versions
|
||||
8. Unknown segment versions and types are skipped, never rejected
|
||||
9. Unknown TLV tags in Level 1 are skipped, never rejected
|
||||
10. Each VEC_SEG block's `dim` field is authoritative for that block's vectors
|
||||
688
vendor/ruvector/docs/research/rvf/spec/10-operations-api.md
vendored
Normal file
688
vendor/ruvector/docs/research/rvf/spec/10-operations-api.md
vendored
Normal file
@@ -0,0 +1,688 @@
|
||||
# RVF Operations API
|
||||
|
||||
## 1. Scope
|
||||
|
||||
This document specifies the operational surface of an RVF runtime: error codes
|
||||
returned by all operations, wire formats for batch queries, batch ingest, and
|
||||
batch deletes, the network streaming protocol for progressive loading over HTTP
|
||||
and TCP, and the compaction scheduling policy. It complements the segment model
|
||||
(spec 01), manifest system (spec 02), and query optimization (spec 06).
|
||||
|
||||
All multi-byte integers are little-endian unless otherwise noted. All offsets
|
||||
within messages are byte offsets from the start of the message payload.
|
||||
|
||||
## 2. Error Code Enumeration
|
||||
|
||||
Error codes are 16-bit unsigned integers. The high byte identifies the error
|
||||
category; the low byte identifies the specific error within that category.
|
||||
Implementations must preserve unrecognized codes in responses and must not
|
||||
treat unknown codes as fatal unless the high byte is `0x01` (format error).
|
||||
|
||||
### Category 0x00: Success
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0000 OK Operation succeeded
|
||||
0x0001 OK_PARTIAL Partial success (some items failed)
|
||||
```
|
||||
|
||||
`OK_PARTIAL` is returned when a batch operation succeeds for some items and
|
||||
fails for others. The response body contains per-item status details.
|
||||
|
||||
### Category 0x01: Format Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0100 INVALID_MAGIC Segment magic mismatch (expected 0x52564653)
|
||||
0x0101 INVALID_VERSION Unsupported segment version
|
||||
0x0102 INVALID_CHECKSUM Segment hash verification failed
|
||||
0x0103 INVALID_SIGNATURE Cryptographic signature invalid
|
||||
0x0104 TRUNCATED_SEGMENT Segment payload shorter than declared length
|
||||
0x0105 INVALID_MANIFEST Root manifest validation failed
|
||||
0x0106 MANIFEST_NOT_FOUND No valid MANIFEST_SEG in file
|
||||
0x0107 UNKNOWN_SEGMENT_TYPE Segment type not recognized (warning, not fatal)
|
||||
0x0108 ALIGNMENT_ERROR Data not at expected 64B boundary
|
||||
```
|
||||
|
||||
`UNKNOWN_SEGMENT_TYPE` is advisory. A reader encountering an unknown segment
|
||||
type should skip it and continue. All other format errors in this category
|
||||
are fatal for the affected segment.
|
||||
|
||||
### Category 0x02: Query Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0200 DIMENSION_MISMATCH Query vector dimension != index dimension
|
||||
0x0201 EMPTY_INDEX No index segments available
|
||||
0x0202 METRIC_UNSUPPORTED Requested distance metric not available
|
||||
0x0203 FILTER_PARSE_ERROR Invalid filter expression
|
||||
0x0204 K_TOO_LARGE Requested K exceeds available vectors
|
||||
0x0205 TIMEOUT Query exceeded time budget
|
||||
```
|
||||
|
||||
When `K_TOO_LARGE` is returned, the response still contains all available
|
||||
results. The result count will be less than the requested K.
|
||||
|
||||
### Category 0x03: Write Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0300 LOCK_HELD Another writer holds the lock
|
||||
0x0301 LOCK_STALE Lock file exists but owner process is dead
|
||||
0x0302 DISK_FULL Insufficient space for write
|
||||
0x0303 FSYNC_FAILED Durable write failed
|
||||
0x0304 SEGMENT_TOO_LARGE Segment exceeds 4 GB limit
|
||||
0x0305 READ_ONLY File opened in read-only mode
|
||||
```
|
||||
|
||||
`LOCK_STALE` is informational. The runtime may attempt to break the stale
|
||||
lock and retry. If recovery succeeds, the original operation proceeds with
|
||||
an `OK` status.
|
||||
|
||||
### Category 0x04: Tile Errors (WASM Microkernel)
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0400 TILE_TRAP WASM trap (OOB, unreachable, stack overflow)
|
||||
0x0401 TILE_OOM Tile exceeded scratch memory (64 KB)
|
||||
0x0402 TILE_TIMEOUT Tile computation exceeded time budget
|
||||
0x0403 TILE_INVALID_MSG Malformed hub-tile message
|
||||
0x0404 TILE_UNSUPPORTED_OP Operation not available on this profile
|
||||
```
|
||||
|
||||
All tile errors trigger the fault isolation protocol described in
|
||||
`microkernel/wasm-runtime.md` section 8. The hub reassigns the tile's
|
||||
work and optionally restarts the faulted tile.
|
||||
|
||||
### Category 0x05: Crypto Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0500 KEY_NOT_FOUND Referenced key_id not in CRYPTO_SEG
|
||||
0x0501 KEY_EXPIRED Key past valid_until timestamp
|
||||
0x0502 DECRYPT_FAILED Decryption or auth tag verification failed
|
||||
0x0503 ALGO_UNSUPPORTED Cryptographic algorithm not implemented
|
||||
```
|
||||
|
||||
Crypto errors are always fatal for the affected segment. An implementation
|
||||
must not serve data from a segment that fails signature or decryption checks.
|
||||
|
||||
## 3. Batch Query API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
Batch queries amortize connection overhead and enable the runtime to
|
||||
schedule vector block loads across multiple queries simultaneously.
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_count Number of queries in batch (max 1024)
|
||||
0x04 4 k Shared top-K parameter
|
||||
0x08 1 metric Distance metric: 0=L2, 1=IP, 2=cosine, 3=hamming
|
||||
0x09 3 reserved Must be zero
|
||||
0x0C 4 ef_search HNSW ef_search parameter
|
||||
0x10 4 shared_filter_len Byte length of shared filter (0 = no filter)
|
||||
0x14 var shared_filter Filter expression (applies to all queries)
|
||||
var var queries[] Per-query entries (see below)
|
||||
```
|
||||
|
||||
Each query entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_id Client-assigned correlation ID
|
||||
0x04 2 dim Vector dimensionality
|
||||
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
|
||||
0x07 1 flags Bit 0: has per-query filter
|
||||
0x08 var vector Query vector (dim * sizeof(dtype) bytes)
|
||||
var 4 filter_len Byte length of per-query filter (if flags bit 0)
|
||||
var var filter Per-query filter (overrides shared filter)
|
||||
```
|
||||
|
||||
When both a shared filter and a per-query filter are present, the per-query
|
||||
filter takes precedence. A per-query filter of zero length inherits the
|
||||
shared filter.
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_count Number of query results
|
||||
0x04 var results[] Per-query result entries
|
||||
```
|
||||
|
||||
Each result entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_id Correlation ID from request
|
||||
0x04 2 status Error code (0x0000 = OK)
|
||||
0x06 2 reserved Must be zero
|
||||
0x08 4 result_count Number of results returned
|
||||
0x0C var results[] Array of (vector_id: u64, distance: f32) pairs
|
||||
```
|
||||
|
||||
Each result pair is 12 bytes: 8 bytes for the vector ID followed by 4 bytes
|
||||
for the distance value. Results are sorted by distance ascending (nearest first).
|
||||
|
||||
### Batch Scheduling
|
||||
|
||||
The runtime should process batch queries using the following strategy:
|
||||
|
||||
1. Parse all query vectors and load them into memory
|
||||
2. Identify shared segments across queries (block deduplication)
|
||||
3. Load each vector block once and evaluate all relevant queries against it
|
||||
4. Merge per-query top-K heaps independently
|
||||
5. Return results as soon as each query completes (streaming response)
|
||||
|
||||
This amortizes I/O: if N queries touch the same vector block, the block is
|
||||
read once instead of N times.
|
||||
|
||||
## 4. Batch Ingest API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 vector_count Number of vectors to ingest (max 65536)
|
||||
0x04 2 dim Vector dimensionality
|
||||
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
|
||||
0x07 1 flags Bit 0: metadata_included
|
||||
0x08 var vectors[] Vector entries
|
||||
var var metadata[] Metadata entries (if flags bit 0)
|
||||
```
|
||||
|
||||
Each vector entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 vector_id Globally unique vector ID
|
||||
0x08 var vector Vector data (dim * sizeof(dtype) bytes)
|
||||
```
|
||||
|
||||
Each metadata entry (when metadata_included is set):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 field_count Number of metadata fields
|
||||
0x02 var fields[] Field entries
|
||||
```
|
||||
|
||||
Each metadata field:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 field_id Field identifier (application-defined)
|
||||
0x02 1 value_type 0=u64, 1=i64, 2=f64, 3=string, 4=bytes
|
||||
0x03 var value Encoded value (u64/i64/f64: 8B; string/bytes: 4B length + data)
|
||||
```
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 accepted_count Number of vectors accepted
|
||||
0x04 4 rejected_count Number of vectors rejected
|
||||
0x08 4 manifest_epoch Epoch of manifest after commit
|
||||
0x0C var rejected_ids[] Array of rejected vector IDs (u64 * rejected_count)
|
||||
var var rejected_reasons[] Array of error codes (u16 * rejected_count)
|
||||
```
|
||||
|
||||
The `manifest_epoch` field is the epoch of the MANIFEST_SEG written after the
|
||||
ingest is committed. Clients can use this value to confirm that a subsequent
|
||||
read will include the ingested vectors.
|
||||
|
||||
### Ingest Commit Semantics
|
||||
|
||||
1. The runtime writes vectors to a new VEC_SEG (append-only)
|
||||
2. If metadata is included, a META_SEG is appended
|
||||
3. Both segments are fsynced
|
||||
4. A new MANIFEST_SEG is written referencing the new segments
|
||||
5. The manifest is fsynced
|
||||
6. The response is sent with the new manifest_epoch
|
||||
|
||||
Vectors are visible to queries only after step 6 completes.
|
||||
|
||||
## 5. Batch Delete API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 1 delete_type 0=by_id, 1=by_range, 2=by_filter
|
||||
0x01 3 reserved Must be zero
|
||||
0x04 var payload Type-specific payload (see below)
|
||||
```
|
||||
|
||||
Delete by ID (`delete_type = 0`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 count Number of IDs to delete
|
||||
0x04 var ids[] Array of vector IDs (u64 * count)
|
||||
```
|
||||
|
||||
Delete by range (`delete_type = 1`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 start_id Start of range (inclusive)
|
||||
0x08 8 end_id End of range (exclusive)
|
||||
```
|
||||
|
||||
Delete by filter (`delete_type = 2`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 filter_len Byte length of filter expression
|
||||
0x04 var filter Filter expression
|
||||
```
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 deleted_count Number of vectors deleted
|
||||
0x08 2 status Error code (0x0000 = OK)
|
||||
0x0A 2 reserved Must be zero
|
||||
0x0C 4 manifest_epoch Epoch of manifest after delete committed
|
||||
```
|
||||
|
||||
### Delete Mechanics
|
||||
|
||||
Deletes are logical. The runtime appends a JOURNAL_SEG containing tombstone
|
||||
entries for the deleted vector IDs. The new MANIFEST_SEG marks affected
|
||||
VEC_SEGs as partially dead. Physical reclamation happens during compaction.
|
||||
|
||||
## 6. Network Streaming Protocol
|
||||
|
||||
### 6.1 HTTP Range Requests (Read-Only Access)
|
||||
|
||||
RVF's progressive loading model maps naturally to HTTP byte-range requests.
|
||||
A client can boot from a remote `.rvf` file and become queryable without
|
||||
downloading the entire file.
|
||||
|
||||
**Phase 1: Boot (mandatory)**
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=-4096
|
||||
```
|
||||
|
||||
Retrieves the last 4 KB of the file. This contains the Level 0 root manifest
|
||||
(MANIFEST_SEG). The client parses hotset pointers, the segment directory, and
|
||||
the profile ID.
|
||||
|
||||
If the file is smaller than 4 KB, the entire file is returned. If the last
|
||||
4 KB does not contain a valid MANIFEST_SEG, the client extends the range
|
||||
backward in 4 KB increments until one is found or 1 MB is scanned (at which
|
||||
point it returns `MANIFEST_NOT_FOUND`).
|
||||
|
||||
**Phase 2: Hotset (parallel, mandatory for queries)**
|
||||
|
||||
Using offsets from the Level 0 manifest, the client issues up to 5 parallel
|
||||
range requests:
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<entrypoint_offset>-<entrypoint_end>
|
||||
GET /file.rvf Range: bytes=<toplayer_offset>-<toplayer_end>
|
||||
GET /file.rvf Range: bytes=<centroid_offset>-<centroid_end>
|
||||
GET /file.rvf Range: bytes=<quantdict_offset>-<quantdict_end>
|
||||
GET /file.rvf Range: bytes=<hotcache_offset>-<hotcache_end>
|
||||
```
|
||||
|
||||
These fetch the HNSW entry point, top-layer graph, routing centroids,
|
||||
quantization dictionary, and the hot cache (HOT_SEG). After these 5 requests
|
||||
complete, the system is queryable with recall >= 0.7.
|
||||
|
||||
**Phase 3: Level 1 (background)**
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<l1_offset>-<l1_end>
|
||||
```
|
||||
|
||||
Fetches the Level 1 manifest containing the full segment directory. This
|
||||
enables the client to discover all segments and plan on-demand fetches.
|
||||
|
||||
**Phase 4: On-demand (per query)**
|
||||
|
||||
For queries that require cold data not yet fetched:
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<segment_offset>-<segment_end>
|
||||
```
|
||||
|
||||
The client caches fetched segments locally. Repeated queries against the
|
||||
same data region do not trigger additional requests.
|
||||
|
||||
### HTTP Requirements
|
||||
|
||||
- Server must support `Accept-Ranges: bytes`
|
||||
- Server must return `206 Partial Content` for range requests
|
||||
- Server should support multiple ranges in a single request (`multipart/byteranges`)
|
||||
- Client should use `If-None-Match` with the file's ETag to detect stale caches
|
||||
|
||||
### 6.2 TCP Streaming Protocol (Real-Time Access)
|
||||
|
||||
For real-time ingest and low-latency queries, RVF defines a binary TCP
|
||||
protocol over TLS 1.3.
|
||||
|
||||
**Connection Setup**
|
||||
|
||||
```
|
||||
1. Client opens TCP connection to server
|
||||
2. TLS 1.3 handshake (mandatory, no plaintext mode)
|
||||
3. Client sends HELLO message with protocol version and capabilities
|
||||
4. Server responds with HELLO_ACK confirming capabilities
|
||||
5. Connection is ready for messages
|
||||
```
|
||||
|
||||
**Framing**
|
||||
|
||||
All messages are length-prefixed:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 frame_length Payload length (big-endian, max 16 MB)
|
||||
0x04 1 msg_type Message type (see below)
|
||||
0x05 3 msg_id Correlation ID (big-endian, wraps at 2^24)
|
||||
0x08 var payload Message-specific payload
|
||||
```
|
||||
|
||||
Frame length is big-endian (network byte order) for consistency with TLS
|
||||
framing. The 16 MB maximum prevents a single message from monopolizing the
|
||||
connection. Payloads larger than 16 MB must be split across multiple messages
|
||||
using continuation framing (see section 6.4).
|
||||
|
||||
**Message Types**
|
||||
|
||||
```
|
||||
Client -> Server:
|
||||
0x01 QUERY Batch query (payload = Batch Query Request)
|
||||
0x02 INGEST Batch ingest (payload = Batch Ingest Request)
|
||||
0x03 DELETE Batch delete (payload = Batch Delete Request)
|
||||
0x04 STATUS Request server status (no payload)
|
||||
0x05 SUBSCRIBE Subscribe to update notifications
|
||||
|
||||
Server -> Client:
|
||||
0x81 QUERY_RESULT Batch query result
|
||||
0x82 INGEST_ACK Batch ingest acknowledgment
|
||||
0x83 DELETE_ACK Batch delete acknowledgment
|
||||
0x84 STATUS_RESP Server status response
|
||||
0x85 UPDATE_NOTIFY Push notification of new data
|
||||
0xFF ERROR Error with code and description
|
||||
```
|
||||
|
||||
**ERROR Message Payload**
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 error_code Error code from section 2
|
||||
0x02 2 description_len Byte length of description string
|
||||
0x04 var description UTF-8 error description (human-readable)
|
||||
```
|
||||
|
||||
### 6.3 Streaming Ingest Protocol
|
||||
|
||||
The TCP protocol supports continuous ingest where the client streams vectors
|
||||
without waiting for per-batch acknowledgments.
|
||||
|
||||
**Flow**
|
||||
|
||||
```
|
||||
Client Server
|
||||
| |
|
||||
|--- INGEST (batch 0) ------------->|
|
||||
|--- INGEST (batch 1) ------------->| Pipelining: send without waiting
|
||||
|--- INGEST (batch 2) ------------->|
|
||||
| | Server writes VEC_SEGs, appends manifest
|
||||
|<--- INGEST_ACK (batch 0) ---------|
|
||||
|<--- INGEST_ACK (batch 1) ---------|
|
||||
| | Backpressure: server delays ACK
|
||||
|--- INGEST (batch 3) ------------->| Client respects window
|
||||
|<--- INGEST_ACK (batch 2) ---------|
|
||||
| |
|
||||
```
|
||||
|
||||
**Backpressure**
|
||||
|
||||
The server controls ingest rate by delaying INGEST_ACK responses. The client
|
||||
must limit its in-flight (unacknowledged) ingest messages to a configurable
|
||||
window size (default: 8 messages). When the window is full, the client must
|
||||
wait for an ACK before sending the next batch.
|
||||
|
||||
The server should send backpressure when:
|
||||
- Write queue exceeds 80% capacity
|
||||
- Compaction is falling behind (dead space > 50%)
|
||||
- Available disk space drops below 10%
|
||||
|
||||
**Commit Semantics**
|
||||
|
||||
Each INGEST_ACK contains the `manifest_epoch` after commit. The server
|
||||
guarantees that all vectors acknowledged with epoch E are visible to any
|
||||
query that reads the manifest at epoch >= E.
|
||||
|
||||
### 6.4 Continuation Framing
|
||||
|
||||
For payloads exceeding the 16 MB frame limit:
|
||||
|
||||
```
|
||||
Frame 0: msg_type = original type, flags bit 0 = CONTINUATION_START
|
||||
Frame 1: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
|
||||
Frame 2: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
|
||||
Frame N: msg_type = 0x00 (CONTINUATION), flags bit 1 = CONTINUATION_END
|
||||
```
|
||||
|
||||
The receiver reassembles the payload from all continuation frames before
|
||||
processing. The msg_id is shared across all frames of a continuation sequence.
|
||||
|
||||
### 6.5 SUBSCRIBE and UPDATE_NOTIFY
|
||||
|
||||
The SUBSCRIBE message registers the client for push notifications when new
|
||||
data is committed:
|
||||
|
||||
```
|
||||
SUBSCRIBE payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 min_epoch Only notify for epochs > this value
|
||||
0x04 1 notify_flags Bit 0: ingest, Bit 1: delete, Bit 2: compaction
|
||||
0x05 3 reserved Must be zero
|
||||
```
|
||||
|
||||
The server sends UPDATE_NOTIFY whenever a new MANIFEST_SEG is committed that
|
||||
matches the subscription criteria:
|
||||
|
||||
```
|
||||
UPDATE_NOTIFY payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 epoch New manifest epoch
|
||||
0x04 1 event_type 0=ingest, 1=delete, 2=compaction
|
||||
0x05 3 reserved Must be zero
|
||||
0x08 4 affected_count Number of vectors affected
|
||||
0x0C 8 new_total Total vector count after event
|
||||
```
|
||||
|
||||
## 7. Compaction Scheduling Policy
|
||||
|
||||
Compaction merges small, overlapping, or partially-dead segments into larger,
|
||||
sealed segments. Because compaction competes with queries and ingest for I/O
|
||||
bandwidth, the runtime enforces a scheduling policy.
|
||||
|
||||
### 7.1 IO Budget
|
||||
|
||||
Compaction must consume at most 30% of available IOPS. The runtime measures
|
||||
IOPS over a 5-second sliding window and throttles compaction I/O to stay
|
||||
within budget.
|
||||
|
||||
```
|
||||
available_iops = measured_iops_capacity (from benchmarking at startup)
|
||||
compaction_budget = available_iops * 0.30
|
||||
compaction_throttle = max(compaction_budget - current_compaction_iops, 0)
|
||||
```
|
||||
|
||||
### 7.2 Priority Ordering
|
||||
|
||||
When I/O bandwidth is contended, operations are prioritized:
|
||||
|
||||
```
|
||||
Priority 1 (highest): Queries (reads from VEC_SEG, INDEX_SEG, HOT_SEG)
|
||||
Priority 2: Ingest (writes to VEC_SEG, META_SEG, MANIFEST_SEG)
|
||||
Priority 3 (lowest): Compaction (reads + writes of sealed segments)
|
||||
```
|
||||
|
||||
Compaction yields to queries and ingest. If a compaction I/O operation would
|
||||
cause a query to exceed its time budget, the compaction operation is deferred.
|
||||
|
||||
### 7.3 Scheduling Triggers
|
||||
|
||||
Compaction runs when all of the following conditions are met:
|
||||
|
||||
| Condition | Threshold | Rationale |
|
||||
|-----------|-----------|-----------|
|
||||
| Query load | < 50% of capacity | Avoid competing with active queries |
|
||||
| Dead space ratio | > 20% of total file size | Not worth compacting small amounts |
|
||||
| Segment count | > 32 active segments | Many small segments hurt read performance |
|
||||
| Time since last compaction | > 60 seconds | Prevent compaction storms |
|
||||
|
||||
The runtime evaluates these conditions every 10 seconds.
|
||||
|
||||
### 7.4 Emergency Compaction
|
||||
|
||||
If dead space exceeds 70% of total file size, compaction enters emergency mode:
|
||||
|
||||
```
|
||||
Emergency compaction rules:
|
||||
1. Compaction preempts ingest (ingest is paused, not rejected)
|
||||
2. IO budget increases to 60% of available IOPS
|
||||
3. Compaction runs regardless of query load
|
||||
4. Ingest resumes after dead space drops below 50%
|
||||
```
|
||||
|
||||
During emergency compaction, the server responds to INGEST messages with
|
||||
delayed ACKs (backpressure) rather than rejecting them. Queries continue to
|
||||
be served at highest priority.
|
||||
|
||||
### 7.5 Compaction Progress Reporting
|
||||
|
||||
The STATUS response includes compaction state:
|
||||
|
||||
```
|
||||
STATUS_RESP compaction fields:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------- ----------------------------------------
|
||||
0x00 1 compaction_state 0=idle, 1=running, 2=emergency
|
||||
0x01 1 progress_pct Completion percentage (0-100)
|
||||
0x02 2 reserved Must be zero
|
||||
0x04 8 dead_bytes Total dead space in bytes
|
||||
0x0C 8 total_bytes Total file size in bytes
|
||||
0x14 4 segments_remaining Segments left to compact
|
||||
0x18 4 segments_completed Segments compacted in current run
|
||||
0x1C 4 estimated_seconds Estimated time to completion
|
||||
0x20 4 io_budget_pct Current IO budget percentage (30 or 60)
|
||||
```
|
||||
|
||||
### 7.6 Compaction Segment Selection
|
||||
|
||||
The runtime selects segments for compaction using a tiered strategy:
|
||||
|
||||
```
|
||||
1. Tombstoned segments: Always compacted first (reclaim dead space)
|
||||
2. Small VEC_SEGs: Segments < 1 MB merged into larger segments
|
||||
3. High-overlap INDEX_SEGs: Index segments covering the same ID range
|
||||
4. Cold OVERLAY_SEGs: Overlay deltas merged into base segments
|
||||
```
|
||||
|
||||
The compaction output is always a sealed segment (SEALED flag set). Sealed
|
||||
segments are immutable and can be verified independently.
|
||||
|
||||
## 8. STATUS Response Format
|
||||
|
||||
The STATUS message provides a snapshot of the server state for monitoring
|
||||
and diagnostics.
|
||||
|
||||
```
|
||||
STATUS_RESP payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------- ----------------------------------------
|
||||
0x00 4 protocol_version Protocol version (currently 1)
|
||||
0x04 4 manifest_epoch Current manifest epoch
|
||||
0x08 8 total_vectors Total vector count
|
||||
0x10 8 total_segments Total segment count
|
||||
0x18 8 file_size_bytes Total file size
|
||||
0x20 4 query_qps Queries per second (last 5s window)
|
||||
0x24 4 ingest_vps Vectors ingested per second (last 5s window)
|
||||
0x28 24 compaction Compaction state (see section 7.5)
|
||||
0x40 1 profile_id Active hardware profile (0x00-0x03)
|
||||
0x41 1 health 0=healthy, 1=degraded, 2=read_only
|
||||
0x42 2 reserved Must be zero
|
||||
0x44 4 uptime_seconds Server uptime
|
||||
```
|
||||
|
||||
## 9. Filter Expression Format
|
||||
|
||||
Filter expressions used in batch queries and batch deletes share a common
|
||||
binary encoding:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 1 op Operator enum (see below)
|
||||
0x01 2 field_id Metadata field to filter on
|
||||
0x03 1 value_type Value type (matches metadata field types)
|
||||
0x04 var value Comparison value
|
||||
var var children[] Sub-expressions (for AND/OR/NOT)
|
||||
```
|
||||
|
||||
Operator enum:
|
||||
|
||||
```
|
||||
0x00 EQ field == value
|
||||
0x01 NE field != value
|
||||
0x02 LT field < value
|
||||
0x03 LE field <= value
|
||||
0x04 GT field > value
|
||||
0x05 GE field >= value
|
||||
0x06 IN field in [values]
|
||||
0x07 RANGE field in [low, high)
|
||||
0x10 AND All children must match
|
||||
0x11 OR Any child must match
|
||||
0x12 NOT Negate single child
|
||||
```
|
||||
|
||||
Filters are evaluated during the query scan phase. Vectors that do not match
|
||||
the filter are excluded from distance computation entirely (pre-filtering) or
|
||||
from the result set (post-filtering), depending on the runtime's cost model.
|
||||
|
||||
## 10. Invariants
|
||||
|
||||
1. Error codes are stable across versions; new codes are additive only
|
||||
2. Batch operations are atomic per-item, not per-batch (partial success is valid)
|
||||
3. TCP connections are always TLS 1.3; plaintext is not permitted
|
||||
4. Frame length is big-endian; all other multi-byte fields are little-endian
|
||||
5. HTTP progressive loading must succeed with at most 7 round trips to become queryable
|
||||
6. Compaction never runs at more than 60% of available IOPS, even in emergency mode
|
||||
7. The STATUS response is always available, even during emergency compaction
|
||||
8. Filter expressions are limited to 64 levels of nesting depth
|
||||
420
vendor/ruvector/docs/research/rvf/spec/11-wasm-bootstrap.md
vendored
Normal file
420
vendor/ruvector/docs/research/rvf/spec/11-wasm-bootstrap.md
vendored
Normal file
@@ -0,0 +1,420 @@
|
||||
# RVF WASM Self-Bootstrapping Specification
|
||||
|
||||
## 1. Motivation
|
||||
|
||||
Traditional file formats require an external runtime to interpret their contents.
|
||||
A JPEG needs an image decoder. A SQLite database needs the SQLite library. An RVF
|
||||
file needs a vector search engine.
|
||||
|
||||
What if the file carried its own runtime?
|
||||
|
||||
By embedding a tiny WASM interpreter inside the RVF file itself, we eliminate the
|
||||
last external dependency. The host only needs **raw execution capability** — the
|
||||
ability to run bytes as instructions. RVF becomes **self-bootstrapping**: a single
|
||||
file that contains both its data and the complete machinery to process that data.
|
||||
|
||||
This is the transition from "needs a compatible runtime" to **"runs anywhere
|
||||
compute exists."**
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
### The Bootstrap Stack
|
||||
|
||||
```
|
||||
Layer 3: RVF Data Segments (VEC_SEG, INDEX_SEG, MANIFEST_SEG, ...)
|
||||
^
|
||||
| processes
|
||||
|
|
||||
Layer 2: WASM Microkernel (WASM_SEG, role=Microkernel, ~5.5 KB)
|
||||
^ 14 exports: query, ingest, distance, top-K
|
||||
| executes
|
||||
|
|
||||
Layer 1: WASM Interpreter (WASM_SEG, role=Interpreter, ~50 KB)
|
||||
^ Minimal stack machine that runs WASM bytecode
|
||||
| loads
|
||||
|
|
||||
Layer 0: Raw Bytes (The .rvf file on any storage medium)
|
||||
```
|
||||
|
||||
Each layer depends only on the one below it. The host reads Layer 0 (raw bytes),
|
||||
finds the interpreter at Layer 1, uses it to execute the microkernel at Layer 2,
|
||||
which then processes the data at Layer 3.
|
||||
|
||||
### Segment Layout
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ bootable.rvf │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │
|
||||
│ │ WASM_SEG │ │ WASM_SEG │ │ VEC_SEG │ │ INDEX │ │
|
||||
│ │ 0x10 │ │ 0x10 │ │ 0x01 │ │ _SEG │ │
|
||||
│ │ │ │ │ │ │ │ 0x02 │ │
|
||||
│ │ role=Interp │ │ role=uKernel │ │ 10M vectors │ │ HNSW │ │
|
||||
│ │ ~50 KB │ │ ~5.5 KB │ │ 384-dim fp16 │ │ L0+L1 │ │
|
||||
│ │ priority=0 │ │ priority=1 │ │ │ │ │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────┘ └─────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ QUANT_SEG │ │ WITNESS_SEG │ │ MANIFEST_SEG │ ← tail │
|
||||
│ │ codebooks │ │ audit trail │ │ source of │ │
|
||||
│ │ │ │ │ │ truth │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 3. WASM_SEG Wire Format
|
||||
|
||||
### Segment Type
|
||||
|
||||
```
|
||||
Value: 0x10
|
||||
Name: WASM_SEG
|
||||
```
|
||||
|
||||
Uses the standard 64-byte RVF segment header (`SegmentHeader`), followed by
|
||||
a 64-byte `WasmHeader`, followed by the WASM bytecode.
|
||||
|
||||
### WasmHeader (64 bytes)
|
||||
|
||||
```
|
||||
Offset Size Type Field Description
|
||||
------ ---- ---- ----- -----------
|
||||
0x00 4 u32 wasm_magic 0x5256574D ("RVWM" big-endian)
|
||||
0x04 2 u16 header_version Currently 1
|
||||
0x06 1 u8 role Bootstrap role (see WasmRole enum)
|
||||
0x07 1 u8 target Target platform (see WasmTarget enum)
|
||||
0x08 2 u16 required_features WASM feature bitfield
|
||||
0x0A 2 u16 export_count Number of WASM exports
|
||||
0x0C 4 u32 bytecode_size Uncompressed bytecode size (bytes)
|
||||
0x10 4 u32 compressed_size Compressed size (0 = no compression)
|
||||
0x14 1 u8 compression 0=none, 1=LZ4, 2=ZSTD
|
||||
0x15 1 u8 min_memory_pages Minimum linear memory (64 KB each)
|
||||
0x16 1 u8 max_memory_pages Maximum linear memory (0 = no limit)
|
||||
0x17 1 u8 table_count Number of WASM tables
|
||||
0x18 32 hash256 bytecode_hash SHAKE-256-256 of uncompressed bytecode
|
||||
0x38 1 u8 bootstrap_priority Lower = tried first in chain
|
||||
0x39 1 u8 interpreter_type Interpreter variant (if role=Interpreter)
|
||||
0x3A 6 u8[6] reserved Must be zero
|
||||
```
|
||||
|
||||
### WasmRole Enum
|
||||
|
||||
```
|
||||
Value Name Description
|
||||
----- ---- -----------
|
||||
0x00 Microkernel RVF query engine (5.5 KB Cognitum tile runtime)
|
||||
0x01 Interpreter Minimal WASM interpreter for self-bootstrapping
|
||||
0x02 Combined Interpreter + microkernel linked together
|
||||
0x03 Extension Domain-specific module (custom distance, decoder)
|
||||
0x04 ControlPlane Store management (create, export, segment parsing)
|
||||
```
|
||||
|
||||
### WasmTarget Enum
|
||||
|
||||
```
|
||||
Value Name Description
|
||||
----- ---- -----------
|
||||
0x00 Wasm32 Generic wasm32 (any compliant runtime)
|
||||
0x01 WasiP1 WASI Preview 1 (requires WASI syscalls)
|
||||
0x02 WasiP2 WASI Preview 2 (component model)
|
||||
0x03 Browser Browser-optimized (expects Web APIs)
|
||||
0x04 BareTile Bare-metal Cognitum tile (hub-tile protocol only)
|
||||
```
|
||||
|
||||
### Required Features Bitfield
|
||||
|
||||
```
|
||||
Bit Mask Feature
|
||||
--- ---- -------
|
||||
0 0x0001 SIMD (v128 operations)
|
||||
1 0x0002 Bulk memory operations
|
||||
2 0x0004 Multi-value returns
|
||||
3 0x0008 Reference types
|
||||
4 0x0010 Threads (shared memory)
|
||||
5 0x0020 Tail call optimization
|
||||
6 0x0040 GC (garbage collection)
|
||||
7 0x0080 Exception handling
|
||||
```
|
||||
|
||||
### Interpreter Type (when role=Interpreter)
|
||||
|
||||
```
|
||||
Value Name Description
|
||||
----- ---- -----------
|
||||
0x00 StackMachine Generic stack-based interpreter
|
||||
0x01 Wasm3Compatible wasm3-style (register machine)
|
||||
0x02 WamrCompatible WAMR-style (AOT + interpreter)
|
||||
0x03 WasmiCompatible wasmi-style (pure stack machine)
|
||||
```
|
||||
|
||||
## 4. Bootstrap Resolution Protocol
|
||||
|
||||
### Discovery
|
||||
|
||||
1. Scan all segments for `seg_type == 0x10` (WASM_SEG)
|
||||
2. Parse the 64-byte WasmHeader from each
|
||||
3. Validate `wasm_magic == 0x5256574D`
|
||||
4. Sort by `bootstrap_priority` ascending
|
||||
|
||||
### Resolution
|
||||
|
||||
```
|
||||
IF any WASM_SEG has role=Combined:
|
||||
→ SelfContained bootstrap (single module does everything)
|
||||
|
||||
ELIF WASM_SEG with role=Interpreter AND role=Microkernel both exist:
|
||||
→ TwoStage bootstrap (interpreter runs microkernel)
|
||||
|
||||
ELIF only WASM_SEG with role=Microkernel exists:
|
||||
→ HostRequired (needs external WASM runtime)
|
||||
|
||||
ELSE:
|
||||
→ No WASM bootstrap available
|
||||
```
|
||||
|
||||
### Execution Sequence (Two-Stage)
|
||||
|
||||
```
|
||||
Host Interpreter Microkernel Data
|
||||
| | | |
|
||||
|-- read WASM_SEG[0] --->| | |
|
||||
| (interpreter bytes) | | |
|
||||
| | | |
|
||||
|-- instantiate -------->| | |
|
||||
| (load into memory) | | |
|
||||
| | | |
|
||||
|-- feed WASM_SEG[1] --->|-- instantiate -------->| |
|
||||
| (microkernel bytes) | (via interpreter) | |
|
||||
| | | |
|
||||
|-- LOAD_QUERY --------->|------- forward ------->| |
|
||||
| | |-- read VEC_SEG -->|
|
||||
| | |<- vector block ---|
|
||||
| | | |
|
||||
| | | rvf_distances() |
|
||||
| | | rvf_topk_merge() |
|
||||
| | | |
|
||||
|<-- TOPK_RESULT --------|<------ return ---------| |
|
||||
```
|
||||
|
||||
## 5. Size Budget
|
||||
|
||||
### Microkernel (role=Microkernel)
|
||||
|
||||
Already specified in `microkernel/wasm-runtime.md`:
|
||||
|
||||
```
|
||||
Total: ~5,500 bytes (< 8 KB code budget)
|
||||
Exports: 14 (query path + quantization + HNSW + verification)
|
||||
Memory: 8 KB data + 64 KB SIMD scratch
|
||||
```
|
||||
|
||||
### Interpreter (role=Interpreter)
|
||||
|
||||
Target: minimal WASM bytecode interpreter sufficient to run the microkernel.
|
||||
|
||||
```
|
||||
Component Estimated Size
|
||||
--------- --------------
|
||||
WASM binary parser 4 KB
|
||||
(magic, section parsing)
|
||||
Type section decoder 1 KB
|
||||
(function types)
|
||||
Import/Export resolution 2 KB
|
||||
Code section interpreter 12 KB
|
||||
(control flow, locals)
|
||||
Stack machine engine 8 KB
|
||||
(operand stack, call stack)
|
||||
Memory management 3 KB
|
||||
(linear memory, grow)
|
||||
i32/i64 integer ops 4 KB
|
||||
(add, sub, mul, div, rem, shifts)
|
||||
f32/f64 float ops 6 KB
|
||||
(add, sub, mul, div, sqrt, conversions)
|
||||
v128 SIMD ops (optional) 8 KB
|
||||
(only if WASM_FEAT_SIMD required)
|
||||
Table + call_indirect 2 KB
|
||||
----------
|
||||
Total (no SIMD): ~42 KB
|
||||
Total (with SIMD): ~50 KB
|
||||
```
|
||||
|
||||
### Combined (role=Combined)
|
||||
|
||||
Interpreter linked with microkernel in a single module:
|
||||
|
||||
```
|
||||
Total: ~48-56 KB (interpreter + microkernel, with overlap eliminated)
|
||||
```
|
||||
|
||||
### Self-Bootstrapping Overhead
|
||||
|
||||
For a 10M vector file (~7.3 GB at 384-dim fp16):
|
||||
- Bootstrap overhead: ~56 KB / ~7.3 GB = **0.0008%**
|
||||
- The file is 99.9992% data, 0.0008% self-sufficient runtime
|
||||
|
||||
For a 1000-vector file (~750 KB):
|
||||
- Bootstrap overhead: ~56 KB / ~750 KB = **7.5%**
|
||||
- Still practical for edge/IoT deployments
|
||||
|
||||
## 6. Execution Tiers (Extended)
|
||||
|
||||
The original three-tier model from ADR-030 is extended:
|
||||
|
||||
| Tier | Segment | Size | Boot | Self-Bootstrap? |
|
||||
|------|---------|------|------|-----------------|
|
||||
| 0: Embedded WASM Interpreter | WASM_SEG (role=Interpreter) | ~50 KB | <5 ms | **Yes** — file carries its own runtime |
|
||||
| 1: WASM Microkernel | WASM_SEG (role=Microkernel) | 5.5 KB | <1 ms | No — needs host or Tier 0 |
|
||||
| 2: eBPF | EBPF_SEG | 10-50 KB | <20 ms | No — needs Linux kernel |
|
||||
| 3: Unikernel | KERNEL_SEG | 200 KB-2 MB | <125 ms | No — needs VMM (Firecracker) |
|
||||
|
||||
**Key insight**: Tier 0 makes all other tiers optional. An RVF file with
|
||||
Tier 0 embedded runs on *any* host that can execute bytes — bare metal,
|
||||
browser, microcontroller, FPGA with a soft CPU, or even another WASM runtime.
|
||||
|
||||
## 7. "Runs Anywhere Compute Exists"
|
||||
|
||||
### What This Means
|
||||
|
||||
A self-bootstrapping RVF file requires exactly **one capability** from its host:
|
||||
|
||||
> The ability to read bytes from storage and execute them as instructions.
|
||||
|
||||
That's it. No operating system. No file system. No network stack. No runtime
|
||||
library. No package manager. No container engine.
|
||||
|
||||
### Where It Runs
|
||||
|
||||
| Host | How It Works |
|
||||
|------|-------------|
|
||||
| **x86 server** | Native WASM runtime (Wasmtime/WAMR) runs microkernel directly |
|
||||
| **ARM edge device** | Same — native WASM runtime |
|
||||
| **Browser tab** | `WebAssembly.instantiate()` on the microkernel bytes |
|
||||
| **Microcontroller** | Embedded interpreter runs microkernel in 64 KB scratch |
|
||||
| **FPGA soft CPU** | Interpreter mapped to BRAM, microkernel in flash |
|
||||
| **Another WASM runtime** | Interpreter-in-WASM runs microkernel-in-WASM (turtles) |
|
||||
| **Bare metal** | Bootloader extracts interpreter, interpreter runs microkernel |
|
||||
| **TEE enclave** | Enclave loads interpreter, verified via WITNESS_SEG attestation |
|
||||
|
||||
### The Bootstrapping Invariant
|
||||
|
||||
For any host `H` with execution capability `E`:
|
||||
|
||||
```
|
||||
∀ H, E: can_execute(H, E) ∧ can_read_bytes(H)
|
||||
→ can_process_rvf(H, self_bootstrapping_rvf_file)
|
||||
```
|
||||
|
||||
The file is a **fixed point** of the execution relation: it contains everything
|
||||
needed to process itself.
|
||||
|
||||
## 8. Security Considerations
|
||||
|
||||
### Interpreter Verification
|
||||
|
||||
The embedded interpreter's bytecode is hashed with SHAKE-256-256 and stored
|
||||
in the WasmHeader (`bytecode_hash`). A WITNESS_SEG can chain the interpreter
|
||||
hash to a trusted build, providing:
|
||||
|
||||
- **Provenance**: Who built this interpreter?
|
||||
- **Integrity**: Has the interpreter been modified?
|
||||
- **Attestation**: Can a TEE verify the interpreter before execution?
|
||||
|
||||
### Sandbox Guarantees
|
||||
|
||||
The WASM sandbox model applies at every layer:
|
||||
- The interpreter cannot access host memory beyond its linear memory
|
||||
- The microkernel cannot access interpreter memory
|
||||
- Each layer communicates only through defined exports/imports
|
||||
- A trapped module cannot corrupt other modules
|
||||
|
||||
### Bootstrap Attack Surface
|
||||
|
||||
| Attack | Mitigation |
|
||||
|--------|-----------|
|
||||
| Malicious interpreter | Verify `bytecode_hash` against known-good hash in WITNESS_SEG |
|
||||
| Modified microkernel | Interpreter verifies microkernel hash before instantiation |
|
||||
| Data corruption | Segment-level CRC32C/SHAKE-256 hashes (Law 2) |
|
||||
| Code injection | WASM validates all code at load time (type checking) |
|
||||
| Resource exhaustion | `max_memory_pages` cap, epoch-based interruption |
|
||||
|
||||
## 9. API
|
||||
|
||||
### Rust (rvf-runtime)
|
||||
|
||||
```rust
|
||||
// Embed a WASM module
|
||||
store.embed_wasm(
|
||||
role: WasmRole::Microkernel as u8,
|
||||
target: WasmTarget::Wasm32 as u8,
|
||||
required_features: WASM_FEAT_SIMD,
|
||||
wasm_bytecode: µkernel_bytes,
|
||||
export_count: 14,
|
||||
bootstrap_priority: 1,
|
||||
interpreter_type: 0,
|
||||
)?;
|
||||
|
||||
// Make self-bootstrapping
|
||||
store.embed_wasm(
|
||||
role: WasmRole::Interpreter as u8,
|
||||
target: WasmTarget::Wasm32 as u8,
|
||||
required_features: 0,
|
||||
wasm_bytecode: &interpreter_bytes,
|
||||
export_count: 3,
|
||||
bootstrap_priority: 0,
|
||||
interpreter_type: 0x03, // wasmi-compatible
|
||||
)?;
|
||||
|
||||
// Check if file is self-bootstrapping
|
||||
assert!(store.is_self_bootstrapping());
|
||||
|
||||
// Extract all WASM modules (ordered by priority)
|
||||
let modules = store.extract_wasm_all()?;
|
||||
```
|
||||
|
||||
### WASM (rvf-wasm bootstrap module)
|
||||
|
||||
```rust
|
||||
use rvf_wasm::bootstrap::{resolve_bootstrap_chain, get_bytecode, BootstrapChain};
|
||||
|
||||
let chain = resolve_bootstrap_chain(&rvf_bytes);
|
||||
|
||||
match chain {
|
||||
BootstrapChain::SelfContained { combined } => {
|
||||
let bytecode = get_bytecode(&rvf_bytes, &combined).unwrap();
|
||||
// Instantiate and run
|
||||
}
|
||||
BootstrapChain::TwoStage { interpreter, microkernel } => {
|
||||
let interp_code = get_bytecode(&rvf_bytes, &interpreter).unwrap();
|
||||
let kernel_code = get_bytecode(&rvf_bytes, µkernel).unwrap();
|
||||
// Load interpreter, then use it to run microkernel
|
||||
}
|
||||
_ => { /* use host runtime */ }
|
||||
}
|
||||
```
|
||||
|
||||
## 10. Relationship to Existing Segments
|
||||
|
||||
| Segment | Relationship to WASM_SEG |
|
||||
|---------|-------------------------|
|
||||
| KERNEL_SEG (0x0E) | Alternative execution tier — KERNEL_SEG boots a full unikernel, WASM_SEG runs a lightweight microkernel. Both make the file self-executing but at different capability levels. |
|
||||
| EBPF_SEG (0x0F) | Complementary — eBPF accelerates hot-path queries on Linux hosts while WASM provides universal portability. |
|
||||
| WITNESS_SEG (0x0A) | Verification — WITNESS_SEG chains can attest the interpreter and microkernel hashes, providing a trust anchor for the bootstrap chain. |
|
||||
| CRYPTO_SEG (0x0C) | Signing — CRYPTO_SEG key material can sign WASM_SEG contents for tamper detection. |
|
||||
| MANIFEST_SEG (0x05) | Discovery — the tail manifest references all WASM_SEGs with their roles and priorities. |
|
||||
|
||||
## 11. Implementation Status
|
||||
|
||||
| Component | Crate | Status |
|
||||
|-----------|-------|--------|
|
||||
| `SegmentType::Wasm` (0x10) | `rvf-types` | Implemented |
|
||||
| `WasmHeader` (64-byte header) | `rvf-types` | Implemented |
|
||||
| `WasmRole`, `WasmTarget` enums | `rvf-types` | Implemented |
|
||||
| `write_wasm_seg` | `rvf-runtime` | Implemented |
|
||||
| `embed_wasm` / `extract_wasm` | `rvf-runtime` | Implemented |
|
||||
| `extract_wasm_all` (priority-sorted) | `rvf-runtime` | Implemented |
|
||||
| `is_self_bootstrapping` | `rvf-runtime` | Implemented |
|
||||
| `resolve_bootstrap_chain` | `rvf-wasm` | Implemented |
|
||||
| `get_bytecode` (zero-copy extraction) | `rvf-wasm` | Implemented |
|
||||
| Embedded interpreter (wasmi-based) | `rvf-wasm` | Future |
|
||||
| Combined interpreter+microkernel build | `rvf-wasm` | Future |
|
||||
439
vendor/ruvector/docs/research/rvf/wire/binary-layout.md
vendored
Normal file
439
vendor/ruvector/docs/research/rvf/wire/binary-layout.md
vendored
Normal file
@@ -0,0 +1,439 @@
|
||||
# RVF Wire Format Reference
|
||||
|
||||
## 1. File Structure
|
||||
|
||||
An RVF file is a byte stream with no fixed header at offset 0. All structure
|
||||
is discovered from the tail.
|
||||
|
||||
```
|
||||
Byte 0 EOF
|
||||
| |
|
||||
v v
|
||||
+--------+--------+--------+ +--------+---------+--------+---------+
|
||||
| Seg 0 | Seg 1 | Seg 2 | ... | Seg N | Seg N+1 | Seg N+2| Mfst K |
|
||||
| VEC | VEC | INDEX | | VEC | HOT | INDEX | MANIF |
|
||||
+--------+--------+--------+ +--------+---------+--------+---------+
|
||||
^ ^
|
||||
| |
|
||||
Level 1 Mfst |
|
||||
Level 0
|
||||
(last 4KB)
|
||||
```
|
||||
|
||||
### Alignment Rule
|
||||
|
||||
Every segment starts at a **64-byte aligned** boundary. If a segment's
|
||||
payload + footer does not end on a 64-byte boundary, zero-padding is inserted
|
||||
before the next segment header.
|
||||
|
||||
### Byte Order
|
||||
|
||||
All multi-byte integers are **little-endian**. All floating-point values
|
||||
are IEEE 754 little-endian. This matches x86, ARM (in default mode), and
|
||||
WASM native byte order.
|
||||
|
||||
## 2. Primitive Types
|
||||
|
||||
```
|
||||
Type Size Encoding
|
||||
---- ---- --------
|
||||
u8 1 Unsigned 8-bit integer
|
||||
u16 2 Unsigned 16-bit little-endian
|
||||
u32 4 Unsigned 32-bit little-endian
|
||||
u64 8 Unsigned 64-bit little-endian
|
||||
i32 4 Signed 32-bit little-endian (two's complement)
|
||||
i64 8 Signed 64-bit little-endian (two's complement)
|
||||
f16 2 IEEE 754 half-precision little-endian
|
||||
f32 4 IEEE 754 single-precision little-endian
|
||||
f64 8 IEEE 754 double-precision little-endian
|
||||
varint 1-10 LEB128 unsigned variable-length integer
|
||||
svarint 1-10 ZigZag + LEB128 signed variable-length integer
|
||||
hash128 16 First 128 bits of hash output
|
||||
hash256 32 First 256 bits of hash output
|
||||
```
|
||||
|
||||
### Varint Encoding (LEB128)
|
||||
|
||||
```
|
||||
Value 0-127: 1 byte [0xxxxxxx]
|
||||
Value 128-16383: 2 bytes [1xxxxxxx 0xxxxxxx]
|
||||
Value 16384-2097151: 3 bytes [1xxxxxxx 1xxxxxxx 0xxxxxxx]
|
||||
...up to 10 bytes for u64
|
||||
```
|
||||
|
||||
### Delta Encoding
|
||||
|
||||
Sequences of sorted integers use delta encoding:
|
||||
```
|
||||
Original: [100, 105, 108, 120, 200]
|
||||
Deltas: [100, 5, 3, 12, 80]
|
||||
Encoded: [varint(100), varint(5), varint(3), varint(12), varint(80)]
|
||||
```
|
||||
|
||||
With restart points every N entries, the first value in each restart group
|
||||
is absolute (not delta-encoded).
|
||||
|
||||
## 3. Segment Header (64 bytes)
|
||||
|
||||
```
|
||||
Offset Type Field Notes
|
||||
------ ---- ----- -----
|
||||
0x00 u32 magic Always 0x52564653 ("RVFS")
|
||||
0x04 u8 version Format version (1)
|
||||
0x05 u8 seg_type Segment type enum
|
||||
0x06 u16 flags See flags bitfield
|
||||
0x08 u64 segment_id Monotonic ordinal
|
||||
0x10 u64 payload_length Bytes after header, before footer
|
||||
0x18 u64 timestamp_ns UNIX nanoseconds
|
||||
0x20 u8 checksum_algo 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
|
||||
0x21 u8 compression 0=none, 1=LZ4, 2=ZSTD, 3=custom
|
||||
0x22 u16 reserved_0 Must be 0x0000
|
||||
0x24 u32 reserved_1 Must be 0x00000000
|
||||
0x28 hash128 content_hash Payload hash (first 128 bits)
|
||||
0x38 u32 uncompressed_len Original payload size (0 if no compression)
|
||||
0x3C u32 alignment_pad Zero padding to 64B boundary
|
||||
```
|
||||
|
||||
### Segment Type Enum
|
||||
|
||||
```
|
||||
0x00 INVALID Not a valid segment
|
||||
0x01 VEC_SEG Vector payloads
|
||||
0x02 INDEX_SEG HNSW adjacency
|
||||
0x03 OVERLAY_SEG Graph overlay deltas
|
||||
0x04 JOURNAL_SEG Metadata mutations
|
||||
0x05 MANIFEST_SEG Segment directory
|
||||
0x06 QUANT_SEG Quantization dictionaries
|
||||
0x07 META_SEG Key-value metadata
|
||||
0x08 HOT_SEG Temperature-promoted data
|
||||
0x09 SKETCH_SEG Access counter sketches
|
||||
0x0A WITNESS_SEG Capability manifests
|
||||
0x0B PROFILE_SEG Domain profile declarations
|
||||
0x0C CRYPTO_SEG Key material / certificate anchors
|
||||
0x0D-0xEF reserved
|
||||
0xF0-0xFF extension Implementation-specific
|
||||
```
|
||||
|
||||
### Flags Bitfield
|
||||
|
||||
```
|
||||
Bit Mask Name Meaning
|
||||
--- ---- ---- -------
|
||||
0 0x0001 COMPRESSED Payload compressed per compression field
|
||||
1 0x0002 ENCRYPTED Payload encrypted (key in CRYPTO_SEG)
|
||||
2 0x0004 SIGNED Signature footer follows payload
|
||||
3 0x0008 SEALED Immutable (compaction output)
|
||||
4 0x0010 PARTIAL Partial/streaming write
|
||||
5 0x0020 TOMBSTONE Logically deletes prior segment
|
||||
6 0x0040 HOT Contains hot-tier data
|
||||
7 0x0080 OVERLAY Contains overlay/delta data
|
||||
8 0x0100 SNAPSHOT Full snapshot (not delta)
|
||||
9 0x0200 CHECKPOINT Safe rollback point
|
||||
10-15 reserved Must be zero
|
||||
```
|
||||
|
||||
## 4. Signature Footer
|
||||
|
||||
Present only if `SIGNED` flag is set. Follows immediately after the payload.
|
||||
|
||||
```
|
||||
Offset Type Field Notes
|
||||
------ ---- ----- -----
|
||||
0x00 u16 sig_algo 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
|
||||
0x02 u16 sig_length Signature byte length
|
||||
0x04 u8[] signature Signature bytes
|
||||
var u32 footer_length Total footer size (for backward scan)
|
||||
```
|
||||
|
||||
### Signature Algorithm Sizes
|
||||
|
||||
| Algorithm | sig_length | Post-Quantum | Performance |
|
||||
|-----------|-----------|-------------|-------------|
|
||||
| Ed25519 | 64 B | No | ~76,000 sign/s |
|
||||
| ML-DSA-65 | 3,309 B | Yes (NIST Level 3) | ~4,500 sign/s |
|
||||
| SLH-DSA-128s | 7,856 B | Yes (NIST Level 1) | ~350 sign/s |
|
||||
|
||||
## 5. VEC_SEG Payload Layout
|
||||
|
||||
Vector segments store blocks of vectors in columnar layout for compression.
|
||||
|
||||
```
|
||||
+------------------------------------------+
|
||||
| VEC_SEG Payload |
|
||||
+------------------------------------------+
|
||||
| Block Directory |
|
||||
| block_count: u32 |
|
||||
| For each block: |
|
||||
| block_offset: u32 (from payload start)|
|
||||
| vector_count: u32 |
|
||||
| dim: u16 |
|
||||
| dtype: u8 |
|
||||
| tier: u8 |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Block 0 |
|
||||
| +-- Columnar Vectors --+ |
|
||||
| | dim_0[0..count] | <- all vals |
|
||||
| | dim_1[0..count] | for dim 0 |
|
||||
| | ... | then dim 1 |
|
||||
| | dim_D[0..count] | etc. |
|
||||
| +----------------------+ |
|
||||
| +-- ID Map --+ |
|
||||
| | encoding: u8 (0=raw, 1=delta-varint) |
|
||||
| | restart_interval: u16 |
|
||||
| | id_count: u32 |
|
||||
| | [restart_offsets: u32[]] (if delta) |
|
||||
| | [ids: encoded] |
|
||||
| +-----------+ |
|
||||
| +-- Block CRC --+ |
|
||||
| | crc32c: u32 | |
|
||||
| +----------------+ |
|
||||
| [64B padding] |
|
||||
+------------------------------------------+
|
||||
| Block 1 |
|
||||
| ... |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
### Data Type Enum
|
||||
|
||||
```
|
||||
0x00 f32 32-bit float
|
||||
0x01 f16 16-bit float
|
||||
0x02 bf16 bfloat16
|
||||
0x03 i8 signed 8-bit integer (scalar quantized)
|
||||
0x04 u8 unsigned 8-bit integer
|
||||
0x05 i4 4-bit integer (packed, 2 per byte)
|
||||
0x06 binary 1-bit (packed, 8 per byte)
|
||||
0x07 pq Product-quantized codes
|
||||
0x08 custom Custom encoding (see QUANT_SEG)
|
||||
```
|
||||
|
||||
### Columnar vs Interleaved
|
||||
|
||||
**VEC_SEG** (columnar): `dim_0[all], dim_1[all], ..., dim_D[all]`
|
||||
- Better compression (similar values adjacent)
|
||||
- Better for batch operations
|
||||
- Worse for single-vector random access
|
||||
|
||||
**HOT_SEG** (interleaved): `vec_0[all_dims], vec_1[all_dims], ...`
|
||||
- Better for single-vector access (one cache line per vector)
|
||||
- Better for top-K refinement (sequential scan)
|
||||
- No compression benefit
|
||||
|
||||
## 6. INDEX_SEG Payload Layout
|
||||
|
||||
```
|
||||
+------------------------------------------+
|
||||
| INDEX_SEG Payload |
|
||||
+------------------------------------------+
|
||||
| Index Header |
|
||||
| index_type: u8 (0=HNSW, 1=IVF, 2=flat)|
|
||||
| layer_level: u8 (A=0, B=1, C=2) |
|
||||
| M: u16 (HNSW max neighbors per layer) |
|
||||
| ef_construction: u32 |
|
||||
| node_count: u64 |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Restart Point Index |
|
||||
| restart_interval: u32 |
|
||||
| restart_count: u32 |
|
||||
| [restart_offset: u32] * count |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Adjacency Data |
|
||||
| For each node (sorted by node_id): |
|
||||
| layer_count: varint |
|
||||
| For each layer: |
|
||||
| neighbor_count: varint |
|
||||
| [delta_neighbor_id: varint] * cnt |
|
||||
| [64B padding per restart group] |
|
||||
+------------------------------------------+
|
||||
| Prefetch Hints (optional) |
|
||||
| hint_count: u32 |
|
||||
| For each hint: |
|
||||
| node_range_start: u64 |
|
||||
| node_range_end: u64 |
|
||||
| page_offset: u64 |
|
||||
| page_count: u32 |
|
||||
| prefetch_ahead: u32 |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
## 7. HOT_SEG Payload Layout
|
||||
|
||||
The hot segment stores the most-accessed vectors in interleaved (row-major)
|
||||
layout with their neighbor lists co-located for cache locality.
|
||||
|
||||
```
|
||||
+------------------------------------------+
|
||||
| HOT_SEG Payload |
|
||||
+------------------------------------------+
|
||||
| Hot Header |
|
||||
| vector_count: u32 |
|
||||
| dim: u16 |
|
||||
| dtype: u8 (f16 or i8) |
|
||||
| neighbor_M: u16 |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Interleaved Hot Data |
|
||||
| For each hot vector: |
|
||||
| vector_id: u64 |
|
||||
| vector: [dtype * dim] |
|
||||
| neighbor_count: u16 |
|
||||
| [neighbor_id: u64] * neighbor_count |
|
||||
| [64B aligned per entry] |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
Each hot entry is self-contained: vector + neighbors in one contiguous block.
|
||||
A sequential scan of the HOT_SEG for top-K refinement reads vectors and
|
||||
neighbors without any pointer chasing.
|
||||
|
||||
### Hot Entry Size Example
|
||||
|
||||
For 384-dim fp16 vectors with M=16 neighbors:
|
||||
```
|
||||
8 (id) + 768 (vector) + 2 (count) + 128 (neighbors) = 906 bytes
|
||||
Padded to 64B: 960 bytes per entry
|
||||
```
|
||||
|
||||
1000 hot vectors = 960 KB (fits in L2 cache on most CPUs).
|
||||
|
||||
## 8. MANIFEST_SEG Payload Layout
|
||||
|
||||
```
|
||||
+------------------------------------------+
|
||||
| MANIFEST_SEG Payload |
|
||||
+------------------------------------------+
|
||||
| TLV Records (Level 1 manifest) |
|
||||
| For each record: |
|
||||
| tag: u16 |
|
||||
| length: u32 |
|
||||
| pad: u16 (to 8B alignment) |
|
||||
| value: [u8; length] |
|
||||
| [8B aligned] |
|
||||
+------------------------------------------+
|
||||
| Level 0 Root Manifest (last 4096 bytes) |
|
||||
| (See 02-manifest-system.md for layout) |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
## 9. SKETCH_SEG Payload Layout
|
||||
|
||||
```
|
||||
+------------------------------------------+
|
||||
| SKETCH_SEG Payload |
|
||||
+------------------------------------------+
|
||||
| Sketch Header |
|
||||
| block_count: u32 |
|
||||
| width: u32 (counters per row) |
|
||||
| depth: u32 (hash functions) |
|
||||
| counter_bits: u8 (8 or 16) |
|
||||
| decay_shift: u8 (aging right-shift) |
|
||||
| total_accesses: u64 |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Sketch Data |
|
||||
| For each block: |
|
||||
| block_id: u32 |
|
||||
| counters: [u8; width * depth] |
|
||||
| [64B aligned per block] |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
## 10. QUANT_SEG Payload Layout
|
||||
|
||||
```
|
||||
+------------------------------------------+
|
||||
| QUANT_SEG Payload |
|
||||
+------------------------------------------+
|
||||
| Quant Header |
|
||||
| quant_type: u8 |
|
||||
| 0 = scalar (min-max per dim) |
|
||||
| 1 = product quantization |
|
||||
| 2 = binary threshold |
|
||||
| 3 = residual PQ |
|
||||
| tier: u8 |
|
||||
| dim: u16 |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
| Type-specific data: |
|
||||
| |
|
||||
| Scalar (type 0): |
|
||||
| min: [f32; dim] |
|
||||
| max: [f32; dim] |
|
||||
| |
|
||||
| PQ (type 1): |
|
||||
| M: u16 (subspaces) |
|
||||
| K: u16 (centroids per sub) |
|
||||
| sub_dim: u16 (dims per sub) |
|
||||
| codebook: [f32; M * K * sub_dim] |
|
||||
| |
|
||||
| Binary (type 2): |
|
||||
| threshold: [f32; dim] |
|
||||
| |
|
||||
| Residual PQ (type 3): |
|
||||
| coarse_centroids: [f32; K_coarse * dim]|
|
||||
| residual_codebook: [f32; M * K * sub] |
|
||||
| |
|
||||
| [64B aligned] |
|
||||
+------------------------------------------+
|
||||
```
|
||||
|
||||
## 11. Checksum Algorithms
|
||||
|
||||
| ID | Algorithm | Output | Speed (HW accel) | Use Case |
|
||||
|----|-----------|--------|-------------------|----------|
|
||||
| 0 | CRC32C | 4 B (stored in 16B field, zero-padded) | ~3 GB/s (SSE4.2) | Per-block integrity |
|
||||
| 1 | XXH3-128 | 16 B | ~50 GB/s (AVX2) | Segment content hash |
|
||||
| 2 | SHAKE-256 | 16 or 32 B | ~1 GB/s | Cryptographic verification |
|
||||
|
||||
Default recommendation:
|
||||
- Block-level CRC: CRC32C (fastest, hardware accelerated)
|
||||
- Segment content hash: XXH3-128 (fast, good distribution)
|
||||
- Crypto witness hashes: SHAKE-256 (post-quantum safe)
|
||||
|
||||
## 12. Compression
|
||||
|
||||
| ID | Algorithm | Ratio | Decompress Speed | Use Case |
|
||||
|----|-----------|-------|-----------------|----------|
|
||||
| 0 | None | 1.0x | N/A | Hot tier |
|
||||
| 1 | LZ4 | 1.5-3x | ~4 GB/s | Warm tier, low latency |
|
||||
| 2 | ZSTD | 3-6x | ~1.5 GB/s | Cold tier, high ratio |
|
||||
| 3 | Custom | Varies | Varies | Domain-specific |
|
||||
|
||||
Compression is applied per-segment payload. Individual blocks within a
|
||||
segment share the same compression.
|
||||
|
||||
## 13. Tail Scan Algorithm
|
||||
|
||||
```python
|
||||
def find_latest_manifest(file):
|
||||
file_size = file.seek(0, SEEK_END)
|
||||
|
||||
# Try fast path: last 4096 bytes
|
||||
file.seek(file_size - 4096)
|
||||
root = file.read(4096)
|
||||
if root[0:4] == b'RVM0' and verify_crc(root):
|
||||
return parse_root_manifest(root)
|
||||
|
||||
# Slow path: scan backward for MANIFEST_SEG header
|
||||
scan_pos = file_size - 64 # Start at last 64B boundary
|
||||
while scan_pos >= 0:
|
||||
file.seek(scan_pos)
|
||||
header = file.read(64)
|
||||
if (header[0:4] == b'RVFS' and
|
||||
header[5] == 0x05 and # MANIFEST_SEG
|
||||
verify_segment_header(header)):
|
||||
return parse_manifest_segment(file, scan_pos)
|
||||
scan_pos -= 64 # Previous 64B boundary
|
||||
|
||||
raise CorruptFileError("No valid MANIFEST_SEG found")
|
||||
```
|
||||
|
||||
Worst case: full backward scan at 64B granularity. For a 4 GB file, this is
|
||||
67M checks — but each check is a 4-byte comparison, so it completes in ~100ms
|
||||
on a modern CPU with mmap. In practice, the fast path succeeds on the first try
|
||||
for non-corrupt files.
|
||||
Reference in New Issue
Block a user