Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

12 KiB

Raw Permalink Blame History

ADR-013: RVDNA -- AI-Native Genomic File Format

Status: Accepted | Date: 2026-02-11 | Authors: RuVector Genomics Architecture Team Parents: ADR-001 (Vision), ADR-003 (HNSW Index), ADR-004 (Attention), ADR-005 (GNN Protein), ADR-006 (Epigenomic)

Context

Every AI genomics pipeline re-encodes from text formats (FASTA, BAM, VCF) into tensors on every run. For a human genome (~3.2 Gbp), this costs 30-120 seconds and dominates latency. No existing format co-locates raw sequence data with pre-computed embeddings, attention matrices, graph adjacencies, or vector indices in a single zero-copy binary.

Format	Era	AI-Ready?	Why Not
FASTA	1985	No	Text, 1 byte/base, no tensors
BAM	2009	Partial	Binary but row-oriented, no embeddings
VCF	2011	No	Text, no graph structures
CRAM	2012	No	Reference-based compression, no AI artifacts

The RuVector DNA crate already implements 2-bit encoding (kmer.rs), HNSW indexing (ruvector-core), attention analysis, GNN protein folding, and epigenomic tracks as in-memory runtime structures. Every restart means full recomputation.

Decision: The RVDNA Binary Format

We define .rvdna -- a sectioned, memory-mappable binary format for mmap(2) + zero-copy access via memmap2. Design principles: (1) zero-copy mmap access, (2) pre-computed AI embeddings co-located with sequences, (3) columnar SIMD-friendly layout, (4) hierarchical indexing (chromosome/region/k-mer/base), (5) native tensor/graph storage (COO, CSR, dense), (6) streaming-compatible chunked encoding. All sections 64-byte aligned.

File Layout Overview

0x0000  64 B    File Header
0x0040  var     Section Directory (16 B per entry, up to 16)
        var     Sec 0: Sequence Data    Sec 1: K-mer Vector Index
        var     Sec 2: Attention        Sec 3: Variant Tensor
        var     Sec 4: Protein Embed    Sec 5: Epigenomic Tracks
        var     Sec 6: Metadata         Footer (16 B)

Header (64 bytes, offset 0x0000)

Off   Sz  Type    Field               Notes
0x00   8  u8[8]   magic               "RVDNA\x01\x00\x00"
0x08   2  u16     version_major       1
0x0A   2  u16     version_minor       0
0x0C   4  u32     flags               bit field (below)
0x10   8  u64     total_file_size
0x18   8  u64     sequence_length     total bases
0x20   4  u32     num_sections        1-7
0x24   4  u32     section_dir_offset
0x28   1  u8      compression         0=none 1=LZ4 2=Zstd 3=Zstd+dict
0x29   1  u8      endianness          0xEF = little-endian (required)
0x2A   2  u16     ref_genome_id       0=none 1=GRCh38 2=T2T-CHM13
0x2C   4  u32     num_chromosomes
0x30   8  u64     creation_timestamp  Unix epoch seconds
0x38   4  u32     creator_version
0x3C   4  u32     header_checksum     CRC32C of 0x00-0x3B

Flags: bit 0=HAS_QUALITY, 1=HAS_KMER_INDEX, 2=HAS_ATTENTION, 3=HAS_VARIANTS, 4=HAS_PROTEIN, 5=HAS_EPIGENOMIC, 6=IS_PAIRED_END, 7=IS_PHASED, 8=KMER_QUANTIZED, 9=ATTENTION_SPARSE, 10=MMAP_SAFE.

Section Directory (16 bytes per entry)

u64 section_offset    u32 compressed_size    u32 uncompressed_size

Section 0: Sequence Data (columnar, block-compressed in 16 KB blocks)

Block header (16 B): u32 block_bases | u32 compressed_size | u32 checksum_crc32c | u16 chromosome_id | u16 reserved

Nucleotide encoding: 2 bits/base packed 4 per byte (A=00, C=01, G=10, T=11). N-bases tracked in a separate 1-bit-per-position mask array.

Quality scores (optional, HAS_QUALITY): 6-bit Phred per position, packed ceil(n*6/8) bytes. Range 0-63.

Chromosome index table: per chrom: u32 id | u32 name_offset | u64 start_base_offset (16 B each).

Storage per Mb: ~251 KB seq-only, ~1,001 KB with quality.

Section 1: K-mer Vector Index (HNSW-Ready)

Header (32 B):

u32 num_k_values | u32 num_windows | u32 window_stride
u16 vector_dtype(0=f32,1=f16,2=int8,3=binary) | u16 hnsw_M | u16 hnsw_ef_construction
u16 hnsw_num_layers | u32 hnsw_graph_offset | u64 reserved

Per k-value descriptor (16 B): u8 k | u8 dim_log2 | u16 vector_dim | u32 num_vectors | u64 data_offset

Vector data: contiguous per k. f32: n*dim*4 B. f16: n*dim*2 B. int8: n*dim B + n*8 B (f32 scale + f32 zero per vector; dequant: f32 = (int8 - zero) * scale).

HNSW graph: per layer top-down: u32 num_nodes, then per node: u16 num_neighbors | u16[neighbors]. Entry point: first u32 after layer count.

Section 2: Attention Matrices (Sparse COO)

Per window (16 B): u64 genomic_start | u32 nnz | u32 data_offset

COO triplets: index_dtype=u16: u16 row | u16 col | f16 value (6 B). index_dtype=u32: u32 row | u32 col | f32 value (12 B).

Cross-attention pairs (optional): per pair header (24 B): u64 query_start | u64 ref_start | u32 nnz | u32 data_offset, followed by COO triplets.

Section 3: Variant Tensor (Probabilistic)

Haplotype blocks (24 B each): u64 start | u64 end | u32 num_variants | u16 phase_set_id | u16 phase_quality

Calibration (8 B each): f32 reported_quality | f32 empirical_quality

Section 4: Protein Embeddings (GNN-Ready)

Embeddings: row-major num_residues * dim * sizeof(dtype). CSR graph: row_ptr: u32[n+1], col_idx: u32[edges], values: f16[edges]. SS: u8[n] (0=coil, 1=helix, 2=sheet, 3=turn). Binding: u8[n] bit flags (0=DNA, 1=ligand, 2=protein-protein, 3=metal).

Section 5: Epigenomic Tracks (Temporal)

Header (20 B): u32 num_cpg | u32 num_access | u32 num_histone | u32 num_clock | u32 num_timepoints

Section 6: Metadata & Provenance

Header (8 B): u32 msgpack_size | u32 string_table_size

MessagePack-encoded metadata (sample ID, species, reference assembly, source files, pipeline version, per-section CRC32C checksums, model parameters). String table: concatenated null-terminated UTF-8 for chromosome names and identifiers.

Footer (16 bytes)

u64 magic_footer ("RVDNA_END" = 0x444E455F414E4456)
u32 global_checksum (XOR of all section CRC32Cs)
u32 footer_offset (self-offset from file start)

Indexing Structures

Index	Location	Lookup Time	Format
B+ tree	Sec 0 trailer	<500 ns	64 B nodes: `u16 num_keys, u16 is_leaf, u32 rsv, u64[3] keys, u32[4] children, u8[8] pad`
HNSW	Sec 1 inline	<10 us	Layered neighbor lists (see Sec 1)
Bloom filter	Sec 0 trailer	<100 ns	`u32 num_bits, u32 num_hashes, u8[ceil(bits/8)]`
Interval tree	Sec 3 inline	O(log n + k)	Augmented BST for variant overlap queries

Performance Targets

Operation	Target	Mechanism
Random access 1 KB region	<1 us	mmap + B+ tree
K-mer similarity top-10	<10 us	Pre-built HNSW, ef_search=50
Attention matrix 10 KB window	<100 us	Pre-computed COO
Variant at position	<500 ns	B+ tree + block binary search
FASTA conversion (1 Mb)	<1 s	2-bit encode + LZ4
File open + header	<10 us	64 B fixed read

Format Comparison

Property	FASTA	BAM	VCF	CRAM	RVDNA
Storage/Mb (seq)	1,000 KB	300 KB	N/A	50 KB	251 KB
Storage/Mb (seq+AI)	N/A	N/A	N/A	N/A	~5,000 KB
Random access	O(n)	~10 us	O(n)	~50 us	<1 us
AI-ready	No	No	No	No	Yes
Streaming	Yes	No	Yes	No	Yes
Vector search	No	No	No	No	HNSW
Tensor/graph	No	No	No	No	COO/CSR
Zero-copy mmap	No	Partial	No	No	Full

Consequences

Positive: Eliminates 30-120s re-encoding tax. Sub-microsecond random access. Pre-built HNSW enables real-time population-scale similarity. Single file -- no sidecar indices. Columnar SIMD access. Partial section loading. 64-byte alignment for cache efficiency.

Negative: Larger than CRAM for sequence-only storage (~4x from AI sections). Requires re-encoding during transition. Pre-computed tensors stale on model updates. No existing tool support (samtools, IGV).

Neutral: MessagePack metadata less human-readable than JSON. Write-once/read-many by design. Per-section compression optional.

Options Considered

Extend BAM with custom tags -- rejected: row-oriented layout blocks SIMD; 2-char tag namespace; no sparse tensors; BGZF 64 KB blocks too coarse.
HDF5 with genomic schema -- rejected: not zero-copy mmap-friendly; C library global locks; no HNSW; not no_std Rust compatible.
Arrow/Parquet genomic schema -- rejected: row groups too coarse; no sparse tensor type; no graph adjacency; heavy C++ dependency.
Custom binary (RVDNA) -- selected: purpose-built for AI genomics access patterns; zero-copy; native HNSW/B+/Bloom; WASM-compatible; 100-1000x latency improvement justifies ecosystem investment.

Implementation Strategy

Phase 1 (Weeks 1-4): Header, section directory, footer. Section 0 (sequence + B+ tree). Section 6 (metadata). rvdna-encode CLI. ruvector-rvdna crate with mmap reader.

Phase 2 (Weeks 5-8): Section 1 (k-mer + HNSW). Section 2 (attention COO). Section 3 (variant tensor). Integration with kmer.rs, pipeline.rs, variant.rs.

Phase 3 (Weeks 9-12): Section 4 (protein CSR graphs). Section 5 (epigenomic tracks). GNN integration. End-to-end benchmarks vs BAM/CRAM.

Rust API Sketch

pub struct RvdnaFile { mmap: Mmap, header: &'static RvdnaHeader, sections: Vec<SectionEntry> }

impl RvdnaFile {
    pub fn open(path: &Path) -> Result<Self, RvdnaError>;
    pub fn sequence(&self, chrom: u16, start: u64, len: u64) -> &[u8];       // zero-copy
    pub fn kmer_vectors(&self, k: u8, region: GenomicRange) -> &[f32];       // zero-copy
    pub fn kmer_search(&self, query: &[f32], k: u8, top_n: usize) -> Vec<SearchResult>;
    pub fn attention(&self, window_idx: u32) -> SparseCooMatrix<f16>;
    pub fn variant_at(&self, position: u64) -> Option<VariantRecord>;
    pub fn protein_embedding(&self, id: u32) -> &[f16];                      // zero-copy
    pub fn contact_graph(&self, id: u32) -> CsrGraph<f16>;
    pub fn methylation(&self, region: GenomicRange) -> &[CpgSite];
}

ADR-003: HNSW genomic vector index -- Section 1 serializes this
ADR-004: Attention architecture -- Section 2 persists attention matrices
ADR-005: GNN protein engine -- Section 4 stores protein graphs
ADR-006: Epigenomic engine -- Section 5 stores methylation/histone tracks
ADR-011: Performance targets -- RVDNA must meet latency budgets defined there

References

SAM/BAM v1.6 | VCF v4.3 | CRAM v3.1
HNSW paper | ESM-2
memmap2 | LZ4 frame format | MessagePack | CRC32C

12 KiB Raw Permalink Blame History