Files
wifi-densepose/vendor/ruvector/examples/dna/adr/ADR-013-rvdna-ai-native-format.md

12 KiB

ADR-013: RVDNA -- AI-Native Genomic File Format

Status: Accepted | Date: 2026-02-11 | Authors: RuVector Genomics Architecture Team Parents: ADR-001 (Vision), ADR-003 (HNSW Index), ADR-004 (Attention), ADR-005 (GNN Protein), ADR-006 (Epigenomic)

Context

Every AI genomics pipeline re-encodes from text formats (FASTA, BAM, VCF) into tensors on every run. For a human genome (~3.2 Gbp), this costs 30-120 seconds and dominates latency. No existing format co-locates raw sequence data with pre-computed embeddings, attention matrices, graph adjacencies, or vector indices in a single zero-copy binary.

Format Era AI-Ready? Why Not
FASTA 1985 No Text, 1 byte/base, no tensors
BAM 2009 Partial Binary but row-oriented, no embeddings
VCF 2011 No Text, no graph structures
CRAM 2012 No Reference-based compression, no AI artifacts

The RuVector DNA crate already implements 2-bit encoding (kmer.rs), HNSW indexing (ruvector-core), attention analysis, GNN protein folding, and epigenomic tracks as in-memory runtime structures. Every restart means full recomputation.

Decision: The RVDNA Binary Format

We define .rvdna -- a sectioned, memory-mappable binary format for mmap(2) + zero-copy access via memmap2. Design principles: (1) zero-copy mmap access, (2) pre-computed AI embeddings co-located with sequences, (3) columnar SIMD-friendly layout, (4) hierarchical indexing (chromosome/region/k-mer/base), (5) native tensor/graph storage (COO, CSR, dense), (6) streaming-compatible chunked encoding. All sections 64-byte aligned.

File Layout Overview

0x0000  64 B    File Header
0x0040  var     Section Directory (16 B per entry, up to 16)
        var     Sec 0: Sequence Data    Sec 1: K-mer Vector Index
        var     Sec 2: Attention        Sec 3: Variant Tensor
        var     Sec 4: Protein Embed    Sec 5: Epigenomic Tracks
        var     Sec 6: Metadata         Footer (16 B)

Header (64 bytes, offset 0x0000)

Off   Sz  Type    Field               Notes
0x00   8  u8[8]   magic               "RVDNA\x01\x00\x00"
0x08   2  u16     version_major       1
0x0A   2  u16     version_minor       0
0x0C   4  u32     flags               bit field (below)
0x10   8  u64     total_file_size
0x18   8  u64     sequence_length     total bases
0x20   4  u32     num_sections        1-7
0x24   4  u32     section_dir_offset
0x28   1  u8      compression         0=none 1=LZ4 2=Zstd 3=Zstd+dict
0x29   1  u8      endianness          0xEF = little-endian (required)
0x2A   2  u16     ref_genome_id       0=none 1=GRCh38 2=T2T-CHM13
0x2C   4  u32     num_chromosomes
0x30   8  u64     creation_timestamp  Unix epoch seconds
0x38   4  u32     creator_version
0x3C   4  u32     header_checksum     CRC32C of 0x00-0x3B

Flags: bit 0=HAS_QUALITY, 1=HAS_KMER_INDEX, 2=HAS_ATTENTION, 3=HAS_VARIANTS, 4=HAS_PROTEIN, 5=HAS_EPIGENOMIC, 6=IS_PAIRED_END, 7=IS_PHASED, 8=KMER_QUANTIZED, 9=ATTENTION_SPARSE, 10=MMAP_SAFE.

Section Directory (16 bytes per entry)

u64 section_offset    u32 compressed_size    u32 uncompressed_size

Section 0: Sequence Data (columnar, block-compressed in 16 KB blocks)

Block header (16 B): u32 block_bases | u32 compressed_size | u32 checksum_crc32c | u16 chromosome_id | u16 reserved

Nucleotide encoding: 2 bits/base packed 4 per byte (A=00, C=01, G=10, T=11). N-bases tracked in a separate 1-bit-per-position mask array.

Quality scores (optional, HAS_QUALITY): 6-bit Phred per position, packed ceil(n*6/8) bytes. Range 0-63.

Chromosome index table: per chrom: u32 id | u32 name_offset | u64 start_base_offset (16 B each).

Storage per Mb: ~251 KB seq-only, ~1,001 KB with quality.

Section 1: K-mer Vector Index (HNSW-Ready)

Header (32 B):

u32 num_k_values | u32 num_windows | u32 window_stride
u16 vector_dtype(0=f32,1=f16,2=int8,3=binary) | u16 hnsw_M | u16 hnsw_ef_construction
u16 hnsw_num_layers | u32 hnsw_graph_offset | u64 reserved

Per k-value descriptor (16 B): u8 k | u8 dim_log2 | u16 vector_dim | u32 num_vectors | u64 data_offset

Vector data: contiguous per k. f32: n*dim*4 B. f16: n*dim*2 B. int8: n*dim B + n*8 B (f32 scale + f32 zero per vector; dequant: f32 = (int8 - zero) * scale).

HNSW graph: per layer top-down: u32 num_nodes, then per node: u16 num_neighbors | u16[neighbors]. Entry point: first u32 after layer count.

Section 2: Attention Matrices (Sparse COO)

Header (24 B): u32 num_windows | u32 window_size | u32 num_heads | u16 value_dtype(0=f32,1=f16,2=bf16) | u16 index_dtype(0=u16,1=u32) | u32 total_nnz | u32 sparsity_threshold

Per window (16 B): u64 genomic_start | u32 nnz | u32 data_offset

COO triplets: index_dtype=u16: u16 row | u16 col | f16 value (6 B). index_dtype=u32: u32 row | u32 col | f32 value (12 B).

Cross-attention pairs (optional): per pair header (24 B): u64 query_start | u64 ref_start | u32 nnz | u32 data_offset, followed by COO triplets.

Section 3: Variant Tensor (Probabilistic)

Header (24 B): u32 num_variant_sites | u32 max_alleles | u32 num_haplotype_blocks | u16 likelihood_dtype | u16 ploidy | u32 calibration_points | u32 reserved

Per variant site: u64 position | u8 ref_allele(2-bit) | u8 num_alt | u8[num_alt] alts | f16[G] genotype_likelihoods | f16 allele_freq | u8 filter_flags where G=(num_alt+1)*(num_alt+2)/2 for diploid.

Haplotype blocks (24 B each): u64 start | u64 end | u32 num_variants | u16 phase_set_id | u16 phase_quality

Calibration (8 B each): f32 reported_quality | f32 empirical_quality

Section 4: Protein Embeddings (GNN-Ready)

Header (24 B): u32 num_proteins | u16 embedding_dim | u16 dtype | u32 total_residues | u32 total_contacts | u32 ss_present | u32 binding_present

Per protein (32 B): u32 protein_id | u32 gene_id | u32 num_residues | u32 embed_offset | u32 csr_rowptr_off | u32 csr_colidx_off | u32 csr_values_off | u32 annotation_off

Embeddings: row-major num_residues * dim * sizeof(dtype). CSR graph: row_ptr: u32[n+1], col_idx: u32[edges], values: f16[edges]. SS: u8[n] (0=coil, 1=helix, 2=sheet, 3=turn). Binding: u8[n] bit flags (0=DNA, 1=ligand, 2=protein-protein, 3=metal).

Section 5: Epigenomic Tracks (Temporal)

Header (20 B): u32 num_cpg | u32 num_access | u32 num_histone | u32 num_clock | u32 num_timepoints

CpG (12 B each): u64 position | f16 beta | u16 coverage. ATAC peaks (16 B): u64 start | u32 width | f16 score | u16 reserved. Histone (6 B): u32 bin_index | f16 signal. Clock (12 B): u32 cpg_idx | f32 coeff | f32 intercept_contrib.

Section 6: Metadata & Provenance

Header (8 B): u32 msgpack_size | u32 string_table_size

MessagePack-encoded metadata (sample ID, species, reference assembly, source files, pipeline version, per-section CRC32C checksums, model parameters). String table: concatenated null-terminated UTF-8 for chromosome names and identifiers.

u64 magic_footer ("RVDNA_END" = 0x444E455F414E4456)
u32 global_checksum (XOR of all section CRC32Cs)
u32 footer_offset (self-offset from file start)

Indexing Structures

Index Location Lookup Time Format
B+ tree Sec 0 trailer <500 ns 64 B nodes: u16 num_keys, u16 is_leaf, u32 rsv, u64[3] keys, u32[4] children, u8[8] pad
HNSW Sec 1 inline <10 us Layered neighbor lists (see Sec 1)
Bloom filter Sec 0 trailer <100 ns u32 num_bits, u32 num_hashes, u8[ceil(bits/8)]
Interval tree Sec 3 inline O(log n + k) Augmented BST for variant overlap queries

Performance Targets

Operation Target Mechanism
Random access 1 KB region <1 us mmap + B+ tree
K-mer similarity top-10 <10 us Pre-built HNSW, ef_search=50
Attention matrix 10 KB window <100 us Pre-computed COO
Variant at position <500 ns B+ tree + block binary search
FASTA conversion (1 Mb) <1 s 2-bit encode + LZ4
File open + header <10 us 64 B fixed read

Format Comparison

Property FASTA BAM VCF CRAM RVDNA
Storage/Mb (seq) 1,000 KB 300 KB N/A 50 KB 251 KB
Storage/Mb (seq+AI) N/A N/A N/A N/A ~5,000 KB
Random access O(n) ~10 us O(n) ~50 us <1 us
AI-ready No No No No Yes
Streaming Yes No Yes No Yes
Vector search No No No No HNSW
Tensor/graph No No No No COO/CSR
Zero-copy mmap No Partial No No Full

Consequences

Positive: Eliminates 30-120s re-encoding tax. Sub-microsecond random access. Pre-built HNSW enables real-time population-scale similarity. Single file -- no sidecar indices. Columnar SIMD access. Partial section loading. 64-byte alignment for cache efficiency.

Negative: Larger than CRAM for sequence-only storage (~4x from AI sections). Requires re-encoding during transition. Pre-computed tensors stale on model updates. No existing tool support (samtools, IGV).

Neutral: MessagePack metadata less human-readable than JSON. Write-once/read-many by design. Per-section compression optional.

Options Considered

  1. Extend BAM with custom tags -- rejected: row-oriented layout blocks SIMD; 2-char tag namespace; no sparse tensors; BGZF 64 KB blocks too coarse.
  2. HDF5 with genomic schema -- rejected: not zero-copy mmap-friendly; C library global locks; no HNSW; not no_std Rust compatible.
  3. Arrow/Parquet genomic schema -- rejected: row groups too coarse; no sparse tensor type; no graph adjacency; heavy C++ dependency.
  4. Custom binary (RVDNA) -- selected: purpose-built for AI genomics access patterns; zero-copy; native HNSW/B+/Bloom; WASM-compatible; 100-1000x latency improvement justifies ecosystem investment.

Implementation Strategy

Phase 1 (Weeks 1-4): Header, section directory, footer. Section 0 (sequence + B+ tree). Section 6 (metadata). rvdna-encode CLI. ruvector-rvdna crate with mmap reader.

Phase 2 (Weeks 5-8): Section 1 (k-mer + HNSW). Section 2 (attention COO). Section 3 (variant tensor). Integration with kmer.rs, pipeline.rs, variant.rs.

Phase 3 (Weeks 9-12): Section 4 (protein CSR graphs). Section 5 (epigenomic tracks). GNN integration. End-to-end benchmarks vs BAM/CRAM.

Rust API Sketch

pub struct RvdnaFile { mmap: Mmap, header: &'static RvdnaHeader, sections: Vec<SectionEntry> }

impl RvdnaFile {
    pub fn open(path: &Path) -> Result<Self, RvdnaError>;
    pub fn sequence(&self, chrom: u16, start: u64, len: u64) -> &[u8];       // zero-copy
    pub fn kmer_vectors(&self, k: u8, region: GenomicRange) -> &[f32];       // zero-copy
    pub fn kmer_search(&self, query: &[f32], k: u8, top_n: usize) -> Vec<SearchResult>;
    pub fn attention(&self, window_idx: u32) -> SparseCooMatrix<f16>;
    pub fn variant_at(&self, position: u64) -> Option<VariantRecord>;
    pub fn protein_embedding(&self, id: u32) -> &[f16];                      // zero-copy
    pub fn contact_graph(&self, id: u32) -> CsrGraph<f16>;
    pub fn methylation(&self, region: GenomicRange) -> &[CpgSite];
}
  • ADR-003: HNSW genomic vector index -- Section 1 serializes this
  • ADR-004: Attention architecture -- Section 2 persists attention matrices
  • ADR-005: GNN protein engine -- Section 4 stores protein graphs
  • ADR-006: Epigenomic engine -- Section 5 stores methylation/histone tracks
  • ADR-011: Performance targets -- RVDNA must meet latency budgets defined there

References