Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

12 KiB

Raw Blame History

RVF Domain Profiles

1. Profile Architecture

A domain profile is a semantic overlay on the universal RVF substrate. It does not change the wire format — every profile-specific file is a valid RVF file. The profile adds:

Semantic type annotations for vector dimensions
Domain-specific distance metrics
Custom quantization strategies optimized for the domain
Metadata schemas for domain-specific labels and provenance
Query preprocessing conventions

Profiles are declared in a PROFILE_SEG and referenced by the root manifest's profile_id field.

+-- RVF Universal Substrate --+
| Segments, manifests, tiers  |
| HNSW index, overlays        |
| Temperature, compaction     |
+-----------------------------+
        |
        | profile_id
        v
+-- Domain Profile Layer --+
| Semantic types            |
| Custom distances          |
| Metadata schema           |
| Query conventions         |
+---------------------------+

2. PROFILE_SEG Binary Layout

Offset  Size  Field              Description
------  ----  -----              -----------
0x00    4     profile_magic      Profile-specific magic number
0x04    2     profile_version    Profile spec version
0x06    2     profile_id         Same as root manifest profile_id
0x08    32    profile_name       UTF-8 null-terminated name
0x28    8     schema_length      Length of metadata schema
0x30    var   metadata_schema    JSON or binary schema for META_SEG entries
var     8     distance_config_len Length of distance configuration
var     var   distance_config    Distance metric parameters
var     8     quant_config_len   Length of quantization configuration
var     var   quant_config       Domain-specific quantization parameters
var     8     preprocess_len     Length of preprocessing spec
var     var   preprocess_spec    Query preprocessing pipeline description

3. RVDNA Profile (Genomics)

Profile Declaration

profile_magic:    0x52444E41 ("RDNA")
profile_id:       0x01
profile_name:     "rvdna"

Semantic Types

RVDNA vectors encode biological sequences at multiple granularities:

Granularity	Dimensions	Encoding	Use Case
Codon	64	Frequency of each codon in reading frame	Gene-level comparison
K-mer (k=6)	4096	6-mer frequency spectrum	Species identification
Motif	128-512	Learned motif embeddings (transformer)	Regulatory element search
Structure	256	Protein secondary structure embedding	Fold similarity
Epigenetic	384	Methylation + histone mark embedding	Epigenomic comparison

Distance Metrics

Codon frequency:     Jensen-Shannon divergence (symmetric KL)
K-mer spectrum:      Cosine similarity (normalized frequency vectors)
Motif embedding:     L2 distance (Euclidean in learned space)
Structure:           L2 distance with structure-aware weighting
Epigenetic:          Weighted cosine (CpG density as weight)

Quantization Strategy

Genomic vectors have specific statistical properties:

Codon frequencies: Sparse, non-negative, sum-to-1. Use scalar quantization with log transform: q = round(log2(freq + epsilon) * scale). 8-bit covers 6 orders of magnitude.
K-mer spectra: Very sparse (most 6-mers absent in short reads). Use sparse encoding: store only non-zero k-mer indices + values. Typical compression: 20-50x over dense.
Learned embeddings: Gaussian-distributed. Standard PQ works well. M=32 subspaces, K=256 centroids (8-bit codes).

Metadata Schema

{
  "type": "rvdna",
  "fields": {
    "organism": { "type": "string", "indexed": true },
    "gene_id": { "type": "string", "indexed": true },
    "chromosome": { "type": "string", "indexed": true },
    "position_start": { "type": "u64", "indexed": true },
    "position_end": { "type": "u64", "indexed": true },
    "strand": { "type": "enum", "values": ["+", "-"] },
    "quality_score": { "type": "f32" },
    "source_format": { "type": "enum", "values": ["FASTA", "FASTQ", "BAM", "VCF"] },
    "read_depth": { "type": "u32" },
    "gc_content": { "type": "f32" }
  }
}

Query Preprocessing

For RVDNA queries:

Input: Raw sequence string (ACGT...)
Compute k-mer frequency spectrum
Apply log transform for codon/k-mer queries
Normalize to unit length for cosine metrics
Encode as fp16 vector
Submit to RVF query path

4. RVText Profile (Language)

Profile Declaration

profile_magic:    0x52545854 ("RTXT")
profile_id:       0x02
profile_name:     "rvtext"

Semantic Types

Granularity	Dimensions	Source	Use Case
Token	768-1536	Transformer last hidden state	Semantic search
Sentence	384-768	Sentence transformer pooled output	Document retrieval
Paragraph	384-1024	Long-context model embedding	Passage ranking
Document	256-512	Document-level embedding	Collection search
Sparse	30522	BM25/SPLADE term weights	Lexical matching

Distance Metrics

Dense embeddings:    Cosine similarity (normalized dot product)
Sparse (SPLADE):     Dot product on sparse vectors
Hybrid:              alpha * dense_score + (1-alpha) * sparse_score
Matryoshka:          Cosine on truncated prefix (adaptive dimensionality)

Quantization Strategy

Text embeddings are well-suited to aggressive quantization:

Dense (384-768 dim): Binary quantization achieves 0.95+ recall on normalized embeddings. 384 dims -> 48 bytes. Use binary for cold tier, int8 for hot.
Sparse (SPLADE): Store as sorted (term_id, weight) pairs with delta-encoded term_ids. Typical sparsity: 100-300 non-zero terms out of 30K vocabulary. Compression: ~100x over dense.
Matryoshka: Store full-dimension vectors but index only the first D/4 dimensions. Progressive refinement uses more dimensions.

Metadata Schema

{
  "type": "rvtext",
  "fields": {
    "text": { "type": "string", "stored": true, "max_length": 8192 },
    "source_url": { "type": "string", "indexed": true },
    "language": { "type": "string", "indexed": true },
    "model_id": { "type": "string" },
    "chunk_index": { "type": "u32" },
    "total_chunks": { "type": "u32" },
    "token_count": { "type": "u32" },
    "timestamp": { "type": "u64" }
  }
}

Query Preprocessing

Input: Raw text string
Tokenize with model-specific tokenizer
Encode through embedding model (or receive pre-computed embedding)
L2-normalize for cosine similarity
Optionally: compute SPLADE sparse expansion
Submit dense + sparse to hybrid query path

5. RVGraph Profile (Networks)

Profile Declaration

profile_magic:    0x52475248 ("RGRH")
profile_id:       0x03
profile_name:     "rvgraph"

Semantic Types

Granularity	Dimensions	Source	Use Case
Node	64-256	Node2Vec / GCN embedding	Node similarity
Edge	64-128	Edge feature embedding	Link prediction
Subgraph	128-512	Graph kernel embedding	Subgraph matching
Community	64-256	Community embedding	Community detection
Spectral	32-128	Laplacian eigenvectors	Graph structure

Distance Metrics

Node embedding:      L2 distance
Edge embedding:      Cosine similarity
Subgraph:            Wasserstein distance (approximated by L2 on sorted features)
Community:           Cosine similarity
Spectral:            L2 on normalized eigenvectors

Integration with Overlay System

RVGraph uniquely integrates with the RVF overlay epoch system:

Graph structure is stored in OVERLAY_SEGs (not just as metadata)
Node embeddings are stored in VEC_SEGs
Edge weights are overlay deltas
Community assignments are partition summaries
Min-cut witnesses directly serve graph partitioning queries

This means RVGraph files are simultaneously vector stores AND graph databases. The overlay system provides dynamic graph operations (add/remove edges, rebalance partitions) while the vector system provides similarity search.

Metadata Schema

{
  "type": "rvgraph",
  "fields": {
    "node_type": { "type": "string", "indexed": true },
    "edge_type": { "type": "string", "indexed": true },
    "node_label": { "type": "string", "indexed": true },
    "degree": { "type": "u32", "indexed": true },
    "community_id": { "type": "u32", "indexed": true },
    "pagerank": { "type": "f32" },
    "clustering_coeff": { "type": "f32" },
    "source_graph": { "type": "string" }
  }
}

6. RVVision Profile (Imagery)

Profile Declaration

profile_magic:    0x52564953 ("RVIS")
profile_id:       0x04
profile_name:     "rvvision"

Semantic Types

Granularity	Dimensions	Source	Use Case
Patch	64-256	ViT patch embedding	Region search
Image	512-2048	CLIP / DINOv2 global embedding	Image retrieval
Object	256-512	Object detection crop embedding	Object search
Scene	128-512	Scene classification embedding	Scene matching
Multi-scale	256 * N	Pyramid of embeddings at scales	Scale-invariant search

Distance Metrics

CLIP embedding:      Cosine similarity (model-normalized)
DINOv2:              Cosine similarity
Patch:               L2 distance (not normalized)
Multi-scale:         Weighted sum of per-scale cosine similarities

Quantization Strategy

Vision embeddings have high intrinsic dimensionality but are compressible:

CLIP (512-dim): PQ with M=64, K=256 works well. Binary quantization achieves 0.90+ recall.
DINOv2 (768-dim): Similar to CLIP. PQ M=96, K=256.
Patch embeddings: Large volume (196+ patches per image). Aggressive quantization to 4-bit scalar. Use residual PQ for high-recall applications.

Spatial Metadata

RVVision supports spatial queries through metadata:

{
  "type": "rvvision",
  "fields": {
    "image_id": { "type": "string", "indexed": true },
    "patch_row": { "type": "u16" },
    "patch_col": { "type": "u16" },
    "scale": { "type": "f32" },
    "bbox_x": { "type": "f32" },
    "bbox_y": { "type": "f32" },
    "bbox_w": { "type": "f32" },
    "bbox_h": { "type": "f32" },
    "object_class": { "type": "string", "indexed": true },
    "confidence": { "type": "f32" },
    "model_id": { "type": "string" }
  }
}

7. Custom Profile Registration

New profiles can be registered by writing a PROFILE_SEG:

1. Choose a unique profile_id (0x10-0xEF for custom profiles)
2. Define a 4-byte profile_magic
3. Define metadata schema
4. Define distance metric configuration
5. Define quantization recommendations
6. Write PROFILE_SEG into the RVF file
7. Set profile_id in root manifest

The profile system is open — any domain can define its own profile as long as it maps onto the RVF substrate. The substrate does not need to understand the domain semantics; it only needs to store vectors, compute distances, and maintain indexes.

8. Cross-Profile Queries

RVF files with different profiles can be queried together if their vectors share a compatible embedding space. This is common in multimodal applications:

Query: "Find images similar to this text description"

1. Text embedding (RVText profile) -> 512-dim CLIP text vector
2. Image database (RVVision profile) -> 512-dim CLIP image vectors
3. Distance metric: Cosine similarity (shared CLIP space)
4. Result: Images ranked by text-image similarity

The query path treats both files as RVF files. The profile only affects preprocessing and metadata interpretation — the core distance computation and indexing are profile-agnostic.

9. Profile Compatibility Matrix

Source Profile	Target Profile	Compatible?	Condition
RVDNA	RVDNA	Yes	Same granularity
RVText	RVText	Yes	Same model or compatible space
RVVision	RVVision	Yes	Same model or compatible space
RVText	RVVision	Yes	If both use CLIP or shared space
RVDNA	RVText	No*	Unless mapped through protein language model
RVGraph	Any	Partial	Node embeddings may share space

*Cross-domain compatibility requires explicit embedding space alignment, which is outside the scope of the format spec but enabled by it.

12 KiB Raw Blame History

RVF Domain Profiles

1. Profile Architecture

2. PROFILE_SEG Binary Layout

3. RVDNA Profile (Genomics)

Profile Declaration

Semantic Types

Distance Metrics

Quantization Strategy

Metadata Schema

Query Preprocessing

4. RVText Profile (Language)

Profile Declaration

Semantic Types

Distance Metrics

Quantization Strategy

Metadata Schema

Query Preprocessing

5. RVGraph Profile (Networks)

Profile Declaration

Semantic Types

Distance Metrics

Integration with Overlay System

Metadata Schema

6. RVVision Profile (Imagery)

Profile Declaration

Semantic Types

Distance Metrics

Quantization Strategy

Spatial Metadata

7. Custom Profile Registration

8. Cross-Profile Queries

9. Profile Compatibility Matrix

12 KiB

Raw Blame History