Files
wifi-densepose/vendor/ruvector/docs/research/rvf/profiles/domain-profiles.md

12 KiB

RVF Domain Profiles

1. Profile Architecture

A domain profile is a semantic overlay on the universal RVF substrate. It does not change the wire format — every profile-specific file is a valid RVF file. The profile adds:

  1. Semantic type annotations for vector dimensions
  2. Domain-specific distance metrics
  3. Custom quantization strategies optimized for the domain
  4. Metadata schemas for domain-specific labels and provenance
  5. Query preprocessing conventions

Profiles are declared in a PROFILE_SEG and referenced by the root manifest's profile_id field.

+-- RVF Universal Substrate --+
| Segments, manifests, tiers  |
| HNSW index, overlays        |
| Temperature, compaction     |
+-----------------------------+
        |
        | profile_id
        v
+-- Domain Profile Layer --+
| Semantic types            |
| Custom distances          |
| Metadata schema           |
| Query conventions         |
+---------------------------+

2. PROFILE_SEG Binary Layout

Offset  Size  Field              Description
------  ----  -----              -----------
0x00    4     profile_magic      Profile-specific magic number
0x04    2     profile_version    Profile spec version
0x06    2     profile_id         Same as root manifest profile_id
0x08    32    profile_name       UTF-8 null-terminated name
0x28    8     schema_length      Length of metadata schema
0x30    var   metadata_schema    JSON or binary schema for META_SEG entries
var     8     distance_config_len Length of distance configuration
var     var   distance_config    Distance metric parameters
var     8     quant_config_len   Length of quantization configuration
var     var   quant_config       Domain-specific quantization parameters
var     8     preprocess_len     Length of preprocessing spec
var     var   preprocess_spec    Query preprocessing pipeline description

3. RVDNA Profile (Genomics)

Profile Declaration

profile_magic:    0x52444E41 ("RDNA")
profile_id:       0x01
profile_name:     "rvdna"

Semantic Types

RVDNA vectors encode biological sequences at multiple granularities:

Granularity Dimensions Encoding Use Case
Codon 64 Frequency of each codon in reading frame Gene-level comparison
K-mer (k=6) 4096 6-mer frequency spectrum Species identification
Motif 128-512 Learned motif embeddings (transformer) Regulatory element search
Structure 256 Protein secondary structure embedding Fold similarity
Epigenetic 384 Methylation + histone mark embedding Epigenomic comparison

Distance Metrics

Codon frequency:     Jensen-Shannon divergence (symmetric KL)
K-mer spectrum:      Cosine similarity (normalized frequency vectors)
Motif embedding:     L2 distance (Euclidean in learned space)
Structure:           L2 distance with structure-aware weighting
Epigenetic:          Weighted cosine (CpG density as weight)

Quantization Strategy

Genomic vectors have specific statistical properties:

  • Codon frequencies: Sparse, non-negative, sum-to-1. Use scalar quantization with log transform: q = round(log2(freq + epsilon) * scale). 8-bit covers 6 orders of magnitude.

  • K-mer spectra: Very sparse (most 6-mers absent in short reads). Use sparse encoding: store only non-zero k-mer indices + values. Typical compression: 20-50x over dense.

  • Learned embeddings: Gaussian-distributed. Standard PQ works well. M=32 subspaces, K=256 centroids (8-bit codes).

Metadata Schema

{
  "type": "rvdna",
  "fields": {
    "organism": { "type": "string", "indexed": true },
    "gene_id": { "type": "string", "indexed": true },
    "chromosome": { "type": "string", "indexed": true },
    "position_start": { "type": "u64", "indexed": true },
    "position_end": { "type": "u64", "indexed": true },
    "strand": { "type": "enum", "values": ["+", "-"] },
    "quality_score": { "type": "f32" },
    "source_format": { "type": "enum", "values": ["FASTA", "FASTQ", "BAM", "VCF"] },
    "read_depth": { "type": "u32" },
    "gc_content": { "type": "f32" }
  }
}

Query Preprocessing

For RVDNA queries:

  1. Input: Raw sequence string (ACGT...)
  2. Compute k-mer frequency spectrum
  3. Apply log transform for codon/k-mer queries
  4. Normalize to unit length for cosine metrics
  5. Encode as fp16 vector
  6. Submit to RVF query path

4. RVText Profile (Language)

Profile Declaration

profile_magic:    0x52545854 ("RTXT")
profile_id:       0x02
profile_name:     "rvtext"

Semantic Types

Granularity Dimensions Source Use Case
Token 768-1536 Transformer last hidden state Semantic search
Sentence 384-768 Sentence transformer pooled output Document retrieval
Paragraph 384-1024 Long-context model embedding Passage ranking
Document 256-512 Document-level embedding Collection search
Sparse 30522 BM25/SPLADE term weights Lexical matching

Distance Metrics

Dense embeddings:    Cosine similarity (normalized dot product)
Sparse (SPLADE):     Dot product on sparse vectors
Hybrid:              alpha * dense_score + (1-alpha) * sparse_score
Matryoshka:          Cosine on truncated prefix (adaptive dimensionality)

Quantization Strategy

Text embeddings are well-suited to aggressive quantization:

  • Dense (384-768 dim): Binary quantization achieves 0.95+ recall on normalized embeddings. 384 dims -> 48 bytes. Use binary for cold tier, int8 for hot.

  • Sparse (SPLADE): Store as sorted (term_id, weight) pairs with delta-encoded term_ids. Typical sparsity: 100-300 non-zero terms out of 30K vocabulary. Compression: ~100x over dense.

  • Matryoshka: Store full-dimension vectors but index only the first D/4 dimensions. Progressive refinement uses more dimensions.

Metadata Schema

{
  "type": "rvtext",
  "fields": {
    "text": { "type": "string", "stored": true, "max_length": 8192 },
    "source_url": { "type": "string", "indexed": true },
    "language": { "type": "string", "indexed": true },
    "model_id": { "type": "string" },
    "chunk_index": { "type": "u32" },
    "total_chunks": { "type": "u32" },
    "token_count": { "type": "u32" },
    "timestamp": { "type": "u64" }
  }
}

Query Preprocessing

  1. Input: Raw text string
  2. Tokenize with model-specific tokenizer
  3. Encode through embedding model (or receive pre-computed embedding)
  4. L2-normalize for cosine similarity
  5. Optionally: compute SPLADE sparse expansion
  6. Submit dense + sparse to hybrid query path

5. RVGraph Profile (Networks)

Profile Declaration

profile_magic:    0x52475248 ("RGRH")
profile_id:       0x03
profile_name:     "rvgraph"

Semantic Types

Granularity Dimensions Source Use Case
Node 64-256 Node2Vec / GCN embedding Node similarity
Edge 64-128 Edge feature embedding Link prediction
Subgraph 128-512 Graph kernel embedding Subgraph matching
Community 64-256 Community embedding Community detection
Spectral 32-128 Laplacian eigenvectors Graph structure

Distance Metrics

Node embedding:      L2 distance
Edge embedding:      Cosine similarity
Subgraph:            Wasserstein distance (approximated by L2 on sorted features)
Community:           Cosine similarity
Spectral:            L2 on normalized eigenvectors

Integration with Overlay System

RVGraph uniquely integrates with the RVF overlay epoch system:

  • Graph structure is stored in OVERLAY_SEGs (not just as metadata)
  • Node embeddings are stored in VEC_SEGs
  • Edge weights are overlay deltas
  • Community assignments are partition summaries
  • Min-cut witnesses directly serve graph partitioning queries

This means RVGraph files are simultaneously vector stores AND graph databases. The overlay system provides dynamic graph operations (add/remove edges, rebalance partitions) while the vector system provides similarity search.

Metadata Schema

{
  "type": "rvgraph",
  "fields": {
    "node_type": { "type": "string", "indexed": true },
    "edge_type": { "type": "string", "indexed": true },
    "node_label": { "type": "string", "indexed": true },
    "degree": { "type": "u32", "indexed": true },
    "community_id": { "type": "u32", "indexed": true },
    "pagerank": { "type": "f32" },
    "clustering_coeff": { "type": "f32" },
    "source_graph": { "type": "string" }
  }
}

6. RVVision Profile (Imagery)

Profile Declaration

profile_magic:    0x52564953 ("RVIS")
profile_id:       0x04
profile_name:     "rvvision"

Semantic Types

Granularity Dimensions Source Use Case
Patch 64-256 ViT patch embedding Region search
Image 512-2048 CLIP / DINOv2 global embedding Image retrieval
Object 256-512 Object detection crop embedding Object search
Scene 128-512 Scene classification embedding Scene matching
Multi-scale 256 * N Pyramid of embeddings at scales Scale-invariant search

Distance Metrics

CLIP embedding:      Cosine similarity (model-normalized)
DINOv2:              Cosine similarity
Patch:               L2 distance (not normalized)
Multi-scale:         Weighted sum of per-scale cosine similarities

Quantization Strategy

Vision embeddings have high intrinsic dimensionality but are compressible:

  • CLIP (512-dim): PQ with M=64, K=256 works well. Binary quantization achieves 0.90+ recall.

  • DINOv2 (768-dim): Similar to CLIP. PQ M=96, K=256.

  • Patch embeddings: Large volume (196+ patches per image). Aggressive quantization to 4-bit scalar. Use residual PQ for high-recall applications.

Spatial Metadata

RVVision supports spatial queries through metadata:

{
  "type": "rvvision",
  "fields": {
    "image_id": { "type": "string", "indexed": true },
    "patch_row": { "type": "u16" },
    "patch_col": { "type": "u16" },
    "scale": { "type": "f32" },
    "bbox_x": { "type": "f32" },
    "bbox_y": { "type": "f32" },
    "bbox_w": { "type": "f32" },
    "bbox_h": { "type": "f32" },
    "object_class": { "type": "string", "indexed": true },
    "confidence": { "type": "f32" },
    "model_id": { "type": "string" }
  }
}

7. Custom Profile Registration

New profiles can be registered by writing a PROFILE_SEG:

1. Choose a unique profile_id (0x10-0xEF for custom profiles)
2. Define a 4-byte profile_magic
3. Define metadata schema
4. Define distance metric configuration
5. Define quantization recommendations
6. Write PROFILE_SEG into the RVF file
7. Set profile_id in root manifest

The profile system is open — any domain can define its own profile as long as it maps onto the RVF substrate. The substrate does not need to understand the domain semantics; it only needs to store vectors, compute distances, and maintain indexes.

8. Cross-Profile Queries

RVF files with different profiles can be queried together if their vectors share a compatible embedding space. This is common in multimodal applications:

Query: "Find images similar to this text description"

1. Text embedding (RVText profile) -> 512-dim CLIP text vector
2. Image database (RVVision profile) -> 512-dim CLIP image vectors
3. Distance metric: Cosine similarity (shared CLIP space)
4. Result: Images ranked by text-image similarity

The query path treats both files as RVF files. The profile only affects preprocessing and metadata interpretation — the core distance computation and indexing are profile-agnostic.

9. Profile Compatibility Matrix

Source Profile Target Profile Compatible? Condition
RVDNA RVDNA Yes Same granularity
RVText RVText Yes Same model or compatible space
RVVision RVVision Yes Same model or compatible space
RVText RVVision Yes If both use CLIP or shared space
RVDNA RVText No* Unless mapped through protein language model
RVGraph Any Partial Node embeddings may share space

*Cross-domain compatibility requires explicit embedding space alignment, which is outside the scope of the format spec but enabled by it.