# RVF Domain Profiles

## 1. Profile Architecture

A domain profile is a **semantic overlay** on the universal RVF substrate. It does
not change the wire format — every profile-specific file is a valid RVF file. The
profile adds:

1. **Semantic type annotations** for vector dimensions
2. **Domain-specific distance metrics**
3. **Custom quantization strategies** optimized for the domain
4. **Metadata schemas** for domain-specific labels and provenance
5. **Query preprocessing** conventions

Profiles are declared in a PROFILE_SEG and referenced by the root manifest's
`profile_id` field.

```
+-- RVF Universal Substrate --+
| Segments, manifests, tiers  |
| HNSW index, overlays        |
| Temperature, compaction     |
+-----------------------------+
        |
        | profile_id
        v
+-- Domain Profile Layer --+
| Semantic types            |
| Custom distances          |
| Metadata schema           |
| Query conventions         |
+---------------------------+
```

## 2. PROFILE_SEG Binary Layout

```
Offset  Size  Field              Description
------  ----  -----              -----------
0x00    4     profile_magic      Profile-specific magic number
0x04    2     profile_version    Profile spec version
0x06    2     profile_id         Same as root manifest profile_id
0x08    32    profile_name       UTF-8 null-terminated name
0x28    8     schema_length      Length of metadata schema
0x30    var   metadata_schema    JSON or binary schema for META_SEG entries
var     8     distance_config_len Length of distance configuration
var     var   distance_config    Distance metric parameters
var     8     quant_config_len   Length of quantization configuration
var     var   quant_config       Domain-specific quantization parameters
var     8     preprocess_len     Length of preprocessing spec
var     var   preprocess_spec    Query preprocessing pipeline description
```

## 3. RVDNA Profile (Genomics)

### Profile Declaration

```
profile_magic:    0x52444E41 ("RDNA")
profile_id:       0x01
profile_name:     "rvdna"
```

### Semantic Types

RVDNA vectors encode biological sequences at multiple granularities:

| Granularity | Dimensions | Encoding | Use Case |
|------------|-----------|----------|----------|
| Codon | 64 | Frequency of each codon in reading frame | Gene-level comparison |
| K-mer (k=6) | 4096 | 6-mer frequency spectrum | Species identification |
| Motif | 128-512 | Learned motif embeddings (transformer) | Regulatory element search |
| Structure | 256 | Protein secondary structure embedding | Fold similarity |
| Epigenetic | 384 | Methylation + histone mark embedding | Epigenomic comparison |

### Distance Metrics

```
Codon frequency:     Jensen-Shannon divergence (symmetric KL)
K-mer spectrum:      Cosine similarity (normalized frequency vectors)
Motif embedding:     L2 distance (Euclidean in learned space)
Structure:           L2 distance with structure-aware weighting
Epigenetic:          Weighted cosine (CpG density as weight)
```

### Quantization Strategy

Genomic vectors have specific statistical properties:

- **Codon frequencies**: Sparse, non-negative, sum-to-1. Use **scalar quantization
  with log transform**: `q = round(log2(freq + epsilon) * scale)`. 8-bit covers
  6 orders of magnitude.

- **K-mer spectra**: Very sparse (most 6-mers absent in short reads). Use
  **sparse encoding**: store only non-zero k-mer indices + values. Typical
  compression: 20-50x over dense.

- **Learned embeddings**: Gaussian-distributed. Standard PQ works well.
  M=32 subspaces, K=256 centroids (8-bit codes).

### Metadata Schema

```json
{
  "type": "rvdna",
  "fields": {
    "organism": { "type": "string", "indexed": true },
    "gene_id": { "type": "string", "indexed": true },
    "chromosome": { "type": "string", "indexed": true },
    "position_start": { "type": "u64", "indexed": true },
    "position_end": { "type": "u64", "indexed": true },
    "strand": { "type": "enum", "values": ["+", "-"] },
    "quality_score": { "type": "f32" },
    "source_format": { "type": "enum", "values": ["FASTA", "FASTQ", "BAM", "VCF"] },
    "read_depth": { "type": "u32" },
    "gc_content": { "type": "f32" }
  }
}
```

### Query Preprocessing

For RVDNA queries:
1. Input: Raw sequence string (ACGT...)
2. Compute k-mer frequency spectrum
3. Apply log transform for codon/k-mer queries
4. Normalize to unit length for cosine metrics
5. Encode as fp16 vector
6. Submit to RVF query path

## 4. RVText Profile (Language)

### Profile Declaration

```
profile_magic:    0x52545854 ("RTXT")
profile_id:       0x02
profile_name:     "rvtext"
```

### Semantic Types

| Granularity | Dimensions | Source | Use Case |
|------------|-----------|--------|----------|
| Token | 768-1536 | Transformer last hidden state | Semantic search |
| Sentence | 384-768 | Sentence transformer pooled output | Document retrieval |
| Paragraph | 384-1024 | Long-context model embedding | Passage ranking |
| Document | 256-512 | Document-level embedding | Collection search |
| Sparse | 30522 | BM25/SPLADE term weights | Lexical matching |

### Distance Metrics

```
Dense embeddings:    Cosine similarity (normalized dot product)
Sparse (SPLADE):     Dot product on sparse vectors
Hybrid:              alpha * dense_score + (1-alpha) * sparse_score
Matryoshka:          Cosine on truncated prefix (adaptive dimensionality)
```

### Quantization Strategy

Text embeddings are well-suited to aggressive quantization:

- **Dense (384-768 dim)**: Binary quantization achieves 0.95+ recall on
  normalized embeddings. 384 dims -> 48 bytes. Use binary for cold tier,
  int8 for hot.

- **Sparse (SPLADE)**: Store as sorted (term_id, weight) pairs with
  delta-encoded term_ids. Typical sparsity: 100-300 non-zero terms out
  of 30K vocabulary. Compression: ~100x over dense.

- **Matryoshka**: Store full-dimension vectors but index only the first
  D/4 dimensions. Progressive refinement uses more dimensions.

### Metadata Schema

```json
{
  "type": "rvtext",
  "fields": {
    "text": { "type": "string", "stored": true, "max_length": 8192 },
    "source_url": { "type": "string", "indexed": true },
    "language": { "type": "string", "indexed": true },
    "model_id": { "type": "string" },
    "chunk_index": { "type": "u32" },
    "total_chunks": { "type": "u32" },
    "token_count": { "type": "u32" },
    "timestamp": { "type": "u64" }
  }
}
```

### Query Preprocessing

1. Input: Raw text string
2. Tokenize with model-specific tokenizer
3. Encode through embedding model (or receive pre-computed embedding)
4. L2-normalize for cosine similarity
5. Optionally: compute SPLADE sparse expansion
6. Submit dense + sparse to hybrid query path

## 5. RVGraph Profile (Networks)

### Profile Declaration

```
profile_magic:    0x52475248 ("RGRH")
profile_id:       0x03
profile_name:     "rvgraph"
```

### Semantic Types

| Granularity | Dimensions | Source | Use Case |
|------------|-----------|--------|----------|
| Node | 64-256 | Node2Vec / GCN embedding | Node similarity |
| Edge | 64-128 | Edge feature embedding | Link prediction |
| Subgraph | 128-512 | Graph kernel embedding | Subgraph matching |
| Community | 64-256 | Community embedding | Community detection |
| Spectral | 32-128 | Laplacian eigenvectors | Graph structure |

### Distance Metrics

```
Node embedding:      L2 distance
Edge embedding:      Cosine similarity
Subgraph:            Wasserstein distance (approximated by L2 on sorted features)
Community:           Cosine similarity
Spectral:            L2 on normalized eigenvectors
```

### Integration with Overlay System

RVGraph uniquely integrates with the RVF overlay epoch system:

- **Graph structure** is stored in OVERLAY_SEGs (not just as metadata)
- **Node embeddings** are stored in VEC_SEGs
- **Edge weights** are overlay deltas
- **Community assignments** are partition summaries
- **Min-cut witnesses** directly serve graph partitioning queries

This means RVGraph files are simultaneously vector stores AND graph databases.
The overlay system provides dynamic graph operations (add/remove edges,
rebalance partitions) while the vector system provides similarity search.

### Metadata Schema

```json
{
  "type": "rvgraph",
  "fields": {
    "node_type": { "type": "string", "indexed": true },
    "edge_type": { "type": "string", "indexed": true },
    "node_label": { "type": "string", "indexed": true },
    "degree": { "type": "u32", "indexed": true },
    "community_id": { "type": "u32", "indexed": true },
    "pagerank": { "type": "f32" },
    "clustering_coeff": { "type": "f32" },
    "source_graph": { "type": "string" }
  }
}
```

## 6. RVVision Profile (Imagery)

### Profile Declaration

```
profile_magic:    0x52564953 ("RVIS")
profile_id:       0x04
profile_name:     "rvvision"
```

### Semantic Types

| Granularity | Dimensions | Source | Use Case |
|------------|-----------|--------|----------|
| Patch | 64-256 | ViT patch embedding | Region search |
| Image | 512-2048 | CLIP / DINOv2 global embedding | Image retrieval |
| Object | 256-512 | Object detection crop embedding | Object search |
| Scene | 128-512 | Scene classification embedding | Scene matching |
| Multi-scale | 256 * N | Pyramid of embeddings at scales | Scale-invariant search |

### Distance Metrics

```
CLIP embedding:      Cosine similarity (model-normalized)
DINOv2:              Cosine similarity
Patch:               L2 distance (not normalized)
Multi-scale:         Weighted sum of per-scale cosine similarities
```

### Quantization Strategy

Vision embeddings have high intrinsic dimensionality but are compressible:

- **CLIP (512-dim)**: PQ with M=64, K=256 works well. Binary quantization
  achieves 0.90+ recall.

- **DINOv2 (768-dim)**: Similar to CLIP. PQ M=96, K=256.

- **Patch embeddings**: Large volume (196+ patches per image). Aggressive
  quantization to 4-bit scalar. Use residual PQ for high-recall applications.

### Spatial Metadata

RVVision supports spatial queries through metadata:

```json
{
  "type": "rvvision",
  "fields": {
    "image_id": { "type": "string", "indexed": true },
    "patch_row": { "type": "u16" },
    "patch_col": { "type": "u16" },
    "scale": { "type": "f32" },
    "bbox_x": { "type": "f32" },
    "bbox_y": { "type": "f32" },
    "bbox_w": { "type": "f32" },
    "bbox_h": { "type": "f32" },
    "object_class": { "type": "string", "indexed": true },
    "confidence": { "type": "f32" },
    "model_id": { "type": "string" }
  }
}
```

## 7. Custom Profile Registration

New profiles can be registered by writing a PROFILE_SEG:

```
1. Choose a unique profile_id (0x10-0xEF for custom profiles)
2. Define a 4-byte profile_magic
3. Define metadata schema
4. Define distance metric configuration
5. Define quantization recommendations
6. Write PROFILE_SEG into the RVF file
7. Set profile_id in root manifest
```

The profile system is open — any domain can define its own profile as long
as it maps onto the RVF substrate. The substrate does not need to understand
the domain semantics; it only needs to store vectors, compute distances,
and maintain indexes.

## 8. Cross-Profile Queries

RVF files with different profiles can be queried together if their vectors
share a compatible embedding space. This is common in multimodal applications:

```
Query: "Find images similar to this text description"

1. Text embedding (RVText profile) -> 512-dim CLIP text vector
2. Image database (RVVision profile) -> 512-dim CLIP image vectors
3. Distance metric: Cosine similarity (shared CLIP space)
4. Result: Images ranked by text-image similarity
```

The query path treats both files as RVF files. The profile only affects
preprocessing and metadata interpretation — the core distance computation
and indexing are profile-agnostic.

## 9. Profile Compatibility Matrix

| Source Profile | Target Profile | Compatible? | Condition |
|---------------|---------------|------------|-----------|
| RVDNA | RVDNA | Yes | Same granularity |
| RVText | RVText | Yes | Same model or compatible space |
| RVVision | RVVision | Yes | Same model or compatible space |
| RVText | RVVision | Yes | If both use CLIP or shared space |
| RVDNA | RVText | No* | Unless mapped through protein language model |
| RVGraph | Any | Partial | Node embeddings may share space |

*Cross-domain compatibility requires explicit embedding space alignment,
which is outside the scope of the format spec but enabled by it.