Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
377
vendor/ruvector/docs/research/rvf/profiles/domain-profiles.md
vendored
Normal file
377
vendor/ruvector/docs/research/rvf/profiles/domain-profiles.md
vendored
Normal file
@@ -0,0 +1,377 @@
|
||||
# RVF Domain Profiles
|
||||
|
||||
## 1. Profile Architecture
|
||||
|
||||
A domain profile is a **semantic overlay** on the universal RVF substrate. It does
|
||||
not change the wire format — every profile-specific file is a valid RVF file. The
|
||||
profile adds:
|
||||
|
||||
1. **Semantic type annotations** for vector dimensions
|
||||
2. **Domain-specific distance metrics**
|
||||
3. **Custom quantization strategies** optimized for the domain
|
||||
4. **Metadata schemas** for domain-specific labels and provenance
|
||||
5. **Query preprocessing** conventions
|
||||
|
||||
Profiles are declared in a PROFILE_SEG and referenced by the root manifest's
|
||||
`profile_id` field.
|
||||
|
||||
```
|
||||
+-- RVF Universal Substrate --+
|
||||
| Segments, manifests, tiers |
|
||||
| HNSW index, overlays |
|
||||
| Temperature, compaction |
|
||||
+-----------------------------+
|
||||
|
|
||||
| profile_id
|
||||
v
|
||||
+-- Domain Profile Layer --+
|
||||
| Semantic types |
|
||||
| Custom distances |
|
||||
| Metadata schema |
|
||||
| Query conventions |
|
||||
+---------------------------+
|
||||
```
|
||||
|
||||
## 2. PROFILE_SEG Binary Layout
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ---- ----- -----------
|
||||
0x00 4 profile_magic Profile-specific magic number
|
||||
0x04 2 profile_version Profile spec version
|
||||
0x06 2 profile_id Same as root manifest profile_id
|
||||
0x08 32 profile_name UTF-8 null-terminated name
|
||||
0x28 8 schema_length Length of metadata schema
|
||||
0x30 var metadata_schema JSON or binary schema for META_SEG entries
|
||||
var 8 distance_config_len Length of distance configuration
|
||||
var var distance_config Distance metric parameters
|
||||
var 8 quant_config_len Length of quantization configuration
|
||||
var var quant_config Domain-specific quantization parameters
|
||||
var 8 preprocess_len Length of preprocessing spec
|
||||
var var preprocess_spec Query preprocessing pipeline description
|
||||
```
|
||||
|
||||
## 3. RVDNA Profile (Genomics)
|
||||
|
||||
### Profile Declaration
|
||||
|
||||
```
|
||||
profile_magic: 0x52444E41 ("RDNA")
|
||||
profile_id: 0x01
|
||||
profile_name: "rvdna"
|
||||
```
|
||||
|
||||
### Semantic Types
|
||||
|
||||
RVDNA vectors encode biological sequences at multiple granularities:
|
||||
|
||||
| Granularity | Dimensions | Encoding | Use Case |
|
||||
|------------|-----------|----------|----------|
|
||||
| Codon | 64 | Frequency of each codon in reading frame | Gene-level comparison |
|
||||
| K-mer (k=6) | 4096 | 6-mer frequency spectrum | Species identification |
|
||||
| Motif | 128-512 | Learned motif embeddings (transformer) | Regulatory element search |
|
||||
| Structure | 256 | Protein secondary structure embedding | Fold similarity |
|
||||
| Epigenetic | 384 | Methylation + histone mark embedding | Epigenomic comparison |
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
```
|
||||
Codon frequency: Jensen-Shannon divergence (symmetric KL)
|
||||
K-mer spectrum: Cosine similarity (normalized frequency vectors)
|
||||
Motif embedding: L2 distance (Euclidean in learned space)
|
||||
Structure: L2 distance with structure-aware weighting
|
||||
Epigenetic: Weighted cosine (CpG density as weight)
|
||||
```
|
||||
|
||||
### Quantization Strategy
|
||||
|
||||
Genomic vectors have specific statistical properties:
|
||||
|
||||
- **Codon frequencies**: Sparse, non-negative, sum-to-1. Use **scalar quantization
|
||||
with log transform**: `q = round(log2(freq + epsilon) * scale)`. 8-bit covers
|
||||
6 orders of magnitude.
|
||||
|
||||
- **K-mer spectra**: Very sparse (most 6-mers absent in short reads). Use
|
||||
**sparse encoding**: store only non-zero k-mer indices + values. Typical
|
||||
compression: 20-50x over dense.
|
||||
|
||||
- **Learned embeddings**: Gaussian-distributed. Standard PQ works well.
|
||||
M=32 subspaces, K=256 centroids (8-bit codes).
|
||||
|
||||
### Metadata Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rvdna",
|
||||
"fields": {
|
||||
"organism": { "type": "string", "indexed": true },
|
||||
"gene_id": { "type": "string", "indexed": true },
|
||||
"chromosome": { "type": "string", "indexed": true },
|
||||
"position_start": { "type": "u64", "indexed": true },
|
||||
"position_end": { "type": "u64", "indexed": true },
|
||||
"strand": { "type": "enum", "values": ["+", "-"] },
|
||||
"quality_score": { "type": "f32" },
|
||||
"source_format": { "type": "enum", "values": ["FASTA", "FASTQ", "BAM", "VCF"] },
|
||||
"read_depth": { "type": "u32" },
|
||||
"gc_content": { "type": "f32" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Query Preprocessing
|
||||
|
||||
For RVDNA queries:
|
||||
1. Input: Raw sequence string (ACGT...)
|
||||
2. Compute k-mer frequency spectrum
|
||||
3. Apply log transform for codon/k-mer queries
|
||||
4. Normalize to unit length for cosine metrics
|
||||
5. Encode as fp16 vector
|
||||
6. Submit to RVF query path
|
||||
|
||||
## 4. RVText Profile (Language)
|
||||
|
||||
### Profile Declaration
|
||||
|
||||
```
|
||||
profile_magic: 0x52545854 ("RTXT")
|
||||
profile_id: 0x02
|
||||
profile_name: "rvtext"
|
||||
```
|
||||
|
||||
### Semantic Types
|
||||
|
||||
| Granularity | Dimensions | Source | Use Case |
|
||||
|------------|-----------|--------|----------|
|
||||
| Token | 768-1536 | Transformer last hidden state | Semantic search |
|
||||
| Sentence | 384-768 | Sentence transformer pooled output | Document retrieval |
|
||||
| Paragraph | 384-1024 | Long-context model embedding | Passage ranking |
|
||||
| Document | 256-512 | Document-level embedding | Collection search |
|
||||
| Sparse | 30522 | BM25/SPLADE term weights | Lexical matching |
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
```
|
||||
Dense embeddings: Cosine similarity (normalized dot product)
|
||||
Sparse (SPLADE): Dot product on sparse vectors
|
||||
Hybrid: alpha * dense_score + (1-alpha) * sparse_score
|
||||
Matryoshka: Cosine on truncated prefix (adaptive dimensionality)
|
||||
```
|
||||
|
||||
### Quantization Strategy
|
||||
|
||||
Text embeddings are well-suited to aggressive quantization:
|
||||
|
||||
- **Dense (384-768 dim)**: Binary quantization achieves 0.95+ recall on
|
||||
normalized embeddings. 384 dims -> 48 bytes. Use binary for cold tier,
|
||||
int8 for hot.
|
||||
|
||||
- **Sparse (SPLADE)**: Store as sorted (term_id, weight) pairs with
|
||||
delta-encoded term_ids. Typical sparsity: 100-300 non-zero terms out
|
||||
of 30K vocabulary. Compression: ~100x over dense.
|
||||
|
||||
- **Matryoshka**: Store full-dimension vectors but index only the first
|
||||
D/4 dimensions. Progressive refinement uses more dimensions.
|
||||
|
||||
### Metadata Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rvtext",
|
||||
"fields": {
|
||||
"text": { "type": "string", "stored": true, "max_length": 8192 },
|
||||
"source_url": { "type": "string", "indexed": true },
|
||||
"language": { "type": "string", "indexed": true },
|
||||
"model_id": { "type": "string" },
|
||||
"chunk_index": { "type": "u32" },
|
||||
"total_chunks": { "type": "u32" },
|
||||
"token_count": { "type": "u32" },
|
||||
"timestamp": { "type": "u64" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Query Preprocessing
|
||||
|
||||
1. Input: Raw text string
|
||||
2. Tokenize with model-specific tokenizer
|
||||
3. Encode through embedding model (or receive pre-computed embedding)
|
||||
4. L2-normalize for cosine similarity
|
||||
5. Optionally: compute SPLADE sparse expansion
|
||||
6. Submit dense + sparse to hybrid query path
|
||||
|
||||
## 5. RVGraph Profile (Networks)
|
||||
|
||||
### Profile Declaration
|
||||
|
||||
```
|
||||
profile_magic: 0x52475248 ("RGRH")
|
||||
profile_id: 0x03
|
||||
profile_name: "rvgraph"
|
||||
```
|
||||
|
||||
### Semantic Types
|
||||
|
||||
| Granularity | Dimensions | Source | Use Case |
|
||||
|------------|-----------|--------|----------|
|
||||
| Node | 64-256 | Node2Vec / GCN embedding | Node similarity |
|
||||
| Edge | 64-128 | Edge feature embedding | Link prediction |
|
||||
| Subgraph | 128-512 | Graph kernel embedding | Subgraph matching |
|
||||
| Community | 64-256 | Community embedding | Community detection |
|
||||
| Spectral | 32-128 | Laplacian eigenvectors | Graph structure |
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
```
|
||||
Node embedding: L2 distance
|
||||
Edge embedding: Cosine similarity
|
||||
Subgraph: Wasserstein distance (approximated by L2 on sorted features)
|
||||
Community: Cosine similarity
|
||||
Spectral: L2 on normalized eigenvectors
|
||||
```
|
||||
|
||||
### Integration with Overlay System
|
||||
|
||||
RVGraph uniquely integrates with the RVF overlay epoch system:
|
||||
|
||||
- **Graph structure** is stored in OVERLAY_SEGs (not just as metadata)
|
||||
- **Node embeddings** are stored in VEC_SEGs
|
||||
- **Edge weights** are overlay deltas
|
||||
- **Community assignments** are partition summaries
|
||||
- **Min-cut witnesses** directly serve graph partitioning queries
|
||||
|
||||
This means RVGraph files are simultaneously vector stores AND graph databases.
|
||||
The overlay system provides dynamic graph operations (add/remove edges,
|
||||
rebalance partitions) while the vector system provides similarity search.
|
||||
|
||||
### Metadata Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rvgraph",
|
||||
"fields": {
|
||||
"node_type": { "type": "string", "indexed": true },
|
||||
"edge_type": { "type": "string", "indexed": true },
|
||||
"node_label": { "type": "string", "indexed": true },
|
||||
"degree": { "type": "u32", "indexed": true },
|
||||
"community_id": { "type": "u32", "indexed": true },
|
||||
"pagerank": { "type": "f32" },
|
||||
"clustering_coeff": { "type": "f32" },
|
||||
"source_graph": { "type": "string" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 6. RVVision Profile (Imagery)
|
||||
|
||||
### Profile Declaration
|
||||
|
||||
```
|
||||
profile_magic: 0x52564953 ("RVIS")
|
||||
profile_id: 0x04
|
||||
profile_name: "rvvision"
|
||||
```
|
||||
|
||||
### Semantic Types
|
||||
|
||||
| Granularity | Dimensions | Source | Use Case |
|
||||
|------------|-----------|--------|----------|
|
||||
| Patch | 64-256 | ViT patch embedding | Region search |
|
||||
| Image | 512-2048 | CLIP / DINOv2 global embedding | Image retrieval |
|
||||
| Object | 256-512 | Object detection crop embedding | Object search |
|
||||
| Scene | 128-512 | Scene classification embedding | Scene matching |
|
||||
| Multi-scale | 256 * N | Pyramid of embeddings at scales | Scale-invariant search |
|
||||
|
||||
### Distance Metrics
|
||||
|
||||
```
|
||||
CLIP embedding: Cosine similarity (model-normalized)
|
||||
DINOv2: Cosine similarity
|
||||
Patch: L2 distance (not normalized)
|
||||
Multi-scale: Weighted sum of per-scale cosine similarities
|
||||
```
|
||||
|
||||
### Quantization Strategy
|
||||
|
||||
Vision embeddings have high intrinsic dimensionality but are compressible:
|
||||
|
||||
- **CLIP (512-dim)**: PQ with M=64, K=256 works well. Binary quantization
|
||||
achieves 0.90+ recall.
|
||||
|
||||
- **DINOv2 (768-dim)**: Similar to CLIP. PQ M=96, K=256.
|
||||
|
||||
- **Patch embeddings**: Large volume (196+ patches per image). Aggressive
|
||||
quantization to 4-bit scalar. Use residual PQ for high-recall applications.
|
||||
|
||||
### Spatial Metadata
|
||||
|
||||
RVVision supports spatial queries through metadata:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rvvision",
|
||||
"fields": {
|
||||
"image_id": { "type": "string", "indexed": true },
|
||||
"patch_row": { "type": "u16" },
|
||||
"patch_col": { "type": "u16" },
|
||||
"scale": { "type": "f32" },
|
||||
"bbox_x": { "type": "f32" },
|
||||
"bbox_y": { "type": "f32" },
|
||||
"bbox_w": { "type": "f32" },
|
||||
"bbox_h": { "type": "f32" },
|
||||
"object_class": { "type": "string", "indexed": true },
|
||||
"confidence": { "type": "f32" },
|
||||
"model_id": { "type": "string" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 7. Custom Profile Registration
|
||||
|
||||
New profiles can be registered by writing a PROFILE_SEG:
|
||||
|
||||
```
|
||||
1. Choose a unique profile_id (0x10-0xEF for custom profiles)
|
||||
2. Define a 4-byte profile_magic
|
||||
3. Define metadata schema
|
||||
4. Define distance metric configuration
|
||||
5. Define quantization recommendations
|
||||
6. Write PROFILE_SEG into the RVF file
|
||||
7. Set profile_id in root manifest
|
||||
```
|
||||
|
||||
The profile system is open — any domain can define its own profile as long
|
||||
as it maps onto the RVF substrate. The substrate does not need to understand
|
||||
the domain semantics; it only needs to store vectors, compute distances,
|
||||
and maintain indexes.
|
||||
|
||||
## 8. Cross-Profile Queries
|
||||
|
||||
RVF files with different profiles can be queried together if their vectors
|
||||
share a compatible embedding space. This is common in multimodal applications:
|
||||
|
||||
```
|
||||
Query: "Find images similar to this text description"
|
||||
|
||||
1. Text embedding (RVText profile) -> 512-dim CLIP text vector
|
||||
2. Image database (RVVision profile) -> 512-dim CLIP image vectors
|
||||
3. Distance metric: Cosine similarity (shared CLIP space)
|
||||
4. Result: Images ranked by text-image similarity
|
||||
```
|
||||
|
||||
The query path treats both files as RVF files. The profile only affects
|
||||
preprocessing and metadata interpretation — the core distance computation
|
||||
and indexing are profile-agnostic.
|
||||
|
||||
## 9. Profile Compatibility Matrix
|
||||
|
||||
| Source Profile | Target Profile | Compatible? | Condition |
|
||||
|---------------|---------------|------------|-----------|
|
||||
| RVDNA | RVDNA | Yes | Same granularity |
|
||||
| RVText | RVText | Yes | Same model or compatible space |
|
||||
| RVVision | RVVision | Yes | Same model or compatible space |
|
||||
| RVText | RVVision | Yes | If both use CLIP or shared space |
|
||||
| RVDNA | RVText | No* | Unless mapped through protein language model |
|
||||
| RVGraph | Any | Partial | Node embeddings may share space |
|
||||
|
||||
*Cross-domain compatibility requires explicit embedding space alignment,
|
||||
which is outside the scope of the format spec but enabled by it.
|
||||
Reference in New Issue
Block a user