13 KiB
RVF WASM Microkernel and Cognitum Hardware Mapping
1. Design Philosophy
RVF must run on hardware ranging from a 64 KB WASM tile to a petabyte cluster. The WASM microkernel is the minimal runtime that makes a tile a first-class RVF citizen — capable of answering queries, ingesting streams, and participating in distributed search.
The microkernel is not a shrunken version of the full runtime. It is a purpose-built execution core that exposes the exact set of operations a tile needs, and nothing more.
2. Cognitum Tile Architecture
Hardware Constraints
+-----------------------------------+
| Cognitum Tile |
| |
| Code Memory: 8 KB |
| Data Memory: 8 KB |
| SIMD Scratch: 64 KB |
| Registers: v128 (WASM SIMD) |
| Clock: ~1 GHz |
| Interconnect: Mesh to hub |
| |
| No filesystem. No mmap. |
| No allocator beyond scratch. |
| All I/O through hub messages. |
+-----------------------------------+
Memory Map
Code (8 KB):
0x0000 - 0x0FFF Microkernel WASM bytecode (4 KB)
0x1000 - 0x17FF Distance function hot path (2 KB)
0x1800 - 0x1FFF Decode / quantization stubs (2 KB)
Data (8 KB):
0x0000 - 0x003F Tile configuration (64 B)
0x0040 - 0x00FF Query scratch (192 B: query vector fp16)
0x0100 - 0x01FF Result buffer (256 B: top-K candidates)
0x0200 - 0x03FF Routing table (512 B: entry points + centroids)
0x0400 - 0x07FF Decode workspace (1 KB)
0x0800 - 0x0FFF Message I/O buffer (2 KB)
0x1000 - 0x1FFF Neighbor list cache (4 KB)
SIMD Scratch (64 KB):
0x0000 - 0x7FFF Vector block (up to 85 vectors @ 384-dim fp16)
0x8000 - 0xBFFF Distance accumulator / PQ tables (16 KB)
0xC000 - 0xEFFF Hot cache subset (12 KB)
0xF000 - 0xFFFF Temporary / spill (4 KB)
Tile Budget
For 384-dim fp16 vectors:
- One vector: 768 bytes
- SIMD scratch holds: 64 KB / 768 = ~85 vectors
- Top-K result buffer: 16 candidates * 16 B = 256 B
- Query vector: 768 B
A tile can process one block of ~85 vectors per cycle, computing distances and maintaining a top-K heap entirely within scratch memory.
3. Microkernel Exports
The WASM microkernel exports exactly these functions:
;; === Core Query Path ===
;; Initialize tile with configuration
;; config_ptr: pointer to 64B tile config in data memory
(export "rvf_init" (func $rvf_init (param $config_ptr i32) (result i32)))
;; Load query vector into query scratch
;; query_ptr: pointer to fp16 vector in data memory
;; dim: vector dimensionality
(export "rvf_load_query" (func $rvf_load_query
(param $query_ptr i32) (param $dim i32) (result i32)))
;; Load a block of vectors into SIMD scratch
;; block_ptr: pointer to vector block in SIMD scratch
;; count: number of vectors
;; dtype: data type enum
(export "rvf_load_block" (func $rvf_load_block
(param $block_ptr i32) (param $count i32)
(param $dtype i32) (result i32)))
;; Compute distances between query and loaded block
;; metric: 0=L2, 1=IP, 2=cosine, 3=hamming
;; result_ptr: pointer to write distances
(export "rvf_distances" (func $rvf_distances
(param $metric i32) (param $result_ptr i32) (result i32)))
;; Merge distances into top-K heap
;; dist_ptr: pointer to distance array
;; id_ptr: pointer to vector ID array
;; count: number of candidates
;; k: top-K to maintain
(export "rvf_topk_merge" (func $rvf_topk_merge
(param $dist_ptr i32) (param $id_ptr i32)
(param $count i32) (param $k i32) (result i32)))
;; Read current top-K results
;; out_ptr: pointer to write results (id, distance pairs)
(export "rvf_topk_read" (func $rvf_topk_read
(param $out_ptr i32) (result i32)))
;; === Quantization ===
;; Load scalar quantization parameters (min/max per dim)
(export "rvf_load_sq_params" (func $rvf_load_sq_params
(param $params_ptr i32) (param $dim i32) (result i32)))
;; Dequantize int8 block to fp16 in SIMD scratch
(export "rvf_dequant_i8" (func $rvf_dequant_i8
(param $src_ptr i32) (param $dst_ptr i32)
(param $count i32) (result i32)))
;; Load PQ codebook subset
(export "rvf_load_pq_codebook" (func $rvf_load_pq_codebook
(param $codebook_ptr i32) (param $M i32)
(param $K i32) (result i32)))
;; Compute PQ asymmetric distances
(export "rvf_pq_distances" (func $rvf_pq_distances
(param $codes_ptr i32) (param $count i32)
(param $result_ptr i32) (result i32)))
;; === HNSW Navigation ===
;; Load neighbor list for a node
(export "rvf_load_neighbors" (func $rvf_load_neighbors
(param $node_id i64) (param $layer i32)
(param $out_ptr i32) (result i32)))
;; Greedy search step: given current node, find nearest neighbor
(export "rvf_greedy_step" (func $rvf_greedy_step
(param $current_id i64) (param $layer i32) (result i64)))
;; === Segment Verification ===
;; Verify segment header hash
(export "rvf_verify_header" (func $rvf_verify_header
(param $header_ptr i32) (result i32)))
;; Compute CRC32C of a data region
(export "rvf_crc32c" (func $rvf_crc32c
(param $data_ptr i32) (param $len i32) (result i32)))
Export Count
14 exports. Each maps to a tight inner loop that fits in the 8 KB code budget. The host (hub) is responsible for all I/O, segment parsing, and orchestration.
4. Host-Tile Protocol
Communication between the hub and tile uses fixed-size messages through the 2 KB I/O buffer:
Message Format
Offset Size Field Description
------ ---- ----- -----------
0x00 2 msg_type Message type enum
0x02 2 msg_length Payload length
0x04 4 msg_id Correlation ID
0x08 var payload Type-specific payload
Message Types
Hub -> Tile:
0x01 LOAD_QUERY Send query vector (768 B for 384-dim fp16)
0x02 LOAD_BLOCK Send vector block (up to ~1.5 KB compressed)
0x03 LOAD_NEIGHBORS Send neighbor list for a node
0x04 LOAD_PARAMS Send quantization parameters
0x05 COMPUTE Trigger distance computation
0x06 READ_TOPK Request current top-K results
0x07 RESET Clear tile state for new query
Tile -> Hub:
0x81 TOPK_RESULT Top-K results (id, distance pairs)
0x82 NEED_BLOCK Request a specific vector block
0x83 NEED_NEIGHBORS Request neighbor list for a node
0x84 DONE Computation complete
0x85 ERROR Error with code
Execution Flow
Hub Tile
| |
|--- LOAD_QUERY (768B) ------------>|
| | rvf_load_query()
|--- LOAD_PARAMS (SQ params) ------>|
| | rvf_load_sq_params()
|--- LOAD_BLOCK (block 0) -------->|
| | rvf_load_block()
| | rvf_distances()
| | rvf_topk_merge()
|--- LOAD_BLOCK (block 1) -------->|
| | rvf_load_block()
| | rvf_distances()
| | rvf_topk_merge()
| ... |
|--- READ_TOPK -------------------->|
| | rvf_topk_read()
|<--- TOPK_RESULT ------------------|
| |
Pull Mode
For HNSW search, the tile drives the traversal:
Hub Tile
| |
|--- LOAD_QUERY -------------------->|
|--- LOAD_NEIGHBORS (entry point) -->|
| | rvf_greedy_step()
|<--- NEED_NEIGHBORS (next node) ----|
|--- LOAD_NEIGHBORS (next node) ---->|
| | rvf_greedy_step()
|<--- NEED_BLOCK (for candidate) ----|
|--- LOAD_BLOCK -------------------->|
| | rvf_distances()
| | rvf_topk_merge()
|<--- DONE ----------------------------|
|--- READ_TOPK --------------------->|
|<--- TOPK_RESULT ------------------|
5. Three Hardware Profiles
RVF Core Profile (Tile)
Target: Cognitum tile (8KB + 8KB + 64KB)
Features: Distance compute, top-K, SQ dequant, CRC32C verify
Max vectors: ~85 per block load
Max dimensions: 384 (fp16) or 768 (i8)
Index: None (hub routes, tile computes)
Streaming: Receive blocks from hub
Quantization: i8 scalar only (no PQ on tile)
Compression: None (hub decompresses before sending)
RVF Hot Profile (Chip)
Target: Cognitum chip (multiple tiles + shared memory)
Features: Core + PQ distance, HNSW navigation, parallel tiles
Max vectors: Limited by shared memory (~10K in shared cache)
Max dimensions: 1024
Index: Layer A in shared memory
Streaming: Block streaming across tiles
Quantization: i8 scalar + PQ (6-bit)
Compression: LZ4 decompress in shared memory
RVF Full Profile (Hub/Desktop)
Target: Desktop CPU, server, hub controller
Features: All features, all segment types, all quantization
Max vectors: Billions (limited by storage)
Max dimensions: Unlimited
Index: Full HNSW (Layers A + B + C)
Streaming: Full append-only segment model
Quantization: All tiers (fp16, i8, PQ, binary)
Compression: All (LZ4, ZSTD, custom)
Crypto: Full (ML-DSA-65 signatures, SHAKE-256)
Temperature: Full adaptive tiering
Overlay: Full epoch model with compaction
Profile Detection
The root manifest's profile_id field declares the minimum profile needed:
0x00 generic Requires Full Profile features
0x01 core Fully usable with Core Profile
0x02 hot Requires Hot Profile minimum
0x03 full Requires Full Profile
A Full Profile reader can always read Core or Hot files. A Core Profile reader rejects Full Profile files but can read Core files. Hot Profile readers can read Core and Hot files.
6. SIMD Strategy by Platform
WASM v128 (Tile/Browser)
;; L2 distance: fp16 vectors, 384 dimensions
;; Process 8 fp16 values per v128 operation
(func $l2_fp16_384 (param $a_ptr i32) (param $b_ptr i32) (result f32)
(local $acc v128)
(local $i i32)
(local.set $acc (v128.const i64x2 0 0))
(local.set $i (i32.const 0))
(block $done
(loop $loop
;; Load 8 fp16 values, widen to f32x4 pairs
;; Subtract, square, accumulate
;; ... (8 values per iteration, 48 iterations for 384 dims)
(br_if $done (i32.ge_u (local.get $i) (i32.const 384)))
(br $loop)
)
)
;; Horizontal sum of accumulator
;; Return L2 distance
)
AVX-512 (Desktop/Server)
; Process 32 fp16 values per cycle with VCVTPH2PS + VFMADD231PS
; 384 dims = 12 iterations of 32 values
; ~12 cycles per distance computation
ARM NEON (Mobile/Edge)
; Process 8 fp16 values per cycle with FMLA
; 384 dims = 48 iterations of 8 values
; ~48 cycles per distance computation
7. Microkernel Size Budget
Function Estimated Size
-------- --------------
rvf_init 128 B
rvf_load_query 64 B
rvf_load_block 256 B
rvf_distances (L2 fp16) 512 B
rvf_distances (L2 i8) 384 B
rvf_distances (IP fp16) 512 B
rvf_distances (hamming) 256 B
rvf_topk_merge 384 B
rvf_topk_read 64 B
rvf_load_sq_params 64 B
rvf_dequant_i8 256 B
rvf_load_pq_codebook 128 B
rvf_pq_distances 512 B
rvf_load_neighbors 128 B
rvf_greedy_step 512 B
rvf_verify_header 128 B
rvf_crc32c 256 B
Message dispatch loop 384 B
Utility functions 256 B
WASM overhead 512 B
----------
Total ~5,500 B (< 8 KB code budget)
Remaining ~2.5 KB of code space is available for domain-specific extensions (e.g., codon distance for RVDNA profile, token overlap for RVText profile).
8. Fault Isolation
Each tile runs in a WASM sandbox. A tile cannot:
- Access hub memory directly
- Communicate with other tiles except through the hub
- Allocate memory beyond its 8 KB data + 64 KB scratch
- Execute code beyond its 8 KB code space
- Trap without the hub catching and recovering
If a tile traps (out-of-bounds, unreachable, stack overflow):
- Hub catches the trap
- Hub marks tile as faulted
- Hub reassigns the tile's work to another tile (or processes locally)
- Hub optionally restarts the faulted tile with fresh state
This makes the system resilient to individual tile failures — important for large tile arrays where hardware faults are inevitable.