Files
wifi-densepose/vendor/ruvector/docs/research/rvf/microkernel/wasm-runtime.md

13 KiB

RVF WASM Microkernel and Cognitum Hardware Mapping

1. Design Philosophy

RVF must run on hardware ranging from a 64 KB WASM tile to a petabyte cluster. The WASM microkernel is the minimal runtime that makes a tile a first-class RVF citizen — capable of answering queries, ingesting streams, and participating in distributed search.

The microkernel is not a shrunken version of the full runtime. It is a purpose-built execution core that exposes the exact set of operations a tile needs, and nothing more.

2. Cognitum Tile Architecture

Hardware Constraints

+-----------------------------------+
| Cognitum Tile                     |
|                                   |
|  Code Memory:    8 KB             |
|  Data Memory:    8 KB             |
|  SIMD Scratch:   64 KB            |
|  Registers:      v128 (WASM SIMD) |
|  Clock:          ~1 GHz           |
|  Interconnect:   Mesh to hub      |
|                                   |
|  No filesystem. No mmap.          |
|  No allocator beyond scratch.     |
|  All I/O through hub messages.    |
+-----------------------------------+

Memory Map

Code (8 KB):
  0x0000 - 0x0FFF   Microkernel WASM bytecode (4 KB)
  0x1000 - 0x17FF   Distance function hot path (2 KB)
  0x1800 - 0x1FFF   Decode / quantization stubs (2 KB)

Data (8 KB):
  0x0000 - 0x003F   Tile configuration (64 B)
  0x0040 - 0x00FF   Query scratch (192 B: query vector fp16)
  0x0100 - 0x01FF   Result buffer (256 B: top-K candidates)
  0x0200 - 0x03FF   Routing table (512 B: entry points + centroids)
  0x0400 - 0x07FF   Decode workspace (1 KB)
  0x0800 - 0x0FFF   Message I/O buffer (2 KB)
  0x1000 - 0x1FFF   Neighbor list cache (4 KB)

SIMD Scratch (64 KB):
  0x0000 - 0x7FFF   Vector block (up to 85 vectors @ 384-dim fp16)
  0x8000 - 0xBFFF   Distance accumulator / PQ tables (16 KB)
  0xC000 - 0xEFFF   Hot cache subset (12 KB)
  0xF000 - 0xFFFF   Temporary / spill (4 KB)

Tile Budget

For 384-dim fp16 vectors:

  • One vector: 768 bytes
  • SIMD scratch holds: 64 KB / 768 = ~85 vectors
  • Top-K result buffer: 16 candidates * 16 B = 256 B
  • Query vector: 768 B

A tile can process one block of ~85 vectors per cycle, computing distances and maintaining a top-K heap entirely within scratch memory.

3. Microkernel Exports

The WASM microkernel exports exactly these functions:

;; === Core Query Path ===

;; Initialize tile with configuration
;; config_ptr: pointer to 64B tile config in data memory
(export "rvf_init" (func $rvf_init (param $config_ptr i32) (result i32)))

;; Load query vector into query scratch
;; query_ptr: pointer to fp16 vector in data memory
;; dim: vector dimensionality
(export "rvf_load_query" (func $rvf_load_query
    (param $query_ptr i32) (param $dim i32) (result i32)))

;; Load a block of vectors into SIMD scratch
;; block_ptr: pointer to vector block in SIMD scratch
;; count: number of vectors
;; dtype: data type enum
(export "rvf_load_block" (func $rvf_load_block
    (param $block_ptr i32) (param $count i32)
    (param $dtype i32) (result i32)))

;; Compute distances between query and loaded block
;; metric: 0=L2, 1=IP, 2=cosine, 3=hamming
;; result_ptr: pointer to write distances
(export "rvf_distances" (func $rvf_distances
    (param $metric i32) (param $result_ptr i32) (result i32)))

;; Merge distances into top-K heap
;; dist_ptr: pointer to distance array
;; id_ptr: pointer to vector ID array
;; count: number of candidates
;; k: top-K to maintain
(export "rvf_topk_merge" (func $rvf_topk_merge
    (param $dist_ptr i32) (param $id_ptr i32)
    (param $count i32) (param $k i32) (result i32)))

;; Read current top-K results
;; out_ptr: pointer to write results (id, distance pairs)
(export "rvf_topk_read" (func $rvf_topk_read
    (param $out_ptr i32) (result i32)))

;; === Quantization ===

;; Load scalar quantization parameters (min/max per dim)
(export "rvf_load_sq_params" (func $rvf_load_sq_params
    (param $params_ptr i32) (param $dim i32) (result i32)))

;; Dequantize int8 block to fp16 in SIMD scratch
(export "rvf_dequant_i8" (func $rvf_dequant_i8
    (param $src_ptr i32) (param $dst_ptr i32)
    (param $count i32) (result i32)))

;; Load PQ codebook subset
(export "rvf_load_pq_codebook" (func $rvf_load_pq_codebook
    (param $codebook_ptr i32) (param $M i32)
    (param $K i32) (result i32)))

;; Compute PQ asymmetric distances
(export "rvf_pq_distances" (func $rvf_pq_distances
    (param $codes_ptr i32) (param $count i32)
    (param $result_ptr i32) (result i32)))

;; === HNSW Navigation ===

;; Load neighbor list for a node
(export "rvf_load_neighbors" (func $rvf_load_neighbors
    (param $node_id i64) (param $layer i32)
    (param $out_ptr i32) (result i32)))

;; Greedy search step: given current node, find nearest neighbor
(export "rvf_greedy_step" (func $rvf_greedy_step
    (param $current_id i64) (param $layer i32) (result i64)))

;; === Segment Verification ===

;; Verify segment header hash
(export "rvf_verify_header" (func $rvf_verify_header
    (param $header_ptr i32) (result i32)))

;; Compute CRC32C of a data region
(export "rvf_crc32c" (func $rvf_crc32c
    (param $data_ptr i32) (param $len i32) (result i32)))

Export Count

14 exports. Each maps to a tight inner loop that fits in the 8 KB code budget. The host (hub) is responsible for all I/O, segment parsing, and orchestration.

4. Host-Tile Protocol

Communication between the hub and tile uses fixed-size messages through the 2 KB I/O buffer:

Message Format

Offset  Size  Field        Description
------  ----  -----        -----------
0x00    2     msg_type     Message type enum
0x02    2     msg_length   Payload length
0x04    4     msg_id       Correlation ID
0x08    var   payload      Type-specific payload

Message Types

Hub -> Tile:
  0x01  LOAD_QUERY       Send query vector (768 B for 384-dim fp16)
  0x02  LOAD_BLOCK       Send vector block (up to ~1.5 KB compressed)
  0x03  LOAD_NEIGHBORS   Send neighbor list for a node
  0x04  LOAD_PARAMS      Send quantization parameters
  0x05  COMPUTE          Trigger distance computation
  0x06  READ_TOPK        Request current top-K results
  0x07  RESET            Clear tile state for new query

Tile -> Hub:
  0x81  TOPK_RESULT      Top-K results (id, distance pairs)
  0x82  NEED_BLOCK       Request a specific vector block
  0x83  NEED_NEIGHBORS   Request neighbor list for a node
  0x84  DONE             Computation complete
  0x85  ERROR            Error with code

Execution Flow

Hub                                 Tile
 |                                    |
 |--- LOAD_QUERY (768B) ------------>|
 |                                    | rvf_load_query()
 |--- LOAD_PARAMS (SQ params) ------>|
 |                                    | rvf_load_sq_params()
 |--- LOAD_BLOCK (block 0) -------->|
 |                                    | rvf_load_block()
 |                                    | rvf_distances()
 |                                    | rvf_topk_merge()
 |--- LOAD_BLOCK (block 1) -------->|
 |                                    | rvf_load_block()
 |                                    | rvf_distances()
 |                                    | rvf_topk_merge()
 |    ...                             |
 |--- READ_TOPK -------------------->|
 |                                    | rvf_topk_read()
 |<--- TOPK_RESULT ------------------|
 |                                    |

Pull Mode

For HNSW search, the tile drives the traversal:

Hub                                 Tile
 |                                    |
 |--- LOAD_QUERY -------------------->|
 |--- LOAD_NEIGHBORS (entry point) -->|
 |                                    | rvf_greedy_step()
 |<--- NEED_NEIGHBORS (next node) ----|
 |--- LOAD_NEIGHBORS (next node) ---->|
 |                                    | rvf_greedy_step()
 |<--- NEED_BLOCK (for candidate) ----|
 |--- LOAD_BLOCK -------------------->|
 |                                    | rvf_distances()
 |                                    | rvf_topk_merge()
 |<--- DONE ----------------------------|
 |--- READ_TOPK --------------------->|
 |<--- TOPK_RESULT ------------------|

5. Three Hardware Profiles

RVF Core Profile (Tile)

Target:         Cognitum tile (8KB + 8KB + 64KB)
Features:       Distance compute, top-K, SQ dequant, CRC32C verify
Max vectors:    ~85 per block load
Max dimensions: 384 (fp16) or 768 (i8)
Index:          None (hub routes, tile computes)
Streaming:      Receive blocks from hub
Quantization:   i8 scalar only (no PQ on tile)
Compression:    None (hub decompresses before sending)

RVF Hot Profile (Chip)

Target:         Cognitum chip (multiple tiles + shared memory)
Features:       Core + PQ distance, HNSW navigation, parallel tiles
Max vectors:    Limited by shared memory (~10K in shared cache)
Max dimensions: 1024
Index:          Layer A in shared memory
Streaming:      Block streaming across tiles
Quantization:   i8 scalar + PQ (6-bit)
Compression:    LZ4 decompress in shared memory

RVF Full Profile (Hub/Desktop)

Target:         Desktop CPU, server, hub controller
Features:       All features, all segment types, all quantization
Max vectors:    Billions (limited by storage)
Max dimensions: Unlimited
Index:          Full HNSW (Layers A + B + C)
Streaming:      Full append-only segment model
Quantization:   All tiers (fp16, i8, PQ, binary)
Compression:    All (LZ4, ZSTD, custom)
Crypto:         Full (ML-DSA-65 signatures, SHAKE-256)
Temperature:    Full adaptive tiering
Overlay:        Full epoch model with compaction

Profile Detection

The root manifest's profile_id field declares the minimum profile needed:

0x00    generic     Requires Full Profile features
0x01    core        Fully usable with Core Profile
0x02    hot         Requires Hot Profile minimum
0x03    full        Requires Full Profile

A Full Profile reader can always read Core or Hot files. A Core Profile reader rejects Full Profile files but can read Core files. Hot Profile readers can read Core and Hot files.

6. SIMD Strategy by Platform

WASM v128 (Tile/Browser)

;; L2 distance: fp16 vectors, 384 dimensions
;; Process 8 fp16 values per v128 operation

(func $l2_fp16_384 (param $a_ptr i32) (param $b_ptr i32) (result f32)
    (local $acc v128)
    (local $i i32)
    (local.set $acc (v128.const i64x2 0 0))
    (local.set $i (i32.const 0))

    (block $done
        (loop $loop
            ;; Load 8 fp16 values, widen to f32x4 pairs
            ;; Subtract, square, accumulate
            ;; ... (8 values per iteration, 48 iterations for 384 dims)

            (br_if $done (i32.ge_u (local.get $i) (i32.const 384)))
            (br $loop)
        )
    )
    ;; Horizontal sum of accumulator
    ;; Return L2 distance
)

AVX-512 (Desktop/Server)

; Process 32 fp16 values per cycle with VCVTPH2PS + VFMADD231PS
; 384 dims = 12 iterations of 32 values
; ~12 cycles per distance computation

ARM NEON (Mobile/Edge)

; Process 8 fp16 values per cycle with FMLA
; 384 dims = 48 iterations of 8 values
; ~48 cycles per distance computation

7. Microkernel Size Budget

Function                    Estimated Size
--------                    --------------
rvf_init                    128 B
rvf_load_query              64 B
rvf_load_block              256 B
rvf_distances (L2 fp16)     512 B
rvf_distances (L2 i8)       384 B
rvf_distances (IP fp16)     512 B
rvf_distances (hamming)     256 B
rvf_topk_merge              384 B
rvf_topk_read               64 B
rvf_load_sq_params          64 B
rvf_dequant_i8              256 B
rvf_load_pq_codebook        128 B
rvf_pq_distances            512 B
rvf_load_neighbors          128 B
rvf_greedy_step             512 B
rvf_verify_header           128 B
rvf_crc32c                  256 B
Message dispatch loop       384 B
Utility functions           256 B
WASM overhead               512 B
                            ----------
Total                       ~5,500 B (< 8 KB code budget)

Remaining ~2.5 KB of code space is available for domain-specific extensions (e.g., codon distance for RVDNA profile, token overlap for RVText profile).

8. Fault Isolation

Each tile runs in a WASM sandbox. A tile cannot:

  • Access hub memory directly
  • Communicate with other tiles except through the hub
  • Allocate memory beyond its 8 KB data + 64 KB scratch
  • Execute code beyond its 8 KB code space
  • Trap without the hub catching and recovering

If a tile traps (out-of-bounds, unreachable, stack overflow):

  1. Hub catches the trap
  2. Hub marks tile as faulted
  3. Hub reassigns the tile's work to another tile (or processes locally)
  4. Hub optionally restarts the faulted tile with fresh state

This makes the system resilient to individual tile failures — important for large tile arrays where hardware faults are inevitable.