Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'
This commit is contained in:
688
vendor/ruvector/docs/research/rvf/spec/10-operations-api.md
vendored
Normal file
688
vendor/ruvector/docs/research/rvf/spec/10-operations-api.md
vendored
Normal file
@@ -0,0 +1,688 @@
|
||||
# RVF Operations API
|
||||
|
||||
## 1. Scope
|
||||
|
||||
This document specifies the operational surface of an RVF runtime: error codes
|
||||
returned by all operations, wire formats for batch queries, batch ingest, and
|
||||
batch deletes, the network streaming protocol for progressive loading over HTTP
|
||||
and TCP, and the compaction scheduling policy. It complements the segment model
|
||||
(spec 01), manifest system (spec 02), and query optimization (spec 06).
|
||||
|
||||
All multi-byte integers are little-endian unless otherwise noted. All offsets
|
||||
within messages are byte offsets from the start of the message payload.
|
||||
|
||||
## 2. Error Code Enumeration
|
||||
|
||||
Error codes are 16-bit unsigned integers. The high byte identifies the error
|
||||
category; the low byte identifies the specific error within that category.
|
||||
Implementations must preserve unrecognized codes in responses and must not
|
||||
treat unknown codes as fatal unless the high byte is `0x01` (format error).
|
||||
|
||||
### Category 0x00: Success
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0000 OK Operation succeeded
|
||||
0x0001 OK_PARTIAL Partial success (some items failed)
|
||||
```
|
||||
|
||||
`OK_PARTIAL` is returned when a batch operation succeeds for some items and
|
||||
fails for others. The response body contains per-item status details.
|
||||
|
||||
### Category 0x01: Format Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0100 INVALID_MAGIC Segment magic mismatch (expected 0x52564653)
|
||||
0x0101 INVALID_VERSION Unsupported segment version
|
||||
0x0102 INVALID_CHECKSUM Segment hash verification failed
|
||||
0x0103 INVALID_SIGNATURE Cryptographic signature invalid
|
||||
0x0104 TRUNCATED_SEGMENT Segment payload shorter than declared length
|
||||
0x0105 INVALID_MANIFEST Root manifest validation failed
|
||||
0x0106 MANIFEST_NOT_FOUND No valid MANIFEST_SEG in file
|
||||
0x0107 UNKNOWN_SEGMENT_TYPE Segment type not recognized (warning, not fatal)
|
||||
0x0108 ALIGNMENT_ERROR Data not at expected 64B boundary
|
||||
```
|
||||
|
||||
`UNKNOWN_SEGMENT_TYPE` is advisory. A reader encountering an unknown segment
|
||||
type should skip it and continue. All other format errors in this category
|
||||
are fatal for the affected segment.
|
||||
|
||||
### Category 0x02: Query Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0200 DIMENSION_MISMATCH Query vector dimension != index dimension
|
||||
0x0201 EMPTY_INDEX No index segments available
|
||||
0x0202 METRIC_UNSUPPORTED Requested distance metric not available
|
||||
0x0203 FILTER_PARSE_ERROR Invalid filter expression
|
||||
0x0204 K_TOO_LARGE Requested K exceeds available vectors
|
||||
0x0205 TIMEOUT Query exceeded time budget
|
||||
```
|
||||
|
||||
When `K_TOO_LARGE` is returned, the response still contains all available
|
||||
results. The result count will be less than the requested K.
|
||||
|
||||
### Category 0x03: Write Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0300 LOCK_HELD Another writer holds the lock
|
||||
0x0301 LOCK_STALE Lock file exists but owner process is dead
|
||||
0x0302 DISK_FULL Insufficient space for write
|
||||
0x0303 FSYNC_FAILED Durable write failed
|
||||
0x0304 SEGMENT_TOO_LARGE Segment exceeds 4 GB limit
|
||||
0x0305 READ_ONLY File opened in read-only mode
|
||||
```
|
||||
|
||||
`LOCK_STALE` is informational. The runtime may attempt to break the stale
|
||||
lock and retry. If recovery succeeds, the original operation proceeds with
|
||||
an `OK` status.
|
||||
|
||||
### Category 0x04: Tile Errors (WASM Microkernel)
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0400 TILE_TRAP WASM trap (OOB, unreachable, stack overflow)
|
||||
0x0401 TILE_OOM Tile exceeded scratch memory (64 KB)
|
||||
0x0402 TILE_TIMEOUT Tile computation exceeded time budget
|
||||
0x0403 TILE_INVALID_MSG Malformed hub-tile message
|
||||
0x0404 TILE_UNSUPPORTED_OP Operation not available on this profile
|
||||
```
|
||||
|
||||
All tile errors trigger the fault isolation protocol described in
|
||||
`microkernel/wasm-runtime.md` section 8. The hub reassigns the tile's
|
||||
work and optionally restarts the faulted tile.
|
||||
|
||||
### Category 0x05: Crypto Errors
|
||||
|
||||
```
|
||||
Code Name Description
|
||||
------ -------------------- ----------------------------------------
|
||||
0x0500 KEY_NOT_FOUND Referenced key_id not in CRYPTO_SEG
|
||||
0x0501 KEY_EXPIRED Key past valid_until timestamp
|
||||
0x0502 DECRYPT_FAILED Decryption or auth tag verification failed
|
||||
0x0503 ALGO_UNSUPPORTED Cryptographic algorithm not implemented
|
||||
```
|
||||
|
||||
Crypto errors are always fatal for the affected segment. An implementation
|
||||
must not serve data from a segment that fails signature or decryption checks.
|
||||
|
||||
## 3. Batch Query API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
Batch queries amortize connection overhead and enable the runtime to
|
||||
schedule vector block loads across multiple queries simultaneously.
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_count Number of queries in batch (max 1024)
|
||||
0x04 4 k Shared top-K parameter
|
||||
0x08 1 metric Distance metric: 0=L2, 1=IP, 2=cosine, 3=hamming
|
||||
0x09 3 reserved Must be zero
|
||||
0x0C 4 ef_search HNSW ef_search parameter
|
||||
0x10 4 shared_filter_len Byte length of shared filter (0 = no filter)
|
||||
0x14 var shared_filter Filter expression (applies to all queries)
|
||||
var var queries[] Per-query entries (see below)
|
||||
```
|
||||
|
||||
Each query entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_id Client-assigned correlation ID
|
||||
0x04 2 dim Vector dimensionality
|
||||
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
|
||||
0x07 1 flags Bit 0: has per-query filter
|
||||
0x08 var vector Query vector (dim * sizeof(dtype) bytes)
|
||||
var 4 filter_len Byte length of per-query filter (if flags bit 0)
|
||||
var var filter Per-query filter (overrides shared filter)
|
||||
```
|
||||
|
||||
When both a shared filter and a per-query filter are present, the per-query
|
||||
filter takes precedence. A per-query filter of zero length inherits the
|
||||
shared filter.
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_count Number of query results
|
||||
0x04 var results[] Per-query result entries
|
||||
```
|
||||
|
||||
Each result entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 query_id Correlation ID from request
|
||||
0x04 2 status Error code (0x0000 = OK)
|
||||
0x06 2 reserved Must be zero
|
||||
0x08 4 result_count Number of results returned
|
||||
0x0C var results[] Array of (vector_id: u64, distance: f32) pairs
|
||||
```
|
||||
|
||||
Each result pair is 12 bytes: 8 bytes for the vector ID followed by 4 bytes
|
||||
for the distance value. Results are sorted by distance ascending (nearest first).
|
||||
|
||||
### Batch Scheduling
|
||||
|
||||
The runtime should process batch queries using the following strategy:
|
||||
|
||||
1. Parse all query vectors and load them into memory
|
||||
2. Identify shared segments across queries (block deduplication)
|
||||
3. Load each vector block once and evaluate all relevant queries against it
|
||||
4. Merge per-query top-K heaps independently
|
||||
5. Return results as soon as each query completes (streaming response)
|
||||
|
||||
This amortizes I/O: if N queries touch the same vector block, the block is
|
||||
read once instead of N times.
|
||||
|
||||
## 4. Batch Ingest API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 vector_count Number of vectors to ingest (max 65536)
|
||||
0x04 2 dim Vector dimensionality
|
||||
0x06 1 dtype Data type: 0=fp32, 1=fp16, 2=i8, 3=binary
|
||||
0x07 1 flags Bit 0: metadata_included
|
||||
0x08 var vectors[] Vector entries
|
||||
var var metadata[] Metadata entries (if flags bit 0)
|
||||
```
|
||||
|
||||
Each vector entry:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 vector_id Globally unique vector ID
|
||||
0x08 var vector Vector data (dim * sizeof(dtype) bytes)
|
||||
```
|
||||
|
||||
Each metadata entry (when metadata_included is set):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 field_count Number of metadata fields
|
||||
0x02 var fields[] Field entries
|
||||
```
|
||||
|
||||
Each metadata field:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 field_id Field identifier (application-defined)
|
||||
0x02 1 value_type 0=u64, 1=i64, 2=f64, 3=string, 4=bytes
|
||||
0x03 var value Encoded value (u64/i64/f64: 8B; string/bytes: 4B length + data)
|
||||
```
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 accepted_count Number of vectors accepted
|
||||
0x04 4 rejected_count Number of vectors rejected
|
||||
0x08 4 manifest_epoch Epoch of manifest after commit
|
||||
0x0C var rejected_ids[] Array of rejected vector IDs (u64 * rejected_count)
|
||||
var var rejected_reasons[] Array of error codes (u16 * rejected_count)
|
||||
```
|
||||
|
||||
The `manifest_epoch` field is the epoch of the MANIFEST_SEG written after the
|
||||
ingest is committed. Clients can use this value to confirm that a subsequent
|
||||
read will include the ingested vectors.
|
||||
|
||||
### Ingest Commit Semantics
|
||||
|
||||
1. The runtime writes vectors to a new VEC_SEG (append-only)
|
||||
2. If metadata is included, a META_SEG is appended
|
||||
3. Both segments are fsynced
|
||||
4. A new MANIFEST_SEG is written referencing the new segments
|
||||
5. The manifest is fsynced
|
||||
6. The response is sent with the new manifest_epoch
|
||||
|
||||
Vectors are visible to queries only after step 6 completes.
|
||||
|
||||
## 5. Batch Delete API
|
||||
|
||||
### Wire Format: Request
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 1 delete_type 0=by_id, 1=by_range, 2=by_filter
|
||||
0x01 3 reserved Must be zero
|
||||
0x04 var payload Type-specific payload (see below)
|
||||
```
|
||||
|
||||
Delete by ID (`delete_type = 0`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 count Number of IDs to delete
|
||||
0x04 var ids[] Array of vector IDs (u64 * count)
|
||||
```
|
||||
|
||||
Delete by range (`delete_type = 1`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 start_id Start of range (inclusive)
|
||||
0x08 8 end_id End of range (exclusive)
|
||||
```
|
||||
|
||||
Delete by filter (`delete_type = 2`):
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 filter_len Byte length of filter expression
|
||||
0x04 var filter Filter expression
|
||||
```
|
||||
|
||||
### Wire Format: Response
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 8 deleted_count Number of vectors deleted
|
||||
0x08 2 status Error code (0x0000 = OK)
|
||||
0x0A 2 reserved Must be zero
|
||||
0x0C 4 manifest_epoch Epoch of manifest after delete committed
|
||||
```
|
||||
|
||||
### Delete Mechanics
|
||||
|
||||
Deletes are logical. The runtime appends a JOURNAL_SEG containing tombstone
|
||||
entries for the deleted vector IDs. The new MANIFEST_SEG marks affected
|
||||
VEC_SEGs as partially dead. Physical reclamation happens during compaction.
|
||||
|
||||
## 6. Network Streaming Protocol
|
||||
|
||||
### 6.1 HTTP Range Requests (Read-Only Access)
|
||||
|
||||
RVF's progressive loading model maps naturally to HTTP byte-range requests.
|
||||
A client can boot from a remote `.rvf` file and become queryable without
|
||||
downloading the entire file.
|
||||
|
||||
**Phase 1: Boot (mandatory)**
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=-4096
|
||||
```
|
||||
|
||||
Retrieves the last 4 KB of the file. This contains the Level 0 root manifest
|
||||
(MANIFEST_SEG). The client parses hotset pointers, the segment directory, and
|
||||
the profile ID.
|
||||
|
||||
If the file is smaller than 4 KB, the entire file is returned. If the last
|
||||
4 KB does not contain a valid MANIFEST_SEG, the client extends the range
|
||||
backward in 4 KB increments until one is found or 1 MB is scanned (at which
|
||||
point it returns `MANIFEST_NOT_FOUND`).
|
||||
|
||||
**Phase 2: Hotset (parallel, mandatory for queries)**
|
||||
|
||||
Using offsets from the Level 0 manifest, the client issues up to 5 parallel
|
||||
range requests:
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<entrypoint_offset>-<entrypoint_end>
|
||||
GET /file.rvf Range: bytes=<toplayer_offset>-<toplayer_end>
|
||||
GET /file.rvf Range: bytes=<centroid_offset>-<centroid_end>
|
||||
GET /file.rvf Range: bytes=<quantdict_offset>-<quantdict_end>
|
||||
GET /file.rvf Range: bytes=<hotcache_offset>-<hotcache_end>
|
||||
```
|
||||
|
||||
These fetch the HNSW entry point, top-layer graph, routing centroids,
|
||||
quantization dictionary, and the hot cache (HOT_SEG). After these 5 requests
|
||||
complete, the system is queryable with recall >= 0.7.
|
||||
|
||||
**Phase 3: Level 1 (background)**
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<l1_offset>-<l1_end>
|
||||
```
|
||||
|
||||
Fetches the Level 1 manifest containing the full segment directory. This
|
||||
enables the client to discover all segments and plan on-demand fetches.
|
||||
|
||||
**Phase 4: On-demand (per query)**
|
||||
|
||||
For queries that require cold data not yet fetched:
|
||||
|
||||
```
|
||||
GET /file.rvf Range: bytes=<segment_offset>-<segment_end>
|
||||
```
|
||||
|
||||
The client caches fetched segments locally. Repeated queries against the
|
||||
same data region do not trigger additional requests.
|
||||
|
||||
### HTTP Requirements
|
||||
|
||||
- Server must support `Accept-Ranges: bytes`
|
||||
- Server must return `206 Partial Content` for range requests
|
||||
- Server should support multiple ranges in a single request (`multipart/byteranges`)
|
||||
- Client should use `If-None-Match` with the file's ETag to detect stale caches
|
||||
|
||||
### 6.2 TCP Streaming Protocol (Real-Time Access)
|
||||
|
||||
For real-time ingest and low-latency queries, RVF defines a binary TCP
|
||||
protocol over TLS 1.3.
|
||||
|
||||
**Connection Setup**
|
||||
|
||||
```
|
||||
1. Client opens TCP connection to server
|
||||
2. TLS 1.3 handshake (mandatory, no plaintext mode)
|
||||
3. Client sends HELLO message with protocol version and capabilities
|
||||
4. Server responds with HELLO_ACK confirming capabilities
|
||||
5. Connection is ready for messages
|
||||
```
|
||||
|
||||
**Framing**
|
||||
|
||||
All messages are length-prefixed:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 frame_length Payload length (big-endian, max 16 MB)
|
||||
0x04 1 msg_type Message type (see below)
|
||||
0x05 3 msg_id Correlation ID (big-endian, wraps at 2^24)
|
||||
0x08 var payload Message-specific payload
|
||||
```
|
||||
|
||||
Frame length is big-endian (network byte order) for consistency with TLS
|
||||
framing. The 16 MB maximum prevents a single message from monopolizing the
|
||||
connection. Payloads larger than 16 MB must be split across multiple messages
|
||||
using continuation framing (see section 6.4).
|
||||
|
||||
**Message Types**
|
||||
|
||||
```
|
||||
Client -> Server:
|
||||
0x01 QUERY Batch query (payload = Batch Query Request)
|
||||
0x02 INGEST Batch ingest (payload = Batch Ingest Request)
|
||||
0x03 DELETE Batch delete (payload = Batch Delete Request)
|
||||
0x04 STATUS Request server status (no payload)
|
||||
0x05 SUBSCRIBE Subscribe to update notifications
|
||||
|
||||
Server -> Client:
|
||||
0x81 QUERY_RESULT Batch query result
|
||||
0x82 INGEST_ACK Batch ingest acknowledgment
|
||||
0x83 DELETE_ACK Batch delete acknowledgment
|
||||
0x84 STATUS_RESP Server status response
|
||||
0x85 UPDATE_NOTIFY Push notification of new data
|
||||
0xFF ERROR Error with code and description
|
||||
```
|
||||
|
||||
**ERROR Message Payload**
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 2 error_code Error code from section 2
|
||||
0x02 2 description_len Byte length of description string
|
||||
0x04 var description UTF-8 error description (human-readable)
|
||||
```
|
||||
|
||||
### 6.3 Streaming Ingest Protocol
|
||||
|
||||
The TCP protocol supports continuous ingest where the client streams vectors
|
||||
without waiting for per-batch acknowledgments.
|
||||
|
||||
**Flow**
|
||||
|
||||
```
|
||||
Client Server
|
||||
| |
|
||||
|--- INGEST (batch 0) ------------->|
|
||||
|--- INGEST (batch 1) ------------->| Pipelining: send without waiting
|
||||
|--- INGEST (batch 2) ------------->|
|
||||
| | Server writes VEC_SEGs, appends manifest
|
||||
|<--- INGEST_ACK (batch 0) ---------|
|
||||
|<--- INGEST_ACK (batch 1) ---------|
|
||||
| | Backpressure: server delays ACK
|
||||
|--- INGEST (batch 3) ------------->| Client respects window
|
||||
|<--- INGEST_ACK (batch 2) ---------|
|
||||
| |
|
||||
```
|
||||
|
||||
**Backpressure**
|
||||
|
||||
The server controls ingest rate by delaying INGEST_ACK responses. The client
|
||||
must limit its in-flight (unacknowledged) ingest messages to a configurable
|
||||
window size (default: 8 messages). When the window is full, the client must
|
||||
wait for an ACK before sending the next batch.
|
||||
|
||||
The server should send backpressure when:
|
||||
- Write queue exceeds 80% capacity
|
||||
- Compaction is falling behind (dead space > 50%)
|
||||
- Available disk space drops below 10%
|
||||
|
||||
**Commit Semantics**
|
||||
|
||||
Each INGEST_ACK contains the `manifest_epoch` after commit. The server
|
||||
guarantees that all vectors acknowledged with epoch E are visible to any
|
||||
query that reads the manifest at epoch >= E.
|
||||
|
||||
### 6.4 Continuation Framing
|
||||
|
||||
For payloads exceeding the 16 MB frame limit:
|
||||
|
||||
```
|
||||
Frame 0: msg_type = original type, flags bit 0 = CONTINUATION_START
|
||||
Frame 1: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
|
||||
Frame 2: msg_type = 0x00 (CONTINUATION), flags bit 0 = 0
|
||||
Frame N: msg_type = 0x00 (CONTINUATION), flags bit 1 = CONTINUATION_END
|
||||
```
|
||||
|
||||
The receiver reassembles the payload from all continuation frames before
|
||||
processing. The msg_id is shared across all frames of a continuation sequence.
|
||||
|
||||
### 6.5 SUBSCRIBE and UPDATE_NOTIFY
|
||||
|
||||
The SUBSCRIBE message registers the client for push notifications when new
|
||||
data is committed:
|
||||
|
||||
```
|
||||
SUBSCRIBE payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 min_epoch Only notify for epochs > this value
|
||||
0x04 1 notify_flags Bit 0: ingest, Bit 1: delete, Bit 2: compaction
|
||||
0x05 3 reserved Must be zero
|
||||
```
|
||||
|
||||
The server sends UPDATE_NOTIFY whenever a new MANIFEST_SEG is committed that
|
||||
matches the subscription criteria:
|
||||
|
||||
```
|
||||
UPDATE_NOTIFY payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 4 epoch New manifest epoch
|
||||
0x04 1 event_type 0=ingest, 1=delete, 2=compaction
|
||||
0x05 3 reserved Must be zero
|
||||
0x08 4 affected_count Number of vectors affected
|
||||
0x0C 8 new_total Total vector count after event
|
||||
```
|
||||
|
||||
## 7. Compaction Scheduling Policy
|
||||
|
||||
Compaction merges small, overlapping, or partially-dead segments into larger,
|
||||
sealed segments. Because compaction competes with queries and ingest for I/O
|
||||
bandwidth, the runtime enforces a scheduling policy.
|
||||
|
||||
### 7.1 IO Budget
|
||||
|
||||
Compaction must consume at most 30% of available IOPS. The runtime measures
|
||||
IOPS over a 5-second sliding window and throttles compaction I/O to stay
|
||||
within budget.
|
||||
|
||||
```
|
||||
available_iops = measured_iops_capacity (from benchmarking at startup)
|
||||
compaction_budget = available_iops * 0.30
|
||||
compaction_throttle = max(compaction_budget - current_compaction_iops, 0)
|
||||
```
|
||||
|
||||
### 7.2 Priority Ordering
|
||||
|
||||
When I/O bandwidth is contended, operations are prioritized:
|
||||
|
||||
```
|
||||
Priority 1 (highest): Queries (reads from VEC_SEG, INDEX_SEG, HOT_SEG)
|
||||
Priority 2: Ingest (writes to VEC_SEG, META_SEG, MANIFEST_SEG)
|
||||
Priority 3 (lowest): Compaction (reads + writes of sealed segments)
|
||||
```
|
||||
|
||||
Compaction yields to queries and ingest. If a compaction I/O operation would
|
||||
cause a query to exceed its time budget, the compaction operation is deferred.
|
||||
|
||||
### 7.3 Scheduling Triggers
|
||||
|
||||
Compaction runs when all of the following conditions are met:
|
||||
|
||||
| Condition | Threshold | Rationale |
|
||||
|-----------|-----------|-----------|
|
||||
| Query load | < 50% of capacity | Avoid competing with active queries |
|
||||
| Dead space ratio | > 20% of total file size | Not worth compacting small amounts |
|
||||
| Segment count | > 32 active segments | Many small segments hurt read performance |
|
||||
| Time since last compaction | > 60 seconds | Prevent compaction storms |
|
||||
|
||||
The runtime evaluates these conditions every 10 seconds.
|
||||
|
||||
### 7.4 Emergency Compaction
|
||||
|
||||
If dead space exceeds 70% of total file size, compaction enters emergency mode:
|
||||
|
||||
```
|
||||
Emergency compaction rules:
|
||||
1. Compaction preempts ingest (ingest is paused, not rejected)
|
||||
2. IO budget increases to 60% of available IOPS
|
||||
3. Compaction runs regardless of query load
|
||||
4. Ingest resumes after dead space drops below 50%
|
||||
```
|
||||
|
||||
During emergency compaction, the server responds to INGEST messages with
|
||||
delayed ACKs (backpressure) rather than rejecting them. Queries continue to
|
||||
be served at highest priority.
|
||||
|
||||
### 7.5 Compaction Progress Reporting
|
||||
|
||||
The STATUS response includes compaction state:
|
||||
|
||||
```
|
||||
STATUS_RESP compaction fields:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------- ----------------------------------------
|
||||
0x00 1 compaction_state 0=idle, 1=running, 2=emergency
|
||||
0x01 1 progress_pct Completion percentage (0-100)
|
||||
0x02 2 reserved Must be zero
|
||||
0x04 8 dead_bytes Total dead space in bytes
|
||||
0x0C 8 total_bytes Total file size in bytes
|
||||
0x14 4 segments_remaining Segments left to compact
|
||||
0x18 4 segments_completed Segments compacted in current run
|
||||
0x1C 4 estimated_seconds Estimated time to completion
|
||||
0x20 4 io_budget_pct Current IO budget percentage (30 or 60)
|
||||
```
|
||||
|
||||
### 7.6 Compaction Segment Selection
|
||||
|
||||
The runtime selects segments for compaction using a tiered strategy:
|
||||
|
||||
```
|
||||
1. Tombstoned segments: Always compacted first (reclaim dead space)
|
||||
2. Small VEC_SEGs: Segments < 1 MB merged into larger segments
|
||||
3. High-overlap INDEX_SEGs: Index segments covering the same ID range
|
||||
4. Cold OVERLAY_SEGs: Overlay deltas merged into base segments
|
||||
```
|
||||
|
||||
The compaction output is always a sealed segment (SEALED flag set). Sealed
|
||||
segments are immutable and can be verified independently.
|
||||
|
||||
## 8. STATUS Response Format
|
||||
|
||||
The STATUS message provides a snapshot of the server state for monitoring
|
||||
and diagnostics.
|
||||
|
||||
```
|
||||
STATUS_RESP payload:
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------- ----------------------------------------
|
||||
0x00 4 protocol_version Protocol version (currently 1)
|
||||
0x04 4 manifest_epoch Current manifest epoch
|
||||
0x08 8 total_vectors Total vector count
|
||||
0x10 8 total_segments Total segment count
|
||||
0x18 8 file_size_bytes Total file size
|
||||
0x20 4 query_qps Queries per second (last 5s window)
|
||||
0x24 4 ingest_vps Vectors ingested per second (last 5s window)
|
||||
0x28 24 compaction Compaction state (see section 7.5)
|
||||
0x40 1 profile_id Active hardware profile (0x00-0x03)
|
||||
0x41 1 health 0=healthy, 1=degraded, 2=read_only
|
||||
0x42 2 reserved Must be zero
|
||||
0x44 4 uptime_seconds Server uptime
|
||||
```
|
||||
|
||||
## 9. Filter Expression Format
|
||||
|
||||
Filter expressions used in batch queries and batch deletes share a common
|
||||
binary encoding:
|
||||
|
||||
```
|
||||
Offset Size Field Description
|
||||
------ ------ ------------------ ----------------------------------------
|
||||
0x00 1 op Operator enum (see below)
|
||||
0x01 2 field_id Metadata field to filter on
|
||||
0x03 1 value_type Value type (matches metadata field types)
|
||||
0x04 var value Comparison value
|
||||
var var children[] Sub-expressions (for AND/OR/NOT)
|
||||
```
|
||||
|
||||
Operator enum:
|
||||
|
||||
```
|
||||
0x00 EQ field == value
|
||||
0x01 NE field != value
|
||||
0x02 LT field < value
|
||||
0x03 LE field <= value
|
||||
0x04 GT field > value
|
||||
0x05 GE field >= value
|
||||
0x06 IN field in [values]
|
||||
0x07 RANGE field in [low, high)
|
||||
0x10 AND All children must match
|
||||
0x11 OR Any child must match
|
||||
0x12 NOT Negate single child
|
||||
```
|
||||
|
||||
Filters are evaluated during the query scan phase. Vectors that do not match
|
||||
the filter are excluded from distance computation entirely (pre-filtering) or
|
||||
from the result set (post-filtering), depending on the runtime's cost model.
|
||||
|
||||
## 10. Invariants
|
||||
|
||||
1. Error codes are stable across versions; new codes are additive only
|
||||
2. Batch operations are atomic per-item, not per-batch (partial success is valid)
|
||||
3. TCP connections are always TLS 1.3; plaintext is not permitted
|
||||
4. Frame length is big-endian; all other multi-byte fields are little-endian
|
||||
5. HTTP progressive loading must succeed with at most 7 round trips to become queryable
|
||||
6. Compaction never runs at more than 60% of available IOPS, even in emergency mode
|
||||
7. The STATUS response is always available, even during emergency compaction
|
||||
8. Filter expressions are limited to 64 levels of nesting depth
|
||||
Reference in New Issue
Block a user