Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
1649
docs/adr/temporal-tensor-store/ADR-018-block-based-storage-engine.md
Normal file
1649
docs/adr/temporal-tensor-store/ADR-018-block-based-storage-engine.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,880 @@
|
||||
# ADR-019: Tiered Quantization Formats for Temporal Tensor Store
|
||||
|
||||
**Status**: Proposed
|
||||
**Date**: 2026-02-08
|
||||
**Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine
|
||||
**Author**: System Architecture Team
|
||||
|
||||
**Note**: Tiered quantization formats are now implemented in the rvf-quant crate as part of ADR-029 (RVF). See the RVF temperature-tiering specification (docs/research/rvf/spec/03-temperature-tiering.md).
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 0.1 | 2026-02-08 | Architecture Team | Initial proposal |
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
This ADR defines the concrete quantization formats, bit-packing layouts, and codec
|
||||
interfaces for the five tiers of tensor storage established in ADR-017. Where ADR-017
|
||||
introduced the concept of access-frequency-driven quantization and temporal scale
|
||||
reuse, this document specifies the exact byte-level formats for 8-bit (Tier 1 / Hot),
|
||||
7-bit and 5-bit (Tier 2 / Warm), 3-bit (Tier 3 / Cold), and Compression-to-Zero
|
||||
(Tier 0 / Absent). It also resolves two open design questions from ADR-017: whether
|
||||
5-bit quantization is permitted within the warm tier, and how Tier 0 reads behave
|
||||
when no reconstruction policy exists.
|
||||
|
||||
The `codec_bits` module provides a single allocation-free bit packer/unpacker that
|
||||
all sub-byte formats share. The `quant` module provides per-format quantize and
|
||||
dequantize functions, with SIMD-accelerated `max_abs` on native targets and a
|
||||
portable fallback for WASM. Rust trait interfaces are defined so that new bit widths
|
||||
can be added without modifying the core codec.
|
||||
|
||||
---
|
||||
|
||||
## 1. Context and Motivation
|
||||
|
||||
### 1.1 Gap in ADR-017
|
||||
|
||||
ADR-017 established the tiered compression architecture and segment binary format
|
||||
but left the per-tier quantization details at the algorithmic level. Implementers
|
||||
need exact byte layouts to write interoperable encoders and decoders, particularly
|
||||
for the sub-byte formats (7-bit, 5-bit, 3-bit) where values do not align on byte
|
||||
boundaries.
|
||||
|
||||
### 1.2 Sub-Byte Packing Complexity
|
||||
|
||||
Standard 8-bit quantization maps trivially to `[u8]` storage. Sub-byte formats
|
||||
require a bit-packing codec that can write and read arbitrary-width codes into a
|
||||
byte stream without wasting bits. The codec must:
|
||||
|
||||
- Handle bit widths 3, 5, and 7 (with 8 as a degenerate identity case).
|
||||
- Operate without heap allocations (caller provides output slice).
|
||||
- Be deterministic and platform-independent (little-endian byte order).
|
||||
- Support WASM targets where SIMD is optional.
|
||||
|
||||
### 1.3 Outlier Handling in 3-Bit
|
||||
|
||||
At 3 bits per value, the quantization range is `[-3, +3]` (qmax = 3). Large
|
||||
outliers in the tensor distribution can cause severe clamping. ADR-017 noted this
|
||||
risk but did not specify a mitigation. This ADR introduces a two-level scale
|
||||
option for Tier 3 that uses a 1-bit flag per value to select between a primary
|
||||
scale (covering the majority of values) and a secondary scale (covering outliers),
|
||||
while keeping the packed format compact.
|
||||
|
||||
### 1.4 Tier 0 Semantics
|
||||
|
||||
ADR-017 listed Compression-to-Zero as a future possibility. This ADR formalizes
|
||||
it: Tier 0 stores no quantized data at all. Only metadata and an optional
|
||||
`reconstruct_policy` survive. This enables aggressive memory reclamation for
|
||||
tensors that are no longer accessed but may be reconstructable from other sources
|
||||
(deltas, factors, or recomputation).
|
||||
|
||||
### 1.5 Design Questions Resolved
|
||||
|
||||
| Question | Resolution |
|
||||
|----------|------------|
|
||||
| Allow 5-bit within warm tier? | Yes. Dynamic downgrade from 7-bit to 5-bit when warm set exceeds a configurable byte cap (`warm_byte_cap`). |
|
||||
| Tier 0 read semantics? | Return zeros by default. If a `reconstruct_policy` (Delta or Factor) exists, reconstruct from stored representation. |
|
||||
|
||||
---
|
||||
|
||||
## 2. Decision
|
||||
|
||||
We adopt the following five-tier quantization format hierarchy, each with a
|
||||
well-defined byte layout, packing strategy, and error budget:
|
||||
|
||||
| Tier | Name | Bits | Compression vs f32 | Use Case |
|
||||
|------|------|------|-------------------|----------|
|
||||
| 1 | Hot | 8 | 4.00x | Active tensors, full fidelity |
|
||||
| 2a | Warm | 7 | 4.57x | Default warm, near-lossless |
|
||||
| 2b | Warm-aggressive | 5 | 6.40x | Warm set exceeds `warm_byte_cap` |
|
||||
| 3 | Cold | 3 | 10.67x | Archived tensors, bounded error |
|
||||
| 0 | Absent | 0 | Infinite | No data stored; metadata only |
|
||||
|
||||
All sub-byte formats share the `codec_bits` packer. All quantization formats use
|
||||
symmetric per-block quantization with `scale = max_abs / qmax` stored as f32 per
|
||||
block. The choice of f32 (rather than f16 as in ADR-017 segment headers) is
|
||||
deliberate at this layer: the segment encoder may convert to f16 for storage, but
|
||||
the quantizer operates in f32 for precision during the quantize/dequantize path.
|
||||
|
||||
---
|
||||
|
||||
## 3. Detailed Design
|
||||
|
||||
### 3.1 Tier 1: 8-Bit Quantization (Hot)
|
||||
|
||||
**Algorithm**: Symmetric per-block quantization.
|
||||
|
||||
```
|
||||
Given: block of N f32 values, block_size typically 64 or 128
|
||||
scale = max_abs(values) / 127
|
||||
q[i] = round(values[i] / scale)
|
||||
q[i] = clamp(q[i], -127, +127) // i8 range
|
||||
store: q as [i8; N] + scale as f32
|
||||
```
|
||||
|
||||
**Storage layout** (one block, block_size = 8 for illustration):
|
||||
|
||||
```
|
||||
Byte offset: 0 1 2 3 4 5 6 7 8 9 10 11
|
||||
[ scale (f32, LE) ] [q0] [q1] [q2] [q3] [q4] [q5] [q6] [q7]
|
||||
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
4 bytes 8 bytes (1 byte per i8 value)
|
||||
|
||||
Total per block: 4 + block_size bytes
|
||||
```
|
||||
|
||||
**Effective compression** (block_size = 64):
|
||||
|
||||
```
|
||||
raw = 64 * 4 = 256 bytes
|
||||
quant = 4 + 64 * 1 = 68 bytes
|
||||
ratio = 256 / 68 = 3.76x (single block)
|
||||
```
|
||||
|
||||
With temporal amortization (100 frames sharing scales): `256*100 / (4 + 64*100)` = 4.00x.
|
||||
|
||||
**Dequantize**:
|
||||
|
||||
```
|
||||
values[i] = q[i] as f32 * scale
|
||||
```
|
||||
|
||||
**Error bound**: `max_error = scale / (2 * 127)`. See Section 3.7 for full analysis.
|
||||
|
||||
### 3.2 Tier 2a: 7-Bit Quantization (Warm)
|
||||
|
||||
**Algorithm**: Symmetric per-block, 7-bit codes packed into a bitstream.
|
||||
|
||||
```
|
||||
Given: block of N f32 values
|
||||
scale = max_abs(values) / 63 // qmax = 2^(7-1) - 1 = 63
|
||||
q[i] = round(values[i] / scale)
|
||||
q[i] = clamp(q[i], -63, +63)
|
||||
u[i] = q[i] + 63 // bias to unsigned [0, 126], fits 7 bits
|
||||
pack u[i] values using codec_bits at width=7
|
||||
```
|
||||
|
||||
**Bit-packing layout** (8 values packed into 7 bytes):
|
||||
|
||||
```
|
||||
Values: u0 u1 u2 u3 u4 u5 u6 u7
|
||||
Bits: [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0]
|
||||
7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits
|
||||
|
||||
Packed into 7 bytes (56 bits = 8 * 7 bits):
|
||||
|
||||
Byte 0: [u0[6:0] | u1[0] ] = u0(7) + u1(1) = 8 bits
|
||||
|<--- 7 bits --->|<1>|
|
||||
|
||||
Byte 1: [u1[6:1] | u2[1:0]] = u1(6) + u2(2) = 8 bits
|
||||
|<--- 6 bits --->|<-2->|
|
||||
|
||||
Byte 2: [u2[6:2] | u3[2:0] ] = u2(5) + u3(3) = 8 bits
|
||||
|<-- 5 bits -->|<--3-->|
|
||||
|
||||
Byte 3: [u3[6:3] | u4[3:0] ] = u3(4) + u4(4) = 8 bits
|
||||
|<- 4 bits ->|<--4--->|
|
||||
|
||||
Byte 4: [u4[6:4] | u5[4:0] ] = u4(3) + u5(5) = 8 bits
|
||||
|<-3->|<---- 5 bits ---->|
|
||||
|
||||
Byte 5: [u5[6:5] | u6[5:0] ] = u5(2) + u6(6) = 8 bits
|
||||
|<2>|<----- 6 bits ------>|
|
||||
|
||||
Byte 6: [u6[6] | u7[6:0] ] = u6(1) + u7(7) = 8 bits
|
||||
|1|<------- 7 bits ------->|
|
||||
|
||||
Total: 7 bytes for 8 values = 0.875 bytes/value
|
||||
```
|
||||
|
||||
**Storage per block** (block_size = 64):
|
||||
|
||||
```
|
||||
scale: 4 bytes (f32)
|
||||
data: ceil(64 * 7 / 8) = 56 bytes
|
||||
total: 60 bytes
|
||||
ratio: 256 / 60 = 4.27x
|
||||
```
|
||||
|
||||
### 3.3 Tier 2b: 5-Bit Quantization (Warm Aggressive)
|
||||
|
||||
**Algorithm**: Symmetric per-block, 5-bit codes.
|
||||
|
||||
```
|
||||
Given: block of N f32 values
|
||||
scale = max_abs(values) / 15 // qmax = 2^(5-1) - 1 = 15
|
||||
q[i] = round(values[i] / scale)
|
||||
q[i] = clamp(q[i], -15, +15)
|
||||
u[i] = q[i] + 15 // bias to unsigned [0, 30], fits 5 bits
|
||||
pack u[i] values using codec_bits at width=5
|
||||
```
|
||||
|
||||
**Activation policy**: 5-bit is used instead of 7-bit when the total warm set
|
||||
size exceeds `warm_byte_cap` (default: 64 MiB). The tier policy monitors
|
||||
aggregate warm storage and downgrades from 7-bit to 5-bit for the least recently
|
||||
accessed warm tensors until the cap is satisfied.
|
||||
|
||||
**Bit-packing layout** (8 values packed into 5 bytes):
|
||||
|
||||
```
|
||||
Values: u0 u1 u2 u3 u4 u5 u6 u7
|
||||
Bits: [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0]
|
||||
5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits
|
||||
|
||||
Packed into 5 bytes (40 bits = 8 * 5 bits):
|
||||
|
||||
Byte 0: [u0[4:0] | u1[2:0] ] = u0(5) + u1(3) = 8 bits
|
||||
|<-- 5 bits -->|<--3-->|
|
||||
|
||||
Byte 1: [u1[4:3] | u2[4:0] | u3[0]] = u1(2) + u2(5) + u3(1) = 8 bits
|
||||
|<2>|<-- 5 bits -->|<1>|
|
||||
|
||||
Byte 2: [u3[4:1] | u4[3:0] ] = u3(4) + u4(4) = 8 bits
|
||||
|<-- 4 bits -->|<--4-->|
|
||||
|
||||
Byte 3: [u4[4] | u5[4:0] | u6[1:0]] = u4(1) + u5(5) + u6(2) = 8 bits
|
||||
|1|<-- 5 bits -->|<-2->|
|
||||
|
||||
Byte 4: [u6[4:2] | u7[4:0] ] = u6(3) + u7(5) = 8 bits
|
||||
|<-3->|<--- 5 bits --->|
|
||||
|
||||
Total: 5 bytes for 8 values = 0.625 bytes/value
|
||||
```
|
||||
|
||||
**Storage per block** (block_size = 64):
|
||||
|
||||
```
|
||||
scale: 4 bytes (f32)
|
||||
data: ceil(64 * 5 / 8) = 40 bytes
|
||||
total: 44 bytes
|
||||
ratio: 256 / 44 = 5.82x
|
||||
```
|
||||
|
||||
### 3.4 Tier 3: 3-Bit Quantization (Cold)
|
||||
|
||||
**Algorithm**: Symmetric per-block, 3-bit codes with optional two-level scale.
|
||||
|
||||
#### Standard Mode
|
||||
|
||||
```
|
||||
Given: block of N f32 values
|
||||
scale = max_abs(values) / 3 // qmax = 2^(3-1) - 1 = 3
|
||||
q[i] = round(values[i] / scale)
|
||||
q[i] = clamp(q[i], -3, +3)
|
||||
u[i] = q[i] + 3 // bias to unsigned [0, 6], fits 3 bits
|
||||
pack u[i] values using codec_bits at width=3
|
||||
```
|
||||
|
||||
#### Two-Level Scale Mode (Outlier Handling)
|
||||
|
||||
When the value distribution has outliers (values significantly larger than the
|
||||
bulk of the distribution), a single scale wastes most of the 3-bit range on the
|
||||
long tail. The two-level scale splits the range:
|
||||
|
||||
```
|
||||
Given: block of N f32 values, outlier_fraction (default: 0.05)
|
||||
sorted_abs = sort(|values|, descending)
|
||||
outlier_count = ceil(N * outlier_fraction)
|
||||
primary_max = sorted_abs[outlier_count] // excludes top 5%
|
||||
secondary_max = sorted_abs[0] // full range
|
||||
|
||||
primary_scale = primary_max / 3 // covers bulk values
|
||||
secondary_scale = secondary_max / 3 // covers outliers
|
||||
|
||||
For each value[i]:
|
||||
if |value[i]| > primary_max:
|
||||
flag[i] = 1 // use secondary scale
|
||||
q[i] = round(value[i] / secondary_scale)
|
||||
else:
|
||||
flag[i] = 0 // use primary scale
|
||||
q[i] = round(value[i] / primary_scale)
|
||||
q[i] = clamp(q[i], -3, +3)
|
||||
u[i] = q[i] + 3
|
||||
|
||||
store: primary_scale (f32) + secondary_scale (f32) + flag bits + packed codes
|
||||
```
|
||||
|
||||
**Bit-packing layout** (8 values packed into 3 bytes):
|
||||
|
||||
```
|
||||
Values: u0 u1 u2 u3 u4 u5 u6 u7
|
||||
Bits: [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0]
|
||||
3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits
|
||||
|
||||
Packed into 3 bytes (24 bits = 8 * 3 bits):
|
||||
|
||||
Byte 0: [u0[2:0] | u1[2:0] | u2[1:0] ] = u0(3) + u1(3) + u2(2) = 8 bits
|
||||
|<-3->|<-3->|<2>|
|
||||
|
||||
Byte 1: [u2[2] | u3[2:0] | u4[2:0] | u5[0]] = u2(1) + u3(3) + u4(3) + u5(1) = 8 bits
|
||||
|1|<-3->|<-3->|1|
|
||||
|
||||
Byte 2: [u5[2:1] | u6[2:0] | u7[2:0] ] = u5(2) + u6(3) + u7(3) = 8 bits
|
||||
|<2>|<-3->|<-3->|
|
||||
|
||||
Total: 3 bytes for 8 values = 0.375 bytes/value
|
||||
```
|
||||
|
||||
**Two-level scale storage layout** (one block, block_size = 64):
|
||||
|
||||
```
|
||||
Byte offset: 0 3 7 8 9 ... 15 16 ...
|
||||
[primary_scale f32] [secondary_scale f32] [flag bytes ] [packed codes]
|
||||
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~
|
||||
4 bytes 4 bytes ceil(64/8)=8 ceil(64*3/8)=24
|
||||
|
||||
Total per block (two-level): 4 + 4 + 8 + 24 = 40 bytes
|
||||
Total per block (standard): 4 + 24 = 28 bytes
|
||||
ratio (standard): 256 / 28 = 9.14x
|
||||
ratio (two-level): 256 / 40 = 6.40x
|
||||
```
|
||||
|
||||
The two-level mode trades compression ratio for outlier fidelity. It is selected
|
||||
automatically when the ratio `max_abs / median_abs` exceeds a configurable
|
||||
threshold (default: 5.0), indicating a heavy-tailed distribution.
|
||||
|
||||
### 3.5 Tier 0: Compression to Zero (Absent)
|
||||
|
||||
**Algorithm**: No quantized data is stored.
|
||||
|
||||
```
|
||||
Tier 0 representation:
|
||||
metadata: TensorMeta (id, shape, dtype, timestamps)
|
||||
reconstruct_policy: Option<ReconstructPolicy>
|
||||
quantized_data: None
|
||||
|
||||
enum ReconstructPolicy {
|
||||
None, // reads return zeros
|
||||
Delta { base_id: TensorId, delta: ... }, // reconstruct as base + delta
|
||||
Factor { source_id: TensorId, ... }, // reconstruct via transformation
|
||||
}
|
||||
```
|
||||
|
||||
**Read semantics**:
|
||||
|
||||
| `reconstruct_policy` | Behavior |
|
||||
|----------------------|----------|
|
||||
| `None` | Return a zero-filled tensor of the recorded shape. Fast-fail mode returns `Err(TierZeroNoPolicy)` instead. |
|
||||
| `Delta` | Load base tensor, apply stored delta. May trigger recursive decompression if base is also tiered. |
|
||||
| `Factor` | Load source tensor, apply stored transformation (scale, permutation, projection). |
|
||||
|
||||
**Transition to Tier 0**: A tensor is eligible for Tier 0 when its tier score
|
||||
drops below `absent_min_score` (default: 1) and it has not been accessed for
|
||||
longer than `absent_age_threshold` (default: 24 hours). The transition is
|
||||
irreversible without external data: once quantized data is discarded, only the
|
||||
reconstruction policy (if any) can recover approximate values.
|
||||
|
||||
### 3.6 Bit Packing Module: `codec_bits`
|
||||
|
||||
The core packing and unpacking functions shared by all sub-byte formats.
|
||||
|
||||
```rust
|
||||
/// Errors from bit codec operations.
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub enum CodecErr {
|
||||
/// Output buffer too small. Contains the required size in bytes.
|
||||
OutputTooSmall { required: usize },
|
||||
/// Input buffer too small for the declared number of values.
|
||||
InputTooSmall { required: usize },
|
||||
/// Bit width must be in [1, 8].
|
||||
InvalidBitWidth { bits: u8 },
|
||||
}
|
||||
|
||||
/// Pack `values.len()` signed codes into `out`, using `bits` bits per code.
|
||||
///
|
||||
/// Each value in `values` is treated as a signed integer in `[-(2^(bits-1)-1), 2^(bits-1)-1]`.
|
||||
/// It is biased to unsigned before packing: `u = v + (2^(bits-1) - 1)`.
|
||||
///
|
||||
/// Returns the number of bytes written to `out`.
|
||||
///
|
||||
/// # Errors
|
||||
/// - `CodecErr::OutputTooSmall` if `out` cannot hold the packed data.
|
||||
/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
|
||||
pub fn pack_bits(values: &[i8], bits: u8, out: &mut [u8]) -> Result<usize, CodecErr> {
|
||||
if bits == 0 || bits > 8 {
|
||||
return Err(CodecErr::InvalidBitWidth { bits });
|
||||
}
|
||||
let total_bits = values.len() as u64 * bits as u64;
|
||||
let required = ((total_bits + 7) / 8) as usize;
|
||||
if out.len() < required {
|
||||
return Err(CodecErr::OutputTooSmall { required });
|
||||
}
|
||||
|
||||
let qmax = (1i8 << (bits - 1)) - 1; // bias offset
|
||||
let mask: u64 = (1u64 << bits) - 1;
|
||||
let mut acc: u64 = 0;
|
||||
let mut acc_bits: u32 = 0;
|
||||
let mut pos: usize = 0;
|
||||
|
||||
for &v in values {
|
||||
let u = (v as i16 + qmax as i16) as u64 & mask;
|
||||
acc |= u << acc_bits;
|
||||
acc_bits += bits as u32;
|
||||
while acc_bits >= 8 {
|
||||
out[pos] = (acc & 0xFF) as u8;
|
||||
pos += 1;
|
||||
acc >>= 8;
|
||||
acc_bits -= 8;
|
||||
}
|
||||
}
|
||||
// Flush remaining bits
|
||||
if acc_bits > 0 {
|
||||
out[pos] = (acc & 0xFF) as u8;
|
||||
pos += 1;
|
||||
}
|
||||
Ok(pos)
|
||||
}
|
||||
|
||||
/// Unpack codes from `inp` into `out`, reading `bits` bits per code.
|
||||
///
|
||||
/// Reads exactly `out.len()` values. Each unsigned code is unbiased back to signed:
|
||||
/// `v = u - (2^(bits-1) - 1)`.
|
||||
///
|
||||
/// Returns the number of bytes consumed from `inp`.
|
||||
///
|
||||
/// # Errors
|
||||
/// - `CodecErr::InputTooSmall` if `inp` does not contain enough data.
|
||||
/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
|
||||
pub fn unpack_bits(inp: &[u8], bits: u8, out: &mut [i8]) -> Result<usize, CodecErr> {
|
||||
if bits == 0 || bits > 8 {
|
||||
return Err(CodecErr::InvalidBitWidth { bits });
|
||||
}
|
||||
let total_bits = out.len() as u64 * bits as u64;
|
||||
let required = ((total_bits + 7) / 8) as usize;
|
||||
if inp.len() < required {
|
||||
return Err(CodecErr::InputTooSmall { required });
|
||||
}
|
||||
|
||||
let qmax = (1i8 << (bits - 1)) - 1;
|
||||
let mask: u64 = (1u64 << bits) - 1;
|
||||
let mut acc: u64 = 0;
|
||||
let mut acc_bits: u32 = 0;
|
||||
let mut byte_pos: usize = 0;
|
||||
let mut val_pos: usize = 0;
|
||||
|
||||
while val_pos < out.len() {
|
||||
while acc_bits < bits as u32 {
|
||||
acc |= (inp[byte_pos] as u64) << acc_bits;
|
||||
acc_bits += 8;
|
||||
byte_pos += 1;
|
||||
}
|
||||
let u = (acc & mask) as i16;
|
||||
out[val_pos] = (u - qmax as i16) as i8;
|
||||
acc >>= bits;
|
||||
acc_bits -= bits as u32;
|
||||
val_pos += 1;
|
||||
}
|
||||
Ok(required)
|
||||
}
|
||||
```
|
||||
|
||||
**Properties**:
|
||||
|
||||
- No heap allocations. Callers provide both input and output slices.
|
||||
- Single bit writer / bit reader using a 64-bit accumulator.
|
||||
- Deterministic little-endian byte order.
|
||||
- The `pack_bits` / `unpack_bits` pair is its own inverse: `unpack(pack(v)) == v`
|
||||
for all valid inputs.
|
||||
|
||||
### 3.7 Quant Module Functions
|
||||
|
||||
```rust
|
||||
/// Block-level quantization configuration.
|
||||
pub struct QuantConfig {
|
||||
pub block_size: usize, // elements per quantization block (default: 64)
|
||||
pub two_level_threshold: f32, // max/median ratio to trigger two-level (default: 5.0)
|
||||
}
|
||||
|
||||
/// Quantized block result.
|
||||
pub struct QuantizedBlock {
|
||||
pub scale: f32,
|
||||
pub secondary_scale: Option<f32>, // only for two-level 3-bit
|
||||
pub flags: Option<Vec<u8>>, // 1-bit-per-value flags for two-level
|
||||
pub codes: Vec<i8>, // signed quantized codes
|
||||
pub bits: u8,
|
||||
}
|
||||
|
||||
/// Symmetric 8-bit quantization (Tier 1 - Hot).
|
||||
///
|
||||
/// Quantizes each block of `block_size` values independently.
|
||||
/// scale = max_abs(block) / 127
|
||||
/// q[i] = clamp(round(x[i] / scale), -127, 127)
|
||||
pub fn quantize_s8(
|
||||
values: &[f32],
|
||||
config: &QuantConfig,
|
||||
) -> Vec<QuantizedBlock>;
|
||||
|
||||
/// Symmetric N-bit quantization (Tier 2/3 - Warm/Cold).
|
||||
///
|
||||
/// `bits` must be one of: 7, 5, 3.
|
||||
/// qmax = 2^(bits-1) - 1
|
||||
/// scale = max_abs(block) / qmax
|
||||
/// q[i] = clamp(round(x[i] / scale), -qmax, qmax)
|
||||
///
|
||||
/// For bits=3 and config.two_level_threshold exceeded: uses two-level scale.
|
||||
pub fn quantize_bits(
|
||||
values: &[f32],
|
||||
bits: u8,
|
||||
config: &QuantConfig,
|
||||
) -> Vec<QuantizedBlock>;
|
||||
|
||||
/// Dequantize a block back to f32 values.
|
||||
///
|
||||
/// For standard mode: x'[i] = codes[i] as f32 * scale
|
||||
/// For two-level mode: x'[i] = codes[i] as f32 * (if flags[i] then secondary_scale else scale)
|
||||
pub fn dequantize(block: &QuantizedBlock) -> Vec<f32>;
|
||||
|
||||
/// Compute the maximum absolute value across a slice.
|
||||
///
|
||||
/// On native targets with `target_feature = "avx2"` or `target_feature = "neon"`:
|
||||
/// uses SIMD intrinsics for 4-8x throughput.
|
||||
/// On WASM with `target_feature = "simd128"` (optional):
|
||||
/// uses wasm_simd128 intrinsics.
|
||||
/// Fallback: portable scalar loop.
|
||||
#[inline]
|
||||
pub fn max_abs(values: &[f32]) -> f32;
|
||||
```
|
||||
|
||||
**SIMD implementation sketch for `max_abs`** (AVX2):
|
||||
|
||||
```rust
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
#[target_feature(enable = "avx2")]
|
||||
unsafe fn max_abs_avx2(values: &[f32]) -> f32 {
|
||||
use std::arch::x86_64::*;
|
||||
let sign_mask = _mm256_set1_ps(f32::from_bits(0x7FFF_FFFF)); // abs mask
|
||||
let mut vmax = _mm256_setzero_ps();
|
||||
let chunks = values.len() / 8;
|
||||
|
||||
for i in 0..chunks {
|
||||
let v = _mm256_loadu_ps(values.as_ptr().add(i * 8));
|
||||
let abs_v = _mm256_and_ps(v, sign_mask);
|
||||
vmax = _mm256_max_ps(vmax, abs_v);
|
||||
}
|
||||
|
||||
// Horizontal max reduction
|
||||
let hi128 = _mm256_extractf128_ps(vmax, 1);
|
||||
let lo128 = _mm256_castps256_ps128(vmax);
|
||||
let max128 = _mm_max_ps(hi128, lo128);
|
||||
let shuf = _mm_movehdup_ps(max128);
|
||||
let max64 = _mm_max_ps(max128, shuf);
|
||||
let shuf2 = _mm_movehl_ps(max64, max64);
|
||||
let max32 = _mm_max_ss(max64, shuf2);
|
||||
let mut result = _mm_cvtss_f32(max32);
|
||||
|
||||
// Handle remainder
|
||||
for i in (chunks * 8)..values.len() {
|
||||
result = result.max(values[i].abs());
|
||||
}
|
||||
result
|
||||
}
|
||||
```
|
||||
|
||||
**WASM portable fallback**:
|
||||
|
||||
```rust
|
||||
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
|
||||
pub fn max_abs(values: &[f32]) -> f32 {
|
||||
let mut m: f32 = 0.0;
|
||||
for &v in values {
|
||||
let a = v.abs();
|
||||
if a > m {
|
||||
m = a;
|
||||
}
|
||||
}
|
||||
m
|
||||
}
|
||||
```
|
||||
|
||||
When WASM SIMD is enabled via `target_feature = "simd128"`, a vectorized path
|
||||
processes 4 f32 values per iteration using `v128` types. This is optional and
|
||||
gated behind a cargo feature flag `wasm-simd`.
|
||||
|
||||
### 3.8 Error Bound Analysis
|
||||
|
||||
For symmetric quantization with bit width `B`, block scale `s`, and `qmax = 2^(B-1) - 1`:
|
||||
|
||||
```
|
||||
quantization_step = s / qmax
|
||||
max_element_error = quantization_step / 2 (from rounding)
|
||||
max_relative_error = 1 / (2 * qmax) (per element, worst case)
|
||||
rms_error = quantization_step / sqrt(12) (uniform quantization noise)
|
||||
```
|
||||
|
||||
**Per-tier error bounds**:
|
||||
|
||||
| Tier | Bits | qmax | Max Rel. Error | RMS Rel. Error | Max Abs. Error (scale=1.0) |
|
||||
|------|------|------|---------------|----------------|---------------------------|
|
||||
| Hot (8-bit) | 8 | 127 | 0.394% | 0.228% | 0.00394 |
|
||||
| Warm (7-bit) | 7 | 63 | 0.794% | 0.458% | 0.00794 |
|
||||
| Warm-agg (5-bit) | 5 | 15 | 3.333% | 1.925% | 0.03333 |
|
||||
| Cold (3-bit, std) | 3 | 3 | 16.667% | 9.623% | 0.16667 |
|
||||
| Cold (3-bit, 2-level) | 3 | 3 | 16.667% per scale | 9.623% | Reduced for bulk values |
|
||||
|
||||
**Two-level scale improvement for 3-bit**: When 95% of values fall within
|
||||
`primary_max` and outliers use `secondary_scale`:
|
||||
|
||||
| Component | Fraction | Scale | Effective Max Error |
|
||||
|-----------|----------|-------|-------------------|
|
||||
| Bulk values (95%) | 0.95 | primary_scale (smaller) | 16.7% of primary_max |
|
||||
| Outlier values (5%) | 0.05 | secondary_scale (larger) | 16.7% of secondary_max |
|
||||
|
||||
The bulk values achieve much lower absolute error because `primary_scale` is
|
||||
typically 3-10x smaller than the single-scale `scale`. The outliers retain the
|
||||
same relative error but are fewer in number.
|
||||
|
||||
**Drift compounding**: When drift tolerance is `d` (e.g., 10%), and a frame is
|
||||
quantized with scales from an earlier frame, the effective max relative error
|
||||
becomes `(1 + d) / (2 * qmax)`. For 8-bit with 10% drift: `1.1 / 254 = 0.433%`.
|
||||
|
||||
**Cumulative error table with drift**:
|
||||
|
||||
| Tier | Bits | No Drift | 10% Drift | 20% Drift |
|
||||
|------|------|----------|-----------|-----------|
|
||||
| Hot | 8 | 0.394% | 0.433% | 0.472% |
|
||||
| Warm | 7 | 0.794% | 0.873% | 0.952% |
|
||||
| Warm-agg | 5 | 3.333% | 3.667% | 4.000% |
|
||||
| Cold | 3 | 16.667% | 18.333% | 20.000% |
|
||||
|
||||
### 3.9 Complete Quantizer and Packer Traits
|
||||
|
||||
```rust
|
||||
/// Trait for quantization formats that can encode and decode tensor blocks.
|
||||
pub trait TensorQuantizer {
|
||||
/// The bit width of this quantizer.
|
||||
fn bit_width(&self) -> u8;
|
||||
|
||||
/// Quantize a block of f32 values into signed codes and scale(s).
|
||||
fn quantize_block(
|
||||
&self,
|
||||
values: &[f32],
|
||||
config: &QuantConfig,
|
||||
) -> QuantizedBlock;
|
||||
|
||||
/// Dequantize a block back to f32 values.
|
||||
fn dequantize_block(
|
||||
&self,
|
||||
block: &QuantizedBlock,
|
||||
out: &mut [f32],
|
||||
) -> Result<(), CodecErr>;
|
||||
|
||||
/// Returns the packed byte size for `num_values` at this bit width,
|
||||
/// excluding scale storage.
|
||||
fn packed_data_size(&self, num_values: usize) -> usize {
|
||||
(num_values * self.bit_width() as usize + 7) / 8
|
||||
}
|
||||
|
||||
/// Returns total block storage size including scale(s) and flags.
|
||||
fn block_storage_size(&self, block_size: usize) -> usize;
|
||||
}
|
||||
|
||||
/// Trait for bit-level packing codecs.
|
||||
pub trait BitCodec {
|
||||
/// Pack signed codes into a byte buffer.
|
||||
fn pack(
|
||||
&self,
|
||||
codes: &[i8],
|
||||
bits: u8,
|
||||
out: &mut [u8],
|
||||
) -> Result<usize, CodecErr>;
|
||||
|
||||
/// Unpack codes from a byte buffer.
|
||||
fn unpack(
|
||||
&self,
|
||||
data: &[u8],
|
||||
bits: u8,
|
||||
out: &mut [i8],
|
||||
) -> Result<usize, CodecErr>;
|
||||
}
|
||||
|
||||
/// Standard implementation using the accumulator-based codec_bits functions.
|
||||
pub struct StandardBitCodec;
|
||||
|
||||
impl BitCodec for StandardBitCodec {
|
||||
fn pack(
|
||||
&self,
|
||||
codes: &[i8],
|
||||
bits: u8,
|
||||
out: &mut [u8],
|
||||
) -> Result<usize, CodecErr> {
|
||||
pack_bits(codes, bits, out)
|
||||
}
|
||||
|
||||
fn unpack(
|
||||
&self,
|
||||
data: &[u8],
|
||||
bits: u8,
|
||||
out: &mut [i8],
|
||||
) -> Result<usize, CodecErr> {
|
||||
unpack_bits(data, bits, out)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.10 Block Storage Summary Diagram
|
||||
|
||||
```
|
||||
TIER 1 (8-bit):
|
||||
+--------+-------+-------+-------+-----+-------+
|
||||
| scale | q[0] | q[1] | q[2] | ... | q[63] |
|
||||
| f32 LE | i8 | i8 | i8 | | i8 |
|
||||
+--------+-------+-------+-------+-----+-------+
|
||||
4 bytes 1 1 1 1 = 68 bytes / block
|
||||
|
||||
TIER 2a (7-bit):
|
||||
+--------+--------------------------------------------+
|
||||
| scale | packed 7-bit codes (56 bytes for 64 vals) |
|
||||
| f32 LE | bitstream, little-endian accumulator |
|
||||
+--------+--------------------------------------------+
|
||||
4 bytes ceil(64*7/8) = 56 bytes = 60 bytes / block
|
||||
|
||||
TIER 2b (5-bit):
|
||||
+--------+--------------------------------------------+
|
||||
| scale | packed 5-bit codes (40 bytes for 64 vals) |
|
||||
| f32 LE | bitstream, little-endian accumulator |
|
||||
+--------+--------------------------------------------+
|
||||
4 bytes ceil(64*5/8) = 40 bytes = 44 bytes / block
|
||||
|
||||
TIER 3 standard (3-bit):
|
||||
+--------+--------------------------------------------+
|
||||
| scale | packed 3-bit codes (24 bytes for 64 vals) |
|
||||
| f32 LE | bitstream, little-endian accumulator |
|
||||
+--------+--------------------------------------------+
|
||||
4 bytes ceil(64*3/8) = 24 bytes = 28 bytes / block
|
||||
|
||||
TIER 3 two-level (3-bit):
|
||||
+--------+--------+----------+-------------------------------+
|
||||
| pscale | sscale | flags | packed 3-bit codes |
|
||||
| f32 LE | f32 LE | ceil(N/8)| bitstream |
|
||||
+--------+--------+----------+-------------------------------+
|
||||
4 4 8 bytes 24 bytes = 40 bytes / block
|
||||
|
||||
TIER 0 (absent):
|
||||
+--------------------------------------+
|
||||
| TensorMeta + ReconstructPolicy only |
|
||||
| NO quantized data |
|
||||
+--------------------------------------+
|
||||
variable (typically 32-128 bytes metadata)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Alternatives Considered
|
||||
|
||||
### 4.1 4-Bit as the Warm Tier
|
||||
|
||||
4-bit quantization (qmax = 7, 8.00x compression) is the most widely studied
|
||||
format (GPTQ, AWQ). We considered using 4-bit instead of 7-bit for the warm
|
||||
tier. **Rejected** because: (a) the jump from 8-bit to 4-bit is too large for
|
||||
tensors that were recently hot, causing unnecessary quality loss; (b) 7-bit
|
||||
provides a gentler step-down; (c) 5-bit is available as an intermediate when
|
||||
memory pressure increases.
|
||||
|
||||
### 4.2 Uniform 4-Bit Across All Non-Hot Tiers
|
||||
|
||||
A simpler design with only two quantization levels (8-bit hot, 4-bit everything
|
||||
else). **Rejected** because: (a) cold tensors waste 1 extra bit per value when
|
||||
3-bit suffices; (b) no path to aggressive compression under memory pressure;
|
||||
(c) loses the granularity that enables smooth quality degradation.
|
||||
|
||||
### 4.3 Asymmetric Quantization for 3-Bit
|
||||
|
||||
Using asymmetric quantization (with zero-point) for 3-bit to better utilize the
|
||||
`[0, 7]` unsigned range when distributions are not centered. **Rejected**
|
||||
because: (a) adds 4 bytes of zero-point storage per block; (b) requires an
|
||||
additional subtraction in the dequantize path; (c) the two-level scale approach
|
||||
handles asymmetric distributions more effectively by splitting the scale rather
|
||||
than shifting the range.
|
||||
|
||||
### 4.4 Lookup Table (Codebook) Quantization for Cold
|
||||
|
||||
Using a small codebook (e.g., 8 centroids) instead of uniform 3-bit levels.
|
||||
**Rejected** because: (a) requires a per-block or per-tensor codebook training
|
||||
step that is expensive for streaming data; (b) codebook storage overhead is
|
||||
comparable to scale storage but with higher decode complexity; (c) uniform
|
||||
quantization is simpler to implement and reason about.
|
||||
|
||||
### 4.5 No Two-Level Scale (Simpler 3-Bit)
|
||||
|
||||
Omitting the two-level scale option entirely. **Considered but rejected** because
|
||||
agent embedding tensors frequently exhibit heavy-tailed distributions where a few
|
||||
dimensions carry disproportionate magnitude. Without two-level scale, these
|
||||
outliers cause the single scale to be too large, wasting most of the 3-bit range
|
||||
on the bulk of near-zero values.
|
||||
|
||||
---
|
||||
|
||||
## 5. Acceptance Criteria
|
||||
|
||||
### 5.1 Format Correctness
|
||||
|
||||
- [ ] `pack_bits` followed by `unpack_bits` is a lossless round-trip for all
|
||||
bit widths (3, 5, 7, 8) and all valid signed input ranges.
|
||||
- [ ] `quantize_s8` followed by `dequantize` produces values within the
|
||||
theoretical error bound (`scale / 254`) of the originals.
|
||||
- [ ] `quantize_bits(7, ...)` followed by `dequantize` produces values within
|
||||
`scale / 126` of the originals.
|
||||
- [ ] `quantize_bits(5, ...)` followed by `dequantize` produces values within
|
||||
`scale / 30` of the originals.
|
||||
- [ ] `quantize_bits(3, ...)` followed by `dequantize` produces values within
|
||||
`scale / 6` of the originals (standard mode).
|
||||
- [ ] Two-level 3-bit mode activates when `max/median > two_level_threshold`.
|
||||
- [ ] Tier 0 reads return zeros when `reconstruct_policy` is `None`.
|
||||
- [ ] Tier 0 reads invoke reconstruction when a policy exists.
|
||||
|
||||
### 5.2 Performance
|
||||
|
||||
- [ ] `pack_bits` throughput >= 2 GB/s on native (AVX2-capable hardware).
|
||||
- [ ] `unpack_bits` throughput >= 2 GB/s on native.
|
||||
- [ ] `max_abs` with SIMD is >= 3x faster than the scalar fallback on 512+ element blocks.
|
||||
- [ ] WASM `pack_bits` / `unpack_bits` throughput >= 500 MB/s (without SIMD).
|
||||
- [ ] No heap allocations in `pack_bits`, `unpack_bits`, or `max_abs`.
|
||||
|
||||
### 5.3 Storage Efficiency
|
||||
|
||||
- [ ] 8-bit block storage: exactly `4 + block_size` bytes.
|
||||
- [ ] 7-bit block storage: exactly `4 + ceil(block_size * 7 / 8)` bytes.
|
||||
- [ ] 5-bit block storage: exactly `4 + ceil(block_size * 5 / 8)` bytes.
|
||||
- [ ] 3-bit block storage (standard): exactly `4 + ceil(block_size * 3 / 8)` bytes.
|
||||
- [ ] 3-bit block storage (two-level): exactly `8 + ceil(block_size / 8) + ceil(block_size * 3 / 8)` bytes.
|
||||
- [ ] No padding bits between consecutive blocks in a segment.
|
||||
|
||||
### 5.4 Dynamic Tier 2 Downgrade
|
||||
|
||||
- [ ] When aggregate warm storage exceeds `warm_byte_cap`, the least recently
|
||||
accessed warm tensors are re-encoded from 7-bit to 5-bit.
|
||||
- [ ] The downgrade is reversible: if warm storage drops below
|
||||
`warm_byte_cap * 0.8` (hysteresis), tensors can be re-promoted to 7-bit
|
||||
on next access.
|
||||
|
||||
---
|
||||
|
||||
## 6. Risks and Mitigations
|
||||
|
||||
| Risk | Severity | Likelihood | Mitigation |
|
||||
|------|----------|------------|------------|
|
||||
| 3-bit two-level scale adds format complexity without sufficient accuracy gain for most distributions | Medium | Medium | Gate behind a cargo feature `two-level-cold`; default to standard 3-bit. Benchmark on real agent embeddings before enabling by default. |
|
||||
| Dynamic 7-bit to 5-bit downgrade causes thrashing when warm set oscillates near the byte cap | Medium | Medium | Implement hysteresis (20% band). Only downgrade when above cap; only upgrade when below 80% of cap. Rate-limit downgrades to at most once per minute. |
|
||||
| `pack_bits` accumulator overflow for large inputs | Low | Low | The 64-bit accumulator can hold up to 56 bits of pending data (7 bytes). Since we flush at 8 bits, the maximum pending bits is `bits - 1 = 7`, well within the 64-bit range. No overflow possible. |
|
||||
| Tier 0 reconstruction from Delta/Factor introduces unbounded latency | Medium | Low | Set a maximum reconstruction depth (default: 3). If the base tensor is also Tier 0, fail with `ReconstructionDepthExceeded` rather than recursing indefinitely. |
|
||||
| WASM scalar `max_abs` is a bottleneck for large tensors | Low | High | Expected. The WASM SIMD feature flag provides 3-4x improvement. For non-SIMD targets, `max_abs` cost is small relative to the full quantize pipeline. |
|
||||
| Block size mismatch between encoder and decoder | High | Low | Block size is stored in the segment header (ADR-017 format). Decoder reads it from the header rather than assuming a default. |
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
1. ADR-017: Temporal Tensor Compression with Tiered Quantization. RuVector Architecture Team, 2026.
|
||||
2. ADR-018: Block-Based Storage Engine for Temporal Tensor Segments (forthcoming).
|
||||
3. Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
|
||||
4. Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
|
||||
5. Kim, S., et al. "SqueezeLLM: Dense-and-Sparse Quantization." ICML 2024.
|
||||
6. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024.
|
||||
7. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015.
|
||||
8. IEEE 754-2019. "IEEE Standard for Floating-Point Arithmetic."
|
||||
9. Lemire, D. and Boytsov, L. "Decoding billions of integers in milliseconds through vectorized bit packing." Software: Practice and Experience, 2015.
|
||||
10. WebAssembly SIMD Proposal. https://github.com/WebAssembly/simd. Finalized 2023.
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
1062
docs/adr/temporal-tensor-store/ADR-022-wasm-api-cross-platform.md
Normal file
1062
docs/adr/temporal-tensor-store/ADR-022-wasm-api-cross-platform.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,422 @@
|
||||
# ADR-023: Benchmarking, Failure Modes, and Acceptance Criteria
|
||||
|
||||
**Status**: Proposed
|
||||
**Date**: 2026-02-08
|
||||
**Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine
|
||||
**Author**: System Architecture Team
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 0.1 | 2026-02-08 | Architecture Team | Initial proposal |
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
This ADR defines benchmarking methodology, acceptance thresholds, failure modes, and CI strategy for the Temporal Tensor Store. It makes ADR-017's performance targets measurable and enforceable by specifying harnesses, pass/fail criteria, and automated regression detection.
|
||||
|
||||
---
|
||||
|
||||
## 1. Context
|
||||
|
||||
ADR-017 and ADR-018 together form the Temporal Tensor Store but leave gaps in how targets are measured, what happens when they are missed, and how regressions are caught. This ADR closes those gaps with concrete harness designs, a primary acceptance test, five catalogued failure modes with fix paths, and CI integration rules.
|
||||
|
||||
---
|
||||
|
||||
## 2. Microbenchmark Targets
|
||||
|
||||
All measurements use a single 16KB block (4096 f32 values, group_len=64). Harness: Criterion.rs with 200 samples, 5s measurement, 2s warm-up.
|
||||
|
||||
### 2.1 Quantize and Dequantize Throughput
|
||||
|
||||
| Operation | Bit Width | Native Target | WASM Target |
|
||||
|-----------|-----------|--------------|-------------|
|
||||
| Quantize | 8-bit | < 2 us | < 20 us |
|
||||
| Quantize | 7-bit | < 2 us | < 20 us |
|
||||
| Quantize | 5-bit | < 2.5 us | < 25 us |
|
||||
| Quantize | 3-bit | < 3 us | < 30 us |
|
||||
| Dequantize | 8-bit | < 2 us | < 20 us |
|
||||
| Dequantize | 7-bit | < 2.5 us | < 25 us |
|
||||
| Dequantize | 5-bit | < 3 us | < 30 us |
|
||||
| Dequantize | 3-bit | < 5 us | < 50 us |
|
||||
|
||||
### 2.2 Pack and Unpack Speed
|
||||
|
||||
| Operation | Bit Width | Native Target | WASM Target |
|
||||
|-----------|-----------|--------------|-------------|
|
||||
| Pack 16KB | 8-bit | < 0.5 us | < 5 us |
|
||||
| Pack 16KB | 7-bit | < 1 us | < 10 us |
|
||||
| Pack 16KB | 5-bit | < 1 us | < 10 us |
|
||||
| Pack 16KB | 3-bit | < 1.5 us | < 15 us |
|
||||
| Unpack 16KB | 8-bit | < 0.5 us | < 5 us |
|
||||
| Unpack 16KB | 7-bit | < 1 us | < 10 us |
|
||||
| Unpack 16KB | 5-bit | < 1 us | < 10 us |
|
||||
| Unpack 16KB | 3-bit | < 1.5 us | < 15 us |
|
||||
|
||||
### 2.3 Tier Decision and Scoring
|
||||
|
||||
| Operation | Native Target | WASM Target |
|
||||
|-----------|--------------|-------------|
|
||||
| Tier decision per block | < 50 ns | < 500 ns |
|
||||
| Per-block scoring | < 20 ns | < 200 ns |
|
||||
| Maintenance tick (1000 candidates) | < 1 ms | < 10 ms |
|
||||
| Delta apply (sparse, 10% nnz) | < 1 us | < 10 us |
|
||||
|
||||
### 2.4 Auxiliary Operations
|
||||
|
||||
| Operation | Native Target | WASM Target |
|
||||
|-----------|--------------|-------------|
|
||||
| f32-to-f16 / f16-to-f32 (single) | < 5 ns | < 50 ns |
|
||||
| Drift check (64-group block) | < 50 ns | < 500 ns |
|
||||
| CRC32 checksum (16KB) | < 1 us | < 10 us |
|
||||
| Segment encode (16KB, 1 frame) | < 3 us | < 30 us |
|
||||
| Segment decode (16KB, 1 frame) | < 3 us | < 30 us |
|
||||
|
||||
---
|
||||
|
||||
## 3. Macrobenchmark Targets
|
||||
|
||||
### 3.1 KV Cache-Like Workload with Zipf Access Pattern
|
||||
|
||||
| Parameter | Value | Rationale |
|
||||
|-----------|-------|-----------|
|
||||
| Total blocks | 1,000,000 | ~16 GB raw; representative large cache |
|
||||
| Total accesses | 10,000,000 | Statistical stability |
|
||||
| Distribution | Zipf (alpha=1.2) | Models real attention-pattern skew |
|
||||
| Block size | 16 KB | Standard block from ADR-018 |
|
||||
| Tier-1 byte cap | 2 GB | Memory-constrained deployment |
|
||||
|
||||
### 3.2 Measurements
|
||||
|
||||
Average read latency, P95 read latency, P99 read latency, bytes stored per token, MSE per tier (sampled from 1000 blocks per tier), tier churn rate (transitions/block/minute), Tier-1 occupancy (snapshotted every simulated second), and eviction count.
|
||||
|
||||
### 3.3 Macrobenchmark Acceptance Thresholds
|
||||
|
||||
| Metric | Target | Hard Fail |
|
||||
|--------|--------|-----------|
|
||||
| Avg read latency (native) | < 3 us | > 10 us |
|
||||
| P95 read latency (native) | < 10 us | > 50 us |
|
||||
| P99 read latency (native) | < 25 us | > 100 us |
|
||||
| Avg read latency (WASM) | < 30 us | > 100 us |
|
||||
| P95 read latency (WASM) | < 100 us | > 500 us |
|
||||
| Bytes stored per token | < 2.5 bytes | > 4 bytes |
|
||||
| Tier churn per block per min | < 0.1 avg | > 0.5 |
|
||||
| Tier-1 byte usage | Under cap always | Any violation |
|
||||
|
||||
---
|
||||
|
||||
## 4. Acceptance Thresholds (Critical)
|
||||
|
||||
These gate merges to main. Any violation blocks the PR.
|
||||
|
||||
### 4.1 Latency
|
||||
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Tier-1 dequant latency (16KB block, native) | < 2 us |
|
||||
| Tier-3 dequant latency (16KB block, native) | < 5 us |
|
||||
| WASM dequant latency (16KB block, Node.js) | < 50 us |
|
||||
|
||||
**Derivation**: A 16KB block requires 4096 multiplies. On AVX2 at 3.5 GHz (8 f32/cycle), the theoretical floor is ~146 ns. The 2 us target provides 14x headroom for unpacking, memory access, and loop overhead while staying well under the 10 us inference-impact threshold. The WASM 50 us target reflects measured 8-12x V8 overhead plus a 2x safety margin.
|
||||
|
||||
### 4.2 Stability
|
||||
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Tier churn per block per min | < 0.1 avg |
|
||||
| Tier-1 byte budget | Under configured cap |
|
||||
| Segment boundary rate | < 1 per 100 frames (stable tensor) |
|
||||
|
||||
**Derivation**: At 0.1 transitions/block/min with 1M blocks, total transitions are ~1,667/sec. At ~5-10 us each, this consumes <2% CPU. At 1.0/block/min it becomes 8-17%, which is unacceptable.
|
||||
|
||||
### 4.3 Quality Thresholds
|
||||
|
||||
| Tier | Bits | Max MSE (normalized) | Max Relative Error |
|
||||
|------|------|---------------------|-------------------|
|
||||
| Hot (8-bit) | 8 | < 0.0001 | < 0.8% |
|
||||
| Warm (7-bit) | 7 | < 0.0004 | < 1.6% |
|
||||
| Warm (5-bit) | 5 | < 0.004 | < 6.5% |
|
||||
| Cold (3-bit) | 3 | < 0.03 | < 30% |
|
||||
|
||||
MSE normalized by squared L2-norm of original block. Relative error is max element-wise error divided by block max absolute value.
|
||||
|
||||
---
|
||||
|
||||
## 5. Primary Acceptance Test
|
||||
|
||||
### 5.1 Configuration
|
||||
|
||||
```
|
||||
blocks: 1,000,000 accesses: 10,000,000 distribution: Zipf(1.2)
|
||||
tier1_byte_cap: 2GB block_size: 16KB group_len: 64
|
||||
hot_min_score: 512 warm_min_score: 64 hysteresis: 32
|
||||
min_residency: 60 drift_pct_q8: 26 max_delta_chain: 8
|
||||
```
|
||||
|
||||
### 5.2 Pass Criteria
|
||||
|
||||
The simulation PASSES if and only if all three hold simultaneously:
|
||||
1. **Budget**: Tier-1 holds under configured byte cap at every epoch snapshot.
|
||||
2. **Stability**: Average tier flips per block per minute < 0.1.
|
||||
3. **Latency**: P95 read latency stays within tier target on host.
|
||||
|
||||
### 5.3 Zipf Simulation Pseudocode
|
||||
|
||||
```
|
||||
function run_zipf_simulation(config):
|
||||
store = BlockStore::new(config.tier1_byte_cap)
|
||||
blocks = Array[config.num_blocks]
|
||||
for i in 0..config.num_blocks:
|
||||
blocks[i] = generate_random_f32_block(config.block_size)
|
||||
store.ingest(block_id=i, data=blocks[i], initial_tier=COLD)
|
||||
|
||||
zipf = ZipfDistribution::new(config.num_blocks, config.alpha)
|
||||
rng = StableRng::seed(42)
|
||||
|
||||
latencies = Vec::new()
|
||||
tier_flips = Array[config.num_blocks].fill(0)
|
||||
prev_tier = Array[config.num_blocks].fill(COLD)
|
||||
epoch_snapshots = Vec::new()
|
||||
sim_clock = 0
|
||||
|
||||
for access in 0..config.num_accesses:
|
||||
block_id = zipf.sample(rng)
|
||||
sim_clock += 1
|
||||
|
||||
t_start = precise_now()
|
||||
tier = store.current_tier(block_id)
|
||||
data = store.read_block(block_id, sim_clock)
|
||||
t_end = precise_now()
|
||||
latencies.push(t_end - t_start)
|
||||
|
||||
if tier != prev_tier[block_id]:
|
||||
tier_flips[block_id] += 1
|
||||
prev_tier[block_id] = tier
|
||||
|
||||
if access % config.maintenance_interval == 0:
|
||||
store.run_maintenance_tick(sim_clock)
|
||||
if access % config.snapshot_interval == 0:
|
||||
epoch_snapshots.push(EpochSnapshot {
|
||||
sim_clock, tier1_bytes: store.tier1_bytes(),
|
||||
tier2_bytes: store.tier2_bytes(),
|
||||
tier3_bytes: store.tier3_bytes(),
|
||||
})
|
||||
|
||||
sim_minutes = sim_clock / config.ticks_per_minute
|
||||
results = SimulationResults {
|
||||
avg_latency: mean(latencies),
|
||||
p95_latency: percentile(latencies, 0.95),
|
||||
p99_latency: percentile(latencies, 0.99),
|
||||
avg_churn: mean(tier_flips) / sim_minutes,
|
||||
budget_violated: any(s.tier1_bytes > config.tier1_byte_cap for s in epoch_snapshots),
|
||||
}
|
||||
|
||||
// Quality sampling: 1000 blocks per tier
|
||||
for tier in [HOT, WARM, COLD]:
|
||||
for id in store.sample_block_ids(tier, 1000):
|
||||
reconstructed = store.read_block(id, sim_clock)
|
||||
results.quality[tier].push(mse(blocks[id], reconstructed))
|
||||
return results
|
||||
|
||||
function assert_pass(results, config):
|
||||
assert !results.budget_violated // Criterion 1
|
||||
assert results.avg_churn < 0.1 // Criterion 2
|
||||
assert results.p95_latency < config.p95 // Criterion 3
|
||||
for tier, samples in results.quality:
|
||||
for mse in samples:
|
||||
assert mse < config.mse_threshold[tier] // Criterion 4
|
||||
```
|
||||
|
||||
### 5.4 Reproducibility
|
||||
|
||||
Fixed RNG seed (42), Zipf-Mandelbrot inverse CDF, monotonic clock (`Instant::now()`), CPU frequency scaling disabled or handled by Criterion warm-up.
|
||||
|
||||
---
|
||||
|
||||
## 6. Failure Modes and Fix Paths
|
||||
|
||||
### 6.1 Thrashing
|
||||
|
||||
- **Symptom**: Tier flips > 0.1/block/min; excessive segment boundaries
|
||||
- **Root cause**: Hysteresis too small; tau too large causing score oscillation
|
||||
- **Fix**: Increase hysteresis (32 to 64+), increase min_residency (60 to 120+ ticks), reduce tau
|
||||
|
||||
### 6.2 Delta Chain Blowup
|
||||
|
||||
- **Symptom**: P95 read latency > 10x tier target; growing read amplification
|
||||
- **Root cause**: Delta chains not compacted; unbounded chain growth
|
||||
- **Fix**: Compact when chain exceeds max_delta_chain (default 8); schedule in maintenance tick; hard cap forces sync compaction on read at 2x max
|
||||
|
||||
### 6.3 Scale Instability
|
||||
|
||||
- **Symptom**: MSE exceeds threshold on bimodal/heavy-tailed tensors
|
||||
- **Root cause**: Single per-group scale insufficient for outlier distributions
|
||||
- **Fix**: Enable two-level scale for 3-bit; reduce group_len to 32 for affected blocks; clamp outliers at 3-sigma with sparse correction side-channel
|
||||
|
||||
### 6.4 Hot Set Misprediction
|
||||
|
||||
- **Symptom**: Tier-1 byte usage exceeds configured cap
|
||||
- **Root cause**: Scoring promotes too many blocks; hot_min_score too low
|
||||
- **Fix**: Raise t1_threshold, lower w_pop, enforce per-tier byte cap with LRU eviction, add feedback loop (auto-raise threshold when eviction rate exceeds N/sec)
|
||||
|
||||
### 6.5 Checksum Corruption
|
||||
|
||||
- **Symptom**: CRC32 mismatch on read
|
||||
- **Root cause**: Bit flip in storage; partial write; pack/unpack bug
|
||||
- **Fix**: Rehydrate from delta chain if available; attempt factor decomposition recovery; else mark Unrecoverable and emit alert metric; enable background scrubbing on idle blocks
|
||||
|
||||
---
|
||||
|
||||
## 7. Benchmark Harness Design
|
||||
|
||||
### 7.1 Microbenchmarks (Criterion.rs)
|
||||
|
||||
```
|
||||
crates/ruvector-temporal-tensor/benches/
|
||||
quantize.rs -- per bit width
|
||||
dequantize.rs -- per bit width
|
||||
bitpack.rs -- pack/unpack per bit width
|
||||
tier_policy.rs -- scoring and tier decision
|
||||
f16_conversion.rs -- f32<->f16
|
||||
segment.rs -- encode/decode round-trip
|
||||
maintenance.rs -- maintenance tick with N candidates
|
||||
```
|
||||
|
||||
Input data: fixed seed (42), standard normal scaled to [-1.0, 1.0]. Median is the primary statistic. Regression detected when new CI lower bound exceeds baseline upper bound by >5%.
|
||||
|
||||
### 7.2 Zipf Simulation (Custom Rust)
|
||||
|
||||
Located at `crates/ruvector-temporal-tensor/tests/zipf_simulation.rs`. Supports `--quick` (100K blocks, 1M accesses, ~30s) for PR checks and `--full` (1M blocks, 10M accesses, ~5-10min) for nightly. Outputs JSON for CI and human-readable summary to stdout. Configurable via env vars (`ZIPF_BLOCKS`, `ZIPF_ACCESSES`, `ZIPF_ALPHA`).
|
||||
|
||||
### 7.3 WASM Benchmarks
|
||||
|
||||
Built with `wasm-pack build --release --target nodejs`. Node.js runner calls each FFI function in a 10,000-iteration loop, measured with `process.hrtime.bigint()`. Reports median, P95, P99 and computes WASM/native overhead ratio.
|
||||
|
||||
---
|
||||
|
||||
## 8. CI Integration Guidelines
|
||||
|
||||
### 8.1 Pipeline Stages
|
||||
|
||||
| Stage | Trigger | Timeout | Scope |
|
||||
|-------|---------|---------|-------|
|
||||
| PR check | Every PR | 10 min | Criterion quick, Zipf quick, quality |
|
||||
| Nightly | 02:00 UTC | 30 min | Full Criterion, Zipf full, WASM, quality sweep |
|
||||
| Release gate | Tag push | 45 min | All benchmarks, cross-platform, WASM + native |
|
||||
|
||||
### 8.2 Regression Detection
|
||||
|
||||
```yaml
|
||||
benchmark-check:
|
||||
steps:
|
||||
- run: cargo bench --bench '*' -- --output-format bencher | tee output.txt
|
||||
- run: python scripts/bench_compare.py --baseline .bench_baseline.json
|
||||
--current output.txt --threshold 0.10 --fail-on-regression
|
||||
- run: cargo test --release --test zipf_simulation -- --quick
|
||||
```
|
||||
|
||||
Baselines committed as `.bench_baseline.json` on main. Updated only on architecture-team-reviewed PRs that modify quantization or storage code. Comparison: `(new_median - baseline) / baseline`; fail at 10% for latency, 20% for throughput.
|
||||
|
||||
### 8.3 Alerting
|
||||
|
||||
| Condition | Action |
|
||||
|-----------|--------|
|
||||
| PR regression > 10% | Block merge; PR comment |
|
||||
| Nightly regression > 15% | GitHub issue: `perf-regression` |
|
||||
| Zipf simulation failure | GitHub issue: `acceptance-failure` |
|
||||
| WASM overhead > 15x native | GitHub issue: `wasm-performance` |
|
||||
| Quality violation | Block merge/release |
|
||||
|
||||
---
|
||||
|
||||
## 9. SOTA Integration Benchmarks
|
||||
|
||||
### 9.1 Reference Systems
|
||||
|
||||
| System | Year | Key Result |
|
||||
|--------|------|-----------|
|
||||
| **RIPPLE++** | 2026 | Tens of thousands of updates/sec, sub-ms latency for incremental graph computation |
|
||||
| **OMEGA** | 2025 | Sub-ms GNN inference via selective recompute |
|
||||
| **STAG** | 2025 | Additivity-based incremental propagation; linear scaling with delta size |
|
||||
|
||||
### 9.2 Comparison
|
||||
|
||||
| Metric | Temporal Tensor Store | RIPPLE++ | OMEGA | STAG |
|
||||
|--------|----------------------|----------|-------|------|
|
||||
| Single read | < 2-5 us | N/A (graph) | ~100 us | ~50 us |
|
||||
| Batch update (1000) | < 1 ms | ~10 ms | ~5 ms | ~2 ms |
|
||||
| Memory/element | 0.375-1.0 B | 8 B | 4-8 B | 4 B |
|
||||
|
||||
The store targets block-level compression rather than graph-level computation but shares the sub-millisecond incremental update goal. The maintenance tick budget (<1ms for 1000 candidates) is competitive.
|
||||
|
||||
---
|
||||
|
||||
## 10. Test Scenarios
|
||||
|
||||
### 10.1 Scenario Matrix
|
||||
|
||||
| ID | Purpose | Blocks | Accesses | Distribution |
|
||||
|----|---------|--------|----------|-------------|
|
||||
| S1 | Baseline: uniform access | 10K | 1M | Uniform |
|
||||
| S2 | Primary acceptance (Zipf) | 1M | 10M | Zipf(1.2) |
|
||||
| S3 | High skew stress | 1M | 10M | Zipf(2.0) |
|
||||
| S4 | Temporal shift (rotating hot set) | 100K | 5M | Rotating Zipf |
|
||||
| S5 | Burst access pattern | 100K | 2M | Burst + uniform |
|
||||
| S6 | Severe memory constraint (100MB cap) | 1M | 10M | Zipf(1.2) |
|
||||
| S7 | Outlier/bimodal tensors | 10K | 500K | Zipf(1.2) |
|
||||
| S8 | Stable tensors (near-zero drift) | 10K | 500K | Zipf(1.2) |
|
||||
|
||||
### 10.2 Per-Scenario Pass Criteria
|
||||
|
||||
| ID | Pass Condition |
|
||||
|----|---------------|
|
||||
| S1 | All blocks converge to same tier within 2x access count |
|
||||
| S2 | Full acceptance test (Section 5.2) |
|
||||
| S3 | Tier-1 < 5% of blocks; no budget violation |
|
||||
| S4 | Churn < 0.2/block/min despite rotation |
|
||||
| S5 | P95 spike during burst < 2x steady-state P95 |
|
||||
| S6 | Zero OOM; cap held; avg latency < 5x unconstrained |
|
||||
| S7 | MSE for bimodal blocks < 2x threshold |
|
||||
| S8 | Segment count per block < 1.1 |
|
||||
|
||||
---
|
||||
|
||||
## 11. Risks and Mitigations
|
||||
|
||||
| Risk | Severity | Mitigation |
|
||||
|------|----------|------------|
|
||||
| CI noise causes false regressions | Medium | 2% Criterion noise threshold; require 3 consecutive failures; pin CI hardware |
|
||||
| Zipf simulation too slow for PR | Medium | Quick mode (~30s); full mode nightly only |
|
||||
| WASM results platform-dependent | Low | Pin Node.js version; accept 20% variance |
|
||||
| Baseline drift over time | Medium | Rebaseline quarterly or on hardware change |
|
||||
|
||||
---
|
||||
|
||||
## 12. Implementation Roadmap
|
||||
|
||||
**Phase 1 (Week 1)**: Criterion benchmarks for all Section 2 operations; initial baselines; `bench_compare.py` script; PR pipeline integration.
|
||||
|
||||
**Phase 2 (Week 1-2)**: Zipf simulation with quick/full modes and JSON output; nightly pipeline integration.
|
||||
|
||||
**Phase 3 (Week 2)**: WASM Node.js benchmark runner; WASM-specific baselines; nightly pipeline.
|
||||
|
||||
**Phase 4 (Week 2-3)**: Failure mode detectors (thrashing counter, delta chain monitor, quality sampler, corruption injection test); wire into simulation harness.
|
||||
|
||||
**Phase 5 (Week 3)**: CI hardening (pinned hardware, nightly scheduling, alerting, release-gate workflow).
|
||||
|
||||
---
|
||||
|
||||
## 13. References
|
||||
|
||||
1. Frantar et al. "GPTQ: Accurate Post-Training Quantization." ICLR 2023.
|
||||
2. Lin et al. "AWQ: Activation-aware Weight Quantization." MLSys 2024.
|
||||
3. Criterion.rs documentation. https://bheisler.github.io/criterion.rs/
|
||||
4. Gray. "The Benchmark Handbook." Morgan Kaufmann, 1993.
|
||||
5. Pelkonen et al. "Gorilla: In-Memory Time Series Database." VLDB 2015.
|
||||
6. Li et al. "RIPPLE++: Incremental Graph Computation." SIGMOD 2026.
|
||||
7. Chen et al. "OMEGA: Selective Recompute for Low-Latency GNN Serving." OSDI 2025.
|
||||
8. Wang et al. "STAG: Additivity-Based Incremental Graph Propagation." VLDB 2025.
|
||||
9. ADR-017: Temporal Tensor Compression. RuVector Architecture Team, 2026.
|
||||
10. ADR-018: Block-Based Storage Engine. RuVector Architecture Team, 2026.
|
||||
Reference in New Issue
Block a user