Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
@@ -0,0 +1,880 @@
|
||||
# ADR-019: Tiered Quantization Formats for Temporal Tensor Store
|
||||
|
||||
**Status**: Proposed
|
||||
**Date**: 2026-02-08
|
||||
**Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine
|
||||
**Author**: System Architecture Team
|
||||
|
||||
**Note**: Tiered quantization formats are now implemented in the rvf-quant crate as part of ADR-029 (RVF). See the RVF temperature-tiering specification (docs/research/rvf/spec/03-temperature-tiering.md).
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| 0.1 | 2026-02-08 | Architecture Team | Initial proposal |
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
This ADR defines the concrete quantization formats, bit-packing layouts, and codec
|
||||
interfaces for the five tiers of tensor storage established in ADR-017. Where ADR-017
|
||||
introduced the concept of access-frequency-driven quantization and temporal scale
|
||||
reuse, this document specifies the exact byte-level formats for 8-bit (Tier 1 / Hot),
|
||||
7-bit and 5-bit (Tier 2 / Warm), 3-bit (Tier 3 / Cold), and Compression-to-Zero
|
||||
(Tier 0 / Absent). It also resolves two open design questions from ADR-017: whether
|
||||
5-bit quantization is permitted within the warm tier, and how Tier 0 reads behave
|
||||
when no reconstruction policy exists.
|
||||
|
||||
The `codec_bits` module provides a single allocation-free bit packer/unpacker that
|
||||
all sub-byte formats share. The `quant` module provides per-format quantize and
|
||||
dequantize functions, with SIMD-accelerated `max_abs` on native targets and a
|
||||
portable fallback for WASM. Rust trait interfaces are defined so that new bit widths
|
||||
can be added without modifying the core codec.
|
||||
|
||||
---
|
||||
|
||||
## 1. Context and Motivation
|
||||
|
||||
### 1.1 Gap in ADR-017
|
||||
|
||||
ADR-017 established the tiered compression architecture and segment binary format
|
||||
but left the per-tier quantization details at the algorithmic level. Implementers
|
||||
need exact byte layouts to write interoperable encoders and decoders, particularly
|
||||
for the sub-byte formats (7-bit, 5-bit, 3-bit) where values do not align on byte
|
||||
boundaries.
|
||||
|
||||
### 1.2 Sub-Byte Packing Complexity
|
||||
|
||||
Standard 8-bit quantization maps trivially to `[u8]` storage. Sub-byte formats
|
||||
require a bit-packing codec that can write and read arbitrary-width codes into a
|
||||
byte stream without wasting bits. The codec must:
|
||||
|
||||
- Handle bit widths 3, 5, and 7 (with 8 as a degenerate identity case).
|
||||
- Operate without heap allocations (caller provides output slice).
|
||||
- Be deterministic and platform-independent (little-endian byte order).
|
||||
- Support WASM targets where SIMD is optional.
|
||||
|
||||
### 1.3 Outlier Handling in 3-Bit
|
||||
|
||||
At 3 bits per value, the quantization range is `[-3, +3]` (qmax = 3). Large
|
||||
outliers in the tensor distribution can cause severe clamping. ADR-017 noted this
|
||||
risk but did not specify a mitigation. This ADR introduces a two-level scale
|
||||
option for Tier 3 that uses a 1-bit flag per value to select between a primary
|
||||
scale (covering the majority of values) and a secondary scale (covering outliers),
|
||||
while keeping the packed format compact.
|
||||
|
||||
### 1.4 Tier 0 Semantics
|
||||
|
||||
ADR-017 listed Compression-to-Zero as a future possibility. This ADR formalizes
|
||||
it: Tier 0 stores no quantized data at all. Only metadata and an optional
|
||||
`reconstruct_policy` survive. This enables aggressive memory reclamation for
|
||||
tensors that are no longer accessed but may be reconstructable from other sources
|
||||
(deltas, factors, or recomputation).
|
||||
|
||||
### 1.5 Design Questions Resolved
|
||||
|
||||
| Question | Resolution |
|
||||
|----------|------------|
|
||||
| Allow 5-bit within warm tier? | Yes. Dynamic downgrade from 7-bit to 5-bit when warm set exceeds a configurable byte cap (`warm_byte_cap`). |
|
||||
| Tier 0 read semantics? | Return zeros by default. If a `reconstruct_policy` (Delta or Factor) exists, reconstruct from stored representation. |
|
||||
|
||||
---
|
||||
|
||||
## 2. Decision
|
||||
|
||||
We adopt the following five-tier quantization format hierarchy, each with a
|
||||
well-defined byte layout, packing strategy, and error budget:
|
||||
|
||||
| Tier | Name | Bits | Compression vs f32 | Use Case |
|
||||
|------|------|------|-------------------|----------|
|
||||
| 1 | Hot | 8 | 4.00x | Active tensors, full fidelity |
|
||||
| 2a | Warm | 7 | 4.57x | Default warm, near-lossless |
|
||||
| 2b | Warm-aggressive | 5 | 6.40x | Warm set exceeds `warm_byte_cap` |
|
||||
| 3 | Cold | 3 | 10.67x | Archived tensors, bounded error |
|
||||
| 0 | Absent | 0 | Infinite | No data stored; metadata only |
|
||||
|
||||
All sub-byte formats share the `codec_bits` packer. All quantization formats use
|
||||
symmetric per-block quantization with `scale = max_abs / qmax` stored as f32 per
|
||||
block. The choice of f32 (rather than f16 as in ADR-017 segment headers) is
|
||||
deliberate at this layer: the segment encoder may convert to f16 for storage, but
|
||||
the quantizer operates in f32 for precision during the quantize/dequantize path.
|
||||
|
||||
---
|
||||
|
||||
## 3. Detailed Design
|
||||
|
||||
### 3.1 Tier 1: 8-Bit Quantization (Hot)
|
||||
|
||||
**Algorithm**: Symmetric per-block quantization.
|
||||
|
||||
```
|
||||
Given: block of N f32 values, block_size typically 64 or 128
|
||||
scale = max_abs(values) / 127
|
||||
q[i] = round(values[i] / scale)
|
||||
q[i] = clamp(q[i], -127, +127) // i8 range
|
||||
store: q as [i8; N] + scale as f32
|
||||
```
|
||||
|
||||
**Storage layout** (one block, block_size = 8 for illustration):
|
||||
|
||||
```
|
||||
Byte offset: 0 1 2 3 4 5 6 7 8 9 10 11
|
||||
[ scale (f32, LE) ] [q0] [q1] [q2] [q3] [q4] [q5] [q6] [q7]
|
||||
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
4 bytes 8 bytes (1 byte per i8 value)
|
||||
|
||||
Total per block: 4 + block_size bytes
|
||||
```
|
||||
|
||||
**Effective compression** (block_size = 64):
|
||||
|
||||
```
|
||||
raw = 64 * 4 = 256 bytes
|
||||
quant = 4 + 64 * 1 = 68 bytes
|
||||
ratio = 256 / 68 = 3.76x (single block)
|
||||
```
|
||||
|
||||
With temporal amortization (100 frames sharing scales): `256*100 / (4 + 64*100)` = 4.00x.
|
||||
|
||||
**Dequantize**:
|
||||
|
||||
```
|
||||
values[i] = q[i] as f32 * scale
|
||||
```
|
||||
|
||||
**Error bound**: `max_error = scale / (2 * 127)`. See Section 3.7 for full analysis.
|
||||
|
||||
### 3.2 Tier 2a: 7-Bit Quantization (Warm)
|
||||
|
||||
**Algorithm**: Symmetric per-block, 7-bit codes packed into a bitstream.
|
||||
|
||||
```
|
||||
Given: block of N f32 values
|
||||
scale = max_abs(values) / 63 // qmax = 2^(7-1) - 1 = 63
|
||||
q[i] = round(values[i] / scale)
|
||||
q[i] = clamp(q[i], -63, +63)
|
||||
u[i] = q[i] + 63 // bias to unsigned [0, 126], fits 7 bits
|
||||
pack u[i] values using codec_bits at width=7
|
||||
```
|
||||
|
||||
**Bit-packing layout** (8 values packed into 7 bytes):
|
||||
|
||||
```
|
||||
Values: u0 u1 u2 u3 u4 u5 u6 u7
|
||||
Bits: [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0]
|
||||
7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits
|
||||
|
||||
Packed into 7 bytes (56 bits = 8 * 7 bits):
|
||||
|
||||
Byte 0: [u0[6:0] | u1[0] ] = u0(7) + u1(1) = 8 bits
|
||||
|<--- 7 bits --->|<1>|
|
||||
|
||||
Byte 1: [u1[6:1] | u2[1:0]] = u1(6) + u2(2) = 8 bits
|
||||
|<--- 6 bits --->|<-2->|
|
||||
|
||||
Byte 2: [u2[6:2] | u3[2:0] ] = u2(5) + u3(3) = 8 bits
|
||||
|<-- 5 bits -->|<--3-->|
|
||||
|
||||
Byte 3: [u3[6:3] | u4[3:0] ] = u3(4) + u4(4) = 8 bits
|
||||
|<- 4 bits ->|<--4--->|
|
||||
|
||||
Byte 4: [u4[6:4] | u5[4:0] ] = u4(3) + u5(5) = 8 bits
|
||||
|<-3->|<---- 5 bits ---->|
|
||||
|
||||
Byte 5: [u5[6:5] | u6[5:0] ] = u5(2) + u6(6) = 8 bits
|
||||
|<2>|<----- 6 bits ------>|
|
||||
|
||||
Byte 6: [u6[6] | u7[6:0] ] = u6(1) + u7(7) = 8 bits
|
||||
|1|<------- 7 bits ------->|
|
||||
|
||||
Total: 7 bytes for 8 values = 0.875 bytes/value
|
||||
```
|
||||
|
||||
**Storage per block** (block_size = 64):
|
||||
|
||||
```
|
||||
scale: 4 bytes (f32)
|
||||
data: ceil(64 * 7 / 8) = 56 bytes
|
||||
total: 60 bytes
|
||||
ratio: 256 / 60 = 4.27x
|
||||
```
|
||||
|
||||
### 3.3 Tier 2b: 5-Bit Quantization (Warm Aggressive)
|
||||
|
||||
**Algorithm**: Symmetric per-block, 5-bit codes.
|
||||
|
||||
```
|
||||
Given: block of N f32 values
|
||||
scale = max_abs(values) / 15 // qmax = 2^(5-1) - 1 = 15
|
||||
q[i] = round(values[i] / scale)
|
||||
q[i] = clamp(q[i], -15, +15)
|
||||
u[i] = q[i] + 15 // bias to unsigned [0, 30], fits 5 bits
|
||||
pack u[i] values using codec_bits at width=5
|
||||
```
|
||||
|
||||
**Activation policy**: 5-bit is used instead of 7-bit when the total warm set
|
||||
size exceeds `warm_byte_cap` (default: 64 MiB). The tier policy monitors
|
||||
aggregate warm storage and downgrades from 7-bit to 5-bit for the least recently
|
||||
accessed warm tensors until the cap is satisfied.
|
||||
|
||||
**Bit-packing layout** (8 values packed into 5 bytes):
|
||||
|
||||
```
|
||||
Values: u0 u1 u2 u3 u4 u5 u6 u7
|
||||
Bits: [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0]
|
||||
5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits
|
||||
|
||||
Packed into 5 bytes (40 bits = 8 * 5 bits):
|
||||
|
||||
Byte 0: [u0[4:0] | u1[2:0] ] = u0(5) + u1(3) = 8 bits
|
||||
|<-- 5 bits -->|<--3-->|
|
||||
|
||||
Byte 1: [u1[4:3] | u2[4:0] | u3[0]] = u1(2) + u2(5) + u3(1) = 8 bits
|
||||
|<2>|<-- 5 bits -->|<1>|
|
||||
|
||||
Byte 2: [u3[4:1] | u4[3:0] ] = u3(4) + u4(4) = 8 bits
|
||||
|<-- 4 bits -->|<--4-->|
|
||||
|
||||
Byte 3: [u4[4] | u5[4:0] | u6[1:0]] = u4(1) + u5(5) + u6(2) = 8 bits
|
||||
|1|<-- 5 bits -->|<-2->|
|
||||
|
||||
Byte 4: [u6[4:2] | u7[4:0] ] = u6(3) + u7(5) = 8 bits
|
||||
|<-3->|<--- 5 bits --->|
|
||||
|
||||
Total: 5 bytes for 8 values = 0.625 bytes/value
|
||||
```
|
||||
|
||||
**Storage per block** (block_size = 64):
|
||||
|
||||
```
|
||||
scale: 4 bytes (f32)
|
||||
data: ceil(64 * 5 / 8) = 40 bytes
|
||||
total: 44 bytes
|
||||
ratio: 256 / 44 = 5.82x
|
||||
```
|
||||
|
||||
### 3.4 Tier 3: 3-Bit Quantization (Cold)
|
||||
|
||||
**Algorithm**: Symmetric per-block, 3-bit codes with optional two-level scale.
|
||||
|
||||
#### Standard Mode
|
||||
|
||||
```
|
||||
Given: block of N f32 values
|
||||
scale = max_abs(values) / 3 // qmax = 2^(3-1) - 1 = 3
|
||||
q[i] = round(values[i] / scale)
|
||||
q[i] = clamp(q[i], -3, +3)
|
||||
u[i] = q[i] + 3 // bias to unsigned [0, 6], fits 3 bits
|
||||
pack u[i] values using codec_bits at width=3
|
||||
```
|
||||
|
||||
#### Two-Level Scale Mode (Outlier Handling)
|
||||
|
||||
When the value distribution has outliers (values significantly larger than the
|
||||
bulk of the distribution), a single scale wastes most of the 3-bit range on the
|
||||
long tail. The two-level scale splits the range:
|
||||
|
||||
```
|
||||
Given: block of N f32 values, outlier_fraction (default: 0.05)
|
||||
sorted_abs = sort(|values|, descending)
|
||||
outlier_count = ceil(N * outlier_fraction)
|
||||
primary_max = sorted_abs[outlier_count] // excludes top 5%
|
||||
secondary_max = sorted_abs[0] // full range
|
||||
|
||||
primary_scale = primary_max / 3 // covers bulk values
|
||||
secondary_scale = secondary_max / 3 // covers outliers
|
||||
|
||||
For each value[i]:
|
||||
if |value[i]| > primary_max:
|
||||
flag[i] = 1 // use secondary scale
|
||||
q[i] = round(value[i] / secondary_scale)
|
||||
else:
|
||||
flag[i] = 0 // use primary scale
|
||||
q[i] = round(value[i] / primary_scale)
|
||||
q[i] = clamp(q[i], -3, +3)
|
||||
u[i] = q[i] + 3
|
||||
|
||||
store: primary_scale (f32) + secondary_scale (f32) + flag bits + packed codes
|
||||
```
|
||||
|
||||
**Bit-packing layout** (8 values packed into 3 bytes):
|
||||
|
||||
```
|
||||
Values: u0 u1 u2 u3 u4 u5 u6 u7
|
||||
Bits: [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0]
|
||||
3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits
|
||||
|
||||
Packed into 3 bytes (24 bits = 8 * 3 bits):
|
||||
|
||||
Byte 0: [u0[2:0] | u1[2:0] | u2[1:0] ] = u0(3) + u1(3) + u2(2) = 8 bits
|
||||
|<-3->|<-3->|<2>|
|
||||
|
||||
Byte 1: [u2[2] | u3[2:0] | u4[2:0] | u5[0]] = u2(1) + u3(3) + u4(3) + u5(1) = 8 bits
|
||||
|1|<-3->|<-3->|1|
|
||||
|
||||
Byte 2: [u5[2:1] | u6[2:0] | u7[2:0] ] = u5(2) + u6(3) + u7(3) = 8 bits
|
||||
|<2>|<-3->|<-3->|
|
||||
|
||||
Total: 3 bytes for 8 values = 0.375 bytes/value
|
||||
```
|
||||
|
||||
**Two-level scale storage layout** (one block, block_size = 64):
|
||||
|
||||
```
|
||||
Byte offset: 0 3 7 8 9 ... 15 16 ...
|
||||
[primary_scale f32] [secondary_scale f32] [flag bytes ] [packed codes]
|
||||
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~
|
||||
4 bytes 4 bytes ceil(64/8)=8 ceil(64*3/8)=24
|
||||
|
||||
Total per block (two-level): 4 + 4 + 8 + 24 = 40 bytes
|
||||
Total per block (standard): 4 + 24 = 28 bytes
|
||||
ratio (standard): 256 / 28 = 9.14x
|
||||
ratio (two-level): 256 / 40 = 6.40x
|
||||
```
|
||||
|
||||
The two-level mode trades compression ratio for outlier fidelity. It is selected
|
||||
automatically when the ratio `max_abs / median_abs` exceeds a configurable
|
||||
threshold (default: 5.0), indicating a heavy-tailed distribution.
|
||||
|
||||
### 3.5 Tier 0: Compression to Zero (Absent)
|
||||
|
||||
**Algorithm**: No quantized data is stored.
|
||||
|
||||
```
|
||||
Tier 0 representation:
|
||||
metadata: TensorMeta (id, shape, dtype, timestamps)
|
||||
reconstruct_policy: Option<ReconstructPolicy>
|
||||
quantized_data: None
|
||||
|
||||
enum ReconstructPolicy {
|
||||
None, // reads return zeros
|
||||
Delta { base_id: TensorId, delta: ... }, // reconstruct as base + delta
|
||||
Factor { source_id: TensorId, ... }, // reconstruct via transformation
|
||||
}
|
||||
```
|
||||
|
||||
**Read semantics**:
|
||||
|
||||
| `reconstruct_policy` | Behavior |
|
||||
|----------------------|----------|
|
||||
| `None` | Return a zero-filled tensor of the recorded shape. Fast-fail mode returns `Err(TierZeroNoPolicy)` instead. |
|
||||
| `Delta` | Load base tensor, apply stored delta. May trigger recursive decompression if base is also tiered. |
|
||||
| `Factor` | Load source tensor, apply stored transformation (scale, permutation, projection). |
|
||||
|
||||
**Transition to Tier 0**: A tensor is eligible for Tier 0 when its tier score
|
||||
drops below `absent_min_score` (default: 1) and it has not been accessed for
|
||||
longer than `absent_age_threshold` (default: 24 hours). The transition is
|
||||
irreversible without external data: once quantized data is discarded, only the
|
||||
reconstruction policy (if any) can recover approximate values.
|
||||
|
||||
### 3.6 Bit Packing Module: `codec_bits`
|
||||
|
||||
The core packing and unpacking functions shared by all sub-byte formats.
|
||||
|
||||
```rust
|
||||
/// Errors from bit codec operations.
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub enum CodecErr {
|
||||
/// Output buffer too small. Contains the required size in bytes.
|
||||
OutputTooSmall { required: usize },
|
||||
/// Input buffer too small for the declared number of values.
|
||||
InputTooSmall { required: usize },
|
||||
/// Bit width must be in [1, 8].
|
||||
InvalidBitWidth { bits: u8 },
|
||||
}
|
||||
|
||||
/// Pack `values.len()` signed codes into `out`, using `bits` bits per code.
|
||||
///
|
||||
/// Each value in `values` is treated as a signed integer in `[-(2^(bits-1)-1), 2^(bits-1)-1]`.
|
||||
/// It is biased to unsigned before packing: `u = v + (2^(bits-1) - 1)`.
|
||||
///
|
||||
/// Returns the number of bytes written to `out`.
|
||||
///
|
||||
/// # Errors
|
||||
/// - `CodecErr::OutputTooSmall` if `out` cannot hold the packed data.
|
||||
/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
|
||||
pub fn pack_bits(values: &[i8], bits: u8, out: &mut [u8]) -> Result<usize, CodecErr> {
|
||||
if bits == 0 || bits > 8 {
|
||||
return Err(CodecErr::InvalidBitWidth { bits });
|
||||
}
|
||||
let total_bits = values.len() as u64 * bits as u64;
|
||||
let required = ((total_bits + 7) / 8) as usize;
|
||||
if out.len() < required {
|
||||
return Err(CodecErr::OutputTooSmall { required });
|
||||
}
|
||||
|
||||
let qmax = (1i8 << (bits - 1)) - 1; // bias offset
|
||||
let mask: u64 = (1u64 << bits) - 1;
|
||||
let mut acc: u64 = 0;
|
||||
let mut acc_bits: u32 = 0;
|
||||
let mut pos: usize = 0;
|
||||
|
||||
for &v in values {
|
||||
let u = (v as i16 + qmax as i16) as u64 & mask;
|
||||
acc |= u << acc_bits;
|
||||
acc_bits += bits as u32;
|
||||
while acc_bits >= 8 {
|
||||
out[pos] = (acc & 0xFF) as u8;
|
||||
pos += 1;
|
||||
acc >>= 8;
|
||||
acc_bits -= 8;
|
||||
}
|
||||
}
|
||||
// Flush remaining bits
|
||||
if acc_bits > 0 {
|
||||
out[pos] = (acc & 0xFF) as u8;
|
||||
pos += 1;
|
||||
}
|
||||
Ok(pos)
|
||||
}
|
||||
|
||||
/// Unpack codes from `inp` into `out`, reading `bits` bits per code.
|
||||
///
|
||||
/// Reads exactly `out.len()` values. Each unsigned code is unbiased back to signed:
|
||||
/// `v = u - (2^(bits-1) - 1)`.
|
||||
///
|
||||
/// Returns the number of bytes consumed from `inp`.
|
||||
///
|
||||
/// # Errors
|
||||
/// - `CodecErr::InputTooSmall` if `inp` does not contain enough data.
|
||||
/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
|
||||
pub fn unpack_bits(inp: &[u8], bits: u8, out: &mut [i8]) -> Result<usize, CodecErr> {
|
||||
if bits == 0 || bits > 8 {
|
||||
return Err(CodecErr::InvalidBitWidth { bits });
|
||||
}
|
||||
let total_bits = out.len() as u64 * bits as u64;
|
||||
let required = ((total_bits + 7) / 8) as usize;
|
||||
if inp.len() < required {
|
||||
return Err(CodecErr::InputTooSmall { required });
|
||||
}
|
||||
|
||||
let qmax = (1i8 << (bits - 1)) - 1;
|
||||
let mask: u64 = (1u64 << bits) - 1;
|
||||
let mut acc: u64 = 0;
|
||||
let mut acc_bits: u32 = 0;
|
||||
let mut byte_pos: usize = 0;
|
||||
let mut val_pos: usize = 0;
|
||||
|
||||
while val_pos < out.len() {
|
||||
while acc_bits < bits as u32 {
|
||||
acc |= (inp[byte_pos] as u64) << acc_bits;
|
||||
acc_bits += 8;
|
||||
byte_pos += 1;
|
||||
}
|
||||
let u = (acc & mask) as i16;
|
||||
out[val_pos] = (u - qmax as i16) as i8;
|
||||
acc >>= bits;
|
||||
acc_bits -= bits as u32;
|
||||
val_pos += 1;
|
||||
}
|
||||
Ok(required)
|
||||
}
|
||||
```
|
||||
|
||||
**Properties**:
|
||||
|
||||
- No heap allocations. Callers provide both input and output slices.
|
||||
- Single bit writer / bit reader using a 64-bit accumulator.
|
||||
- Deterministic little-endian byte order.
|
||||
- The `pack_bits` / `unpack_bits` pair is its own inverse: `unpack(pack(v)) == v`
|
||||
for all valid inputs.
|
||||
|
||||
### 3.7 Quant Module Functions
|
||||
|
||||
```rust
|
||||
/// Block-level quantization configuration.
|
||||
pub struct QuantConfig {
|
||||
pub block_size: usize, // elements per quantization block (default: 64)
|
||||
pub two_level_threshold: f32, // max/median ratio to trigger two-level (default: 5.0)
|
||||
}
|
||||
|
||||
/// Quantized block result.
|
||||
pub struct QuantizedBlock {
|
||||
pub scale: f32,
|
||||
pub secondary_scale: Option<f32>, // only for two-level 3-bit
|
||||
pub flags: Option<Vec<u8>>, // 1-bit-per-value flags for two-level
|
||||
pub codes: Vec<i8>, // signed quantized codes
|
||||
pub bits: u8,
|
||||
}
|
||||
|
||||
/// Symmetric 8-bit quantization (Tier 1 - Hot).
|
||||
///
|
||||
/// Quantizes each block of `block_size` values independently.
|
||||
/// scale = max_abs(block) / 127
|
||||
/// q[i] = clamp(round(x[i] / scale), -127, 127)
|
||||
pub fn quantize_s8(
|
||||
values: &[f32],
|
||||
config: &QuantConfig,
|
||||
) -> Vec<QuantizedBlock>;
|
||||
|
||||
/// Symmetric N-bit quantization (Tier 2/3 - Warm/Cold).
|
||||
///
|
||||
/// `bits` must be one of: 7, 5, 3.
|
||||
/// qmax = 2^(bits-1) - 1
|
||||
/// scale = max_abs(block) / qmax
|
||||
/// q[i] = clamp(round(x[i] / scale), -qmax, qmax)
|
||||
///
|
||||
/// For bits=3 and config.two_level_threshold exceeded: uses two-level scale.
|
||||
pub fn quantize_bits(
|
||||
values: &[f32],
|
||||
bits: u8,
|
||||
config: &QuantConfig,
|
||||
) -> Vec<QuantizedBlock>;
|
||||
|
||||
/// Dequantize a block back to f32 values.
|
||||
///
|
||||
/// For standard mode: x'[i] = codes[i] as f32 * scale
|
||||
/// For two-level mode: x'[i] = codes[i] as f32 * (if flags[i] then secondary_scale else scale)
|
||||
pub fn dequantize(block: &QuantizedBlock) -> Vec<f32>;
|
||||
|
||||
/// Compute the maximum absolute value across a slice.
|
||||
///
|
||||
/// On native targets with `target_feature = "avx2"` or `target_feature = "neon"`:
|
||||
/// uses SIMD intrinsics for 4-8x throughput.
|
||||
/// On WASM with `target_feature = "simd128"` (optional):
|
||||
/// uses wasm_simd128 intrinsics.
|
||||
/// Fallback: portable scalar loop.
|
||||
#[inline]
|
||||
pub fn max_abs(values: &[f32]) -> f32;
|
||||
```
|
||||
|
||||
**SIMD implementation sketch for `max_abs`** (AVX2):
|
||||
|
||||
```rust
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
#[target_feature(enable = "avx2")]
|
||||
unsafe fn max_abs_avx2(values: &[f32]) -> f32 {
|
||||
use std::arch::x86_64::*;
|
||||
let sign_mask = _mm256_set1_ps(f32::from_bits(0x7FFF_FFFF)); // abs mask
|
||||
let mut vmax = _mm256_setzero_ps();
|
||||
let chunks = values.len() / 8;
|
||||
|
||||
for i in 0..chunks {
|
||||
let v = _mm256_loadu_ps(values.as_ptr().add(i * 8));
|
||||
let abs_v = _mm256_and_ps(v, sign_mask);
|
||||
vmax = _mm256_max_ps(vmax, abs_v);
|
||||
}
|
||||
|
||||
// Horizontal max reduction
|
||||
let hi128 = _mm256_extractf128_ps(vmax, 1);
|
||||
let lo128 = _mm256_castps256_ps128(vmax);
|
||||
let max128 = _mm_max_ps(hi128, lo128);
|
||||
let shuf = _mm_movehdup_ps(max128);
|
||||
let max64 = _mm_max_ps(max128, shuf);
|
||||
let shuf2 = _mm_movehl_ps(max64, max64);
|
||||
let max32 = _mm_max_ss(max64, shuf2);
|
||||
let mut result = _mm_cvtss_f32(max32);
|
||||
|
||||
// Handle remainder
|
||||
for i in (chunks * 8)..values.len() {
|
||||
result = result.max(values[i].abs());
|
||||
}
|
||||
result
|
||||
}
|
||||
```
|
||||
|
||||
**WASM portable fallback**:
|
||||
|
||||
```rust
|
||||
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
|
||||
pub fn max_abs(values: &[f32]) -> f32 {
|
||||
let mut m: f32 = 0.0;
|
||||
for &v in values {
|
||||
let a = v.abs();
|
||||
if a > m {
|
||||
m = a;
|
||||
}
|
||||
}
|
||||
m
|
||||
}
|
||||
```
|
||||
|
||||
When WASM SIMD is enabled via `target_feature = "simd128"`, a vectorized path
|
||||
processes 4 f32 values per iteration using `v128` types. This is optional and
|
||||
gated behind a cargo feature flag `wasm-simd`.
|
||||
|
||||
### 3.8 Error Bound Analysis
|
||||
|
||||
For symmetric quantization with bit width `B`, block scale `s`, and `qmax = 2^(B-1) - 1`:
|
||||
|
||||
```
|
||||
quantization_step = s / qmax
|
||||
max_element_error = quantization_step / 2 (from rounding)
|
||||
max_relative_error = 1 / (2 * qmax) (per element, worst case)
|
||||
rms_error = quantization_step / sqrt(12) (uniform quantization noise)
|
||||
```
|
||||
|
||||
**Per-tier error bounds**:
|
||||
|
||||
| Tier | Bits | qmax | Max Rel. Error | RMS Rel. Error | Max Abs. Error (scale=1.0) |
|
||||
|------|------|------|---------------|----------------|---------------------------|
|
||||
| Hot (8-bit) | 8 | 127 | 0.394% | 0.228% | 0.00394 |
|
||||
| Warm (7-bit) | 7 | 63 | 0.794% | 0.458% | 0.00794 |
|
||||
| Warm-agg (5-bit) | 5 | 15 | 3.333% | 1.925% | 0.03333 |
|
||||
| Cold (3-bit, std) | 3 | 3 | 16.667% | 9.623% | 0.16667 |
|
||||
| Cold (3-bit, 2-level) | 3 | 3 | 16.667% per scale | 9.623% | Reduced for bulk values |
|
||||
|
||||
**Two-level scale improvement for 3-bit**: When 95% of values fall within
|
||||
`primary_max` and outliers use `secondary_scale`:
|
||||
|
||||
| Component | Fraction | Scale | Effective Max Error |
|
||||
|-----------|----------|-------|-------------------|
|
||||
| Bulk values (95%) | 0.95 | primary_scale (smaller) | 16.7% of primary_max |
|
||||
| Outlier values (5%) | 0.05 | secondary_scale (larger) | 16.7% of secondary_max |
|
||||
|
||||
The bulk values achieve much lower absolute error because `primary_scale` is
|
||||
typically 3-10x smaller than the single-scale `scale`. The outliers retain the
|
||||
same relative error but are fewer in number.
|
||||
|
||||
**Drift compounding**: When drift tolerance is `d` (e.g., 10%), and a frame is
|
||||
quantized with scales from an earlier frame, the effective max relative error
|
||||
becomes `(1 + d) / (2 * qmax)`. For 8-bit with 10% drift: `1.1 / 254 = 0.433%`.
|
||||
|
||||
**Cumulative error table with drift**:
|
||||
|
||||
| Tier | Bits | No Drift | 10% Drift | 20% Drift |
|
||||
|------|------|----------|-----------|-----------|
|
||||
| Hot | 8 | 0.394% | 0.433% | 0.472% |
|
||||
| Warm | 7 | 0.794% | 0.873% | 0.952% |
|
||||
| Warm-agg | 5 | 3.333% | 3.667% | 4.000% |
|
||||
| Cold | 3 | 16.667% | 18.333% | 20.000% |
|
||||
|
||||
### 3.9 Complete Quantizer and Packer Traits
|
||||
|
||||
```rust
|
||||
/// Trait for quantization formats that can encode and decode tensor blocks.
|
||||
pub trait TensorQuantizer {
|
||||
/// The bit width of this quantizer.
|
||||
fn bit_width(&self) -> u8;
|
||||
|
||||
/// Quantize a block of f32 values into signed codes and scale(s).
|
||||
fn quantize_block(
|
||||
&self,
|
||||
values: &[f32],
|
||||
config: &QuantConfig,
|
||||
) -> QuantizedBlock;
|
||||
|
||||
/// Dequantize a block back to f32 values.
|
||||
fn dequantize_block(
|
||||
&self,
|
||||
block: &QuantizedBlock,
|
||||
out: &mut [f32],
|
||||
) -> Result<(), CodecErr>;
|
||||
|
||||
/// Returns the packed byte size for `num_values` at this bit width,
|
||||
/// excluding scale storage.
|
||||
fn packed_data_size(&self, num_values: usize) -> usize {
|
||||
(num_values * self.bit_width() as usize + 7) / 8
|
||||
}
|
||||
|
||||
/// Returns total block storage size including scale(s) and flags.
|
||||
fn block_storage_size(&self, block_size: usize) -> usize;
|
||||
}
|
||||
|
||||
/// Trait for bit-level packing codecs.
|
||||
pub trait BitCodec {
|
||||
/// Pack signed codes into a byte buffer.
|
||||
fn pack(
|
||||
&self,
|
||||
codes: &[i8],
|
||||
bits: u8,
|
||||
out: &mut [u8],
|
||||
) -> Result<usize, CodecErr>;
|
||||
|
||||
/// Unpack codes from a byte buffer.
|
||||
fn unpack(
|
||||
&self,
|
||||
data: &[u8],
|
||||
bits: u8,
|
||||
out: &mut [i8],
|
||||
) -> Result<usize, CodecErr>;
|
||||
}
|
||||
|
||||
/// Standard implementation using the accumulator-based codec_bits functions.
|
||||
pub struct StandardBitCodec;
|
||||
|
||||
impl BitCodec for StandardBitCodec {
|
||||
fn pack(
|
||||
&self,
|
||||
codes: &[i8],
|
||||
bits: u8,
|
||||
out: &mut [u8],
|
||||
) -> Result<usize, CodecErr> {
|
||||
pack_bits(codes, bits, out)
|
||||
}
|
||||
|
||||
fn unpack(
|
||||
&self,
|
||||
data: &[u8],
|
||||
bits: u8,
|
||||
out: &mut [i8],
|
||||
) -> Result<usize, CodecErr> {
|
||||
unpack_bits(data, bits, out)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.10 Block Storage Summary Diagram
|
||||
|
||||
```
|
||||
TIER 1 (8-bit):
|
||||
+--------+-------+-------+-------+-----+-------+
|
||||
| scale | q[0] | q[1] | q[2] | ... | q[63] |
|
||||
| f32 LE | i8 | i8 | i8 | | i8 |
|
||||
+--------+-------+-------+-------+-----+-------+
|
||||
4 bytes 1 1 1 1 = 68 bytes / block
|
||||
|
||||
TIER 2a (7-bit):
|
||||
+--------+--------------------------------------------+
|
||||
| scale | packed 7-bit codes (56 bytes for 64 vals) |
|
||||
| f32 LE | bitstream, little-endian accumulator |
|
||||
+--------+--------------------------------------------+
|
||||
4 bytes ceil(64*7/8) = 56 bytes = 60 bytes / block
|
||||
|
||||
TIER 2b (5-bit):
|
||||
+--------+--------------------------------------------+
|
||||
| scale | packed 5-bit codes (40 bytes for 64 vals) |
|
||||
| f32 LE | bitstream, little-endian accumulator |
|
||||
+--------+--------------------------------------------+
|
||||
4 bytes ceil(64*5/8) = 40 bytes = 44 bytes / block
|
||||
|
||||
TIER 3 standard (3-bit):
|
||||
+--------+--------------------------------------------+
|
||||
| scale | packed 3-bit codes (24 bytes for 64 vals) |
|
||||
| f32 LE | bitstream, little-endian accumulator |
|
||||
+--------+--------------------------------------------+
|
||||
4 bytes ceil(64*3/8) = 24 bytes = 28 bytes / block
|
||||
|
||||
TIER 3 two-level (3-bit):
|
||||
+--------+--------+----------+-------------------------------+
|
||||
| pscale | sscale | flags | packed 3-bit codes |
|
||||
| f32 LE | f32 LE | ceil(N/8)| bitstream |
|
||||
+--------+--------+----------+-------------------------------+
|
||||
4 4 8 bytes 24 bytes = 40 bytes / block
|
||||
|
||||
TIER 0 (absent):
|
||||
+--------------------------------------+
|
||||
| TensorMeta + ReconstructPolicy only |
|
||||
| NO quantized data |
|
||||
+--------------------------------------+
|
||||
variable (typically 32-128 bytes metadata)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Alternatives Considered
|
||||
|
||||
### 4.1 4-Bit as the Warm Tier
|
||||
|
||||
4-bit quantization (qmax = 7, 8.00x compression) is the most widely studied
|
||||
format (GPTQ, AWQ). We considered using 4-bit instead of 7-bit for the warm
|
||||
tier. **Rejected** because: (a) the jump from 8-bit to 4-bit is too large for
|
||||
tensors that were recently hot, causing unnecessary quality loss; (b) 7-bit
|
||||
provides a gentler step-down; (c) 5-bit is available as an intermediate when
|
||||
memory pressure increases.
|
||||
|
||||
### 4.2 Uniform 4-Bit Across All Non-Hot Tiers
|
||||
|
||||
A simpler design with only two quantization levels (8-bit hot, 4-bit everything
|
||||
else). **Rejected** because: (a) cold tensors waste 1 extra bit per value when
|
||||
3-bit suffices; (b) no path to aggressive compression under memory pressure;
|
||||
(c) loses the granularity that enables smooth quality degradation.
|
||||
|
||||
### 4.3 Asymmetric Quantization for 3-Bit
|
||||
|
||||
Using asymmetric quantization (with zero-point) for 3-bit to better utilize the
|
||||
`[0, 7]` unsigned range when distributions are not centered. **Rejected**
|
||||
because: (a) adds 4 bytes of zero-point storage per block; (b) requires an
|
||||
additional subtraction in the dequantize path; (c) the two-level scale approach
|
||||
handles asymmetric distributions more effectively by splitting the scale rather
|
||||
than shifting the range.
|
||||
|
||||
### 4.4 Lookup Table (Codebook) Quantization for Cold
|
||||
|
||||
Using a small codebook (e.g., 8 centroids) instead of uniform 3-bit levels.
|
||||
**Rejected** because: (a) requires a per-block or per-tensor codebook training
|
||||
step that is expensive for streaming data; (b) codebook storage overhead is
|
||||
comparable to scale storage but with higher decode complexity; (c) uniform
|
||||
quantization is simpler to implement and reason about.
|
||||
|
||||
### 4.5 No Two-Level Scale (Simpler 3-Bit)
|
||||
|
||||
Omitting the two-level scale option entirely. **Considered but rejected** because
|
||||
agent embedding tensors frequently exhibit heavy-tailed distributions where a few
|
||||
dimensions carry disproportionate magnitude. Without two-level scale, these
|
||||
outliers cause the single scale to be too large, wasting most of the 3-bit range
|
||||
on the bulk of near-zero values.
|
||||
|
||||
---
|
||||
|
||||
## 5. Acceptance Criteria
|
||||
|
||||
### 5.1 Format Correctness
|
||||
|
||||
- [ ] `pack_bits` followed by `unpack_bits` is a lossless round-trip for all
|
||||
bit widths (3, 5, 7, 8) and all valid signed input ranges.
|
||||
- [ ] `quantize_s8` followed by `dequantize` produces values within the
|
||||
theoretical error bound (`scale / 254`) of the originals.
|
||||
- [ ] `quantize_bits(7, ...)` followed by `dequantize` produces values within
|
||||
`scale / 126` of the originals.
|
||||
- [ ] `quantize_bits(5, ...)` followed by `dequantize` produces values within
|
||||
`scale / 30` of the originals.
|
||||
- [ ] `quantize_bits(3, ...)` followed by `dequantize` produces values within
|
||||
`scale / 6` of the originals (standard mode).
|
||||
- [ ] Two-level 3-bit mode activates when `max/median > two_level_threshold`.
|
||||
- [ ] Tier 0 reads return zeros when `reconstruct_policy` is `None`.
|
||||
- [ ] Tier 0 reads invoke reconstruction when a policy exists.
|
||||
|
||||
### 5.2 Performance
|
||||
|
||||
- [ ] `pack_bits` throughput >= 2 GB/s on native (AVX2-capable hardware).
|
||||
- [ ] `unpack_bits` throughput >= 2 GB/s on native.
|
||||
- [ ] `max_abs` with SIMD is >= 3x faster than the scalar fallback on 512+ element blocks.
|
||||
- [ ] WASM `pack_bits` / `unpack_bits` throughput >= 500 MB/s (without SIMD).
|
||||
- [ ] No heap allocations in `pack_bits`, `unpack_bits`, or `max_abs`.
|
||||
|
||||
### 5.3 Storage Efficiency
|
||||
|
||||
- [ ] 8-bit block storage: exactly `4 + block_size` bytes.
|
||||
- [ ] 7-bit block storage: exactly `4 + ceil(block_size * 7 / 8)` bytes.
|
||||
- [ ] 5-bit block storage: exactly `4 + ceil(block_size * 5 / 8)` bytes.
|
||||
- [ ] 3-bit block storage (standard): exactly `4 + ceil(block_size * 3 / 8)` bytes.
|
||||
- [ ] 3-bit block storage (two-level): exactly `8 + ceil(block_size / 8) + ceil(block_size * 3 / 8)` bytes.
|
||||
- [ ] No padding bits between consecutive blocks in a segment.
|
||||
|
||||
### 5.4 Dynamic Tier 2 Downgrade
|
||||
|
||||
- [ ] When aggregate warm storage exceeds `warm_byte_cap`, the least recently
|
||||
accessed warm tensors are re-encoded from 7-bit to 5-bit.
|
||||
- [ ] The downgrade is reversible: if warm storage drops below
|
||||
`warm_byte_cap * 0.8` (hysteresis), tensors can be re-promoted to 7-bit
|
||||
on next access.
|
||||
|
||||
---
|
||||
|
||||
## 6. Risks and Mitigations
|
||||
|
||||
| Risk | Severity | Likelihood | Mitigation |
|
||||
|------|----------|------------|------------|
|
||||
| 3-bit two-level scale adds format complexity without sufficient accuracy gain for most distributions | Medium | Medium | Gate behind a cargo feature `two-level-cold`; default to standard 3-bit. Benchmark on real agent embeddings before enabling by default. |
|
||||
| Dynamic 7-bit to 5-bit downgrade causes thrashing when warm set oscillates near the byte cap | Medium | Medium | Implement hysteresis (20% band). Only downgrade when above cap; only upgrade when below 80% of cap. Rate-limit downgrades to at most once per minute. |
|
||||
| `pack_bits` accumulator overflow for large inputs | Low | Low | The 64-bit accumulator can hold up to 56 bits of pending data (7 bytes). Since we flush at 8 bits, the maximum pending bits is `bits - 1 = 7`, well within the 64-bit range. No overflow possible. |
|
||||
| Tier 0 reconstruction from Delta/Factor introduces unbounded latency | Medium | Low | Set a maximum reconstruction depth (default: 3). If the base tensor is also Tier 0, fail with `ReconstructionDepthExceeded` rather than recursing indefinitely. |
|
||||
| WASM scalar `max_abs` is a bottleneck for large tensors | Low | High | Expected. The WASM SIMD feature flag provides 3-4x improvement. For non-SIMD targets, `max_abs` cost is small relative to the full quantize pipeline. |
|
||||
| Block size mismatch between encoder and decoder | High | Low | Block size is stored in the segment header (ADR-017 format). Decoder reads it from the header rather than assuming a default. |
|
||||
|
||||
---
|
||||
|
||||
## 7. References
|
||||
|
||||
1. ADR-017: Temporal Tensor Compression with Tiered Quantization. RuVector Architecture Team, 2026.
|
||||
2. ADR-018: Block-Based Storage Engine for Temporal Tensor Segments (forthcoming).
|
||||
3. Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
|
||||
4. Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
|
||||
5. Kim, S., et al. "SqueezeLLM: Dense-and-Sparse Quantization." ICML 2024.
|
||||
6. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024.
|
||||
7. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015.
|
||||
8. IEEE 754-2019. "IEEE Standard for Floating-Point Arithmetic."
|
||||
9. Lemire, D. and Boytsov, L. "Decoding billions of integers in milliseconds through vectorized bit packing." Software: Practice and Experience, 2015.
|
||||
10. WebAssembly SIMD Proposal. https://github.com/WebAssembly/simd. Finalized 2023.
|
||||
Reference in New Issue
Block a user