# ADR-019: Tiered Quantization Formats for Temporal Tensor Store **Status**: Proposed **Date**: 2026-02-08 **Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine **Author**: System Architecture Team **Note**: Tiered quantization formats are now implemented in the rvf-quant crate as part of ADR-029 (RVF). See the RVF temperature-tiering specification (docs/research/rvf/spec/03-temperature-tiering.md). ## Version History | Version | Date | Author | Changes | |---------|------|--------|---------| | 0.1 | 2026-02-08 | Architecture Team | Initial proposal | --- ## Abstract This ADR defines the concrete quantization formats, bit-packing layouts, and codec interfaces for the five tiers of tensor storage established in ADR-017. Where ADR-017 introduced the concept of access-frequency-driven quantization and temporal scale reuse, this document specifies the exact byte-level formats for 8-bit (Tier 1 / Hot), 7-bit and 5-bit (Tier 2 / Warm), 3-bit (Tier 3 / Cold), and Compression-to-Zero (Tier 0 / Absent). It also resolves two open design questions from ADR-017: whether 5-bit quantization is permitted within the warm tier, and how Tier 0 reads behave when no reconstruction policy exists. The `codec_bits` module provides a single allocation-free bit packer/unpacker that all sub-byte formats share. The `quant` module provides per-format quantize and dequantize functions, with SIMD-accelerated `max_abs` on native targets and a portable fallback for WASM. Rust trait interfaces are defined so that new bit widths can be added without modifying the core codec. --- ## 1. Context and Motivation ### 1.1 Gap in ADR-017 ADR-017 established the tiered compression architecture and segment binary format but left the per-tier quantization details at the algorithmic level. Implementers need exact byte layouts to write interoperable encoders and decoders, particularly for the sub-byte formats (7-bit, 5-bit, 3-bit) where values do not align on byte boundaries. ### 1.2 Sub-Byte Packing Complexity Standard 8-bit quantization maps trivially to `[u8]` storage. Sub-byte formats require a bit-packing codec that can write and read arbitrary-width codes into a byte stream without wasting bits. The codec must: - Handle bit widths 3, 5, and 7 (with 8 as a degenerate identity case). - Operate without heap allocations (caller provides output slice). - Be deterministic and platform-independent (little-endian byte order). - Support WASM targets where SIMD is optional. ### 1.3 Outlier Handling in 3-Bit At 3 bits per value, the quantization range is `[-3, +3]` (qmax = 3). Large outliers in the tensor distribution can cause severe clamping. ADR-017 noted this risk but did not specify a mitigation. This ADR introduces a two-level scale option for Tier 3 that uses a 1-bit flag per value to select between a primary scale (covering the majority of values) and a secondary scale (covering outliers), while keeping the packed format compact. ### 1.4 Tier 0 Semantics ADR-017 listed Compression-to-Zero as a future possibility. This ADR formalizes it: Tier 0 stores no quantized data at all. Only metadata and an optional `reconstruct_policy` survive. This enables aggressive memory reclamation for tensors that are no longer accessed but may be reconstructable from other sources (deltas, factors, or recomputation). ### 1.5 Design Questions Resolved | Question | Resolution | |----------|------------| | Allow 5-bit within warm tier? | Yes. Dynamic downgrade from 7-bit to 5-bit when warm set exceeds a configurable byte cap (`warm_byte_cap`). | | Tier 0 read semantics? | Return zeros by default. If a `reconstruct_policy` (Delta or Factor) exists, reconstruct from stored representation. | --- ## 2. Decision We adopt the following five-tier quantization format hierarchy, each with a well-defined byte layout, packing strategy, and error budget: | Tier | Name | Bits | Compression vs f32 | Use Case | |------|------|------|-------------------|----------| | 1 | Hot | 8 | 4.00x | Active tensors, full fidelity | | 2a | Warm | 7 | 4.57x | Default warm, near-lossless | | 2b | Warm-aggressive | 5 | 6.40x | Warm set exceeds `warm_byte_cap` | | 3 | Cold | 3 | 10.67x | Archived tensors, bounded error | | 0 | Absent | 0 | Infinite | No data stored; metadata only | All sub-byte formats share the `codec_bits` packer. All quantization formats use symmetric per-block quantization with `scale = max_abs / qmax` stored as f32 per block. The choice of f32 (rather than f16 as in ADR-017 segment headers) is deliberate at this layer: the segment encoder may convert to f16 for storage, but the quantizer operates in f32 for precision during the quantize/dequantize path. --- ## 3. Detailed Design ### 3.1 Tier 1: 8-Bit Quantization (Hot) **Algorithm**: Symmetric per-block quantization. ``` Given: block of N f32 values, block_size typically 64 or 128 scale = max_abs(values) / 127 q[i] = round(values[i] / scale) q[i] = clamp(q[i], -127, +127) // i8 range store: q as [i8; N] + scale as f32 ``` **Storage layout** (one block, block_size = 8 for illustration): ``` Byte offset: 0 1 2 3 4 5 6 7 8 9 10 11 [ scale (f32, LE) ] [q0] [q1] [q2] [q3] [q4] [q5] [q6] [q7] ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4 bytes 8 bytes (1 byte per i8 value) Total per block: 4 + block_size bytes ``` **Effective compression** (block_size = 64): ``` raw = 64 * 4 = 256 bytes quant = 4 + 64 * 1 = 68 bytes ratio = 256 / 68 = 3.76x (single block) ``` With temporal amortization (100 frames sharing scales): `256*100 / (4 + 64*100)` = 4.00x. **Dequantize**: ``` values[i] = q[i] as f32 * scale ``` **Error bound**: `max_error = scale / (2 * 127)`. See Section 3.7 for full analysis. ### 3.2 Tier 2a: 7-Bit Quantization (Warm) **Algorithm**: Symmetric per-block, 7-bit codes packed into a bitstream. ``` Given: block of N f32 values scale = max_abs(values) / 63 // qmax = 2^(7-1) - 1 = 63 q[i] = round(values[i] / scale) q[i] = clamp(q[i], -63, +63) u[i] = q[i] + 63 // bias to unsigned [0, 126], fits 7 bits pack u[i] values using codec_bits at width=7 ``` **Bit-packing layout** (8 values packed into 7 bytes): ``` Values: u0 u1 u2 u3 u4 u5 u6 u7 Bits: [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits Packed into 7 bytes (56 bits = 8 * 7 bits): Byte 0: [u0[6:0] | u1[0] ] = u0(7) + u1(1) = 8 bits |<--- 7 bits --->|<1>| Byte 1: [u1[6:1] | u2[1:0]] = u1(6) + u2(2) = 8 bits |<--- 6 bits --->|<-2->| Byte 2: [u2[6:2] | u3[2:0] ] = u2(5) + u3(3) = 8 bits |<-- 5 bits -->|<--3-->| Byte 3: [u3[6:3] | u4[3:0] ] = u3(4) + u4(4) = 8 bits |<- 4 bits ->|<--4--->| Byte 4: [u4[6:4] | u5[4:0] ] = u4(3) + u5(5) = 8 bits |<-3->|<---- 5 bits ---->| Byte 5: [u5[6:5] | u6[5:0] ] = u5(2) + u6(6) = 8 bits |<2>|<----- 6 bits ------>| Byte 6: [u6[6] | u7[6:0] ] = u6(1) + u7(7) = 8 bits |1|<------- 7 bits ------->| Total: 7 bytes for 8 values = 0.875 bytes/value ``` **Storage per block** (block_size = 64): ``` scale: 4 bytes (f32) data: ceil(64 * 7 / 8) = 56 bytes total: 60 bytes ratio: 256 / 60 = 4.27x ``` ### 3.3 Tier 2b: 5-Bit Quantization (Warm Aggressive) **Algorithm**: Symmetric per-block, 5-bit codes. ``` Given: block of N f32 values scale = max_abs(values) / 15 // qmax = 2^(5-1) - 1 = 15 q[i] = round(values[i] / scale) q[i] = clamp(q[i], -15, +15) u[i] = q[i] + 15 // bias to unsigned [0, 30], fits 5 bits pack u[i] values using codec_bits at width=5 ``` **Activation policy**: 5-bit is used instead of 7-bit when the total warm set size exceeds `warm_byte_cap` (default: 64 MiB). The tier policy monitors aggregate warm storage and downgrades from 7-bit to 5-bit for the least recently accessed warm tensors until the cap is satisfied. **Bit-packing layout** (8 values packed into 5 bytes): ``` Values: u0 u1 u2 u3 u4 u5 u6 u7 Bits: [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits Packed into 5 bytes (40 bits = 8 * 5 bits): Byte 0: [u0[4:0] | u1[2:0] ] = u0(5) + u1(3) = 8 bits |<-- 5 bits -->|<--3-->| Byte 1: [u1[4:3] | u2[4:0] | u3[0]] = u1(2) + u2(5) + u3(1) = 8 bits |<2>|<-- 5 bits -->|<1>| Byte 2: [u3[4:1] | u4[3:0] ] = u3(4) + u4(4) = 8 bits |<-- 4 bits -->|<--4-->| Byte 3: [u4[4] | u5[4:0] | u6[1:0]] = u4(1) + u5(5) + u6(2) = 8 bits |1|<-- 5 bits -->|<-2->| Byte 4: [u6[4:2] | u7[4:0] ] = u6(3) + u7(5) = 8 bits |<-3->|<--- 5 bits --->| Total: 5 bytes for 8 values = 0.625 bytes/value ``` **Storage per block** (block_size = 64): ``` scale: 4 bytes (f32) data: ceil(64 * 5 / 8) = 40 bytes total: 44 bytes ratio: 256 / 44 = 5.82x ``` ### 3.4 Tier 3: 3-Bit Quantization (Cold) **Algorithm**: Symmetric per-block, 3-bit codes with optional two-level scale. #### Standard Mode ``` Given: block of N f32 values scale = max_abs(values) / 3 // qmax = 2^(3-1) - 1 = 3 q[i] = round(values[i] / scale) q[i] = clamp(q[i], -3, +3) u[i] = q[i] + 3 // bias to unsigned [0, 6], fits 3 bits pack u[i] values using codec_bits at width=3 ``` #### Two-Level Scale Mode (Outlier Handling) When the value distribution has outliers (values significantly larger than the bulk of the distribution), a single scale wastes most of the 3-bit range on the long tail. The two-level scale splits the range: ``` Given: block of N f32 values, outlier_fraction (default: 0.05) sorted_abs = sort(|values|, descending) outlier_count = ceil(N * outlier_fraction) primary_max = sorted_abs[outlier_count] // excludes top 5% secondary_max = sorted_abs[0] // full range primary_scale = primary_max / 3 // covers bulk values secondary_scale = secondary_max / 3 // covers outliers For each value[i]: if |value[i]| > primary_max: flag[i] = 1 // use secondary scale q[i] = round(value[i] / secondary_scale) else: flag[i] = 0 // use primary scale q[i] = round(value[i] / primary_scale) q[i] = clamp(q[i], -3, +3) u[i] = q[i] + 3 store: primary_scale (f32) + secondary_scale (f32) + flag bits + packed codes ``` **Bit-packing layout** (8 values packed into 3 bytes): ``` Values: u0 u1 u2 u3 u4 u5 u6 u7 Bits: [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits Packed into 3 bytes (24 bits = 8 * 3 bits): Byte 0: [u0[2:0] | u1[2:0] | u2[1:0] ] = u0(3) + u1(3) + u2(2) = 8 bits |<-3->|<-3->|<2>| Byte 1: [u2[2] | u3[2:0] | u4[2:0] | u5[0]] = u2(1) + u3(3) + u4(3) + u5(1) = 8 bits |1|<-3->|<-3->|1| Byte 2: [u5[2:1] | u6[2:0] | u7[2:0] ] = u5(2) + u6(3) + u7(3) = 8 bits |<2>|<-3->|<-3->| Total: 3 bytes for 8 values = 0.375 bytes/value ``` **Two-level scale storage layout** (one block, block_size = 64): ``` Byte offset: 0 3 7 8 9 ... 15 16 ... [primary_scale f32] [secondary_scale f32] [flag bytes ] [packed codes] ~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~ 4 bytes 4 bytes ceil(64/8)=8 ceil(64*3/8)=24 Total per block (two-level): 4 + 4 + 8 + 24 = 40 bytes Total per block (standard): 4 + 24 = 28 bytes ratio (standard): 256 / 28 = 9.14x ratio (two-level): 256 / 40 = 6.40x ``` The two-level mode trades compression ratio for outlier fidelity. It is selected automatically when the ratio `max_abs / median_abs` exceeds a configurable threshold (default: 5.0), indicating a heavy-tailed distribution. ### 3.5 Tier 0: Compression to Zero (Absent) **Algorithm**: No quantized data is stored. ``` Tier 0 representation: metadata: TensorMeta (id, shape, dtype, timestamps) reconstruct_policy: Option quantized_data: None enum ReconstructPolicy { None, // reads return zeros Delta { base_id: TensorId, delta: ... }, // reconstruct as base + delta Factor { source_id: TensorId, ... }, // reconstruct via transformation } ``` **Read semantics**: | `reconstruct_policy` | Behavior | |----------------------|----------| | `None` | Return a zero-filled tensor of the recorded shape. Fast-fail mode returns `Err(TierZeroNoPolicy)` instead. | | `Delta` | Load base tensor, apply stored delta. May trigger recursive decompression if base is also tiered. | | `Factor` | Load source tensor, apply stored transformation (scale, permutation, projection). | **Transition to Tier 0**: A tensor is eligible for Tier 0 when its tier score drops below `absent_min_score` (default: 1) and it has not been accessed for longer than `absent_age_threshold` (default: 24 hours). The transition is irreversible without external data: once quantized data is discarded, only the reconstruction policy (if any) can recover approximate values. ### 3.6 Bit Packing Module: `codec_bits` The core packing and unpacking functions shared by all sub-byte formats. ```rust /// Errors from bit codec operations. #[derive(Debug, Clone, PartialEq, Eq)] pub enum CodecErr { /// Output buffer too small. Contains the required size in bytes. OutputTooSmall { required: usize }, /// Input buffer too small for the declared number of values. InputTooSmall { required: usize }, /// Bit width must be in [1, 8]. InvalidBitWidth { bits: u8 }, } /// Pack `values.len()` signed codes into `out`, using `bits` bits per code. /// /// Each value in `values` is treated as a signed integer in `[-(2^(bits-1)-1), 2^(bits-1)-1]`. /// It is biased to unsigned before packing: `u = v + (2^(bits-1) - 1)`. /// /// Returns the number of bytes written to `out`. /// /// # Errors /// - `CodecErr::OutputTooSmall` if `out` cannot hold the packed data. /// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8. pub fn pack_bits(values: &[i8], bits: u8, out: &mut [u8]) -> Result { if bits == 0 || bits > 8 { return Err(CodecErr::InvalidBitWidth { bits }); } let total_bits = values.len() as u64 * bits as u64; let required = ((total_bits + 7) / 8) as usize; if out.len() < required { return Err(CodecErr::OutputTooSmall { required }); } let qmax = (1i8 << (bits - 1)) - 1; // bias offset let mask: u64 = (1u64 << bits) - 1; let mut acc: u64 = 0; let mut acc_bits: u32 = 0; let mut pos: usize = 0; for &v in values { let u = (v as i16 + qmax as i16) as u64 & mask; acc |= u << acc_bits; acc_bits += bits as u32; while acc_bits >= 8 { out[pos] = (acc & 0xFF) as u8; pos += 1; acc >>= 8; acc_bits -= 8; } } // Flush remaining bits if acc_bits > 0 { out[pos] = (acc & 0xFF) as u8; pos += 1; } Ok(pos) } /// Unpack codes from `inp` into `out`, reading `bits` bits per code. /// /// Reads exactly `out.len()` values. Each unsigned code is unbiased back to signed: /// `v = u - (2^(bits-1) - 1)`. /// /// Returns the number of bytes consumed from `inp`. /// /// # Errors /// - `CodecErr::InputTooSmall` if `inp` does not contain enough data. /// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8. pub fn unpack_bits(inp: &[u8], bits: u8, out: &mut [i8]) -> Result { if bits == 0 || bits > 8 { return Err(CodecErr::InvalidBitWidth { bits }); } let total_bits = out.len() as u64 * bits as u64; let required = ((total_bits + 7) / 8) as usize; if inp.len() < required { return Err(CodecErr::InputTooSmall { required }); } let qmax = (1i8 << (bits - 1)) - 1; let mask: u64 = (1u64 << bits) - 1; let mut acc: u64 = 0; let mut acc_bits: u32 = 0; let mut byte_pos: usize = 0; let mut val_pos: usize = 0; while val_pos < out.len() { while acc_bits < bits as u32 { acc |= (inp[byte_pos] as u64) << acc_bits; acc_bits += 8; byte_pos += 1; } let u = (acc & mask) as i16; out[val_pos] = (u - qmax as i16) as i8; acc >>= bits; acc_bits -= bits as u32; val_pos += 1; } Ok(required) } ``` **Properties**: - No heap allocations. Callers provide both input and output slices. - Single bit writer / bit reader using a 64-bit accumulator. - Deterministic little-endian byte order. - The `pack_bits` / `unpack_bits` pair is its own inverse: `unpack(pack(v)) == v` for all valid inputs. ### 3.7 Quant Module Functions ```rust /// Block-level quantization configuration. pub struct QuantConfig { pub block_size: usize, // elements per quantization block (default: 64) pub two_level_threshold: f32, // max/median ratio to trigger two-level (default: 5.0) } /// Quantized block result. pub struct QuantizedBlock { pub scale: f32, pub secondary_scale: Option, // only for two-level 3-bit pub flags: Option>, // 1-bit-per-value flags for two-level pub codes: Vec, // signed quantized codes pub bits: u8, } /// Symmetric 8-bit quantization (Tier 1 - Hot). /// /// Quantizes each block of `block_size` values independently. /// scale = max_abs(block) / 127 /// q[i] = clamp(round(x[i] / scale), -127, 127) pub fn quantize_s8( values: &[f32], config: &QuantConfig, ) -> Vec; /// Symmetric N-bit quantization (Tier 2/3 - Warm/Cold). /// /// `bits` must be one of: 7, 5, 3. /// qmax = 2^(bits-1) - 1 /// scale = max_abs(block) / qmax /// q[i] = clamp(round(x[i] / scale), -qmax, qmax) /// /// For bits=3 and config.two_level_threshold exceeded: uses two-level scale. pub fn quantize_bits( values: &[f32], bits: u8, config: &QuantConfig, ) -> Vec; /// Dequantize a block back to f32 values. /// /// For standard mode: x'[i] = codes[i] as f32 * scale /// For two-level mode: x'[i] = codes[i] as f32 * (if flags[i] then secondary_scale else scale) pub fn dequantize(block: &QuantizedBlock) -> Vec; /// Compute the maximum absolute value across a slice. /// /// On native targets with `target_feature = "avx2"` or `target_feature = "neon"`: /// uses SIMD intrinsics for 4-8x throughput. /// On WASM with `target_feature = "simd128"` (optional): /// uses wasm_simd128 intrinsics. /// Fallback: portable scalar loop. #[inline] pub fn max_abs(values: &[f32]) -> f32; ``` **SIMD implementation sketch for `max_abs`** (AVX2): ```rust #[cfg(target_arch = "x86_64")] #[target_feature(enable = "avx2")] unsafe fn max_abs_avx2(values: &[f32]) -> f32 { use std::arch::x86_64::*; let sign_mask = _mm256_set1_ps(f32::from_bits(0x7FFF_FFFF)); // abs mask let mut vmax = _mm256_setzero_ps(); let chunks = values.len() / 8; for i in 0..chunks { let v = _mm256_loadu_ps(values.as_ptr().add(i * 8)); let abs_v = _mm256_and_ps(v, sign_mask); vmax = _mm256_max_ps(vmax, abs_v); } // Horizontal max reduction let hi128 = _mm256_extractf128_ps(vmax, 1); let lo128 = _mm256_castps256_ps128(vmax); let max128 = _mm_max_ps(hi128, lo128); let shuf = _mm_movehdup_ps(max128); let max64 = _mm_max_ps(max128, shuf); let shuf2 = _mm_movehl_ps(max64, max64); let max32 = _mm_max_ss(max64, shuf2); let mut result = _mm_cvtss_f32(max32); // Handle remainder for i in (chunks * 8)..values.len() { result = result.max(values[i].abs()); } result } ``` **WASM portable fallback**: ```rust #[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))] pub fn max_abs(values: &[f32]) -> f32 { let mut m: f32 = 0.0; for &v in values { let a = v.abs(); if a > m { m = a; } } m } ``` When WASM SIMD is enabled via `target_feature = "simd128"`, a vectorized path processes 4 f32 values per iteration using `v128` types. This is optional and gated behind a cargo feature flag `wasm-simd`. ### 3.8 Error Bound Analysis For symmetric quantization with bit width `B`, block scale `s`, and `qmax = 2^(B-1) - 1`: ``` quantization_step = s / qmax max_element_error = quantization_step / 2 (from rounding) max_relative_error = 1 / (2 * qmax) (per element, worst case) rms_error = quantization_step / sqrt(12) (uniform quantization noise) ``` **Per-tier error bounds**: | Tier | Bits | qmax | Max Rel. Error | RMS Rel. Error | Max Abs. Error (scale=1.0) | |------|------|------|---------------|----------------|---------------------------| | Hot (8-bit) | 8 | 127 | 0.394% | 0.228% | 0.00394 | | Warm (7-bit) | 7 | 63 | 0.794% | 0.458% | 0.00794 | | Warm-agg (5-bit) | 5 | 15 | 3.333% | 1.925% | 0.03333 | | Cold (3-bit, std) | 3 | 3 | 16.667% | 9.623% | 0.16667 | | Cold (3-bit, 2-level) | 3 | 3 | 16.667% per scale | 9.623% | Reduced for bulk values | **Two-level scale improvement for 3-bit**: When 95% of values fall within `primary_max` and outliers use `secondary_scale`: | Component | Fraction | Scale | Effective Max Error | |-----------|----------|-------|-------------------| | Bulk values (95%) | 0.95 | primary_scale (smaller) | 16.7% of primary_max | | Outlier values (5%) | 0.05 | secondary_scale (larger) | 16.7% of secondary_max | The bulk values achieve much lower absolute error because `primary_scale` is typically 3-10x smaller than the single-scale `scale`. The outliers retain the same relative error but are fewer in number. **Drift compounding**: When drift tolerance is `d` (e.g., 10%), and a frame is quantized with scales from an earlier frame, the effective max relative error becomes `(1 + d) / (2 * qmax)`. For 8-bit with 10% drift: `1.1 / 254 = 0.433%`. **Cumulative error table with drift**: | Tier | Bits | No Drift | 10% Drift | 20% Drift | |------|------|----------|-----------|-----------| | Hot | 8 | 0.394% | 0.433% | 0.472% | | Warm | 7 | 0.794% | 0.873% | 0.952% | | Warm-agg | 5 | 3.333% | 3.667% | 4.000% | | Cold | 3 | 16.667% | 18.333% | 20.000% | ### 3.9 Complete Quantizer and Packer Traits ```rust /// Trait for quantization formats that can encode and decode tensor blocks. pub trait TensorQuantizer { /// The bit width of this quantizer. fn bit_width(&self) -> u8; /// Quantize a block of f32 values into signed codes and scale(s). fn quantize_block( &self, values: &[f32], config: &QuantConfig, ) -> QuantizedBlock; /// Dequantize a block back to f32 values. fn dequantize_block( &self, block: &QuantizedBlock, out: &mut [f32], ) -> Result<(), CodecErr>; /// Returns the packed byte size for `num_values` at this bit width, /// excluding scale storage. fn packed_data_size(&self, num_values: usize) -> usize { (num_values * self.bit_width() as usize + 7) / 8 } /// Returns total block storage size including scale(s) and flags. fn block_storage_size(&self, block_size: usize) -> usize; } /// Trait for bit-level packing codecs. pub trait BitCodec { /// Pack signed codes into a byte buffer. fn pack( &self, codes: &[i8], bits: u8, out: &mut [u8], ) -> Result; /// Unpack codes from a byte buffer. fn unpack( &self, data: &[u8], bits: u8, out: &mut [i8], ) -> Result; } /// Standard implementation using the accumulator-based codec_bits functions. pub struct StandardBitCodec; impl BitCodec for StandardBitCodec { fn pack( &self, codes: &[i8], bits: u8, out: &mut [u8], ) -> Result { pack_bits(codes, bits, out) } fn unpack( &self, data: &[u8], bits: u8, out: &mut [i8], ) -> Result { unpack_bits(data, bits, out) } } ``` ### 3.10 Block Storage Summary Diagram ``` TIER 1 (8-bit): +--------+-------+-------+-------+-----+-------+ | scale | q[0] | q[1] | q[2] | ... | q[63] | | f32 LE | i8 | i8 | i8 | | i8 | +--------+-------+-------+-------+-----+-------+ 4 bytes 1 1 1 1 = 68 bytes / block TIER 2a (7-bit): +--------+--------------------------------------------+ | scale | packed 7-bit codes (56 bytes for 64 vals) | | f32 LE | bitstream, little-endian accumulator | +--------+--------------------------------------------+ 4 bytes ceil(64*7/8) = 56 bytes = 60 bytes / block TIER 2b (5-bit): +--------+--------------------------------------------+ | scale | packed 5-bit codes (40 bytes for 64 vals) | | f32 LE | bitstream, little-endian accumulator | +--------+--------------------------------------------+ 4 bytes ceil(64*5/8) = 40 bytes = 44 bytes / block TIER 3 standard (3-bit): +--------+--------------------------------------------+ | scale | packed 3-bit codes (24 bytes for 64 vals) | | f32 LE | bitstream, little-endian accumulator | +--------+--------------------------------------------+ 4 bytes ceil(64*3/8) = 24 bytes = 28 bytes / block TIER 3 two-level (3-bit): +--------+--------+----------+-------------------------------+ | pscale | sscale | flags | packed 3-bit codes | | f32 LE | f32 LE | ceil(N/8)| bitstream | +--------+--------+----------+-------------------------------+ 4 4 8 bytes 24 bytes = 40 bytes / block TIER 0 (absent): +--------------------------------------+ | TensorMeta + ReconstructPolicy only | | NO quantized data | +--------------------------------------+ variable (typically 32-128 bytes metadata) ``` --- ## 4. Alternatives Considered ### 4.1 4-Bit as the Warm Tier 4-bit quantization (qmax = 7, 8.00x compression) is the most widely studied format (GPTQ, AWQ). We considered using 4-bit instead of 7-bit for the warm tier. **Rejected** because: (a) the jump from 8-bit to 4-bit is too large for tensors that were recently hot, causing unnecessary quality loss; (b) 7-bit provides a gentler step-down; (c) 5-bit is available as an intermediate when memory pressure increases. ### 4.2 Uniform 4-Bit Across All Non-Hot Tiers A simpler design with only two quantization levels (8-bit hot, 4-bit everything else). **Rejected** because: (a) cold tensors waste 1 extra bit per value when 3-bit suffices; (b) no path to aggressive compression under memory pressure; (c) loses the granularity that enables smooth quality degradation. ### 4.3 Asymmetric Quantization for 3-Bit Using asymmetric quantization (with zero-point) for 3-bit to better utilize the `[0, 7]` unsigned range when distributions are not centered. **Rejected** because: (a) adds 4 bytes of zero-point storage per block; (b) requires an additional subtraction in the dequantize path; (c) the two-level scale approach handles asymmetric distributions more effectively by splitting the scale rather than shifting the range. ### 4.4 Lookup Table (Codebook) Quantization for Cold Using a small codebook (e.g., 8 centroids) instead of uniform 3-bit levels. **Rejected** because: (a) requires a per-block or per-tensor codebook training step that is expensive for streaming data; (b) codebook storage overhead is comparable to scale storage but with higher decode complexity; (c) uniform quantization is simpler to implement and reason about. ### 4.5 No Two-Level Scale (Simpler 3-Bit) Omitting the two-level scale option entirely. **Considered but rejected** because agent embedding tensors frequently exhibit heavy-tailed distributions where a few dimensions carry disproportionate magnitude. Without two-level scale, these outliers cause the single scale to be too large, wasting most of the 3-bit range on the bulk of near-zero values. --- ## 5. Acceptance Criteria ### 5.1 Format Correctness - [ ] `pack_bits` followed by `unpack_bits` is a lossless round-trip for all bit widths (3, 5, 7, 8) and all valid signed input ranges. - [ ] `quantize_s8` followed by `dequantize` produces values within the theoretical error bound (`scale / 254`) of the originals. - [ ] `quantize_bits(7, ...)` followed by `dequantize` produces values within `scale / 126` of the originals. - [ ] `quantize_bits(5, ...)` followed by `dequantize` produces values within `scale / 30` of the originals. - [ ] `quantize_bits(3, ...)` followed by `dequantize` produces values within `scale / 6` of the originals (standard mode). - [ ] Two-level 3-bit mode activates when `max/median > two_level_threshold`. - [ ] Tier 0 reads return zeros when `reconstruct_policy` is `None`. - [ ] Tier 0 reads invoke reconstruction when a policy exists. ### 5.2 Performance - [ ] `pack_bits` throughput >= 2 GB/s on native (AVX2-capable hardware). - [ ] `unpack_bits` throughput >= 2 GB/s on native. - [ ] `max_abs` with SIMD is >= 3x faster than the scalar fallback on 512+ element blocks. - [ ] WASM `pack_bits` / `unpack_bits` throughput >= 500 MB/s (without SIMD). - [ ] No heap allocations in `pack_bits`, `unpack_bits`, or `max_abs`. ### 5.3 Storage Efficiency - [ ] 8-bit block storage: exactly `4 + block_size` bytes. - [ ] 7-bit block storage: exactly `4 + ceil(block_size * 7 / 8)` bytes. - [ ] 5-bit block storage: exactly `4 + ceil(block_size * 5 / 8)` bytes. - [ ] 3-bit block storage (standard): exactly `4 + ceil(block_size * 3 / 8)` bytes. - [ ] 3-bit block storage (two-level): exactly `8 + ceil(block_size / 8) + ceil(block_size * 3 / 8)` bytes. - [ ] No padding bits between consecutive blocks in a segment. ### 5.4 Dynamic Tier 2 Downgrade - [ ] When aggregate warm storage exceeds `warm_byte_cap`, the least recently accessed warm tensors are re-encoded from 7-bit to 5-bit. - [ ] The downgrade is reversible: if warm storage drops below `warm_byte_cap * 0.8` (hysteresis), tensors can be re-promoted to 7-bit on next access. --- ## 6. Risks and Mitigations | Risk | Severity | Likelihood | Mitigation | |------|----------|------------|------------| | 3-bit two-level scale adds format complexity without sufficient accuracy gain for most distributions | Medium | Medium | Gate behind a cargo feature `two-level-cold`; default to standard 3-bit. Benchmark on real agent embeddings before enabling by default. | | Dynamic 7-bit to 5-bit downgrade causes thrashing when warm set oscillates near the byte cap | Medium | Medium | Implement hysteresis (20% band). Only downgrade when above cap; only upgrade when below 80% of cap. Rate-limit downgrades to at most once per minute. | | `pack_bits` accumulator overflow for large inputs | Low | Low | The 64-bit accumulator can hold up to 56 bits of pending data (7 bytes). Since we flush at 8 bits, the maximum pending bits is `bits - 1 = 7`, well within the 64-bit range. No overflow possible. | | Tier 0 reconstruction from Delta/Factor introduces unbounded latency | Medium | Low | Set a maximum reconstruction depth (default: 3). If the base tensor is also Tier 0, fail with `ReconstructionDepthExceeded` rather than recursing indefinitely. | | WASM scalar `max_abs` is a bottleneck for large tensors | Low | High | Expected. The WASM SIMD feature flag provides 3-4x improvement. For non-SIMD targets, `max_abs` cost is small relative to the full quantize pipeline. | | Block size mismatch between encoder and decoder | High | Low | Block size is stored in the segment header (ADR-017 format). Decoder reads it from the header rather than assuming a default. | --- ## 7. References 1. ADR-017: Temporal Tensor Compression with Tiered Quantization. RuVector Architecture Team, 2026. 2. ADR-018: Block-Based Storage Engine for Temporal Tensor Segments (forthcoming). 3. Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. 4. Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. 5. Kim, S., et al. "SqueezeLLM: Dense-and-Sparse Quantization." ICML 2024. 6. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024. 7. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015. 8. IEEE 754-2019. "IEEE Standard for Floating-Point Arithmetic." 9. Lemire, D. and Boytsov, L. "Decoding billions of integers in milliseconds through vectorized bit packing." Software: Practice and Experience, 2015. 10. WebAssembly SIMD Proposal. https://github.com/WebAssembly/simd. Finalized 2023.