Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/docs/adr/temporal-tensor-store/ADR-019-tiered-quantization-formats.md
+++ b/docs/adr/temporal-tensor-store/ADR-019-tiered-quantization-formats.md
@@ -0,0 +1,880 @@
+# ADR-019: Tiered Quantization Formats for Temporal Tensor Store
+
+**Status**: Proposed
+**Date**: 2026-02-08
+**Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine
+**Author**: System Architecture Team
+
+**Note**: Tiered quantization formats are now implemented in the rvf-quant crate as part of ADR-029 (RVF). See the RVF temperature-tiering specification (docs/research/rvf/spec/03-temperature-tiering.md).
+
+## Version History
+
+| Version | Date | Author | Changes |
+|---------|------|--------|---------|
+| 0.1 | 2026-02-08 | Architecture Team | Initial proposal |
+
+---
+
+## Abstract
+
+This ADR defines the concrete quantization formats, bit-packing layouts, and codec
+interfaces for the five tiers of tensor storage established in ADR-017. Where ADR-017
+introduced the concept of access-frequency-driven quantization and temporal scale
+reuse, this document specifies the exact byte-level formats for 8-bit (Tier 1 / Hot),
+7-bit and 5-bit (Tier 2 / Warm), 3-bit (Tier 3 / Cold), and Compression-to-Zero
+(Tier 0 / Absent). It also resolves two open design questions from ADR-017: whether
+5-bit quantization is permitted within the warm tier, and how Tier 0 reads behave
+when no reconstruction policy exists.
+
+The `codec_bits` module provides a single allocation-free bit packer/unpacker that
+all sub-byte formats share. The `quant` module provides per-format quantize and
+dequantize functions, with SIMD-accelerated `max_abs` on native targets and a
+portable fallback for WASM. Rust trait interfaces are defined so that new bit widths
+can be added without modifying the core codec.
+
+---
+
+## 1. Context and Motivation
+
+### 1.1 Gap in ADR-017
+
+ADR-017 established the tiered compression architecture and segment binary format
+but left the per-tier quantization details at the algorithmic level. Implementers
+need exact byte layouts to write interoperable encoders and decoders, particularly
+for the sub-byte formats (7-bit, 5-bit, 3-bit) where values do not align on byte
+boundaries.
+
+### 1.2 Sub-Byte Packing Complexity
+
+Standard 8-bit quantization maps trivially to `[u8]` storage. Sub-byte formats
+require a bit-packing codec that can write and read arbitrary-width codes into a
+byte stream without wasting bits. The codec must:
+
+- Handle bit widths 3, 5, and 7 (with 8 as a degenerate identity case).
+- Operate without heap allocations (caller provides output slice).
+- Be deterministic and platform-independent (little-endian byte order).
+- Support WASM targets where SIMD is optional.
+
+### 1.3 Outlier Handling in 3-Bit
+
+At 3 bits per value, the quantization range is `[-3, +3]` (qmax = 3). Large
+outliers in the tensor distribution can cause severe clamping. ADR-017 noted this
+risk but did not specify a mitigation. This ADR introduces a two-level scale
+option for Tier 3 that uses a 1-bit flag per value to select between a primary
+scale (covering the majority of values) and a secondary scale (covering outliers),
+while keeping the packed format compact.
+
+### 1.4 Tier 0 Semantics
+
+ADR-017 listed Compression-to-Zero as a future possibility. This ADR formalizes
+it: Tier 0 stores no quantized data at all. Only metadata and an optional
+`reconstruct_policy` survive. This enables aggressive memory reclamation for
+tensors that are no longer accessed but may be reconstructable from other sources
+(deltas, factors, or recomputation).
+
+### 1.5 Design Questions Resolved
+
+| Question | Resolution |
+|----------|------------|
+| Allow 5-bit within warm tier? | Yes. Dynamic downgrade from 7-bit to 5-bit when warm set exceeds a configurable byte cap (`warm_byte_cap`). |
+| Tier 0 read semantics? | Return zeros by default. If a `reconstruct_policy` (Delta or Factor) exists, reconstruct from stored representation. |
+
+---
+
+## 2. Decision
+
+We adopt the following five-tier quantization format hierarchy, each with a
+well-defined byte layout, packing strategy, and error budget:
+
+| Tier | Name | Bits | Compression vs f32 | Use Case |
+|------|------|------|-------------------|----------|
+| 1 | Hot | 8 | 4.00x | Active tensors, full fidelity |
+| 2a | Warm | 7 | 4.57x | Default warm, near-lossless |
+| 2b | Warm-aggressive | 5 | 6.40x | Warm set exceeds `warm_byte_cap` |
+| 3 | Cold | 3 | 10.67x | Archived tensors, bounded error |
+| 0 | Absent | 0 | Infinite | No data stored; metadata only |
+
+All sub-byte formats share the `codec_bits` packer. All quantization formats use
+symmetric per-block quantization with `scale = max_abs / qmax` stored as f32 per
+block. The choice of f32 (rather than f16 as in ADR-017 segment headers) is
+deliberate at this layer: the segment encoder may convert to f16 for storage, but
+the quantizer operates in f32 for precision during the quantize/dequantize path.
+
+---
+
+## 3. Detailed Design
+
+### 3.1 Tier 1: 8-Bit Quantization (Hot)
+
+**Algorithm**: Symmetric per-block quantization.
+
+```
+Given: block of N f32 values, block_size typically 64 or 128
+  scale    = max_abs(values) / 127
+  q[i]     = round(values[i] / scale)
+  q[i]     = clamp(q[i], -127, +127)      // i8 range
+  store:     q as [i8; N] + scale as f32
+```
+
+**Storage layout** (one block, block_size = 8 for illustration):
+
+```
+Byte offset:  0    1    2    3    4    5    6    7    8    9   10   11
+             [  scale (f32, LE)  ] [q0] [q1] [q2] [q3] [q4] [q5] [q6] [q7]
+              ~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+              4 bytes              8 bytes (1 byte per i8 value)
+
+Total per block: 4 + block_size bytes
+```
+
+**Effective compression** (block_size = 64):
+
+```
+raw     = 64 * 4           = 256 bytes
+quant   = 4 + 64 * 1       = 68 bytes
+ratio   = 256 / 68         = 3.76x (single block)
+```
+
+With temporal amortization (100 frames sharing scales): `256*100 / (4 + 64*100)` = 4.00x.
+
+**Dequantize**:
+
+```
+values[i] = q[i] as f32 * scale
+```
+
+**Error bound**: `max_error = scale / (2 * 127)`. See Section 3.7 for full analysis.
+
+### 3.2 Tier 2a: 7-Bit Quantization (Warm)
+
+**Algorithm**: Symmetric per-block, 7-bit codes packed into a bitstream.
+
+```
+Given: block of N f32 values
+  scale    = max_abs(values) / 63          // qmax = 2^(7-1) - 1 = 63
+  q[i]     = round(values[i] / scale)
+  q[i]     = clamp(q[i], -63, +63)
+  u[i]     = q[i] + 63                    // bias to unsigned [0, 126], fits 7 bits
+  pack u[i] values using codec_bits at width=7
+```
+
+**Bit-packing layout** (8 values packed into 7 bytes):
+
+```
+Values:     u0       u1       u2       u3       u4       u5       u6       u7
+Bits:    [6..0]   [6..0]   [6..0]   [6..0]   [6..0]   [6..0]   [6..0]   [6..0]
+         7 bits   7 bits   7 bits   7 bits   7 bits   7 bits   7 bits   7 bits
+
+Packed into 7 bytes (56 bits = 8 * 7 bits):
+
+Byte 0:  [u0[6:0]              | u1[0]  ]   = u0(7) + u1(1) = 8 bits
+         |<--- 7 bits --->|<1>|
+
+Byte 1:  [u1[6:1]                | u2[1:0]]  = u1(6) + u2(2) = 8 bits
+         |<--- 6 bits --->|<-2->|
+
+Byte 2:  [u2[6:2]              | u3[2:0] ]   = u2(5) + u3(3) = 8 bits
+         |<-- 5 bits -->|<--3-->|
+
+Byte 3:  [u3[6:3]            | u4[3:0]   ]   = u3(4) + u4(4) = 8 bits
+         |<- 4 bits ->|<--4--->|
+
+Byte 4:  [u4[6:4]          | u5[4:0]     ]   = u4(3) + u5(5) = 8 bits
+         |<-3->|<---- 5 bits ---->|
+
+Byte 5:  [u5[6:5]        | u6[5:0]       ]   = u5(2) + u6(6) = 8 bits
+         |<2>|<----- 6 bits ------>|
+
+Byte 6:  [u6[6]        | u7[6:0]         ]   = u6(1) + u7(7) = 8 bits
+         |1|<------- 7 bits ------->|
+
+Total:   7 bytes for 8 values = 0.875 bytes/value
+```
+
+**Storage per block** (block_size = 64):
+
+```
+scale:   4 bytes (f32)
+data:    ceil(64 * 7 / 8) = 56 bytes
+total:   60 bytes
+ratio:   256 / 60 = 4.27x
+```
+
+### 3.3 Tier 2b: 5-Bit Quantization (Warm Aggressive)
+
+**Algorithm**: Symmetric per-block, 5-bit codes.
+
+```
+Given: block of N f32 values
+  scale    = max_abs(values) / 15          // qmax = 2^(5-1) - 1 = 15
+  q[i]     = round(values[i] / scale)
+  q[i]     = clamp(q[i], -15, +15)
+  u[i]     = q[i] + 15                    // bias to unsigned [0, 30], fits 5 bits
+  pack u[i] values using codec_bits at width=5
+```
+
+**Activation policy**: 5-bit is used instead of 7-bit when the total warm set
+size exceeds `warm_byte_cap` (default: 64 MiB). The tier policy monitors
+aggregate warm storage and downgrades from 7-bit to 5-bit for the least recently
+accessed warm tensors until the cap is satisfied.
+
+**Bit-packing layout** (8 values packed into 5 bytes):
+
+```
+Values:     u0       u1       u2       u3       u4       u5       u6       u7
+Bits:    [4..0]   [4..0]   [4..0]   [4..0]   [4..0]   [4..0]   [4..0]   [4..0]
+         5 bits   5 bits   5 bits   5 bits   5 bits   5 bits   5 bits   5 bits
+
+Packed into 5 bytes (40 bits = 8 * 5 bits):
+
+Byte 0:  [u0[4:0]        | u1[2:0]   ]   = u0(5) + u1(3) = 8 bits
+         |<-- 5 bits -->|<--3-->|
+
+Byte 1:  [u1[4:3]    | u2[4:0]   | u3[0]]  = u1(2) + u2(5) + u3(1) = 8 bits
+         |<2>|<-- 5 bits -->|<1>|
+
+Byte 2:  [u3[4:1]          | u4[3:0]  ]   = u3(4) + u4(4) = 8 bits
+         |<-- 4 bits -->|<--4-->|
+
+Byte 3:  [u4[4]    | u5[4:0]   | u6[1:0]] = u4(1) + u5(5) + u6(2) = 8 bits
+         |1|<-- 5 bits -->|<-2->|
+
+Byte 4:  [u6[4:2]        | u7[4:0]    ]   = u6(3) + u7(5) = 8 bits
+         |<-3->|<--- 5 bits --->|
+
+Total:   5 bytes for 8 values = 0.625 bytes/value
+```
+
+**Storage per block** (block_size = 64):
+
+```
+scale:   4 bytes (f32)
+data:    ceil(64 * 5 / 8) = 40 bytes
+total:   44 bytes
+ratio:   256 / 44 = 5.82x
+```
+
+### 3.4 Tier 3: 3-Bit Quantization (Cold)
+
+**Algorithm**: Symmetric per-block, 3-bit codes with optional two-level scale.
+
+#### Standard Mode
+
+```
+Given: block of N f32 values
+  scale    = max_abs(values) / 3           // qmax = 2^(3-1) - 1 = 3
+  q[i]     = round(values[i] / scale)
+  q[i]     = clamp(q[i], -3, +3)
+  u[i]     = q[i] + 3                     // bias to unsigned [0, 6], fits 3 bits
+  pack u[i] values using codec_bits at width=3
+```
+
+#### Two-Level Scale Mode (Outlier Handling)
+
+When the value distribution has outliers (values significantly larger than the
+bulk of the distribution), a single scale wastes most of the 3-bit range on the
+long tail. The two-level scale splits the range:
+
+```
+Given: block of N f32 values, outlier_fraction (default: 0.05)
+  sorted_abs     = sort(|values|, descending)
+  outlier_count  = ceil(N * outlier_fraction)
+  primary_max    = sorted_abs[outlier_count]      // excludes top 5%
+  secondary_max  = sorted_abs[0]                  // full range
+
+  primary_scale   = primary_max / 3               // covers bulk values
+  secondary_scale = secondary_max / 3             // covers outliers
+
+  For each value[i]:
+    if |value[i]| > primary_max:
+      flag[i] = 1                                  // use secondary scale
+      q[i]    = round(value[i] / secondary_scale)
+    else:
+      flag[i] = 0                                  // use primary scale
+      q[i]    = round(value[i] / primary_scale)
+    q[i] = clamp(q[i], -3, +3)
+    u[i] = q[i] + 3
+
+  store: primary_scale (f32) + secondary_scale (f32) + flag bits + packed codes
+```
+
+**Bit-packing layout** (8 values packed into 3 bytes):
+
+```
+Values:     u0       u1       u2       u3       u4       u5       u6       u7
+Bits:    [2..0]   [2..0]   [2..0]   [2..0]   [2..0]   [2..0]   [2..0]   [2..0]
+         3 bits   3 bits   3 bits   3 bits   3 bits   3 bits   3 bits   3 bits
+
+Packed into 3 bytes (24 bits = 8 * 3 bits):
+
+Byte 0:  [u0[2:0]  | u1[2:0]  | u2[1:0] ]   = u0(3) + u1(3) + u2(2) = 8 bits
+         |<-3->|<-3->|<2>|
+
+Byte 1:  [u2[2]  | u3[2:0]  | u4[2:0]  | u5[0]]  = u2(1) + u3(3) + u4(3) + u5(1) = 8 bits
+         |1|<-3->|<-3->|1|
+
+Byte 2:  [u5[2:1]  | u6[2:0]  | u7[2:0]  ]   = u5(2) + u6(3) + u7(3) = 8 bits
+         |<2>|<-3->|<-3->|
+
+Total:   3 bytes for 8 values = 0.375 bytes/value
+```
+
+**Two-level scale storage layout** (one block, block_size = 64):
+
+```
+Byte offset:  0         3         7        8       9    ...   15       16  ...
+             [primary_scale f32] [secondary_scale f32] [flag bytes  ] [packed codes]
+              ~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~  ~~~~~~~~~~~~~
+              4 bytes             4 bytes               ceil(64/8)=8   ceil(64*3/8)=24
+
+Total per block (two-level): 4 + 4 + 8 + 24 = 40 bytes
+Total per block (standard):  4 + 24          = 28 bytes
+ratio (standard):  256 / 28 = 9.14x
+ratio (two-level): 256 / 40 = 6.40x
+```
+
+The two-level mode trades compression ratio for outlier fidelity. It is selected
+automatically when the ratio `max_abs / median_abs` exceeds a configurable
+threshold (default: 5.0), indicating a heavy-tailed distribution.
+
+### 3.5 Tier 0: Compression to Zero (Absent)
+
+**Algorithm**: No quantized data is stored.
+
+```
+Tier 0 representation:
+  metadata:            TensorMeta (id, shape, dtype, timestamps)
+  reconstruct_policy:  Option<ReconstructPolicy>
+  quantized_data:      None
+
+enum ReconstructPolicy {
+    None,                                    // reads return zeros
+    Delta { base_id: TensorId, delta: ... }, // reconstruct as base + delta
+    Factor { source_id: TensorId, ... },     // reconstruct via transformation
+}
+```
+
+**Read semantics**:
+
+| `reconstruct_policy` | Behavior |
+|----------------------|----------|
+| `None` | Return a zero-filled tensor of the recorded shape. Fast-fail mode returns `Err(TierZeroNoPolicy)` instead. |
+| `Delta` | Load base tensor, apply stored delta. May trigger recursive decompression if base is also tiered. |
+| `Factor` | Load source tensor, apply stored transformation (scale, permutation, projection). |
+
+**Transition to Tier 0**: A tensor is eligible for Tier 0 when its tier score
+drops below `absent_min_score` (default: 1) and it has not been accessed for
+longer than `absent_age_threshold` (default: 24 hours). The transition is
+irreversible without external data: once quantized data is discarded, only the
+reconstruction policy (if any) can recover approximate values.
+
+### 3.6 Bit Packing Module: `codec_bits`
+
+The core packing and unpacking functions shared by all sub-byte formats.
+
+```rust
+/// Errors from bit codec operations.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum CodecErr {
+    /// Output buffer too small. Contains the required size in bytes.
+    OutputTooSmall { required: usize },
+    /// Input buffer too small for the declared number of values.
+    InputTooSmall { required: usize },
+    /// Bit width must be in [1, 8].
+    InvalidBitWidth { bits: u8 },
+}
+
+/// Pack `values.len()` signed codes into `out`, using `bits` bits per code.
+///
+/// Each value in `values` is treated as a signed integer in `[-(2^(bits-1)-1), 2^(bits-1)-1]`.
+/// It is biased to unsigned before packing: `u = v + (2^(bits-1) - 1)`.
+///
+/// Returns the number of bytes written to `out`.
+///
+/// # Errors
+/// - `CodecErr::OutputTooSmall` if `out` cannot hold the packed data.
+/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
+pub fn pack_bits(values: &[i8], bits: u8, out: &mut [u8]) -> Result<usize, CodecErr> {
+    if bits == 0 || bits > 8 {
+        return Err(CodecErr::InvalidBitWidth { bits });
+    }
+    let total_bits = values.len() as u64 * bits as u64;
+    let required = ((total_bits + 7) / 8) as usize;
+    if out.len() < required {
+        return Err(CodecErr::OutputTooSmall { required });
+    }
+
+    let qmax = (1i8 << (bits - 1)) - 1;  // bias offset
+    let mask: u64 = (1u64 << bits) - 1;
+    let mut acc: u64 = 0;
+    let mut acc_bits: u32 = 0;
+    let mut pos: usize = 0;
+
+    for &v in values {
+        let u = (v as i16 + qmax as i16) as u64 & mask;
+        acc |= u << acc_bits;
+        acc_bits += bits as u32;
+        while acc_bits >= 8 {
+            out[pos] = (acc & 0xFF) as u8;
+            pos += 1;
+            acc >>= 8;
+            acc_bits -= 8;
+        }
+    }
+    // Flush remaining bits
+    if acc_bits > 0 {
+        out[pos] = (acc & 0xFF) as u8;
+        pos += 1;
+    }
+    Ok(pos)
+}
+
+/// Unpack codes from `inp` into `out`, reading `bits` bits per code.
+///
+/// Reads exactly `out.len()` values. Each unsigned code is unbiased back to signed:
+/// `v = u - (2^(bits-1) - 1)`.
+///
+/// Returns the number of bytes consumed from `inp`.
+///
+/// # Errors
+/// - `CodecErr::InputTooSmall` if `inp` does not contain enough data.
+/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8.
+pub fn unpack_bits(inp: &[u8], bits: u8, out: &mut [i8]) -> Result<usize, CodecErr> {
+    if bits == 0 || bits > 8 {
+        return Err(CodecErr::InvalidBitWidth { bits });
+    }
+    let total_bits = out.len() as u64 * bits as u64;
+    let required = ((total_bits + 7) / 8) as usize;
+    if inp.len() < required {
+        return Err(CodecErr::InputTooSmall { required });
+    }
+
+    let qmax = (1i8 << (bits - 1)) - 1;
+    let mask: u64 = (1u64 << bits) - 1;
+    let mut acc: u64 = 0;
+    let mut acc_bits: u32 = 0;
+    let mut byte_pos: usize = 0;
+    let mut val_pos: usize = 0;
+
+    while val_pos < out.len() {
+        while acc_bits < bits as u32 {
+            acc |= (inp[byte_pos] as u64) << acc_bits;
+            acc_bits += 8;
+            byte_pos += 1;
+        }
+        let u = (acc & mask) as i16;
+        out[val_pos] = (u - qmax as i16) as i8;
+        acc >>= bits;
+        acc_bits -= bits as u32;
+        val_pos += 1;
+    }
+    Ok(required)
+}
+```
+
+**Properties**:
+
+- No heap allocations. Callers provide both input and output slices.
+- Single bit writer / bit reader using a 64-bit accumulator.
+- Deterministic little-endian byte order.
+- The `pack_bits` / `unpack_bits` pair is its own inverse: `unpack(pack(v)) == v`
+  for all valid inputs.
+
+### 3.7 Quant Module Functions
+
+```rust
+/// Block-level quantization configuration.
+pub struct QuantConfig {
+    pub block_size: usize,          // elements per quantization block (default: 64)
+    pub two_level_threshold: f32,   // max/median ratio to trigger two-level (default: 5.0)
+}
+
+/// Quantized block result.
+pub struct QuantizedBlock {
+    pub scale: f32,
+    pub secondary_scale: Option<f32>,    // only for two-level 3-bit
+    pub flags: Option<Vec<u8>>,          // 1-bit-per-value flags for two-level
+    pub codes: Vec<i8>,                  // signed quantized codes
+    pub bits: u8,
+}
+
+/// Symmetric 8-bit quantization (Tier 1 - Hot).
+///
+/// Quantizes each block of `block_size` values independently.
+/// scale = max_abs(block) / 127
+/// q[i]  = clamp(round(x[i] / scale), -127, 127)
+pub fn quantize_s8(
+    values: &[f32],
+    config: &QuantConfig,
+) -> Vec<QuantizedBlock>;
+
+/// Symmetric N-bit quantization (Tier 2/3 - Warm/Cold).
+///
+/// `bits` must be one of: 7, 5, 3.
+/// qmax  = 2^(bits-1) - 1
+/// scale = max_abs(block) / qmax
+/// q[i]  = clamp(round(x[i] / scale), -qmax, qmax)
+///
+/// For bits=3 and config.two_level_threshold exceeded: uses two-level scale.
+pub fn quantize_bits(
+    values: &[f32],
+    bits: u8,
+    config: &QuantConfig,
+) -> Vec<QuantizedBlock>;
+
+/// Dequantize a block back to f32 values.
+///
+/// For standard mode:   x'[i] = codes[i] as f32 * scale
+/// For two-level mode:  x'[i] = codes[i] as f32 * (if flags[i] then secondary_scale else scale)
+pub fn dequantize(block: &QuantizedBlock) -> Vec<f32>;
+
+/// Compute the maximum absolute value across a slice.
+///
+/// On native targets with `target_feature = "avx2"` or `target_feature = "neon"`:
+///   uses SIMD intrinsics for 4-8x throughput.
+/// On WASM with `target_feature = "simd128"` (optional):
+///   uses wasm_simd128 intrinsics.
+/// Fallback: portable scalar loop.
+#[inline]
+pub fn max_abs(values: &[f32]) -> f32;
+```
+
+**SIMD implementation sketch for `max_abs`** (AVX2):
+
+```rust
+#[cfg(target_arch = "x86_64")]
+#[target_feature(enable = "avx2")]
+unsafe fn max_abs_avx2(values: &[f32]) -> f32 {
+    use std::arch::x86_64::*;
+    let sign_mask = _mm256_set1_ps(f32::from_bits(0x7FFF_FFFF)); // abs mask
+    let mut vmax = _mm256_setzero_ps();
+    let chunks = values.len() / 8;
+
+    for i in 0..chunks {
+        let v = _mm256_loadu_ps(values.as_ptr().add(i * 8));
+        let abs_v = _mm256_and_ps(v, sign_mask);
+        vmax = _mm256_max_ps(vmax, abs_v);
+    }
+
+    // Horizontal max reduction
+    let hi128 = _mm256_extractf128_ps(vmax, 1);
+    let lo128 = _mm256_castps256_ps128(vmax);
+    let max128 = _mm_max_ps(hi128, lo128);
+    let shuf = _mm_movehdup_ps(max128);
+    let max64 = _mm_max_ps(max128, shuf);
+    let shuf2 = _mm_movehl_ps(max64, max64);
+    let max32 = _mm_max_ss(max64, shuf2);
+    let mut result = _mm_cvtss_f32(max32);
+
+    // Handle remainder
+    for i in (chunks * 8)..values.len() {
+        result = result.max(values[i].abs());
+    }
+    result
+}
+```
+
+**WASM portable fallback**:
+
+```rust
+#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
+pub fn max_abs(values: &[f32]) -> f32 {
+    let mut m: f32 = 0.0;
+    for &v in values {
+        let a = v.abs();
+        if a > m {
+            m = a;
+        }
+    }
+    m
+}
+```
+
+When WASM SIMD is enabled via `target_feature = "simd128"`, a vectorized path
+processes 4 f32 values per iteration using `v128` types. This is optional and
+gated behind a cargo feature flag `wasm-simd`.
+
+### 3.8 Error Bound Analysis
+
+For symmetric quantization with bit width `B`, block scale `s`, and `qmax = 2^(B-1) - 1`:
+
+```
+quantization_step   = s / qmax
+max_element_error   = quantization_step / 2         (from rounding)
+max_relative_error  = 1 / (2 * qmax)                (per element, worst case)
+rms_error           = quantization_step / sqrt(12)   (uniform quantization noise)
+```
+
+**Per-tier error bounds**:
+
+| Tier | Bits | qmax | Max Rel. Error | RMS Rel. Error | Max Abs. Error (scale=1.0) |
+|------|------|------|---------------|----------------|---------------------------|
+| Hot (8-bit) | 8 | 127 | 0.394% | 0.228% | 0.00394 |
+| Warm (7-bit) | 7 | 63 | 0.794% | 0.458% | 0.00794 |
+| Warm-agg (5-bit) | 5 | 15 | 3.333% | 1.925% | 0.03333 |
+| Cold (3-bit, std) | 3 | 3 | 16.667% | 9.623% | 0.16667 |
+| Cold (3-bit, 2-level) | 3 | 3 | 16.667% per scale | 9.623% | Reduced for bulk values |
+
+**Two-level scale improvement for 3-bit**: When 95% of values fall within
+`primary_max` and outliers use `secondary_scale`:
+
+| Component | Fraction | Scale | Effective Max Error |
+|-----------|----------|-------|-------------------|
+| Bulk values (95%) | 0.95 | primary_scale (smaller) | 16.7% of primary_max |
+| Outlier values (5%) | 0.05 | secondary_scale (larger) | 16.7% of secondary_max |
+
+The bulk values achieve much lower absolute error because `primary_scale` is
+typically 3-10x smaller than the single-scale `scale`. The outliers retain the
+same relative error but are fewer in number.
+
+**Drift compounding**: When drift tolerance is `d` (e.g., 10%), and a frame is
+quantized with scales from an earlier frame, the effective max relative error
+becomes `(1 + d) / (2 * qmax)`. For 8-bit with 10% drift: `1.1 / 254 = 0.433%`.
+
+**Cumulative error table with drift**:
+
+| Tier | Bits | No Drift | 10% Drift | 20% Drift |
+|------|------|----------|-----------|-----------|
+| Hot | 8 | 0.394% | 0.433% | 0.472% |
+| Warm | 7 | 0.794% | 0.873% | 0.952% |
+| Warm-agg | 5 | 3.333% | 3.667% | 4.000% |
+| Cold | 3 | 16.667% | 18.333% | 20.000% |
+
+### 3.9 Complete Quantizer and Packer Traits
+
+```rust
+/// Trait for quantization formats that can encode and decode tensor blocks.
+pub trait TensorQuantizer {
+    /// The bit width of this quantizer.
+    fn bit_width(&self) -> u8;
+
+    /// Quantize a block of f32 values into signed codes and scale(s).
+    fn quantize_block(
+        &self,
+        values: &[f32],
+        config: &QuantConfig,
+    ) -> QuantizedBlock;
+
+    /// Dequantize a block back to f32 values.
+    fn dequantize_block(
+        &self,
+        block: &QuantizedBlock,
+        out: &mut [f32],
+    ) -> Result<(), CodecErr>;
+
+    /// Returns the packed byte size for `num_values` at this bit width,
+    /// excluding scale storage.
+    fn packed_data_size(&self, num_values: usize) -> usize {
+        (num_values * self.bit_width() as usize + 7) / 8
+    }
+
+    /// Returns total block storage size including scale(s) and flags.
+    fn block_storage_size(&self, block_size: usize) -> usize;
+}
+
+/// Trait for bit-level packing codecs.
+pub trait BitCodec {
+    /// Pack signed codes into a byte buffer.
+    fn pack(
+        &self,
+        codes: &[i8],
+        bits: u8,
+        out: &mut [u8],
+    ) -> Result<usize, CodecErr>;
+
+    /// Unpack codes from a byte buffer.
+    fn unpack(
+        &self,
+        data: &[u8],
+        bits: u8,
+        out: &mut [i8],
+    ) -> Result<usize, CodecErr>;
+}
+
+/// Standard implementation using the accumulator-based codec_bits functions.
+pub struct StandardBitCodec;
+
+impl BitCodec for StandardBitCodec {
+    fn pack(
+        &self,
+        codes: &[i8],
+        bits: u8,
+        out: &mut [u8],
+    ) -> Result<usize, CodecErr> {
+        pack_bits(codes, bits, out)
+    }
+
+    fn unpack(
+        &self,
+        data: &[u8],
+        bits: u8,
+        out: &mut [i8],
+    ) -> Result<usize, CodecErr> {
+        unpack_bits(data, bits, out)
+    }
+}
+```
+
+### 3.10 Block Storage Summary Diagram
+
+```
+TIER 1 (8-bit):
+--------+-------+-------+-------+-----+-------+
+| scale  | q[0]  | q[1]  | q[2]  | ... | q[63] |
+| f32 LE | i8    | i8    | i8    |     | i8    |
+--------+-------+-------+-------+-----+-------+
+  4 bytes   1       1       1             1        = 68 bytes / block
+
+TIER 2a (7-bit):
+--------+--------------------------------------------+
+| scale  | packed 7-bit codes (56 bytes for 64 vals)   |
+| f32 LE | bitstream, little-endian accumulator        |
+--------+--------------------------------------------+
+  4 bytes   ceil(64*7/8) = 56 bytes                    = 60 bytes / block
+
+TIER 2b (5-bit):
+--------+--------------------------------------------+
+| scale  | packed 5-bit codes (40 bytes for 64 vals)   |
+| f32 LE | bitstream, little-endian accumulator        |
+--------+--------------------------------------------+
+  4 bytes   ceil(64*5/8) = 40 bytes                    = 44 bytes / block
+
+TIER 3 standard (3-bit):
+--------+--------------------------------------------+
+| scale  | packed 3-bit codes (24 bytes for 64 vals)   |
+| f32 LE | bitstream, little-endian accumulator        |
+--------+--------------------------------------------+
+  4 bytes   ceil(64*3/8) = 24 bytes                    = 28 bytes / block
+
+TIER 3 two-level (3-bit):
+--------+--------+----------+-------------------------------+
+| pscale | sscale | flags    | packed 3-bit codes            |
+| f32 LE | f32 LE | ceil(N/8)| bitstream                     |
+--------+--------+----------+-------------------------------+
+  4         4        8 bytes    24 bytes                       = 40 bytes / block
+
+TIER 0 (absent):
+--------------------------------------+
+| TensorMeta + ReconstructPolicy only  |
+| NO quantized data                    |
+--------------------------------------+
+  variable (typically 32-128 bytes metadata)
+```
+
+---
+
+## 4. Alternatives Considered
+
+### 4.1 4-Bit as the Warm Tier
+
+4-bit quantization (qmax = 7, 8.00x compression) is the most widely studied
+format (GPTQ, AWQ). We considered using 4-bit instead of 7-bit for the warm
+tier. **Rejected** because: (a) the jump from 8-bit to 4-bit is too large for
+tensors that were recently hot, causing unnecessary quality loss; (b) 7-bit
+provides a gentler step-down; (c) 5-bit is available as an intermediate when
+memory pressure increases.
+
+### 4.2 Uniform 4-Bit Across All Non-Hot Tiers
+
+A simpler design with only two quantization levels (8-bit hot, 4-bit everything
+else). **Rejected** because: (a) cold tensors waste 1 extra bit per value when
+3-bit suffices; (b) no path to aggressive compression under memory pressure;
+(c) loses the granularity that enables smooth quality degradation.
+
+### 4.3 Asymmetric Quantization for 3-Bit
+
+Using asymmetric quantization (with zero-point) for 3-bit to better utilize the
+`[0, 7]` unsigned range when distributions are not centered. **Rejected**
+because: (a) adds 4 bytes of zero-point storage per block; (b) requires an
+additional subtraction in the dequantize path; (c) the two-level scale approach
+handles asymmetric distributions more effectively by splitting the scale rather
+than shifting the range.
+
+### 4.4 Lookup Table (Codebook) Quantization for Cold
+
+Using a small codebook (e.g., 8 centroids) instead of uniform 3-bit levels.
+**Rejected** because: (a) requires a per-block or per-tensor codebook training
+step that is expensive for streaming data; (b) codebook storage overhead is
+comparable to scale storage but with higher decode complexity; (c) uniform
+quantization is simpler to implement and reason about.
+
+### 4.5 No Two-Level Scale (Simpler 3-Bit)
+
+Omitting the two-level scale option entirely. **Considered but rejected** because
+agent embedding tensors frequently exhibit heavy-tailed distributions where a few
+dimensions carry disproportionate magnitude. Without two-level scale, these
+outliers cause the single scale to be too large, wasting most of the 3-bit range
+on the bulk of near-zero values.
+
+---
+
+## 5. Acceptance Criteria
+
+### 5.1 Format Correctness
+
+- [ ] `pack_bits` followed by `unpack_bits` is a lossless round-trip for all
+      bit widths (3, 5, 7, 8) and all valid signed input ranges.
+- [ ] `quantize_s8` followed by `dequantize` produces values within the
+      theoretical error bound (`scale / 254`) of the originals.
+- [ ] `quantize_bits(7, ...)` followed by `dequantize` produces values within
+      `scale / 126` of the originals.
+- [ ] `quantize_bits(5, ...)` followed by `dequantize` produces values within
+      `scale / 30` of the originals.
+- [ ] `quantize_bits(3, ...)` followed by `dequantize` produces values within
+      `scale / 6` of the originals (standard mode).
+- [ ] Two-level 3-bit mode activates when `max/median > two_level_threshold`.
+- [ ] Tier 0 reads return zeros when `reconstruct_policy` is `None`.
+- [ ] Tier 0 reads invoke reconstruction when a policy exists.
+
+### 5.2 Performance
+
+- [ ] `pack_bits` throughput >= 2 GB/s on native (AVX2-capable hardware).
+- [ ] `unpack_bits` throughput >= 2 GB/s on native.
+- [ ] `max_abs` with SIMD is >= 3x faster than the scalar fallback on 512+ element blocks.
+- [ ] WASM `pack_bits` / `unpack_bits` throughput >= 500 MB/s (without SIMD).
+- [ ] No heap allocations in `pack_bits`, `unpack_bits`, or `max_abs`.
+
+### 5.3 Storage Efficiency
+
+- [ ] 8-bit block storage: exactly `4 + block_size` bytes.
+- [ ] 7-bit block storage: exactly `4 + ceil(block_size * 7 / 8)` bytes.
+- [ ] 5-bit block storage: exactly `4 + ceil(block_size * 5 / 8)` bytes.
+- [ ] 3-bit block storage (standard): exactly `4 + ceil(block_size * 3 / 8)` bytes.
+- [ ] 3-bit block storage (two-level): exactly `8 + ceil(block_size / 8) + ceil(block_size * 3 / 8)` bytes.
+- [ ] No padding bits between consecutive blocks in a segment.
+
+### 5.4 Dynamic Tier 2 Downgrade
+
+- [ ] When aggregate warm storage exceeds `warm_byte_cap`, the least recently
+      accessed warm tensors are re-encoded from 7-bit to 5-bit.
+- [ ] The downgrade is reversible: if warm storage drops below
+      `warm_byte_cap * 0.8` (hysteresis), tensors can be re-promoted to 7-bit
+      on next access.
+
+---
+
+## 6. Risks and Mitigations
+
+| Risk | Severity | Likelihood | Mitigation |
+|------|----------|------------|------------|
+| 3-bit two-level scale adds format complexity without sufficient accuracy gain for most distributions | Medium | Medium | Gate behind a cargo feature `two-level-cold`; default to standard 3-bit. Benchmark on real agent embeddings before enabling by default. |
+| Dynamic 7-bit to 5-bit downgrade causes thrashing when warm set oscillates near the byte cap | Medium | Medium | Implement hysteresis (20% band). Only downgrade when above cap; only upgrade when below 80% of cap. Rate-limit downgrades to at most once per minute. |
+| `pack_bits` accumulator overflow for large inputs | Low | Low | The 64-bit accumulator can hold up to 56 bits of pending data (7 bytes). Since we flush at 8 bits, the maximum pending bits is `bits - 1 = 7`, well within the 64-bit range. No overflow possible. |
+| Tier 0 reconstruction from Delta/Factor introduces unbounded latency | Medium | Low | Set a maximum reconstruction depth (default: 3). If the base tensor is also Tier 0, fail with `ReconstructionDepthExceeded` rather than recursing indefinitely. |
+| WASM scalar `max_abs` is a bottleneck for large tensors | Low | High | Expected. The WASM SIMD feature flag provides 3-4x improvement. For non-SIMD targets, `max_abs` cost is small relative to the full quantize pipeline. |
+| Block size mismatch between encoder and decoder | High | Low | Block size is stored in the segment header (ADR-017 format). Decoder reads it from the header rather than assuming a default. |
+
+---
+
+## 7. References
+
+1. ADR-017: Temporal Tensor Compression with Tiered Quantization. RuVector Architecture Team, 2026.
+2. ADR-018: Block-Based Storage Engine for Temporal Tensor Segments (forthcoming).
+3. Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
+4. Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
+5. Kim, S., et al. "SqueezeLLM: Dense-and-Sparse Quantization." ICML 2024.
+6. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024.
+7. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015.
+8. IEEE 754-2019. "IEEE Standard for Floating-Point Arithmetic."
+9. Lemire, D. and Boytsov, L. "Decoding billions of integers in milliseconds through vectorized bit packing." Software: Practice and Experience, 2015.
+10. WebAssembly SIMD Proposal. https://github.com/WebAssembly/simd. Finalized 2023.