git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
1383 lines
153 KiB
Plaintext
1383 lines
153 KiB
Plaintext
{\rtf1\ansi\ansicpg1252\cocoartf2867
|
||
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\froman\fcharset0 Times-Bold;\f1\froman\fcharset0 Times-Roman;\f2\froman\fcharset0 Times-Italic;
|
||
\f3\froman\fcharset0 Times-BoldItalic;\f4\fmodern\fcharset0 Courier;\f5\fnil\fcharset0 HelveticaNeue;
|
||
\f6\fmodern\fcharset0 Courier-Bold;}
|
||
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red0\green0\blue0;\red179\green179\blue179;
|
||
}
|
||
{\*\expandedcolortbl;;\cssrgb\c0\c0\c0;\cssrgb\c0\c0\c0\c84706;\cssrgb\c75294\c75294\c75294;
|
||
}
|
||
{\*\listtable{\list\listtemplateid1\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid1}
|
||
{\list\listtemplateid2\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid101\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{circle\}}{\leveltext\leveltemplateid102\'01\uc0\u9702 ;}{\levelnumbers;}\fi-360\li1440\lin1440 }{\listname ;}\listid2}
|
||
{\list\listtemplateid3\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid201\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid3}
|
||
{\list\listtemplateid4\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid301\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid4}
|
||
{\list\listtemplateid5\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid401\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid5}
|
||
{\list\listtemplateid6\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid501\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid6}
|
||
{\list\listtemplateid7\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid601\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid7}
|
||
{\list\listtemplateid8\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid701\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid8}
|
||
{\list\listtemplateid9\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid801\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid9}
|
||
{\list\listtemplateid10\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid901\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid10}
|
||
{\list\listtemplateid11\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid1001\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid11}
|
||
{\list\listtemplateid12\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1101\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid12}
|
||
{\list\listtemplateid13\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1201\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid13}
|
||
{\list\listtemplateid14\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1301\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid14}
|
||
{\list\listtemplateid15\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1401\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid15}
|
||
{\list\listtemplateid16\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1501\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid16}
|
||
{\list\listtemplateid17\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1601\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid17}
|
||
{\list\listtemplateid18\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1701\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid18}
|
||
{\list\listtemplateid19\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1801\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid19}
|
||
{\list\listtemplateid20\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1901\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid20}}
|
||
{\*\listoverridetable{\listoverride\listid1\listoverridecount0\ls1}{\listoverride\listid2\listoverridecount0\ls2}{\listoverride\listid3\listoverridecount0\ls3}{\listoverride\listid4\listoverridecount0\ls4}{\listoverride\listid5\listoverridecount0\ls5}{\listoverride\listid6\listoverridecount0\ls6}{\listoverride\listid7\listoverridecount0\ls7}{\listoverride\listid8\listoverridecount0\ls8}{\listoverride\listid9\listoverridecount0\ls9}{\listoverride\listid10\listoverridecount0\ls10}{\listoverride\listid11\listoverridecount0\ls11}{\listoverride\listid12\listoverridecount0\ls12}{\listoverride\listid13\listoverridecount0\ls13}{\listoverride\listid14\listoverridecount0\ls14}{\listoverride\listid15\listoverridecount0\ls15}{\listoverride\listid16\listoverridecount0\ls16}{\listoverride\listid17\listoverridecount0\ls17}{\listoverride\listid18\listoverridecount0\ls18}{\listoverride\listid19\listoverridecount0\ls19}{\listoverride\listid20\listoverridecount0\ls20}}
|
||
\margl1440\margr1440\vieww11520\viewh8400\viewkind0
|
||
\deftab720
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Frontier Plan: Rust, SIMD, WASM and Edge LLM Serving (State of the Art 2026)\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
\cf0 \strokec2 State of the Art Context (2026)\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\fs28 \cf0 KV Cache Bottleneck and New Solutions\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\fs24 \cf0 KV Cache as the Bottleneck:
|
||
\f1\b0 \strokec2 In modern LLM inference with long contexts and batched requests, the memory footprint of the
|
||
\f0\b \strokec2 key-value (KV) cache
|
||
\f1\b0 \strokec2 dominates. For example, in a 540B model (PaLM) at batch size 512 and 2048 token context, the KV cache alone reaches ~3\uc0\u8239 TB \'96 about 3\'d7 the model\'92s parameters. This memory overhead (and the need to move KV data for each generated token) makes KV caching a primary performance and cost bottleneck for long contexts. Recent work has therefore shifted focus from further weight compression to
|
||
\f0\b \strokec2 KV cache optimizations
|
||
\f1\b0 \strokec2 as the key to pushing context lengths and multi-query throughput.\
|
||
|
||
\f0\b \strokec2 KV Cache Quantization:
|
||
\f1\b0 \strokec2 Techniques for compressing the KV cache have matured to achieve very low precision storage with minimal quality loss, making KV quantization a
|
||
\f0\b \strokec2 first-class feature
|
||
\f1\b0 \strokec2 rather than a hack.
|
||
\f2\i \strokec2 KVQuant
|
||
\f1\i0 \strokec2 (NeurIPS 2024) demonstrates that with careful handling of outlier entries and structure (e.g. per-channel quantization for keys, isolating per-vector outliers), one can push KV down to ~3-bit precision while keeping perplexity near baseline.
|
||
\f2\i \strokec2 KIVI
|
||
\f1\i0 \strokec2 (2024) goes even further: it introduces a
|
||
\f0\b \strokec2 tuning-free 2-bit KV cache quantization
|
||
\f1\b0 \strokec2 that treats keys and values asymmetrically (keys quantized per-channel, values per-token) and still \'93maintains almost the same quality\'94 as FP16, with dramatically lower peak memory usage. These low-bit KV schemes translate directly into higher batch capacity or longer contexts \'96 KIVI\'92s 2-bit cache enables larger batches and throughput gains in real workloads. Beyond pure quantization,
|
||
\f2\i \strokec2 XQuant
|
||
\f1\i0 \strokec2 (2025) takes a different approach:
|
||
\f0\b \strokec2 rematerialization
|
||
\f1\b0 \strokec2 . Instead of storing full key/value tensors, XQuant stores a much smaller
|
||
\f0\b \strokec2 quantized intermediate (the \'93X\'94 layer input)
|
||
\f1\b0 \strokec2 and
|
||
\f0\b \strokec2 recomputes the K and V on the fly
|
||
\f1\b0 \strokec2 each step. This trades extra compute for big memory savings \'96 an immediate 2\'d7 reduction vs. standard KV caching, and up to ~7.7\'d7 less memory with negligible (<0.1) perplexity loss compared to FP16 storage. In practice, systems may also combine strategies (quantize what is stored, recompute some parts when needed) to balance latency and memory. Notably, the open-source
|
||
\f0\b \strokec2 vLLM
|
||
\f1\b0 \strokec2 library already implements a pragmatic scheme: it
|
||
\f0\b \strokec2 stores KV in a quantized format (e.g. FP8)
|
||
\f1\b0 \strokec2 to roughly halve memory use, but
|
||
\f0\b \strokec2 dequantizes back to high precision (FP16/BF16) for the actual attention computations
|
||
\f1\b0 \strokec2 . This way,
|
||
\f0\b \strokec2 memory is saved on storage, but the attention math still runs at full precision
|
||
\f1\b0 \strokec2 for stability. In summary, the cutting edge has shifted toward aggressive KV cache compression and smart management, since that is now the main limiter for scaling context and concurrency on the edge.\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 Multi-Adapter LoRA Serving as a Systems Challenge\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 Fine-tuning large models via
|
||
\f0\b \strokec2 LoRA (Low-Rank Adapters)
|
||
\f1\b0 \strokec2 is commonplace for personalization, but serving many different LoRA-tuned variants efficiently is no longer just a training trick \'96 it\'92s a systems design problem. Naively
|
||
\f0\b \strokec2 merging each LoRA into a separate model copy
|
||
\f1\b0 \strokec2 doesn\'92t scale when each user session might require a different adapter (the memory overhead would be enormous). Recent research addresses this by treating
|
||
\f2\i \strokec2 many LoRAs + one base model
|
||
\f1\i0 \strokec2 as a single multi-tenant deployment.
|
||
\f0\b \strokec2 S-LoRA
|
||
\f1\b0 \strokec2 (MLSys 2024) is a flagship example: it serves
|
||
\f2\i \strokec2 thousands of LoRA adapters concurrently
|
||
\f1\i0 \strokec2 on a single base model instance. The S-LoRA system avoids merging adapters into the weights; instead it keeps all adapter weight matrices in CPU memory and
|
||
\f0\b \strokec2 pages in the ones needed for active queries into GPU memory on the fly
|
||
\f1\b0 \strokec2 . To make this feasible, S-LoRA introduces
|
||
\f0\b \strokec2 Unified Paging
|
||
\f1\b0 \strokec2 , which uses a unified GPU memory pool to manage both the dynamic adapter weights
|
||
\f2\i \strokec2 and
|
||
\f1\i0 \strokec2 the KV cache tensors. This unified pool with a paging strategy prevents memory fragmentation and ensures efficient utilization even as different sessions load and unload different LoRAs. (In other words, the residency of KV cache and LoRA data is handled together as one scheduling/allocation problem.) S-LoRA also employs custom CUDA kernels for
|
||
\f0\b \strokec2 heterogeneous batching
|
||
\f1\b0 \strokec2 , so that requests using different LoRA adapters (even with different rank sizes) can still be batched together for throughput. The result is that
|
||
\f2\i \strokec2 on-demand personalization
|
||
\f1\i0 \strokec2 is possible at scale: S-LoRA reports up to 4\'d7 throughput improvement over naive per-request loading, and it can serve orders of magnitude more adapters than approaches that \'93merge and deploy\'94 one model per adapter.\
|
||
Another advance is
|
||
\f2\i \strokec2 DoRA (Weight-Decomposed Low-Rank Adaptation)
|
||
\f1\i0 \strokec2 by NVIDIA (ICML 2024). DoRA is a new fine-tuning method that improves on LoRA\'92s quality without adding any inference overhead. It
|
||
\f0\b \strokec2 decomposes each pretrained weight matrix into magnitude and direction components, applies LoRA-style low-rank updates to the directional component, and then merges them back
|
||
\f1\b0 \strokec2 before inference. The key point:
|
||
\f0\b \strokec2 DoRA\'92s adapted weights can be merged into the base model for runtime
|
||
\f1\b0 \strokec2 , so serving a DoRA-tuned model is as cheap as serving the original model (no extra latency or memory), while achieving higher accuracy than standard LoRA. This matters for cases where you
|
||
\f2\i \strokec2 can
|
||
\f1\i0 \strokec2 afford separate deployments for fixed domains: you get better quality per parameter, but still keep inference cost \'93costless\'94 like LoRA (since the low-rank updates don\'92t add new compute at runtime). In summary,
|
||
\f2\i \strokec2 serving personalized models at the edge
|
||
\f1\i0 \strokec2 is now being tackled with system-level optimizations: unified memory management for many adapters (S-LoRA) when you need lots of per-request variants, and improved fine-tuning techniques (DoRA) when you want high-quality adapters that incur no runtime penalty.\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 WASM at the Edge: Setting the Right Boundaries\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\fs24 \cf0 WebAssembly (WASM)
|
||
\f1\b0 \strokec2 has emerged as a credible option for sandboxing and deploying code at the network edge, including portions of ML inference pipelines \'96 but
|
||
\f2\i \strokec2 where
|
||
\f1\i0 \strokec2 to use WASM is critical. The consensus is that
|
||
\f0\b \strokec2 WASM is best used for plugin-like modules or kernels with strict resource budgets
|
||
\f1\b0 \strokec2 , rather than as a full deep learning runtime by itself. In practice, this means heavy-duty math is often left to optimized native libraries, but you can run those libraries behind a WASM interface safely. A notable development here is
|
||
\f0\b \strokec2 WASI-NN
|
||
\f1\b0 \strokec2 , a proposed WebAssembly system interface for neural network inference. WASI-NN defines a
|
||
\f0\b \strokec2 vendor-neutral API
|
||
\f1\b0 \strokec2 where a WASM module can load a model and ask the host to perform inference. Crucially, the WASI-NN spec is
|
||
\f0\b \strokec2 framework-agnostic
|
||
\f1\b0 \strokec2 : it treats the model as an opaque blob and can delegate to any backend \'96 e.g. pass an ONNX model to an ONNX Runtime or OpenVINO backend, TensorFlow model to TF, etc., depending on the host\'92s implementation. This allows a sandboxed WASM plugin to leverage high-performance native inference engines (like ONNX Runtime, OpenVINO, TensorRT)
|
||
\f0\b \strokec2 without exposing the host to risk
|
||
\f1\b0 \strokec2 , since the WASM side only sees a safe API. In edge scenarios, one might use this for less time-critical ML tasks running in a plugin environment or to run classical ML models (vision, etc.) on the fly in a restricted sandbox.\
|
||
For the
|
||
\f2\i \strokec2 core
|
||
\f1\i0 \strokec2 per-token neural network loops of a large language model, however,
|
||
\f0\b \strokec2 pure WASM will add overhead
|
||
\f1\b0 \strokec2 unless carefully optimized. The state of the art approach is to keep those tight loops small and efficient, possibly with explicit SIMD use, and to
|
||
\f0\b \strokec2 use WASM engine features to enforce execution budgets
|
||
\f1\b0 \strokec2 . For example,
|
||
\f0\b \strokec2 Wasmtime
|
||
\f1\b0 \strokec2 (a popular WASM runtime) offers two ways to preempt long-running code: a
|
||
\f0\b \strokec2 deterministic fuel mechanism
|
||
\f1\b0 \strokec2 where each WASM operation decrements a counter and traps when fuel is exhausted, and a faster
|
||
\f0\b \strokec2 epoch-based interruption
|
||
\f1\b0 \strokec2 (timer-based) \'96 both allow halting runaway code safely. Fuel is essentially a
|
||
\f2\i \strokec2 \'93you only get N instructions\'94
|
||
\f1\i0 \strokec2 meter, whereas epoch timers act like a timeout. In an edge inference setting, one can compile critical kernels to WASM and use these features to guarantee no kernel exceeds its time budget (preventing latency spikes or infinite loops). Meanwhile, for extremely resource-constrained devices, projects like
|
||
\f0\b \strokec2 WAMR (WebAssembly Micro Runtime)
|
||
\f1\b0 \strokec2 provide a
|
||
\f2\i \strokec2 tiny
|
||
\f1\i0 \strokec2 footprint WASM runtime. WAMR\'92s core (interpreted) can be under 85 KB, and even its AOT-compiled mode is ~50 KB binary size, making it suitable for embedding in IoT or mobile deployments. Despite its size, WAMR supports ahead-of-time compilation to native code for speed, achieving near-native performance with minimal runtime overhead. The bottom line:
|
||
\f0\b \strokec2 WASM is \'93credible\'94 for edge ML when used wisely
|
||
\f1\b0 \strokec2 \'96 for example, shipping optimized kernels or small models as sandboxed modules, enforcing limits with fuel/time quotas, and choosing lightweight runtimes for embedded use \'96 rather than trying to run an entire PyTorch/TensorFlow stack inside WASM. This approach gives you safety, flexibility (hot-swapping modules), and portability, without sacrificing much performance.\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 Rust Ecosystem for Portable ML Inference\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\fs24 \cf0 Rust
|
||
\f1\b0 \strokec2 has gained traction for ML inference, especially for edge and client-side use, due to its combination of performance, safety, and portability (including easy targeting of WASM). Two reference points illustrate the state of the art in 2026:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls1\ilvl0
|
||
\f2\i \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Rust LLM engines:
|
||
\f1\i0
|
||
\f0\b mistral.rs
|
||
\f1\b0 is a community-driven Rust inference engine that has made waves with its speed and minimalism. It\'92s a
|
||
\f0\b Rust-native implementation
|
||
\f1\b0 of the Mistral (and other) model families focused on low latency and low resource usage. By leveraging Rust\'92s zero-cost abstractions and careful optimization (SIMD, quantization, multi-threading), mistral.rs can run chat models in constrained environments (even on CPU or WASM) with performance approaching highly optimized C++ backends. It supports multimodal and instruct models with a fraction of the footprint of larger frameworks. In short, it proves that
|
||
\f0\b \'93pure Rust\'94 inference is viable
|
||
\f1\b0 and even advantageous for edge deployments, offering memory safety and portability without sacrificing speed. Its design choices (e.g. an asynchronous API, support for paged attention, flash attention, etc.) are now influencing other projects.\
|
||
\ls1\ilvl0
|
||
\f2\i \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Rust ML libraries:
|
||
\f1\i0 For building custom inference solutions in Rust, an important building block is
|
||
\f0\b Hugging Face\'92s Candle
|
||
\f1\b0 library. Candle is a
|
||
\f0\b minimalist ML framework in Rust
|
||
\f1\b0 that provides tensor operations and model loading, optimized for performance and small binary size. It\'92s designed to feel like writing PyTorch, but under the hood it\'92s pure Rust with support for CPU vectorization, GPU (CUDA) kernels, and even WebAssembly as backends. One of Candle\'92s notable strengths is that it can
|
||
\f0\b compile models to WebAssembly and run them in-browser or in a WASM runtime
|
||
\f1\b0 with ease. It also makes it straightforward to integrate or write custom ops in Rust (for example, the library supports plugging in user-defined GPU kernels like FlashAttention v2). This means if you need a particular optimized kernel (say for a new quantization scheme or a custom activation), you can add it in Rust and still have the whole model run end-to-end in a single Rust binary or WASM module. Candle\'92s focus on
|
||
\f2\i performance and flexibility
|
||
\f1\i0 (without a giant runtime dependency) has made it a
|
||
\f0\b \'93clean base\'94 for Rust ML
|
||
\f1\b0 \'96 many experimental Rust LLM projects build on Candle for tensor implementations and then add their own model-specific logic.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 In summary, the Rust ecosystem by 2026 offers both high-level engines (like mistral.rs) ready to use for fast inference, and lower-level frameworks (like Candle) for crafting your own solution, all with an eye toward portability (WASM support) and efficiency (zero-copy, SIMD, etc.). These will be the foundation for the next generation of edge ML runtimes.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 Updated Architecture Plan for
|
||
\f3\i \strokec2 ruvllm
|
||
\f0\i0 \strokec2 (Edge + WASM)\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 Given the above trends, here is a
|
||
\f0\b \strokec2 \'93frontier\'94 plan
|
||
\f1\b0 \strokec2 to build an edge-focused LLM serving runtime (which we\'92ll call
|
||
\f2\i \strokec2 ruvllm
|
||
\f1\i0 \strokec2 for now) using Rust, vectorized kernels, WASM modules, and latest best practices. The plan is organized in stages:\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 Core Plan (Immediate Priorities)\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls2\ilvl0
|
||
\fs24 \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Keep Orchestration Native; Make Kernels Pluggable:
|
||
\f1\b0 Use Rust for the high-level orchestration, scheduling, and policy logic, and define a stable
|
||
\f0\b Kernel Provider ABI
|
||
\f1\b0 (application binary interface) for the low-level math kernels. This way, the coordinator (token loop, model logic, memory management) stays in safe native Rust, but you can
|
||
\f0\b swap out kernel implementations easily
|
||
\f1\b0 \'96 e.g. use highly optimized native Rust kernels, or load alternative implementations (possibly as WASM modules) without changing the core. This aligns with our design goals of auditability and flexibility: the core runtime handles things like policy decisions, logging, and safety checks, while the matrix multiplication, attention, etc., can be provided by plugin modules. In essence, the
|
||
\f2\i ruvllm
|
||
\f1\i0 engine should treat kernels as replaceable artifacts (almost like drivers). This was directionally in the original spec; now we double down on it by clearly delineating the boundary and ensuring that, for example, a kernel pack can be distributed or updated independently (e.g. a user could load a SIMD-optimized WASM kernel pack for their CPU). The base system should run with a default kernel set, but allow override via this interface.\
|
||
\ls2\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Implement Two-Tier KV Cache (Quantization as a First-Class Feature):
|
||
\f1\b0 Based on the state-of-art practices, the runtime must manage the KV cache in a memory-optimized way by default. Concretely,
|
||
\f2\i ruvllm
|
||
\f1\i0 should use a
|
||
\f0\b two-tier KV cache
|
||
\f1\b0 :\
|
||
\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sa240\partightenfactor0
|
||
\ls2\ilvl1\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 A small
|
||
\f0\b high-precision \'93tail\'94 buffer
|
||
\f1\b0 that holds the most recent tokens\'92 KV (e.g. the last $N$ tokens in full float32/16 precision).\
|
||
\ls2\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 A larger
|
||
\f0\b quantized store for older tokens\'92 KV
|
||
\f1\b0 (using low-bit encoding).\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls2\ilvl0\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 New KV entries start in the high-precision tail, then as the sequence grows and that buffer fills, move older entries into the quantized store. This design ensures that attention for the most recent tokens (which might be most crucial for quality) is done directly on high-precision data, while longer-term history is kept compressed. We will use
|
||
\f0\b group-wise quantization with explicit scales
|
||
\f1\b0 for the compressed KV \'96 for example, quantize in blocks (or per head) with a scale/offset, which is exactly the approach shown to preserve quality in KVQuant. Importantly,
|
||
\f0\b when performing an attention step, the system will
|
||
\f3\i dequantize
|
||
\f0\i0 any needed KV blocks from the store back into a high-precision scratch buffer for the dot-product computation
|
||
\f1\b0 , then discard that scratch after use. This follows the proven pattern from production systems:
|
||
\f2\i quantize for storage, dequantize for compute
|
||
\f1\i0 . By doing this, we keep arithmetic precision high where it matters (no accuracy loss in the actual attention math), while memory usage is greatly reduced in steady-state. Initial research suggests that even 3\'964 bit precision for KV can work with proper calibration, so
|
||
\f2\i ruvllm
|
||
\f1\i0 should make KV compression \'93on by default.\'94 In essence, KV cache management becomes a core feature of the runtime (with policies to decide when to quantize or not) rather than an afterthought. This matches what both academic work and deployed systems (like vLLM) indicate about KV being the new bottleneck.\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls2\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 4 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Add Mixed-Precision KV Support (per-Component and per-Layer):
|
||
\f1\b0 Not all parts of the KV cache are equal \'96
|
||
\f0\b keys vs. values, and different model layers, have different sensitivity to quantization
|
||
\f1\b0 . The design should allow flexibility in precision
|
||
\f2\i per component
|
||
\f1\i0 . For instance, research (e.g. KIVI) found that keys benefit from per-channel quantization and perhaps need a different approach than values.
|
||
\f2\i ruvllm
|
||
\f1\i0 should support configuring
|
||
\f0\b separate quantization formats for K and V
|
||
\f1\b0 , and even potentially leaving some layers\'92 KV in higher precision if desired. This means the KV cache API/structs need to carry type information (one layer\'92s KV might be 8-bit, another 4-bit, etc., or keys in one format, values in another). By making this configurable, the system can implement policies like: \'93Layers 1\'96N quantized to 4-bit, final layers 8-bit for fidelity\'94 or \'93Key caches use a higher precision than Value caches,\'94 etc. Such mixed schemes were hinted by prior studies to improve the quality-memory trade-off. The key point is to
|
||
\f0\b treat precision as a tunable policy lever
|
||
\f1\b0 . The runtime core will expose hooks to apply different quantization to different layers or switch format on the fly, while the kernels themselves just see either a dequantized float array or a quantized buffer reference. This separation of concerns (policy in Rust, mechanism in kernels) is crucial. We will implement the infrastructure for it now (even if initial policy is simple). In practice, this might involve defining a small enum for supported precisions (FP16, FP8-E4M3, int4, etc.) and having the KV cache manager know how to transcode between them.\
|
||
\ls2\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 5 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Manage LoRA Adapters with Unified Memory Pooling:
|
||
\f1\b0 Extend the runtime\'92s memory management to handle
|
||
\f0\b LoRA adapter weights using the same strategy as KV scratch buffers
|
||
\f1\b0 . In other words,
|
||
\f2\i ruvllm
|
||
\f1\i0 will allocate a
|
||
\f0\b unified arena (pool)
|
||
\f1\b0 in memory that is used for:\
|
||
\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sa240\partightenfactor0
|
||
\ls2\ilvl1\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Temporarily holding
|
||
\f0\b currently active
|
||
\f1\b0 LoRA adapter weights (for whatever personalization is in use this request or batch).\
|
||
\ls2\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Staging
|
||
\f0\b dequantized KV tensors
|
||
\f1\b0 for attention computation (as described above).\
|
||
\ls2\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Other temporary scratch (intermediate results) as needed.\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls2\ilvl0\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 6 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 This unified pool prevents fragmentation and simplifies allocation. We can draw inspiration from S-LoRA\'92s paging system: S-LoRA had one memory pool that managed both adapter tensors and KV tensors to reduce fragmentation and overhead.
|
||
\f2\i ruvllm
|
||
\f1\i0 on edge may not have as many concurrent adapters as S-LoRA\'92s cloud scenario, but the same principle of
|
||
\f0\b unified budgeting
|
||
\f1\b0 applies (especially on CPU memory-constrained devices). For example, if multiple sessions with different LoRAs are running, we might swap out adapters from the pool when not in use (similar to paging to CPU or disk if needed), or compress them. We will also consider storing adapters in a compressed form (e.g. 8-bit or in some efficient delta form) when idle, and only
|
||
\f0\b materializing the full-weight matrices in the pool when a request that needs them is processing
|
||
\f1\b0 . Essentially, treat adapter weights as another class of cacheable data that can move between memory tiers (RAM, disk) just like KV caches in LMCache. By pooling KV scratch and adapter data together, the runtime can make globally optimal decisions (e.g. evict some long-term KV to make room for a just-activated adapter, or vice versa) with a single allocator. This will ensure we
|
||
\f0\b \'93stop adapters from being a mess\'94
|
||
\f1\b0 \'96 no more memory blow-ups when many fine-tunes exist, as everything lives in one managed space with predictable limits. We will incorporate mechanisms like LRU or usage-based eviction for this pool, and possibly prefetching of adapters if we anticipate usage.\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls2\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 7 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Deploy Core Kernels as Tiny WASM Modules (Kernel Packs):
|
||
\f1\b0 Following the idea of pluggable kernels, we will ship a default set of critical kernels in a portable WASM form as well. These
|
||
\f0\b WASM kernel packs
|
||
\f1\b0 (statically compiled, with SIMD support enabled) will include routines for things like positional embedding (RoPE), layer normalization (RMSNorm), the SwiGLU activation, KV pack/unpack (quantize/dequantize routines), and applying LoRA deltas. Each of these is a small, self-contained function that we can compile to WebAssembly with near-native performance (especially with new WASM SIMD instructions) and then load into our runtime. The benefits are twofold: (a) it provides a sandboxed execution environment for critical code (improving safety when running untrusted or user-supplied models), and (b) it makes the engine
|
||
\f0\b highly portable
|
||
\f1\b0 \'96 e.g. you can run the same kernel pack in a Wasmtime environment on x86, or in a WAMR on an ARM microcontroller, without recompiling the Rust core. We will ensure these WASM kernels have a very minimal ABI (e.g. direct pointer to memory, lengths, no complex objects) to reduce overhead when calling from Rust into WASM.
|
||
\f0\b Attention
|
||
\f1\b0 : we will
|
||
\f0\b not
|
||
\f1\b0 move the entire attention computation into WASM in the initial version. The attention softmax and batched matrix multiply are performance-critical and depend on memory layout \'96 we will keep those native for now (or use an optimized library) until we\'92ve stabilized everything else. The first WASM kernels will handle the pieces around the edges (embedding, non-linearities, elementwise ops) which are easier to sandbox and unlikely to be the bottleneck. Over time, as we gain confidence and possibly as WASM gains new features (threads, better SIMD), we can consider migrating more into that sandbox.\
|
||
\ls2\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 8 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Tailor the WASM Runtime to Device Class:
|
||
\f1\b0
|
||
\f2\i ruvllm
|
||
\f1\i0 should be able to run on a spectrum of edge devices, so we will
|
||
\f0\b adopt different WASM runtime configurations depending on the deployment target
|
||
\f1\b0 :\
|
||
\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sa240\partightenfactor0
|
||
\ls2\ilvl1
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Edge servers / x86 Linux
|
||
\f1\b0 : Use
|
||
\f0\b Wasmtime
|
||
\f1\b0 with features like epoch interruption (or fuel) to ensure rogue kernels can be stopped. We\'92ll configure strict memory limits for the WASM instances (so a kernel can\'92t overflow its allotted memory). This gives us a high-performance JIT for WASM (important for SIMD) and the safety controls we need.\
|
||
\ls2\ilvl1
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Embedded devices / microcontrollers
|
||
\f1\b0 : Use
|
||
\f0\b WAMR (WebAssembly Micro Runtime)
|
||
\f1\b0 possibly in ahead-of-time (AOT) compiled mode. WAMR\'92s tiny footprint (on the order of tens of KB) and ability to precompile the WASM to native code make it ideal for constrained hardware. We might not have all features (e.g. no threads), but we won\'92t need them for these small devices. Essentially, the kernel pack would be compiled to a native blob via WAMR\'92s AoT compiler and then loaded.\
|
||
\ls2\ilvl1
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Optional: WASI Threads
|
||
\f1\b0 where available: If running on a platform that supports WASM threads (and our runtime allows it), we could enable multi-threaded kernels in WASM for things like batched BLAS. But this will be off by default since support is still emerging and platform-specific in 2026.\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls2\ilvl0\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 9 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 The runtime will detect or be configured with the target class and choose the appropriate engine. Importantly, the
|
||
\f0\b functional behavior is consistent
|
||
\f1\b0 across these \'96 thanks to WASI, the same module can run in Wasmtime or WAMR. This flexibility means we can deploy the same
|
||
\f2\i ruvllm
|
||
\f1\i0 engine from a beefy edge box down to an IoT gateway, just swapping the WASM runtime and maybe the kernel pack flavor (some modules might include alternative code paths for different SIMD instruction sets, selected at load time).\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 The above
|
||
\f0\b \strokec2 Core Plan
|
||
\f1\b0 \strokec2 items are the must-haves to get an initial version of
|
||
\f2\i \strokec2 ruvllm
|
||
\f1\i0 \strokec2 that is
|
||
\f0\b \strokec2 predictable, memory-efficient, and extensible
|
||
\f1\b0 \strokec2 . To summarize, after this stage: KV cache usage will be under control via quantization; adapter overhead will be managed via unified pooling; our system will be modular with safe, portable kernel plugins; and we\'92ll be equipped to enforce strict execution budgets (no stalls or leaks). These are all foundations for a system that can run
|
||
\f2\i \strokec2 continuously
|
||
\f1\i0 \strokec2 at the edge without surprises.\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 Stretch Plan (Next Phases Once Core is Stable)\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls3\ilvl0
|
||
\fs24 \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 WASM-Accelerated Attention Kernels with SIMD:
|
||
\f1\b0 With the core in place (especially the scratch buffer strategy for KV), we can attempt to move the
|
||
\f0\b attention computation itself into a WASM kernel
|
||
\f1\b0 for even greater portability. This is challenging because attention involves a lot of data movement and computation (dot products, softmax) on potentially large matrices. However, by the time core is stable, we\'92ll have a good handle on memory layouts and will have SIMD support. We can write an optimized attention step (for one head or a block of heads) in a low-level language or even use something like portable SIMD libraries, then compile to WASM. The goal is to see minimal performance loss relative to native. If achieved, this means the
|
||
\f2\i entire
|
||
\f1\i0 model forward pass could run in the sandbox (important for untrusted code execution scenarios). We will attempt this only after verifying that copying cost (from host memory to WASM memory and back) doesn\'92t outweigh the benefits. One approach is to use
|
||
\f0\b shared memory
|
||
\f1\b0 so that the WASM module can directly read the model weights or KV buffers without copying (Wasmtime supports a mechanism for this in some configurations). Success here would make our engine extremely flexible \'96 imagine downloading a new attention algorithm as a WASM module and plugging it in at runtime.\
|
||
\ls3\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Adapter-Aware Batching (Multi-Tenant Batching):
|
||
\f1\b0 In the core system, we will already support applying different LoRA adapters to different sessions. A stretch goal is to implement
|
||
\f0\b heterogeneous batching
|
||
\f1\b0 across requests that use different adapters. S-LoRA already showed a technique for this by grouping operations and using a bitmask to select adapter contributions per request. Initially, we might serialize different adapter requests separately, but to fully utilize the hardware, we want to batch them when possible. We can design the compute such that, for example, during the forward pass we accumulate the base model output and then for each adapter in the batch, apply its rank-$r$ update (possibly via a fused kernel per adapter). This requires careful scheduling: all requests in the batch share the base computation, then each gets its adapter added. We don\'92t need the full generality of S-LoRA\'92s solution at first; we can start with a simpler approach (e.g., limit to 2\'964 distinct adapters in a batch and handle them explicitly). The key is that nothing in the core design should prevent this \'96 indeed our unified memory pool and kernel ABI should make it feasible. Achieving this will further improve throughput for multi-tenant scenarios (common on edge boxes that handle requests from different users/models). It also forces our memory manager to be really robust (since multiple adapters must reside in GPU/CPU memory simultaneously during a batch). We\'92ll take cues from S-LoRA and Punica on how they batch and schedule multi-adapter workloads, ensuring
|
||
\f2\i ruvllm
|
||
\f1\i0 can eventually do the same.\
|
||
\ls3\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Formal KV Cache \'93Scheduler\'94 Policies:
|
||
\f1\b0 Once basic KV quantization is in place, we can enhance it with smarter dynamic policies. The idea is to make KV cache precision
|
||
\f0\b adaptive
|
||
\f1\b0 and the process
|
||
\f0\b auditable
|
||
\f1\b0 . For example, the engine could monitor the latency per token (p95 or p99 percentile) and if it\'92s well under budget, decide to spend extra time compressing older KV blocks (e.g. quantize from 8-bit down to 4-bit) to save memory. Conversely, if certain quality metrics start degrading (or if the model perplexity drifts up as measured on recent tokens), the engine might pause quantization or bump certain layers back to higher precision. This essentially is a
|
||
\f0\b KV cache scheduler
|
||
\f1\b0 that decides
|
||
\f2\i when
|
||
\f1\i0 and
|
||
\f2\i how aggressively
|
||
\f1\i0 to quantize. It should operate deterministically given the same sequence of inputs (for reproducibility). A concrete policy could be:
|
||
\f2\i \'93After every 50 tokens generated, quantize all KV blocks older than 100 tokens to 4-bit unless doing so has previously caused >5% increase in loss, and ensure at least 1 block (tail) remains in FP16\'94
|
||
\f1\i0 . These rules can be encoded in a policy module. We will also
|
||
\f0\b log every quantization change as an event
|
||
\f1\b0 (with layer, precision, reason) so that debugging and auditing is possible \'96 you can see exactly when the engine decided to compress memory and when it reverted. This feature is forward-looking; it draws on the idea of
|
||
\f0\b predictable and transparent scheduling
|
||
\f1\b0 common in OS design (and hinted at by some LLM cache research). By implementing a simple version, we lay the groundwork for more complex memory management in the future (and build trust \'96 users can see and control the trade-offs). This scheduler could also interface with external signals \'96 e.g., a monitoring agent could instruct our runtime to \'93enter low-memory mode\'94 which triggers more aggressive KV quantization across the board.\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 Frontier Plan (Long-Term Innovations)\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 Finally, if we look further out, there are a few
|
||
\f0\b \strokec2 disruptive ideas
|
||
\f1\b0 \strokec2 that could push
|
||
\f2\i \strokec2 ruvllm
|
||
\f1\i0 \strokec2 to true state-of-the-art and beyond, especially in low-memory edge environments:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls4\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Extreme KV Rematerialization:
|
||
\f1\b0 Build on the concept from XQuant and similar research to implement a mode where
|
||
\f2\i ruvllm
|
||
\f1\i0 *
|
||
\f2\i does not store all KV activations at all
|
||
\f1\i0 , but instead
|
||
\f0\b recomputes or regenerates them as needed
|
||
\f1\b0 . For instance, instead of caching every Key and Value vector, we might cache a smaller \'93fingerprint\'94 of the intermediate state (like the
|
||
\f2\i X
|
||
\f1\i0 that XQuant stores or perhaps a compressed hidden state before projection) and on each new token, recompute the necessary K/V from that. This could drastically reduce memory \'96 potentially enabling context lengths in the millions on edge devices \'96 at the cost of extra compute per token. For devices that have idle compute capacity (or specialized accelerators), this trade could be worthwhile when memory is the hard limit. We\'92d need to integrate this with our scheduling: e.g., only turn on rematerialization beyond a certain context length or when memory usage hits a cap. It\'92s essentially taking quantization to the limit (1-bit or mathematical regeneration). The challenge is keeping it fast; we might use multi-threading or SIMD to recompute multiple layers of KV in parallel if needed. This \'93compute instead of cache\'94 approach would differentiate
|
||
\f2\i ruvllm
|
||
\f1\i0 for ultra-long contexts. It\'92s a frontier area, but the pieces we will have (unified memory, kernel plugins, etc.) will help experiment with it.\
|
||
\ls4\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 True Unified Memory Paging Across All Components:
|
||
\f1\b0 We plan a unified pool for KV scratch and adapters; the ultimate vision is to treat
|
||
\f2\i all
|
||
\f1\i0 model memory (KV cache, adapter weights, and even the model weights or other state) under one unified memory management system with multiple tiers (HBM/GPU, CPU RAM, perhaps disk/SSD). In a sense, this would function like a
|
||
\f0\b virtual memory system for the LLM
|
||
\f1\b0 , where each type of data (model weights, activations, cache, deltas) can be evicted or swapped out in a coordinated way. Some early signs of this are in systems like
|
||
\f0\b LMCache
|
||
\f1\b0 , which offloads KV to CPU or disk with a controller API. We\'92d extend that concept to include LoRA and other aux data. The benefit is that
|
||
\f2\i ruvllm
|
||
\f1\i0 could then
|
||
\f0\b run indefinitely (as a service) without ever hitting an out-of-memory
|
||
\f1\b0 , because it would have mechanisms to spill to secondary storage deterministically. It would also make consolidation easier: e.g., on a multi-model edge server, a single pool could be shared among multiple model instances, improving utilization. Achieving this \'93runs forever\'94 reliability is the difference between a demo and a production service. We will aim for a design where adding a \'93unified paging\'94 module later is feasible \'96 likely by abstracting memory accesses through a layer that can decide to fetch from a slower tier. Logging and determinism will be critical here, as well as avoiding thrashing (hence the importance of the earlier scheduler component).\
|
||
\ls4\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 WASI-NN for Auxiliary Models:
|
||
\f1\b0 While the main LLM forward pass will use our custom kernels for performance, we can leverage
|
||
\f0\b WASI-NN and the WASM plugin approach to run
|
||
\f3\i auxiliary models
|
||
\f1\i0\b0 alongside the LLM. For example, an edge assistant might have a small
|
||
\f0\b gating or routing model
|
||
\f1\b0 that decides if a request should even go to the LLM, or an
|
||
\f0\b anomaly detector
|
||
\f1\b0 that inspects outputs for safety. These smaller models (say a 50 MB vision model or a 100 MB classifier) could be run via the WASI-NN interface inside a WASM sandbox. The advantage is we can reuse highly-optimized engines (like OpenVINO, ONNX Runtime) for these tasks by exposing them to the WASM module. Meanwhile, the primary LLM stays on our own optimized path (perhaps for which frameworks don\'92t yet offer what we need). This hybrid approach keeps the
|
||
\f0\b hot path
|
||
\f1\b0 (token generation) under tight control and maximum speed, while still allowing
|
||
\f0\b extensibility for supporting tasks
|
||
\f1\b0 using the broader ML ecosystem. Concretely, we could define a WASM ABI for calling out to \'93side models\'94 asynchronously. This frontier idea means
|
||
\f2\i ruvllm
|
||
\f1\i0 could become not just a single-model server but an intelligent host that safely runs multiple models in concert (each in its sandbox), e.g., for multi-modal capabilities or guardrail systems, without compromising the main model\'92s performance.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 Recommended Specification Changes\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 To implement the above plan in
|
||
\f2\i \strokec2 ruvllm
|
||
\f1\i0 \strokec2 , here are the key changes and additions we would make to the current system specification:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls5\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Dual-Format KV Cache API:
|
||
\f1\b0 Everywhere the KV cache is referenced in the spec, update it to handle
|
||
\f0\b independent Key and Value formats
|
||
\f1\b0 . For example, instead of a single
|
||
\f4\fs26 dtype
|
||
\f1\fs24 for the entire KV, the interface might carry
|
||
\f4\fs26 key_dtype
|
||
\f1\fs24 and
|
||
\f4\fs26 value_dtype
|
||
\f1\fs24 , and functions operating on the cache must handle the possibility that each has its own precision or storage layout. Additionally, add support for
|
||
\f2\i per-layer
|
||
\f1\i0 KV settings \'96 e.g. an array of formats indexed by layer. This allows the runtime to use mixed precision strategies (as found in KIVI\'92s analysis of key vs value distributions). Internally, the KV cache structure could hold pointers (or offsets) to separate key and value storage for each layer, rather than assuming a contiguous array of homogeneous type. This change lays the groundwork for both asymmetrical quantization and easier experimentation (you could plug in a new compression for values without touching keys, for instance).\
|
||
\ls5\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Unified Memory Pool Manager:
|
||
\f1\b0 Introduce a new component (module) in the spec dedicated to memory management of transient data \'96 essentially an
|
||
\f0\b arena allocator for both KV scratch and adapter weights
|
||
\f1\b0 (and any other temporary tensors). The spec should define how this allocator works: e.g., it reserves a fixed maximum size (perhaps configurable per deployment), and it exposes operations to allocate/free blocks for \'93KV-dequant buffer\'94 or \'93LoRA weights buffer\'94 etc. Importantly, this manager would implement policies like pooling and paging. For instance, it might have an API to \'93pin\'94 certain data in GPU memory or to \'93evict\'94 least-recently-used blocks. We saw in S-LoRA that a unified memory pool is key to avoid fragmentation and handle varying tensor sizes smoothly. By specifying one allocator to rule them all, we prevent each sub-system from over-provisioning. The spec should also clarify the behavior when the pool is exhausted (e.g., block until space is freed, or evict something \'96 likely the latter with a strategy defined). This unified pool concept aligns with both S-LoRA\'92s unified paging and LMCache\'92s unified CPU pool for caches. It will make the system more robust under load (no unpredictable OOMs due to many adapters or long prompts, as long as it\'92s within overall limits).\
|
||
\ls5\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Explicit Quantization Policy Configuration:
|
||
\f1\b0 Extend the model-loading or runtime configuration part of the spec to include a
|
||
\f0\b quantization policy schema
|
||
\f1\b0 . This would allow one to specify things like: quantization bit-width for KV (global or per-layer), group size (if using group quantization), \'93float tail\'94 length (how many recent tokens to keep in full precision), and thresholds for dynamic quantization triggers (if any). By having a structured policy, the engine can log and adhere to it strictly. For example, a policy might say: \'93Use 8-bit for layers 0\'9610, 4-bit for layers 11\'9623, keep last 32 tokens in float, group-size 16, outlier threshold 0.1%\'94 \'96 the runtime then follows this and we can verify that via logs. The spec should define a standard way to represent this (maybe a JSON or Rust struct). This will help with
|
||
\f0\b auditability
|
||
\f1\b0 \'96 one can replay a run and see exactly which decisions were made according to policy. It also separates concerns: the core engine deals with enforcing the policy, while the policy itself can be tuned or even learned. Given vLLM\'92s approach and others, separating storage precision from compute precision should be clearly delineated \'96 e.g., the spec could state \'93KV cache may be stored at lower precision than used in attention computations; the system must convert as needed transparently.\'94 This makes it clear to users that storage vs compute dtype are different knobs.\
|
||
\ls5\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 4 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Decoupled Storage vs Compute Precision:
|
||
\f1\b0 As mentioned, formally update the spec (and code) to treat
|
||
\f0\b storage precision and compute precision as distinct
|
||
\f1\b0 for model data. This applies not just to KV (where we store quantized and compute in float) but potentially to weights (one could imagine using 8-bit weights but converting to 16-bit on the fly for multiplication if needed, etc.). The documentation should stress that any quantized representation will be
|
||
\f2\i losslessly (or near-losslessly)
|
||
\f1\i0 converted to a higher precision for actual math operations. By following this pattern (seen in practice with vLLM\'92s FP8 KV cache), we ensure that accuracy is easier to maintain and reasoning about precision is simpler. In implementation, this might mean providing utility functions to \'93prepare tensor for compute\'94 which does dequantization, and making sure kernels always assume they might receive quantized inputs and need to convert. Logging here is useful too \'96 e.g., when a block was dequantized for use, record it. This change is partly philosophical: treat quantization as a compression technique for transit/storage, not as a different kind of \'93tensor\'94 from the compute perspective. It will simplify kernel development and debugging (kernels can mostly assume floats), and align with hardware trends (many new GPUs handle FP8<->FP16 conversion in hardware seamlessly).\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 By making the above spec changes, we would get a system that is far more aligned with current best practices and ready for the next steps. Specifically, we add more flexibility in memory and precision management, which are crucial for edge scenarios.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 Expected Outcomes\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 If we execute this plan, the resulting
|
||
\f2\i \strokec2 ruvllm
|
||
\f1\i0 \strokec2 runtime will be a
|
||
\f0\b \strokec2 predictable, auditable, and high-performance edge inference engine
|
||
\f1\b0 \strokec2 with several key advantages:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls6\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Memory Bottlenecks Tamed:
|
||
\f1\b0 The KV cache will no longer be a mysterious source of out-of-memory errors or latency spikes. With quantization and unified management, its footprint is bounded and under control. Long contexts can be handled more gracefully on limited hardware (e.g., quantizing old tokens means you can support chat history 2\'964\'d7 longer within the same RAM). Adapters similarly become lightweight to serve \'96 you can host hundreds of personalized variants and only pay marginal cost per active one, instead of needing separate copies of the whole model.\
|
||
\ls6\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 System-Wide Efficiency:
|
||
\f1\b0 Through unified pooling and scheduling, all parts of the model (base weights, KV, LoRAs, scratch) share resources cooperatively. This avoids the fragmentation and over-allocation that plague less integrated solutions. It also means
|
||
\f2\i ruvllm
|
||
\f1\i0 can
|
||
\f0\b run continuously (\'93forever\'94)
|
||
\f1\b0 in an edge environment without needing manual restarts to clear caches \'96 it has its own internal \'93garbage collection\'94 and paging for model data. The worst-case memory usage is predictable and capped, which is vital for production services.\
|
||
\ls6\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Modularity and Upgradability:
|
||
\f1\b0 By using a plugin kernel approach (with WASM as the vehicle), we achieve a level of modularity where kernels can be improved or specialized for different platforms without changing the core. For example, if a new faster matrix multiply library comes out, one could package it as a WASM plugin; or if a security issue is found in an old kernel, it can be updated independently. This is a step towards treating model execution like loading drivers \'96 the core orchestrator doesn\'92t need recompilation for every tweak. It also means
|
||
\f2\i ruvllm
|
||
\f1\i0 could be extended with new ops or model architectures by adding modules, not by overhauling the engine. This
|
||
\f0\b decoupling of policy and mechanism
|
||
\f1\b0 (Rust core sets policy, WASM/Kernel modules do mechanism) follows good systems design and makes the system easier to maintain.\
|
||
\ls6\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Cross-Platform Edge Deployment:
|
||
\f1\b0 The use of Rust and WASM together ensures that our engine can run on a wide range of edge devices. Rust gives native performance on any platform we can compile to, and WASM gives a safe, sandboxed fallback for portability. An outcome of the plan is that we could have, for instance, the same model and code running on an x86 server, an ARM laptop, and a WASI-enabled browser environment with only minor differences in the loaded kernel modules. This level of portability is cutting-edge \'96 enabling
|
||
\f2\i \'93run anywhere\'94
|
||
\f1\i0 agents that keep consistent behavior and safety checks across devices.\
|
||
\ls6\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Always-On Reliability:
|
||
\f1\b0 Ultimately, the combination of all these features means
|
||
\f2\i ruvllm
|
||
\f1\i0 can serve as the backbone for
|
||
\f0\b always-on edge AI agents
|
||
\f1\b0 that stay responsive and
|
||
\f0\b \'93calm under load\'94
|
||
\f1\b0 . Surprises due to memory exhaustion or latency spikes will be minimized because we\'92ve built in backpressure and adaptation (quantize more if needed, etc.). Everything is auditable: if the model\'92s quality dips or it ran out of budget, you can trace it to a logged event (e.g., \'93KV for layer 10 quantized to 4-bit at 13:05:23 due to memory threshold\'94). This is crucial for trust and debugging, which in turn is crucial for deploying AI in the wild. Moreover, by having a strong foundation, new research ideas (like those in the Frontier section) can be incorporated without a rewrite \'96 we can try extreme strategies knowing the system\'92s modular pieces (schedulers, plugins, etc.) support experimentation.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 In essence, by following this plan we transform the original
|
||
\f2\i \strokec2 ruvllm
|
||
\f1\i0 \strokec2 spec into a
|
||
\f0\b \strokec2 state-of-the-art edge inference system
|
||
\f1\b0 \strokec2 that embodies the lessons from the latest research (2024\'962025) and anticipates future needs. It balances performance with safety and flexibility, enabling cutting-edge ML models to run efficiently at the edge. This positions us well for the coming era where AI services are expected to be ubiquitous, personal, and reliable.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f0\b \cf0 \strokec2 Sources:
|
||
\f1\b0 \strokec2 \
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls7\ilvl0\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Hooper
|
||
\f2\i et al.
|
||
\f1\i0 ,
|
||
\f2\i \'93KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization,\'94
|
||
\f1\i0 NeurIPS 2024.\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Liu
|
||
\f2\i et al.
|
||
\f1\i0 ,
|
||
\f2\i \'93KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,\'94
|
||
\f1\i0 2024.\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Tomar
|
||
\f2\i et al.
|
||
\f1\i0 ,
|
||
\f2\i \'93XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization,\'94
|
||
\f1\i0 arXiv 2508.10395, Aug 2025.\
|
||
\ls7\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 4 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 vLLM
|
||
\f1\b0 documentation \'96
|
||
\f2\i Quantized KV Cache
|
||
\f1\i0 feature description (FP8 storage with FP16 compute).\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 5 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Cheng
|
||
\f2\i et al.
|
||
\f1\i0 ,
|
||
\f2\i \'93LMCACHE: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference,\'94
|
||
\f1\i0 Tech Report 2024 (LMCache).\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 6 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Sheng
|
||
\f2\i et al.
|
||
\f1\i0 ,
|
||
\f2\i \'93S-LoRA: Serving Thousands of Concurrent LoRA Adapters,\'94
|
||
\f1\i0 MLSys 2024.\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 7 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 NVIDIA Technical Blog,
|
||
\f2\i \'93Introducing DoRA, a High-Performing Alternative to LoRA,\'94
|
||
\f1\i0 Jun 2024.\
|
||
\ls7\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 8 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 WASI-NN Specification
|
||
\f1\b0 \'96 Bytecode Alliance post,
|
||
\f2\i \'93Machine Learning in WebAssembly: Using wasi-nn,\'94
|
||
\f1\i0 2023.\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 9 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Bytecode Alliance,
|
||
\f0\b Wasmtime
|
||
\f1\b0 examples \'96
|
||
\f2\i fuel and epoch-based interruption
|
||
\f1\i0 documentation.\
|
||
\ls7\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 10 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 WAMR
|
||
\f1\b0 (WebAssembly Micro Runtime) project docs \'96 LogRocket Blog,
|
||
\f2\i \'93WebAssembly runtimes compared,\'94
|
||
\f1\i0 2023.\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 11 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Buehler,
|
||
\f2\i mistral.rs
|
||
\f1\i0 \'96 GitHub README and Jimmy Song\'92s overview, 2024.\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 12 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Hugging Face
|
||
\f0\b Candle
|
||
\f1\b0 \'96 GitHub README (features: custom kernels, WASM support), 2023.\
|
||
\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 13 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 AWS Machine Learning Blog,
|
||
\f2\i \'93Multi-tenant LoRA serving with Sagemaker (S-LoRA and Punica),\'94
|
||
\f1\i0 2024.\
|
||
\pard\pardeftab720\qc\partightenfactor0
|
||
|
||
\f5\fs22 \cf3 \strokec3 {{\NeXTGraphic favicons.png \width640 \height640 \appleattachmentpadding0 \appleembedtype0 \appleaqc
|
||
}<7D>}\
|
||
{{\NeXTGraphic 1__#$!@%!#__favicons.png \width640 \height640 \appleattachmentpadding0 \appleembedtype0 \appleaqc
|
||
}<7D>}\
|
||
{{\NeXTGraphic 2__#$!@%!#__favicons.png \width640 \height640 \appleattachmentpadding0 \appleembedtype0 \appleaqc
|
||
}<7D>}\
|
||
{{\NeXTGraphic 3__#$!@%!#__favicons.png \width640 \height640 \appleattachmentpadding0 \appleembedtype0 \appleaqc
|
||
}<7D>}\
|
||
{{\NeXTGraphic 4__#$!@%!#__favicons.png \width640 \height640 \appleattachmentpadding0 \appleembedtype0 \appleaqc
|
||
}<7D>}\
|
||
\cb4 Sources\cb1 \
|
||
\pard\pardeftab720\partightenfactor0
|
||
|
||
\f1\fs24 \cf0 \strokec2 \
|
||
\
|
||
What\'92s missing?\
|
||
\
|
||
\
|
||
|
||
\f5\fs22 \cf3 \cb4 \strokec3 Thought for 1m 3s
|
||
\f1\fs24 \cf0 \cb1 \strokec2 \
|
||
\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 Here is what is missing right now. The plan is directionally correct, but it still lacks the pieces that make it runnable, measurable, and hard to break.\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls8\ilvl0\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 A committed attention and matmul path for the hot loop\uc0\u8232 Right now the plan says \'93defer attention in WASM,\'94 but it does not specify what runs attention today. You need a clear default backend choice and fallback order, for example a mistral.rs style paged attention backend for throughput and fragmentation control, with KV cache quantization support already proven in that codebase. \
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 A real SIMD strategy that matches Rust reality in 2026\uc0\u8232 If you want deterministic speed on edge CPU, you need to pick a SIMD approach that compiles everywhere and can multiversion per CPU. std simd is still nightly in 2025, with people recommending crates like wide, pulp, or macerator depending on your needs. That decision affects every kernel you ship. \
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 The missing KV cache algorithm layer between quantize and rematerialize\uc0\u8232 You listed KVQuant, KIVI, XQuant, but you have not specified when to use which, nor how to combine them safely. Also, there are newer KV quant methods like SQuat that change the error geometry rather than just lowering bits. That belongs in your decision tree and benchmark suite. \
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 4 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Unified paging is described, but the allocator and eviction policy are not specified\uc0\u8232 You need concrete answers for: page size, metadata layout, pinning rules, eviction order, concurrency model, and how you prevent thrash. Without this, \'93unified pool\'94 becomes a soft idea instead of a production subsystem.\
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 5 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 The WASM boundary is not locked down\uc0\u8232 You need to decide whether you are shipping raw ABI calls or a component model interface, how memory is shared, and what happens on trap. You also need to commit to interruption mechanics. Wasmtime documents fuel and epoch interruption, with epoch generally faster, and the choice matters for tail latency. \
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 6 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Kernel pack supply chain and rollback is not fully designed\uc0\u8232 You mentioned signing, but you still need: manifest schema, signature rotation, allow lists, version compatibility gates, reproducible builds, and a safe rollback protocol that is deterministic under load.\
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 7 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Adapter serving needs a multi tenant execution plan\uc0\u8232 Micro LoRA apply is specified, but what is missing is the serving strategy: adapter residency rules, batching rules when multiple adapters are active, adapter compression, and a unified paging contract that covers KV plus adapters together. The S LoRA approach is the reference for why this matters. \
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 8 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Evaluation harness is not concrete enough\uc0\u8232 You need a mandatory benchmark and correctness suite that covers:\u8232 a) p50 p95 p99 decode step latency\u8232 b) memory per token and KV growth curves\u8232 c) quality drift across long contexts\u8232 d) adapter correctness and regressions\u8232 e) quantization error accumulation and recovery\u8232 Until this exists, you will not know if the scheduler and gating rules are helping or silently hurting.\
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 9 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 ruvector integration is still conceptual\uc0\u8232 You have not defined the exact role ruvector plays in the runtime loop. Missing choices include:\u8232 a) ruvector as the policy memory store for learned thresholds\u8232 b) ruvector as the session state index for adapter selection and cache locality\u8232 c) ruvector as the witness log index for postmortem and audit queries\u8232 You need concrete APIs and storage models for these.\
|
||
\ls8\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 10 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 A clear mapping of model formats and feature coverage\uc0\u8232 The plan does not specify model formats, tokenizer pipeline, and attention variants that must be supported from day one, like grouped query attention, sliding window attention, rope variants, and long context scaling. This affects your kernel ABI and test vectors.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 If you want the updated frontier plan to be actionable, the next step is to lock three things in writing:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls9\ilvl0\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Default inference backend for attention and matmul, with a defined fallback order, likely using mistral.rs as the reference for paged attention and quantized KV support \
|
||
\ls9\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 SIMD kernel strategy using crates that work today, since std simd remains nightly for many setups \
|
||
\ls9\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 WASM execution budget mechanism, with epoch interruption as the default where possible \
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 Once those are fixed, everything else becomes an engineering pipeline instead of a research debate.\
|
||
\pard\pardeftab720\partightenfactor0
|
||
\cf0 \strokec2 \
|
||
Give me full implementation \
|
||
\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 To give you a full implementation, could you clarify the scope and target first?\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls10\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Hardware targets:
|
||
\f1\b0 What devices should this support (e.g. x86 edge servers, ARM SBCs, mobile, microcontrollers)?\
|
||
\ls10\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Model types:
|
||
\f1\b0 Are you targeting Mistral-7B or similar LLMs, or smaller instruction-tuned models?\
|
||
\ls10\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Deployment format:
|
||
\f1\b0 Should this build as a static binary, a dynamic runtime, or a WASM package?\
|
||
\ls10\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 4 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Priorities:
|
||
\f1\b0 Should I optimize for minimal memory, max throughput, or multi-user batching?\
|
||
\ls10\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 5 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Use of ruvector:
|
||
\f1\b0 Do you want this as a vector DB, witness logger, policy engine, or all of the above?\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 Once I have that, I\'92ll generate the full code and build structure.\
|
||
\pard\pardeftab720\partightenfactor0
|
||
\cf0 \strokec2 \
|
||
Here is what is missing right now. The plan is directionally correct, but it still lacks the pieces that make it runnable, measurable, and hard to break. 1. A committed attention and matmul path for the hot loop Right now the plan says \'93defer attention in WASM,\'94 but it does not specify what runs attention today. You need a clear default backend choice and fallback order, for example a mistral.rs style paged attention backend for throughput and fragmentation control, with KV cache quantization support already proven in that codebase. 2. A real SIMD strategy that matches Rust reality in 2026 If you want deterministic speed on edge CPU, you need to pick a SIMD approach that compiles everywhere and can multiversion per CPU. std simd is still nightly in 2025, with people recommending crates like wide, pulp, or macerator depending on your needs. That decision affects every kernel you ship. 3. The missing KV cache algorithm layer between quantize and rematerialize You listed KVQuant, KIVI, XQuant, but you have not specified when to use which, nor how to combine them safely. Also, there are newer KV quant methods like SQuat that change the error geometry rather than just lowering bits. That belongs in your decision tree and benchmark suite. 4. Unified paging is described, but the allocator and eviction policy are not specified You need concrete answers for: page size, metadata layout, pinning rules, eviction order, concurrency model, and how you prevent thrash. Without this, \'93unified pool\'94 becomes a soft idea instead of a production subsystem. 5. The WASM boundary is not locked down You need to decide whether you are shipping raw ABI calls or a component model interface, how memory is shared, and what happens on trap. You also need to commit to interruption mechanics. Wasmtime documents fuel and epoch interruption, with epoch generally faster, and the choice matters for tail latency. 6. Kernel pack supply chain and rollback is not fully designed You mentioned signing, but you still need: manifest schema, signature rotation, allow lists, version compatibility gates, reproducible builds, and a safe rollback protocol that is deterministic under load. 7. Adapter serving needs a multi tenant execution plan Micro LoRA apply is specified, but what is missing is the serving strategy: adapter residency rules, batching rules when multiple adapters are active, adapter compression, and a unified paging contract that covers KV plus adapters together. The S LoRA approach is the reference for why this matters. 8. Evaluation harness is not concrete enough You need a mandatory benchmark and correctness suite that covers: a) p50 p95 p99 decode step latency b) memory per token and KV growth curves c) quality drift across long contexts d) adapter correctness and regressions e) quantization error accumulation and recovery Until this exists, you will not know if the scheduler and gating rules are helping or silently hurting. 9. ruvector integration is still conceptual You have not defined the exact role ruvector plays in the runtime loop. Missing choices include: a) ruvector as the policy memory store for learned thresholds b) ruvector as the session state index for adapter selection and cache locality c) ruvector as the witness log index for postmortem and audit queries You need concrete APIs and storage models for these. 10. A clear mapping of model formats and feature coverage The plan does not specify model formats, tokenizer pipeline, and attention variants that must be supported from day one, like grouped query attention, sliding window attention, rope variants, and long context scaling. This affects your kernel ABI and test vectors. If you want the updated frontier plan to be actionable, the next step is to lock three things in writing: 1. Default inference backend for attention and matmul, with a defined fallback order, likely using mistral.rs as the reference for paged attention and quantized KV support 2. SIMD kernel strategy using crates that work today, since std simd remains nightly for many setups 3. WASM execution budget mechanism, with epoch interruption as the default where possible Once those are fixed, everything else becomes an engineering pipeline instead of a research debate.\
|
||
\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 Understood. I will research and specify:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls11\ilvl0\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 A committed, default attention and matmul backend (e.g. mistral.rs, Candle, FlashAttention alternatives) with fallback order for edge inference.\
|
||
\ls11\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 A Rust-compatible SIMD strategy usable in production as of 2026 (e.g.
|
||
\f4\fs26 wide
|
||
\f1\fs24 ,
|
||
\f4\fs26 pulp
|
||
\f1\fs24 ,
|
||
\f4\fs26 safe_arch
|
||
\f1\fs24 , or stable
|
||
\f4\fs26 std::simd
|
||
\f1\fs24 ), with multiversioning guidance.\
|
||
\ls11\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 WASM kernel execution limits\'97compare Wasmtime's fuel vs epoch interruption\'97and define a safe default.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 Once those are defined, the rest of the system can be locked down as an implementation pipeline.\
|
||
\pard\pardeftab720\partightenfactor0
|
||
|
||
\f5\fs22 \cf3 \cb4 \strokec3 Research completed in 7m \'b7 9 sources \'b7 52 searches
|
||
\f1\fs24 \cf0 \cb1 \strokec2 \
|
||
\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 Implementing the Frontier Plan: Filling the Gaps\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
\cf0 \strokec2 Introduction\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 The frontier inference plan is on the right track conceptually, but several critical implementation details are missing. This comprehensive plan addresses the ten identified gaps, turning a high-level design into a
|
||
\f0\b \strokec2 runnable, measurable, and robust
|
||
\f1\b0 \strokec2 system. We then lock in three key decisions (attention backend, SIMD strategy, and WASM budgeting) to make the plan actionable.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 1. Committed Attention and Matmul Path for the Hot Loop\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\fs24 \cf0 Default Attention/Matmul Backend:
|
||
\f1\b0 \strokec2 We will implement a high-performance attention kernel and matrix multiplication backend for the main inference loop. The default choice is a
|
||
\f0\b \strokec2 paged attention
|
||
\f1\b0 \strokec2 mechanism inspired by
|
||
\f2\i \strokec2 mistral.rs
|
||
\f1\i0 \strokec2 , which has proven success in managing long contexts efficiently. In paged attention, the KV cache is split into fixed-size slots (\'93pages\'94) rather than one contiguous buffer. The attention computation uses an indirection table of indices into these pages, avoiding ever assembling a giant contiguous KV tensor. This design improves
|
||
\f0\b \strokec2 throughput
|
||
\f1\b0 \strokec2 and
|
||
\f0\b \strokec2 fragmentation control
|
||
\f1\b0 \strokec2 , as it reuses memory slots and avoids reallocating huge buffers. Mistral.rs shows that paged attention can dramatically improve memory use and even supports quantized KV cache entries (e.g. 8-bit) to further reduce overhead. We will follow this approach: store keys/values in a paged structure and have attention kernels gather from pages on-the-fly.\
|
||
|
||
\f0\b \strokec2 Quantized KV Support:
|
||
\f1\b0 \strokec2 The attention implementation will natively support
|
||
\f2\i \strokec2 quantized
|
||
\f1\i0 \strokec2 KV cache entries. For example, we can store keys and values in 8-bit or 4-bit formats during attention computations to save memory. This follows the approach demonstrated in Mistral.rs where KV cache quantization to FP8 was added to cut memory usage. Our attention kernel will treat quantized KV appropriately (dequantizing on read or operating in quantized space if possible) so that using lower precision doesn\'92t break the model\'92s computations.\
|
||
|
||
\f0\b \strokec2 Backend Fallback Order:
|
||
\f1\b0 \strokec2 We will define a clear fallback sequence for the attention/matmul computation in case the fastest path is not available on a given system:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls12\ilvl0
|
||
\f2\i \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Primary:
|
||
\f1\i0 a custom Rust SIMD kernel (see SIMD strategy below) implementing the
|
||
\f0\b paged attention
|
||
\f1\b0 with quantized KV. This is optimized for CPU inference on long contexts.\
|
||
\ls12\ilvl0
|
||
\f2\i \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Secondary:
|
||
\f1\i0 if the environment provides specialized libraries (e.g. BLAS or vendor optimized kernels) or a GPU is available, we can offload the matrix multiplies to those. For instance, on GPU we might leverage a CUDA attention kernel (as in vLLM or FlashAttention) or on x86 maybe call into oneDNN for large matmul, falling back to our paged scheme for KV management. The system will detect available backends and
|
||
\f0\b choose the best
|
||
\f1\b0 at runtime.\
|
||
\ls12\ilvl0
|
||
\f2\i \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Tertiary:
|
||
\f1\i0 a simple, reference implementation (contiguous full-precision KV cache with a straightforward attention loop) as a last resort. This ensures correctness on any platform even if performance is lower. It\'92s essentially a \'93naive\'94 attention that we can always fall back to for functional safety.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 By committing to this
|
||
\f0\b \strokec2 default paged-attention path
|
||
\f1\b0 \strokec2 with quantized KV, we ensure the main loop is efficient. On long contexts, our approach avoids the latency spikes from huge memory allocations and keeps throughput high by reusing fixed pages. When a platform can\'92t support that, the graceful degradation path is clearly defined by the fallback order above.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 2. Real SIMD Strategy Matching Rust 2026 Reality\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\fs24 \cf0 Choosing a SIMD Abstraction:
|
||
\f1\b0 \strokec2 To achieve deterministic speed on edge CPUs, we need to pick a SIMD approach that is
|
||
\f0\b \strokec2 portable
|
||
\f1\b0 \strokec2 and can
|
||
\f0\b \strokec2 multi-version
|
||
\f1\b0 \strokec2 (dispatch optimized code per CPU feature) on stable Rust. As of 2025/2026,
|
||
\f4\fs26 \strokec2 std::simd
|
||
\f1\fs24 \strokec2 is still nightly-only and not suitable for stable builds. Community consensus suggests: use
|
||
\f4\fs26 \strokec2 std::simd
|
||
\f1\fs24 \strokec2 on nightly, but for stable builds consider
|
||
\f6\b\fs26 \strokec2 wide
|
||
\f1\b0\fs24 \strokec2 if no CPU feature dispatch is needed, or
|
||
\f6\b\fs26 \strokec2 pulp
|
||
\f0\fs24 /
|
||
\f6\fs26 macerator
|
||
\f1\b0\fs24 \strokec2 for multi-version support.\
|
||
Given we want broad deployment on various CPUs with optimal vector instructions, we will adopt
|
||
\f6\b\fs26 \strokec2 macerator
|
||
\f1\b0\fs24 \strokec2 as our SIMD backbone. The
|
||
\f2\i \strokec2 macerator
|
||
\f1\i0 \strokec2 crate is a fork of pulp offering generic SIMD traits and expanded instruction set coverage. It supports all modern x86 (SSE, AVX2, AVX-512), ARM NEON, WASM SIMD, etc., making it ideal for a cross-platform engine. It also handles runtime feature detection (multi-versioning) so our code can transparently use AVX-512 on a server chip, AVX2 on an older laptop, or NEON on ARM. This choice influences every low-level kernel (from dot products in attention to LoRA merges), ensuring we exploit hardware vector units fully.\
|
||
|
||
\f0\b \strokec2 Alternate Crate Options:
|
||
\f1\b0 \strokec2 We considered
|
||
\f4\fs26 \strokec2 pulp
|
||
\f1\fs24 \strokec2 and
|
||
\f4\fs26 \strokec2 wide
|
||
\f1\fs24 \strokec2 as well.
|
||
\f4\fs26 \strokec2 pulp
|
||
\f1\fs24 \strokec2 is proven (used in faer\'92s matrix multiply) and has built-in multiversioning, but it only supports native width vectors and a limited set of architectures (AVX2, AVX-512, NEON).
|
||
\f4\fs26 \strokec2 wide
|
||
\f1\fs24 \strokec2 is very ergonomic and covers many types, but it lacks multiversioning and would force a single target ISA. Since our system must scale from consumer devices to servers,
|
||
\f6\b\fs26 \strokec2 macerator
|
||
\f1\b0\fs24 \strokec2 gives the best balance of stability and performance. It operates on stable Rust and supports writing generic SIMD code that compiles to all necessary backends.\
|
||
|
||
\f0\b \strokec2 SIMD Utilization in Kernels:
|
||
\f1\b0 \strokec2 With
|
||
\f4\fs26 \strokec2 macerator
|
||
\f1\fs24 \strokec2 , we will implement all heavy compute kernels (matrix multiplies, layernorm, activation functions, etc.) to operate on SIMD types. This yields consistent speedups across platforms. We will also use its traits to write code once and have it vectorize to different widths. For example, the attention score calculation (Q\'b7K^T) and the output combination (scores\'b7V) will be vectorized. The crate will handle selecting 128-bit vs 256-bit vs 512-bit operations depending on the CPU. This approach ensures
|
||
\f0\b \strokec2 deterministic speed-ups
|
||
\f1\b0 \strokec2 and that no CPU is left underutilized. In summary, our SIMD strategy is locked to a
|
||
\f0\b \strokec2 stable, cross-platform solution (macerator)
|
||
\f1\b0 \strokec2 that aligns with Rust 2026 best practices, avoiding nightly features while still leveraging modern SIMD instructions.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 3. KV Cache Quantization and Rematerialization Strategy\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 We need a clear algorithmic policy for managing the KV cache as sequences grow, including
|
||
\f0\b \strokec2 when to quantize
|
||
\f1\b0 \strokec2 cached keys/values,
|
||
\f0\b \strokec2 when to rematerialize
|
||
\f1\b0 \strokec2 (recompute) them if needed, and how to combine different quantization methods safely. In the original plan, methods like
|
||
\f2\i \strokec2 KVQuant, KIVI,
|
||
\f1\i0 \strokec2 and
|
||
\f2\i \strokec2 XQuant
|
||
\f1\i0 \strokec2 were mentioned. We will refine this with newer research (like
|
||
\f0\b \strokec2 SQuat
|
||
\f1\b0 \strokec2 ) and define usage scenarios for each technique:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls13\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Default KV Cache Precision:
|
||
\f1\b0 By default, new tokens\'92 K/V tensors start in full precision (e.g. FP16/BF16). This ensures maximal quality for recent tokens where the model is most sensitive. As the context grows, the oldest entries contribute less to near-term predictions, so we apply quantization progressively.\
|
||
\ls13\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 KIVI 2-bit Compression for Stale Segments:
|
||
\f1\b0 KIVI is a tuning-free 2-bit KV quantization method that quantizes keys per-channel and values per-token. It achieves ~2.6\'d7 reduction in KV memory with
|
||
\f0\b minimal quality loss
|
||
\f1\b0 (virtually same perplexity) and enables up to 4\'d7 larger batch sizes for Llama-2 and others. We will use KIVI-style quantization on
|
||
\f2\i older tokens
|
||
\f1\i0 in the cache. For example, once the context exceeds a threshold (say 512 or 1024 tokens), we convert the oldest segment of KV cache to 2-bit using KIVI\'92s scheme. This dramatically reduces memory footprint while maintaining quality for those older positions. The system will maintain a small per-layer scale/offset metadata from KIVI\'92s asymmetric quantization to allow dequantization if those tokens are attended strongly.\
|
||
\ls13\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 SQuat for Aggressive Compression:
|
||
\f1\b0 For extremely long contexts or memory-constrained scenarios, we introduce an option to use
|
||
\f0\b SQuat (Subspace-orthogonal quantization)
|
||
\f1\b0 . SQuat projects key vectors into a subspace spanned by recent query vectors, and then quantizes in a way that preserves components relevant to attention. This method can push quantization to very low bit-widths (2-bit) while minimizing impact on attention outputs. It has shown ~2.2\'962.8\'d7 memory reduction and ~2.5\'963.6\'d7 throughput improvement over baseline KV cache handling, with better quality than other low-bit methods. We will deploy SQuat for
|
||
\f2\i very long
|
||
\f1\i0 contexts (e.g. >2048 or on user opt-in \'93max compression\'94 mode). In practice, after KIVI reduces memory, if context keeps growing, SQuat can be applied to further compress the oldest half of the cache to 2-bit with orthogonal error minimization. This two-stage quant (first KIVI, then SQuat on top) ensures that even at extreme lengths, the attention sees minimal error from quantization.\
|
||
\ls13\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 KVQuant for Quality-Critical Long Contexts:
|
||
\f1\b0
|
||
\f2\i KVQuant
|
||
\f1\i0 is a comprehensive approach incorporating per-channel key quantization,
|
||
\f0\b pre-ROPE quantization
|
||
\f1\b0 (applying quantization before rotary positional embedding to reduce its interference), non-uniform clustering of values, and outlier isolation. It is more complex but yields the highest fidelity: <0.1 perplexity drop at 3-bit precision on long contexts, outperforming simpler methods. KVQuant enabled LLaMA-7B to handle up to
|
||
\f0\b 1 million tokens
|
||
\f1\b0 on a single 80GB A100 GPU (10 million on 8 GPUs), thanks to its ultra-low precision and custom CUDA kernels. We will provide KVQuant as an
|
||
\f2\i optional
|
||
\f1\i0 backend for deployments that absolutely require maximum context length with minimal quality degradation (e.g. enterprise use of >100k context). If the hardware support is present (our system detects GPU and has the KVQuant CUDA kernels available in the kernel pack), we can invoke KVQuant to quantize the cache down to 3-bit or 2-bit once context exceeds a high threshold. This will come with the trade-off of needing the specialized kernels and possibly slightly higher latencies (due to more complex quantization logic), so we won\'92t use it by default on every platform \'96 only when long-context support is paramount and the environment can support it.\
|
||
\ls13\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 \'93XQuant\'94 and Others:
|
||
\f1\b0 We interpret
|
||
\f2\i XQuant
|
||
\f1\i0 as a placeholder for any other experimental quantization approach (for example, exllama\'92s KV compression using a Hadamard transform, etc.). The design will be extensible so that new quantization plugins can be slotted in. For now, our primary quantization decision tree is: use KIVI for general compression, escalate to SQuat for aggressive compression, or switch to KVQuant when available for best-in-class long-range support. Each method is applied exclusively to avoid conflict (e.g. we wouldn\'92t stack KIVI and KVQuant on the same data \'96 we choose one path based on scenario). All quantization modes will be
|
||
\f0\b configurable
|
||
\f1\b0 , so operators can disable quantization entirely (for absolute maximal quality) or choose a desired memory-accuracy trade-off profile.\
|
||
\ls13\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Rematerialization Policy:
|
||
\f1\b0 In addition to quantization, we have a strategy for
|
||
\f2\i rematerialization
|
||
\f1\i0 . If the context window grows so large that even quantization can\'92t save enough memory (or if quality starts degrading), the system can
|
||
\f0\b evict
|
||
\f1\b0 the oldest KV cache blocks and
|
||
\f0\b rematerialize
|
||
\f1\b0 them on-the-fly if needed. Concretely, if a session goes beyond N tokens (configurable, e.g. 8192), we might drop the oldest 2048 tokens\'92 KV caches entirely to free memory (those tokens remain in the textual history). If the model later needs to attend to those dropped tokens (which might happen depending on attention pattern), our scheduler will
|
||
\f0\b recompute
|
||
\f1\b0 that portion by running the model on that segment again to regenerate the missing keys/values. This is similar to checkpointing in training \'96 trading compute for memory. We will be careful to only rematerialize in controlled ways: for example, never drop the
|
||
\f2\i most recent
|
||
\f1\i0 context so the majority of attention queries hit quantized or full-precision cache. Rematerialization might be paired with summary techniques (e.g. the system could insert a summary embedding for very old content to avoid fully recomputing). The exact trigger for rematerialization vs. further quantization will be informed by the evaluation harness (see section 8): if we detect quantization error accumulating beyond a tolerance, we prefer to rematerialize old context rather than quantize it further. This ensures
|
||
\f0\b quality does not silently degrade
|
||
\f1\b0 on ultra-long sessions.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 In summary, the KV cache management layer will dynamically choose between
|
||
\f2\i \strokec2 no quantization
|
||
\f1\i0 \strokec2 ,
|
||
\f2\i \strokec2 KIVI 2-bit quant
|
||
\f1\i0 \strokec2 ,
|
||
\f2\i \strokec2 SQuat subspace quant
|
||
\f1\i0 \strokec2 , or
|
||
\f2\i \strokec2 KVQuant advanced quant
|
||
\f1\i0 \strokec2 based on context length, hardware capabilities, and desired quality. It will also utilize eviction and recompute (rematerialization) as a backstop to prevent unbounded memory growth. All these are orchestrated by a scheduler that monitors memory usage per token and error metrics so it can
|
||
\f0\b \strokec2 combine these techniques safely
|
||
\f1\b0 \strokec2 . Newer research like SQuat is incorporated to improve the error characteristics of quantization (ensuring quant errors are orthogonal to the important subspace of queries). By explicitly defining when and how to apply each method, we turn a menu of options into a concrete algorithm that will be implemented and tuned in the system.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 4. Unified Paging Allocator and Eviction Policy\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 We will implement a
|
||
\f0\b \strokec2 unified memory pool
|
||
\f1\b0 \strokec2 for all dynamic data (KV cache pages, adapter weights, etc.) and define the allocator behavior in detail. This unified paging system is inspired by S-LoRA\'92s approach, which introduced a unified memory pool to manage both KV cache tensors and LoRA adapter weights of varying sizes. Here we specify the parameters and policies:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls14\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Page Size:
|
||
\f1\b0 We choose a page size of
|
||
\f0\b 2 MB
|
||
\f1\b0 for the unified pool by default. This size is a balance between fragmentation and overhead: 2MB pages are large enough to hold many typical KV cache blocks (for instance, a 512-token KV for a 7B model fits in a few pages) and moderately sized LoRA weights, while small enough that we can allocate and move them without huge latency. We will allow this to be configurable (e.g. 512 KB to 4 MB range) to tune for different hardware. We align pages to 2MB also because on GPU memory, large page allocations (or using CUDA unified memory manager) often perform better when aligned to power-of-two boundaries.\
|
||
\ls14\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Memory Pool Layout:
|
||
\f1\b0 At initialization, we allocate a big contiguous chunk of memory (on each device, e.g. CPU RAM for CPU inference or GPU VRAM for GPU inference) that will serve as the unified pool. For instance, on a 80GB GPU, we might dedicate 75GB as the pool, leaving some headroom for model weights and other overhead. Within this pool, memory is managed in
|
||
\f0\b fixed-size pages
|
||
\f1\b0 . We maintain a metadata table (in host memory) with an entry per page indicating its status:
|
||
\f2\i free
|
||
\f1\i0 or
|
||
\f2\i allocated
|
||
\f1\i0 , and if allocated, what it contains (e.g. \'93Adapter X layer 2 weights\'94 or \'93KV cache for Session Y, layer 10, tokens 100-199\'94). This metadata also stores usage stats like last access time for eviction decisions.\
|
||
\ls14\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Allocation Strategy:
|
||
\f1\b0 When a request comes in that needs memory (say, loading a LoRA adapter or extending a KV cache), the allocator will find a sufficient number of free pages in the pool to satisfy it. If the object is larger than one page,
|
||
\f0\b contiguous pages
|
||
\f1\b0 will be allocated (the pool supports allocating ranges of pages). The allocator will try to find a contiguous run via a first-fit or best-fit strategy to minimize fragmentation. Metadata will then mark those pages as in use and link them to the object. Smaller objects (e.g. a very small LoRA of only a few KB) will still consume one full page (we won\'92t sub-partition pages at this time, to keep it simple and avoid internal fragmentation issues \'96 the page is our atomic unit of allocation).\
|
||
\ls14\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Pinning Rules:
|
||
\f1\b0
|
||
\f0\b Active data is pinned in memory.
|
||
\f1\b0 Any KV cache pages that correspond to tokens still in the current context of an ongoing request are pinned (not evictable) until those tokens are no longer needed (e.g. evicted by our KV scheduler as described above, or the session ends). Similarly, an adapter weight that is currently in use by at least one active inference is pinned in GPU memory. \'93In use\'94 means the model is either actively processing a prompt/batch that uses that adapter, or it\'92s expected to be used in an upcoming batch that is already scheduled (to prevent thrashing during scheduled batches). Pinning ensures we don\'92t evict something mid-use. We implement reference counting in the metadata: each page has a pin count (number of active uses), and only when that drops to zero does the page become evictable.\
|
||
\ls14\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Eviction Policy:
|
||
\f1\b0 When the memory pool is exhausted (no sufficient free pages for a new allocation) we need to evict something. We will use an
|
||
\f0\b LRU (Least Recently Used)
|
||
\f1\b0 policy by default for evictable pages, with some refinements. Specifically, we\'92ll evict the pages that have the oldest
|
||
\f2\i last access timestamp
|
||
\f1\i0 among those not pinned. This tends to remove long-unused adapters or KV from old sessions. However, we also factor in
|
||
\f2\i size
|
||
\f1\i0 and
|
||
\f2\i importance
|
||
\f1\i0 : if multiple candidates are similarly old, we might prefer evicting a larger adapter that frees more space in one go, unless that adapter is expected to be needed soon (we could use a simple heuristic: if an adapter hasn\'92t been used for X minutes, it\'92s unlikely to be needed immediately again). For KV cache pages, eviction generally would target entire old sessions that have been idle (e.g. if a user hasn\'92t sent a message in a while, we evict that session\'92s KV to free memory). The eviction process will copy evicted data to a slower storage if needed: e.g., for an adapter, we might offload it back to host memory or disk so it can be loaded later; for KV cache, we likely just drop it (or, if we support long-term persistence of conversation state, we might serialize it to disk, but by default KV eviction means those tokens will need recomputation if needed again).\
|
||
\ls14\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Concurrency Model:
|
||
\f1\b0 The allocator must handle concurrent requests, as multiple threads may be generating outputs for different users. We will implement a
|
||
\f0\b lock-free free-list
|
||
\f1\b0 or use atomic operations for page allocation to avoid global locks. One approach is to divide the pool among threads or use a per-thread cache of free pages to reduce contention. However, since the pool is unified, a global view is needed to decide eviction. We will likely protect the allocation and eviction process with a lightweight mutex or an ordered lock: e.g., when a new large allocation request comes in and finds insufficient space, a thread will take an eviction lock, perform the LRU eviction of some pages, mark them free, then allocate. This eviction step will be tuned to be fast (we\'92ll evict in bulk if possible, and perhaps do it asynchronously if the requester can wait a bit). The design will ensure that most
|
||
\f0\b fast-path allocations (when free pages exist)
|
||
\f1\b0 don\'92t need global locking \'96 they can pop from a free list quickly.\
|
||
\ls14\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Thrash Prevention:
|
||
\f1\b0 To avoid pathological thrashing (constant evict-load cycles of the same data), we implement a
|
||
\f0\b hysteresis
|
||
\f1\b0 and
|
||
\f0\b prioritization
|
||
\f1\b0 scheme. For example, if an adapter was evicted very recently and now is needed again, the system might decide to keep it in memory a little longer next time (mark it as \'93recently evicted, do not evict again for Y minutes\'94). Similarly, if a certain session\'92s KV is causing frequent evictions back and forth, we may choose to
|
||
\f2\i grow
|
||
\f1\i0 the pool or refuse new allocations (backpressure) rather than constantly evict and rematerialize. The unified allocator will expose metrics like current utilization and eviction rate; if eviction rate is high (thrashing signal), it could trigger load-shedding (e.g. refuse loading a low-priority adapter until memory pressure eases) or increase the quantization of KV to reduce footprint. Additionally, we might reserve a small percentage of the pool as a buffer that\'92s only used under extreme conditions, to absorb spikes without immediate evictions. These measures ensure the system remains stable under load and doesn\'92t degrade into continuous swapping of pages in and out.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 In summary,
|
||
\f0\b \strokec2 unified paging
|
||
\f1\b0 \strokec2 means all dynamic memory (adapters and caches) lives in one managed pool, reducing fragmentation and allowing intelligent trade-offs between them. We have concretely defined the page size, metadata, pin/evict rules, concurrency handling, and thrash avoidance. This turns the \'93unified pool\'94 concept into a working subsystem ready for implementation and tuning.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 5. WASM Execution Boundary and Safety\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 We clarify how we will integrate WebAssembly (WASM) into the system, including the interface, memory sharing, trap handling, and the execution budget (fuel/epoch) for timeouts. The plan is to use WASM for any sandboxed or user-provided logic (for example, custom embedding functions, plugin scripts, or untrusted model components) while ensuring it cannot hang or crash the host.\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls15\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 ABI vs Component Model:
|
||
\f1\b0 We will initially expose a
|
||
\f0\b raw ABI interface
|
||
\f1\b0 to the WASM modules for simplicity and performance. That is, the host and WASM will communicate through low-level function calls and memory, rather than using the full WASM Component Model (which, as of 2026, is still maturing and could add overhead). Concretely, we\'92ll define a small set of C-ABI functions that a module can export (e.g.
|
||
\f4\fs26 compute_attention(query_ptr, key_ptr, value_ptr, len)
|
||
\f1\fs24 or
|
||
\f4\fs26 apply_adapter(layer_idx, input_ptr, output_ptr, len)
|
||
\f1\fs24 ) and the host will use Wasmtime (our chosen runtime) to call these functions directly. This avoids serialization of data or complex adapters \'96 it\'92s essentially like calling a dynamic library, but in a safe WASM sandbox.\
|
||
\ls15\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Memory Sharing:
|
||
\f1\b0 To minimize copying overhead across the host-WASM boundary, we will
|
||
\f0\b share memory
|
||
\f1\b0 where possible. Wasmtime allows the host to allocate a memory and then expose it to the WASM module, or vice versa. We will use this to let the WASM module operate on data that resides in host memory (or in a memory mapped region accessible to both). For instance, we can allocate the model\'92s tensor data in a shared memory and pass pointers (offsets) to WASM functions. Because WASM memory can be dynamically grown and uses 32-bit indexing by default, we might use the Wasmtime 64-bit memory extension if needed for large models, or simply manage multiple memory segments. The key is that heavy data (tensors) won\'92t be copied into WASM; instead, WASM gets access to a controlled memory region containing those tensors. This improves performance while maintaining safety (WASM can\'92t access outside the shared region, and we set memory limits).\
|
||
\ls15\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Trap Handling:
|
||
\f1\b0 If a WASM module traps (for example, due to an out-of-bounds access or an explicit
|
||
\f4\fs26 unreachable
|
||
\f1\fs24 ), the host will catch this as a runtime error. Our policy is to treat WASM traps as
|
||
\f0\b non-fatal errors
|
||
\f1\b0 for the request in question. The engine will abort the current operation in that WASM instance, log the error (with enough context to debug), and free or reset the WASM instance. For the user request that triggered it, we\'92ll propagate an error up (or fallback to a safe implementation if possible). The system remains running \'96 the trap does not crash the whole server. We will design these error paths carefully so that any partially acquired resources (memory pages, locks, etc.) are released when a trap occurs. Wasmtime\'92s API allows us to attach a trap handler or just catch the exception, so we will utilize that. Additionally, we might impose a
|
||
\f0\b restart policy
|
||
\f1\b0 : if a particular WASM module traps frequently, we may unload or block that module until it\'92s fixed, to avoid repeated failures.\
|
||
\ls15\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Interruption and Budgeting:
|
||
\f1\b0 To ensure a runaway WASM execution (e.g. an infinite loop in user plugin code) doesn\'92t hang our service, we will use Wasmtime\'92s
|
||
\f0\b epoch-based interruption
|
||
\f1\b0 as the default mechanism. Wasmtime offers two approaches:
|
||
\f2\i fuel
|
||
\f1\i0 , where you decrement a counter for executed instructions (which adds overhead per instruction), and
|
||
\f2\i epoch timers
|
||
\f1\i0 , where you can periodically check for a cancellation flag with much lower overhead. We choose
|
||
\f0\b epoch interruption
|
||
\f1\b0 because it has negligible performance impact on tight loops, which is crucial for tail latency in our hot loop. Concretely, we will configure an epoch deadline for each WASM execution. The host can increment a global epoch counter asynchronously (e.g. every N milliseconds) and if the execution runs for too long, Wasmtime will throw an interruption error at the next check point. The generated code will include safe points (typically at function calls or loop back-edges) to observe the epoch. This way, if (for example) an attention kernel WASM is taking too long (maybe it got an unexpectedly large input), we can interrupt it and perhaps fall back to a simpler implementation rather than let it drag out p99 latency.\
|
||
\ls15\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 WASM vs Native Execution Choices:
|
||
\f1\b0 For clarity, not all core inference will run in WASM \'96 our default attention/matmul is in native Rust (for speed and control). The WASM boundary is used for
|
||
\f2\i extensibility
|
||
\f1\i0 and
|
||
\f2\i isolation
|
||
\f1\i0 . For instance, if we support user-defined functions or a model from a third party in WASM, we use this sandbox. We will document which parts of the system use WASM. For those parts, we\'92ll ensure the interface is minimal (passing pointers/lengths and getting results) to keep overhead low. Memory is mostly shared, so copying is minimal. And with epoch-based cancellation, we guarantee that even a malicious or buggy WASM cannot spin forever: it will be stopped in a timely manner, preserving SLAs for other users.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 By locking down these WASM details, we ensure that the
|
||
\f0\b \strokec2 component boundary is robust
|
||
\f1\b0 \strokec2 : modules have a defined way to call/behave, cannot harm the host, and can be preempted if needed. This transforms \'93defer to WASM\'94 from a vague idea into a concrete, safe extension mechanism ready for production (leveraging Wasmtime\'92s proven sandbox and interruption features).\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 6. Kernel Pack Supply Chain and Rollback Design\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 The plan to allow pluggable optimized kernels (for various hardware or updated algorithms) requires a solid supply chain setup. We outline how kernel packs are versioned, verified, and rolled back safely:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls16\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Manifest Schema:
|
||
\f1\b0 Each kernel pack (a bundle of optimized kernels or functions, possibly compiled to native code or WASM) will include a manifest file (e.g.
|
||
\f4\fs26 kernels.json
|
||
\f1\fs24 or similar). This manifest lists the contents and metadata: for each kernel, it might have fields like
|
||
\f4\fs26 name
|
||
\f1\fs24 (e.g. "attention_avx512"),
|
||
\f4\fs26 version
|
||
\f1\fs24 (of that kernel),
|
||
\f4\fs26 supported_targets
|
||
\f1\fs24 (CPU features or GPU architectures), a cryptographic
|
||
\f4\fs26 hash
|
||
\f1\fs24 of the binary, and a signature. It also lists a
|
||
\f2\i pack version
|
||
\f1\i0 and a
|
||
\f2\i minimum engine version
|
||
\f1\i0 required (to ensure compatibility with our runtime). The manifest schema will be strict and documented, enabling the runtime to parse it and decide which kernels to load on startup.\
|
||
\ls16\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Signatures and Trust:
|
||
\f1\b0 The kernel pack (or each kernel in it) will be
|
||
\f0\b digitally signed
|
||
\f1\b0 by the provider. We will use an asymmetric key (e.g. an Ed25519 or ECDSA keypair) to sign the manifest. The public key (or a root certificate) will be embedded in our application as the root of trust (we can allow updates to this via secure channels for rotation). When the runtime fetches or is given a new kernel pack, it verifies the signature against the trusted key. Only if the signature is valid and the content hash matches (to prevent tampering) will the pack be accepted. This prevents unauthorized or malicious code from being loaded into our process.\
|
||
\ls16\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Signature Rotation:
|
||
\f1\b0 Over time, keys may need to be changed (compromised or just operational rotation). Our design supports multiple trusted keys with metadata (e.g. key ID and expiry). The manifest will indicate which key signed it (through an identifier or embedded certificate). The runtime will have an
|
||
\f2\i allowlist of valid signers
|
||
\f1\i0 . We can update that allowlist via our own secure update mechanism if needed (for example, ship an update that trusts a new key before retiring an old one). We will also timestamp our manifests; if a manifest is signed by an expired key or past a certain date, the runtime may warn or refuse it depending on policy. This ensures a compromised old key can be phased out.\
|
||
\ls16\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Allowed List and Version Gates:
|
||
\f1\b0 Not all kernels or packs are automatically acceptable. We will maintain an
|
||
\f0\b allowlist
|
||
\f1\b0 of known-good kernel pack versions. For example, if version 1.2 is the latest stable, the runtime might refuse to load an unrecognized version unless explicitly overridden. This is a safeguard so that even with a valid signature, a radically different pack won\'92t be loaded blindly. Moreover, each pack will declare compatibility (like \'93for Frontier runtime >= 1.0.0 and < 2.0.0\'94). The runtime will cross-check its own version and the pack\'92s intended range. Incompatible packs are rejected to prevent crashes from API mismatches. Essentially, we bake in
|
||
\f0\b version gating
|
||
\f1\b0 : both the engine and the kernels must agree on an interface version.\
|
||
\ls16\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Reproducible Builds and Audits:
|
||
\f1\b0 To build trust in these kernel packs, we will strive for
|
||
\f0\b reproducible builds
|
||
\f1\b0 of the kernels. This means anyone (including ourselves or third-party auditors) can rebuild the kernel code from source and get the same hash that is listed in the manifest. Using deterministic compilation techniques (specific compiler versions, flags, etc.) is part of the pipeline. We will also publish the source or at least the hash of source control for each official kernel binary. This supply chain transparency helps ensure no hidden code is present. For internal development, when we integrate a new kernel (say an updated quantization kernel), we will have a CI step that reproduces the binary, signs it, and packages the manifest in a verifiable way.\
|
||
\ls16\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Safe Rollback Protocol:
|
||
\f1\b0 If a new kernel pack is deployed and causes problems (e.g. performance regression or crashes), we need a deterministic and safe rollback. We design the system to
|
||
\f0\b retain the previous kernel pack
|
||
\f1\b0 in memory or disk until the new one is proven. On startup, the engine can keep two versions loaded (current and last-known-good), but uses the current by default. If certain health checks fail (for instance, the new kernels trigger errors in the evaluation harness or fail a quick self-test), the system can switch back to the old pack on the fly or after a restart. We also implement an administrative override: an operator can send a command to revert to the previous pack, which the runtime will then use for subsequent requests. The key point is to
|
||
\f0\b never remove or overwrite the last known good kernels until we are sure the new ones are stable
|
||
\f1\b0 . All kernel packs are immutable (versioned), so rollback is simply a matter of toggling which version is active. Additionally, under load, we ensure consistency: we won\'92t have one request using the new kernel while another uses an old one in an inconsistent way. The switchover is handled at a synchronization point (e.g. no requests in mid-attention computation) to avoid nondeterminism.\
|
||
\ls16\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Deterministic Behavior Under Load:
|
||
\f1\b0 We need to guarantee that rolling out a new kernel (or rolling back) does not cause half-computed results or divergent behavior. To do this, we plan a
|
||
\f0\b two-phase activation
|
||
\f1\b0 : load the new kernel pack in parallel (so both old and new are in memory), then
|
||
\f0\b quiesce
|
||
\f1\b0 incoming work (finish ongoing queries), switch a flag so new queries use the new kernels, but allow any in-flight ones to finish with the old if needed. In practice, since inference requests are short-lived relative to deployments, we might simply drain and then flip. For rollback, the same process applies in reverse. This way, at any given moment each session is consistently using one set of kernels. Logging will note which version was used for each request for audit.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 By specifying manifest, signing, allowlists, compatibility checks, and rollback steps, we convert \'93kernel pack with signing\'94 from an idea into a concrete
|
||
\f0\b \strokec2 supply chain security protocol
|
||
\f1\b0 \strokec2 . This will protect users from malicious or unvetted optimizations and give maintainers the confidence to push updates and revert if issues arise, all in a controlled,
|
||
\f0\b \strokec2 deterministic
|
||
\f1\b0 \strokec2 fashion.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 7. Multi-Tenant Adapter Serving Strategy\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 Serving multiple LoRA adapters concurrently (multi-tenant) introduces challenges in how to
|
||
\f0\b \strokec2 load, apply, and schedule
|
||
\f1\b0 \strokec2 adapters efficiently. We have outlined micro-batched LoRA application in the plan; now we detail the serving strategy:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls17\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Adapter Residency Rules:
|
||
\f1\b0 We cannot keep
|
||
\f2\i all
|
||
\f1\i0 adapters loaded on the GPU at all times if there are thousands, so we need rules for which adapters stay in memory. Our policy will be usage-driven: adapters that have
|
||
\f0\b recently or frequently been used
|
||
\f1\b0 will remain resident in GPU memory (the unified pool) for faster access, while others are evicted to CPU memory when idle (using the unified paging system described above). Concretely, if an adapter hasn\'92t been used in, say, the last 5 minutes and memory is needed, it becomes a candidate for eviction. However, we also allow pinning of certain high-priority adapters (for instance, a globally important one or a very frequently used one can be configured to always stay loaded). This ensures critical adapters don\'92t thrash. All adapters are still always stored in host memory (main RAM), so evicted just means \'93not currently on GPU.\'94 Loading an evicted adapter from CPU to GPU will incur a transfer latency (which we aim to amortize with batching).\
|
||
\ls17\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Batching and Scheduling:
|
||
\f1\b0 We leverage
|
||
\f2\i heterogeneous batching
|
||
\f1\i0 similar to S-LoRA and Punica. The idea is to
|
||
\f0\b batch inference requests even if they use different adapters
|
||
\f1\b0 , to maximize GPU utilization. S-LoRA introduced custom kernels for this: for example, if two requests are running on the same base model but with different LoRAs, we can still combine parts of their computation. We plan to implement a scheduling algorithm that groups incoming requests by common stages. For instance, all requests share the base model forward pass up until the injection of adapter weights. We can run the base model forward for a batch of requests (regardless of adapter), then apply each adapter\'92s adjustments in parallel using a specialized kernel. Recent research (Punica\'92s SGMV kernel) does exactly this by performing a segmented matrix-vector multiply that applies different LoRA deltas for different requests in one fused GPU operation. We will incorporate a similar approach:
|
||
\f2\i during each transformer layer
|
||
\f1\i0 , we separate the computation into the base part (which can be batched across requests) and the adapter part (which is small rank updates). The adapter parts for multiple requests can be fused by parallelizing over the batch dimension. This way, multi-adapter batches approach the efficiency of single-adapter ones.\
|
||
\ls17\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Adapter Loading and Compression:
|
||
\f1\b0 To reduce load time and memory, we will
|
||
\f0\b compress adapter weights
|
||
\f1\b0 in memory. Adapters (which are essentially small matrices) can be stored in 16-bit or even 8-bit without significant quality loss (some LoRA papers even quantize adapters). We\'92ll use a compressed representation (e.g. int8 + scale) in the unified pool, and only decompress to FP16 when applying them. If the adapter is rarely used, keeping it compressed saves memory and speeds up transfers from CPU to GPU. Loading an adapter from CPU might involve reading from disk or network (if it\'92s not preloaded). We aim to hide this latency by prefetching: the scheduler can predict which adapter will be needed (based on incoming requests queue) and initiate load in advance. Also, if multiple requests for the same adapter come in, they will
|
||
\f0\b share
|
||
\f1\b0 the single loaded instance \'96 we won\'92t duplicate the adapter in memory for each request. Instead, one copy is loaded and reference-counted.\
|
||
\ls17\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Unified Paging of KV + Adapters:
|
||
\f1\b0 Because both KV caches and adapters draw from the same memory pool, we enforce a unified policy to prevent one from starving the other. For instance, we might reserve a portion of GPU memory specifically for adapters (say 20%) and the rest for KV cache, or make it fully dynamic with priority. Our eviction policy (from section 4) will consider both types: e.g. if the pool is full and a new adapter needs memory, it might evict some KV pages from an idle session
|
||
\f2\i or
|
||
\f1\i0 evict a less-used adapter, whichever has lower impact. The
|
||
\f0\b eviction order
|
||
\f1\b0 across types might be decided by a cost heuristic: evict whichever of (some KV pages vs an adapter) frees the most space with least expected future penalty. S-LoRA\'92s unified paging indicates that managing them together is possible and beneficial. We will likely implement a unified LRU across all pages, but with a tweak: adapter pages might have a slightly different aging curve than KV pages. For example, KV pages might naturally cycle as sessions end, whereas adapter pages might stick around longer. We ensure that eviction does not consistently choose KV over adapters or vice versa in a way that thrashes one side; tuning may involve weighting recency for adapters differently.\
|
||
\ls17\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 S-LoRA Reference Implementation:
|
||
\f1\b0 We take inspiration from
|
||
\f0\b S-LoRA
|
||
\f1\b0 , which achieved ~4\'d7 throughput improvement and orders-of-magnitude more adapters served compared to naive approaches. The key features we emulate are: unified memory management (addressed above), heterogeneous batching (addressed with our scheduler), and parallel LoRA application kernels. We may use or adapt S-LoRA\'92s published CUDA kernels or implement our own. The result is that whether there are 2 adapters or 2000, the system can load and unload them dynamically and schedule work such that GPU utilization remains high. In practical terms, if 100 different adapter requests each want to generate text, our system can consolidate their work onto, say, a handful of GPUs by intermixing them (instead of dedicating one GPU per adapter). This is exactly how multi-tenancy is achieved efficiently.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 In summary, our multi-tenant adapter serving plan ensures that
|
||
\f0\b \strokec2 many adapters can co-exist with minimal overhead
|
||
\f1\b0 \strokec2 . Adapters are only on GPU when needed, multiple adapter requests can batch together, and memory is shared with KV cache but under a unified policy to avoid conflict. By following S-LoRA and Punica\'92s innovations, we commit to a design that is proven to scale to thousands of adapters with excellent throughput.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 8. Comprehensive Evaluation Harness\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 We will develop a rigorous evaluation harness to measure performance and correctness, covering the critical metrics listed. This harness will be used continuously during development to validate that each subsystem (scheduling, quantization, etc.) is helping rather than harming the end goals.\
|
||
Key components of the evaluation suite:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls18\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Latency Benchmarks (p50/p95/p99):
|
||
\f1\b0 We will measure the decode step latency (time per token generation) under various conditions. This includes single-token latency (p50 median latency for a single inference) and tail latency (p95, p99 for bursts or batched scenarios). The harness will simulate realistic loads \'96 e.g., multiple concurrent users with different context lengths \'96 and record token generation times. We need to ensure that tail latency is within acceptable ranges, especially with our scheduling and WASM interruption in place. For example, we might find that a certain configuration causes p99 to spike; the harness would catch that so we can adjust (maybe adjust the epoch timing or batch scheduler to cut off outliers). We\'92ll automate this test across different hardware (CPU vs GPU) as well.\
|
||
\ls18\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Memory Usage per Token & KV Growth:
|
||
\f1\b0 We will create tests that feed increasingly long inputs and record memory usage patterns. For example, feed a conversation that grows to 1k, 2k, \'85 10k tokens, and log memory allocated for KV cache, how much was quantized, etc. This produces a
|
||
\f0\b KV growth curve
|
||
\f1\b0 . We expect, with our quantization and eviction, that memory usage will plateau or grow sub-linearly beyond a point (as older parts get compressed or dropped). If instead we see linear growth without bound, that means our policies failed \'96 the harness would flag it. Additionally, we measure memory
|
||
\f2\i per token
|
||
\f1\i0 : how many bytes of GPU RAM are used per each token in context. With quantization, this number should drop (e.g. maybe 16 bytes/token in full float down to 4 bytes/token with 2-bit quantization). We will verify these against theoretical expectations (KIVI promises 2.6\'d7 less memory, etc.). Any discrepancy might reveal fragmentation or metadata overhead, which we can then optimize.\
|
||
\ls18\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Quality Drift over Long Contexts:
|
||
\f1\b0 We will validate that model output quality does not degrade unacceptably over long contexts. This requires tests on tasks where ground truth or expected behavior is known even with long prompts. For instance, we can use a long document QA dataset: provide a very long text and ask a question about the beginning. We compare the answer when the model has the full context precisely vs when it has gone through our pipeline (with quantization, maybe some rematerialization). We also use perplexity measurements on language modeling benchmarks at various context lengths. The harness will, for example, take a 4,000-token text, run the model with no quantization (as reference), and then run with our progressive quantization, and compute perplexity on predicting the next tokens. If perplexity rises significantly, that indicates quality loss due to quantization error accumulation. We then test mitigations (like SQuat or more frequent rematerialization) to see if we recover quality. By doing this systematically (e.g. at 1k, 2k, 4k, 8k context), we can chart how quality metrics drift and ensure they stay within acceptable bounds. Our target is that even at maximum context, the model\'92s performance is close to a baseline (perhaps <5% degradation in perplexity or accuracy on tasks). If we see larger drops, the harness will highlight it and we\'92ll adjust our quantization strategy thresholds.\
|
||
\ls18\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Adapter Correctness and Regression:
|
||
\f1\b0 We must verify that applying adapters (LoRAs) yields the intended model behavior exactly as if the model were finetuned or the LoRA merged offline. The harness will include tests where we have ground-truth outputs: for example, if we have a small LoRA that changes the model to output in a certain style, we check that on a known prompt, the output matches the expected adaptation. We\'92ll run the model with and without the adapter in our system and compare to an offline merged version of the model. The difference should be only minor floating-point noise. We will also test concurrent adapters: ensure that when two requests use different adapters, they don\'92t interfere (no cross-talk or memory corruption). Additionally, we will maintain a set of core metrics for each adapter (like accuracy on a task it was meant for) and ensure that as we change scheduling or memory management, those metrics don\'92t regress. For instance, if an adapter was fine-tuned for sentiment analysis, we\'92ll run a small evaluation of that in the harness and confirm the outputs remain as before, even as we tweak the system\'92s internals.\
|
||
\ls18\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Quantization Error Accumulation and Recovery:
|
||
\f1\b0 This test suite will intentionally stress the quantization system. For example, run the model for a long conversation (many tokens) and periodically
|
||
\f0\b unquantize
|
||
\f1\b0 the cache to see if the model can recover. One specific test: take a scenario and run it twice \'96 once with periodic cache resets (clear and recompute, akin to perfect precision refresh) and once with continuous quantization. Then compare the model\'92s answers or logit distributions at certain checkpoints. If we observe divergence, we measure how big and whether it grows. The harness can include automated detection: e.g., embed the outputs or use a similarity measure to see if responses drift after a long time. If drift is found, we test our recovery mechanisms: perhaps after N tokens we flush part of the cache (forcing a rematerialize, effectively resetting quantization errors). The harness will help find an optimal schedule for such refreshes if needed. Our goal is that any accumulation of quantization error is bounded \'96 ideally the model\'92s outputs after long runs remain coherent and correct. Should the harness find unbounded error growth, that will prompt us to incorporate techniques like
|
||
\f0\b high precision periodic sync
|
||
\f1\b0 (maybe temporarily using FP16 for a layer once every few hundred tokens to realign) or using SQuat which is designed to minimize accumulated error by its orthogonal projection approach.\
|
||
\ls18\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Scheduler and Gating Effects:
|
||
\f1\b0 We will also explicitly test the contribution of our scheduling policies (like adapter batching or KV eviction thresholds). For example, we\'92ll measure throughput with and without heterogeneous batching to quantify the gain (expecting something akin to S-LoRA\'92s up to 4\'d7 throughput improvement on multi-adapter loads). We\'92ll simulate high load multi-tenant scenarios to ensure our scheduler improves p95 latency and throughput. If any gating rule (like \'93if context > N, quantize\'94) ends up harming quality disproportionately, the harness will catch that and we can adjust N or use a smarter criterion (like based on model confidence).\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 The evaluation harness will be integrated into our CI pipeline \'96 every change runs these tests on a suite of models (small ones for speed and large ones for stress). By making this mandatory, we ensure
|
||
\f0\b \strokec2 no optimization silently hurts the system
|
||
\f1\b0 \strokec2 : we\'92ll either see it in metrics or not merge that change. Only with this concrete benchmark suite can we iterate confidently on the complex scheduler and quantization logic.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 9. Ruvector Integration Roles and APIs\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\fs24 \cf0 Ruvector
|
||
\f1\b0 \strokec2 is envisioned as a
|
||
\f2\i \strokec2 learning and memory subsystem
|
||
\f1\i0 \strokec2 for the runtime. We clarify how ruvector will be used in three roles and outline the APIs/storage involved:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls19\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 A. Policy Memory Store for Learned Thresholds:
|
||
\f1\b0 As the system runs, it could learn optimal settings (like quantization levels or eviction thresholds) based on observed outcomes. Ruvector can serve as a database to store these learned policies. For example, we might use reinforcement learning or simply heuristic logging to adjust something like \'93optimal KV quantization bit-width as a function of context length and content type.\'94 The data to back this could be high-dimensional (taking into account properties of the input). Ruvector, being a vector database that
|
||
\f2\i \'93learns and improves from every interaction\'94
|
||
\f1\i0 , can store an
|
||
\f0\b embedding of the context
|
||
\f1\b0 or some representation of the session state along with the outcome metrics (latency, quality). Over time, we can cluster or correlate which contexts benefit from which strategy. The API here would look like: after each session or critical event, we form a vector (embedding) of the situation and store it via
|
||
\f4\fs26 ruvector.insert(key, vector, metadata)
|
||
\f1\fs24 , where metadata might include the strategy used and the results (e.g. \'93quantized at 4k tokens, quality good\'94). Later, when a new session starts or reaches a decision point, we query ruvector for similar situations:
|
||
\f4\fs26 ruvector.query(vector, k=5)
|
||
\f1\fs24 might return the closest stored experiences. If those all indicate that, say, KIVI 2-bit was fine, we proceed; if they indicate that quality dropped, we might choose a different path (maybe use SQuat or none). This effectively allows the system to
|
||
\f0\b learn thresholds dynamically
|
||
\f1\b0 rather than hard-coding them. Initially, we might run in a logging mode and manually analyze, but eventually the loop can be closed for self-optimizing behavior.\
|
||
\ls19\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 B. Session State Index for Adapter Selection & Cache Locality:
|
||
\f1\b0 In a multi-tenant system, ruvector can act as an index of sessions and their properties, enabling better placement and retrieval. We\'92ll maintain an entry in ruvector for each active session (or recently active one). The vector could encode the user\'92s preferences or past usage (for example, which adapters they use frequently, what kind of prompts they have \'96 embedded into a vector). The
|
||
\f0\b adapter selection
|
||
\f1\b0 part means if a new request comes in for a certain task, we can quickly find if there\'92s an adapter that fits (by querying similar past sessions or known task vectors). This could guide routing: e.g. we identify which LoRA to load for a given query if not explicitly specified. The
|
||
\f0\b cache locality
|
||
\f1\b0 aspect is about optimizing memory usage: ruvector can help us decide if we should colocate certain sessions on the same GPU because they use similar adapters or content (thus could batch). If ruvector tells us Session X and Y are very similar in embedding (say both are long chats about programming), we might schedule them closely so they can share context or at least ensure their memory pages might reside together for efficiency. The API could be: when a session ends or after N turns, update its embedding in ruvector. When scheduling new requests, query for the nearest neighbor sessions and see if they are on a particular server or GPU \'96 then possibly assign the new request to the same location to
|
||
\f0\b reuse loaded adapters or cached knowledge
|
||
\f1\b0 . Essentially, ruvector becomes a smart directory of sessions, supporting decisions like \'93which adapter to pre-load for this user\'92s next query\'94 or \'93which server has the most relevant data for this request\'94.\
|
||
\ls19\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 C. Witness Log Index for Postmortem & Audit:
|
||
\f1\b0 Every interaction and decision in the system can be logged as a \'93witness\'94 \'96 a data point that can be later analyzed. We will use ruvector to index these logs in a semantic way. For example, each completed request can produce: an embedding of the prompt+response (to capture the content semantics), the system decisions (quantization used, any interruptions, etc.), and outcome (latency, errors if any). This goes into ruvector as a vector with attached metadata (the log). Later, if we are investigating an incident (say a particular query produced a wrong answer or the system lagged), we can perform an
|
||
\f0\b semantic search
|
||
\f1\b0 in these logs. Perhaps an auditor can query \'93find all requests that had to be interrupted for timeout\'94 or \'93find similar conversations to this one that got a bad answer\'94. Ruvector\'92s ability to store and search high-dimensional data makes it ideal for this. The
|
||
\f0\b witness logs
|
||
\f1\b0 stored allow
|
||
\f2\i postmortem analysis
|
||
\f1\i0 : after deployment, developers can cluster failures or outliers and derive improvements. They also assist in compliance audits \'96 e.g. if a user reports an inappropriate response, we can find it and related cases by similarity. API-wise, we\'92d have something like
|
||
\f4\fs26 ruvector.log(vector, info)
|
||
\f1\fs24 for each request. For retrieval, an admin tool might do
|
||
\f4\fs26 ruvector.search(vector_or_text, filter=condition)
|
||
\f1\fs24 .\
|
||
\ls19\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Concrete Storage Model:
|
||
\f1\b0 Ruvector in distributed mode would likely run as a service or embedded database. We will integrate it such that on each node (or centrally) there\'92s a ruvector instance. It can scale horizontally with Raft consensus (per their docs), which is good for reliability. We\'92ll define the schemas for each entry type (policy, session, log) possibly as separate collections or namespaces in ruvector. The vectors themselves might be of different dimensions (for policy maybe small, for session perhaps large embedding). Ruvector\'92s learning capability means it might adjust indices or create summary indices automatically \'96 we\'92ll leverage that to improve query speed as data grows. We\'92ll also ensure PII or sensitive info is handled \'96 likely the embeddings will be somewhat anonymized or at least not directly storing raw text.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 In summary, ruvector will be the
|
||
\f0\b \strokec2 intelligent memory
|
||
\f1\b0 \strokec2 of the system: storing past experiences and using them to make the system smarter and more auditable over time. We\'92ve identified clear APIs for inserting and querying this information, which cements its role beyond a conceptual \'93nice-to-have\'94. It becomes a core component for adaptive behavior (learning thresholds), efficient multi-session handling (session index), and traceability (audit log).\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 10. Model Format Support and Feature Coverage from Day One\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 Finally, we specify the range of model formats and architectural features the system will support to ensure broad applicability:\
|
||
\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0
|
||
\ls20\ilvl0
|
||
\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Model File Formats:
|
||
\f1\b0 We will support the
|
||
\f0\b GGUF/GGML
|
||
\f1\b0 family of formats out-of-the-box, as they are popular for LLMs with quantization and easy CPU loading. Additionally, we will support Hugging Face Transformer models (through either direct loading of PyTorch safetensors or via conversion to our internal format). Our internal format might be a variant of GGUF (since it\'92s extensible and already used for quantized models like LLaMA). We\'92ll document a conversion tool for PyTorch checkpoints to our format if needed. Tokenizers will be supported via the Hugging Face
|
||
\f2\i tokenizers
|
||
\f1\i0 library or a compatible Rust implementation, ensuring we can handle BPE, sentencepiece, etc., for all major model types. On day one, we aim to run models like LLaMA-2, Falcon, Mistral, GPT-NeoX, etc., which covers a broad set of attention mechanisms and tokenizer quirks.\
|
||
\ls20\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Rotary and ALiBi Positional Embeddings:
|
||
\f1\b0 We will fully support
|
||
\f0\b RoPE (Rotary Positional Embedding)
|
||
\f1\b0 as used in LLaMA/Mistral, including extended context modifications (like NTK scaling). This means our attention kernels can incorporate the rotary transforms on keys and queries \'96 and importantly, if using KVQuant we note that quantization might be applied pre-ROPE as their method suggests, which we have accounted for. We will also support
|
||
\f0\b ALiBi (Attention Linear Bias)
|
||
\f1\b0 positions for models that use it (like some older GPT-NEO variants). This requires the attention code to add a static bias for each query-key distance; we will include that in our kernel implementations or handle it in the model data. Additionally, any variants like T5\'92s relative attention or XLNet\'92s might not be first priority but our code is written with extensibility so new attention bias patterns can be added.\
|
||
\ls20\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Grouped Query Attention (GQA):
|
||
\f1\b0 Some newer models (e.g. certain LLaMA-2 70B architecture or Mistral) use
|
||
\f0\b GQA
|
||
\f1\b0 , where multiple heads share the same key/query projections (reducing number of distinct heads). Our implementation will handle GQA by allowing the number of heads in the model to differ from number of key-value slots. For example, if a model has 8 query groups for 32 heads, our attention will treat it appropriately (keys/values effectively have shape with 8 groups, and each group is used by 4 logical heads). We ensure our data structures (especially KV cache) and kernels account for this shared head scenario. This might involve minor changes in how we index the KV cache pages (group index vs head index). We\'92ll test on a model known to use GQA (like Mistral-7B) to confirm correctness.\
|
||
\ls20\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Sliding Window / Local Attention:
|
||
\f1\b0 Some models (e.g. Longformer, MPT in storywriter mode) use sliding window or block sparse attention patterns instead of full attention. Our system will support at least a
|
||
\f0\b sliding window attention
|
||
\f1\b0 mechanism: this means the attention kernel will only attend to the last N tokens instead of all previous tokens for each new token (or some fixed pattern). We\'92ll implement this by allowing a configurable attention mask or range. If a model config in the format indicates a sliding window of size w, our attention code will simply mask out (or not retrieve from KV) any keys older than w tokens relative to the current. This can actually integrate well with our KV paging: we can simply not hold pages older than the window since they\'92ll never be used. That yields big memory savings. For block-sparse or other patterns, we may not do a fully general sparse attention on day one, but we will at least be able to support any
|
||
\f2\i contiguous window or prefix
|
||
\f1\i0 style restriction easily. This covers many use cases where context is long but only a recent subset is actively attended.\
|
||
\ls20\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Long Context Strategies:
|
||
\f1\b0 We commit to supporting long context extensions like
|
||
\f0\b RoPE scaling
|
||
\f1\b0 (the technique used by GPT-4 32k and projects like Mistral 16k). If a model uses an extended context via scaled RoPE frequencies or repeated patterns, our position embedding implementation will include those formulas (for example, the NTK-aware scaling where sinusoid frequency grows slower to accommodate longer contexts). We\'92ll verify on known long-context models (like LLaMA-2 32k variants or others) that the perplexity matches reference. Additionally, if models use techniques like position interpolation or segment recurrence, we will handle those in the model forward pass logic. Essentially, any model that adheres to standard transformer architectures up to late-2025 will be supported: this includes improvements like Multi-Query Attention (one key/value per layer shared across heads \'96 we support that as a special case of GQA), and potentially newer architectures like MHA with linear bias, etc.\
|
||
\ls20\ilvl0
|
||
\f0\b \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0
|
||
\outl0\strokewidth0 \strokec2 Kernel ABI and Test Vectors:
|
||
\f1\b0 With these features in mind, our kernel interface is designed to be flexible. The ABI between model and kernels will pass information about the attention type (full vs grouped vs sliding) so the kernel can handle it. For example, we might have an enum indicating if it\'92s full attention or local with window=256, etc., that the kernel reads. We will prepare
|
||
\f0\b test vectors
|
||
\f1\b0 \'96 known inputs and outputs \'96 for each variant. For instance, we\'92ll take a small model with rotary and ensure our attention output matches a reference PyTorch implementation\'92s output to high precision. We\'92ll do the same for a model with ALiBi, and one with grouped heads, etc. These test vectors will be part of our continuous testing to catch any regression in supporting these features.\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
\cf0 \strokec2 By outlining these format and feature supports, we ensure that from day one our system isn\'92t limited to a toy scenario \'96 it can run real modern models with long contexts, different attention schemes, and different fine-tuning methods. This comprehensive coverage influences the design of our kernels and data structures (they must be general enough), but now we have it
|
||
\f0\b \strokec2 locked in writing which models and features are priority
|
||
\f1\b0 \strokec2 , removing any ambiguity for implementers.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 Locking Key Decisions: Next Steps\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 To make the
|
||
\f0\b \strokec2 frontier plan actionable
|
||
\f1\b0 \strokec2 , we lock in three critical decisions that guide all implementation efforts going forward:\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 Decision 1: Default Inference Backend for Attention/Matmul\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 We will use a
|
||
\f0\b \strokec2 Mistral.rs-inspired paged attention
|
||
\f1\b0 \strokec2 as the default backend, augmented with quantized KV cache support. This means our primary inference loop will utilize paged memory for KV (improving cache management) and include optimized matrix multiplication for attention. If available (on GPU or specialized hardware), we will integrate custom kernels from projects like vLLM or FlashAttention as a first preference due to their speed; otherwise, our Rust SIMD implementation will handle it. The fallback order is defined: try high-performance device-specific kernels, fall back to our Rust SIMD paged-attention, and finally to a safe reference implementation if needed. By choosing this path, we leverage existing successful techniques in a novel Rust implementation, ensuring we meet performance needs for long contexts by default. The
|
||
\f0\b \strokec2 mistral.rs PagedAttention
|
||
\f1\b0 \strokec2 model serves as a reference for behavior and we explicitly include KV cache quantization in this backend from the start. This decision is now fixed and all team members can proceed assuming paged attention + quantized KV is the core of our attention mechanism.\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 Decision 2: SIMD Kernel Strategy \'96 Use Portable Stable SIMD (Macerator)\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 We will implement all low-level compute kernels using
|
||
\f0\b \strokec2 portable SIMD via the
|
||
\f6\fs26 macerator
|
||
\f0\fs24 crate
|
||
\f1\b0 \strokec2 (with fallback to
|
||
\f4\fs26 \strokec2 pulp
|
||
\f1\fs24 \strokec2 if needed). This locks us into a stable Rust solution (no nightly
|
||
\f4\fs26 \strokec2 std::simd
|
||
\f1\fs24 \strokec2 ) that can auto-dispatch to different instruction sets. Macerator\'92s broad architecture support and generic vector types make it an ideal choice for longevity. Developers will write kernel code in a SIMD-generic way; the crate will produce optimized versions for AVX2, AVX-512, NEON, etc. We accept the slight risk of macerator being less tested than pulp, but given its design and expanded support, it aligns with 2026 reality where stable, cross-platform performance is required. Should std::simd stabilize in the future, we can consider migrating, but for now
|
||
\f0\b \strokec2 we standardize on macerator
|
||
\f1\b0 \strokec2 . This decision affects every compute kernel we ship \'96 everyone will use the chosen SIMD abstraction for consistency. Multiversioning will be achieved by macerator\'92s runtime dispatch, ensuring users get the best performance their CPU can offer. By locking this in, we remove uncertainty about how to write and optimize our math routines: the team can confidently proceed with macerator-based implementations.\
|
||
\pard\pardeftab720\sa280\partightenfactor0
|
||
|
||
\f0\b\fs28 \cf0 \strokec2 Decision 3: WASM Execution Budget \'96 Use Epoch-Based Interruption\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 We will enforce execution budgets on any WASM-run code using
|
||
\f0\b \strokec2 epoch-based interruption
|
||
\f1\b0 \strokec2 (Wasmtime) as the default mechanism. This means every WASM module invocation will be associated with an epoch deadline to guarantee it can be interrupted if it runs too long, without incurring high per-instruction overhead. We choose epoch over fuel because of its superior performance for long-running computations (it doesn\'92t slow down each instruction) and simplicity for resetting timers. The epoch mechanism will be integrated such that, for example, each inference request has an epoch counter and if it exceeds a time slice, the engine will asynchronously signal a trap in the WASM. This keeps tail latencies in check and prevents any rogue code from hanging the system. This decision is now final:
|
||
\f0\b \strokec2 epoch interruption will be used wherever possible
|
||
\f1\b0 \strokec2 . In cases where epoch isn\'92t available (perhaps an older WebAssembly runtime or special scenario), we\'92ll use fuel as a secondary, but our platform of choice (Wasmtime) supports epoch, so that\'92s our standard. Additionally, we commit to handling these interruptions gracefully (cleaning up the WASM instance and returning a controlled error). By locking in epoch-based budgeting, we ensure uniform handling of WASM across the project and avoid debate later about how to handle timeouts \'96 it\'92s settled now.\
|
||
With these three key decisions solidified \'96 the attention backend (paged+quantized with mistral.rs influence), the SIMD approach (macerator stable SIMD), and the WASM budget (epoch interrupts) \'96 the plan transitions from exploratory to execution. All other pieces (memory management, scheduling, etc.) will be built on top of these firm choices. We have taken what was previously a research discussion and turned it into an
|
||
\f0\b \strokec2 engineering blueprint
|
||
\f1\b0 \strokec2 with concrete defaults and fallback strategies. From here, the team can proceed to implementation, confident that the foundational decisions are made and the system\'92s behavior is specified in detail.\
|
||
\pard\pardeftab720\sa298\partightenfactor0
|
||
|
||
\f0\b\fs36 \cf0 \strokec2 Conclusion\
|
||
\pard\pardeftab720\sa240\partightenfactor0
|
||
|
||
\f1\b0\fs24 \cf0 \strokec2 By addressing each missing piece with concrete implementations and policies, we have transformed the original high-level plan into a detailed design ready for engineering. We have specified how attention and matmul will run, how SIMD will be done in Rust today, how we\'92ll quantize and manage KV caches, how memory paging and eviction works, the exact WASM integration approach, secure handling of kernel plugins, multi-adapter serving strategies, testing protocols, ruvector\'92s integration, and comprehensive model support. Finally, we cemented three critical decisions (attention backend, SIMD crate, WASM interruption) to guide the development.\
|
||
This updated plan is now
|
||
\f0\b \strokec2 actionable
|
||
\f1\b0 \strokec2 \'96 each component can be implemented and tested according to the descriptions here. With these details in writing, the project moves out of the realm of research debate and into execution. The result will be a robust, state-of-the-art LLM inference engine that is efficient, scalable, and hard to break, by design.\
|
||
} |