Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

118 KiB

Raw Blame History

ADR-017: Craftsman Ultra 30b 1bit — BitNet Integration with RuvLLM

Status: Proposed Date: 2026-02-03 Decision Makers: Ruvector Architecture Team Technical Area: 1-Bit LLM Inference / MoE Architecture / CPU-Native Serving

Context and Problem Statement

Large language models require substantial GPU resources for inference, limiting deployment to cloud environments and specialized hardware. Recent advances in 1-bit quantization — specifically Microsoft Research's BitNet b1.58 — demonstrate that ternary-weight models ({-1, 0, +1}) can match full-precision performance at 3B+ parameters while enabling CPU-only inference at human-readable speeds.

Concurrently, Zhipu AI's GLM-4.7-Flash introduces a 30B-A3B Mixture-of-Experts architecture that activates only ~3B parameters per token while storing 30B total knowledge, achieving strong coding and agentic benchmarks (SWE-bench Verified: 59.2%, LiveCodeBench v6: 64.0%) with 200K context.

Craftsman Ultra 30b 1bit is a proposed model that combines these two paradigms: a 30B-A3B MoE architecture with native BitNet b1.58 ternary quantization, purpose-built for CPU inference within the RuvLLM serving runtime. This ADR evaluates the integration path, architectural decisions, and trade-offs.

Strategic Goal

Deliver a 30B-class coding/agentic model that runs entirely on consumer CPUs (no GPU required) at 5-15 tokens/second decode, with memory footprint under 8GB, integrated into the RuvLLM + Ruvector ecosystem with SONA self-learning capabilities.

Decision Drivers

Performance Requirements

Metric	Target	Rationale
Decode throughput (CPU)	5-15 tok/s	Human-readable speed per BitNet 100B benchmarks
Prefill latency (1K tokens)	<2s	Interactive coding assistant responsiveness
Memory footprint (model)	<8 GB	Fits in 16GB system RAM with OS + KV cache
Memory footprint (KV cache, 4K ctx)	<2 GB	Q8 KV cache for 4096-token context
Active parameter GEMM	Addition-only	BitNet eliminates multiplication in W×A
Energy per inference	<0.05J	BitNet CPU efficiency benchmarks

Architecture Requirements

MoE routing must remain full-precision: Expert selection requires accurate gating scores
Expert weights are ternary: Each expert's linear layers use BitLinear (W1.58A8)
Activations quantized to INT8: Per-token absmax scaling
Shared layers (embeddings, LM head) remain FP16: Critical for quality preservation
GGUF-compatible: Must serialize to/load from GGUF v3 format with custom metadata

Ecosystem Requirements

Integrate with RuvLLM's existing backend abstraction (backends/mod.rs)
Leverage existing GGUF parser (gguf/parser.rs, gguf/quantization.rs)
Support SONA learning loops for per-session adaptation
Compatible with Claude Flow agent routing for task delegation
NAPI bindings for Node.js consumption via npm/packages/ruvllm

Research Summary

BitNet b1.58 Architecture

Source: Microsoft Research, "The Era of 1-bit LLMs" (Feb 2024), bitnet.cpp (Oct 2024)

BitNet b1.58 replaces standard nn.Linear with BitLinear layers:

Forward Pass:
  1. W_ternary = RoundClip(W / (gamma + epsilon), -1, 1)
     where gamma = mean(|W|) (absmean quantization)
  2. X_int8 = Quant(X, absmax)  (per-token 8-bit activation)
  3. Y = W_ternary @ X_int8      (integer addition only, no multiplication)
  4. Y_float = Dequant(Y)         (rescale to float)

Key properties:

Weights: ternary {-1, 0, +1} → 1.58 bits per parameter
Activations: INT8 per-token (absmax scaling)
Matrix multiply becomes addition and subtraction only (no FP multiply)
Zero weights enable feature filtering (sparse activation within dense layers)
Must be trained from scratch — post-training quantization to 1-bit destroys quality

Inference kernels (bitnet.cpp):

Kernel	Method	Compression	Best For
I2_S	2-bit pack, unpack-and-multiply	2 bits/weight	Bandwidth-limited
TL1	2-weight → 4-bit LUT index	2 bits/weight	Balanced CPU
TL2	3-weight → 5-bit LUT index	1.67 bits/weight	Memory-limited

CPU performance (bitnet.cpp benchmarks):

Platform	Speedup vs FP16	Energy Reduction
ARM (NEON)	1.37x – 5.07x	55-70%
x86 (AVX2)	2.37x – 6.17x	72-82%
x86 (AVX512)	~6x+	~85%

GLM-4.7-Flash Architecture

Source: Zhipu AI / Z.AI (Jan 2026)

Property	Value
Total parameters	~30B (31B reported)
Active parameters	~3B (A3B)
Architecture	Mixture of Experts (MoE)
Shared layers	~2B parameters
Expert layers	~28B (distributed across experts)
Context window	200K tokens (MLA-based)
Training data	15T general + 7T reasoning/code tokens
Attention	Multi-head Latent Attention (MLA) with QK-Norm
Activation	SwiGLU
Position encoding	RoPE
Speculative decoding	Multi-Token Prediction (MTP) layer
Reasoning	Interleaved + Retention-Based + Round-Level

Benchmark performance:

Benchmark	Score
AIME 25	91.6%
GPQA	75.2%
SWE-bench Verified	59.2%
LiveCodeBench v6	64.0%
HLE	14.4%
tau2-Bench	79.5%

RuvLLM Current Capabilities (Relevant)

GGUF v3 parser: Full format support including IQ1_S (1.56 bits/weight, type 19)
Quantization pipeline: Q4_K_M, Q5_K_M, Q8_0, F16 (no native ternary training)
Backends: Candle (Metal/CUDA), mistral-rs (PagedAttention), CoreML (ANE)
No CPU-optimized ternary kernel: Current backends target GPU acceleration
SIMD kernels: Existing NEON/SSE4.1/AVX2 infrastructure in crates/ruvllm/src/kernels/
MicroLoRA: Rank 1-2 adapters with <1ms adaptation (compatible with BitNet)
SONA: Three-tier learning (instant/background/deep) — can drive ternary adapter training

RuvLLM RLM Training Stack (Reusable for Distillation)

RuvLLM contains a mature reinforcement-learning-from-model-feedback (RLM) training stack that directly accelerates Craftsman Ultra distillation. These components are production-tested and reduce net-new code by ~70%.

GRPO — Group Relative Policy Optimization (training/grpo.rs, 897 lines)

Critic-free RL: computes relative advantages within sample groups
Adaptive KL divergence penalty (kl_target, clip_range) controls teacher-student divergence
PPO-style clipping prevents catastrophic updates
Preset configs: GrpoConfig::stable() (safe distillation), GrpoConfig::for_tool_use() (expert routing)
Thread-safe batch processing via RwLock<VecDeque<SampleGroup>>

RealContrastiveTrainer (training/real_trainer.rs, 1000 lines)

Candle-based training loop with GGUF model loading and GGUF weight export
Combined loss: Triplet (margin) + InfoNCE (contrastive) + GRPO reward scaling
AdamW optimizer with gradient clipping, LR warmup, checkpointing
GrpoEvaluator computes per-prediction rewards (1.0 correct, -0.5 wrong)
Metal/CUDA acceleration via Candle device dispatch

MicroLoRA + EWC++ Training Pipeline (lora/training.rs, 798 lines)

Single-example gradient computation (batch_size=1 for real-time)
EWC++ regularizer: λ/2 * Σ F_i * (w_i - w*_i)² prevents catastrophic forgetting
Fisher diagonal tracking with exponential decay (fisher_decay: 0.999)
7 learning rate schedules (Cosine, OneCycle, Step, etc.)
Async adaptation with buffered gradient accumulation

Memory Distillation (reasoning_bank/distillation.rs, 856 lines)

Compresses trajectories to KeyLesson objects with semantic embeddings
Smart extraction: explicit lessons, implicit patterns, error patterns, recovery patterns
Semantic deduplication (Jaccard + cosine similarity, threshold 0.85)
Quality-gated: only trajectories above min_quality_threshold are preserved

Policy Store (policy_store.rs, 474 lines)

Ruvector-backed semantic policy persistence with HNSW indexing
Policy types: Quantization, Router, Ewc, Pattern
Per-layer QuantizationPolicy with precision, activation thresholds, quality-latency tradeoff
Policy source tracking: InstantLoop, BackgroundLoop, DeepLoop, Federated

Contrastive Training (training/contrastive.rs, 634 lines)

Two-stage: Triplet Loss (margin=0.5) + InfoNCE (temperature=0.07)
13 agent types with 1,078 training triplets (578 base + 500 hard negatives)
Hard negative mining at 48.4% ratio (Claude-generated confusing pairs)
Proven 100% routing accuracy with hybrid keyword-first + embedding fallback

Considered Options

Option A: Post-Training Quantization of GLM-4.7-Flash (PTQ Tiers)

Take the existing BF16 GLM-4.7-Flash weights and quantize to low-bit formats without full distillation training.

Critical distinction — IQ1_S ≠ BitNet b1.58:

Property	GGUF IQ1_S	BitNet b1.58
Encoding	Codebook-based importance quantization	Ternary {-1, 0, +1} via absmean
Bits/weight	1.56 bpw	1.58 bpw
Inference	Dequantize → FP multiply	Integer addition only (no multiply)
Speed benefit	Memory bandwidth only	Bandwidth + compute (multiplication-free)
How obtained	Post-training quantization	Trained from scratch or distilled
Quality at 7B	Near-random / broken outputs	Matches FP16

Existing GLM-4.7-Flash GGUF quantizations available (community-published):

Repository	Lowest Quant	Size	Notes
bartowski/zai-org_GLM-4.7-Flash-GGUF	IQ2_XXS (2.06 bpw)	7.62 GB	No IQ1_S published
unsloth/GLM-4.7-Flash-GGUF	UD-Q2_K_XL (2.7 bpw dynamic)	~11 GB	Dynamic quant, recommended
ngxson/GLM-4.7-Flash-GGUF	Q4_K_M (4.5 bpw)	18.1 GB	55 variants available

No IQ1_S quantization has been published for GLM-4.7-Flash by any community quantizer — this itself is a signal (too aggressive for practical use).

Sub-options ranked by increasing effort:

Sub-option 0A: Download existing IQ2_XXS GGUF

Download bartowski's IQ2_XXS at 7.62 GB
Cost: $0, time: 5 minutes (just download)
Quality: ~75-80% of FP16 (2.06 bpw is usable per community reports)
NOT 1-bit, NOT BitNet — just aggressive 2-bit compression
RuvLLM gap: IQ2_XXS dequantization not implemented (falls to error catch-all in quantization.rs:358)
RuvLLM Q2_K dequantization IS implemented and works

Sub-option 0B: Quantize to IQ1_S via llama.cpp

Run llama-quantize GLM-4.7-Flash-F16.gguf IQ1_S with importance matrix
Cost: $0, time: ~30 minutes on CPU
Quality: SEVERE degradation — blind testing shows IQ1_S is "broken rather than just bad" on 7B; outputs contain garbled text despite acceptable perplexity scores. 30B MoE may survive better due to parameter redundancy, but expert routing is highly sensitive to weight perturbation
RuvLLM gap: IQ1_S dequantization not implemented (quantization.rs:358 catch-all)
Does NOT achieve BitNet multiplication-free inference

Sub-option 0C: PT-BitNet ternary PTQ (per PT-BitNet paper)

Apply absmean ternary quantization (BitNet's native method) to pre-trained weights with calibration data
Cost: $0 (runs locally on Mac Studio via mmap + Metal; 1-4 hours wall time)
Alternative: ~$50-200 on cloud GPU if no local Apple Silicon hardware
Quality: ~55-65% downstream accuracy (PT-BitNet reports 61% on 70B; GLM-4.7-Flash's 30B-A3B may differ)
THIS IS proper BitNet ternary format → enables multiplication-free inference with AD-4 kernels
Requires implementing absmean ternary quantizer (~200-300 lines of new code)
Requires calibration dataset (WikiText-2 or similar, ~1M tokens)
Mac Studio M4 Max 64GB+ or M3 Ultra 96GB+ recommended (see AD-18)

Sub-option 0D: BitDistill Lite (10B tokens) (per BitDistill paper)

3-stage: SubLN insertion → 10B-token continued pre-training → KL + attention distillation
Cost: ~$200-500 (8× GPU hours on Mi300X/A100 class)
Quality: ~90-95% of FP16 (BitDistill reports 88.17% vs 88.01% FP16 on MNLI at 0.6B)
Near-full quality recovery with only 10B tokens (vs 200B+ for Phase 1 full distillation)
Requires SubLN module insertion + distillation fine-tuning loop
Bridges gap between pure PTQ and full expert distillation (Phase 1)

Summary comparison:

Sub-option	Cost	Time	Quality (est.)	BitNet Speedup	RuvLLM Ready
0A: IQ2_XXS download	$0	5 min	~75-80%	No	No (missing dequant)
0B: IQ1_S quantize	$0	30 min	~40-50%	No	No (missing dequant)
0C: PT-BitNet PTQ	$0 (Mac Studio)	1-4 hrs	~55-65%	Yes	Needs quantizer impl
0D: BitDistill Lite	$0 local / ~$300 cloud	2-4 wks / 1-2 days	~90-95%	Yes	Needs SubLN + KD loop

Pros (of PTQ approach generally):

Immediate or near-immediate results ($0-$300, minutes to days)
No large-scale training infrastructure
Validates inference pipeline and kernels before investing in full distillation
Sub-option 0C produces genuine BitNet ternary format for kernel development

Cons:

Sub-options 0A/0B: Quality too degraded for production coding tasks
Sub-options 0A/0B: No BitNet multiplication-free inference (still dequant-then-multiply)
Sub-option 0C: Significant quality loss (~35-45%) vs teacher — adequate for kernel validation, not production
Sub-option 0D: Requires non-trivial training code (SubLN, KD loss) but much less than full Phase 1
IQ1_S blind test results: statistically indistinguishable from random on smaller models

Verdict: Recommended as Phase 0 rapid prototype — Sub-option 0C (PT-BitNet PTQ) is the optimal entry point: $100, 2-4 hours, produces genuine BitNet ternary format for kernel development and inference validation. Sub-option 0D (BitDistill Lite) bridges to Phase 1 if higher quality is needed before committing to full expert distillation. Sub-options 0A/0B are useful only as baselines for comparison.

Option B: Native BitNet Training of GLM-4.7-Flash Architecture (Full)

Train Craftsman Ultra 30b 1bit from scratch using BitNet b1.58 methodology on the GLM-4.7-Flash MoE architecture.

Approach:

Implement BitLinear layers for all expert MLPs and attention projections
Keep MoE router, embeddings, and LM head in FP16
Train on 4T+ tokens with ternary weight updates via straight-through estimator
Export to custom GGUF with ternary tensor metadata

Pros:

Maximum quality — matches FP16 at 3B+ active parameter scale
True multiplication-free inference for expert forward passes
Full TL1/TL2 kernel optimization possible
Scientifically validated approach (BitNet b1.58 2B4T results)

Cons:

Massive training compute: estimated 4,000-8,000 A100-hours for 4T tokens
Requires custom training framework (BitNet + MoE + MLA integration)
6-12 month timeline for training pipeline + training run
No pre-existing GLM-4.7-class BitNet training recipe

Verdict: Recommended long-term — Highest quality but requires significant investment.

Option C: Hybrid Approach — BitNet Distillation from GLM-4.7-Flash (RLM-Accelerated)

Use knowledge distillation to transfer GLM-4.7-Flash capabilities into a BitNet architecture, reducing training cost by 5-10x. Leverages the existing RLM training stack to eliminate ~70% of net-new training code.

Approach:

Initialize Craftsman Ultra with GLM-4.7-Flash architecture (30B-A3B MoE)
Replace all expert linear layers with BitLinear (ternary {-1, 0, +1})
Keep router, embeddings, LM head in FP16
Extend RealContrastiveTrainer with KD loss (KL div + hard-label CE) replacing triplet+InfoNCE
Use GrpoOptimizer for per-expert quality rewards during distillation — each SampleGroup maps to one expert's teacher vs student outputs
Apply EwcRegularizer across distillation phases to prevent early-trained experts from being overwritten
Log distillation trajectories to MemoryDistiller for quality tracking and KeyLesson extraction
Persist per-layer ternary policies via PolicyStore (quantization thresholds, scale distributions)
Export to GGUF with ternary tensor metadata and TL1/TL2 kernel hints via existing GgufExportResult

RLM Component Reuse:

Existing Component	Reuse	Adaptation Needed
`RealContrastiveTrainer`	Training loop, GGUF export, checkpointing	Replace triplet+InfoNCE with KD loss
`GrpoOptimizer`	Reward scaling, adaptive KL, PPO clipping	Map `SampleGroup` to per-expert outputs
`EwcRegularizer`	Fisher diagonal, forgetting prevention	Apply across expert distillation phases
`MemoryDistiller`	Trajectory compression, lesson extraction	Map `Verdict` to teacher-student quality delta
`PolicyStore`	Semantic policy persistence	Add `PolicyType::TernaryScale` for per-block absmean tracking
`ContrastiveTrainer`	Hard negative mining framework	Reuse for expert-routing contrastive pre-training

Pros:

5-10x less compute than training from scratch (~800-1,600 A100-hours)
~70% existing code reuse — only BitLinear forward/backward and MoE data loading are net-new
Leverages GLM-4.7-Flash's proven architecture and routing
GRPO's adaptive KL prevents ternary student from diverging too far from teacher
EWC++ ensures sequential expert distillation doesn't corrupt earlier experts
Teacher model provides strong supervision signal for ternary convergence
Can incrementally improve with more distillation tokens
PolicyStore enables learned per-layer quantization decisions
Distillation quality tracked end-to-end via MemoryDistiller trajectory logging

Cons:

Slight quality gap vs native training (estimated 2-5% on benchmarks)
RealContrastiveTrainer embedding_dim (896) must scale to GLM-4.7-Flash hidden_size
Teacher inference cost during distillation
Distillation may not perfectly transfer MoE routing behavior

Verdict: Recommended near-term — Best balance of quality, cost, and timeline. RLM reuse eliminates the "custom framework" risk.

Option D: BitNet Expert Replacement (Incremental, RLM-Accelerated)

Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLinear, leaving attention in FP16. Reuses existing RLM stack for the entire distillation loop.

Approach:

Load GLM-4.7-Flash architecture
Replace expert FFN layers (gate_proj, up_proj, down_proj) with BitLinear
Keep attention (Q/K/V/O projections) in FP16
Use RealContrastiveTrainer + GrpoOptimizer for expert-only distillation (~200B tokens)
Apply EwcRegularizer to prevent expert N+1 distillation from corrupting expert N
Attention weights loaded directly from GLM-4.7-Flash (no distillation needed)
Use contrastive pre-training to validate MoE routing still selects correct experts after ternary conversion

Pros:

Fastest path to working model
Attention quality preserved exactly
Expert FFN is 60-70% of active parameters — gets most BitNet benefits
Simpler distillation (only FFN layers)
Lower memory: ~5.5 GB for ternary experts + FP16 attention
Minimal net-new code: BitLinear layer + GGUF ternary type only; training loop is 100% reused

Cons:

Attention layers still require FP multiply (not fully multiplication-free)
Mixed-precision inference path complexity
~40% of compute still in FP16 attention

Verdict: Recommended as Phase 1 — Enables rapid prototyping and validation. RLM reuse makes this achievable with only ~30% new code.

Decision

Phased approach: A(0C) → RLM Refinement → D → C → B

Phase 0: PTQ Rapid Prototype (Option A, Sub-option 0C)

Timeline: 1-2 weeks
Cost: $0 (runs entirely on Mac Studio locally)
Platform: Mac Studio (M4 Max 64GB+ or M3 Ultra 96GB+)
Goal: Produce a genuine BitNet ternary GGUF of GLM-4.7-Flash for kernel development, inference pipeline validation, and baseline quality measurement
Deliverables:
- PT-BitNet ternary quantized GLM-4.7-Flash GGUF file (~6-7 GB)
- Absmean ternary quantizer implementation (~200-300 lines)
- IQ1_S / BITNET_T158 dequantization kernel in RuvLLM
- Baseline quality benchmarks (HumanEval, MMLU) to compare against Phase 1+
- Functional TL1 kernel validated against ternary model
Expected quality: ~55-65% of GLM-4.7-Flash (adequate for kernel validation, not production)
Key value: De-risks Phase 1 by validating the entire inference pipeline (GGUF loading → ternary dequant → TL1 kernel → MoE routing → token generation) at zero cost before committing to $1,300+ distillation training
Why Mac Studio works: Phase 0 is PTQ (no training loop) — just load FP16 weights via mmap, compute absmean per block, round to ternary, export. The absmean computation is trivial math; the bottleneck is memory bandwidth, not compute. Calibration forward pass uses Metal GPU acceleration via existing Candle integration.
Optional upgrade (0D): If 0C quality is too low for meaningful testing, apply BitDistill Lite (10B tokens, ~$300 cloud or ~$0 on Mac Studio over several weeks) to reach ~90-95% quality

Phase 0.5: RLM Post-Quantization Refinement (NEW — Mac Studio, $0)

Timeline: 1-3 weeks (overlaps with Phase 0 kernel development)
Cost: $0 (runs on Mac Studio, ~2-12 days training wall time with Metal; ~4-24 days SIMD-only)
Platform: Mac Studio (same as Phase 0) — supports both Metal GPU and pure SIMD/CPU modes (see AD-20)
Goal: Improve Phase 0 PTQ quality from ~55-65% to ~70-80% by training only the small FP16 components using the existing RLM stack — no traditional distillation, no cloud GPU
Approach: Freeze ternary weights, train FP16 corrections using RLM components:
1. MicroLoRA adapters (rank 1-2) on each expert FFN — adds small FP16 correction: Y = BitLinear(X) + LoRA_B @ LoRA_A @ X
2. Router fine-tuning via ContrastiveTrainer — corrects misrouting caused by PTQ weight changes
3. Scale factor optimization via GRPO rewards — per-block FP16 absmean scales are differentiable
4. EWC++ regularization — prevents router fix from breaking already-good routing paths
5. Quality tracking via MemoryDistiller — identifies worst-degraded experts for focused training
6. Policy persistence via PolicyStore — stores optimized per-layer configurations
Trainable parameters: ~200-400M (1-2% of 30B total) — router (~30M), MicroLoRA adapters (~50-100M), LM head (~150M), scale factors (~0.1M)
Training data: 100M-500M tokens (sufficient for <400M trainable params)
Throughput: ~500-1000 tok/s (Metal) or ~200-500 tok/s (NEON SIMD only) × 100M-500M tokens = 2-12 days (Metal) or 4-24 days (SIMD-only) on Mac Studio
Deliverables:
- RLM-refined GGUF with ternary experts + optimized FP16 components
- MicroLoRA adapter weights (exportable, ~20-100 MB)
- Optimized router weights and scale factors
- Quality benchmarks showing improvement over Phase 0 baseline
Expected quality: ~70-80% of GLM-4.7-Flash (up from ~55-65% Phase 0 PTQ)
Key value: Gets a usable model on Mac Studio at $0 before committing to cloud GPU. If 70-80% quality is sufficient for the use case, Phase 1 cloud distillation may be deferred or skipped entirely.
100% RLM code reuse: MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer, ContrastiveTrainer, MemoryDistiller, PolicyStore — all production-tested, zero new training code needed

Phase 1: BitNet Expert Replacement (Option D)

Timeline: 3-4 months
Cost: ~$1,300-$2,000 (4× A100 spot, ~46 days)
Goal: Full-quality ternary experts via distillation, validated against Phase 0/0.5 baselines
Deliverables: Working Craftsman Ultra 30b 1bit (mixed: ternary experts, FP16 attention)
Expected quality: ~90-95% of GLM-4.7-Flash on coding benchmarks
Prerequisites: Phase 0 validates inference pipeline; Phase 0.5 provides quality baseline

Phase 2: Full BitNet Distillation (Option C)

Timeline: 4-6 months after Phase 1
Cost: ~$2,500-$5,000 (4× H100, 16-32 days)
Goal: Full ternary model with complete BitNet inference optimization
Deliverables: Craftsman Ultra 30b 1bit v2 (full ternary except router/embed/head)
Expected quality: ~95-98% of GLM-4.7-Flash

Phase 3: Native BitNet Training (Option B)

Timeline: 6-12 months after Phase 2, contingent on funding/compute
Cost: ~$15,000-$30,000 (8× H100 cluster, 90-180 days)
Goal: Surpass GLM-4.7-Flash quality with native ternary training
Deliverables: Craftsman Ultra 30b 1bit v3 (trained from scratch)
Expected quality: 100%+ of GLM-4.7-Flash (BitNet at scale exceeds FP16)

Architectural Decisions

AD-1: Ternary Weight Representation

Decision: Use BitNet b1.58 absmean quantization for weight ternary encoding.

W_ternary = RoundClip(W / (mean(|W|) + epsilon), -1, 1)

Each weight is one of {-1, 0, +1}, stored as 2-bit packed integers (I2_S format) in GGUF tensors. Per-block scale factor stored as FP16.

Storage format per block (256 elements):

64 bytes for ternary weights (2 bits × 256)
2 bytes for absmean scale (FP16)
Total: 66 bytes / 256 weights = 2.06 bits/weight

AD-2: MoE Router Precision

Decision: MoE gating/routing network remains in FP16.

Rationale: Expert selection requires high-precision softmax scores to maintain routing quality. Quantizing the router to ternary would collapse expert selection, effectively turning a 30B model into a random-expert 3B model. The router is <0.1% of total parameters.

Components kept in FP16:

Expert gating weights (router)
Token embedding table
LM head (output projection)
RoPE frequency table
LayerNorm/RMSNorm parameters

AD-3: Activation Quantization

Decision: INT8 per-token absmax quantization for activations flowing through BitLinear layers.

X_int8 = clamp(round(X * 127 / max(|X|)), -128, 127)

Rationale: Consistent with BitNet b1.58 specification. INT8 activations enable integer-only GEMM in expert forward passes. Attention activations remain in FP16/BF16 for KV cache compatibility.

AD-4: CPU Inference Kernel Strategy

Decision: Implement all three bitnet.cpp kernel types, with runtime selection based on hardware detection.

Kernel	Target Hardware	Selection Criteria
I2_S	x86 AVX512, ARM SVE	Systems with wide SIMD and high bandwidth
TL1	x86 AVX2, ARM NEON	General-purpose, balanced performance
TL2	Memory-constrained	Systems with <16GB RAM or high cache pressure

Implementation path: Adapt bitnet.cpp's kernel generation scripts (Python codegen) to produce Rust SIMD intrinsics compatible with RuvLLM's existing kernels/ module structure.

Key kernel operations:

Pack ternary weights into 2-bit (I2_S) or LUT index (TL1: 4-bit, TL2: 5-bit)
Generate lookup tables for activation sums at model load time
Execute GEMM via table lookup + integer addition (no floating-point multiply)
Accumulate in INT16 with pack-and-unpack technique (lossless, no quantization of partials)
Dequantize output with per-block FP16 scale

AD-5: GGUF Tensor Format Extension

Decision: Extend RuvLLM's GGUF format with BitNet-specific metadata and a new BITNET_TERNARY quantization type.

New GGUF metadata keys:

craftsman.bitnet.version = 1
craftsman.bitnet.weight_encoding = "absmean_ternary"
craftsman.bitnet.activation_bits = 8
craftsman.bitnet.router_precision = "f16"
craftsman.bitnet.kernel_hint = "tl1"  // preferred kernel
craftsman.moe.total_params = 30000000000
craftsman.moe.active_params = 3000000000
craftsman.moe.num_experts = <N>
craftsman.moe.active_experts = <K>

Tensor storage: Map to existing IQ1_S (type 19) for ternary expert weights, with additional metadata distinguishing post-training IQ1_S from native BitNet ternary. Alternatively, register a new type BITNET_T158 = 29 if the existing IQ1_S block format is incompatible with absmean-scale-per-block layout.

AD-6: RuvLLM Backend Integration

Decision: Create a new BitNetBackend alongside existing Candle and mistral-rs backends.

backends/
├── mod.rs                 // Backend trait + dispatch
├── candle_backend.rs      // GPU (Metal/CUDA)
├── mistral_backend.rs     // PagedAttention + ISQ
├── coreml_backend.rs      // Apple Neural Engine
└── bitnet_backend.rs      // NEW: CPU ternary inference

BitNetBackend responsibilities:

Load GGUF with ternary tensor detection
Initialize TL1/TL2/I2_S lookup tables per layer
Execute MoE routing in FP16 → select active experts
Run selected expert forward passes using ternary GEMM kernels
Attention in FP16 (Phase 1) or ternary (Phase 2+)
KV cache management (Q8 two-tier, existing infrastructure)

Backend trait compliance:

impl InferenceBackend for BitNetBackend {
    fn load_model(&mut self, path: &Path, config: ModelConfig) -> Result<()>;
    fn generate(&self, prompt: &str, params: GenerateParams) -> Result<Response>;
    fn get_embeddings(&self, text: &str) -> Result<Vec<f32>>;
    fn supports_architecture(&self, arch: &str) -> bool;
}

AD-7: MoE Forward Pass Pipeline

Decision: Split MoE forward pass into FP16 routing + ternary expert execution.

Input Token Embedding (FP16)
  │
  ▼
┌─────────────────────────────────────────┐
│ For each transformer layer:             │
│                                         │
│  1. RMSNorm (FP16)                      │
│  2. Self-Attention                      │
│     ├─ Q/K/V projection (Phase 1: FP16, │
│     │                    Phase 2: Ternary)│
│     ├─ RoPE (FP16)                      │
│     ├─ Scaled dot-product attention      │
│     └─ Output projection                │
│  3. RMSNorm (FP16)                      │
│  4. MoE Block:                          │
│     ├─ Router (FP16 gating network)     │
│     │   → Select top-K experts          │
│     ├─ Expert FFN (TERNARY BitLinear)   │
│     │   ├─ gate_proj: W_ternary @ X_int8│
│     │   ├─ up_proj:   W_ternary @ X_int8│
│     │   ├─ SwiGLU activation            │
│     │   └─ down_proj: W_ternary @ X_int8│
│     └─ Weighted sum of expert outputs   │
│  5. Residual connection                 │
└─────────────────────────────────────────┘
  │
  ▼
LM Head (FP16) → Logits → Token

AD-8: SONA Integration for Ternary Adaptation

Decision: MicroLoRA adapters applied as FP16 deltas on top of ternary base weights.

Rationale: Ternary weights cannot be directly fine-tuned at inference time (gradient updates don't map to {-1, 0, +1}). Instead, SONA's MicroLoRA applies rank-1 FP16 adapters whose output is added to the ternary forward pass output:

Y = BitLinear(X) + LoRA_B @ LoRA_A @ X

Where BitLinear(X) uses ternary GEMM and LoRA_B @ LoRA_A @ X is a small FP16 correction. This preserves BitNet's efficiency for 99%+ of computation while enabling per-session adaptation.

AD-9: Memory Budget Analysis

Decision: Target <8GB model + 2GB KV cache = 10GB total for 4K context.

Component	Precision	Size	Notes
Expert weights (28B params)	1.58-bit	~5.5 GB	28B × 2.06 bits = ~7.2 GB raw, but only routing metadata for inactive experts
Shared layers (2B params)	FP16	~4 GB	Embeddings, LM head, router, norms
Expert routing tables	FP16	~50 MB	Gating network weights
TL1/TL2 lookup tables	INT16	~200 MB	Pre-computed at load time
KV cache (4K context)	Q8	~1.5 GB	Two-tier cache (hot FP16 + warm Q8)
MicroLoRA adapters	FP16	~10 MB	Rank-1, <1MB per target module
Total	—	~7.8 GB	Fits in 16GB system with headroom

Note: Full 30B ternary weights on disk are ~7.2 GB. At runtime, only active expert weights (~3B active) are in hot memory for any given token, with inactive expert pages memory-mapped and demand-loaded.

AD-10: Platform-Specific Kernel Dispatch

Decision: Runtime hardware detection drives kernel selection.

pub fn select_kernel(caps: &HardwareCaps) -> BitNetKernel {
    if caps.has_avx512() {
        BitNetKernel::I2S_AVX512
    } else if caps.has_avx2() {
        BitNetKernel::TL1_AVX2
    } else if caps.has_neon() {
        if caps.cache_size_l2 >= 2 * 1024 * 1024 {
            BitNetKernel::TL1_NEON
        } else {
            BitNetKernel::TL2_NEON  // memory-constrained
        }
    } else if caps.has_sse41() {
        BitNetKernel::TL1_SSE41
    } else {
        BitNetKernel::I2S_Scalar  // fallback
    }
}

Integration: Leverages RuvLLM's existing autodetect.rs hardware capability detection module.

AD-11: GRPO-Guided Distillation Loss

Decision: Use GrpoOptimizer to compute per-expert reward scaling during knowledge distillation, replacing a traditional fixed-weight KD loss.

Rationale: Standard KD uses a static alpha to blend KL divergence and hard-label cross-entropy. GRPO adds a dynamic reward signal that upweights expert-student pairs where ternary output closely matches the teacher, and downweights divergent pairs. This is achieved by mapping each expert's teacher-vs-student output comparison to a SampleGroup:

Combined Loss = KD_base + GRPO_scale
Where:
  KD_base  = α * KL(teacher_logits/T, student_logits/T)
           + (1-α) * CE(labels, student_logits)
  GRPO_scale = (1 + reward * 0.1)

  reward = GrpoEvaluator.evaluate(student_expert_output, teacher_expert_output)
         → 1.0 when cosine_sim > 0.95
         → -0.5 when cosine_sim < 0.7

Key configuration (extending GrpoConfig::stable()):

GrpoConfig {
    group_size: num_experts,        // One group per MoE layer
    learning_rate: 1e-6,            // Conservative for distillation
    kl_coefficient: 0.1,            // Tight teacher adherence
    kl_target: 0.02,                // Low divergence target
    clip_range: 0.1,                // Narrow clipping for stability
    normalize_advantages: true,     // Normalize across experts in group
    adaptive_kl: true,              // Auto-adjust KL penalty
    ..GrpoConfig::stable()
}

Reused: GrpoOptimizer, GrpoConfig, SampleGroup, GrpoEvaluator from training/grpo.rs. New: BitNetGrpoAdapter that maps expert forward pass outputs to GrpoSample structs.

AD-12: Contrastive Pre-Training for Expert Routing Validation

Decision: After ternary conversion of expert weights, use the existing ContrastiveTrainer to verify that MoE routing still selects the correct experts.

Rationale: Replacing expert FFN weights with ternary approximations changes the output distribution of each expert. If expert N's ternary output becomes more similar to expert M's output, the router may misroute tokens. Contrastive pre-training on expert embeddings detects and corrects this.

Approach:

For each token in a calibration set, record which expert the teacher model's router selects
Generate TrainingTriplets: anchor = hidden state, positive = correct expert output, negative = wrong expert output
Use existing hard negative mining to find expert pairs that become confusable after ternary conversion
Fine-tune the FP16 router gating weights using contrastive loss to restore correct expert selection

Reused: ContrastiveTrainer, ContrastiveConfig, TrainingTriplet from training/contrastive.rs. New: ExpertTripletGenerator that produces triplets from MoE routing decisions.

AD-13: EWC++ Cross-Expert Stability During Sequential Distillation

Decision: Apply EwcRegularizer from lora/training.rs during sequential expert distillation to prevent catastrophic forgetting across experts.

Rationale: Distilling 30B MoE experts sequentially (expert 0, then 1, ..., then N) risks overwriting shared representations. EWC++ computes Fisher information diagonals for each expert's contribution to the shared attention layers, then regularizes subsequent expert distillation to not deviate from previously-learned important weights.

Configuration:

TrainingConfig {
    ewc_lambda: 5000.0,           // Higher than default (2000) for cross-expert stability
    fisher_decay: 0.995,           // Slower decay to preserve Fisher across expert phases
    quality_threshold: 0.5,        // Only learn from high-quality distillation samples
    lr_schedule: LearningRateSchedule::Cosine,
    warmup_steps: 500,             // Longer warmup for 30B scale
    ..Default::default()
}

Concrete protection:

After distilling expert 0: compute Fisher diagonal F_0 over validation set
When distilling expert 1: add penalty ewc_lambda/2 * Σ F_0_i * (w_i - w*_0_i)²
Accumulate: F_cumulative = fisher_decay * F_prev + (1-fisher_decay) * F_new

Reused: EwcRegularizer, TrainingPipeline, TrainingConfig, FisherDiagonal from lora/training.rs. New: SequentialExpertDistiller that wraps EwcRegularizer across expert phases.

AD-14: Policy Store for Per-Layer Ternary Scale Tracking

Decision: Extend PolicyStore with a new PolicyType::TernaryScale to persist per-block absmean scale distributions and learned quantization decisions.

Rationale: Not all layers quantize equally well to ternary. Attention layers may need different scale clipping than FFN layers. The policy store enables the distillation pipeline to learn and persist per-layer quantization strategies that can be retrieved and applied in future distillation runs or model updates.

New policy type:

pub enum PolicyType {
    Quantization,
    Router,
    Ewc,
    Pattern,
    TernaryScale,      // NEW: Per-layer ternary quantization metadata
}

pub struct TernaryScalePolicy {
    pub layer_idx: usize,
    pub module: String,              // "gate_proj", "up_proj", "down_proj", "q_proj", etc.
    pub mean_absmean: f32,           // Average scale factor across blocks
    pub std_absmean: f32,            // Variance in scale factors
    pub sparsity: f32,               // Fraction of zero weights
    pub quality_vs_teacher: f32,     // Cosine similarity to teacher output
    pub distillation_loss: f32,      // Final loss for this layer
    pub recommended_block_size: usize, // 256 default, may vary
}

Reused: PolicyStore, PolicyEntry, PolicySource from policy_store.rs. New: TernaryScalePolicy struct and PolicyType::TernaryScale variant.

AD-15: Memory Distillation for Training Quality Tracking

Decision: Log all distillation teacher-student comparisons as Trajectory objects in the ReasoningBank, enabling MemoryDistiller to extract KeyLessons about which layers, experts, and configurations produce the best ternary quality.

Rationale: Distillation is iterative — understanding which experts converge quickly, which resist ternary conversion, and what scale distributions correlate with quality enables intelligent scheduling of future distillation runs.

Mapping:

ReasoningBank Concept	Distillation Mapping
`Trajectory`	One expert's distillation run (N steps)
`Verdict`	`Success` if cosine_sim > 0.9, `Failure` if < 0.7
`PatternCategory`	Expert index + layer type (e.g., "expert_3_gate_proj")
`KeyLesson`	"Expert 7 gate_proj converges fastest with lr=2e-6 and block_size=128"
`CompressedTrajectory`	Summary of entire expert distillation phase

Reused: MemoryDistiller, DistillationConfig, CompressedTrajectory, KeyLesson from reasoning_bank/distillation.rs. New: DistillationTrajectoryRecorder that adapts expert training steps to Trajectory format.

AD-16: Distillation Pipeline Composition

Decision: Compose the full Craftsman Ultra distillation pipeline from existing RLM components wired through a new CraftsmanDistiller orchestrator.

Pipeline architecture:

┌─────────────────────────────────────────────────────────────────┐
│                  CraftsmanDistiller (NEW orchestrator)           │
│                                                                 │
│  ┌───────────────┐    ┌──────────────────┐    ┌──────────────┐ │
│  │ TeacherModel  │───▶│BitLinearTrainer   │───▶│ GGUFExporter │ │
│  │(GLM-4.7-Flash)│    │(NEW: STE+shadow)  │    │(REUSED)      │ │
│  └───────┬───────┘    └────────┬─────────┘    └──────────────┘ │
│          │                     │                                │
│          │   ┌─────────────────┼─────────────────┐              │
│          │   │                 │                  │              │
│          ▼   ▼                 ▼                  ▼              │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────┐    │
│  │GrpoOptimizer │   │EwcRegularizer│   │ContrastiveTrainer│    │
│  │(REUSED)      │   │(REUSED)      │   │(REUSED)          │    │
│  │Per-expert    │   │Cross-expert  │   │Router validation │    │
│  │reward scaling│   │stability     │   │post-ternary      │    │
│  └──────┬───────┘   └──────┬───────┘   └────────┬─────────┘    │
│         │                  │                     │              │
│         ▼                  ▼                     ▼              │
│  ┌──────────────────────────────────────────────────────┐      │
│  │              Quality Feedback Loop                    │      │
│  │                                                       │      │
│  │  MemoryDistiller ──▶ KeyLesson extraction             │      │
│  │  PolicyStore    ──▶ TernaryScale persistence          │      │
│  │  (BOTH REUSED)                                        │      │
│  └──────────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────────┘

Net-new code:  BitLinearTrainer (STE + shadow weights), CraftsmanDistiller (orchestrator)
Reused code:   GrpoOptimizer, EwcRegularizer, ContrastiveTrainer, MemoryDistiller,
               PolicyStore, GGUFExporter, TrainingConfig, LR schedules
Reuse ratio:   ~70% existing / ~30% new

Optimization: Expert-Parallel Distillation

Experts are independent during forward pass. Distill multiple experts concurrently across CPU cores:

// Distill experts in parallel (independent FFN weights)
let expert_results: Vec<DistillResult> = experts
    .par_iter()                          // rayon parallel iterator
    .enumerate()
    .map(|(idx, expert)| {
        let mut trainer = BitLinearTrainer::new(expert, &teacher_expert[idx]);
        let mut ewc = EwcRegularizer::new_with_fisher(cumulative_fisher[idx]);
        let mut grpo = GrpoOptimizer::new(GrpoConfig::stable());

        for batch in dataset.batches() {
            let student_out = trainer.forward_ternary(batch);
            let teacher_out = teacher.forward_expert(idx, batch);

            let reward = grpo.evaluate(&student_out, &teacher_out);
            let kd_loss = kd_loss_fn(&student_out, &teacher_out, alpha, temperature);
            let ewc_penalty = ewc.penalty(&trainer.shadow_weights());
            let total_loss = kd_loss * reward.scale() + ewc_penalty;

            trainer.backward_ste(total_loss);
        }

        ewc.update_fisher(&trainer);      // Update Fisher for next expert
        DistillResult { idx, weights: trainer.export_ternary(), fisher: ewc.fisher() }
    })
    .collect();

AD-17: Training Infrastructure — Cloud GPU over Local SIMD

Decision: Use Google Cloud A100/H100 GPU instances for distillation training. Reserve local CPU/SIMD for inference validation, MicroLoRA adaptation, and GGUF export only.

Rationale: Local CPU/SIMD training is mathematically infeasible at the 200B+ token scale required for expert distillation. The existing RuvLLM SIMD kernels (kernels/) are inference-only — no backpropagation or gradient computation. The training code (real_trainer.rs:178-184) supports Metal (macOS) or CPU but not CUDA, and CPU throughput at ~50-100 tok/s training would require ~65 years for 200B tokens.

Memory analysis (per-expert distillation):

Component	Size	Notes
Single expert FFN shadow weights (FP16)	~2 GB	~1B params per expert (28B ÷ N experts)
Gradients (FP32)	~4 GB	Full precision for STE backprop
AdamW optimizer state (2× FP32)	~8 GB	First + second moment
Teacher activations cache	~1 GB	Per-batch FP16
EWC++ Fisher diagonal	~0.5 GB	Per-expert accumulated
Per-expert total	~15.5 GB	Fits in A100 40GB with headroom

Full model simultaneous (Phase 2+):

Component	Size	Notes
30B shadow weights (FP16)	~60 GB	Requires A100 80GB or H100
Gradients + optimizer	~360 GB	Requires multi-GPU parallelism
Total	~430 GB	4× A100 80GB or 4× H100 80GB

Throughput and cost comparison:

Platform	Training tok/s	Time (200B tok, Phase 1)	Cost	Phase 0 PTQ?	Phase 0.5 RLM?
Mac Studio M4 Max (Metal)	~500-1000	~6.5 years	N/A	Yes — 1-4 hrs, $0	Yes — 2-12 days, $0
Mac Studio M4 Max (NEON SIMD only, no Metal)	~200-500	~13 years	N/A	Yes — 2-6 hrs, $0	Yes — 4-24 days, $0
Mac Studio M3 Ultra (Metal)	~800-1500	~4.2 years	N/A	Yes — 1-1.5 hrs, $0	Yes — 1.5-8 days, $0
Mac Studio M3 Ultra (NEON SIMD only, no Metal)	~300-700	~9 years	N/A	Yes — 1.5-3 hrs, $0	Yes — 3-16 days, $0
CPU AVX2 (Ryzen 9) — scalar fallback	~50-150	~43-130 years	N/A	Yes — 2-6 hrs, $0	Yes — 14-58 days, $0
1× A100 80GB (GCP on-demand)	~15,000	~155 days	~$3,700	Yes — 30 min, ~$5	Overkill
4× A100 80GB (GCP on-demand)	~50,000	~46 days	~$4,400	Overkill for PTQ	Overkill
4× A100 80GB (GCP spot)	~50,000	~46 days	~$1,300	Overkill for PTQ	Overkill
1× H100 (DataCrunch)	~40,000	~58 days	~$2,900	Overkill for PTQ	Overkill
4× H100 (DataCrunch)	~140,000	~16 days	~$3,200	Overkill for PTQ	Overkill

Key insight: Mac Studio is infeasible for Phase 1+ training (years of wall time) but ideal for Phase 0 PTQ (hours, $0). This separation justifies the phased approach.

Recommended infrastructure per phase:

Phase	Instance	Duration	Estimated Cost	Strategy
Phase 0 (PTQ)	Mac Studio (M4 Max/M3 Ultra)	1-4 hours	$0	Mmap FP16 weights → absmean quantize → export GGUF; Metal GPU for calibration pass
Phase 0D (BitDistill Lite, 10B tok)	Mac Studio Metal or 1× A100 spot	2-4 weeks (local) / 1-2 days (cloud)	$0 (local) / ~$300 (cloud)	Optional quality upgrade if Phase 0C too degraded
Phase 0.5 (RLM refinement, Metal)	Mac Studio (Metal)	3-14 days	$0	MicroLoRA + router fix + scale opt using existing RLM stack
Phase 0.5 (RLM refinement, SIMD-only)	Mac Studio (NEON CPU)	5-28 days	$0	Same pipeline, no Metal required — pure ndarray + NEON SIMD (see AD-20)
Phase 1 (expert FFN, 200B tok)	4× A100 80GB spot (GCP)	~46 days	$1,300-$2,000	Per-expert sequential with EWC++; each expert fits 1 GPU
Phase 1 (router validation)	Mac Studio Metal or 1× A100	~2-4 hours	$0 (local) / <$10 (cloud)	Contrastive training on router only (~2B params)
Phase 2 (full ternary, 500B tok)	4× H100 (DataCrunch)	~16-32 days	$2,500-$5,000	All layers; model-parallel across GPUs
Phase 3 (native training, 4T tok)	8× H100 cluster	~90-180 days	$15,000-$30,000	Full from-scratch; depends on funding
Inference validation	Mac Studio (NEON)	Continuous	$0	TL1/TL2 kernel testing on ARM NEON
MicroLoRA adaptation	Mac Studio	<1ms/update	$0	Existing ndarray-based EWC++ pipeline

Required code change: Add CUDA device dispatch to RealContrastiveTrainer:

// Current (real_trainer.rs:178-184):
let device = if config.use_metal {
    Device::new_metal(0).unwrap_or(Device::Cpu)
} else {
    Device::Cpu
};

// Required for cloud GPU training:
let device = if config.use_cuda {
    Device::new_cuda(config.cuda_device_id).unwrap_or(Device::Cpu)
} else if config.use_metal {
    Device::new_metal(0).unwrap_or(Device::Cpu)
} else {
    Device::Cpu
};

This is a single-line addition to RealTrainingConfig (use_cuda: bool, cuda_device_id: usize) and a 3-line change to device selection. The rest of the Candle training pipeline (tensors, optimizer, loss computation) works identically across CPU/Metal/CUDA.

Cost optimization strategies:

Spot instances: GCP A100 spot at ~$1/GPU-hr (70% off on-demand) — requires checkpointing every 30 min
DataCrunch / Lambda Labs: H100 at $1.99-$2.10/hr (40-50% below GCP on-demand)
Expert-sequential on fewer GPUs: Distill 1 expert at a time on 1× A100 80GB (~$1.50/hr), increasing wall time but reducing per-hour cost
Mixed precision training: FP16 shadow weights + BF16 activations reduces memory, enabling smaller instances
Gradient checkpointing: Trade compute for memory to fit on fewer GPUs

AD-18: Phase 0 — PT-BitNet Post-Training Quantization on Mac Studio

Decision: Implement a PT-BitNet ternary post-training quantizer as Phase 0, running entirely on a local Mac Studio, producing a rapid prototype GGUF for inference pipeline validation before investing in full distillation.

Rationale: The original Option A ("Rejected") assumed only generic IQ1_S quantization, which produces garbled outputs at 1.56 bpw. However, PT-BitNet (2025) demonstrates that applying BitNet's native absmean ternary quantization to pre-trained weights with calibration data achieves significantly better results (61% downstream at 70B) than generic codebook PTQ. This produces genuine BitNet ternary format that enables multiplication-free inference with TL1/TL2 kernels — unlike IQ1_S which still requires dequant-then-multiply.

Target platform: Mac Studio (Apple Silicon)

Phase 0 is pure quantization (no training loop), making it ideal for local execution on Mac Studio:

Config	Unified RAM	FP16 Load	PTQ?	Calibration?	Notes
M4 Max 36GB	36 GB	mmap (demand-paged)	Yes	Slow (paging)	Minimum viable; mmap means only active tensor pages in RAM
M4 Max 64GB	64 GB	Fits with mmap assist	Yes	Yes	Comfortable for PTQ; calibration may page
M4 Max 128GB	128 GB	Fits entirely	Yes	Yes	Ideal — FP16 model (60GB) + ternary output (7GB) + calibration buffers all in RAM
M3 Ultra 96GB	96 GB	Fits entirely	Yes	Yes	Good headroom
M3 Ultra 192GB+	192+ GB	Fits entirely	Yes	Yes	Ample room for full model + calibration + inference validation

Why Mac Studio works for Phase 0 (but not Phase 1+):

PTQ is not training: No gradient computation, no optimizer state, no backpropagation — just load → quantize → export
Memory-mapped I/O: FP16 weights can be mmap'd from disk; only the current tensor's pages need to be in RAM
Per-tensor processing: Quantize one tensor at a time (read FP16 block → compute absmean → round to ternary → write output) — working memory is ~2-4 MB per tensor regardless of total model size
Metal GPU for calibration: RuvLLM's existing RealContrastiveTrainer and kernels/matmul.rs support Metal via Candle (use_metal: true default, 3x speedup on M4 Pro GEMV)
ARM NEON for TL1 kernels: Mac Studio's Apple Silicon has NEON SIMD — the same target ISA as the TL1 kernel for ternary inference validation
Phase 1 still needs cloud GPU: 200B token distillation at ~500-1000 tok/s (Metal) = ~6.5 years locally vs ~46 days on 4× A100

Estimated Phase 0 wall time on Mac Studio:

Step	M4 Max 128GB	M4 Max 64GB	M3 Ultra 192GB
Download GLM-4.7-Flash FP16 (~60GB)	~30 min (1Gbps)	~30 min	~30 min
Absmean ternary quantization	~5-15 min	~10-30 min (paging)	~5-10 min
Calibration pass (1000 samples, Metal)	~30-60 min	~60-120 min	~20-40 min
GGUF export	~2-5 min	~2-5 min	~2-5 min
TL1 kernel validation inference	~10-20 min	~10-20 min	~10-20 min
Total	~1-2 hours	~2-4 hours	~1-1.5 hours

Implementation approach:

Phase 0 Pipeline (runs on Mac Studio):
  1. Load GLM-4.7-Flash FP16/BF16 weights via mmap
  2. For each linear layer in expert FFNs:
     a. Compute gamma = mean(|W|)  (absmean scale)
     b. W_ternary = RoundClip(W / (gamma + epsilon), -1, 1)
     c. Store: 2-bit packed ternary weights + FP16 scale per block
  3. Calibration pass (optional, improves quality, uses Metal GPU):
     a. Run ~1000 calibration samples through teacher model
     b. Record activation statistics per layer
     c. Optimize scale factors to minimize MSE between teacher and ternary outputs
  4. Export to GGUF with BITNET_T158 tensor type + metadata
  5. Validate: load in BitNetBackend → TL1 NEON kernel → generate tokens

Absmean ternary quantizer (core algorithm):

Input:  W ∈ R^{m×n} (FP16 weight matrix)
Output: W_t ∈ {-1,0,+1}^{m×n}, scale ∈ R (per-block FP16)

For each block of 256 elements:
  1. gamma = mean(|block|) + 1e-8
  2. normalized = block / gamma
  3. ternary = round(clamp(normalized, -1, 1))  → {-1, 0, +1}
  4. Pack: 2 bits per weight (00=-1, 01=0, 10=+1)
  5. Store scale = gamma as FP16

What stays FP16 (same as AD-2):

MoE router gating weights
Token embeddings + LM head
RoPE frequencies
LayerNorm/RMSNorm parameters

RuvLLM implementation gaps to fill:

Gap	Effort	Details
Absmean ternary quantizer	~200-300 lines	New function in `gguf/quantization.rs` or new module
IQ1_S / BITNET_T158 dequantization	~80-120 lines	Add to `dequantize_tensor` match arm (currently falls to error at line 358)
GGUF export with ternary metadata	~100-150 lines	Extend `GgufExportResult` with BitNet metadata keys from AD-5
TL1 kernel smoke test	~200 lines	Validate ternary GEMM produces correct output on PTQ model

Total new code: ~600-800 lines (vs ~15,000+ for Phase 1 full distillation pipeline)

Quality expectations (conservative estimates for GLM-4.7-Flash 30B-A3B):

Benchmark	FP16 Baseline	Phase 0 PTQ (est.)	Phase 1 Distill (est.)
HumanEval pass@1	~65%	~35-45%	~55-60%
MMLU	~75%	~45-55%	~65-70%
SWE-bench Verified	59.2%	~25-35%	~50-55%
LiveCodeBench v6	64.0%	~30-40%	~55-60%

Why Phase 0 quality is still useful:

Kernel validation: Ternary GEMM correctness doesn't depend on model quality
Memory profiling: Real-world memory usage measurement with actual MoE activation patterns
Throughput benchmarking: Measure real tok/s with TL1/TL2/I2_S kernels on target hardware
Pipeline testing: End-to-end GGUF load → inference → token output
Baseline measurement: Quantitative quality floor establishes improvement target for Phase 1
Cost: $0 on Mac Studio vs ~$1,300 for Phase 1 — validates infrastructure at zero cost before committing to cloud GPU

Key configuration:

pub struct PtBitnetConfig {
    pub calibration_samples: usize,     // 1000 default (WikiText-2 or code corpus)
    pub block_size: usize,              // 256 (matches AD-1)
    pub optimize_scales: bool,          // true: MSE-optimized scales; false: raw absmean
    pub layers_to_quantize: LayerMask,  // ExpertsOnly (Phase 0) or All (future)
    pub export_format: TernaryFormat,   // BitnetT158 (native) or IQ1S (llama.cpp compat)
    pub router_precision: Precision,    // FP16 (always, per AD-2)
    pub use_mmap: bool,                 // true: memory-map FP16 weights (required for <128GB systems)
    pub use_metal_calibration: bool,    // true: Metal GPU for calibration pass (Mac Studio)
    pub max_memory_gb: Option<f32>,     // Cap memory usage; enables streaming quantization
}

Reused: GGUF parser, tensor metadata, GgufQuantType enum, export pipeline. New: PtBitnetQuantizer, absmean_ternary(), BITNET_T158 dequantization kernel.

AD-19: Phase 0.5 — RLM Post-Quantization Refinement (No Traditional Training)

Decision: Use the existing RLM training stack to refine the Phase 0 PTQ model on Mac Studio by training only the small FP16 components (~1-2% of parameters), freezing ternary weights. This replaces traditional distillation for the rapid prototype phase.

Rationale: Traditional knowledge distillation (Phase 1) requires shadow weights, straight-through estimator, and GPU-scale compute to modify the ternary weights themselves. However, the Phase 0 PTQ model already has ternary weights — the quality loss comes from:

Sub-optimal per-block scale factors (absmean is a rough approximation)
MoE router misrouting tokens to wrong experts (expert output distributions changed)
No adaptation to ternary output characteristics

All three can be addressed by training only the FP16 components using the existing RLM stack, without touching the ternary weights.

What gets trained (FP16, differentiable) vs frozen (ternary, not differentiable):

Component	Params	Size	Trainable?	Training Method
Expert FFN ternary weights	~28B	~5.5 GB	Frozen	N/A — {-1,0,+1} not differentiable
MicroLoRA adapters (rank-2, per expert FFN)	~50-100M	~100-200 MB	Yes	`TrainingPipeline` + `EwcRegularizer`
MoE router gating weights	~30M	~60 MB	Yes	`ContrastiveTrainer` (triplet + InfoNCE)
Per-block absmean scale factors	~0.1M	~200 KB	Yes	GRPO reward-guided optimization
LM head (output projection)	~150M	~300 MB	Yes (optional)	Standard fine-tuning
Attention Q/K/V/O (FP16)	~2B	~4 GB	Optional	Can add LoRA here too if budget allows
Total trainable	~200-400M	~400-800 MB		~1-2% of 30B total

Why RLM works here (vs traditional distillation):

Property	Traditional KD (Phase 1)	RLM Refinement (Phase 0.5)
Modifies ternary weights	Yes (shadow weights + STE)	No (frozen)
Trainable params	~28B (all expert weights)	~200-400M (1-2%)
Training tokens needed	200B	100M-500M (400x less)
GPU requirement	4× A100 ($1,300+)	Mac Studio Metal ($0)
Training time	~46 days (cloud)	2-12 days (local)
Quality target	~90-95% of FP16	~70-80% of FP16
New code required	~15,000 lines (BitLinear, STE, orchestrator)	~0 lines (100% RLM reuse)

RLM component mapping:

┌──────────────────────────────────────────────────────────────────┐
│              Phase 0.5: RLM Refinement Pipeline                  │
│              (100% existing RLM code, 0% new training code)      │
│                                                                  │
│  Frozen Ternary Model (Phase 0 PTQ output)                       │
│  ┌────────────────────────────────────────────┐                  │
│  │  Expert FFNs: {-1,0,+1} weights (FROZEN)   │                  │
│  │  Router: FP16 gating (TRAINABLE)            │                  │
│  │  Attention: FP16 (TRAINABLE via LoRA opt.)  │                  │
│  │  Scales: FP16 per-block (TRAINABLE)         │                  │
│  └────────────────────────────────────────────┘                  │
│           │                                                       │
│     ┌─────▼──────────────────────────────────────────┐           │
│     │  Step 1: Router Repair                          │           │
│     │  ContrastiveTrainer (REUSED, contrastive.rs)    │           │
│     │  • Generate triplets: anchor=hidden, +correct   │           │
│     │    expert, -wrong expert                        │           │
│     │  • Triplet + InfoNCE loss on FP16 router        │           │
│     │  • Fix misrouting from PTQ weight changes       │           │
│     │  Training: ~10M tokens, ~1-2 hours (Metal)      │           │
│     └─────┬──────────────────────────────────────────┘           │
│           │                                                       │
│     ┌─────▼──────────────────────────────────────────┐           │
│     │  Step 2: MicroLoRA Injection + Training         │           │
│     │  TrainingPipeline + MicroLoRA (REUSED,          │           │
│     │    lora/training.rs + lora/micro_lora.rs)       │           │
│     │  • Rank-2 LoRA per expert FFN: Y = BitLinear(X) │           │
│     │    + LoRA_B @ LoRA_A @ X                        │           │
│     │  • Loss: MSE(teacher_output, student+LoRA)      │           │
│     │  • EWC++ across expert phases                   │           │
│     │  Training: ~100-500M tokens, ~2-12 days (Metal) │           │
│     └─────┬──────────────────────────────────────────┘           │
│           │                                                       │
│     ┌─────▼──────────────────────────────────────────┐           │
│     │  Step 3: Scale Factor + Quality Optimization    │           │
│     │  GrpoOptimizer (REUSED, grpo.rs)                │           │
│     │  • Per-expert output quality → reward signal     │           │
│     │  • Optimize FP16 scale factors to maximize       │           │
│     │    cosine similarity with teacher output          │           │
│     │  • Adaptive KL prevents over-correction          │           │
│     │  Training: concurrent with Step 2               │           │
│     └─────┬──────────────────────────────────────────┘           │
│           │                                                       │
│     ┌─────▼──────────────────────────────────────────┐           │
│     │  Feedback Loop                                  │           │
│     │  MemoryDistiller → KeyLessons (REUSED)          │           │
│     │  PolicyStore → TernaryScale policies (REUSED)   │           │
│     │  • Track which experts improve most             │           │
│     │  • Store optimized configs for reproducibility  │           │
│     └────────────────────────────────────────────────┘           │
└──────────────────────────────────────────────────────────────────┘

Memory budget on Mac Studio during Phase 0.5 training:

Component	Size	Notes
PTQ ternary model (mmap)	~7 GB disk / ~3-7 GB RAM	Demand-paged; only active expert pages in RAM
Teacher FP16 model (mmap)	~60 GB disk / ~4-8 GB RAM	Only forward pass activations; demand-paged
MicroLoRA adapters (rank-2)	~200 MB	All experts in RAM
LoRA gradients + optimizer (AdamW 2×FP32)	~1.5 GB	For ~400M trainable params
EWC++ Fisher diagonal	~200 MB	Per-expert accumulated
KV cache + activations	~2 GB	Calibration/training forward pass
Total active RAM	~12-20 GB	Fits in any Mac Studio config

Key insight: The teacher model is only needed for forward pass (no gradients), so it can be mmap'd and demand-paged. The ternary student is similarly mmap'd. Only the ~400M trainable parameters and their optimizer state need to be fully in RAM (~2 GB), which fits comfortably in even the 36GB M4 Max.

Training schedule on Mac Studio M4 Max 128GB:

Step	Tokens	Wall Time	What Changes
Router repair	~10M	~3-6 hours	FP16 router gating weights
LoRA training (per-expert, sequential)	~100-500M	2-12 days	MicroLoRA A/B matrices per expert FFN
Scale optimization	~10M	~3-6 hours	Per-block FP16 absmean scales
Validation + export	—	~1-2 hours	Benchmark + GGUF re-export
Total	~120-520M	~3-14 days

Expected quality improvement:

Benchmark	Phase 0 PTQ	Phase 0.5 RLM	Phase 1 Distill	FP16 Baseline
HumanEval pass@1	~35-45%	~45-55%	~55-60%	~65%
MMLU	~45-55%	~55-65%	~65-70%	~75%
SWE-bench Verified	~25-35%	~35-45%	~50-55%	59.2%

The question "can I use RLM rather than traditional training" is answered YES — with the critical caveat that RLM refinement trains the FP16 corrections around frozen ternary weights, not the ternary weights themselves. This is fundamentally different from traditional distillation but achieves meaningful quality recovery (estimated +10-15 percentage points) at zero cost.

Reused (100%): MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer, ContrastiveTrainer, MemoryDistiller, PolicyStore, TrainingConfig, LR schedules, GGUF export. New (0%): No new training code. The only new code is a thin RlmRefiner orchestrator (~200-300 lines) that wires the existing components together for the Phase 0.5 pipeline.

AD-20: Phase 0.5 — SIMD-Only Training Mode (No Metal GPU Required)

Decision: Phase 0.5 RLM refinement supports a pure SIMD/CPU execution mode with no Metal GPU dependency. Metal is an optional acceleration path (~2-3x faster) but not required.

Rationale: Analysis of the RLM training stack reveals that Metal GPU is used by only one component (RealContrastiveTrainer via Candle), while all other training components are pure ndarray/CPU. Since Phase 0.5 uses the lightweight ContrastiveTrainer (not RealContrastiveTrainer) for router repair, and all gradient computation is ndarray-based, the entire pipeline runs on pure CPU with SIMD acceleration for inference forward passes.

Component-by-component GPU dependency analysis:

Component	Source	GPU Dependency	SIMD-Only Mode
`MicroLoRA.forward_simd()`	`lora/micro_lora.rs:279`	None — ARM NEON intrinsics with scalar fallback	NEON on aarch64, scalar on x86
`MicroLoRA.apply_gradients()`	`lora/micro_lora.rs:621+`	None — pure ndarray	Works everywhere
`MicroLoRA.apply_gradients_with_ewc()`	`lora/micro_lora.rs:621+`	None — pure ndarray	Works everywhere
`TrainingPipeline`	`lora/training.rs`	None — pure ndarray CPU	Works everywhere
`EwcRegularizer`	`lora/training.rs`	None — pure ndarray CPU	Works everywhere
`GrpoOptimizer`	`training/grpo.rs`	None — pure ndarray CPU	Works everywhere
`ContrastiveTrainer`	`training/contrastive.rs:169-175`	Optional — `use_metal: true` default, but `Device::new_metal(0).unwrap_or(Device::Cpu)` fallback	Set `use_metal: false` for CPU-only; also has non-Candle pure CPU path (line 475)
`MemoryDistiller`	`reasoning_bank/distillation.rs`	None — pure Rust	Works everywhere
`PolicyStore`	`policy_store.rs`	None — pure Rust	Works everywhere
`RealContrastiveTrainer`	`training/real_trainer.rs:178`	Yes — Metal/Candle	NOT used in Phase 0.5 (used in full distillation only)

Inference forward pass (for loss computation) SIMD support:

Kernel	NEON (aarch64)	x86	Source
GEMM	`gemm_neon`	`gemm_scalar` fallback	`kernels/matmul.rs:520`
GEMV	`gemv_neon`	`gemv_scalar` fallback	`kernels/matmul.rs:184`
SiLU	`silu_neon_impl` (~3.5x speedup)	scalar fallback	`kernels/activations.rs`
GeLU	`gelu_neon_impl` (~3.2x speedup)	scalar fallback	`kernels/activations.rs`
ReLU	`relu_neon_impl` (~4.0x speedup)	scalar fallback	`kernels/activations.rs`
RMSNorm	`rms_norm_neon`	scalar fallback	`kernels/norm.rs`
RoPE	`apply_rope_neon`	scalar fallback	`kernels/rope.rs`
Softmax	`softmax_neon` (~2.8x speedup)	scalar fallback	`kernels/activations.rs`

Key observation: The matmul kernels only dispatch on target_arch = "aarch64" vs scalar. There are no explicit AVX2 or AVX512 SIMD implementations for x86 in the current kernel codebase. This means:

Apple Silicon (aarch64): Full NEON SIMD acceleration — primary target for SIMD-only mode
x86 (AMD/Intel): Falls to scalar fallback — works but ~3-5x slower than NEON
Future opportunity: Adding AVX2/AVX512 kernels to matmul.rs would make x86 competitive with NEON

Throughput comparison for Phase 0.5 (100M tokens, ~200-400M trainable params, 3B active forward):

Execution Mode	Forward tok/s	Effective Training tok/s	100M Tokens	500M Tokens
Metal GPU (M4 Max)	~500-1500	~300-700	~2-4 days	~8-19 days
NEON SIMD only (M4 Max CPU)	~200-500	~100-300	~4-12 days	~19-58 days
NEON SIMD only (M3 Ultra CPU)	~300-700	~150-400	~3-8 days	~14-39 days
x86 scalar (Ryzen 9, no AVX2 kernels)	~50-150	~30-80	~14-39 days	~72-193 days

Why SIMD-only is ~2-3x slower than Metal (not 10x):

Phase 0.5 training is dominated by the forward pass through the frozen 3B active parameters to compute loss against the teacher
The forward pass uses SIMD-accelerated GEMM/GEMV (gemm_neon/gemv_neon) which gets ~60-70% of Metal throughput for these matrix sizes
Gradient computation for the ~200-400M trainable params is pure ndarray — identical speed regardless of Metal availability
The training bottleneck is I/O (loading teacher activations from mmap) not compute, further narrowing the gap

Platform portability (bonus of SIMD-only mode):

SIMD-only mode extends Phase 0.5 beyond Mac Studio to any platform with ndarray support:

Platform	SIMD Path	Effective tok/s	Feasible?
Mac Studio M4 Max (aarch64)	NEON intrinsics	~100-300	Yes — primary target
Mac Studio M3 Ultra (aarch64)	NEON intrinsics	~150-400	Yes — faster than M4 Max
Linux ARM64 (Ampere/Graviton)	NEON intrinsics	~80-200	Yes — cloud ARM instances
Linux x86 (Ryzen/Xeon)	Scalar fallback	~30-80	Marginal — 100M tokens feasible (~14-39 days), 500M not practical
macOS Intel	Scalar fallback	~20-50	Not recommended

Configuration for SIMD-only mode:

// Phase 0.5 SIMD-only config (no Metal)
let contrastive_config = ContrastiveConfig {
    use_metal: false,    // Force CPU path in ContrastiveTrainer
    ..Default::default()
};

// MicroLoRA — already pure SIMD/ndarray, no config change needed
// TrainingPipeline — already pure ndarray
// GrpoOptimizer — already pure ndarray
// EwcRegularizer — already pure ndarray

The only config change is ContrastiveTrainer.use_metal = false. All other RLM components are GPU-agnostic by design.

SIMD-only Phase 0.5 exit criteria (in addition to standard Phase 0.5 criteria):

All training completes without Metal GPU dependency
ContrastiveTrainer runs with use_metal: false and produces equivalent router accuracy
MicroLoRA forward_simd() executes NEON path on aarch64 (verified via cfg compile check)
Training throughput measured and documented for SIMD-only vs Metal comparison

Recommendation: Use Metal when available (2-3x faster), fall back to SIMD-only when Metal is unavailable or on non-Mac platforms. The training code requires zero changes — only ContrastiveTrainer.use_metal needs to be set to false.

Reused: 100% of existing RLM stack — MicroLoRA NEON forward, ndarray training, ContrastiveTrainer CPU fallback, all existing SIMD kernels. New: 0 lines. SIMD-only mode is already supported by the existing code paths; AD-20 documents this capability explicitly.

AD-21: Native Rust Ternary Kernels with WASM Target (bitnet.cpp Port Strategy)

Decision: Port bitnet.cpp's ternary inference kernels (TL1, TL2, I2_S) to native Rust with dual compilation targets: native SIMD (NEON/AVX2/AVX512) and WebAssembly SIMD128. This replaces the original AD-4 strategy of Python codegen → Rust intrinsics with a pure Rust implementation that leverages existing open-source work.

Rationale: Three significant developments change the AD-4 implementation calculus:

R3-Engine (https://github.com/r3-engine/r3-engine) — A pure Safe Rust BitNet inference engine achieving 80-117 tok/s single-threaded on Ryzen 9950X3D, with native WASM SIMD128 cross-compilation. Uses bit-sliced ternary matrices with AVX-512 VPOPCNTDQ, zero-copy mmap, and zero heap allocations during generation.
bitnet.rs (https://github.com/ocentra/bitnet.rs) — Pure Rust BitNet toolkit with conversion, inference, training, and streaming. Apache 2.0 license. GPU path via WGSL/wgpu (Vulkan/Metal/DX12). Dedicated bitnet-wasm crate for browser deployment.
WASM SIMD128 maturity — Fixed-width 128-bit SIMD now supported in all major browsers (Chrome, Firefox, Safari, Edge). Rust's core::arch::wasm32 provides direct intrinsic access via simd128 LLVM feature flag.

Comparison of approaches:

Approach	Native Performance	WASM Support	Safety	Integration Effort	Code Reuse
A: Python codegen (original AD-4)	Optimal (platform-tuned)	None	C-level unsafe	High — custom codegen pipeline	bitnet.cpp algorithms
B: Port bitnet.cpp to Rust	Near-optimal	Manual WASM SIMD	Mixed (`unsafe` for intrinsics)	Medium — translate C → Rust	bitnet.cpp algorithms
C: Reference R3-Engine patterns	80-117 tok/s proven	Native dual-target	100% Safe Rust	Low-medium — adapt patterns	R3 bit-slicing + mmap
D: Integrate bitnet.rs crate	GPU: 32x (WGSL), CPU: scalar	`bitnet-wasm` crate	Safe Rust + WGSL	Low — add dependency	Full crate

Recommended: Approach C (Reference R3-Engine) with RuvLLM integration

R3-Engine's techniques are the strongest fit because:

100% Safe Rust — no unsafe blocks in the hot path
Dual-target proven — same codebase compiles to AVX-512 native and WASM SIMD128
Zero-copy mmap — matches our Phase 0 mmap strategy (AD-18)
Cache-aligned bit-slicing — 64-byte aligned CacheLines match CPU cache architecture
VPOPCNTDQ — bit-population-count approach to ternary GEMM is elegant and SIMD-width-agnostic

WASM SIMD128 kernel mapping for TL1:

WASM SIMD128 provides v128 type (128 bits):
- i8x16: 16 × 8-bit integers — pack 64 ternary weights (2-bit each)
- i16x8: 8 × 16-bit integers — accumulation without overflow
- i32x4: 4 × 32-bit integers — final dequantized output

TL1 LUT (16 entries) maps naturally to a single v128:
  v128.load(lut_ptr)           → load 16-entry LUT
  v128.swizzle(lut, indices)   → parallel 16-way table lookup
  i16x8.add(accum, partial)    → INT16 accumulation
  f32x4.mul(dequant, scale)    → FP32 scale application

Estimated WASM SIMD128 throughput:
  ~20-40 tok/s for 3B active params (vs ~5-10 tok/s scalar JS)
  ~4-8x speedup over non-SIMD WebAssembly

WASM SIMD128 limitations:

Fixed 128-bit width only (vs NEON 128, AVX2 256, AVX512 512)
No integer popcount instruction (must emulate VPOPCNTDQ via lookup or bit manipulation)
No gather/scatter operations (LUT access must be sequential or use swizzle)
Memory alignment not enforced (no hardware-guaranteed 64-byte alignment)
Single-threaded unless SharedArrayBuffer + Web Workers enabled

Dual-target compilation strategy (Cargo feature flags):

// In Cargo.toml:
[features]
default = ["native-simd"]
native-simd = []           # AVX2/AVX512/NEON via std::arch
wasm-simd = ["simd128"]    # WASM SIMD128 via core::arch::wasm32

// In kernel code:
#[cfg(all(target_arch = "aarch64", feature = "native-simd"))]
fn ternary_gemv_neon(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }

#[cfg(all(target_arch = "x86_64", feature = "native-simd"))]
fn ternary_gemv_avx2(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }

#[cfg(all(target_arch = "wasm32", feature = "wasm-simd"))]
fn ternary_gemv_wasm128(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }

// Scalar fallback (always available):
fn ternary_gemv_scalar(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }

Integration with existing RuvLLM architecture:

Existing Component	Change Needed	Impact
`kernels/mod.rs`	Add `ternary` module export	Low
`kernels/matmul.rs`	Add ternary GEMV dispatch alongside existing FP16/Metal GEMV	Low
`bitnet/mod.rs` (new)	Wire TernaryTensor to kernel dispatch	Already created (Phase 0)
`gguf/quantization.rs`	BitnetT158 dequant already integrated	Already done
`autodetect.rs`	Add AVX512 VPOPCNTDQ detection + WASM target detection	Low
`Cargo.toml`	Add `wasm-simd` feature flag, `wasm32` target conditional deps	Low
`backends/`	New `BitNetBackend` uses ternary kernel dispatch	Medium (new backend)

Estimated implementation effort (Rust ternary kernels with WASM):

Component	Lines	Complexity	Notes
TL1 kernel (NEON + scalar)	~200	Medium	Reference R3-Engine bit-slicing
TL1 kernel (AVX2/AVX512)	~250	Medium	VPOPCNTDQ for AVX512, lookup for AVX2
TL1 kernel (WASM SIMD128)	~150	Medium	v128 swizzle + i16x8 accumulation
I2_S kernel (all targets)	~300	Low	Simpler unpack-and-add
TL2 kernel (all targets)	~250	Medium-High	5-bit index, 32-entry LUT
Kernel dispatch + autodetect	~100	Low	Match existing `matmul.rs` pattern
LUT generation	~80	Low	Pre-compute at model load
Total	~1,330	—	Compiles to native + WASM from single source

Phase 0 impact: The Phase 0 smoke test (TL1 NEON + scalar) is already partially covered by the existing bitnet/ module. AD-21 extends this to production-grade kernels with WASM as an additional target.

Exit criteria:

TL1 kernel passes bit-exact validation against bitnet.cpp reference output
WASM SIMD128 build produces functional .wasm binary
Native NEON throughput ≥ 80% of R3-Engine (≥ ~64-94 tok/s for 2B model)
AVX2 path tested on x86 Linux
Scalar fallback tested on generic platform
WASM throughput ≥ 20 tok/s for 3B active params in browser
Zero unsafe blocks in WASM path (Safe Rust only)
Kernel dispatch selects optimal path via autodetect.rs feature detection

Open question resolved: AD-21 answers open question #5 (WASM target for ternary kernels) — yes, WASM SIMD128 is viable for TL1/I2_S, with ~4-8x speedup over scalar WASM. TL2's 5-bit index is less natural for 128-bit SIMD but still implementable via two-stage lookup.

AD-22: Evaluation Infrastructure and Behavioral Gates

Decision: Define a three-gate behavioral evaluation framework with a structured trace schema, auto-labeling strategy, and Go/No-Go shipping rule. All gates are non-LLM-judge, deterministic, reproducible, and executable on CPU without external API calls. The system ships on integrity/citations/refusal behavior, not raw model quality benchmarks. Full GPU distillation (Phase 1+) is deferred; the eval infrastructure must validate Phase 0 and Phase 0.5 outputs at zero marginal cost.

Rationale: Standard LLM evaluation relies on either (a) benchmark suites (HumanEval, MMLU) that measure general capability, or (b) LLM-as-judge approaches that are non-deterministic, expensive, and unsuitable for gating CI/CD pipelines. For Craftsman Ultra, the critical shipping question is not "does it score well on benchmarks?" but "does it route correctly, cite honestly, and refuse when uncertain?" These behavioral properties are testable with deterministic, cheap-to-run gate checks that compare model outputs against known ground-truth traces.

The three gates correspond to the three failure modes that would make the system untrustworthy regardless of benchmark scores:

Misrouting — wrong experts selected, producing semantically wrong outputs from correct-seeming completions
Hallucinated citations — model cites evidence that does not exist or does not support the claim
Over/under-refusal — model refuses answerable questions or confidently answers indeterminate ones

Gate 1 — Routing Correctness

Run the FP16 teacher model once on the 200-prompt evaluation suite to record ground-truth routing traces: which experts are selected, with what softmax weights, per token per layer. Then run the ternary student model on the same prompts and compare routing decisions.

Parameter	Value
Metric	`routing_agreement = count(same_topk_experts) / total_tokens`
Comparison	Per-token, per-layer: do the top-K selected expert indices match between teacher and student?
Pass threshold	>= 0.85 (85% of tokens route to the same expert set as the teacher)
Fail action	Trigger targeted router repair via `ContrastiveTrainer` (AD-19, AD-20) with triplets generated from the misrouted token positions

Teacher traces are recorded once and cached as JSONL. The ternary model is evaluated against these cached traces on every pipeline run. Agreement is measured at the expert-set level (order-invariant): if teacher selects experts {2, 5} and student selects {5, 2}, this counts as agreement.

Gate 2 — Citation Correctness

For retrieval-augmented responses, verify that citations are grounded in the actual retrieval corpus. This gate requires a labeled subset of the 200-prompt suite where prompts include retrieval context with known chunk IDs.

Parameter	Value
Metric (precision)	`citation_precision = valid_citations / total_citations`
Metric (recall)	`citation_recall = cited_evidence / relevant_evidence` (from labeled prompts)
Validity check	For each cited `chunk_id`: (1) chunk exists in retrieval corpus, (2) cited span is an exact substring match OR Jaccard similarity between cited span and chunk content > 0.6
Pass threshold	Precision >= 0.90, Recall >= 0.70
Fail action	Trigger retrieval-first policy training via `GrpoOptimizer` (GRPO reward penalizes hallucinated citations, rewards grounded ones)

Jaccard similarity is computed at the word level: |intersection(words_cited, words_chunk)| / |union(words_cited, words_chunk)|. This catches paraphrased citations while rejecting fabricated ones. The 0.6 threshold was chosen to allow minor rephrasing while catching wholesale fabrication.

Gate 3 — Refusal Calibration

Test the model's ability to refuse when evidence is insufficient and answer when evidence is adequate. Uses the auto-labeled prompt suite (see below) where each prompt is classified as resolved, contested, or indeterminate.

Parameter	Value
Metric	`refusal_f1 = harmonic_mean(refusal_precision, refusal_recall)`
Refusal detection	Output contains a refusal signal (configurable string set, e.g., "I cannot determine", "insufficient evidence", "I'm not sure", or a structured `<refusal>` tag)
Must-refuse rate	Model must refuse >= 80% of `indeterminate` prompts
Must-answer rate	Model must NOT refuse >= 95% of `resolved` prompts
Pass threshold	Refusal F1 >= 0.85
Fail action	Adjust refusal threshold in controller policy, or retrain controller via `GrpoOptimizer` with refusal-aware reward signal

contested prompts (sources actively contradict) are evaluated separately and not gated — they are tracked for monitoring but the correct behavior (refuse vs. present both sides) is domain-dependent.

Trace Schema (JSONL format)

Every evaluation run produces a JSONL trace file where each line records per-token, per-layer routing decisions alongside response-level citation and refusal assessments:

{
  "prompt_id": "p-001",
  "token_idx": 42,
  "layer_idx": 3,
  "routing": {
    "topk_expert_ids": [2, 5],
    "topk_weights": [0.62, 0.38],
    "teacher_expert_ids": [2, 5],
    "teacher_weights": [0.65, 0.35],
    "agreement": true
  },
  "citations": [
    {"chunk_id": "doc-17-p3", "span": "exact quoted text", "valid": true}
  ],
  "refusal": {
    "should_refuse": false,
    "did_refuse": false,
    "correct": true
  },
  "coherence_score": 0.91,
  "stop_reason": "eos"
}

Schema notes:

routing is emitted per-token per-layer (one record per token-layer pair)
citations and refusal are emitted once per response (attached to the final token record, stop_reason != null)
coherence_score is the cosine similarity between student and teacher hidden states at the final layer — a cheap proxy for output quality without LLM-judge
Trace files are stored in eval/traces/ (never in the project root) and named {model_version}_{prompt_suite}_{timestamp}.jsonl

Auto-Labeling Strategy

The 200-prompt evaluation suite is labeled without manual annotation by using RuVector retrieval signals as proxy ground truth:

Label	Condition	Meaning	Gate Usage
`resolved`	Evidence redundancy > 3 (multiple independent sources agree on the answer)	The question is clearly answerable from the corpus	Gate 3: model must answer (not refuse)
`contested`	Cluster disagreement > 0.4 (sources actively contradict each other)	The question has conflicting evidence	Monitored only (not gated)
`indeterminate`	Mincut fragility > 0.7 (removing a single source breaks the entire evidence chain)	The question cannot be reliably answered	Gate 3: model must refuse

These labels also feed Gate 2:

resolved prompts provide the relevant_evidence denominator for citation recall (all supporting chunks should be cited)
indeterminate prompts should produce no citations (any citation on an indeterminate prompt is likely hallucinated)

Auto-labeling is deterministic given a fixed retrieval corpus and runs on CPU via existing RuVector HNSW search. Labels are stored alongside prompts in the evaluation suite and versioned with the corpus.

Go/No-Go Rule

All three gates must pass on the same evaluation suite run for the system to ship:

SHIP = (routing_agreement >= 0.85)
     AND (citation_precision >= 0.90)
     AND (citation_recall >= 0.70)
     AND (refusal_f1 >= 0.85)

If any gate fails, the system cannot ship. The remediation path is gate-specific:

Failed Gate	Remediation	Component	Estimated Duration
Routing Correctness	Router repair via `ContrastiveTrainer` with misrouted-token triplets	`training/contrastive.rs`	1-4 hours
Citation Correctness	Retrieval-first policy training via `GrpoOptimizer` (reward grounded citations)	`training/grpo.rs`	2-8 hours
Refusal Calibration	Adjust refusal threshold or retrain controller policy via `GrpoOptimizer`	`training/grpo.rs` + controller config	1-2 hours

Re-evaluation after remediation must re-run all three gates (not just the failed one) to confirm no regression.

Implementation location:

Component	Path	Lines	Notes
Gate runner orchestrator	`crates/ruvllm/src/eval/gates.rs`	~300	New module; runs all three gates, produces trace JSONL
Routing trace recorder	`crates/ruvllm/src/eval/routing_trace.rs`	~150	Records teacher routing decisions; compares against student
Citation validator	`crates/ruvllm/src/eval/citation_check.rs`	~200	Substring match + Jaccard similarity; corpus lookup
Refusal detector	`crates/ruvllm/src/eval/refusal_detect.rs`	~100	Configurable refusal signal set; F1 computation
Auto-labeler	`crates/ruvllm/src/eval/auto_label.rs`	~150	RuVector signal extraction; prompt classification
Trace schema types	`crates/ruvllm/src/eval/trace.rs`	~80	Serde-annotated structs matching the JSONL schema
Total new code		~980	All CPU-only, no external dependencies

Exit criteria:

Teacher routing traces recorded for full 200-prompt suite and cached as JSONL
Gate 1 (routing agreement) runs in < 30 minutes on Mac Studio for 200 prompts
Gate 2 (citation correctness) validates chunk_id existence and span grounding
Gate 3 (refusal calibration) correctly classifies refusal signals in model output
Auto-labeler produces resolved/contested/indeterminate labels from RuVector signals
All gates produce deterministic results (same inputs = same pass/fail, bit-exact)
Trace JSONL files are written to eval/traces/, never to project root
Go/No-Go rule enforced: all three gates must pass on same run
Failed gate triggers correct remediation path (ContrastiveTrainer or GrpoOptimizer)
Total eval suite runtime < 2 hours on Mac Studio (CPU-only)

AD-23: Phase-1 Distillation via External GPU Teacher Artifacts

Status: Accepted

Context: The Ultra 30B ternary MoE system prioritizes CPU-first inference, integrity-driven behavior, and low operational cost. Phase-1 performance goals focus on routing correctness after ternary quantization, citation-grounded answers, and calibrated refusal under thin or conflicting evidence. Full end-to-end GPU distillation of a 30B teacher is expensive, slow, and misaligned with the system's long-term architecture — where RuVector provides memory and structure, and the generator model is intentionally small and cheap. However, pure PTQ ternary conversion (Phase 0) introduces unacceptable degradation in MoE routing stability, answer fidelity on contested prompts, and refusal behavior calibration. We therefore require a limited refinement phase that recovers task-relevant behavior without committing to ongoing GPU dependence.

Decision: Phase-1 distillation SHALL be implemented as a one-time, external GPU artifact generation step, followed by local CPU-only refinement.

A full-precision FP16 teacher is executed once on a short-lived cloud GPU instance
The teacher produces behavioral artifacts, not trained weights
All refinement and training occurs locally on CPU using these artifacts
GPU infrastructure is not a runtime dependency

Scope of Teacher Artifacts (GPU job exports only):

Artifact	Content	Purpose
Routing Traces	Per token, per MoE layer: top-k expert indices + routing probabilities/margins	Preserve expert selection behavior post-quantization
Sparse Logits	Answer spans, refusal boundaries, contradiction disclosure points only	Guide LoRA residual correction and refusal calibration without full sequence distillation
Preference Labels	Per-prompt classification: resolved / contested / indeterminate	Train stop decisions and disclosure behavior

Artifacts SHALL be stored as immutable, versioned files and reused across refinement runs.

CPU-Only Refinement Strategy (using teacher artifacts):

Router Repair — Match student top-k routing to teacher traces; penalize expert churn and margin collapse
Low-Rank Residual Correction — Apply LoRA-style residuals to compensate ternary approximation error; enforce strict parameter budget
EWC++ Preservation — Prevent catastrophic drift outside repaired regions
Policy Optimization — Train RLM stop and retrieval behavior; optimize for citation correctness and calibrated refusal

No full expert weight updates are allowed in Phase-1.

Evaluation Gate: A checkpoint SHALL NOT be promoted unless it passes behavioral evaluation, not reconstruction metrics. Mandatory metrics:

Metric	Criterion	Gate
Routing correctness	Top-k overlap with teacher + margin correlation	Gate 1 (AD-22)
Citation correctness	Span hash verification + evidence support via RuVector	Gate 2 (AD-22)
Refusal calibration	Refuse on indeterminate, disclose on contested, pass on resolved	Gate 3 (AD-22)

compute_dequant_error is a sanity check only, not a promotion criterion.

Acceptance Criteria:

System passes the 200-prompt disagreement suite
Routing correctness meets Gate 1 threshold (>= 0.85)
Citation precision exceeds 0.90 (Gate 2 precision target)
Refusal behavior aligns with RuVector coherence signals (Gate 3 F1 >= 0.85)
Results remain stable under 10% corpus perturbation
GPU artifact generation completes in single cloud session (< 4 hours)
CPU refinement reproducible without GPU access

Alternatives Considered:

Alternative	Verdict	Reason
Full GPU distillation	Rejected	High cost, long iteration cycles, misalignment with CPU-first design
Pure PTQ without refinement	Rejected	Unacceptable routing instability, incorrect refusal behavior, citation degradation
Continuous GPU shadow training	Rejected	Operational complexity, long-term infrastructure lock-in

Consequences:

Positive: GPU cost is bounded and minimal; refinement is repeatable and auditable; CPU-first deployment remains intact; system behavior aligns with integrity goals; distillation artifacts are reusable
Negative: General language quality parity with FP16 teacher is not guaranteed; some PTQ loss may remain in non-critical behaviors; requires building custom evaluation infrastructure (addressed by AD-22)
Note: This ADR does not preclude a future Phase-2 distillation if product requirements shift toward general language parity. Phase-2 would be a separate decision

AD-24: RLM-Style Recursive Sentence Transformer Embedder

Status: Accepted

Context: The Craftsman Ultra system uses RuVector for evidence retrieval, cluster analysis, contradiction detection, and mincut fragility scoring. Standard sentence transformers produce embeddings in a single forward pass — one chunk in, one vector out. This works for basic retrieval but fails at three critical boundaries:

Contradiction boundaries: Two chunks with opposing claims embed near each other because they share vocabulary, despite being semantically opposed
Domain drift: Embeddings trained on general corpora perform poorly when the corpus shifts to a specialized domain (legal, medical, code)
Context blindness: The embedding of a chunk is independent of its neighborhood, losing structural signals that RuVector already knows (entity links, claim chains, cluster membership)

A normal embedding pipeline cannot distinguish "Drug X cures condition Y" from "Drug X does NOT cure condition Y" — they embed almost identically. The system needs embeddings that reflect the structural position of a chunk within the evidence graph, not just its surface semantics.

Decision: Implement an RLM-style recursive embedder — not a new architecture, but an inference strategy that wraps any base sentence transformer in a short iterative loop that retrieves context, decomposes, re-embeds, and merges.

Core Loop (bounded to 2-3 iterations):

State: { text, intent, neighbors, candidate_embeddings, iteration, stop_reason }

1. Embed the base chunk                           → base_embedding
2. Retrieve k nearest neighbors from RuVector      → neighbors[]
3. Normalize/summarize chunk with neighbor context  → contextualized_text
4. Re-embed the normalized view                    → ctx_embedding
5. If contested (low-cut boundary), embed both     → cluster_a_emb, cluster_b_emb
   sides of the disagreement separately
6. Merge into final representation                 → final_embedding + metadata

Output Schema:

Field	Type	Description
`embedding`	`Vec<f32>`	Final merged embedding vector
`confidence`	`f32`	Embedding stability across iterations (cosine similarity between iteration N and N-1)
`evidence_neighbor_ids`	`Vec<String>`	RuVector chunk IDs used as context
`contradiction_flags`	`Vec<bool>`	Per-neighbor: true if neighbor is in opposing cluster
`cluster_id`	`Option<usize>`	Primary cluster assignment
`stop_reason`	`StopReason`	Why the loop terminated: `Converged`, `MaxIterations`, `Contested`

Three Embedding Variants:

Variant	Conditioning	Use Case	Output
A: Query-Conditioned	Query text + neighborhood	Retrieval under a specific query	Embedding optimized for that query's intent
B: Corpus-Conditioned	Stable neighbors + entity graph	Corpus indexing	Embedding stable over time, less sensitive to local phrasing
C: Contradiction-Aware Twin	Both sides of a low-cut boundary	Disputed claims	Bimodal representation: one embedding per cluster side

Merge Rule (auditable, not learned):

final = normalize(w0 * base + w1 * ctx + w2 * anti)

Where anti is the embedding of the strongest counter-cluster neighbor set. Weights can be fixed (w0=0.6, w1=0.3, w2=0.1) or learned with a small regression on the eval set.

Training Strategy (minimal, no full model training):

Only three components are trainable:

Merge weights (w0, w1, w2) — 3 parameters, learned via grid search or small regression
Stop policy — when to terminate the loop (convergence threshold on cosine similarity between iterations)
Adapter layer — optional small linear layer on top of base embeddings for domain adaptation (rank-4 LoRA or single linear)

Evaluation Criteria:

Metric	Definition	Target
Top-k retrieval accuracy	Correct chunk in top-k results	Improvement over single-pass baseline
False neighbor rate	Contradicting chunks incorrectly ranked as similar	Reduction vs baseline
Cluster purity	Intra-cluster coherence after re-embedding	Improvement vs baseline
Contradiction separation	Cosine distance between opposing claim embeddings	> 0.3 (vs ~0.05 for single-pass)
Stability under perturbation	Embedding change when 10% of corpus is modified	< 0.05 cosine drift
Latency per embedding	Wall time including retrieval + re-embedding	< 50ms for 2 iterations on target hardware

Appliance Fit (CPU-first):

Small base embedder model (e.g., 22M-110M params)
2-3 passes maximum per chunk
RuVector supplies all context (no additional retrieval infrastructure)
Ternary quantization of the base embedder is possible (future AD)
Compatible with WASM deployment for browser-side embedding

Acceptance Criteria:

On a held-out corpus slice, RLM-style embedder improves top-k retrieval accuracy vs single-pass baseline
False neighbor matches near contradiction boundaries are reduced
Latency stays within budget (< 50ms for 2 iterations on target hardware)
Memory usage does not exceed appliance budget
Variant C produces measurably separated embeddings for known contradictions
Merge weights are interpretable and auditable (no black-box learned fusion)

Consequences

Positive

CPU-only deployment: 30B-class model running on commodity hardware without GPU
Energy efficiency: 55-82% reduction in inference energy vs FP16
Memory efficiency: ~8GB vs ~60GB for FP16 30B model (7.5x reduction)
Multiplication-free expert GEMM: Integer addition only in expert forward passes
SONA compatibility: MicroLoRA adaptation preserves per-session learning
GGUF ecosystem: Compatible with existing model distribution infrastructure
Incremental path: Phase 0 ($0) validates pipeline; Phase 0.5 ($0) adds RLM quality boost; Phase 1 ($1,300) delivers production quality; Phases 2-3 optimize
~70% RLM code reuse: GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore are production-tested — only BitLinear layer and orchestrator are net-new
Adaptive distillation: GRPO reward scaling dynamically focuses compute on hard-to-distill experts
Cross-expert stability: EWC++ Fisher diagonal prevents catastrophic forgetting during sequential expert distillation
Learned quantization policies: PolicyStore persists per-layer ternary scale distributions for reproducible future distillation runs
Expert-parallel distillation: Independent expert FFNs enable rayon-parallel distillation across CPU cores
Phase 0 de-risks Phase 1 at zero cost: Mac Studio PTQ prototype validates entire inference pipeline (GGUF → dequant → kernel → MoE → generation) for $0 before committing $1,300+ to cloud GPU distillation
Existing GGUF ecosystem: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines
Phase 0.5 RLM refinement at $0: Existing MicroLoRA + GRPO + EWC++ + ContrastiveTrainer stack provides ~10-15 percentage point quality recovery over raw PTQ with zero new training code, running entirely on Mac Studio
100% RLM reuse for Phase 0.5: No new training infrastructure needed — all 7 RLM components are production-tested and wire together directly
SIMD-only Phase 0.5: Entire RLM refinement pipeline runs on pure CPU SIMD (NEON on aarch64) without Metal GPU — only ~2-3x slower than Metal, extends platform support to Linux ARM64 and (with scalar fallback) x86
Zero-config SIMD mode: All training components (MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer) are already GPU-agnostic; only ContrastiveTrainer.use_metal = false needed for full SIMD-only execution
WASM browser deployment: Native Rust kernels compile to WASM SIMD128 via Cargo feature flags, enabling in-browser ternary inference at ~20-40 tok/s without server roundtrip
Single-source dual-target: One Rust codebase compiles to both native SIMD (NEON/AVX2/AVX512) and WASM SIMD128, eliminating the need for separate C++ and JS codebases
Safe Rust kernels: Following R3-Engine's approach, production kernels can be 100% Safe Rust (no unsafe in hot path), eliminating entire classes of memory safety bugs vs bitnet.cpp's C++
Existing Rust ecosystem: R3-Engine (Apache-compatible) and bitnet.rs (Apache 2.0) provide proven reference implementations to accelerate kernel development
Deterministic behavioral gates: Three non-LLM-judge evaluation gates (routing, citation, refusal) provide reproducible pass/fail shipping decisions without expensive API calls or non-deterministic judge models
Structured trace schema: JSONL trace format captures per-token routing, per-response citation, and refusal decisions in a single auditable artifact — enables regression detection across model versions
Zero-annotation auto-labeling: RuVector retrieval signals (evidence redundancy, cluster disagreement, mincut fragility) classify prompts as resolved/contested/indeterminate without human annotation effort
Gate-specific remediation: Each failed gate maps to a concrete repair action using existing RLM components (ContrastiveTrainer for routing, GrpoOptimizer for citations and refusal), avoiding manual debugging cycles
CPU-only evaluation: Full eval suite runs on Mac Studio in < 2 hours with no cloud GPU or external API dependency, keeping the evaluation loop at $0 marginal cost
Bounded GPU cost: Phase-1 distillation requires only a single short-lived cloud GPU session to generate behavioral artifacts (routing traces, sparse logits, preference labels) — no ongoing GPU dependency
Artifact reusability: Teacher artifacts are immutable and versioned; CPU refinement runs can be repeated, tuned, and audited without re-running the GPU job
Behavioral distillation: Distilling routing decisions and refusal signals rather than full logit sequences aligns training objectives with the system's integrity-first design goal
RLM-style embeddings: Recursive context-aware embeddings improve retrieval accuracy and contradiction separation without requiring a larger embedding model — inference strategy, not new architecture
Contradiction-aware twin embeddings: Variant C produces bimodal representations at low-cut boundaries, preserving disagreement structure in the embedding space for downstream decision-making
Minimal training surface: Only 3 merge weights + stop policy + optional adapter need training for the RLM embedder — no full model fine-tuning required

Negative

Training cost: Even distillation requires 800-1,600 A100-hours (~$2K-$5K cloud cost)
Custom kernels: Must implement and maintain platform-specific SIMD kernels in Rust
Quality gap: Phase 1 may be 5-10% below GLM-4.7-Flash on some benchmarks
No GPU acceleration: BitNet kernels are CPU-specific; GPU path requires separate optimization
Mixed-precision complexity: Router (FP16) + experts (ternary) + attention (FP16/ternary) adds dispatch complexity
WASM SIMD128 ceiling: Fixed 128-bit width limits throughput vs native AVX2 (256-bit) or AVX512 (512-bit); no popcount instruction requires emulation; single-threaded unless SharedArrayBuffer enabled — expect ~20-40 tok/s vs ~80-117 tok/s native
RLM scale gap: Existing RealContrastiveTrainer targets 0.5B models (embedding_dim=896); scaling to 30B requires distributed data loading and increased batch sizes
No x86 SIMD kernels: Current kernels/matmul.rs only implements NEON (aarch64); x86 falls to scalar fallback (~3-5x slower than NEON). Adding AVX2/AVX512 kernels would make x86 SIMD-only mode competitive but is not yet implemented
Teacher trace dependency: Gate 1 requires a full FP16 teacher forward pass to generate ground-truth routing traces; this must be re-run whenever the evaluation suite changes or the teacher model is updated
Auto-label noise: RuVector-derived labels (evidence redundancy, mincut fragility) are proxies for true answerability; edge cases near thresholds (e.g., fragility = 0.69 vs 0.71) may produce inconsistent labels across corpus versions
200-prompt suite coverage: A fixed 200-prompt suite may not cover all failure modes; adversarial or distribution-shifted prompts could pass all gates yet fail in production
General quality ceiling: Phase-1 behavioral distillation intentionally does not target full language quality parity with FP16 teacher; non-critical behaviors may remain degraded
Teacher artifact staleness: If the evaluation prompt suite or teacher model changes, routing traces and preference labels must be regenerated on GPU

Risks

Risk	Likelihood	Impact	Mitigation
Phase 0 PTQ quality too low for meaningful testing	Medium	Low	Phase 0 is for kernel/pipeline validation, not quality; upgrade to 0D (BitDistill Lite) if needed
MoE routing degrades with ternary experts	Medium	High	Phase 0 detects routing issues early; Phase 1 validates routing; router stays FP16; AD-12 contrastive validation
bitnet.cpp kernel translation to Rust introduces bugs	Medium	Medium	Phase 0 PTQ model provides cheap test fixture; extensive kernel unit tests; validate against reference impl
Distillation fails to converge for MoE	Low	High	GRPO reward scaling + per-expert distillation fallback; EWC++ stability (AD-13)
GLM-4.7-Flash architecture changes break compatibility	Low	Medium	Pin to specific HF revision; architecture abstraction layer
IQ1_S GGUF format insufficient for absmean metadata	Medium	Low	Register custom GGUF type (BITNET_T158); backward-compatible extension
EWC++ Fisher accumulation OOM at 30B scale	Medium	Medium	Sparse Fisher (top-k diagonal entries); per-expert rather than global Fisher
GRPO reward signal too noisy for distillation	Low	Low	Fall back to static KD loss; GRPO reward as optional multiplier
`RealContrastiveTrainer` doesn't scale to 30B	Medium	Medium	Extract training loop; replace Candle Linear with BitLinear; keep optimizer/scheduler
Calibration data bias in Phase 0 PTQ	Low	Low	Use diverse calibration corpus (WikiText + code); measure variance across calibration sets
Auto-label thresholds misclassify edge-case prompts	Medium	Medium	Track label stability across corpus versions; flag prompts with signals near threshold boundaries for manual review
200-prompt suite insufficient for production coverage	Low	Medium	Expand suite iteratively as production failure modes are discovered; run gates on user-submitted adversarial prompts quarterly
Teacher routing traces become stale after model update	Low	Low	Re-record teacher traces as part of every model version bump; cache invalidation keyed on teacher model hash

Validation Criteria

Phase 0 Exit Criteria

Absmean ternary quantizer produces valid {-1, 0, +1} weights from GLM-4.7-Flash FP16
Quantization runs successfully on Mac Studio via mmap (no cloud GPU required)
GGUF export with BITNET_T158 tensor type loads without error in BitNetBackend
TL1 NEON kernel produces non-zero, bounded output on PTQ ternary weights
MoE routing selects experts (not all-zero or all-same-expert degenerate routing)
End-to-end token generation produces coherent (if degraded) text
Memory usage measured and documented for real MoE activation patterns
Throughput measured: tok/s on Mac Studio (ARM NEON) and optionally x86 AVX2
Baseline quality benchmarks recorded (HumanEval, MMLU) as Phase 1 improvement target
Total Phase 0 cost = $0 (local Mac Studio execution)

Phase 0.5 Exit Criteria

MicroLoRA adapters (rank-2) attached to all expert FFN layers
Router fine-tuning via ContrastiveTrainer restores >=90% routing accuracy vs teacher
GRPO reward signal shows positive quality improvement over Phase 0 baseline
EWC++ prevents router fix from degrading already-correct routing paths (Fisher delta < 5%)
HumanEval pass@1 >= 45% (up from Phase 0 baseline of ~35-45%)
MicroLoRA + ternary inference produces coherent code completions
Training completes on Mac Studio within 14 days
MemoryDistiller has extracted KeyLessons identifying worst-degraded experts
PolicyStore contains optimized TernaryScale entries for all refined layers
Total Phase 0.5 cost = $0 (local Mac Studio execution)
GGUF re-exported with optimized router, scale factors, and LoRA adapter weights

Phase 1 Exit Criteria

BitNet backend loads GGUF with ternary expert weights
TL1 kernel produces bit-exact output vs reference float implementation
Decode speed >= 5 tok/s on x86_64 AVX2 (AMD Ryzen 7 / Intel i7 class)
HumanEval pass@1 >= 50% (GLM-4.7-Flash baseline: ~65%)
Memory usage < 10GB for 4K context inference
GRPO-guided expert distillation converges (loss < 0.5 for all experts)
EWC++ prevents cross-expert interference (Fisher-regularized loss delta < 5%)
Contrastive router validation: >= 95% expert routing accuracy vs teacher
PolicyStore contains TernaryScale entries for all distilled expert layers

Phase 2 Exit Criteria

Full ternary model (attention + experts) running on CPU
Decode speed >= 8 tok/s on x86_64 AVX2
SWE-bench Verified >= 52% (90%+ of GLM-4.7-Flash's 59.2%)
SONA MicroLoRA adaptation functional on ternary base
MemoryDistiller has extracted >= 50 KeyLessons from distillation trajectories
GRPO adaptive KL stabilizes below kl_target (0.02) for all experts

Phase 3 Exit Criteria

Native-trained model matches or exceeds GLM-4.7-Flash benchmarks
Published on HuggingFace (ruv/craftsman-ultra-30b-1bit)
GGUF + bitnet kernel distributed via npm/packages/ruvllm
Full distillation pipeline reproducible from PolicyStore policies (no manual tuning)

References

Ma, S. et al., "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (arXiv:2402.17764, Feb 2024)
Ma, S. et al., "BitNet b1.58 2B4T Technical Report" (arXiv:2504.12285, Apr 2025)
Microsoft Research, "bitnet.cpp: Efficient Edge Inference for Ternary LLMs" (arXiv:2502.11880, Feb 2025)
Microsoft, bitnet.cpp — https://github.com/microsoft/BitNet
Zhipu AI, GLM-4.7-Flash — https://huggingface.co/zai-org/GLM-4.7-Flash
Zhipu AI, "GLM-4.7: Advancing the Coding Capability" — https://z.ai/blog/glm-4.7
RuvLLM ADR-002: RuvLLM Integration with Ruvector
RuvLLM GGUF Quantization Module: crates/ruvllm/src/gguf/quantization.rs
Microsoft, bitnet-b1.58-2B-4T-gguf — https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf
RuvLLM GRPO Implementation: crates/ruvllm/src/training/grpo.rs
RuvLLM RealContrastiveTrainer: crates/ruvllm/src/training/real_trainer.rs
RuvLLM EWC++ Training Pipeline: crates/ruvllm/src/lora/training.rs
RuvLLM Memory Distillation: crates/ruvllm/src/reasoning_bank/distillation.rs
RuvLLM Policy Store: crates/ruvllm/src/policy_store.rs
RuvLLM Contrastive Training: crates/ruvllm/src/training/contrastive.rs
PT-BitNet: "Scaling up the 1-Bit large language model with post-training quantization" (2025) — https://www.sciencedirect.com/science/article/abs/pii/S089360802500735X
BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025) — https://arxiv.org/html/2510.13998v1
bartowski, GLM-4.7-Flash-GGUF quantizations — https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
unsloth, GLM-4.7-Flash-GGUF dynamic quantizations — https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
llama.cpp IQ1_S blind testing (Discussion #5962) — https://github.com/ggml-org/llama.cpp/discussions/5962
STBLLM: "Breaking the 1-bit Barrier" (ICLR 2025) — https://proceedings.iclr.cc/paper_files/paper/2025/file/ff997469ac66cf893c4183efeb22212a-Paper-Conference.pdf
Apple Mac Studio Technical Specifications (2025) — https://www.apple.com/mac-studio/specs/
RuvLLM Metal GEMV integration: crates/ruvllm/src/kernels/matmul.rs:1444-1582
RuvLLM MicroLoRA NEON SIMD forward: crates/ruvllm/src/lora/micro_lora.rs:279-390 (forward_simd, forward_simd_neon_impl)
RuvLLM NEON SIMD kernels: crates/ruvllm/src/kernels/ (matmul: gemm_neon/gemv_neon, activations: silu_neon/gelu_neon/relu_neon, norm: rms_norm_neon, rope: apply_rope_neon)
RuvLLM ContrastiveTrainer CPU fallback: crates/ruvllm/src/training/contrastive.rs:171-175 (Metal → CPU fallback) and contrastive.rs:475 (non-Candle pure CPU path)
R3-Engine: Pure Rust BitNet inference engine with WASM SIMD128 — https://github.com/r3-engine/r3-engine
bitnet.rs: Pure Rust BitNet toolkit (Apache 2.0) — https://github.com/ocentra/bitnet.rs
WASM SIMD128 specification: Fixed-width 128-bit SIMD for WebAssembly — https://v8.dev/features/simd
Rust core::arch::wasm32 SIMD intrinsics — https://doc.rust-lang.org/beta/core/arch/wasm32/index.html
"The state of SIMD in Rust in 2025" (Sergey Davidoff) — https://shnatsel.medium.com/the-state-of-simd-in-rust-in-2025-32c263e5f53d
"Rust + WebAssembly 2025: WasmGC and SIMD" — https://dev.to/dataformathub/rust-webassembly-2025-why-wasmgc-and-simd-change-everything-3ldh
Bai, Y. et al., "Constitutional AI: Harmlessness from AI Feedback" (arXiv:2212.08073, Dec 2022) — https://arxiv.org/abs/2212.08073
Zheng, L. et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (arXiv:2306.05685, Jun 2023) — https://arxiv.org/abs/2306.05685
Rafailov, R. et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (arXiv:2305.18290, May 2023) — https://arxiv.org/abs/2305.18290
Min, S. et al., "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" (arXiv:2305.14251, May 2023) — https://arxiv.org/abs/2305.14251
RuvLLM BitNet Backend: crates/ruvllm/src/bitnet/backend.rs (MoE routing, TL1 GEMV, forward pass)
RuvLLM RLM Refiner: crates/ruvllm/src/bitnet/rlm_refiner.rs (Phase 0.5 refinement orchestrator)

118 KiB Raw Blame History Unescape Escape

ADR-017: Craftsman Ultra 30b 1bit — BitNet Integration with RuvLLM

Context and Problem Statement

Strategic Goal

Decision Drivers

Performance Requirements

Architecture Requirements

Ecosystem Requirements

Research Summary

BitNet b1.58 Architecture

GLM-4.7-Flash Architecture

RuvLLM Current Capabilities (Relevant)

RuvLLM RLM Training Stack (Reusable for Distillation)

Considered Options

Option A: Post-Training Quantization of GLM-4.7-Flash (PTQ Tiers)

Option B: Native BitNet Training of GLM-4.7-Flash Architecture (Full)

Option C: Hybrid Approach — BitNet Distillation from GLM-4.7-Flash (RLM-Accelerated)

Option D: BitNet Expert Replacement (Incremental, RLM-Accelerated)

Decision

Phase 0: PTQ Rapid Prototype (Option A, Sub-option 0C)

Phase 0.5: RLM Post-Quantization Refinement (NEW — Mac Studio, $0)

Phase 1: BitNet Expert Replacement (Option D)

Phase 2: Full BitNet Distillation (Option C)

Phase 3: Native BitNet Training (Option B)

Architectural Decisions

AD-1: Ternary Weight Representation

AD-2: MoE Router Precision

AD-3: Activation Quantization

AD-4: CPU Inference Kernel Strategy

AD-5: GGUF Tensor Format Extension

AD-6: RuvLLM Backend Integration

AD-7: MoE Forward Pass Pipeline

AD-8: SONA Integration for Ternary Adaptation

AD-9: Memory Budget Analysis

AD-10: Platform-Specific Kernel Dispatch

AD-11: GRPO-Guided Distillation Loss

AD-12: Contrastive Pre-Training for Expert Routing Validation

AD-13: EWC++ Cross-Expert Stability During Sequential Distillation

AD-14: Policy Store for Per-Layer Ternary Scale Tracking

AD-15: Memory Distillation for Training Quality Tracking

AD-16: Distillation Pipeline Composition

AD-17: Training Infrastructure — Cloud GPU over Local SIMD

AD-18: Phase 0 — PT-BitNet Post-Training Quantization on Mac Studio

AD-19: Phase 0.5 — RLM Post-Quantization Refinement (No Traditional Training)

AD-20: Phase 0.5 — SIMD-Only Training Mode (No Metal GPU Required)

AD-21: Native Rust Ternary Kernels with WASM Target (bitnet.cpp Port Strategy)

AD-22: Evaluation Infrastructure and Behavioral Gates

AD-23: Phase-1 Distillation via External GPU Teacher Artifacts

AD-24: RLM-Style Recursive Sentence Transformer Embedder

Consequences

Positive

Negative

Risks

Validation Criteria

Phase 0 Exit Criteria

Phase 0.5 Exit Criteria

Phase 1 Exit Criteria

Phase 2 Exit Criteria

Phase 3 Exit Criteria

References

118 KiB

Raw Blame History