Files
wifi-densepose/vendor/ruvector/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md

118 KiB
Raw Blame History

ADR-017: Craftsman Ultra 30b 1bit — BitNet Integration with RuvLLM

Status: Proposed Date: 2026-02-03 Decision Makers: Ruvector Architecture Team Technical Area: 1-Bit LLM Inference / MoE Architecture / CPU-Native Serving


Context and Problem Statement

Large language models require substantial GPU resources for inference, limiting deployment to cloud environments and specialized hardware. Recent advances in 1-bit quantization — specifically Microsoft Research's BitNet b1.58 — demonstrate that ternary-weight models ({-1, 0, +1}) can match full-precision performance at 3B+ parameters while enabling CPU-only inference at human-readable speeds.

Concurrently, Zhipu AI's GLM-4.7-Flash introduces a 30B-A3B Mixture-of-Experts architecture that activates only ~3B parameters per token while storing 30B total knowledge, achieving strong coding and agentic benchmarks (SWE-bench Verified: 59.2%, LiveCodeBench v6: 64.0%) with 200K context.

Craftsman Ultra 30b 1bit is a proposed model that combines these two paradigms: a 30B-A3B MoE architecture with native BitNet b1.58 ternary quantization, purpose-built for CPU inference within the RuvLLM serving runtime. This ADR evaluates the integration path, architectural decisions, and trade-offs.

Strategic Goal

Deliver a 30B-class coding/agentic model that runs entirely on consumer CPUs (no GPU required) at 5-15 tokens/second decode, with memory footprint under 8GB, integrated into the RuvLLM + Ruvector ecosystem with SONA self-learning capabilities.


Decision Drivers

Performance Requirements

Metric Target Rationale
Decode throughput (CPU) 5-15 tok/s Human-readable speed per BitNet 100B benchmarks
Prefill latency (1K tokens) <2s Interactive coding assistant responsiveness
Memory footprint (model) <8 GB Fits in 16GB system RAM with OS + KV cache
Memory footprint (KV cache, 4K ctx) <2 GB Q8 KV cache for 4096-token context
Active parameter GEMM Addition-only BitNet eliminates multiplication in W×A
Energy per inference <0.05J BitNet CPU efficiency benchmarks

Architecture Requirements

  • MoE routing must remain full-precision: Expert selection requires accurate gating scores
  • Expert weights are ternary: Each expert's linear layers use BitLinear (W1.58A8)
  • Activations quantized to INT8: Per-token absmax scaling
  • Shared layers (embeddings, LM head) remain FP16: Critical for quality preservation
  • GGUF-compatible: Must serialize to/load from GGUF v3 format with custom metadata

Ecosystem Requirements

  • Integrate with RuvLLM's existing backend abstraction (backends/mod.rs)
  • Leverage existing GGUF parser (gguf/parser.rs, gguf/quantization.rs)
  • Support SONA learning loops for per-session adaptation
  • Compatible with Claude Flow agent routing for task delegation
  • NAPI bindings for Node.js consumption via npm/packages/ruvllm

Research Summary

BitNet b1.58 Architecture

Source: Microsoft Research, "The Era of 1-bit LLMs" (Feb 2024), bitnet.cpp (Oct 2024)

BitNet b1.58 replaces standard nn.Linear with BitLinear layers:

Forward Pass:
  1. W_ternary = RoundClip(W / (gamma + epsilon), -1, 1)
     where gamma = mean(|W|) (absmean quantization)
  2. X_int8 = Quant(X, absmax)  (per-token 8-bit activation)
  3. Y = W_ternary @ X_int8      (integer addition only, no multiplication)
  4. Y_float = Dequant(Y)         (rescale to float)

Key properties:

  • Weights: ternary {-1, 0, +1} → 1.58 bits per parameter
  • Activations: INT8 per-token (absmax scaling)
  • Matrix multiply becomes addition and subtraction only (no FP multiply)
  • Zero weights enable feature filtering (sparse activation within dense layers)
  • Must be trained from scratch — post-training quantization to 1-bit destroys quality

Inference kernels (bitnet.cpp):

Kernel Method Compression Best For
I2_S 2-bit pack, unpack-and-multiply 2 bits/weight Bandwidth-limited
TL1 2-weight → 4-bit LUT index 2 bits/weight Balanced CPU
TL2 3-weight → 5-bit LUT index 1.67 bits/weight Memory-limited

CPU performance (bitnet.cpp benchmarks):

Platform Speedup vs FP16 Energy Reduction
ARM (NEON) 1.37x 5.07x 55-70%
x86 (AVX2) 2.37x 6.17x 72-82%
x86 (AVX512) ~6x+ ~85%

GLM-4.7-Flash Architecture

Source: Zhipu AI / Z.AI (Jan 2026)

Property Value
Total parameters ~30B (31B reported)
Active parameters ~3B (A3B)
Architecture Mixture of Experts (MoE)
Shared layers ~2B parameters
Expert layers ~28B (distributed across experts)
Context window 200K tokens (MLA-based)
Training data 15T general + 7T reasoning/code tokens
Attention Multi-head Latent Attention (MLA) with QK-Norm
Activation SwiGLU
Position encoding RoPE
Speculative decoding Multi-Token Prediction (MTP) layer
Reasoning Interleaved + Retention-Based + Round-Level

Benchmark performance:

Benchmark Score
AIME 25 91.6%
GPQA 75.2%
SWE-bench Verified 59.2%
LiveCodeBench v6 64.0%
HLE 14.4%
tau2-Bench 79.5%

RuvLLM Current Capabilities (Relevant)

  • GGUF v3 parser: Full format support including IQ1_S (1.56 bits/weight, type 19)
  • Quantization pipeline: Q4_K_M, Q5_K_M, Q8_0, F16 (no native ternary training)
  • Backends: Candle (Metal/CUDA), mistral-rs (PagedAttention), CoreML (ANE)
  • No CPU-optimized ternary kernel: Current backends target GPU acceleration
  • SIMD kernels: Existing NEON/SSE4.1/AVX2 infrastructure in crates/ruvllm/src/kernels/
  • MicroLoRA: Rank 1-2 adapters with <1ms adaptation (compatible with BitNet)
  • SONA: Three-tier learning (instant/background/deep) — can drive ternary adapter training

RuvLLM RLM Training Stack (Reusable for Distillation)

RuvLLM contains a mature reinforcement-learning-from-model-feedback (RLM) training stack that directly accelerates Craftsman Ultra distillation. These components are production-tested and reduce net-new code by ~70%.

GRPO — Group Relative Policy Optimization (training/grpo.rs, 897 lines)

  • Critic-free RL: computes relative advantages within sample groups
  • Adaptive KL divergence penalty (kl_target, clip_range) controls teacher-student divergence
  • PPO-style clipping prevents catastrophic updates
  • Preset configs: GrpoConfig::stable() (safe distillation), GrpoConfig::for_tool_use() (expert routing)
  • Thread-safe batch processing via RwLock<VecDeque<SampleGroup>>

RealContrastiveTrainer (training/real_trainer.rs, 1000 lines)

  • Candle-based training loop with GGUF model loading and GGUF weight export
  • Combined loss: Triplet (margin) + InfoNCE (contrastive) + GRPO reward scaling
  • AdamW optimizer with gradient clipping, LR warmup, checkpointing
  • GrpoEvaluator computes per-prediction rewards (1.0 correct, -0.5 wrong)
  • Metal/CUDA acceleration via Candle device dispatch

MicroLoRA + EWC++ Training Pipeline (lora/training.rs, 798 lines)

  • Single-example gradient computation (batch_size=1 for real-time)
  • EWC++ regularizer: λ/2 * Σ F_i * (w_i - w*_i)² prevents catastrophic forgetting
  • Fisher diagonal tracking with exponential decay (fisher_decay: 0.999)
  • 7 learning rate schedules (Cosine, OneCycle, Step, etc.)
  • Async adaptation with buffered gradient accumulation

Memory Distillation (reasoning_bank/distillation.rs, 856 lines)

  • Compresses trajectories to KeyLesson objects with semantic embeddings
  • Smart extraction: explicit lessons, implicit patterns, error patterns, recovery patterns
  • Semantic deduplication (Jaccard + cosine similarity, threshold 0.85)
  • Quality-gated: only trajectories above min_quality_threshold are preserved

Policy Store (policy_store.rs, 474 lines)

  • Ruvector-backed semantic policy persistence with HNSW indexing
  • Policy types: Quantization, Router, Ewc, Pattern
  • Per-layer QuantizationPolicy with precision, activation thresholds, quality-latency tradeoff
  • Policy source tracking: InstantLoop, BackgroundLoop, DeepLoop, Federated

Contrastive Training (training/contrastive.rs, 634 lines)

  • Two-stage: Triplet Loss (margin=0.5) + InfoNCE (temperature=0.07)
  • 13 agent types with 1,078 training triplets (578 base + 500 hard negatives)
  • Hard negative mining at 48.4% ratio (Claude-generated confusing pairs)
  • Proven 100% routing accuracy with hybrid keyword-first + embedding fallback

Considered Options

Option A: Post-Training Quantization of GLM-4.7-Flash (PTQ Tiers)

Take the existing BF16 GLM-4.7-Flash weights and quantize to low-bit formats without full distillation training.

Critical distinction — IQ1_S ≠ BitNet b1.58:

Property GGUF IQ1_S BitNet b1.58
Encoding Codebook-based importance quantization Ternary {-1, 0, +1} via absmean
Bits/weight 1.56 bpw 1.58 bpw
Inference Dequantize → FP multiply Integer addition only (no multiply)
Speed benefit Memory bandwidth only Bandwidth + compute (multiplication-free)
How obtained Post-training quantization Trained from scratch or distilled
Quality at 7B Near-random / broken outputs Matches FP16

Existing GLM-4.7-Flash GGUF quantizations available (community-published):

Repository Lowest Quant Size Notes
bartowski/zai-org_GLM-4.7-Flash-GGUF IQ2_XXS (2.06 bpw) 7.62 GB No IQ1_S published
unsloth/GLM-4.7-Flash-GGUF UD-Q2_K_XL (2.7 bpw dynamic) ~11 GB Dynamic quant, recommended
ngxson/GLM-4.7-Flash-GGUF Q4_K_M (4.5 bpw) 18.1 GB 55 variants available

No IQ1_S quantization has been published for GLM-4.7-Flash by any community quantizer — this itself is a signal (too aggressive for practical use).

Sub-options ranked by increasing effort:

Sub-option 0A: Download existing IQ2_XXS GGUF

  • Download bartowski's IQ2_XXS at 7.62 GB
  • Cost: $0, time: 5 minutes (just download)
  • Quality: ~75-80% of FP16 (2.06 bpw is usable per community reports)
  • NOT 1-bit, NOT BitNet — just aggressive 2-bit compression
  • RuvLLM gap: IQ2_XXS dequantization not implemented (falls to error catch-all in quantization.rs:358)
  • RuvLLM Q2_K dequantization IS implemented and works

Sub-option 0B: Quantize to IQ1_S via llama.cpp

  • Run llama-quantize GLM-4.7-Flash-F16.gguf IQ1_S with importance matrix
  • Cost: $0, time: ~30 minutes on CPU
  • Quality: SEVERE degradation — blind testing shows IQ1_S is "broken rather than just bad" on 7B; outputs contain garbled text despite acceptable perplexity scores. 30B MoE may survive better due to parameter redundancy, but expert routing is highly sensitive to weight perturbation
  • RuvLLM gap: IQ1_S dequantization not implemented (quantization.rs:358 catch-all)
  • Does NOT achieve BitNet multiplication-free inference

Sub-option 0C: PT-BitNet ternary PTQ (per PT-BitNet paper)

  • Apply absmean ternary quantization (BitNet's native method) to pre-trained weights with calibration data
  • Cost: $0 (runs locally on Mac Studio via mmap + Metal; 1-4 hours wall time)
  • Alternative: ~$50-200 on cloud GPU if no local Apple Silicon hardware
  • Quality: ~55-65% downstream accuracy (PT-BitNet reports 61% on 70B; GLM-4.7-Flash's 30B-A3B may differ)
  • THIS IS proper BitNet ternary format → enables multiplication-free inference with AD-4 kernels
  • Requires implementing absmean ternary quantizer (~200-300 lines of new code)
  • Requires calibration dataset (WikiText-2 or similar, ~1M tokens)
  • Mac Studio M4 Max 64GB+ or M3 Ultra 96GB+ recommended (see AD-18)

Sub-option 0D: BitDistill Lite (10B tokens) (per BitDistill paper)

  • 3-stage: SubLN insertion → 10B-token continued pre-training → KL + attention distillation
  • Cost: ~$200-500 (8× GPU hours on Mi300X/A100 class)
  • Quality: ~90-95% of FP16 (BitDistill reports 88.17% vs 88.01% FP16 on MNLI at 0.6B)
  • Near-full quality recovery with only 10B tokens (vs 200B+ for Phase 1 full distillation)
  • Requires SubLN module insertion + distillation fine-tuning loop
  • Bridges gap between pure PTQ and full expert distillation (Phase 1)

Summary comparison:

Sub-option Cost Time Quality (est.) BitNet Speedup RuvLLM Ready
0A: IQ2_XXS download $0 5 min ~75-80% No No (missing dequant)
0B: IQ1_S quantize $0 30 min ~40-50% No No (missing dequant)
0C: PT-BitNet PTQ $0 (Mac Studio) 1-4 hrs ~55-65% Yes Needs quantizer impl
0D: BitDistill Lite $0 local / ~$300 cloud 2-4 wks / 1-2 days ~90-95% Yes Needs SubLN + KD loop

Pros (of PTQ approach generally):

  • Immediate or near-immediate results ($0-$300, minutes to days)
  • No large-scale training infrastructure
  • Validates inference pipeline and kernels before investing in full distillation
  • Sub-option 0C produces genuine BitNet ternary format for kernel development

Cons:

  • Sub-options 0A/0B: Quality too degraded for production coding tasks
  • Sub-options 0A/0B: No BitNet multiplication-free inference (still dequant-then-multiply)
  • Sub-option 0C: Significant quality loss (~35-45%) vs teacher — adequate for kernel validation, not production
  • Sub-option 0D: Requires non-trivial training code (SubLN, KD loss) but much less than full Phase 1
  • IQ1_S blind test results: statistically indistinguishable from random on smaller models

Verdict: Recommended as Phase 0 rapid prototype — Sub-option 0C (PT-BitNet PTQ) is the optimal entry point: $100, 2-4 hours, produces genuine BitNet ternary format for kernel development and inference validation. Sub-option 0D (BitDistill Lite) bridges to Phase 1 if higher quality is needed before committing to full expert distillation. Sub-options 0A/0B are useful only as baselines for comparison.

Option B: Native BitNet Training of GLM-4.7-Flash Architecture (Full)

Train Craftsman Ultra 30b 1bit from scratch using BitNet b1.58 methodology on the GLM-4.7-Flash MoE architecture.

Approach:

  1. Implement BitLinear layers for all expert MLPs and attention projections
  2. Keep MoE router, embeddings, and LM head in FP16
  3. Train on 4T+ tokens with ternary weight updates via straight-through estimator
  4. Export to custom GGUF with ternary tensor metadata

Pros:

  • Maximum quality — matches FP16 at 3B+ active parameter scale
  • True multiplication-free inference for expert forward passes
  • Full TL1/TL2 kernel optimization possible
  • Scientifically validated approach (BitNet b1.58 2B4T results)

Cons:

  • Massive training compute: estimated 4,000-8,000 A100-hours for 4T tokens
  • Requires custom training framework (BitNet + MoE + MLA integration)
  • 6-12 month timeline for training pipeline + training run
  • No pre-existing GLM-4.7-class BitNet training recipe

Verdict: Recommended long-term — Highest quality but requires significant investment.

Option C: Hybrid Approach — BitNet Distillation from GLM-4.7-Flash (RLM-Accelerated)

Use knowledge distillation to transfer GLM-4.7-Flash capabilities into a BitNet architecture, reducing training cost by 5-10x. Leverages the existing RLM training stack to eliminate ~70% of net-new training code.

Approach:

  1. Initialize Craftsman Ultra with GLM-4.7-Flash architecture (30B-A3B MoE)
  2. Replace all expert linear layers with BitLinear (ternary {-1, 0, +1})
  3. Keep router, embeddings, LM head in FP16
  4. Extend RealContrastiveTrainer with KD loss (KL div + hard-label CE) replacing triplet+InfoNCE
  5. Use GrpoOptimizer for per-expert quality rewards during distillation — each SampleGroup maps to one expert's teacher vs student outputs
  6. Apply EwcRegularizer across distillation phases to prevent early-trained experts from being overwritten
  7. Log distillation trajectories to MemoryDistiller for quality tracking and KeyLesson extraction
  8. Persist per-layer ternary policies via PolicyStore (quantization thresholds, scale distributions)
  9. Export to GGUF with ternary tensor metadata and TL1/TL2 kernel hints via existing GgufExportResult

RLM Component Reuse:

Existing Component Reuse Adaptation Needed
RealContrastiveTrainer Training loop, GGUF export, checkpointing Replace triplet+InfoNCE with KD loss
GrpoOptimizer Reward scaling, adaptive KL, PPO clipping Map SampleGroup to per-expert outputs
EwcRegularizer Fisher diagonal, forgetting prevention Apply across expert distillation phases
MemoryDistiller Trajectory compression, lesson extraction Map Verdict to teacher-student quality delta
PolicyStore Semantic policy persistence Add PolicyType::TernaryScale for per-block absmean tracking
ContrastiveTrainer Hard negative mining framework Reuse for expert-routing contrastive pre-training

Pros:

  • 5-10x less compute than training from scratch (~800-1,600 A100-hours)
  • ~70% existing code reuse — only BitLinear forward/backward and MoE data loading are net-new
  • Leverages GLM-4.7-Flash's proven architecture and routing
  • GRPO's adaptive KL prevents ternary student from diverging too far from teacher
  • EWC++ ensures sequential expert distillation doesn't corrupt earlier experts
  • Teacher model provides strong supervision signal for ternary convergence
  • Can incrementally improve with more distillation tokens
  • PolicyStore enables learned per-layer quantization decisions
  • Distillation quality tracked end-to-end via MemoryDistiller trajectory logging

Cons:

  • Slight quality gap vs native training (estimated 2-5% on benchmarks)
  • RealContrastiveTrainer embedding_dim (896) must scale to GLM-4.7-Flash hidden_size
  • Teacher inference cost during distillation
  • Distillation may not perfectly transfer MoE routing behavior

Verdict: Recommended near-term — Best balance of quality, cost, and timeline. RLM reuse eliminates the "custom framework" risk.

Option D: BitNet Expert Replacement (Incremental, RLM-Accelerated)

Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLinear, leaving attention in FP16. Reuses existing RLM stack for the entire distillation loop.

Approach:

  1. Load GLM-4.7-Flash architecture
  2. Replace expert FFN layers (gate_proj, up_proj, down_proj) with BitLinear
  3. Keep attention (Q/K/V/O projections) in FP16
  4. Use RealContrastiveTrainer + GrpoOptimizer for expert-only distillation (~200B tokens)
  5. Apply EwcRegularizer to prevent expert N+1 distillation from corrupting expert N
  6. Attention weights loaded directly from GLM-4.7-Flash (no distillation needed)
  7. Use contrastive pre-training to validate MoE routing still selects correct experts after ternary conversion

Pros:

  • Fastest path to working model
  • Attention quality preserved exactly
  • Expert FFN is 60-70% of active parameters — gets most BitNet benefits
  • Simpler distillation (only FFN layers)
  • Lower memory: ~5.5 GB for ternary experts + FP16 attention
  • Minimal net-new code: BitLinear layer + GGUF ternary type only; training loop is 100% reused

Cons:

  • Attention layers still require FP multiply (not fully multiplication-free)
  • Mixed-precision inference path complexity
  • ~40% of compute still in FP16 attention

Verdict: Recommended as Phase 1 — Enables rapid prototyping and validation. RLM reuse makes this achievable with only ~30% new code.


Decision

Phased approach: A(0C) → RLM Refinement → D → C → B

Phase 0: PTQ Rapid Prototype (Option A, Sub-option 0C)

  • Timeline: 1-2 weeks
  • Cost: $0 (runs entirely on Mac Studio locally)
  • Platform: Mac Studio (M4 Max 64GB+ or M3 Ultra 96GB+)
  • Goal: Produce a genuine BitNet ternary GGUF of GLM-4.7-Flash for kernel development, inference pipeline validation, and baseline quality measurement
  • Deliverables:
    • PT-BitNet ternary quantized GLM-4.7-Flash GGUF file (~6-7 GB)
    • Absmean ternary quantizer implementation (~200-300 lines)
    • IQ1_S / BITNET_T158 dequantization kernel in RuvLLM
    • Baseline quality benchmarks (HumanEval, MMLU) to compare against Phase 1+
    • Functional TL1 kernel validated against ternary model
  • Expected quality: ~55-65% of GLM-4.7-Flash (adequate for kernel validation, not production)
  • Key value: De-risks Phase 1 by validating the entire inference pipeline (GGUF loading → ternary dequant → TL1 kernel → MoE routing → token generation) at zero cost before committing to $1,300+ distillation training
  • Why Mac Studio works: Phase 0 is PTQ (no training loop) — just load FP16 weights via mmap, compute absmean per block, round to ternary, export. The absmean computation is trivial math; the bottleneck is memory bandwidth, not compute. Calibration forward pass uses Metal GPU acceleration via existing Candle integration.
  • Optional upgrade (0D): If 0C quality is too low for meaningful testing, apply BitDistill Lite (10B tokens, ~$300 cloud or ~$0 on Mac Studio over several weeks) to reach ~90-95% quality

Phase 0.5: RLM Post-Quantization Refinement (NEW — Mac Studio, $0)

  • Timeline: 1-3 weeks (overlaps with Phase 0 kernel development)
  • Cost: $0 (runs on Mac Studio, ~2-12 days training wall time with Metal; ~4-24 days SIMD-only)
  • Platform: Mac Studio (same as Phase 0) — supports both Metal GPU and pure SIMD/CPU modes (see AD-20)
  • Goal: Improve Phase 0 PTQ quality from ~55-65% to ~70-80% by training only the small FP16 components using the existing RLM stack — no traditional distillation, no cloud GPU
  • Approach: Freeze ternary weights, train FP16 corrections using RLM components:
    1. MicroLoRA adapters (rank 1-2) on each expert FFN — adds small FP16 correction: Y = BitLinear(X) + LoRA_B @ LoRA_A @ X
    2. Router fine-tuning via ContrastiveTrainer — corrects misrouting caused by PTQ weight changes
    3. Scale factor optimization via GRPO rewards — per-block FP16 absmean scales are differentiable
    4. EWC++ regularization — prevents router fix from breaking already-good routing paths
    5. Quality tracking via MemoryDistiller — identifies worst-degraded experts for focused training
    6. Policy persistence via PolicyStore — stores optimized per-layer configurations
  • Trainable parameters: ~200-400M (1-2% of 30B total) — router (~30M), MicroLoRA adapters (~50-100M), LM head (~150M), scale factors (~0.1M)
  • Training data: 100M-500M tokens (sufficient for <400M trainable params)
  • Throughput: ~500-1000 tok/s (Metal) or ~200-500 tok/s (NEON SIMD only) × 100M-500M tokens = 2-12 days (Metal) or 4-24 days (SIMD-only) on Mac Studio
  • Deliverables:
    • RLM-refined GGUF with ternary experts + optimized FP16 components
    • MicroLoRA adapter weights (exportable, ~20-100 MB)
    • Optimized router weights and scale factors
    • Quality benchmarks showing improvement over Phase 0 baseline
  • Expected quality: ~70-80% of GLM-4.7-Flash (up from ~55-65% Phase 0 PTQ)
  • Key value: Gets a usable model on Mac Studio at $0 before committing to cloud GPU. If 70-80% quality is sufficient for the use case, Phase 1 cloud distillation may be deferred or skipped entirely.
  • 100% RLM code reuse: MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer, ContrastiveTrainer, MemoryDistiller, PolicyStore — all production-tested, zero new training code needed

Phase 1: BitNet Expert Replacement (Option D)

  • Timeline: 3-4 months
  • Cost: ~$1,300-$2,000 (4× A100 spot, ~46 days)
  • Goal: Full-quality ternary experts via distillation, validated against Phase 0/0.5 baselines
  • Deliverables: Working Craftsman Ultra 30b 1bit (mixed: ternary experts, FP16 attention)
  • Expected quality: ~90-95% of GLM-4.7-Flash on coding benchmarks
  • Prerequisites: Phase 0 validates inference pipeline; Phase 0.5 provides quality baseline

Phase 2: Full BitNet Distillation (Option C)

  • Timeline: 4-6 months after Phase 1
  • Cost: ~$2,500-$5,000 (4× H100, 16-32 days)
  • Goal: Full ternary model with complete BitNet inference optimization
  • Deliverables: Craftsman Ultra 30b 1bit v2 (full ternary except router/embed/head)
  • Expected quality: ~95-98% of GLM-4.7-Flash

Phase 3: Native BitNet Training (Option B)

  • Timeline: 6-12 months after Phase 2, contingent on funding/compute
  • Cost: ~$15,000-$30,000 (8× H100 cluster, 90-180 days)
  • Goal: Surpass GLM-4.7-Flash quality with native ternary training
  • Deliverables: Craftsman Ultra 30b 1bit v3 (trained from scratch)
  • Expected quality: 100%+ of GLM-4.7-Flash (BitNet at scale exceeds FP16)

Architectural Decisions

AD-1: Ternary Weight Representation

Decision: Use BitNet b1.58 absmean quantization for weight ternary encoding.

W_ternary = RoundClip(W / (mean(|W|) + epsilon), -1, 1)

Each weight is one of {-1, 0, +1}, stored as 2-bit packed integers (I2_S format) in GGUF tensors. Per-block scale factor stored as FP16.

Storage format per block (256 elements):

  • 64 bytes for ternary weights (2 bits × 256)
  • 2 bytes for absmean scale (FP16)
  • Total: 66 bytes / 256 weights = 2.06 bits/weight

AD-2: MoE Router Precision

Decision: MoE gating/routing network remains in FP16.

Rationale: Expert selection requires high-precision softmax scores to maintain routing quality. Quantizing the router to ternary would collapse expert selection, effectively turning a 30B model into a random-expert 3B model. The router is <0.1% of total parameters.

Components kept in FP16:

  • Expert gating weights (router)
  • Token embedding table
  • LM head (output projection)
  • RoPE frequency table
  • LayerNorm/RMSNorm parameters

AD-3: Activation Quantization

Decision: INT8 per-token absmax quantization for activations flowing through BitLinear layers.

X_int8 = clamp(round(X * 127 / max(|X|)), -128, 127)

Rationale: Consistent with BitNet b1.58 specification. INT8 activations enable integer-only GEMM in expert forward passes. Attention activations remain in FP16/BF16 for KV cache compatibility.

AD-4: CPU Inference Kernel Strategy

Decision: Implement all three bitnet.cpp kernel types, with runtime selection based on hardware detection.

Kernel Target Hardware Selection Criteria
I2_S x86 AVX512, ARM SVE Systems with wide SIMD and high bandwidth
TL1 x86 AVX2, ARM NEON General-purpose, balanced performance
TL2 Memory-constrained Systems with <16GB RAM or high cache pressure

Implementation path: Adapt bitnet.cpp's kernel generation scripts (Python codegen) to produce Rust SIMD intrinsics compatible with RuvLLM's existing kernels/ module structure.

Key kernel operations:

  1. Pack ternary weights into 2-bit (I2_S) or LUT index (TL1: 4-bit, TL2: 5-bit)
  2. Generate lookup tables for activation sums at model load time
  3. Execute GEMM via table lookup + integer addition (no floating-point multiply)
  4. Accumulate in INT16 with pack-and-unpack technique (lossless, no quantization of partials)
  5. Dequantize output with per-block FP16 scale

AD-5: GGUF Tensor Format Extension

Decision: Extend RuvLLM's GGUF format with BitNet-specific metadata and a new BITNET_TERNARY quantization type.

New GGUF metadata keys:

craftsman.bitnet.version = 1
craftsman.bitnet.weight_encoding = "absmean_ternary"
craftsman.bitnet.activation_bits = 8
craftsman.bitnet.router_precision = "f16"
craftsman.bitnet.kernel_hint = "tl1"  // preferred kernel
craftsman.moe.total_params = 30000000000
craftsman.moe.active_params = 3000000000
craftsman.moe.num_experts = <N>
craftsman.moe.active_experts = <K>

Tensor storage: Map to existing IQ1_S (type 19) for ternary expert weights, with additional metadata distinguishing post-training IQ1_S from native BitNet ternary. Alternatively, register a new type BITNET_T158 = 29 if the existing IQ1_S block format is incompatible with absmean-scale-per-block layout.

AD-6: RuvLLM Backend Integration

Decision: Create a new BitNetBackend alongside existing Candle and mistral-rs backends.

backends/
├── mod.rs                 // Backend trait + dispatch
├── candle_backend.rs      // GPU (Metal/CUDA)
├── mistral_backend.rs     // PagedAttention + ISQ
├── coreml_backend.rs      // Apple Neural Engine
└── bitnet_backend.rs      // NEW: CPU ternary inference

BitNetBackend responsibilities:

  1. Load GGUF with ternary tensor detection
  2. Initialize TL1/TL2/I2_S lookup tables per layer
  3. Execute MoE routing in FP16 → select active experts
  4. Run selected expert forward passes using ternary GEMM kernels
  5. Attention in FP16 (Phase 1) or ternary (Phase 2+)
  6. KV cache management (Q8 two-tier, existing infrastructure)

Backend trait compliance:

impl InferenceBackend for BitNetBackend {
    fn load_model(&mut self, path: &Path, config: ModelConfig) -> Result<()>;
    fn generate(&self, prompt: &str, params: GenerateParams) -> Result<Response>;
    fn get_embeddings(&self, text: &str) -> Result<Vec<f32>>;
    fn supports_architecture(&self, arch: &str) -> bool;
}

AD-7: MoE Forward Pass Pipeline

Decision: Split MoE forward pass into FP16 routing + ternary expert execution.

Input Token Embedding (FP16)
  │
  ▼
┌─────────────────────────────────────────┐
│ For each transformer layer:             │
│                                         │
│  1. RMSNorm (FP16)                      │
│  2. Self-Attention                      │
│     ├─ Q/K/V projection (Phase 1: FP16, │
│     │                    Phase 2: Ternary)│
│     ├─ RoPE (FP16)                      │
│     ├─ Scaled dot-product attention      │
│     └─ Output projection                │
│  3. RMSNorm (FP16)                      │
│  4. MoE Block:                          │
│     ├─ Router (FP16 gating network)     │
│     │   → Select top-K experts          │
│     ├─ Expert FFN (TERNARY BitLinear)   │
│     │   ├─ gate_proj: W_ternary @ X_int8│
│     │   ├─ up_proj:   W_ternary @ X_int8│
│     │   ├─ SwiGLU activation            │
│     │   └─ down_proj: W_ternary @ X_int8│
│     └─ Weighted sum of expert outputs   │
│  5. Residual connection                 │
└─────────────────────────────────────────┘
  │
  ▼
LM Head (FP16) → Logits → Token

AD-8: SONA Integration for Ternary Adaptation

Decision: MicroLoRA adapters applied as FP16 deltas on top of ternary base weights.

Rationale: Ternary weights cannot be directly fine-tuned at inference time (gradient updates don't map to {-1, 0, +1}). Instead, SONA's MicroLoRA applies rank-1 FP16 adapters whose output is added to the ternary forward pass output:

Y = BitLinear(X) + LoRA_B @ LoRA_A @ X

Where BitLinear(X) uses ternary GEMM and LoRA_B @ LoRA_A @ X is a small FP16 correction. This preserves BitNet's efficiency for 99%+ of computation while enabling per-session adaptation.

AD-9: Memory Budget Analysis

Decision: Target <8GB model + 2GB KV cache = 10GB total for 4K context.

Component Precision Size Notes
Expert weights (28B params) 1.58-bit ~5.5 GB 28B × 2.06 bits = ~7.2 GB raw, but only routing metadata for inactive experts
Shared layers (2B params) FP16 ~4 GB Embeddings, LM head, router, norms
Expert routing tables FP16 ~50 MB Gating network weights
TL1/TL2 lookup tables INT16 ~200 MB Pre-computed at load time
KV cache (4K context) Q8 ~1.5 GB Two-tier cache (hot FP16 + warm Q8)
MicroLoRA adapters FP16 ~10 MB Rank-1, <1MB per target module
Total ~7.8 GB Fits in 16GB system with headroom

Note: Full 30B ternary weights on disk are ~7.2 GB. At runtime, only active expert weights (~3B active) are in hot memory for any given token, with inactive expert pages memory-mapped and demand-loaded.

AD-10: Platform-Specific Kernel Dispatch

Decision: Runtime hardware detection drives kernel selection.

pub fn select_kernel(caps: &HardwareCaps) -> BitNetKernel {
    if caps.has_avx512() {
        BitNetKernel::I2S_AVX512
    } else if caps.has_avx2() {
        BitNetKernel::TL1_AVX2
    } else if caps.has_neon() {
        if caps.cache_size_l2 >= 2 * 1024 * 1024 {
            BitNetKernel::TL1_NEON
        } else {
            BitNetKernel::TL2_NEON  // memory-constrained
        }
    } else if caps.has_sse41() {
        BitNetKernel::TL1_SSE41
    } else {
        BitNetKernel::I2S_Scalar  // fallback
    }
}

Integration: Leverages RuvLLM's existing autodetect.rs hardware capability detection module.

AD-11: GRPO-Guided Distillation Loss

Decision: Use GrpoOptimizer to compute per-expert reward scaling during knowledge distillation, replacing a traditional fixed-weight KD loss.

Rationale: Standard KD uses a static alpha to blend KL divergence and hard-label cross-entropy. GRPO adds a dynamic reward signal that upweights expert-student pairs where ternary output closely matches the teacher, and downweights divergent pairs. This is achieved by mapping each expert's teacher-vs-student output comparison to a SampleGroup:

Combined Loss = KD_base + GRPO_scale
Where:
  KD_base  = α * KL(teacher_logits/T, student_logits/T)
           + (1-α) * CE(labels, student_logits)
  GRPO_scale = (1 + reward * 0.1)

  reward = GrpoEvaluator.evaluate(student_expert_output, teacher_expert_output)
         → 1.0 when cosine_sim > 0.95
         → -0.5 when cosine_sim < 0.7

Key configuration (extending GrpoConfig::stable()):

GrpoConfig {
    group_size: num_experts,        // One group per MoE layer
    learning_rate: 1e-6,            // Conservative for distillation
    kl_coefficient: 0.1,            // Tight teacher adherence
    kl_target: 0.02,                // Low divergence target
    clip_range: 0.1,                // Narrow clipping for stability
    normalize_advantages: true,     // Normalize across experts in group
    adaptive_kl: true,              // Auto-adjust KL penalty
    ..GrpoConfig::stable()
}

Reused: GrpoOptimizer, GrpoConfig, SampleGroup, GrpoEvaluator from training/grpo.rs. New: BitNetGrpoAdapter that maps expert forward pass outputs to GrpoSample structs.

AD-12: Contrastive Pre-Training for Expert Routing Validation

Decision: After ternary conversion of expert weights, use the existing ContrastiveTrainer to verify that MoE routing still selects the correct experts.

Rationale: Replacing expert FFN weights with ternary approximations changes the output distribution of each expert. If expert N's ternary output becomes more similar to expert M's output, the router may misroute tokens. Contrastive pre-training on expert embeddings detects and corrects this.

Approach:

  1. For each token in a calibration set, record which expert the teacher model's router selects
  2. Generate TrainingTriplets: anchor = hidden state, positive = correct expert output, negative = wrong expert output
  3. Use existing hard negative mining to find expert pairs that become confusable after ternary conversion
  4. Fine-tune the FP16 router gating weights using contrastive loss to restore correct expert selection

Reused: ContrastiveTrainer, ContrastiveConfig, TrainingTriplet from training/contrastive.rs. New: ExpertTripletGenerator that produces triplets from MoE routing decisions.

AD-13: EWC++ Cross-Expert Stability During Sequential Distillation

Decision: Apply EwcRegularizer from lora/training.rs during sequential expert distillation to prevent catastrophic forgetting across experts.

Rationale: Distilling 30B MoE experts sequentially (expert 0, then 1, ..., then N) risks overwriting shared representations. EWC++ computes Fisher information diagonals for each expert's contribution to the shared attention layers, then regularizes subsequent expert distillation to not deviate from previously-learned important weights.

Configuration:

TrainingConfig {
    ewc_lambda: 5000.0,           // Higher than default (2000) for cross-expert stability
    fisher_decay: 0.995,           // Slower decay to preserve Fisher across expert phases
    quality_threshold: 0.5,        // Only learn from high-quality distillation samples
    lr_schedule: LearningRateSchedule::Cosine,
    warmup_steps: 500,             // Longer warmup for 30B scale
    ..Default::default()
}

Concrete protection:

  • After distilling expert 0: compute Fisher diagonal F_0 over validation set
  • When distilling expert 1: add penalty ewc_lambda/2 * Σ F_0_i * (w_i - w*_0_i)²
  • Accumulate: F_cumulative = fisher_decay * F_prev + (1-fisher_decay) * F_new

Reused: EwcRegularizer, TrainingPipeline, TrainingConfig, FisherDiagonal from lora/training.rs. New: SequentialExpertDistiller that wraps EwcRegularizer across expert phases.

AD-14: Policy Store for Per-Layer Ternary Scale Tracking

Decision: Extend PolicyStore with a new PolicyType::TernaryScale to persist per-block absmean scale distributions and learned quantization decisions.

Rationale: Not all layers quantize equally well to ternary. Attention layers may need different scale clipping than FFN layers. The policy store enables the distillation pipeline to learn and persist per-layer quantization strategies that can be retrieved and applied in future distillation runs or model updates.

New policy type:

pub enum PolicyType {
    Quantization,
    Router,
    Ewc,
    Pattern,
    TernaryScale,      // NEW: Per-layer ternary quantization metadata
}

pub struct TernaryScalePolicy {
    pub layer_idx: usize,
    pub module: String,              // "gate_proj", "up_proj", "down_proj", "q_proj", etc.
    pub mean_absmean: f32,           // Average scale factor across blocks
    pub std_absmean: f32,            // Variance in scale factors
    pub sparsity: f32,               // Fraction of zero weights
    pub quality_vs_teacher: f32,     // Cosine similarity to teacher output
    pub distillation_loss: f32,      // Final loss for this layer
    pub recommended_block_size: usize, // 256 default, may vary
}

Reused: PolicyStore, PolicyEntry, PolicySource from policy_store.rs. New: TernaryScalePolicy struct and PolicyType::TernaryScale variant.

AD-15: Memory Distillation for Training Quality Tracking

Decision: Log all distillation teacher-student comparisons as Trajectory objects in the ReasoningBank, enabling MemoryDistiller to extract KeyLessons about which layers, experts, and configurations produce the best ternary quality.

Rationale: Distillation is iterative — understanding which experts converge quickly, which resist ternary conversion, and what scale distributions correlate with quality enables intelligent scheduling of future distillation runs.

Mapping:

ReasoningBank Concept Distillation Mapping
Trajectory One expert's distillation run (N steps)
Verdict Success if cosine_sim > 0.9, Failure if < 0.7
PatternCategory Expert index + layer type (e.g., "expert_3_gate_proj")
KeyLesson "Expert 7 gate_proj converges fastest with lr=2e-6 and block_size=128"
CompressedTrajectory Summary of entire expert distillation phase

Reused: MemoryDistiller, DistillationConfig, CompressedTrajectory, KeyLesson from reasoning_bank/distillation.rs. New: DistillationTrajectoryRecorder that adapts expert training steps to Trajectory format.

AD-16: Distillation Pipeline Composition

Decision: Compose the full Craftsman Ultra distillation pipeline from existing RLM components wired through a new CraftsmanDistiller orchestrator.

Pipeline architecture:

┌─────────────────────────────────────────────────────────────────┐
│                  CraftsmanDistiller (NEW orchestrator)           │
│                                                                 │
│  ┌───────────────┐    ┌──────────────────┐    ┌──────────────┐ │
│  │ TeacherModel  │───▶│BitLinearTrainer   │───▶│ GGUFExporter │ │
│  │(GLM-4.7-Flash)│    │(NEW: STE+shadow)  │    │(REUSED)      │ │
│  └───────┬───────┘    └────────┬─────────┘    └──────────────┘ │
│          │                     │                                │
│          │   ┌─────────────────┼─────────────────┐              │
│          │   │                 │                  │              │
│          ▼   ▼                 ▼                  ▼              │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────┐    │
│  │GrpoOptimizer │   │EwcRegularizer│   │ContrastiveTrainer│    │
│  │(REUSED)      │   │(REUSED)      │   │(REUSED)          │    │
│  │Per-expert    │   │Cross-expert  │   │Router validation │    │
│  │reward scaling│   │stability     │   │post-ternary      │    │
│  └──────┬───────┘   └──────┬───────┘   └────────┬─────────┘    │
│         │                  │                     │              │
│         ▼                  ▼                     ▼              │
│  ┌──────────────────────────────────────────────────────┐      │
│  │              Quality Feedback Loop                    │      │
│  │                                                       │      │
│  │  MemoryDistiller ──▶ KeyLesson extraction             │      │
│  │  PolicyStore    ──▶ TernaryScale persistence          │      │
│  │  (BOTH REUSED)                                        │      │
│  └──────────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────────┘

Net-new code:  BitLinearTrainer (STE + shadow weights), CraftsmanDistiller (orchestrator)
Reused code:   GrpoOptimizer, EwcRegularizer, ContrastiveTrainer, MemoryDistiller,
               PolicyStore, GGUFExporter, TrainingConfig, LR schedules
Reuse ratio:   ~70% existing / ~30% new

Optimization: Expert-Parallel Distillation

Experts are independent during forward pass. Distill multiple experts concurrently across CPU cores:

// Distill experts in parallel (independent FFN weights)
let expert_results: Vec<DistillResult> = experts
    .par_iter()                          // rayon parallel iterator
    .enumerate()
    .map(|(idx, expert)| {
        let mut trainer = BitLinearTrainer::new(expert, &teacher_expert[idx]);
        let mut ewc = EwcRegularizer::new_with_fisher(cumulative_fisher[idx]);
        let mut grpo = GrpoOptimizer::new(GrpoConfig::stable());

        for batch in dataset.batches() {
            let student_out = trainer.forward_ternary(batch);
            let teacher_out = teacher.forward_expert(idx, batch);

            let reward = grpo.evaluate(&student_out, &teacher_out);
            let kd_loss = kd_loss_fn(&student_out, &teacher_out, alpha, temperature);
            let ewc_penalty = ewc.penalty(&trainer.shadow_weights());
            let total_loss = kd_loss * reward.scale() + ewc_penalty;

            trainer.backward_ste(total_loss);
        }

        ewc.update_fisher(&trainer);      // Update Fisher for next expert
        DistillResult { idx, weights: trainer.export_ternary(), fisher: ewc.fisher() }
    })
    .collect();

AD-17: Training Infrastructure — Cloud GPU over Local SIMD

Decision: Use Google Cloud A100/H100 GPU instances for distillation training. Reserve local CPU/SIMD for inference validation, MicroLoRA adaptation, and GGUF export only.

Rationale: Local CPU/SIMD training is mathematically infeasible at the 200B+ token scale required for expert distillation. The existing RuvLLM SIMD kernels (kernels/) are inference-only — no backpropagation or gradient computation. The training code (real_trainer.rs:178-184) supports Metal (macOS) or CPU but not CUDA, and CPU throughput at ~50-100 tok/s training would require ~65 years for 200B tokens.

Memory analysis (per-expert distillation):

Component Size Notes
Single expert FFN shadow weights (FP16) ~2 GB ~1B params per expert (28B ÷ N experts)
Gradients (FP32) ~4 GB Full precision for STE backprop
AdamW optimizer state (2× FP32) ~8 GB First + second moment
Teacher activations cache ~1 GB Per-batch FP16
EWC++ Fisher diagonal ~0.5 GB Per-expert accumulated
Per-expert total ~15.5 GB Fits in A100 40GB with headroom

Full model simultaneous (Phase 2+):

Component Size Notes
30B shadow weights (FP16) ~60 GB Requires A100 80GB or H100
Gradients + optimizer ~360 GB Requires multi-GPU parallelism
Total ~430 GB 4× A100 80GB or 4× H100 80GB

Throughput and cost comparison:

Platform Training tok/s Time (200B tok, Phase 1) Cost Phase 0 PTQ? Phase 0.5 RLM?
Mac Studio M4 Max (Metal) ~500-1000 ~6.5 years N/A Yes — 1-4 hrs, $0 Yes — 2-12 days, $0
Mac Studio M4 Max (NEON SIMD only, no Metal) ~200-500 ~13 years N/A Yes — 2-6 hrs, $0 Yes — 4-24 days, $0
Mac Studio M3 Ultra (Metal) ~800-1500 ~4.2 years N/A Yes — 1-1.5 hrs, $0 Yes — 1.5-8 days, $0
Mac Studio M3 Ultra (NEON SIMD only, no Metal) ~300-700 ~9 years N/A Yes — 1.5-3 hrs, $0 Yes — 3-16 days, $0
CPU AVX2 (Ryzen 9) — scalar fallback ~50-150 ~43-130 years N/A Yes — 2-6 hrs, $0 Yes — 14-58 days, $0
1× A100 80GB (GCP on-demand) ~15,000 ~155 days ~$3,700 Yes — 30 min, ~$5 Overkill
4× A100 80GB (GCP on-demand) ~50,000 ~46 days ~$4,400 Overkill for PTQ Overkill
4× A100 80GB (GCP spot) ~50,000 ~46 days ~$1,300 Overkill for PTQ Overkill
1× H100 (DataCrunch) ~40,000 ~58 days ~$2,900 Overkill for PTQ Overkill
4× H100 (DataCrunch) ~140,000 ~16 days ~$3,200 Overkill for PTQ Overkill

Key insight: Mac Studio is infeasible for Phase 1+ training (years of wall time) but ideal for Phase 0 PTQ (hours, $0). This separation justifies the phased approach.

Recommended infrastructure per phase:

Phase Instance Duration Estimated Cost Strategy
Phase 0 (PTQ) Mac Studio (M4 Max/M3 Ultra) 1-4 hours $0 Mmap FP16 weights → absmean quantize → export GGUF; Metal GPU for calibration pass
Phase 0D (BitDistill Lite, 10B tok) Mac Studio Metal or 1× A100 spot 2-4 weeks (local) / 1-2 days (cloud) $0 (local) / ~$300 (cloud) Optional quality upgrade if Phase 0C too degraded
Phase 0.5 (RLM refinement, Metal) Mac Studio (Metal) 3-14 days $0 MicroLoRA + router fix + scale opt using existing RLM stack
Phase 0.5 (RLM refinement, SIMD-only) Mac Studio (NEON CPU) 5-28 days $0 Same pipeline, no Metal required — pure ndarray + NEON SIMD (see AD-20)
Phase 1 (expert FFN, 200B tok) 4× A100 80GB spot (GCP) ~46 days $1,300-$2,000 Per-expert sequential with EWC++; each expert fits 1 GPU
Phase 1 (router validation) Mac Studio Metal or 1× A100 ~2-4 hours $0 (local) / <$10 (cloud) Contrastive training on router only (~2B params)
Phase 2 (full ternary, 500B tok) 4× H100 (DataCrunch) ~16-32 days $2,500-$5,000 All layers; model-parallel across GPUs
Phase 3 (native training, 4T tok) 8× H100 cluster ~90-180 days $15,000-$30,000 Full from-scratch; depends on funding
Inference validation Mac Studio (NEON) Continuous $0 TL1/TL2 kernel testing on ARM NEON
MicroLoRA adaptation Mac Studio <1ms/update $0 Existing ndarray-based EWC++ pipeline

Required code change: Add CUDA device dispatch to RealContrastiveTrainer:

// Current (real_trainer.rs:178-184):
let device = if config.use_metal {
    Device::new_metal(0).unwrap_or(Device::Cpu)
} else {
    Device::Cpu
};

// Required for cloud GPU training:
let device = if config.use_cuda {
    Device::new_cuda(config.cuda_device_id).unwrap_or(Device::Cpu)
} else if config.use_metal {
    Device::new_metal(0).unwrap_or(Device::Cpu)
} else {
    Device::Cpu
};

This is a single-line addition to RealTrainingConfig (use_cuda: bool, cuda_device_id: usize) and a 3-line change to device selection. The rest of the Candle training pipeline (tensors, optimizer, loss computation) works identically across CPU/Metal/CUDA.

Cost optimization strategies:

  1. Spot instances: GCP A100 spot at ~$1/GPU-hr (70% off on-demand) — requires checkpointing every 30 min
  2. DataCrunch / Lambda Labs: H100 at $1.99-$2.10/hr (40-50% below GCP on-demand)
  3. Expert-sequential on fewer GPUs: Distill 1 expert at a time on 1× A100 80GB (~$1.50/hr), increasing wall time but reducing per-hour cost
  4. Mixed precision training: FP16 shadow weights + BF16 activations reduces memory, enabling smaller instances
  5. Gradient checkpointing: Trade compute for memory to fit on fewer GPUs

AD-18: Phase 0 — PT-BitNet Post-Training Quantization on Mac Studio

Decision: Implement a PT-BitNet ternary post-training quantizer as Phase 0, running entirely on a local Mac Studio, producing a rapid prototype GGUF for inference pipeline validation before investing in full distillation.

Rationale: The original Option A ("Rejected") assumed only generic IQ1_S quantization, which produces garbled outputs at 1.56 bpw. However, PT-BitNet (2025) demonstrates that applying BitNet's native absmean ternary quantization to pre-trained weights with calibration data achieves significantly better results (61% downstream at 70B) than generic codebook PTQ. This produces genuine BitNet ternary format that enables multiplication-free inference with TL1/TL2 kernels — unlike IQ1_S which still requires dequant-then-multiply.

Target platform: Mac Studio (Apple Silicon)

Phase 0 is pure quantization (no training loop), making it ideal for local execution on Mac Studio:

Config Unified RAM FP16 Load PTQ? Calibration? Notes
M4 Max 36GB 36 GB mmap (demand-paged) Yes Slow (paging) Minimum viable; mmap means only active tensor pages in RAM
M4 Max 64GB 64 GB Fits with mmap assist Yes Yes Comfortable for PTQ; calibration may page
M4 Max 128GB 128 GB Fits entirely Yes Yes Ideal — FP16 model (60GB) + ternary output (7GB) + calibration buffers all in RAM
M3 Ultra 96GB 96 GB Fits entirely Yes Yes Good headroom
M3 Ultra 192GB+ 192+ GB Fits entirely Yes Yes Ample room for full model + calibration + inference validation

Why Mac Studio works for Phase 0 (but not Phase 1+):

  • PTQ is not training: No gradient computation, no optimizer state, no backpropagation — just load → quantize → export
  • Memory-mapped I/O: FP16 weights can be mmap'd from disk; only the current tensor's pages need to be in RAM
  • Per-tensor processing: Quantize one tensor at a time (read FP16 block → compute absmean → round to ternary → write output) — working memory is ~2-4 MB per tensor regardless of total model size
  • Metal GPU for calibration: RuvLLM's existing RealContrastiveTrainer and kernels/matmul.rs support Metal via Candle (use_metal: true default, 3x speedup on M4 Pro GEMV)
  • ARM NEON for TL1 kernels: Mac Studio's Apple Silicon has NEON SIMD — the same target ISA as the TL1 kernel for ternary inference validation
  • Phase 1 still needs cloud GPU: 200B token distillation at ~500-1000 tok/s (Metal) = ~6.5 years locally vs ~46 days on 4× A100

Estimated Phase 0 wall time on Mac Studio:

Step M4 Max 128GB M4 Max 64GB M3 Ultra 192GB
Download GLM-4.7-Flash FP16 (~60GB) ~30 min (1Gbps) ~30 min ~30 min
Absmean ternary quantization ~5-15 min ~10-30 min (paging) ~5-10 min
Calibration pass (1000 samples, Metal) ~30-60 min ~60-120 min ~20-40 min
GGUF export ~2-5 min ~2-5 min ~2-5 min
TL1 kernel validation inference ~10-20 min ~10-20 min ~10-20 min
Total ~1-2 hours ~2-4 hours ~1-1.5 hours

Implementation approach:

Phase 0 Pipeline (runs on Mac Studio):
  1. Load GLM-4.7-Flash FP16/BF16 weights via mmap
  2. For each linear layer in expert FFNs:
     a. Compute gamma = mean(|W|)  (absmean scale)
     b. W_ternary = RoundClip(W / (gamma + epsilon), -1, 1)
     c. Store: 2-bit packed ternary weights + FP16 scale per block
  3. Calibration pass (optional, improves quality, uses Metal GPU):
     a. Run ~1000 calibration samples through teacher model
     b. Record activation statistics per layer
     c. Optimize scale factors to minimize MSE between teacher and ternary outputs
  4. Export to GGUF with BITNET_T158 tensor type + metadata
  5. Validate: load in BitNetBackend → TL1 NEON kernel → generate tokens

Absmean ternary quantizer (core algorithm):

Input:  W ∈ R^{m×n} (FP16 weight matrix)
Output: W_t ∈ {-1,0,+1}^{m×n}, scale ∈ R (per-block FP16)

For each block of 256 elements:
  1. gamma = mean(|block|) + 1e-8
  2. normalized = block / gamma
  3. ternary = round(clamp(normalized, -1, 1))  → {-1, 0, +1}
  4. Pack: 2 bits per weight (00=-1, 01=0, 10=+1)
  5. Store scale = gamma as FP16

What stays FP16 (same as AD-2):

  • MoE router gating weights
  • Token embeddings + LM head
  • RoPE frequencies
  • LayerNorm/RMSNorm parameters

RuvLLM implementation gaps to fill:

Gap Effort Details
Absmean ternary quantizer ~200-300 lines New function in gguf/quantization.rs or new module
IQ1_S / BITNET_T158 dequantization ~80-120 lines Add to dequantize_tensor match arm (currently falls to error at line 358)
GGUF export with ternary metadata ~100-150 lines Extend GgufExportResult with BitNet metadata keys from AD-5
TL1 kernel smoke test ~200 lines Validate ternary GEMM produces correct output on PTQ model

Total new code: ~600-800 lines (vs ~15,000+ for Phase 1 full distillation pipeline)

Quality expectations (conservative estimates for GLM-4.7-Flash 30B-A3B):

Benchmark FP16 Baseline Phase 0 PTQ (est.) Phase 1 Distill (est.)
HumanEval pass@1 ~65% ~35-45% ~55-60%
MMLU ~75% ~45-55% ~65-70%
SWE-bench Verified 59.2% ~25-35% ~50-55%
LiveCodeBench v6 64.0% ~30-40% ~55-60%

Why Phase 0 quality is still useful:

  1. Kernel validation: Ternary GEMM correctness doesn't depend on model quality
  2. Memory profiling: Real-world memory usage measurement with actual MoE activation patterns
  3. Throughput benchmarking: Measure real tok/s with TL1/TL2/I2_S kernels on target hardware
  4. Pipeline testing: End-to-end GGUF load → inference → token output
  5. Baseline measurement: Quantitative quality floor establishes improvement target for Phase 1
  6. Cost: $0 on Mac Studio vs ~$1,300 for Phase 1 — validates infrastructure at zero cost before committing to cloud GPU

Key configuration:

pub struct PtBitnetConfig {
    pub calibration_samples: usize,     // 1000 default (WikiText-2 or code corpus)
    pub block_size: usize,              // 256 (matches AD-1)
    pub optimize_scales: bool,          // true: MSE-optimized scales; false: raw absmean
    pub layers_to_quantize: LayerMask,  // ExpertsOnly (Phase 0) or All (future)
    pub export_format: TernaryFormat,   // BitnetT158 (native) or IQ1S (llama.cpp compat)
    pub router_precision: Precision,    // FP16 (always, per AD-2)
    pub use_mmap: bool,                 // true: memory-map FP16 weights (required for <128GB systems)
    pub use_metal_calibration: bool,    // true: Metal GPU for calibration pass (Mac Studio)
    pub max_memory_gb: Option<f32>,     // Cap memory usage; enables streaming quantization
}

Reused: GGUF parser, tensor metadata, GgufQuantType enum, export pipeline. New: PtBitnetQuantizer, absmean_ternary(), BITNET_T158 dequantization kernel.

AD-19: Phase 0.5 — RLM Post-Quantization Refinement (No Traditional Training)

Decision: Use the existing RLM training stack to refine the Phase 0 PTQ model on Mac Studio by training only the small FP16 components (~1-2% of parameters), freezing ternary weights. This replaces traditional distillation for the rapid prototype phase.

Rationale: Traditional knowledge distillation (Phase 1) requires shadow weights, straight-through estimator, and GPU-scale compute to modify the ternary weights themselves. However, the Phase 0 PTQ model already has ternary weights — the quality loss comes from:

  1. Sub-optimal per-block scale factors (absmean is a rough approximation)
  2. MoE router misrouting tokens to wrong experts (expert output distributions changed)
  3. No adaptation to ternary output characteristics

All three can be addressed by training only the FP16 components using the existing RLM stack, without touching the ternary weights.

What gets trained (FP16, differentiable) vs frozen (ternary, not differentiable):

Component Params Size Trainable? Training Method
Expert FFN ternary weights ~28B ~5.5 GB Frozen N/A — {-1,0,+1} not differentiable
MicroLoRA adapters (rank-2, per expert FFN) ~50-100M ~100-200 MB Yes TrainingPipeline + EwcRegularizer
MoE router gating weights ~30M ~60 MB Yes ContrastiveTrainer (triplet + InfoNCE)
Per-block absmean scale factors ~0.1M ~200 KB Yes GRPO reward-guided optimization
LM head (output projection) ~150M ~300 MB Yes (optional) Standard fine-tuning
Attention Q/K/V/O (FP16) ~2B ~4 GB Optional Can add LoRA here too if budget allows
Total trainable ~200-400M ~400-800 MB ~1-2% of 30B total

Why RLM works here (vs traditional distillation):

Property Traditional KD (Phase 1) RLM Refinement (Phase 0.5)
Modifies ternary weights Yes (shadow weights + STE) No (frozen)
Trainable params ~28B (all expert weights) ~200-400M (1-2%)
Training tokens needed 200B 100M-500M (400x less)
GPU requirement 4× A100 ($1,300+) Mac Studio Metal ($0)
Training time ~46 days (cloud) 2-12 days (local)
Quality target ~90-95% of FP16 ~70-80% of FP16
New code required ~15,000 lines (BitLinear, STE, orchestrator) ~0 lines (100% RLM reuse)

RLM component mapping:

┌──────────────────────────────────────────────────────────────────┐
│              Phase 0.5: RLM Refinement Pipeline                  │
│              (100% existing RLM code, 0% new training code)      │
│                                                                  │
│  Frozen Ternary Model (Phase 0 PTQ output)                       │
│  ┌────────────────────────────────────────────┐                  │
│  │  Expert FFNs: {-1,0,+1} weights (FROZEN)   │                  │
│  │  Router: FP16 gating (TRAINABLE)            │                  │
│  │  Attention: FP16 (TRAINABLE via LoRA opt.)  │                  │
│  │  Scales: FP16 per-block (TRAINABLE)         │                  │
│  └────────────────────────────────────────────┘                  │
│           │                                                       │
│     ┌─────▼──────────────────────────────────────────┐           │
│     │  Step 1: Router Repair                          │           │
│     │  ContrastiveTrainer (REUSED, contrastive.rs)    │           │
│     │  • Generate triplets: anchor=hidden, +correct   │           │
│     │    expert, -wrong expert                        │           │
│     │  • Triplet + InfoNCE loss on FP16 router        │           │
│     │  • Fix misrouting from PTQ weight changes       │           │
│     │  Training: ~10M tokens, ~1-2 hours (Metal)      │           │
│     └─────┬──────────────────────────────────────────┘           │
│           │                                                       │
│     ┌─────▼──────────────────────────────────────────┐           │
│     │  Step 2: MicroLoRA Injection + Training         │           │
│     │  TrainingPipeline + MicroLoRA (REUSED,          │           │
│     │    lora/training.rs + lora/micro_lora.rs)       │           │
│     │  • Rank-2 LoRA per expert FFN: Y = BitLinear(X) │           │
│     │    + LoRA_B @ LoRA_A @ X                        │           │
│     │  • Loss: MSE(teacher_output, student+LoRA)      │           │
│     │  • EWC++ across expert phases                   │           │
│     │  Training: ~100-500M tokens, ~2-12 days (Metal) │           │
│     └─────┬──────────────────────────────────────────┘           │
│           │                                                       │
│     ┌─────▼──────────────────────────────────────────┐           │
│     │  Step 3: Scale Factor + Quality Optimization    │           │
│     │  GrpoOptimizer (REUSED, grpo.rs)                │           │
│     │  • Per-expert output quality → reward signal     │           │
│     │  • Optimize FP16 scale factors to maximize       │           │
│     │    cosine similarity with teacher output          │           │
│     │  • Adaptive KL prevents over-correction          │           │
│     │  Training: concurrent with Step 2               │           │
│     └─────┬──────────────────────────────────────────┘           │
│           │                                                       │
│     ┌─────▼──────────────────────────────────────────┐           │
│     │  Feedback Loop                                  │           │
│     │  MemoryDistiller → KeyLessons (REUSED)          │           │
│     │  PolicyStore → TernaryScale policies (REUSED)   │           │
│     │  • Track which experts improve most             │           │
│     │  • Store optimized configs for reproducibility  │           │
│     └────────────────────────────────────────────────┘           │
└──────────────────────────────────────────────────────────────────┘

Memory budget on Mac Studio during Phase 0.5 training:

Component Size Notes
PTQ ternary model (mmap) ~7 GB disk / ~3-7 GB RAM Demand-paged; only active expert pages in RAM
Teacher FP16 model (mmap) ~60 GB disk / ~4-8 GB RAM Only forward pass activations; demand-paged
MicroLoRA adapters (rank-2) ~200 MB All experts in RAM
LoRA gradients + optimizer (AdamW 2×FP32) ~1.5 GB For ~400M trainable params
EWC++ Fisher diagonal ~200 MB Per-expert accumulated
KV cache + activations ~2 GB Calibration/training forward pass
Total active RAM ~12-20 GB Fits in any Mac Studio config

Key insight: The teacher model is only needed for forward pass (no gradients), so it can be mmap'd and demand-paged. The ternary student is similarly mmap'd. Only the ~400M trainable parameters and their optimizer state need to be fully in RAM (~2 GB), which fits comfortably in even the 36GB M4 Max.

Training schedule on Mac Studio M4 Max 128GB:

Step Tokens Wall Time What Changes
Router repair ~10M ~3-6 hours FP16 router gating weights
LoRA training (per-expert, sequential) ~100-500M 2-12 days MicroLoRA A/B matrices per expert FFN
Scale optimization ~10M ~3-6 hours Per-block FP16 absmean scales
Validation + export ~1-2 hours Benchmark + GGUF re-export
Total ~120-520M ~3-14 days

Expected quality improvement:

Benchmark Phase 0 PTQ Phase 0.5 RLM Phase 1 Distill FP16 Baseline
HumanEval pass@1 ~35-45% ~45-55% ~55-60% ~65%
MMLU ~45-55% ~55-65% ~65-70% ~75%
SWE-bench Verified ~25-35% ~35-45% ~50-55% 59.2%

The question "can I use RLM rather than traditional training" is answered YES — with the critical caveat that RLM refinement trains the FP16 corrections around frozen ternary weights, not the ternary weights themselves. This is fundamentally different from traditional distillation but achieves meaningful quality recovery (estimated +10-15 percentage points) at zero cost.

Reused (100%): MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer, ContrastiveTrainer, MemoryDistiller, PolicyStore, TrainingConfig, LR schedules, GGUF export. New (0%): No new training code. The only new code is a thin RlmRefiner orchestrator (~200-300 lines) that wires the existing components together for the Phase 0.5 pipeline.

AD-20: Phase 0.5 — SIMD-Only Training Mode (No Metal GPU Required)

Decision: Phase 0.5 RLM refinement supports a pure SIMD/CPU execution mode with no Metal GPU dependency. Metal is an optional acceleration path (~2-3x faster) but not required.

Rationale: Analysis of the RLM training stack reveals that Metal GPU is used by only one component (RealContrastiveTrainer via Candle), while all other training components are pure ndarray/CPU. Since Phase 0.5 uses the lightweight ContrastiveTrainer (not RealContrastiveTrainer) for router repair, and all gradient computation is ndarray-based, the entire pipeline runs on pure CPU with SIMD acceleration for inference forward passes.

Component-by-component GPU dependency analysis:

Component Source GPU Dependency SIMD-Only Mode
MicroLoRA.forward_simd() lora/micro_lora.rs:279 None — ARM NEON intrinsics with scalar fallback NEON on aarch64, scalar on x86
MicroLoRA.apply_gradients() lora/micro_lora.rs:621+ None — pure ndarray Works everywhere
MicroLoRA.apply_gradients_with_ewc() lora/micro_lora.rs:621+ None — pure ndarray Works everywhere
TrainingPipeline lora/training.rs None — pure ndarray CPU Works everywhere
EwcRegularizer lora/training.rs None — pure ndarray CPU Works everywhere
GrpoOptimizer training/grpo.rs None — pure ndarray CPU Works everywhere
ContrastiveTrainer training/contrastive.rs:169-175 Optionaluse_metal: true default, but Device::new_metal(0).unwrap_or(Device::Cpu) fallback Set use_metal: false for CPU-only; also has non-Candle pure CPU path (line 475)
MemoryDistiller reasoning_bank/distillation.rs None — pure Rust Works everywhere
PolicyStore policy_store.rs None — pure Rust Works everywhere
RealContrastiveTrainer training/real_trainer.rs:178 Yes — Metal/Candle NOT used in Phase 0.5 (used in full distillation only)

Inference forward pass (for loss computation) SIMD support:

Kernel NEON (aarch64) x86 Source
GEMM gemm_neon gemm_scalar fallback kernels/matmul.rs:520
GEMV gemv_neon gemv_scalar fallback kernels/matmul.rs:184
SiLU silu_neon_impl (~3.5x speedup) scalar fallback kernels/activations.rs
GeLU gelu_neon_impl (~3.2x speedup) scalar fallback kernels/activations.rs
ReLU relu_neon_impl (~4.0x speedup) scalar fallback kernels/activations.rs
RMSNorm rms_norm_neon scalar fallback kernels/norm.rs
RoPE apply_rope_neon scalar fallback kernels/rope.rs
Softmax softmax_neon (~2.8x speedup) scalar fallback kernels/activations.rs

Key observation: The matmul kernels only dispatch on target_arch = "aarch64" vs scalar. There are no explicit AVX2 or AVX512 SIMD implementations for x86 in the current kernel codebase. This means:

  • Apple Silicon (aarch64): Full NEON SIMD acceleration — primary target for SIMD-only mode
  • x86 (AMD/Intel): Falls to scalar fallback — works but ~3-5x slower than NEON
  • Future opportunity: Adding AVX2/AVX512 kernels to matmul.rs would make x86 competitive with NEON

Throughput comparison for Phase 0.5 (100M tokens, ~200-400M trainable params, 3B active forward):

Execution Mode Forward tok/s Effective Training tok/s 100M Tokens 500M Tokens
Metal GPU (M4 Max) ~500-1500 ~300-700 ~2-4 days ~8-19 days
NEON SIMD only (M4 Max CPU) ~200-500 ~100-300 ~4-12 days ~19-58 days
NEON SIMD only (M3 Ultra CPU) ~300-700 ~150-400 ~3-8 days ~14-39 days
x86 scalar (Ryzen 9, no AVX2 kernels) ~50-150 ~30-80 ~14-39 days ~72-193 days

Why SIMD-only is ~2-3x slower than Metal (not 10x):

  • Phase 0.5 training is dominated by the forward pass through the frozen 3B active parameters to compute loss against the teacher
  • The forward pass uses SIMD-accelerated GEMM/GEMV (gemm_neon/gemv_neon) which gets ~60-70% of Metal throughput for these matrix sizes
  • Gradient computation for the ~200-400M trainable params is pure ndarray — identical speed regardless of Metal availability
  • The training bottleneck is I/O (loading teacher activations from mmap) not compute, further narrowing the gap

Platform portability (bonus of SIMD-only mode):

SIMD-only mode extends Phase 0.5 beyond Mac Studio to any platform with ndarray support:

Platform SIMD Path Effective tok/s Feasible?
Mac Studio M4 Max (aarch64) NEON intrinsics ~100-300 Yes — primary target
Mac Studio M3 Ultra (aarch64) NEON intrinsics ~150-400 Yes — faster than M4 Max
Linux ARM64 (Ampere/Graviton) NEON intrinsics ~80-200 Yes — cloud ARM instances
Linux x86 (Ryzen/Xeon) Scalar fallback ~30-80 Marginal — 100M tokens feasible (~14-39 days), 500M not practical
macOS Intel Scalar fallback ~20-50 Not recommended

Configuration for SIMD-only mode:

// Phase 0.5 SIMD-only config (no Metal)
let contrastive_config = ContrastiveConfig {
    use_metal: false,    // Force CPU path in ContrastiveTrainer
    ..Default::default()
};

// MicroLoRA — already pure SIMD/ndarray, no config change needed
// TrainingPipeline — already pure ndarray
// GrpoOptimizer — already pure ndarray
// EwcRegularizer — already pure ndarray

The only config change is ContrastiveTrainer.use_metal = false. All other RLM components are GPU-agnostic by design.

SIMD-only Phase 0.5 exit criteria (in addition to standard Phase 0.5 criteria):

  • All training completes without Metal GPU dependency
  • ContrastiveTrainer runs with use_metal: false and produces equivalent router accuracy
  • MicroLoRA forward_simd() executes NEON path on aarch64 (verified via cfg compile check)
  • Training throughput measured and documented for SIMD-only vs Metal comparison

Recommendation: Use Metal when available (2-3x faster), fall back to SIMD-only when Metal is unavailable or on non-Mac platforms. The training code requires zero changes — only ContrastiveTrainer.use_metal needs to be set to false.

Reused: 100% of existing RLM stack — MicroLoRA NEON forward, ndarray training, ContrastiveTrainer CPU fallback, all existing SIMD kernels. New: 0 lines. SIMD-only mode is already supported by the existing code paths; AD-20 documents this capability explicitly.

AD-21: Native Rust Ternary Kernels with WASM Target (bitnet.cpp Port Strategy)

Decision: Port bitnet.cpp's ternary inference kernels (TL1, TL2, I2_S) to native Rust with dual compilation targets: native SIMD (NEON/AVX2/AVX512) and WebAssembly SIMD128. This replaces the original AD-4 strategy of Python codegen → Rust intrinsics with a pure Rust implementation that leverages existing open-source work.

Rationale: Three significant developments change the AD-4 implementation calculus:

  1. R3-Engine (https://github.com/r3-engine/r3-engine) — A pure Safe Rust BitNet inference engine achieving 80-117 tok/s single-threaded on Ryzen 9950X3D, with native WASM SIMD128 cross-compilation. Uses bit-sliced ternary matrices with AVX-512 VPOPCNTDQ, zero-copy mmap, and zero heap allocations during generation.

  2. bitnet.rs (https://github.com/ocentra/bitnet.rs) — Pure Rust BitNet toolkit with conversion, inference, training, and streaming. Apache 2.0 license. GPU path via WGSL/wgpu (Vulkan/Metal/DX12). Dedicated bitnet-wasm crate for browser deployment.

  3. WASM SIMD128 maturity — Fixed-width 128-bit SIMD now supported in all major browsers (Chrome, Firefox, Safari, Edge). Rust's core::arch::wasm32 provides direct intrinsic access via simd128 LLVM feature flag.

Comparison of approaches:

Approach Native Performance WASM Support Safety Integration Effort Code Reuse
A: Python codegen (original AD-4) Optimal (platform-tuned) None C-level unsafe High — custom codegen pipeline bitnet.cpp algorithms
B: Port bitnet.cpp to Rust Near-optimal Manual WASM SIMD Mixed (unsafe for intrinsics) Medium — translate C → Rust bitnet.cpp algorithms
C: Reference R3-Engine patterns 80-117 tok/s proven Native dual-target 100% Safe Rust Low-medium — adapt patterns R3 bit-slicing + mmap
D: Integrate bitnet.rs crate GPU: 32x (WGSL), CPU: scalar bitnet-wasm crate Safe Rust + WGSL Low — add dependency Full crate

Recommended: Approach C (Reference R3-Engine) with RuvLLM integration

R3-Engine's techniques are the strongest fit because:

  • 100% Safe Rust — no unsafe blocks in the hot path
  • Dual-target proven — same codebase compiles to AVX-512 native and WASM SIMD128
  • Zero-copy mmap — matches our Phase 0 mmap strategy (AD-18)
  • Cache-aligned bit-slicing — 64-byte aligned CacheLines match CPU cache architecture
  • VPOPCNTDQ — bit-population-count approach to ternary GEMM is elegant and SIMD-width-agnostic

WASM SIMD128 kernel mapping for TL1:

WASM SIMD128 provides v128 type (128 bits):
- i8x16: 16 × 8-bit integers — pack 64 ternary weights (2-bit each)
- i16x8: 8 × 16-bit integers — accumulation without overflow
- i32x4: 4 × 32-bit integers — final dequantized output

TL1 LUT (16 entries) maps naturally to a single v128:
  v128.load(lut_ptr)           → load 16-entry LUT
  v128.swizzle(lut, indices)   → parallel 16-way table lookup
  i16x8.add(accum, partial)    → INT16 accumulation
  f32x4.mul(dequant, scale)    → FP32 scale application

Estimated WASM SIMD128 throughput:
  ~20-40 tok/s for 3B active params (vs ~5-10 tok/s scalar JS)
  ~4-8x speedup over non-SIMD WebAssembly

WASM SIMD128 limitations:

  • Fixed 128-bit width only (vs NEON 128, AVX2 256, AVX512 512)
  • No integer popcount instruction (must emulate VPOPCNTDQ via lookup or bit manipulation)
  • No gather/scatter operations (LUT access must be sequential or use swizzle)
  • Memory alignment not enforced (no hardware-guaranteed 64-byte alignment)
  • Single-threaded unless SharedArrayBuffer + Web Workers enabled

Dual-target compilation strategy (Cargo feature flags):

// In Cargo.toml:
[features]
default = ["native-simd"]
native-simd = []           # AVX2/AVX512/NEON via std::arch
wasm-simd = ["simd128"]    # WASM SIMD128 via core::arch::wasm32

// In kernel code:
#[cfg(all(target_arch = "aarch64", feature = "native-simd"))]
fn ternary_gemv_neon(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }

#[cfg(all(target_arch = "x86_64", feature = "native-simd"))]
fn ternary_gemv_avx2(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }

#[cfg(all(target_arch = "wasm32", feature = "wasm-simd"))]
fn ternary_gemv_wasm128(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }

// Scalar fallback (always available):
fn ternary_gemv_scalar(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }

Integration with existing RuvLLM architecture:

Existing Component Change Needed Impact
kernels/mod.rs Add ternary module export Low
kernels/matmul.rs Add ternary GEMV dispatch alongside existing FP16/Metal GEMV Low
bitnet/mod.rs (new) Wire TernaryTensor to kernel dispatch Already created (Phase 0)
gguf/quantization.rs BitnetT158 dequant already integrated Already done
autodetect.rs Add AVX512 VPOPCNTDQ detection + WASM target detection Low
Cargo.toml Add wasm-simd feature flag, wasm32 target conditional deps Low
backends/ New BitNetBackend uses ternary kernel dispatch Medium (new backend)

Estimated implementation effort (Rust ternary kernels with WASM):

Component Lines Complexity Notes
TL1 kernel (NEON + scalar) ~200 Medium Reference R3-Engine bit-slicing
TL1 kernel (AVX2/AVX512) ~250 Medium VPOPCNTDQ for AVX512, lookup for AVX2
TL1 kernel (WASM SIMD128) ~150 Medium v128 swizzle + i16x8 accumulation
I2_S kernel (all targets) ~300 Low Simpler unpack-and-add
TL2 kernel (all targets) ~250 Medium-High 5-bit index, 32-entry LUT
Kernel dispatch + autodetect ~100 Low Match existing matmul.rs pattern
LUT generation ~80 Low Pre-compute at model load
Total ~1,330 Compiles to native + WASM from single source

Phase 0 impact: The Phase 0 smoke test (TL1 NEON + scalar) is already partially covered by the existing bitnet/ module. AD-21 extends this to production-grade kernels with WASM as an additional target.

Exit criteria:

  • TL1 kernel passes bit-exact validation against bitnet.cpp reference output
  • WASM SIMD128 build produces functional .wasm binary
  • Native NEON throughput ≥ 80% of R3-Engine (≥ ~64-94 tok/s for 2B model)
  • AVX2 path tested on x86 Linux
  • Scalar fallback tested on generic platform
  • WASM throughput ≥ 20 tok/s for 3B active params in browser
  • Zero unsafe blocks in WASM path (Safe Rust only)
  • Kernel dispatch selects optimal path via autodetect.rs feature detection

Open question resolved: AD-21 answers open question #5 (WASM target for ternary kernels) — yes, WASM SIMD128 is viable for TL1/I2_S, with ~4-8x speedup over scalar WASM. TL2's 5-bit index is less natural for 128-bit SIMD but still implementable via two-stage lookup.


AD-22: Evaluation Infrastructure and Behavioral Gates

Decision: Define a three-gate behavioral evaluation framework with a structured trace schema, auto-labeling strategy, and Go/No-Go shipping rule. All gates are non-LLM-judge, deterministic, reproducible, and executable on CPU without external API calls. The system ships on integrity/citations/refusal behavior, not raw model quality benchmarks. Full GPU distillation (Phase 1+) is deferred; the eval infrastructure must validate Phase 0 and Phase 0.5 outputs at zero marginal cost.

Rationale: Standard LLM evaluation relies on either (a) benchmark suites (HumanEval, MMLU) that measure general capability, or (b) LLM-as-judge approaches that are non-deterministic, expensive, and unsuitable for gating CI/CD pipelines. For Craftsman Ultra, the critical shipping question is not "does it score well on benchmarks?" but "does it route correctly, cite honestly, and refuse when uncertain?" These behavioral properties are testable with deterministic, cheap-to-run gate checks that compare model outputs against known ground-truth traces.

The three gates correspond to the three failure modes that would make the system untrustworthy regardless of benchmark scores:

  1. Misrouting — wrong experts selected, producing semantically wrong outputs from correct-seeming completions
  2. Hallucinated citations — model cites evidence that does not exist or does not support the claim
  3. Over/under-refusal — model refuses answerable questions or confidently answers indeterminate ones

Gate 1 — Routing Correctness

Run the FP16 teacher model once on the 200-prompt evaluation suite to record ground-truth routing traces: which experts are selected, with what softmax weights, per token per layer. Then run the ternary student model on the same prompts and compare routing decisions.

Parameter Value
Metric routing_agreement = count(same_topk_experts) / total_tokens
Comparison Per-token, per-layer: do the top-K selected expert indices match between teacher and student?
Pass threshold >= 0.85 (85% of tokens route to the same expert set as the teacher)
Fail action Trigger targeted router repair via ContrastiveTrainer (AD-19, AD-20) with triplets generated from the misrouted token positions

Teacher traces are recorded once and cached as JSONL. The ternary model is evaluated against these cached traces on every pipeline run. Agreement is measured at the expert-set level (order-invariant): if teacher selects experts {2, 5} and student selects {5, 2}, this counts as agreement.

Gate 2 — Citation Correctness

For retrieval-augmented responses, verify that citations are grounded in the actual retrieval corpus. This gate requires a labeled subset of the 200-prompt suite where prompts include retrieval context with known chunk IDs.

Parameter Value
Metric (precision) citation_precision = valid_citations / total_citations
Metric (recall) citation_recall = cited_evidence / relevant_evidence (from labeled prompts)
Validity check For each cited chunk_id: (1) chunk exists in retrieval corpus, (2) cited span is an exact substring match OR Jaccard similarity between cited span and chunk content > 0.6
Pass threshold Precision >= 0.90, Recall >= 0.70
Fail action Trigger retrieval-first policy training via GrpoOptimizer (GRPO reward penalizes hallucinated citations, rewards grounded ones)

Jaccard similarity is computed at the word level: |intersection(words_cited, words_chunk)| / |union(words_cited, words_chunk)|. This catches paraphrased citations while rejecting fabricated ones. The 0.6 threshold was chosen to allow minor rephrasing while catching wholesale fabrication.

Gate 3 — Refusal Calibration

Test the model's ability to refuse when evidence is insufficient and answer when evidence is adequate. Uses the auto-labeled prompt suite (see below) where each prompt is classified as resolved, contested, or indeterminate.

Parameter Value
Metric refusal_f1 = harmonic_mean(refusal_precision, refusal_recall)
Refusal detection Output contains a refusal signal (configurable string set, e.g., "I cannot determine", "insufficient evidence", "I'm not sure", or a structured <refusal> tag)
Must-refuse rate Model must refuse >= 80% of indeterminate prompts
Must-answer rate Model must NOT refuse >= 95% of resolved prompts
Pass threshold Refusal F1 >= 0.85
Fail action Adjust refusal threshold in controller policy, or retrain controller via GrpoOptimizer with refusal-aware reward signal

contested prompts (sources actively contradict) are evaluated separately and not gated — they are tracked for monitoring but the correct behavior (refuse vs. present both sides) is domain-dependent.

Trace Schema (JSONL format)

Every evaluation run produces a JSONL trace file where each line records per-token, per-layer routing decisions alongside response-level citation and refusal assessments:

{
  "prompt_id": "p-001",
  "token_idx": 42,
  "layer_idx": 3,
  "routing": {
    "topk_expert_ids": [2, 5],
    "topk_weights": [0.62, 0.38],
    "teacher_expert_ids": [2, 5],
    "teacher_weights": [0.65, 0.35],
    "agreement": true
  },
  "citations": [
    {"chunk_id": "doc-17-p3", "span": "exact quoted text", "valid": true}
  ],
  "refusal": {
    "should_refuse": false,
    "did_refuse": false,
    "correct": true
  },
  "coherence_score": 0.91,
  "stop_reason": "eos"
}

Schema notes:

  • routing is emitted per-token per-layer (one record per token-layer pair)
  • citations and refusal are emitted once per response (attached to the final token record, stop_reason != null)
  • coherence_score is the cosine similarity between student and teacher hidden states at the final layer — a cheap proxy for output quality without LLM-judge
  • Trace files are stored in eval/traces/ (never in the project root) and named {model_version}_{prompt_suite}_{timestamp}.jsonl

Auto-Labeling Strategy

The 200-prompt evaluation suite is labeled without manual annotation by using RuVector retrieval signals as proxy ground truth:

Label Condition Meaning Gate Usage
resolved Evidence redundancy > 3 (multiple independent sources agree on the answer) The question is clearly answerable from the corpus Gate 3: model must answer (not refuse)
contested Cluster disagreement > 0.4 (sources actively contradict each other) The question has conflicting evidence Monitored only (not gated)
indeterminate Mincut fragility > 0.7 (removing a single source breaks the entire evidence chain) The question cannot be reliably answered Gate 3: model must refuse

These labels also feed Gate 2:

  • resolved prompts provide the relevant_evidence denominator for citation recall (all supporting chunks should be cited)
  • indeterminate prompts should produce no citations (any citation on an indeterminate prompt is likely hallucinated)

Auto-labeling is deterministic given a fixed retrieval corpus and runs on CPU via existing RuVector HNSW search. Labels are stored alongside prompts in the evaluation suite and versioned with the corpus.

Go/No-Go Rule

All three gates must pass on the same evaluation suite run for the system to ship:

SHIP = (routing_agreement >= 0.85)
     AND (citation_precision >= 0.90)
     AND (citation_recall >= 0.70)
     AND (refusal_f1 >= 0.85)

If any gate fails, the system cannot ship. The remediation path is gate-specific:

Failed Gate Remediation Component Estimated Duration
Routing Correctness Router repair via ContrastiveTrainer with misrouted-token triplets training/contrastive.rs 1-4 hours
Citation Correctness Retrieval-first policy training via GrpoOptimizer (reward grounded citations) training/grpo.rs 2-8 hours
Refusal Calibration Adjust refusal threshold or retrain controller policy via GrpoOptimizer training/grpo.rs + controller config 1-2 hours

Re-evaluation after remediation must re-run all three gates (not just the failed one) to confirm no regression.

Implementation location:

Component Path Lines Notes
Gate runner orchestrator crates/ruvllm/src/eval/gates.rs ~300 New module; runs all three gates, produces trace JSONL
Routing trace recorder crates/ruvllm/src/eval/routing_trace.rs ~150 Records teacher routing decisions; compares against student
Citation validator crates/ruvllm/src/eval/citation_check.rs ~200 Substring match + Jaccard similarity; corpus lookup
Refusal detector crates/ruvllm/src/eval/refusal_detect.rs ~100 Configurable refusal signal set; F1 computation
Auto-labeler crates/ruvllm/src/eval/auto_label.rs ~150 RuVector signal extraction; prompt classification
Trace schema types crates/ruvllm/src/eval/trace.rs ~80 Serde-annotated structs matching the JSONL schema
Total new code ~980 All CPU-only, no external dependencies

Exit criteria:

  • Teacher routing traces recorded for full 200-prompt suite and cached as JSONL
  • Gate 1 (routing agreement) runs in < 30 minutes on Mac Studio for 200 prompts
  • Gate 2 (citation correctness) validates chunk_id existence and span grounding
  • Gate 3 (refusal calibration) correctly classifies refusal signals in model output
  • Auto-labeler produces resolved/contested/indeterminate labels from RuVector signals
  • All gates produce deterministic results (same inputs = same pass/fail, bit-exact)
  • Trace JSONL files are written to eval/traces/, never to project root
  • Go/No-Go rule enforced: all three gates must pass on same run
  • Failed gate triggers correct remediation path (ContrastiveTrainer or GrpoOptimizer)
  • Total eval suite runtime < 2 hours on Mac Studio (CPU-only)

AD-23: Phase-1 Distillation via External GPU Teacher Artifacts

Status: Accepted

Context: The Ultra 30B ternary MoE system prioritizes CPU-first inference, integrity-driven behavior, and low operational cost. Phase-1 performance goals focus on routing correctness after ternary quantization, citation-grounded answers, and calibrated refusal under thin or conflicting evidence. Full end-to-end GPU distillation of a 30B teacher is expensive, slow, and misaligned with the system's long-term architecture — where RuVector provides memory and structure, and the generator model is intentionally small and cheap. However, pure PTQ ternary conversion (Phase 0) introduces unacceptable degradation in MoE routing stability, answer fidelity on contested prompts, and refusal behavior calibration. We therefore require a limited refinement phase that recovers task-relevant behavior without committing to ongoing GPU dependence.

Decision: Phase-1 distillation SHALL be implemented as a one-time, external GPU artifact generation step, followed by local CPU-only refinement.

  1. A full-precision FP16 teacher is executed once on a short-lived cloud GPU instance
  2. The teacher produces behavioral artifacts, not trained weights
  3. All refinement and training occurs locally on CPU using these artifacts
  4. GPU infrastructure is not a runtime dependency

Scope of Teacher Artifacts (GPU job exports only):

Artifact Content Purpose
Routing Traces Per token, per MoE layer: top-k expert indices + routing probabilities/margins Preserve expert selection behavior post-quantization
Sparse Logits Answer spans, refusal boundaries, contradiction disclosure points only Guide LoRA residual correction and refusal calibration without full sequence distillation
Preference Labels Per-prompt classification: resolved / contested / indeterminate Train stop decisions and disclosure behavior

Artifacts SHALL be stored as immutable, versioned files and reused across refinement runs.

CPU-Only Refinement Strategy (using teacher artifacts):

  1. Router Repair — Match student top-k routing to teacher traces; penalize expert churn and margin collapse
  2. Low-Rank Residual Correction — Apply LoRA-style residuals to compensate ternary approximation error; enforce strict parameter budget
  3. EWC++ Preservation — Prevent catastrophic drift outside repaired regions
  4. Policy Optimization — Train RLM stop and retrieval behavior; optimize for citation correctness and calibrated refusal

No full expert weight updates are allowed in Phase-1.

Evaluation Gate: A checkpoint SHALL NOT be promoted unless it passes behavioral evaluation, not reconstruction metrics. Mandatory metrics:

Metric Criterion Gate
Routing correctness Top-k overlap with teacher + margin correlation Gate 1 (AD-22)
Citation correctness Span hash verification + evidence support via RuVector Gate 2 (AD-22)
Refusal calibration Refuse on indeterminate, disclose on contested, pass on resolved Gate 3 (AD-22)

compute_dequant_error is a sanity check only, not a promotion criterion.

Acceptance Criteria:

  • System passes the 200-prompt disagreement suite
  • Routing correctness meets Gate 1 threshold (>= 0.85)
  • Citation precision exceeds 0.90 (Gate 2 precision target)
  • Refusal behavior aligns with RuVector coherence signals (Gate 3 F1 >= 0.85)
  • Results remain stable under 10% corpus perturbation
  • GPU artifact generation completes in single cloud session (< 4 hours)
  • CPU refinement reproducible without GPU access

Alternatives Considered:

Alternative Verdict Reason
Full GPU distillation Rejected High cost, long iteration cycles, misalignment with CPU-first design
Pure PTQ without refinement Rejected Unacceptable routing instability, incorrect refusal behavior, citation degradation
Continuous GPU shadow training Rejected Operational complexity, long-term infrastructure lock-in

Consequences:

  • Positive: GPU cost is bounded and minimal; refinement is repeatable and auditable; CPU-first deployment remains intact; system behavior aligns with integrity goals; distillation artifacts are reusable
  • Negative: General language quality parity with FP16 teacher is not guaranteed; some PTQ loss may remain in non-critical behaviors; requires building custom evaluation infrastructure (addressed by AD-22)
  • Note: This ADR does not preclude a future Phase-2 distillation if product requirements shift toward general language parity. Phase-2 would be a separate decision

AD-24: RLM-Style Recursive Sentence Transformer Embedder

Status: Accepted

Context: The Craftsman Ultra system uses RuVector for evidence retrieval, cluster analysis, contradiction detection, and mincut fragility scoring. Standard sentence transformers produce embeddings in a single forward pass — one chunk in, one vector out. This works for basic retrieval but fails at three critical boundaries:

  1. Contradiction boundaries: Two chunks with opposing claims embed near each other because they share vocabulary, despite being semantically opposed
  2. Domain drift: Embeddings trained on general corpora perform poorly when the corpus shifts to a specialized domain (legal, medical, code)
  3. Context blindness: The embedding of a chunk is independent of its neighborhood, losing structural signals that RuVector already knows (entity links, claim chains, cluster membership)

A normal embedding pipeline cannot distinguish "Drug X cures condition Y" from "Drug X does NOT cure condition Y" — they embed almost identically. The system needs embeddings that reflect the structural position of a chunk within the evidence graph, not just its surface semantics.

Decision: Implement an RLM-style recursive embedder — not a new architecture, but an inference strategy that wraps any base sentence transformer in a short iterative loop that retrieves context, decomposes, re-embeds, and merges.

Core Loop (bounded to 2-3 iterations):

State: { text, intent, neighbors, candidate_embeddings, iteration, stop_reason }

1. Embed the base chunk                           → base_embedding
2. Retrieve k nearest neighbors from RuVector      → neighbors[]
3. Normalize/summarize chunk with neighbor context  → contextualized_text
4. Re-embed the normalized view                    → ctx_embedding
5. If contested (low-cut boundary), embed both     → cluster_a_emb, cluster_b_emb
   sides of the disagreement separately
6. Merge into final representation                 → final_embedding + metadata

Output Schema:

Field Type Description
embedding Vec<f32> Final merged embedding vector
confidence f32 Embedding stability across iterations (cosine similarity between iteration N and N-1)
evidence_neighbor_ids Vec<String> RuVector chunk IDs used as context
contradiction_flags Vec<bool> Per-neighbor: true if neighbor is in opposing cluster
cluster_id Option<usize> Primary cluster assignment
stop_reason StopReason Why the loop terminated: Converged, MaxIterations, Contested

Three Embedding Variants:

Variant Conditioning Use Case Output
A: Query-Conditioned Query text + neighborhood Retrieval under a specific query Embedding optimized for that query's intent
B: Corpus-Conditioned Stable neighbors + entity graph Corpus indexing Embedding stable over time, less sensitive to local phrasing
C: Contradiction-Aware Twin Both sides of a low-cut boundary Disputed claims Bimodal representation: one embedding per cluster side

Merge Rule (auditable, not learned):

final = normalize(w0 * base + w1 * ctx + w2 * anti)

Where anti is the embedding of the strongest counter-cluster neighbor set. Weights can be fixed (w0=0.6, w1=0.3, w2=0.1) or learned with a small regression on the eval set.

Training Strategy (minimal, no full model training):

Only three components are trainable:

  1. Merge weights (w0, w1, w2) — 3 parameters, learned via grid search or small regression
  2. Stop policy — when to terminate the loop (convergence threshold on cosine similarity between iterations)
  3. Adapter layer — optional small linear layer on top of base embeddings for domain adaptation (rank-4 LoRA or single linear)

Evaluation Criteria:

Metric Definition Target
Top-k retrieval accuracy Correct chunk in top-k results Improvement over single-pass baseline
False neighbor rate Contradicting chunks incorrectly ranked as similar Reduction vs baseline
Cluster purity Intra-cluster coherence after re-embedding Improvement vs baseline
Contradiction separation Cosine distance between opposing claim embeddings > 0.3 (vs ~0.05 for single-pass)
Stability under perturbation Embedding change when 10% of corpus is modified < 0.05 cosine drift
Latency per embedding Wall time including retrieval + re-embedding < 50ms for 2 iterations on target hardware

Appliance Fit (CPU-first):

  • Small base embedder model (e.g., 22M-110M params)
  • 2-3 passes maximum per chunk
  • RuVector supplies all context (no additional retrieval infrastructure)
  • Ternary quantization of the base embedder is possible (future AD)
  • Compatible with WASM deployment for browser-side embedding

Acceptance Criteria:

  • On a held-out corpus slice, RLM-style embedder improves top-k retrieval accuracy vs single-pass baseline
  • False neighbor matches near contradiction boundaries are reduced
  • Latency stays within budget (< 50ms for 2 iterations on target hardware)
  • Memory usage does not exceed appliance budget
  • Variant C produces measurably separated embeddings for known contradictions
  • Merge weights are interpretable and auditable (no black-box learned fusion)

Consequences

Positive

  1. CPU-only deployment: 30B-class model running on commodity hardware without GPU
  2. Energy efficiency: 55-82% reduction in inference energy vs FP16
  3. Memory efficiency: ~8GB vs ~60GB for FP16 30B model (7.5x reduction)
  4. Multiplication-free expert GEMM: Integer addition only in expert forward passes
  5. SONA compatibility: MicroLoRA adaptation preserves per-session learning
  6. GGUF ecosystem: Compatible with existing model distribution infrastructure
  7. Incremental path: Phase 0 ($0) validates pipeline; Phase 0.5 ($0) adds RLM quality boost; Phase 1 ($1,300) delivers production quality; Phases 2-3 optimize
  8. ~70% RLM code reuse: GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore are production-tested — only BitLinear layer and orchestrator are net-new
  9. Adaptive distillation: GRPO reward scaling dynamically focuses compute on hard-to-distill experts
  10. Cross-expert stability: EWC++ Fisher diagonal prevents catastrophic forgetting during sequential expert distillation
  11. Learned quantization policies: PolicyStore persists per-layer ternary scale distributions for reproducible future distillation runs
  12. Expert-parallel distillation: Independent expert FFNs enable rayon-parallel distillation across CPU cores
  13. Phase 0 de-risks Phase 1 at zero cost: Mac Studio PTQ prototype validates entire inference pipeline (GGUF → dequant → kernel → MoE → generation) for $0 before committing $1,300+ to cloud GPU distillation
  14. Existing GGUF ecosystem: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines
  15. Phase 0.5 RLM refinement at $0: Existing MicroLoRA + GRPO + EWC++ + ContrastiveTrainer stack provides ~10-15 percentage point quality recovery over raw PTQ with zero new training code, running entirely on Mac Studio
  16. 100% RLM reuse for Phase 0.5: No new training infrastructure needed — all 7 RLM components are production-tested and wire together directly
  17. SIMD-only Phase 0.5: Entire RLM refinement pipeline runs on pure CPU SIMD (NEON on aarch64) without Metal GPU — only ~2-3x slower than Metal, extends platform support to Linux ARM64 and (with scalar fallback) x86
  18. Zero-config SIMD mode: All training components (MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer) are already GPU-agnostic; only ContrastiveTrainer.use_metal = false needed for full SIMD-only execution
  19. WASM browser deployment: Native Rust kernels compile to WASM SIMD128 via Cargo feature flags, enabling in-browser ternary inference at ~20-40 tok/s without server roundtrip
  20. Single-source dual-target: One Rust codebase compiles to both native SIMD (NEON/AVX2/AVX512) and WASM SIMD128, eliminating the need for separate C++ and JS codebases
  21. Safe Rust kernels: Following R3-Engine's approach, production kernels can be 100% Safe Rust (no unsafe in hot path), eliminating entire classes of memory safety bugs vs bitnet.cpp's C++
  22. Existing Rust ecosystem: R3-Engine (Apache-compatible) and bitnet.rs (Apache 2.0) provide proven reference implementations to accelerate kernel development
  23. Deterministic behavioral gates: Three non-LLM-judge evaluation gates (routing, citation, refusal) provide reproducible pass/fail shipping decisions without expensive API calls or non-deterministic judge models
  24. Structured trace schema: JSONL trace format captures per-token routing, per-response citation, and refusal decisions in a single auditable artifact — enables regression detection across model versions
  25. Zero-annotation auto-labeling: RuVector retrieval signals (evidence redundancy, cluster disagreement, mincut fragility) classify prompts as resolved/contested/indeterminate without human annotation effort
  26. Gate-specific remediation: Each failed gate maps to a concrete repair action using existing RLM components (ContrastiveTrainer for routing, GrpoOptimizer for citations and refusal), avoiding manual debugging cycles
  27. CPU-only evaluation: Full eval suite runs on Mac Studio in < 2 hours with no cloud GPU or external API dependency, keeping the evaluation loop at $0 marginal cost
  28. Bounded GPU cost: Phase-1 distillation requires only a single short-lived cloud GPU session to generate behavioral artifacts (routing traces, sparse logits, preference labels) — no ongoing GPU dependency
  29. Artifact reusability: Teacher artifacts are immutable and versioned; CPU refinement runs can be repeated, tuned, and audited without re-running the GPU job
  30. Behavioral distillation: Distilling routing decisions and refusal signals rather than full logit sequences aligns training objectives with the system's integrity-first design goal
  31. RLM-style embeddings: Recursive context-aware embeddings improve retrieval accuracy and contradiction separation without requiring a larger embedding model — inference strategy, not new architecture
  32. Contradiction-aware twin embeddings: Variant C produces bimodal representations at low-cut boundaries, preserving disagreement structure in the embedding space for downstream decision-making
  33. Minimal training surface: Only 3 merge weights + stop policy + optional adapter need training for the RLM embedder — no full model fine-tuning required

Negative

  1. Training cost: Even distillation requires 800-1,600 A100-hours (~$2K-$5K cloud cost)
  2. Custom kernels: Must implement and maintain platform-specific SIMD kernels in Rust
  3. Quality gap: Phase 1 may be 5-10% below GLM-4.7-Flash on some benchmarks
  4. No GPU acceleration: BitNet kernels are CPU-specific; GPU path requires separate optimization
  5. Mixed-precision complexity: Router (FP16) + experts (ternary) + attention (FP16/ternary) adds dispatch complexity
  6. WASM SIMD128 ceiling: Fixed 128-bit width limits throughput vs native AVX2 (256-bit) or AVX512 (512-bit); no popcount instruction requires emulation; single-threaded unless SharedArrayBuffer enabled — expect ~20-40 tok/s vs ~80-117 tok/s native
  7. RLM scale gap: Existing RealContrastiveTrainer targets 0.5B models (embedding_dim=896); scaling to 30B requires distributed data loading and increased batch sizes
  8. No x86 SIMD kernels: Current kernels/matmul.rs only implements NEON (aarch64); x86 falls to scalar fallback (~3-5x slower than NEON). Adding AVX2/AVX512 kernels would make x86 SIMD-only mode competitive but is not yet implemented
  9. Teacher trace dependency: Gate 1 requires a full FP16 teacher forward pass to generate ground-truth routing traces; this must be re-run whenever the evaluation suite changes or the teacher model is updated
  10. Auto-label noise: RuVector-derived labels (evidence redundancy, mincut fragility) are proxies for true answerability; edge cases near thresholds (e.g., fragility = 0.69 vs 0.71) may produce inconsistent labels across corpus versions
  11. 200-prompt suite coverage: A fixed 200-prompt suite may not cover all failure modes; adversarial or distribution-shifted prompts could pass all gates yet fail in production
  12. General quality ceiling: Phase-1 behavioral distillation intentionally does not target full language quality parity with FP16 teacher; non-critical behaviors may remain degraded
  13. Teacher artifact staleness: If the evaluation prompt suite or teacher model changes, routing traces and preference labels must be regenerated on GPU

Risks

Risk Likelihood Impact Mitigation
Phase 0 PTQ quality too low for meaningful testing Medium Low Phase 0 is for kernel/pipeline validation, not quality; upgrade to 0D (BitDistill Lite) if needed
MoE routing degrades with ternary experts Medium High Phase 0 detects routing issues early; Phase 1 validates routing; router stays FP16; AD-12 contrastive validation
bitnet.cpp kernel translation to Rust introduces bugs Medium Medium Phase 0 PTQ model provides cheap test fixture; extensive kernel unit tests; validate against reference impl
Distillation fails to converge for MoE Low High GRPO reward scaling + per-expert distillation fallback; EWC++ stability (AD-13)
GLM-4.7-Flash architecture changes break compatibility Low Medium Pin to specific HF revision; architecture abstraction layer
IQ1_S GGUF format insufficient for absmean metadata Medium Low Register custom GGUF type (BITNET_T158); backward-compatible extension
EWC++ Fisher accumulation OOM at 30B scale Medium Medium Sparse Fisher (top-k diagonal entries); per-expert rather than global Fisher
GRPO reward signal too noisy for distillation Low Low Fall back to static KD loss; GRPO reward as optional multiplier
RealContrastiveTrainer doesn't scale to 30B Medium Medium Extract training loop; replace Candle Linear with BitLinear; keep optimizer/scheduler
Calibration data bias in Phase 0 PTQ Low Low Use diverse calibration corpus (WikiText + code); measure variance across calibration sets
Auto-label thresholds misclassify edge-case prompts Medium Medium Track label stability across corpus versions; flag prompts with signals near threshold boundaries for manual review
200-prompt suite insufficient for production coverage Low Medium Expand suite iteratively as production failure modes are discovered; run gates on user-submitted adversarial prompts quarterly
Teacher routing traces become stale after model update Low Low Re-record teacher traces as part of every model version bump; cache invalidation keyed on teacher model hash

Validation Criteria

Phase 0 Exit Criteria

  • Absmean ternary quantizer produces valid {-1, 0, +1} weights from GLM-4.7-Flash FP16
  • Quantization runs successfully on Mac Studio via mmap (no cloud GPU required)
  • GGUF export with BITNET_T158 tensor type loads without error in BitNetBackend
  • TL1 NEON kernel produces non-zero, bounded output on PTQ ternary weights
  • MoE routing selects experts (not all-zero or all-same-expert degenerate routing)
  • End-to-end token generation produces coherent (if degraded) text
  • Memory usage measured and documented for real MoE activation patterns
  • Throughput measured: tok/s on Mac Studio (ARM NEON) and optionally x86 AVX2
  • Baseline quality benchmarks recorded (HumanEval, MMLU) as Phase 1 improvement target
  • Total Phase 0 cost = $0 (local Mac Studio execution)

Phase 0.5 Exit Criteria

  • MicroLoRA adapters (rank-2) attached to all expert FFN layers
  • Router fine-tuning via ContrastiveTrainer restores >=90% routing accuracy vs teacher
  • GRPO reward signal shows positive quality improvement over Phase 0 baseline
  • EWC++ prevents router fix from degrading already-correct routing paths (Fisher delta < 5%)
  • HumanEval pass@1 >= 45% (up from Phase 0 baseline of ~35-45%)
  • MicroLoRA + ternary inference produces coherent code completions
  • Training completes on Mac Studio within 14 days
  • MemoryDistiller has extracted KeyLessons identifying worst-degraded experts
  • PolicyStore contains optimized TernaryScale entries for all refined layers
  • Total Phase 0.5 cost = $0 (local Mac Studio execution)
  • GGUF re-exported with optimized router, scale factors, and LoRA adapter weights

Phase 1 Exit Criteria

  • BitNet backend loads GGUF with ternary expert weights
  • TL1 kernel produces bit-exact output vs reference float implementation
  • Decode speed >= 5 tok/s on x86_64 AVX2 (AMD Ryzen 7 / Intel i7 class)
  • HumanEval pass@1 >= 50% (GLM-4.7-Flash baseline: ~65%)
  • Memory usage < 10GB for 4K context inference
  • GRPO-guided expert distillation converges (loss < 0.5 for all experts)
  • EWC++ prevents cross-expert interference (Fisher-regularized loss delta < 5%)
  • Contrastive router validation: >= 95% expert routing accuracy vs teacher
  • PolicyStore contains TernaryScale entries for all distilled expert layers

Phase 2 Exit Criteria

  • Full ternary model (attention + experts) running on CPU
  • Decode speed >= 8 tok/s on x86_64 AVX2
  • SWE-bench Verified >= 52% (90%+ of GLM-4.7-Flash's 59.2%)
  • SONA MicroLoRA adaptation functional on ternary base
  • MemoryDistiller has extracted >= 50 KeyLessons from distillation trajectories
  • GRPO adaptive KL stabilizes below kl_target (0.02) for all experts

Phase 3 Exit Criteria

  • Native-trained model matches or exceeds GLM-4.7-Flash benchmarks
  • Published on HuggingFace (ruv/craftsman-ultra-30b-1bit)
  • GGUF + bitnet kernel distributed via npm/packages/ruvllm
  • Full distillation pipeline reproducible from PolicyStore policies (no manual tuning)

References

  1. Ma, S. et al., "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (arXiv:2402.17764, Feb 2024)
  2. Ma, S. et al., "BitNet b1.58 2B4T Technical Report" (arXiv:2504.12285, Apr 2025)
  3. Microsoft Research, "bitnet.cpp: Efficient Edge Inference for Ternary LLMs" (arXiv:2502.11880, Feb 2025)
  4. Microsoft, bitnet.cpp — https://github.com/microsoft/BitNet
  5. Zhipu AI, GLM-4.7-Flash — https://huggingface.co/zai-org/GLM-4.7-Flash
  6. Zhipu AI, "GLM-4.7: Advancing the Coding Capability" — https://z.ai/blog/glm-4.7
  7. RuvLLM ADR-002: RuvLLM Integration with Ruvector
  8. RuvLLM GGUF Quantization Module: crates/ruvllm/src/gguf/quantization.rs
  9. Microsoft, bitnet-b1.58-2B-4T-gguf — https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf
  10. RuvLLM GRPO Implementation: crates/ruvllm/src/training/grpo.rs
  11. RuvLLM RealContrastiveTrainer: crates/ruvllm/src/training/real_trainer.rs
  12. RuvLLM EWC++ Training Pipeline: crates/ruvllm/src/lora/training.rs
  13. RuvLLM Memory Distillation: crates/ruvllm/src/reasoning_bank/distillation.rs
  14. RuvLLM Policy Store: crates/ruvllm/src/policy_store.rs
  15. RuvLLM Contrastive Training: crates/ruvllm/src/training/contrastive.rs
  16. PT-BitNet: "Scaling up the 1-Bit large language model with post-training quantization" (2025) — https://www.sciencedirect.com/science/article/abs/pii/S089360802500735X
  17. BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025) — https://arxiv.org/html/2510.13998v1
  18. bartowski, GLM-4.7-Flash-GGUF quantizations — https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
  19. unsloth, GLM-4.7-Flash-GGUF dynamic quantizations — https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
  20. llama.cpp IQ1_S blind testing (Discussion #5962) — https://github.com/ggml-org/llama.cpp/discussions/5962
  21. STBLLM: "Breaking the 1-bit Barrier" (ICLR 2025) — https://proceedings.iclr.cc/paper_files/paper/2025/file/ff997469ac66cf893c4183efeb22212a-Paper-Conference.pdf
  22. Apple Mac Studio Technical Specifications (2025) — https://www.apple.com/mac-studio/specs/
  23. RuvLLM Metal GEMV integration: crates/ruvllm/src/kernels/matmul.rs:1444-1582
  24. RuvLLM MicroLoRA NEON SIMD forward: crates/ruvllm/src/lora/micro_lora.rs:279-390 (forward_simd, forward_simd_neon_impl)
  25. RuvLLM NEON SIMD kernels: crates/ruvllm/src/kernels/ (matmul: gemm_neon/gemv_neon, activations: silu_neon/gelu_neon/relu_neon, norm: rms_norm_neon, rope: apply_rope_neon)
  26. RuvLLM ContrastiveTrainer CPU fallback: crates/ruvllm/src/training/contrastive.rs:171-175 (Metal → CPU fallback) and contrastive.rs:475 (non-Candle pure CPU path)
  27. R3-Engine: Pure Rust BitNet inference engine with WASM SIMD128 — https://github.com/r3-engine/r3-engine
  28. bitnet.rs: Pure Rust BitNet toolkit (Apache 2.0) — https://github.com/ocentra/bitnet.rs
  29. WASM SIMD128 specification: Fixed-width 128-bit SIMD for WebAssembly — https://v8.dev/features/simd
  30. Rust core::arch::wasm32 SIMD intrinsics — https://doc.rust-lang.org/beta/core/arch/wasm32/index.html
  31. "The state of SIMD in Rust in 2025" (Sergey Davidoff) — https://shnatsel.medium.com/the-state-of-simd-in-rust-in-2025-32c263e5f53d
  32. "Rust + WebAssembly 2025: WasmGC and SIMD" — https://dev.to/dataformathub/rust-webassembly-2025-why-wasmgc-and-simd-change-everything-3ldh
  33. Bai, Y. et al., "Constitutional AI: Harmlessness from AI Feedback" (arXiv:2212.08073, Dec 2022) — https://arxiv.org/abs/2212.08073
  34. Zheng, L. et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (arXiv:2306.05685, Jun 2023) — https://arxiv.org/abs/2306.05685
  35. Rafailov, R. et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (arXiv:2305.18290, May 2023) — https://arxiv.org/abs/2305.18290
  36. Min, S. et al., "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" (arXiv:2305.14251, May 2023) — https://arxiv.org/abs/2305.14251
  37. RuvLLM BitNet Backend: crates/ruvllm/src/bitnet/backend.rs (MoE routing, TL1 GEMV, forward pass)
  38. RuvLLM RLM Refiner: crates/ruvllm/src/bitnet/rlm_refiner.rs (Phase 0.5 refinement orchestrator)