118 KiB
ADR-017: Craftsman Ultra 30b 1bit — BitNet Integration with RuvLLM
Status: Proposed Date: 2026-02-03 Decision Makers: Ruvector Architecture Team Technical Area: 1-Bit LLM Inference / MoE Architecture / CPU-Native Serving
Context and Problem Statement
Large language models require substantial GPU resources for inference, limiting deployment to cloud environments and specialized hardware. Recent advances in 1-bit quantization — specifically Microsoft Research's BitNet b1.58 — demonstrate that ternary-weight models ({-1, 0, +1}) can match full-precision performance at 3B+ parameters while enabling CPU-only inference at human-readable speeds.
Concurrently, Zhipu AI's GLM-4.7-Flash introduces a 30B-A3B Mixture-of-Experts architecture that activates only ~3B parameters per token while storing 30B total knowledge, achieving strong coding and agentic benchmarks (SWE-bench Verified: 59.2%, LiveCodeBench v6: 64.0%) with 200K context.
Craftsman Ultra 30b 1bit is a proposed model that combines these two paradigms: a 30B-A3B MoE architecture with native BitNet b1.58 ternary quantization, purpose-built for CPU inference within the RuvLLM serving runtime. This ADR evaluates the integration path, architectural decisions, and trade-offs.
Strategic Goal
Deliver a 30B-class coding/agentic model that runs entirely on consumer CPUs (no GPU required) at 5-15 tokens/second decode, with memory footprint under 8GB, integrated into the RuvLLM + Ruvector ecosystem with SONA self-learning capabilities.
Decision Drivers
Performance Requirements
| Metric | Target | Rationale |
|---|---|---|
| Decode throughput (CPU) | 5-15 tok/s | Human-readable speed per BitNet 100B benchmarks |
| Prefill latency (1K tokens) | <2s | Interactive coding assistant responsiveness |
| Memory footprint (model) | <8 GB | Fits in 16GB system RAM with OS + KV cache |
| Memory footprint (KV cache, 4K ctx) | <2 GB | Q8 KV cache for 4096-token context |
| Active parameter GEMM | Addition-only | BitNet eliminates multiplication in W×A |
| Energy per inference | <0.05J | BitNet CPU efficiency benchmarks |
Architecture Requirements
- MoE routing must remain full-precision: Expert selection requires accurate gating scores
- Expert weights are ternary: Each expert's linear layers use BitLinear (W1.58A8)
- Activations quantized to INT8: Per-token absmax scaling
- Shared layers (embeddings, LM head) remain FP16: Critical for quality preservation
- GGUF-compatible: Must serialize to/load from GGUF v3 format with custom metadata
Ecosystem Requirements
- Integrate with RuvLLM's existing backend abstraction (
backends/mod.rs) - Leverage existing GGUF parser (
gguf/parser.rs,gguf/quantization.rs) - Support SONA learning loops for per-session adaptation
- Compatible with Claude Flow agent routing for task delegation
- NAPI bindings for Node.js consumption via
npm/packages/ruvllm
Research Summary
BitNet b1.58 Architecture
Source: Microsoft Research, "The Era of 1-bit LLMs" (Feb 2024), bitnet.cpp (Oct 2024)
BitNet b1.58 replaces standard nn.Linear with BitLinear layers:
Forward Pass:
1. W_ternary = RoundClip(W / (gamma + epsilon), -1, 1)
where gamma = mean(|W|) (absmean quantization)
2. X_int8 = Quant(X, absmax) (per-token 8-bit activation)
3. Y = W_ternary @ X_int8 (integer addition only, no multiplication)
4. Y_float = Dequant(Y) (rescale to float)
Key properties:
- Weights: ternary {-1, 0, +1} → 1.58 bits per parameter
- Activations: INT8 per-token (absmax scaling)
- Matrix multiply becomes addition and subtraction only (no FP multiply)
- Zero weights enable feature filtering (sparse activation within dense layers)
- Must be trained from scratch — post-training quantization to 1-bit destroys quality
Inference kernels (bitnet.cpp):
| Kernel | Method | Compression | Best For |
|---|---|---|---|
| I2_S | 2-bit pack, unpack-and-multiply | 2 bits/weight | Bandwidth-limited |
| TL1 | 2-weight → 4-bit LUT index | 2 bits/weight | Balanced CPU |
| TL2 | 3-weight → 5-bit LUT index | 1.67 bits/weight | Memory-limited |
CPU performance (bitnet.cpp benchmarks):
| Platform | Speedup vs FP16 | Energy Reduction |
|---|---|---|
| ARM (NEON) | 1.37x – 5.07x | 55-70% |
| x86 (AVX2) | 2.37x – 6.17x | 72-82% |
| x86 (AVX512) | ~6x+ | ~85% |
GLM-4.7-Flash Architecture
Source: Zhipu AI / Z.AI (Jan 2026)
| Property | Value |
|---|---|
| Total parameters | ~30B (31B reported) |
| Active parameters | ~3B (A3B) |
| Architecture | Mixture of Experts (MoE) |
| Shared layers | ~2B parameters |
| Expert layers | ~28B (distributed across experts) |
| Context window | 200K tokens (MLA-based) |
| Training data | 15T general + 7T reasoning/code tokens |
| Attention | Multi-head Latent Attention (MLA) with QK-Norm |
| Activation | SwiGLU |
| Position encoding | RoPE |
| Speculative decoding | Multi-Token Prediction (MTP) layer |
| Reasoning | Interleaved + Retention-Based + Round-Level |
Benchmark performance:
| Benchmark | Score |
|---|---|
| AIME 25 | 91.6% |
| GPQA | 75.2% |
| SWE-bench Verified | 59.2% |
| LiveCodeBench v6 | 64.0% |
| HLE | 14.4% |
| tau2-Bench | 79.5% |
RuvLLM Current Capabilities (Relevant)
- GGUF v3 parser: Full format support including IQ1_S (1.56 bits/weight, type 19)
- Quantization pipeline: Q4_K_M, Q5_K_M, Q8_0, F16 (no native ternary training)
- Backends: Candle (Metal/CUDA), mistral-rs (PagedAttention), CoreML (ANE)
- No CPU-optimized ternary kernel: Current backends target GPU acceleration
- SIMD kernels: Existing NEON/SSE4.1/AVX2 infrastructure in
crates/ruvllm/src/kernels/ - MicroLoRA: Rank 1-2 adapters with <1ms adaptation (compatible with BitNet)
- SONA: Three-tier learning (instant/background/deep) — can drive ternary adapter training
RuvLLM RLM Training Stack (Reusable for Distillation)
RuvLLM contains a mature reinforcement-learning-from-model-feedback (RLM) training stack that directly accelerates Craftsman Ultra distillation. These components are production-tested and reduce net-new code by ~70%.
GRPO — Group Relative Policy Optimization (training/grpo.rs, 897 lines)
- Critic-free RL: computes relative advantages within sample groups
- Adaptive KL divergence penalty (
kl_target,clip_range) controls teacher-student divergence - PPO-style clipping prevents catastrophic updates
- Preset configs:
GrpoConfig::stable()(safe distillation),GrpoConfig::for_tool_use()(expert routing) - Thread-safe batch processing via
RwLock<VecDeque<SampleGroup>>
RealContrastiveTrainer (training/real_trainer.rs, 1000 lines)
- Candle-based training loop with GGUF model loading and GGUF weight export
- Combined loss: Triplet (margin) + InfoNCE (contrastive) + GRPO reward scaling
- AdamW optimizer with gradient clipping, LR warmup, checkpointing
GrpoEvaluatorcomputes per-prediction rewards (1.0 correct, -0.5 wrong)- Metal/CUDA acceleration via Candle device dispatch
MicroLoRA + EWC++ Training Pipeline (lora/training.rs, 798 lines)
- Single-example gradient computation (batch_size=1 for real-time)
- EWC++ regularizer:
λ/2 * Σ F_i * (w_i - w*_i)²prevents catastrophic forgetting - Fisher diagonal tracking with exponential decay (
fisher_decay: 0.999) - 7 learning rate schedules (Cosine, OneCycle, Step, etc.)
- Async adaptation with buffered gradient accumulation
Memory Distillation (reasoning_bank/distillation.rs, 856 lines)
- Compresses trajectories to
KeyLessonobjects with semantic embeddings - Smart extraction: explicit lessons, implicit patterns, error patterns, recovery patterns
- Semantic deduplication (Jaccard + cosine similarity, threshold 0.85)
- Quality-gated: only trajectories above
min_quality_thresholdare preserved
Policy Store (policy_store.rs, 474 lines)
- Ruvector-backed semantic policy persistence with HNSW indexing
- Policy types:
Quantization,Router,Ewc,Pattern - Per-layer
QuantizationPolicywith precision, activation thresholds, quality-latency tradeoff - Policy source tracking:
InstantLoop,BackgroundLoop,DeepLoop,Federated
Contrastive Training (training/contrastive.rs, 634 lines)
- Two-stage: Triplet Loss (margin=0.5) + InfoNCE (temperature=0.07)
- 13 agent types with 1,078 training triplets (578 base + 500 hard negatives)
- Hard negative mining at 48.4% ratio (Claude-generated confusing pairs)
- Proven 100% routing accuracy with hybrid keyword-first + embedding fallback
Considered Options
Option A: Post-Training Quantization of GLM-4.7-Flash (PTQ Tiers)
Take the existing BF16 GLM-4.7-Flash weights and quantize to low-bit formats without full distillation training.
Critical distinction — IQ1_S ≠ BitNet b1.58:
| Property | GGUF IQ1_S | BitNet b1.58 |
|---|---|---|
| Encoding | Codebook-based importance quantization | Ternary {-1, 0, +1} via absmean |
| Bits/weight | 1.56 bpw | 1.58 bpw |
| Inference | Dequantize → FP multiply | Integer addition only (no multiply) |
| Speed benefit | Memory bandwidth only | Bandwidth + compute (multiplication-free) |
| How obtained | Post-training quantization | Trained from scratch or distilled |
| Quality at 7B | Near-random / broken outputs | Matches FP16 |
Existing GLM-4.7-Flash GGUF quantizations available (community-published):
| Repository | Lowest Quant | Size | Notes |
|---|---|---|---|
| bartowski/zai-org_GLM-4.7-Flash-GGUF | IQ2_XXS (2.06 bpw) | 7.62 GB | No IQ1_S published |
| unsloth/GLM-4.7-Flash-GGUF | UD-Q2_K_XL (2.7 bpw dynamic) | ~11 GB | Dynamic quant, recommended |
| ngxson/GLM-4.7-Flash-GGUF | Q4_K_M (4.5 bpw) | 18.1 GB | 55 variants available |
No IQ1_S quantization has been published for GLM-4.7-Flash by any community quantizer — this itself is a signal (too aggressive for practical use).
Sub-options ranked by increasing effort:
Sub-option 0A: Download existing IQ2_XXS GGUF
- Download bartowski's IQ2_XXS at 7.62 GB
- Cost: $0, time: 5 minutes (just download)
- Quality: ~75-80% of FP16 (2.06 bpw is usable per community reports)
- NOT 1-bit, NOT BitNet — just aggressive 2-bit compression
- RuvLLM gap: IQ2_XXS dequantization not implemented (falls to error catch-all in
quantization.rs:358) - RuvLLM Q2_K dequantization IS implemented and works
Sub-option 0B: Quantize to IQ1_S via llama.cpp
- Run
llama-quantize GLM-4.7-Flash-F16.gguf IQ1_Swith importance matrix - Cost: $0, time: ~30 minutes on CPU
- Quality: SEVERE degradation — blind testing shows IQ1_S is "broken rather than just bad" on 7B; outputs contain garbled text despite acceptable perplexity scores. 30B MoE may survive better due to parameter redundancy, but expert routing is highly sensitive to weight perturbation
- RuvLLM gap: IQ1_S dequantization not implemented (
quantization.rs:358catch-all) - Does NOT achieve BitNet multiplication-free inference
Sub-option 0C: PT-BitNet ternary PTQ (per PT-BitNet paper)
- Apply absmean ternary quantization (BitNet's native method) to pre-trained weights with calibration data
- Cost: $0 (runs locally on Mac Studio via mmap + Metal; 1-4 hours wall time)
- Alternative: ~$50-200 on cloud GPU if no local Apple Silicon hardware
- Quality: ~55-65% downstream accuracy (PT-BitNet reports 61% on 70B; GLM-4.7-Flash's 30B-A3B may differ)
- THIS IS proper BitNet ternary format → enables multiplication-free inference with AD-4 kernels
- Requires implementing absmean ternary quantizer (~200-300 lines of new code)
- Requires calibration dataset (WikiText-2 or similar, ~1M tokens)
- Mac Studio M4 Max 64GB+ or M3 Ultra 96GB+ recommended (see AD-18)
Sub-option 0D: BitDistill Lite (10B tokens) (per BitDistill paper)
- 3-stage: SubLN insertion → 10B-token continued pre-training → KL + attention distillation
- Cost: ~$200-500 (8× GPU hours on Mi300X/A100 class)
- Quality: ~90-95% of FP16 (BitDistill reports 88.17% vs 88.01% FP16 on MNLI at 0.6B)
- Near-full quality recovery with only 10B tokens (vs 200B+ for Phase 1 full distillation)
- Requires SubLN module insertion + distillation fine-tuning loop
- Bridges gap between pure PTQ and full expert distillation (Phase 1)
Summary comparison:
| Sub-option | Cost | Time | Quality (est.) | BitNet Speedup | RuvLLM Ready |
|---|---|---|---|---|---|
| 0A: IQ2_XXS download | $0 | 5 min | ~75-80% | No | No (missing dequant) |
| 0B: IQ1_S quantize | $0 | 30 min | ~40-50% | No | No (missing dequant) |
| 0C: PT-BitNet PTQ | $0 (Mac Studio) | 1-4 hrs | ~55-65% | Yes | Needs quantizer impl |
| 0D: BitDistill Lite | $0 local / ~$300 cloud | 2-4 wks / 1-2 days | ~90-95% | Yes | Needs SubLN + KD loop |
Pros (of PTQ approach generally):
- Immediate or near-immediate results ($0-$300, minutes to days)
- No large-scale training infrastructure
- Validates inference pipeline and kernels before investing in full distillation
- Sub-option 0C produces genuine BitNet ternary format for kernel development
Cons:
- Sub-options 0A/0B: Quality too degraded for production coding tasks
- Sub-options 0A/0B: No BitNet multiplication-free inference (still dequant-then-multiply)
- Sub-option 0C: Significant quality loss (~35-45%) vs teacher — adequate for kernel validation, not production
- Sub-option 0D: Requires non-trivial training code (SubLN, KD loss) but much less than full Phase 1
- IQ1_S blind test results: statistically indistinguishable from random on smaller models
Verdict: Recommended as Phase 0 rapid prototype — Sub-option 0C (PT-BitNet PTQ) is the optimal entry point: $100, 2-4 hours, produces genuine BitNet ternary format for kernel development and inference validation. Sub-option 0D (BitDistill Lite) bridges to Phase 1 if higher quality is needed before committing to full expert distillation. Sub-options 0A/0B are useful only as baselines for comparison.
Option B: Native BitNet Training of GLM-4.7-Flash Architecture (Full)
Train Craftsman Ultra 30b 1bit from scratch using BitNet b1.58 methodology on the GLM-4.7-Flash MoE architecture.
Approach:
- Implement BitLinear layers for all expert MLPs and attention projections
- Keep MoE router, embeddings, and LM head in FP16
- Train on 4T+ tokens with ternary weight updates via straight-through estimator
- Export to custom GGUF with ternary tensor metadata
Pros:
- Maximum quality — matches FP16 at 3B+ active parameter scale
- True multiplication-free inference for expert forward passes
- Full TL1/TL2 kernel optimization possible
- Scientifically validated approach (BitNet b1.58 2B4T results)
Cons:
- Massive training compute: estimated 4,000-8,000 A100-hours for 4T tokens
- Requires custom training framework (BitNet + MoE + MLA integration)
- 6-12 month timeline for training pipeline + training run
- No pre-existing GLM-4.7-class BitNet training recipe
Verdict: Recommended long-term — Highest quality but requires significant investment.
Option C: Hybrid Approach — BitNet Distillation from GLM-4.7-Flash (RLM-Accelerated)
Use knowledge distillation to transfer GLM-4.7-Flash capabilities into a BitNet architecture, reducing training cost by 5-10x. Leverages the existing RLM training stack to eliminate ~70% of net-new training code.
Approach:
- Initialize Craftsman Ultra with GLM-4.7-Flash architecture (30B-A3B MoE)
- Replace all expert linear layers with BitLinear (ternary {-1, 0, +1})
- Keep router, embeddings, LM head in FP16
- Extend
RealContrastiveTrainerwith KD loss (KL div + hard-label CE) replacing triplet+InfoNCE - Use
GrpoOptimizerfor per-expert quality rewards during distillation — eachSampleGroupmaps to one expert's teacher vs student outputs - Apply
EwcRegularizeracross distillation phases to prevent early-trained experts from being overwritten - Log distillation trajectories to
MemoryDistillerfor quality tracking andKeyLessonextraction - Persist per-layer ternary policies via
PolicyStore(quantization thresholds, scale distributions) - Export to GGUF with ternary tensor metadata and TL1/TL2 kernel hints via existing
GgufExportResult
RLM Component Reuse:
| Existing Component | Reuse | Adaptation Needed |
|---|---|---|
RealContrastiveTrainer |
Training loop, GGUF export, checkpointing | Replace triplet+InfoNCE with KD loss |
GrpoOptimizer |
Reward scaling, adaptive KL, PPO clipping | Map SampleGroup to per-expert outputs |
EwcRegularizer |
Fisher diagonal, forgetting prevention | Apply across expert distillation phases |
MemoryDistiller |
Trajectory compression, lesson extraction | Map Verdict to teacher-student quality delta |
PolicyStore |
Semantic policy persistence | Add PolicyType::TernaryScale for per-block absmean tracking |
ContrastiveTrainer |
Hard negative mining framework | Reuse for expert-routing contrastive pre-training |
Pros:
- 5-10x less compute than training from scratch (~800-1,600 A100-hours)
- ~70% existing code reuse — only BitLinear forward/backward and MoE data loading are net-new
- Leverages GLM-4.7-Flash's proven architecture and routing
- GRPO's adaptive KL prevents ternary student from diverging too far from teacher
- EWC++ ensures sequential expert distillation doesn't corrupt earlier experts
- Teacher model provides strong supervision signal for ternary convergence
- Can incrementally improve with more distillation tokens
PolicyStoreenables learned per-layer quantization decisions- Distillation quality tracked end-to-end via
MemoryDistillertrajectory logging
Cons:
- Slight quality gap vs native training (estimated 2-5% on benchmarks)
RealContrastiveTrainerembedding_dim (896) must scale to GLM-4.7-Flash hidden_size- Teacher inference cost during distillation
- Distillation may not perfectly transfer MoE routing behavior
Verdict: Recommended near-term — Best balance of quality, cost, and timeline. RLM reuse eliminates the "custom framework" risk.
Option D: BitNet Expert Replacement (Incremental, RLM-Accelerated)
Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLinear, leaving attention in FP16. Reuses existing RLM stack for the entire distillation loop.
Approach:
- Load GLM-4.7-Flash architecture
- Replace expert FFN layers (gate_proj, up_proj, down_proj) with BitLinear
- Keep attention (Q/K/V/O projections) in FP16
- Use
RealContrastiveTrainer+GrpoOptimizerfor expert-only distillation (~200B tokens) - Apply
EwcRegularizerto prevent expert N+1 distillation from corrupting expert N - Attention weights loaded directly from GLM-4.7-Flash (no distillation needed)
- Use contrastive pre-training to validate MoE routing still selects correct experts after ternary conversion
Pros:
- Fastest path to working model
- Attention quality preserved exactly
- Expert FFN is 60-70% of active parameters — gets most BitNet benefits
- Simpler distillation (only FFN layers)
- Lower memory: ~5.5 GB for ternary experts + FP16 attention
- Minimal net-new code: BitLinear layer + GGUF ternary type only; training loop is 100% reused
Cons:
- Attention layers still require FP multiply (not fully multiplication-free)
- Mixed-precision inference path complexity
- ~40% of compute still in FP16 attention
Verdict: Recommended as Phase 1 — Enables rapid prototyping and validation. RLM reuse makes this achievable with only ~30% new code.
Decision
Phased approach: A(0C) → RLM Refinement → D → C → B
Phase 0: PTQ Rapid Prototype (Option A, Sub-option 0C)
- Timeline: 1-2 weeks
- Cost: $0 (runs entirely on Mac Studio locally)
- Platform: Mac Studio (M4 Max 64GB+ or M3 Ultra 96GB+)
- Goal: Produce a genuine BitNet ternary GGUF of GLM-4.7-Flash for kernel development, inference pipeline validation, and baseline quality measurement
- Deliverables:
- PT-BitNet ternary quantized GLM-4.7-Flash GGUF file (~6-7 GB)
- Absmean ternary quantizer implementation (~200-300 lines)
- IQ1_S / BITNET_T158 dequantization kernel in RuvLLM
- Baseline quality benchmarks (HumanEval, MMLU) to compare against Phase 1+
- Functional TL1 kernel validated against ternary model
- Expected quality: ~55-65% of GLM-4.7-Flash (adequate for kernel validation, not production)
- Key value: De-risks Phase 1 by validating the entire inference pipeline (GGUF loading → ternary dequant → TL1 kernel → MoE routing → token generation) at zero cost before committing to $1,300+ distillation training
- Why Mac Studio works: Phase 0 is PTQ (no training loop) — just load FP16 weights via mmap, compute absmean per block, round to ternary, export. The absmean computation is trivial math; the bottleneck is memory bandwidth, not compute. Calibration forward pass uses Metal GPU acceleration via existing Candle integration.
- Optional upgrade (0D): If 0C quality is too low for meaningful testing, apply BitDistill Lite (10B tokens, ~$300 cloud or ~$0 on Mac Studio over several weeks) to reach ~90-95% quality
Phase 0.5: RLM Post-Quantization Refinement (NEW — Mac Studio, $0)
- Timeline: 1-3 weeks (overlaps with Phase 0 kernel development)
- Cost: $0 (runs on Mac Studio, ~2-12 days training wall time with Metal; ~4-24 days SIMD-only)
- Platform: Mac Studio (same as Phase 0) — supports both Metal GPU and pure SIMD/CPU modes (see AD-20)
- Goal: Improve Phase 0 PTQ quality from ~55-65% to ~70-80% by training only the small FP16 components using the existing RLM stack — no traditional distillation, no cloud GPU
- Approach: Freeze ternary weights, train FP16 corrections using RLM components:
- MicroLoRA adapters (rank 1-2) on each expert FFN — adds small FP16 correction:
Y = BitLinear(X) + LoRA_B @ LoRA_A @ X - Router fine-tuning via ContrastiveTrainer — corrects misrouting caused by PTQ weight changes
- Scale factor optimization via GRPO rewards — per-block FP16 absmean scales are differentiable
- EWC++ regularization — prevents router fix from breaking already-good routing paths
- Quality tracking via MemoryDistiller — identifies worst-degraded experts for focused training
- Policy persistence via PolicyStore — stores optimized per-layer configurations
- MicroLoRA adapters (rank 1-2) on each expert FFN — adds small FP16 correction:
- Trainable parameters: ~200-400M (1-2% of 30B total) — router (~30M), MicroLoRA adapters (~50-100M), LM head (~150M), scale factors (~0.1M)
- Training data: 100M-500M tokens (sufficient for <400M trainable params)
- Throughput: ~500-1000 tok/s (Metal) or ~200-500 tok/s (NEON SIMD only) × 100M-500M tokens = 2-12 days (Metal) or 4-24 days (SIMD-only) on Mac Studio
- Deliverables:
- RLM-refined GGUF with ternary experts + optimized FP16 components
- MicroLoRA adapter weights (exportable, ~20-100 MB)
- Optimized router weights and scale factors
- Quality benchmarks showing improvement over Phase 0 baseline
- Expected quality: ~70-80% of GLM-4.7-Flash (up from ~55-65% Phase 0 PTQ)
- Key value: Gets a usable model on Mac Studio at $0 before committing to cloud GPU. If 70-80% quality is sufficient for the use case, Phase 1 cloud distillation may be deferred or skipped entirely.
- 100% RLM code reuse: MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer, ContrastiveTrainer, MemoryDistiller, PolicyStore — all production-tested, zero new training code needed
Phase 1: BitNet Expert Replacement (Option D)
- Timeline: 3-4 months
- Cost: ~$1,300-$2,000 (4× A100 spot, ~46 days)
- Goal: Full-quality ternary experts via distillation, validated against Phase 0/0.5 baselines
- Deliverables: Working Craftsman Ultra 30b 1bit (mixed: ternary experts, FP16 attention)
- Expected quality: ~90-95% of GLM-4.7-Flash on coding benchmarks
- Prerequisites: Phase 0 validates inference pipeline; Phase 0.5 provides quality baseline
Phase 2: Full BitNet Distillation (Option C)
- Timeline: 4-6 months after Phase 1
- Cost: ~$2,500-$5,000 (4× H100, 16-32 days)
- Goal: Full ternary model with complete BitNet inference optimization
- Deliverables: Craftsman Ultra 30b 1bit v2 (full ternary except router/embed/head)
- Expected quality: ~95-98% of GLM-4.7-Flash
Phase 3: Native BitNet Training (Option B)
- Timeline: 6-12 months after Phase 2, contingent on funding/compute
- Cost: ~$15,000-$30,000 (8× H100 cluster, 90-180 days)
- Goal: Surpass GLM-4.7-Flash quality with native ternary training
- Deliverables: Craftsman Ultra 30b 1bit v3 (trained from scratch)
- Expected quality: 100%+ of GLM-4.7-Flash (BitNet at scale exceeds FP16)
Architectural Decisions
AD-1: Ternary Weight Representation
Decision: Use BitNet b1.58 absmean quantization for weight ternary encoding.
W_ternary = RoundClip(W / (mean(|W|) + epsilon), -1, 1)
Each weight is one of {-1, 0, +1}, stored as 2-bit packed integers (I2_S format) in GGUF tensors. Per-block scale factor stored as FP16.
Storage format per block (256 elements):
- 64 bytes for ternary weights (2 bits × 256)
- 2 bytes for absmean scale (FP16)
- Total: 66 bytes / 256 weights = 2.06 bits/weight
AD-2: MoE Router Precision
Decision: MoE gating/routing network remains in FP16.
Rationale: Expert selection requires high-precision softmax scores to maintain routing quality. Quantizing the router to ternary would collapse expert selection, effectively turning a 30B model into a random-expert 3B model. The router is <0.1% of total parameters.
Components kept in FP16:
- Expert gating weights (router)
- Token embedding table
- LM head (output projection)
- RoPE frequency table
- LayerNorm/RMSNorm parameters
AD-3: Activation Quantization
Decision: INT8 per-token absmax quantization for activations flowing through BitLinear layers.
X_int8 = clamp(round(X * 127 / max(|X|)), -128, 127)
Rationale: Consistent with BitNet b1.58 specification. INT8 activations enable integer-only GEMM in expert forward passes. Attention activations remain in FP16/BF16 for KV cache compatibility.
AD-4: CPU Inference Kernel Strategy
Decision: Implement all three bitnet.cpp kernel types, with runtime selection based on hardware detection.
| Kernel | Target Hardware | Selection Criteria |
|---|---|---|
| I2_S | x86 AVX512, ARM SVE | Systems with wide SIMD and high bandwidth |
| TL1 | x86 AVX2, ARM NEON | General-purpose, balanced performance |
| TL2 | Memory-constrained | Systems with <16GB RAM or high cache pressure |
Implementation path: Adapt bitnet.cpp's kernel generation scripts (Python codegen) to produce Rust SIMD intrinsics compatible with RuvLLM's existing kernels/ module structure.
Key kernel operations:
- Pack ternary weights into 2-bit (I2_S) or LUT index (TL1: 4-bit, TL2: 5-bit)
- Generate lookup tables for activation sums at model load time
- Execute GEMM via table lookup + integer addition (no floating-point multiply)
- Accumulate in INT16 with pack-and-unpack technique (lossless, no quantization of partials)
- Dequantize output with per-block FP16 scale
AD-5: GGUF Tensor Format Extension
Decision: Extend RuvLLM's GGUF format with BitNet-specific metadata and a new BITNET_TERNARY quantization type.
New GGUF metadata keys:
craftsman.bitnet.version = 1
craftsman.bitnet.weight_encoding = "absmean_ternary"
craftsman.bitnet.activation_bits = 8
craftsman.bitnet.router_precision = "f16"
craftsman.bitnet.kernel_hint = "tl1" // preferred kernel
craftsman.moe.total_params = 30000000000
craftsman.moe.active_params = 3000000000
craftsman.moe.num_experts = <N>
craftsman.moe.active_experts = <K>
Tensor storage: Map to existing IQ1_S (type 19) for ternary expert weights, with additional metadata distinguishing post-training IQ1_S from native BitNet ternary. Alternatively, register a new type BITNET_T158 = 29 if the existing IQ1_S block format is incompatible with absmean-scale-per-block layout.
AD-6: RuvLLM Backend Integration
Decision: Create a new BitNetBackend alongside existing Candle and mistral-rs backends.
backends/
├── mod.rs // Backend trait + dispatch
├── candle_backend.rs // GPU (Metal/CUDA)
├── mistral_backend.rs // PagedAttention + ISQ
├── coreml_backend.rs // Apple Neural Engine
└── bitnet_backend.rs // NEW: CPU ternary inference
BitNetBackend responsibilities:
- Load GGUF with ternary tensor detection
- Initialize TL1/TL2/I2_S lookup tables per layer
- Execute MoE routing in FP16 → select active experts
- Run selected expert forward passes using ternary GEMM kernels
- Attention in FP16 (Phase 1) or ternary (Phase 2+)
- KV cache management (Q8 two-tier, existing infrastructure)
Backend trait compliance:
impl InferenceBackend for BitNetBackend {
fn load_model(&mut self, path: &Path, config: ModelConfig) -> Result<()>;
fn generate(&self, prompt: &str, params: GenerateParams) -> Result<Response>;
fn get_embeddings(&self, text: &str) -> Result<Vec<f32>>;
fn supports_architecture(&self, arch: &str) -> bool;
}
AD-7: MoE Forward Pass Pipeline
Decision: Split MoE forward pass into FP16 routing + ternary expert execution.
Input Token Embedding (FP16)
│
▼
┌─────────────────────────────────────────┐
│ For each transformer layer: │
│ │
│ 1. RMSNorm (FP16) │
│ 2. Self-Attention │
│ ├─ Q/K/V projection (Phase 1: FP16, │
│ │ Phase 2: Ternary)│
│ ├─ RoPE (FP16) │
│ ├─ Scaled dot-product attention │
│ └─ Output projection │
│ 3. RMSNorm (FP16) │
│ 4. MoE Block: │
│ ├─ Router (FP16 gating network) │
│ │ → Select top-K experts │
│ ├─ Expert FFN (TERNARY BitLinear) │
│ │ ├─ gate_proj: W_ternary @ X_int8│
│ │ ├─ up_proj: W_ternary @ X_int8│
│ │ ├─ SwiGLU activation │
│ │ └─ down_proj: W_ternary @ X_int8│
│ └─ Weighted sum of expert outputs │
│ 5. Residual connection │
└─────────────────────────────────────────┘
│
▼
LM Head (FP16) → Logits → Token
AD-8: SONA Integration for Ternary Adaptation
Decision: MicroLoRA adapters applied as FP16 deltas on top of ternary base weights.
Rationale: Ternary weights cannot be directly fine-tuned at inference time (gradient updates don't map to {-1, 0, +1}). Instead, SONA's MicroLoRA applies rank-1 FP16 adapters whose output is added to the ternary forward pass output:
Y = BitLinear(X) + LoRA_B @ LoRA_A @ X
Where BitLinear(X) uses ternary GEMM and LoRA_B @ LoRA_A @ X is a small FP16 correction. This preserves BitNet's efficiency for 99%+ of computation while enabling per-session adaptation.
AD-9: Memory Budget Analysis
Decision: Target <8GB model + 2GB KV cache = 10GB total for 4K context.
| Component | Precision | Size | Notes |
|---|---|---|---|
| Expert weights (28B params) | 1.58-bit | ~5.5 GB | 28B × 2.06 bits = ~7.2 GB raw, but only routing metadata for inactive experts |
| Shared layers (2B params) | FP16 | ~4 GB | Embeddings, LM head, router, norms |
| Expert routing tables | FP16 | ~50 MB | Gating network weights |
| TL1/TL2 lookup tables | INT16 | ~200 MB | Pre-computed at load time |
| KV cache (4K context) | Q8 | ~1.5 GB | Two-tier cache (hot FP16 + warm Q8) |
| MicroLoRA adapters | FP16 | ~10 MB | Rank-1, <1MB per target module |
| Total | — | ~7.8 GB | Fits in 16GB system with headroom |
Note: Full 30B ternary weights on disk are ~7.2 GB. At runtime, only active expert weights (~3B active) are in hot memory for any given token, with inactive expert pages memory-mapped and demand-loaded.
AD-10: Platform-Specific Kernel Dispatch
Decision: Runtime hardware detection drives kernel selection.
pub fn select_kernel(caps: &HardwareCaps) -> BitNetKernel {
if caps.has_avx512() {
BitNetKernel::I2S_AVX512
} else if caps.has_avx2() {
BitNetKernel::TL1_AVX2
} else if caps.has_neon() {
if caps.cache_size_l2 >= 2 * 1024 * 1024 {
BitNetKernel::TL1_NEON
} else {
BitNetKernel::TL2_NEON // memory-constrained
}
} else if caps.has_sse41() {
BitNetKernel::TL1_SSE41
} else {
BitNetKernel::I2S_Scalar // fallback
}
}
Integration: Leverages RuvLLM's existing autodetect.rs hardware capability detection module.
AD-11: GRPO-Guided Distillation Loss
Decision: Use GrpoOptimizer to compute per-expert reward scaling during knowledge distillation, replacing a traditional fixed-weight KD loss.
Rationale: Standard KD uses a static alpha to blend KL divergence and hard-label cross-entropy. GRPO adds a dynamic reward signal that upweights expert-student pairs where ternary output closely matches the teacher, and downweights divergent pairs. This is achieved by mapping each expert's teacher-vs-student output comparison to a SampleGroup:
Combined Loss = KD_base + GRPO_scale
Where:
KD_base = α * KL(teacher_logits/T, student_logits/T)
+ (1-α) * CE(labels, student_logits)
GRPO_scale = (1 + reward * 0.1)
reward = GrpoEvaluator.evaluate(student_expert_output, teacher_expert_output)
→ 1.0 when cosine_sim > 0.95
→ -0.5 when cosine_sim < 0.7
Key configuration (extending GrpoConfig::stable()):
GrpoConfig {
group_size: num_experts, // One group per MoE layer
learning_rate: 1e-6, // Conservative for distillation
kl_coefficient: 0.1, // Tight teacher adherence
kl_target: 0.02, // Low divergence target
clip_range: 0.1, // Narrow clipping for stability
normalize_advantages: true, // Normalize across experts in group
adaptive_kl: true, // Auto-adjust KL penalty
..GrpoConfig::stable()
}
Reused: GrpoOptimizer, GrpoConfig, SampleGroup, GrpoEvaluator from training/grpo.rs.
New: BitNetGrpoAdapter that maps expert forward pass outputs to GrpoSample structs.
AD-12: Contrastive Pre-Training for Expert Routing Validation
Decision: After ternary conversion of expert weights, use the existing ContrastiveTrainer to verify that MoE routing still selects the correct experts.
Rationale: Replacing expert FFN weights with ternary approximations changes the output distribution of each expert. If expert N's ternary output becomes more similar to expert M's output, the router may misroute tokens. Contrastive pre-training on expert embeddings detects and corrects this.
Approach:
- For each token in a calibration set, record which expert the teacher model's router selects
- Generate
TrainingTriplets: anchor = hidden state, positive = correct expert output, negative = wrong expert output - Use existing hard negative mining to find expert pairs that become confusable after ternary conversion
- Fine-tune the FP16 router gating weights using contrastive loss to restore correct expert selection
Reused: ContrastiveTrainer, ContrastiveConfig, TrainingTriplet from training/contrastive.rs.
New: ExpertTripletGenerator that produces triplets from MoE routing decisions.
AD-13: EWC++ Cross-Expert Stability During Sequential Distillation
Decision: Apply EwcRegularizer from lora/training.rs during sequential expert distillation to prevent catastrophic forgetting across experts.
Rationale: Distilling 30B MoE experts sequentially (expert 0, then 1, ..., then N) risks overwriting shared representations. EWC++ computes Fisher information diagonals for each expert's contribution to the shared attention layers, then regularizes subsequent expert distillation to not deviate from previously-learned important weights.
Configuration:
TrainingConfig {
ewc_lambda: 5000.0, // Higher than default (2000) for cross-expert stability
fisher_decay: 0.995, // Slower decay to preserve Fisher across expert phases
quality_threshold: 0.5, // Only learn from high-quality distillation samples
lr_schedule: LearningRateSchedule::Cosine,
warmup_steps: 500, // Longer warmup for 30B scale
..Default::default()
}
Concrete protection:
- After distilling expert 0: compute Fisher diagonal
F_0over validation set - When distilling expert 1: add penalty
ewc_lambda/2 * Σ F_0_i * (w_i - w*_0_i)² - Accumulate:
F_cumulative = fisher_decay * F_prev + (1-fisher_decay) * F_new
Reused: EwcRegularizer, TrainingPipeline, TrainingConfig, FisherDiagonal from lora/training.rs.
New: SequentialExpertDistiller that wraps EwcRegularizer across expert phases.
AD-14: Policy Store for Per-Layer Ternary Scale Tracking
Decision: Extend PolicyStore with a new PolicyType::TernaryScale to persist per-block absmean scale distributions and learned quantization decisions.
Rationale: Not all layers quantize equally well to ternary. Attention layers may need different scale clipping than FFN layers. The policy store enables the distillation pipeline to learn and persist per-layer quantization strategies that can be retrieved and applied in future distillation runs or model updates.
New policy type:
pub enum PolicyType {
Quantization,
Router,
Ewc,
Pattern,
TernaryScale, // NEW: Per-layer ternary quantization metadata
}
pub struct TernaryScalePolicy {
pub layer_idx: usize,
pub module: String, // "gate_proj", "up_proj", "down_proj", "q_proj", etc.
pub mean_absmean: f32, // Average scale factor across blocks
pub std_absmean: f32, // Variance in scale factors
pub sparsity: f32, // Fraction of zero weights
pub quality_vs_teacher: f32, // Cosine similarity to teacher output
pub distillation_loss: f32, // Final loss for this layer
pub recommended_block_size: usize, // 256 default, may vary
}
Reused: PolicyStore, PolicyEntry, PolicySource from policy_store.rs.
New: TernaryScalePolicy struct and PolicyType::TernaryScale variant.
AD-15: Memory Distillation for Training Quality Tracking
Decision: Log all distillation teacher-student comparisons as Trajectory objects in the ReasoningBank, enabling MemoryDistiller to extract KeyLessons about which layers, experts, and configurations produce the best ternary quality.
Rationale: Distillation is iterative — understanding which experts converge quickly, which resist ternary conversion, and what scale distributions correlate with quality enables intelligent scheduling of future distillation runs.
Mapping:
| ReasoningBank Concept | Distillation Mapping |
|---|---|
Trajectory |
One expert's distillation run (N steps) |
Verdict |
Success if cosine_sim > 0.9, Failure if < 0.7 |
PatternCategory |
Expert index + layer type (e.g., "expert_3_gate_proj") |
KeyLesson |
"Expert 7 gate_proj converges fastest with lr=2e-6 and block_size=128" |
CompressedTrajectory |
Summary of entire expert distillation phase |
Reused: MemoryDistiller, DistillationConfig, CompressedTrajectory, KeyLesson from reasoning_bank/distillation.rs.
New: DistillationTrajectoryRecorder that adapts expert training steps to Trajectory format.
AD-16: Distillation Pipeline Composition
Decision: Compose the full Craftsman Ultra distillation pipeline from existing RLM components wired through a new CraftsmanDistiller orchestrator.
Pipeline architecture:
┌─────────────────────────────────────────────────────────────────┐
│ CraftsmanDistiller (NEW orchestrator) │
│ │
│ ┌───────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ TeacherModel │───▶│BitLinearTrainer │───▶│ GGUFExporter │ │
│ │(GLM-4.7-Flash)│ │(NEW: STE+shadow) │ │(REUSED) │ │
│ └───────┬───────┘ └────────┬─────────┘ └──────────────┘ │
│ │ │ │
│ │ ┌─────────────────┼─────────────────┐ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │GrpoOptimizer │ │EwcRegularizer│ │ContrastiveTrainer│ │
│ │(REUSED) │ │(REUSED) │ │(REUSED) │ │
│ │Per-expert │ │Cross-expert │ │Router validation │ │
│ │reward scaling│ │stability │ │post-ternary │ │
│ └──────┬───────┘ └──────┬───────┘ └────────┬─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Quality Feedback Loop │ │
│ │ │ │
│ │ MemoryDistiller ──▶ KeyLesson extraction │ │
│ │ PolicyStore ──▶ TernaryScale persistence │ │
│ │ (BOTH REUSED) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Net-new code: BitLinearTrainer (STE + shadow weights), CraftsmanDistiller (orchestrator)
Reused code: GrpoOptimizer, EwcRegularizer, ContrastiveTrainer, MemoryDistiller,
PolicyStore, GGUFExporter, TrainingConfig, LR schedules
Reuse ratio: ~70% existing / ~30% new
Optimization: Expert-Parallel Distillation
Experts are independent during forward pass. Distill multiple experts concurrently across CPU cores:
// Distill experts in parallel (independent FFN weights)
let expert_results: Vec<DistillResult> = experts
.par_iter() // rayon parallel iterator
.enumerate()
.map(|(idx, expert)| {
let mut trainer = BitLinearTrainer::new(expert, &teacher_expert[idx]);
let mut ewc = EwcRegularizer::new_with_fisher(cumulative_fisher[idx]);
let mut grpo = GrpoOptimizer::new(GrpoConfig::stable());
for batch in dataset.batches() {
let student_out = trainer.forward_ternary(batch);
let teacher_out = teacher.forward_expert(idx, batch);
let reward = grpo.evaluate(&student_out, &teacher_out);
let kd_loss = kd_loss_fn(&student_out, &teacher_out, alpha, temperature);
let ewc_penalty = ewc.penalty(&trainer.shadow_weights());
let total_loss = kd_loss * reward.scale() + ewc_penalty;
trainer.backward_ste(total_loss);
}
ewc.update_fisher(&trainer); // Update Fisher for next expert
DistillResult { idx, weights: trainer.export_ternary(), fisher: ewc.fisher() }
})
.collect();
AD-17: Training Infrastructure — Cloud GPU over Local SIMD
Decision: Use Google Cloud A100/H100 GPU instances for distillation training. Reserve local CPU/SIMD for inference validation, MicroLoRA adaptation, and GGUF export only.
Rationale: Local CPU/SIMD training is mathematically infeasible at the 200B+ token scale required for expert distillation. The existing RuvLLM SIMD kernels (kernels/) are inference-only — no backpropagation or gradient computation. The training code (real_trainer.rs:178-184) supports Metal (macOS) or CPU but not CUDA, and CPU throughput at ~50-100 tok/s training would require ~65 years for 200B tokens.
Memory analysis (per-expert distillation):
| Component | Size | Notes |
|---|---|---|
| Single expert FFN shadow weights (FP16) | ~2 GB | ~1B params per expert (28B ÷ N experts) |
| Gradients (FP32) | ~4 GB | Full precision for STE backprop |
| AdamW optimizer state (2× FP32) | ~8 GB | First + second moment |
| Teacher activations cache | ~1 GB | Per-batch FP16 |
| EWC++ Fisher diagonal | ~0.5 GB | Per-expert accumulated |
| Per-expert total | ~15.5 GB | Fits in A100 40GB with headroom |
Full model simultaneous (Phase 2+):
| Component | Size | Notes |
|---|---|---|
| 30B shadow weights (FP16) | ~60 GB | Requires A100 80GB or H100 |
| Gradients + optimizer | ~360 GB | Requires multi-GPU parallelism |
| Total | ~430 GB | 4× A100 80GB or 4× H100 80GB |
Throughput and cost comparison:
| Platform | Training tok/s | Time (200B tok, Phase 1) | Cost | Phase 0 PTQ? | Phase 0.5 RLM? |
|---|---|---|---|---|---|
| Mac Studio M4 Max (Metal) | ~500-1000 | ~6.5 years | N/A | Yes — 1-4 hrs, $0 | Yes — 2-12 days, $0 |
| Mac Studio M4 Max (NEON SIMD only, no Metal) | ~200-500 | ~13 years | N/A | Yes — 2-6 hrs, $0 | Yes — 4-24 days, $0 |
| Mac Studio M3 Ultra (Metal) | ~800-1500 | ~4.2 years | N/A | Yes — 1-1.5 hrs, $0 | Yes — 1.5-8 days, $0 |
| Mac Studio M3 Ultra (NEON SIMD only, no Metal) | ~300-700 | ~9 years | N/A | Yes — 1.5-3 hrs, $0 | Yes — 3-16 days, $0 |
| CPU AVX2 (Ryzen 9) — scalar fallback | ~50-150 | ~43-130 years | N/A | Yes — 2-6 hrs, $0 | Yes — 14-58 days, $0 |
| 1× A100 80GB (GCP on-demand) | ~15,000 | ~155 days | ~$3,700 | Yes — 30 min, ~$5 | Overkill |
| 4× A100 80GB (GCP on-demand) | ~50,000 | ~46 days | ~$4,400 | Overkill for PTQ | Overkill |
| 4× A100 80GB (GCP spot) | ~50,000 | ~46 days | ~$1,300 | Overkill for PTQ | Overkill |
| 1× H100 (DataCrunch) | ~40,000 | ~58 days | ~$2,900 | Overkill for PTQ | Overkill |
| 4× H100 (DataCrunch) | ~140,000 | ~16 days | ~$3,200 | Overkill for PTQ | Overkill |
Key insight: Mac Studio is infeasible for Phase 1+ training (years of wall time) but ideal for Phase 0 PTQ (hours, $0). This separation justifies the phased approach.
Recommended infrastructure per phase:
| Phase | Instance | Duration | Estimated Cost | Strategy |
|---|---|---|---|---|
| Phase 0 (PTQ) | Mac Studio (M4 Max/M3 Ultra) | 1-4 hours | $0 | Mmap FP16 weights → absmean quantize → export GGUF; Metal GPU for calibration pass |
| Phase 0D (BitDistill Lite, 10B tok) | Mac Studio Metal or 1× A100 spot | 2-4 weeks (local) / 1-2 days (cloud) | $0 (local) / ~$300 (cloud) | Optional quality upgrade if Phase 0C too degraded |
| Phase 0.5 (RLM refinement, Metal) | Mac Studio (Metal) | 3-14 days | $0 | MicroLoRA + router fix + scale opt using existing RLM stack |
| Phase 0.5 (RLM refinement, SIMD-only) | Mac Studio (NEON CPU) | 5-28 days | $0 | Same pipeline, no Metal required — pure ndarray + NEON SIMD (see AD-20) |
| Phase 1 (expert FFN, 200B tok) | 4× A100 80GB spot (GCP) | ~46 days | $1,300-$2,000 | Per-expert sequential with EWC++; each expert fits 1 GPU |
| Phase 1 (router validation) | Mac Studio Metal or 1× A100 | ~2-4 hours | $0 (local) / <$10 (cloud) | Contrastive training on router only (~2B params) |
| Phase 2 (full ternary, 500B tok) | 4× H100 (DataCrunch) | ~16-32 days | $2,500-$5,000 | All layers; model-parallel across GPUs |
| Phase 3 (native training, 4T tok) | 8× H100 cluster | ~90-180 days | $15,000-$30,000 | Full from-scratch; depends on funding |
| Inference validation | Mac Studio (NEON) | Continuous | $0 | TL1/TL2 kernel testing on ARM NEON |
| MicroLoRA adaptation | Mac Studio | <1ms/update | $0 | Existing ndarray-based EWC++ pipeline |
Required code change: Add CUDA device dispatch to RealContrastiveTrainer:
// Current (real_trainer.rs:178-184):
let device = if config.use_metal {
Device::new_metal(0).unwrap_or(Device::Cpu)
} else {
Device::Cpu
};
// Required for cloud GPU training:
let device = if config.use_cuda {
Device::new_cuda(config.cuda_device_id).unwrap_or(Device::Cpu)
} else if config.use_metal {
Device::new_metal(0).unwrap_or(Device::Cpu)
} else {
Device::Cpu
};
This is a single-line addition to RealTrainingConfig (use_cuda: bool, cuda_device_id: usize) and a 3-line change to device selection. The rest of the Candle training pipeline (tensors, optimizer, loss computation) works identically across CPU/Metal/CUDA.
Cost optimization strategies:
- Spot instances: GCP A100 spot at ~$1/GPU-hr (70% off on-demand) — requires checkpointing every 30 min
- DataCrunch / Lambda Labs: H100 at $1.99-$2.10/hr (40-50% below GCP on-demand)
- Expert-sequential on fewer GPUs: Distill 1 expert at a time on 1× A100 80GB (~$1.50/hr), increasing wall time but reducing per-hour cost
- Mixed precision training: FP16 shadow weights + BF16 activations reduces memory, enabling smaller instances
- Gradient checkpointing: Trade compute for memory to fit on fewer GPUs
AD-18: Phase 0 — PT-BitNet Post-Training Quantization on Mac Studio
Decision: Implement a PT-BitNet ternary post-training quantizer as Phase 0, running entirely on a local Mac Studio, producing a rapid prototype GGUF for inference pipeline validation before investing in full distillation.
Rationale: The original Option A ("Rejected") assumed only generic IQ1_S quantization, which produces garbled outputs at 1.56 bpw. However, PT-BitNet (2025) demonstrates that applying BitNet's native absmean ternary quantization to pre-trained weights with calibration data achieves significantly better results (61% downstream at 70B) than generic codebook PTQ. This produces genuine BitNet ternary format that enables multiplication-free inference with TL1/TL2 kernels — unlike IQ1_S which still requires dequant-then-multiply.
Target platform: Mac Studio (Apple Silicon)
Phase 0 is pure quantization (no training loop), making it ideal for local execution on Mac Studio:
| Config | Unified RAM | FP16 Load | PTQ? | Calibration? | Notes |
|---|---|---|---|---|---|
| M4 Max 36GB | 36 GB | mmap (demand-paged) | Yes | Slow (paging) | Minimum viable; mmap means only active tensor pages in RAM |
| M4 Max 64GB | 64 GB | Fits with mmap assist | Yes | Yes | Comfortable for PTQ; calibration may page |
| M4 Max 128GB | 128 GB | Fits entirely | Yes | Yes | Ideal — FP16 model (60GB) + ternary output (7GB) + calibration buffers all in RAM |
| M3 Ultra 96GB | 96 GB | Fits entirely | Yes | Yes | Good headroom |
| M3 Ultra 192GB+ | 192+ GB | Fits entirely | Yes | Yes | Ample room for full model + calibration + inference validation |
Why Mac Studio works for Phase 0 (but not Phase 1+):
- PTQ is not training: No gradient computation, no optimizer state, no backpropagation — just load → quantize → export
- Memory-mapped I/O: FP16 weights can be mmap'd from disk; only the current tensor's pages need to be in RAM
- Per-tensor processing: Quantize one tensor at a time (read FP16 block → compute absmean → round to ternary → write output) — working memory is ~2-4 MB per tensor regardless of total model size
- Metal GPU for calibration: RuvLLM's existing
RealContrastiveTrainerandkernels/matmul.rssupport Metal via Candle (use_metal: truedefault, 3x speedup on M4 Pro GEMV) - ARM NEON for TL1 kernels: Mac Studio's Apple Silicon has NEON SIMD — the same target ISA as the TL1 kernel for ternary inference validation
- Phase 1 still needs cloud GPU: 200B token distillation at ~500-1000 tok/s (Metal) = ~6.5 years locally vs ~46 days on 4× A100
Estimated Phase 0 wall time on Mac Studio:
| Step | M4 Max 128GB | M4 Max 64GB | M3 Ultra 192GB |
|---|---|---|---|
| Download GLM-4.7-Flash FP16 (~60GB) | ~30 min (1Gbps) | ~30 min | ~30 min |
| Absmean ternary quantization | ~5-15 min | ~10-30 min (paging) | ~5-10 min |
| Calibration pass (1000 samples, Metal) | ~30-60 min | ~60-120 min | ~20-40 min |
| GGUF export | ~2-5 min | ~2-5 min | ~2-5 min |
| TL1 kernel validation inference | ~10-20 min | ~10-20 min | ~10-20 min |
| Total | ~1-2 hours | ~2-4 hours | ~1-1.5 hours |
Implementation approach:
Phase 0 Pipeline (runs on Mac Studio):
1. Load GLM-4.7-Flash FP16/BF16 weights via mmap
2. For each linear layer in expert FFNs:
a. Compute gamma = mean(|W|) (absmean scale)
b. W_ternary = RoundClip(W / (gamma + epsilon), -1, 1)
c. Store: 2-bit packed ternary weights + FP16 scale per block
3. Calibration pass (optional, improves quality, uses Metal GPU):
a. Run ~1000 calibration samples through teacher model
b. Record activation statistics per layer
c. Optimize scale factors to minimize MSE between teacher and ternary outputs
4. Export to GGUF with BITNET_T158 tensor type + metadata
5. Validate: load in BitNetBackend → TL1 NEON kernel → generate tokens
Absmean ternary quantizer (core algorithm):
Input: W ∈ R^{m×n} (FP16 weight matrix)
Output: W_t ∈ {-1,0,+1}^{m×n}, scale ∈ R (per-block FP16)
For each block of 256 elements:
1. gamma = mean(|block|) + 1e-8
2. normalized = block / gamma
3. ternary = round(clamp(normalized, -1, 1)) → {-1, 0, +1}
4. Pack: 2 bits per weight (00=-1, 01=0, 10=+1)
5. Store scale = gamma as FP16
What stays FP16 (same as AD-2):
- MoE router gating weights
- Token embeddings + LM head
- RoPE frequencies
- LayerNorm/RMSNorm parameters
RuvLLM implementation gaps to fill:
| Gap | Effort | Details |
|---|---|---|
| Absmean ternary quantizer | ~200-300 lines | New function in gguf/quantization.rs or new module |
| IQ1_S / BITNET_T158 dequantization | ~80-120 lines | Add to dequantize_tensor match arm (currently falls to error at line 358) |
| GGUF export with ternary metadata | ~100-150 lines | Extend GgufExportResult with BitNet metadata keys from AD-5 |
| TL1 kernel smoke test | ~200 lines | Validate ternary GEMM produces correct output on PTQ model |
Total new code: ~600-800 lines (vs ~15,000+ for Phase 1 full distillation pipeline)
Quality expectations (conservative estimates for GLM-4.7-Flash 30B-A3B):
| Benchmark | FP16 Baseline | Phase 0 PTQ (est.) | Phase 1 Distill (est.) |
|---|---|---|---|
| HumanEval pass@1 | ~65% | ~35-45% | ~55-60% |
| MMLU | ~75% | ~45-55% | ~65-70% |
| SWE-bench Verified | 59.2% | ~25-35% | ~50-55% |
| LiveCodeBench v6 | 64.0% | ~30-40% | ~55-60% |
Why Phase 0 quality is still useful:
- Kernel validation: Ternary GEMM correctness doesn't depend on model quality
- Memory profiling: Real-world memory usage measurement with actual MoE activation patterns
- Throughput benchmarking: Measure real tok/s with TL1/TL2/I2_S kernels on target hardware
- Pipeline testing: End-to-end GGUF load → inference → token output
- Baseline measurement: Quantitative quality floor establishes improvement target for Phase 1
- Cost: $0 on Mac Studio vs ~$1,300 for Phase 1 — validates infrastructure at zero cost before committing to cloud GPU
Key configuration:
pub struct PtBitnetConfig {
pub calibration_samples: usize, // 1000 default (WikiText-2 or code corpus)
pub block_size: usize, // 256 (matches AD-1)
pub optimize_scales: bool, // true: MSE-optimized scales; false: raw absmean
pub layers_to_quantize: LayerMask, // ExpertsOnly (Phase 0) or All (future)
pub export_format: TernaryFormat, // BitnetT158 (native) or IQ1S (llama.cpp compat)
pub router_precision: Precision, // FP16 (always, per AD-2)
pub use_mmap: bool, // true: memory-map FP16 weights (required for <128GB systems)
pub use_metal_calibration: bool, // true: Metal GPU for calibration pass (Mac Studio)
pub max_memory_gb: Option<f32>, // Cap memory usage; enables streaming quantization
}
Reused: GGUF parser, tensor metadata, GgufQuantType enum, export pipeline.
New: PtBitnetQuantizer, absmean_ternary(), BITNET_T158 dequantization kernel.
AD-19: Phase 0.5 — RLM Post-Quantization Refinement (No Traditional Training)
Decision: Use the existing RLM training stack to refine the Phase 0 PTQ model on Mac Studio by training only the small FP16 components (~1-2% of parameters), freezing ternary weights. This replaces traditional distillation for the rapid prototype phase.
Rationale: Traditional knowledge distillation (Phase 1) requires shadow weights, straight-through estimator, and GPU-scale compute to modify the ternary weights themselves. However, the Phase 0 PTQ model already has ternary weights — the quality loss comes from:
- Sub-optimal per-block scale factors (absmean is a rough approximation)
- MoE router misrouting tokens to wrong experts (expert output distributions changed)
- No adaptation to ternary output characteristics
All three can be addressed by training only the FP16 components using the existing RLM stack, without touching the ternary weights.
What gets trained (FP16, differentiable) vs frozen (ternary, not differentiable):
| Component | Params | Size | Trainable? | Training Method |
|---|---|---|---|---|
| Expert FFN ternary weights | ~28B | ~5.5 GB | Frozen | N/A — {-1,0,+1} not differentiable |
| MicroLoRA adapters (rank-2, per expert FFN) | ~50-100M | ~100-200 MB | Yes | TrainingPipeline + EwcRegularizer |
| MoE router gating weights | ~30M | ~60 MB | Yes | ContrastiveTrainer (triplet + InfoNCE) |
| Per-block absmean scale factors | ~0.1M | ~200 KB | Yes | GRPO reward-guided optimization |
| LM head (output projection) | ~150M | ~300 MB | Yes (optional) | Standard fine-tuning |
| Attention Q/K/V/O (FP16) | ~2B | ~4 GB | Optional | Can add LoRA here too if budget allows |
| Total trainable | ~200-400M | ~400-800 MB | ~1-2% of 30B total |
Why RLM works here (vs traditional distillation):
| Property | Traditional KD (Phase 1) | RLM Refinement (Phase 0.5) |
|---|---|---|
| Modifies ternary weights | Yes (shadow weights + STE) | No (frozen) |
| Trainable params | ~28B (all expert weights) | ~200-400M (1-2%) |
| Training tokens needed | 200B | 100M-500M (400x less) |
| GPU requirement | 4× A100 ($1,300+) | Mac Studio Metal ($0) |
| Training time | ~46 days (cloud) | 2-12 days (local) |
| Quality target | ~90-95% of FP16 | ~70-80% of FP16 |
| New code required | ~15,000 lines (BitLinear, STE, orchestrator) | ~0 lines (100% RLM reuse) |
RLM component mapping:
┌──────────────────────────────────────────────────────────────────┐
│ Phase 0.5: RLM Refinement Pipeline │
│ (100% existing RLM code, 0% new training code) │
│ │
│ Frozen Ternary Model (Phase 0 PTQ output) │
│ ┌────────────────────────────────────────────┐ │
│ │ Expert FFNs: {-1,0,+1} weights (FROZEN) │ │
│ │ Router: FP16 gating (TRAINABLE) │ │
│ │ Attention: FP16 (TRAINABLE via LoRA opt.) │ │
│ │ Scales: FP16 per-block (TRAINABLE) │ │
│ └────────────────────────────────────────────┘ │
│ │ │
│ ┌─────▼──────────────────────────────────────────┐ │
│ │ Step 1: Router Repair │ │
│ │ ContrastiveTrainer (REUSED, contrastive.rs) │ │
│ │ • Generate triplets: anchor=hidden, +correct │ │
│ │ expert, -wrong expert │ │
│ │ • Triplet + InfoNCE loss on FP16 router │ │
│ │ • Fix misrouting from PTQ weight changes │ │
│ │ Training: ~10M tokens, ~1-2 hours (Metal) │ │
│ └─────┬──────────────────────────────────────────┘ │
│ │ │
│ ┌─────▼──────────────────────────────────────────┐ │
│ │ Step 2: MicroLoRA Injection + Training │ │
│ │ TrainingPipeline + MicroLoRA (REUSED, │ │
│ │ lora/training.rs + lora/micro_lora.rs) │ │
│ │ • Rank-2 LoRA per expert FFN: Y = BitLinear(X) │ │
│ │ + LoRA_B @ LoRA_A @ X │ │
│ │ • Loss: MSE(teacher_output, student+LoRA) │ │
│ │ • EWC++ across expert phases │ │
│ │ Training: ~100-500M tokens, ~2-12 days (Metal) │ │
│ └─────┬──────────────────────────────────────────┘ │
│ │ │
│ ┌─────▼──────────────────────────────────────────┐ │
│ │ Step 3: Scale Factor + Quality Optimization │ │
│ │ GrpoOptimizer (REUSED, grpo.rs) │ │
│ │ • Per-expert output quality → reward signal │ │
│ │ • Optimize FP16 scale factors to maximize │ │
│ │ cosine similarity with teacher output │ │
│ │ • Adaptive KL prevents over-correction │ │
│ │ Training: concurrent with Step 2 │ │
│ └─────┬──────────────────────────────────────────┘ │
│ │ │
│ ┌─────▼──────────────────────────────────────────┐ │
│ │ Feedback Loop │ │
│ │ MemoryDistiller → KeyLessons (REUSED) │ │
│ │ PolicyStore → TernaryScale policies (REUSED) │ │
│ │ • Track which experts improve most │ │
│ │ • Store optimized configs for reproducibility │ │
│ └────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Memory budget on Mac Studio during Phase 0.5 training:
| Component | Size | Notes |
|---|---|---|
| PTQ ternary model (mmap) | ~7 GB disk / ~3-7 GB RAM | Demand-paged; only active expert pages in RAM |
| Teacher FP16 model (mmap) | ~60 GB disk / ~4-8 GB RAM | Only forward pass activations; demand-paged |
| MicroLoRA adapters (rank-2) | ~200 MB | All experts in RAM |
| LoRA gradients + optimizer (AdamW 2×FP32) | ~1.5 GB | For ~400M trainable params |
| EWC++ Fisher diagonal | ~200 MB | Per-expert accumulated |
| KV cache + activations | ~2 GB | Calibration/training forward pass |
| Total active RAM | ~12-20 GB | Fits in any Mac Studio config |
Key insight: The teacher model is only needed for forward pass (no gradients), so it can be mmap'd and demand-paged. The ternary student is similarly mmap'd. Only the ~400M trainable parameters and their optimizer state need to be fully in RAM (~2 GB), which fits comfortably in even the 36GB M4 Max.
Training schedule on Mac Studio M4 Max 128GB:
| Step | Tokens | Wall Time | What Changes |
|---|---|---|---|
| Router repair | ~10M | ~3-6 hours | FP16 router gating weights |
| LoRA training (per-expert, sequential) | ~100-500M | 2-12 days | MicroLoRA A/B matrices per expert FFN |
| Scale optimization | ~10M | ~3-6 hours | Per-block FP16 absmean scales |
| Validation + export | — | ~1-2 hours | Benchmark + GGUF re-export |
| Total | ~120-520M | ~3-14 days |
Expected quality improvement:
| Benchmark | Phase 0 PTQ | Phase 0.5 RLM | Phase 1 Distill | FP16 Baseline |
|---|---|---|---|---|
| HumanEval pass@1 | ~35-45% | ~45-55% | ~55-60% | ~65% |
| MMLU | ~45-55% | ~55-65% | ~65-70% | ~75% |
| SWE-bench Verified | ~25-35% | ~35-45% | ~50-55% | 59.2% |
The question "can I use RLM rather than traditional training" is answered YES — with the critical caveat that RLM refinement trains the FP16 corrections around frozen ternary weights, not the ternary weights themselves. This is fundamentally different from traditional distillation but achieves meaningful quality recovery (estimated +10-15 percentage points) at zero cost.
Reused (100%): MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer, ContrastiveTrainer, MemoryDistiller, PolicyStore, TrainingConfig, LR schedules, GGUF export.
New (0%): No new training code. The only new code is a thin RlmRefiner orchestrator (~200-300 lines) that wires the existing components together for the Phase 0.5 pipeline.
AD-20: Phase 0.5 — SIMD-Only Training Mode (No Metal GPU Required)
Decision: Phase 0.5 RLM refinement supports a pure SIMD/CPU execution mode with no Metal GPU dependency. Metal is an optional acceleration path (~2-3x faster) but not required.
Rationale: Analysis of the RLM training stack reveals that Metal GPU is used by only one component (RealContrastiveTrainer via Candle), while all other training components are pure ndarray/CPU. Since Phase 0.5 uses the lightweight ContrastiveTrainer (not RealContrastiveTrainer) for router repair, and all gradient computation is ndarray-based, the entire pipeline runs on pure CPU with SIMD acceleration for inference forward passes.
Component-by-component GPU dependency analysis:
| Component | Source | GPU Dependency | SIMD-Only Mode |
|---|---|---|---|
MicroLoRA.forward_simd() |
lora/micro_lora.rs:279 |
None — ARM NEON intrinsics with scalar fallback | NEON on aarch64, scalar on x86 |
MicroLoRA.apply_gradients() |
lora/micro_lora.rs:621+ |
None — pure ndarray | Works everywhere |
MicroLoRA.apply_gradients_with_ewc() |
lora/micro_lora.rs:621+ |
None — pure ndarray | Works everywhere |
TrainingPipeline |
lora/training.rs |
None — pure ndarray CPU | Works everywhere |
EwcRegularizer |
lora/training.rs |
None — pure ndarray CPU | Works everywhere |
GrpoOptimizer |
training/grpo.rs |
None — pure ndarray CPU | Works everywhere |
ContrastiveTrainer |
training/contrastive.rs:169-175 |
Optional — use_metal: true default, but Device::new_metal(0).unwrap_or(Device::Cpu) fallback |
Set use_metal: false for CPU-only; also has non-Candle pure CPU path (line 475) |
MemoryDistiller |
reasoning_bank/distillation.rs |
None — pure Rust | Works everywhere |
PolicyStore |
policy_store.rs |
None — pure Rust | Works everywhere |
RealContrastiveTrainer |
training/real_trainer.rs:178 |
Yes — Metal/Candle | NOT used in Phase 0.5 (used in full distillation only) |
Inference forward pass (for loss computation) SIMD support:
| Kernel | NEON (aarch64) | x86 | Source |
|---|---|---|---|
| GEMM | gemm_neon |
gemm_scalar fallback |
kernels/matmul.rs:520 |
| GEMV | gemv_neon |
gemv_scalar fallback |
kernels/matmul.rs:184 |
| SiLU | silu_neon_impl (~3.5x speedup) |
scalar fallback | kernels/activations.rs |
| GeLU | gelu_neon_impl (~3.2x speedup) |
scalar fallback | kernels/activations.rs |
| ReLU | relu_neon_impl (~4.0x speedup) |
scalar fallback | kernels/activations.rs |
| RMSNorm | rms_norm_neon |
scalar fallback | kernels/norm.rs |
| RoPE | apply_rope_neon |
scalar fallback | kernels/rope.rs |
| Softmax | softmax_neon (~2.8x speedup) |
scalar fallback | kernels/activations.rs |
Key observation: The matmul kernels only dispatch on target_arch = "aarch64" vs scalar. There are no explicit AVX2 or AVX512 SIMD implementations for x86 in the current kernel codebase. This means:
- Apple Silicon (aarch64): Full NEON SIMD acceleration — primary target for SIMD-only mode
- x86 (AMD/Intel): Falls to scalar fallback — works but ~3-5x slower than NEON
- Future opportunity: Adding AVX2/AVX512 kernels to
matmul.rswould make x86 competitive with NEON
Throughput comparison for Phase 0.5 (100M tokens, ~200-400M trainable params, 3B active forward):
| Execution Mode | Forward tok/s | Effective Training tok/s | 100M Tokens | 500M Tokens |
|---|---|---|---|---|
| Metal GPU (M4 Max) | ~500-1500 | ~300-700 | ~2-4 days | ~8-19 days |
| NEON SIMD only (M4 Max CPU) | ~200-500 | ~100-300 | ~4-12 days | ~19-58 days |
| NEON SIMD only (M3 Ultra CPU) | ~300-700 | ~150-400 | ~3-8 days | ~14-39 days |
| x86 scalar (Ryzen 9, no AVX2 kernels) | ~50-150 | ~30-80 | ~14-39 days | ~72-193 days |
Why SIMD-only is ~2-3x slower than Metal (not 10x):
- Phase 0.5 training is dominated by the forward pass through the frozen 3B active parameters to compute loss against the teacher
- The forward pass uses SIMD-accelerated GEMM/GEMV (
gemm_neon/gemv_neon) which gets ~60-70% of Metal throughput for these matrix sizes - Gradient computation for the ~200-400M trainable params is pure ndarray — identical speed regardless of Metal availability
- The training bottleneck is I/O (loading teacher activations from mmap) not compute, further narrowing the gap
Platform portability (bonus of SIMD-only mode):
SIMD-only mode extends Phase 0.5 beyond Mac Studio to any platform with ndarray support:
| Platform | SIMD Path | Effective tok/s | Feasible? |
|---|---|---|---|
| Mac Studio M4 Max (aarch64) | NEON intrinsics | ~100-300 | Yes — primary target |
| Mac Studio M3 Ultra (aarch64) | NEON intrinsics | ~150-400 | Yes — faster than M4 Max |
| Linux ARM64 (Ampere/Graviton) | NEON intrinsics | ~80-200 | Yes — cloud ARM instances |
| Linux x86 (Ryzen/Xeon) | Scalar fallback | ~30-80 | Marginal — 100M tokens feasible (~14-39 days), 500M not practical |
| macOS Intel | Scalar fallback | ~20-50 | Not recommended |
Configuration for SIMD-only mode:
// Phase 0.5 SIMD-only config (no Metal)
let contrastive_config = ContrastiveConfig {
use_metal: false, // Force CPU path in ContrastiveTrainer
..Default::default()
};
// MicroLoRA — already pure SIMD/ndarray, no config change needed
// TrainingPipeline — already pure ndarray
// GrpoOptimizer — already pure ndarray
// EwcRegularizer — already pure ndarray
The only config change is ContrastiveTrainer.use_metal = false. All other RLM components are GPU-agnostic by design.
SIMD-only Phase 0.5 exit criteria (in addition to standard Phase 0.5 criteria):
- All training completes without Metal GPU dependency
ContrastiveTrainerruns withuse_metal: falseand produces equivalent router accuracy- MicroLoRA
forward_simd()executes NEON path on aarch64 (verified viacfgcompile check) - Training throughput measured and documented for SIMD-only vs Metal comparison
Recommendation: Use Metal when available (2-3x faster), fall back to SIMD-only when Metal is unavailable or on non-Mac platforms. The training code requires zero changes — only ContrastiveTrainer.use_metal needs to be set to false.
Reused: 100% of existing RLM stack — MicroLoRA NEON forward, ndarray training, ContrastiveTrainer CPU fallback, all existing SIMD kernels.
New: 0 lines. SIMD-only mode is already supported by the existing code paths; AD-20 documents this capability explicitly.
AD-21: Native Rust Ternary Kernels with WASM Target (bitnet.cpp Port Strategy)
Decision: Port bitnet.cpp's ternary inference kernels (TL1, TL2, I2_S) to native Rust with dual compilation targets: native SIMD (NEON/AVX2/AVX512) and WebAssembly SIMD128. This replaces the original AD-4 strategy of Python codegen → Rust intrinsics with a pure Rust implementation that leverages existing open-source work.
Rationale: Three significant developments change the AD-4 implementation calculus:
-
R3-Engine (https://github.com/r3-engine/r3-engine) — A pure Safe Rust BitNet inference engine achieving 80-117 tok/s single-threaded on Ryzen 9950X3D, with native WASM SIMD128 cross-compilation. Uses bit-sliced ternary matrices with AVX-512 VPOPCNTDQ, zero-copy mmap, and zero heap allocations during generation.
-
bitnet.rs (https://github.com/ocentra/bitnet.rs) — Pure Rust BitNet toolkit with conversion, inference, training, and streaming. Apache 2.0 license. GPU path via WGSL/wgpu (Vulkan/Metal/DX12). Dedicated
bitnet-wasmcrate for browser deployment. -
WASM SIMD128 maturity — Fixed-width 128-bit SIMD now supported in all major browsers (Chrome, Firefox, Safari, Edge). Rust's
core::arch::wasm32provides direct intrinsic access viasimd128LLVM feature flag.
Comparison of approaches:
| Approach | Native Performance | WASM Support | Safety | Integration Effort | Code Reuse |
|---|---|---|---|---|---|
| A: Python codegen (original AD-4) | Optimal (platform-tuned) | None | C-level unsafe | High — custom codegen pipeline | bitnet.cpp algorithms |
| B: Port bitnet.cpp to Rust | Near-optimal | Manual WASM SIMD | Mixed (unsafe for intrinsics) |
Medium — translate C → Rust | bitnet.cpp algorithms |
| C: Reference R3-Engine patterns | 80-117 tok/s proven | Native dual-target | 100% Safe Rust | Low-medium — adapt patterns | R3 bit-slicing + mmap |
| D: Integrate bitnet.rs crate | GPU: 32x (WGSL), CPU: scalar | bitnet-wasm crate |
Safe Rust + WGSL | Low — add dependency | Full crate |
Recommended: Approach C (Reference R3-Engine) with RuvLLM integration
R3-Engine's techniques are the strongest fit because:
- 100% Safe Rust — no
unsafeblocks in the hot path - Dual-target proven — same codebase compiles to AVX-512 native and WASM SIMD128
- Zero-copy mmap — matches our Phase 0 mmap strategy (AD-18)
- Cache-aligned bit-slicing — 64-byte aligned CacheLines match CPU cache architecture
- VPOPCNTDQ — bit-population-count approach to ternary GEMM is elegant and SIMD-width-agnostic
WASM SIMD128 kernel mapping for TL1:
WASM SIMD128 provides v128 type (128 bits):
- i8x16: 16 × 8-bit integers — pack 64 ternary weights (2-bit each)
- i16x8: 8 × 16-bit integers — accumulation without overflow
- i32x4: 4 × 32-bit integers — final dequantized output
TL1 LUT (16 entries) maps naturally to a single v128:
v128.load(lut_ptr) → load 16-entry LUT
v128.swizzle(lut, indices) → parallel 16-way table lookup
i16x8.add(accum, partial) → INT16 accumulation
f32x4.mul(dequant, scale) → FP32 scale application
Estimated WASM SIMD128 throughput:
~20-40 tok/s for 3B active params (vs ~5-10 tok/s scalar JS)
~4-8x speedup over non-SIMD WebAssembly
WASM SIMD128 limitations:
- Fixed 128-bit width only (vs NEON 128, AVX2 256, AVX512 512)
- No integer popcount instruction (must emulate VPOPCNTDQ via lookup or bit manipulation)
- No gather/scatter operations (LUT access must be sequential or use swizzle)
- Memory alignment not enforced (no hardware-guaranteed 64-byte alignment)
- Single-threaded unless SharedArrayBuffer + Web Workers enabled
Dual-target compilation strategy (Cargo feature flags):
// In Cargo.toml:
[features]
default = ["native-simd"]
native-simd = [] # AVX2/AVX512/NEON via std::arch
wasm-simd = ["simd128"] # WASM SIMD128 via core::arch::wasm32
// In kernel code:
#[cfg(all(target_arch = "aarch64", feature = "native-simd"))]
fn ternary_gemv_neon(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }
#[cfg(all(target_arch = "x86_64", feature = "native-simd"))]
fn ternary_gemv_avx2(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }
#[cfg(all(target_arch = "wasm32", feature = "wasm-simd"))]
fn ternary_gemv_wasm128(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }
// Scalar fallback (always available):
fn ternary_gemv_scalar(weights: &TernaryTensor, activations: &[i8], output: &mut [f32]) { ... }
Integration with existing RuvLLM architecture:
| Existing Component | Change Needed | Impact |
|---|---|---|
kernels/mod.rs |
Add ternary module export |
Low |
kernels/matmul.rs |
Add ternary GEMV dispatch alongside existing FP16/Metal GEMV | Low |
bitnet/mod.rs (new) |
Wire TernaryTensor to kernel dispatch | Already created (Phase 0) |
gguf/quantization.rs |
BitnetT158 dequant already integrated | Already done |
autodetect.rs |
Add AVX512 VPOPCNTDQ detection + WASM target detection | Low |
Cargo.toml |
Add wasm-simd feature flag, wasm32 target conditional deps |
Low |
backends/ |
New BitNetBackend uses ternary kernel dispatch |
Medium (new backend) |
Estimated implementation effort (Rust ternary kernels with WASM):
| Component | Lines | Complexity | Notes |
|---|---|---|---|
| TL1 kernel (NEON + scalar) | ~200 | Medium | Reference R3-Engine bit-slicing |
| TL1 kernel (AVX2/AVX512) | ~250 | Medium | VPOPCNTDQ for AVX512, lookup for AVX2 |
| TL1 kernel (WASM SIMD128) | ~150 | Medium | v128 swizzle + i16x8 accumulation |
| I2_S kernel (all targets) | ~300 | Low | Simpler unpack-and-add |
| TL2 kernel (all targets) | ~250 | Medium-High | 5-bit index, 32-entry LUT |
| Kernel dispatch + autodetect | ~100 | Low | Match existing matmul.rs pattern |
| LUT generation | ~80 | Low | Pre-compute at model load |
| Total | ~1,330 | — | Compiles to native + WASM from single source |
Phase 0 impact: The Phase 0 smoke test (TL1 NEON + scalar) is already partially covered by the existing bitnet/ module. AD-21 extends this to production-grade kernels with WASM as an additional target.
Exit criteria:
- TL1 kernel passes bit-exact validation against bitnet.cpp reference output
- WASM SIMD128 build produces functional
.wasmbinary - Native NEON throughput ≥ 80% of R3-Engine (≥ ~64-94 tok/s for 2B model)
- AVX2 path tested on x86 Linux
- Scalar fallback tested on generic platform
- WASM throughput ≥ 20 tok/s for 3B active params in browser
- Zero
unsafeblocks in WASM path (Safe Rust only) - Kernel dispatch selects optimal path via
autodetect.rsfeature detection
Open question resolved: AD-21 answers open question #5 (WASM target for ternary kernels) — yes, WASM SIMD128 is viable for TL1/I2_S, with ~4-8x speedup over scalar WASM. TL2's 5-bit index is less natural for 128-bit SIMD but still implementable via two-stage lookup.
AD-22: Evaluation Infrastructure and Behavioral Gates
Decision: Define a three-gate behavioral evaluation framework with a structured trace schema, auto-labeling strategy, and Go/No-Go shipping rule. All gates are non-LLM-judge, deterministic, reproducible, and executable on CPU without external API calls. The system ships on integrity/citations/refusal behavior, not raw model quality benchmarks. Full GPU distillation (Phase 1+) is deferred; the eval infrastructure must validate Phase 0 and Phase 0.5 outputs at zero marginal cost.
Rationale: Standard LLM evaluation relies on either (a) benchmark suites (HumanEval, MMLU) that measure general capability, or (b) LLM-as-judge approaches that are non-deterministic, expensive, and unsuitable for gating CI/CD pipelines. For Craftsman Ultra, the critical shipping question is not "does it score well on benchmarks?" but "does it route correctly, cite honestly, and refuse when uncertain?" These behavioral properties are testable with deterministic, cheap-to-run gate checks that compare model outputs against known ground-truth traces.
The three gates correspond to the three failure modes that would make the system untrustworthy regardless of benchmark scores:
- Misrouting — wrong experts selected, producing semantically wrong outputs from correct-seeming completions
- Hallucinated citations — model cites evidence that does not exist or does not support the claim
- Over/under-refusal — model refuses answerable questions or confidently answers indeterminate ones
Gate 1 — Routing Correctness
Run the FP16 teacher model once on the 200-prompt evaluation suite to record ground-truth routing traces: which experts are selected, with what softmax weights, per token per layer. Then run the ternary student model on the same prompts and compare routing decisions.
| Parameter | Value |
|---|---|
| Metric | routing_agreement = count(same_topk_experts) / total_tokens |
| Comparison | Per-token, per-layer: do the top-K selected expert indices match between teacher and student? |
| Pass threshold | >= 0.85 (85% of tokens route to the same expert set as the teacher) |
| Fail action | Trigger targeted router repair via ContrastiveTrainer (AD-19, AD-20) with triplets generated from the misrouted token positions |
Teacher traces are recorded once and cached as JSONL. The ternary model is evaluated against these cached traces on every pipeline run. Agreement is measured at the expert-set level (order-invariant): if teacher selects experts {2, 5} and student selects {5, 2}, this counts as agreement.
Gate 2 — Citation Correctness
For retrieval-augmented responses, verify that citations are grounded in the actual retrieval corpus. This gate requires a labeled subset of the 200-prompt suite where prompts include retrieval context with known chunk IDs.
| Parameter | Value |
|---|---|
| Metric (precision) | citation_precision = valid_citations / total_citations |
| Metric (recall) | citation_recall = cited_evidence / relevant_evidence (from labeled prompts) |
| Validity check | For each cited chunk_id: (1) chunk exists in retrieval corpus, (2) cited span is an exact substring match OR Jaccard similarity between cited span and chunk content > 0.6 |
| Pass threshold | Precision >= 0.90, Recall >= 0.70 |
| Fail action | Trigger retrieval-first policy training via GrpoOptimizer (GRPO reward penalizes hallucinated citations, rewards grounded ones) |
Jaccard similarity is computed at the word level: |intersection(words_cited, words_chunk)| / |union(words_cited, words_chunk)|. This catches paraphrased citations while rejecting fabricated ones. The 0.6 threshold was chosen to allow minor rephrasing while catching wholesale fabrication.
Gate 3 — Refusal Calibration
Test the model's ability to refuse when evidence is insufficient and answer when evidence is adequate. Uses the auto-labeled prompt suite (see below) where each prompt is classified as resolved, contested, or indeterminate.
| Parameter | Value |
|---|---|
| Metric | refusal_f1 = harmonic_mean(refusal_precision, refusal_recall) |
| Refusal detection | Output contains a refusal signal (configurable string set, e.g., "I cannot determine", "insufficient evidence", "I'm not sure", or a structured <refusal> tag) |
| Must-refuse rate | Model must refuse >= 80% of indeterminate prompts |
| Must-answer rate | Model must NOT refuse >= 95% of resolved prompts |
| Pass threshold | Refusal F1 >= 0.85 |
| Fail action | Adjust refusal threshold in controller policy, or retrain controller via GrpoOptimizer with refusal-aware reward signal |
contested prompts (sources actively contradict) are evaluated separately and not gated — they are tracked for monitoring but the correct behavior (refuse vs. present both sides) is domain-dependent.
Trace Schema (JSONL format)
Every evaluation run produces a JSONL trace file where each line records per-token, per-layer routing decisions alongside response-level citation and refusal assessments:
{
"prompt_id": "p-001",
"token_idx": 42,
"layer_idx": 3,
"routing": {
"topk_expert_ids": [2, 5],
"topk_weights": [0.62, 0.38],
"teacher_expert_ids": [2, 5],
"teacher_weights": [0.65, 0.35],
"agreement": true
},
"citations": [
{"chunk_id": "doc-17-p3", "span": "exact quoted text", "valid": true}
],
"refusal": {
"should_refuse": false,
"did_refuse": false,
"correct": true
},
"coherence_score": 0.91,
"stop_reason": "eos"
}
Schema notes:
routingis emitted per-token per-layer (one record per token-layer pair)citationsandrefusalare emitted once per response (attached to the final token record,stop_reason != null)coherence_scoreis the cosine similarity between student and teacher hidden states at the final layer — a cheap proxy for output quality without LLM-judge- Trace files are stored in
eval/traces/(never in the project root) and named{model_version}_{prompt_suite}_{timestamp}.jsonl
Auto-Labeling Strategy
The 200-prompt evaluation suite is labeled without manual annotation by using RuVector retrieval signals as proxy ground truth:
| Label | Condition | Meaning | Gate Usage |
|---|---|---|---|
resolved |
Evidence redundancy > 3 (multiple independent sources agree on the answer) | The question is clearly answerable from the corpus | Gate 3: model must answer (not refuse) |
contested |
Cluster disagreement > 0.4 (sources actively contradict each other) | The question has conflicting evidence | Monitored only (not gated) |
indeterminate |
Mincut fragility > 0.7 (removing a single source breaks the entire evidence chain) | The question cannot be reliably answered | Gate 3: model must refuse |
These labels also feed Gate 2:
resolvedprompts provide therelevant_evidencedenominator for citation recall (all supporting chunks should be cited)indeterminateprompts should produce no citations (any citation on an indeterminate prompt is likely hallucinated)
Auto-labeling is deterministic given a fixed retrieval corpus and runs on CPU via existing RuVector HNSW search. Labels are stored alongside prompts in the evaluation suite and versioned with the corpus.
Go/No-Go Rule
All three gates must pass on the same evaluation suite run for the system to ship:
SHIP = (routing_agreement >= 0.85)
AND (citation_precision >= 0.90)
AND (citation_recall >= 0.70)
AND (refusal_f1 >= 0.85)
If any gate fails, the system cannot ship. The remediation path is gate-specific:
| Failed Gate | Remediation | Component | Estimated Duration |
|---|---|---|---|
| Routing Correctness | Router repair via ContrastiveTrainer with misrouted-token triplets |
training/contrastive.rs |
1-4 hours |
| Citation Correctness | Retrieval-first policy training via GrpoOptimizer (reward grounded citations) |
training/grpo.rs |
2-8 hours |
| Refusal Calibration | Adjust refusal threshold or retrain controller policy via GrpoOptimizer |
training/grpo.rs + controller config |
1-2 hours |
Re-evaluation after remediation must re-run all three gates (not just the failed one) to confirm no regression.
Implementation location:
| Component | Path | Lines | Notes |
|---|---|---|---|
| Gate runner orchestrator | crates/ruvllm/src/eval/gates.rs |
~300 | New module; runs all three gates, produces trace JSONL |
| Routing trace recorder | crates/ruvllm/src/eval/routing_trace.rs |
~150 | Records teacher routing decisions; compares against student |
| Citation validator | crates/ruvllm/src/eval/citation_check.rs |
~200 | Substring match + Jaccard similarity; corpus lookup |
| Refusal detector | crates/ruvllm/src/eval/refusal_detect.rs |
~100 | Configurable refusal signal set; F1 computation |
| Auto-labeler | crates/ruvllm/src/eval/auto_label.rs |
~150 | RuVector signal extraction; prompt classification |
| Trace schema types | crates/ruvllm/src/eval/trace.rs |
~80 | Serde-annotated structs matching the JSONL schema |
| Total new code | ~980 | All CPU-only, no external dependencies |
Exit criteria:
- Teacher routing traces recorded for full 200-prompt suite and cached as JSONL
- Gate 1 (routing agreement) runs in < 30 minutes on Mac Studio for 200 prompts
- Gate 2 (citation correctness) validates chunk_id existence and span grounding
- Gate 3 (refusal calibration) correctly classifies refusal signals in model output
- Auto-labeler produces
resolved/contested/indeterminatelabels from RuVector signals - All gates produce deterministic results (same inputs = same pass/fail, bit-exact)
- Trace JSONL files are written to
eval/traces/, never to project root - Go/No-Go rule enforced: all three gates must pass on same run
- Failed gate triggers correct remediation path (ContrastiveTrainer or GrpoOptimizer)
- Total eval suite runtime < 2 hours on Mac Studio (CPU-only)
AD-23: Phase-1 Distillation via External GPU Teacher Artifacts
Status: Accepted
Context: The Ultra 30B ternary MoE system prioritizes CPU-first inference, integrity-driven behavior, and low operational cost. Phase-1 performance goals focus on routing correctness after ternary quantization, citation-grounded answers, and calibrated refusal under thin or conflicting evidence. Full end-to-end GPU distillation of a 30B teacher is expensive, slow, and misaligned with the system's long-term architecture — where RuVector provides memory and structure, and the generator model is intentionally small and cheap. However, pure PTQ ternary conversion (Phase 0) introduces unacceptable degradation in MoE routing stability, answer fidelity on contested prompts, and refusal behavior calibration. We therefore require a limited refinement phase that recovers task-relevant behavior without committing to ongoing GPU dependence.
Decision: Phase-1 distillation SHALL be implemented as a one-time, external GPU artifact generation step, followed by local CPU-only refinement.
- A full-precision FP16 teacher is executed once on a short-lived cloud GPU instance
- The teacher produces behavioral artifacts, not trained weights
- All refinement and training occurs locally on CPU using these artifacts
- GPU infrastructure is not a runtime dependency
Scope of Teacher Artifacts (GPU job exports only):
| Artifact | Content | Purpose |
|---|---|---|
| Routing Traces | Per token, per MoE layer: top-k expert indices + routing probabilities/margins | Preserve expert selection behavior post-quantization |
| Sparse Logits | Answer spans, refusal boundaries, contradiction disclosure points only | Guide LoRA residual correction and refusal calibration without full sequence distillation |
| Preference Labels | Per-prompt classification: resolved / contested / indeterminate | Train stop decisions and disclosure behavior |
Artifacts SHALL be stored as immutable, versioned files and reused across refinement runs.
CPU-Only Refinement Strategy (using teacher artifacts):
- Router Repair — Match student top-k routing to teacher traces; penalize expert churn and margin collapse
- Low-Rank Residual Correction — Apply LoRA-style residuals to compensate ternary approximation error; enforce strict parameter budget
- EWC++ Preservation — Prevent catastrophic drift outside repaired regions
- Policy Optimization — Train RLM stop and retrieval behavior; optimize for citation correctness and calibrated refusal
No full expert weight updates are allowed in Phase-1.
Evaluation Gate: A checkpoint SHALL NOT be promoted unless it passes behavioral evaluation, not reconstruction metrics. Mandatory metrics:
| Metric | Criterion | Gate |
|---|---|---|
| Routing correctness | Top-k overlap with teacher + margin correlation | Gate 1 (AD-22) |
| Citation correctness | Span hash verification + evidence support via RuVector | Gate 2 (AD-22) |
| Refusal calibration | Refuse on indeterminate, disclose on contested, pass on resolved | Gate 3 (AD-22) |
compute_dequant_error is a sanity check only, not a promotion criterion.
Acceptance Criteria:
- System passes the 200-prompt disagreement suite
- Routing correctness meets Gate 1 threshold (>= 0.85)
- Citation precision exceeds 0.90 (Gate 2 precision target)
- Refusal behavior aligns with RuVector coherence signals (Gate 3 F1 >= 0.85)
- Results remain stable under 10% corpus perturbation
- GPU artifact generation completes in single cloud session (< 4 hours)
- CPU refinement reproducible without GPU access
Alternatives Considered:
| Alternative | Verdict | Reason |
|---|---|---|
| Full GPU distillation | Rejected | High cost, long iteration cycles, misalignment with CPU-first design |
| Pure PTQ without refinement | Rejected | Unacceptable routing instability, incorrect refusal behavior, citation degradation |
| Continuous GPU shadow training | Rejected | Operational complexity, long-term infrastructure lock-in |
Consequences:
- Positive: GPU cost is bounded and minimal; refinement is repeatable and auditable; CPU-first deployment remains intact; system behavior aligns with integrity goals; distillation artifacts are reusable
- Negative: General language quality parity with FP16 teacher is not guaranteed; some PTQ loss may remain in non-critical behaviors; requires building custom evaluation infrastructure (addressed by AD-22)
- Note: This ADR does not preclude a future Phase-2 distillation if product requirements shift toward general language parity. Phase-2 would be a separate decision
AD-24: RLM-Style Recursive Sentence Transformer Embedder
Status: Accepted
Context: The Craftsman Ultra system uses RuVector for evidence retrieval, cluster analysis, contradiction detection, and mincut fragility scoring. Standard sentence transformers produce embeddings in a single forward pass — one chunk in, one vector out. This works for basic retrieval but fails at three critical boundaries:
- Contradiction boundaries: Two chunks with opposing claims embed near each other because they share vocabulary, despite being semantically opposed
- Domain drift: Embeddings trained on general corpora perform poorly when the corpus shifts to a specialized domain (legal, medical, code)
- Context blindness: The embedding of a chunk is independent of its neighborhood, losing structural signals that RuVector already knows (entity links, claim chains, cluster membership)
A normal embedding pipeline cannot distinguish "Drug X cures condition Y" from "Drug X does NOT cure condition Y" — they embed almost identically. The system needs embeddings that reflect the structural position of a chunk within the evidence graph, not just its surface semantics.
Decision: Implement an RLM-style recursive embedder — not a new architecture, but an inference strategy that wraps any base sentence transformer in a short iterative loop that retrieves context, decomposes, re-embeds, and merges.
Core Loop (bounded to 2-3 iterations):
State: { text, intent, neighbors, candidate_embeddings, iteration, stop_reason }
1. Embed the base chunk → base_embedding
2. Retrieve k nearest neighbors from RuVector → neighbors[]
3. Normalize/summarize chunk with neighbor context → contextualized_text
4. Re-embed the normalized view → ctx_embedding
5. If contested (low-cut boundary), embed both → cluster_a_emb, cluster_b_emb
sides of the disagreement separately
6. Merge into final representation → final_embedding + metadata
Output Schema:
| Field | Type | Description |
|---|---|---|
embedding |
Vec<f32> |
Final merged embedding vector |
confidence |
f32 |
Embedding stability across iterations (cosine similarity between iteration N and N-1) |
evidence_neighbor_ids |
Vec<String> |
RuVector chunk IDs used as context |
contradiction_flags |
Vec<bool> |
Per-neighbor: true if neighbor is in opposing cluster |
cluster_id |
Option<usize> |
Primary cluster assignment |
stop_reason |
StopReason |
Why the loop terminated: Converged, MaxIterations, Contested |
Three Embedding Variants:
| Variant | Conditioning | Use Case | Output |
|---|---|---|---|
| A: Query-Conditioned | Query text + neighborhood | Retrieval under a specific query | Embedding optimized for that query's intent |
| B: Corpus-Conditioned | Stable neighbors + entity graph | Corpus indexing | Embedding stable over time, less sensitive to local phrasing |
| C: Contradiction-Aware Twin | Both sides of a low-cut boundary | Disputed claims | Bimodal representation: one embedding per cluster side |
Merge Rule (auditable, not learned):
final = normalize(w0 * base + w1 * ctx + w2 * anti)
Where anti is the embedding of the strongest counter-cluster neighbor set. Weights can be fixed (w0=0.6, w1=0.3, w2=0.1) or learned with a small regression on the eval set.
Training Strategy (minimal, no full model training):
Only three components are trainable:
- Merge weights (
w0, w1, w2) — 3 parameters, learned via grid search or small regression - Stop policy — when to terminate the loop (convergence threshold on cosine similarity between iterations)
- Adapter layer — optional small linear layer on top of base embeddings for domain adaptation (rank-4 LoRA or single linear)
Evaluation Criteria:
| Metric | Definition | Target |
|---|---|---|
| Top-k retrieval accuracy | Correct chunk in top-k results | Improvement over single-pass baseline |
| False neighbor rate | Contradicting chunks incorrectly ranked as similar | Reduction vs baseline |
| Cluster purity | Intra-cluster coherence after re-embedding | Improvement vs baseline |
| Contradiction separation | Cosine distance between opposing claim embeddings | > 0.3 (vs ~0.05 for single-pass) |
| Stability under perturbation | Embedding change when 10% of corpus is modified | < 0.05 cosine drift |
| Latency per embedding | Wall time including retrieval + re-embedding | < 50ms for 2 iterations on target hardware |
Appliance Fit (CPU-first):
- Small base embedder model (e.g., 22M-110M params)
- 2-3 passes maximum per chunk
- RuVector supplies all context (no additional retrieval infrastructure)
- Ternary quantization of the base embedder is possible (future AD)
- Compatible with WASM deployment for browser-side embedding
Acceptance Criteria:
- On a held-out corpus slice, RLM-style embedder improves top-k retrieval accuracy vs single-pass baseline
- False neighbor matches near contradiction boundaries are reduced
- Latency stays within budget (< 50ms for 2 iterations on target hardware)
- Memory usage does not exceed appliance budget
- Variant C produces measurably separated embeddings for known contradictions
- Merge weights are interpretable and auditable (no black-box learned fusion)
Consequences
Positive
- CPU-only deployment: 30B-class model running on commodity hardware without GPU
- Energy efficiency: 55-82% reduction in inference energy vs FP16
- Memory efficiency: ~8GB vs ~60GB for FP16 30B model (7.5x reduction)
- Multiplication-free expert GEMM: Integer addition only in expert forward passes
- SONA compatibility: MicroLoRA adaptation preserves per-session learning
- GGUF ecosystem: Compatible with existing model distribution infrastructure
- Incremental path: Phase 0 ($0) validates pipeline; Phase 0.5 ($0) adds RLM quality boost; Phase 1 ($1,300) delivers production quality; Phases 2-3 optimize
- ~70% RLM code reuse: GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore are production-tested — only BitLinear layer and orchestrator are net-new
- Adaptive distillation: GRPO reward scaling dynamically focuses compute on hard-to-distill experts
- Cross-expert stability: EWC++ Fisher diagonal prevents catastrophic forgetting during sequential expert distillation
- Learned quantization policies: PolicyStore persists per-layer ternary scale distributions for reproducible future distillation runs
- Expert-parallel distillation: Independent expert FFNs enable rayon-parallel distillation across CPU cores
- Phase 0 de-risks Phase 1 at zero cost: Mac Studio PTQ prototype validates entire inference pipeline (GGUF → dequant → kernel → MoE → generation) for $0 before committing $1,300+ to cloud GPU distillation
- Existing GGUF ecosystem: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines
- Phase 0.5 RLM refinement at $0: Existing MicroLoRA + GRPO + EWC++ + ContrastiveTrainer stack provides ~10-15 percentage point quality recovery over raw PTQ with zero new training code, running entirely on Mac Studio
- 100% RLM reuse for Phase 0.5: No new training infrastructure needed — all 7 RLM components are production-tested and wire together directly
- SIMD-only Phase 0.5: Entire RLM refinement pipeline runs on pure CPU SIMD (NEON on aarch64) without Metal GPU — only ~2-3x slower than Metal, extends platform support to Linux ARM64 and (with scalar fallback) x86
- Zero-config SIMD mode: All training components (MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer) are already GPU-agnostic; only
ContrastiveTrainer.use_metal = falseneeded for full SIMD-only execution - WASM browser deployment: Native Rust kernels compile to WASM SIMD128 via Cargo feature flags, enabling in-browser ternary inference at ~20-40 tok/s without server roundtrip
- Single-source dual-target: One Rust codebase compiles to both native SIMD (NEON/AVX2/AVX512) and WASM SIMD128, eliminating the need for separate C++ and JS codebases
- Safe Rust kernels: Following R3-Engine's approach, production kernels can be 100% Safe Rust (no
unsafein hot path), eliminating entire classes of memory safety bugs vs bitnet.cpp's C++ - Existing Rust ecosystem: R3-Engine (Apache-compatible) and bitnet.rs (Apache 2.0) provide proven reference implementations to accelerate kernel development
- Deterministic behavioral gates: Three non-LLM-judge evaluation gates (routing, citation, refusal) provide reproducible pass/fail shipping decisions without expensive API calls or non-deterministic judge models
- Structured trace schema: JSONL trace format captures per-token routing, per-response citation, and refusal decisions in a single auditable artifact — enables regression detection across model versions
- Zero-annotation auto-labeling: RuVector retrieval signals (evidence redundancy, cluster disagreement, mincut fragility) classify prompts as resolved/contested/indeterminate without human annotation effort
- Gate-specific remediation: Each failed gate maps to a concrete repair action using existing RLM components (ContrastiveTrainer for routing, GrpoOptimizer for citations and refusal), avoiding manual debugging cycles
- CPU-only evaluation: Full eval suite runs on Mac Studio in < 2 hours with no cloud GPU or external API dependency, keeping the evaluation loop at $0 marginal cost
- Bounded GPU cost: Phase-1 distillation requires only a single short-lived cloud GPU session to generate behavioral artifacts (routing traces, sparse logits, preference labels) — no ongoing GPU dependency
- Artifact reusability: Teacher artifacts are immutable and versioned; CPU refinement runs can be repeated, tuned, and audited without re-running the GPU job
- Behavioral distillation: Distilling routing decisions and refusal signals rather than full logit sequences aligns training objectives with the system's integrity-first design goal
- RLM-style embeddings: Recursive context-aware embeddings improve retrieval accuracy and contradiction separation without requiring a larger embedding model — inference strategy, not new architecture
- Contradiction-aware twin embeddings: Variant C produces bimodal representations at low-cut boundaries, preserving disagreement structure in the embedding space for downstream decision-making
- Minimal training surface: Only 3 merge weights + stop policy + optional adapter need training for the RLM embedder — no full model fine-tuning required
Negative
- Training cost: Even distillation requires 800-1,600 A100-hours (~$2K-$5K cloud cost)
- Custom kernels: Must implement and maintain platform-specific SIMD kernels in Rust
- Quality gap: Phase 1 may be 5-10% below GLM-4.7-Flash on some benchmarks
- No GPU acceleration: BitNet kernels are CPU-specific; GPU path requires separate optimization
- Mixed-precision complexity: Router (FP16) + experts (ternary) + attention (FP16/ternary) adds dispatch complexity
- WASM SIMD128 ceiling: Fixed 128-bit width limits throughput vs native AVX2 (256-bit) or AVX512 (512-bit); no popcount instruction requires emulation; single-threaded unless SharedArrayBuffer enabled — expect ~20-40 tok/s vs ~80-117 tok/s native
- RLM scale gap: Existing
RealContrastiveTrainertargets 0.5B models (embedding_dim=896); scaling to 30B requires distributed data loading and increased batch sizes - No x86 SIMD kernels: Current
kernels/matmul.rsonly implements NEON (aarch64); x86 falls to scalar fallback (~3-5x slower than NEON). Adding AVX2/AVX512 kernels would make x86 SIMD-only mode competitive but is not yet implemented - Teacher trace dependency: Gate 1 requires a full FP16 teacher forward pass to generate ground-truth routing traces; this must be re-run whenever the evaluation suite changes or the teacher model is updated
- Auto-label noise: RuVector-derived labels (evidence redundancy, mincut fragility) are proxies for true answerability; edge cases near thresholds (e.g., fragility = 0.69 vs 0.71) may produce inconsistent labels across corpus versions
- 200-prompt suite coverage: A fixed 200-prompt suite may not cover all failure modes; adversarial or distribution-shifted prompts could pass all gates yet fail in production
- General quality ceiling: Phase-1 behavioral distillation intentionally does not target full language quality parity with FP16 teacher; non-critical behaviors may remain degraded
- Teacher artifact staleness: If the evaluation prompt suite or teacher model changes, routing traces and preference labels must be regenerated on GPU
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Phase 0 PTQ quality too low for meaningful testing | Medium | Low | Phase 0 is for kernel/pipeline validation, not quality; upgrade to 0D (BitDistill Lite) if needed |
| MoE routing degrades with ternary experts | Medium | High | Phase 0 detects routing issues early; Phase 1 validates routing; router stays FP16; AD-12 contrastive validation |
| bitnet.cpp kernel translation to Rust introduces bugs | Medium | Medium | Phase 0 PTQ model provides cheap test fixture; extensive kernel unit tests; validate against reference impl |
| Distillation fails to converge for MoE | Low | High | GRPO reward scaling + per-expert distillation fallback; EWC++ stability (AD-13) |
| GLM-4.7-Flash architecture changes break compatibility | Low | Medium | Pin to specific HF revision; architecture abstraction layer |
| IQ1_S GGUF format insufficient for absmean metadata | Medium | Low | Register custom GGUF type (BITNET_T158); backward-compatible extension |
| EWC++ Fisher accumulation OOM at 30B scale | Medium | Medium | Sparse Fisher (top-k diagonal entries); per-expert rather than global Fisher |
| GRPO reward signal too noisy for distillation | Low | Low | Fall back to static KD loss; GRPO reward as optional multiplier |
RealContrastiveTrainer doesn't scale to 30B |
Medium | Medium | Extract training loop; replace Candle Linear with BitLinear; keep optimizer/scheduler |
| Calibration data bias in Phase 0 PTQ | Low | Low | Use diverse calibration corpus (WikiText + code); measure variance across calibration sets |
| Auto-label thresholds misclassify edge-case prompts | Medium | Medium | Track label stability across corpus versions; flag prompts with signals near threshold boundaries for manual review |
| 200-prompt suite insufficient for production coverage | Low | Medium | Expand suite iteratively as production failure modes are discovered; run gates on user-submitted adversarial prompts quarterly |
| Teacher routing traces become stale after model update | Low | Low | Re-record teacher traces as part of every model version bump; cache invalidation keyed on teacher model hash |
Validation Criteria
Phase 0 Exit Criteria
- Absmean ternary quantizer produces valid {-1, 0, +1} weights from GLM-4.7-Flash FP16
- Quantization runs successfully on Mac Studio via mmap (no cloud GPU required)
- GGUF export with BITNET_T158 tensor type loads without error in BitNetBackend
- TL1 NEON kernel produces non-zero, bounded output on PTQ ternary weights
- MoE routing selects experts (not all-zero or all-same-expert degenerate routing)
- End-to-end token generation produces coherent (if degraded) text
- Memory usage measured and documented for real MoE activation patterns
- Throughput measured: tok/s on Mac Studio (ARM NEON) and optionally x86 AVX2
- Baseline quality benchmarks recorded (HumanEval, MMLU) as Phase 1 improvement target
- Total Phase 0 cost = $0 (local Mac Studio execution)
Phase 0.5 Exit Criteria
- MicroLoRA adapters (rank-2) attached to all expert FFN layers
- Router fine-tuning via ContrastiveTrainer restores >=90% routing accuracy vs teacher
- GRPO reward signal shows positive quality improvement over Phase 0 baseline
- EWC++ prevents router fix from degrading already-correct routing paths (Fisher delta < 5%)
- HumanEval pass@1 >= 45% (up from Phase 0 baseline of ~35-45%)
- MicroLoRA + ternary inference produces coherent code completions
- Training completes on Mac Studio within 14 days
- MemoryDistiller has extracted KeyLessons identifying worst-degraded experts
- PolicyStore contains optimized TernaryScale entries for all refined layers
- Total Phase 0.5 cost = $0 (local Mac Studio execution)
- GGUF re-exported with optimized router, scale factors, and LoRA adapter weights
Phase 1 Exit Criteria
- BitNet backend loads GGUF with ternary expert weights
- TL1 kernel produces bit-exact output vs reference float implementation
- Decode speed >= 5 tok/s on x86_64 AVX2 (AMD Ryzen 7 / Intel i7 class)
- HumanEval pass@1 >= 50% (GLM-4.7-Flash baseline: ~65%)
- Memory usage < 10GB for 4K context inference
- GRPO-guided expert distillation converges (loss < 0.5 for all experts)
- EWC++ prevents cross-expert interference (Fisher-regularized loss delta < 5%)
- Contrastive router validation: >= 95% expert routing accuracy vs teacher
- PolicyStore contains TernaryScale entries for all distilled expert layers
Phase 2 Exit Criteria
- Full ternary model (attention + experts) running on CPU
- Decode speed >= 8 tok/s on x86_64 AVX2
- SWE-bench Verified >= 52% (90%+ of GLM-4.7-Flash's 59.2%)
- SONA MicroLoRA adaptation functional on ternary base
- MemoryDistiller has extracted >= 50 KeyLessons from distillation trajectories
- GRPO adaptive KL stabilizes below kl_target (0.02) for all experts
Phase 3 Exit Criteria
- Native-trained model matches or exceeds GLM-4.7-Flash benchmarks
- Published on HuggingFace (ruv/craftsman-ultra-30b-1bit)
- GGUF + bitnet kernel distributed via npm/packages/ruvllm
- Full distillation pipeline reproducible from PolicyStore policies (no manual tuning)
References
- Ma, S. et al., "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (arXiv:2402.17764, Feb 2024)
- Ma, S. et al., "BitNet b1.58 2B4T Technical Report" (arXiv:2504.12285, Apr 2025)
- Microsoft Research, "bitnet.cpp: Efficient Edge Inference for Ternary LLMs" (arXiv:2502.11880, Feb 2025)
- Microsoft, bitnet.cpp — https://github.com/microsoft/BitNet
- Zhipu AI, GLM-4.7-Flash — https://huggingface.co/zai-org/GLM-4.7-Flash
- Zhipu AI, "GLM-4.7: Advancing the Coding Capability" — https://z.ai/blog/glm-4.7
- RuvLLM ADR-002: RuvLLM Integration with Ruvector
- RuvLLM GGUF Quantization Module:
crates/ruvllm/src/gguf/quantization.rs - Microsoft, bitnet-b1.58-2B-4T-gguf — https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf
- RuvLLM GRPO Implementation:
crates/ruvllm/src/training/grpo.rs - RuvLLM RealContrastiveTrainer:
crates/ruvllm/src/training/real_trainer.rs - RuvLLM EWC++ Training Pipeline:
crates/ruvllm/src/lora/training.rs - RuvLLM Memory Distillation:
crates/ruvllm/src/reasoning_bank/distillation.rs - RuvLLM Policy Store:
crates/ruvllm/src/policy_store.rs - RuvLLM Contrastive Training:
crates/ruvllm/src/training/contrastive.rs - PT-BitNet: "Scaling up the 1-Bit large language model with post-training quantization" (2025) — https://www.sciencedirect.com/science/article/abs/pii/S089360802500735X
- BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025) — https://arxiv.org/html/2510.13998v1
- bartowski, GLM-4.7-Flash-GGUF quantizations — https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
- unsloth, GLM-4.7-Flash-GGUF dynamic quantizations — https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
- llama.cpp IQ1_S blind testing (Discussion #5962) — https://github.com/ggml-org/llama.cpp/discussions/5962
- STBLLM: "Breaking the 1-bit Barrier" (ICLR 2025) — https://proceedings.iclr.cc/paper_files/paper/2025/file/ff997469ac66cf893c4183efeb22212a-Paper-Conference.pdf
- Apple Mac Studio Technical Specifications (2025) — https://www.apple.com/mac-studio/specs/
- RuvLLM Metal GEMV integration:
crates/ruvllm/src/kernels/matmul.rs:1444-1582 - RuvLLM MicroLoRA NEON SIMD forward:
crates/ruvllm/src/lora/micro_lora.rs:279-390(forward_simd, forward_simd_neon_impl) - RuvLLM NEON SIMD kernels:
crates/ruvllm/src/kernels/(matmul: gemm_neon/gemv_neon, activations: silu_neon/gelu_neon/relu_neon, norm: rms_norm_neon, rope: apply_rope_neon) - RuvLLM ContrastiveTrainer CPU fallback:
crates/ruvllm/src/training/contrastive.rs:171-175(Metal → CPU fallback) andcontrastive.rs:475(non-Candle pure CPU path) - R3-Engine: Pure Rust BitNet inference engine with WASM SIMD128 — https://github.com/r3-engine/r3-engine
- bitnet.rs: Pure Rust BitNet toolkit (Apache 2.0) — https://github.com/ocentra/bitnet.rs
- WASM SIMD128 specification: Fixed-width 128-bit SIMD for WebAssembly — https://v8.dev/features/simd
- Rust
core::arch::wasm32SIMD intrinsics — https://doc.rust-lang.org/beta/core/arch/wasm32/index.html - "The state of SIMD in Rust in 2025" (Sergey Davidoff) — https://shnatsel.medium.com/the-state-of-simd-in-rust-in-2025-32c263e5f53d
- "Rust + WebAssembly 2025: WasmGC and SIMD" — https://dev.to/dataformathub/rust-webassembly-2025-why-wasmgc-and-simd-change-everything-3ldh
- Bai, Y. et al., "Constitutional AI: Harmlessness from AI Feedback" (arXiv:2212.08073, Dec 2022) — https://arxiv.org/abs/2212.08073
- Zheng, L. et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (arXiv:2306.05685, Jun 2023) — https://arxiv.org/abs/2306.05685
- Rafailov, R. et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (arXiv:2305.18290, May 2023) — https://arxiv.org/abs/2305.18290
- Min, S. et al., "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" (arXiv:2305.14251, May 2023) — https://arxiv.org/abs/2305.14251
- RuvLLM BitNet Backend:
crates/ruvllm/src/bitnet/backend.rs(MoE routing, TL1 GEMV, forward pass) - RuvLLM RLM Refiner:
crates/ruvllm/src/bitnet/rlm_refiner.rs(Phase 0.5 refinement orchestrator)