v8 inference hardening

This ShivasNotes deep dive is written for engineers who want to understand what happens between "here is a model on Hugging Face" and "here is a compiled .so that runs inference in pure C." C-Kernel-Engine is not a runtime interpreter — it is a code generator. Template JSON defines the architecture. The IR builder resolves kernels. The memory planner assigns every byte. The codegen emits unrolled C. The compiler produces a shared library that runs independently of Python. This post focuses on the current v8 inference lane: promoted text-family bring-up, high-memory smoke targets, Qwen3-VL multimodal bridge work, and the regression/parity surface that hardens the runtime. Video walkthrough on youtube.com@antshivrobotics.

The central idea is a compiler split: CKE is smart before codegen and intentionally dumb during codegen. Template choice, kernel binding, memory layout, multimodal bridge policy, and parity instrumentation happen before a single line of emitted C exists. By the time model_v8.c or encoder_v8.c is written, the interesting decisions are already over. That is why the generated file feels so explicit. It is not discovering the model at runtime. It is replaying a compile-time plan that was already settled in JSON and lowered IR.

What this post covers

Sections 1 through 5 frame the active v8 inference front-end: the six-step ck_run_v8.py pipeline, the template JSON system, the Gemma4 hybrid contract, the Qwen3-VL vision encoder, and the multimodal bridge.

Sections 6 through 10 walk the middle of the pipeline: IR1, the fusion pass, the memory planner, lowered IR, and the code generator that emits fully unrolled C.

Sections 11 through 15 land on the hardening surface: what model_v8.c actually looks like, how it becomes libmodel.so, what the v8 regression/parity tools check, and why the smart-front-end / dumb-back-end split is the architectural win.

Introduction — CKE Is a Compiler, Not a Framework

Most LLM frameworks are runtime interpreters. They load a model, keep a graph around, and make decisions while tokens are flowing. CKE can do the opposite because the graph shape is already known. It downloads weights once, resolves a template once, lowers that plan once, then emits static C that calls kernels directly.

The most useful way to read v8 is as a six-step compiler pipeline. Step 1 downloads the model. Step 2 converts Hugging Face or GGUF weights into the bump format CKE expects. Step 3 builds IR1 from template + quant metadata. Step 4 emits C from lowered IR. Step 5 compiles that C into libmodel.so. Step 6 starts chat by loading the shared library. There is no interpreter loop between step 5 and the first generated token.

The surface area is large enough that this is clearly a compiler project, not a wrapper. The v8 inference pipeline spans ck_run_v8.py (1,315 lines), build_ir_v8.py (10,459), codegen_v8.py (1,025), codegen_prefill_v8.py (1,943), and memory_planner_v8.py (705) before the generated C even exists.

That front-loads complexity on purpose. The runtime becomes boring because the builder already paid the planning cost up front. 15,447 linesThe compile pipeline is substantial before inference starts: IR build, memory planning, decode codegen, and prefill codegen are all explicit source files, not hidden framework internals.

Six-step CKE v8 pipeline from model download through IR build, code generation, compilation, and runtime loading.ck_run_v8.py — the helper that prints the familiar [1/6] step counterpython
def log_step(step: int, msg: str):
    """Print pipeline step."""
    print(f"{C_ORANGE}[{step}/6]{C_RESET} {C_BOLD}{msg}{C_RESET}")
ck_run_v8.py — steps [1/6] through [6/6] are real function calls in the runnerpython
log_step(1, f"Downloading {model_id}")
log_step(2, f"Converting weights to bump format ({weight_dtype})")
log_step(3, "Building IR1 (Template + Quant → Kernel IDs)")
log_step(4, "Generating C code")
log_step(5, "Compiling to shared library")
log_step(6, "Starting chat")
v8 inference surface area — Python files that prepare and harden generated inferencetext
ck_run_v8.py                 1315 lines
build_ir_v8.py              10459 lines
codegen_v8.py                1025 lines
codegen_prefill_v8.py        1943 lines
memory_planner_v8.py          705 lines
--------------------------------------------
Core pipeline               15447 lines

run_regression_v8.py          844 lines
parity_test_v8.py             899 lines
compare_first_token_logits_v8.py 633 lines
run_cached_model_smoke_v8.py  174 lines

Stage Script / component Lines Input Output
[1/6] Download model ck_run_v8.py 1,315 HF model id / GGUF repo local model directory
[2/6] Convert weights convert_hf_to_bump_v8.py or convert_gguf_to_bump_v8.py orchestrated by ck_run_v8.py config.json + weights weights.bump + weights_manifest.json
[3/6] Build IR1 build_ir_v8.py 10,459 template + quant manifest ir1_decode.json + lowered IR
[4/6] Generate C codegen_v8.py 1,025 lowered_decode.json model_v8.c
[5/6] Compile .so gcc / clang / icx compiler stage model_v8.c + runtime libs libmodel.so
[6/6] Start chat scripts/ck_chat.py Python loader libmodel.so + tokenizer .so interactive inference

Template JSON — The Architecture Specification

Every model family in CKE starts as a template JSON. In v8 the template lane now covers text decoders, hybrid/sliding decoder variants, vision encoder experiments, and the Qwen3-VL multimodal pair. The template is not a cosmetic manifest. It is the architecture contract.

For Qwen3 that contract declares no QKV bias, yes Q/K norm, a SwiGLU MLP, a RoPE decoder, a BPE tokenizer contract, and a header/body/footer decode sequence. Those body ops are the skeleton that the IR builder expands into 28 explicit layers.

The cleanest way to think about templates is this: they are handwritten architecture truth, while the generated C is mechanically derived truth.

A new model family often means writing 100–200 lines of template JSON, not rewriting the whole compiler. That is why CKE can absorb dense text models, sliding-window hybrids, and multimodal encoders without inventing a new runtime every time.

Template JSON as the architecture contract that feeds IR build, memory planning, and code generation.Template-level header, body, and footer sequencing for a generated decoder pipeline.qwen3.json — flags and contracts that tell v8 what kind of decoder to buildjson
{
  "name": "qwen3",
  "family": "llama",
  "flags": {"use_qkv_bias": false, "has_qk_norm": true, "activation": "swiglu", "rope": "rope", "tokenizer": "bpe"},
  "attention_contract": {"rope_layout": "split", "rope_type": "rope", "qk_norm": true, "kv_layout": "layer_major_kv_cache", "attn_variant": "dense"},
  "block_contract": {"norm_type": "rmsnorm", "mlp_formula": "gate_up -> silu_mul -> down", "activation": "swiglu", "qkv_bias": false}
}
qwen3.json — the decoder sequence is a real header/body/footer programjson
{
  "sequence": ["decoder"],
  "block_types": {
    "decoder": {
      "sequence": ["header", "body", "footer"],
      "header": ["bpe_tokenizer", "dense_embedding_lookup"],
      "body": {
        "type": "dense",
        "ops": [
          "rmsnorm", "qkv_proj", "qk_norm", "rope_qk",
          "attn", "out_proj", "residual_add", "rmsnorm",
          "mlp_gate_up", "silu_mul", "mlp_down", "residual_add"
        ]
      },
      "footer": ["rmsnorm", "lm_head", "logits"]
    }
  }
}
Template inventory — active v8 template surfacetext
version/v8/templates/
  gemma3.json
  gemma4.json
  gemma4_vision.json
  glm4.json
  llama.json
  nemotron_h.json
  qwen2.json
  qwen3.json
  qwen35.json
  qwen3_vl_vision.json
  qwen3vl.json
  siglip_vit.json

Template Family Body type Attention variant Activation Special features
qwen3.json llama-style decoder dense dense causal GQA SwiGLU Q/K norm, BPE chat contract
qwen35.json qwen3.5 recurrent hybrid hybrid recurrent attention hybrid recurrent SwiGLU mixed recurrent + dense path
gemma4.json Gemma4 decoder hybrid_sliding_attention full + sliding interleaving GeGLU layer-kind dispatch, shared-KV variants
qwen3_vl_vision.json vision_transformer_with_branches dense bidirectional dense bidirectional GELU dual patch projection, 2D positions, deepstack branches
qwen3vl.json multimodal language decoder dense dense causal GQA SwiGLU mRoPE, vision markers in chat contract

Gemma4 — Hybrid Sliding Window Interleaving

Gemma4 is where templates stop looking like simple decoder boilerplate and start looking like a compiler necessity. The body type is hybrid_sliding_attention, and each layer can be one of four different kinds.

Those kinds are not cosmetic labels. Shared-KV layers skip explicit k_proj and v_proj. Sliding layers call different attention kernels. The template names kind_config_key: "layer_kinds" so the builder can read a per-layer kind vector from config.json.

Instead of a runtime loop asking what kind each layer is on every token, the front-end resolves the layer plan once and the back-end unrolls concrete layers with the correct kernel sequence already baked in.

Gemma4 makes the case for code generation better than any slogan could. Once different layers legally have different body shapes, the easiest correct implementation is to emit different C for each layer.

Gemma4 layer interleaving between sliding attention, full attention, and shared-KV variants across the decoder stack.gemma4.json — the attention contract names per-layer policy keysjson
{
  "rope_layout": "split",
  "rope_type": "rope",
  "qk_norm": true,
  "kv_layout": "layer_major_kv_cache",
  "attn_variant": "hybrid_sliding_attention",
  "layer_policy_config_key": "layer_attention_plan",
  "layer_kind_config_key": "layer_kinds",
  "kv_policy_config_key": "layer_kv_policy",
  "kv_source_config_key": "layer_kv_source",
  "sliding_window_config_key": "layer_sliding_window",
  "rope_kind_config_key": "layer_rope_kind"
}
gemma4.json — quant_aliases_by_kind remaps weight names by layer kindjson
{
  "sliding_attention_kv": {
    "wq": "attn_q",
    "wk": "attn_k",
    "wv": "attn_v",
    "wo": "attn_output",
    "q_norm": "attn_q_norm",
    "k_norm": "attn_k_norm"
  },
  "full_attention_kv": {
    "wq": "attn_q",
    "wk": "attn_k",
    "wv": "attn_v",
    "wo": "attn_output",
    "q_norm": "attn_q_norm",
    "k_norm": "attn_k_norm"
  },
  "sliding_attention_shared_kv": {
    "wq": "attn_q",
    "wo": "attn_output",
    "q_norm": "attn_q_norm"
  },
  "full_attention_shared_kv": {
    "wq": "attn_q",
    "wo": "attn_output",
    "q_norm": "attn_q_norm"
  }
}
gemma4.json — ops_by_kind is the real per-layer program the codegen will unrolljson
sliding_attention_kv:
  attn_norm -> q_proj -> k_proj -> v_proj -> v_norm -> qk_norm
  -> rope_qk -> attn_sliding -> out_proj -> post_attention_norm
  -> residual_add -> ffn_norm -> mlp_gate_up -> geglu -> mlp_down
  -> post_ffn_norm -> residual_add -> gemma4_per_layer_embed

full_attention_shared_kv:
  attn_norm -> q_proj -> q_norm -> rope_q
  -> attn_shared_kv -> out_proj -> post_attention_norm
  -> residual_add -> ffn_norm -> mlp_gate_up -> geglu
  -> mlp_down -> post_ffn_norm -> residual_add -> gemma4_per_layer_embed

Vision Templates — Multi-Modal Architecture (v8)

v8 pushes the same template system into multimodal territory. qwen3_vl_vision.json is not a decoder template at all. Its family is vision_transformer_with_branches. The attention is bidirectional instead of causal. The normalization is LayerNorm instead of RMSNorm. The activation is GELU instead of SwiGLU.

The template also introduces 2D position handling and branch pipelines. There are custom kernels like position_embeddings_add_tiled_2d and spatial_merge_contiguous_tiled. The footer emits vision_embeddings that the language decoder later consumes.

Then qwen3vl.json takes over for the text side. It carries the chat contract, the <|vision_start|> markers, and the mrope attention contract. One template system now covers both the vision encoder and the language decoder.

The current validated multimodal lane is intentionally narrower than “all vision models now work.” The public v8 vision encoder architecture documents the Qwen3-VL path: real image input, deterministic encoder prefix, bridge stitching, and decoder continuation. The flow is GGUF intake to template resolution to IR/layout to bridge prefix to decoder continuation.

That distinction matters. v8 is proving that the same compiler surface can lower and run the vision encoder, expose artifacts like encoder_v8.c, layout.json, call.json, and then hand the projected rows to the decoder. Some host policy is still explicit, especially image preprocessing and chat-template fallback behavior, but the encoder itself is not an external black box.

The same front-end machinery can describe a causal decoder, a hybrid sliding decoder, and a bidirectional vision transformer without changing the core lowering idea. The back-end still does the same boring job at the end: emit function calls that the lowered IR already settled.

qwen3_vl_vision.json — contract and kernel surface for the vision encoderjson
{
  "family": "vision_transformer_with_branches",
  "flags": {"patch_frontend": "dual_patch_proj_sum", "activation": "gelu", "normalization": "layernorm"},
  "vision_contract": {"input_modality": "image", "position_encoding": "absolute_2d", "output": "vision_embeddings"},
  "attention_contract": {"attn_variant": "dense_bidirectional", "causal": false, "kv_layout": "ephemeral_full_context"},
  "kernels": {"position_embeddings": "position_embeddings_add_tiled_2d", "spatial_merge": "spatial_merge_contiguous_tiled", "attn_prefill": "attention_forward_full_head_major_gqa_ggml_strided"}
}
qwen3_vl_vision.json — structure with patch ops, body ops, branches, and projector footerjson
vision_encoder:
  header:
    patchify
    patch_proj
    patch_proj_aux
    patch_sum
    patch_bias
    position_embeddings
    vision_position_ids
  body:
    attn_norm -> qkv_packed_proj -> split_qkv -> vision_mrope
    -> attn -> attn_out_proj -> attn_residual
    -> ffn_norm -> mlp_up -> mlp_gelu -> mlp_down -> mlp_residual
  branches:
    deepstack
  footer:
    final_norm -> merge_main -> projector_fc1 -> projector_gelu -> projector_fc2 -> deepstack_concat
qwen3vl.json — language-side template keeps chat markers and mRoPE separate from visionjson
{
  "flags": {"use_qkv_bias": false, "has_qk_norm": true, "activation": "swiglu", "rope": "mrope", "tokenizer": "bpe"},
  "chat_contract": {
    "image_begin_marker": "<|vision_start|>",
    "image_end_marker": "<|vision_end|>",
    "template_markers": ["<|im_start|>", "<|im_end|>", "<|vision_start|>", "<|vision_end|>", "", ""]
  },
  "attention_contract": {"rope_layout": "multi_section_1d", "rope_type": "mrope", "qk_norm": true, "kv_layout": "layer_major_kv_cache"}
}

v8 Hardening Surface — Regression, Parity, and Smoke

v8 is the active inference bring-up lane. Training workflows remain in v7. That separation is useful: v8 can harden text inference, multimodal inference, tokenizer/chat-template behavior, parity probes, and generated runtime contracts without pretending the training stack has moved lanes.

The public v8 inference runbook names the operational contract. A model family is not promoted because a template exists. It is promoted when conversion, compile, tokenizer behavior, chat-template behavior, first-token or logits parity, and smoke generation are understood well enough to debug repeatably.

The hardening surface sits next to the runner: run_regression_v8.py, parity_test_v8.py, compare_first_token_logits_v8.py, and run_cached_model_smoke_v8.py. Those files are not marketing artifacts. They are the guardrails that stop “it generated something once” from becoming a fake support claim.

The important architectural point is promotion discipline: v8 support means the generated runtime survives repeatable checks, not just a lucky local prompt. 2,550 linesRegression, parity, and cached smoke helpers give v8 a practical promotion surface around generated C inference.

v8 runbook — canonical text and multimodal smoke commandsbash
version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --prompt "Explain static code generation in one sentence."

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \
  --mmproj hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
  --image-path version/v8/test_assets/v8_vision_doc_card_72.ppm \
  --prompt "Explain this image."
v8 hardening scripts — repeatability beats vibestext
run_regression_v8.py              844 lines
parity_test_v8.py                 899 lines
compare_first_token_logits_v8.py  633 lines
run_cached_model_smoke_v8.py      174 lines

common checks:
  build or reuse cached model artifacts
  compile generated libmodel.so / libdecoder_v8.so
  compare first-token logits where reference output exists
  run short text smokes
  run cached Qwen3-VL E2E smoke when artifacts are available
compare_first_token_logits_v8.py — tokenizer-free parity probetext
purpose:
  compare CK runtime logits against llama.cpp reference output

runtime side:
  load libmodel.so
  call ck_model_embed_tokens
  call ck_model_forward
  optionally enable ck_set_strict_parity

why it matters:
  if generation is incoherent, first-token logits expose whether the bug is
  tokenization, chat templating, weight conversion, kernel math, or runtime state
v8 make gates — fast local lane plus cached vision E2E lanebash
make v8-regression-fast
make test-v8-qwen3vl-e2e-smoke

# The second target is intentionally artifact-aware:
# small runners skip cleanly when the large decoder/mmproj cache is absent.

IR1 — The Op Graph (564 Operations)

IR1 is where the template becomes a concrete op graph. For the Qwen3-0.6B GGUF artifact used here, ir1_decode.json contains exactly 564 decode ops and spans 18,304 lines.

Each IR1 op already names the kernel, the semantic op type, the section, the layer index, the dataflow inputs, the output slots, and any weight references. That is why the builder can be smart without being magical.

Op 0 is a perfect example. It says the kernel is embedding_forward_q8_0, the op is dense_embedding_lookup, the input comes from external:token_ids, the output goes to the main_stream slot, and the weight comes from the token embedding at offset 552.

Once you see IR1, the mental model becomes simple: every layer is just a repeated slot machine of “read from this slot, call this kernel, write to that slot.” 564 opsFor the Qwen3-0.6B GGUF decode artifact, IR1 is already large enough that manual runtime dispatch would be silly. Compilation is the cleaner abstraction.

ir1_decode.json — Op 0 is the header embedding lookup for Qwen3-0.6Bjson
{
  "op_id": 0,
  "kernel": "embedding_forward_q8_0",
  "op": "dense_embedding_lookup",
  "section": "header",
  "layer": -1,
  "dataflow": {
    "inputs": {"token_ids": {"from": "external:token_ids", "dtype": "i32", "slot": "external:token_ids"}},
    "outputs": {"out": {"dtype": "fp32", "slot": "main_stream"}}
  },
  "weights": {"token_emb": {"offset": 552, "size": 165306368, "dtype": "q8_0"}}
}
ir1_decode.json — Op 4 is layer-0 q_proj with quantized weights and a typed slot edgejson
{
  "op_id": 4,
  "kernel": "gemv_q8_0_q8_0",
  "op": "q_proj",
  "section": "body",
  "layer": 0,
  "dataflow": {
    "inputs": {"x": {"from_op": 3, "from_output": "output", "dtype": "q8_0", "slot": "main_stream_q8"}},
    "outputs": {"y": {"dtype": "fp32", "slot": "q_scratch"}}
  },
  "weights": {
    "wq": {"offset": 170026546, "size": 2228224, "dtype": "q8_0"},
    "bq": {"offset": 172254770, "size": 8192, "dtype": "fp32"}
  }
}
IR1 decode summary — sections and the most repeated semantic opstext
total ops: 564
header ops: 1
body ops: 560
footer ops: 3

top repeated ops:
  rmsnorm: 57
  residual_save: 56
  residual_add: 56
  quantize_input_0: 28
  q_proj: 28
  k_proj: 28
  v_proj: 28
  qk_norm: 28
  rope_qk: 28
  attn: 28

Section Op count What lives there Example ops
header 1 Tokenizer / embedding entry point dense_embedding_lookup
body 560 Repeated decoder layer body residual_save, rmsnorm, q_proj, attn, mlp_down
footer 3 Final normalization and logits rmsnorm, quantize_final_output, logits

IR1 Stage 2 — Fusion Pass

After IR1 generation, build_ir_v8.py runs a fusion pass. The important detail is where fusion lives: not in codegen, and not in the C runtime. It happens while the graph is still symbolic enough for the builder to match sequences of ops against kernel-registry patterns.

The file says exactly how it works: scan the registry for kernels with a fuses field, collect candidate patterns, match consecutive ops, and replace them with a fused kernel while merging metadata.

Even when a specific artifact does not dramatically shrink its visible op count, the architecture still matters. The compiler has a place where operator fusion belongs, and that place is before memory layout and before C emission.

If an optimization changes what the op graph means, it belongs in IR build. Fusion is the canonical example. You want it to happen when the compiler still understands graph adjacency, not after everything has been flattened into raw pointer arithmetic.

build_ir_v8.py — pipeline comment makes fusion a first-class stagepython
#!/usr/bin/env python3
"""
build_ir_v8.py - Complete IR Pipeline: Template + Quant → IR1 → Fusion → Layout

PIPELINE (4 stages):
    1. IR1 Generation: Template + Quant Summary → Kernel IDs
    2. Fusion Pass: Combine consecutive kernels using registry-driven patterns
    3. Memory Layout: Plan activation buffers and weight offsets
    4. Output: IR1 JSON + Memory Layout JSON

Stage 1 - IR1 Generation (Direct mapping, no intermediate abstractions):
    1. Parse template sequence (what ops to run)
    2. Read quant summary from manifest (what dtypes for weights)
    3. Map template ops → kernel ops → concrete kernel IDs
    4. Return: List of kernel function names

Stage 2 - Fusion Pass:
    1. Scan kernel registry for kernels with "fuses" field
    2. Match consecutive kernel sequences in IR1
    3. Replace matching sequences with fused kernels
build_ir_v8.py — the fusion pass explicitly looks for registry kernels with a fuses fieldpython
        1. Scan registry for kernels with "fuses" field
        2. Match consecutive kernel sequences
        3. Replace with fused kernel, merge weights
        4. Track fusion statistics
    """
    print(f"\n{'='*60}")
    print("FUSION PASS")
    print(f"{'='*60}")

    # Check for fusion disable flag (parameter only)
    if no_fusion:
        print("  ⚠️ Fusion DISABLED (--no-fusion)")
        return ir1_ops, {"total_fusions": 0, "kernels_removed": 0, "fusions_applied": [], "disabled": True}

    # Build fusion patterns from registry
    fusion_patterns = []
    for kernel in registry["kernels"]:
        if "fuses" not in kernel:
            continue

        # Check if this fused kernel matches the mode
        # NOTE: Allow prefill fused kernels in decode mode (v8 baseline parity)
        # The fused prefill kernels work for tokens=1 (decode) and are more accurate
        # because they handle quantization internally.
        variant = kernel.get("variant", "")
        # Don't skip prefill kernels in decode mode - they work with tokens=1
        # if mode == "decode" and "prefill" in variant and "decode" not in variant:
        #     continue
        if mode == "prefill" and "decode" in variant and "prefill" not in variant:
            continue

        pattern = {
Fusion pass mental model — what the builder is allowed to do before loweringtext
registry kernel advertises:
  id: fused_kernel_name
  fuses: [kernel_a, kernel_b, kernel_c]

builder action:
  scan IR1 for consecutive kernel_a -> kernel_b -> kernel_c
  replace that span with fused_kernel_name
  merge weights and params
  record fusion statistics

The Memory Planner — Every Byte Has an Address

The memory planner is the bridge between symbolic slots and physical addresses. In v8 it lives in memory_planner_v8.py, a 705-line file whose job is exactly what the name says: assign buffers based on the dataflow graph.

The planner starts from canonical buffers like A_EMBEDDED_INPUT, A_LAYER_INPUT, A_RESIDUAL, A_ATTN_SCRATCH, A_MLP_SCRATCH, A_KV_CACHE, A_LOGITS, and A_LAYER_OUTPUT. The current Qwen3 artifact also allocates specialized scratch regions such as A_ATTN_Q_GATE_PACKED and A_ATTN_GATE.

The result for this Qwen3-0.6B GGUF run is deterministic and inspectable. The layout map says the total footprint is 1,606,394,890 bytes, of which 639,587,338 bytes are weights and 966,807,552 bytes are activations.

This is where the compiler earns trust from systems engineers. 1.50 GBThe Qwen3-0.6B GGUF artifact shows a 610 MB weight arena and a 922 MB activation arena for a 1.50 GB total footprint.

Deterministic activation and weight layout for the Qwen3-0.6B decode artifact, including KV cache and scratch buffers.memory_planner_v8.py — planner contract and the eight canonical physical bufferspython
#!/usr/bin/env python3
"""
memory_planner_v8.py - Assign physical buffers based on IR1 dataflow graph.

This replaces the buggy ping-pong buffer logic with explicit dataflow-based assignment.

PIPELINE POSITION:
    IR1 (with dataflow) → Kernel Resolution → MEMORY PLANNER → IR Lower

INPUT:
    - IR1 ops with dataflow info (from build_ir_v8.py)
    - Kernel maps (to know dtype requirements)

OUTPUT:
    - Buffer assignments per op: {op_id: {input_name: buffer, output_name: buffer}}

PHYSICAL BUFFERS:
    - A_EMBEDDED_INPUT  : Main activation buffer 1 (FP32)
    - A_LAYER_INPUT     : Main activation buffer 2 (FP32/Q8)
    - A_RESIDUAL        : Saved residual for skip connections (FP32)
    - A_ATTN_SCRATCH    : Q/K/V projections and attention output (FP32)
    - A_MLP_SCRATCH     : MLP gate_up and swiglu output (FP32)
    - A_KV_CACHE        : KV cache (persistent across tokens)
    - A_LOGITS          : Final logits output (FP32)
    - A_LAYER_OUTPUT    : Layer output buffer (FP32)
memory_planner_v8.py — PHYSICAL_BUFFERS registry in the real source filepython
    "A_EMBEDDED_INPUT": PhysicalBuffer(
        name="A_EMBEDDED_INPUT",
        dtype="fp32",
        last_writer=-1,
        can_hold=["fp32", "q8_0", "q8_k"]
    ),
    "A_LAYER_INPUT": PhysicalBuffer(
        name="A_LAYER_INPUT",
        dtype="fp32",
        last_writer=-1,
        can_hold=["fp32", "q8_0", "q8_k"]
    ),
    "A_RESIDUAL": PhysicalBuffer(
        name="A_RESIDUAL",
        dtype="fp32",
        last_writer=-1,
        can_hold=["fp32"]
    ),
    "A_ATTN_SCRATCH": PhysicalBuffer(
        name="A_ATTN_SCRATCH",
layout_decode.map — Qwen3-0.6B GGUF memory summarytext
MEMORY SUMMARY
--------------------------------------------------------------------------------
  Total:                1,606,394,890 bytes  (1.50 GB)
  Weights:                639,587,338 bytes  (610.0 MB)
  Activations:            966,807,552 bytes  (922.0 MB)
layout_decode.map — activation buffer table from the real run artifacttext
Offset         End            Size (bytes)               Buffer                   Shape                         
------------------------------------------------------------------------------------------------------------------------
0x000000000000 0x000000004000       16,384  (  16.00 KB)  text_input               [16384]                       
0x000000004000 0x000000005000        4,096  (   4.00 KB)  token_ids                [1024]                        
0x000000005000 0x000000405000    4,194,304  (   4.00 MB)  embedded_input           [1024, 1024]                  
0x000000405000 0x000000805000    4,194,304  (   4.00 MB)  layer_input              [1024, max(1024, Q8_K(3072))] 
0x000000805000 0x000000C05000    4,194,304  (   4.00 MB)  residual                 [1024, 1024]                  
0x000000C05000 0x00000EC05000  234,881,024  ( 224.00 MB)  kv_cache                 [28, 2, 8, 1024, 128]         
0x00000EC05000 0x00000EC85000      524,288  ( 512.00 KB)  rope_cache               [2, 1024, 64]                 
0x00000EC85000 0x00000F485000    8,388,608  (   8.00 MB)  q_scratch                [16, 1024, 128]               
0x00000F485000 0x00000F885000    4,194,304  (   4.00 MB)  k_scratch                [8, 1024, 128]                
0x00000F885000 0x00000FC85000    4,194,304  (   4.00 MB)  v_scratch                [8, 1024, 128]                
0x00000FC85000 0x000010C85000   16,777,216  (  16.00 MB)  attn_q_gate_packed       [1024, 4096]                  
0x000010C85000 0x000011485000    8,388,608  (   8.00 MB)  attn_gate                [1024, 2048]                  
0x000011485000 0x000011C85000    8,388,608  (   8.00 MB)  attn_scratch             [16, 1024, 128]               
0x000011C85000 0x000014485000   41,943,040  (  40.00 MB)  mlp_scratch              [max(1024*6144, fused_attn, geglu_bf16)]
0x000014485000 0x000014885000    4,194,304  (   4.00 MB)  layer_output             [1024, 1024]                  
0x000014885000 0x000039A05000  622,329,856  ( 593.50 MB)  logits                   [1024, 151936]

Buffer Offset Size Shape Purpose
embedded_input 0x000000005000 4,194,304 [1024, 1024] Main FP32 stream after embedding and norm
layer_input 0x000000405000 4,194,304 [1024, max(1024, Q8_K(3072))] Quantized or alternate main stream
residual 0x000000805000 4,194,304 [1024, 1024] Saved skip-connection copy
kv_cache 0x000000C05000 234,881,024 [28, 2, 8, 1024, 128] Persistent decode cache
q_scratch 0x00000EC85000 8,388,608 [16, 1024, 128] Q projection output
attn_scratch 0x000011485000 8,388,608 [16, 1024, 128] Attention output workspace
mlp_scratch 0x000011C85000 41,943,040 [max(1024*6144, fused_attn, geglu_bf16)] MLP gate/up/down workspace
logits 0x000014885000 622,329,856 [1024, 151936] Final logits buffer

Physical buffer name Declared dtype(s) Source of truth Why it exists
A_EMBEDDED_INPUT fp32 / q8_0 / q8_k memory_planner_v8.py First main-stream arena
A_LAYER_INPUT fp32 / q8_0 / q8_k memory_planner_v8.py Ping-pong alternate for quantized activations
A_RESIDUAL fp32 memory_planner_v8.py Skip connection save buffer
A_ATTN_SCRATCH fp32 memory_planner_v8.py Attention-side temporary output
A_MLP_SCRATCH fp32 memory_planner_v8.py MLP-side temporary output
A_KV_CACHE fp32 memory_planner_v8.py Persistent cross-token state
A_LOGITS fp32 memory_planner_v8.py Final output arena
A_LAYER_OUTPUT fp32 memory_planner_v8.py Explicit layer-output staging area

The Lowered IR — Concrete Pointer Expressions

IR1 still talks in slots. Lowered IR talks in addresses. In the Qwen3-0.6B GGUF artifact, lowered_decode.json contains 592 lowered ops and spans 36,181 lines.

Op 0 no longer says “read token_ids and write main_stream.” It says “read tokens from activations + 16384, write output to activations + 20480, and read token embeddings from bump_weights + 0.” Op 4 does the same for the layer-0 Q projection.

Once lowering is done, codegen has almost nothing left to decide. The pointer expressions already exist.

Lowered IR is the moment where the compiler becomes brutally concrete. If a pointer is wrong in generated C, the real bug is usually upstream in memory planning or lowering. Codegen is just copying expressions it was handed.

Transformation from symbolic IR1 slots to lowered IR with explicit bump-weight and activation pointer expressions.lowered_decode.json — Op 0 after lowering has real pointer expressionsjson
{
  "idx": 0,
  "kernel": "embedding_forward_q8_0",
  "function": "embedding_forward_q8_0",
  "weights": {"token_emb": {"ptr_expr": "bump_weights + 0"}},
  "activations": {"tokens": {"ptr_expr": "activations + 16384"}},
  "outputs": {"output": {"ptr_expr": "activations + 20480"}},
  "params": {"embed_dim": 1024, "num_layers": 28, "seq_len": 1}
}
lowered_decode.json — Op 4 q_proj now points at exact weight and scratch offsetsjson
{
  "idx": 4,
  "kernel": "gemv_q8_0_q8_0",
  "op": "q_proj",
  "weights": {
    "wq": {"ptr_expr": "bump_weights + 170025994"},
    "bq": {"ptr_expr": "bump_weights + 172254218"}
  },
  "activations": {"x": {"ptr_expr": "activations + 4214784"}},
  "outputs": {"y": {"ptr_expr": "activations + 248008704"}},
  "params": {"_output_dim": 2048, "_input_dim": 1024, "_m": 1}
}
lowered_decode.json — final footer ops are already call-readyjson
idx 589  rmsnorm
  W final_ln_weight = bump_weights + 639579146
  A input           = activations + 20480
  O output          = activations + 20480

idx 590  quantize_final_output
  A input           = activations + 20480
  O output          = activations + 4214784

idx 591  logits
  W token_emb       = bump_weights + 0
  A input           = activations + 4214784
  O logits          = activations + 344477696

Codegen — The Dumb Emitter (1,025 Lines)

The comment at the top of codegen_v8.py is unusually candid, and that is a good thing. It says the job is to create memory layout declarations, parse lowered IR, emit unrolled function calls, and pass pointers cleanly. Then it states the bug-routing rule in plain language: if there are memory issues, fix the memory layout builder, not codegen; if there are kernel issues, fix the IR lower, not codegen.

That is the architectural payoff of the whole pipeline. Codegen does not carry model-family intelligence, dispatch logic, or ad hoc runtime decisions. It walks the lowered ops list, emits one C call after another, wires in stop-op hooks, and includes optional parity/profile instrumentation when the build asked for it.

The prefill generator follows the same philosophy. Even parallelization is treated as upstream truth.

“Dumb” is not an insult here. It is a design goal. The less intelligence codegen carries, the fewer places there are for architecture bugs to hide.

codegen_v8.py — the file tells you exactly what responsibilities it ownspython
#!/usr/bin/env python3
from __future__ import annotations
"""
codegen_v8.py - Generate C code from lowered IR.

RESPONSIBILITIES:
1. Create memory layout from layout.json (structs, offsets, allocations)
2. Parse lowered IR and emit function calls (unrolled, one after another)
3. Pass pointers cleanly to all functions

If there are memory issues → fix the memory layout builder, not codegen.
If there are kernel issues → fix the IR lower, not codegen.

===============================================================================
codegen_v8.py — technical debt tracker for RoPE and activation assumptionspython
This section documents values that are hardcoded in codegen but should come from
IR config or dedicated kernels. These WILL BREAK for non-Qwen2 models.

Delete entries from this list as they are properly fixed.

NOTE: Init ops (rope_init, etc.) now use init_call.json pattern:
  manifest.config → init.json → init_call.json → codegen emits calls
  This is the correct pattern for model-specific initialization.

┌─────────────────────────────────────────────────────────────────────────────┐
│ 1. ROPE SCALING TYPE - MEDIUM                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ Location: rope_precompute_cache kernel                                      │
│ Current: Standard RoPE only (no scaling)                                    │
│ Should be: Support for rope_scaling_type from config:                       │
│   - "linear": freq *= 1/scaling_factor                                      │
│   - "dynamic": NTK-aware dynamic scaling                                    │
│   - "yarn": YaRN (Yet another RoPE extensioN)                               │
│                                                                             │
│ Impact: Context extension won't work for models using scaled RoPE           │
│   - Llama 3.1 uses scaled RoPE for 128K context                             │
│   - Code Llama uses linear scaling                                          │
│                                                                             │
│ Fix: Extend rope_precompute_cache kernel to accept scaling_type param       │
│      init.json already has rope_scaling_type field ready to use             │
codegen_v8.py — emit_decode writes ck_decode from lowered IR and appends stop-op hookspython
    tokenizer_include = ""
    if init_call:
        for op in init_call.get("operations", []):
            c_code = op.get("c_code", {})
            if isinstance(c_code, dict) and c_code.get("include"):
                tokenizer_include = c_code["include"]
                break

    parts = []
    parts.append(f'''/*
 * Auto-generated by codegen_v8.py
 * Generated: {now}
 * Model: {config.get("model", "unknown")}
 * Mode: {ir.get("mode", "decode")}
 * Layers: {config.get("num_layers", 0)} (unrolled)
 * RoPE: theta={rope_theta}, rotary={rotary_dim}, scaling={rope_scaling_type}/{rope_scaling_factor}
 * RoPE kernels: init={rope_init_kernel}, qk={rope_qk_kernel}, cache={rope_cache_layout}
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
codegen_prefill_v8.py — multi-token prefill is generated with the same dumb-emitter philosophypython
#!/usr/bin/env python3
"""
codegen_prefill_v8.py - Generate C code for PREFILL mode from lowered IR.

This generates ck_prefill() which processes multiple tokens at once.
The IR (lowered_prefill_call.json) already has function names and expressions.
We just substitute num_tokens for const:1 sources.

=============================================================================
IMPORTANT: CODEGEN IS DUMB - NO PARALLELIZATION LOGIC HERE
=============================================================================

When you look at this code, you'll see many `for` loops that LOOK like they
could be parallelized with `#pragma omp parallel for`. You might be tempted
to add pragmas here. DON'T.

WHY NOT?

1. Codegen has NO global view of the computation graph
2. Adding pragmas here could cause FALSE SHARING between ops
3. Two adjacent ops might both parallelize the same buffer = cache thrashing
4. Thread over-subscription if multiple ops spawn threads

WHERE DOES PARALLELIZATION COME FROM?

The Generated C — What model_v8.c Looks Like

The emitted file is not tiny. For a concrete Qwen3-0.6B GGUF artifact, model_v8.c becomes a large model-specific C translation unit. But the structure is easy to read once you know what to look for.

The weight metadata is compile-time data. HeaderOffsets and LayerOffsets L_LAYERS[28] are just raw offsets into the bump file. The activation metadata is the same idea for the single contiguous allocation.

That is also where the debugging hooks become visible. Every op gets an optional if (stop_seq == N) return; check, and parity instrumentation can be compiled in with CK_PARITY_DUMP.

The generated C does not describe a framework. It describes this model, with these offsets, calling these kernels in this exact order. generated CThe generated decode+prefill file is large because layers are fully unrolled and every offset is made explicit as compile-time data.

model_v8.c — generated header, configuration defines, and HeaderOffsetsc
/*
 * Auto-generated by codegen_v8.py
 * Generated: 2026-05-29 21:57:57
 * Model: qwen3
 * Mode: decode
 * Layers: 28 (unrolled)
 * RoPE: theta=1000000.0, rotary=128, scaling=none/1.0
 * RoPE kernels: init=rope_precompute_cache, qk=rope_forward_qk_with_rotary_dim, cache=rotary_dim/2
 */
/* ============================================================================
 * MODEL CONFIGURATION
 * ============================================================================ */
#define EMBED_DIM 1024
#define NUM_HEADS 16
#define NUM_KV_HEADS 8
#define HEAD_DIM 128
#define ROTARY_DIM 128
#define INTERMEDIATE_SIZE 3072
#define NUM_LAYERS 28
#define VOCAB_SIZE 151936
#define MAX_SEQ_LEN 1024
/* RoPE scaling: type=none, factor=1.0 */

/* Memory sizes */
#define WEIGHTS_SIZE 639587338ULL
#define ACTIVATIONS_SIZE 966807552ULL
model_v8.c — LayerOffsets L_LAYERS[28] begins as compile-time data for each layerc
/* Per-layer weight offsets */
typedef struct {
    size_t b1;
    size_t b2;
    size_t bk;
    size_t bo;
    size_t bq;
    size_t bv;
    size_t k_norm;
    size_t ln1_gamma;
    size_t ln2_gamma;
    size_t q_norm;
    size_t w1;
    size_t w2;
    size_t wk;
    size_t wo;
    size_t wq;
    size_t wv;
} LayerOffsets;

static const LayerOffsets L_LAYERS[28] = {
    [0] = {
    [0] = {
        .b1 = 183416842,
        .b2 = 186783754,
        .bk = 173377034,
        .bo = 176728074,
        .bq = 172254218,
        .bv = 174495754,
        .k_norm = 173381130,
        .ln1_gamma = 170017802,
        .ln2_gamma = 170021898,
        .q_norm = 172262410,
        .w1 = 176732170,
        .w2 = 183441418,
        .wk = 172262922,
        .wo = 174499850,
        .wq = 170025994,
        .wv = 173381642,
    },
model_v8.c — activation defines and the single contiguous CKModel memory storyc
#define A_TEXT_INPUT 639587890
#define A_TOKEN_IDS 639604274
#define A_EMBEDDED_INPUT 639608370
#define A_LAYER_INPUT 643802674
#define A_RESIDUAL 647996978
#define A_KV_CACHE 652191282
#define A_ROPE_CACHE 887072306
#define A_Q_SCRATCH 887596594
#define A_K_SCRATCH 895985202
#define A_V_SCRATCH 900179506
#define A_ATTN_Q_GATE_PACKED 904373810
#define A_ATTN_GATE 921151026
#define A_ATTN_SCRATCH 929539634
#define A_MLP_SCRATCH 937928242
#define A_LAYER_OUTPUT 979871282
#define A_LOGITS 984065586


/* ============================================================================
 * MODEL STRUCT
 * ============================================================================ */
typedef struct {
    uint8_t *bump;           /* Single contiguous allocation */
    size_t bump_size;
    uint8_t *bump_weights;   /* Weights section */
    float *activations;      /* Activations section */
model_v8.c — ck_decode begins with embedding, residual save, norm, quantize, and q_projc
    /* Store token at offset 639604274 (from layout) */
    *(int32_t*)(MEM + 639604274) = token;

    /* Op 0: embedding_forward_q8_0 (dense_embedding_lookup) layer=-1 section=header */
    embedding_forward_q8_0(
        (int32_t*)(model->bump + A_TOKEN_IDS),
        1,
        151936,
        (const void*)(model->bump + W_TOKEN_EMB),
        NULL,
        (float*)(model->bump + A_EMBEDDED_INPUT),
        1024,
        1024,
        1,
        0
    );
    if (stop_seq == 0) return;
    /* Op 1: memcpy (residual_save) layer=0 section=body */
    memcpy(
        (void*)(model->bump + A_RESIDUAL),
        (const void*)(model->bump + A_EMBEDDED_INPUT),
        4096
    );
    if (stop_seq == 1) return;

    /* Op 2: rmsnorm_forward (rmsnorm) layer=0 section=body */
    rmsnorm_forward(
        (const float*)(model->bump + A_EMBEDDED_INPUT),
        (float*)(model->bump + W_LAYER_0_LN1_GAMMA),
        (float*)(model->bump + A_EMBEDDED_INPUT),
        NULL,
        1,
        1024,
        1024,
        9.999999974752427e-07
    );
    if (stop_seq == 2) return;
    /* Op 3: quantize_row_q8_0 (quantize_input_0) layer=0 section=body */
    quantize_row_q8_0(
        (const float*)(model->bump + A_EMBEDDED_INPUT),
        (void*)(model->bump + A_LAYER_INPUT),
        1024
    );
    if (stop_seq == 3) return;

    /* Op 4: gemv_q8_0_q8_0 (q_proj) layer=0 section=body */
    gemv_q8_0_q8_0(
        (float*)(model->bump + A_Q_SCRATCH),
        (const void*)(model->bump + W_LAYER_0_WQ),
        (void*)(model->bump + A_LAYER_INPUT),
        2048,
        1024
    );
    if (stop_seq == 4) return;
model_v8.c — footer ops 589 through 591 finish the decode path with logitsc
    /* Op 589: rmsnorm_forward (rmsnorm) layer=-1 section=footer */
    rmsnorm_forward(
        (const float*)(model->bump + A_EMBEDDED_INPUT),
        (float*)(model->bump + W_FINAL_LN_WEIGHT),
        (float*)(model->bump + A_EMBEDDED_INPUT),
        NULL,
        1,
        1024,
        1024,
        9.999999974752427e-07
    );
    if (stop_seq == 589) return;

    /* Op 590: quantize_row_q8_0 (quantize_final_output) layer=-1 section=footer */
    quantize_row_q8_0(
        (const float*)(model->bump + A_EMBEDDED_INPUT),
        (void*)(model->bump + A_LAYER_INPUT),
        1024
    );
    if (stop_seq == 590) return;

    /* Op 591: gemv_q8_0_q8_0 (logits) layer=-1 section=footer */
    gemv_q8_0_q8_0(
        (float*)(model->bump + A_LOGITS),
        (const void*)(model->bump + W_TOKEN_EMB),
        (void*)(model->bump + A_LAYER_INPUT),
        151936,
        1024
    );
    if (stop_seq == 591) return;

    model->pos++;
}

Compilation and Linking — The .so Is the Model

Step [5/6] is where the compiler hands off to the system compiler. ck_run_v8.py builds or refreshes libckernel_engine.so and libckernel_tokenizer.so, then compiles model_v8.c into libmodel.so.

The compile flags are exactly the ones you would expect for a performance-oriented shared library: -shared, -fPIC, -O3, -march=native, and critically -mcmodel=large.

The runner prefers icx if it is available, otherwise it falls back to gcc, with clang also supported through environment override. Once this step finishes, Python is no longer the execution engine. It is just a loader.

The compiler stage is where the generated C stops being documentation and becomes an executable artifact. After step [5/6], what matters for inference is no longer “did Python build the graph?” but “did the native toolchain compile and link the generated program?”

ck_run_v8.py — step [5/6] builds runtime libs and targets libmodel.sopython
    log_step(5, "Compiling to shared library")

    # Output library name (ck_chat.py expects libmodel.so or ck-kernel-inference.so)
    lib_path = output_dir / "libmodel.so"
    kernel_lib = BUILD_DIR / "libckernel_engine.so"
    tokenizer_lib = BUILD_DIR / "libckernel_tokenizer.so"

    log(f"  Source: {model_c_path}", C_DIM)
    log(f"  Lines: {sum(1 for _ in open(model_c_path))}", C_DIM)

        runtime_targets.append(kernel_lib)
    if _runtime_lib_needs_rebuild(tokenizer_lib, tokenizer_source_roots):
        runtime_targets.append(tokenizer_lib)
    if runtime_targets:
        verb = "missing/stale" if any(not p.exists() for p in runtime_targets) else "stale"
        log(f"  Building {verb} runtime libs: {', '.join(p.name for p in runtime_targets)}", C_DIM)
        make_targets = [_path_to_make_target(path) for path in runtime_targets]
        run_cmd(["make"] + make_targets, cwd=PROJECT_ROOT)
ck_run_v8.py — compile command with -shared, -fPIC, -O3, -march=native, and -mcmodel=largepython
    # Override with CK_V8_COMPILER=gcc|icx|clang when needed (e.g., profiling portability).
    import shutil
    compiler = "gcc"
    requested_compiler = os.environ.get("CK_V8_COMPILER", "").strip()
    if requested_compiler:
        if not shutil.which(requested_compiler):
            log_error(f"Requested CK_V8_COMPILER not found in PATH: {requested_compiler}")
            sys.exit(1)
        compiler = requested_compiler
    elif shutil.which("icx"):
        compiler = "icx"

    omp_flag = "-qopenmp" if compiler == "icx" else "-fopenmp"

    cmd = [
        compiler,
        "-shared", "-fPIC",
        "-mcmodel=large",  # Handle large static data in v8 models
        "-O3", "-march=native",
        "-std=c11",
        "-fvisibility=default",  # Export CK_EXPORT symbols
        omp_flag,  # OpenMP for parallelization
        f"-I{include_dir}",
        f"-I{v8_include}",
        f"-I{v8_src}",
        "-o", str(lib_path),
        str(model_c_path),
        str(loader_src),
        str(v8_src / "ck_parallel_decode_v8.c"),  # Thread-pool parallel GEMV dispatch
        str(v8_src / "ck_parallel_prefill_v8.c"),  # Thread-pool parallel GEMM dispatch (prefill)
        f"-L{BUILD_DIR}",
        f"-L{output_dir}",  # Also look in output_dir for libckernel_engine.so
        "-lckernel_tokenizer",  # BPE tokenizer library
        # Keep tokenizer before engine: both export legacy ck_tokenizer_* symbols,
ck_chat.py — after compilation, chat just loads libmodel.so with ctypespython
        # Load C library first (needed to check for C tokenizer)
        lib_path = self.model_dir / "ck-kernel-inference.so"
        if not lib_path.exists():
            lib_path = self.model_dir / "ck-kernel-decode.so"
        if not lib_path.exists():
            lib_path = self.model_dir / "libmodel.so"
        if not lib_path.exists():
            print(f"Error: Model library not found in: {self.model_dir}")
            return False

        stale_errors = self._runtime_artifact_staleness_errors(lib_path)
        if stale_errors:
            for msg in stale_errors:
                print(f"Error: {msg}")
            return False
        self.lib = ctypes.CDLL(str(lib_path))

Independence — The Generated Code Stands Alone

The independence claim is not rhetorical. The Qwen3 runtime directory already contains everything the native inference path needs: weights.bump, weights_manifest.json, ir1_decode.json, lowered_decode.json, layout_decode.map, model_v8.c, libmodel.so, libckernel_engine.so, and libckernel_tokenizer.so.

That means a silicon vendor can open the generated C, follow the offsets, profile the kernel calls, and reason about memory access without reverse-engineering a dynamic runtime.

You could port the system to a new platform by reimplementing the kernel ABI and recompiling the generated model file.

This is the strongest form of ahead-of-time compilation: the model has become a library plus a weight blob, not an interpreted object graph. For deployment, the line between “the model” and “the runtime” becomes clear: the model-specific logic lives in libmodel.so, while the reusable math lives in the kernel and tokenizer shared objects.

The post-compile runtime bundle: generated model library, kernel library, tokenizer library, and weights blob running independently of Python.Qwen--Qwen3-0.6B-GGUF runtime directory after step [5/6]text
ir1_decode.json
layout_decode.map
libckernel_engine.so
libckernel_tokenizer.so
libmodel.so
lowered_decode.json
model_v8.c
weights.bump
weights_manifest.json
model_v8.c — CKModel is one contiguous bump allocation plus typed entry pointersc
/* ============================================================================
 * MODEL STRUCT
 * ============================================================================ */
typedef struct {
    uint8_t *bump;           /* Single contiguous allocation */
    size_t bump_size;
    uint8_t *bump_weights;   /* Weights section */
    float *activations;      /* Activations section */
    float *kv_cache;         /* KV cache section */
    float *rope_cos;         /* RoPE cos table */
    float *rope_sin;         /* RoPE sin table */
    float *logits;           /* Output logits */
    int pos;                 /* Current position */
Porting contract — what has to exist on a new platformtext
required artifacts after compile:
  libmodel.so
  libckernel_engine.so
  libckernel_tokenizer.so
  weights.bump

required platform work:
  implement the kernel ABI in libckernel_engine
  implement tokenizer ABI if using C tokenizer path
  compile generated model_v8.c for the target toolchain

Smart Front-End / Dumb Back-End — Why This Architecture Works

Now the big design claim should be concrete. The smart side of CKE is the front-end: template resolution, model-family detection, kernel binding, quantization dispatch, fusion detection, backward synthesis, slot planning, buffer assignment, and final lowering into pointer expressions.

The dumb side is the back-end: codegen_v8.py reads already-lowered ops and emits C. It does not decide which attention kernel a model gets. It does not decide whether Q/K norm exists. It does not plan buffers. It just writes what the lowered IR already decided.

This split is what keeps the system debuggable. If a kernel choice is wrong, you debug the builder. If a pointer is wrong, you debug memory planning or lowering. If emitted C is malformed, you debug codegen.

The architecture works because it puts intelligence where global context exists and removes intelligence where only local serialization remains. 10,459 vs 1,025The file-size ratio is a useful proxy for design intent: most of the logic lives in the front-end builder, while the emitter stays comparatively small and mechanical.

CKE architecture split between smart front-end decisions and dumb back-end C emission.Smart front-end responsibilities — what must be decided before any C is emittedtext
template selection
model-family detection
kernel binding from weight dtype and semantic op
fusion detection
memory planning
pointer lowering
attention-contract specialization
inference hardening and parity promotion
Dumb back-end responsibilities — what codegen_v8.py actually doestext
walk lowered ops in order
emit one C call per lowered op
splice in precomputed pointer expressions
emit offset tables and A_* defines
insert CK_STOP_OP checkpoints
optionally insert CK_PARITY_DUMP and CK_PROFILE plumbing
codegen_v8.py — the bug-routing rule is explicit in the source commentpython
#!/usr/bin/env python3
from __future__ import annotations
"""
codegen_v8.py - Generate C code from lowered IR.

RESPONSIBILITIES:
1. Create memory layout from layout.json (structs, offsets, allocations)
2. Parse lowered IR and emit function calls (unrolled, one after another)
3. Pass pointers cleanly to all functions

If there are memory issues → fix the memory layout builder, not codegen.
If there are kernel issues → fix the IR lower, not codegen.

Conclusion — The Compiler Pipeline as Technical Portfolio

Put the whole stack together and the compiler shape is obvious: v8 templates and kernel maps, a 10,459-line IR builder, a 705-line memory planner, a 1,025-line decode codegen, a 1,943-line prefill codegen, generated C for model-specific runtimes, and finally shared libraries that run independently after linking.

It also explains why adding a new model family can be so leverage-heavy. Once the kernel library and lowering contracts exist, much of the work is template authoring: describe the architecture correctly, let the builder synthesize IR, let the planner assign memory, and let codegen serialize the plan.

For ShivasNotes readers, this post is the connective tissue for Posts 37 through 41. SIMD, NEON, quantization, flash attention, and performance analysis matter because the compiler pipeline makes those kernels visible, reproducible, and portable.

The final mental model is simple: templates describe the machine, IR proves the plan, the memory map fixes the bytes, lowered IR fixes the pointers, codegen writes the calls, and the compiler turns that into a standalone model library. That is why CKE feels unusual in the LLM world. It treats the generated C as the model-specific executable truth, not as a debugging byproduct.

Pipeline summary — template JSON to standalone libmodel.sotext
v8 templates + kernel maps
      ↓
10,459-line build_ir_v8.py
      ↓
705-line memory_planner_v8.py
      ↓
1,025-line codegen_v8.py + 1,943-line codegen_prefill_v8.py
      ↓
generated model_v8.c / encoder_v8.c
      ↓
compiler toolchain
      ↓
libmodel.so + libckernel_engine.so + libckernel_tokenizer.so
      ↓
regression + parity + smoke gates
Adding a new model family — what changes versus what stays mechanicaltext
write or extend a template JSON
  declare flags
  declare contracts
  declare body/header/footer ops
  add family-specific kernels only when necessary

then reuse the compiler:
  build IR1
  run fusion
  plan memory
  lower to pointer expressions
  emit C
  compile to libmodel.so
Related ShivasNotes posts 37–41text
SIMD Deep Dive
ARM NEON in CKE
Quantization Deep Dive
Flash Attention on CPU
CPU Performance Analysis

Continue with SIMD Deep Dive, ARM NEON in CKE, Quantization Deep Dive, Flash Attention on CPU, and CPU Performance Analysis for the kernel-level stories this compiler pipeline turns into native code.