C-Kernel-Engine architecture note

Nemotron-style hybrid models are interesting to C-Kernel-Engine because they stop looking like a simple stack of attention plus MLP blocks. A runtime has to handle attention, Mamba2 state-space recurrence, routed MoE experts, ReLU2 expert math, tokenizer guardrails, memory planning, and architecture-specific lowering without turning the implementation into a pile of one-off hacks.

This post is not a broad marketing summary of every Nemotron release. It is a runtime-engineering map: what the architecture forces a CPU-native runtime to represent, how C-Kernel-Engine is adding those pieces, and why inspectability matters before claiming performance.

A model architecture is not just a paper diagram. For a runtime, it is a contract: tensors, states, routes, kernels, memory, and tests.

Nemotron hybrid runtime map showing attention, Mamba2, routed MoE, ReLU2 experts, and CKE lowering surfaces
Nemotron-style hybrid models force the runtime to carry multiple execution surfaces, not just one transformer block template.

Roadmap for this post

First, we map the architecture as CKE sees it: attention blocks, Mamba2 state-space blocks, routed MoE blocks, and ReLU2 experts.

Second, we separate KV cache from Mamba2 state, because they are not the same memory problem.

Third, we describe why the router is a kernel contract, not a vague “choose experts” step.

Fourth, we show the CKE implementation surface: templates, GraphIR, LoweredIR, kernel registry, memory planner, generated C, and test lanes.

Finally, we explain what still has to be hardened before calling this a full optimized Nemotron CPU runtime.

The Main Point

Dense transformer models let a runtime start with a relatively clean mental model:

RMSNorm → QKV → attention → output projection → RMSNorm → MLP → residual

That pattern is still difficult to optimize, but the block shape is familiar. Nemotron-H style models make the runtime more complicated because not every layer is the same kind of layer. Some layers behave like attention layers with KV cache. Some behave like Mamba2 recurrence with persistent state. Some feed tokens into routers that choose sparse experts. Some expert paths use ReLU2 instead of the usual SwiGLU-style MLP shape.

For C-Kernel-Engine, this matters because CKE is not trying to run models by hiding everything inside a giant Python framework. The goal is to make the model inspectable, lowerable, testable, and eventually fast on CPU hardware. That means the architecture must become explicit runtime data.

Architecture Surface

The current CKE concept documentation describes Nemotron-H as a hybrid circuit containing attention layers, Mamba2-style state-space layers, routed MoE blocks, ReLU2 experts, tokenizer guardrails, and model-specific template rules. That is the right level to study it. The architecture is not one trick. It is a composition problem.

Surface Runtime Meaning CKE Needs To Represent
Attention QKV projection, positional logic, softmax, value mixing, output projection. KV cache layout, attention kernel selection, prefill/decode split.
Mamba2 Selective state-space recurrence with convolution/state update behavior. Persistent state buffers, selective scan kernels, decode state update kernels.
MoE router Scores tokens, applies correction bias, selects groups and top-k experts. Deterministic routing contract, indices, weights, normalization, scaling.
ReLU2 experts Expert MLP path with squared ReLU activation behavior. Reference parity, optimized expert kernels, sparse dispatch shape.
Tokenizer guardrails Model-specific text boundary behavior that must match expected tokens. Template selection and tokenizer capability checks before codegen.

This is the kind of architecture where a runtime cannot get away with saying “we support transformers.” The runtime has to know which layers are attention, which layers are recurrence, which paths are sparse, and what state lives across decode steps.

Mamba2 Is Not KV Cache

A common mistake is to mentally file every long-context memory mechanism under “cache.” Attention KV cache and Mamba2 state are different contracts. KV cache stores past keys and values so future tokens can attend back to them. Mamba2 state stores recurrent state that is updated as the sequence advances.

Comparison of attention KV cache and Mamba2 recurrent state in decode
Attention grows KV cache with context. Mamba2 updates persistent state. Both are memory systems, but they stress the runtime differently.

In CKE terms, Mamba2 support is not just a new activation function. It needs a family of reference kernels and guardrails:

mamba2_in_proj_split_f32
mamba2_conv1d_decode_f32
mamba2_dt_softplus_f32
mamba2_selective_state_update_decode_f32
mamba2_selective_scan_f32
mamba2_rmsnorm_gate_f32

The important word here is reference. Before optimizing this aggressively, CKE needs boring, deterministic, inspectable kernels that prove the state math is correct. Once the reference path is correct, the optimized path can be measured honestly.

Mamba2 Hyper-Detail: From Layer To Kernels

The useful way to understand Mamba2 inside CKE is to stop treating it like one magical operation. A Mamba2 layer lowers into a sequence of smaller contracts: project the hidden vector, split the streams, run local convolution, prepare the time step, update recurrent state, normalize/gate the result, and project back into the model width.

Detailed Mamba2 layer flow from hidden vector through in projection, split streams, conv1d, dt softplus, selective scan, gated RMSNorm, out projection, and residual add
Mamba2 is a layer circuit. CKE lowers that circuit into named kernels and saved intermediate buffers.

What Is A Split Stream?

A split stream is an internal branch tensor produced by one projection. It is not the same thing as the residual stream. The residual stream is the long-lived hidden vector that passes from block to block:

x_{next} = x + block(x)

The split streams are temporary tensors inside the Mamba2 block. CKE sees this as:

projected = x @ W_in + b
gate, hidden_bc, dt = split(projected)

The reason models do this is practical. One GEMM can create several related streams in one pass over memory. Then each stream gets a different job: the gate controls the final gated normalization path, hidden_bc feeds convolution and state parameters, and dt becomes the learned time step for recurrent state update.

Diagram showing residual stream passing between blocks while split streams are temporary branches produced by in projection inside a Mamba2 block
Residual stream is the long-lived model highway. Split streams are temporary branch tensors inside the Mamba2 layer.

The backprop rule is also straightforward but easy to implement poorly. If a projected tensor is split into three views, then backward must gather the gradient contributions from all branches:

dL/dprojected =
    concat_or_scatter(dL/dgate, dL/dhidden_bc, dL/ddt)

Then the projection backward is the standard matrix multiply backward:

dL/dx    = dL/dprojected @ W_in^T
dL/dW_in = x^T @ dL/dprojected
dL/db    = sum_rows(dL/dprojected)

The residual path adds another gradient route:

x_next = x + block(x)
dL/dx = dL/dx_from_identity + dL/dx_from_block

This is where CKE has to be precise. In inference, mamba2_in_proj_split_f32 only needs to split the projected buffer correctly. In training, the same split becomes a backward scatter/gather problem plus projection-gradient GEMMs.

What Is Conv1D In A Text Model?

Conv1D means convolution along one axis. In images, convolution usually slides across height and width. In text or token sequences, convolution slides across the sequence/time axis. The idea is not “images only.” The idea is local pattern mixing.

For a simple causal 1D convolution:

y[t] = w[0] * x[t] + w[1] * x[t-1] + w[2] * x[t-2] + bias

This gives the recurrent layer a small local neighborhood before the selective state update. Attention can read any previous token through the KV cache. Conv1D deliberately reads only a small local window. That makes it cheaper and cache-friendlier, but less globally expressive by itself. Mamba2 uses this local mixing together with recurrent state so the layer has both nearby token mixing and compact long-horizon memory.

Conv1D on text sequence showing neighboring token window, kernel weights, output token, and convolution backward formulas
Conv1D on text is a sliding local filter over tokens. It is local sequence mixing, not an image-only idea.

The backward pass for Conv1D has two major surfaces: gradient to input and gradient to weights. For a causal convolution:

dL/dx[t] += sum_k dL/dy[t+k] * w[k]
dL/dw[k] += sum_t dL/dy[t] * x[t-k]
dL/db    += sum_t dL/dy[t]

This matters for CKE because Conv1D backward is not conceptually hard, but it is memory-sensitive. The kernel repeatedly reads local windows and writes accumulated gradients. A clean implementation must choose whether to loop by token, by channel, by kernel width, or by vector lane. On CPU, that choice affects cache reuse, store contention, and SIMD packing.

In the current CKE v8 Nemotron/Mamba2 path, the forward/reference surface is the important first step: mamba2_conv1d_decode_f32 handles the decode-time convolution state update. Training backward kernels would need the corresponding input-gradient and weight-gradient paths.

What Is dt Softplus?

dt can be confusing because in calculus dt often means “an infinitesimal change in time” or appears in derivatives with respect to time. In this Mamba2 runtime context, dt is not the sampling frequency of the input text, and it is not the derivative operator itself. It is a learned delta/time-step value produced by the model for the state update.

During training, the weights that produce dt_raw and the dt_bias terms are learned by backpropagation. During inference, those trained parameters are used to compute dt for the current token. So dt is trained during learning and used during inference. It controls how aggressively the recurrent state should decay, preserve, or incorporate new information at that token.

In a state-space model, the state update needs a stable positive notion of how much to move the state at this token. Raw neural network outputs can be negative, huge, tiny, or unstable. Softplus turns that raw value into a positive smooth value:

softplus(u) = log(1 + exp(u))
dt_out = clamp(softplus(dt_raw + dt_bias), dt_min, dt_max)

Why not just use ReLU? ReLU is positive, but it has a hard zero region and a sharp kink. Softplus is smooth and differentiable everywhere. Its derivative is the sigmoid:

d softplus(u) / du = sigmoid(u)

So the backward pass is:

dL/du = dL/ddt_out * sigmoid(u)

If clamping is active, the backward rule must respect the clamp. Values outside the allowed range may have zero gradient through the clamp region, depending on the exact clamp semantics used by the reference. That is why CKE treats mamba2_dt_softplus_f32 as its own kernel contract instead of hiding it inside a fused blob too early.

dt softplus diagram showing raw dt plus bias, softplus, clamp, dt_out, derivative sigmoid, and comparison with attention
dt is a learned delta/time-step for state evolution. It is trained by backprop and computed during inference.

Compared with full attention, this is a different memory mechanism. Attention computes a softmax distribution over previous tokens and reads values from KV cache. Mamba2 computes how the compact state should decay and update. Attention asks: “which previous tokens should I read?” Mamba2 asks: “how should my state evolve after seeing this token?”

The core recurrence is simple enough to say, but dangerous to implement casually: use the current token and learned/input-dependent parameters to decay the old state, write new information, and expose an output. The runtime must preserve the exact shape rules for heads, head dimension, state dimension, and groups.

Selective state update math showing previous state and current token flowing through decay, write, state_out, and output y
The selective state update is the mathematical center: state_out depends on old state, current token, dt, A, B, C, and D.

CKE makes this concrete through a kernel contract table. Each kernel can be tested independently before the runtime fuses, tiles, vectorizes, or threads the path. That is the discipline: parity first, optimization second.

Table of CKE Mamba2 kernel contracts including in projection split, conv1d decode, dt softplus, selective state update, selective scan, and RMSNorm gate
The Mamba2 path is testable because each piece has a named C kernel contract.

During decode, CKE can lower a selective-scan IR node to the one-token state-update kernel. This is important because decode is not the same workload as prefill. Prefill scans a sequence. Decode updates the state for the next token.

Generated C decode call showing mamba2_selective_state_update_decode_f32 with state, x, dt, A, B, C, D, state_out, y, and shape arguments
In decode, the generated C path can call mamba2_selective_state_update_decode_f32 directly and preserve profiling labels.

This is also why the CKE work added state-shape guardrails. For Nemotron-style Mamba2, the state shape cannot be guessed from a generic transformer assumption. If the runtime treats a state dimension like a square head dimension, the code can compile and still be wrong.

shape State layout is a correctness problem before it is a performance problem.

The Router Is A Kernel Contract

MoE can sound simple at the slogan level: route each token to a few experts. The actual runtime contract is sharper than that. The router has scores, correction bias, group-limited routing, top-k selection, optional probability normalization, and a routed scaling factor. If any of those details are off, the expert path changes.

Nemotron router kernel contract showing scores, correction bias, group selection, top-k experts, weights, and ReLU2 expert dispatch
The router is not metadata. It produces concrete expert indices and weights that determine which compute path runs.

The CKE kernel registry exposes this as a concrete function surface:

nemotron_group_limited_topk_router_f32(
    scores,
    correction_bias,
    indices,
    weights,
    rows,
    n_experts,
    top_k,
    n_group,
    topk_group,
    norm_topk_prob,
    routed_scaling_factor
)

That signature is useful because it turns a model-architecture paragraph into something testable. Given scores and bias, the runtime must produce the same indices and weights as the reference behavior. Only after that should we care whether the kernel is vectorized, tiled, or threaded.

ReLU2 Experts Are Also Part Of The Architecture

Many transformer runtimes are built around a narrow set of MLP assumptions: gate projection, up projection, activation, elementwise multiply, down projection. Nemotron-H style expert paths can require ReLU2 expert handling instead. That changes what the generated circuit should emit and what parity tests should check.

From a CKE perspective, this is exactly why a kernel registry exists. The model template should not scatter architecture-specific if-statements throughout the runtime. The template should lower the graph into named kernel contracts, and the registry should tell codegen how to call those kernels.

The CKE Lowering Surface

C-Kernel-Engine lowering pipeline from model config to template, GraphIR, LoweredIR, kernel registry, memory plan, generated C, and tests
The implementation target is not a handwritten one-off runtime. It is a deterministic lowering path from model facts to generated C.

The clean CKE path looks like this:

Model config + GGUF metadata
  → architecture template
  → GraphIR nodes
  → LoweredIR kernel calls
  → memory planner offsets
  → generated C
  → parity / smoke / performance tests

The reason this matters is simple: once the architecture is explicit in IR, CKE can inspect it, visualize it, test it, and optimize it. If the architecture is only hidden in ad hoc runtime code, every new model family becomes a manual port.

What CKE Already Has In The Nemotron Lane

The current CKE documentation and commits show a reference-first path for Nemotron-style hybrid support. This is the right order: build the boring kernels, prove their behavior, then optimize.

CKE Lane Purpose Evidence Surface
Mamba2 reference kernels Provide deterministic selective scan and decode state behavior. make test-mamba2-reference
Nemotron router Validate group-limited top-k routing with correction bias. make test-nemotron-router
MoE ReLU2 expert Validate expert activation and projection behavior. make test-moe-relu2-expert
High-memory smoke lane Exercise larger model bring-up without pretending it is fully optimized. make test-v8-nemotron9-highmem
Tokenizer capability guardrails Prevent wrong template/codegen behavior for model-specific tokenization. v8 tokenizer capability tests

This is also the right way to communicate the status publicly. CKE is not claiming “Nemotron is solved.” CKE is saying: these are the architecture contracts, these are the kernels, these are the guardrails, and these are the tests we are hardening.

Performance Claims Need Boundaries

The CKE docs include local validation notes where the Nemotron router path and ReLU2 expert path show promising speedups against reference-style comparisons. That is useful evidence, but it should be read correctly. A fast router kernel does not mean the full model is fast. A fast expert microbenchmark does not mean the full decode loop is solved.

The honest runtime sequence is reference parity, then kernel microbenchmarks, then layer-level benchmarks, then end-to-end model throughput. Skipping those boundaries creates the kind of AI hardware handwaving that CKE is intentionally trying to avoid.

A kernel win is evidence. It is not the whole model.

Evidence ladder from reference parity to kernel benchmark to layer benchmark to full model throughput
CKE should climb the evidence ladder in order: parity, microbench, layer, full model.

Why This Matters For CPU AI

The important lesson is not “Nemotron is optimized for one vendor.” The useful lesson is that modern models are increasingly architecture-specific at the runtime level. Attention, recurrence, routing, quantization, expert sparsity, and tokenizer behavior all affect the generated execution plan.

CPUs can run these operations, but they need a runtime that treats the model as an inspectable circuit. That means:

  • the architecture template must be explicit,
  • the kernel contracts must be named and testable,
  • the memory planner must understand persistent state and cache separately,
  • the generated C must be auditable,
  • and the test lane must catch numerical and shape drift before performance work hides the bug.

This is the deeper CKE bet. Not that every CPU magically beats every GPU. The bet is that deterministic model lowering, CPU-specific kernels, and distributed memory bandwidth can make a serious CPU-native AI runtime practical for real workloads. The way to get there is not handwaving. It is one architecture contract at a time.

How To Read This In The CKE Docs

The relevant CKE documentation pages are:

Related ShivasNotes posts:

Planned follow-up posts in this CKE architecture arc:

  • CKE Templates And Circuit Maps: how model architecture becomes an executable template.
  • GraphIR, LoweredIR, And C Kernels: how the runtime turns model structure into kernel calls.
  • CKE Kernel Registry: why every operation needs a named implementation contract.
  • Prefill vs Decode In CKE: why sequence scan and one-token state update are different workloads.
  • CKE Memory Planner: how state, KV cache, activations, and temporary buffers get placed.

Closing Thought

Nemotron-style hybrid architecture is useful because it forces the runtime to grow up. A toy runtime can run a clean dense transformer. A serious runtime has to survive architecture variation: attention here, state-space recurrence there, sparse experts in another path, and model-specific tokenizer/template behavior around all of it.

That is why CKE keeps moving toward inspectable IR, deterministic kernel contracts, generated C, and measurable test lanes. The model architecture is becoming more diverse. The runtime has to become more explicit.