DeltaNet and SSM/Mamba are two ways to reduce the amount of unbounded attention history a model has to carry by introducing fixed-size learned state. They are often described with the same vague phrase: “linear attention” or “recurrent attention.” That is not enough for C-Kernel-Engine. CKE has to know what state is carried, what kernel updates it, how backprop flows through it, and whether the CPU is fighting compute throughput or memory bandwidth.
The compact comparison is this: DeltaNet writes corrections into an associative matrix. SSM/Mamba evolves a compact recurrent state. Both avoid KV-cache growth, but they are not the same runtime problem.
Same goal: fixed-size memory. Different contract: matrix correction vs state evolution.Core intuition
Attention: stores and reads old tokens.
DeltaNet: updates an associative key-value matrix.
SSM/Mamba: evolves a recurrent state over time.
The important practical detail is that modern hybrid architectures usually do not delete full attention everywhere. Full attention is still the gold-standard “photogenic memory” because the model can directly look back at token-level keys and values. DeltaNet and SSM layers trade some of that explicit memory for fixed-size state, better decode memory behavior, and better long-context cost shape.
Roadmap for this post
First, we define the shared problem: replacing growing KV cache with fixed-size state.
Second, we explain DeltaNet as a corrective associative memory and show its forward/backward kernel contract.
Third, we explain SSM/Mamba as state evolution, including Conv1D, selective state update, and training surfaces.
Fourth, we compare what CKE has to implement: function signatures, state layout, parity tests, and optimization paths.
Finally, we connect this to yesterday's Nemotron/Mamba post and the larger CPU AI runtime bet.
The Shared Problem: KV Cache Grows
Full attention stores key and value vectors for previous tokens. During decode, each new token can attend over that history. This is powerful because it preserves exact token-level retrieval. It is expensive because the memory grows with context length.
KV cache shape per layer:
K: [tokens, heads, head_dim]
V: [tokens, heads, head_dim]DeltaNet and SSM/Mamba attack the same pressure point: keep more of the layer stack in a fixed-size state that does not grow linearly with token count. The key difference is what that fixed-size state means.
| Mechanism | State | State Meaning | Cost Shape |
|---|---|---|---|
| Attention | K,V history | Token-level memory log. | Grows with context T. |
| DeltaNet | S[H,D,D] | Associative key-value matrix per head. | Fixed in T, grows with D². |
| SSM/Mamba | recurrent state | Compact evolving summary. | Fixed in T, shaped by state dimension. |
This is why “constant-cost inference” is too vague. Constant with respect to what? DeltaNet is constant with respect to sequence length, but the state is a dense matrix. SSM/Mamba is also constant with respect to sequence length, but the state update has different shape rules and different memory movement. In practice, these mechanisms are often interleaved with periodic full-attention layers because full attention still gives the cleanest exact token retrieval path.
Why hybrids keep some full attention
A recurrent or linear-memory layer compresses history into learned state. That is the whole performance win. It is also the tradeoff. Compression can lose some token-level detail.
Full attention remains expensive, but it gives the model direct access to the old key/value log. Hybrid models therefore often use DeltaNet or SSM/Mamba for many layers and keep full attention in selected layers as an anchor for retrieval, grounding, and long-range correction.
Not Attention-Free, Attention-Lighter
The mistake is to frame this as a binary contest: attention versus DeltaNet, attention versus SSM, or Transformer versus Mamba. That is clean for a slide, but it is not how many practical architectures are evolving. The real design space is hybrid. Some layers keep full softmax attention. Other layers use a fixed-state mechanism that is cheaper to carry through decode.
Full attention is expensive because the model keeps explicit keys and values for old tokens. But that explicitness is also why it works so well. If token 2,913 matters to token 8,144, attention can directly score that relationship. The model does not have to hope the information survived inside a compressed recurrent state. This is what I mean by photogenic memory: the old tokens are still represented as inspectable key/value vectors.
DeltaNet and SSM/Mamba layers are different. They do not keep a full visible token log. They ask the model to learn a compressed state that is good enough for the next computation. That can be much cheaper, especially at long context, but it is not free. The trade is accuracy and retrieval sharpness for memory shape, throughput, and predictable decode cost. Good hybrid models try to get most of the performance win without deleting the full-attention safety net.
The Photogenic Memory Analogy
A useful way to think about this is a camera. Full attention is like keeping high-resolution photographs of the past. Every previous token has a key and value representation that can still be inspected by the current token. That is expensive, but it is also why full attention is so hard to replace. The model can point back to something concrete.
DeltaNet, SSM/Mamba, and sliding-window attention are more like compressed memory systems. They do not keep every old frame at full resolution. DeltaNet keeps an associative matrix that has been updated by the history. SSM/Mamba keeps an evolving recurrent state. Sliding-window attention keeps exact attention only for a local neighborhood. These are powerful ideas, but they are also compressions.
This is why practical model design can feel like controlled risk. The model designer is asking: how much photogenic memory can I remove before the model starts losing coherence, retrieval ability, or reasoning quality? Too much full attention is expensive. Too little full attention can create a quality cliff. The useful zone is usually somewhere in the middle: enough exact memory to anchor the model, enough compressed memory to make inference practical.
Gemma-style sliding-window/global attention mixtures fit this same pattern. Some layers use cheaper local attention. Other layers preserve broader/global attention. DeltaNet hybrids and SSM/Mamba hybrids are a different mechanism, but the design pressure is similar: ration the expensive exact-memory layers instead of pretending they do not matter.
| Question | Full Attention | DeltaNet | SSM/Mamba |
|---|---|---|---|
| What is remembered? | Old token keys and values. | A learned key-value mapping matrix. | A recurrent dynamical state. |
| What is the cost trade? | Best retrieval, growing memory. | Fixed token-length state, dense matrix update. | Fixed token-length state, recurrent/scan update. |
| What can be lost? | Mostly cost, not visibility. | Some exact token-level detail. | Some exact token-level detail. |
| Why interleave? | Acts as the exact memory anchor. | Reduces repeated KV-cache pressure. | Reduces repeated KV-cache pressure. |
Step-By-Step Algorithms
The easiest way to compare them is to put the algorithms next to each other. Full attention, DeltaNet, and SSM/Mamba all start from the hidden stream, but they diverge at the memory step. Attention reads a growing token memory. DeltaNet reads and corrects a fixed matrix memory. SSM/Mamba evolves a fixed recurrent state.
Full Attention Algorithm
1. Project hidden state into Q, K, V.
2. Split Q, K, V into heads.
3. Compute attention scores per head:
scores = QK^T / sqrt(head_dim)
4. Apply mask and softmax:
probs = softmax(scores)
5. Read values:
context = probs @ V
6. Concatenate heads.
7. Apply attention output projection.The memory is photogenic: all previous keys and values remain visible. This is why attention has strong retrieval behavior. It is also why long-context decode becomes memory-heavy.
DeltaNet Algorithm
1. Project hidden state into q, k, v, g, beta.
2. Split q, k, v into heads.
3. Decay old matrix memory:
S_decay = exp(g) * S_prev
4. Probe memory with current key:
kv_mem = S_decay^T @ k
5. Compute correction error:
delta = sigmoid(beta) * (v - kv_mem)
6. Write correction into memory:
S_new = S_decay + outer(k, delta)
7. Read updated memory with query:
out = S_new^T @ q
8. Concatenate heads and apply output projection.S[H,D,D]: a fixed matrix memory per head.DeltaNet is closer to compressed key-value memory than to a leaky sensor. It asks: “what does memory currently return for this key, and how wrong is it?” The update is the error. That is why the delta rule is corrective.
SSM / Mamba Algorithm
1. Project hidden state into split streams:
gate, hidden/input stream, dt, B, C, etc.
2. Apply local Conv1D over the token sequence:
x_conv[t] = sum_k w[k] * x[t-k]
3. Compute learned state-update delta:
dt = softplus(dt_raw + dt_bias)
4. Decay previous recurrent state:
state_decay = decay(dt, A) * state_prev
5. Write current input into state:
state_new = state_decay + B * x_conv
6. Read from state:
y = C * state_new + D * x_conv
7. Apply gate, normalization, output projection, and residual add.D × D associative matrix.Mamba-style layers are closer to a learned dynamical system. They ask: “how should my state evolve after seeing this token?” That is different from attention's question and different from DeltaNet's correction rule.
What Is Actually Being Traded?
When people say these architectures are more efficient than attention, the statement needs a unit. More efficient in prefill? More efficient in batch-1 decode? More efficient in memory capacity? More efficient in memory bandwidth? More efficient on GPUs? More efficient on CPUs? These answers can differ.
Full attention has a clear decode memory shape: every layer carries a KV cache whose size depends on context length. That means every new token needs access to a longer and longer history. FlashAttention-style kernels can avoid materializing the full attention matrix during prefill, but the decode-time KV cache is still real. The history exists as old keys and values.
DeltaNet changes the object being carried. Instead of carrying a growing list of keys and values, it carries a fixed matrix per head. That matrix is updated by a gated correction rule. This makes decode memory fixed with respect to token count, but the state itself is not tiny. It is a dense D × D object per head or value-head group. On CPU, this can become a very different performance problem: fewer growing KV reads, but more dense FMA loops over a persistent matrix state.
SSM/Mamba changes the object again. Instead of an associative matrix lookup and correction, it evolves a recurrent state. The layer learns when to keep, forget, decay, and write information. The important kernels become local Conv1D, learned dt softplus, state transition math, gating, normalization, and scan or recurrent update paths. The runtime pressure is not the same as DeltaNet. It can be more layout-sensitive and scan-sensitive than pure dense-matrix update.
So the trade is not “attention bad, recurrent good.” The trade is: how much exact token memory do we keep, how much history do we compress into learned state, how often do we reintroduce full attention, and what does that decision do to the kernels that actually run?
The CKE rule
If an architecture changes the state object, it changes the runtime. If it changes the runtime, it changes the C kernel contract. If it changes the C kernel contract, it changes parity tests, memory planning, SIMD strategy, prefill behavior, decode behavior, and backprop.
DeltaNet: Corrective Associative Memory
DeltaNet keeps a dense state matrix per head. You can think of this matrix as a learned associative memory. Given a key, the memory produces a value. The delta rule checks what the memory currently returns and writes only the correction.
gate = exp(g)
beta_s = sigmoid(beta)
S_decay = gate * S_prev
kv_mem = S_decay^T * k
delta = beta_s * (v - kv_mem)
S_new = S_decay + outer(k, delta)
out = S_new^T * q_scaledThis is different from a KV cache. A KV cache appends another key/value pair to a log. DeltaNet modifies a fixed matrix. If the memory already predicts the value for this key, the correction becomes small. If the memory is wrong, the outer product updates the matrix in the direction that reduces that error.
DeltaNet Backprop
The backward surface is dense because nearly every term influences the state matrix. CKE has an explicit backward registry entry:
gated_deltanet_autoregressive_backward(
d_out,
d_state_out,
q, k, v, g, beta,
state_in,
state_out,
d_q, d_k, d_v,
d_g, d_beta,
d_state_in,
num_heads,
state_dim,
norm_eps
) Backprop has to reverse the output read, the rank-one outer-product write, the delta error term, the sigmoid on beta, the exponential gate exp(g), and the state decay. This is not just an inference trick. If CKE wants training kernels, the backward path has to be a first-class kernel contract.
From the kernel registry: the forward parity target is around 1e-5, and the backward parity target is looser at 5e-4. That difference is reasonable. Backward paths compound more floating-point differences because they reverse several dense dependent operations.
SSM/Mamba: State Evolution
SSM/Mamba-style layers do not write corrections into a D × D associative matrix. They evolve a recurrent state. Yesterday's Nemotron/Mamba post went deep on this, especially dt, split streams, Conv1D, and the Mamba2 reference kernels. Here the important contrast is the memory contract.
state_t = decay(dt, A) * state_{t-1} + write(B, x_t)
y_t = read(C, state_t) + D * x_tA simple SSM Conv1D kernel in CKE uses this layout:
conv_x : [num_seqs, num_channels, kernel_size - 1 + num_tokens]
kernel : [num_channels, kernel_size]
out : [num_seqs, num_tokens, num_channels]The CKE SSM Conv1D forward equation is:
out[seq, token, ch] =
dot(conv_x[seq, ch, token : token + kernel_size],
kernel[ch, :])This is local sequence mixing. It is not image-specific convolution. Conv1D here means a small sliding filter over token positions for each channel. It gives the recurrent path nearby context before the state update.
SSM Conv1D Backprop
CKE already has a reference backward for the SSM Conv1D surface:
ssm_conv1d_backward(
d_out,
conv_x,
kernel,
d_conv_x,
d_kernel,
kernel_size,
num_channels,
num_tokens,
num_seqs
)The math is the standard convolution backward:
d_conv_x[seq,ch,t+k] += d_out[seq,t,ch] * kernel[ch,k]
d_kernel[ch,k] += d_out[seq,t,ch] * conv_x[seq,ch,t+k] This is a different training problem from DeltaNet. DeltaNet backward is dense state-matrix calculus. SSM Conv1D backward is sliding-window accumulation. Mamba2 full backward would then add gradients through split streams, dt softplus, state update, gated RMSNorm, and projections.
CKE Kernel Contract Comparison
| Dimension | DeltaNet | SSM / Mamba |
|---|---|---|
| State meaning | Associative memory matrix. | Recurrent evolving state. |
| State shape | [heads, D, D] | Depends on model: heads/head_dim/state_dim/groups. |
| Core update | Error correction plus rank-one outer product. | Decay old state, write current input, read output. |
| Backward shape | Dense gradient through matrix read/write/gates. | Conv gradients plus recurrent-state/projection gradients. |
| CPU pressure | Often compute/FMA-heavy for state sweeps. | Often memory/layout-sensitive, especially local conv/state movement. |
| CKE evidence | Forward/backward registry entries and parity tests. | SSM Conv1D forward/backward plus Mamba2 reference kernels. |
Prefill vs Decode
The prefill/decode split matters. During prefill, the runtime processes a sequence. During decode, it usually updates state for one token at a time. That difference changes the kernel to call.
In yesterday's Mamba2 post, we saw that CKE can lower a selective-scan IR node to mamba2_selective_state_update_decode_f32 in decode. That is not a cosmetic rewrite. It says: for a single token, do not pretend you are scanning a long sequence. Update the recurrent state directly.
DeltaNet has a similar single-token framing in the registry: gated_deltanet_autoregressive_forward. It is explicitly an autoregressive one-token recurrent update. That makes it naturally aligned with decode. For prefill, a runtime has to decide how to process many tokens while respecting the sequential state dependency.
What This Means For CPU AI
This is exactly why CKE is being built as a kernel engine and not only a model loader. Model architecture choices become CPU execution contracts:
- Attention means KV-cache layout and softmax/FlashAttention-style numerics.
- DeltaNet means dense matrix state, sigmoid/exponential gates, outer-product correction, and dense backward math.
- SSM Conv1D means local sliding-window kernels and memory-sensitive backward accumulation.
- Mamba2 means split streams,
dtsoftplus, state update, gated RMSNorm, and decode-specific state kernels.
These are not interchangeable. If a runtime handwaves them as “linear attention,” it loses the details that actually determine correctness and speed. The CPU does not run marketing names. It runs memory loads, loops, vector instructions, reductions, state writes, and function calls.
How I Read These Architectures In CKE
I do not read these papers only as model architecture papers. I read them as kernel specifications hiding inside a modeling paper. A paper might say “linear-time sequence modeling,” but CKE has to turn that sentence into a deterministic execution path. The runtime needs to know the state layout, the projection layout, the scalar gates, the vector operations, the matrix operations, and the backward equations.
For full attention, the implementation questions are familiar: how are Q/K/V projected, how are heads grouped, how is the KV cache stored, how do we apply RoPE or another positional transform, how do we perform masked softmax without numerical instability, and how do we avoid wasting memory bandwidth during prefill and decode?
For DeltaNet, the implementation questions change: where is the state matrix stored, is it per head or grouped by value heads, how do we apply gate decay, how do we compute S^T k, how do we form the correction v - S^T k, how do we write outer(k, delta), and how do gradients flow through the previous state, the updated state, the gate, beta, key, query, and value?
For SSM/Mamba, the implementation questions are different again: where are the split streams, how is Conv1D buffered, how is dt converted by softplus, how does the recurrent state decay, what is recomputed in backward, what state is kept during decode, and which parts can be vectorized without breaking numerical parity?
This is why I want CKE to expose these layers explicitly. If the engine can name the operation, inspect the dimensions, draw the circuit, test parity, and then emit or call the right C kernel, the model stops being a black box. It becomes a set of deterministic math circuits that can be measured.
Original Paper Trail
The algorithms in this post are not invented names. They come from a real sequence-modeling lineage: attention made explicit token memory dominant, linear attention and fast-weight work reframed memory as an updateable key-value map, DeltaNet added a corrective delta rule, and Mamba/Mamba2 made state-space recurrence practical for modern sequence models.
- Attention Is All You Need introduced the Transformer attention architecture that made full token-to-token attention the baseline.
- Linear Transformers Are Secretly Fast Weight Programmers connects linearized attention to fast weights and introduces the delta-rule style corrective memory update.
- Gated Delta Networks: Improving Mamba2 with Delta Rule develops Gated DeltaNet and discusses hybrid designs that combine delta-rule memory with attention or Mamba-style layers.
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces introduces selective SSMs as a practical recurrent sequence backbone with hardware-aware scan behavior.
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality gives the Mamba2/SSD view that connects structured state-space models and attention-like computation.
Related Reading
Read these with this post:
- Nemotron Architecture From A C-Kernel-Engine Runtime Perspective
- What Is the C-Kernel-Engine?
- SIMD Deep Dive: How AI Kernels Use SSE, AVX, AVX-512, And VNNI
- Flash Attention On CPU
- Pipeline vs Tensor Parallelism
- C-Kernel-Engine DeltaNet Deep Dive
- C-Kernel-Engine Mamba2 Reference Kernels
Closing Thought
DeltaNet and SSM/Mamba both try to make long-context inference cheaper by replacing a growing token log with learned state. But that shared goal hides a major implementation difference. DeltaNet is a corrective matrix memory. SSM/Mamba is state evolution. One leans into dense associative updates. The other leans into recurrent dynamics and local sequence mixing.
For CKE, this is the whole point: the model is not a black box. The architecture becomes a circuit. The circuit becomes kernel contracts. The kernel contracts become C code, memory plans, parity tests, and performance evidence.