Gated DeltaNet: Recurrent Attention With Error-Driven Memory

Lab note

This ShivasNotes lab note studies the alternative used by the Qwen3.5 / qwen3next recurrent path in C-Kernel-Engine: Gated DeltaNet, where the model stops trying to store every token verbatim and instead learns to store only the correction its memory still needs. It builds on the earlier attention derivation in Attention: The Core Of The Transformer.

Speaker note. The key narrative for this post is simple: standard attention says keep everything and look it up later; DeltaNet says keep a fixed memory and continuously repair it.

DeltaNet is easiest to understand if you stop thinking “compressed attention” and start thinking “learned associative memory with forgetting.”

Roadmap for this post

First, we set up the architectural motivation, the fixed-memory state matrix, and the five per-step inputs that drive one recurrent update.

Then we walk the forward pass, the delta rule, the write gate, and the decay gate.

After that, we switch from math to systems work: buffer layout, SIMD tiers, and ISA dispatch in src/kernels/deltanet_kernels.c.

Finally, we cover training, chunked prefill, Qwen3.5's hybrid architecture, a direct comparison to standard attention, and a kernel engineer's closing perspective.

Introduction — Beyond Standard Attention

Standard self-attention is powerful because every token can compare itself with every earlier token. But that power comes with a scaling cost. During prefill, the model materializes or implicitly traverses a T × T interaction pattern. During autoregressive decode, the score matrix is avoided, but the KV-cache still grows linearly with context length and every new token must scan that history.

That is the practical long-context problem. The more text you feed the model, the more memory traffic you accumulate, and the more each new token depends on a growing external store. Standard attention is therefore exact and expressive, but it does not offer a fixed-size memory story.

Gated DeltaNet offers a different contract. Instead of appending keys and values forever, it keeps one state matrix S ∈ ℝ^d×d per head. That state has fixed size, so the memory cost per head is O(d²) regardless of whether the context is 1 token or 1 million tokens long.

O(d²) Per-head state stays fixed. DeltaNet pays in state dimension, not in context length.

This is not a toy side path. In current Qwen3.5 / qwen3next model code, recurrent layers and full-attention layers coexist. In src/models/qwen35.cpp lines 21-28, recurrent layers are marked so that every fourth main-layer block becomes full attention. The same file switches tensors and kernels based on that flag in lines 73-92. C-Kernel-Engine mirrors that contract with a dedicated DeltaNet kernel family for the recurrent blocks.

Comparison diagram showing standard attention growing with context while DeltaNet keeps a fixed d-by-d state matrix.

Why this matters after the attention derivation

In Attention: The Core Of The Transformer, attention was presented as the transformer's core routing primitive. This post does not replace that derivation. It asks a more engineering-focused question: what if a subset of layers can trade exact token-by-token recall for a learned fixed-size memory that streams efficiently?

The Core Idea — A Learned Associative Memory

The most useful mental model for DeltaNet is an associative memory. Each head owns a matrix S of shape [d, d]. Queries read from it. Keys address it. Values define what the memory should return for those keys. But unlike a hash table or a KV-cache, this memory is updated in place.

Major concept

A KV-cache is append-only. DeltaNet state is corrective. It does not ask, “What new row should I add?” It asks, “What is the error between what I want to remember and what the current matrix already recalls?”

That distinction is the entire algorithm. If we probe the current memory with key k, the recalled vector is Sᵀ · k. If the target value is v, then the memory error is:

error = v - Sᵀ · k

DeltaNet does not write v directly. It writes a gated version of this error. So if the memory already returns the right answer, the update shrinks toward zero.

The state matrix is not a log of the past. It is a compressed hypothesis about what the past should imply for future reads.

Memory view	Standard attention	Gated DeltaNet
Stored object	All past `K` and `V` rows	One evolving matrix `S`
Growth with sequence	Linear in `T`	Constant for fixed `d`
Write rule	Append raw rows	Write the correction `v - Sᵀk`
Update style	Never rewrites old rows	Continuously revises the same state

This is why DeltaNet feels more like an old-school adaptive filter than a transformer cache. The memory is always alive. It decays, gets corrected, and is immediately read back out. In systems language, it is a fixed working set. In learning language, it is an error-driven associative memory.

The Five Inputs Per Step

Every recurrent DeltaNet step consumes five learned signals per head. Three are vectors. Two are scalars. The vector trio tells the memory what to read and what to store. The scalar pair decides how much to forget and how aggressively to write.

Five DeltaNet inputs per step: q, k, v, beta write gate, and g decay gate.

Input	Shape	Role in the update
`q`	`[d]`	Query vector. It asks the memory what to read out after the state update.
`k`	`[d]`	Key vector. It addresses the memory for both recall and writing.
`v`	`[d]`	Value vector. It is the desired memory response for the current key.
`β`	scalar	Write gate. After sigmoid, it becomes a learned learning rate in `(0, 1)`.
`g`	scalar	Decay gate. After exponentiation, it scales the previous state before the new write.

Speaker note. If you remember only one sentence from this section, make it this one: q asks, k addresses, v provides the target content, β controls how hard we write, and g controls how much of yesterday survives.

Two gates matter for two different reasons: β is epistemic confidence about the write; g is temporal confidence about how much past memory should remain.

In the C-Kernel-Engine contract, the vectors are laid out head-major as flat FP32 buffers, and the scalars arrive one per head. That makes the hot loop easy to reason about: for each head, read one q, one k, one v, one β, one g, one previous state matrix, then emit one updated state and one output vector.

Step-by-Step Forward Pass

Here is the six-step forward pass in equation form. This is the clean mathematical version that the implementation realizes.

Step 1. q̂ = L2norm(q) / √d — normalize and scale the query.

Step 2. k̂ = L2norm(k) — normalize the key.

Step 3. S_decay = exp(g) · S_prev — decay the old memory.

Step 4. kv_mem = S_decayᵀ · k̂, then δ = sigmoid(β) · (v - kv_mem) — recall what memory says and compute the correction.

Step 5. S_new = S_decay + outer(k̂, δ) — rank-1 write back into memory.

Step 6. out = S_newᵀ · q̂ — read the updated memory using the query.

6 steps Normalize, decay, recall, compute the delta, write a rank-1 correction, then read out from the updated matrix.

In the current C-Kernel-Engine main branch, the scalar reference implementation appears in src/kernels/deltanet_kernels.c lines 54-111. One subtlety matters immediately: the file comment and the AVX paths note that q and k arrive pre-normalized from recurrent_qk_l2_norm, so the reference code only needs to apply the extra 1 / √d scaling to q.

C-Kernel-Engine — scalar reference forward, lines 54-111

void gated_deltanet_autoregressive_forward_ref(const float *q,
                                               const float *k,
                                               const float *v,
                                               const float *g,
                                               const float *beta,
                                               const float *state_in,
                                               float *state_out,
                                               float *out,
                                               int num_heads,
                                               int state_dim,
                                               float norm_eps)
{
    const float q_scale = 1.0f / sqrtf((float)state_dim);
    const size_t vec_stride = (size_t)state_dim;
    const size_t state_stride = (size_t)state_dim * (size_t)state_dim;

    for (int h = 0; h < num_heads; ++h) {
        const float *q_head = q + (size_t)h * vec_stride;
        const float *k_head = k + (size_t)h * vec_stride;
        const float *v_head = v + (size_t)h * vec_stride;
        const float *state_prev = state_in + (size_t)h * state_stride;
        float *state_cur = state_out + (size_t)h * state_stride;
        float *out_head = out + (size_t)h * vec_stride;

        const float beta_s = ck_deltanet_sigmoidf(beta[h]);
        const float gate = expf(g[h]);

        for (int row = 0; row < state_dim; ++row) {
            const size_t row_off = (size_t)row * (size_t)state_dim;
            for (int col = 0; col < state_dim; ++col) {
                state_cur[row_off + (size_t)col] = state_prev[row_off + (size_t)col] * gate;
            }
        }

        for (int col = 0; col < state_dim; ++col) {
            float kv_mem = 0.0f;
            for (int row = 0; row < state_dim; ++row) {
                const float k_hat = k_head[row];
                kv_mem += state_cur[(size_t)row * (size_t)state_dim + (size_t)col] * k_hat;
            }

            const float delta = (v_head[col] - kv_mem) * beta_s;
            for (int row = 0; row < state_dim; ++row) {
                const float k_hat = k_head[row];
                state_cur[(size_t)row * (size_t)state_dim + (size_t)col] += k_hat * delta;
            }
        }

        for (int col = 0; col < state_dim; ++col) {
            float acc = 0.0f;
            for (int row = 0; row < state_dim; ++row) {
                const float q_hat = q_head[row] * q_scale;
                acc += state_cur[(size_t)row * (size_t)state_dim + (size_t)col] * q_hat;
            }
            out_head[col] = acc;
        }
    }
}

The loop order looks different from the textbook six-step derivation, but the math is the same. The first nested loop is Step 3. The second nested loop fuses Step 4 and Step 5 column by column. The final nested loop is Step 6.

Notice how cache-friendly the memory access is. Each head walks one contiguous d × d block, updates it in place, and immediately reuses that same block for the readout.

One implementation detail worth noticing

The scalar reference separates the algorithm into clean sweeps over the state matrix instead of trying to micro-fuse everything. That keeps the reference path easy to audit for parity, which matters because every ISA-specialized path is supposed to reproduce this math.

The Delta Rule — Why Error-Driven Updates Matter

The word “delta” is not decorative. It comes from the classical delta rule in associative memory and adaptive filtering, often associated with Widrow-Hoff style learning from 1960. The central idea is that you do not overwrite memory with the raw target. You write only the residual error between the target and the current prediction.

For DeltaNet, that rule is: δ = β · (v - Sᵀ·k) after the write gate has been squeezed into a usable range. The current state already contains beliefs about what key k should retrieve. So before writing, the kernel asks the state to answer first.

Delta rule diagram showing memory recall S-transpose-times-k, subtraction from target v, and a corrective write back into state.

If the memory already knows the answer, then Sᵀ·k ≈ v and the correction is tiny. The state barely changes.

If the memory is wrong, the residual is large, and the write becomes a corrective push in the direction that makes the next recall better.

Attention says “store all evidence, decide later.” DeltaNet says “store only the error signal the current memory still cannot explain.”

This is fundamentally different from standard attention. In a KV-cache, every token writes a new row whether or not that row is redundant. In DeltaNet, redundant information naturally suppresses itself because the state can already recall it. Compression is therefore not a post-processing trick. It is built into the write rule.

Situation	Recall `Sᵀ·k`	Correction `δ`	Effect on state
Memory already accurate	Close to `v`	Near zero	Almost no write
Memory partially wrong	Some coordinates right, some wrong	Moderate residual	Targeted correction
Memory very wrong	Far from `v`	Large residual	Strong rewrite along `outer(k, δ)`

Beta — The Learned Write Gate

The parameter β is transformed with a sigmoid before use. In code, the reference path computes beta_s = ck_deltanet_sigmoidf(beta[h]). That squeezes the learned scalar into the interval (0, 1), turning it into a stable write gate.

Conceptually, β_s is a learned learning rate for the current token and head. When it is small, the model treats the current information as something that should barely perturb the state. When it is large, the model decides that the current residual deserves a strong memory update.

Write gate illustration showing low beta causing tiny corrections and high beta causing strong corrective writes.

Low β. Conservative write, useful when the state already carries a stable long-lived memory.

High β. Aggressive correction, useful when the model detects fresh evidence that should quickly reshape the state.

0 → 1 Sigmoid turns an unconstrained learned scalar into a bounded update strength that behaves like a write-rate controller.

Write-gate regime	Behavior	Typical interpretation
`β_s ≈ 0`	Almost no correction is written	Protect a stable memory
`β_s ≈ 0.5`	Partial correction	Blend old memory with new evidence
`β_s ≈ 1`	Full-strength correction	Rapidly adapt to new information

This gate is learned during training, not hand-tuned at inference. So the model gradually discovers which kinds of tokens or hidden states should trigger strong updates, and which should merely probe the memory without changing it very much.

Gate — Exponential Decay and Forgetting

The second scalar gate is g, which the kernel transforms as gate = exp(g). Multiplying the whole state by this factor produces a simple but important effect: forgetting.

Every cell in S is scaled before the new write arrives. So even before the model decides what fresh correction to store, it has already decided how much of the old memory should survive. This makes the state dynamic rather than archival.

State decay illustration showing a memory matrix fading under exp(g) before receiving a new update.

When gate < 1, the memory fades. That is the common case for selective forgetting.

When gate ≈ 1, the old state is preserved almost exactly, so the new token behaves more like a gentle refinement than a reset.

DeltaNet gets a natural context-window mechanism not by dropping tokens, but by continuously shrinking the effective contribution of older memory traces.

This is one reason recurrent DeltaNet layers are attractive for streaming inference. The model can learn a soft notion of recency through decay without hauling an ever-growing cache behind it. Instead of a hard truncation rule, it gets a learned forgetting process.

Decay is not the same as deletion

Nothing in the kernel literally deletes one token's contribution. The state just keeps being rescaled and corrected. Over time, contributions that are no longer reinforced naturally lose influence.

Memory Layout and Shapes

Once you leave the math and open the kernel, layout becomes the story. C-Kernel-Engine documents DeltaNet as a head-major FP32 operator: one state matrix per head, one vector bundle per head, no heap allocation inside the kernel, and no hidden framework objects.

Systems view

The header comment at the top of deltanet_kernels.c lines 5-10 states the rules plainly: no malloc/free, no OpenMP, explicit APIs, pure computation. Line 47 then hard-codes CK_DELTANET_MAX_STACK_DIM = 4096 for the temporary stack buffers used by the kernel family.

Buffer	Shape	Layout detail
State `S`	`[num_heads × d × d]`	One contiguous row-major matrix per head.
`q`, `k`, `v`, `out`	`[num_heads × d]`	Flat vectors with stride `d` between heads.
`g`, `β`	`[num_heads]`	One scalar per head, broadcast inside the loops.
Temporaries	`[d]` each	Stack arrays such as `q_hat`, `k_hat`, `kv_mem`, `delta`.

The row-major flattening is not just a C convenience. It is what makes the per-head sweep simple for both scalar and SIMD implementations. A head starts at one base pointer, walks one contiguous matrix, and then advances by a fixed stride to the next head.

4096 Maximum stack dimension for temporary vectors in the DeltaNet kernels, as defined by CK_DELTANET_MAX_STACK_DIM.

This also explains why the kernel is friendly to a bump allocator model. The long-lived buffers are allocated by the orchestrator. The kernel itself just receives raw pointers, uses a few bounded stack arrays for scratch work, and exits. For low-level inference code, that contract keeps both ownership and performance predictable.

SIMD Optimization — Scalar to AVX-512

One of the nicest things about this kernel family is that the algorithm stays the same across all ISA tiers. The reference path defines the math. The vectorized paths only change how many columns they process per iteration and how much loop overhead they remove.

Tier	Source lines	Main idea
Scalar reference	`54-111`	Bit-clear baseline with explicit decay, corrective write, and readout loops.
AVX	`254-357`	8 floats per iteration with pre-scaled `q_hat` and copied `k_hat`.
AVX2 + FMA	`370-530`	Row-pair unrolling plus fused multiply-add helper `ck_deltanet_fmadd256`.
AVX-512	`552-653`	16 floats per iteration with the same column-wise algorithm and wider registers.

The AVX path first makes the scalar contract explicit for vector code. Lines 286-288 precompute the scaled query and copied key into stack arrays, so the hot loops can operate on clean contiguous buffers. That is exactly the kind of small structural cleanup that helps the compiler and keeps parity straightforward.

AVX2 row-pair unroll pattern from lines 421-445 and 479-507

int row = 0;
for (; row + 2 <= state_dim; row += 2) {
    const size_t row0_off = (size_t)row * (size_t)state_dim;
    const size_t row1_off = (size_t)(row + 1) * (size_t)state_dim;
    const __m256 k0_v = _mm256_set1_ps(k_hat[row]);
    const __m256 k1_v = _mm256_set1_ps(k_hat[row + 1]);

    col = 0;
    for (; col + 8 <= state_dim; col += 8) {
        __m256 prev0_v = _mm256_loadu_ps(state_prev + row0_off + (size_t)col);
        __m256 prev1_v = _mm256_loadu_ps(state_prev + row1_off + (size_t)col);
        __m256 cur0_v = _mm256_mul_ps(prev0_v, gate_v);
        __m256 cur1_v = _mm256_mul_ps(prev1_v, gate_v);
        __m256 kv_v = _mm256_loadu_ps(kv_mem + col);
        kv_v = ck_deltanet_fmadd256(cur0_v, k0_v, kv_v);
        kv_v = ck_deltanet_fmadd256(cur1_v, k1_v, kv_v);
        _mm256_storeu_ps(state_cur + row0_off + (size_t)col, cur0_v);
        _mm256_storeu_ps(state_cur + row1_off + (size_t)col, cur1_v);
        _mm256_storeu_ps(kv_mem + col, kv_v);
    }
}

for (; row + 2 <= state_dim; row += 2) {
    const size_t row0_off = (size_t)row * (size_t)state_dim;
    const size_t row1_off = (size_t)(row + 1) * (size_t)state_dim;
    const __m256 k0_v = _mm256_set1_ps(k_hat[row]);
    const __m256 k1_v = _mm256_set1_ps(k_hat[row + 1]);
    const __m256 q0_v = _mm256_set1_ps(q_hat[row]);
    const __m256 q1_v = _mm256_set1_ps(q_hat[row + 1]);

    col = 0;
    for (; col + 8 <= state_dim; col += 8) {
        __m256 cur0_v = _mm256_loadu_ps(state_cur + row0_off + (size_t)col);
        __m256 cur1_v = _mm256_loadu_ps(state_cur + row1_off + (size_t)col);
        __m256 delta_v = _mm256_loadu_ps(delta + col);
        __m256 out_v = _mm256_loadu_ps(out_head + col);
        __m256 upd0_v = ck_deltanet_fmadd256(k0_v, delta_v, cur0_v);
        __m256 upd1_v = ck_deltanet_fmadd256(k1_v, delta_v, cur1_v);
        out_v = ck_deltanet_fmadd256(upd0_v, q0_v, out_v);
        out_v = ck_deltanet_fmadd256(upd1_v, q1_v, out_v);
        _mm256_storeu_ps(state_cur + row0_off + (size_t)col, upd0_v);
        _mm256_storeu_ps(state_cur + row1_off + (size_t)col, upd1_v);
        _mm256_storeu_ps(out_head + col, out_v);
    }
}

That row-pair strategy matters because DeltaNet repeatedly sweeps the same matrix. Unrolling two rows at a time reduces loop-control overhead, reuses broadcast scalars efficiently, and lets the FMA helper accumulate state updates and readout contributions with fewer separate instructions.

The optimization is not changing the algorithm. It is teaching the machine to walk the same matrix with wider steps and less bookkeeping.

The AVX-512 implementation in lines 552-653 keeps the same column-wise design, simply widening loads and stores to 16 floats. In other CK kernels, AVX-512 often ends with horizontal reductions such as _mm512_reduce_add_ps. DeltaNet's forward path avoids that pattern because it vectorizes across columns of the state matrix rather than forming one explicit dot-product reduction at a time. That is an important implementation detail: the widest path is still structurally the same algorithm, not a different formula.

Scalar, AVX, AVX2, and AVX-512 tiers for the DeltaNet kernel showing progressively wider vectorization.

ISA Dispatch

Once multiple ISA-specific kernels exist, the next question is dispatch. C-Kernel-Engine answers that with a tiny top-level selector in deltanet_kernels.c lines 656-719. The key design idea is that runtime branching happens once before the hot loops, not inside them.

Dispatch and parity selection, lines 656-719

static int ck_deltanet_force_ref(void)
{
    const char *env = getenv("CK_DELTANET_FORCE_REF");
    return env && atoi(env) != 0;
}

const char *gated_deltanet_impl_name(void)
{
    if (ck_strict_parity_enabled() || ck_deltanet_force_ref()) {
        return "REF";
    }
#if defined(__AVX512F__)
    return "AVX512";
#elif defined(__AVX2__)
    return "AVX2";
#elif defined(__AVX__)
    return "AVX";
#else
    return "REF";
#endif
}

void gated_deltanet_autoregressive_forward(const float *q,
                                           const float *k,
                                           const float *v,
                                           const float *g,
                                           const float *beta,
                                           const float *state_in,
                                           float *state_out,
                                           float *out,
                                           int num_heads,
                                           int state_dim,
                                           float norm_eps)
{
    if (!q || !k || !v || !g || !beta || !state_in || !state_out || !out) {
        return;
    }
    if (num_heads <= 0 || state_dim <= 0) {
        return;
    }

    if (ck_strict_parity_enabled() || ck_deltanet_force_ref()) {
        gated_deltanet_autoregressive_forward_ref(
            q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
        return;
    }
#if defined(__AVX512F__)
    gated_deltanet_autoregressive_forward_avx512(
        q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
#elif defined(__AVX2__)
    gated_deltanet_autoregressive_forward_avx2(
        q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
#elif defined(__AVX__)
    gated_deltanet_autoregressive_forward_avx(
        q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
#else
    gated_deltanet_autoregressive_forward_ref(
        q, k, v, g, beta, state_in, state_out, out, num_heads, state_dim, norm_eps);
#endif
}

Strict parity mode. Set CK_DELTANET_FORCE_REF=1 to force the scalar reference path even when vector ISAs are available. That is invaluable when benchmarking correctness or chasing a numerical parity issue.

Good kernel engineering means the fastest path and the most trustworthy path share one contract, and you can flip between them without changing the graph above.

Notice what the dispatcher does not do. It does not branch on every row, every column, or every token. It selects the implementation once, then hands the whole head sweep to that implementation. That keeps the hot loops branch-light and predictable.

Backward Pass — Training DeltaNet

A production recurrent kernel is not finished when the forward pass works. Training requires gradients for every meaningful input to the operator. The DeltaNet backward reference in deltanet_kernels.c lines 113-238 therefore emits six outputs: d_q, d_k, d_v, d_g, d_beta, and d_state_in.

The logic mirrors the forward pass in reverse. The output read contributes gradients back into the updated state and the scaled query. The rank-1 write contributes gradients into k, delta, and the state. The delta rule then routes gradient into v, the memory recall term, and the write gate. Finally, the decay multiplies all state-related gradients by the exponential gate and accumulates the gate derivative itself.

Backprop foundation

If the derivative flow here feels dense, read dL/d(LLM): The Full Backward Pass first. DeltaNet backward is the same chain-rule idea applied to a recurrent memory update: gradients flow from the output read, through the rank-1 write, through the delta residual, and back into the gates.

Backward pass diagram for DeltaNet showing gradients flowing to q, k, v, beta, g, and previous state.

The write-gate gradient follows the intuition you would expect: if changing β would make the delta rule better align the write with the residual error, the gradient grows. Algebraically, it is the residual term times the derivative of the sigmoid.

Likewise, the decay-gate gradient accumulates how much the total state gradient wanted the previous state to survive.

6 grads DeltaNet backward returns gradients for query, key, value, write gate, decay gate, and previous state—everything needed to train the recurrent memory end to end.

Key backward excerpt from lines 199-237

for (int col = 0; col < state_dim; ++col) {
    float d_delta_acc = d_out_head[col] * qk_dot;
    for (int row = 0; row < state_dim; ++row) {
        d_delta_acc += d_state_out_head[(size_t)row * (size_t)state_dim + (size_t)col] * k_hat[row];
    }

    d_v_head[col] = beta_s * d_delta_acc;
    d_mem[col] = -beta_s * d_delta_acc;
    beta_acc += d_delta_acc * (v_head[col] - kv_mem[col]);
}

for (int row = 0; row < state_dim; ++row) {
    const size_t row_off = (size_t)row * (size_t)state_dim;
    for (int col = 0; col < state_dim; ++col) {
        const float d_state_total = d_state_out_head[row_off + (size_t)col]
                                  + q_hat[row] * d_out_head[col]
                                  + k_hat[row] * d_mem[col];
        d_state_prev[row_off + (size_t)col] = gate * d_state_total;
        gate_acc += d_state_total * state_prev[row_off + (size_t)col];
    }
}

for (int i = 0; i < state_dim; ++i) {
    d_q_head[i] = d_q_hat[i] * q_scale;
    d_k_head[i] = d_k_hat[i];
}

d_g[h] = gate_acc * gate;
d_beta[h] = beta_acc * beta_s * (1.0f - beta_s);

In compact math, you can summarize the most important scalars this way:

d_beta = d_delta · (v - kv_mem) · sigmoid'(β)

d_g = gate_acc · gate

Those two formulas capture the training story. β learns how strongly a residual should update memory, while g learns how quickly memory should decay.

Chunked Processing for Prefill

The single-token DeltaNet kernel is the autoregressive path. It assumes n_tokens == 1 and updates the state once. But prefill is different. If you need to ingest a whole prompt of length T, you cannot call the single-token path naively and hope to preserve throughput. The graph builder in src/models/delta-net-base.cpp therefore has a chunked prefill path in lines 16-287.

The first important line is 61: const int CS = kda ? 16 : 64; That means chunk size 16 for KDA mode and 64 for GDA mode. Everything that follows reorganizes the prompt into those chunked blocks, builds decay masks from cumulative gate sums, solves for within-chunk attention weights, and propagates the recurrent state chunk by chunk.

Chunk-size choice and chunked triangular solve, delta-net-base.cpp

c++

const int CS = kda ? 16 : 64; // chunk size

const int pad = (CS - n_tokens % CS) % CS;
const int n_chunks = (n_tokens + pad) / CS;

...

ggml_tensor * lhs = ggml_add(ctx0, attn, identity);
attn = ggml_neg(ctx0, attn);

ggml_tensor * lin_solve = ggml_solve_tri(ctx0, lhs, attn, true, true, false);
attn = ggml_add(ctx0, lin_solve, identity);

...

for (int64_t chunk = 0; chunk < n_chunks; chunk++) {
    ggml_tensor * ch_k_cd    = get_slice_2d(ctx0, k_cd,    chunk);
    ggml_tensor * ch_v_t     = get_slice_2d(ctx0, v_t,     chunk);
    ggml_tensor * ch_kq      = get_slice_2d(ctx0, kq,      chunk);
    ggml_tensor * ch_q_g_exp = get_slice_2d(ctx0, q_g_exp, chunk);
    ggml_tensor * ch_kg_t    = get_slice_2d(ctx0, kg_t,    chunk);

    ggml_tensor * v_t_p = ggml_mul_mat(ctx0, ch_k_cd, s);
    ggml_tensor * v_t_new = ggml_sub(ctx0, ch_v_t, v_t_p);
    ggml_tensor * v_attn = ggml_mul_mat(ctx0, v_t_new, ch_kq);
    ggml_tensor * attn_inter = ggml_mul_mat(ctx0, s, ch_q_g_exp);
    ggml_tensor * o_ch = ggml_add(ctx0, attn_inter, v_attn);

    ggml_tensor * kgv = ggml_mul_mat(ctx0, ch_kg_t, v_t_new);
    ggml_tensor * ch_g_last_exp_t = get_slice_2d(ctx0, g_last_exp_t, chunk);

    s = ggml_mul(ctx0, s, ch_g_last_exp_t);
    s = ggml_add(ctx0, s, kgv);
}

Speaker note. The chunk builder is the bridge between two worlds. Inside a chunk, DeltaNet behaves almost like a structured linear-attention solve. Across chunks, the recurrent state s carries the summary forward.

Prefill is not “just call the single-token kernel many times.” It is a chunked graph that uses gate cumsums, decay masks, and a triangular solve to preserve the recurrent semantics efficiently.

Three implementation details are worth highlighting. First, cumulative sums of g create decay masks that encode how much earlier positions should attenuate later ones. Second, the triangular solve turns those masked interactions into stable within-chunk attention weights. Third, after each chunk, the state is explicitly updated and handed to the next chunk, so the recurrent summary continues to propagate across the whole prompt.

Hybrid Attention in Qwen3.5

The real deployment lesson is that Qwen3.5 does not ask DeltaNet to do everything. It uses a hybrid stack. Some layers are recurrent DeltaNet layers. Some layers are standard full-attention layers. That split is what makes the architecture interesting.

In qwen35.cpp lines 21-28, the recurrent-layer flag is populated so that every fourth main layer becomes a full-attention layer. In the same file, lines 73-92 allocate attention tensors when the layer is non-recurrent and DeltaNet-specific tensors when it is recurrent. In qwen3next.cpp lines 131-137, the graph builder then uses that flag to choose between build_layer_attn_linear(...) and build_layer_attn(...).

Hybrid architecture diagram showing recurrent DeltaNet layers interleaved with periodic full-attention layers in Qwen3.5.

The C-Kernel-Engine model matrix describes the pattern explicitly as 3×recurrent → 1×full_attention for the Qwen3.5 template. That means the system is not betting on one universal memory mechanism. It is assigning jobs: recurrent layers handle efficient streaming summaries, while full-attention layers periodically recover exact token-level interactions.

Hybrid design is the practical answer to the false choice between “attention everywhere” and “recurrence everywhere.” Use exact recall where it matters, and fixed-memory streaming where it pays off.

Layer type	Strength	Why Qwen3.5 keeps it
Full attention	Precise long-range token recall	Recover exact interactions that fixed memory may blur.
Gated DeltaNet	Streaming efficiency with fixed state	Keep decode cost stable and memory bounded over long contexts.

If you like slogans, this hybrid stack gives you accuracy where the model most needs explicit recall, and efficiency where the workload most needs bounded state. That is why DeltaNet in Qwen3.5 should be seen as a complement to attention, not a wholesale replacement.

DeltaNet vs Standard Attention — Complete Comparison

By now the contrast should be crisp enough to summarize directly. The table below puts the two mechanisms side by side using the categories that matter most in inference engineering.

Property	Standard Attention	Gated DeltaNet
Memory per head	`O(T × d)` grows	`O(d²)` fixed
Per-token cost	`O(T × d)` scans KV-cache	`O(d²)` constant
State	KV-cache (append-only)	Matrix `S` (delta rule)
Forgetting	None (or windowed)	Exponential decay `exp(g)`
Write mechanism	Append `K,V` rows	Rank-1 correction `outer(k̂, δ)`
Best for	Precise long-range recall	Streaming / very long contexts

The table hides one subtle but important point. Standard attention stores exact episodes. DeltaNet stores a compressed evolving theory of what matters. That means their trade-off is not just computational. It is epistemic.

Store corrections This is the conceptual pivot of the whole post. DeltaNet replaces append-only memory with corrective memory.

Where each mechanism wins

If you need exact token lookup across a moderate context, standard attention is still the gold standard. If you need bounded-state streaming over very long contexts, DeltaNet becomes compelling because its cost no longer scales with prompt length.

Conclusion — The Kernel Engineer's Perspective

From a kernel engineer's perspective, Gated DeltaNet is attractive because it changes the memory story without discarding the language-model story. It does not pretend that attention was a mistake. It identifies a subset of layers where fixed-size recurrent memory is a better systems contract.

The delta rule is the conceptual heart of that contract. The memory does not blindly accumulate raw values. It measures what it already recalls, computes the residual, and writes only the correction. That is why “store corrections, not everything” is the right summary sentence for this architecture.

Takeaway

DeltaNet is not replacing attention. It is complementing it. Fixed memory keeps inference cost from growing with context, learned error-driven updates make the memory adaptive, and the C-Kernel-Engine implementation shows how the same algorithm scales from scalar reference code to AVX, AVX2, and AVX-512 without changing the mathematical contract.

C-Kernel-Engine goes one step further by pairing the forward kernel with a backward reference, so the recurrent attention path is not just inferable but trainable. That is what turns DeltaNet from an interesting runtime trick into a real model component.

The most important architectural move here is not the SIMD work. It is the decision to treat memory as a learned, decaying, corrective object rather than an ever-growing log.

Gated DeltaNet: Recurrent Attention With Error-Driven Memory

Roadmap for this post

Introduction — Beyond Standard Attention

Why this matters after the attention derivation

The Core Idea — A Learned Associative Memory

The Five Inputs Per Step

Step-by-Step Forward Pass

One implementation detail worth noticing

The Delta Rule — Why Error-Driven Updates Matter

Beta — The Learned Write Gate

Gate — Exponential Decay and Forgetting

Decay is not the same as deletion

Memory Layout and Shapes

SIMD Optimization — Scalar to AVX-512

ISA Dispatch

Backward Pass — Training DeltaNet

Backprop foundation

Chunked Processing for Prefill

Hybrid Attention in Qwen3.5

DeltaNet vs Standard Attention — Complete Comparison

Where each mechanism wins

Conclusion — The Kernel Engineer's Perspective

Further reading

Related ShivasNotes posts

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support

Gated DeltaNet: Recurrent Attention With Error-Driven Memory

Roadmap for this post

Introduction — Beyond Standard Attention

Why this matters after the attention derivation

The Core Idea — A Learned Associative Memory

The Five Inputs Per Step

Step-by-Step Forward Pass

One implementation detail worth noticing

The Delta Rule — Why Error-Driven Updates Matter

Beta — The Learned Write Gate

Gate — Exponential Decay and Forgetting

Decay is not the same as deletion

Memory Layout and Shapes

SIMD Optimization — Scalar to AVX-512

ISA Dispatch

Backward Pass — Training DeltaNet

Backprop foundation

Chunked Processing for Prefill

Hybrid Attention in Qwen3.5

DeltaNet vs Standard Attention — Complete Comparison

Where each mechanism wins

Conclusion — The Kernel Engineer's Perspective

Further reading

Related ShivasNotes posts

Subscribe

Subscribe to emails from Anthony

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support