Attention: The Core Of The Transformer

Lab note

Previously: Tokenization: The First Decision That Shapes Everything.

The previous post ended with tokens becoming vectors. This post is where those vectors finally start talking to one another. Full attention is the moment a transformer turns a row of isolated embeddings into a context-aware state where every token can measure every other token, route information selectively, and write back a blended result.

It is also the point where several earlier posts fuse into one pipeline. The query, key, and value projections are the matrix multiplies from Matrix Wx+b: From Scalars To Transformers. The probability conversion is the stable row-wise normalization from Softmax: The Probability Engine. The input usually arrives through LayerNorm And RMSNorm: Stabilizing The Signal, and the positional geometry comes from Positional Encoding: Teaching Transformers Where To Look. Attention is the place where all of those ideas stop being separate topics and become one executable graph. Attention is not magic. It is a carefully staged routing computation: project, compare, normalize, mix, and project again.

Roadmap for this post

Sections 1 through 9 walk the forward pass from intuition to production variants: what attention does, how Q/K/V are projected, how heads are split, where RoPE enters, how scores become weights, how values are mixed, and why GQA matters.

Sections 10 through 12 reverse the whole graph. We will go step by step through the backward pass, then cover stability, precision, Flash Attention, and the implementation choices in C-Kernel-Engine.

Section 13 closes by reconnecting attention to the full transformer block so the next layer no longer feels like a collection of disconnected diagrams.

Section 1: What Attention Does

At the highest level, attention lets each token look sideways. Instead of processing token 7 only after token 6 and before token 8, as an RNN would, the transformer projects the whole sequence in parallel and asks a matrix question: which other positions matter to each position right now? That single change removes the sequential bottleneck that used to dominate language modeling.

The standard formula is Attention(Q, K, V) = softmax(QK^T / √d_k) · V. Every symbol in that equation is concrete. Q is the query matrix: what each token is looking for. K is the key matrix: what each token says it contains. V is the value matrix: what each token can actually contribute if another token decides to attend to it. 90% compute In long-context workloads, most of the live compute and memory traffic clusters around the attention block. That is why this is the bottleneck kernel engineers spend so much time optimizing.

The output is a weighted sum of value vectors. The weights come from query-key similarity. If token i finds token j highly relevant, then the weight w[i,j] is large and more of V[j] flows into the output state for token i. If the similarity is low, very little of that value survives the mix.

A useful mnemonic is this. Queries ask, “What am I looking for?” Keys answer, “What do I contain?” Values provide, “What do I offer if selected?” Attention scores are therefore not the final content. They are routing coefficients that decide which content vectors get amplified. Scores are not meaning yet. They are the control plane that decides which value vectors are allowed onto the data plane.

Symbol	Shape	Interpretation
`X`	`[T, C]`	Sequence of token embeddings or normalized hidden states entering attention.
`Q`	`[T, C]` before head split	Per-token “search pattern” vectors.
`K`	`[T, C]` before head split	Per-token “index” vectors used for matching.
`V`	`[T, C]` before head split	Per-token content vectors that get mixed into the output.
`QK^T`	`[T, T]` per head	All pairwise query-key compatibilities.
`softmax(...)`	`[T, T]` per head	Row-wise attention weights that sum to 1.
`weights · V`	`[T, D_h]` per head	Weighted mixture of value vectors for each query token.

High-level diagram showing input splitting into Q, K, and V, then moving through scaled dot-product, softmax, weighted value sum, and output projection.

Why transformers displaced recurrent models

RNNs hide context in a single evolving state, which makes training inherently sequential. Attention turns the whole sequence into a matrix problem, so modern hardware can evaluate many token-to-token interactions in parallel.

The cost is quadratic interaction at sequence length T, but the benefit is dramatically better parallel throughput and more direct long-range communication.

Section 2: Q, K, V Projections — Three GEMM Operations

Suppose the attention block receives an input matrix X of shape [T, C]. T is the number of token positions in the current sequence chunk. C is the channel count, also called the embedding dimension or model width. The first thing attention does is multiply that same input by three different learned weight matrices.

The formulas are simple. Q = X · W_Q, K = X · W_K, and V = X · W_V. Each weight matrix is usually [C, C], or equivalently [C, H×D_h] when we think in head terms. These are not three conceptual metaphors. They are literally three independent GEMM calls over the same source tensor. 3× GEMM Attention begins with three dense matrix multiplies. This is the exact same wx+b pattern from the matrix wx+b post, just lifted from vectors to full token matrices.

Q, K, V projections

python

# Q, K, V projections
Q = X @ W_Q  # [T, C] × [C, C] → [T, C]
K = X @ W_K  # [T, C] × [C, C] → [T, C]
V = X @ W_V  # [T, C] × [C, C] → [T, C]

Why three projections instead of one? Because the model wants three different views of the same hidden state. The query view emphasizes what a token needs from context. The key view emphasizes how a token should be matched. The value view emphasizes what information should actually flow if a match happens.

The weights W_Q, W_K, and W_V are trainable parameters. During learning, the optimizer changes these matrices so certain semantic, syntactic, positional, or structural signals become easy to detect through dot products. The input sequence does not come pre-labeled with “query-ness” or “value-ness.” The projections manufacture those roles from the same incoming representation. Q, K, and V are not three different inputs. They are three learned interpretations of one shared input tensor.

Operation	Input shape	Weight shape	Output shape
`Q = X @ W_Q`	`[T, C]`	`[C, C]`	`[T, C]`
`K = X @ W_K`	`[T, C]`	`[C, C]`	`[T, C]`
`V = X @ W_V`	`[T, C]`	`[C, C]`	`[T, C]`

Diagram showing input matrix X multiplied by weight matrices WQ, WK, and WV to produce query, key, and value matrices.

The first major compute phase

If the attention block feels abstract, anchor it here: before there are scores or masks or softmaxes, there are three matrix multiplies. Attention begins as linear algebra.

Section 3: Multi-Head Splitting

A single giant attention head could in principle do all the matching work. In practice, transformers split the channel dimension into H parallel heads. If C = H × D_h, then each head gets a smaller subspace of width D_h. The idea is that different heads can specialize in different relational patterns without fighting over one monolithic attention map.

After projection, each tensor is reshaped from [T, C] to [H, T, D_h]. Nothing is multiplied here. No new parameters appear. This is a memory reinterpretation step: the same bytes are viewed through a head axis so downstream kernels can process one head at a time. C = H×D_h Common decoder settings include H=32, D_h=128 for LLaMA-class models and H=12, D_h=64 for GPT-2-class models. The split trades one wide channel axis for many smaller attention subspaces.

Multi-head split

python

# Multi-head split (reshape, no compute)
# Q: [T, C] → [H, T, D_h]
Q = Q.reshape(T, H, D_h).transpose(0, 1)  # [H, T, D_h]
K = K.reshape(T, H, D_h).transpose(0, 1)  # [H, T, D_h]
V = V.reshape(T, H, D_h).transpose(0, 1)  # [H, T, D_h]

Conceptually, each head can latch onto different features. One head may lock onto nearby syntactic agreement. Another may track topic continuity across dozens of tokens. Another may become strongly position-sensitive once RoPE is applied. “Multi-head” is therefore a representation budget for multiple matching strategies to coexist.

Systems-wise, the layout matters just as much as the math. C-Kernel-Engine keeps tensors in head-major form, so one head’s token rows and feature lanes live contiguously in memory. That makes cache behavior far cleaner than a token-major layout that would interleave multiple heads in the hottest loops. Layout is not cosmetic. It is part of the performance model. head-major Keeping one head contiguous makes score and value kernels much kinder to cache. Token-major layouts tend to introduce unnecessary head-to-head conflict.

Head-major memory layout in C-Kernel-Engine

text

Memory: [head][token][head_dim]
Index: h * T * aligned_head_dim + t * aligned_head_dim + d

A T by C tensor reshaped into eight colored attention heads with a note that the operation is only a reshape.

Why the reshape matters

The multi-head split does not add new information. It creates independent lanes so different relational patterns can be learned and so kernels can work on compact [T, D_h] slices instead of one giant mixed feature block.

Section 4: RoPE Application (Position Encoding)

Queries and keys still do not know where tokens live in the sequence unless position is injected somehow. In modern decoder models, that injection usually comes from RoPE: rotary positional embedding. As discussed in the positional encoding post, RoPE rotates each even-odd dimension pair by a position-dependent angle. The key systems point here is when the rotation happens.

RoPE is applied after the Q/K projections and after the head split, but before the dot product. So the forward order is projection → reshape → RoPE → score computation. Queries and keys are rotated because they define similarity. Values are not rotated because they carry content to be mixed after the routing decision is made. RoPE modifies how tokens compare, not what value payload they deliver. Rotate Q and K, leave V alone.

Applying RoPE to Q and K

python

# Apply RoPE to Q and K (per head)
for h in range(H):
    for t in range(T):
        Q[h, t] = rope_rotate(Q[h, t], position=t, freqs=theta)
        K[h, t] = rope_rotate(K[h, t], position=t, freqs=theta)
# V is untouched by RoPE

The beauty of RoPE is that relative position becomes implicit inside the dot product. After rotation, the similarity between a query at position m and a key at position n depends on the angle difference between those positions. That means the model gets relative position awareness without storing a huge learned position table for every possible context length.

This detail is easy to miss when reading the formula quickly. The attention block does not add a positional vector and then forget about it. It geometrically twists Q and K so the matching operation itself becomes position-aware. That is why RoPE feels so natural inside self-attention specifically. Q/K only If you rotate V as well, you contaminate the value payload with position in a place where only content mixing should happen. The geometric trick belongs in similarity space.

From the positional encoding post to this post

RoPE was not the destination. It was preparation. This is the exact point in the transformer where that preparation pays off: the Q·K dot product now encodes content and relative position simultaneously.

Section 5: Scaled Dot-Product Attention — The Core

Once Q and K are ready, each head computes a score matrix. For head h, the operation is scores[h] = Q[h] @ K[h]^T / √D_h. If Q[h] has shape [T, D_h] and K[h] has shape [T, D_h], then K[h]^T has shape [D_h, T], so the result is [T, T]. Every entry scores[i,j] asks one question: how much should query token i attend to key token j?

The √D_h denominator is not a cosmetic scaling. It is what keeps the dot-product variance under control. If the entries of Q and K are roughly mean-zero with variance 1, then an unscaled dot product over D_h dimensions has variance that grows with D_h. Large variance makes softmax too peaky too early, which in turn kills useful gradients. √D_h With D_h=64, dividing by √64 = 8 keeps score magnitudes in a regime where softmax still has room to express uncertainty and produce non-trivial gradients.

Decoder-only models add one more ingredient: a causal mask. Token i must not see tokens j > i because those are future positions. The standard trick is to write -∞ into the forbidden score entries before softmax. Exponentiating -∞ gives zero, so masked positions vanish from the probability distribution automatically.

This is what makes decoder attention causal rather than fully bidirectional. Remove that upper-triangular mask and you recover full self-attention, which is common in encoder architectures like BERT. Keep it in place and the model can only aggregate information from the left context and the current token. Same math, different visibility rule. mask = decoder The causal mask is the difference between “look anywhere in the sentence” and “predict the next token without cheating.”

Scaled dot-product attention with causal masking

python

# Scaled dot-product with causal mask
for h in range(H):
    scores = Q[h] @ K[h].T  # [T, T]
    scores = scores / sqrt(D_h)

    # Causal mask: -inf for future positions
    for i in range(T):
        for j in range(i + 1, T):
            scores[i, j] = -float('inf')

    # Softmax per row (each query attends to all valid keys)
    weights[h] = softmax(scores, axis=-1)  # [T, T]

Two heatmaps: the left panel shows masked attention scores with the upper triangle set to negative infinity, and the right panel shows softmax probabilities per row.

The score matrix is the interaction map

This [T, T] matrix is the core object of full attention. It explicitly represents all query-key interactions for a head, which is powerful for modeling and expensive for memory. Flash Attention will later show how to keep the math while avoiding materializing the whole matrix.

Section 6: Softmax — Converting Scores to Weights

The raw score matrix is not usable yet. Scores can be negative, positive, unbounded, and incomparable across rows. Softmax converts each row into a probability distribution. For query row i, the output weights[i,:] sums to 1 over all visible keys.

This is exactly the operation from the softmax post, just applied to attention rows instead of vocabulary logits. Each row is an independent normalization problem. Row i asks: among the keys I am allowed to see, how should I distribute one unit of attention mass? The answer is a soft selection, not a hard argmax. row sums = 1 Every query row becomes a normalized routing plan. The weights are probabilities over keys, not over vocabulary classes.

Numerical stability matters here. Exponentials explode if we feed them large positive inputs. So practical softmax implementations subtract the row maximum first. Because softmax is translation-invariant, subtracting the same constant from the whole row changes nothing mathematically but prevents overflow computationally.

Three-phase stable softmax used inside attention kernels

// Phase 1: find max (numerical stability)
float max_val = row[0];
for (int j = 1; j <= i; j++)
    if (row[j] > max_val) max_val = row[j];

// Phase 2: exp and sum
float sum = 0.0f;
for (int j = 0; j <= i; j++) {
    float e = expf(row[j] - max_val);  // Subtract max!
    row[j] = e;
    sum += e;
}

// Phase 3: normalize
float inv_sum = 1.0f / sum;
for (int j = 0; j <= i; j++)
    row[j] *= inv_sum;

Stable softmax is one of those routines that looks almost trivial until you implement it for real hardware. You need to respect masking, avoid overflow, and keep precision errors from accumulating across long rows. That is why production attention kernels often spend more engineering effort on “simple” normalization than outsiders expect. Softmax is the bridge between similarity and routing. It turns a raw score landscape into a legal distribution the value mixer can trust.

Attention weights are not optional polish

Without softmax, the score matrix is just an unbounded similarity table. Softmax gives each row a probabilistic meaning, stabilizes the dynamic range, and makes the later weighted sum interpretable.

Section 7: Weighted Value Sum — The Final GEMM

Once the weights are ready, each head performs one more matrix multiply. output[h] = weights[h] @ V[h]. The shapes are [T, T] × [T, D_h] → [T, D_h]. So for every query token, we compute a weighted combination of all visible value vectors in that head.

This is the actual content transport step. Scores and softmax weights only decide who matters. The value matrix is what gets transported and blended. For token i, the output vector is Σ_j weights[i,j] × V[j]. The result is therefore a convex combination of value vectors along the row. context mix After softmax, each token no longer carries only its own value vector. It carries a learned mixture of the entire visible context.

Weighted value sum per head

python

# Weighted value sum
for h in range(H):
    output[h] = weights[h] @ V[h]  # [T, T] × [T, D_h] → [T, D_h]

Notice the architecture pattern now. Attention is not one matrix multiply. It is a pipeline of matrix multiplies with a nonlinearity-like normalization in the middle. That is why optimized GEMM kernels still sit at the heart of high-performance attention implementations.

Section 8: Concatenation and Output Projection

Each head now holds an output tensor of shape [T, D_h]. To return to the model width C, the heads are concatenated back together. That means reshaping [H, T, D_h] to [T, C]. After that, the block applies one more learned projection W_O.

This final projection matters because it lets the heads communicate. Without W_O, the head outputs would merely sit side by side. With it, the model can blend and remix information discovered by different heads into a single residual-ready output state. In other words, multi-head attention ends with one more dense layer. 4th GEMM The output projection W_O is the fourth learned matrix multiply in the forward pass, after W_Q, W_K, and W_V.

Concatenate heads and apply output projection

python

# Concatenate heads and project
concat = output.transpose(0, 1).reshape(T, C)  # [H, T, D_h] → [T, C]
result = concat @ W_O  # [T, C] × [C, C] → [T, C]  (fourth GEMM)

Summarizing the matrix pipeline helps. There are four learned GEMMs in the block. There are also two data-dependent matrix multiplies per head: the score computation Q @ K^T and the context computation weights @ V. Together they explain why attention is, at heart, a GEMM pipeline wrapped around a softmax.

#	Operation	Role
1	`Q = X @ W_Q`	Learned query projection.
2	`K = X @ W_K`	Learned key projection.
3	`V = X @ W_V`	Learned value projection.
4	`scores = Q @ K^T`	Pairwise query-key similarities per head.
5	`context = weights @ V`	Weighted value mixing per head.
6	`result = concat @ W_O`	Final head recombination.

Complete forward-flow diagram from X through QKV projections, head split, RoPE, score computation, mask, softmax, weighted value sum, concatenation, and output projection.

Attention is a pipeline, not a single formula

The famous equation fits on one line, but the real forward pass is a staged program. That staged view is what makes kernel design, memory layout, and backward differentiation understandable.

Section 9: Grouped Query Attention (GQA)

Standard multi-head attention gives every query head its own key head and value head. That means if you have 32 Q heads, you also have 32 K heads and 32 V heads. Grouped Query Attention relaxes that symmetry. You can keep many Q heads while sharing fewer K/V heads across them.

In a common GQA setup, 32 query heads map to only 8 key heads and 8 value heads. Each K/V head is shared by 4 query heads. The quality drop is usually small, while KV-cache size and bandwidth drop dramatically. That makes GQA a sweet spot for modern decoder inference. 4× smaller KV Relative to full MHA, GQA reduces KV-cache storage and memory traffic in proportion to the number of shared heads. That matters enormously during long-context decode.

Mapping a query head to its shared KV head

python

# GQA: map query head to KV head
kv_head = h * num_kv_heads // num_heads

Type	Q Heads	K,V Heads	KV Cache	Models
MHA	32	32	100%	GPT-3, LLaMA 1
GQA	32	8	25%	LLaMA 2/3
MQA	32	1	3%	Falcon, PaLM

The arithmetic of attention barely changes. What changes is who shares which K and V tensors. Multiple Q heads now point to the same K/V head index. That sharing shows up again in the backward pass because multiple gradient paths accumulate into the same dK and dV buffers. GQA keeps the expressive search space of many queries while shrinking the storage bill of keys and values.

Diagram showing eight query heads connecting to two shared key-value heads, illustrating a four-to-one grouped query attention mapping.

Why GQA became popular

Training still wants expressive attention. Inference wants smaller KV caches and lower memory bandwidth. GQA is the compromise that keeps many query perspectives without paying full multi-head storage costs.

Section 10: Backward Pass — Step by Step Chain Rule

Now for the part that usually gets hand-waved away. Suppose the layer above attention has already computed a gradient dL/d(result) for the output of this block. Our job is to push that gradient backward through every operation we just described and produce five things: dW_Q, dW_K, dW_V, dW_O, and dX. This is pure chain rule. Nothing new is invented in backward. We just visit the forward graph in reverse and apply each local Jacobian carefully.

The cleanest way to stay sane is to mirror the forward order exactly. Start at the output projection. Undo the concatenation. Undo the weighted sum. Undo softmax. Undo the score matrix. Undo RoPE. Undo the Q/K/V projections. Sum any branches that reconverge. Backward is the forward graph read right to left. Every box becomes a local rule, and every fork becomes an accumulation.

Goal of the backward pass

We need parameter gradients for the optimizer and an input gradient for the layer below. The optimizer updates W_Q, W_K, W_V, and W_O. The residual stack needs dX so the gradient can continue through RMSNorm, embeddings, and earlier layers.

Step 1: Backprop through the output projection

The last forward operation was result = concat @ W_O. This is standard matrix-multiply backward, exactly the same rule from the matrix wx+b post. If the upstream gradient is dresult = dL/d(result), then the gradient with respect to the concatenated head tensor is the upstream gradient multiplied by the transpose of the weight matrix. The gradient with respect to the weight matrix is the input transpose times the upstream gradient.

Step 1 formulas

text

Forward:   result = concat @ W_O
Backward:
  dconcat = dresult @ W_O^T
  dW_O    = concat^T @ dresult

Notice the symmetry. The data gradient uses the weight transpose. The weight gradient uses the input transpose. If you are comfortable with ordinary linear layers, this first step should feel completely familiar. same as GEMM The output projection is just another dense layer, so its backward rule is identical to the matrix calculus from the matrix wx+b post.

Step 2: Backprop through concatenation

Forward concatenation took [H, T, D_h], transposed or reshaped it, and produced [T, C]. Backward concatenation does the opposite. There is no multiplication here. We simply reshape dconcat back into [H, T, D_h] and hand each head its slice.

Step 2 formulas

text

Forward:   concat = reshape([output[0], ..., output[H-1]])
Backward:  dout_heads = reshape(dconcat)  # split back into H heads

This is one of the friendliest backward steps in the whole block. Reshape and transpose operations do not create new arithmetic. They only reassign indexing. Their gradients therefore just undo the indexing transformation.

Step 3: Backprop through the weighted value sum

For each head, the forward rule was output[h] = weights[h] @ V[h]. This produces two gradient paths because the output depends on both weights and V. The gradient with respect to the weights asks how sensitive the output is to changing a routing coefficient. The gradient with respect to V asks how sensitive the output is to changing the value payload itself.

Step 3 formulas

text

Forward:   output[h] = weights[h] @ V[h]
Backward:
  dweights[h] = dout[h] @ V[h]^T
  dV[h]      += weights[h]^T @ dout[h]

The += on dV matters. In ordinary multi-head attention, one head contributes to one V[h]. In GQA, several query heads can share the same value head. That means multiple gradient paths must accumulate into the same value-gradient buffer. use += Whenever multiple query heads share one KV head, dK and dV must accumulate contributions from every head that pointed to that shared cache entry.

C-Kernel-Engine: weighted-sum backward inside attention

// Step 1 from attention_backward_causal_head_major_gqa:
// d_weights[h, i, j] = d_output[h, i, :] @ v[kv_h, j, :]^T
float dot = 0.0f;
for (int dd = 0; dd < hd; dd++)
    dot += d_output[d_out_base + dd] * v[v_base + dd];
d_scores[w_idx] = dot;

// d_v[kv_h, j, :] += weights[h, i, j] * d_output[h, i, :]
for (int dd = 0; dd < hd; dd++)
    d_v[v_base + dd] += w * d_output[d_out_base + dd];

Read that loop literally. The first part forms a dot product between the downstream gradient and the value vector to get dweights. The second part scatters weighted downstream gradients back into the value buffer. The code is not mysterious because the calculus is not mysterious.

Step 4: Backprop through softmax

Softmax is the most conceptually delicate local derivative in the whole attention block. Each output probability depends on every input score in the same row, so the Jacobian is dense at the row level. Fortunately the row-wise backward rule has a compact form.

Step 4 formulas

text

Forward:   weights = softmax(scores)
Backward: dscores[i,j] = weights[i,j] * (dweights[i,j] - Σ_k weights[i,k] * dweights[i,k])

This is the same pattern from the softmax post: p ⊙ (g − p^T g). First compute the dot product of the probability row with the incoming gradient row. Then subtract that scalar from each incoming component. Finally multiply elementwise by the probability row.

C-Kernel-Engine: softmax backward inside attention

// Step 2: Softmax backward
// dot_product = sum_j w[i,j] * d_w[i,j]
float dot_product = 0.0f;
for (int j = 0; j <= i; j++)
    dot_product += attn_weights[base + j] * d_scores[base + j];

// d_score[i,j] = w[i,j] * (d_w[i,j] - dot_product)
for (int j = 0; j <= i; j++)
    d_scores[base + j] = attn_weights[base + j] * (d_scores[base + j] - dot_product);

Notice how the mask quietly persists in backward. If a future position was masked in forward, its weight is zero. That means its downstream softmax gradient is also zero. No forbidden token can suddenly reappear during differentiation. Softmax backward is row-local but not element-local. Every score in a row influences every other score in that row through the normalization constraint.

Step 5: Backprop through the scaled dot product

Next we undo scores = Q @ K^T / √D_h. Again there are two parameter-free branches because scores depend on both Q and K. The gradient with respect to Q is a matrix multiply against K. The gradient with respect to K is a matrix multiply against Q, but with the score gradient transposed.

Step 5 formulas

text

Forward:   scores[h] = Q[h] @ K[h]^T / √D_h
Backward:
  dQ[h] = dscores[h]   @ K[h] * (1/√D_h)
  dK[h] = dscores[h]^T @ Q[h] * (1/√D_h)

The scale factor does not disappear. If forward multiplied by 1/√D_h, backward must multiply by that same scalar because it is part of the local derivative. This is a common place to make mistakes when re-deriving attention by memory instead of by chain rule.

C-Kernel-Engine: score backward into dQ and dK

// Step 3: dQ and dK through scaled dot product
for (int j = 0; j <= i; j++) {
    float ds = d_scores[score_index(h, i, j, aw)] * scale;  // scale = 1/sqrt(D_h)
    for (int dd = 0; dd < hd; dd++) {
        d_q[d_q_base + dd] += ds * k[k_base + dd];
        d_k[d_k_base + dd] += ds * q[q_base + dd];
    }
}

Geometrically, this step says: if a certain score should increase, move the query toward the key and the key toward the query in proportion to the gradient. If a score should decrease, move them apart. Backward through a dot product therefore preserves the same relational interpretation as forward. scale backward too Whatever stabilizes forward statistics must also appear in backward. Forgetting the 1/√D_h factor changes gradient magnitudes and breaks parity.

Step 6: Backprop through RoPE

RoPE is a rotation. Rotations are orthogonal linear transforms. For orthogonal matrices, the inverse is the transpose. So backward through RoPE is simply the inverse rotation: use the same cosine values, flip the sine sign, and rotate gradients back into the pre-RoPE query/key space.

Step 6 formulas

text

Forward:   Q_rot = RoPE(Q_pre, θ)
Backward: dQ_pre = RoPE_inverse(dQ_rot, θ)
          dK_pre = RoPE_inverse(dK_rot, θ)

RoPE inverse rotation in C form

// RoPE backward: inverse rotation
d_x[2i]   =  d_out[2i] * cos(θ) + d_out[2i + 1] * sin(θ)
d_x[2i+1] = -d_out[2i] * sin(θ) + d_out[2i + 1] * cos(θ)

There are no RoPE parameters to update. RoPE is deterministic geometry. So this step only transforms gradients in coordinate space. The same cached cosines and sines from forward can be reused in backward. R⁻¹ = Rᵀ Because RoPE is an orthogonal rotation, its backward pass is cheap: apply the transpose rotation. Same angles, opposite sine direction.

Step 7: Backprop through the Q, K, and V projections

After RoPE inverse and head un-splitting, we are back in full-width [T, C] space for Q and K. V never went through RoPE, but it may still need to be reshaped back from head-major layout. Now we differentiate the original linear projections. This is again ordinary matrix-multiply backward, applied three times.

Step 7 formulas

text

Forward:   Q = X @ W_Q
Backward:
  dX_from_Q = dQ_full @ W_Q^T
  dW_Q      = X^T @ dQ_full

Forward:   K = X @ W_K
Backward:
  dX_from_K = dK_full @ W_K^T
  dW_K      = X^T @ dK_full

Forward:   V = X @ W_V
Backward:
  dX_from_V = dV_full @ W_V^T
  dW_V      = X^T @ dV_full

Final input gradient:
  dX = dX_from_Q + dX_from_K + dX_from_V

That final sum into dX is essential. The original input X branched into three projections during forward. So in backward, three independent gradient paths flow back into the same source tensor. The correct total input gradient is their sum.

This is one of the easiest places to reason incorrectly if you narrate the graph casually. Q, K, and V do not compete for gradient ownership. They all depend on the same X, so they all contribute to dX. Branch in forward means add in backward. sum branches Because the input fans out into three projections, the backward pass must fan in by summing dX_from_Q, dX_from_K, and dX_from_V.

Step 8: GQA gradient accumulation

In grouped query attention, several Q heads can share the same K head and V head. So the backward pass must accumulate key and value gradients across all of those query heads. The shared K/V tensors act like reused parameters. Every consumer contributes a partial derivative.

GQA accumulation rule

text

for each query head h:
    kv_head = h * num_kv_heads // num_heads
    dK[kv_head] += contribution_from_head_h
    dV[kv_head] += contribution_from_head_h

This is why C-Kernel-Engine’s backward kernel uses accumulation operators in the K/V paths. Sharing in forward implies summation in backward. Once all heads have been processed, the shared dK and dV buffers contain the total gradient for those reused tensors.

Step	Forward	Backward	Shape
1	`result = concat @ W_O`	`dconcat = dresult @ W_O^T`, `dW_O = concat^T @ dresult`	`[T, C]`
2	`concat = reshape(heads)`	Split `dconcat` back into `H` heads.	`[H, T, D_h]`
3	`out[h] = weights[h] @ V[h]`	`dweights = dout @ V^T`, `dV += weights^T @ dout`	`[T, T]`, `[T, D_h]`
4	`weights = softmax(scores)`	`dscores = w ⊙ (dw − w^T dw)`	`[T, T]`
5	`scores = QK^T / √d`	`dQ = dscores @ K / √d`, `dK += dscores^T @ Q / √d`	`[T, D_h]`
6	`Q_rot = RoPE(Q)`	`dQ_pre = RoPE^-1(dQ_rot)`	`[T, D_h]`
7	`Q = X @ W_Q`	`dX += dQ @ W_Q^T`, `dW_Q = X^T @ dQ`	`[T, C]`

Diagram showing gradients flowing right to left through the attention block from output projection back to dX, including softmax backward, dQ, dK, and dV branches.

Backward pass mental model

If the forward pass feels like “project, compare, normalize, mix, project,” then the backward pass is “unproject, unsplit, unmix, unnormalize, uncompare, unrotate, and accumulate.” The formulas look long only because attention contains many small stages.

Section 11: Numerical Precision and Stability

Attention is numerically delicate. The score matrix can become large. Softmax exponentials can overflow. Reduced-precision storage can drift from reference implementations. And long sequences can turn tiny rounding differences into visible output changes.

Three practical stability rules show up again and again in production code. First, subtract the row maximum before exponentiating. Second, if you store K/V in reduced precision such as FP16 or BF16, perform the critical accumulation work in FP32. Third, when long contexts make [T, T] materialization too expensive, use online softmax so exact normalization happens tile by tile. FP32 compute Even when activations are stored in BF16 or FP16, attention kernels often promote to FP32 for the actual dot products, exponentials, and sums. The accumulator is where precision buys stability.

The user-visible examples from llama.cpp parity are particularly instructive. K/V values may be rounded to FP16 precision before dot products to match reference behavior. BF16 attention paths still often use exact expf rather than a fast approximation because small errors repeated across many rows and layers can compound noticeably. In other words, “close enough” is sometimes not close enough.

Online softmax used by Flash Attention

// Online softmax with running max (Flash Attention)
float m = -INFINITY;  // running max
float s = 0.0f;       // running sum

for each KV tile:
    tile_max = max(scores in tile)
    if (tile_max > m):
        rescale = exp(m - tile_max)
        output *= rescale      // rescale previous accumulation
        s *= rescale
        m = tile_max
    s += sum(exp(scores - m))
    output += softmax(scores) @ V_tile
output /= s  // final normalization

Flash Attention is often described as an approximation, but that framing is misleading. The exact same softmax result can be recovered without storing the whole score matrix, provided the running max and running sum are updated correctly while streaming tiles. The savings are about memory traffic and peak storage, not about changing the mathematics. Flash Attention is exact attention with smarter scheduling. The trick is algebra plus tiling, not a relaxed objective.

That memory story is dramatic at large sequence lengths. A dense [32768, 32768] score matrix in FP32 wants gigabytes of storage. A tiled online algorithm can reduce the live footprint to a tiny streaming working set. Same semantics, radically better systems behavior.

Tiling diagram for Flash Attention showing one Q tile staying active while K and V tiles stream past, with running max and running sum annotations.

Precision is part of the algorithm

On paper, attention is just matrix multiply plus softmax. In production, precision mode, accumulator width, max subtraction, and tile scheduling are all part of whether that paper algorithm survives real hardware without diverging.

Section 12: C-Kernel-Engine Attention Implementation

The C-Kernel-Engine project is useful because it makes the attention pipeline explicit at the kernel level. Instead of hiding everything in a monolithic framework call, it names the layout assumptions, the causal mask rules, the GQA mapping, and the forward/backward decomposition directly. That makes the implementation readable as systems engineering rather than just library invocation.

Kernel	SIMD	Score Matrix	Use Case
`attention_forward_causal_head_major_gqa`	AVX-512	Yes (O(N²))	Training, short context
`attention_forward_causal_head_major_gqa_flash`	AVX-512	No (O(N))	Inference, long context
`attention_forward_decode_head_major_gqa_flash`	AVX-512	No	Single-token decode
`attention_forward_causal_head_major_gqa_bf16`	AVX-512	Yes	BF16 training

Backward kernel signature

void attention_backward_causal_head_major_gqa(
    const float *d_output,      // [H, T, D_h]
    const float *q, *k, *v,     // Forward activations (cached)
    const float *attn_weights,  // Cached softmax output
    float *d_q, *d_k, *d_v,     // Output gradients
    float *d_scores,            // Scratch buffer
    ...)
{
    // Step 1: d_scores = d_output @ V^T, d_V += weights^T @ d_output
    // Step 2: d_scores = softmax_backward(d_scores, weights)
    // Step 3: d_Q = d_scores @ K * scale, d_K += d_scores^T @ Q * scale
}

Several design rules repeat across the engine. There is no malloc or free inside hot kernels. There is no OpenMP inside the kernels themselves because orchestration parallelism is handled elsewhere. Tensors stay in head-major layout. GQA is built in via kv_head = h * num_kv_heads / num_heads. Strict parity mode can even force volatile operations for reproducibility. no malloc High-performance kernels avoid dynamic allocation in the hot loop. Scratch space is managed ahead of time so the math path stays predictable and vector-friendly.

The engine also surfaces low-level implementation choices that are easy to forget when reading only equations. Causal masking is usually implemented by never touching forbidden upper-triangle entries or by explicitly zeroing them after exponentiation. aligned_head_dim pads feature widths for SIMD-friendly lanes. Decode paths may keep the KV cache in FP16 to reduce bandwidth pressure.

The SIMD story is similarly concrete. A 6×2 microkernel processes six output rows and sixteen floats per AVX-512 iteration. Fused multiply-add instructions like _mm512_fmadd_ps do the heavy arithmetic. Prefetching warms L1 before the next chunk of data arrives. This is the hardware face of the abstract Q @ K^T and weights @ V equations. Equations describe what attention means. Kernels decide whether that meaning arrives in microseconds or milliseconds.

Implementation rules that scale

Head-major layout, preallocated scratch buffers, explicit GQA indexing, and carefully chosen precision modes are not incidental engineering preferences. They are what let the mathematically simple attention pipeline run efficiently on real CPUs.

Section 13: Summary — The Full Pipeline

We can now name the entire forward pass cleanly. Tokenization from the tokenization post turns text into token IDs. Embedding lookup turns IDs into vectors. RMSNorm from the normalization post stabilizes the incoming state. Then attention performs the full pipeline we just unpacked.

Tokenization → token IDs.
Embedding lookup → vectors [T, C].
RMSNorm → normalized input.
Q, K, V projections → three GEMM ops from the matrix wx+b post.
Multi-head split → reshape to [H, T, D_h].
RoPE on Q and K → position-aware matching from the positional encoding post.
Scaled dot-product Q·K^T/√d → score matrix.
Causal mask → no future visibility.
Softmax → attention weights from the softmax post.
Weighted value sum weights·V → contextual head outputs.
Concatenate heads + output projection → return to [T, C].
Residual add → x + attention(norm(x)).

Backward simply walks that list in reverse. dL/d(output proj) → dL/d(concat) → dL/d(heads) → dL/d(weights and V) → dL/d(scores) → dL/d(RoPE) → dL/d(Q,K,V proj) → dL/dX. Once you see that reverse itinerary, the attention block stops feeling like a black box and starts feeling like a manageable computational graph. That is the real payoff of going step by step. 2 sublayers Every transformer layer repeats this structure twice: one routed mixing block for attention and one dense expansion block for the FFN. Stack that pattern N times and you get the modern transformer.

Full transformer block diagram showing RMSNorm, multi-head attention, residual add, RMSNorm, FFN, residual add, and output, with references to earlier posts.

Where this leaves us

Attention is the transformer’s routing engine. It projects tokens into search, index, and payload spaces; computes relevance; normalizes that relevance; and uses it to move information across the sequence.

Once you understand that forward and backward pipeline, the rest of the transformer stack becomes much easier to reason about. The remaining blocks are no longer mysterious islands. They are neighboring pieces in the same graph.

Attention: The Core Of The Transformer

Roadmap for this post

Section 1: What Attention Does

Why transformers displaced recurrent models

Section 2: Q, K, V Projections — Three GEMM Operations

The first major compute phase

Section 3: Multi-Head Splitting

Why the reshape matters

Section 4: RoPE Application (Position Encoding)

From the positional encoding post to this post

Section 5: Scaled Dot-Product Attention — The Core

The score matrix is the interaction map

Section 6: Softmax — Converting Scores to Weights

Attention weights are not optional polish

Section 7: Weighted Value Sum — The Final GEMM

Section 8: Concatenation and Output Projection

Attention is a pipeline, not a single formula

Section 9: Grouped Query Attention (GQA)

Why GQA became popular

Section 10: Backward Pass — Step by Step Chain Rule

Goal of the backward pass

Step 1: Backprop through the output projection

Step 2: Backprop through concatenation

Step 3: Backprop through the weighted value sum

Step 4: Backprop through softmax

Step 5: Backprop through the scaled dot product

Step 6: Backprop through RoPE

Step 7: Backprop through the Q, K, and V projections

Step 8: GQA gradient accumulation

Backward pass mental model

Section 11: Numerical Precision and Stability

Precision is part of the algorithm

Section 12: C-Kernel-Engine Attention Implementation

Implementation rules that scale

Section 13: Summary — The Full Pipeline

Where this leaves us

Subscribe

Subscribe to emails from Anthony

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support