Companion post to the matrix Wx+b carousel. Previously: scalar wx+b.
Post 18 was the tiny version of the story. One input. One weight. One bias. One prediction. A clean scalar line: y = wx + b.
This sequel is the scale-up. The algebra barely changes, but the objects do. Numbers become vectors. Vectors become matrices. A single prediction becomes a full batch of token embeddings moving through a transformer block.
That jump can feel dramatic when the notation arrives all at once. Wq, Wk, Wv, W1, W2, X, QKᵀ, and a forest of shapes. But the mental model I keep coming back to is still the one from the scalar post: a weighted sum, plus a bias, repeated many times.
If scalar wx+b was the atom, matrix Wx+b is the crystal lattice. The same rule repeats across rows, columns, heads, batches, and layers. "Transformers look huge only until you realize the machinery is mostly the same affine map over and over again." The mental bridge from post 18 to this post
The thesis of this post
Transformers are not magical tensor sculptures. They are mostly disciplined shape management around matrix multiplication, bias addition, and a few nonlinearities. If you can track wx+b, you can track a transformer layer.
Section 1: From Scalar to Matrix
In post 18, the model was scalar: y = wx + b. Here, w was one number, x was one number, and b was one number. The forward pass was a single multiply and a single add.
The first real jump in deep learning is not from linear models to transformers. It is from one weighted sum to many weighted sums at once. Instead of asking one weight to scale one input, we ask a whole matrix of weights to mix a whole vector of inputs.
The equation becomes: y = Wx + b. Now W is an [M×N] matrix, x is an [N×1] vector, b is an [M×1] vector, and the output y is an [M×1] vector.
The easiest way to read this is row by row. Each row of W is one learned template. That row takes a dot product with the input vector x. Then the matching bias term shifts the result. So one matrix multiply is really M scalar weighted sums happening in parallel.

That picture is the whole jump. The scalar case makes one output. The matrix case makes many outputs. But conceptually, every row still says the same thing: multiply corresponding entries, add them up, add a bias.
If I write out a concrete example, the notation becomes less intimidating. Suppose W has shape [3×4], x has shape [4×1], and b has shape [3×1]. Then each of the three rows in W produces one output entry.
The row-wise view is what keeps the matrix case human-sized for me. I do not try to memorize the whole matrix at once. I just remember that row 1 makes output 1, row 2 makes output 2, and so on. M outputs An [M×N] weight matrix means M learned weighted sums, each reading the same N-dimensional input.
One row = one neuron
In a dense layer, every row of the weight matrix is a learned detector. Feed in the same input vector, get a different weighted sum from each row, then stack those results into the output vector.
import numpy as np
# scalar case
w = 2.0
x = 3.0
b = 1.0
y_scalar = w * x + b
# matrix case
W = np.array([
[ 2.0, -1.0, 0.0, 3.0],
[ 1.0, 2.0, 1.0, 0.0],
[-2.0, 1.0, 4.0, 1.0],
])
x_vec = np.array([[1.0], [2.0], [-1.0], [3.0]])
b_vec = np.array([[1.0], [0.0], [2.0]])
y_vec = W @ x_vec + b_vec
print(y_scalar) # 7.0
print(y_vec.ravel()) # [10. 4. 1.]Notice what did not change. There is still a learned multiplier, still an input, still an offset, still an output. The only change is that the multiplier is now a matrix, so the model can produce several outputs from several inputs in one shot.
This is why the scalar post mattered. The scalar formula was not a toy that we now throw away. It was the blueprint. Matrix notation is just the compact way of writing many scalar weighted sums together.
Once that clicks, every dense layer becomes less mysterious. A layer is not a black box. It is a table of weights asking, row by row, how each output should mix the input coordinates.
Section 2: The Dimension Rule
Matrix multiplication has one law that never negotiates: [M×N] × [N×K] = [M×K]. The inner dimensions must match. If they do not, the multiply is undefined.
This rule is so important that I treat shapes as part of the equation, not as decoration next to it. In other words, I do not just read AB. I read [M×N] × [N×K].

This is the invariant behind the whole post. Whether I write Wx+b, XWᵀ+b, QKᵀ, or an MLP projection, the operation is only legal when the shared dimension lines up. The outside dimensions become the output shape. inner dims match [N×M] · [M×K] → [N×K]. The repeated M is consumed by dot products; the remaining N and K define the result.
Why do the inner dimensions matter? Because every entry in the output is built from a dot product. A dot product only works when the two vectors being compared have the same length. That shared length is the inner dimension.
If I store a weight matrix in the opposite orientation, I fix the problem with a transpose. For example, if a weight is stored as [N×M], then Wᵀ becomes [M×N]. Now the multiply can proceed.
This is not a rare trick. It is normal engineering practice. In transformer code, shapes are constantly rearranged to make the inner dimensions line up: transpose for attention scores, reshape for heads, and sometimes store the weight already transposed for cache-friendly memory access.
A lot of tensor confusion is really shape confusion. If I get lost, I stop reading the variable names and just write the dimensions on paper. The dimensions usually reveal the legal operation immediately. "Shapes are unit tests for your algebra." A rule that saves debugging time
| Symbol | Name in practice | Typical meaning | Example |
|---|---|---|---|
T | Tokens / sequence length | How many positions are in the current prompt or batch slice. | T = 4 tokens |
D | Embedding dimension | The width of each token representation. | D = 8 in a toy example |
H | Heads | How many attention subspaces the model splits into. | H = 2 |
d_k | Head dimension | The width of one attention head, usually D / H. | d_k = 4 when D = 8 and H = 2 |
4D | MLP hidden width | The expansion dimension in the feed-forward block. | 32 when D = 8 |
M, N, K | Generic GEMM axes | Rows, columns, and reduction dimension in matrix multiplication kernels. | [M×N] × [N×K] → [M×K] |
In deep learning, I usually keep two naming systems active at once. The generic system is M, N, K for linear algebra and kernels. The model system is T, D, H, and d_k for transformer intuition. They are describing the same arithmetic from different zoom levels.
import numpy as np
A = np.zeros((3, 4)) # [M×N]
B = np.zeros((4, 5)) # [N×K]
C = A @ B
print(C.shape) # (3, 5)
W = np.zeros((4, 3)) # stored the other way around
x = np.zeros((4, 1))
# W @ x would fail because (4,3) @ (4,1) is illegal
print((W.T @ x).shape) # (3, 1)The transpose is not a hack. It is the geometric way of turning columns into rows, or rows into columns, so the correct dot products can happen. Once you start looking for it, you see it everywhere in attention code.
Section 3: Batched Forward Pass
Real models do not process one vector at a time if they can help it. They batch. In transformers, the natural batch inside one sequence is the token matrix. Instead of one input vector x, we now have X with shape [T×D].
Read that as: T tokens, each token carrying D features. Every row is one token embedding. Every column is one feature coordinate across tokens.
A common implementation stores the layer weight as [N×D]. That layout is convenient for row-wise memory access, especially in low-level kernels. So the forward pass is often written as Y = XWᵀ + b.
Here the shapes are: X is [T×D], W is stored as [N×D], Wᵀ becomes [D×N], b is [N], and the output Y is [T×N].
The arithmetic is still the same as the single-vector case. Each output token row in Y is the corresponding input row in X multiplied by the weight matrix, then shifted by the bias. We are just doing it for all tokens in one matrix multiply.
This is the version of wx+b that shows up most often in actual model code: a batch matrix on the left, a transposed weight on the right, and a bias broadcast across rows. T×N outputs For every token row, the layer emits N output features. Stack all token rows and you get Y [T×N].
Let us make it concrete. Say we have T = 4 tokens, D = 8 input dimensions, and a layer that expands to N = 32 output features. Then the forward pass is: [4×8] @ [8×32] + [32] = [4×32].
That means four token rows go in, and four token rows come out, but each output token is now 32 numbers wide. If the next layer expects width 32, great. If it expects width 8, another projection will bring it back down.

The heatmap above shows the structure of one batched multiply. An input element from token 0 only contributes to outputs for token 0. An input element from token 1 only contributes to outputs for token 1. The mixing happens across feature channels, not across token rows, at this particular dense step.
This is a useful distinction. A standard linear layer mixes dimensions within each token representation. Attention is the place where tokens start explicitly mixing with other tokens.
import numpy as np
rng = np.random.default_rng(25)
T, D, N = 4, 8, 32
X = rng.normal(size=(T, D))
W = rng.normal(size=(N, D)) # stored transposed for kernel-friendly access
b = rng.normal(size=(N,))
Y = X @ W.T + b
print(X.shape) # (4, 8)
print(W.shape) # (32, 8)
print(Y.shape) # (4, 32) The bias add here is broadcast. NumPy, PyTorch, and most frameworks understand that a vector of shape [N] can be added to every row of Y [T×N]. That saves us from manually repeating the bias T times.
So the mental picture for a batched forward pass is simple: one token matrix comes in, the layer applies the same learned transform to every token row, and a new token matrix comes out.
Section 4: Inside a Transformer — It's ALL wx+b
Once the batched case feels normal, the transformer block stops looking alien. It is mostly a carefully arranged collection of linear maps. The names change, but the pattern stays familiar.
Start with the attention layer. The input token matrix is X [T×D]. From that same matrix, we create three new matrices: queries, keys, and values.
The formulas are straightforward: Q = X·Wq + bq, K = X·Wk + bk, and V = X·Wv + bv. In the common square case, each weight is [D×D] and each result is also [T×D].
These are literally three separate wx+b operations applied to the same input matrix. They learn three different views of the token embeddings. The query view asks, “what am I looking for?” The key view asks, “what do I contain?” The value view asks, “what content should be passed along if I am selected?”

Next comes the most famous transpose in the architecture: attention scores. After splitting into heads, we compare queries and keys with Q @ Kᵀ / √d_k. The shapes are [T×d_k] @ [d_k×T] = [T×T] inside one head.
That T×T score matrix answers a token-to-token question. For each token, how strongly should it attend to every other token? The transpose on K is what makes the inner dimension d_k line up.
After a softmax, the attention weights are used to mix the value vectors. If I count every raw matrix multiply in the attention path, there is also a separate softmax(scores) @ V step. In this post, I keep the simplified “7 GEMMs per layer” accounting from the outline, so I emphasize the major projections, score multiply, output projection, and MLP projections.

After the weighted value mix, the result is projected back through the output weight: attn_out · Wo + bo. Shape-wise, that is another familiar linear map: [T×D] @ [D×D] + [D] = [T×D].
Then the MLP block repeats the same design in a slightly different width. First we expand: h = X·W1 + b1, usually from D to 4D. Then we apply a nonlinearity like GELU. Then we project back down: out = h·W2 + b2.
This is why transformers stop feeling mystical once the shapes click. The block is not a giant new category of math. It is a sequence of familiar affine maps, plus a token-mixing score step, plus nonlinearities and normalization around them. "Attention is not replacing linear algebra. It is orchestrating more of it." The best high-level summary I know
| Operation | Formula | Shape story | What it does |
|---|---|---|---|
| Query projection | Q = X·Wq + bq | [T×D] @ [D×D] + [D] → [T×D] | Builds the “what am I looking for?” view. |
| Key projection | K = X·Wk + bk | [T×D] @ [D×D] + [D] → [T×D] | Builds the “what do I contain?” view. |
| Value projection | V = X·Wv + bv | [T×D] @ [D×D] + [D] → [T×D] | Builds the content that attention will mix. |
| Attention scores | Q @ Kᵀ / √d_k | [T×d_k] @ [d_k×T] → [T×T] | Creates token-to-token compatibility scores. This is the transpose-heavy matmul. |
| Output projection | attn_out · Wo + bo | [T×D] @ [D×D] + [D] → [T×D] | Maps the mixed attention output back into model space. |
| MLP FC1 | h = X·W1 + b1 | [T×D] @ [D×4D] + [4D] → [T×4D] | Expands representation width before the nonlinearity. |
| MLP FC2 | out = h·W2 + b2 | [T×4D] @ [4D×D] + [D] → [T×D] | Compresses the hidden activation back to model width. |
If I keep the toy dimensions from earlier, say T = 4, D = 8, H = 2, and d_k = 4, I can verify every single one of those shapes by hand. That is exactly what I recommend doing the first few times.
import numpy as np
rng = np.random.default_rng(25)
T, D, H = 4, 8, 2
d_k = D // H
X = rng.normal(size=(T, D))
Wq = rng.normal(size=(D, D))
Wk = rng.normal(size=(D, D))
Wv = rng.normal(size=(D, D))
Wo = rng.normal(size=(D, D))
Q = X @ Wq
K = X @ Wk
V = X @ Wv
Qh = Q.reshape(T, H, d_k).transpose(1, 0, 2) # [H, T, d_k]
Kh = K.reshape(T, H, d_k).transpose(1, 0, 2) # [H, T, d_k]
Vh = V.reshape(T, H, d_k).transpose(1, 0, 2) # [H, T, d_k]
scores = Qh @ Kh.transpose(0, 2, 1) # [H, T, T]
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
heads = weights @ Vh # [H, T, d_k]
attn_out = heads.transpose(1, 0, 2).reshape(T, D) @ Wo
print(scores.shape) # (2, 4, 4)
print(attn_out.shape) # (4, 8)The transformer block in one sentence
Project the tokens into several learned spaces, compute token-to-token scores, mix content accordingly, project back, then run a two-layer MLP. Nearly every heavy operation in that sentence is a matrix multiply.
Section 5: Matrix Gradients — Same Chain Rule, Bigger Shapes
The loss is still a scalar. That part never changes. Even in a giant model, training eventually asks one scalar question: how wrong was this step?
What changes is the shape of the derivatives flowing backward. When a parameter is a matrix, its gradient is also a matrix. When a parameter is a vector, its gradient is a vector. The chain rule survives intact; the containers get bigger.
For the affine map Y = XW + b, where X [T×D], W [D×N], and Y [T×N], the core backward formulas are beautifully symmetric.
The weight gradient is dL/dW = Xᵀ @ dL/dY. Shape check: [D×T] @ [T×N] = [D×N]. Same shape as W.
The input gradient is dL/dX = dL/dY @ Wᵀ. Shape check: [T×N] @ [N×D] = [T×D]. Same shape as X.
The bias gradient is dL/db = sum(dL/dY, axis=0). Shape check: [N]. Same shape as b.

This is the key insight I wish I understood earlier: matrix backpropagation is not a different subject from scalar backpropagation. It is the same chain rule with transposes inserted so the dimensions line up.
In post 18, the scalar weight gradient was basically “input times downstream error.” Here it is still that, except the input is a full matrix, the downstream error is a full matrix, and the transpose makes the reduction happen along the token axis. Shape match Every gradient tensor has the same shape as the parameter or activation it belongs to. That is one of the fastest correctness checks you have.
Backward pass intuition
Forward says, “how do these weights combine inputs into outputs?” Backward says, “given the output error, how much credit or blame should flow to the input, the weight matrix, and the bias?”
import numpy as np
rng = np.random.default_rng(25)
T, D, N = 4, 8, 32
X = rng.normal(size=(T, D))
W = rng.normal(size=(D, N))
b = rng.normal(size=(N,))
dY = rng.normal(size=(T, N))
dW = X.T @ dY
dX = dY @ W.T
db = dY.sum(axis=0)
print(W.shape, dW.shape) # (8, 32) (8, 32)
print(X.shape, dX.shape) # (4, 8) (4, 8)
print(b.shape, db.shape) # (32,) (32,) The transpose in Xᵀ @ dY is doing important bookkeeping. It turns the batch axis into the reduction axis so all token contributions accumulate into one weight gradient matrix.
This is also why framework autodiff feels magical but is not magic. The engine is systematically applying local Jacobian rules, shape alignment, and accumulation. Under the hood, the same matrix multiplications keep reappearing.
Section 6: What This Looks Like in C (The Kernel)
At the framework level, we say “linear layer,” “projection,” or “fully connected.” At the kernel level, we often say GEMM: general matrix multiply.
This is where my C-Kernel-Engine reference point becomes useful. A single dense projection is basically one GEMM plus a bias add. That is the real machine behind the high-level layer abstraction.
In the notation of this post, a forward kernel is just: Y = X @ Wᵀ + bias. Shapes: X [T×D], W [N×D] stored transposed, bias [N], and Y [T×N].
The backward path uses the same ingredients again: dL/dX = dL/dY @ W, dL/dW = dL/dYᵀ @ X, and dL/db = sum(dL/dY, axis=0). Different transpose pattern, same overall story.
// Forward: Y = X @ W^T + bias
// X: [T × D], W: [N × D], bias: [N], Y: [T × N]
gemm_blocked(X, W, bias, Y, T, N, D);
// Backward:
// dL/dX = dL/dY @ W
// dL/dW = dL/dY^T @ X
// dL/db = sum(dL/dY, axis=0)
gemm_nn_simd(dY, W, NULL, dX, T, D, N);
gemm_tn_parallel(dY, X, NULL, dW, N, D, T);
sum_bias_grad(dY, db, T, N); Why store W as [N×D] instead of [D×N]? Because low-level code cares about memory order. If each output neuron’s weights live contiguously in memory, the kernel gets more cache-friendly row access.
That storage choice explains why high-level math sometimes says XW while low-level code computes XWᵀ. The model is not changing. The memory layout is.

Real kernels also do not naively multiply one element at a time across giant matrices. They tile. The M, N, and K dimensions are chopped into smaller blocks that fit into cache. That is how the CPU reuses data before it gets evicted.
In practice, I think of three levels at once: the mathematical matrix shapes, the cache tiles, and the SIMD lanes. The math says “multiply rows by columns.” The kernel says “do it in blocks that fit L1 and L2, then process 8 or 16 floating-point multiplies per instruction when AVX or AVX512 is available.”
This is the side of the story that makes transformer math feel tangible. Every projection in the model eventually lands in code that looks like a blocked nested loop over contiguous memory. "Every gemm_blocked_serial() call is one wx+b made physical." How I connect the algebra to the CPU
void gemm_blocked_serial(const float *A, const float *B, const float *bias,
float *C, int M, int N, int K) {
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
C[i*N + j] = bias ? bias[j] : 0.0f;
for (int ii = 0; ii < M; ii += BLOCK)
for (int jj = 0; jj < N; jj += BLOCK)
for (int kk = 0; kk < K; kk += BLOCK)
for (int i = ii; i < min(ii + BLOCK, M); ++i)
for (int j = jj; j < min(jj + BLOCK, N); ++j)
for (int k = kk; k < min(kk + BLOCK, K); ++k)
C[i*N + j] += A[i*K + k] * B[j*K + k];
}Why this matters for transformers
Once a transformer is large, performance is often less about fancy equations and more about whether your GEMM kernels keep data resident, vectorize well, and avoid wasting memory bandwidth.
Section 7: Putting It All Together
With the simplified accounting used in this post, one transformer layer contains 7 major GEMM-style operations in the forward pass. In the backward pass, each one typically needs two more major GEMMs: one for dW and one for dX.
That gives a memorable count: 7 forward + 14 backward = 21 GEMM calls per layer per training step.
Then the scale multiplication starts. GPT-2 has 12 layers. So, with this simplified count, that is 12 × 21 = 252 GEMM calls in one training step.
GPT-3 has 96 layers. So the same accounting becomes 96 × 21 = 2016 GEMM calls. That is a lot of linear algebra.
GPT-4 is a special case in any public chart because its exact layer count is not publicly disclosed. So the plot below marks it as an intentionally unresolved placeholder rather than pretending we know the number.

I like this section because it compresses the whole post into one operational sentence: a large language model is an enormous stack of repeated matrix multiplications. The glamour words change, but the compute bill keeps coming back to the same kernels.
So yes, it really is all wx+b in spirit. Sometimes it is scalar, sometimes vectorized, sometimes batched, sometimes split across heads, and sometimes fused into a heavily optimized GEMM kernel. But it is still the same habit of weighted mixing. "Scale does not replace the primitive. It repeats it until hardware becomes the real bottleneck." Why transformer intuition and systems intuition eventually meet
gemms_per_layer = 21 # simplified count for this post
for name, layers in [("GPT-2", 12), ("GPT-3", 96)]:
print(name, layers * gemms_per_layer)
# GPT-2 -> 252
# GPT-3 -> 2016Next systems-level question
Once the algebra is comfortable, the next useful question is not “what is Wx+b?” It is “how does the CPU actually execute it?” That is where cache lines, SIMD lanes, blocking, and memory bandwidth take over the story.
Section 8: Summary
Here is the compressed version I want to keep in my head after writing all this out.
- Scalar
y = wx + band matrixy = Wx + bare the same idea at different widths. - An
[M×N]weight matrix meansMlearned weighted sums over anN-dimensional input. - The dimension rule is non-negotiable:
[M×N] × [N×K] = [M×K]. If shapes do not line up, transpose or reshape until they do. - Batched forward passes use token matrices like
X [T×D]and often computeY = XWᵀ + bbecause the weight is stored as[N×D]. - Inside a transformer, query, key, value, output, and MLP projections are all affine maps.
- Attention scores are where transposes become especially visible:
Q @ Kᵀturns two[T×d_k]views into a token-to-token score matrix[T×T]. - Matrix backpropagation is the same chain rule as the scalar case, just with larger shapes and carefully placed transposes.
- Every gradient has the same shape as the thing it differentiates with respect to:
dWmatchesW,dXmatchesX, anddbmatchesb. - At the systems level, a dense layer is a GEMM plus bias. That is what the kernel actually executes.
- At model scale, transformers are mountains of repeated matrix multiplications. The primitive never disappeared.
If post 18 made backprop feel local and understandable, I want this post to do the same thing for transformer shapes. The notation gets bigger, but the engine underneath is still the one-line idea we already learned.
The reward for understanding this is enormous. Once Wx+b feels natural, papers, kernels, attention code, and performance discussions all become easier to parse.
That is the real reason I keep reducing giant models back to simple algebra. I do not want the scale to hide the mechanism.