Lab note

Companion post to the Softmax carousel. Previously: Matrix Wx+b: From Scalars To Transformers.

The previous post was about the moment a scalar weighted sum grew into a matrix multiply. This post picks up exactly where that one leaves off. Once a transformer has produced a row of scores, logits, or affinities, it still needs one more operation before those numbers can act like choices. That operation is softmax.

If Wx+b is the engine that manufactures scores, softmax is the governor that turns those scores into a normalized distribution. It answers a probability question without throwing away the relative differences between options. That is why it shows up twice in transformers: once inside attention, and once again at the vocabulary output layer.

The easiest way to think about softmax is this: it does not invent information. It reorganizes information so that the row can be interpreted as belief, focus, or preference. A bigger logit still means “more likely,” but now the whole row also adds up to one coherent probability budget. "Softmax is where arbitrary scores become a probability story." Why it deserves its own post

Why this sequel matters

Matrix Wx+b: From Scalars To Transformers explained how matrices create many scores at once. This post explains how those scores become weights, predictions, and gradients. If the previous post was about linear algebra producing raw numbers, this one is about probability geometry shaping what those numbers mean.

Section 1: What Is Softmax?

Softmax takes a vector of real numbers and converts it into a probability distribution. The formula for coordinate i is:

Softmax Formula

\[ \operatorname{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]

Each logit becomes a positive exponential score, then the shared denominator normalizes the whole vector into one probability budget.

The exponential makes every entry positive, and the division by the shared sum makes the whole vector add up to one.

Notice what softmax does not require. The inputs do not need to be between zero and one. They do not need to sum to anything special. They can be negative, huge, tiny, messy, or centered around zero. Softmax takes care of the cleanup step.

Property What it means Why it matters
0 < p_i < 1 Every output is strictly positive and strictly less than one. Each value can be read as a probability weight.
Σ p_i = 1 The whole vector uses one shared probability budget. Rows become directly comparable and interpretable.
Ordering preserved If z_a > z_b, then p_a > p_b. Softmax respects the ranking already present in the logits.
Differentiable Tiny logit changes produce smooth probability changes. Backpropagation can flow through it.

Softmax is sometimes introduced as “normalized exponentials.” That description is correct, but the important intuition is even simpler. The exponential amplifies gaps. A logit that is a little larger becomes a probability that is noticeably larger. "Softmax is the differentiable version of argmax. It's 'soft' because it gives weight to all options, not just the winner." Section 1 side note

Take the concrete logit vector [2.0, 1.0, 0.1, -1.0, 3.0]. The largest score is the last one, so we expect the last probability to win. Softmax keeps that ordering, but it also tells us how strongly it wins compared with the rest.

Bar chart showing raw logits [2.0, 1.0, 0.1, -1.0, 3.0] beside their softmax probabilities, annotated with a sum of 1.0

For that vector, the probabilities are approximately [0.2334, 0.0858, 0.0349, 0.0116, 0.6343]. The winning logit 3.0 does not become probability 1. It becomes the largest share of the budget. That is the “soft” in softmax.

There is also a crucial implementation detail hidden behind almost every production softmax: the numerical stability trick. Before taking exponentials, we subtract the maximum logit from the whole row. Because every term is shifted by the same constant, the probabilities do not change. But the exponentials stay in a safe numerical range.

Why is that shift legal? Because exp(z_i - c) / Σ exp(z_j - c) equals exp(z_i) / Σ exp(z_j) for any constant c. Choosing c = max(z) guarantees the largest shifted value is zero, so the biggest exponential is exactly one instead of something explosive. subtract max(z) The stable version changes the arithmetic scale, not the final probabilities.

Stable softmax on one vector python
import numpy as np

z = np.array([2.0, 1.0, 0.1, -1.0, 3.0])
shifted = z - z.max()
exp_z = np.exp(shifted)
p = exp_z / exp_z.sum()

print(np.round(shifted, 4))
print(np.round(p, 4))
print(p.sum())  # 1.0

One sentence definition

Softmax turns “how strong is each option?” into “what fraction of attention or probability does each option get?”

Section 2: Softmax on Vectors and Matrices

On a single vector, softmax is easy to picture. Input a shape [1×N] row, output a shape [1×N] probability row. Nothing about the length changes. Only the interpretation changes.

Transformers rarely work with one row at a time. They work with batches of token positions, or with many query rows at once. So in practice, softmax is usually applied to a matrix. The rule is row-wise: each row is normalized independently.

Input object Shape How softmax is applied Result
Single vector [1×N] Across its N entries One probability distribution
Matrix [M×N] Independently across each row M separate probability distributions
Attention scores [T×T] Across each token row One distribution per query token
Vocabulary logits [T×V] Across each vocabulary row One distribution per token position

That row-wise rule is easy to forget when the notation becomes abstract. A matrix softmax is not one giant probability distribution over all M×N cells. It is M separate probability questions living side by side.

Side-by-side heatmaps of a 4 by 6 matrix before and after row-wise softmax, with each row after softmax summing to 1.0

This is the transformer habit worth memorizing. When you see softmax on a matrix, ask: what does one row represent? In attention, one row means one token asking where to look. In the final layer, one row means one position asking which vocabulary token comes next. "Row-wise softmax means each row is a separate probability question. In attention: 'how much should this token attend to every other token?'" Section 2 side note

NumPy makes the row-wise version explicit with an axis argument. If the last axis stores the choices, then softmax happens along axis=-1. The same rule works for a plain matrix or for higher-rank tensors that end in the choice dimension.

Row-wise softmax in NumPy python
softmax = np.exp(z - z.max(axis=-1, keepdims=True)) / np.exp(z - z.max(axis=-1, keepdims=True)).sum(axis=-1, keepdims=True)

You can read that one-liner in two phases. First, shift every row by its own maximum. Second, exponentiate and divide each row by its own row sum. The word “keepdims” is what preserves broadcasting alignment.

Matrix intuition

Softmax does not mix rows together. It only rescales each row so that the row becomes a valid distribution. That locality is why the same operation fits both attention matrices and vocabulary logits.

Section 3: Where Softmax Appears in Transformers

Before we go deeper into derivatives, it helps to know where softmax shows up. It appears in two critical places inside a transformer, and we will explore the full attention mechanism in a future post. For now, a quick sketch.

Place 1 — Attention weights. Inside self-attention, pairwise similarity scores between tokens form a [T×T] matrix (one row per token, one column per candidate). Softmax is applied row-wise, turning each row into a probability distribution: "how much should this token look at every other token?" "Inside attention, softmax answers: given this token, how should I distribute my focus across the context window?" Preview — we will derive the full attention mechanism in a later post

Two 6 by 6 heatmaps showing raw scores before softmax and normalized attention weights after softmax

Place 2 — The final prediction layer. After all transformer layers, the model produces a logit vector over the vocabulary (e.g., 50,257 tokens for GPT-2). Softmax converts those logits into a probability distribution: "which word comes next?" We will spend a full section on this below.

Both uses are the same operation — turn raw scores into probabilities — just applied to different shapes and at different depths in the network. The math we derive next applies identically to both. "The [T×T] attention matrix is why transformers scale as O(T²). Every token attends to every other token." A hint at why efficient attention is a hot research area

Section 4: A Quick Note on Causal vs Full Attention

When softmax is applied inside attention, one important detail changes the shape of the probability distribution: masking.

In full self-attention (encoders like BERT), every token can attend to every other token. The full [T×T] softmax weight matrix is used — each row distributes attention across all T positions.

In causal attention (decoders like GPT), token i can only attend to tokens 0…i. The upper triangle of the score matrix is set to -∞ before softmax, so those positions become zero probability. Each row still sums to 1, but only over the allowed past tokens. "Mask first, softmax second. The mask changes the support of the distribution — the row sums to 1 over a smaller set." We will derive the full attention mechanism in a dedicated post

Two 8 by 8 heatmaps comparing full attention weights against causal lower-triangular attention weightsBar chart for one token showing its attention distribution over previous tokens while future tokens are greyed out

The key insight for this post: whether full or causal, softmax is still the same operation — row-wise normalization to probabilities. The mask only changes which entries participate. All the derivative math in the sections below applies identically to both cases.

Section 5: Softmax in the Final Layer — Logits to Predictions

Attention is not the only place softmax appears. The final layer of a language model also ends with logits. There, the model takes a hidden state and projects it into vocabulary space: logits = hidden @ W_vocab + b_vocab.

If we keep T token positions and the vocabulary has size V, then the result has shape [T×V]. For GPT-2, V is 50,257. Each row is one prediction problem over the entire vocabulary.

Mini-vocabulary bar charts showing logits on the left and softmax probabilities on the right with the winning token highlighted

The model does not output words directly. It outputs logits. Softmax is what turns those logits into a probability distribution that says how plausible each vocabulary token is as the next token. "At inference we only need the last row’s softmax, the prediction for the next token. At training we need all T rows for the loss." Section 5 side note

At inference time, we usually care only about the last row. That row corresponds to the next-token prediction after reading the prompt so far. Sampling, top-k, nucleus filtering, and greedy decoding all operate on that last-row probability distribution.

During training, the story is broader. We compute logits for every position, apply row-wise softmax, and compare every row with the true next token using cross-entropy loss. So the same operation supports both prediction and learning.

Final-layer prediction sketch python
# hidden: [T, D]
# W_vocab: [D, V]
# b_vocab: [V]
logits = hidden @ W_vocab + b_vocab      # [T, V]
probs = softmax(logits, axis=-1)         # [T, V]
next_token_id = np.argmax(probs[-1])     # last row only at inference

Softmax is not the choice rule itself

Softmax creates the distribution. Argmax or sampling creates the actual chosen token. Keeping those two steps separate helps when you reason about inference versus training.

Section 6: The Derivative of Softmax (Scalar Form)

Softmax is smooth, so it has derivatives. But unlike a scalar nonlinearity such as tanh, each output coordinate depends on every input coordinate. Changing one logit changes the whole denominator. That is why the derivative is a Jacobian matrix rather than a single element-wise formula.

Let p_i = softmax(z_i). Then the derivative with respect to logit z_j is ∂p_i/∂z_j = p_i(δ_ij - p_j). The Kronecker delta δ_ij is 1 when i = j and 0 otherwise.

Case Derivative Interpretation
i = j ∂p_i/∂z_i = p_i(1 - p_i) Raising a logit raises its own probability, but not without limit.
i ≠ j ∂p_i/∂z_j = -p_i p_j Raising one logit steals probability mass from the others.
Matrix form J = diag(p) - p pᵀ The whole row is coupled through one dense Jacobian.

The diagonal entries are positive because increasing a class logit tends to increase its own probability. The off-diagonal entries are negative because the distribution must still sum to one. Probability mass gained somewhere must be lost somewhere else.

Color-coded 5 by 5 Jacobian heatmap for softmax with positive diagonal entries and negative off-diagonal entries Step-by-step derivation of the softmax Jacobian text
Let p_i = exp(z_i) / S
where S = Σ_j exp(z_j)

Differentiate with respect to z_j:

∂p_i/∂z_j
= ∂/∂z_j [exp(z_i) / S]
= [exp(z_i)·δ_ij·S - exp(z_i)·exp(z_j)] / S²

Now divide and multiply by S where helpful:

= [exp(z_i)/S]·δ_ij - [exp(z_i)/S]·[exp(z_j)/S]
= p_i·δ_ij - p_i·p_j
= p_i(δ_ij - p_j)

This formula looks dense the first time you see it, but its message is intuitive. Softmax outputs compete because they share one denominator. Each probability is tied to every other probability through normalization. "The Jacobian is dense because the denominator is shared." Why softmax is coupled

Why softmax is differentiable (and min/max are not)

Before we walk through the numbers, a quick but important aside. Softmax uses exp and sum — both are smooth, continuous functions with well-defined derivatives everywhere. The sum S = exp(z₀) + exp(z₁) + … + exp(z₄) is just addition — no branching, no discontinuity. Its derivative with respect to any z_j is simply exp(z_j).

Compare this to max(z) or min(z). At the point where two inputs are exactly equal, max switches which input it "follows." That switch creates a kink — a point where the derivative is undefined. You can work around it with subgradients or smooth approximations, but the raw function is not differentiable at ties. "Softmax is sometimes called 'soft-max' precisely because it is the smooth, differentiable approximation of the hard argmax. Replace the discontinuous winner-take-all with a smooth exponential competition." Why the name matters

This smoothness is why backpropagation works cleanly through softmax. Every gradient exists, every chain rule step is well-defined, and we can compute exact derivatives at every point in the input space.

Walking through the quotient rule on 5 concrete values

Let's make this completely concrete with a 5-element vector. Start with logits z = [2.0, 1.0, 0.5, 0.1, 3.0].

Step 1: Forward pass — compute softmax python
import numpy as np

z = np.array([2.0, 1.0, 0.5, 0.1, 3.0])

# Numerical stability: subtract max
z_shifted = z - z.max()           # [-1.0, -2.0, -2.5, -2.9, 0.0]

# Exponentiate
e = np.exp(z_shifted)             # [0.3679, 0.1353, 0.0821, 0.0550, 1.0000]

# Sum
S = e.sum()                       # 1.6403

# Normalize
p = e / S                         # [0.2243, 0.0825, 0.0500, 0.0335, 0.6097]

print(f"p = {p}")                 # sums to 1.0000
print(f"sum(p) = {p.sum():.4f}")  # 1.0000

So our softmax output is p = [0.2243, 0.0825, 0.0500, 0.0335, 0.6097]. Now let's differentiate.

Recall the definition: p_i = exp(z_i) / S where S = Σ exp(z_j). We want ∂p_i/∂z_j — how does the i-th output change when we nudge the j-th input?

This is a quotient: f(z) / g(z) where f = exp(z_i) and g = S = Σ exp(z_k). The quotient rule says: d(f/g)/dz_j = (f'·g − f·g') / g².

The two cases of the quotient rule

Numerator derivative: ∂exp(z_i)/∂z_j = exp(z_i) if i = j, else 0. That's because exp(z_i) only depends on z_i, not on z_j for j ≠ i.

Denominator derivative: ∂S/∂z_j = exp(z_j) always — since S = exp(z₀) + exp(z₁) + … + exp(z₄), the derivative of that sum with respect to z_j is just exp(z_j). This is why the sum is smooth: no branching, no if-then, just addition.

Step 2: Derive ∂p₀/∂z_j for each j = 0, 1, 2, 3, 4 text
Let's compute the full row 0 of the Jacobian: ∂p₀/∂z_j for all j.

p₀ = exp(z₀) / S

Using quotient rule: ∂p₀/∂z_j = [∂exp(z₀)/∂z_j · S − exp(z₀) · ∂S/∂z_j] / S²

── j = 0 (same index): ─────────────────────────────────
  ∂exp(z₀)/∂z₀ = exp(z₀)    ← numerator depends on z₀
  ∂S/∂z₀ = exp(z₀)          ← sum always has this term

  ∂p₀/∂z₀ = [exp(z₀)·S − exp(z₀)·exp(z₀)] / S²
           = exp(z₀)/S · [1 − exp(z₀)/S]
           = p₀ · (1 − p₀)
           = 0.2243 × (1 − 0.2243) = 0.2243 × 0.7757 = 0.1740

── j = 1 (different index): ─────────────────────────────
  ∂exp(z₀)/∂z₁ = 0           ← numerator does NOT depend on z₁
  ∂S/∂z₁ = exp(z₁)           ← sum always has this term

  ∂p₀/∂z₁ = [0·S − exp(z₀)·exp(z₁)] / S²
           = −exp(z₀)/S · exp(z₁)/S
           = −p₀ · p₁
           = −0.2243 × 0.0825 = −0.0185

── j = 2: ∂p₀/∂z₂ = −p₀·p₂ = −0.2243 × 0.0500 = −0.0112
── j = 3: ∂p₀/∂z₃ = −p₀·p₃ = −0.2243 × 0.0335 = −0.0075
── j = 4: ∂p₀/∂z₄ = −p₀·p₄ = −0.2243 × 0.6097 = −0.1368

Row 0 of Jacobian: [0.1740, −0.0185, −0.0112, −0.0075, −0.1368]
                     ↑ positive (self)   ↑ all negative (competition)

The pattern is clear. The diagonal entry is positive — raising z₀ increases p₀. Every off-diagonal entry is negative — raising any other logit steals probability from p₀. And notice: the biggest negative is −0.1368 for z₄, because p₄ = 0.6097 is the dominant class. The dominant competitor hurts you most. Row sum = 0 Every row of the Jacobian sums to zero. This must be true because the probabilities always sum to 1 — if you raise one logit, the total probability gained equals the total probability lost elsewhere.

Step 3: Compute the full 5×5 Jacobian and verify numerically python
import numpy as np

z = np.array([2.0, 1.0, 0.5, 0.1, 3.0])
p = np.exp(z - z.max()) / np.exp(z - z.max()).sum()

# Build the full Jacobian: J_ij = p_i(δ_ij - p_j)
J = np.diag(p) - np.outer(p, p)
print("Jacobian:")
print(np.round(J, 4))

# Verify: each row sums to 0
print(f"\nRow sums: {J.sum(axis=1).round(10)}")  # all zeros

# Verify: diagonal = p_i(1 - p_i)
print(f"Diagonal: {np.diag(J).round(4)}")
print(f"p*(1-p):  {(p * (1 - p)).round(4)}")  # should match

# Verify against numerical gradient (finite differences)
eps = 1e-5
J_numerical = np.zeros((5, 5))
for j in range(5):
    z_plus = z.copy(); z_plus[j] += eps
    z_minus = z.copy(); z_minus[j] -= eps
    p_plus = np.exp(z_plus - z_plus.max()) / np.exp(z_plus - z_plus.max()).sum()
    p_minus = np.exp(z_minus - z_minus.max()) / np.exp(z_minus - z_minus.max()).sum()
    J_numerical[:, j] = (p_plus - p_minus) / (2 * eps)
print(f"\nMax error vs numerical: {np.abs(J - J_numerical).max():.2e}")  # ~1e-10

The chain rule summary for all 5 indices

For any p_i = exp(z_i) / S:
∂p_i/∂z_j = p_i·δ_ij − p_i·p_j = p_i(δ_ij − p_j)

The quotient rule gives us the f'g − fg' structure. The chain rule gives us the exp(z_j) term in the denominator's derivative. Together they produce: positive on the diagonal (self-reinforcement), negative everywhere else (competition). And because sum is smooth everywhere — unlike min or max which have kinks at ties — this derivative exists at every point.

A full Jacobian is educational for small vectors. It is not something we materialize for real vocabulary sizes. For V = 50,257, the Jacobian would have more than 2.5 billion entries for one row. "The Jacobian is a dense N×N matrix. For vocab_size=50257, that is 2.5 billion entries. Thankfully, we never build it explicitly — see the O(N) trick in Section 7." Why we need the efficient backward identity

Section 7: Backprop Through Softmax — The Efficient Trick

We have the Jacobian: J_ij = p_i(δ_ij - p_j). Naively, to backpropagate through softmax with N outputs, we would build the full N×N Jacobian and multiply. For a vocabulary of 50,257 tokens, that is 2.5 billion entries per row — completely impractical.

Fortunately, there is a beautiful shortcut. Let p be the softmax output and g = dL/dp be the incoming gradient. The vector-Jacobian product simplifies to:

The softmax backward identity

dL/dz = p ⊙ (g - (pᵀg)·1)

Where is element-wise multiplication and pᵀg is a single dot product. This is O(N) per row — no explicit Jacobian needed.

The derivation: (J·g)_i = Σ_j p_i(δ_ij - p_j)·g_j = p_i·g_i - p_i·Σ_j p_j·g_j = p_i·(g_i - pᵀg). Factor out p_i and you get the vector form above. "We never build the N×N Jacobian. Instead: compute one dot product pᵀg, subtract it from g, multiply element-wise by p. Done." The trick that makes softmax backprop practical

Flow diagram showing backward propagation through softmax — incoming gradient g, dot product step, element-wise product, output gradient dL/dz Efficient softmax backward in NumPy python
import numpy as np

def softmax(z, axis=-1):
    shifted = z - z.max(axis=axis, keepdims=True)
    exp_z = np.exp(shifted)
    return exp_z / exp_z.sum(axis=axis, keepdims=True)

def softmax_backward(p, grad_output):
    """Backprop through softmax without building the Jacobian.
    p: softmax output [... x N]
    grad_output: dL/dp [... x N]
    returns: dL/dz [... x N]
    """
    dot = np.sum(p * grad_output, axis=-1, keepdims=True)  # pᵀg
    return p * (grad_output - dot)  # p ⊙ (g - pᵀg)

# Works for any shape — single row or [M×N] matrix
z = np.array([[2.0, 1.0, 0.1], [1.0, 3.0, 0.5]])
p = softmax(z)
g = np.random.randn(*p.shape)  # incoming gradient
dz = softmax_backward(p, g)    # O(N) per row

This same formula is used everywhere softmax appears — whether in attention (where the row length is the context length T) or in the final layer (where the row length is the vocabulary size V). The next section shows an even simpler special case for the output layer. O(N) vs O(N²) The backward identity reduces softmax backprop from quadratic to linear — critical when N = 50,257 (vocab size) or N = 8,192 (context length).

Section 8: The Final Layer Magic — p - one_hot

At the output layer, something special happens when softmax is paired with cross-entropy loss. The combined gradient collapses into a much simpler form than the raw Jacobian suggests. If y is the one-hot target vector and p is the softmax output, then dL/dz = p - y.

Cross-entropy is L = -Σ y_i log(p_i). Because y is one-hot, only the true class survives, so L = -log(p_true). The loss punishes the model only for assigning too little probability to the correct next token.

Derivation of dL/dz = p - y text
Start with cross-entropy:
L = -Σ_i y_i log(p_i)

Differentiate with respect to z_j:

dL/dz_j = Σ_i (dL/dp_i · dp_i/dz_j)
        = Σ_i (-y_i / p_i) · p_i(δ_ij - p_j)
        = Σ_i -y_i(δ_ij - p_j)
        = Σ_i (-y_iδ_ij + y_ip_j)
        = -y_j + p_j Σ_i y_i

Since y is one-hot, Σ_i y_i = 1.
Therefore:

dL/dz_j = p_j - y_j

Vector form:
dL/dz = p - y
Three stacked bar charts showing a softmax output distribution, the one-hot target vector, and the resulting gradient p minus y

This is one of the most elegant identities in deep learning. A complicated normalization plus a logarithmic loss turns into a simple subtraction. Prediction minus truth is all the optimizer needs at the final layer. "This is why cross-entropy + softmax is universal. The gradient is just prediction minus truth. No Jacobians, no complicated math — just p - y." Section 8 side note

The sign pattern is equally intuitive. For the true class, y_i = 1, so the gradient is negative unless the model already predicts probability 1. For all incorrect classes, y_i = 0, so the gradient is just the predicted probability itself and therefore positive. "The true token gets a negative gradient, pushing its logit up. All other tokens get positive gradients, pushing their logits down. That is learning." Section 8 side note

Component Value in the example Training signal
True token 0.70 - 1.00 = -0.30 Increase the correct logit
Wrong token 1 0.10 - 0.00 = 0.10 Decrease that incorrect logit
Wrong token 2 0.05 - 0.00 = 0.05 Decrease a little
Wrong token 3 0.10 - 0.00 = 0.10 Decrease that incorrect logit
Wrong token 4 0.05 - 0.00 = 0.05 Decrease a little

The final-layer shortcut

At the output layer, we do not backprop through softmax and cross-entropy as two separate giant symbolic objects. The combination simplifies to a vector the same size as the vocabulary row. That simplification is one reason modern language model training is mathematically clean despite enormous output spaces.

Section 9: Softmax Temperature and Sharp vs Flat Distributions

Softmax can be made sharper or flatter by dividing logits by a temperature τ before normalization. The formula becomes softmax(z / τ). Changing temperature does not reorder logits, but it changes how strongly the winner dominates.

Temperature regime Effect on probabilities Interpretation
τ < 1 Sharper, more concentrated More confident, closer to argmax
τ = 1 Baseline softmax Default distribution
τ > 1 Flatter, more spread out More exploratory or uncertain
τ → 0 Approaches one-hot Hard winner-take-all behavior
τ → ∞ Approaches uniform Every option looks similar

In language model inference, temperature changes the character of sampling. Low temperature makes outputs more deterministic. High temperature gives weaker options more room to be sampled. That is why people associate temperature with creativity.

Four subplots showing the same logits under temperatures 0.5, 1.0, 2.0, and 5.0, ranging from sharp to flat distributions

Temperature also has a hidden cousin inside attention. The division by √d_k prevents dot products from growing too large as the key dimension grows. Without that scaling, attention softmax would become too peaky too early and gradients would become fragile. "In attention, the √d_k divisor acts like temperature. It keeps score magnitudes from exploding before softmax." Section 9 side note

So temperature is not only an inference knob. It is also a way to reason about calibration, confidence, and gradient quality. Sharp distributions can be decisive, but they are also easier to saturate. τ controls sharpness Lower temperature concentrates mass on the winner. Higher temperature redistributes mass toward the rest.

Limit cases worth remembering

As τ approaches zero, softmax behaves more and more like argmax. As τ grows very large, every logit difference gets washed out and the distribution approaches 1/N.

Section 10: What This Looks Like in C — The Softmax Kernel

In Backpropagation From wx+b: The Smallest Training Loop and Matrix Wx+b: From Scalars To Transformers, we saw how gemm_blocked_serial() implements wx+b in the C-Kernel-Engine. Softmax has its own dedicated kernel — and its structure directly mirrors the math we just derived.

The forward pass has three phases: find max, exponentiate and sum, normalize. Here is a simplified version of what the actual causal_softmax_head_major() kernel does:

Simplified softmax forward kernel (from C-Kernel-Engine) python
# C-Kernel-Engine: softmax_kernels.c (simplified to pseudocode)
# The real kernel uses AVX512/AVX2 SIMD — here is the scalar logic.

def softmax_forward(row, valid_len):
    # Phase 1: find max for numerical stability
    max_val = -inf
    for j in range(valid_len):
        if row[j] > max_val:
            max_val = row[j]

    # Phase 2: exponentiate and sum
    total = 0.0
    for j in range(valid_len):
        row[j] = exp(row[j] - max_val)
        total += row[j]

    # Phase 3: normalize (multiply by 1/sum — faster than dividing)
    inv_sum = 1.0 / total
    for j in range(valid_len):
        row[j] *= inv_sum

    # Phase 4 (causal only): zero out future tokens
    for j in range(valid_len, total_len):
        row[j] = 0.0

The backward pass is the O(N) trick from Section 7, implemented in SIMD. Compute pᵀg (the dot product), then p ⊙ (g − pᵀg) — exactly two vectorized passes over the row: "In the C-Kernel-Engine, backward_causal_softmax_head_major() implements dL/dz = p ⊙ (g − pᵀg) with AVX512 FMA instructions — processing 16 floats per cycle." softmax_kernels.c, line 382

Simplified softmax backward kernel (from C-Kernel-Engine) python
# C-Kernel-Engine: softmax_kernels.c backward (simplified)
# d_scores arrives as dL/dp, weights is the cached softmax output p

def softmax_backward(d_scores, weights, valid_len):
    # Pass 1: compute dot product  pᵀg
    dot_product = 0.0
    for j in range(valid_len):
        dot_product += weights[j] * d_scores[j]

    # Pass 2: compute gradient  p * (g - pᵀg)
    for j in range(valid_len):
        d_scores[j] = weights[j] * (d_scores[j] - dot_product)

    # Zero out future tokens (causal)
    for j in range(valid_len, total_len):
        d_scores[j] = 0.0
Kernel File SIMD What it computes
causal_softmax_head_major softmax_kernels.c AVX512 / AVX2 / scalar Forward: max → exp → sum → normalize → mask future
causal_softmax_head_major_exact softmax_kernels.c scalar (expf) Forward with exact libm exp (for parity testing)
backward_causal_softmax_head_major softmax_kernels.c AVX512 / AVX / scalar Backward: dot product → p⊙(g − pᵀg) → mask future

Notice something interesting: the SIMD forward kernel uses a fast polynomial approximation of exp() instead of calling the C library's expf(). It's accurate to ~1e-4 relative error — good enough for softmax normalization but not for numerical parity tests. That's why there's a separate _exact variant using expf(). 16 exp() calls per cycle AVX512 processes 16 floats in one SIMD register. The polynomial exp approximation avoids expensive libm calls, making softmax forward ~4× faster than scalar expf().

Section 11: Summary

Softmax is easy to write in one line and surprisingly rich in consequences. It is the probability engine inside attention, the probability engine at the vocabulary head, and a key place where numerical stability and gradient algebra matter.

  • Softmax maps any real-valued row to a valid probability distribution.
  • Subtracting the row maximum before exponentiation preserves the answer and prevents overflow.
  • Applied row-wise to a matrix, softmax creates one independent probability distribution per row.
  • In attention, softmax turns [T×T] scores into attention weights that mix the value matrix.
  • Causal masking sets illegal future positions to zero probability by masking before softmax.
  • At the final layer, softmax converts vocabulary logits into next-token probabilities.
  • Its Jacobian is dense, but practical backprop uses vector-Jacobian products instead of explicit matrices.
  • With cross-entropy, the final-layer gradient simplifies to the famous p - y.
  • Temperature controls whether the distribution is sharp, flat, deterministic, or exploratory.

Next up

Next: how the CPU sees softmax — the exp() bottleneck, lookup tables, vectorization, and why this operation is often more memory-bound than it first appears.