This is the companion post to today's backprop-wxplusb carousel. I wanted a version with more room for the pictures, the algebra, and the engineering intuition.
Previous post in this sequence
This post builds directly on Derivatives And The Chain Rule: The Math Of Tiny Changes. That earlier piece explains why derivatives measure tiny changes. Here I use that idea to show how one small model, y = wx + b, actually learns.
The shortest honest story I know about deep learning starts with a line. Not with transformers, not with billion-parameter buzzwords, and not with an intimidating wall of tensor notation. It starts with y = wx + b.
Once this clicked for me, backprop stopped feeling mystical. A model makes a prediction, a loss function measures how wrong that prediction is, gradients tell me which knobs matter, and gradient descent nudges those knobs in a better direction. The whole pipeline is already visible in one weight w and one bias b.
So this post is my engineering notebook version of the topic. I want the math to stay visible, but I also want every equation to connect back to a picture and to a concrete training loop. That is the only way this topic became intuitive for me. "If I can understand how one line learns, I can understand how a whole network learns." The mental compression that made backprop feel manageable
The framing I use
Backpropagation is not a separate magic trick layered on top of machine learning. It is the chain rule applied to a computation graph so I can see how changing each parameter changes the final loss.
Why wx + b Is The Foundation
A single linear model is the smallest useful place to study learning. I have an input x, I scale it by w, shift it by b, and get a prediction ŷ. That is all wx + b means.
The reason it matters so much is that every dense neural network layer is this same operation generalized to more dimensions. In one dimension it is wx + b. In many dimensions it becomes Wx + b or xW + b depending on notation.
The nonlinearity usually gets the glamour because it makes deep networks expressive. But the affine part is the plumbing that actually mixes information and creates learnable degrees of freedom. Every layer is basically, "take the input, do a weighted sum, add a bias, then maybe apply a nonlinearity."
I like to keep the layer template in my head like this: input → weighted sum → bias add → nonlinearity → next layer. If I can differentiate that pipeline once, I can differentiate it a thousand times. 1 multiply The scalar model has one weighted interaction, w · x, plus one offset, b.
scalar neuron:
ŷ = wx + b
vector layer:
z = Wx + b
a = σ(z) The forward pass and backward pass already exist in that tiny scalar example. The rest of deep learning is mostly scaling the same idea, managing more parameters, and keeping the gradients numerically sane. "A giant neural network is not conceptually different from wx+b. It is that pattern repeated, widened, and composed." How I keep large models mentally grounded
The Forward Pass
The forward pass is just the act of making predictions. I choose initial parameters, feed in each x, compute ŷ = wx + b, and compare that prediction to the data.
For the examples in this post, I generated twenty noisy points from a hidden line that is roughly y = 2x + 1. Then I started the model from a deliberately bad guess: w = 0.5 and b = 3. That gives me something for gradient descent to fix.
The first picture to look at is not the loss. It is the geometry. Is the line too flat? Too high? Too low? That visual mismatch is what the loss will turn into a single scalar number.

import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(18)
x = np.linspace(-2, 5, 20)
y = 2 * x + 1 + rng.normal(0, 0.65, size=x.shape)
line_x = np.linspace(x.min() - 0.5, x.max() + 0.5, 300)
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x, y, s=46, color="#3b82f6", label="training data")
ax.plot(line_x, 2 * line_x + 1, color="#10b981", linewidth=2.8,
label="true line: y = 2x + 1")
ax.plot(line_x, 0.5 * line_x + 3, color="#ef4444", linewidth=2.6,
linestyle="--", label="initial guess: ŷ = 0.5x + 3")
ax.set_title("A Simple Dataset and an Initial Guess")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.grid(True, alpha=0.3)
ax.legend()
fig.savefig("data-and-guess.png", dpi=150, bbox_inches="tight") What I notice here is that the guess line is wrong in two ways at once. Its slope is too shallow, and its intercept is too high. That means both w and b need to move.
This is already a useful intuition check. Before I ever compute a derivative, I can predict the signs I expect. I probably need a bigger w to rotate the line upward, and a smaller b to shift the line downward. 20 points The full training signal in this example comes from only twenty residuals.
| Symbol | Meaning | Role In The Forward Pass |
|---|---|---|
x | Input feature | The value I feed into the model. |
w | Weight / slope | Controls how strongly the prediction changes as x changes. |
b | Bias / intercept | Shifts the whole line up or down. |
ŷ | Prediction | The model output computed by wx + b. |
The Loss Function
A model can be visually wrong, but training needs a number. The loss function is that number. It compresses "how bad is this fit?" into a scalar that I can optimize.
For this post I am using mean squared error. It is one of the cleanest losses to reason about because it squares each residual, averages them, and makes large mistakes hurt more than small ones.
L = (1/n) * sum((ŷ - y)^2)I read this as: predict first, subtract the truth, square the error so sign does not cancel, then average across the dataset. That is it.
| Piece | What It Means | Why It Is There |
|---|---|---|
ŷ - y | The residual | Measures how far the prediction is from the target. |
(ŷ - y)^2 | Squared residual | Penalizes larger misses more strongly and removes sign cancellation. |
(1/n) * sum | Dataset average | Turns many pointwise errors into one optimization objective. |
The residual plot is the one I keep coming back to when teaching myself this topic. Every dashed vertical segment is literally an error term that will be squared and averaged. The loss is not abstract anymore once I see it this way.

import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(18)
x = np.linspace(-2, 5, 20)
y = 2 * x + 1 + rng.normal(0, 0.65, size=x.shape)
y_guess = 0.5 * x + 3
line_x = np.linspace(x.min() - 0.5, x.max() + 0.5, 300)
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x, y, s=46, color="#3b82f6", label="training data")
ax.plot(line_x, 0.5 * line_x + 3, color="#ef4444", linewidth=2.8,
linestyle="--", label="guess line")
for x_i, y_i, yhat_i in zip(x, y, y_guess):
ax.plot([x_i, x_i], [y_i, yhat_i], color="#ef4444",
linestyle=(0, (3, 3)), linewidth=1.3)
ax.set_title("Residuals Are the Errors the Loss Sees")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.grid(True, alpha=0.3)
ax.legend()
fig.savefig("residuals.png", dpi=150, bbox_inches="tight")What to notice: some residuals are positive, some are negative, but after squaring they all contribute positive mass to the loss. That is why MSE behaves like a bowl near the optimum instead of letting errors cancel each other out.
The square is also why outliers matter a lot. If one prediction is twice as wrong, its squared contribution is four times as large. The loss is telling the optimizer where the big mistakes live. "The loss is just the residual picture compressed into one number." The sentence that keeps MSE intuitive for me
The Loss Landscape
Once the loss is a scalar function of the parameters, I can treat it like terrain. Instead of thinking only about lines in data space, I can think about a surface in parameter space. For each pair (w, b), there is a corresponding loss value.
With mean squared error on a linear model, that surface is beautifully well behaved. It forms a convex bowl. There is a single global minimum, and gradient descent just needs to keep walking downhill.

import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(18)
x = np.linspace(-2, 5, 20)
y = 2 * x + 1 + rng.normal(0, 0.65, size=x.shape)
def mse(w, b):
return np.mean((w * x + b - y) ** 2)
w_vals = np.linspace(-1, 5, 140)
b_vals = np.linspace(-3, 7, 140)
W, B = np.meshgrid(w_vals, b_vals)
L = np.mean((W[..., None] * x + B[..., None] - y) ** 2, axis=-1)
fig = plt.figure(figsize=(11, 8))
ax = fig.add_subplot(111, projection="3d")
surf = ax.plot_surface(W, B, L, cmap="coolwarm", linewidth=0)
ax.set_xlabel("w")
ax.set_ylabel("b")
ax.set_zlabel("loss")
ax.set_title("MSE Loss Surface Over Weight and Bias")
fig.colorbar(surf, ax=ax, shrink=0.7, pad=0.08, label="MSE")
fig.savefig("loss-surface-3d.png", dpi=150, bbox_inches="tight")The important thing to notice in the surface plot is the overall shape, not the exact coordinates. Anywhere I start on the bowl, the local slope points toward the same valley floor. That is why linear regression is such a friendly place to learn optimization.
The contour view makes the same idea even easier to see because it turns the 3D bowl into topographic rings. Now the gradient descent path looks like an actual route on a map.

import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(18)
x = np.linspace(-2, 5, 20)
y = 2 * x + 1 + rng.normal(0, 0.65, size=x.shape)
def grads(w, b):
residual = (w * x + b) - y
grad_w = (2 / len(x)) * np.sum(x * residual)
grad_b = (2 / len(x)) * np.sum(residual)
return grad_w, grad_b
w_vals = np.linspace(-1, 5, 220)
b_vals = np.linspace(-3, 7, 220)
W, B = np.meshgrid(w_vals, b_vals)
L = np.mean((W[..., None] * x + B[..., None] - y) ** 2, axis=-1)
path = [(0.5, 3.0)]
w, b, lr = 0.5, 3.0, 0.05
for _ in range(50):
gw, gb = grads(w, b)
w -= lr * gw
b -= lr * gb
path.append((w, b))
path = np.array(path)
fig, ax = plt.subplots(figsize=(10, 7))
ax.contourf(W, B, L, levels=28, cmap="coolwarm")
ax.contour(W, B, L, levels=18, colors="white", linewidths=0.7, alpha=0.65)
ax.plot(path[:, 0], path[:, 1], color="#111827", linewidth=2)
ax.scatter(path[0, 0], path[0, 1], color="#ef4444", s=90, label="start")
ax.scatter(path[-1, 0], path[-1, 1], color="#10b981", s=120, marker="X", label="end")
ax.set_xlabel("w")
ax.set_ylabel("b")
ax.set_title("Gradient Descent Walking Down the Loss Landscape")
ax.legend()
fig.savefig("loss-contour-gd.png", dpi=150, bbox_inches="tight")What I like about the contour plot is that it makes the two-parameter case feel like the one-dimensional parabola from intro calculus, just stretched into two directions. I am still doing the same thing: follow the slope downhill.
It also shows why a learning rate matters. If the steps are too large, the path can bounce across the valley. If the steps are too small, convergence is safe but slow. "Loss landscapes sound abstract until you realize they are just maps from parameters to error. A contour line is really a sentence about which choices of w and b are equally wrong." The picture that made optimization feel geometric instead of symbolic
Computing Gradients
The gradients tell me how sensitive the loss is to each parameter. In this tiny model, I only need two of them: dL/dw and dL/db. These are the signals that drive the update step.
The derivation looks cleaner if I name the residual for each point r_i = ŷ_i - y_i. Then the loss is L = (1/n) * sum(r_i^2). From there the chain rule becomes almost mechanical.
ŷ_i = wx_i + b
r_i = ŷ_i - y_i
L = (1/n) * sum(r_i^2)Differentiate the square, then differentiate the residual, then differentiate the affine prediction inside the residual. That gives:
dL/dw = (2/n) * sum(x_i * (ŷ_i - y_i))
dL/db = (2/n) * sum(ŷ_i - y_i) The weight gradient includes x_i because changing the slope matters more for points farther out on the x-axis. The bias gradient does not include x_i because changing the intercept shifts every prediction by the same amount.
| Derivative Link | Local Rule | Why It Appears |
|---|---|---|
dL/dŷ_i | (2/n)(ŷ_i - y_i) | The square turns residual magnitude into a gradient proportional to error. |
dŷ_i/dw | x_i | The prediction changes with slope in proportion to the input value. |
dŷ_i/db | 1 | Bias adds the same offset to every prediction. |
The computation graph is the best place to see why this works. Each node contributes a local derivative. Backpropagation is just the process of multiplying those local pieces in reverse order.

from pathlib import Path
import subprocess
dot = """
digraph Backprop {
rankdir=LR;
graph [bgcolor="white", pad="0.25"];
node [shape=box, style="rounded,filled", fontname="Helvetica"];
x [label="x\n(input)", fillcolor="#dbeafe"];
w [label="w\n(weight)", fillcolor="#fee2e2"];
b [label="b\n(bias)", fillcolor="#ede9fe"];
y [label="y\n(target)", fillcolor="#ecfccb"];
wx [label="w · x", fillcolor="#f3f4f6"];
yhat [label="ŷ = wx + b", fillcolor="#ffedd5"];
loss [label="Loss = (ŷ - y)²", fillcolor="#dcfce7"];
x -> wx; w -> wx; wx -> yhat; b -> yhat; yhat -> loss; y -> loss;
loss -> yhat [style=dashed, color="#ef4444", label="∂L/∂ŷ"];
yhat -> wx [style=dashed, color="#ef4444", label="∂L/∂(wx)"];
yhat -> b [style=dashed, color="#ef4444", label="∂L/∂b"];
wx -> w [style=dashed, color="#ef4444", label="∂L/∂w"];
}
"""
Path("graph.dot").write_text(dot, encoding="utf-8")
subprocess.run(["dot", "-Tpng", "graph.dot", "-o", "computation-graph.png"], check=True)What to notice in the graph is the ordering. The forward pass moves left to right and computes values. The backward pass moves right to left and computes sensitivities.
In this tiny graph, the chain rule feels simple enough to do by hand. In a deep network, the principle is identical. The graph is bigger, but the bookkeeping pattern is the same. "Backprop is the chain rule with data structures." The least intimidating description I know
Gradient Descent In Action
Once I have the gradients, the learning loop becomes procedural. Evaluate the current line, compute the loss, compute the gradients, update the parameters, repeat.
The next figure overlays five snapshots of the model during training: iteration 0, 5, 10, 20, and 50. I like this style because it makes the whole optimization trajectory visible in one frame.

import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(18)
x = np.linspace(-2, 5, 20)
y = 2 * x + 1 + rng.normal(0, 0.65, size=x.shape)
def grads(w, b):
residual = (w * x + b) - y
return (2 / len(x)) * np.sum(x * residual), (2 / len(x)) * np.sum(residual)
history = []
w, b, lr = 0.5, 3.0, 0.05
for step in range(51):
history.append((step, w, b))
gw, gb = grads(w, b)
w -= lr * gw
b -= lr * gb
fig, ax = plt.subplots(figsize=(10, 6))
line_x = np.linspace(x.min() - 0.5, x.max() + 0.5, 300)
ax.scatter(x, y, s=42, color="#3b82f6", label="training data")
ax.plot(line_x, 2 * line_x + 1, color="#10b981", linewidth=3, label="true line")
for step in [0, 5, 10, 20, 50]:
_, w_s, b_s = history[step]
ax.plot(line_x, w_s * line_x + b_s, linewidth=2.5, label=f"iter {step}")
ax.set_title("Five Snapshots of Gradient Descent Improving the Fit")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.grid(True, alpha=0.3)
ax.legend()
fig.savefig("gradient-descent-steps.png", dpi=150, bbox_inches="tight")The most important thing to notice is that the model improves in a coordinated way. The slope rotates upward while the intercept drifts downward. The optimizer is not separately "learning a slope" and then "learning a bias" in isolation. It is updating both because both influence the loss.
This is also why gradients are so useful. They tell me not just whether the model is wrong, but how the model should change in parameter space to become less wrong.
Learning is repeated local linear repair
Every gradient step assumes the current neighborhood is locally informative. It is a tiny trust exercise: use the current slope information, move a little, recompute, and repeat.
The Update Rule
The update step is the whole training loop in one line. Once I know the gradient, I move against it because the gradient points uphill and I want to go downhill.
w := w - lr * dL/dw
b := b - lr * dL/db The learning rate lr is just a scale factor on trust. Too large and I overshoot. Too small and I crawl. In this example, 0.05 is enough to move quickly without bouncing around.

import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(18)
x = np.linspace(-2, 5, 20)
y = 2 * x + 1 + rng.normal(0, 0.65, size=x.shape)
def grads(w, b):
residual = (w * x + b) - y
return (2 / len(x)) * np.sum(x * residual), (2 / len(x)) * np.sum(residual)
losses = []
w, b, lr = 0.5, 3.0, 0.05
for _ in range(51):
losses.append(np.mean((w * x + b - y) ** 2))
gw, gb = grads(w, b)
w -= lr * gw
b -= lr * gb
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(np.arange(len(losses)), losses, color="#8b5cf6", linewidth=2.8)
ax.scatter([0, 50], [losses[0], losses[-1]], color=["#ef4444", "#10b981"], s=70)
ax.set_title("Loss Drops Fast, Then Flattens as We Converge")
ax.set_xlabel("iteration")
ax.set_ylabel("MSE loss")
ax.grid(True, alpha=0.3)
fig.savefig("loss-over-iterations.png", dpi=150, bbox_inches="tight")What to notice in the curve is the shape. The first few updates remove a lot of error because the starting point is bad and the bowl is steep there. Later, the curve flattens because the model is already near the minimum and each extra improvement is smaller.
This is the optimization pattern I expect in many well-behaved settings: fast early wins, then diminishing returns. It is not unique to linear regression. It is a generic sign that the system is approaching a minimum. 96.1% In fifty steps, the loss falls from 11.32 to 0.44.
The table below records the first twenty states of the optimization. I find tables like this surprisingly helpful because they show the training loop as concrete state evolution rather than as an abstract formula.
| Iteration | w | b | Loss | dL/dw | dL/db |
|---|---|---|---|---|---|
0 | 0.5000 | 3.0000 | 11.3245 | -15.4209 | -1.0538 |
1 | 1.2710 | 3.0527 | 3.5241 | -4.8334 | 1.3647 |
2 | 1.5127 | 2.9845 | 2.6131 | -1.7692 | 1.9533 |
3 | 1.6012 | 2.8868 | 2.3024 | -0.8657 | 2.0233 |
4 | 1.6445 | 2.7856 | 2.0700 | -0.5837 | 1.9508 |
5 | 1.6736 | 2.6881 | 1.8694 | -0.4816 | 1.8433 |
6 | 1.6977 | 2.5959 | 1.6937 | -0.4324 | 1.7312 |
7 | 1.7193 | 2.5094 | 1.5395 | -0.3996 | 1.6229 |
8 | 1.7393 | 2.4282 | 1.4043 | -0.3728 | 1.5206 |
9 | 1.7580 | 2.3522 | 1.2856 | -0.3488 | 1.4244 |
10 | 1.7754 | 2.2810 | 1.1814 | -0.3266 | 1.3343 |
11 | 1.7917 | 2.2142 | 1.0901 | -0.3058 | 1.2499 |
12 | 1.8070 | 2.1518 | 1.0099 | -0.2865 | 1.1708 |
13 | 1.8213 | 2.0932 | 0.9396 | -0.2683 | 1.0967 |
14 | 1.8348 | 2.0384 | 0.8779 | -0.2514 | 1.0272 |
15 | 1.8473 | 1.9870 | 0.8237 | -0.2354 | 0.9622 |
16 | 1.8591 | 1.9389 | 0.7762 | -0.2205 | 0.9013 |
17 | 1.8701 | 1.8938 | 0.7345 | -0.2066 | 0.8443 |
18 | 1.8805 | 1.8516 | 0.6979 | -0.1935 | 0.7908 |
19 | 1.8901 | 1.8121 | 0.6658 | -0.1813 | 0.7408 |
The signs are exactly what I expected from the first picture. The early dL/dw values are negative, so subtracting them increases w. The early dL/db values become positive, so subtracting them pushes b downward.
That is a nice sanity check: the algebra and the geometry agree. When those two disagree in a real project, it usually means I made a bug.
From 1D To Neural Networks
The move from a single neuron to a real network is conceptually small and computationally huge. Instead of one scalar input and one scalar output, I now have vectors and matrices. But the operation itself is the same affine map.
1D model:
ŷ = wx + b
matrix form:
z = Wx + b
a = σ(z) In matrix form, every output unit takes a weighted sum of many input units. The bias becomes a vector. Then I apply a nonlinearity like ReLU, GELU, or tanh so the layer can do more than fit a single flat hyperplane.
What backprop changes is not its logic but its scale. I still compute a forward pass, a loss, local derivatives, and reverse-mode gradient accumulation. There are just many more nodes and many more parameters.

import matplotlib.pyplot as plt
from matplotlib.patches import FancyArrowPatch, FancyBboxPatch
def box(ax, xy, width, height, text, fc):
patch = FancyBboxPatch(xy, width, height, boxstyle="round,pad=0.03",
edgecolor="#e5e7eb", facecolor=fc, linewidth=1.5)
ax.add_patch(patch)
ax.text(xy[0] + width / 2, xy[1] + height / 2, text, ha="center", va="center")
def arrow(ax, start, end):
ax.add_patch(FancyArrowPatch(start, end, arrowstyle="->", mutation_scale=12,
linewidth=1.8, color="#6b7280"))
fig, axes = plt.subplots(1, 2, figsize=(12, 5.5))
for ax in axes:
ax.axis("off")
axes[0].set_xlim(0, 10); axes[0].set_ylim(0, 6)
box(axes[0], (0.8, 2.2), 1.6, 1.2, "x", "#dbeafe")
box(axes[0], (3.2, 2.2), 1.8, 1.2, "w·x", "#fee2e2")
box(axes[0], (5.8, 2.2), 1.8, 1.2, "+ b", "#ede9fe")
box(axes[0], (8.2, 2.2), 1.2, 1.2, "ŷ", "#dcfce7")
arrow(axes[0], (2.4, 2.8), (3.2, 2.8)); arrow(axes[0], (5.0, 2.8), (5.8, 2.8)); arrow(axes[0], (7.6, 2.8), (8.2, 2.8))
axes[1].set_xlim(0, 12); axes[1].set_ylim(0, 8)
box(axes[1], (0.5, 2.0), 1.8, 3.6, "x\n[x₁\n x₂\n x₃]", "#dbeafe")
box(axes[1], (3.6, 1.6), 2.4, 4.4, "W", "#fee2e2")
box(axes[1], (7.0, 2.2), 1.6, 3.2, "+ b", "#ede9fe")
box(axes[1], (9.6, 2.0), 1.8, 3.6, "z\n[z₁\n z₂]", "#dcfce7")
arrow(axes[1], (2.3, 3.8), (3.6, 3.8)); arrow(axes[1], (6.0, 3.8), (7.0, 3.8)); arrow(axes[1], (8.6, 3.8), (9.6, 3.8))
fig.savefig("1d-vs-matrix.png", dpi=150, bbox_inches="tight")What to notice in this diagram is that the scalar picture did not disappear. It got packed into a matrix multiply. Each output row is still doing a weighted sum plus a bias.
The next conceptual ingredient is the nonlinearity. If I only stack affine layers with no activation, the whole network collapses into another affine layer. The nonlinearity is what makes depth useful.
| Model View | Formula | What Changes |
|---|---|---|
| Single feature regression | ŷ = wx + b | One input, one weight, one bias. |
| Dense layer | z = Wx + b | Many inputs and outputs coupled through a matrix. |
| Neural network block | a = σ(Wx + b) | Same affine core, plus a nonlinearity before the next layer. |
| Deep stack | a_(l+1) = σ(W_l a_l + b_l) | The pattern repeats layer after layer. |
That is the bridge from one scalar line to larger networks. "Stack enough wx+b blocks, add nonlinearities, and you have the core pattern behind modern neural networks." The simple sentence hiding underneath a lot of complexity
Summary
If I compress the whole post into the pieces I want to keep cached, it looks like this.
-
wx + bis the smallest model that already contains prediction, error, gradients, and learning. - The forward pass computes
ŷ = wx + bfor each data point. - Mean squared error turns the residual picture into a single scalar objective.
- The MSE loss landscape for linear regression is a convex bowl.
- The computation graph makes the backward pass feel like signal flow rather than magic.
-
dL/dw = (2/n) * sum(x * (ŷ - y))anddL/db = (2/n) * sum(ŷ - y)come straight from the chain rule. - The update rule
w := w - lr * dL/dwis just a downhill step in parameter space. - The jump from
wx+btoWx+bchanges scale, not core logic. - Nonlinearities make stacked affine layers expressive.
- That is why I can honestly say: stack enough
wx+band you get GPT.
The way I think about backprop now is very concrete. It is not a mystical training ritual. It is a bookkeeping system for local sensitivities flowing backward through a graph.
And the reason I keep returning to the scalar case is that it never stops being true. Large models add width, depth, matrix multiplies, nonlinearities, normalization, attention, and a mountain of systems work, but the learning signal is still built from derivatives of composed functions.
In the next post, we'll see what happens when x, w, and b are not scalars but arrays, and how the exact same story becomes matrix calculus.