Derivatives And The Chain Rule: The Math Of Tiny Changes

Lab note

This post walks through the derivative and chain rule from first principles, with visualizations. It is the companion piece to my carousel post on the same topic.

I needed this topic to stop feeling like ceremonial math notation. I needed it to feel like an engineering tool. Once I translated the derivative into the language of local change, and the chain rule into the language of composed sensitivities, calculus stopped looking decorative and started looking operational.

That shift mattered for me because I keep running into derivatives in places that are not introduced as calculus classes. They show up in robotics, control, optimization, machine learning, and backpropagation. If I want to understand why a system moves, converges, explodes, or stalls, rates of change are usually somewhere in the story.

So this is the version I wrote for myself. I am not trying to reproduce a textbook chapter. I am trying to explain how I now think about limits, derivatives, and the chain rule when I am building models, reading papers, or debugging why a gradient-based method behaves strangely.

Why Derivatives Matter

Rates of change are everywhere. If position changes with time, I call that velocity. If cost changes with production, I call that marginal cost. If loss changes with a parameter in a neural network, I call that a gradient. The vocabulary changes, but the underlying question is the same.

The derivative answers a very practical question: if I nudge the input a little right here, how much will the output move? Not on average across the whole graph. Not somewhere far away. Right here, locally, at this operating point.

That local view is what makes derivatives so useful. A system can be calm in one region, unstable in another, and flat in a third. The derivative tells me which regime I am currently in. It is less like a global summary, and more like an instrument reading taken at a specific point. "If I nudge the input a little right here, how much will the output move? That is the question the derivative answers." The framing that made calculus feel operational

The framing that unlocked this for me

A derivative is not a magic symbol. It is a disciplined way to ask, “if I perturb the input a tiny amount, what is the local response of the output?” That is exactly the kind of question I care about in ML, robotics, and control.

Domain	Quantity	What The Derivative Means	Why I Care
Physics	Position `x(t)`	`dx/dt` is velocity, and another derivative gives acceleration.	Motion is literally described by change over time.
Economics	Total cost `C(q)`	`dC/dq` is marginal cost, the local cost of producing one more unit.	It separates average cost stories from incremental decision-making.
Machine learning	Loss `L(w)`	`dL/dw` tells me whether changing a parameter raises or lowers loss.	This is the signal gradient descent follows.
Control systems	Plant response	Derivatives measure sensitivity, response speed, and local stability behavior.	I cannot tune a controller well if I do not understand local response.
Robotics	Trajectory or error signal	The derivative tells me how quickly state or error is evolving.	That matters for estimation, control, and planning loops.

I also like a sign-based interpretation. Even before I compute an exact formula, I can ask whether the derivative should be positive, negative, near zero, or huge. That already tells me a lot about what a system is doing.

If `f'(x)` Is...	Local Meaning	Practical Interpretation
Positive	The function is increasing at that point.	A small increase in input pushes the output upward.
Negative	The function is decreasing at that point.	A small increase in input pushes the output downward.
Zero	The graph is locally flat, or at least first-order flat.	I may be near a minimum, maximum, saddle, or plateau.
Large magnitude	The function is changing rapidly.	Small input mistakes can produce big output swings.

That is why I do not see derivatives as abstract decoration anymore. They are a local sensitivity language. And once I say it that way, the rest of the topic becomes much easier to organize.

What Is A Limit?

Before I can define instantaneous change, I need the idea of approach. That is what a limit gives me. It tells me what value a function tends toward as the input gets closer and closer to some point.

lim(x→c) f(x) = L

I read that as: as x gets close to c, the outputs f(x) get close to L. The important word is approach. A limit is not always about plugging in the point directly. It is about what nearby values are doing.

That distinction feels subtle the first time, but it is the entire foundation of the derivative. Instantaneous slope sounds impossible if I insist on using only a single point. The limit gets around that by letting me study behavior in a shrinking neighborhood.

The classic example is sin(x)/x as x → 0. If I plug in x = 0 directly, I get 0/0, which is indeterminate. But if I look at values near zero, the function clearly settles toward 1.

That is why this example matters. It separates “what happens exactly at the point” from “what the function is tending toward as I approach the point.” Limits care about the second idea.

Plot of sin(x) divided by x approaching 1 as x approaches 0, with a dashed horizontal line at y equals 1 and an open circle at the limiting point

Generate this plot

python

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(-4 * np.pi, 4 * np.pi, 4000)
y = np.sinc(x / np.pi)          # sinc already computes sin(πx)/(πx)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, color="#3b82f6", linewidth=2.8, label=r"$f(x)=\sin(x)/x$")
ax.axhline(1, color="#f59e0b", linestyle="--", linewidth=2.0, label="limit = 1")

# open circle at the limit point
ax.scatter([0], [1], s=130, facecolors="#ffffff",
           edgecolors="#ef4444", linewidths=2.2, zorder=6)

ax.annotate(r"$\lim_{x \to 0}\, \sin(x)/x = 1$",
            xy=(0, 1), xytext=(1.8, 1.35),
            arrowprops=dict(arrowstyle="->", color="#6b7280", linewidth=1.5),
            fontsize=13)

ax.set_xlim(-4 * np.pi, 4 * np.pi)
ax.set_ylim(-0.35, 1.45)
ax.set_title("The Limit of sin(x)/x as x → 0")
ax.set_xlabel("x"); ax.set_ylabel("sin(x)/x")
ax.legend(); ax.grid(True, alpha=0.3)
fig.savefig("limit-sinx-over-x.png", dpi=150, bbox_inches="tight")

I like this plot because it makes the mental move visual. The curve wiggles, but near the origin it squeezes toward 1. The open circle reminds me that the limit statement is about the approached value, not necessarily the raw formula evaluated at that exact point. → 1 sin(x)/x approaches exactly 1 as x → 0, even though 0/0 is undefined

`x`	`sin(x)/x` (approx.)	What I Notice
`-1.0`	`0.8415`	Already below 1, but still fairly close.
`-0.5`	`0.9589`	Moving closer to 1 as I approach zero.
`-0.1`	`0.9983`	Very close to 1.
`0.1`	`0.9983`	Same behavior from the right side.
`0.5`	`0.9589`	Symmetry makes the left and right stories agree.
`1.0`	`0.8415`	Farther away again, because I am no longer in the tiny neighborhood around zero.

A limit is a trend, not a plug-in

When I evaluate a limit, I am watching the output stabilize as the input moves closer and closer to a target. That is why a limit can exist even when direct substitution is undefined, ambiguous, or just not the right question.

This is also a useful place to separate two different situations. Sometimes a function has a removable hole, like the sin(x)/x story at zero. Other times a function truly blows up, or the left and right sides disagree. In those cases, the limit does not exist.

So when I see a limit, I ask three things. What are nearby values doing? Do the left and right sides agree? Is there a stable value emerging? If the answer is yes, I have a meaningful local target.

From Limits To Derivatives

Once the idea of approach makes sense, the derivative definition stops feeling arbitrary. It is just the limit of an average rate of change as the interval shrinks to zero.

f'(x) = lim(h→0) [f(x+h) - f(x)] / h

The numerator, f(x+h) - f(x), is the change in output. The denominator, h, is the change in input. Their ratio is the slope of a secant line, which is an average slope over a small interval.

The derivative takes that secant idea and pushes it to the limit. As h gets smaller and smaller, the second point slides toward the first, and the secant line approaches the tangent line. That limiting slope is the derivative.

Geometrically, this was the picture that made everything click for me. A derivative is not some separate object floating above the graph. It is the slope you get when two nearby points collapse into one, but the ratio of changes still converges. "A derivative is not some separate object floating above the graph. It is the slope you get when two nearby points collapse into one."

Plot of x squared with several secant lines from x equals 2 approaching the red tangent line at x equals 2

Generate this plot

python

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(-1, 4, 600)
y = x ** 2
x0, y0 = 2.0, 4.0
h_values = [2.0, 1.0, 0.5, 0.25]
secant_colors = ["#cbd5e1", "#94a3b8", "#64748b", "#3b82f6"]

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, color="#10b981", linewidth=3, label=r"$f(x)=x^2$")

line_x = np.linspace(-0.5, 4.2, 300)
for h, color in zip(h_values, secant_colors):
    x1, y1 = x0 + h, (x0 + h) ** 2
    slope = (y1 - y0) / h
    ax.plot(line_x, y0 + slope * (line_x - x0),
            linestyle="--", linewidth=2, color=color, label=f"secant h={h:g}")
    ax.scatter([x1], [y1], color=color, s=45, zorder=5)

# tangent line: slope = 2x = 4 at x=2
ax.plot(line_x, y0 + 4 * (line_x - x0),
        color="#ef4444", linewidth=3.2, label="tangent at x=2")
ax.scatter([x0], [y0], color="#ef4444", s=80, zorder=6)

ax.set_xlim(-1, 4); ax.set_ylim(-1, 18)
ax.set_title("Secant Lines Approach the Tangent")
ax.legend(loc="upper left"); ax.grid(True, alpha=0.3)
fig.savefig("secant-to-tangent.png", dpi=150, bbox_inches="tight")

In the secant-to-tangent plot, the function is f(x) = x² and the point of interest is x = 2. The secant slopes keep approaching 4 as the horizontal gap h shrinks. That limiting value is the derivative at x = 2.

`h`	Second Point `2+h`	Secant Slope	What It Shows
`2.0`	`4.0`	`6.0`	With a wide interval, I only get a rough average slope.
`1.0`	`3.0`	`5.0`	The average slope is getting closer to the local one.
`0.5`	`2.5`	`4.5`	Smaller intervals capture more local behavior.
`0.25`	`2.25`	`4.25`	The value is visibly converging toward `4`.

Algebra shows the same story cleanly. For f(x) = x², I can compute the difference quotient exactly.

f(x) = x²

at x = 2:
[f(2+h) - f(2)] / h
= [(2+h)² - 4] / h
= [4 + 4h + h² - 4] / h
= (4h + h²) / h
= 4 + h

lim(h→0) (4 + h) = 4

So the derivative of x² at x = 2 is 4. The tangent line there has slope 4. The function is increasing, and increasing fairly steeply, at that point. slope = 4 As h → 0, the secant slope converges to the tangent slope.

Piece Of The Formula	Meaning	Why It Matters
`f(x+h)`	The function evaluated at a nearby point.	I need a second sample to talk about change.
`f(x+h) - f(x)`	Output change.	This measures how much the function moved vertically.
`h`	Input change.	This measures how far I moved horizontally.
The quotient	Average rate of change over a small interval.	This is the secant slope.
The limit	The interval size collapses toward zero.	That is what turns an average slope into an instantaneous one.

The derivative is local linear prediction

Another way I think about the derivative is this: near a point, a smooth function behaves approximately like a line. The derivative is the slope of that best local line. That viewpoint becomes extremely useful later in optimization.

The next plot makes the local nature even clearer. The same function can have different slopes at different points, including flat, negative, and positive regions.

Plot of x cubed minus three x plus one with colored tangent lines at x equals negative one, zero, and one point five showing zero, negative, and positive slopes

Generate this plot

python

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(-2.5, 3, 700)
y = x ** 3 - 3 * x + 1

tangent_points = [
    (-1.0, "#8b5cf6", "x=-1, slope=0"),
    ( 0.0, "#ef4444", "x=0, slope=-3"),
    ( 1.5, "#f59e0b", "x=1.5, slope=3.75"),
]

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, color="#3b82f6", linewidth=3, label=r"$f(x)=x^3-3x+1$")

for x0, color, label in tangent_points:
    y0    = x0 ** 3 - 3 * x0 + 1
    slope = 3 * x0 ** 2 - 3           # f'(x) = 3x² − 3
    local_x = np.linspace(x0 - 0.95, x0 + 0.95, 80)
    ax.plot(local_x, y0 + slope * (local_x - x0),
            color=color, linewidth=2.8, label=label)
    ax.scatter([x0], [y0], color=color, s=65, zorder=5)

ax.set_title("The Derivative as Slope at a Point")
ax.legend(loc="upper left"); ax.grid(True, alpha=0.3)
fig.savefig("derivative-as-slope.png", dpi=150, bbox_inches="tight")

I like the cubic example because one picture shows several derivative behaviors. At x = -1, the tangent is flat. At x = 0, the tangent slopes downward. At x = 1.5, it slopes upward. Same function, different local stories.

That is why the derivative is a pointwise measurement. Asking for “the derivative of the function” really means asking for a new function, one that assigns a slope to every input where the slope exists.

Derivative Rules

The limit definition is the foundation, but nobody wants to re-derive everything from first principles every time. Derivative rules are compressed proofs. They package recurring limit arguments into reusable tools.

I still like keeping the limit definition in the back of my mind, because it tells me what the rules mean. But once the concept is secure, the rules are how I actually work quickly.

Four-panel visual summary of derivative rules showing power rule, sum rule, product rule, and chain rule with matching derivative curves

Generate this plot

python

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(-2.5, 2.5, 600)
fig, axes = plt.subplots(2, 2, figsize=(10, 7))

# Power Rule
ax = axes[0, 0]
ax.plot(x, x**2, color="#3b82f6", linewidth=2.4, label=r"$x^2$")
ax.plot(x, x**3, color="#8b5cf6", linewidth=2.2, label=r"$x^3$")
ax.plot(x, 2*x,  color="#10b981", linestyle="--", label=r"$2x$")
ax.plot(x, 3*x**2, color="#f59e0b", linestyle="--", label=r"$3x^2$")
ax.set_title("Power Rule"); ax.legend(); ax.grid(True, alpha=0.3)

# Sum Rule
ax = axes[0, 1]
ax.plot(x, np.sin(x) + 0.5*x, color="#3b82f6", linewidth=2.5,
        label=r"$\sin(x)+0.5x$")
ax.plot(x, np.cos(x) + 0.5, color="#ef4444", linestyle="--",
        label=r"$\cos(x)+0.5$")
ax.set_title("Sum Rule"); ax.legend(); ax.grid(True, alpha=0.3)

# Product Rule
ax = axes[1, 0]
ax.plot(x, x*np.sin(x), color="#3b82f6", linewidth=2.5,
        label=r"$x\sin(x)$")
ax.plot(x, np.sin(x) + x*np.cos(x), color="#10b981", linestyle="--",
        label=r"$\sin(x)+x\cos(x)$")
ax.set_title("Product Rule"); ax.legend(); ax.grid(True, alpha=0.3)

# Chain Rule
ax = axes[1, 1]
ax.plot(x, np.sin(x**2), color="#3b82f6", linewidth=2.5,
        label=r"$\sin(x^2)$")
ax.plot(x, 2*x*np.cos(x**2), color="#ef4444", linestyle="--",
        label=r"$2x\cos(x^2)$")
ax.set_title("Chain Rule"); ax.legend(); ax.grid(True, alpha=0.3)

fig.tight_layout()
fig.savefig("derivative-rules-visual.png", dpi=150, bbox_inches="tight")

Rule	Statement	Example	Why It Is Useful
Power rule	`d/dx (xⁿ) = nxⁿ⁻¹`	`d/dx (x⁵) = 5x⁴`	Polynomial terms become easy to differentiate.
Constant rule	`d/dx (c) = 0`	`d/dx (7) = 0`	A fixed number does not change as `x` changes.
Sum rule	`d/dx (f + g) = f' + g'`	`d/dx (x² + sin x) = 2x + cos x`	I can differentiate term by term.
Product rule	`d/dx (fg) = f'g + fg'`	`d/dx (x² sin x) = 2x sin x + x² cos x`	Multiplication mixes both rates of change.
Quotient rule	`d/dx (f/g) = (f'g - fg') / g²`	`d/dx ((x²+1)/(x-1)) = ((2x)(x-1) - (x²+1)) / (x-1)²`	Division has its own bookkeeping, so I do not naively divide derivatives.

The power rule is usually where fluency starts. Once I know that x⁵ becomes 5x⁴, polynomials stop being intimidating. The constant rule then explains why trailing offsets disappear.

d/dx (x⁵) = 5x⁴

d/dx (3x⁴ - 2x² + 7)
= 12x³ - 4x + 0
= 12x³ - 4x

The sum rule says I can differentiate each term on its own and add the results. That is why so much calculus work turns into structured bookkeeping. Break the expression into pieces, identify the operation, then apply the matching rule.

The product rule is important because multiplication is where naive shortcuts fail. In general, the derivative of a product is not just “the derivative of the left times the derivative of the right.” Each factor can change, so each factor contributes a term.

d/dx [x² sin(x)]
= (d/dx x²) sin(x) + x² (d/dx sin(x))
= (2x) sin(x) + x² cos(x)

The quotient rule is the same kind of caution for division. I use it when one changing quantity is sitting on top of another changing quantity. Again, the structure of the expression tells me which rule to reach for.

d/dx [(x² + 1) / (x - 1)]
= [(2x)(x - 1) - (x² + 1)(1)] / (x - 1)²

Common Mistake	Why It Fails	Correct Thought
Treating every expression like a sum.	Multiplication and composition do not distribute that simply.	Identify the outer operation before differentiating.
Forgetting the denominator square in the quotient rule.	The algebra of changing ratios is more constrained than it first looks.	Memorize the pattern only after understanding the structure.
Dropping the inner derivative in a composition.	That loses one of the local sensitivities.	Outer derivative first, then multiply by the inner derivative.

The practical habit I use

Before I differentiate, I ask what operation built the function. Is it a sum, a product, a quotient, or a composition? The structure determines the rule.

That habit sounds almost too simple, but it is what keeps derivative work from becoming a blur of symbols. Once I see the expression tree clearly, the derivative becomes a structural transformation.

The Chain Rule

The chain rule is the derivative rule I cared about most once I started thinking seriously about ML and backpropagation. Real functions are rarely isolated. They are nested. One transformation feeds another.

If I square an input and then pass it through a sine, that is a composition. If a neural network layer computes a weighted sum, then applies a nonlinearity, then feeds another layer, that is a composition too. So I need a way to differentiate through layers of dependence.

Why compositions matter

If one quantity depends on a second quantity, and the second depends on x, then the total sensitivity has to combine both links. That combination rule is the chain rule.

d/dx f(g(x)) = f'(g(x)) · g'(x)

I read this in words as: differentiate the outside, evaluated at the inside, and then multiply by the derivative of the inside. Outer first. Inner second. Multiply the local sensitivities. "Outer first. Inner second. Multiply the local sensitivities." The shortest correct mental algorithm for the chain rule

The worked example I keep coming back to is sin(x²). It is simple enough to see clearly, but rich enough to show exactly why the extra factor appears.

Piece	Role	Derivative Contribution
`g(x) = x²`	Inner function.	`g'(x) = 2x`
`f(u) = sin(u)`	Outer function, written with a temporary input `u`.	`f'(u) = cos(u)`
`f(g(x)) = sin(x²)`	Composition of the two pieces.	`f'(g(x)) · g'(x) = cos(x²) · 2x`

y = sin(x²)

let u = x²
then y = sin(u)

dy/du = cos(u)
du/dx = 2x

dy/dx = (dy/du)(du/dx)
      = cos(x²) · 2x

That last line is the whole point. The derivative is not just cos(x²). If I stop there, I have only differentiated the outer sine. I still need to account for the fact that the input to that sine, namely x², is also changing with x.

Two-panel plot showing sin of x squared on top and its derivative two x times cosine of x squared below, with annotations for inner and outer function roles

Generate this plot

python

import matplotlib.pyplot as plt
import numpy as np

x  = np.linspace(-3, 3, 1200)
y  = np.sin(x ** 2)
dy = 2 * x * np.cos(x ** 2)

fig, axes = plt.subplots(2, 1, figsize=(10, 7), sharex=True)

axes[0].plot(x, y, color="#3b82f6", linewidth=2.8, label=r"$y=\sin(x^2)$")
axes[0].text(-2.85, 0.82,
    "Inner function: g(x)=x²\nOuter function: f(u)=sin(u)",
    fontsize=12,
    bbox=dict(facecolor="#fff", edgecolor="#e5e7eb", boxstyle="round,pad=0.4"))
axes[0].set_title("Chain Rule: d/dx sin(x²) = 2x·cos(x²)")
axes[0].legend(); axes[0].grid(True, alpha=0.3)

axes[1].plot(x, dy, color="#ef4444", linewidth=2.8, label=r"$y'=2x\cos(x^2)$")
axes[1].text(-2.85, 4.2,
    "f'(g(x)) gives cos(x²)\ng'(x) gives 2x\nMultiply them together.",
    fontsize=12,
    bbox=dict(facecolor="#fff", edgecolor="#e5e7eb", boxstyle="round,pad=0.4"))
axes[1].set_xlabel("x"); axes[1].legend(); axes[1].grid(True, alpha=0.3)

fig.tight_layout(rect=[0, 0, 1, 0.98])
fig.savefig("chain-rule-composition.png", dpi=150, bbox_inches="tight")

The plot makes the composition feel physical. The inner square changes how quickly the sine is traversed. Farther from zero, the argument x² changes faster, so the outer sine is being driven through its oscillations differently. That is why the derivative picks up the extra 2x factor.

I also like drawing the computation graph in words. It makes the chain rule feel like signal flow rather than symbol juggling.

x ──g(x)=x²──▶ u ──f(u)=sin(u)──▶ y

outer local derivative: dy/du = cos(u)
inner local derivative: du/dx = 2x
total derivative:      dy/dx = (dy/du)(du/dx)

This is exactly the structure that reappears in backpropagation. Every node in a computation graph contributes a local derivative. The total gradient is built by chaining those local pieces together.

If I want one more example, I can use the same logic on (3x² + 1)^5. The outer function is u^5. The inner function is 3x² + 1. So the derivative is 5(3x² + 1)^4 · 6x. Same rule, different outer and inner choices.

Computation Graph View	Question	Answer
Local derivative at the outer node	How does the output change if the middle value changes?	Measure `dy/du`.
Local derivative at the inner node	How does the middle value change if the original input changes?	Measure `du/dx`.
Total derivative	How does the output change if the original input changes?	Multiply the local sensitivities: `dy/dx = (dy/du)(du/dx)`.

Outer first, inner second

This is the phrase I actually remember. Differentiate the outside while keeping the inside in place, then multiply by the derivative of the inside. It is the shortest correct mental algorithm I know.

Derivatives In Practice

Once I started viewing derivatives as local sensitivity measurements, the application list stopped feeling like a bunch of unrelated examples. They are all asking the same question in different costumes.

Field	Function	Derivative Meaning	Why It Matters
Physics	`x(t)`	Velocity is `dx/dt`, acceleration is `d²x/dt²`.	Motion, force, and trajectory tracking all depend on change rates.
Economics	`C(q)`, `R(q)`	Marginal cost and marginal revenue are derivatives.	Decisions are made on incremental impact, not only totals.
Machine learning	`L(w)`	The gradient tells me how sensitive loss is to parameter changes.	Training depends on updating parameters in useful directions.
Engineering	System response curves	Derivatives reveal slope, gain, and local sensitivity.	This matters for control, calibration, and stability analysis.
Signal processing	Changing waveform or error signal	Derivatives emphasize rapid transitions and local structure.	Edge detection, filtering, and estimation all use change information.

Physics is the cleanest story. Position differentiated once becomes velocity. Differentiate again and I get acceleration. The derivative literally tracks how motion is evolving.

Economics changes the nouns, but not the logic. A total cost curve is useful, but the derivative answers the operational question: what is the cost of one more unit right now? That is a local question, so calculus is the natural tool.

In ML, the derivative is the thing that tells me whether a parameter update is helpful or harmful. If increasing a weight raises the loss, that local slope should probably push me in the opposite direction. If the slope is tiny, learning may stall.

In engineering and control, derivatives are tied to responsiveness. How sharply does the output react to a small change in input? How quickly is error changing? Is the response flattening, overshooting, or amplifying noise? These are derivative questions even when the equations are hidden behind hardware or software layers.

Connection To Gradient Descent

This is the bridge that made the whole topic worth internalizing for me. The derivative tells you which direction to step. Gradient descent is just the repeated act of using that local slope information to move downhill.

w_next = w - η dL/dw

That tiny update rule carries a lot of meaning. dL/dw is the local slope of the loss with respect to the parameter. η is the learning rate, which decides how far I trust that local linear guidance.

If the derivative is positive, moving right makes the loss worse, so I step left. If the derivative is negative, moving right makes the loss better, so I step right. If the derivative is near zero, I may be close to a flat region or an optimum.

One-dimensional gradient descent on a parabola-like loss curve with arrows showing iterative steps from w equals 8 toward the minimum at w equals 3

Generate this plot

python

import matplotlib.pyplot as plt
import numpy as np

def loss(w): return (w - 3) ** 2 + 1
def grad(w): return 2 * (w - 3)

x = np.linspace(-1, 9, 500)
y = loss(x)

# run gradient descent
lr = 0.3
steps = [8.0]
for _ in range(6):
    steps.append(steps[-1] - lr * grad(steps[-1]))
step_vals = loss(np.array(steps))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, color="#3b82f6", linewidth=3, label=r"$L(w)=(w-3)^2+1$")
ax.scatter(steps, step_vals, color="#ef4444", s=55, zorder=5,
           label="gradient descent steps")

for i, (w, v) in enumerate(zip(steps, step_vals)):
    label = "start" if i == 0 else f"step {i}"
    ax.annotate(label, xy=(w, v), xytext=(w + 0.12, v + 0.7), fontsize=11)
    if i < len(steps) - 1:
        ax.annotate("", xy=(steps[i+1], step_vals[i+1]), xytext=(w, v),
                    arrowprops=dict(arrowstyle="->", color="#10b981", linewidth=2))

ax.scatter([3], [1], marker="*", s=220, color="#f59e0b", zorder=6, label="minimum")
ax.set_title("Gradient Descent: Following the Derivative Downhill")
ax.set_xlabel("w"); ax.set_ylabel("loss")
ax.legend(loc="upper right"); ax.grid(True, alpha=0.3)
fig.savefig("gradient-descent-1d.png", dpi=150, bbox_inches="tight")

I like this one-dimensional picture because it strips away the intimidation. Gradient descent is just, “look at the slope, then move downhill.” In high dimensions, the geometry is more complicated, but the local logic is the same.

Backprop is chain rule bookkeeping

Gradient descent needs derivatives of the loss with respect to many internal parameters. Backpropagation computes those derivatives efficiently by applying the chain rule across a computation graph. That is why I think of the chain rule as the engine behind backpropagation.

This is the sentence I wish I had heard earlier: the derivative tells you which direction to step. Gradient descent is just the chain rule applied across a computation graph. Once that clicked, backprop stopped feeling mystical. "The derivative tells you which direction to step. Gradient descent is just the chain rule applied across a computation graph." The sentence that made backprop click

Every layer in a neural network transforms its input. So every layer contributes a local derivative. The full gradient is what I get when I combine those local derivatives correctly, in reverse, from the loss back to the earlier parameters.

That is also why derivative intuition matters when gradients vanish, explode, or get distorted by clipping, quantization, saturation, or bad scaling. Those are all stories about local sensitivity changing shape.

Gradient Descent Ingredient	Role	Derivative Connection
Loss function	Measures how wrong the model currently is.	I differentiate it to know which way improves it.
Parameter	The adjustable quantity I want to update.	The partial derivative tells me the parameter's local effect on loss.
Learning rate	Sets step size.	It scales how strongly I respond to the derivative signal.
Backpropagation	Efficient gradient computation for layered models.	It is repeated chain rule across the network graph.

So for me, the derivative and the chain rule are not only calculus topics. They are infrastructure for understanding optimization. And optimization is infrastructure for modern ML.

Summary

If I compress the whole post into a short list, these are the ideas I want to keep loaded.

Limits let me talk about what a function approaches, even when direct substitution is undefined or misleading.
The derivative is a limit of average change becoming instantaneous change.
Geometrically, secant lines approach a tangent line.
Operationally, the derivative is local slope and local sensitivity.
The power, sum, product, and quotient rules are reusable shortcuts built on the limit definition.
The chain rule handles compositions, which is how most real systems and models are actually built.
d/dx f(g(x)) = f'(g(x)) · g'(x) is the formal statement, but “outer first, inner second” is the memory hook I actually use.
Gradient descent follows derivative information downhill.
Backpropagation is repeated chain rule on a computation graph.
The chain rule is the engine behind backpropagation, which is why understanding it cleanly pays off later.

I wrote this the way I wish someone had explained it to me when I first started connecting calculus to ML, robotics, and optimization. The derivative is not only about curves in a textbook. It is about how systems respond when I perturb them.

The chain rule is not a memorization trap. It is the natural rule for how sensitivities combine when one process feeds another. That is exactly the structure I see in neural networks, control pipelines, and layered engineering systems.

So this post is really a bridge. It starts with limits, moves through slope, and ends at computation graphs. From here, the next step is obvious: the backprop carousel, where the chain rule stops being a chapter heading and becomes the mechanism that trains the model.

Derivatives And The Chain Rule: The Math Of Tiny Changes

Why Derivatives Matter

The framing that unlocked this for me

What Is A Limit?

A limit is a trend, not a plug-in

From Limits To Derivatives

The derivative is local linear prediction

Derivative Rules

The practical habit I use

The Chain Rule

Why compositions matter

Outer first, inner second

Derivatives In Practice

Connection To Gradient Descent

Backprop is chain rule bookkeeping

Summary

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support

Derivatives And The Chain Rule: The Math Of Tiny Changes

Why Derivatives Matter

The framing that unlocked this for me

What Is A Limit?

A limit is a trend, not a plug-in

From Limits To Derivatives

The derivative is local linear prediction

Derivative Rules

The practical habit I use

The Chain Rule

Why compositions matter

Outer first, inner second

Derivatives In Practice

Connection To Gradient Descent

Backprop is chain rule bookkeeping

Summary

Subscribe

Subscribe to emails from Anthony

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support