This post walks through the derivative and chain rule from first principles, with visualizations. It is the companion piece to my carousel post on the same topic.
I needed this topic to stop feeling like ceremonial math notation. I needed it to feel like an engineering tool. Once I translated the derivative into the language of local change, and the chain rule into the language of composed sensitivities, calculus stopped looking decorative and started looking operational.
That shift mattered for me because I keep running into derivatives in places that are not introduced as calculus classes. They show up in robotics, control, optimization, machine learning, and backpropagation. If I want to understand why a system moves, converges, explodes, or stalls, rates of change are usually somewhere in the story.
So this is the version I wrote for myself. I am not trying to reproduce a textbook chapter. I am trying to explain how I now think about limits, derivatives, and the chain rule when I am building models, reading papers, or debugging why a gradient-based method behaves strangely.
Why Derivatives Matter
Rates of change are everywhere. If position changes with time, I call that velocity. If cost changes with production, I call that marginal cost. If loss changes with a parameter in a neural network, I call that a gradient. The vocabulary changes, but the underlying question is the same.
The derivative answers a very practical question: if I nudge the input a little right here, how much will the output move? Not on average across the whole graph. Not somewhere far away. Right here, locally, at this operating point.
That local view is what makes derivatives so useful. A system can be calm in one region, unstable in another, and flat in a third. The derivative tells me which regime I am currently in. It is less like a global summary, and more like an instrument reading taken at a specific point.
The framing that unlocked this for me
A derivative is not a magic symbol. It is a disciplined way to ask, “if I perturb the input a tiny amount, what is the local response of the output?” That is exactly the kind of question I care about in ML, robotics, and control.
| Domain | Quantity | What The Derivative Means | Why I Care |
|---|---|---|---|
| Physics | Position x(t) | dx/dt is velocity, and another derivative gives acceleration. | Motion is literally described by change over time. |
| Economics | Total cost C(q) | dC/dq is marginal cost, the local cost of producing one more unit. | It separates average cost stories from incremental decision-making. |
| Machine learning | Loss L(w) | dL/dw tells me whether changing a parameter raises or lowers loss. | This is the signal gradient descent follows. |
| Control systems | Plant response | Derivatives measure sensitivity, response speed, and local stability behavior. | I cannot tune a controller well if I do not understand local response. |
| Robotics | Trajectory or error signal | The derivative tells me how quickly state or error is evolving. | That matters for estimation, control, and planning loops. |
I also like a sign-based interpretation. Even before I compute an exact formula, I can ask whether the derivative should be positive, negative, near zero, or huge. That already tells me a lot about what a system is doing.
If f'(x) Is... | Local Meaning | Practical Interpretation |
|---|---|---|
| Positive | The function is increasing at that point. | A small increase in input pushes the output upward. |
| Negative | The function is decreasing at that point. | A small increase in input pushes the output downward. |
| Zero | The graph is locally flat, or at least first-order flat. | I may be near a minimum, maximum, saddle, or plateau. |
| Large magnitude | The function is changing rapidly. | Small input mistakes can produce big output swings. |
That is why I do not see derivatives as abstract decoration anymore. They are a local sensitivity language. And once I say it that way, the rest of the topic becomes much easier to organize.
What Is A Limit?
Before I can define instantaneous change, I need the idea of approach. That is what a limit gives me. It tells me what value a function tends toward as the input gets closer and closer to some point.
lim(x→c) f(x) = L I read that as: as x gets close to c, the outputs f(x) get close to L. The important word is approach. A limit is not always about plugging in the point directly. It is about what nearby values are doing.
That distinction feels subtle the first time, but it is the entire foundation of the derivative. Instantaneous slope sounds impossible if I insist on using only a single point. The limit gets around that by letting me study behavior in a shrinking neighborhood.
The classic example is sin(x)/x as x → 0. If I plug in x = 0 directly, I get 0/0, which is indeterminate. But if I look at values near zero, the function clearly settles toward 1.
That is why this example matters. It separates “what happens exactly at the point” from “what the function is tending toward as I approach the point.” Limits care about the second idea.

I like this plot because it makes the mental move visual. The curve wiggles, but near the origin it squeezes toward 1. The open circle reminds me that the limit statement is about the approached value, not necessarily the raw formula evaluated at that exact point.
x | sin(x)/x (approx.) | What I Notice |
|---|---|---|
-1.0 | 0.8415 | Already below 1, but still fairly close. |
-0.5 | 0.9589 | Moving closer to 1 as I approach zero. |
-0.1 | 0.9983 | Very close to 1. |
0.1 | 0.9983 | Same behavior from the right side. |
0.5 | 0.9589 | Symmetry makes the left and right stories agree. |
1.0 | 0.8415 | Farther away again, because I am no longer in the tiny neighborhood around zero. |
A limit is a trend, not a plug-in
When I evaluate a limit, I am watching the output stabilize as the input moves closer and closer to a target. That is why a limit can exist even when direct substitution is undefined, ambiguous, or just not the right question.
This is also a useful place to separate two different situations. Sometimes a function has a removable hole, like the sin(x)/x story at zero. Other times a function truly blows up, or the left and right sides disagree. In those cases, the limit does not exist.
So when I see a limit, I ask three things. What are nearby values doing? Do the left and right sides agree? Is there a stable value emerging? If the answer is yes, I have a meaningful local target.
From Limits To Derivatives
Once the idea of approach makes sense, the derivative definition stops feeling arbitrary. It is just the limit of an average rate of change as the interval shrinks to zero.
f'(x) = lim(h→0) [f(x+h) - f(x)] / h The numerator, f(x+h) - f(x), is the change in output. The denominator, h, is the change in input. Their ratio is the slope of a secant line, which is an average slope over a small interval.
The derivative takes that secant idea and pushes it to the limit. As h gets smaller and smaller, the second point slides toward the first, and the secant line approaches the tangent line. That limiting slope is the derivative.
Geometrically, this was the picture that made everything click for me. A derivative is not some separate object floating above the graph. It is the slope you get when two nearby points collapse into one, but the ratio of changes still converges.

In the secant-to-tangent plot, the function is f(x) = x² and the point of interest is x = 2. The secant slopes keep approaching 4 as the horizontal gap h shrinks. That limiting value is the derivative at x = 2.
h | Second Point 2+h | Secant Slope | What It Shows |
|---|---|---|---|
2.0 | 4.0 | 6.0 | With a wide interval, I only get a rough average slope. |
1.0 | 3.0 | 5.0 | The average slope is getting closer to the local one. |
0.5 | 2.5 | 4.5 | Smaller intervals capture more local behavior. |
0.25 | 2.25 | 4.25 | The value is visibly converging toward 4. |
Algebra shows the same story cleanly. For f(x) = x², I can compute the difference quotient exactly.
f(x) = x²
at x = 2:
[f(2+h) - f(2)] / h
= [(2+h)² - 4] / h
= [4 + 4h + h² - 4] / h
= (4h + h²) / h
= 4 + h
lim(h→0) (4 + h) = 4 So the derivative of x² at x = 2 is 4. The tangent line there has slope 4. The function is increasing, and increasing fairly steeply, at that point.
| Piece Of The Formula | Meaning | Why It Matters |
|---|---|---|
f(x+h) | The function evaluated at a nearby point. | I need a second sample to talk about change. |
f(x+h) - f(x) | Output change. | This measures how much the function moved vertically. |
h | Input change. | This measures how far I moved horizontally. |
| The quotient | Average rate of change over a small interval. | This is the secant slope. |
| The limit | The interval size collapses toward zero. | That is what turns an average slope into an instantaneous one. |
The derivative is local linear prediction
Another way I think about the derivative is this: near a point, a smooth function behaves approximately like a line. The derivative is the slope of that best local line. That viewpoint becomes extremely useful later in optimization.
The next plot makes the local nature even clearer. The same function can have different slopes at different points, including flat, negative, and positive regions.

I like the cubic example because one picture shows several derivative behaviors. At x = -1, the tangent is flat. At x = 0, the tangent slopes downward. At x = 1.5, it slopes upward. Same function, different local stories.
That is why the derivative is a pointwise measurement. Asking for “the derivative of the function” really means asking for a new function, one that assigns a slope to every input where the slope exists.
Derivative Rules
The limit definition is the foundation, but nobody wants to re-derive everything from first principles every time. Derivative rules are compressed proofs. They package recurring limit arguments into reusable tools.
I still like keeping the limit definition in the back of my mind, because it tells me what the rules mean. But once the concept is secure, the rules are how I actually work quickly.

| Rule | Statement | Example | Why It Is Useful |
|---|---|---|---|
| Power rule | d/dx (xⁿ) = nxⁿ⁻¹ | d/dx (x⁵) = 5x⁴ | Polynomial terms become easy to differentiate. |
| Constant rule | d/dx (c) = 0 | d/dx (7) = 0 | A fixed number does not change as x changes. |
| Sum rule | d/dx (f + g) = f' + g' | d/dx (x² + sin x) = 2x + cos x | I can differentiate term by term. |
| Product rule | d/dx (fg) = f'g + fg' | d/dx (x² sin x) = 2x sin x + x² cos x | Multiplication mixes both rates of change. |
| Quotient rule | d/dx (f/g) = (f'g - fg') / g² | d/dx ((x²+1)/(x-1)) = ((2x)(x-1) - (x²+1)) / (x-1)² | Division has its own bookkeeping, so I do not naively divide derivatives. |
The power rule is usually where fluency starts. Once I know that x⁵ becomes 5x⁴, polynomials stop being intimidating. The constant rule then explains why trailing offsets disappear.
d/dx (x⁵) = 5x⁴
d/dx (3x⁴ - 2x² + 7)
= 12x³ - 4x + 0
= 12x³ - 4xThe sum rule says I can differentiate each term on its own and add the results. That is why so much calculus work turns into structured bookkeeping. Break the expression into pieces, identify the operation, then apply the matching rule.
The product rule is important because multiplication is where naive shortcuts fail. In general, the derivative of a product is not just “the derivative of the left times the derivative of the right.” Each factor can change, so each factor contributes a term.
d/dx [x² sin(x)]
= (d/dx x²) sin(x) + x² (d/dx sin(x))
= (2x) sin(x) + x² cos(x)The quotient rule is the same kind of caution for division. I use it when one changing quantity is sitting on top of another changing quantity. Again, the structure of the expression tells me which rule to reach for.
d/dx [(x² + 1) / (x - 1)]
= [(2x)(x - 1) - (x² + 1)(1)] / (x - 1)²| Common Mistake | Why It Fails | Correct Thought |
|---|---|---|
| Treating every expression like a sum. | Multiplication and composition do not distribute that simply. | Identify the outer operation before differentiating. |
| Forgetting the denominator square in the quotient rule. | The algebra of changing ratios is more constrained than it first looks. | Memorize the pattern only after understanding the structure. |
| Dropping the inner derivative in a composition. | That loses one of the local sensitivities. | Outer derivative first, then multiply by the inner derivative. |
The practical habit I use
Before I differentiate, I ask what operation built the function. Is it a sum, a product, a quotient, or a composition? The structure determines the rule.
That habit sounds almost too simple, but it is what keeps derivative work from becoming a blur of symbols. Once I see the expression tree clearly, the derivative becomes a structural transformation.
The Chain Rule
The chain rule is the derivative rule I cared about most once I started thinking seriously about ML and backpropagation. Real functions are rarely isolated. They are nested. One transformation feeds another.
If I square an input and then pass it through a sine, that is a composition. If a neural network layer computes a weighted sum, then applies a nonlinearity, then feeds another layer, that is a composition too. So I need a way to differentiate through layers of dependence.
Why compositions matter
If one quantity depends on a second quantity, and the second depends on x, then the total sensitivity has to combine both links. That combination rule is the chain rule.
d/dx f(g(x)) = f'(g(x)) · g'(x)I read this in words as: differentiate the outside, evaluated at the inside, and then multiply by the derivative of the inside. Outer first. Inner second. Multiply the local sensitivities.
The worked example I keep coming back to is sin(x²). It is simple enough to see clearly, but rich enough to show exactly why the extra factor appears.
| Piece | Role | Derivative Contribution |
|---|---|---|
g(x) = x² | Inner function. | g'(x) = 2x |
f(u) = sin(u) | Outer function, written with a temporary input u. | f'(u) = cos(u) |
f(g(x)) = sin(x²) | Composition of the two pieces. | f'(g(x)) · g'(x) = cos(x²) · 2x |
y = sin(x²)
let u = x²
then y = sin(u)
dy/du = cos(u)
du/dx = 2x
dy/dx = (dy/du)(du/dx)
= cos(x²) · 2x That last line is the whole point. The derivative is not just cos(x²). If I stop there, I have only differentiated the outer sine. I still need to account for the fact that the input to that sine, namely x², is also changing with x.

The plot makes the composition feel physical. The inner square changes how quickly the sine is traversed. Farther from zero, the argument x² changes faster, so the outer sine is being driven through its oscillations differently. That is why the derivative picks up the extra 2x factor.
I also like drawing the computation graph in words. It makes the chain rule feel like signal flow rather than symbol juggling.
x ──g(x)=x²──▶ u ──f(u)=sin(u)──▶ y
outer local derivative: dy/du = cos(u)
inner local derivative: du/dx = 2x
total derivative: dy/dx = (dy/du)(du/dx)This is exactly the structure that reappears in backpropagation. Every node in a computation graph contributes a local derivative. The total gradient is built by chaining those local pieces together.
If I want one more example, I can use the same logic on (3x² + 1)^5. The outer function is u^5. The inner function is 3x² + 1. So the derivative is 5(3x² + 1)^4 · 6x. Same rule, different outer and inner choices.
| Computation Graph View | Question | Answer |
|---|---|---|
| Local derivative at the outer node | How does the output change if the middle value changes? | Measure dy/du. |
| Local derivative at the inner node | How does the middle value change if the original input changes? | Measure du/dx. |
| Total derivative | How does the output change if the original input changes? | Multiply the local sensitivities: dy/dx = (dy/du)(du/dx). |
Outer first, inner second
This is the phrase I actually remember. Differentiate the outside while keeping the inside in place, then multiply by the derivative of the inside. It is the shortest correct mental algorithm I know.
Derivatives In Practice
Once I started viewing derivatives as local sensitivity measurements, the application list stopped feeling like a bunch of unrelated examples. They are all asking the same question in different costumes.
| Field | Function | Derivative Meaning | Why It Matters |
|---|---|---|---|
| Physics | x(t) | Velocity is dx/dt, acceleration is d²x/dt². | Motion, force, and trajectory tracking all depend on change rates. |
| Economics | C(q), R(q) | Marginal cost and marginal revenue are derivatives. | Decisions are made on incremental impact, not only totals. |
| Machine learning | L(w) | The gradient tells me how sensitive loss is to parameter changes. | Training depends on updating parameters in useful directions. |
| Engineering | System response curves | Derivatives reveal slope, gain, and local sensitivity. | This matters for control, calibration, and stability analysis. |
| Signal processing | Changing waveform or error signal | Derivatives emphasize rapid transitions and local structure. | Edge detection, filtering, and estimation all use change information. |
Physics is the cleanest story. Position differentiated once becomes velocity. Differentiate again and I get acceleration. The derivative literally tracks how motion is evolving.
Economics changes the nouns, but not the logic. A total cost curve is useful, but the derivative answers the operational question: what is the cost of one more unit right now? That is a local question, so calculus is the natural tool.
In ML, the derivative is the thing that tells me whether a parameter update is helpful or harmful. If increasing a weight raises the loss, that local slope should probably push me in the opposite direction. If the slope is tiny, learning may stall.
In engineering and control, derivatives are tied to responsiveness. How sharply does the output react to a small change in input? How quickly is error changing? Is the response flattening, overshooting, or amplifying noise? These are derivative questions even when the equations are hidden behind hardware or software layers.
Connection To Gradient Descent
This is the bridge that made the whole topic worth internalizing for me. The derivative tells you which direction to step. Gradient descent is just the repeated act of using that local slope information to move downhill.
w_next = w - η dL/dw That tiny update rule carries a lot of meaning. dL/dw is the local slope of the loss with respect to the parameter. η is the learning rate, which decides how far I trust that local linear guidance.
If the derivative is positive, moving right makes the loss worse, so I step left. If the derivative is negative, moving right makes the loss better, so I step right. If the derivative is near zero, I may be close to a flat region or an optimum.

I like this one-dimensional picture because it strips away the intimidation. Gradient descent is just, “look at the slope, then move downhill.” In high dimensions, the geometry is more complicated, but the local logic is the same.
Backprop is chain rule bookkeeping
Gradient descent needs derivatives of the loss with respect to many internal parameters. Backpropagation computes those derivatives efficiently by applying the chain rule across a computation graph. That is why I think of the chain rule as the engine behind backpropagation.
This is the sentence I wish I had heard earlier: the derivative tells you which direction to step. Gradient descent is just the chain rule applied across a computation graph. Once that clicked, backprop stopped feeling mystical.
Every layer in a neural network transforms its input. So every layer contributes a local derivative. The full gradient is what I get when I combine those local derivatives correctly, in reverse, from the loss back to the earlier parameters.
That is also why derivative intuition matters when gradients vanish, explode, or get distorted by clipping, quantization, saturation, or bad scaling. Those are all stories about local sensitivity changing shape.
| Gradient Descent Ingredient | Role | Derivative Connection |
|---|---|---|
| Loss function | Measures how wrong the model currently is. | I differentiate it to know which way improves it. |
| Parameter | The adjustable quantity I want to update. | The partial derivative tells me the parameter's local effect on loss. |
| Learning rate | Sets step size. | It scales how strongly I respond to the derivative signal. |
| Backpropagation | Efficient gradient computation for layered models. | It is repeated chain rule across the network graph. |
So for me, the derivative and the chain rule are not only calculus topics. They are infrastructure for understanding optimization. And optimization is infrastructure for modern ML.
Summary
If I compress the whole post into a short list, these are the ideas I want to keep loaded.
- Limits let me talk about what a function approaches, even when direct substitution is undefined or misleading.
- The derivative is a limit of average change becoming instantaneous change.
- Geometrically, secant lines approach a tangent line.
- Operationally, the derivative is local slope and local sensitivity.
- The power, sum, product, and quotient rules are reusable shortcuts built on the limit definition.
- The chain rule handles compositions, which is how most real systems and models are actually built.
-
d/dx f(g(x)) = f'(g(x)) · g'(x)is the formal statement, but “outer first, inner second” is the memory hook I actually use. - Gradient descent follows derivative information downhill.
- Backpropagation is repeated chain rule on a computation graph.
- The chain rule is the engine behind backpropagation, which is why understanding it cleanly pays off later.
I wrote this the way I wish someone had explained it to me when I first started connecting calculus to ML, robotics, and optimization. The derivative is not only about curves in a textbook. It is about how systems respond when I perturb them.
The chain rule is not a memorization trap. It is the natural rule for how sensitivities combine when one process feeds another. That is exactly the structure I see in neural networks, control pipelines, and layered engineering systems.
So this post is really a bridge. It starts with limits, moves through slope, and ends at computation graphs. From here, the next step is obvious: the backprop carousel, where the chain rule stops being a chapter heading and becomes the mechanism that trains the model.