Part of the ShivasNotes transformer-from-scratch series. Previously: dL/d(LLM): The Full Backward Pass.
The full backward-pass post ended with every weight in the model holding a gradient. This post is about the step that actually changes those weights. The optimizer takes the raw gradient signal and turns it into a weight update. How that update is computed determines whether the model learns quickly, slowly, or not at all.
We will build from the simplest possible optimizer (just a learning rate) through SGD with momentum, then to Adam, and finally to AdamW — the optimizer that trains nearly every modern LLM. At each step we will show what breaks, why the next idea was needed, and what the real C code looks like in C-Kernel-Engine. The optimizer is the last kernel in the training pipeline. It is the only kernel that actually modifies the model weights. Everything before it was just measuring how wrong the weights are.
Roadmap for this post
Sections 1 and 2 start with vanilla gradient descent: one learning rate, one update rule, and the problems that immediately surface.
Sections 3 and 4 add momentum and explain why a rolling average of past gradients helps escape saddle points and noisy landscapes.
Sections 5 through 7 build Adam: adaptive per-parameter learning rates, bias correction, and the intuition behind each moving average.
Sections 8 and 9 explain the crucial difference between L2 regularization and decoupled weight decay — why Adam got it wrong and AdamW fixed it.
Sections 10 through 12 cover gradient clipping, learning rate scheduling, and the fused multi-tensor implementation in C-Kernel-Engine.
Section 13 connects the optimizer to the full training pipeline from the backward-pass post.
Section 1: The Simplest Optimizer — Vanilla Gradient Descent
The simplest possible weight update rule is gradient descent with a fixed learning rate. After computing the gradient g = dL/dw for every weight, the update is:
# For each weight w with gradient g:
w = w - lr * g
# That's it. One multiplication, one subtraction.
# lr (learning rate) controls the step size.If the gradient points uphill (toward higher loss), we move downhill. The learning rate controls how far we step each time. In C, this would be a trivial loop:
void sgd_vanilla_f32(const float *grad, float *weight,
size_t numel, float lr)
{
for (size_t i = 0; i < numel; ++i) {
weight[i] = weight[i] - lr * grad[i];
}
}Section 2: Why Vanilla Gradient Descent Breaks
Vanilla gradient descent has three problems that become severe in deep learning.
Problem 1: The learning rate dilemma
A single learning rate must work for every parameter in the entire model. But not all parameters have the same gradient magnitude. Embedding weights might have gradients of 0.001 while attention projection weights have gradients of 10.0. A learning rate that works for the attention weights will be 10,000× too small for the embeddings, and vice versa. 1 lr One learning rate for millions of parameters with wildly different gradient scales. This is the fundamental limitation of vanilla SGD.
Problem 2: Noisy gradients
Each mini-batch gives a noisy estimate of the true gradient. Vanilla SGD follows each noisy gradient faithfully, producing a zigzag path toward the minimum. The noise means the optimizer takes many small corrective steps that partially cancel each other.
Problem 3: Saddle points and flat regions
In high-dimensional loss landscapes, saddle points are far more common than local minima. At a saddle point, the gradient is nearly zero, so vanilla SGD stalls. Without momentum, there is no accumulated velocity to push through these flat regions.

Section 3: Adding Momentum — SGD with Velocity
The first fix is momentum. Instead of following the raw gradient at each step, we maintain a running average of past gradients and follow that instead. This smooths out the noise and builds up speed in consistent directions.
# For each weight w with gradient g:
# v is the velocity buffer (initialized to zeros)
v = momentum * v + g # accumulate momentum
w = w - lr * v # step using smoothed gradient
# momentum is typically 0.9
# This means the velocity is 90% of the old velocity + 10% new gradient
# (when unrolled: exponential moving average)The physical analogy is a ball rolling downhill. Without momentum, the ball teleports to a new position each step and has no memory of which direction it was moving. With momentum, the ball has inertia: if it was rolling left and the gradient says to go left again, it accelerates. If the gradient suddenly says right (noise), the ball slows down but does not immediately reverse. Momentum turns noisy stochastic steps into a smooth trajectory. The signal (consistent gradient direction) accumulates, while the noise (random per-batch variations) cancels out.
void sgd_momentum_update_f32(
const float *grad, float *weight, float *velocity,
size_t numel, float lr, float momentum, float weight_decay)
{
// AVX-512: process 16 floats at a time
for (; i + 16 <= numel; i += 16) {
__m512 g = _mm512_loadu_ps(&grad[i]);
__m512 w = _mm512_loadu_ps(&weight[i]);
__m512 vel = _mm512_loadu_ps(&velocity[i]);
// v = momentum * v + g
vel = _mm512_fmadd_ps(v_momentum, vel, g);
// w = w - lr * (v + weight_decay * w)
__m512 update = _mm512_fmadd_ps(v_weight_decay, w, vel);
w = _mm512_fnmadd_ps(v_lr, update, w);
_mm512_storeu_ps(&weight[i], w);
_mm512_storeu_ps(&velocity[i], vel);
}
// Scalar tail
for (; i < numel; ++i) {
velocity[i] = momentum * velocity[i] + grad[i];
weight[i] -= lr * (velocity[i] + weight_decay * weight[i]);
}
}Momentum solves the noise problem (smoothing) and the saddle point problem (accumulated velocity pushes through flat regions). But it does not solve the learning rate dilemma. Every parameter still shares the same effective step size.

Section 4: The Momentum Buffer — What It Costs
Notice that momentum introduces one extra buffer per weight: the velocity v. For a model with N parameters, SGD with momentum needs 2N floats of state (weights + velocity). This doubles the memory footprint of the model parameters. For a 7B parameter model, that adds 28 GB.
This is a pattern we will see again: every optimizer improvement trades memory for better convergence. Momentum adds one buffer. Adam will add two.
Section 5: Adam — Adaptive Learning Rates
The key insight of Adam (Adaptive Moment Estimation, Kingma & Ba, 2014) is to give each parameter its own effective learning rate. It does this by tracking two exponential moving averages per parameter:
# m = first moment (mean of gradients) — the momentum term
# v = second moment (mean of squared gradients) — the scale term
m = β₁ * m + (1 - β₁) * g # momentum: where are we going?
v = β₂ * v + (1 - β₂) * g² # scale: how big are the gradients?
# β₁ = 0.9 (typical) — 90% old momentum, 10% new gradient
# β₂ = 0.999 (typical) — 99.9% old scale, 0.1% new squared gradient The first moment m is exactly momentum — it tracks which direction the gradients have been pointing. The second moment v is new. It tracks the magnitude of recent gradients for each parameter independently. When we divide the momentum by the square root of the scale, parameters with large gradients get automatically scaled down, and parameters with small gradients get scaled up. m / √v This ratio is the heart of Adam. Parameters with consistently large gradients (high v) get smaller steps. Parameters with tiny gradients (low v) get larger steps. The learning rate adapts per-parameter.
Section 6: Bias Correction — Why the First Few Steps Need Fixing
There is a subtle initialization problem. Both m and v are initialized to zero. In the first few steps, the exponential moving averages are biased toward zero because they have not accumulated enough history.
# Without correction, step 1:
# m₁ = 0.9 * 0 + 0.1 * g₁ = 0.1 * g₁ (should be ≈ g₁)
# v₁ = 0.999 * 0 + 0.001 * g₁² = 0.001 * g₁²
# The estimates are 10× too small for m and 1000× too small for v!
# Bias correction:
m_hat = m / (1 - β₁ᵗ) # at step 1: m / (1 - 0.9¹) = m / 0.1 = 10× boost
v_hat = v / (1 - β₂ᵗ) # at step 1: v / (1 - 0.999¹) = v / 0.001 = 1000× boost
# As t grows large:
# (1 - β₁ᵗ) → 1 and (1 - β₂ᵗ) → 1
# So bias correction vanishes after ~100 steps for m, ~1000 for vWithout bias correction, the first few steps would have wildly wrong step sizes. With it, the effective learning rate is stable from step 1.
Section 7: The Full Adam Update
Putting it all together, the Adam update rule is:
# For each weight w with gradient g at step t:
# 1. Update moving averages
m = β₁ * m + (1 - β₁) * g # first moment (momentum)
v = β₂ * v + (1 - β₂) * g² # second moment (scale)
# 2. Bias correction
m_hat = m / (1 - β₁ᵗ)
v_hat = v / (1 - β₂ᵗ)
# 3. Update weight
w = w - lr * m_hat / (√v_hat + ε)
# ε = 1e-8 prevents division by zero when v_hat ≈ 0
# Typical hyperparameters: lr=1e-3, β₁=0.9, β₂=0.999 The effective per-parameter learning rate is lr / (√v_hat + ε). For a parameter whose recent gradients average 100.0, the effective lr is lr / √(100² * correction) ≈ lr / 100. For a parameter whose gradients average 0.01, the effective lr is lr / √(0.01² * correction) ≈ lr / 0.01 = 100 × lr. The optimizer automatically compensates for gradient scale differences across the model. Adam gives each parameter its own effective learning rate, derived from the history of its own gradients. Parameters that need bigger steps get bigger steps. Parameters that need smaller steps get smaller steps.

Section 8: The Weight Decay Problem — Adam vs AdamW
Standard Adam has one significant flaw in how it handles weight decay (regularization). Weight decay is a technique that penalizes large weights by adding a small fraction of the weight magnitude to the gradient:
# L2 regularization (what Adam does):
g_regularized = g + λ * w # add weight penalty to gradient
# Then run Adam on g_regularized
# Problem: the adaptive scaling in Adam also scales the weight decay!
# The weight decay gets divided by √v along with the gradient.
# This means large-gradient parameters get LESS weight decay,
# which is the opposite of what we want.
# Decoupled weight decay (what AdamW does):
# Run Adam on the raw gradient g (no modification)
# Apply weight decay separately:
w = w - lr * (m_hat / (√v_hat + ε)) - lr * λ * w
# Adam update weight decay
# Weight decay is NOT scaled by 1/√v.
# Every parameter gets the same proportional shrinkage.This difference was identified by Loshchilov & Hutter (2017) in the paper "Decoupled Weight Decay Regularization." The fix is elegant: do not bake the weight decay into the gradient. Instead, apply it as a separate multiplicative shrinkage after the Adam step. This is why the optimizer is called AdamW — Adam with decoupled Weight decay. In Adam, weight decay is accidentally coupled to the adaptive learning rate. AdamW fixes this by applying weight decay independently. This single change measurably improves generalization in large models.
Section 9: The Full AdamW Update
# For each weight w with gradient g at step t:
# 1. Update moving averages (same as Adam)
m = β₁ * m + (1 - β₁) * g
v = β₂ * v + (1 - β₂) * g²
# 2. Bias correction (same as Adam)
m_hat = m / (1 - β₁ᵗ)
v_hat = v / (1 - β₂ᵗ)
# 3. Weight update with DECOUPLED weight decay
w = w - lr * (m_hat / (√v_hat + ε) + λ * w)
# ├── adaptive gradient step ──┤ ├─ weight decay ─┤
# These are added, NOT coupled through v
# λ = 0.01 typically (weight_decay parameter)
# This shrinks every weight by lr * λ = 0.00001 per step In practice, C-Kernel-Engine follows the PyTorch convention for AdamW operator ordering: apply the decoupled weight decay as a multiplicative scale w *= (1 - lr * λ) before the Adam step. This produces identical results but is slightly more cache-friendly since it touches the weight array once.
static void adamw_update_f32_impl(
const float *grad, float *weight, float *m, float *v,
size_t numel, float lr, float beta1, float beta2,
float eps, float weight_decay, int step)
{
float bias_correction1 = 1.0f - powf(beta1, (float)step);
float bias_correction2 = 1.0f - powf(beta2, (float)step);
float one_minus_beta1 = 1.0f - beta1;
float one_minus_beta2 = 1.0f - beta2;
for (size_t i = 0; i < numel; ++i) {
float g = grad[i];
float w = weight[i];
// Update moving averages
m[i] = beta1 * m[i] + one_minus_beta1 * g;
v[i] = beta2 * v[i] + one_minus_beta2 * g * g;
// Bias-corrected estimates
float m_hat = m[i] / bias_correction1;
float v_hat = v[i] / bias_correction2;
// AdamW: gradient step + decoupled weight decay
weight[i] = w - lr * (m_hat / (sqrtf(v_hat) + eps)
+ weight_decay * w);
}
}
Section 10: The SIMD Implementation — AVX-512 AdamW
The scalar loop is correct but slow. C-Kernel-Engine implements the same math using AVX-512 intrinsics to process 16 parameters at once.
#if defined(__AVX512F__)
__m512 v_beta1 = _mm512_set1_ps(beta1);
__m512 v_beta2 = _mm512_set1_ps(beta2);
__m512 v_one_minus_beta1 = _mm512_set1_ps(one_minus_beta1);
__m512 v_one_minus_beta2 = _mm512_set1_ps(one_minus_beta2);
__m512 v_lr = _mm512_set1_ps(lr);
__m512 v_eps = _mm512_set1_ps(eps);
__m512 v_weight_decay = _mm512_set1_ps(weight_decay);
__m512 v_bc1_inv = _mm512_set1_ps(1.0f / bias_correction1);
__m512 v_bc2_inv = _mm512_set1_ps(1.0f / bias_correction2);
for (; i + 16 <= numel; i += 16) {
__m512 g = _mm512_loadu_ps(&grad[i]);
__m512 w = _mm512_loadu_ps(&weight[i]);
__m512 m_val = _mm512_loadu_ps(&m[i]);
__m512 v_val = _mm512_loadu_ps(&v[i]);
// m = β₁m + (1-β₁)g
m_val = _mm512_fmadd_ps(v_beta1, m_val,
_mm512_mul_ps(v_one_minus_beta1, g));
// v = β₂v + (1-β₂)g²
__m512 g_sq = _mm512_mul_ps(g, g);
v_val = _mm512_fmadd_ps(v_beta2, v_val,
_mm512_mul_ps(v_one_minus_beta2, g_sq));
// Bias-corrected: m̂ = m/(1-β₁ᵗ), v̂ = v/(1-β₂ᵗ)
__m512 m_hat = _mm512_mul_ps(m_val, v_bc1_inv);
__m512 v_hat = _mm512_mul_ps(v_val, v_bc2_inv);
// w = w - lr * (m̂/√(v̂+ε) + λw)
__m512 denom = _mm512_add_ps(_mm512_sqrt_ps(v_hat), v_eps);
__m512 update = _mm512_div_ps(m_hat, denom);
update = _mm512_fmadd_ps(v_weight_decay, w, update);
w = _mm512_fnmadd_ps(v_lr, update, w);
_mm512_storeu_ps(&weight[i], w);
_mm512_storeu_ps(&m[i], m_val);
_mm512_storeu_ps(&v[i], v_val);
}
#endif Each iteration of this loop processes 16 weight-gradient-m-v tuples simultaneously. Four loads, four math sequences, three stores — all operating on 512-bit registers. The _mm512_fmadd_ps instruction computes a*b + c in a single clock cycle, which is critical for the momentum update: β₁ * m + (1-β₁) * g is one FMA. 16× AVX-512 processes 16 float32 parameters per clock cycle. For a model with 1.8M parameters, the optimizer loop completes in roughly 112K iterations instead of 1.8M.
C-Kernel-Engine also provides AVX (8 floats), SSE2 (4 floats), and scalar fallback paths for portability. The same math, four levels of SIMD width.
Section 11: Gradient Clipping — Preventing Explosions
Before the optimizer sees the gradients, they often need to be clipped. In early training or when the model encounters unusual data, gradient magnitudes can spike to extreme values. A single bad batch can produce gradients large enough to destroy months of training.
# Compute global gradient norm across ALL parameters
global_norm = sqrt(sum(g² for all gradients across all tensors))
# If the norm exceeds the threshold, scale all gradients down
if global_norm > max_grad_norm:
scale = max_grad_norm / global_norm
for g in all_gradients:
g *= scale
# This preserves gradient direction but caps magnitude
# max_grad_norm = 1.0 is typical for LLM training// Compute global norm across all weight tensors
float gradient_global_norm_multi_f32(
const float *const *grads, const size_t *numels,
int tensor_count)
{
double sum_sq = 0.0;
for (int i = 0; i < tensor_count; ++i) {
sum_sq += gradient_sum_sq_f32_impl(grads[i], numels[i]);
}
return sqrtf((float)sum_sq);
}
// Then in adamw_clip_update_multi_f32:
if (max_grad_norm > 0.0f) {
float global_norm = gradient_global_norm_multi_f32(grads, ...);
if (global_norm > max_grad_norm) {
grad_scale = max_grad_norm / global_norm;
}
}
// All gradients scaled by grad_scale before AdamW update Notice the accumulation uses double precision. Summing millions of squared float32 values in float32 would lose precision due to catastrophic cancellation. The double-precision accumulator ensures the global norm is accurate even for models with hundreds of millions of parameters. Gradient clipping is not optional for LLM training. Without it, a single bad batch can produce gradient norms of 1000+ and destroy the model. With clipping at 1.0, the worst case is a normal-sized step in a potentially wrong direction.
Section 12: The Fused Multi-Tensor Update
C-Kernel-Engine does not call adamw_update_f32 once per weight tensor in a loop. Instead, it uses a fused multi-tensor update that processes all 19 weight tensors in a single function call.
void adamw_clip_update_multi_f32(
float *const *grads, // array of 19 gradient pointers
float *const *weights, // array of 19 weight pointers
float *const *m_states, // array of 19 first-moment pointers
float *const *v_states, // array of 19 second-moment pointers
const size_t *numels, // array of 19 tensor sizes
int tensor_count, // 19
float lr, float beta1, float beta2, float eps,
float weight_decay, float max_grad_norm, int step)
{
// 1. Compute global gradient norm across all tensors
float grad_scale = 1.0f;
if (max_grad_norm > 0.0f) {
float global_norm = gradient_global_norm_multi_f32(
(const float *const *)grads, numels, tensor_count);
if (global_norm > max_grad_norm)
grad_scale = max_grad_norm / global_norm;
}
// 2. Parallel dispatch: each thread owns a subset of tensors
// No atomics needed — tensors are independent
for (int i = 0; i < tensor_count; ++i) {
if (grad_scale != 1.0f)
gradient_scale_f32_impl(grads[i], numels[i], grad_scale);
adamw_update_f32_impl(grads[i], weights[i], m_states[i],
v_states[i], numels[i], lr, beta1, beta2, eps,
weight_decay, step);
}
}The fused design has two advantages. First, gradient clipping computes a single global norm before any updates happen. This cannot be done if you update tensors one at a time because you need all gradients to compute the norm. Second, the threadpool assigns different tensors to different threads with zero contention. Each thread owns its tensors completely — no locks, no atomics, no synchronization.
Section 13: Memory Cost — The Price of Adaptive Optimization
Each optimizer level costs more memory:
| Optimizer | Buffers per weight | Total memory (per param) | For 7B params (FP32) |
|---|---|---|---|
| Vanilla SGD | weight only | 1× = 4 bytes | 28 GB |
| SGD + Momentum | weight + velocity | 2× = 8 bytes | 56 GB |
| Adam / AdamW | weight + m + v | 3× = 12 bytes | 84 GB |
| AdamW + gradients | weight + m + v + grad | 4× = 16 bytes | 112 GB |
This is why optimizer state dominates memory in large-scale training. The model weights for a 7B parameter model are 28 GB. But the optimizer state (m + v) adds another 56 GB. With gradients, the total is 112 GB — 4× the model size, just for the optimizer. 4× AdamW needs 4× the memory of just storing the weights. This is why techniques like mixed-precision training (FP16/BF16 weights, FP32 optimizer states) and optimizer state sharding exist.

Section 14: Connecting to the Training Pipeline
In the full backward-pass post, we saw the complete training step:
// The full training loop
int ck_train_step_ex(...) {
// 1. Forward: 19 ops (embedding → layers → logits)
ck_train_forward_step();
// 2. Backward: 59 ops (loss → layers → embedding)
ck_train_backward_step();
// 3. Accumulate (optional: sum over K micro-batches)
g_accum_step++;
if (g_accum_step >= CK_GRAD_ACCUM_STEPS) {
// 4. THIS POST: the optimizer step
// - Compute global gradient norm
// - Clip if norm > max_grad_norm
// - For all 19 tensors: AdamW update (AVX-512)
// - Weights are now updated
ck_train_optimizer_step(lr);
// 5. Zero all gradient buffers for next window
ck_zero_grad();
g_accum_step = 0;
}
}The optimizer is the final kernel in the pipeline. Everything before it — forward pass, loss computation, backward pass, gradient accumulation — produces one thing: a gradient for every weight. The optimizer consumes those gradients and produces one thing: updated weights. Then the gradients are zeroed, and the cycle begins again. Forward measures the error. Backward distributes blame. The optimizer fixes the weights. Zero grad forgets the past. Repeat until the loss is small enough.
Section 15: The Progression — From Trivial to Production
Here is the complete progression we covered:
| Optimizer | Update rule | What it fixes | What it still lacks |
|---|---|---|---|
| Vanilla SGD | w -= lr * g | — | Noisy, one lr, stalls at saddle points |
| SGD + Momentum | v = μv + g; w -= lr * v | Noise smoothing, saddle escape | Still one lr for all params |
| Adam | m/v updated; w -= lr * m̂/(√v̂ + ε) | Adaptive per-param lr | Weight decay coupled to lr scaling |
| AdamW | Adam + decoupled weight decay | Correct regularization | Memory cost (3× model params) |
| AdamW + clipping | AdamW + global norm clipping | Gradient explosion prevention | Production-ready ✓ |
Every step in this progression solves one problem that was killing training at scale. And every step is a kernel in C-Kernel-Engine — from sgd_momentum_update_f32 to adamw_clip_update_multi_f32 — with AVX-512 SIMD, threadpool parallelism, and double-precision gradient norm accumulation.

What the optimizer needs from the rest of the pipeline
Gradients: Computed by the backward pass (dL/d(LLM): The Full Backward Pass).
Saved forward activations: Used by the backward pass to compute those gradients (Attention: The Core Of The Transformer and related posts).
Gradient accumulation: The optimizer only sees the summed gradient after K micro-batches, scaled by 1/K.
Gradient zeroing: The optimizer expects clean (zeroed) gradient buffers at the start of each accumulation window. Forgetting this is the most common training bug.
Related ShivasNotes posts
AI Kernel Engineer Beginner Guide: Math, Linear Algebra, C/Linux
Thread Pools in C: How CPU Runtimes Dispatch Work Across Cores
dL/d(LLM): The Full Backward Pass
Attention: The Core Of The Transformer