AI Kernels You Should Know To Learn Deep Neural Networks

AI kernel learning map

This post is a bridge between the earlier ShivasNotes fundamentals series and the next long-form video: what AI kernels should you actually know if you want to understand deep neural networks from math to runtime?

A neural network is not one giant mysterious operation. It is a graph of smaller kernels: matrix multiplies, activation functions, normalization, attention, position encoding, cache updates, quantization, and optimizer updates. Once those kernels become familiar, transformer models, recurrent attention, vision encoders, and CPU inference systems become much easier to read. The goal is not to memorize every model architecture. The goal is to recognize the kernel vocabulary underneath the architecture.

Purpose of this post

This is a practical learning map for AI, robotics, and HPC kernel engineering. It connects the math you write by hand to the low-level routines that execute inside real runtimes.

It also links back to the earlier ShivasNotes sequence: matrix wx+b, softmax, activation functions, normalization, position encoding, tokenization, attention, residual connections, backprop, thread pools, vision encoders, SIMD, ARM NEON, and quantization.

A taxonomy map of the AI kernels needed to understand neural network runtimes.

How To Read The Kernel Map

The map is not meant to be a taxonomy for academic neatness. It is a debugging tool. When a transformer model fails parity, runs slowly, overflows numerically, consumes too much memory, or cannot scale across threads, the failure normally lands in one of these kernel families.

A beginner often reads a model diagram as a list of named blocks: attention, MLP, norm, residual, head. A kernel engineer reads the same diagram as a list of executable contracts: load this tensor, transform this shape, reduce this axis, preserve this dtype expectation, write this layout, and hand the result to the next kernel.

Model word	Kernel-engineering reading	Typical failure mode
Linear layer	GEMV/GEMM, layout, tile shape, accumulation	wrong transpose, slow memory access, bad accumulation order
Activation	elementwise curve, gating, possible fusion point	wrong approximation, missing gate, incorrect intermediate size
Norm	reduction kernel plus scale application	epsilon mismatch, reduction drift, wrong axis
Attention	projection, score, mask, softmax, value mix, cache	mask bug, unstable softmax, KV layout mismatch
Quantized layer	packed format, metadata, dequant, dot product	wrong scale, block-size mismatch, parity drift
Optimizer	gradient transform plus persistent state update	state memory explosion, wrong weight decay semantics

This is why kernel learning compounds. Every new architecture may introduce a new arrangement, but the same questions keep returning: what are the shapes, what memory is read, what math is applied, what is reduced, what is cached, what is updated, and how do we prove it is correct?

What Is An AI Kernel?

An AI kernel is a low-level routine that implements one mathematical operation efficiently on a compute target. That target can be a scalar CPU loop, a SIMD vector unit, a thread pool, a GPU block, a matrix tile unit, an FPGA datapath, or a distributed cluster stage.

The important part is that a kernel has a contract. It accepts inputs with known shapes, performs a specific mathematical transformation, respects numerical expectations, and writes outputs in a layout the next kernel can consume.

Kernel contract intuition

text

kernel(input tensors, weights, metadata) -> output tensor

The contract includes:
  shape
  dtype
  memory layout
  numerical meaning
  threading / SIMD rules
  correctness tolerance

This is why AI kernel engineering sits between math and systems. If you only know the math, you may write code that is correct but slow. If you only know systems, you may optimize a wrong operation. The kernel engineer has to preserve the math while making it efficient.

The First Kernel: Linear Projection

The first kernel to master is the linear layer:

Linear layer

text

Y = XW + b

This is the same idea behind scalar wx+b, GEMV, GEMM, attention projections, MLP projections, vision patch projections, and classifier heads. In a transformer block, the model repeatedly applies linear kernels to move between hidden states, Q/K/V tensors, attention outputs, and MLP intermediate vectors.

For one token, the linear layer often looks like GEMV: one vector multiplied by a weight matrix. For many tokens, it becomes GEMM: a matrix of tokens multiplied by a weight matrix. The math is stable; the runtime shape changes the kernel strategy.

Kernel	Shape	Where it appears	What to learn
`wx+b`	scalar or vector	first-principles linear model	derivatives and local gradients
GEMV	`[1,N] × [N,K]`	single-token decode	row/column dot products, cache reuse
GEMM	`[M,N] × [N,K]`	prefill, training, batches	tiling, blocking, arithmetic intensity
Projection	`XWq, XWk, XWv`	attention	shape discipline and memory layout

The same linear idea has different speed rules depending on shape. This is one of the most important lessons in kernel engineering: the equation may look the same, but the memory behavior changes.

Level	Math view	Speed view	What to keep hot
Scalar	`y = wx + b`	almost no memory problem; teaches correctness	`w`, `x`, accumulator in registers
Vector dot	`y_j = sum_i x_i W_{i,j}`	stream one weight column and reuse one input vector	input vector and accumulator
GEMV	`[1,N] × [N,K]`	often memory-bandwidth heavy during decode	current token vector, several output accumulators
GEMM	`[M,N] × [N,K]`	tile A and B so each loaded value is reused many times	small A/B tiles and C accumulators

For scalar code, you mainly learn the derivative and the exact order of operations. For GEMV, you start caring about contiguous memory, vector lanes, and keeping several output accumulators in registers. For GEMM, you care about blocking and tiling because the same input and weight values can be reused across many outputs.

Linear kernel speed ladder

text

scalar:
  keep w, x, b, acc in registers

GEMV:
  keep x hot
  stream W
  compute several y values before storing

GEMM:
  tile X and W
  keep C accumulators in registers
  reuse cache-resident tiles many times

If you understand this kernel deeply, you can connect the simple wx+b post to transformer inference and backpropagation. The Jacobian is the conceptual object. The kernel implements the shortcut.

The Same Linear Kernel Also Teaches Backprop

For forward inference, the linear layer computes output values. For training, the same local structure tells us how gradients move backward. If Y = XW + b, and the parent gradient arriving from the loss is G = dL/dY, the compact backprop rules are:

Linear layer backward pass

text

dW = X^T G
dX = G W^T
db = sum_rows(G)

This is where the Jacobian intuition becomes useful without forcing us to materialize a giant Jacobian. Conceptually, every output depends only on the input row and the corresponding weight column. Computationally, the kernel uses matrix products and reductions to apply the same chain rule efficiently.

That distinction matters. A beginner may try to write every partial derivative one by one. A runtime cannot afford that. A runtime uses the structure of the Jacobian to avoid storing the full object. The math says what is connected. The kernel chooses the efficient contraction.

Backward object	Kernel shape	Meaning
`dW`	`X^T G`	accumulate how each input feature contributed to each output gradient
`dX`	`G W^T`	send gradient back to the previous layer or token representation
`db`	`sum_rows(G)`	bias gradient is a reduction across tokens or examples

Activation Kernels

After a linear projection, neural networks usually need a nonlinearity. Without activation kernels, stacked linear layers collapse into another linear operation. Activations bend the signal and make deeper function composition useful.

Activation function curves showing ReLU, GELU, SiLU, and sigmoid.

For older networks, ReLU is the basic starting point. For transformer models, GELU and SiLU-family functions are more common. Modern MLP blocks often use gated activations such as SwiGLU:

SwiGLU shape

text

gate = SiLU(XW_gate)
up   = XW_up
out  = gate ⊙ up
down = out W_down

This is why activation kernels are not “just one formula.” A gated activation may involve two linear projections, an elementwise activation, an elementwise multiply, and a final projection. In real runtime terms, the activation family changes memory movement and kernel fusion opportunities.

The practical thing to learn is not only the curve. Learn the memory path. A ReLU kernel can stream through one tensor and write one tensor. A SwiGLU block usually needs two projected streams, a nonlinear gate, a multiply, and then a down projection. That means more intermediate data, more layout decisions, and more opportunity to fuse or accidentally spill to memory.

Practice target

Implement ReLU, GELU, SiLU, and SwiGLU in scalar C first. Then add vectorized paths only after the scalar reference and parity tests are stable. The activation is simple enough to learn correctness discipline, but rich enough to teach approximation, vectorization, and fusion.

Normalization Kernels

LayerNorm and RMSNorm stabilize hidden states by controlling scale. These kernels are deceptively simple: compute a reduction, derive a scale factor, then apply it across the vector.

RMSNorm

text

mean_square = sum(x_i^2) / D
rstd        = 1 / sqrt(mean_square + eps)
y_i         = gamma_i * x_i * rstd

The important kernel lesson is reduction. Many lanes or threads contribute to one scalar statistic. That means the kernel engineer must understand floating-point accumulation, SIMD reductions, thread partitioning, and numerical tolerance.

This connects directly to the earlier LayerNorm and RMSNorm post and the SIMD deep dive.

Normalization is also where “close enough” becomes dangerous. A tiny mismatch in a reduction may not look serious for one token, but model parity is a long chain. Small numerical differences can propagate through attention, MLP, logits, sampling, and future tokens. That is why good kernel work needs both local tests and model-level validation.

In C-Kernel-Engine terms, the scalar reference is not optional. It is the anchor. The optimized path can use SIMD, threads, or architecture-specific instructions, but the optimized path must still agree with the reference under a defined tolerance. Without that discipline, performance numbers are not meaningful.

Attention Kernels

Attention is not one kernel. It is a pipeline:

A transformer block shown as a sequence of kernels: RMSNorm, QKV GEMM, RoPE, attention, projection, residual add, and MLP.

Attention kernel sequence

text

Q = XWq
K = XWk
V = XWv

scores = QK^T / sqrt(d)
prob   = softmax(scores)
out    = prob V
Y      = out Wo

The attention map includes linear kernels, position kernels, matrix multiplication, softmax, value mixing, output projection, and residual addition. Each stage has different memory behavior. QKV projection is weight-stream heavy. Scores can become quadratic in sequence length. Softmax needs numerical stabilization. KV cache changes the decode path.

This is why “attention” is a useful model concept but too coarse as a systems concept. Kernel engineering needs to break it into the pieces that actually move bytes and execute instructions.

For learning, attention should be separated into at least five surfaces:

Projection: build Q, K, and V from the hidden state.
Position: apply RoPE or another position mechanism to Q and K.
Score: compute similarity between the current query and available keys.
Probability: apply mask and numerically stable softmax.
Mix: multiply probabilities by values and project back to hidden dimension.

Each surface can become the bottleneck in a different regime. Prefill may be dominated by large matrix operations. Decode may be dominated by KV-cache reads. Long context may become memory-bandwidth heavy. Small batch real-time serving may be latency-sensitive rather than throughput-sensitive. The model word “attention” hides all of those distinctions.

This is also why FlashAttention-style ideas matter. They are not magic attention. They are better memory scheduling for the same mathematical object. They reduce unnecessary reads and writes by changing how score, softmax, and value mixing are tiled through memory.

Position, Cache, and State Kernels

Position and memory are where models start to differ more strongly. RoPE rotates query and key vectors by position. KV cache stores past keys and values so decode does not recompute the entire prefix. State-space and recurrent-attention designs compress history into a fixed or structured state update.

A diagram comparing RoPE, KV cache, and recurrent state kernels such as DeltaNet or SSM.

Kernel family	Question it answers	Runtime pressure
RoPE / position	How does the model know token order?	sin/cos lookup, vector rotation, Q/K layout
KV cache	How does decode reuse history?	memory growth with context length
DeltaNet / SSM state	Can history be compressed into recurrent state?	state update correctness and long-context behavior
Sliding-window attention	Can each layer see only local context?	smaller cache/read footprint

This connects to positional encoding, attention, and Gated DeltaNet. The common systems question is: where does the model store sequence information, and how expensive is it to read or update that storage?

Routing and MoE Kernels

Mixture-of-Experts models add another kernel family: routing. Instead of sending every token through the same dense MLP weights, the model chooses a small number of experts for each token. That changes the runtime problem from “one dense block for all tokens” to “select, group, dispatch, compute, and combine.”

MoE routing mental model

text

scores  = router(hidden_state)
experts = top_k(scores)

for each selected expert:
    route token to expert batch
    run expert MLP
    weight expert output by router score

output = combined expert outputs

MoE is a good example of why kernel engineering is not only arithmetic. Routing introduces sorting, grouping, scatter/gather movement, load balancing, and scheduling. A mathematically valid MoE layer can still be slow if tokens are distributed badly across experts or if memory movement dominates expert compute.

That is also why MoE is relevant to CPU, GPU, and distributed systems thinking. The expert function may be a familiar MLP kernel, but the routing layer decides how work is partitioned and whether compute resources stay busy.

MoE step	Kernel/system concern	What to measure
Router scores	small projection plus top-k selection	latency, stability, top-k correctness
Token grouping	scatter/gather and batch formation	memory movement, expert imbalance
Expert MLP	dense GEMM/GEMV per expert	utilization, batch size per expert
Combine output	weighted sum back into token order	ordering correctness, accumulation tolerance

Quantization Kernels

Quantization kernels turn model deployment from a pure arithmetic problem into a memory-format problem. The kernel has to pack weights, unpack weights, apply scales, handle metadata, and accumulate accurately.

Quantized dot-product intuition

text

packed weights -> unpack nibbles/bytes
metadata       -> load scales / mins / block sums
dequant        -> reconstruct approximate values in registers
dot product    -> accumulate into int32 or fp32
output         -> return to the next runtime stage

The key lesson from the quantization deep dive is that quantization is not just “use INT4.” It is a format contract plus an ISA-specific kernel path. The correctness test is not whether the model produces plausible text once. The correctness test is whether strict parity survives long-horizon generation and real model bring-up.

The practical reason quantization matters is memory bandwidth. If the bottleneck is moving weights from memory into the compute unit, smaller weight formats can help even before the arithmetic looks impressive. But the savings are not free. The kernel must pay the cost of unpacking, applying scales, handling block metadata, and preserving enough numerical information for the model to remain useful.

This is why quantization is a perfect kernel-engineering topic. It touches math, memory layout, instruction sets, model accuracy, and runtime dispatch at the same time. A quantized kernel is not finished when it runs. It is finished when it runs fast, matches the reference within tolerance, and preserves model behavior on real prompts.

Optimizer Kernels

Optimizer kernels are the training-side counterpart to inference kernels. Backprop computes gradients. The optimizer interprets those gradients and updates the weights.

AdamW in compact form

text

m_t = beta1 * m_{t-1} + (1-beta1) * g_t
v_t = beta2 * v_{t-1} + (1-beta2) * g_t^2
w   = w - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * w)

For kernel engineering, optimizers matter because they are memory-heavy. AdamW keeps extra state buffers. Large training jobs are not just about multiplying tensors; they are about moving parameters, gradients, optimizer states, activations, and checkpoints reliably.

This is why the optimizer post belongs in the same learning path as inference kernels. A serious AI kernel engineer must understand both forward serving and backward training.

Dispatcher and Runtime Kernels

A model runtime is not only a pile of kernels. It also needs a dispatcher. The dispatcher decides which kernel implementation should run for a given operation, shape, dtype, hardware target, and memory layout.

Runtime dispatch sketch

text

if op == RMSNORM and isa >= AVX2:
    run rmsnorm_avx2(...)
elif op == RMSNORM:
    run rmsnorm_scalar(...)

if op == Q4_GEMV and isa >= NEON:
    run q4_gemv_neon(...)
elif op == Q4_GEMV and isa >= AVX512:
    run q4_gemv_avx512(...)
else:
    run q4_gemv_reference(...)

This is the part many explanations skip. A kernel engineer may write several versions of the same operation: scalar reference, portable C, thread-pool version, AVX2 version, AVX-512 version, NEON version, or a backend-specific accelerator version. The runtime has to select the correct one without breaking the model contract.

The dispatcher is also where testing discipline becomes essential. Every optimized path needs parity against the reference path. Every path needs shape guards. Every path needs clear assumptions about alignment, block size, dtype, and scratch memory. Otherwise the runtime becomes fast only when the demo happens to follow the happy path.

How To Think About Kernel Speed

Kernel speed is not only “do fewer operations.” Modern processors are usually limited by where the data lives, how often it moves, and whether the useful values stay close to the execution units.

The simplest mental model is:

Kernel speed mental model

text

registers  -> fastest, tiny, closest to execution
L1 cache   -> very fast, small
L2 cache   -> fast, larger
L3 cache   -> shared, slower
DRAM       -> huge, much slower
storage    -> enormous, not part of the hot loop

A fast kernel tries to keep the hot working set in registers and cache. A slow kernel repeatedly fetches the same data from DRAM, writes temporary values too early, or touches memory in a pattern the CPU cannot predict and prefetch efficiently.

Concept	Kernel-engineering meaning	Example
Registers	smallest and fastest storage for live values	accumulate several dot-product sums before storing
Hot cache	data recently used and likely still close to the core	reuse a block of weights across several token rows
Cold memory	data not in cache; fetch cost dominates	streaming huge weights from DRAM during decode
Arithmetic intensity	work done per byte moved	GEMM can reuse tiles; GEMV often has lower reuse
Writeback pressure	cost of storing intermediates	unfused activation spills an intermediate tensor
Branch overhead	control flow that disrupts predictable execution	checking dtype or shape inside the inner loop

This is why kernel code often looks different from ordinary application code. The inner loop should be boring and predictable. The dispatcher can make decisions before the loop starts. The loop itself should mostly load, compute, accumulate, and store.

For a dot product, the naive version may load one value, multiply, add, and repeat. A better version tries to keep multiple accumulators in registers, load contiguous values, unroll the loop, and avoid storing partial results until the end.

Dot-product speed intuition

text

bad mental model:
  for every operation, go back to memory

better mental model:
  load a block
  keep accumulators in registers
  reuse cached data
  store only when the result is complete

For AI kernels, this shows up everywhere. RMSNorm wants to reduce a vector without unnecessary passes. GEMV wants to stream weights and reuse the input vector while it is hot. GEMM wants to tile so small blocks of A and B are reused many times. Quantized kernels want to unpack weights into registers and immediately use them before the temporary representation spills. Attention wants to avoid writing the full score matrix when a tiled online softmax can keep only what is needed.

The high-level rule is simple: keep hot things short-lived, close, and reused. Put constants and accumulators in registers when possible. Keep working tiles inside cache. Avoid large temporary tensors unless they are necessary for correctness or later reuse. Measure the result instead of guessing.

What To Practice For Each Kernel Family

A useful learning path should produce code, not only notes. The table below is a practical curriculum for building intuition one kernel family at a time.

Kernel family	First implementation	Next step	Test
Linear	scalar GEMV	blocked GEMM or threaded rows	compare against Python/NumPy reference
Activation	ReLU, GELU, SiLU	SwiGLU with two projections	curve values and end-to-end MLP parity
Normalization	RMSNorm scalar	SIMD reduction	tolerance across random vectors
Attention	single-head attention	KV-cache decode	same logits as reference for short prompts
Position	RoPE rotation	cached sin/cos or fused Q/K path	exact Q/K rotation parity
Quantization	simple int8 dot product	block quantized Q4/Q8 path	reference dequant and model-level drift
Optimizer	SGD	AdamW with state buffers	known training-step parity
Runtime	manual function call	ISA-aware dispatcher	all implementations agree with reference

This is the same pattern from robotics and embedded systems: start with the clear scalar math, create a reliable reference, measure it, then optimize only after correctness is boring. Whether the target is a flight controller, CPU inference engine, or distributed training system, the engineering discipline is the same.

The Practical Learning Ladder

A practical learning ladder for AI kernel engineering from scalar math to distributed systems and accelerators.

The beginner path is not “jump straight to CUDA” or “memorize transformer papers.” A better path is:

Write scalar wx+b and derivatives by hand.
Convert scalar math into vector and matrix form.
Implement naive GEMV and GEMM in C.
Measure the code with timers and perf.
Add thread-pool row partitioning.
Add SIMD only after the scalar path is correct.
Study cache, NUMA, memory bandwidth, and roofline limits.
Then extend toward distributed systems and accelerators.

For robotics and control systems, the same foundation appears in smaller form: Jacobians, matrix multiplies, Kalman filters, PID loops, sensor fusion, flight controllers, and embedded inference. For LLMs, the same foundation scales into huge memory systems, networked clusters, and transformer-model runtime graphs.

After this ladder, the path becomes more specialized. On CPUs, you study cache blocking, prefetching, NUMA placement, thread affinity, SIMD instruction selection, and roofline analysis. On GPUs, you study warps, shared memory, occupancy, tensor cores, and memory coalescing. On distributed systems, you study sharding, networking, collective communication, storage throughput, checkpointing, and reliability.

The important point is that these are extensions of the same foundation. The math does not disappear when the system gets larger. It becomes more expensive to move, schedule, verify, and preserve.

How This Connects To C-Kernel-Engine

C-Kernel-Engine is useful as a learning artifact because it forces the model to become explicit. A template maps to a circuit. A circuit maps to kernel order. Kernel order maps to tensor shapes and memory layout. Then each kernel has to be implemented, tested, dispatched, and measured.

The public concept map is here: C-Kernel-Engine concepts. That page is useful because it shows how model-level ideas such as attention, MoE routing, normalization, quantization, and runtime dispatch become concrete kernel surfaces.

That is the practical kernel-engineering loop:

Kernel engineering loop

text

math definition
  -> tensor shape
  -> scalar reference
  -> C kernel
  -> correctness test
  -> thread partition
  -> SIMD / ISA path
  -> memory measurement
  -> runtime dispatch
  -> model-level validation

The next long-form video can use this post as the map: start with the kernel vocabulary, then walk through the math and implementation path kernel by kernel.

How To Use This Post For The Next Video

The long-form video should not try to teach every kernel fully in one sitting. A better structure is to use this post as the map, then zoom into each family with a concrete example:

Start with the question: what kernels do I need to know before deep learning stops feeling mysterious?
Show the map: linear, activation, norm, attention, position, memory, quantization, optimizer, dispatcher.
Walk one kernel deeply: use Y = XW + b to connect scalar math, GEMV, GEMM, and backprop.
Show runtime thinking: explain why the same math needs different kernels for scalar, SIMD, thread pool, and accelerator paths.
Close with practice: implement, test, measure, optimize, and validate on a real model.

That structure keeps the talk grounded. The audience does not need to become expert in every kernel immediately. They need to see the terrain and understand why the same small set of mathematical operations keeps reappearing inside deep neural networks.

Takeaway

If you want to understand deep neural networks deeply, learn the kernels underneath them: linear algebra, activation, normalization, attention, position, memory, quantization, and optimization.

Architectures change. The kernel vocabulary keeps reappearing.

AI Kernels You Should Know To Learn Deep Neural Networks

Purpose of this post

How To Read The Kernel Map

What Is An AI Kernel?

The First Kernel: Linear Projection

The Same Linear Kernel Also Teaches Backprop

Activation Kernels

Practice target

Normalization Kernels

Attention Kernels

Position, Cache, and State Kernels

Routing and MoE Kernels

Quantization Kernels

Optimizer Kernels

Dispatcher and Runtime Kernels

How To Think About Kernel Speed

What To Practice For Each Kernel Family

The Practical Learning Ladder

How This Connects To C-Kernel-Engine

How To Use This Post For The Next Video

Takeaway

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect