AI kernel learning map
This post is a bridge between the earlier ShivasNotes fundamentals series and the next long-form video: what AI kernels should you actually know if you want to understand deep neural networks from math to runtime?
A neural network is not one giant mysterious operation. It is a graph of smaller kernels: matrix multiplies, activation functions, normalization, attention, position encoding, cache updates, quantization, and optimizer updates. Once those kernels become familiar, transformer models, recurrent attention, vision encoders, and CPU inference systems become much easier to read. The goal is not to memorize every model architecture. The goal is to recognize the kernel vocabulary underneath the architecture.
Purpose of this post
This is a practical learning map for AI, robotics, and HPC kernel engineering. It connects the math you write by hand to the low-level routines that execute inside real runtimes.
It also links back to the earlier ShivasNotes sequence: matrix wx+b, softmax, activation functions, normalization, position encoding, tokenization, attention, residual connections, backprop, thread pools, vision encoders, SIMD, ARM NEON, and quantization.

How To Read The Kernel Map
The map is not meant to be a taxonomy for academic neatness. It is a debugging tool. When a transformer model fails parity, runs slowly, overflows numerically, consumes too much memory, or cannot scale across threads, the failure normally lands in one of these kernel families.
A beginner often reads a model diagram as a list of named blocks: attention, MLP, norm, residual, head. A kernel engineer reads the same diagram as a list of executable contracts: load this tensor, transform this shape, reduce this axis, preserve this dtype expectation, write this layout, and hand the result to the next kernel.
| Model word | Kernel-engineering reading | Typical failure mode |
|---|---|---|
| Linear layer | GEMV/GEMM, layout, tile shape, accumulation | wrong transpose, slow memory access, bad accumulation order |
| Activation | elementwise curve, gating, possible fusion point | wrong approximation, missing gate, incorrect intermediate size |
| Norm | reduction kernel plus scale application | epsilon mismatch, reduction drift, wrong axis |
| Attention | projection, score, mask, softmax, value mix, cache | mask bug, unstable softmax, KV layout mismatch |
| Quantized layer | packed format, metadata, dequant, dot product | wrong scale, block-size mismatch, parity drift |
| Optimizer | gradient transform plus persistent state update | state memory explosion, wrong weight decay semantics |
This is why kernel learning compounds. Every new architecture may introduce a new arrangement, but the same questions keep returning: what are the shapes, what memory is read, what math is applied, what is reduced, what is cached, what is updated, and how do we prove it is correct?
What Is An AI Kernel?
An AI kernel is a low-level routine that implements one mathematical operation efficiently on a compute target. That target can be a scalar CPU loop, a SIMD vector unit, a thread pool, a GPU block, a matrix tile unit, an FPGA datapath, or a distributed cluster stage.
The important part is that a kernel has a contract. It accepts inputs with known shapes, performs a specific mathematical transformation, respects numerical expectations, and writes outputs in a layout the next kernel can consume.
kernel(input tensors, weights, metadata) -> output tensor
The contract includes:
shape
dtype
memory layout
numerical meaning
threading / SIMD rules
correctness toleranceThis is why AI kernel engineering sits between math and systems. If you only know the math, you may write code that is correct but slow. If you only know systems, you may optimize a wrong operation. The kernel engineer has to preserve the math while making it efficient.
The First Kernel: Linear Projection
The first kernel to master is the linear layer:
Y = XW + bThis is the same idea behind scalar wx+b, GEMV, GEMM, attention projections, MLP projections, vision patch projections, and classifier heads. In a transformer block, the model repeatedly applies linear kernels to move between hidden states, Q/K/V tensors, attention outputs, and MLP intermediate vectors.
For one token, the linear layer often looks like GEMV: one vector multiplied by a weight matrix. For many tokens, it becomes GEMM: a matrix of tokens multiplied by a weight matrix. The math is stable; the runtime shape changes the kernel strategy.
| Kernel | Shape | Where it appears | What to learn |
|---|---|---|---|
wx+b | scalar or vector | first-principles linear model | derivatives and local gradients |
| GEMV | [1,N] × [N,K] | single-token decode | row/column dot products, cache reuse |
| GEMM | [M,N] × [N,K] | prefill, training, batches | tiling, blocking, arithmetic intensity |
| Projection | XWq, XWk, XWv | attention | shape discipline and memory layout |
The same linear idea has different speed rules depending on shape. This is one of the most important lessons in kernel engineering: the equation may look the same, but the memory behavior changes.
| Level | Math view | Speed view | What to keep hot |
|---|---|---|---|
| Scalar | y = wx + b | almost no memory problem; teaches correctness | w, x, accumulator in registers |
| Vector dot | y_j = sum_i x_i W_{i,j} | stream one weight column and reuse one input vector | input vector and accumulator |
| GEMV | [1,N] × [N,K] | often memory-bandwidth heavy during decode | current token vector, several output accumulators |
| GEMM | [M,N] × [N,K] | tile A and B so each loaded value is reused many times | small A/B tiles and C accumulators |
For scalar code, you mainly learn the derivative and the exact order of operations. For GEMV, you start caring about contiguous memory, vector lanes, and keeping several output accumulators in registers. For GEMM, you care about blocking and tiling because the same input and weight values can be reused across many outputs.
scalar:
keep w, x, b, acc in registers
GEMV:
keep x hot
stream W
compute several y values before storing
GEMM:
tile X and W
keep C accumulators in registers
reuse cache-resident tiles many timesIf you understand this kernel deeply, you can connect the simple wx+b post to transformer inference and backpropagation. The Jacobian is the conceptual object. The kernel implements the shortcut.
The Same Linear Kernel Also Teaches Backprop
For forward inference, the linear layer computes output values. For training, the same local structure tells us how gradients move backward. If Y = XW + b, and the parent gradient arriving from the loss is G = dL/dY, the compact backprop rules are:
dW = X^T G
dX = G W^T
db = sum_rows(G)This is where the Jacobian intuition becomes useful without forcing us to materialize a giant Jacobian. Conceptually, every output depends only on the input row and the corresponding weight column. Computationally, the kernel uses matrix products and reductions to apply the same chain rule efficiently.
That distinction matters. A beginner may try to write every partial derivative one by one. A runtime cannot afford that. A runtime uses the structure of the Jacobian to avoid storing the full object. The math says what is connected. The kernel chooses the efficient contraction.
| Backward object | Kernel shape | Meaning |
|---|---|---|
dW | X^T G | accumulate how each input feature contributed to each output gradient |
dX | G W^T | send gradient back to the previous layer or token representation |
db | sum_rows(G) | bias gradient is a reduction across tokens or examples |
Activation Kernels
After a linear projection, neural networks usually need a nonlinearity. Without activation kernels, stacked linear layers collapse into another linear operation. Activations bend the signal and make deeper function composition useful.

For older networks, ReLU is the basic starting point. For transformer models, GELU and SiLU-family functions are more common. Modern MLP blocks often use gated activations such as SwiGLU:
gate = SiLU(XW_gate)
up = XW_up
out = gate ⊙ up
down = out W_downThis is why activation kernels are not “just one formula.” A gated activation may involve two linear projections, an elementwise activation, an elementwise multiply, and a final projection. In real runtime terms, the activation family changes memory movement and kernel fusion opportunities.
The practical thing to learn is not only the curve. Learn the memory path. A ReLU kernel can stream through one tensor and write one tensor. A SwiGLU block usually needs two projected streams, a nonlinear gate, a multiply, and then a down projection. That means more intermediate data, more layout decisions, and more opportunity to fuse or accidentally spill to memory.
Practice target
Implement ReLU, GELU, SiLU, and SwiGLU in scalar C first. Then add vectorized paths only after the scalar reference and parity tests are stable. The activation is simple enough to learn correctness discipline, but rich enough to teach approximation, vectorization, and fusion.
Normalization Kernels
LayerNorm and RMSNorm stabilize hidden states by controlling scale. These kernels are deceptively simple: compute a reduction, derive a scale factor, then apply it across the vector.
mean_square = sum(x_i^2) / D
rstd = 1 / sqrt(mean_square + eps)
y_i = gamma_i * x_i * rstdThe important kernel lesson is reduction. Many lanes or threads contribute to one scalar statistic. That means the kernel engineer must understand floating-point accumulation, SIMD reductions, thread partitioning, and numerical tolerance.
This connects directly to the earlier LayerNorm and RMSNorm post and the SIMD deep dive.
Normalization is also where “close enough” becomes dangerous. A tiny mismatch in a reduction may not look serious for one token, but model parity is a long chain. Small numerical differences can propagate through attention, MLP, logits, sampling, and future tokens. That is why good kernel work needs both local tests and model-level validation.
In C-Kernel-Engine terms, the scalar reference is not optional. It is the anchor. The optimized path can use SIMD, threads, or architecture-specific instructions, but the optimized path must still agree with the reference under a defined tolerance. Without that discipline, performance numbers are not meaningful.
Attention Kernels
Attention is not one kernel. It is a pipeline:

Q = XWq
K = XWk
V = XWv
scores = QK^T / sqrt(d)
prob = softmax(scores)
out = prob V
Y = out WoThe attention map includes linear kernels, position kernels, matrix multiplication, softmax, value mixing, output projection, and residual addition. Each stage has different memory behavior. QKV projection is weight-stream heavy. Scores can become quadratic in sequence length. Softmax needs numerical stabilization. KV cache changes the decode path.
This is why “attention” is a useful model concept but too coarse as a systems concept. Kernel engineering needs to break it into the pieces that actually move bytes and execute instructions.
For learning, attention should be separated into at least five surfaces:
- Projection: build Q, K, and V from the hidden state.
- Position: apply RoPE or another position mechanism to Q and K.
- Score: compute similarity between the current query and available keys.
- Probability: apply mask and numerically stable softmax.
- Mix: multiply probabilities by values and project back to hidden dimension.
Each surface can become the bottleneck in a different regime. Prefill may be dominated by large matrix operations. Decode may be dominated by KV-cache reads. Long context may become memory-bandwidth heavy. Small batch real-time serving may be latency-sensitive rather than throughput-sensitive. The model word “attention” hides all of those distinctions.
This is also why FlashAttention-style ideas matter. They are not magic attention. They are better memory scheduling for the same mathematical object. They reduce unnecessary reads and writes by changing how score, softmax, and value mixing are tiled through memory.
Position, Cache, and State Kernels
Position and memory are where models start to differ more strongly. RoPE rotates query and key vectors by position. KV cache stores past keys and values so decode does not recompute the entire prefix. State-space and recurrent-attention designs compress history into a fixed or structured state update.

| Kernel family | Question it answers | Runtime pressure |
|---|---|---|
| RoPE / position | How does the model know token order? | sin/cos lookup, vector rotation, Q/K layout |
| KV cache | How does decode reuse history? | memory growth with context length |
| DeltaNet / SSM state | Can history be compressed into recurrent state? | state update correctness and long-context behavior |
| Sliding-window attention | Can each layer see only local context? | smaller cache/read footprint |
This connects to positional encoding, attention, and Gated DeltaNet. The common systems question is: where does the model store sequence information, and how expensive is it to read or update that storage?
Routing and MoE Kernels
Mixture-of-Experts models add another kernel family: routing. Instead of sending every token through the same dense MLP weights, the model chooses a small number of experts for each token. That changes the runtime problem from “one dense block for all tokens” to “select, group, dispatch, compute, and combine.”
scores = router(hidden_state)
experts = top_k(scores)
for each selected expert:
route token to expert batch
run expert MLP
weight expert output by router score
output = combined expert outputsMoE is a good example of why kernel engineering is not only arithmetic. Routing introduces sorting, grouping, scatter/gather movement, load balancing, and scheduling. A mathematically valid MoE layer can still be slow if tokens are distributed badly across experts or if memory movement dominates expert compute.
That is also why MoE is relevant to CPU, GPU, and distributed systems thinking. The expert function may be a familiar MLP kernel, but the routing layer decides how work is partitioned and whether compute resources stay busy.
| MoE step | Kernel/system concern | What to measure |
|---|---|---|
| Router scores | small projection plus top-k selection | latency, stability, top-k correctness |
| Token grouping | scatter/gather and batch formation | memory movement, expert imbalance |
| Expert MLP | dense GEMM/GEMV per expert | utilization, batch size per expert |
| Combine output | weighted sum back into token order | ordering correctness, accumulation tolerance |
Quantization Kernels
Quantization kernels turn model deployment from a pure arithmetic problem into a memory-format problem. The kernel has to pack weights, unpack weights, apply scales, handle metadata, and accumulate accurately.
packed weights -> unpack nibbles/bytes
metadata -> load scales / mins / block sums
dequant -> reconstruct approximate values in registers
dot product -> accumulate into int32 or fp32
output -> return to the next runtime stageThe key lesson from the quantization deep dive is that quantization is not just “use INT4.” It is a format contract plus an ISA-specific kernel path. The correctness test is not whether the model produces plausible text once. The correctness test is whether strict parity survives long-horizon generation and real model bring-up.
The practical reason quantization matters is memory bandwidth. If the bottleneck is moving weights from memory into the compute unit, smaller weight formats can help even before the arithmetic looks impressive. But the savings are not free. The kernel must pay the cost of unpacking, applying scales, handling block metadata, and preserving enough numerical information for the model to remain useful.
This is why quantization is a perfect kernel-engineering topic. It touches math, memory layout, instruction sets, model accuracy, and runtime dispatch at the same time. A quantized kernel is not finished when it runs. It is finished when it runs fast, matches the reference within tolerance, and preserves model behavior on real prompts.
Optimizer Kernels
Optimizer kernels are the training-side counterpart to inference kernels. Backprop computes gradients. The optimizer interprets those gradients and updates the weights.
m_t = beta1 * m_{t-1} + (1-beta1) * g_t
v_t = beta2 * v_{t-1} + (1-beta2) * g_t^2
w = w - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * w)For kernel engineering, optimizers matter because they are memory-heavy. AdamW keeps extra state buffers. Large training jobs are not just about multiplying tensors; they are about moving parameters, gradients, optimizer states, activations, and checkpoints reliably.
This is why the optimizer post belongs in the same learning path as inference kernels. A serious AI kernel engineer must understand both forward serving and backward training.
Dispatcher and Runtime Kernels
A model runtime is not only a pile of kernels. It also needs a dispatcher. The dispatcher decides which kernel implementation should run for a given operation, shape, dtype, hardware target, and memory layout.
if op == RMSNORM and isa >= AVX2:
run rmsnorm_avx2(...)
elif op == RMSNORM:
run rmsnorm_scalar(...)
if op == Q4_GEMV and isa >= NEON:
run q4_gemv_neon(...)
elif op == Q4_GEMV and isa >= AVX512:
run q4_gemv_avx512(...)
else:
run q4_gemv_reference(...)This is the part many explanations skip. A kernel engineer may write several versions of the same operation: scalar reference, portable C, thread-pool version, AVX2 version, AVX-512 version, NEON version, or a backend-specific accelerator version. The runtime has to select the correct one without breaking the model contract.
The dispatcher is also where testing discipline becomes essential. Every optimized path needs parity against the reference path. Every path needs shape guards. Every path needs clear assumptions about alignment, block size, dtype, and scratch memory. Otherwise the runtime becomes fast only when the demo happens to follow the happy path.
How To Think About Kernel Speed
Kernel speed is not only “do fewer operations.” Modern processors are usually limited by where the data lives, how often it moves, and whether the useful values stay close to the execution units.
The simplest mental model is:
registers -> fastest, tiny, closest to execution
L1 cache -> very fast, small
L2 cache -> fast, larger
L3 cache -> shared, slower
DRAM -> huge, much slower
storage -> enormous, not part of the hot loopA fast kernel tries to keep the hot working set in registers and cache. A slow kernel repeatedly fetches the same data from DRAM, writes temporary values too early, or touches memory in a pattern the CPU cannot predict and prefetch efficiently.
| Concept | Kernel-engineering meaning | Example |
|---|---|---|
| Registers | smallest and fastest storage for live values | accumulate several dot-product sums before storing |
| Hot cache | data recently used and likely still close to the core | reuse a block of weights across several token rows |
| Cold memory | data not in cache; fetch cost dominates | streaming huge weights from DRAM during decode |
| Arithmetic intensity | work done per byte moved | GEMM can reuse tiles; GEMV often has lower reuse |
| Writeback pressure | cost of storing intermediates | unfused activation spills an intermediate tensor |
| Branch overhead | control flow that disrupts predictable execution | checking dtype or shape inside the inner loop |
This is why kernel code often looks different from ordinary application code. The inner loop should be boring and predictable. The dispatcher can make decisions before the loop starts. The loop itself should mostly load, compute, accumulate, and store.
For a dot product, the naive version may load one value, multiply, add, and repeat. A better version tries to keep multiple accumulators in registers, load contiguous values, unroll the loop, and avoid storing partial results until the end.
bad mental model:
for every operation, go back to memory
better mental model:
load a block
keep accumulators in registers
reuse cached data
store only when the result is completeFor AI kernels, this shows up everywhere. RMSNorm wants to reduce a vector without unnecessary passes. GEMV wants to stream weights and reuse the input vector while it is hot. GEMM wants to tile so small blocks of A and B are reused many times. Quantized kernels want to unpack weights into registers and immediately use them before the temporary representation spills. Attention wants to avoid writing the full score matrix when a tiled online softmax can keep only what is needed.
The high-level rule is simple: keep hot things short-lived, close, and reused. Put constants and accumulators in registers when possible. Keep working tiles inside cache. Avoid large temporary tensors unless they are necessary for correctness or later reuse. Measure the result instead of guessing.
What To Practice For Each Kernel Family
A useful learning path should produce code, not only notes. The table below is a practical curriculum for building intuition one kernel family at a time.
| Kernel family | First implementation | Next step | Test |
|---|---|---|---|
| Linear | scalar GEMV | blocked GEMM or threaded rows | compare against Python/NumPy reference |
| Activation | ReLU, GELU, SiLU | SwiGLU with two projections | curve values and end-to-end MLP parity |
| Normalization | RMSNorm scalar | SIMD reduction | tolerance across random vectors |
| Attention | single-head attention | KV-cache decode | same logits as reference for short prompts |
| Position | RoPE rotation | cached sin/cos or fused Q/K path | exact Q/K rotation parity |
| Quantization | simple int8 dot product | block quantized Q4/Q8 path | reference dequant and model-level drift |
| Optimizer | SGD | AdamW with state buffers | known training-step parity |
| Runtime | manual function call | ISA-aware dispatcher | all implementations agree with reference |
This is the same pattern from robotics and embedded systems: start with the clear scalar math, create a reliable reference, measure it, then optimize only after correctness is boring. Whether the target is a flight controller, CPU inference engine, or distributed training system, the engineering discipline is the same.
The Practical Learning Ladder

The beginner path is not “jump straight to CUDA” or “memorize transformer papers.” A better path is:
- Write scalar
wx+band derivatives by hand. - Convert scalar math into vector and matrix form.
- Implement naive GEMV and GEMM in C.
- Measure the code with timers and
perf. - Add thread-pool row partitioning.
- Add SIMD only after the scalar path is correct.
- Study cache, NUMA, memory bandwidth, and roofline limits.
- Then extend toward distributed systems and accelerators.
For robotics and control systems, the same foundation appears in smaller form: Jacobians, matrix multiplies, Kalman filters, PID loops, sensor fusion, flight controllers, and embedded inference. For LLMs, the same foundation scales into huge memory systems, networked clusters, and transformer-model runtime graphs.
After this ladder, the path becomes more specialized. On CPUs, you study cache blocking, prefetching, NUMA placement, thread affinity, SIMD instruction selection, and roofline analysis. On GPUs, you study warps, shared memory, occupancy, tensor cores, and memory coalescing. On distributed systems, you study sharding, networking, collective communication, storage throughput, checkpointing, and reliability.
The important point is that these are extensions of the same foundation. The math does not disappear when the system gets larger. It becomes more expensive to move, schedule, verify, and preserve.
How This Connects To C-Kernel-Engine
C-Kernel-Engine is useful as a learning artifact because it forces the model to become explicit. A template maps to a circuit. A circuit maps to kernel order. Kernel order maps to tensor shapes and memory layout. Then each kernel has to be implemented, tested, dispatched, and measured.
The public concept map is here: C-Kernel-Engine concepts. That page is useful because it shows how model-level ideas such as attention, MoE routing, normalization, quantization, and runtime dispatch become concrete kernel surfaces.
That is the practical kernel-engineering loop:
math definition
-> tensor shape
-> scalar reference
-> C kernel
-> correctness test
-> thread partition
-> SIMD / ISA path
-> memory measurement
-> runtime dispatch
-> model-level validationThe next long-form video can use this post as the map: start with the kernel vocabulary, then walk through the math and implementation path kernel by kernel.
How To Use This Post For The Next Video
The long-form video should not try to teach every kernel fully in one sitting. A better structure is to use this post as the map, then zoom into each family with a concrete example:
- Start with the question: what kernels do I need to know before deep learning stops feeling mysterious?
- Show the map: linear, activation, norm, attention, position, memory, quantization, optimizer, dispatcher.
- Walk one kernel deeply: use
Y = XW + bto connect scalar math, GEMV, GEMM, and backprop. - Show runtime thinking: explain why the same math needs different kernels for scalar, SIMD, thread pool, and accelerator paths.
- Close with practice: implement, test, measure, optimize, and validate on a real model.
That structure keeps the talk grounded. The audience does not need to become expert in every kernel immediately. They need to see the terrain and understand why the same small set of mathematical operations keeps reappearing inside deep neural networks.
Takeaway
If you want to understand deep neural networks deeply, learn the kernels underneath them: linear algebra, activation, normalization, attention, position, memory, quantization, and optimization.
Architectures change. The kernel vocabulary keeps reappearing.