AI kernel learning map

This post is a bridge between the earlier ShivasNotes fundamentals series and the next long-form video: what AI kernels should you actually know if you want to understand deep neural networks from math to runtime?

A neural network is not one giant mysterious operation. It is a graph of smaller kernels: matrix multiplies, activation functions, normalization, attention, position encoding, cache updates, quantization, and optimizer updates. Once those kernels become familiar, transformer models, recurrent attention, vision encoders, and CPU inference systems become much easier to read. The goal is not to memorize every model architecture. The goal is to recognize the kernel vocabulary underneath the architecture.

Purpose of this post

This is a practical learning map for AI, robotics, and HPC kernel engineering. It connects the math you write by hand to the low-level routines that execute inside real runtimes.

It also links back to the earlier ShivasNotes sequence: matrix wx+b, softmax, activation functions, normalization, position encoding, tokenization, attention, residual connections, backprop, thread pools, vision encoders, SIMD, ARM NEON, and quantization.

A taxonomy map of the AI kernels needed to understand neural network runtimes.

How To Read The Kernel Map

The map is not meant to be a taxonomy for academic neatness. It is a debugging tool. When a transformer model fails parity, runs slowly, overflows numerically, consumes too much memory, or cannot scale across threads, the failure normally lands in one of these kernel families.

A beginner often reads a model diagram as a list of named blocks: attention, MLP, norm, residual, head. A kernel engineer reads the same diagram as a list of executable contracts: load this tensor, transform this shape, reduce this axis, preserve this dtype expectation, write this layout, and hand the result to the next kernel.

Model wordKernel-engineering readingTypical failure mode
Linear layerGEMV/GEMM, layout, tile shape, accumulationwrong transpose, slow memory access, bad accumulation order
Activationelementwise curve, gating, possible fusion pointwrong approximation, missing gate, incorrect intermediate size
Normreduction kernel plus scale applicationepsilon mismatch, reduction drift, wrong axis
Attentionprojection, score, mask, softmax, value mix, cachemask bug, unstable softmax, KV layout mismatch
Quantized layerpacked format, metadata, dequant, dot productwrong scale, block-size mismatch, parity drift
Optimizergradient transform plus persistent state updatestate memory explosion, wrong weight decay semantics

This is why kernel learning compounds. Every new architecture may introduce a new arrangement, but the same questions keep returning: what are the shapes, what memory is read, what math is applied, what is reduced, what is cached, what is updated, and how do we prove it is correct?

What Is An AI Kernel?

An AI kernel is a low-level routine that implements one mathematical operation efficiently on a compute target. That target can be a scalar CPU loop, a SIMD vector unit, a thread pool, a GPU block, a matrix tile unit, an FPGA datapath, or a distributed cluster stage.

The important part is that a kernel has a contract. It accepts inputs with known shapes, performs a specific mathematical transformation, respects numerical expectations, and writes outputs in a layout the next kernel can consume.

Kernel contract intuitiontext
kernel(input tensors, weights, metadata) -> output tensor

The contract includes:
  shape
  dtype
  memory layout
  numerical meaning
  threading / SIMD rules
  correctness tolerance

This is why AI kernel engineering sits between math and systems. If you only know the math, you may write code that is correct but slow. If you only know systems, you may optimize a wrong operation. The kernel engineer has to preserve the math while making it efficient.

The First Kernel: Linear Projection

The first kernel to master is the linear layer:

Linear layertext
Y = XW + b

This is the same idea behind scalar wx+b, GEMV, GEMM, attention projections, MLP projections, vision patch projections, and classifier heads. In a transformer block, the model repeatedly applies linear kernels to move between hidden states, Q/K/V tensors, attention outputs, and MLP intermediate vectors.

For one token, the linear layer often looks like GEMV: one vector multiplied by a weight matrix. For many tokens, it becomes GEMM: a matrix of tokens multiplied by a weight matrix. The math is stable; the runtime shape changes the kernel strategy.

KernelShapeWhere it appearsWhat to learn
wx+bscalar or vectorfirst-principles linear modelderivatives and local gradients
GEMV[1,N] × [N,K]single-token decoderow/column dot products, cache reuse
GEMM[M,N] × [N,K]prefill, training, batchestiling, blocking, arithmetic intensity
ProjectionXWq, XWk, XWvattentionshape discipline and memory layout

The same linear idea has different speed rules depending on shape. This is one of the most important lessons in kernel engineering: the equation may look the same, but the memory behavior changes.

LevelMath viewSpeed viewWhat to keep hot
Scalary = wx + balmost no memory problem; teaches correctnessw, x, accumulator in registers
Vector doty_j = sum_i x_i W_{i,j}stream one weight column and reuse one input vectorinput vector and accumulator
GEMV[1,N] × [N,K]often memory-bandwidth heavy during decodecurrent token vector, several output accumulators
GEMM[M,N] × [N,K]tile A and B so each loaded value is reused many timessmall A/B tiles and C accumulators

For scalar code, you mainly learn the derivative and the exact order of operations. For GEMV, you start caring about contiguous memory, vector lanes, and keeping several output accumulators in registers. For GEMM, you care about blocking and tiling because the same input and weight values can be reused across many outputs.

Linear kernel speed laddertext
scalar:
  keep w, x, b, acc in registers

GEMV:
  keep x hot
  stream W
  compute several y values before storing

GEMM:
  tile X and W
  keep C accumulators in registers
  reuse cache-resident tiles many times

If you understand this kernel deeply, you can connect the simple wx+b post to transformer inference and backpropagation. The Jacobian is the conceptual object. The kernel implements the shortcut.

The Same Linear Kernel Also Teaches Backprop

For forward inference, the linear layer computes output values. For training, the same local structure tells us how gradients move backward. If Y = XW + b, and the parent gradient arriving from the loss is G = dL/dY, the compact backprop rules are:

Linear layer backward passtext
dW = X^T G
dX = G W^T
db = sum_rows(G)

This is where the Jacobian intuition becomes useful without forcing us to materialize a giant Jacobian. Conceptually, every output depends only on the input row and the corresponding weight column. Computationally, the kernel uses matrix products and reductions to apply the same chain rule efficiently.

That distinction matters. A beginner may try to write every partial derivative one by one. A runtime cannot afford that. A runtime uses the structure of the Jacobian to avoid storing the full object. The math says what is connected. The kernel chooses the efficient contraction.

Backward objectKernel shapeMeaning
dWX^T Gaccumulate how each input feature contributed to each output gradient
dXG W^Tsend gradient back to the previous layer or token representation
dbsum_rows(G)bias gradient is a reduction across tokens or examples

Activation Kernels

After a linear projection, neural networks usually need a nonlinearity. Without activation kernels, stacked linear layers collapse into another linear operation. Activations bend the signal and make deeper function composition useful.

Activation function curves showing ReLU, GELU, SiLU, and sigmoid.

For older networks, ReLU is the basic starting point. For transformer models, GELU and SiLU-family functions are more common. Modern MLP blocks often use gated activations such as SwiGLU:

SwiGLU shapetext
gate = SiLU(XW_gate)
up   = XW_up
out  = gate ⊙ up
down = out W_down

This is why activation kernels are not “just one formula.” A gated activation may involve two linear projections, an elementwise activation, an elementwise multiply, and a final projection. In real runtime terms, the activation family changes memory movement and kernel fusion opportunities.

The practical thing to learn is not only the curve. Learn the memory path. A ReLU kernel can stream through one tensor and write one tensor. A SwiGLU block usually needs two projected streams, a nonlinear gate, a multiply, and then a down projection. That means more intermediate data, more layout decisions, and more opportunity to fuse or accidentally spill to memory.

Practice target

Implement ReLU, GELU, SiLU, and SwiGLU in scalar C first. Then add vectorized paths only after the scalar reference and parity tests are stable. The activation is simple enough to learn correctness discipline, but rich enough to teach approximation, vectorization, and fusion.

Normalization Kernels

LayerNorm and RMSNorm stabilize hidden states by controlling scale. These kernels are deceptively simple: compute a reduction, derive a scale factor, then apply it across the vector.

RMSNormtext
mean_square = sum(x_i^2) / D
rstd        = 1 / sqrt(mean_square + eps)
y_i         = gamma_i * x_i * rstd

The important kernel lesson is reduction. Many lanes or threads contribute to one scalar statistic. That means the kernel engineer must understand floating-point accumulation, SIMD reductions, thread partitioning, and numerical tolerance.

This connects directly to the earlier LayerNorm and RMSNorm post and the SIMD deep dive.

Normalization is also where “close enough” becomes dangerous. A tiny mismatch in a reduction may not look serious for one token, but model parity is a long chain. Small numerical differences can propagate through attention, MLP, logits, sampling, and future tokens. That is why good kernel work needs both local tests and model-level validation.

In C-Kernel-Engine terms, the scalar reference is not optional. It is the anchor. The optimized path can use SIMD, threads, or architecture-specific instructions, but the optimized path must still agree with the reference under a defined tolerance. Without that discipline, performance numbers are not meaningful.

Attention Kernels

Attention is not one kernel. It is a pipeline:

A transformer block shown as a sequence of kernels: RMSNorm, QKV GEMM, RoPE, attention, projection, residual add, and MLP.Attention kernel sequencetext
Q = XWq
K = XWk
V = XWv

scores = QK^T / sqrt(d)
prob   = softmax(scores)
out    = prob V
Y      = out Wo

The attention map includes linear kernels, position kernels, matrix multiplication, softmax, value mixing, output projection, and residual addition. Each stage has different memory behavior. QKV projection is weight-stream heavy. Scores can become quadratic in sequence length. Softmax needs numerical stabilization. KV cache changes the decode path.

This is why “attention” is a useful model concept but too coarse as a systems concept. Kernel engineering needs to break it into the pieces that actually move bytes and execute instructions.

For learning, attention should be separated into at least five surfaces:

  1. Projection: build Q, K, and V from the hidden state.
  2. Position: apply RoPE or another position mechanism to Q and K.
  3. Score: compute similarity between the current query and available keys.
  4. Probability: apply mask and numerically stable softmax.
  5. Mix: multiply probabilities by values and project back to hidden dimension.

Each surface can become the bottleneck in a different regime. Prefill may be dominated by large matrix operations. Decode may be dominated by KV-cache reads. Long context may become memory-bandwidth heavy. Small batch real-time serving may be latency-sensitive rather than throughput-sensitive. The model word “attention” hides all of those distinctions.

This is also why FlashAttention-style ideas matter. They are not magic attention. They are better memory scheduling for the same mathematical object. They reduce unnecessary reads and writes by changing how score, softmax, and value mixing are tiled through memory.

Position, Cache, and State Kernels

Position and memory are where models start to differ more strongly. RoPE rotates query and key vectors by position. KV cache stores past keys and values so decode does not recompute the entire prefix. State-space and recurrent-attention designs compress history into a fixed or structured state update.

A diagram comparing RoPE, KV cache, and recurrent state kernels such as DeltaNet or SSM.

Kernel familyQuestion it answersRuntime pressure
RoPE / positionHow does the model know token order?sin/cos lookup, vector rotation, Q/K layout
KV cacheHow does decode reuse history?memory growth with context length
DeltaNet / SSM stateCan history be compressed into recurrent state?state update correctness and long-context behavior
Sliding-window attentionCan each layer see only local context?smaller cache/read footprint

This connects to positional encoding, attention, and Gated DeltaNet. The common systems question is: where does the model store sequence information, and how expensive is it to read or update that storage?

Routing and MoE Kernels

Mixture-of-Experts models add another kernel family: routing. Instead of sending every token through the same dense MLP weights, the model chooses a small number of experts for each token. That changes the runtime problem from “one dense block for all tokens” to “select, group, dispatch, compute, and combine.”

MoE routing mental modeltext
scores  = router(hidden_state)
experts = top_k(scores)

for each selected expert:
    route token to expert batch
    run expert MLP
    weight expert output by router score

output = combined expert outputs

MoE is a good example of why kernel engineering is not only arithmetic. Routing introduces sorting, grouping, scatter/gather movement, load balancing, and scheduling. A mathematically valid MoE layer can still be slow if tokens are distributed badly across experts or if memory movement dominates expert compute.

That is also why MoE is relevant to CPU, GPU, and distributed systems thinking. The expert function may be a familiar MLP kernel, but the routing layer decides how work is partitioned and whether compute resources stay busy.

MoE stepKernel/system concernWhat to measure
Router scoressmall projection plus top-k selectionlatency, stability, top-k correctness
Token groupingscatter/gather and batch formationmemory movement, expert imbalance
Expert MLPdense GEMM/GEMV per expertutilization, batch size per expert
Combine outputweighted sum back into token orderordering correctness, accumulation tolerance

Quantization Kernels

Quantization kernels turn model deployment from a pure arithmetic problem into a memory-format problem. The kernel has to pack weights, unpack weights, apply scales, handle metadata, and accumulate accurately.

Quantized dot-product intuitiontext
packed weights -> unpack nibbles/bytes
metadata       -> load scales / mins / block sums
dequant        -> reconstruct approximate values in registers
dot product    -> accumulate into int32 or fp32
output         -> return to the next runtime stage

The key lesson from the quantization deep dive is that quantization is not just “use INT4.” It is a format contract plus an ISA-specific kernel path. The correctness test is not whether the model produces plausible text once. The correctness test is whether strict parity survives long-horizon generation and real model bring-up.

The practical reason quantization matters is memory bandwidth. If the bottleneck is moving weights from memory into the compute unit, smaller weight formats can help even before the arithmetic looks impressive. But the savings are not free. The kernel must pay the cost of unpacking, applying scales, handling block metadata, and preserving enough numerical information for the model to remain useful.

This is why quantization is a perfect kernel-engineering topic. It touches math, memory layout, instruction sets, model accuracy, and runtime dispatch at the same time. A quantized kernel is not finished when it runs. It is finished when it runs fast, matches the reference within tolerance, and preserves model behavior on real prompts.

Optimizer Kernels

Optimizer kernels are the training-side counterpart to inference kernels. Backprop computes gradients. The optimizer interprets those gradients and updates the weights.

AdamW in compact formtext
m_t = beta1 * m_{t-1} + (1-beta1) * g_t
v_t = beta2 * v_{t-1} + (1-beta2) * g_t^2
w   = w - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * w)

For kernel engineering, optimizers matter because they are memory-heavy. AdamW keeps extra state buffers. Large training jobs are not just about multiplying tensors; they are about moving parameters, gradients, optimizer states, activations, and checkpoints reliably.

This is why the optimizer post belongs in the same learning path as inference kernels. A serious AI kernel engineer must understand both forward serving and backward training.

Dispatcher and Runtime Kernels

A model runtime is not only a pile of kernels. It also needs a dispatcher. The dispatcher decides which kernel implementation should run for a given operation, shape, dtype, hardware target, and memory layout.

Runtime dispatch sketchtext
if op == RMSNORM and isa >= AVX2:
    run rmsnorm_avx2(...)
elif op == RMSNORM:
    run rmsnorm_scalar(...)

if op == Q4_GEMV and isa >= NEON:
    run q4_gemv_neon(...)
elif op == Q4_GEMV and isa >= AVX512:
    run q4_gemv_avx512(...)
else:
    run q4_gemv_reference(...)

This is the part many explanations skip. A kernel engineer may write several versions of the same operation: scalar reference, portable C, thread-pool version, AVX2 version, AVX-512 version, NEON version, or a backend-specific accelerator version. The runtime has to select the correct one without breaking the model contract.

The dispatcher is also where testing discipline becomes essential. Every optimized path needs parity against the reference path. Every path needs shape guards. Every path needs clear assumptions about alignment, block size, dtype, and scratch memory. Otherwise the runtime becomes fast only when the demo happens to follow the happy path.

How To Think About Kernel Speed

Kernel speed is not only “do fewer operations.” Modern processors are usually limited by where the data lives, how often it moves, and whether the useful values stay close to the execution units.

The simplest mental model is:

Kernel speed mental modeltext
registers  -> fastest, tiny, closest to execution
L1 cache   -> very fast, small
L2 cache   -> fast, larger
L3 cache   -> shared, slower
DRAM       -> huge, much slower
storage    -> enormous, not part of the hot loop

A fast kernel tries to keep the hot working set in registers and cache. A slow kernel repeatedly fetches the same data from DRAM, writes temporary values too early, or touches memory in a pattern the CPU cannot predict and prefetch efficiently.

ConceptKernel-engineering meaningExample
Registerssmallest and fastest storage for live valuesaccumulate several dot-product sums before storing
Hot cachedata recently used and likely still close to the corereuse a block of weights across several token rows
Cold memorydata not in cache; fetch cost dominatesstreaming huge weights from DRAM during decode
Arithmetic intensitywork done per byte movedGEMM can reuse tiles; GEMV often has lower reuse
Writeback pressurecost of storing intermediatesunfused activation spills an intermediate tensor
Branch overheadcontrol flow that disrupts predictable executionchecking dtype or shape inside the inner loop

This is why kernel code often looks different from ordinary application code. The inner loop should be boring and predictable. The dispatcher can make decisions before the loop starts. The loop itself should mostly load, compute, accumulate, and store.

For a dot product, the naive version may load one value, multiply, add, and repeat. A better version tries to keep multiple accumulators in registers, load contiguous values, unroll the loop, and avoid storing partial results until the end.

Dot-product speed intuitiontext
bad mental model:
  for every operation, go back to memory

better mental model:
  load a block
  keep accumulators in registers
  reuse cached data
  store only when the result is complete

For AI kernels, this shows up everywhere. RMSNorm wants to reduce a vector without unnecessary passes. GEMV wants to stream weights and reuse the input vector while it is hot. GEMM wants to tile so small blocks of A and B are reused many times. Quantized kernels want to unpack weights into registers and immediately use them before the temporary representation spills. Attention wants to avoid writing the full score matrix when a tiled online softmax can keep only what is needed.

The high-level rule is simple: keep hot things short-lived, close, and reused. Put constants and accumulators in registers when possible. Keep working tiles inside cache. Avoid large temporary tensors unless they are necessary for correctness or later reuse. Measure the result instead of guessing.

What To Practice For Each Kernel Family

A useful learning path should produce code, not only notes. The table below is a practical curriculum for building intuition one kernel family at a time.

Kernel familyFirst implementationNext stepTest
Linearscalar GEMVblocked GEMM or threaded rowscompare against Python/NumPy reference
ActivationReLU, GELU, SiLUSwiGLU with two projectionscurve values and end-to-end MLP parity
NormalizationRMSNorm scalarSIMD reductiontolerance across random vectors
Attentionsingle-head attentionKV-cache decodesame logits as reference for short prompts
PositionRoPE rotationcached sin/cos or fused Q/K pathexact Q/K rotation parity
Quantizationsimple int8 dot productblock quantized Q4/Q8 pathreference dequant and model-level drift
OptimizerSGDAdamW with state buffersknown training-step parity
Runtimemanual function callISA-aware dispatcherall implementations agree with reference

This is the same pattern from robotics and embedded systems: start with the clear scalar math, create a reliable reference, measure it, then optimize only after correctness is boring. Whether the target is a flight controller, CPU inference engine, or distributed training system, the engineering discipline is the same.

The Practical Learning Ladder

A practical learning ladder for AI kernel engineering from scalar math to distributed systems and accelerators.

The beginner path is not “jump straight to CUDA” or “memorize transformer papers.” A better path is:

  1. Write scalar wx+b and derivatives by hand.
  2. Convert scalar math into vector and matrix form.
  3. Implement naive GEMV and GEMM in C.
  4. Measure the code with timers and perf.
  5. Add thread-pool row partitioning.
  6. Add SIMD only after the scalar path is correct.
  7. Study cache, NUMA, memory bandwidth, and roofline limits.
  8. Then extend toward distributed systems and accelerators.

For robotics and control systems, the same foundation appears in smaller form: Jacobians, matrix multiplies, Kalman filters, PID loops, sensor fusion, flight controllers, and embedded inference. For LLMs, the same foundation scales into huge memory systems, networked clusters, and transformer-model runtime graphs.

After this ladder, the path becomes more specialized. On CPUs, you study cache blocking, prefetching, NUMA placement, thread affinity, SIMD instruction selection, and roofline analysis. On GPUs, you study warps, shared memory, occupancy, tensor cores, and memory coalescing. On distributed systems, you study sharding, networking, collective communication, storage throughput, checkpointing, and reliability.

The important point is that these are extensions of the same foundation. The math does not disappear when the system gets larger. It becomes more expensive to move, schedule, verify, and preserve.

How This Connects To C-Kernel-Engine

C-Kernel-Engine is useful as a learning artifact because it forces the model to become explicit. A template maps to a circuit. A circuit maps to kernel order. Kernel order maps to tensor shapes and memory layout. Then each kernel has to be implemented, tested, dispatched, and measured.

The public concept map is here: C-Kernel-Engine concepts. That page is useful because it shows how model-level ideas such as attention, MoE routing, normalization, quantization, and runtime dispatch become concrete kernel surfaces.

That is the practical kernel-engineering loop:

Kernel engineering looptext
math definition
  -> tensor shape
  -> scalar reference
  -> C kernel
  -> correctness test
  -> thread partition
  -> SIMD / ISA path
  -> memory measurement
  -> runtime dispatch
  -> model-level validation

The next long-form video can use this post as the map: start with the kernel vocabulary, then walk through the math and implementation path kernel by kernel.

How To Use This Post For The Next Video

The long-form video should not try to teach every kernel fully in one sitting. A better structure is to use this post as the map, then zoom into each family with a concrete example:

  1. Start with the question: what kernels do I need to know before deep learning stops feeling mysterious?
  2. Show the map: linear, activation, norm, attention, position, memory, quantization, optimizer, dispatcher.
  3. Walk one kernel deeply: use Y = XW + b to connect scalar math, GEMV, GEMM, and backprop.
  4. Show runtime thinking: explain why the same math needs different kernels for scalar, SIMD, thread pool, and accelerator paths.
  5. Close with practice: implement, test, measure, optimize, and validate on a real model.

That structure keeps the talk grounded. The audience does not need to become expert in every kernel immediately. They need to see the terrain and understand why the same small set of mathematical operations keeps reappearing inside deep neural networks.

Takeaway

If you want to understand deep neural networks deeply, learn the kernels underneath them: linear algebra, activation, normalization, attention, position, memory, quantization, and optimization.

Architectures change. The kernel vocabulary keeps reappearing.