AI Kernel Engineer Beginner Guide: Math, Linear Algebra, C/Linux

Part 1: Beginner roadmap

A viewer asked a useful question: “I know some C, some linear algebra, and some calculus. What are the prerequisites to learn this?” My answer is not “learn all of mathematics first.” This beginner guide is the first layer: learn enough calculus to read gradients, enough linear algebra to read tensor shapes, and enough C/Linux system programming to understand how compute actually processes linear algebra. Later parts can go deeper into ML internals, heterogeneous compute, kernel optimization, and performance engineering.

AI kernel engineering is not the same job as AI research. Researchers may invent new objectives, architectures, proofs, and optimization methods. Kernel engineers take the useful mathematics and make it run correctly, fast, and repeatedly on real compute. That means the question is not “what is every topic in calculus and linear algebra?” The better question is: what math lets me understand wx+b, loss functions, backprop, softmax, attention, GEMM, memory layout, and thread dispatch?

The short answer

Calculus teaches gradients. Linear algebra teaches shapes and matrix multiplication. C/Linux teaches how equations become memory loads, cache lines, execution units, instructions, threads, and measured performance.

Start on CPU first. Measure what the CPU is doing. Then slowly extend the same mental model to SIMD, GPUs, FPGAs, and other accelerators.

Section 1: You Need Calculus, But Not All Of Calculus First

For AI kernels, calculus is primarily the language of change. Backpropagation asks: if the loss changes, how should each parameter change? That question is answered by derivatives and the chain rule. You do not need to start with every theorem in a full calculus curriculum. Start with the derivative rules that show up constantly.

\[ \begin{aligned} \text{Constant:}\quad &\frac{d}{dx}[c]=0\\ \text{Power:}\quad &\frac{d}{dx}[x^n]=nx^{n-1}\\ \text{Scale:}\quad &\frac{d}{dx}[c f(x)]=c f'(x)\\ \text{Sum:}\quad &\frac{d}{dx}[f(x)\pm g(x)]=f'(x)\pm g'(x)\\ \text{Product:}\quad &\frac{d}{dx}[f(x)g(x)]=f(x)g'(x)+g(x)f'(x)\\ \text{Quotient:}\quad &\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right]=\frac{g(x)f'(x)-f(x)g'(x)}{[g(x)]^2}\\ \text{Chain:}\quad &h(x)=f(g(x))\Rightarrow h'(x)=f'(g(x))g'(x) \end{aligned} \]

The chain rule matters because neural networks are composed functions. Each layer wraps another function.

Derivative plots for x squared, exponential, and tanh
Derivative intuition: a kernel engineer should understand the local slope because gradients are local sensitivity.

The chain rule is the most important one for neural networks. If h(x)=f(g(x)), then g(x) is the inner function and f is the outer function. The derivative becomes the outer derivative evaluated at the inner function, multiplied by the derivative of the inner function. That is the same idea repeated through a model.

Section 2: Know The Functions That Actually Appear In Models

The next set is not exotic. It is exponential, logarithm, trigonometry, and a little hyperbolic tangent. These show up in softmax, cross entropy, RoPE, activation functions, and normalization-related analysis.

\[ \begin{aligned} \frac{d}{dx}[e^x]&=e^x &\qquad \frac{d}{dx}[\ln x]&=\frac{1}{x}\\ \frac{d}{dx}[a^x]&=a^x\ln a &\qquad \frac{d}{dx}[\log_a x]&=\frac{1}{x\ln a}\\ \frac{d}{dx}[\sin x]&=\cos x &\qquad \frac{d}{dx}[\cos x]&=-\sin x\\ \frac{d}{dx}[\tan x]&=\sec^2 x &\qquad \frac{d}{dx}[\tanh x]&=1-\tanh^2 x \end{aligned} \]

Softmax uses exp. Cross entropy uses log. RoPE uses sin/cos. Activations use tanh-like curves.

You can learn concavity, critical points, mean value theorem, Taylor approximation, and related rates later. They are useful, especially if you want to analyze optimizers or approximation quality. But if the immediate goal is AI kernel engineering, the first milestone is reading the forward and backward pass of common model operations.

Section 3: Linear Algebra Is The Shape Language

Linear algebra is where AI starts looking like compute. Tokens become vectors. Batches become matrices. Projections become matrix multiplication. Attention scores become QK^T. The output projection becomes another matrix multiply.

\[ [M \times N]\cdot[N \times K]=[M \times K] \]

The middle dimension must match. The result keeps the outside dimensions.

Matrix multiplication shape rule M by N times N by K equals M by K
Matrix multiplication is repeated dot products. This is the foundation of GEMV and GEMM kernels.

For a first pass, prioritize these linear algebra topics:

  • Vectors and dot products
  • Matrix shapes and dimension rules
  • Matrix multiplication
  • Transpose and layout intuition
  • GEMV: matrix-vector multiplication
  • GEMM: matrix-matrix multiplication
  • Broadcasting and row-wise operations

Eigenvalues, eigenvectors, rank, null spaces, SVD, and basis changes are valuable. They matter more as you move into mathematical research, compression, numerical analysis, optimization theory, or low-rank methods. But you do not need to master all of that before writing your first AI kernel. You need to master shape discipline first.

Section 4: From Math To Kernel Math

Kernel math is the subset of math that must become loops, vector instructions, memory accesses, and synchronization. For example, the mathematical expression y = Wx + b becomes repeated dot products, row partitioning, cache-aware data movement, and eventually SIMD or accelerator instructions.

\[ y_i = \sum_{j=0}^{N-1} W_{ij}x_j + b_i \]

This is simple algebra, but the implementation must decide layout, stride, precision, parallelism, and accumulation order.

GEMV shape thinkingc
for (int row = 0; row < M; ++row) {
    float sum = 0.0f;
    for (int col = 0; col < N; ++col) {
        sum += W[row * N + col] * x[col];
    }
    y[row] = sum + b[row];
}

This is why the math foundation and the systems foundation cannot be separated. If you understand the equation but not memory, you may write correct but slow code. If you understand C but not the equation, you may optimize the wrong thing. AI kernel engineering lives where both sides meet.

Section 5: C/Linux Teaches How Compute Processes Linear Algebra

I would start with CPU first because the CPU makes the machine visible. You can see memory, cache, branches, threads, vector instructions, and system calls. Once that model is clear, SIMD, GPUs, and FPGAs are easier to reason about because they are not magic. They are different ways of organizing parallel compute and data movement.

AI kernel engineering stack from math to kernel to runtime to compute to measurement
AI kernel engineering connects math, kernels, runtime, compute, and measurement.

The practical C/Linux checklist is:

  • Pointers, arrays, structs, and function pointers
  • Memory allocation, alignment, and ownership
  • Cache lines, locality, and data movement
  • Execution units and instruction-level work
  • SIMD basics: one instruction operating on multiple data lanes
  • Threads, pthreads, barriers, and work dispatch
  • gcc, make, gdb, perf, and command-line workflow
Memory hierarchy from registers to storage
Most real performance problems are data movement problems before they are pure arithmetic problems.

Section 6: The Actual Learning Path I Would Follow

If someone knows some C, some linear algebra, and some calculus, I would not tell them to disappear into textbooks for a year. I would give them a ladder that constantly connects theory to implementation.

  1. Write derivative rules by hand until the chain rule feels natural.
  2. Implement scalar wx+b and compute the gradient by hand.
  3. Convert scalar wx+b into vector and matrix form.
  4. Implement a naive GEMV in C.
  5. Measure it.
  6. Add row partitioning with a threadpool.
  7. Add SIMD only after the scalar version is correct.
  8. Study softmax, cross entropy, activations, normalization, and attention one kernel at a time.
  9. Connect every forward kernel to its backward kernel.
  10. Keep a formula sheet and a code sheet side by side.

The working definition

An AI kernel engineer understands the math well enough to preserve correctness and understands compute well enough to make the math efficient.

The goal is not to memorize every theorem. The goal is to know which math becomes hot code, which memory layout feeds that code, and which measurement proves it works.

Section 7: What Comes After The Prerequisites?

After the foundations, the path becomes the transformer building-block path:

  • wx+b: scalar, vector, and matrix projection
  • Loss functions: MSE and cross entropy
  • Backpropagation: local gradients stitched by chain rule
  • Softmax: stable exponentials and probabilities
  • Activations: ReLU, GELU, SwiGLU
  • Normalization: LayerNorm and RMSNorm
  • Attention: Q/K/V, scores, masks, softmax, weighted values
  • Optimizers: SGD, momentum, AdamW, Muon-style matrix updates later
  • Runtime: threadpools, memory pools, dispatch, cache, and NUMA
  • Kernel implementation: scalar, SIMD, quantized, tiled, and accelerator-aware

This is the route I am building through ShivasNotes and C-Kernel-Engine. The notes explain the math. The blogs connect the math to model architecture. The C kernels test whether the understanding survives contact with real compute.

Follow the current ShivasNotes sequence

If you are using this as a roadmap, the practical sequence is not random. Start from the smallest training loop, then walk forward into the full transformer and the runtime that executes it.

  1. Derivatives And The Chain Rule — the math of tiny changes.
  2. Backpropagation From wx+b — the smallest training loop.
  3. Matrix Wx+b — from scalar equations to transformer projections.
  4. Softmax — stable probabilities and logits.
  5. Activation Functions — ReLU, GELU, and SwiGLU intuition.
  6. LayerNorm And RMSNorm — stabilizing the signal through depth.
  7. Positional Encoding — sinusoidal, learned positions, and RoPE.
  8. Tokenization — the first representation decision.
  9. Attention — Q/K/V, masks, softmax, and weighted values.
  10. Residual Connections — the gradient highway that makes depth trainable.
  11. dL/d(LLM) — the full backward pass through an LLM.
  12. Thread Pools In C — how CPU runtimes dispatch work across cores.

This prerequisite post is the map before that path. The sequence above is where the map turns into concrete model math and C runtime engineering.