A fast kernel is not just a loop. A fast kernel is a loop, a memory layout, a thread placement policy, a page-fault policy, and a measurement discipline running inside a real Linux system.
When people first learn AI performance engineering, they usually start with the visible math: matrix multiplication, attention, normalization, quantization, SIMD, and cache lines. That is the right starting point. But once the kernel becomes real, another layer appears underneath it. The operating system is now part of the performance story.
Linux is not only the thing that launches the binary. Linux decides where threads run, how virtual memory becomes physical memory, when pages fault in, which NUMA node owns a buffer, whether memory can be backed by huge pages, and whether a thread that was hot on one core suddenly wakes up somewhere else with cold caches. For a normal application, most of this is invisible. For an AI runtime trying to squeeze useful work out of commodity CPUs, it is the ground truth.
The above image was generated by Codex. The weird hand popping from the table is A.I doing its thing.
The thesis of this post is simple: Linux system programming is part of kernel engineering. If I want C-Kernel-Engine to run predictable CPU inference and eventually distributed CPU training, then the runtime cannot treat Linux as a black box. It has to understand thread affinity, NUMA locality, page size, TLB pressure, memory advice, allocator behavior, page faults, and scheduler noise. core idea The kernel engineer does not merely write math. The kernel engineer makes sure the machine can keep feeding the math.
Why Linux Matters For AI Kernels
The CPU core is fast. The execution units can add, multiply, fuse multiply-add, compare, shuffle, load, store, and retire instructions at terrifying speed. But the core is not the whole machine. The core depends on L1, L2, L3, DRAM, the TLB, branch prediction, the scheduler, and the memory allocator. If the execution unit does not have data, it stalls. If the thread migrates to another core, its local cache history can disappear. If a buffer lives on the wrong NUMA node, a local memory access becomes a remote memory access. If pages are small and the working set is huge, the TLB can become a hidden bottleneck.
This is why performance work cannot stop at -O3 or SIMD intrinsics. The compiler can make a loop better, but it cannot magically guarantee that the operating system keeps the right thread near the right data. That becomes a runtime design problem.
The mental model
A CPU AI runtime has three jobs: keep compute units busy, keep data close, and keep measurements honest. Linux tuning exists because all three jobs can fail even when the C code is mathematically correct.
Core Pinning: Stop Letting Hot Threads Wander
Core pinning means binding a thread to a specific CPU core or a specific set of cores. On Linux, this usually means sched_setaffinity, pthread_setaffinity_np, taskset, numactl --physcpubind, or cgroup CPU placement. The reason is not superstition. A hot thread has history. It has branch predictor history, cache residency, memory access patterns, and synchronization rhythm with other worker threads.
If Linux moves that thread across cores, the program may still be correct, but performance can become noisy. The thread wakes up on a different core, sees colder private caches, and may now compete with different sibling threads. On hybrid CPUs, it may even move from a performance core to an efficiency core unless the runtime or OS policy prevents it. For everyday software this might not matter. For a kernel benchmark, it can be the difference between a real regression and scheduler noise.
#define _GNU_SOURCE
#include
#include
#include
static int pin_thread_to_cpu(int cpu_id) {
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu_id, &set);
return pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
}
void *worker_main(void *arg) {
int cpu_id = *(int *)arg;
if (pin_thread_to_cpu(cpu_id) != 0) {
perror("pthread_setaffinity_np");
}
/* Run a shard of the kernel here. */
return NULL;
} In a C runtime, this should not be scattered randomly through kernels. It belongs in the thread-pool layer. The worker pool should know how many workers exist, which CPU each worker owns, and whether the current machine should use all cores, only performance cores, one NUMA node, or an explicitly selected core range.
NUMA: Memory Has Geography
On a single-socket laptop, memory feels like one pool. On workstations and servers, memory can have geography. NUMA means Non-Uniform Memory Access: a CPU core can access all memory, but not all memory is equally close. A core reading memory attached to its own NUMA node is faster than a core repeatedly reading memory attached to another socket or memory controller.
This matters for AI runtimes because model weights, KV cache, activations, gradients, optimizer states, and staging buffers can be large. If a thread pool is pinned to one side of the machine but the buffers were first touched by a thread on the other side, Linux may place those pages far away. Then the kernel pays a remote-memory tax on every pass.
The practical rule is simple: place memory near the workers that will use it. Use first-touch intentionally, use numactl while measuring, and eventually teach the runtime to make placement decisions explicitly.
# Run on a specific NUMA node while testing locality.
numactl --cpunodebind=0 --membind=0 ./ck-infer-smoke
# Inspect the machine topology.
lscpu
numactl --hardware
# Watch whether memory placement is drifting.
numastat -p $(pidof ck-infer-smoke)Page Faults: The Runtime Tax You Did Not Benchmark
A memory access in C is not automatically a physical DRAM access. User-space code operates on virtual addresses. The kernel maps virtual pages to physical pages. If a page is not resident yet, the first access can trigger a page fault. That does not mean the program is broken. It means Linux had to stop, resolve the mapping, and make the page available.
For a long-running application, some page faults are normal during startup. For a hot inference loop, page faults inside the timing window are poison. They create latency spikes that have nothing to do with the math kernel. This is why benchmark harnesses often warm up memory, touch pages before timing, and separate initialization from measured execution.
The most common beginner trap is assuming that allocation immediately means physical DRAM has been assigned. On Linux, a large malloc or anonymous mmap often gives the process virtual address space first. Physical pages may be attached lazily only when the program first touches those addresses. So a page fault can mean: this virtual page is valid, but no physical frame has been mapped for it yet.
char *buf = malloc(1024ULL * 1024 * 1024); /* Reserve address space. */
/* The first write to each page may fault and attach physical memory. */
for (size_t i = 0; i < bytes; i += 4096) {
buf[i] = 0;
}That first-touch cost is usually a minor page fault when DRAM is available, because Linux can allocate or map a page without waiting on disk. It can still slow down a benchmark if it happens inside the measured loop. This is why a kernel runtime should separate memory reservation, page commitment, warmup, and actual kernel timing.
Fewer long-lived memory chunks can help, but not because they magically remove page faults. They help because the runtime avoids allocator churn, reduces fragmentation, improves layout predictability, and can intentionally touch or warm the pages before the hot path. One giant arena can still fault page-by-page on first touch. The difference is that an arena gives the runtime a place to pay that cost deliberately instead of accidentally during token generation.
static void touch_pages(char *ptr, size_t bytes) {
const size_t page = 4096;
volatile char sink = 0;
for (size_t i = 0; i < bytes; i += page) {
sink ^= ptr[i];
}
(void)sink;
} In C-Kernel-Engine terms, this matters for generated artifacts too. If a run maps a large weights.bump file and immediately measures token latency, the first pass may include page-in behavior. That is not the steady-state kernel speed. A disciplined runtime should distinguish cold-start, warm-start, and hot-loop measurements.
The confusing part is that "page fault" sounds like an error, but Linux uses the same broad mechanism for several very different events. A page fault can mean "this page is already in RAM, but this process does not have the page-table entry yet." It can also mean "the kernel has to wait on disk or swap before the process can continue." Those two cases have completely different performance meanings.
| Event | What It Means | Performance Interpretation |
|---|---|---|
| Minor page fault | The page is already in RAM or can be created without storage I/O, but the process page table needs to be updated. | Usually normal during allocation, first touch, fork, or mmap warmup. Bad only if it happens inside the hot timing loop. |
| Major page fault | The kernel must wait for storage I/O, usually because the page must be read from a file or swap. | Usually a serious latency problem for inference and kernel benchmarks. |
| Demand-zero fault | The first write to newly allocated anonymous memory causes Linux to map a zeroed physical page. | Normal, but should be paid before measurement if the benchmark is about compute. |
| Copy-on-write fault | A process writes to a shared page, so Linux creates a private physical copy. | Expected after fork or shared mappings, but can create hidden allocation and copy costs. |
| File-backed mmap fault | A mapped file page is accessed. If the page cache already has it, the fault can be minor. If disk is needed, it becomes major. | Important for mapped model weights. Cold-start and steady-state can look completely different. |
| Swap fault | The page was evicted to swap and must be read back. | Usually catastrophic for hot AI paths. If swap enters the benchmark, the measurement is no longer kernel speed. |
| TLB miss | The virtual-to-physical mapping exists, but the CPU translation cache missed. | Not a Linux page fault. It is a hardware translation-cache cost. Huge pages may help here. |
| Remote NUMA access | The physical page exists in DRAM, but on a different NUMA node/socket than the worker core. | Not a page fault. It is a memory-locality problem. |
| Segmentation fault | The address is invalid or the access violates permissions, so Linux cannot legally satisfy the fault. | This is a correctness bug, not a performance event. |
The shortest mental model is this: minor faults are mapping work, major faults are storage work, TLB misses are translation-cache misses, and NUMA misses are locality mistakes. They can all slow a program down, but they are not the same bottleneck.
# One-shot benchmark sanity check.
/usr/bin/time -v ./ck-infer-smoke
# Look for:
# Minor (reclaiming a frame) page faults
# Major (requiring I/O) page faults
# Swaps
# Maximum resident set size# Live per-process memory telemetry.
pidstat -r -p $(pidof ck-infer-smoke) 1
# Useful columns:
# minflt/s minor page faults per second
# majflt/s major page faults per second# Global VM pressure.
vmstat 1
# Watch these:
# si swap-in from disk
# so swap-out to disk
# Raw kernel counters.
grep -E 'pgfault|pgmajfault|pswpin|pswpout|pgscan|pgsteal' /proc/vmstat# Hardware and kernel-event view for a benchmark.
perf stat \
-e minor-faults,major-faults,page-faults,dTLB-load-misses,dTLB-store-misses \
./ck-infer-smoke
# Attribute page faults to code locations.
perf record -e page-faults ./ck-infer-smoke
perf report
# If you only care about storage-backed stalls:
perf record -e major-faults ./ck-infer-smoke
perf reportThis changes how a runtime should benchmark itself. If minor faults happen during setup, that is expected. If minor faults happen inside the measured token loop, the runtime may be measuring allocation or first-touch cost. If major faults happen inside the hot path, the runtime is no longer measuring compute at all. It is measuring how long Linux waited for storage.
For CPU-native AI work, the practical rule is simple: allocate, map, touch, warm up, then measure. If the goal is to study page-in behavior, measure cold-start separately and name it honestly. If the goal is kernel throughput, first-touch and page-in should be outside the timing window.
mmap, madvise, and Memory As A Runtime Contract
mmap lets a runtime map a file or anonymous memory region into the process address space. For AI systems, this is attractive because model weights are large and mostly read-only. Instead of manually allocating a giant heap buffer and copying bytes into it, a runtime can map the weight file directly. That makes memory layout more explicit and often reduces startup overhead.
madvise lets the program tell Linux how it expects to use a mapped region. It is not a command in the sense of "the kernel must obey me." It is advice. But good advice can help the kernel choose better paging and readahead behavior.
#include
/* Sequential model load or one-pass scan. */
madvise(ptr, bytes, MADV_SEQUENTIAL);
/* Random access pattern, such as scattered lookup. */
madvise(ptr, bytes, MADV_RANDOM);
/* The runtime expects to need this soon. */
madvise(ptr, bytes, MADV_WILLNEED);
/* The runtime is done with this region for now. */
madvise(ptr, bytes, MADV_DONTNEED); This is the kind of API that looks boring until the model becomes large. Once weights, KV cache, and activation buffers become substantial, memory behavior is no longer a background detail. It is a first-class part of runtime design.
How mmap, madvise, Warmup, And TLB Policy Fit Together
The important thing to understand is that madvise does not allocate memory, does not pin memory, and does not directly program the TLB. It gives the Linux kernel a hint about how a region will be used. The actual mapping is created by mmap, malloc, the dynamic loader, the file cache, or another allocator path. The physical pages are attached through page faults or explicit population. The CPU fills the TLB after address translation happens. These are different layers of the same pipeline.
| Layer | Tool | What It Controls | What It Does Not Control |
|---|---|---|---|
| Virtual mapping | mmap | Creates a virtual address range backed by a file, anonymous memory, shared memory, or special huge-page mapping. | Does not guarantee all physical pages are resident unless paired with population, locking, or first-touch warmup. |
| Allocator convenience | malloc | Gives C code a usable pointer and hides small/large allocation strategy behind libc. | Does not expose exact mapping policy, page advice, huge-page intent, or placement as cleanly as an explicit arena. |
| Kernel hint | madvise | Tells Linux expected access pattern: sequential, random, soon-needed, done, huge-page-friendly, or huge-page-hostile. | It is advice. The kernel may ignore, delay, or approximate it depending on pressure and configuration. |
| Physical commitment | first touch, MAP_POPULATE, warmup loop | Moves page-fault cost out of the hot path by making pages resident or mapped before measurement. | Does not guarantee NUMA placement unless first touch and worker placement are coordinated. |
| Residency | mlock, mlockall | Asks Linux to keep pages resident and avoid paging them out. | Does not make access cache-hot and can starve the system if abused. |
| Translation cache | TLB, page tables, huge pages | CPU caches virtual-to-physical translations; larger pages reduce number of translations needed. | User code does not directly insert TLB entries. It shapes the page tables Linux builds. |
A useful runtime sequence looks like this:
decide memory role
-> map or allocate region
-> apply advice
-> place worker threads
-> touch pages from the intended workers
-> optionally lock critical pages
-> warm cache-sensitive kernels
-> measure the hot loop The reason malloc should be treated as the last resort for serious runtime arenas is not that malloc is bad. It is excellent for general-purpose programs. The problem is that high-performance AI runtimes often need more control than "give me some bytes." They need to know whether a region is file-backed, anonymous, shared, huge-page-friendly, NUMA-local, reusable, temporary, or safe to discard. A normal heap allocation hides too much of that policy.
| Memory Region | Better Primitive | Likely Advice | Why |
|---|---|---|---|
| Read-only model weights | File-backed mmap with MAP_PRIVATE | MADV_WILLNEED, maybe MADV_SEQUENTIAL for loading, maybe huge-page strategy if aligned and stable. | Avoid copying giant weight files into heap memory and let the page cache do useful work. |
| Activation arena | Anonymous mmap arena | MADV_HUGEPAGE for large stable arenas, first-touch by pinned workers. | Predictable lifetime and reuse. Good candidate for arena planning. |
| KV cache | Explicit arena or mapped region | Warm pages before serving, NUMA-aware placement, possibly huge pages for large contexts. | Latency-sensitive and repeatedly accessed during decode. |
| Temporary scratch buffers | Per-thread arena | Avoid sharing, avoid false sharing, reuse instead of allocate/free churn. | Keeps worker-local writes local and prevents allocator noise. |
| Small metadata objects | malloc is usually fine | No special advice needed. | Control matters less when the data is small and not in the hot path. |
#include
#include
#include
/* File-backed, private, read-only mapping for model weights. */
int fd = open("weights.bump", O_RDONLY);
void *weights = mmap(NULL, weight_bytes, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(weights, weight_bytes, MADV_WILLNEED);
/* Anonymous runtime arena for activations or KV cache. */
void *arena = mmap(NULL, arena_bytes,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0);
/* Ask for huge-page treatment where suitable. */
madvise(arena, arena_bytes, MADV_HUGEPAGE);
/* Pay first-touch cost before measuring the hot loop. */
touch_pages((char *)arena, arena_bytes); mmap flags define the mapping contract. madvise flags describe intended behavior after the mapping exists. Worker placement determines which NUMA node first-touch will likely allocate from. Huge-page policy changes the page-table shape. The TLB then caches the translations that result from those page tables. In other words, the CPU and Linux are not separate stories here. The runtime coordinates them through mapping, advice, placement, warmup, and measurement.
CKE engineering note
C-Kernel-Engine should eventually treat memory allocation as a planned runtime phase, not scattered calls to malloc. The IR/layout planner can classify buffers, the runtime can map arenas with the right policy, and the benchmark harness can verify page faults, TLB misses, cache misses, and NUMA placement separately. That is how a CPU runtime turns Linux memory behavior from mystery into an explicit contract.
Huge Pages And The TLB
The TLB, or Translation Lookaside Buffer, is a small hardware cache that remembers virtual-to-physical address translations. Every memory load needs address translation. If the translation is in the TLB, the CPU proceeds quickly. If it misses, the CPU has to walk page tables, which costs time.
Standard Linux pages are commonly 4 KiB. A 1 GiB working set therefore spans 262,144 pages. A 2 MiB huge page covers 512 normal pages. The same 1 GiB working set spans only 512 huge pages. That can dramatically reduce TLB pressure for large streaming buffers.
For a buffer of size \(B\) bytes and page size \(P\), the page count is:
\[ \text{pages} = \left\lceil \frac{B}{P} \right\rceil \]
A 1 GiB region uses 262,144 normal 4 KiB pages, but only 512 huge 2 MiB pages.
Huge pages are not magic. They help when the access pattern is large enough and stable enough for TLB pressure to matter. They can hurt if they increase memory waste, complicate allocation, or hide the real bottleneck. The right posture is empirical: measure TLB misses and page-walk behavior before declaring victory.
# Transparent huge page status.
cat /sys/kernel/mm/transparent_hugepage/enabled
# Hardware-counter view of TLB behavior.
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./ck-infer-smokeWhat A TLB Entry Actually Means
A CPU core does not directly execute loads and stores against the virtual address the C program sees. The program might load from something like 0x7f12_4000_1234, but DRAM is addressed through physical addresses. Somewhere between the load instruction and the cache hierarchy, the processor must translate that virtual address into a physical address. The TLB is the fast cache for those translations.
Conceptually, a TLB entry stores a mapping from a virtual page number to a physical frame number, plus metadata. The exact format is CPU-specific, but the idea is stable:
| TLB Field | Meaning | Why It Matters |
|---|---|---|
| Virtual page number | The upper bits of the virtual address after removing the page offset. | This is the lookup key the CPU uses to ask, "Have I translated this page recently?" |
| Physical frame number | The upper bits of the physical DRAM address for the backing page frame. | This tells the CPU where the data really lives in physical memory. |
| Page size | Whether the mapping covers 4 KiB, 2 MiB, 1 GiB, or another architecture-supported size. | One larger entry can cover much more memory, reducing translation pressure. |
| Access permissions | Readable, writable, executable, user/supervisor, and related protection bits. | A bad access becomes a protection fault instead of silently touching memory. |
| Address-space tag | Usually some form of process/context identifier, such as PCID or ASID on supported CPUs. | This helps the CPU avoid throwing away all translations on every context switch. |
| Memory attributes | Cacheability and ordering behavior derived from page-table attributes. | Important for device memory, write-combining, and special mappings. |
For a 4 KiB page, the lower 12 bits of the virtual address are the offset inside the page because \(2^{12} = 4096\). The TLB translates the upper virtual-page bits into upper physical-frame bits. The page offset stays the same. That is the little trick: the CPU does not translate every byte. It translates the page base, then keeps the offset.
For a 4 KiB page:
\[ \text{virtual address} = \text{virtual page number} \; || \; \text{12-bit offset} \]
\[ \text{physical address} = \text{physical frame number} \; || \; \text{same 12-bit offset} \]
The TLB replaces the page number. The byte offset inside the page is preserved.
Larger pages simply use more offset bits. A 2 MiB page uses a 21-bit offset because \(2^{21} = 2,097,152\). A 1 GiB page uses a 30-bit offset because \(2^{30} = 1,073,741,824\). The bigger the page, the fewer virtual page numbers the CPU has to translate for the same memory region.
| Page Size | Offset Bits | Pages Needed For 1 GiB | Practical Meaning |
|---|---|---|---|
| 4 KiB | 12 bits | 262,144 pages | Fine-grained and flexible, but many translations for large arrays. |
| 2 MiB | 21 bits | 512 pages | Common huge-page size; useful for large hot buffers and mapped model regions. |
| 1 GiB | 30 bits | 1 page | Very coarse mapping; powerful for stable giant regions, but harder to allocate and easier to waste. |
This is why a TLB miss can become expensive in AI kernels. Imagine a kernel streaming through a multi-gigabyte weight matrix or KV cache. The arithmetic may be vectorized. The cache-line accesses may be predictable. But if the working set spans far more pages than the TLB can remember, the CPU repeatedly has to perform page-table walks. Those walks consume cycles and memory bandwidth that should have gone into useful math.
A page-table walk is not a disk access. It is the hardware walking the process page tables in memory to reconstruct the virtual-to-physical mapping. But it still costs time. The page-table structures themselves live in memory and must be loaded through the cache hierarchy. A bad enough translation pattern can therefore look like mysterious memory latency even when there are no major page faults and no swapping.
# A page-fault counter can be quiet while TLB misses are high.
# That means pages are mapped, but address translation is still expensive.
perf stat \
-e page-faults,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,dtlb_load_misses.walk_completed \
./ck-infer-smoke The exact event names vary by CPU and kernel. On some Intel systems, dtlb_load_misses.walk_completed or related page-walk events are available. On other systems, use perf list | grep -i tlb to discover the supported counters. The important idea is to separate "the page was missing" from "the translation cache missed."
# Discover TLB/page-walk counters available on this machine.
perf list | grep -i tlb
perf list | grep -i walkHow Linux Can Influence Page Size
User-space code cannot directly edit the processor's TLB. The TLB is maintained by the CPU and the kernel. But a runtime can influence the page-table mappings Linux creates, and those mappings determine what the TLB can cache. This is where huge pages, transparent huge pages, madvise, and explicit huge-page mappings enter the story.
There are two common Linux paths:
| Mechanism | How It Works | Runtime Tradeoff |
|---|---|---|
| Transparent Huge Pages | Linux may automatically back suitable anonymous memory with 2 MiB pages, especially when configured as always or when a region uses MADV_HUGEPAGE. | Easy to try, but not perfectly deterministic. The kernel may collapse or split pages depending on pressure and fragmentation. |
| Explicit HugeTLB pages | The system reserves huge pages up front, and the process maps them explicitly through HugeTLB mechanisms. | More deterministic, but operationally heavier. Requires reservation, permissions, and careful capacity planning. |
# Transparent huge page mode.
cat /sys/kernel/mm/transparent_hugepage/enabled
# Transparent huge page defrag behavior.
cat /sys/kernel/mm/transparent_hugepage/defrag
# HugeTLB pool status.
grep -i huge /proc/meminfo#include
/* Ask Linux to consider transparent huge pages for this region. */
madvise(ptr, bytes, MADV_HUGEPAGE);
/* Ask Linux not to use transparent huge pages for this region. */
madvise(ptr, bytes, MADV_NOHUGEPAGE); This is a runtime policy decision, not a universal rule. A large, stable, contiguous activation arena may benefit from huge pages. A tiny scratch buffer probably will not. A sparsely touched mapping can waste memory if backed by huge pages. A file-backed weight map may need a different strategy than anonymous activation memory. The right engine behavior is not "turn on huge pages everywhere." The right behavior is to classify memory regions by access pattern and then measure.
CKE engineering note
For C-Kernel-Engine, page size is a layout concern. Weight blobs, KV-cache arenas, activation workspaces, and temporary scratch buffers should not all receive the same memory policy. The compiler/runtime boundary can eventually annotate regions as streaming, random-access, hot-reused, cold, file-backed, anonymous, huge-page-friendly, or huge-page-hostile.
Hot Cache, Cold Cache, Eviction, And False Sharing
After address translation succeeds, the next question is simple: is the data already close to the execution units? If the data is in L1, the core can use it quickly. If it is in L2, it is still close. If it is in shared L3, it is farther away but still on chip. If it has to come from DRAM, the core may wait for hundreds of cycles. Kernel engineering is the art of arranging work so the important data stays hot long enough to be reused.
| Term | Meaning | Kernel Engineering Consequence |
|---|---|---|
| Hot cache | The data was recently used and is still in L1, L2, or L3. | Reuse is cheap. This is what tiled matrix kernels and fused operators try to create. |
| Cold cache | The data is not currently in cache, so the CPU must fetch it from a lower cache level or DRAM. | First pass over a large model can look much slower than a warmed pass. |
| Cache miss | The requested cache line is not present at the cache level being checked. | Misses are not all equal. L1 miss to L2 is very different from last-level-cache miss to DRAM. |
| Eviction | A cache line is removed to make room for another cache line. | Large working sets, poor loop order, or competing threads can evict data before reuse. |
| Cache line | The minimum block of memory transferred into cache, commonly 64 bytes on modern x86 CPUs. | Accessing one byte usually pulls in the whole line. Layout should make nearby bytes useful. |
| False sharing | Two cores modify different variables that happen to live on the same cache line. | The cores fight over ownership of the same line even though they are not logically sharing data. |
The cache hierarchy is not a storage system in the human sense. It works in cache lines. If a core loads one float, the processor usually brings the entire surrounding 64-byte cache line with it. For float32, that is 16 floats. For float16 or bfloat16, that is 32 values. For int8, that is 64 values. This is why contiguous layout matters so much: if the next values your kernel needs sit beside the current value, one cache-line fetch feeds multiple operations.
For a 64-byte cache line:
\[ \text{float32 values per line} = \frac{64}{4} = 16 \]
\[ \text{int8 values per line} = \frac{64}{1} = 64 \]
The kernel should make the fetched line useful, not drag mostly-unused bytes through the machine.
Cache eviction is the quiet enemy of reuse. Suppose a matrix kernel loads a tile of weights, multiplies it against an activation block, and then reuses that tile many times. If the tile fits in cache, the kernel pays the memory cost once and reuses the hot data. If the tile is too large, or the loop order walks memory poorly, the tile can be evicted before reuse. The program still produces the same answer, but the hardware spends its time moving bytes instead of doing math.
/* Better: contiguous scan, cache-line friendly. */
for (size_t i = 0; i < n; ++i) {
y[i] += a[i] * x[i];
}
/* Often worse: strided access can waste cache lines. */
for (size_t i = 0; i < n; i += stride) {
y[i] += a[i] * x[i];
} The same logic shows up in matrix multiplication. A naive loop order can repeatedly stream through memory in a way that defeats cache reuse. A tiled or blocked kernel deliberately keeps a small region of A, B, and C hot while the execution units chew through it. This is why GEMM is not only math. GEMM is a memory choreography problem.
/* Sketch only: block sizes should be selected empirically for the target CPU. */
for (int ii = 0; ii < M; ii += BM) {
for (int kk = 0; kk < K; kk += BK) {
for (int jj = 0; jj < N; jj += BN) {
/* Work on a tile small enough to reuse in cache. */
gemm_tile(&A[ii*K + kk], &B[kk*N + jj], &C[ii*N + jj], BM, BN, BK);
}
}
}False Sharing: When Threads Fight Over A Cache Line
False sharing is one of the most annoying performance bugs because the source code can look perfectly parallel. Imagine each thread owns its own counter. Logically, there is no sharing. But if those counters sit next to each other in memory, they may occupy the same 64-byte cache line. When thread 0 updates counter 0 and thread 1 updates counter 1, the CPU cache-coherence protocol may bounce ownership of that line between cores. Both threads are working on different variables, but the hardware sees one shared cache line.
#include
#include
/* Bad: adjacent counters can share one cache line. */
typedef struct {
uint64_t tokens_processed;
} worker_counter_bad;
/* Better: pad or align per-worker hot writes to separate cache lines. */
typedef struct {
alignas(64) uint64_t tokens_processed;
char pad[64 - sizeof(uint64_t)];
} worker_counter_good; This matters for AI runtimes because worker threads often keep per-thread counters, partial sums, scratch offsets, queue heads, and timing records. If those write-heavy fields are packed together, the runtime can accidentally create a coherence storm. The profiler may show poor scaling even though each thread appears to have independent work.
| Symptom | Possible Cache Cause | What To Try |
|---|---|---|
| Single thread is fast, many threads scale poorly. | False sharing, shared queue contention, or LLC bandwidth pressure. | Pad per-thread writes, shard queues, pin workers, measure cache misses. |
| Warm run is much faster than first run. | Cold cache, page-in behavior, branch predictor warmup, or allocator warmup. | Separate cold-start benchmark from steady-state benchmark. |
| High cache misses with low page faults. | Data is mapped but not resident in useful cache levels. | Improve blocking, layout, prefetch, and loop order. |
| Performance changes when threads move cores. | Hot L1/L2 state is lost, or L3/NUMA locality changes. | Use affinity and NUMA-aware placement during measurement. |
# Broad cache sanity check.
perf stat -e cycles,instructions,cache-references,cache-misses ./ck-infer-smoke
# More detailed data-cache view where supported.
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./ck-infer-smoke
# Find supported cache/coherence events on this CPU.
perf list | grep -i cache
perf list | grep -i snoopCache tuning should be treated like a hypothesis, not a superstition. If the model is streaming weights once, prefetch and bandwidth dominate. If the kernel reuses a tile many times, blocking and cache residency dominate. If many workers write to nearby metadata, false sharing may dominate. If workers read shared immutable weights, sharing can be good. The point is not "cache is good" or "memory is bad." The point is to know which bytes should be hot, which bytes are allowed to be cold, and which writes must never fight across cores.
CKE engineering note
For C-Kernel-Engine, cache behavior is part of the kernel contract. A generated kernel should eventually know its tile shape, cache-line assumptions, per-thread scratch layout, and expected reuse pattern. The runtime should also distinguish cold-start measurements from hot-loop measurements, because both are real but they answer different questions.
Memory Locking: Prevent The OS From Taking Back Hot Pages
mlock and mlockall can ask Linux to keep memory resident rather than paging it out. This can matter for latency-sensitive runtimes. If a model server is supposed to respond predictably, the hot weights and runtime buffers should not disappear into swap under pressure.
But this is also a sharp tool. Locking too much memory can starve the rest of the system. On many machines, it requires raising ulimit -l or configuring system limits. The engineering rule is again boring but powerful: lock only what you understand, and measure the system impact.
#include
/* Lock current and future mapped pages, if permitted by system limits. */
if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
/* Handle permission or limit failure without crashing blindly. */
} Scheduler Noise And Benchmark Honesty
A benchmark is only useful if it measures the thing you think it measures. On Linux, many factors can corrupt that measurement: background daemons, CPU frequency scaling, thermal throttling, interrupts, scheduler migration, page faults, turbo behavior, and NUMA drift. This is why kernel engineers care about repeatability.
The goal is not to turn every laptop into a lab-grade appliance. The goal is to know which numbers are trustworthy. A local development laptop can still produce useful measurements if the harness records enough context: CPU model, governor, thread count, affinity mask, NUMA policy, page size policy, compiler flags, warmup count, and input shape.
# Basic runtime context.
lscpu
uname -a
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 2>/dev/null || true
# Pin a test to cores 2-7.
taskset -c 2-7 ./ck-infer-smoke
# Measure cycles, instructions, cache misses, and TLB misses.
perf stat -e cycles,instructions,cache-misses,dTLB-load-misses ./ck-infer-smokeWhat This Means For C-Kernel-Engine
C-Kernel-Engine should not only generate correct C kernels. It should eventually generate or carry enough runtime metadata to explain how those kernels should run on the host machine. The same way an IR visualizer can show the computational graph, the runtime should be able to report: worker count, affinity policy, memory map, page advice, hot buffers, NUMA assumptions, and measured hardware counters.
This is where the project becomes more than "C code for neural nets." It becomes a machine-aware runtime. The generated code owns the math. The memory planner owns layout. The thread pool owns worker dispatch. The Linux integration layer owns placement, page behavior, and measurement. All of these layers matter if the goal is CPU-native inference and training that can scale across commodity machines.
CKE engineering note
The next hardening target is not only faster kernels. It is clearer runtime contracts: which cores are used, where buffers live, what page policy is active, and whether the measured run was cold, warm, or hot.
The Takeaway
Linux systems programming is not separate from AI kernel work. It is the layer that decides whether the kernel gets a stable machine to run on. Core pinning protects cache locality. NUMA policy protects memory locality. Huge pages can reduce TLB pressure. madvise and mmap make memory behavior explicit. mlock can protect hot pages when latency matters. perf tells you whether the bottleneck is real or imagined.
This is also why CPU-native AI is interesting. It is not just "make CPUs do GPU work." It is a full-stack systems problem: math, kernels, memory, scheduler, page tables, Linux, networking, and measurement. That is dry to many people. But for kernel engineers, that is the canvas.