Threadpools And Memory Pools: Why CKE Needs Runtime Ownership For CPU AI Kernels

C-Kernel-Engine runtime comparison

This post is a follow-up to the earlier thread-pool deep dive. The previous post built the pthread idea from first principles. This one asks the systems question: when should a CPU-native AI runtime own its threadpool and memory pools instead of letting every kernel, library, allocator, or framework layer make hidden decisions?

In the earlier post, Thread Pools In C: How CPU Runtimes Dispatch Work Across Cores, the goal was educational. Start with pthread_create. Add a worker function. Add a pool. Add dispatch. Add row partitioning. Then show why C-Kernel-Engine needs persistent workers for CPU kernels.

Today's post is different. This is not another pthread tutorial. This is a comparison between runtime styles: hidden fork-join parallelism versus explicit persistent workers, heap allocation versus planned arenas, ad hoc scratch buffers versus workspace contracts, and generic convenience versus deterministic AI-kernel execution. It connects directly to the Linux/runtime control described in Linux System Programming For AI Kernels and the distributed runtime direction described in Pipeline vs Tensor Parallelism.

Comparison between the earlier threadpool tutorial and this runtime-contract post — The earlier post taught the mechanism. This post explains where that mechanism belongs inside a CPU-native AI runtime.

The key idea is simple: a kernel should do math, not secretly manage the machine. The runtime should own thread dispatch, memory ownership, workspace lifetime, affinity, barriers, and reports. runtime rule A kernel should not surprise the runtime. No hidden malloc, no hidden thread creation, no secret layout conversion.

What this post compares

Threadpool versus OpenMP-style fork/join placement. Memory pool versus general heap allocation. Workspace contract versus kernel-local scratch allocation. CKE-style explicit runtime ownership versus hidden behavior inside low-level kernels.

The Previous Post Built The Mechanism

The older thread-pool post and the video walkthrough were about mechanics. They showed how a CPU runtime can create workers once, wake them for work, split rows by worker id, and join at a barrier. That is the right teaching path because the abstraction is easier to trust after you have built it from raw pthreads.

The core shape was:

Threadpool work contract

typedef void (*work_fn_t)(int ith, int nth, void *arg);

// ith: current worker id
// nth: total active workers
// arg: operation payload

This tiny signature is more important than it looks. It separates scheduling from math. The runtime says: "you are worker ith out of nth." The kernel says: "given that identity, I know which rows, tiles, heads, tokens, or channels I own."

<br>

Today's Question: Who Owns The Machine?

In a small example, it does not matter much whether a kernel creates threads, calls OpenMP, allocates scratch memory, or lets the heap handle temporary buffers. In an LLM runtime, those choices compound. One token may run RMSNorm, QKV projection, RoPE, attention, output projection, MLP projection, activation, residual updates, logits, sampling, and sometimes training or optimizer kernels.

If every stage hides its own parallel region and its own memory allocation, the runtime becomes hard to reason about. You may have a fast kernel, but the full model runner still suffers from allocator locks, thread oversubscription, cache-line fights, NUMA mistakes, and synchronization that nobody planned.

Convenient local decision	Runtime-level consequence
`malloc` inside a kernel	Allocator locks, fragmentation, unpredictable latency, hidden lifetime.
`#pragma omp parallel` inside every op	Repeated fork/join overhead, oversubscription, harder affinity control.
Temporary scratch arrays created ad hoc	No global memory plan, harder training/debug parity, harder RDMA registration later.
Shared counters packed together	False sharing and cache-coherence traffic in the dispatch path.

Threadpool Placement: OpenMP Is Not The Enemy

OpenMP is not bad. OpenMP is useful for experiments, baselines, quick parallel loops, and some production HPC code. The problem is placement. If OpenMP is hidden inside kernels, the outer model runtime loses control over thread count, affinity, nested parallelism, and synchronization boundaries.

CKE's cleaner rule is: kernels expose computation over pointer ranges; the orchestrator owns parallel dispatch. That means one threadpool can serve many kernels without recreating a team of workers inside every hot operation.

Comparison of hidden OpenMP fork join placement versus CKE persistent threadpool dispatch — The issue is not OpenMP itself. The issue is hiding parallel-team creation inside every hot operation when the model runtime should own dispatch.

Threadpool lifecycle showing init, wait, dispatch, barrier, and worker reuse — A threadpool is useful because the expensive part happens once. The hot path should dispatch work to already-created workers, not recreate execution machinery for every operation.

CKE kernel purity rule

text

Kernels must NOT allocate or free memory.
Kernels must NOT create hidden thread teams.
Kernels must expose inputs, outputs, dimensions, and workspace.
The runtime decides active threads, row/tile ownership, and barriers.

Row Partitioning Is The First Contract

A simple GEMV/GEMM-like dispatch often begins by splitting output rows. Worker ith owns a half-open range:

\[ r_0 = \left\lfloor \frac{ith \cdot M}{nth} \right\rfloor,\qquad r_1 = \left\lfloor \frac{(ith+1) \cdot M}{nth} \right\rfloor \] Each worker receives a deterministic row interval. No row should be skipped, no row should be written twice.

Later, high-performance kernels may split by tiles instead of plain rows. Attention may split by heads, token blocks, or query ranges. Training reductions may split by gradient shards. But the runtime principle stays the same: each worker owns a defined slice, writes to a defined output region, and synchronizes only when the next operation depends on complete results.

Thread 0 Should Work Too

A common mistake is letting the main thread dispatch work and then wait while workers compute. For tight CPU inference, that wastes a hardware thread and adds overhead to every dispatch. CKE's threadpool instinct is better: thread 0 is the main thread, and it also performs its slice of the work.

That matters because the main thread is already hot. It already has model state, operation arguments, and cache locality around the dispatch. Unless there is a specific reason to reserve it for orchestration, it should participate.

Cache-Line Aligned Runtime Metadata

Dispatch counters are not glamorous, but they are hot. A threadpool may repeatedly touch generation counters, active-worker counters, completion counters, flags, and condition-variable state. If multiple frequently written values share one cache line, workers can invalidate each other's cache lines even when they are logically updating different variables.

That is false sharing at the runtime-control layer. It is especially embarrassing because the math kernel might be perfectly optimized while the dispatch metadata creates avoidable coherence traffic. This is why cache-line alignment is not cosmetic. It is part of the runtime contract.

False sharing diagram showing hot counters sharing one cache line versus separated cache lines — Threadpool metadata is still data. If hot counters share cache lines, workers can fight each other before the math kernel even starts.

Memory Pools: The Other Half Of The Runtime

Thread ownership answers: who computes this slice? Memory ownership answers: where does that slice read, write, and store temporary state? A CPU AI runtime needs both answers.

If the threadpool is explicit but memory allocation is chaotic, performance still becomes unpredictable. A kernel that needs scratch memory should not call malloc. It should declare the workspace it needs. The runtime should allocate that workspace from a planned arena and pass a pointer.

Memory pool lifetime diagram showing heap chaos versus planned CKE arenas — Memory pools are not just a speed trick. They encode lifetime: model lifetime, request lifetime, operation lifetime, tokenizer-call lifetime, and training-state lifetime.

Threadpool and memory pool ownership contract for C-Kernel-Engine — CKE's runtime contract should answer both questions: which worker owns the work, and which arena slice owns the memory?

Per-worker scratch slices showing non-overlapping workspace ranges for each worker — When work is split across workers, scratch memory should also be split. This avoids malloc, overlap bugs, and cache-line fights inside temporary buffers.

\[ \mathrm{arena} = W + A + G + O + S \] Weights, activations, gradients, optimizer state, and scratch workspace should be planned, not discovered by accident at runtime.

Bump Arenas Beat Heap Chaos

A bump allocator is simple: allocate a large region, keep a cursor or offset, hand out aligned ranges, and reset by lifetime. For CKE, that is more natural than scattering model memory across the general heap. Weights, activations, gradients, optimizer buffers, KV cache blocks, tokenizer scratch, and per-worker scratch can all become explicit regions.

Runtime arena shape

typedef struct {
    uint8_t *base;
    size_t total_size;
    size_t weights_base;
    size_t activations_base;
    size_t gradients_base;
    size_t optimizer_base;
    size_t scratch_base;
} ck_runtime_arena_t;

This also points toward distributed execution. MPI and Linux RDMA backends prefer stable buffers. A giant registered arena with known offsets is easier to move across nodes than a pile of unrelated heap allocations. That does not make distributed inference easy, but it makes the memory model compatible with the direction CKE wants to go. That direction is also why the C-Kernel-Engine scaling thesis and the CKE throughput unit are runtime problems, not just kernel problems.

C-Kernel-Engine dispatch diagram combining IR layout, arena allocation, threadpool dispatch, pure kernel execution, and generated reports — In CKE, threadpool and memory-pool design should meet inside one generated execution plan: the compiler plans offsets, the runtime dispatches workers, and the report proves what happened.

Tokenizer Pools Are Not A Side Quest

Memory pools are not only for GEMM or attention. Tokenizers create short-lived objects: trie walks, byte pieces, merge candidates, temporary spans, decoded fragments, and chat-template buffers. If every encode call produces thousands of tiny heap allocations, the model runtime inherits latency noise before the neural network even runs.

The better pattern is lifetime-based allocation. If a temporary object only lives for one encode call, allocate it from an encode-call scratch pool and reset that pool at the end. If tokenizer vocabulary or trie structures live for the entire model lifetime, allocate them once during initialization.

Tokenizer pool reset diagram showing temporary encode-call memory reset after token ids are produced — Tokenizer pools are the same memory-discipline idea applied to string processing: long-lived vocabulary stays loaded, short-lived encode scratch resets by lifetime.

Training Makes This Non-Negotiable

In inference, hidden allocation is bad. In training, hidden allocation becomes chaos. Backpropagation needs saved activations, gradient buffers, optimizer state, temporary reductions, checkpointing decisions, and numerical parity logs. If those buffers appear from inside kernels, you cannot audit the training step cleanly. This is the same reason the Muon optimizer post matters: training kernels are not only math updates; they are memory ownership, gradient state, optimizer state, numerical stability, and repeatable runtime behavior.

CKE's training direction therefore needs threadpools and memory pools to be visible artifacts, not private implementation details. A generated run should be able to report active thread count, row/tile ranges, workspace bytes, arena offsets, NUMA placement, dispatch latency, and whether a kernel used serial fallback or threadpool dispatch.

Runtime report field	Why it matters
active threads	Shows whether the parallel path actually ran.
row or tile ranges	Makes worker ownership inspectable.
workspace bytes	Prevents hidden allocation.
arena offsets	Connects memory planning to debugging and RDMA-friendly layout.
NUMA node / CPU affinity	Connects Linux placement to measured performance.
dispatch latency	Shows whether threadpool overhead is acceptable.

How This Blog Differs From The Earlier One

The earlier post answered: how does a threadpool work in C? This post answers: why should an AI runtime centralize thread and memory ownership?

Earlier threadpool post	This comparison post
Starts from raw pthreads.	Starts from AI-runtime contracts.
Shows worker creation and dispatch.	Compares persistent workers against hidden fork/join placement.
Teaches row partitioning through a demo.	Connects row/tile ownership to generated model execution.
Focuses mainly on CPU work dispatch.	Adds memory pools, arenas, workspace contracts, tokenizer pools, and training state.
Good for learning pthread mechanics.	Good for understanding why CKE treats runtime design as part of AI kernel engineering.

The Takeaway

Threadpools and memory pools are not infrastructure trivia. They are the layer that turns isolated math kernels into a repeatable model runner. A CPU-native AI runtime needs kernels that are pure, workers that persist, memory that is planned, scratch that is declared, and reports that make the execution path visible.

That is the CKE direction: do not hide scheduling inside the kernel; do not hide allocation inside the kernel; do not let the runtime become a black box; make work ownership and memory ownership explicit enough that the system can be optimized, tested, and eventually distributed across CPU nodes.

Thread Pools In C: How CPU Runtimes Dispatch Work Across Cores

Linux System Programming For AI Kernels: Core Pinning, Huge Pages, TLBs, NUMA, and Memory Discipline

Pipeline vs Tensor Parallelism: How CKE Splits AI Across CPU Nodes

Muon Optimizer: SGD vs AdamW vs Matrix-Aware Training Updates

Video walkthrough: Thread pools from first principles

C-Kernel-Engine documentation

C-Kernel-Engine scaling thesis

C-Kernel-Engine throughput unit

Threadpools And Memory Pools: Why CKE Needs Runtime Ownership For CPU AI Kernels

What this post compares

The Previous Post Built The Mechanism

Today's Question: Who Owns The Machine?

Threadpool Placement: OpenMP Is Not The Enemy

Row Partitioning Is The First Contract

Thread 0 Should Work Too

Cache-Line Aligned Runtime Metadata

Memory Pools: The Other Half Of The Runtime

Bump Arenas Beat Heap Chaos

Tokenizer Pools Are Not A Side Quest

Training Makes This Non-Negotiable

How This Blog Differs From The Earlier One

The Takeaway

Related

ShivasNotes

Explore

Connect

Threadpools And Memory Pools: Why CKE Needs Runtime Ownership For CPU AI Kernels

What this post compares

The Previous Post Built The Mechanism

Today's Question: Who Owns The Machine?

Threadpool Placement: OpenMP Is Not The Enemy

Row Partitioning Is The First Contract

Thread 0 Should Work Too

Cache-Line Aligned Runtime Metadata

Memory Pools: The Other Half Of The Runtime

Bump Arenas Beat Heap Chaos

Tokenizer Pools Are Not A Side Quest

Training Makes This Non-Negotiable

How This Blog Differs From The Earlier One

The Takeaway

Related

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect