Thread Pools in C: How CPU Runtimes Dispatch Work Across Cores
Companion note for the threadpool video. This post zooms into the local runtime layer: how persistent workers receive work, how fn(ith, nth, arg) turns one C function into core-parallel execution, and why CPU AI runtimes should not create threads inside hot kernels.
Companion material
This post is paired with the long-form video walkthrough: Thread Pools in C: CPU runtime dispatch across cores.
The presentation and C demo material are public here: 27-Threadpool-C-demo on GitHub.
If a kernel is the mathematical unit of work, the runtime decides when that work runs, which thread owns which slice, where outputs are written, and when all workers have reached the barrier. A fast CPU model runner therefore needs more than optimized GEMM. It needs a threadpool, cache-line discipline, clear ownership rules, and a contract that prevents hidden thread creation inside the hot path.
C-Kernel-Engine already has the right architectural instinct here. Kernels should be pure. Parallelism should be orchestrated outside kernels. Worker threads should persist instead of being constantly created and destroyed. runtime rule A kernel should not surprise the runtime. No hidden malloc, no hidden thread creation, no secret layout conversion.
Watch the companion walkthrough
The video version walks through the same C threadpool demo, compile steps, and runtime intuition. The written post below is the deeper reference.
Roadmap for this post
Sections 1 through 3 explain why persistent threadpools exist, how dispatch/barriers work, and why thread 0 doing real work matters.
Sections 4 through 9 cover dispatch job structs, function pointers, fixed-role threads versus threadpools, cache-line aligned atomics, false sharing, row partitioning, and why CKE keeps OpenMP out of kernels.
Sections 10 through 13 extend the same runtime-contract idea to memory pools, bump arenas, hugepages, tokenizer pools, and workspace planning.
Section 1: Why A Persistent Threadpool?
Creating threads inside every decode step would be absurdly expensive. A transformer decode loop may call many kernels per token. If each parallel region had to create workers, synchronize them, tear them down, and recreate them, overhead would dominate.
A persistent threadpool creates workers once. The main thread dispatches work descriptors. Workers wait for a dispatch counter to change, run their assigned slice, then join a barrier. Between batches, workers can pause on a condition variable so the process does not burn CPU while idle.
- Sub-microsecond dispatch latency
- Zero allocation after init
- Cache-line aligned atomics to avoid false sharing
- Hybrid polling: spin first, then condvar
- Thread 0 = main thread and also does its share of workSection 2: The Work Function Contract
The threadpool API is intentionally simple. A work function receives ith, nth, and an opaque argument pointer. That means the worker does not need to know the whole model. It only needs to know its slice of a specific operation.
typedef void (*ck_work_fn_t)(int ith, int nth, void *args);
// ith: current thread index
// nth: active thread count
// args: operation-specific payload This is the core abstraction behind deterministic row partitioning. For a GEMV or GEMM-like operation, thread ith can own rows:
Each worker receives a contiguous row range. The math is simple, predictable, and easy to test.
Section 3: Dispatch And Barrier
Dispatch is the handoff from orchestration to workers. The main thread writes the function pointer and args, bumps a dispatch counter, executes its own share as thread 0, then waits until every worker has completed. The barrier prevents the next operation from reading partially written outputs.
ck_threadpool_t *ck_threadpool_create(int n_threads);
void ck_threadpool_dispatch(
ck_threadpool_t *pool,
ck_work_fn_t fn,
void *args);
void ck_threadpool_dispatch_n(
ck_threadpool_t *pool,
int active_threads,
ck_work_fn_t fn,
void *args);
void ck_threadpool_barrier(ck_threadpool_t *pool);Where the mutex actually appears in CKE
The mutex is not protecting the RMSNorm data, GEMV data, attention buffers, or worker slices. In CKE’s src/ck_threadpool.c, the hot dispatch path is mostly atomic. The pool stores work_fn, work_args, active_threads, n_dispatch, and n_complete. Workers watch the atomic dispatch counter. When it changes, a worker checks whether its ith is inside the active worker count and then runs the submitted function.
// Main thread publishes the job.
pool->work_fn = fn;
pool->work_args = args;
atomic_store(&pool->active_threads, active_threads);
atomic_store(&pool->n_complete, 0);
// This is the hot wakeup signal workers poll.
atomic_fetch_add(&pool->n_dispatch, 1);
// Sleeping workers also need a condition-variable wakeup.
pthread_mutex_lock(&pool->mutex);
pthread_cond_broadcast(&pool->cond_dispatch);
pthread_mutex_unlock(&pool->mutex);
// Main thread is worker 0.
fn(0, active_threads, args); The condition variable is there because workers should not burn CPU forever when there is no work. A worker spins briefly with _mm_pause(). If no dispatch arrives, or if that worker is outside the active subset, it goes to sleep with pthread_cond_wait. When a new job arrives, the dispatcher broadcasts cond_dispatch, the sleeping worker wakes, sees the new dispatch counter, and computes its slice.
Four-worker example
Assume a pool has four active workers: ith = 0, 1, 2, 3.
For a large RMSNorm prefill job, the dispatcher may set active_threads = 4. Each worker receives the same rmsnorm_job_t *, but each computes a different token range.
For a tiny decode residual-add job, the dispatcher may set active_threads = 1. Worker 0 runs it directly and the other workers stay asleep.
If there is no work, workers stop wasting CPU cycles by entering pthread_cond_wait. When work appears, the condition variable wakes them.
while runtime is alive:
check atomic dispatch counter
if new work and ith < active_threads:
run fn(ith, active_threads, shared_job_pointer)
atomically mark completion
else if no work or ith is not active:
sleep on condvar so CPU is not wasted Completion uses the same idea. Workers increment n_complete when they finish. The main thread spins for a short time, then can sleep on cond_done if the job takes longer. The last worker signals cond_done so the main thread can continue to the next kernel. That is the hybrid design: fast atomics for hot LLM dispatch, mutex/condvar for idle or long-wait cases.
Section 4: How Dispatch Chooses The Right Kernel
The threadpool itself should be boring. It should not know what RMSNorm is. It should not know what GEMV is. It should not know transformer architecture. A clean threadpool only knows how to wake workers, pass each worker its ith and nth, run one submitted function pointer, and wait for completion.
The intelligence lives one layer above the threadpool: the orchestrator. The orchestrator knows the model graph, current layer, tensor pointers, dimensions, active thread count, and which kernel implementation should run. It packages those details into a job struct, then dispatches the correct work function.
typedef void (*ck_work_fn_t)(int ith, int nth, void *arg);
typedef struct {
const float *x;
const float *weight;
float *out;
int rows;
int cols;
} gemv_job_t;
gemv_job_t job = {
.x = x,
.weight = w,
.out = y,
.rows = M,
.cols = K,
};
ck_threadpool_dispatch(pool, gemv_work, &job); In this example, the threadpool does not inspect gemv_job_t. It treats the payload as void *. The function pointer gemv_work knows how to cast that payload back to the correct type. This is the normal C runtime pattern: generic scheduler, typed job payload, typed worker function.
One dispatch, one payload type
All workers in the same dispatch receive the same callback and the same void *arg pointer. If the dispatch is GEMV, every worker runs gemv_work and casts the payload to gemv_job_t *. If the next dispatch is RMSNorm, every worker runs rmsnorm_work and casts the payload to rmsnorm_job_t *.
The threadpool API stays generic. The payload struct does not have to be globally identical for every kernel. It only has to match the callback used for that specific dispatch.
static void gemv_work(int ith, int nth, void *arg) {
gemv_job_t *job = (gemv_job_t *)arg;
int r0 = (ith * job->rows) / nth;
int r1 = ((ith + 1) * job->rows) / nth;
for (int r = r0; r < r1; ++r) {
float sum = 0.0f;
for (int c = 0; c < job->cols; ++c) {
sum += job->weight[r * job->cols + c] * job->x[c];
}
job->out[r] = sum;
}
} That is how the runtime knows which function to call. It does not discover it dynamically inside the threadpool. The orchestrator decides: “this operation is GEMV, the tensors are here, the shape is this, the output is there, and the work function is gemv_work.” The threadpool only executes the submitted contract.
One transformer layer becomes many dispatches
A transformer layer is not one giant threadpool job. It is a sequence of kernel dispatches. Each operation has its own payload shape, its own worker adapter, and its own decision about whether parallel execution is worth it. The same persistent worker threads are reused across all of them.
In practice, every parallelizable operation needs a worker adapter that matches the threadpool signature. The name does not have to be exactly [kernel]_worker, but the role is the same: adapt a typed kernel payload into fn(ith, nth, arg). Small kernels may run serially. Large kernels usually get a worker adapter.
In C-Kernel-Engine this idea already appears in two places. First, src/ck_threadpool.c implements the persistent execution machinery: N-1 pthread workers are created once, the main thread is worker 0, and every dispatch updates the current work_fn, work_args, and active thread count. Second, src/ckernel_kernel_specs.c describes the decoder plan and kernel specs: rmsnorm, qkv_project, rope, attention, attn_proj, residual_add, mlp_up, swiglu, and mlp_down. That plan is the model-runtime side. The threadpool is the execution side.
typedef enum {
CK_OP_RESIDUAL_ADD,
CK_OP_RMSNORM,
CK_OP_QKV_GEMM,
CK_OP_ATTENTION_SCORE,
CK_OP_ATTN_PROJ,
CK_OP_MLP_GEMM,
} ck_op_kind_t;
typedef struct {
ck_op_kind_t op;
const char *name;
ck_work_fn_t worker;
bool threadpool_safe;
bool needs_private_scratch;
int min_elements_for_parallel;
} ck_kernel_meta_t;The exact struct above is intentionally illustrative. CKE already has kernel specs and decoder plan steps; a production runtime can attach dispatch policy to those specs or derive it during codegen. The important idea is not the exact struct name. The important idea is that every operation needs metadata saying: what kernel function runs, what payload shape it expects, whether it is safe to parallelize, and when parallelization is worth the overhead.
// src/ckernel_kernel_specs.c has this kind of forward plan:
rmsnorm
qkv_project
rope
attention
attn_proj
residual_add
rmsnorm
mlp_up
swiglu
mlp_down
residual_add That plan says what should happen. The dispatcher decides how each plan step runs. rmsnorm can be sliced by token range. qkv_project, attn_proj, mlp_up, and mlp_down can be sliced by rows, output channels, tiles, heads, or token blocks depending on the kernel. residual_add can be sliced by contiguous element ranges. Attention can be sliced by head, query block, or tile, but only if scratch and output ownership are clean.
static const ck_kernel_meta_t ck_dispatch_table[] = {
{
.op = CK_OP_RMSNORM,
.name = "rmsnorm",
.worker = rmsnorm_work,
.threadpool_safe = true, // workers own token ranges
.needs_private_scratch = false,
.min_elements_for_parallel = 4096,
},
{
.op = CK_OP_QKV_GEMM,
.name = "qkv_project",
.worker = qkv_gemm_work,
.threadpool_safe = true, // workers own output rows/heads/tiles
.needs_private_scratch = true,
.min_elements_for_parallel = 8192,
},
{
.op = CK_OP_ATTENTION_SCORE,
.name = "attention",
.worker = attention_score_work,
.threadpool_safe = true, // only if each worker owns heads/blocks
.needs_private_scratch = true,
.min_elements_for_parallel = 16384,
},
{
.op = CK_OP_RESIDUAL_ADD,
.name = "residual_add",
.worker = residual_add_work,
.threadpool_safe = true, // workers own element ranges
.needs_private_scratch = false,
.min_elements_for_parallel = 32768,
},
};The dispatcher should not blindly parallelize every operation. It should ask whether the kernel is declared threadpool-safe, whether scratch memory is available, and whether the tensor is large enough to justify waking workers. This is where runtime policy lives.
Prefill versus decode changes the policy
The same model layer has different dispatch economics in prefill and decode. In prefill, the runtime processes many prompt tokens at once. Tensor shapes are larger, so RMSNorm over many tokens, QKV projection over many rows, attention over many query positions, and MLP GEMMs usually have enough work to keep several cores busy. In decode, the runtime often processes one new token at a time. Now many operations become smaller GEMV-like kernels, and dispatch overhead can dominate if the runtime wakes too many workers for too little work.
Dispatch policy changes by mode
Prefill: many tokens, larger matrices, more threadpool-friendly, often worth slicing by token block, output row, tile, or head.
Decode: one token or small batch, KV-cache heavy, often memory-bandwidth bound, sometimes better with fewer active workers or fused serial/low-thread kernels.
Rule: threadpool-safe is necessary, but not sufficient. The operation also needs enough work per dispatch to pay for synchronization.
This is why CKE has separate prefill and decode paths in its tooling. The runtime is not just asking “can this kernel run in parallel?” It is asking “for this mode, this tensor shape, this hardware, and this cache state, how many workers should run?”
static void ck_dispatch_op(ck_threadpool_t *pool,
const ck_kernel_meta_t *meta,
void *job,
int elements)
{
if (!meta->threadpool_safe) {
meta->worker(0, 1, job); // serial fallback
return;
}
if (elements < meta->min_elements_for_parallel) {
meta->worker(0, 1, job); // overhead would dominate
return;
}
ck_threadpool_dispatch(pool, meta->worker, job);
}What about RMSNorm?
RMSNorm follows the same pattern, but the job struct is different. Instead of rows and columns for a matrix-vector multiply, the job describes token count, feature dimension, epsilon, input pointer, weight pointer, and output pointer. Each worker owns a range of tokens.
typedef struct {
const float *x; // [tokens, dim]
const float *weight; // [dim]
float *out; // [tokens, dim]
int tokens;
int dim;
float eps;
} rmsnorm_job_t;
static void rmsnorm_work(int ith, int nth, void *arg) {
rmsnorm_job_t *job = (rmsnorm_job_t *)arg;
int t0 = (ith * job->tokens) / nth;
int t1 = ((ith + 1) * job->tokens) / nth;
for (int t = t0; t < t1; ++t) {
rmsnorm_one_token(
job->x + t * job->dim,
job->weight,
job->out + t * job->dim,
job->dim,
job->eps);
}
} Again, the threadpool does not need an RMSNorm-specific branch. It does not say if op == RMSNORM. The orchestrator chooses rmsnorm_work and passes an rmsnorm_job_t. The same worker threads that just ran GEMV can now run RMSNorm because the execution contract is generic.
LayerNorm would submit another typed payload because LayerNorm needs both gamma and beta. GEMV needs x, weight, and out. RMSNorm needs x, weight, out, dim, and eps. LayerNorm needs x, gamma, beta, out, dim, and eps. These do not need one universal mega-struct. They need correct pairing: the orchestrator must submit the right callback with the right payload type.
What makes a kernel threadpool-safe?
A function is threadpool-safe when each worker can run its slice without corrupting another worker’s slice. That usually means read-only shared inputs are allowed, but writes must be partitioned. Worker 0 writes one output range, worker 1 writes another output range, and no two workers write the same element unless the kernel explicitly uses atomics or a reduction protocol.
Threadpool-safe kernel contract
Inputs may be shared read-only across workers.
Outputs must be partitioned so workers do not race on the same memory.
Scratch memory must be per-worker, pre-partitioned, or explicitly synchronized.
The function must not call hidden malloc, create hidden threads, or mutate global state without synchronization.
GEMV is naturally threadpool-safe if workers own disjoint output rows. RMSNorm is naturally threadpool-safe if workers own disjoint token ranges. Attention can be threadpool-safe if workers own heads, query blocks, token ranges, or tiles with clear output ownership. Optimizer updates can be threadpool-safe if workers own disjoint parameter ranges.
Some operations are not safe by default. Reductions, histograms, top-k routing counters, or shared accumulators need extra design. The orchestrator may give each worker private scratch space and then run a second combine step. Or it may use atomics. Or it may choose to run a small operation serially because the synchronization overhead is larger than the work.
Safe to dispatch:
output rows are independent
output tokens are independent
output tiles are independent
per-worker scratch is available
Be careful:
multiple workers update one counter
multiple workers write one output row
global allocator is touched
global RNG/state is mutated
reduction needs deterministic combine orderThis is why the orchestrator matters. It chooses the function, chooses the active worker count, prepares the job struct, prepares scratch space, and decides whether the operation is worth parallelizing at all. The threadpool is the execution machinery. The orchestrator is the runtime brain.
Section 5: Why Thread 0 Should Work
In some threadpool designs, the main thread only dispatches work and waits. CKE’s header explicitly says thread 0 is the main thread and also does serial operations plus its share of parallel work. That is the right instinct. If the main thread only waits, one hardware thread is wasted during every parallel section.
The main thread is already hot. It already has the model state. It should participate in compute unless there is a specific reason not to.
Section 6: Does A Threadpool Use while(1)?
Yes, a threadpool worker often has an internal loop that looks like while (1) or for (;;). But that loop alone does not make something a threadpool. The important question is: does the thread have a fixed job, or is it a generic worker waiting for dispatched work?
Many systems programs use persistent fixed-role threads. A router, server, or embedded runtime might have one thread for IPC, one thread for control-plane messages, one thread for forwarding-plane dispatch, one thread for telemetry, and one thread for hardware events. Those threads may all run forever. But each thread owns a specific service role. That is a persistent threaded architecture, not necessarily a threadpool.
void *rx_thread(void *arg) {
for (;;) {
packet_t pkt = nic_receive();
enqueue_to_forwarding_plane(pkt);
}
}
void *control_thread(void *arg) {
for (;;) {
message_t msg = ipc_receive();
update_routing_state(msg);
}
}In that design, the receive thread is always the receive thread. The control thread is always the control thread. The forwarding thread is always the forwarding thread. The architecture is organized by responsibility.
A threadpool is different. Workers do not permanently own a business role like “receive packets” or “handle IPC.” Instead, each worker waits for a generic dispatch. Today the dispatch might be RMSNorm. The next dispatch might be GEMV. The next might be attention softmax, optimizer update, image resize, or a batch of HTTP jobs.
void *worker_main(void *arg) {
worker_t *w = arg;
threadpool_t *p = w->pool;
for (;;) {
pthread_mutex_lock(&p->mu);
while (!p->has_job && !p->stop) {
pthread_cond_wait(&p->cv, &p->mu);
}
if (p->stop) {
pthread_mutex_unlock(&p->mu);
break;
}
work_fn_t fn = p->fn;
void *job = p->arg;
pthread_mutex_unlock(&p->mu);
fn(w->ith, p->nthreads, job);
mark_worker_done(p);
}
return NULL;
}So the distinction is not whether a thread has an infinite loop. The distinction is the contract. A fixed-role thread is a named subsystem. A threadpool worker is a reusable execution slot. It wakes, receives a function pointer or work descriptor, runs its slice, reports completion, and waits for the next dispatch.
Simple rule
If the thread’s purpose is fixed for the lifetime of the process, call it a persistent service thread.
If the thread’s purpose changes based on submitted jobs, and many jobs share the same worker set, call it a threadpool.
This is why C-Kernel-Engine’s model is a threadpool model. The workers are not “the GEMM thread” or “the RMSNorm thread.” They are reusable CPU workers. The runtime dispatches a kernel job, each worker computes its assigned range, and then the same workers are reused for the next kernel.
Section 7: Cache-Line Aligned Atomics
Threadpool metadata is performance-sensitive. If dispatch counters, done counters, and flags share the same cache line, unrelated writes can invalidate each other. This is false sharing at the runtime-control level.
CKE’s threadpool header calls this out directly: atomics should be cache-line aligned to avoid false sharing. This is a small detail, but it tells you the runtime is thinking like systems software, not like a wrapper script.
Section 8: OpenMP Is Not Wrong, But Placement Matters
OpenMP can be useful, especially for quick experiments and parity tests. But putting #pragma omp parallel directly inside every kernel makes the runtime harder to control. Nested parallelism, thread oversubscription, inconsistent affinity, and hidden synchronization become harder to reason about.
CKE’s rule is cleaner: kernels define pure computation over pointer ranges; the orchestrator decides how many threads run, which rows each owns, and when barriers happen.
Kernels must NOT allocate or free memory.
Kernels must NOT contain hidden OpenMP parallel regions.
Kernels must expose inputs, outputs, dimensions, and workspace.Section 9: Memory Pools And Bump Arenas
Runtime allocation is another source of chaos. If every kernel asks the OS for scratch memory, the model runner gets fragmentation, allocator locks, unpredictable latency, and more places for memory bugs. A bump arena solves this by allocating a large region up front, then handing out offsets.
CKE’s allocator path points in this direction: try hugepage-backed memory, fall back to aligned allocation with transparent hugepage advice, and preserve a logical contiguous arena for weights and activations.
typedef struct {
uint8_t *base;
size_t total_size;
size_t weights_base;
size_t activations_base;
size_t mapped_len;
size_t weights_file_size;
ck_bump_mode_t mode;
} ck_bump_alloc_t;Section 10: Workspace Is A Contract
A kernel that needs temporary storage should not call malloc. It should declare how much workspace it needs. The orchestrator includes that workspace in the memory plan. Then the generated runtime passes a pointer to the kernel.
This matters for training even more than inference. Backpropagation needs saved activations, gradient buffers, optimizer state, scratch matrices, and temporary reductions. A training runtime that allocates those ad hoc will be extremely hard to debug.
Weights, activations, gradients, optimizer state, and scratch workspace should be planned, not discovered by accident at runtime.
Section 11: Tokenizer Memory Pools
Memory pools are not only for matrix kernels. Tokenization can allocate many small objects: pieces, nodes, merge candidates, temporary strings, or trie structures. CKE’s tokenizer headers include memory-pool interfaces because tokenizer allocation patterns are different from GEMM allocation patterns.
The principle is the same: define the lifetime. If temporary tokenizer strings only live for one encode call, allocate them from a pool or scratch arena and reset the pool after the call. Do not let thousands of tiny allocations leak into the general heap.
Section 12: Threadpool + Arena Together
Threading and memory planning have to agree. If each worker writes to a separate output row range, the memory layout should make those row ranges contiguous. If each worker needs scratch space, the arena should either give each worker a non-overlapping slice or allocate scratch by operation. If workers update shared counters, those counters should avoid false sharing.
The runtime contract is therefore two-dimensional: who owns the work, and who owns the memory. If those two answers are inconsistent, performance collapses or correctness becomes fragile. contract Parallel code is not just “split the loop.” It is split the loop, split the memory, avoid cache-line fights, and synchronize only when the next operation truly needs it.
Section 13: What To Build Next
The next CKE runtime work should make these contracts visible in generated reports: active thread count per operation, row ranges per worker, scratch workspace size, arena offsets, cache-line alignment, NUMA placement, and whether a kernel used serial fallback or threadpool dispatch.
| Runtime report field | Why it matters |
|---|---|
| active threads | Shows whether parallel path actually ran. |
| row ranges | Makes partitioning inspectable. |
| workspace bytes | Prevents hidden allocation. |
| arena offsets | Allows memory layout debugging. |
| NUMA node | Connects placement to performance. |
| dispatch latency | Shows whether threadpool overhead is acceptable. |
Section 14: Summary
Threadpools and memory pools are not infrastructure trivia. They are the runtime layer that lets kernels compose into a model. CKE’s design direction is correct: persistent workers, explicit dispatch, cache-line awareness, pre-planned memory, hugepage-capable arenas, and pure kernels.
The larger lesson is that CPU AI performance is built from contracts. The math kernel must be correct. The threadpool must schedule it predictably. The arena must provide memory without surprise. The generated runtime must stitch those pieces together so the CPU can stream through work instead of fighting the OS, the allocator, and the cache-coherence protocol.