C-Kernel-Engine has been hardened first as a CPU-native runtime on single machines. The next phase is not just buying faster CPUs. It is learning how to make multiple CPU nodes cooperate without turning the runtime into a black box.
The previous distributed CPU post framed the larger systems bet: cheap replaceable CPU nodes, Linux control, explicit memory layout, MPI first, and RDMA later if measurement proves the transport layer matters. That bet is also written more directly in the CKE scaling hypothesis. This post zooms into the scheduling question underneath that bet. If CKE has two, three, or five CPU nodes, how should the work be split? Pipeline parallelism and tensor parallelism are two different answers.
The goal is not to pretend a CPU cluster is a GPU box. The goal is to use what CPU clusters actually have: aggregate FLOPs, more memory channels, more DRAM capacity, more NICs, and more concurrent scheduling opportunities. CKE's job is to turn those resources into a planned runtime instead of a pile of machines. The implementation direction is tracked in the C-Kernel-Engine GitHub repository and the public model and kernel matrix. north star The cluster is not one giant RAM pool. It is a planned compute, bandwidth, and communication surface.
The Constraint: Active Bytes Per Token
The CKE throughput-unit page frames the deeper target clearly: FLOPs and TOPS are useful hardware numbers, but they do not answer whether the full runtime can move the data needed to produce the next token. The proposed CKE unit is aggregate bytes cycled per second across the token path. In simple form:
CKU asks how fast the full system cycles the bytes it must touch, transform, cache, transmit, or reuse to produce tokens.
This is the theory of constraints for distributed CKE. The runtime should not move everything everywhere. It should minimize active node traffic by keeping weights resident, placing KV cache and recurrent state deliberately, sending only the activation or shard data required at a boundary, and choosing a topology that matches the phase of execution. A topology is good only if it reduces the slowest constraint in the token path.
The Bet: Make AI Competitive On CPU Nodes
The CKE bet is not that one consumer CPU magically beats a datacenter accelerator at every workload. The bet is more specific: if AI execution is reduced to explicit kernels, planned memory arenas, deterministic generated C, Linux-level scheduling, and measurable data movement, then commodity CPU nodes can become a serious runtime surface for inference and parts of training. That means CKE is trying to make AI run primarily on CPUs and CPU-attached accelerators by controlling the whole path: weights, activations, KV cache, recurrent state, gradients, network hops, and synchronization.
This matters because real deployment is not only peak FLOPs. It is also model memory, context length, batch shape, latency target, power, cooling, hardware availability, and whether a team can add capacity by buying another replaceable node. GPUs are excellent at dense batched math, but large practical systems still pay for sharding, memory capacity, communication, scheduling, correctness checks, and operations. CKE asks whether a CPU-native runtime can win specific zones of that tradeoff by making the full byte path visible and tunable.
| CKE bet | What must be proven | Evidence the runtime should emit |
|---|---|---|
| CPU nodes can be useful AI runtime units | per-node kernels can keep execution units fed | kernel timing, cache/memory counters, active bytes per token |
| Distributed CPU inference can scale economically | network movement is smaller than the compute/memory gain | rank ownership, activation bytes, hop latency, sync timing |
| Training paths can be made inspectable | gradients, optimizer state, and reductions stay numerically stable | parity reports, gradient checks, reduction timing, fault logs |
| Linux can become part of the runtime surface | pinning, NUMA, huge pages, arenas, and prefetch reduce stalls | CPU affinity maps, NUMA placement, page-fault counts, TLB/cache stats |
What Pipeline Parallelism Means
Pipeline parallelism splits the model by layer ranges. Node A owns early layers, Node B owns middle layers, Node C owns later layers, and each node keeps its assigned weights resident. The data that moves between nodes is primarily activation state, not the full model weights.
For a 12-layer model on three nodes, a simple first schedule is: Node A owns layers 1 through 4, Node B owns layers 5 through 8, and Node C owns layers 9 through 12. Each node stores the weights, local KV/state, scratch buffers, and kernel plan for its assigned layer block. Node A does not send its Q/K/V/MLP weights to Node B. It sends the computed activation at the boundary after layer 4.
For decode, \(T=1\). For prefill, \(T\) can be the prompt/block length. With \(d=4096\), bf16, and \(T=4096\), one boundary activation is about 32 MiB.
This is the easiest distributed schedule to reason about. It maps naturally onto CKE's model-family bring-up work because the compiler can assign a contiguous layer block to a rank. The danger is pipeline bubbles. If Node B is slow, every request waits for Node B. If only one request is active, many nodes may sit idle while the token moves stage by stage.
How Big Is A Layer?
The layer weights are dominated by attention projections and MLP projections. A standard transformer block with full multi-head attention has roughly: Q, K, V, and O projection weights, plus the feed-forward weights. If the MLP expansion is around \(4d\), a rough dense layer estimate is:
Q/K/V/O contribute about \(4d^2\). A gated or two-up-one-down MLP can contribute around \(12d^2\). Norm weights are small by comparison.
| Hidden size | Approx layer weights, bf16 | Approx layer weights, q8 | Approx layer weights, q4 | Boundary activation, bf16, T=4096 |
|---|---|---|---|---|
| 768 | ~18.9 MB | ~9.4 MB | ~4.7 MB | ~6.3 MB |
| 2048 | ~134 MB | ~67 MB | ~33.5 MB | ~16.8 MB |
| 4096 | ~537 MB | ~268 MB | ~134 MB | ~33.5 MB |
| 8192 | ~2.15 GB | ~1.07 GB | ~537 MB | ~67 MB |
This table is why the cache story has to be precise. A large L3 cache can help keep hot tiles, scales, metadata, routing data, KV slices, and small-model layer weights closer to compute. But for modern large hidden sizes, full layer weights usually do not live entirely in L3 unless the layer is small or heavily quantized. The main job is to stream from DDR5 into cache efficiently, tile the computation, avoid thrashing, and overlap stages enough that the full pipeline stays busy.
That does not mean the CPU path is doomed. Cache residency is not binary. A layer either fitting or not fitting in L3 is not the whole story. The real question is whether CKE can turn the layer into a predictable stream of tiles so DDR5, cache, prefetch, and execution overlap. This is no longer a high-level AI-framework problem. This is software systems and kernel engineering.
| Technique | What CKE tries to protect | Why it matters |
|---|---|---|
| Tiling | working sets that fit cache/registers | the whole layer need not fit if the active tile does |
| Software prefetch | future weight and activation tiles | load the next data before the current compute finishes |
| Double buffering | one tile computing, one tile loading | turn compute-wait-compute into a conveyor belt |
| Arena layout | predictable address streams | helps hardware prefetchers and avoids allocator chaos |
| NUMA and pinning | worker locality | keeps streaming memory closer to the cores consuming it |
| Quantization | active bytes per token | reduces how much data must cross DDR/cache/network |
The ideal local kernel path looks more like a conveyor belt than a stop-and-wait loop:
CKE does not need the whole model hot in cache. It needs the next useful tile to arrive before the execution units starve.
What Tensor Parallelism Means
Tensor parallelism splits work inside a layer or tensor operation. Instead of Node A owning layers 0 through 9 and Node B owning layers 10 through 19, multiple nodes can own shards of the same projection, expert group, attention head group, or matrix block. A central node can orchestrate a star topology for early experiments: send work to left and right workers, gather partial results, reduce or concatenate, then continue.
Tensor parallelism is more flexible but harder. The central node can become a bottleneck. Reduction and concatenation introduce synchronization. Small shards can waste time on communication. But for heavy layers, expert routing, or large matrix blocks, tensor parallelism may let CKE accumulate compute and memory bandwidth across nodes in a way pipeline parallelism alone cannot.
Prefill And Decode Are Different
A distributed runtime should not assume the same topology is best for every phase. Prefill processes a larger prompt window and often has more available parallel work. Decode produces one token at a time and becomes more sensitive to latency, KV-cache placement, pipeline fill, and concurrent requests.
The per-token path pays for stage compute, network hops, and synchronization. Throughput improves when the pipeline stays full.
This is where the schedule can legitimately change. Prefill may benefit from tensor or hybrid parallelism because there is a larger token block and more work to amortize communication. Decode may prefer pipeline parallelism or concurrent request scheduling because every token step is smaller and latency-sensitive. The KV cache also grows during decode, but it should remain resident with the layer or shard that owns it instead of being moved blindly across the cluster.
The CKE Runtime Question
The interesting question is not whether pipeline or tensor parallelism is universally better. The interesting question is whether CKE can measure enough about the model, hardware, and workload to choose a schedule. A small two-node cluster may start with pipeline parallelism. A three-node cluster may test a central orchestration pattern. A five-node cluster may mix pipeline stages with tensor shards inside a heavy stage.
The scheduler therefore needs a constraint model. If a schedule increases FLOPs but doubles network traffic, it may lose. If a schedule keeps weights resident and only moves compact activation blocks, it may win. If tensor parallelism creates too many tiny reductions during decode, pipeline parallelism may be better. If prefill has a large enough chunk of work to hide communication, tensor or hybrid parallelism may become attractive.
How CKE Tries To Do This
The important architectural difference is that CKE should not treat distributed execution as a vague serving feature. It should compile and report the distributed plan. In a CKE-style runtime, a model graph can be lowered into explicit rank ownership: which node owns which layers, which node owns which tensor shard, which arena offset stores the weights, where KV cache lives, which activation boundary crosses the wire, and which synchronization step is required before the next kernel runs. This is why the CKE concepts page matters: kernels are not just math names; they are contracts about layout, memory movement, and execution behavior.
The simplified planner objective is:
The distributed schedule is only useful if it lowers the real token path while preserving numerical correctness.
For pipeline parallelism, the planner tries to keep weights resident and move boundary activations. For tensor parallelism, the planner tries to shard large projections or expert groups and then pay the gather/reduce cost only when that cost is smaller than the compute and memory-bandwidth gain. For prefill, larger token blocks can make tensor or hybrid parallelism easier to amortize. For decode, the per-token path is smaller, so pipeline fill, KV locality, and concurrent requests often matter more. For training, the same idea becomes harder because gradients, optimizer state, activation checkpointing, and collective communication enter the path.
The reason this is not hand-waving is that every term can be measured. Activation boundary size is \(T \cdot d_{\mathrm{model}} \cdot \mathrm{bytes(dtype)}\). A dense transformer layer is roughly \(16d^2\) parameters before model-specific variations such as GQA, MoE, SSM, DeltaNet-style recurrence, quantization, and fused kernels. Network time can be estimated as payload bytes divided by effective link bandwidth, then corrected by measured latency, software overhead, and synchronization. Memory time can be estimated from active bytes divided by measured DRAM/cache bandwidth, then corrected by misses, TLB pressure, NUMA distance, and prefetch quality. CKE's job is to turn those estimates into generated C, runtime artifacts, and measurements.
The larger formula is not GPU-specific and not CPU-specific:
GPU clusters also pay communication and synchronization costs. CKE's bet is that CPU nodes can be made useful by controlling the whole deterministic byte path.
This is also where the economic thesis enters. GPU systems are technically excellent, but real scale is not only the price of a card. It includes VRAM capacity, power delivery, cooling, servers, racks, networking, and operations. The practical question is whether a small team can keep adding useful compute without first graduating into datacenter-class GPU power and cooling. CKE is exploring whether commodity CPU nodes can scale differently: add another replaceable node, add more cores, add more memory channels, add more DRAM, add another NIC, and let deterministic software decide how the byte path is split. The claim still has to be proven with measurements. But the formula itself does not say "GPU only."
That is why the math matters. If a node can own a layer block, keep its local weights and KV state resident, stream the next tile before the execution units starve, and send only the required activation boundary to the next rank, then the cluster becomes a throughput experiment instead of a single-box benchmark. Adding nodes is not free. Network hops, synchronization, rank imbalance, NUMA placement, and memory bandwidth can erase the gain. But those are measurable engineering constraints, not religious arguments about whether AI must run on one hardware class.
| Strategy | Best fit | Main risk | CKE evidence needed |
|---|---|---|---|
| Pipeline parallelism | layer-block ownership | slowest stage limits throughput | stage timing, bubble report, activation hop cost |
| Tensor parallelism | large projections, experts, head groups | gather/reduce synchronization | shard timing, reduction timing, bandwidth use |
| Hybrid schedule | large models across many nodes | planner complexity | topology report, rank ownership, parity checks |
Why This Fits CKE
CKE already wants model execution to be explicit: generated C, planned memory, sectioned arenas, kernel purity, visible artifacts, and Linux-level control. The project documentation describes this as a CPU-native AI runtime and kernel compiler for auditable inference, training kernels, and distributed CPU execution. Distributed parallelism is an extension of that same philosophy. Instead of hiding the cluster behind a vague serving layer, CKE should expose rank ownership, layer ownership, tensor shard ownership, arena offsets, transfer sizes, and timing reports.
That is how the project can avoid becoming just another benchmark claim. The output should not only be tokens per second. The output should include evidence: which node owned which layer, what moved across the network, how much time each hop took, where the pipeline stalled, and where numerical parity was checked. The throughput-unit view adds one more important artifact: how many active bytes were cycled per token, and where those bytes were spent.
Bottom Line
Pipeline parallelism is the first simple schedule. Tensor parallelism is the harder but more flexible schedule. Prefill and decode may need different schedules. Training adds gradients, optimizer state, and synchronization pressure. The next phase of CKE is learning how to make these choices explicit enough that cheap CPU nodes can be added, replaced, and measured like real runtime units.
Related C-Kernel-Engine References
- C-Kernel-Engine project documentation - overview of the CPU-native runtime and kernel compiler.
- C-Kernel-Engine GitHub repository - source code and active development.
- CKE scaling hypothesis - the larger commodity CPU and distributed systems thesis.
- CKE throughput unit - active bytes per token and aggregate bytes cycled per second.
- CKE concepts - kernel concepts, layouts, and model execution ideas.
- Model and kernel matrix - current model-family and kernel bring-up surface.