Pipeline vs Tensor Parallelism: How CKE Splits AI Across CPU Nodes

Distributed runtime note

C-Kernel-Engine has been hardened first as a CPU-native runtime on single machines. The next phase is not just buying faster CPUs. It is learning how to make multiple CPU nodes cooperate without turning the runtime into a black box.

The previous distributed CPU post framed the larger systems bet: cheap replaceable CPU nodes, Linux control, explicit memory layout, MPI first, and RDMA later if measurement proves the transport layer matters. That bet is also written more directly in the CKE scaling hypothesis. This post zooms into the scheduling question underneath that bet. If CKE has two, three, or five CPU nodes, how should the work be split? Pipeline parallelism and tensor parallelism are two different answers.

The goal is not to pretend a CPU cluster is a GPU box. The goal is to use what CPU clusters actually have: aggregate FLOPs, more memory channels, more DRAM capacity, more NICs, and more concurrent scheduling opportunities. CKE's job is to turn those resources into a planned runtime instead of a pile of machines. The implementation direction is tracked in the C-Kernel-Engine GitHub repository and the public model and kernel matrix. north star The cluster is not one giant RAM pool. It is a planned compute, bandwidth, and communication surface.

The Constraint: Active Bytes Per Token

The CKE throughput-unit page frames the deeper target clearly: FLOPs and TOPS are useful hardware numbers, but they do not answer whether the full runtime can move the data needed to produce the next token. The proposed CKE unit is aggregate bytes cycled per second across the token path. In simple form:

\[ \mathrm{CKU} = \frac{\mathrm{active\ bytes\ per\ token}}{\mathrm{seconds\ per\ token}} \] CKU asks how fast the full system cycles the bytes it must touch, transform, cache, transmit, or reuse to produce tokens.

This is the theory of constraints for distributed CKE. The runtime should not move everything everywhere. It should minimize active node traffic by keeping weights resident, placing KV cache and recurrent state deliberately, sending only the activation or shard data required at a boundary, and choosing a topology that matches the phase of execution. A topology is good only if it reduces the slowest constraint in the token path.

The Bet: Make AI Competitive On CPU Nodes

The CKE bet is not that one consumer CPU magically beats a datacenter accelerator at every workload. The bet is more specific: if AI execution is reduced to explicit kernels, planned memory arenas, deterministic generated C, Linux-level scheduling, and measurable data movement, then commodity CPU nodes can become a serious runtime surface for inference and parts of training. That means CKE is trying to make AI run primarily on CPUs and CPU-attached accelerators by controlling the whole path: weights, activations, KV cache, recurrent state, gradients, network hops, and synchronization.

This matters because real deployment is not only peak FLOPs. It is also model memory, context length, batch shape, latency target, power, cooling, hardware availability, and whether a team can add capacity by buying another replaceable node. GPUs are excellent at dense batched math, but large practical systems still pay for sharding, memory capacity, communication, scheduling, correctness checks, and operations. CKE asks whether a CPU-native runtime can win specific zones of that tradeoff by making the full byte path visible and tunable.

CKE bet	What must be proven	Evidence the runtime should emit
CPU nodes can be useful AI runtime units	per-node kernels can keep execution units fed	kernel timing, cache/memory counters, active bytes per token
Distributed CPU inference can scale economically	network movement is smaller than the compute/memory gain	rank ownership, activation bytes, hop latency, sync timing
Training paths can be made inspectable	gradients, optimizer state, and reductions stay numerically stable	parity reports, gradient checks, reduction timing, fault logs
Linux can become part of the runtime surface	pinning, NUMA, huge pages, arenas, and prefetch reduce stalls	CPU affinity maps, NUMA placement, page-fault counts, TLB/cache stats

CKE throughput unit showing active bytes per token through weights, activations, KV cache, network hops, and synchronization. — The distributed runtime should optimize the active-byte path, not only raw FLOPs or node count.

What Pipeline Parallelism Means

Pipeline parallelism splits the model by layer ranges. Node A owns early layers, Node B owns middle layers, Node C owns later layers, and each node keeps its assigned weights resident. The data that moves between nodes is primarily activation state, not the full model weights.

Pipeline parallelism across CPU nodes where each node owns a layer block and sends activations forward. — Pipeline parallelism keeps layer weights resident on each node and moves activations across stage boundaries.

Detailed C-Kernel-Engine pipeline parallelism diagram showing 12 transformer layers split across 3 CPU nodes, resident weights, KV cache ownership, activation boundary math, cache streaming, and MPI/RDMA transport. — A more complete CKE pipeline-parallel plan: each CPU node owns a layer block and sends computed activations, not weights. Prefill and decode may choose different distributed schedules.

For a 12-layer model on three nodes, a simple first schedule is: Node A owns layers 1 through 4, Node B owns layers 5 through 8, and Node C owns layers 9 through 12. Each node stores the weights, local KV/state, scratch buffers, and kernel plan for its assigned layer block. Node A does not send its Q/K/V/MLP weights to Node B. It sends the computed activation at the boundary after layer 4.

\[ \mathrm{activation\ bytes} = T \cdot d_{\mathrm{model}} \cdot \mathrm{bytes(dtype)} \] For decode, \(T=1\). For prefill, \(T\) can be the prompt/block length. With \(d=4096\), bf16, and \(T=4096\), one boundary activation is about 32 MiB.

This is the easiest distributed schedule to reason about. It maps naturally onto CKE's model-family bring-up work because the compiler can assign a contiguous layer block to a rank. The danger is pipeline bubbles. If Node B is slow, every request waits for Node B. If only one request is active, many nodes may sit idle while the token moves stage by stage.

How Big Is A Layer?

The layer weights are dominated by attention projections and MLP projections. A standard transformer block with full multi-head attention has roughly: Q, K, V, and O projection weights, plus the feed-forward weights. If the MLP expansion is around \(4d\), a rough dense layer estimate is:

\[ W_{\mathrm{layer}} \approx 4d^2 + 3d(4d) = 16d^2 \quad \mathrm{parameters} \] Q/K/V/O contribute about \(4d^2\). A gated or two-up-one-down MLP can contribute around \(12d^2\). Norm weights are small by comparison.

Hidden size	Approx layer weights, bf16	Approx layer weights, q8	Approx layer weights, q4	Boundary activation, bf16, T=4096
768	~18.9 MB	~9.4 MB	~4.7 MB	~6.3 MB
2048	~134 MB	~67 MB	~33.5 MB	~16.8 MB
4096	~537 MB	~268 MB	~134 MB	~33.5 MB
8192	~2.15 GB	~1.07 GB	~537 MB	~67 MB

This table is why the cache story has to be precise. A large L3 cache can help keep hot tiles, scales, metadata, routing data, KV slices, and small-model layer weights closer to compute. But for modern large hidden sizes, full layer weights usually do not live entirely in L3 unless the layer is small or heavily quantized. The main job is to stream from DDR5 into cache efficiently, tile the computation, avoid thrashing, and overlap stages enough that the full pipeline stays busy.

That does not mean the CPU path is doomed. Cache residency is not binary. A layer either fitting or not fitting in L3 is not the whole story. The real question is whether CKE can turn the layer into a predictable stream of tiles so DDR5, cache, prefetch, and execution overlap. This is no longer a high-level AI-framework problem. This is software systems and kernel engineering.

Technique	What CKE tries to protect	Why it matters
Tiling	working sets that fit cache/registers	the whole layer need not fit if the active tile does
Software prefetch	future weight and activation tiles	load the next data before the current compute finishes
Double buffering	one tile computing, one tile loading	turn compute-wait-compute into a conveyor belt
Arena layout	predictable address streams	helps hardware prefetchers and avoids allocator chaos
NUMA and pinning	worker locality	keeps streaming memory closer to the cores consuming it
Quantization	active bytes per token	reduces how much data must cross DDR/cache/network

The ideal local kernel path looks more like a conveyor belt than a stop-and-wait loop:

\[ \mathrm{DDR5} \rightarrow \mathrm{L3} \rightarrow \mathrm{L2} \rightarrow \mathrm{registers} \rightarrow \mathrm{execution\ units} \] CKE does not need the whole model hot in cache. It needs the next useful tile to arrive before the execution units starve.

What Tensor Parallelism Means

Tensor parallelism splits work inside a layer or tensor operation. Instead of Node A owning layers 0 through 9 and Node B owning layers 10 through 19, multiple nodes can own shards of the same projection, expert group, attention head group, or matrix block. A central node can orchestrate a star topology for early experiments: send work to left and right workers, gather partial results, reduce or concatenate, then continue.

A star topology is not the only tensor-parallel schedule, but it is a simple first topology for CKE experiments.

Tensor parallelism is more flexible but harder. The central node can become a bottleneck. Reduction and concatenation introduce synchronization. Small shards can waste time on communication. But for heavy layers, expert routing, or large matrix blocks, tensor parallelism may let CKE accumulate compute and memory bandwidth across nodes in a way pipeline parallelism alone cannot.

Prefill And Decode Are Different

A distributed runtime should not assume the same topology is best for every phase. Prefill processes a larger prompt window and often has more available parallel work. Decode produces one token at a time and becomes more sensitive to latency, KV-cache placement, pipeline fill, and concurrent requests.

Prefill and decode schedules for distributed CPU inference, showing larger prefill chunks and smaller decode pipeline work. — Prefill can use larger chunks. Decode needs careful latency and concurrency scheduling.

Detailed C-Kernel-Engine prefill versus decode topology diagram showing tensor or hybrid parallelism during prefill and pipeline or concurrent request scheduling during decode. — Prefill and decode can choose different distributed plans. Prefill can amortize tensor reductions over large token blocks; decode must protect latency, KV locality, and pipeline fill.

\[ T_{\mathrm{token}} \approx \sum_i T_{\mathrm{stage},i} + \sum_j T_{\mathrm{hop},j} + T_{\mathrm{sync}} \] The per-token path pays for stage compute, network hops, and synchronization. Throughput improves when the pipeline stays full.

This is where the schedule can legitimately change. Prefill may benefit from tensor or hybrid parallelism because there is a larger token block and more work to amortize communication. Decode may prefer pipeline parallelism or concurrent request scheduling because every token step is smaller and latency-sensitive. The KV cache also grows during decode, but it should remain resident with the layer or shard that owns it instead of being moved blindly across the cluster.

The CKE Runtime Question

The interesting question is not whether pipeline or tensor parallelism is universally better. The interesting question is whether CKE can measure enough about the model, hardware, and workload to choose a schedule. A small two-node cluster may start with pipeline parallelism. A three-node cluster may test a central orchestration pattern. A five-node cluster may mix pipeline stages with tensor shards inside a heavy stage.

The scheduler therefore needs a constraint model. If a schedule increases FLOPs but doubles network traffic, it may lose. If a schedule keeps weights resident and only moves compact activation blocks, it may win. If tensor parallelism creates too many tiny reductions during decode, pipeline parallelism may be better. If prefill has a large enough chunk of work to hide communication, tensor or hybrid parallelism may become attractive.

How CKE Tries To Do This

The important architectural difference is that CKE should not treat distributed execution as a vague serving feature. It should compile and report the distributed plan. In a CKE-style runtime, a model graph can be lowered into explicit rank ownership: which node owns which layers, which node owns which tensor shard, which arena offset stores the weights, where KV cache lives, which activation boundary crosses the wire, and which synchronization step is required before the next kernel runs. This is why the CKE concepts page matters: kernels are not just math names; they are contracts about layout, memory movement, and execution behavior.

The simplified planner objective is:

\[ \min \left( T_{\mathrm{compute}} + T_{\mathrm{memory}} + T_{\mathrm{network}} + T_{\mathrm{sync}} \right) \quad \mathrm{while\ preserving\ parity} \] The distributed schedule is only useful if it lowers the real token path while preserving numerical correctness.

For pipeline parallelism, the planner tries to keep weights resident and move boundary activations. For tensor parallelism, the planner tries to shard large projections or expert groups and then pay the gather/reduce cost only when that cost is smaller than the compute and memory-bandwidth gain. For prefill, larger token blocks can make tensor or hybrid parallelism easier to amortize. For decode, the per-token path is smaller, so pipeline fill, KV locality, and concurrent requests often matter more. For training, the same idea becomes harder because gradients, optimizer state, activation checkpointing, and collective communication enter the path.

The reason this is not hand-waving is that every term can be measured. Activation boundary size is \(T \cdot d_{\mathrm{model}} \cdot \mathrm{bytes(dtype)}\). A dense transformer layer is roughly \(16d^2\) parameters before model-specific variations such as GQA, MoE, SSM, DeltaNet-style recurrence, quantization, and fused kernels. Network time can be estimated as payload bytes divided by effective link bandwidth, then corrected by measured latency, software overhead, and synchronization. Memory time can be estimated from active bytes divided by measured DRAM/cache bandwidth, then corrected by misses, TLB pressure, NUMA distance, and prefetch quality. CKE's job is to turn those estimates into generated C, runtime artifacts, and measurements.

The larger formula is not GPU-specific and not CPU-specific:

\[ T_{\mathrm{cluster}} \approx \sum_{\mathrm{nodes}} \left( T_{\mathrm{layers\ owned}} + T_{\mathrm{memory\ stream}} + T_{\mathrm{network\ boundary}} + T_{\mathrm{sync}} \right) \] GPU clusters also pay communication and synchronization costs. CKE's bet is that CPU nodes can be made useful by controlling the whole deterministic byte path.

This is also where the economic thesis enters. GPU systems are technically excellent, but real scale is not only the price of a card. It includes VRAM capacity, power delivery, cooling, servers, racks, networking, and operations. The practical question is whether a small team can keep adding useful compute without first graduating into datacenter-class GPU power and cooling. CKE is exploring whether commodity CPU nodes can scale differently: add another replaceable node, add more cores, add more memory channels, add more DRAM, add another NIC, and let deterministic software decide how the byte path is split. The claim still has to be proven with measurements. But the formula itself does not say "GPU only."

That is why the math matters. If a node can own a layer block, keep its local weights and KV state resident, stream the next tile before the execution units starve, and send only the required activation boundary to the next rank, then the cluster becomes a throughput experiment instead of a single-box benchmark. Adding nodes is not free. Network hops, synchronization, rank imbalance, NUMA placement, and memory bandwidth can erase the gain. But those are measurable engineering constraints, not religious arguments about whether AI must run on one hardware class.

Strategy	Best fit	Main risk	CKE evidence needed
Pipeline parallelism	layer-block ownership	slowest stage limits throughput	stage timing, bubble report, activation hop cost
Tensor parallelism	large projections, experts, head groups	gather/reduce synchronization	shard timing, reduction timing, bandwidth use
Hybrid schedule	large models across many nodes	planner complexity	topology report, rank ownership, parity checks

C-Kernel-Engine layer weight, activation boundary, cache tiling, DDR5 streaming, and network transfer math for distributed CPU AI. — The layer math explains why CKE cares about tiling, prefetch, memory layout, and network payload size. The bottleneck is the measured active-byte path.

Why This Fits CKE

CKE already wants model execution to be explicit: generated C, planned memory, sectioned arenas, kernel purity, visible artifacts, and Linux-level control. The project documentation describes this as a CPU-native AI runtime and kernel compiler for auditable inference, training kernels, and distributed CPU execution. Distributed parallelism is an extension of that same philosophy. Instead of hiding the cluster behind a vague serving layer, CKE should expose rank ownership, layer ownership, tensor shard ownership, arena offsets, transfer sizes, and timing reports.

That is how the project can avoid becoming just another benchmark claim. The output should not only be tokens per second. The output should include evidence: which node owned which layer, what moved across the network, how much time each hop took, where the pipeline stalled, and where numerical parity was checked. The throughput-unit view adds one more important artifact: how many active bytes were cycled per token, and where those bytes were spent.

Bottom Line

Pipeline parallelism is the first simple schedule. Tensor parallelism is the harder but more flexible schedule. Prefill and decode may need different schedules. Training adds gradients, optimizer state, and synchronization pressure. The next phase of CKE is learning how to make these choices explicit enough that cheap CPU nodes can be added, replaced, and measured like real runtime units.

Related C-Kernel-Engine References

C-Kernel-Engine project documentation - overview of the CPU-native runtime and kernel compiler.
C-Kernel-Engine GitHub repository - source code and active development.
CKE scaling hypothesis - the larger commodity CPU and distributed systems thesis.
CKE throughput unit - active bytes per token and aggregate bytes cycled per second.
CKE concepts - kernel concepts, layouts, and model execution ideas.
Model and kernel matrix - current model-family and kernel bring-up surface.

Pipeline vs Tensor Parallelism: How CKE Splits AI Across CPU Nodes

The Constraint: Active Bytes Per Token

The Bet: Make AI Competitive On CPU Nodes

What Pipeline Parallelism Means

How Big Is A Layer?

What Tensor Parallelism Means

Prefill And Decode Are Different

The CKE Runtime Question

How CKE Tries To Do This

Why This Fits CKE

Bottom Line

Related C-Kernel-Engine References

ShivasNotes

Explore

Connect

Pipeline vs Tensor Parallelism: How CKE Splits AI Across CPU Nodes

The Constraint: Active Bytes Per Token

The Bet: Make AI Competitive On CPU Nodes

What Pipeline Parallelism Means

How Big Is A Layer?

What Tensor Parallelism Means

Prefill And Decode Are Different

The CKE Runtime Question

How CKE Tries To Do This

Why This Fits CKE

Bottom Line

Related C-Kernel-Engine References

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect