Abstract / Executive Summary
C Kernel Engine is an AI LLM inference and training engine optimized for CPUs.
It takes model weights, model configuration, and computational graph templates, then stitches them into deterministic, hardware-targeted C code. The generated C is effectively a model-specific execution file: pointer wiring, kernel stitching logic, memory offsets, buffer planning, and calls into the computational kernels supplied by the C Kernel Engine kernel ecosystem. That generated code is compiled on the machine it is meant to run on, producing a model-specific shared object that can execute independently of the parent C Kernel Engine pipeline.
The goal is not to hide the runtime behind another framework. The goal is to make the runtime explicit: kernels, tensor shapes, memory offsets, saved activations, backward paths, SIMD choices, generated C, and parity reports.
In short: C Kernel Engine turns transformer structure into compiled CPU-native artifacts that can run inference and training workloads with inspectable memory, math, and execution paths.
Project references: documentation, GitHub repository, and scaling thesis.
C Kernel Engine is a CPU-native compiler pipeline for LLM inference and training. By combining model weights with computational graph templates, it processes those inputs through intermediate representations, memory planning, and code generation to produce hardware-aware shared objects for CPUs.

Introduction
Most people talk about AI from the model side.
They talk about prompts, benchmarks, agents, datasets, GPUs, frontier labs, and which model is winning this week. That is useful, but it skips the part I care about most: what actually has to run on the machine.
At some point a transformer stops being a magical object called "AI" and becomes a sequence of operations:

- load tokens
- look up embeddings
- normalize activations
- project Q, K, and V
- apply RoPE
- compute attention
- run softmax
- multiply by V
- run the MLP
- add residuals
- produce logits
- compute loss
- run the backward pass
- update weights
That is the level where the C Kernel Engine lives.
The C Kernel Engine repository is my attempt to build a CPU-native AI runtime and training engine from the kernel level upward. The actual kernel implementations live in src/kernels, which makes the project less abstract: readers can see the small set of math kernels the engine is currently stitching together. It is written in C, tested against PyTorch, designed around explicit memory layouts, and organized so the forward path, backward path, generated C, tensor shapes, and runtime buffers are inspectable.
The important distinction is that C Kernel Engine is not only a runtime you call into forever. It behaves more like a compiler pipeline for models. The Python-side interface can parse configuration, build IR, lower the graph, plan memory, and generate optimized C for the specific model and hardware target. That generated C can then be compiled into a shared object and executed independently of the parent C Kernel Engine project.
That is different from relying on a framework at runtime. In PyTorch, llama.cpp, and most engines, the framework or engine remains the thing that interprets or orchestrates execution. In C Kernel Engine, the goal is to generate deterministic C code and link the mathematical kernels so the resulting compiled artifact can run as its own model-specific runtime.
This is not a random teaching aid. It is infrastructure work.
The simple version is this:
C Kernel Engine turns transformer models into explicit C kernels, deterministic memory layouts, generated runtime code, shared object artifacts, and parity-tested forward/backward execution paths.
The longer version is what this post is about.
1. The Core Bet
The core bet behind C Kernel Engine is that AI systems should not only be understood through Python frameworks and GPU kernels.
PyTorch is excellent. CUDA is excellent. llama.cpp is excellent. oneDNN, BLIS, OpenBLAS, and the rest of the performance ecosystem all matter. I am not pretending those systems do not exist.
But if the only way I can understand a model is by calling a high-level framework and trusting the runtime, then I do not really understand the model as a system.
I want to see the model as:
- kernels
- shapes
- strides
- buffers
- cache behavior
- SIMD lanes
- quantized blocks
- forward graphs
- backward graphs
- saved activations
- layout maps
- generated C
- parity reports
That is the point of C Kernel Engine.
The project is built around a CPU-first question. I expand this more directly in the C Kernel Engine scaling thesis:
If we had to run, train, debug, and scale transformer-style models on CPUs and embedded systems, what would the runtime need to look like?
That question leads to a very different architecture than a normal "use PyTorch and call it done" workflow.
2. What C Kernel Engine Is
C Kernel Engine is a C-first runtime and kernel system for transformer-style AI workloads.
At the lowest level, it contains C kernels for operations that show up again and again inside modern neural networks:
- GEMM and GEMV
- Q/K/V projection
- attention score computation
- softmax
- RoPE
- RMSNorm and LayerNorm
- residual add
- SwiGLU and GELU
- logits
- cross-entropy
- quantized dot products
- forward and backward variants of core ops
But the engine is not only a folder of kernels.
The more important idea is that a model is turned into an explicit runtime plan:
- Parse model configuration and weights.
- Build an intermediate representation of the model.
- Lower that representation into concrete runtime modes.
- Plan memory offsets and buffer sizes.
- Generate C code.
- Compile a runtime library.
- Run inference or training.
- Compare against reference outputs.
That is why I keep coming back to the phrase "kernel engine." The kernel is the mathematical primitive. The engine is the system that decides how those kernels are wired, laid out, compiled, tested, and executed.
Compiler Pipeline
Model WeightsParameters, tensors, quantized blocks
Graph TemplatesArchitecture rules and operation contracts
IRGraph capture, lowering, and shape propagation
Memory PlanOffsets, buffers, alignment, and lifetimes
Generated CPointer wiring and kernel stitching logic
Kernel EcosystemLinked computational kernels from CKE
.so ArtifactCompiled model-specific shared object
Target CPUAVX-512, AMX, NEON, FMA execution paths
3. What It Is Not
C Kernel Engine is not trying to be a nicer Python API.
It is not trying to hide the runtime behind more abstraction. It is almost the opposite. The point is to expose the runtime so I can reason about it.
It is also not simply a llama.cpp clone. llama.cpp is an important reference point for CPU inference and GGUF model execution, and C Kernel Engine can use parity comparisons against systems like that. But the architectural goal is different. I want a system that makes forward execution, backpropagation, memory planning, operator stitching, and generated C visible as first-class artifacts.
It is also not "just inference." Inference matters, but training is the real test of whether the system understands the graph. A forward-only runtime can get away with many shortcuts. A training runtime has to know what needs to be saved, what needs to be recomputed, how gradients flow, and how each operation reverses through the chain rule.
That is where the project becomes more serious.
4. Why Write It in C?
I wrote more broadly about this in Why I’m Doubling Down on C in 2026, but the reason C matters for C Kernel Engine is specific.
C is closer to the physics of compute. It makes the cost of memory, layout, pointer movement, cache behavior, and instruction selection harder to ignore. If I want to squeeze cycles out of CPU execution, I need to care about what the machine is actually doing.
There is also a practical ecosystem reason. CPU and silicon vendors expose their most important optimization paths through C and C++: compiler intrinsics, vector extensions, optimized math libraries, BLAS interfaces, OpenMP, platform compilers, and low-level profiling tools. Fortran still matters in parts of HPC, but C is the common systems language that sits close to this hardware surface.
I personally like C because it removes a lot of object-oriented bloat. I do not want a runtime hidden behind layers of classes when the real problem is shapes, buffers, loops, and math. But that preference is not the core reason.
The core reason is that C Kernel Engine generates C.
The user is not supposed to hand-write pointer arithmetic for every model. The engine should know the model structure, tensor shapes, offsets, strides, kernel contracts, and memory plan. Those details are deterministic. Once the graph is lowered, the engine can emit model-specific C code, compile it on the machine that is supposed to run it, and produce a shared object that can execute without the parent engine constantly interpreting the graph.
That is the compiler analogy. A normal compiler deterministically turns source code into assembly and binary artifacts. C Kernel Engine tries to do something similar for transformer execution: turn model structure, weights, and IR into compiled C artifacts that run the actual numerical workload.
| Layer | What C Kernel Engine Controls |
|---|---|
| Model | Weights, configuration, graph templates, tensor contracts, and architecture-specific runtime rules. |
| IR | Graph capture, lowering, shape propagation, execution order, and forward/backward stitching decisions. |
| Memory | Offsets, buffers, alignment, activation lifetimes, KV cache regions, gradient regions, and temporary workspace. |
| Kernels | GEMM, GEMV, attention, softmax, RMSNorm, RoPE, MLP, loss, backward kernels, and quantized dot products. |
| CPU | AVX2, AVX-512, AMX, NEON, FMA, vector lanes, cache locality, threading, and NUMA-aware execution paths. |
| Artifact | Generated C compiled into a model-specific shared object that can run independently on the target hardware. |
5. Why Kernels Matter
A transformer looks complicated at the paper level, but at runtime it is built from a smaller set of repeating computational shapes.
For a single-batch transformer, a lot of the work reduces to:
[T, D] x [D, D][T, D] x [D, 3D][T, D] x [D, 4D][T, d] x [d, T][T, T] x [T, d]- transposed variants for backprop
Where:
Tis the token/context length.Dis the embedding dimension.His the number of heads.d = D / His the head dimension.
That means the runtime is not an infinite zoo of magical operations. It is a set of repeated matrix operations, reductions, elementwise functions, and memory movement patterns.
The CPU does not see "attention" as a concept.
The CPU sees:
- loads
- stores
- loops
- strides
- multiply-adds
- vector registers
- cache lines
- branches
- exponentials
- reductions
This is why C Kernel Engine focuses so heavily on GEMM/GEMV, memory layout, SIMD, and explicit operator contracts. If the shapes are wrong, the model is wrong. If the layout is wrong, the kernel is slow. If the saved activations are wrong, backprop is wrong. If the parity tests are weak, the runtime can silently drift.
6. The Forward Path
In the forward path, C Kernel Engine sees a transformer block as a sequence of kernel calls.
A simplified decoder block looks like:
x
-> RMSNorm
-> QKV projection
-> RoPE on Q and K
-> attention scores QK^T
-> softmax
-> attention scores x V
-> attention output projection
-> residual add
-> RMSNorm
-> MLP gate/up/down
-> residual addIn PyTorch, a lot of this is hidden behind tensor operations and autograd. In C Kernel Engine, each step has to be explicit.
That means the engine has to know:
- what input buffer each kernel reads
- what output buffer each kernel writes
- what weight tensor it uses
- what shape the tensor has
- what stride/layout is expected
- whether the op belongs to prefill, decode, backward, or training mode
- what intermediate values must be saved
This explicitness is painful, but it is also the point.
Once the graph is explicit, the system can generate reports, diagrams, runtime code, and parity checks. It becomes easier to ask, "Where exactly did this model diverge from the reference?"
7. The Backward Path
Backprop is where many AI explanations get hand-wavy.
The usual answer is: "PyTorch handles it."
That is true if you are using PyTorch. It is not true if you are building your own runtime in C.
In C Kernel Engine, every forward operation needs a backward interpretation. Some are straightforward:
- forward add becomes gradient split
- forward split becomes gradient accumulation
- forward matmul becomes gradients with respect to input and weights
- forward residual add routes the incoming gradient to both branches
Some are more delicate:
- softmax backward
- RMSNorm backward
- attention backward
- RoPE backward
- quantized path gradients
- saved activation handling
- cross-entropy gradient reduction
The key idea is simple:
Backprop is incoming gradient times local gradient, repeated through the graph.
The engineering problem is not the slogan. The engineering problem is making the runtime know where every parent gradient goes, which forward values must be available, which buffers are reused, and which backward kernel should be called.
That is why the v7 work matters. v7 is not just "training support" as a checkbox. It is the foundation for explicit backward kernels and stitching rules.
If the forward graph says:
y = attention(x)The backward graph has to know how to propagate:
dL/dy -> dL/dattention -> dQ, dK, dV, dWqkv, dxThat is a very different engineering problem from calling .backward().
8. IR and Code Generation
C Kernel Engine uses intermediate representations because hand-writing every model variant is not scalable.
The rough pipeline is:
config + weights + template
-> graph IR
-> lowered IR
-> layout plan
-> generated C
-> compiled runtimeThis matters because modern models are similar, but not identical.
GPT-style, Llama-style, Qwen-style, Gemma-style, and newer architectures all repeat the same broad transformer ideas, but the details change:
- normalization type
- RoPE details
- grouped-query attention
- MLP shape
- gate/up/down projection layout
- tokenizer behavior
- quantized weight format
- context length
- vocabulary size
- cache layout
The IR is the place where those choices become explicit.
The generated C is the place where those choices become executable.
This is why I do not want the project to become a pile of one-off C files. The whole point is to make model structure, lowering, memory planning, and generated execution part of the same pipeline.
9. Memory Planning Is Not Optional
Memory layout is one of the most important parts of the project.
An AI model is not just math. It is math moving through memory.
For inference, the runtime needs:
- weights
- activations
- temporary buffers
- KV cache
- logits
- tokenizer state
For training, it also needs:
- saved activations
- gradients
- optimizer state
- loss buffers
- backward intermediates
If memory is handled casually, everything becomes fragile. You get accidental allocations, unclear lifetimes, cache-hostile layouts, pointer bugs, and silent corruption.
C Kernel Engine pushes toward deterministic memory planning:
- compute offsets
- enforce alignment
- track buffer sizes
- separate weights, activations, cache, gradients, and temporary regions
- use canaries and checks where needed
- produce inspectable layout artifacts
This is not cosmetic. This is what lets the runtime become debuggable.
When something goes wrong, I want to be able to ask:
- which buffer was read?
- which buffer was written?
- what was the expected shape?
- what was the byte offset?
- was the memory aligned?
- did the canary fail?
- did this layer diverge from PyTorch or llama.cpp?
That is the kind of visibility high-level frameworks often hide.
10. Why CPU-First?
The obvious objection is: why focus on CPUs when frontier AI training is dominated by GPUs?
The honest answer is that GPUs are currently the center of large-scale AI training. Ignoring that would be unserious.
But CPU-first does not mean pretending GPUs do not matter. CPU-first means asking a different set of questions:
- How far can CPU inference and training be pushed with explicit kernels?
- What happens when memory layout is planned from first principles?
- How much can SIMD, cache locality, thread pools, NUMA awareness, and quantization recover?
- Can smaller and domain-specific models be trained or fine-tuned efficiently on CPU clusters?
- Can embedded systems use a smaller version of the same runtime ideas?
- Can a runtime be easier to inspect because it is written close to the hardware?
There is also a strategic point: CPUs are everywhere.
They are in servers, laptops, edge devices, robotics systems, embedded boards, and old machines people already own. If AI only belongs to the most expensive GPU clusters, the stack becomes centralized. If more of the runtime can be understood and executed on CPUs, there is more room for independent engineering.
That is the bet.
It does not mean the work is easy. It means the work is worth doing.
11. Parity Is the Discipline
The dangerous thing about writing numerical code is that it can look correct while being wrong.
A transformer can run and still drift.
A kernel can compile and still produce subtly incorrect results.
A quantized path can look close on one token and diverge later.
A backward pass can produce gradients with the right shape and the wrong values.
That is why parity testing is central to C Kernel Engine.
The project compares C kernels and generated runtime behavior against trusted references like PyTorch and, where relevant, llama.cpp-style execution. The purpose is not to worship another framework. The purpose is to find the first point of divergence.
The questions are:
- Does RMSNorm match?
- Does RoPE match?
- Does QKV projection match?
- Does softmax match?
- Does attention match?
- Does the MLP match?
- Do logits match?
- Does loss match?
- Do gradients match?
This is the discipline that keeps the project from becoming vibes.
If the engine claims to run a model, it should be able to show where the numbers agree and where they do not.
12. Quantization and Model Formats
A practical AI runtime has to deal with model formats and quantized weights.
Full FP32 is useful for correctness and development, but it is not enough for practical inference. Quantization changes what the runtime stores and how kernels compute.
Instead of thinking only in floats, the runtime has to understand:
- packed integer blocks
- scale values
- block sizes
- dequantization paths
- quantized dot products
- Q4/Q5/Q8-style formats
- activation quantization tradeoffs
- decode vs prefill differences
This is why quantization is not merely "make the model smaller." It is a storage and execution contract.
If the packing is wrong, the model is wrong.
If the scale is wrong, the model is wrong.
If the kernel dispatch chooses the wrong path, the model is wrong.
That is why C Kernel Engine treats quantization as part of the runtime architecture, not as an afterthought.
13. Inference vs Training
Inference and training are different runtime modes.
For inference:
- prefill processes the prompt
- decode generates one token at a time
- KV cache matters
- latency and memory bandwidth matter
- quantized GEMV becomes important
For training:
- full forward matters
- saved activations matter
- backward kernels matter
- gradient accumulation matters
- optimizer state matters
- deterministic layout matters
This distinction matters because an optimization that makes inference faster may not apply to training. KV-cache decode is useful for inference, but training needs the full forward/backward path.
C Kernel Engine has to model these modes explicitly. Otherwise the runtime becomes a mess of conditional logic and hidden assumptions.
14. What Makes This Different From a Framework?
Most AI frameworks are designed to make the user productive.
C Kernel Engine is designed to make the runtime inspectable.
Those are different goals.
The framework user wants:
loss.backward()
optimizer.step()The engine builder wants to know:
- which kernel produced this tensor?
- what shape was it?
- what buffer offset did it use?
- which backward kernel consumed it?
- what gradient path was stitched?
- which SIMD implementation ran?
- what was compared against the reference?
- where did the first drift occur?
That is why this project fits my way of thinking. I like systems where the structure is explicit. I like when the runtime tells me what it is doing. I like when the generated code can be inspected. I like when the math is connected to the hardware instead of hidden behind layers of magic.
15. The Long-Term Goal
The long-term goal is ambitious:
A proper distributed and embedded-system framework for CPU-first AI inference and training.
That does not mean the project is already done. It means that is the direction.
The pieces are:
- standalone C kernels
- model-aware IR
- deterministic codegen
- explicit memory planning
- CPU SIMD and threading
- quantized execution
- forward/backward parity
- training support
- inference support
- observability reports
- runbooks
- eventually distributed execution
That is why I care about this project.
It is not only about making a model run.
It is about building the machinery required to understand how a model runs.
16. Why I Am Writing About It Now
I see a lot of AI commentary that talks about models as products, agents, benchmarks, or social change.
That is fine. But the part I am trying to build lives lower in the stack.
I care about the level where attention becomes loops, where backprop becomes explicit kernel stitching, where a tensor shape becomes a byte count, where a model architecture becomes generated C, and where CPU instructions decide whether the system is fast or painfully slow.
This is also why I am using ShivasNotes, carousels, and videos to explain the project. Writing forces me to clarify the architecture. Teaching the kernels forces me to catch hand-wavy thinking. Drawing the shapes forces me to see the memory problem. Recording the explanation forces me to simplify without lying.
The content is not separate from the engineering.
The content is part of the hardening loop.
17. Current State and Roadmap
The honest state of C Kernel Engine is this: it is serious infrastructure work, but it is not yet a general-purpose production replacement for PyTorch, llama.cpp, or a mature accelerator runtime.
That distinction matters. I do not want to position it as a toy, because it is not a toy. It has real architecture, real kernels, real generated code, real parity work, and a real CPU-first thesis. But I also do not want to pretend it is already a finished framework. The work now is hardening the path from experiment to reliable engine.
What exists now
- CPU-native C kernels for core transformer operations.
- Forward and backward kernel thinking, especially in the v7 training path.
- Generated C runtime artifacts instead of only calling into a permanent parent framework.
- IR and lowering work that makes model structure inspectable before execution.
- Parity gates against reference behavior so correctness comes before speed.
- Documentation, runbooks, scaling notes, and visual explanations that make the system easier to audit.
What I am hardening next
- More complete kernel coverage for transformer blocks: norm, attention, softmax, MLP, residual paths, loss, and optimizer-related pieces.
- Backprop stitching rules so the engine can reliably reverse forward operations through explicit backward kernels.
- Memory planning: buffer lifetimes, offsets, alignment, saved activations, recomputation choices, and cache-aware layouts.
- Generated C reproducibility so the same model configuration produces understandable, repeatable runtime artifacts.
- Performance reports that connect kernel timing, memory traffic, SIMD usage, and roofline-style analysis.
- Model coverage across small practical targets before claiming anything broad.
What production-ready should mean
For this project, production-ready should not mean "it can replace PyTorch for everyone." That would be the wrong bar at this stage.
The first credible production-ready milestone is narrower: a reproducible CPU-native runtime for one or two supported model families, with documented commands, generated C artifacts, parity checks, memory reports, and performance numbers that can be reproduced by someone other than me.
After that, the bar moves higher: stable training experiments, more model families, better quantization, threading, distributed execution, and eventually CPU clusters where the engine can prove whether the larger scaling thesis holds.
The timeline
I think of the timeline in stages, not dates.
- Research-grade: the engine can run selected models, expose artifacts, and explain what happened.
- Technical preview: another developer can follow the runbook and reproduce the same result.
- Production narrow path: one supported workload is stable enough to trust repeatedly.
- Framework path: multiple model families, training and inference paths, memory planning, quantization, threading, and distributed execution become part of one coherent system.
That is where the project is heading. The bet is still the same: make inference and training viable on CPUs by making the runtime explicit, inspectable, generated, and optimized around the hardware that actually executes the math.
References
Start here if you want to inspect the project rather than only read the explanation.
- DocsC Kernel Engine documentation
- RepoC Kernel Engine GitHub repository
- Kernelssrc/kernels: current C kernel implementations
- ScalingThe C Kernel Engine scaling thesis
- v7v7: backprop, training gates, and explicit gradient plumbing
- v8v8: vision encoder, inference lane, and multimodal direction
- HistoryC Kernel Engine version history and roadmap trail
- CWhy I am doubling down on C in 2026
Conclusion
C Kernel Engine is my attempt to make AI runtime internals visible, explicit, controllable, and useful on CPUs.
The real goal is simple to state and hard to execute: make LLM inference and training viable on CPU hardware.
That does not mean pretending GPUs and external accelerators do not matter. They clearly do. But C Kernel Engine is not primarily designed around outsourcing the hard work to an external accelerator. It is designed around the CPU as the execution target, including the accelerators already baked into modern and future CPU silicon: AVX2, AVX-512, AMX, NEON, FMA, vector units, cache hierarchies, threading, NUMA, and whatever future CPU architectures expose next.
It starts with C kernels, but it does not end there.
It connects:
- transformer math
- CPU execution
- SIMD and matrix extensions
- memory layout
- code generation
- quantization
- inference
- backprop
- parity testing
- training
- scaling
This is not a toy project to demonstrate a few kernels. It is an infrastructure project I want to keep building, testing, hardening, and pushing toward real utility.
The reason I keep building it is simple: I do not want AI to remain a black box wrapped in Python calls.
I want to understand the machine, and I want the machine to run the model through explicit code I can inspect.
The only way I know how to really understand it is to build the kernels, wire the graph, plan the memory, run the tests, compare the numbers, use the CPU instruction paths available on the hardware, and keep hardening the system until the abstraction stops being magic.
That is what the C Kernel Engine is.