Last week, Claude Opus 4.7 "got released." And the Anthropic team is doing what Americans are genius at: marketing, marketing, marketing.
The whole Claude, OpenAI, Gemini saga is the same playbook: bigger, better, more compute, scaling laws. "You need us to win."
My hunch is different. As these AI models get better using massive weights, there will also be research into how to get the same — or even better — performance using less weights, less compute, and a more diverse set of products.
The Contradiction at the Heart of Scaling
Frontier models are currently great at reading large codebases, reasoning through them, and solving problems across C, C++, PHP, Java, JavaScript, React — you name it. Their biggest strength is long-horizon tasks: reading multiple files, reasoning, keeping track, finding solutions, and iterating until they find the core problem. They may not always be right today, but tomorrow's models will be more right than wrong.
Having said that: why can't a smaller model do the same thing?
The frontier labs sell scaling laws. But we've also found that smaller models trained well can do decently too — look at Qwen, Mistral, MiniMax, Kimi, DeepSeek. Most open-source models are 2–3 generations behind the frontier. That will change as they get better.
A lot of the Chinese open-source models are primarily trained on Chinese data, whereas American models are trained on English data. But the Chinese are doing something innovative. They don't have NVIDIA GPUs (again, more American marketing) and they've been smart enough to efficiently handle compute despite not having the latest and greatest. And this is where I see the real innovation.
The Historical Pattern: Why Commodity Always Wins
This pattern keeps repeating. In the 1990s, proprietary RISC chips (SPARC, Alpha, PA-RISC) were "obviously superior" to x86. Sun Microsystems and SGI sold $500K+ servers based on that "fact." Then commodity x86 PCs with Linux and software like MapReduce ate their lunch. Sun went bankrupt in 2010.
| Era | Proprietary Incumbent | Commodity Disruptor | Result |
|---|---|---|---|
| 1990s | SPARC, Alpha, PA-RISC | x86 commodity chips | Proprietary RISC faded |
| 1998 | Sun/SGI servers ($500K+) | x86 PCs + MapReduce/GFS | Sun bankrupt (2010) |
| 2009 | Teradata, Netezza ($1M+) | Hadoop on commodity clusters | Big data democratized |
| Now | GPU clusters ($M+) | CPU clusters + software | → ? |
My Three Strategic Bets
- More efficient compute — Not more flops, but more value per flop.
- Smaller models with frontier-scale capabilities — Dense, focused, and trained with high-quality curricula.
- Pre-training to usable models becomes easier — This is the most important point.
The Real Optimization: Training on Anything
What I am betting on is this: the ability to go from random initialization of weights to training any model at will on any commodity hardware.
Once anyone can train their own model, figure out how to use them, and make this as easy as using "Opus 4.7" or whatever—then "bigger is better" only matters at the exact moment of release. This part is still mostly closed source: the training regime, the curriculum, the methods, the flow, the setup.
What we normally get from so-called open-source models is mostly the weights and the model architecture. Both are important, but architecture and training regime go hand-in-hand—and that part is omitted.
As more labs share their training curriculum and their ability to make powerful models, this trend will continue to grow. But then comes cost. If open labs aren't earning at the same rate as frontier labs, open-weight labs may go under. That is a very real possibility, and I hope we don't get there.
What the Marketing Won't Tell You: "Innovations" Are Actually GPU Patches
Most "breakthroughs" in LLM architecture are just patches for GPU holes.
| "Innovation" | What It Actually Does | The GPU Constraint It Patches |
|---|---|---|
| GQA (Grouped Query Attention) | Shares KV heads across query heads | KV cache blows up GPU VRAM |
| MoE (Mixture of Experts) | Activates sparse subset of parameters | Dense model won't fit on one GPU |
| Gradient Checkpointing | Recomputes activations instead of storing them | Training activations don't fit in GPU VRAM |
| Tensor Parallelism | Shards weight matrices across GPUs | Single GPU can't hold the full matrix |
| Pipeline Parallelism | Distributes layers across GPUs | All layers don't fit on one GPU |
| Flash Attention | Online softmax with tiled computation | Full attention matrix doesn't fit in GPU SRAM/VRAM |
On a CPU with 2–4TB of RAM, half of these become unnecessary. The entire research direction has been shaped by a GPU limitation, not fundamental progress. My bet is on the architectures that don't need these patches.
The Fundamental Math: Amdahl's Law and "0 × ∞ = 0"
Forget peak FLOPS. The real math is Amdahl's Law: speedup(N) = 1 / (S + (1-S)/N) where S is the serial fraction. For LLMs, that's the next token depending on the last, layer boundaries, and sync points. Even with infinite GPUs, if S is 10%, your max speedup is 10x. You hit a wall.
Then there's the "0 × ∞ = 0" principle: if the model doesn't fit in memory, the FLOPS don't matter.
A 70B model needs ~140GB just for weights (FP16). A single H100 has 80GB. You are forced into a multi-GPU cluster just to start:
- GPU Path: Model doesn't fit → 8 GPUs → $320K + NVLink $50K+ + DGX chassis $80K+ = $450K+ just to START
- CPU Path: Model fits in 4TB RAM → 1-2 servers $30K each + Ethernet $2K = $60K and you're running
The math doesn't lie.
The Real Bottleneck: Data Movement
Everyone is obsessed with FLOPS. But generating the next token isn't about compute. It's about cycling GBs of data through compute as fast as possible.
When you generate a token, you need to move the weights (GBs of data) from RAM to compute, move the KV cache (hundreds of MB to GBs), move the activations — and then you do the math. The math is fast. The data movement is slow.
And this is where CPUs are about to have their moment.
Memory Channels: The Silent Revolution
Today's dual-socket CPUs already have 12–16 memory channels. Each channel of DDR5 delivers ~40 GB/s:
- 12 channels × 40 GB/s = 480 GB/s
- 16 channels × 40 GB/s = 640 GB/s
| Year | Channels per Socket | Dual Socket Total | Bandwidth |
|---|---|---|---|
| Today (2025) | 12 | 24 | ~960 GB/s |
| 2026–2027 | 16 | 32 | ~1.28 TB/s |
| 2028+ | 24 | 48 | ~1.9–2.5 TB/s |
So the hardware is coming. More channels. Wider SIMD. Larger caches. But hardware alone doesn't solve the problem. You need a software stack that actually uses these capabilities coherently. That is where the C-Kernel-Engine comes in.
What Already Exists (And Why I'm Not Using It)
Before I explain what I am building, let me acknowledge what already works on CPUs today. I'm not here to pretend nothing exists.
| Project | What It Does | Why I'm Not Building On It |
|---|---|---|
| llama.cpp | Brilliant inference kernels. Quantization that respects memory bandwidth. | No training stack. No backward passes. Inference only. |
| PyTorch CPU | Full training + autograd. Optimizers. Same API as GPU. | Kernels are generic BLAS, not CPU-optimized. Heavy dependency tree. |
| Intel OpenVINO | Extremely optimized for Intel hardware. Great inference. | Intel-specific. Inference only. Vendor lock-in. |
| ONNX Runtime | Cross-platform. Multiple backends. | Configuration hell. Not training-native. Wrapper complexity. |
These are all good projects. I respect them. But none of them give me a coherent training + inference stack that is:
- CPU-native – written for cache lines, NUMA domains, and memory channels
- Auditable – every kernel has a forward and backward pass, validated against PyTorch
- Trainable – backprop synthesis, gradient accumulation, AdamW
- Portable – runs on any Linux machine with AVX2 or better
- Zero dependencies – no CUDA, no Python at runtime, no heavy frameworks
So I built my own. From scratch. Because coherence doesn't come from glue. It comes from unified design.
The C-Kernel-Engine: Building the Missing Piece
Here is what I am actually building.
I am not stitching existing libraries together. I am not writing wrappers around llama.cpp or PyTorch CPU. I am building a complete CPU-native AI system from scratch — because coherence doesn't come from glue. It comes from unified design.
Let me explain how it works in plain English.
How It Works
Most CPU AI projects today are inference-only. They can run a model, but they cannot train one. Training requires backward passes — the ability to compute gradients and update weights — and almost no one has built that for CPUs in a coherent way.
So I built it. Every kernel — attention, math, normalization — has both a forward pass (running the model) and a backward pass (training the model). Each is tested against PyTorch to ensure it is bit-exact within tolerance.
Quantized models (Q4_K, Q5_K, Q6_K, Q8_0) stay quantized in memory. They dequantize on-the-fly during math operations. This saves memory and bandwidth — the two things that actually matter for inference.
The engine uses a three-stage pipeline to go from model config to executable C code. Stage one builds a graph of operations. Stage two finds patterns it can fuse together (like combining normalization + attention into one fast pass). Stage three assigns memory offsets and generates the final C code. The code generator is deliberately dumb — it only emits what the earlier stages decided. If something is wrong, the bug is upstream, not hidden in the generator.
Memory is managed with a bump allocator — a 128-byte header followed by contiguous blocks for weights, activations, gradients, and optimizer state. It uses huge pages, NUMA binding, and canary sentinels to catch memory corruption. The .bump format is my own design because mmap() of a single contiguous blob is the fastest way to load weights — no parsing, no deserialization, just bytes to memory.
The tokenizer is 16.5x faster than PyTorch's. Trie-based BPE, no memcpy in the hot path. 114k tokens per second on a 15k-character input.
And zero dependencies. No CUDA. No ROCm. No Python at runtime. Just GCC, Linux, and standard C.
The Only Piece I Didn't Write From Scratch
I parse GGUF files (llama.cpp's weight format) because the format is already proven. But I immediately convert to my own .bump format. The conversion happens once. After that, the runtime touches nothing from any other project. Pragmatism where it matters. Independence where it counts.
Where Things Stand
| What | Status | What It Means |
|---|---|---|
| Quantized math kernels | ✅ Done | Q4_K, Q5_K, Q6_K, Q8_0 — memory-efficient inference |
| Attention + normalization + MLP | ✅ Done | Forward + backward passes, tested against PyTorch |
| Single-node inference | ✅ Done | Generate C runtime from a model config |
| Bump allocator + GGUF conversion | ✅ Done | Fast weight loading, custom .bump format |
| Tokenizer | ✅ Done | 16.5x faster than PyTorch |
| v7 training runtime | ✅ Done | Backprop, gradient accumulation, AdamW optimizer |
| Multi-node training (RDMA) | 🔄 Planned | Pipeline + tensor parallelism across servers |
| MEGA kernel fusion | 🔄 Planned | AVX-512 / NEON — fuse RMSNorm + QKV + RoPE into one pass |
The Numbers So Far
- 0.6B quantized model on a 12th-gen Intel Alder Lake → ~100 tokens/sec when the system is otherwise idle
- Same model on an older 4-core machine → ~20–25 tokens/sec
- No GPU. No CUDA. No special hardware. Just common x86 instructions and Linux.
📅 See the full version history:
Why This Approach Wins
Building from scratch is harder. It takes longer. But it is the only way to achieve coherence.
When you control the kernels, the IR, the memory layout, and the code generator, you can make decisions that no glued-together stack can:
- The memory allocator knows the kernel's cache line alignment requirements because they were designed together.
- The IR knows which fusion patterns are profitable because it has access to the actual kernel implementations.
- The code generator knows the NUMA topology because the runtime exposes it through the same configuration.
- The backward pass knows what activations were saved in the forward pass because the same IR built both.
You cannot get this from stitching. You can only get this from a unified design.
The Distributed Computing Reality That Marketing Hid
Here's what the marketing doesn't want you to know: CPU distributed computing is rock solid and has been for decades.
- MPI – 30+ years of production use. Top500 supercomputers run on it.
- OpenMP – Shared memory parallelism. Works on any CPU. No proprietary vendor lock.
- RDMA – Remote Direct Memory Access. Zero-copy networking. CPUs have had this forever.
CPU clusters have been running the world's largest supercomputers for decades. The top 500 supercomputers? Most are CPU-based or hybrid. MPI + OpenMP + RDMA is battle-tested, debugged, and optimized across millions of nodes.
The Long Bet: Abundant, Not Centralized
Frontier models will take the lead in the next 3–5 years. AI is not going anywhere. And as long as open-weight models are just 2–3 generations behind, by the 5th or 6th year, I believe AI will be abundant — not centralized.
The proprietary path forces you into a corner: "You're now FORCED to buy 8+ GPUs in a cluster."
The CPU path is different: "If the software gets good enough, ordinary servers become practical AI machines."
That's the bet. That's why I stopped getting high on the frontier hype.
The Bottom Line
Generating the next token requires cycling GBs of data through compute. High-channel-count CPUs are the most direct path to doing that efficiently. Distributed computing on CPUs has been solved for decades.
But the software stack for CPU-native AI is immature. So I am building it.
From scratch. Kernels, IR, codegen, memory allocator, training runtime, tokenizer. Everything except the GGUF weight parser.
Not because I don't know about llama.cpp or PyTorch CPU. Because I need coherence. And coherence doesn't come from glue. It comes from unified design.
This is my bet. This is the C-Kernel-Engine.
Who knows if it will work? That's why it's a bet.
But if it does work? AI becomes abundant. Not centralized. Running on the hardware you already own. No GPU required. No vendor lock-in. No marketing hype.
Just code. Just Linux. Just compute.
👉 Read the full C-Kernel-Engine scaling philosophy:
https://c-kernel-engine.github.io/C-Kernel-Engine/scaling.html
Until next time. Take care.