Why I Stopped Getting High on the Newer AI Models (And Why My Strategic Bet Is Still Consistent: CPUs, Smaller Models, and Less Compute Will Win)

Last week, Claude Opus 4.7 "got released." And the Anthropic team is doing what Americans are genius at: marketing, marketing, marketing.

The whole Claude, OpenAI, Gemini saga is the same playbook: bigger, better, more compute, scaling laws. "You need us to win."

My hunch is different. As these AI models get better using massive weights, there will also be research into how to get the same — or even better — performance using less weights, less compute, and a more diverse set of products.

The Contradiction at the Heart of Scaling

Frontier models are currently great at reading large codebases, reasoning through them, and solving problems across C, C++, PHP, Java, JavaScript, React — you name it. Their biggest strength is long-horizon tasks: reading multiple files, reasoning, keeping track, finding solutions, and iterating until they find the core problem. They may not always be right today, but tomorrow's models will be more right than wrong.

Having said that: why can't a smaller model do the same thing?

The frontier labs sell scaling laws. But we've also found that smaller models trained well can do decently too — look at Qwen, Mistral, MiniMax, Kimi, DeepSeek. Most open-source models are 2–3 generations behind the frontier. That will change as they get better.

A lot of the Chinese open-source models are primarily trained on Chinese data, whereas American models are trained on English data. But the Chinese are doing something innovative. They don't have NVIDIA GPUs (again, more American marketing) and they've been smart enough to efficiently handle compute despite not having the latest and greatest. And this is where I see the real innovation.

The Historical Pattern: Why Commodity Always Wins

This pattern keeps repeating. In the 1990s, proprietary RISC chips (SPARC, Alpha, PA-RISC) were "obviously superior" to x86. Sun Microsystems and SGI sold $500K+ servers based on that "fact." Then commodity x86 PCs with Linux and software like MapReduce ate their lunch. Sun went bankrupt in 2010.

Era	Proprietary Incumbent	Commodity Disruptor	Result
1990s	SPARC, Alpha, PA-RISC	x86 commodity chips	Proprietary RISC faded
1998	Sun/SGI servers ($500K+)	x86 PCs + MapReduce/GFS	Sun bankrupt (2010)
2009	Teradata, Netezza ($1M+)	Hadoop on commodity clusters	Big data democratized
Now	GPU clusters ($M+)	CPU clusters + software	→ ?

My Three Strategic Bets

More efficient compute — Not more flops, but more value per flop.
Smaller models with frontier-scale capabilities — Dense, focused, and trained with high-quality curricula.
Pre-training to usable models becomes easier — This is the most important point.

The Real Optimization: Training on Anything

What I am betting on is this: the ability to go from random initialization of weights to training any model at will on any commodity hardware.

Once anyone can train their own model, figure out how to use them, and make this as easy as using "Opus 4.7" or whatever—then "bigger is better" only matters at the exact moment of release. This part is still mostly closed source: the training regime, the curriculum, the methods, the flow, the setup.

What we normally get from so-called open-source models is mostly the weights and the model architecture. Both are important, but architecture and training regime go hand-in-hand—and that part is omitted.

As more labs share their training curriculum and their ability to make powerful models, this trend will continue to grow. But then comes cost. If open labs aren't earning at the same rate as frontier labs, open-weight labs may go under. That is a very real possibility, and I hope we don't get there.

What the Marketing Won't Tell You: "Innovations" Are Actually GPU Patches

Most "breakthroughs" in LLM architecture are just patches for GPU holes.

"Innovation"	What It Actually Does	The GPU Constraint It Patches
GQA (Grouped Query Attention)	Shares KV heads across query heads	KV cache blows up GPU VRAM
MoE (Mixture of Experts)	Activates sparse subset of parameters	Dense model won't fit on one GPU
Gradient Checkpointing	Recomputes activations instead of storing them	Training activations don't fit in GPU VRAM
Tensor Parallelism	Shards weight matrices across GPUs	Single GPU can't hold the full matrix
Pipeline Parallelism	Distributes layers across GPUs	All layers don't fit on one GPU
Flash Attention	Online softmax with tiled computation	Full attention matrix doesn't fit in GPU SRAM/VRAM

Here is the thing, though: all these GPU optimizations work on CPUs too. Because we are still cycling GBs of information to produce one token. This is a cycling problem that a CPU could possibly handle with a 2-3TB model and high DRAM capacity.

What grows is the KV cache. But newer architectures — like Qwen 3.5 — are now using SSM or DeltaNet combined with attention, so only some layers need the KV cache. The point is: all the optimizations built for GPUs hold very well for CPUs.

And manufacturers are starting to get the idea. Look at ARM's AGI CPU — they are designing CPUs with more memory channels specifically to optimize the terabytes of data cycles per token. If they reach the point of cycling this data in nanoseconds, you get a full answer in a second. This requires expanding capacity: more cores, more SIMD, more L3 cache, more channels — and maybe a move from DDR5 to DDR6. All of this can be done on one CPU for the most part. You could run a 1TB model on a single prompt, with the ability to scan your codebase the way you use frontier models today.

This is my bet. It will take 5-7 years. By then, models of this capacity will be more than sufficient — with more intelligent designs and the ability to have tremendous stamina. Stamina means running long sessions to find the correct or optimal solution. Reading code pages, understanding, planning, reading more, figuring things out, and then executing — using tools like grep, sed, awk, git to get more data. Those are the critical improvements of today's models.

I don't think this is an illogical bet. We are already seeing innovations in this direction — even from ARM, releasing products as we speak.

I am betting on open-source models matching the capability of today's frontier models — in 3-4 years. But here is what I have noticed: from GPT-3.5 to GPT-4, improvements were dramatic. From GPT-4 onward? More incremental. But what keeps improving is stamina — more context, better coherence across longer data, the ability to process longer before outputting a solution. Reading code pages, understanding, planning, reading more, then executing with tools like grep, sed, awk, git.

That is the gap open-source models need to close. And it is closing. But the other part of my bet is hardware. Because right now, even if the model is there, the hardware to run it at scale is not something someone can just buy. That changes when model intelligence density improves and CPUs add more memory channels — exactly what ARM's AGI CPU is designed to do.

That is my bet: model intelligence density plus commodity hardware that anyone can purchase.

👉 ARM AGI CPU →

The Fundamental Math: Amdahl's Law and "0 × ∞ = 0"

Forget peak FLOPS. The real math is Amdahl's Law: speedup(N) = 1 / (S + (1-S)/N) where S is the serial fraction. For LLMs, that's the next token depending on the last, layer boundaries, and sync points. Even with infinite GPUs, if S is 10%, your max speedup is 10x. You hit a wall.

Then there's the "0 × ∞ = 0" principle: if the model doesn't fit in memory, the FLOPS don't matter.

A 70B model needs ~140GB just for weights (FP16). A single H100 has 80GB. You are forced into a multi-GPU cluster just to start:

GPU Path: Model doesn't fit → 8 GPUs → $320K + NVLink $50K+ + DGX chassis $80K+ = $450K+ just to START
CPU Path: Model fits in 4TB RAM → 1-2 servers $30K each + Ethernet $2K = $60K and you're running

The math doesn't lie.

The Real Bottleneck: Data Movement

Everyone is obsessed with FLOPS. But generating the next token isn't about compute. It's about cycling GBs of data through compute as fast as possible.

When you generate a token, you need to move the weights (GBs of data) from RAM to compute, move the KV cache (hundreds of MB to GBs), move the activations — and then you do the math. The math is fast. The data movement is slow.

And this is where CPUs are about to have their moment.

Memory Channels: The Silent Revolution

Today's dual-socket CPUs already have 12–16 memory channels. Each channel of DDR5 delivers ~40 GB/s:

12 channels × 40 GB/s = 480 GB/s
16 channels × 40 GB/s = 640 GB/s

Year	Channels per Socket	Dual Socket Total	Bandwidth
Today (2025)	12	24	~960 GB/s
2026–2027	16	32	~1.28 TB/s
2028+	24	48	~1.9–2.5 TB/s

So the hardware is coming. More channels. Wider SIMD. Larger caches. But hardware alone doesn't solve the problem. You need a software stack that actually uses these capabilities coherently. That is where the C-Kernel-Engine comes in.

What Already Exists (And Why I'm Not Using It)

Before I explain what I am building, let me acknowledge what already works on CPUs today. I'm not here to pretend nothing exists.

Project	What It Does	Why I'm Not Building On It
llama.cpp	Brilliant inference kernels. Quantization that respects memory bandwidth.	No training stack. No backward passes. Inference only.
PyTorch CPU	Full training + autograd. Optimizers. Same API as GPU.	Kernels are generic BLAS, not CPU-optimized. Heavy dependency tree.
Intel OpenVINO	Extremely optimized for Intel hardware. Great inference.	Intel-specific. Inference only. Vendor lock-in.
ONNX Runtime	Cross-platform. Multiple backends.	Configuration hell. Not training-native. Wrapper complexity.

These are all good projects. I respect them. But none of them give me a coherent training + inference stack that is:

CPU-native – written for cache lines, NUMA domains, and memory channels
Auditable – every kernel has a forward and backward pass, validated against PyTorch
Trainable – backprop synthesis, gradient accumulation, AdamW
Portable – runs on any Linux machine with AVX2 or better
Zero dependencies – no CUDA, no Python at runtime, no heavy frameworks

So I built my own. From scratch. Because coherence doesn't come from glue. It comes from unified design.

The C-Kernel-Engine: Building the Missing Piece

Here is what I am actually building.

I am not stitching existing libraries together. I am not writing wrappers around llama.cpp or PyTorch CPU. I am building a complete CPU-native AI system from scratch — because coherence doesn't come from glue. It comes from unified design.

Let me explain how it works in plain English.

How It Works

Most CPU AI projects today are inference-only. They can run a model, but they cannot train one. Training requires backward passes — the ability to compute gradients and update weights — and almost no one has built that for CPUs in a coherent way.

So I built it. Every kernel — attention, math, normalization — has both a forward pass (running the model) and a backward pass (training the model). Each is tested against PyTorch to ensure it is bit-exact within tolerance.

Quantized models (Q4_K, Q5_K, Q6_K, Q8_0) stay quantized in memory. They dequantize on-the-fly during math operations. This saves memory and bandwidth — the two things that actually matter for inference.

The engine uses a three-stage pipeline to go from model config to executable C code. Stage one builds a graph of operations. Stage two finds patterns it can fuse together (like combining normalization + attention into one fast pass). Stage three assigns memory offsets and generates the final C code. The code generator is deliberately dumb — it only emits what the earlier stages decided. If something is wrong, the bug is upstream, not hidden in the generator.

Memory is managed with a bump allocator — a 128-byte header followed by contiguous blocks for weights, activations, gradients, and optimizer state. It uses huge pages, NUMA binding, and canary sentinels to catch memory corruption. The .bump format is my own design because mmap() of a single contiguous blob is the fastest way to load weights — no parsing, no deserialization, just bytes to memory.

The tokenizer is 16.5x faster than PyTorch's. Trie-based BPE, no memcpy in the hot path. 114k tokens per second on a 15k-character input.

And zero dependencies. No CUDA. No ROCm. No Python at runtime. Just GCC, Linux, and standard C.

The Only Piece I Didn't Write From Scratch

I parse GGUF files (llama.cpp's weight format) because the format is already proven. But I immediately convert to my own .bump format. The conversion happens once. After that, the runtime touches nothing from any other project. Pragmatism where it matters. Independence where it counts.

Where Things Stand

What	Status	What It Means
Quantized math kernels	✅ Done	Q4_K, Q5_K, Q6_K, Q8_0 — memory-efficient inference
Attention + normalization + MLP	✅ Done	Forward + backward passes, tested against PyTorch
Single-node inference	✅ Done	Generate C runtime from a model config
Bump allocator + GGUF conversion	✅ Done	Fast weight loading, custom .bump format
Tokenizer	✅ Done	16.5x faster than PyTorch
v7 training runtime	✅ Done	Backprop, gradient accumulation, AdamW optimizer
Multi-node training (RDMA)	🔄 Planned	Pipeline + tensor parallelism across servers
MEGA kernel fusion	🔄 Planned	AVX-512 / NEON — fuse RMSNorm + QKV + RoPE into one pass

The Numbers So Far

0.6B quantized model on a 12th-gen Intel Alder Lake → ~100 tokens/sec when the system is otherwise idle
Same model on an older 4-core machine → ~20–25 tokens/sec
No GPU. No CUDA. No special hardware. Just common x86 instructions and Linux.

📅 See the full version history:

C-Kernel-Engine Version History →

Why This Approach Wins

Building from scratch is harder. It takes longer. But it is the only way to achieve coherence.

When you control the kernels, the IR, the memory layout, and the code generator, you can make decisions that no glued-together stack can:

The memory allocator knows the kernel's cache line alignment requirements because they were designed together.
The IR knows which fusion patterns are profitable because it has access to the actual kernel implementations.
The code generator knows the NUMA topology because the runtime exposes it through the same configuration.
The backward pass knows what activations were saved in the forward pass because the same IR built both.

You cannot get this from stitching. You can only get this from a unified design.

The Distributed Computing Reality That Marketing Hid

Here's what the marketing doesn't want you to know: CPU distributed computing is rock solid and has been for decades.

MPI – 30+ years of production use. Top500 supercomputers run on it.
OpenMP – Shared memory parallelism. Works on any CPU. No proprietary vendor lock.
RDMA – Remote Direct Memory Access. Zero-copy networking. CPUs have had this forever.

CPU clusters have been running the world's largest supercomputers for decades. The top 500 supercomputers? Most are CPU-based or hybrid. MPI + OpenMP + RDMA is battle-tested, debugged, and optimized across millions of nodes.

The Long Bet: Abundant, Not Centralized

Frontier models will take the lead in the next 3–5 years. AI is not going anywhere. And as long as open-weight models are just 2–3 generations behind, by the 5th or 6th year, I believe AI will be abundant — not centralized.

The proprietary path forces you into a corner: "You're now FORCED to buy 8+ GPUs in a cluster."

The CPU path is different: "If the software gets good enough, ordinary servers become practical AI machines."

That's the bet. That's why I stopped getting high on the frontier hype.

The Bottom Line

Generating the next token requires cycling GBs of data through compute. High-channel-count CPUs are the most direct path to doing that efficiently. Distributed computing on CPUs has been solved for decades.

But the software stack for CPU-native AI is immature. So I am building it.

From scratch. Kernels, IR, codegen, memory allocator, training runtime, tokenizer. Everything except the GGUF weight parser.

Not because I don't know about llama.cpp or PyTorch CPU. Because I need coherence. And coherence doesn't come from glue. It comes from unified design.

This is my bet. This is the C-Kernel-Engine.

Who knows if it will work? That's why it's a bet.

But if it does work? AI becomes abundant. Not centralized. Running on the hardware you already own. No GPU required. No vendor lock-in. No marketing hype.

Just code. Just Linux. Just compute.

👉 Read the full C-Kernel-Engine scaling philosophy:

https://c-kernel-engine.github.io/C-Kernel-Engine/scaling.html

Until next time. Take care.

Why I Stopped Getting High on the Newer AI Models (And Why My Strategic Bet Is Still Consistent: CPUs, Smaller Models, and Less Compute Will Win)

The Contradiction at the Heart of Scaling

The Historical Pattern: Why Commodity Always Wins

My Three Strategic Bets

The Real Optimization: Training on Anything

What the Marketing Won't Tell You: "Innovations" Are Actually GPU Patches

The Fundamental Math: Amdahl's Law and "0 × ∞ = 0"

The Real Bottleneck: Data Movement

Memory Channels: The Silent Revolution

What Already Exists (And Why I'm Not Using It)

The C-Kernel-Engine: Building the Missing Piece

How It Works

The Only Piece I Didn't Write From Scratch

Where Things Stand

The Numbers So Far

Why This Approach Wins

The Distributed Computing Reality That Marketing Hid

The Long Bet: Abundant, Not Centralized

The Bottom Line

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support

Why I Stopped Getting High on the Newer AI Models (And Why My Strategic Bet Is Still Consistent: CPUs, Smaller Models, and Less Compute Will Win)

The Contradiction at the Heart of Scaling

The Historical Pattern: Why Commodity Always Wins

My Three Strategic Bets

The Real Optimization: Training on Anything

What the Marketing Won't Tell You: "Innovations" Are Actually GPU Patches

The Fundamental Math: Amdahl's Law and "0 × ∞ = 0"

The Real Bottleneck: Data Movement

Memory Channels: The Silent Revolution

What Already Exists (And Why I'm Not Using It)

The C-Kernel-Engine: Building the Missing Piece

How It Works

The Only Piece I Didn't Write From Scratch

Where Things Stand

The Numbers So Far

Why This Approach Wins

The Distributed Computing Reality That Marketing Hid

The Long Bet: Abundant, Not Centralized

The Bottom Line

Subscribe

Subscribe to emails from Anthony

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support